<a href="https://colab.research.google.com/github/rskrisel/factiva_dataframe/blob/main/Create_spreadsheet_Factiva_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#https://jcldinco.medium.com/obtaining-and-cleaning-news-data-from-factiva-21a7a0ae2759
#from factiva, select display/Full Article/Report plus Indexing
#restrict data select based on region/publications
#duplicates:identical
#select articles, then click print. Then hit command+u for html code. Copy/paste code into VSCODE.
#Save docs as htm

import glob
import pandas as pd
# files = glob.glob('html/*.htm', recursive = True)

# Collecting news data from Factiva and saving it in a Dataframe

In [None]:
from google.colab import drive
drive.mount('/content/drive')

To get data from Factiva, you must have a subscription to the Factiva database. Most universities subscribe to the database, so you can use your library to connect to it.

Please consult with your librarian for how to properly search for articles in Factiva.

Once you have narrowed down your search and found the articles you wish to work with, follow these steps:


1.   Click "display" and select: Full Article/Report plus Indexing
2.   Set the duplicates to identical.
3. Select the articles you wish to collect, then click print/article format
4. From the print view window, click `command` + `u` to view the html code.
5. Copy the html code and store it in the `html_code` variable below.





In [None]:
html_code= """
PASTE HTML CODE HERE
"""

Next, you want to save the contents of the `html_code` variable into a `.htm` file:

In [None]:

# Write the HTML code to the file
with open("/content/drive/MyDrive/Factiva/factiva.htm", 'w') as file: #replace with your path
    file.write(html_code)


# # For a list of variables with HTML content
# html_list = [html_code1, html_code2, html_code3]  # Replace with your actual variables

# # Iterate over the list and write each HTML content to a separate file
# for i, html_code in enumerate(html_list):
#     file_path = f"/content/drive/MyDrive/Factiva/factiva_{i}.htm"  # Replace with your path
#     with open(file_path, 'w') as file:
#         file.write(html_code)



In the following line, you will look for all `.htm` files in your Factiva folder (and its subfolders, if any) located at the specified path on your Google Drive and return them as a list.

- glob.glob is a function that finds all the files that match a specific pattern. In this case, it looks for files inside a folder called Factiva that have the .htm extension.
- The part `/content/drive/MyDrive/Factiva/*.htm` is the path where it will look for the files. You would replace this with the path where your own files are located. The *.htm means it will find all files ending with .htm (which are likely HTML files).
- recursive = True allows the function to search within subdirectories inside the Factiva folder as well.




In [None]:
files = glob.glob('/content/drive/MyDrive/Factiva/*.htm', recursive = True) #replace with your path

In [None]:
files

The following `for loop` starts with an empty list. It then goes through each HTML file in the files list, reads any tables found in those files, and adds them to the empty_list.

In this case, the goal is to have a list where each element is a dataframe, not a list of lists. Since `pd.read_html()` returns a list of dataframes for each file, `extend` is used to merge those dataframes directly into `empty_list` so that it contains all the dataframes in one flat structure.

If you used `append`, you would end up with a nested structure where each element is a list of dataframes, which is likely not what you want.

In [None]:
empty_list = []
for file in files:
    data = pd.read_html(file, index_col = 0) #reads the HTML content of the file and tries to find any tables inside it.
    empty_list.extend(data) # The extend() method is used to add the data (which is a list of dataframes) from the current file to empty_list.

In [None]:
empty_list

Let's create a variable, `frames`, which contains a dataframe where all relevant dataframes (those containing 'HD' in their index) are combined and flipped.

In the next line of code, we will accomplish the following:
- Look through all the dataframes in `empty_list` and selects only the ones where `HD` is found in the index.
- Concatenate those dataframes side by side (combining their columns).
- Finally, transpose the resulting dataframe, flipping the rows and columns, and assigning it to the variable `frames`.


Breaking it down:
- `[l for l in empty_list if 'HD' in l.index.values]:`
1. This is a list comprehension. It goes through each item `l` in `empty_list` (which contains dataframes).
2. For each dataframe `l`, it checks if `'HD'` is present in the index values of that dataframe `(l.index.values)`.
3. If `'HD'` is found in the index of a dataframe, that dataframe is included in the resulting list. If `'HD'` is not found, that dataframe is ignored.

- `pd.concat([...], axis=1)`:

1. this concatenates (joins) all the dataframes that contain `'HD'` in their index. The `axis=1` argument means the dataframes will be concatenated side by side, meaning their columns will be combined.The result is a new dataframe where the data from each matching dataframe is merged by columns.

- `.T`:
1. This is a shorthand for "transpose," which flips the rows and columns of the resulting dataframe.
2. After concatenating the dataframes side by side, `.T` switches the rows and columns, so what were previously columns are now rows, and vice versa.

In [None]:
frames = pd.concat([l for l in empty_list if 'HD' in l.index.values], axis=1).T

In [None]:
frames

Next, let's drop unnecessary columns from the DataFrame and the rename certain columns to more meaningful or readable names (e.g., 'HD' becomes 'Headline', 'PD' becomes 'Publication_Date', etc.).

In [None]:
frames.drop(columns=['SC', 'CY','PUB', 'NS',
                   'IPD', 'IPC',
                   'IN', 'VOL', 'RF', 'LA', 'CO', 'AN', 'SE', 'WC', 'ED', 'PG', 'ART', 'CLM'  ], inplace = True)
frames.rename(columns = {'HD': 'Headline',
                         'PD': 'Publication_Date','SN': 'Source_Name', 'LP': 'Lead Paragraph',
                          'TD': 'Body', 'ET':'Estimated_Time',
                         'BY':'Author_Name', 'RE':'Region' }, inplace=True)

Let's make sure our `Publication_Date` column is in datetime format.

In [None]:
frames['Publication_Date'] = pd.to_datetime(frames['Publication_Date'])
frames.sort_values(by='Publication_Date', inplace=True)

The next line of code is downloading a resource from the Natural Language Toolkit (nltk) package called 'punkt'. Here's a breakdown of what it does:

Explanation:
- `import nltk`:
1. This imports the `nltk` library, which is a popular Python library for working with human language data (natural language processing).
- `nltk.download('punkt')`:
1. This downloads a tokenizer model called Punkt, which is used for splitting text into sentences or words.
2. `Punkt` is a pre-trained model that comes with `nltk` and is used for tokenization tasks (breaking up text into smaller components, like sentences or words). Once downloaded, you'll be able to use it in conjunction with functions like `nltk.sent_tokenize()` to break up text into sentences or `nltk.word_tokenize()` to break it up into individual words.

Why do you need this?
You download this resource because `nltk` uses external models and corpora to process text. The `punkt` tokenizer is required for tasks like sentence splitting and word tokenization, which are fundamental for most natural language processing tasks.

In [None]:
import nltk
nltk.download('punkt')

Next, let's import the `sent_tokenize` function from the `nltk.tokenize` module. This function is used to split a given text into individual sentences. It takes a string of text as input and returns a list of sentences.

In [None]:
from nltk.tokenize import sent_tokenize

Next, we want to combine the text from the `Lead Paragraph` and the `Body` columns so we have the full article in a single cell.

In [None]:
frames['CombinedText'] = frames['Lead Paragraph'] + " " + frames['Body']

In [None]:
frames

Let's reset the index so it's the standard [0:] index, in ascending order.

In [None]:
df = frames.reset_index()

In [None]:
path= '/content/drive/MyDrive/Factiva'  # change to your path

Next, we will us a code that loops through each row in the DataFrame `df`, creates a unique text file for each row, and writes the content from the `CombinedText` column into that file. If the `CombinedText` value is missing (i.e., `NaN`), it writes an empty string instead. The filenames are generated dynamically based on the row index.

In essence, it saves the text content from each row in the DataFrame as individual text files.

1. **`for index, row in df.iterrows():`**
   - This line starts a loop over each row in the DataFrame `df`.
   - `df.iterrows()` is a pandas function that allows you to loop through the DataFrame row by row.
   - `index` represents the row number (starting from 0), and `row` contains the data for that specific row in the form of a pandas Series.

2. **`file_name = f"{path}/text_file_{index + 1}.txt"`**
   - This line generates a unique filename for each row.
   - The `f"{path}/text_file_{index + 1}.txt"` uses an f-string to create a file name based on the `index` (plus 1 to make it start at 1 instead of 0).
   - `path` is a variable that contains the directory where the file will be saved (it should be defined earlier in the code).
   - For example, for the first row (index 0), this would create a filename like `"path/to/directory/text_file_1.txt"`.

3. **`with open(file_name, 'w') as file:`**
   - This opens a file with the name `file_name` in write mode (`'w'`), allowing the program to write data into it.
   - The `with` statement ensures the file is properly closed after writing, even if an error occurs.

4. **`text_content = str(row['CombinedText']) if pd.notnull(row['CombinedText']) else ''`**
   - This checks the value in the `'CombinedText'` column for the current row.
   - **`pd.notnull(row['CombinedText'])`** checks if the value is **not** `NaN` (i.e., it’s not missing).
   - If the value is **not** `NaN`, it converts the value to a string using `str(row['CombinedText'])`.
   - If the value is `NaN`, it sets `text_content` to an empty string (`''`).
   - This ensures that no matter what value is in the `CombinedText` column, you will have valid text to write to the file (either the text itself or an empty string).

5. **`file.write(text_content)`**
   - This writes the `text_content` (the string version of the `CombinedText` value) to the file.
   - If the `CombinedText` was `NaN`, it writes an empty string; otherwise, it writes the actual content from that column.




In [None]:
for index, row in df.iterrows():
    file_name = f"{path}/text_file_{index + 1}.txt"  # Create a unique filename for each row
    with open(file_name, 'w') as file:
        text_content = str(row['CombinedText']) if pd.notnull(row['CombinedText']) else ''  # Convert to string and handle NaN
        file.write(text_content)  # Write the text content to the file