Since json data is only available for 2019-2022 at the current moment, 2017-2018 are currently omitted.

# News Data Processing Script

This script processes news data stored in JSON format and converts it into a CSV file.

## Script Workflow

The script performs the following steps:

1. **Mounts Google Drive:** This connects to the user's Google Drive to access the data files using the `google.colab.drive` module.
2. **Installs necessary libraries:** This installs the `dropbox` library using `pip` for interacting with Dropbox. It also imports `json`, `csv`, `os`, `glob`, and `tqdm` for data processing and file handling.
3. **Defines file paths:** This sets the paths to the raw news data folder, project folder, and the dataset folder within Google Drive.
4. **Sets Dropbox access token:** This sets the access token for authenticating with Dropbox.
5. **Defines years to process:** This specifies the years of data to be processed.
6. **Processes data from Google Drive (if applicable):**
    - If the data is already in Google Drive, it iterates through the specified years.
    - It reads JSON files from the raw news data folder for each year.
    - It extracts the file name, ID, and content from each JSON record.
    - It writes the extracted data to a CSV file in the dataset folder.
7. **Processes data from Dropbox (if applicable):**
    - If the data is in Dropbox, it initializes the Dropbox client using the access token.
    - It retrieves a list of files from the specified Dropbox folder path.
    - It iterates through the files, downloads them, and reads the JSON data.
    - It extracts the file name, ID, and content from each JSON record.
    - It writes the extracted data to a CSV file in the dataset folder.

## Input

- JSON files containing news data for each year, stored either in Google Drive or Dropbox.
- File paths specified in the `json_dir`, `project_folder`, and `dataset_folder` variables.
- Dropbox access token.

## Output

- A CSV file named `combined_data_{year}.csv` for each year, containing the extracted data from the JSON files. The CSV file includes columns for file name, ID, and content.

## Dependencies

- `dropbox`
- `json`
- `csv`
- `os`
- `glob`
- `tqdm`
- `google.colab` (for Google Colab environment)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install dropbox

import json
import csv
import os
import glob
from tqdm import tqdm
import dropbox

In [None]:
# Directory containing your JSON files
json_dir = '/content/drive/MyDrive/2024SUDSProject/rawNewsData/newsdata_'
project_folder = "/content/drive/MyDrive/2024SUDSProject/"
dataset_folder = "/content/drive/MyDrive/2024SUDSProject/processedNewsData/"

def create_dropbox_folder_path(year):
    return f'/CCAMO/Data/Unconverted/nela-gt-{year}/newsdata/'

In [None]:
# define your years here. best to do each year one by one to prevent the google colab notebook from timing out.

years = ['2022']

dropbox_access_token = 'YOUR_TOKEN_HERE'

In [None]:
# If the dataset is already in the Google Drive

for year in years:
    with open(dataset_folder+f'combined_data_{year}.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['file_name', 'id', 'content'])  # Writing the header

        # Get a list of all JSON files

        json_files = glob.glob(os.path.join(json_dir+year, '*.json'))
        # Iterate over all JSON files in the directory with a progress bar
        for json_file in tqdm(json_files, desc="Processing JSON files", miniters=1):
            # Read the JSON file
            with open(json_file, 'r', encoding='utf-8') as file:
                data = json.load(file)

            file_name = os.path.splitext(os.path.basename(json_file))[0]

            for record in data:
                id_value = record.get('id', '')
                content_value = record.get('content', '')
                writer.writerow([file_name, id_value, content_value])

print("Conversion complete.")

In [None]:
# If the dataset is in Dropbox (short term solution, switch to streaming rows directly to csv instead of writing to memory)

def import_from_dropbox(year, dropbox_folder_path):
    # Initialize Dropbox client
    dbx = dropbox.Dropbox(dropbox_access_token)

    result = dbx.files_list_folder(dropbox_folder_path)

    files = result.entries

    # Iterate over rest of files in the folder
    while result.has_more:
        result = dbx.files_list_folder_continue(result.cursor)
        files.extend(result.entries)

    total_data = []

    for file in tqdm(files, desc="Reading JSON files from Dropbox", miniters=1):
        _, file_response = dbx.files_download(file.path_lower)

        # Read JSON data
        json_data = file_response.content.decode('utf-8')
        json_data = json.loads(json_data)
        for record in json_data:
            id_value = record.get('id', '')
            content_value = record.get('content', '')
            total_data.append([id_value.split('--')[0], id_value, content_value])


    keys = ['file_name', 'id', 'content']  # Writing the header

    # Write to CSV file
    with open(dataset_folder+f'combined_data_{year}.csv', 'w', newline='') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(keys)
        writer.writerows(total_data)

    # print(f'Conversion Complete')

import_from_dropbox('2020', create_dropbox_folder_path('2020'))
