# Data Preprocessing

For our analysis tokenizing phrases rather than individual words could be a better choice due to the presence of important phrases like "United States of America" etc... which might lose context if broken into individual words.

We process the text in batches to speed up the processing time.

In [1]:
import json
import spacy
import os
import time

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
os.makedirs('../data/tokenized', exist_ok=True)

In [4]:
def preprocess_data_phrases(dataset, batch_size=1000):
    tokenized_phrases = []
    
    for doc in nlp.pipe([ " ".join(entry['text']) if isinstance(entry['text'], list) else entry['text'] for entry in dataset], batch_size=batch_size):
        if doc:
            phrases = [chunk.text for chunk in doc.noun_chunks]
            tokenized_phrases.append(phrases)
        else:
            tokenized_phrases.append([])

    return tokenized_phrases

I decided to not lowercase text because our task is NER and some words could lose their distinguishing features when lowercased

In [5]:
with open('../data/cleaned/cleaned_biden_data.json', 'r') as f:
    biden_data = json.load(f)

with open('../data/cleaned/cleaned_obama_data.json', 'r') as f:
    obama_data = json.load(f)

with open('../data/cleaned/cleaned_trump_data.json', 'r') as f:
    trump_data = json.load(f)

In [6]:
start_time = time.time()

In [7]:
print("Processing Biden dataset...")
biden_preprocessed_phrases = preprocess_data_phrases(biden_data)
with open('../data/tokenized/preprocessed_biden_phrases.json', 'w') as f:
    json.dump(biden_preprocessed_phrases, f)

Processing Biden dataset...


In [8]:
print("Processing Obama dataset...")
obama_preprocessed_phrases = preprocess_data_phrases(obama_data)
with open('../data/tokenized/preprocessed_obama_phrases.json', 'w') as f:
    json.dump(obama_preprocessed_phrases, f)

Processing Obama dataset...


In [9]:
print("Processing Trump dataset...")
trump_preprocessed_phrases = preprocess_data_phrases(trump_data)
with open('../data/tokenized/preprocessed_trump_phrases.json', 'w') as f:
    json.dump(trump_preprocessed_phrases, f)

Processing Trump dataset...


In [10]:
end_time = time.time()
print(f"Preprocessing completed in {end_time - start_time} seconds")

Preprocessing completed in 14299.131327867508 seconds
