# Data Pre-processing

From the raw file that we got from rtf_conversion.ipynb, let's now pre-process the data into other csv's that we will be using for later methods, namely
1. colocates
2. topic modelling

For colocates, we want to have words still next to each other as would be found in the article text, but we do want to remove things like punctuation so we can segment each word and consider it at one "index". We should also remove general stopwords. 

For topic modelling, we are following a "bag-of-words" approach, so a lot more processing needs to be done. 

In [2]:
# general imports 
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer

In [3]:
# load dataframe of raw data
df = pd.read_csv('data/all_years_raw.csv')

In [11]:
# check if years have any issues
# there were no years detected above 2011, but there were for < 1980, 41 articles

df_test = df.loc[(df['year'] < 1980)] 
df_test

Unnamed: 0,title,publisher,year,month,day,full text
465,"THEODORE C. SORENSEN; May 8, 1928 - Oct. 31, 2...",Pittsburgh Post-Gazette (Pennsylvania),1928,5,8,"Theodore C. Sorensen, John F. Kennedy's close ..."
1344,"SUSAN NELSON; MARCH 2, 1947 - NOV. 4, 2010; FE...",Pittsburgh Post-Gazette (Pennsylvania),1947,3,2,"The Rev. Susan Nelson, a feminist theologian w..."
1373,"ELLIOT HANDLER; APRIL 9, 1916 - JULY 21, 2011;...",Pittsburgh Post-Gazette (Pennsylvania),1916,4,9,"Elliot Handler, a pioneering toy maker who co-..."
2186,PEOPLE\n,St. Louis Post-Dispatch (Missouri),1942,7,30,The Great Flood of '93 has finally risen as hi...
2603,"JILL JOHNSTON; MAY 17, 1929 - SEPT. 18, 2010; ...",Pittsburgh Post-Gazette (Pennsylvania),1929,5,17,"Jill Johnston, a longtime cultural critic for ..."
4149,"H.F. GARDNER; MARCH 2, 1926 - JULY 25, 2009; S...",Pittsburgh Post-Gazette (Pennsylvania),1926,3,2,"Gerald H.F. Gardner, an Irish-born geophysicis..."
6758,"JACK. T. LITMAN; JULY 26, 1943-JAN. 23, 2010; ...",The New York Times,1943,7,26,"Jack T. Litman, a lawyer known for his cerebra..."
7341,Review/Architecture;\nAn Architect From the Gr...,The New York Times,0,0,0,"ABOUT 15 years ago, I was living in the Berksh..."
7721,MILES THE JAZZ GREAT'S NEW AUTOBIOGRAPHY IS A ...,St. Louis Post-Dispatch (Missouri),1926,5,26,IN SOME fairly hip magazine of about 30 years ...
7798,"HAZEL DICKENS; JUNE 1, 1935 - APRIL 22, 2011; ...",The New York Times,1935,6,1,"Hazel Dickens, a clarion-voiced advocate for c..."


In [14]:
# a perusal of the data shows that the majority of these that were before 1980 are actually obituaries!
# so we will ignore them (remove them from the dataset)
df = df.loc[~(df['year'] < 1980)]

# check that they have been removed
df.loc[(df['year'] < 1980)]  

Unnamed: 0,title,publisher,year,month,day,full text


## Topic Modelling Section

The pre-processing for what we will need in the topic modelling

In [15]:
# Load stopwords
nltk.download("stopwords")
nltk.download("punkt")
stop_words = set(stopwords.words("english"))
nltk.download('wordnet')

# Define your cleaning functions
def remove_special_characters_and_numbers(text):
    return re.sub(r"[^a-zA-Z\s]", "", text)

def remove_stopwords_and_lowercase(text):
    word_tokens = word_tokenize(text)
    filtered_words = [w.lower() for w in word_tokens if w.lower() not in stop_words]
    return " ".join(filtered_words)

# Implement other cleaning functions (e.g., correct OCR errors, lemmatize/stem) if needed
def remove_urls_and_mentions(text):
    return re.sub(r"http\S+|www\S+|@\S+", "", text)

def remove_punctuation(text):
    return re.sub(r"[^\w\s]", "", text)

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

# additional stopwords
with open('data/stopwords-en.txt') as f:
    stopwords_txt = f.read()
    stopwords_list = stopwords_txt.split("\n")

# function to remove these additional stopwords
def remove_additional_stopwords(text):
    word_tokens = word_tokenize(text)
    filtered_words = [w for w in word_tokens if w not in stopwords_list]
    return " ".join(filtered_words)

# Apply cleaning functions to a new dataframe for topic modelling (df_tp)
df_tm = df 
df_tm = df_tm.replace(r'\n',' ', regex=True) 
df_tm['clean_text'] = df_tm['full text'].apply(remove_special_characters_and_numbers)
df_tm['clean_text'] = df_tm['clean_text'].apply(remove_stopwords_and_lowercase)
df_tm['clean_text'] = df_tm['clean_text'].apply(remove_additional_stopwords)
df_tm['clean_text'] = df_tm['clean_text'].apply(remove_urls_and_mentions)
df_tm['clean_text'] = df_tm['clean_text'].apply(remove_punctuation)
df_tm['clean_text'] = df_tm['clean_text'].apply(lemmatize_text)

print(df_tm.head())

# we get some sense of our data
df_tm['article_length'] = df_tm['clean_text'].apply(lambda x: len(x.split()))

# Basic statistical analysis
article_stats = df_tm['article_length'].describe()
print(article_stats)

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1129)>
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1129)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1129)>


                                               title   
0  Primer of populist progressives;  They claim s...  \
1                                   Pretty Poison      
2  RECIFE JOURNAL; BRAZIL'S FLESHPOTS BRING TOURI...   
3  Life in U.S. Brings Success and Visibility For...   
4               Beating Time Warner at Its Own Game    

                        publisher  year  month  day   
0  Star Tribune (Minneapolis, MN)  1993      2   28  \
1              The New York Times  2002      2   10   
2              The New York Times  1987      2   24   
3              The New York Times  2010     12   28   
4              The New York Times  1990      4    8   

                                           full text   
0  Lots of Minnesota's liberal DFLers these days ...  \
1  Five years ago, Anna Quindlen wrote that there...   
2  One repertoire of the night in this hot coasta...   
3  ATLANTA -- Around Sept. 11, 2001, not long aft...   
4  I didn't have any money,'' said Robin Wolaner....

In [19]:
# let us then export this df_tp as a .csv for use in the topic_modelling file
df_tm.to_csv('data_for_tm.csv', encoding='utf-8', index=False)

## Colocates section

Pre-processing that needs to be done for colocates

In [20]:
# we already ran the functions above in the topic modelling, so let's call the ones we need here
# we don't want to lemmatize the text though

df_cl = df 
df_cl = df_tm.replace(r'\n',' ', regex=True) 
df_cl['clean_text'] = df_cl['full text'].apply(remove_special_characters_and_numbers)
df_cl['clean_text'] = df_cl['clean_text'].apply(remove_stopwords_and_lowercase)
df_cl['clean_text'] = df_cl['clean_text'].apply(remove_additional_stopwords)
df_cl['clean_text'] = df_cl['clean_text'].apply(remove_urls_and_mentions)
df_cl['clean_text'] = df_cl['clean_text'].apply(remove_punctuation)

print(df_cl.head())

# we get some sense of our data
df_cl['article_length'] = df_cl['clean_text'].apply(lambda x: len(x.split()))

# Basic statistical analysis
article_stats = df_cl['article_length'].describe()
print(article_stats)

                                               title   
0  Primer of populist progressives;  They claim s...  \
1                                   Pretty Poison      
2  RECIFE JOURNAL; BRAZIL'S FLESHPOTS BRING TOURI...   
3  Life in U.S. Brings Success and Visibility For...   
4               Beating Time Warner at Its Own Game    

                        publisher  year  month  day   
0  Star Tribune (Minneapolis, MN)  1993      2   28  \
1              The New York Times  2002      2   10   
2              The New York Times  1987      2   24   
3              The New York Times  2010     12   28   
4              The New York Times  1990      4    8   

                                           full text   
0  Lots of Minnesota's liberal DFLers these days ...  \
1  Five years ago, Anna Quindlen wrote that there...   
2  One repertoire of the night in this hot coasta...   
3  ATLANTA -- Around Sept. 11, 2001, not long aft...   
4  I didn't have any money,'' said Robin Wolaner....

In [None]:
# let us then export this df_cl as a .csv for use in the colocate file

df_cl.to_csv('data_for_cl.csv', encoding='utf-8', index=False)