# Preprocessing Notebook

In this notebook, we build on insights from the exploratory data analysis to iteratively develop and refine data cleaning and preprocessing functions. This trial-and-error process will guide the creation of a finalized preprocessing script to be used consistently during the modeling phase.

In [18]:
import pandas as pd

#Reading in the data
df = pd.read_csv('../Data/WELFake_Dataset.csv')

## Handleing Missing Values

As identified during the exploratory data analysis, the dataset contains **558** rows with missing titles and **39** rows with missing text. Given the overall size of the dataset, the most appropriate approach is to remove these rows rather than impute missing values with blanks. Dropping them preserves data quality and avoids introducing noise that could negatively impact model performance.

In [19]:
#Print size of dataframe before dropping rows with NA values
print("Before dropping NA's: ", df.shape)

#Dropping rows with NA values
df = df.dropna(subset = ['title', 'text'])

#Printing size of dataframe after dropping rows with NA values
print("After dropping NA's", df.shape)

Before dropping NA's:  (72134, 4)
After dropping NA's (71537, 4)


## Text cleaning

In [20]:
#Retrieving a sample of 5 article's titles and text
sample_df = df[['title', 'text']].sample(5)

#Creates a loop that itterates over the sample data and outputs the text without truncating the string
for i, row in sample_df.iterrows():
    print(f"\n=== ARTICLE {i} ===")
    print(f"Title: {row['title']}\n")
    print(f"Text:\n{row['text']}\n")


=== ARTICLE 19281 ===
Title: Clinton clear on Trump: 'We were not friends': People magazine

Text:
WASHINGTON (Reuters) - Democratic presidential candidate Hillary Clinton wants to set the record straight on Donald Trump: “We were not friends.” “We knew each other, obviously, in New York,” Clinton, a former U.S. senator from New York, said in excerpts of a People magazine interview released on Wednesday. “I knew a lot of people.” Trump, the real estate billionaire whose standing as Republican front-runner was dented by a second-place finish in the Iowa caucuses on Monday, had long touted his friendship with Bill and Hillary Clinton. In a March 2012 Fox News interview, Trump praised Clinton as a “terrific woman.” “I am biased because I have known her for years. I live in New York. She lives in New York. I really like her and her husband both a lot. I think she really works hard,” Trump told Fox. But the Clintons, who attended Trump’s 2005 wedding, were fair game on the campaign trail. 

These five sample articles offer meaningful insight into the types of preprocessing necessary to prepare the text for effective modeling. Several key cleaning operations have been identified based on this initial review:

-**Convert all text to lowercase**: Standardizing case helps reduce vocabulary size and ensures that semantically identical words (e.g., “President” and “president”) are treated equivalently. This is especially important when using bag-of-words or TF-IDF representations, where case differences would otherwise be interpreted as separate tokens.

-**Remove punctuation**: Punctuation rarely contributes meaningful information in traditional NLP classification tasks, and its removal simplifies the tokenization process. This also reduces sparsity in the feature space.

-**Eliminate redundant whitespace**: Extra spaces introduced by removing punctuation, URLs, or formatting artifacts can create inconsistencies. Normalizing whitespace ensures cleaner tokenization and more consistent feature extraction.

-**Remove content within parentheses**: In many cases, parenthetical text contains metadata such as newswire sources (e.g., “(Reuters)”), which may bias the model or introduce non-generalizable patterns. Removing this content encourages the model to focus on the article's substantive content rather than source attribution.

-**Remove stop words**: Stop words are common terms such as “the,” “is,” and “and” that typically carry little semantic weight in text classification. Removing them reduces noise and allows the model to focus on more informative tokens that are better indicators of content or tone.

To validate the impact of these steps and ensure correctness, we will first apply them to a single article. This trial implementation will serve as the foundation for designing a reusable text-cleaning function to be integrated into the preprocessing script.

In [21]:
#Pulling one row of data to perform trail and error cleaning on. This one specifically has text within parenthesis.
test_row = df.iloc[[38919]].copy()

#Prints title and text value of test row
print("Before converting to lowercase\n")
print(f"Title: {test_row['title'].iloc[0]}\n")
print(f"Text:\n{test_row['text'].iloc[0]}\n")

#Converts title and text strings to lowercase
test_row['title'] = test_row['title'].str.lower()
test_row['text'] = test_row['text'].str.lower()

#Prints values of title and text values to confirm lowercase conversion worked
print("After converting to lowercase\n")
print(f"Title: {test_row['title'].iloc[0]}\n")
print(f"Text:\n{test_row['text'].iloc[0]}\n")

Before converting to lowercase

Title: Egypt defends human rights position after criticism from OHCHR

Text:
CAIRO (Reuters) - Egypt s United Nations envoy on Tuesday criticized U.N. High Commissioner for Human Rights Zeid Ra ad al-Hussein s remarks on systemic violence in the country, saying they reflected  flawed logic , state news agency MENA reported. Ambassador Amr Ramadan was quoted as saying that he had cautioned Zeid against his office becoming a  mouthpiece for paid agencies with political and economic agendas,  and he rejected his accusations, without elaborating. At a UN Human Rights Council meeting in Geneva on Monday, Hussein said the state of emergency declared by the Egyptian government last April had been used to justify  systemic silencing of civil society.  He cited reports of waves of arrests, arbitrary detention, black-listing, travel bans, asset freezes, intimidation and other reprisals against human rights defenders, journalists, political dissidents and those aff

With the text now fully converted to lowercase, the next preprocessing step is to remove words within parenthesis.

In [22]:
import re

#Looks for opening and closing parenthesis with any characters in between them and replaces them with nothing
test_row['title'] = test_row['title'].str.replace(r'\([^)]*\)', '', regex = True)
test_row['text'] = test_row['text'].str.replace(r'\([^)]*\)', '', regex = True)

#Print title and text column after removing text within parenthesis
print("After removing punctuation\n")
print(f"Title: {test_row['title'].iloc[0]}\n")
print(f"Text:\n{test_row['text'].iloc[0]}\n")

After removing punctuation

Title: egypt defends human rights position after criticism from ohchr

Text:
cairo  - egypt s united nations envoy on tuesday criticized u.n. high commissioner for human rights zeid ra ad al-hussein s remarks on systemic violence in the country, saying they reflected  flawed logic , state news agency mena reported. ambassador amr ramadan was quoted as saying that he had cautioned zeid against his office becoming a  mouthpiece for paid agencies with political and economic agendas,  and he rejected his accusations, without elaborating. at a un human rights council meeting in geneva on monday, hussein said the state of emergency declared by the egyptian government last april had been used to justify  systemic silencing of civil society.  he cited reports of waves of arrests, arbitrary detention, black-listing, travel bans, asset freezes, intimidation and other reprisals against human rights defenders, journalists, political dissidents and those affiliated with 

With parenthetical content successfully removed, we can now proceed to eliminate the remaining punctuation from the text.

In [23]:
import string

#Creates a translation table, replaces nothing with nothing, but deleted every character in string.punctuation
punctuation_table = str.maketrans('','', string.punctuation)

#Applies the punctuation_table to the title and text column using translate
test_row['title'] = test_row['title'].str.translate(punctuation_table)
test_row['text'] = test_row['text'].str.translate(punctuation_table)

#Print title and text column after removing puncuation
print("After removing punctuation\n")
print(f"Title: {test_row['title'].iloc[0]}\n")
print(f"Text:\n{test_row['text'].iloc[0]}\n")

After removing punctuation

Title: egypt defends human rights position after criticism from ohchr

Text:
cairo   egypt s united nations envoy on tuesday criticized un high commissioner for human rights zeid ra ad alhussein s remarks on systemic violence in the country saying they reflected  flawed logic  state news agency mena reported ambassador amr ramadan was quoted as saying that he had cautioned zeid against his office becoming a  mouthpiece for paid agencies with political and economic agendas  and he rejected his accusations without elaborating at a un human rights council meeting in geneva on monday hussein said the state of emergency declared by the egyptian government last april had been used to justify  systemic silencing of civil society  he cited reports of waves of arrests arbitrary detention blacklisting travel bans asset freezes intimidation and other reprisals against human rights defenders journalists political dissidents and those affiliated with the muslim brotherho

While this step effectively removed most standard punctuation, certain characters such as apostrophes, quotation marks, and ellipses remain in the text. To address this more comprehensively, we will implement an alternative cleaning method.

In [24]:
#Removes all punctuation from the title and text columns by stripping everything but letters and whitespace
test_row['title'] = test_row['title'].str.replace(r'[^\w\s]', '', regex = True)
test_row['text'] = test_row['text'].str.replace(r'[^\w\s]', '', regex = True)

#Print title and text column after removing puncuation
print("After removing punctuation\n")
print(f"Title: {test_row['title'].iloc[0]}\n")
print(f"Text:\n{test_row['text'].iloc[0]}\n")

After removing punctuation

Title: egypt defends human rights position after criticism from ohchr

Text:
cairo   egypt s united nations envoy on tuesday criticized un high commissioner for human rights zeid ra ad alhussein s remarks on systemic violence in the country saying they reflected  flawed logic  state news agency mena reported ambassador amr ramadan was quoted as saying that he had cautioned zeid against his office becoming a  mouthpiece for paid agencies with political and economic agendas  and he rejected his accusations without elaborating at a un human rights council meeting in geneva on monday hussein said the state of emergency declared by the egyptian government last april had been used to justify  systemic silencing of civil society  he cited reports of waves of arrests arbitrary detention blacklisting travel bans asset freezes intimidation and other reprisals against human rights defenders journalists political dissidents and those affiliated with the muslim brotherho

One observation from the text review is the presence of isolated single-character tokens, such as "s", which do not contribute meaningful information. Before proceeding with stop word removal, we will eliminate these extraneous single-character words to further refine the text.

In [25]:
#Removes all single character words within title and text
test_row['title'] = test_row['title'].str.replace(r'\b\w\b', '', regex=True)
test_row['text'] = test_row['text'].str.replace(r'\b\w\b', '', regex=True)

#Print title and text column after removing all single character words
print("After removing punctuation\n")
print(f"Title: {test_row['title'].iloc[0]}\n")
print(f"Text:\n{test_row['text'].iloc[0]}\n")

After removing punctuation

Title: egypt defends human rights position after criticism from ohchr

Text:
cairo   egypt  united nations envoy on tuesday criticized un high commissioner for human rights zeid ra ad alhussein  remarks on systemic violence in the country saying they reflected  flawed logic  state news agency mena reported ambassador amr ramadan was quoted as saying that he had cautioned zeid against his office becoming   mouthpiece for paid agencies with political and economic agendas  and he rejected his accusations without elaborating at  un human rights council meeting in geneva on monday hussein said the state of emergency declared by the egyptian government last april had been used to justify  systemic silencing of civil society  he cited reports of waves of arrests arbitrary detention blacklisting travel bans asset freezes intimidation and other reprisals against human rights defenders journalists political dissidents and those affiliated with the muslim brotherhood g

With the text now largely cleaned and standardized, we are ready to proceed with the removal of stop words to further refine the dataset. While this step may not be necessary in the final preprocessing script, since the TF-IDF vectorizer can handle stop word removal internally, it is valuable to explore this process here for experimental purposes and to better understand its potential impact on the data.

In [26]:
import nltk

#Downloads most current list of stopwords to remove from text
nltk.download('stopwords')
from nltk.corpus import stopwords

#Sets stop words to english list of stop words
stop_words = set(stopwords.words('english'))

#Function to remove stop words from a string to be applied to columns of the data
def remove_stopwords(text):
     return ' '.join(word for word in text.split() if word.lower() not in stop_words)

#Applies the function created above to the title and text columns of the data
test_row['title'] = test_row['title'].apply(remove_stopwords)
test_row['text'] = test_row['text'].apply(remove_stopwords)

#Print title and text column after removing all stop words
print("After removing punctuation\n")
print(f"Title: {test_row['title'].iloc[0]}\n")
print(f"Text:\n{test_row['text'].iloc[0]}\n")

After removing punctuation

Title: egypt defends human rights position criticism ohchr

Text:
cairo egypt united nations envoy tuesday criticized un high commissioner human rights zeid ra ad alhussein remarks systemic violence country saying reflected flawed logic state news agency mena reported ambassador amr ramadan quoted saying cautioned zeid office becoming mouthpiece paid agencies political economic agendas rejected accusations without elaborating un human rights council meeting geneva monday hussein said state emergency declared egyptian government last april used justify systemic silencing civil society cited reports waves arrests arbitrary detention blacklisting travel bans asset freezes intimidation reprisals human rights defenders journalists political dissidents affiliated muslim brotherhood group last week egypt came fire human rights watch said report systemic torture country jails leading cairo block access hrw website egypt human rights parliamentary committee critical 

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tylerkatz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


With all stop words successfully removed from the text and title columns, we now move on to the final step of text preprocessing: eliminating unnecessary whitespace to ensure clean and consistent formatting.

In [27]:
#Removes internal whitespace with replace, and leading/trailing whitespace with strip
test_row['title'] = test_row['title'].str.replace(r'\s+', ' ', regex=True).str.strip()
test_row['text'] = test_row['text'].str.replace(r'\s+', ' ', regex=True).str.strip()

#Print title and text column after removing whitespace
print("After removing punctuation\n")
print(f"Title: {test_row['title'].iloc[0]}\n")
print(f"Text:\n{test_row['text'].iloc[0]}\n")

After removing punctuation

Title: egypt defends human rights position criticism ohchr

Text:
cairo egypt united nations envoy tuesday criticized un high commissioner human rights zeid ra ad alhussein remarks systemic violence country saying reflected flawed logic state news agency mena reported ambassador amr ramadan quoted saying cautioned zeid office becoming mouthpiece paid agencies political economic agendas rejected accusations without elaborating un human rights council meeting geneva monday hussein said state emergency declared egyptian government last april used justify systemic silencing civil society cited reports waves arrests arbitrary detention blacklisting travel bans asset freezes intimidation reprisals human rights defenders journalists political dissidents affiliated muslim brotherhood group last week egypt came fire human rights watch said report systemic torture country jails leading cairo block access hrw website egypt human rights parliamentary committee critical 

With whitespace successfully removed, the title and text columns are now fully standardized and cleaned. We can now proceed to the next phase of the pipeline: feature engineering.

## Feature Engineering

As identified during the exploratory data analysis, sentiment can serve as a valuable indicator of an article's authenticity. To leverage this insight, we will engineer a new sentiment feature replicating the approach used in the EDA phase which will be incorporated as an input variable in the subsequent modeling process.

In [28]:
from nltk.sentiment import SentimentIntensityAnalyzer

#Downloads vocabulary and rules to analyze sentiment
nltk.download('vader_lexicon')

#Initializes a VADER sentiment analyzer object
sia = SentimentIntensityAnalyzer()

#Creates a sentiment column for both datagrames based on the text of the article
test_row['sentiment'] = (
    test_row['text'].fillna('').apply(lambda t: sia.polarity_scores(t)['compound'])
)

print('Sentiment of test row:')
print(test_row.iloc[0]['sentiment'])

Sentiment of test row:
-0.9808


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/tylerkatz/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


With the sentiment feature successfully generated, the preprocessing notebook is now complete. The next step is to consolidate all the cleaning and preprocessing steps into a single, reusable script that can be applied to the entire dataset in preparation for modeling.