# 01 Preprocessing

Goal: Prepare data to be used in model training.
1. Combining datasets into one homogenous human-readable dataset, ensuring no empty cells, with the text **Article** and boolean **Truth** columns.
2. Remove any punctuation and error-prone text.
3. Tokenize the data for NLP.

Output:
1. Human-readable dataset
2. Reduced dataset
3. Tokenized dataset

## 01.1 Human-Readable Dataset

First, we'll reduce the raw FakeNewsNet dataset to only the columns we need, with a boolean truth label added manually.

In [None]:
import pandas as pd

In [None]:
fnn_files = [
    '../data/raw/FakeNewsNet/gossipcop_fake.csv',
    '../data/raw/FakeNewsNet/gossipcop_real.csv',
    '../data/raw/FakeNewsNet/politifact_fake.csv',
    '../data/raw/FakeNewsNet/politifact_real.csv'
]

dfs = []
for file in fnn_files:
    temp_df = pd.read_csv(file)

    # Determine truth value: 1 for real, 0 for fake
    truth = 1 if 'real' in file else 0
    
    # Some files may use different column names for the article text
    if 'title' in temp_df.columns:
        titles = temp_df['title']
    elif 'text' in temp_df.columns:
        titles = temp_df['text']
    else:
        continue  # skip if no suitable column
    df_subset = pd.DataFrame({'Article': titles, 'Truth': truth})
    dfs.append(df_subset)

joint_df = pd.concat(dfs, ignore_index=True)

Next, we'll do the same for the ISOT dataset, concatenating the title and body of the article together since it's present in the dataset already.

In [None]:
isot_files = [
    '../data/raw/ISOTFakeNewsDataset/True.csv',
    '../data/raw/ISOTFakeNewsDataset/Fake.csv'
]

for file in isot_files:
    temp_df = pd.read_csv(file)
    truth = 1 if 'True' in file else 0 # a little unecessary but whatever

    # Concatenate title and text for ISOT dataset
    articles = temp_df['title'] + ' ' + temp_df['text']
    df_subset = pd.DataFrame({'Article': articles, 'Truth': truth})
    dfs.append(df_subset)
joint_df = pd.concat(dfs, ignore_index=True)

In [None]:
# save to file

## 01.2 Reduced Dataset

Next, we'll check some traits of this data for potential points of failure. First, check for blank article entries.

In [None]:
joint_df.shape
joint_df.isna().sum()

Next, we want to remove punctuation and unusual characters. We'll use a standard procedure for NLP cleaning:
1. Lowercasing
2. Removal of punctuation and special characters
3. Removal of extra whitespace
4. Removal of stopwords (such as 'the', 'is', or 'and')

In [None]:
# lowercasing
joint_df['Article'] = joint_df['Article'].str.lower()

# Removal of punctuation and special characters
import re
joint_df['Article'] = joint_df['Article'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', x))

# Removal of extra whitespace
joint_df['Article'] = joint_df['Article'].apply(lambda x: re.sub(r'\s+', ' ', x).strip())

# Removal of stopwords (such as 'the', 'is', or 'and')
