## Data Cleaning and Preprocessing

In this notebook, we process our data in aita_posts.csv so that it can be fed to our logistic regression model

### Necessary Imports Below

In [2]:
import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm

### Preprocessing

Here, we will begin preprocessing each reddit post. Specifically, the code below will do the following to the post title and body

- Stop word removal
- Lowercasing
- Lemminization
- Punctuation Removal
- Tokenization
- Enforce str data type

Doing this will remove irrelevant and noisy text, avoid hitting token length limits, normalize slangy/repetitive posts, and keep the BERT model efficient and focused.

Afterwards, the title and body will be combined and added a new column called "combined_text". Then, we will drop the title, selftext, and other unecessary columns to save space.

In [None]:
# Load dataset
df = pd.read_csv('../../Data/aita_raw.csv')

# print initial dataset shape
print("Initial dataset shape:", df.shape)

# Drop unnecessary columns
df.drop(columns=['timestamp', 'edited', 'is_asshole'], inplace=True)

Initial dataset shape: (97628, 9)


In [5]:
# Load spaCy and disable unnecessary components
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
nlp.max_length = 60000000

def preprocess_text(text, nlp):
    """
    Preprocesses text by lemmatizing, removing stopwords (except 'not'), and lowercasing.
    """
    # Process the text with spaCy
    doc = nlp(text.lower())
    
    # Get all stopwords except 'not'
    stop_words = nlp.Defaults.stop_words - {'not'}
    
    # Create list of tokens: lemmatized, without stopwords, and without punctuation
    tokens = [token.lemma_ for token in doc 
             if (not token.is_punct 
                 and not token.is_space
                 and token.text not in stop_words)]
    
    # Join tokens back into a string
    return ' '.join(tokens)

# Enforce strs in cols
df['body'] = df['body'].fillna('').astype(str)
df['title'] = df['title'].fillna('').astype(str)

# Add processed texts as new column
df['combined_text'] = df['title'] + ' ' + df['body']

# Apply preprocessing
tqdm.pandas(desc="Preprocessing texts")
df['combined_text'] = df['combined_text'].progress_apply(lambda x: preprocess_text(x, nlp))

# Drop the body and title columns
df.drop(columns=['body', 'title'], inplace=True)

Preprocessing texts:   6%|▌         | 578/10000 [00:30<08:20, 18.81it/s]


KeyboardInterrupt: 

After the cleaning, we save the cleaned data in the Data Subdirectory

In [None]:
# Save the preprocessed DataFrame to a new CSV file
df.to_csv("../../Data/aita_posts_preprocessed.csv", index=False)