## Data Cleaning and Preprocessing

In this notebook, we process our data in aita_posts.csv so that it can be fed to our BERT model

### Necessary Imports Below

In [15]:
import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm

In [16]:
df = pd.read_csv('../../aita_clean.csv', nrows=10000)

df.drop(columns=['timestamp', 'edited', 'is_asshole'])

Unnamed: 0,id,title,body,verdict,score,num_comments
0,1ytxov,[AITA] I wrote an explanation in TIL and came ...,[Here is the post in question](http://www.redd...,asshole,52,13.0
1,1yu29c,[AITA] Threw my parent's donuts away,"My parents are diabetic, morbidly obese, and a...",asshole,140,27.0
2,1yu8hi,I told a goth girl she looked like a clown.,I was four.,not the asshole,74,15.0
3,1yuc78,[AItA]: Argument I had with another redditor i...,http://www.reddit.com/r/HIMYM/comments/1vvfkq/...,everyone sucks,22,3.0
4,1yueqb,[AITA] I let my story get a little long and bo...,,not the asshole,6,4.0
...,...,...,...,...,...,...
9995,aex06k,AITA for not wanting to go to prom?,i’m a junior in high school and i’ve been to p...,not the asshole,8,9.0
9996,aex3da,AITA for not letting my brother use my computer?,Hey reddit am I not the asshole for letting my...,not the asshole,16,28.0
9997,aex4r5,AITA for despising this girl?,AITA??\n\nSo I know a girl who got pregnant be...,asshole,31,80.0
9998,aex7pl,AITA becuase a christmas gift I ordered for my...,"Ok so long story here, I decided back in Novem...",not the asshole,11,13.0


### Preprocessing

Here, we will begin preprocessing each reddit post. Specifically, the code below will do the following to the post title and body

- Stop word removal
- Lowercasing
- Lemminization
- Punctuation Removal

Doing this will remove irrelevant and noisy text, avoid hitting token length limits, normalize slangy/repetitive posts, and keep the BERT model efficient and focused.

Afterwards, the title and body will be combined and added a new column called "combined". Then, we will drop the title, selftext, and other unecessary columns to save space.

In [None]:
# Load spaCy and disable unnecessary components
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
nlp.max_length = 60000000

# Enforce strs in cols
df['body'] = df['body'].fillna('').astype(str)
df['title'] = df['title'].fillna('').astype(str)

# Define preprocessing functions
def preprocess_text(text):
    doc = nlp(text)
    return [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]

def process_row(row):
    try:
        title_tokens = preprocess_text(row['title'])
        body_tokens = preprocess_text(row['body'])
        return ' '.join(title_tokens + body_tokens)
    except Exception as e:
        print(f"Error processing row: {e}")
        return ''

# Process rows with visible progress bar
processed_texts = []
for idx, row in tqdm(df.iterrows(), total=len(df), desc="Processing posts"):
    processed_text = process_row(row)
    processed_texts.append(processed_text)

# Add processed texts as new column
df['combined_text'] = processed_texts

# Drop the body and title columns
df.drop(columns=['body', 'title'], inplace=True)



Processing posts: 100%|██████████| 10000/10000 [09:41<00:00, 17.20it/s]


After the cleaning, we save the cleaned data in the Data Subdirectory

In [None]:
# Save the preprocessed DataFrame to a new CSV file
df.to_csv("../Data/aita_posts_preprocessed.csv", index=False)