## Data Cleaning and Preprocessing

In this notebook, we process our data in aita_posts.csv so that it can be fed to our BERT model

### Necessary Imports Below

In [16]:
import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm

In [18]:
df = pd.read_csv('../Data/aita_raw.csv')

df.drop(columns=['timestamp', 'edited', 'is_asshole'], inplace=True)

### Preprocessing

Here, we will begin preprocessing each reddit post. Specifically, the code below will do the following to the post title and body

- Stop word removal
- Lowercasing
- Lemminization
- Punctuation Removal

Doing this will remove irrelevant and noisy text, avoid hitting token length limits, normalize slangy/repetitive posts, and keep the BERT model efficient and focused.

Afterwards, the title and body will be combined and added a new column called "combined". Then, we will drop the title, selftext, and other unecessary columns to save space.

In [19]:
# Load spaCy and disable unnecessary components
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
nlp.max_length = 60000000

# Enforce strs in cols
df['body'] = df['body'].fillna('').astype(str)
df['title'] = df['title'].fillna('').astype(str)


# Add processed texts as new column
df['combined_text'] = df['title'] + ' ' + df['body']

# Drop the body and title columns
df.drop(columns=['body', 'title'], inplace=True)

df

Unnamed: 0,id,verdict,score,num_comments,combined_text
0,1ytxov,asshole,52,13.0,[AITA] I wrote an explanation in TIL and came ...
1,1yu29c,asshole,140,27.0,[AITA] Threw my parent's donuts away My parent...
2,1yu8hi,not the asshole,74,15.0,I told a goth girl she looked like a clown. I ...
3,1yuc78,everyone sucks,22,3.0,[AItA]: Argument I had with another redditor i...
4,1yueqb,not the asshole,6,4.0,[AITA] I let my story get a little long and bo...
...,...,...,...,...,...
9995,aex06k,not the asshole,8,9.0,AITA for not wanting to go to prom? i’m a juni...
9996,aex3da,not the asshole,16,28.0,AITA for not letting my brother use my compute...
9997,aex4r5,asshole,31,80.0,AITA for despising this girl? AITA??\n\nSo I k...
9998,aex7pl,not the asshole,11,13.0,AITA becuase a christmas gift I ordered for my...


After the cleaning, we save the cleaned data in the Data Subdirectory

In [20]:
# Save the preprocessed DataFrame to a new CSV file
df.to_csv("../Data/aita_posts_preprocessed.csv", index=False)