## Data Cleaning and Preprocessing

In this notebook, we process our data in aita_posts.csv so that it can be fed to our BERT model

### Necessary Imports Below

In [11]:
import pandas
import numpy as np
import spacy

### Preprocessing

Here, we will begin preprocessing each reddit post. Specifically, the code below will do the following to the post title and body

- Stop word removal
- Lowercasing
- Lemminization
- Punctuation Removal

Doing this will remove irrelevant and noisy text, avoid hitting token length limits, normalize slangy/repetitive posts, and keep the BERT model efficient and focused.

Afterwards, the title and body will be combined and added a new column called "combined". Then, we will drop the title, selftext, and other unecessary columns to save space.

In [13]:
# Load the spaCy language model
nlp = spacy.load("en_core_web_sm")

# Load the dataset
df = pandas.read_csv("../Data/aita_posts.csv")

# Define a function to handle preprocessing of text
def preprocess(text):
    doc = nlp(text)
    return [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]

# Loop through each row and preprocess the title and post body
for index, row in df.iterrows():
    # Preprocess the title and post body
    title_tokens = preprocess(row['title'])
    body_tokens = preprocess(row['selftext'])
    
    # Combine the tokens into a single list
    combined_tokens = title_tokens + body_tokens
    
    # Store the combined tokens back into the DataFrame
    df.at[index, 'combined'] = ' '.join(combined_tokens)

# Drop the original title and selftext columns
df.drop(columns=['title', 'selftext', 'is_self', 'created_utc', 'subreddit_name', 'author', 'url'], inplace=True)


df



Unnamed: 0,post_id,score,upvote_ratio,num_comments,flair,label,combined
0,1k6igkk,5,0.78,12,Not the A-hole,Not the A-hole,aita freak mom watch brother 14f 13 m brother ...
1,1k6hyib,16,0.86,27,Not the A-hole,Not the A-hole,aita ask brother m19 voice chat friend night f...
2,1k6gmp6,59,0.69,100,Not the A-hole,Not the A-hole,aita let bf use desk chair 34nb bf(35 m live 9...
3,1k6g0wj,57,0.70,163,Not the A-hole,Not the A-hole,aita tell wife quit smoking asshole tell wife ...
4,1k6flrt,4,0.64,11,Not the A-hole,Not the A-hole,aita share switch 14f young sister 12f switch ...
...,...,...,...,...,...,...,...
189,1jyorzh,2,0.56,16,Everyone Sucks,Everyone Sucks,aitah gf 33f accuse 39 m bully child post r ad...
190,1jymj11,15,0.73,10,Everyone Sucks,Everyone Sucks,aita snap roommate clean shit stain toilet 23f...
191,1jyl1xn,596,0.78,537,Everyone Sucks,Everyone Sucks,aita side dad divorce cheat mom \n dad(52 m mo...
192,1jyelws,311,0.79,114,Everyone Sucks,Everyone Sucks,aita give dog ex breakup say month ago ex 30f ...


After the cleaning, we save the cleaned data in the Data Subdirectory

In [14]:
# Save the preprocessed DataFrame to a new CSV file
df.to_csv("../Data/aita_posts_preprocessed.csv", index=False)