# Cleaning Data

_This notebook will be used to clean the data collected from the two subreddits. I hope to combine title and selftext columns into one category, as well as combine all the data into one .csv file. I will also be removing any parts of the text that are useless (such as \n)._

In [1]:
import pandas as pd
import regex as re

In [2]:
jokes = pd.read_csv('./datasets/jokes.csv')
a_jokes = pd.read_csv('./datasets/antijokes.csv')

_The general layout for most of the posts in the /r/Jokes subreddit is that the joke is in the title and the punchline is in the selftext. Because this might be an important faeture, I decided to do clean the title and the selftext for both jokes and anti-jokes._

In [3]:
jokes.head()

Unnamed: 0,author,num_comments,permalink,score,selftext,subreddit,title
0,love_the_heat,270,/r/Jokes/comments/7c3dev/by_popular_demand_we_...,3412,**Guaranteed reposts.** \n\nhttps://discord.gg...,Jokes,"By popular demand, we now have a discord serve..."
1,Carljohnson09,219,/r/Jokes/comments/a629g5/husband_was_screwing_...,17201,Wife: (sobbing) You can't do this to me!\n\nHu...,Jokes,Husband was screwing his secretary up the ass ...
2,dandan_56,48,/r/Jokes/comments/a63k1m/why_does_batman_wear_...,1419,Batman doesn't want to get shot.\n\nWhy does R...,Jokes,Why does Batman wear Dark clothing?
3,JustKeepScrollingDad,203,/r/Jokes/comments/a605lp/a_man_is_in_court_the...,15281,"""Guilty"", said the man in the dock. At this po...",Jokes,"A man is in court. The Judges says,""on the 3rd..."
4,boced,350,/r/Jokes/comments/a5xqnf/a_poor_old_lady_was_f...,17905,"As she rummaged through her dusty belongings, ...",Jokes,A poor old lady was forced to sell her valuabl...


### Cleaning the escape characters

In [4]:
escapes = {}
newline = 0
backslash = 0
squote = 0
dquote = 0
carriage = 0
tab = 0
invspace = 0

for joke in jokes['selftext']:
    newline += joke.count('\n')
    backslash += joke.count('\\')
    squote += joke.count("\'")
    dquote += joke.count('\"')
    carriage += joke.count('\r')
    tab += joke.count('\t')
    invspace += joke.count('')

escapes['newline'] = newline
escapes['backslash'] = backslash
escapes['single quote'] = squote
escapes['double quotes'] = dquote
escapes['cariage returns'] = carriage
escapes['tab'] = tab
escapes['inv space'] = invspace
escapes

{'newline': 11935,
 'backslash': 72,
 'single quote': 4412,
 'double quotes': 5509,
 'cariage returns': 216,
 'tab': 0,
 'inv space': 917124}

_In the selftext of the Jokes and AntiJokes subreddits, there are quite a few escape characters that need to be addressed. I created a function to replace those escape characters with spaces._

In [5]:
# Created a function to remove escape characters, as well as deal with double spaces
def remove_esc(x):
    x = x.replace('\n',' ').replace('  ', ' ')
    x = x.replace('\\',' ').replace('  ', ' ')
    x = x.replace('\"',' ').replace('  ', ' ')
    x = x.replace('\r',' ').replace('  ', ' ')
    x = x.replace('\t',' ').replace('  ', ' ')
    x = x.replace('&amp;', ' ').replace('  ', ' ')
    return x

In [6]:
jokes['selftext'] = jokes['selftext'].apply(remove_esc)

In [7]:
jokes['title'] = jokes['title'].apply(remove_esc)

In [8]:
# There are a few entries in /r/AntiJokes that have null selftext values
# This is probably due to the nature of "AntiJokes", which might not have a punchline
# I filled the null values in with an empty string
a_jokes['selftext'].isna().sum()

396

In [9]:
a_jokes['selftext'] = a_jokes['selftext'].fillna('')

In [10]:
a_jokes['selftext'].isna().sum()

0

In [11]:
a_jokes['selftext'] = a_jokes['selftext'].apply(remove_esc)

In [12]:
a_jokes['title'] = a_jokes['title'].apply(remove_esc)

### Removing duplicates

In [13]:
# Getting rid of the first observation, since it was 
# an advertisement to the /r/Jokes discord channel
jokes = jokes.drop(0)

In [14]:
jokes.shape

(5269, 7)

In [15]:
a_jokes.shape

(5472, 7)

In [16]:
jokes = jokes.drop_duplicates(subset = ['selftext'], keep = 'first')
jokes = jokes.reset_index(drop=True)

In [17]:
a_jokes = a_jokes.drop_duplicates(subset = ['selftext'], keep = 'first')
a_jokes = a_jokes.reset_index(drop=True)

In [18]:
jokes.shape

(883, 7)

In [19]:
a_jokes.shape

(842, 7)

_Turns out that many of the jokes and antijokes are duplicates! I guess that's why they call it a 'repost'... After removing duplicates, I was left with around 800 unique jokes and antijokes._

### Combining dataframes

In [20]:
df = jokes.append(a_jokes)
df.shape

(1725, 7)

In [21]:
df.head()

Unnamed: 0,author,num_comments,permalink,score,selftext,subreddit,title
0,Carljohnson09,219,/r/Jokes/comments/a629g5/husband_was_screwing_...,17201,Wife: (sobbing) You can't do this to me! Husba...,Jokes,Husband was screwing his secretary up the ass ...
1,dandan_56,48,/r/Jokes/comments/a63k1m/why_does_batman_wear_...,1419,Batman doesn't want to get shot. Why does Robi...,Jokes,Why does Batman wear Dark clothing?
2,JustKeepScrollingDad,203,/r/Jokes/comments/a605lp/a_man_is_in_court_the...,15281,"Guilty , said the man in the dock. At this po...",Jokes,"A man is in court. The Judges says, on the 3rd..."
3,boced,350,/r/Jokes/comments/a5xqnf/a_poor_old_lady_was_f...,17905,"As she rummaged through her dusty belongings, ...",Jokes,A poor old lady was forced to sell her valuabl...
4,adjustable_wrench,27,/r/Jokes/comments/a643d9/how_do_you_get_a_nun_...,285,Dress her up as a choir boy.,Jokes,How do you get a nun pregnant?


_I decided to make the 'AntiJokes' posts as equal 1, since my problem statement is whether or not an algorithm can differentiate an AntiJoke from a Joke._

In [22]:
df['subreddit'] = df['subreddit'].apply(lambda x: 0 if x == 'Jokes' else 1)

### Dropping columns that are unnecessary

In [23]:
df.head()

Unnamed: 0,author,num_comments,permalink,score,selftext,subreddit,title
0,Carljohnson09,219,/r/Jokes/comments/a629g5/husband_was_screwing_...,17201,Wife: (sobbing) You can't do this to me! Husba...,0,Husband was screwing his secretary up the ass ...
1,dandan_56,48,/r/Jokes/comments/a63k1m/why_does_batman_wear_...,1419,Batman doesn't want to get shot. Why does Robi...,0,Why does Batman wear Dark clothing?
2,JustKeepScrollingDad,203,/r/Jokes/comments/a605lp/a_man_is_in_court_the...,15281,"Guilty , said the man in the dock. At this po...",0,"A man is in court. The Judges says, on the 3rd..."
3,boced,350,/r/Jokes/comments/a5xqnf/a_poor_old_lady_was_f...,17905,"As she rummaged through her dusty belongings, ...",0,A poor old lady was forced to sell her valuabl...
4,adjustable_wrench,27,/r/Jokes/comments/a643d9/how_do_you_get_a_nun_...,285,Dress her up as a choir boy.,0,How do you get a nun pregnant?


_For the sake of analysis, I will only be needing the 'title' and the 'selftext' columns in determining the target, 'subreddit'. Other columns will be dropped._

In [24]:
df = df.drop(['author','num_comments','permalink','score'], axis=1)

### Saving the cleaned df into two csvs, one for selftexts and one for titles.

In [25]:
joined = []
for index, row in df.iterrows():
    join = row['title'] + " " + row['selftext']
    joined.append(join)
joined

["Husband was screwing his secretary up the ass when his wife walked in Wife: (sobbing) You can't do this to me! Husband: I know that's why I am doing it with her!",
 "Why does Batman wear Dark clothing? Batman doesn't want to get shot. Why does Robin wear bright clothing? Batman doesn't want to get shot.",
 "A man is in court. The Judges says, on the 3rd August you are accused of killing your wife by beating her to death with a hammer, how do you plead?   Guilty , said the man in the dock. At this point a man at the back of the court stood up and shouted You dirty rat! The Judge asked the man to site down and to refrain from making any noise. The Judge continued ..... and that also on the 17th September you are accused of killing your son by beating him to death with a hammer, how do you plead ? Guilty , said the man in the dock. Again the same man at the back stood up and shouted even louder, You dirty rotten stinking rat !! At this point the Judge called the man to the bench and sai

In [26]:
df['joined'] = joined

In [27]:
df.to_csv('./datasets/joined.csv', index=False)

In [28]:
df[['selftext','subreddit']].to_csv('./datasets/selftext.csv', index=False)
df[['title','subreddit']].to_csv('./datasets/title.csv', index=False)

_This notebook was used to clean the data collected from Reddit. This mainly included getting rid of the escape characters and removing duplicate posts. After removing duplicates, the number of posts I had decreased from 5000+ to about 800 posts for each subreddit (there were many duplicates). Lastly, I exported the data into .csv files for EDA and modeling. The next notebook will perform EDA on the data._