# EDA & PreProcessing

In this notebook, we'll import and explore the cleaned data from the previous notebook with the goal of determining which characteristics should and should not be the focus of our analysis. Afterwards, we'll preprocess the data and send it to our modelling notebook.

It's anticipated that the output of this section will provide important insights into the structure and content of the data, which can guide subsequent analysis and modeling decisions.

### Library Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import pickle
import string

from pandas import json_normalize

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from collections import Counter
from textblob import TextBlob
from sklearn.preprocessing import StandardScaler



import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords, wordnet
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('stopwords')
nltk.download('vader_lexicon')

[nltk_data] Downloading package stopwords to /Users/ryan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/ryan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

### Data Imports

In [2]:
with open('pickles/df_parent.pkl', 'rb') as f:
    df_parent = pickle.load(f)
    
with open('pickles/df_childfree.pkl', 'rb') as f:
    df_childfree = pickle.load(f)

In [3]:
df_parent.head(5)

Unnamed: 0,subreddit,selftext,gilded,title,hidden,pwls,link_flair_css_class,hide_score,quarantine,upvote_ratio,...,id,is_robot_indexable,num_comments,send_replies,whitelist_status,contest_mode,parent_whitelist_status,stickied,subreddit_subscribers,num_crossposts
0,Parenting,"Will the title says it all, my mother-in-law p...",0,First night away from my baby- help :(,False,6,advice,True,False,1.0,...,11icx6s,True,0,True,all_ads,False,all_ads,False,5221073,0
1,Parenting,"Took the boy (14), the spouse (51), the dog (3...",0,Exercising the teen and the…cat,False,6,teenager,True,False,1.0,...,11icgox,True,0,True,all_ads,False,all_ads,False,5221030,0
2,Parenting,Not sure if this is the right sub to post in. ...,0,Questions about pediatrician…,False,6,advice,True,False,1.0,...,11icck2,True,0,True,all_ads,False,all_ads,False,5221024,0
3,Parenting,"My daughter is amazing. She is smart, fucking ...",0,normal 4/5 year old behavior?,False,6,child,True,False,1.0,...,11ic4rf,True,0,True,all_ads,False,all_ads,False,5221010,0
4,Parenting,There's already a lot of good threads about ra...,0,Yet another trilingual baby question,False,6,education,True,False,1.0,...,11ibuvf,True,0,True,all_ads,False,all_ads,False,5220995,0


### Let's combine the DataFrames

In [4]:
df = pd.concat([df_childfree,df_parent]).reset_index(drop= True)
df.head(-15)

Unnamed: 0,subreddit,selftext,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,hide_score,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,retrieved_utc,updated_utc,utc_datetime_str
0,childfree,It‘s just something I’ve been wondering lately...,0,I wonder how many of the guys who have bullied...,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/11id1o4/i_wonder_how_man...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494153,1.677964e+09,0,1.677964e+09,1.677964e+09,2023-03-04 20:58:42
1,childfree,So I work in a job that means I deal with a lo...,0,Needing to rant,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/11icxz6/needing_to_rant/,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494155,1.677963e+09,0,1.677963e+09,1.677963e+09,2023-03-04 20:54:45
2,childfree,[removed],0,Hysterectomy hashtags,"[{'e': 'text', 't': 'RAVE'}]",r/childfree,False,6,rave,True,...,/r/childfree/comments/11icv81/hysterectomy_has...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494154,1.677963e+09,0,1.677963e+09,1.677963e+09,2023-03-04 20:51:40
3,childfree,Especially when they say that their children a...,0,What do you say in response to a parent when t...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,True,...,/r/childfree/comments/11icnw8/what_do_you_say_...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494157,1.677963e+09,0,1.677963e+09,1.677963e+09,2023-03-04 20:43:48
4,childfree,The phrase that sounds sweet but drives me up ...,0,What's a phrase Parents use that sounds altrui...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,True,...,/r/childfree/comments/11ichxn/whats_a_phrase_p...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494154,1.677962e+09,0,1.677962e+09,1.677962e+09,2023-03-04 20:37:18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9980,Parenting,[removed],0,Should I be concerned about my 14 year old's h...,,,False,6,teenager,True,...,,all_ads,False,,5139651,,0,,,
9981,Parenting,Our 11 year old boy has been disrespecting his...,0,How should we punish our 11 year old,,,False,6,tween,True,...,,all_ads,False,,5139648,,0,,,
9982,Parenting,"Like, those are a thing, right? So instead of ...",0,Toddler chew toys?,,,False,6,humour,True,...,,all_ads,False,,5139647,,0,,,
9983,Parenting,"I'm a bit lost here, it's been a recurring thi...",0,4 y/o comes home mean and angry after a sleepo...,,,False,6,child,True,...,,all_ads,False,,5139645,,0,,,


### Check for missing values

In [5]:
df.isna().sum().sum()

60000

### Label Encoding our Data

Before we proceed, let's convert the subreddit column to 1s and 0s. In this analysis, a 1 will correspond to r/Parenting and a 0 will correspond to Childfree.

We want to create this binary column so we can evaluate and model our data, as we'll see in the next steps.

In [6]:
df['subreddit'] = df['subreddit'].map({'Parenting': 1, "childfree": 0})
df.head(5)

Unnamed: 0,subreddit,selftext,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,hide_score,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,retrieved_utc,updated_utc,utc_datetime_str
0,0,It‘s just something I’ve been wondering lately...,0,I wonder how many of the guys who have bullied...,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/11id1o4/i_wonder_how_man...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494153,1677964000.0,0,1677964000.0,1677964000.0,2023-03-04 20:58:42
1,0,So I work in a job that means I deal with a lo...,0,Needing to rant,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/11icxz6/needing_to_rant/,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494155,1677963000.0,0,1677963000.0,1677963000.0,2023-03-04 20:54:45
2,0,[removed],0,Hysterectomy hashtags,"[{'e': 'text', 't': 'RAVE'}]",r/childfree,False,6,rave,True,...,/r/childfree/comments/11icv81/hysterectomy_has...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494154,1677963000.0,0,1677963000.0,1677963000.0,2023-03-04 20:51:40
3,0,Especially when they say that their children a...,0,What do you say in response to a parent when t...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,True,...,/r/childfree/comments/11icnw8/what_do_you_say_...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494157,1677963000.0,0,1677963000.0,1677963000.0,2023-03-04 20:43:48
4,0,The phrase that sounds sweet but drives me up ...,0,What's a phrase Parents use that sounds altrui...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,True,...,/r/childfree/comments/11ichxn/whats_a_phrase_p...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494154,1677962000.0,0,1677962000.0,1677962000.0,2023-03-04 20:37:18


### Baseline Calculation

Let's calculate a baseline for our model using a concatenated version of the post text and title text.

In [7]:
# Extracting Text Columns
df_text = pd.DataFrame()
df_text['text'] = df['selftext'].str.cat(df['title'],sep =' ')
df_text['subreddit'] = df['subreddit']
df_text

Unnamed: 0,text,subreddit
0,It‘s just something I’ve been wondering lately...,0
1,So I work in a job that means I deal with a lo...,0
2,[removed] Hysterectomy hashtags,0
3,Especially when they say that their children a...,0
4,The phrase that sounds sweet but drives me up ...,0
...,...,...
9995,It seems like it's relatively common for a tod...,1
9996,Hi all! In March my husband and I will be taki...,1
9997,[removed] Question about hanna andersson store,1
9998,Hello - this post is to get some opinions on a...,1


In [8]:
# Train Test Split
X= df_text.drop('subreddit',axis = 1)
y = df_text['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2,random_state=2023)

y.value_counts(normalize = True)

0    0.5
1    0.5
Name: subreddit, dtype: float64

As expected, our baseline is .5. This is because our data is evenly split between the two subreddits and any model that we create should be able to predict the subreddit better than half the time.

### Sentiment Analysis

Before proceeding to the modelling portion of this analysis, we would like to evaluate the sentiment of the original text data and include that as a feature alongside our text.

This will improve the efficacy of our model as we anticipate both subreddits will differ in terms of sentiment.

To carry out this sentiment analysis, we'll be using VADER lexicon as we are working with informal text and VADER is a popular tool for this type of data.

In [9]:
sa = SentimentIntensityAnalyzer()

In [10]:
def sentiment(text):
    scores = sa.polarity_scores(text)
    neg = scores['neg']
    neu = scores['neu']
    pos = scores['pos']
    compound = scores['compound'] 
    return (neg,neu,pos,compound)

In [11]:
df_text[['neg','neu','pos','compound']] = df_text['text'].apply(sentiment).tolist()
df_text

Unnamed: 0,text,subreddit,neg,neu,pos,compound
0,It‘s just something I’ve been wondering lately...,0,0.176,0.733,0.091,-0.9947
1,So I work in a job that means I deal with a lo...,0,0.169,0.718,0.113,-0.9448
2,[removed] Hysterectomy hashtags,0,0.000,1.000,0.000,0.0000
3,Especially when they say that their children a...,0,0.040,0.960,0.000,-0.4215
4,The phrase that sounds sweet but drives me up ...,0,0.202,0.722,0.077,-0.9951
...,...,...,...,...,...,...
9995,It seems like it's relatively common for a tod...,1,0.086,0.830,0.084,-0.1685
9996,Hi all! In March my husband and I will be taki...,1,0.000,0.827,0.173,0.9279
9997,[removed] Question about hanna andersson store,1,0.000,1.000,0.000,0.0000
9998,Hello - this post is to get some opinions on a...,1,0.044,0.873,0.083,0.8490


In [12]:
## Textblob Polarity

In [13]:
def sentiment_tb(text):
    tb = TextBlob(text)
    tb_polarity = tb.sentiment.polarity
    return tb_polarity

df_text['polarity'] = df_text['text'].apply(sentiment_tb).apply(pd.Series)

### Pre-Processing

Now that we have our baseline, we can continue with the preprocessing of the df. We'll need to vectorize the text dataframe, which contains the text based elements of both subreddits.

With the data provide above, we'll need to stadardize and then stem/lem the text to improve the accuracy of the model we'll create in the next workbook. From this standardized data, we'll be able to complete a sentiment analysis which will also be an important feature in our model's predictions.

#### Standardizing Data

In [14]:
# removing special characters
def remove_punct(text):
    return text.translate(str.maketrans('','',string.punctuation + '\n'))

df_text['text'] = df_text['text'].apply(remove_punct)

In [15]:
# lowercasing
def conv_lower(text):
    return text.lower()
    
df_text['text'] = df_text['text'].apply(conv_lower)

In [16]:
# tokenizing lower case strings
def tokenizer(text):
    token = RegexpTokenizer('\w+')
    word = token.tokenize(text)
    
    return word

df_text['text'] = df_text['text'].apply(tokenizer)

In [17]:
# removing stop words
def rem_stops(text):
    stops = stopwords.words('english')
    rem = [word for word in text if word not in stops]
    return rem

df_text['text'] = df_text['text'].apply(rem_stops)

In [18]:
subreddit_1 = df_text[df_text['subreddit'] == 1]['text']
subreddit_0 = df_text[df_text['subreddit'] == 0]['text']
words_0 = [word for text in subreddit_0 for word in text]
words_1 = [word for text in subreddit_1 for word in text]

word_count = len(words_0) + len(words_1)
word_count

743618

#### Removing Common Words

Let's look at the words that are most frequently found in both dataframes. by removing common words, we'll improve our model's ability to discern between subreddits.

In [19]:
subreddit_1 = df_text[df_text['subreddit'] == 1]['text']
subreddit_0 = df_text[df_text['subreddit'] == 0]['text']
words_0 = [word for text in subreddit_0 for word in text]
words_1 = [word for text in subreddit_1 for word in text]
words = (words_0 + words_1)
word_freq = {word: words.count(word) for word in set(words)}
top_words = [word for word,freq in sorted(word_freq.items(),key = lambda x: x[1], reverse = True)]

After completing this analysis and modelling the results, not removing any of the top words yielded the best results.

 #### Storing Top words by Subreddit

In [20]:
word_freq_0 = {word: words_0.count(word) for word in set(words_0)}
word_freq_1 = {word: words_1.count(word) for word in set(words_1)}

In [21]:
word_freq = {}
for word in set(words_0 + words_1):
    freq_0 = word_freq_0.get(word, 0)
    freq_1 = word_freq_1.get(word, 0)
    word_freq[word] = {'subreddit_0': freq_0, 'subreddit_1': freq_1}

In [22]:
top_words_0 = [word for word,freq in sorted(word_freq_0.items(),key = lambda x: x[1], reverse = True)][:10]
top_words_1 = [word for word,freq in sorted(word_freq_1.items(),key = lambda x: x[1], reverse = True)][:10]
top_freqs_0 = [word_freq_0[word] for word in top_words_0]
top_freqs_1 = [word_freq_1[word] for word in top_words_1]

#### Function to Remove Words

In [23]:
def rem_shared(text):
    rem = [word for word in text if word not in top_words[:0]]
    return rem

df_text['text'] = df_text['text'].apply(rem_shared)

In [24]:
subreddit_1 = df_text[df_text['subreddit'] == 1]['text']
subreddit_0 = df_text[df_text['subreddit'] == 0]['text']
words_0 = [word for text in subreddit_0 for word in text]
words_1 = [word for text in subreddit_1 for word in text]

word_count = len(words_0) + len(words_1)
word_count

743618

#### Stemming

Since we are working with informal text data, we'll stem the data to reduce each word in the dataframe to it's stem characters.

In [25]:
# Stemming with Porter Stemmer
ps = PorterStemmer()

In [26]:
def stemmer(text_list):
    return ' '.join([ps.stem(word) for word in text_list])

In [27]:
df_text['text'] = df_text['text'].apply(stemmer)

#### Removing Zero Compound Score Rows

There are several rows in the data where the post was removed. Due to how reddit works, when a post is removed, the Title is replaced by the text 'removed'. As a result of this, the sentiment analysis returns 1 for "neutral" because there isn't data to parse.

To improve the result of our model, we'll remove these rows.

In [28]:
df_text = df_text[df_text['neu'] !=1]
df_text

Unnamed: 0,text,subreddit,neg,neu,pos,compound,polarity
0,someth wonder late lot guy singl relationship ...,0,0.176,0.733,0.091,-0.9947,0.039933
1,work job mean deal lot medic inform vagu purpo...,0,0.169,0.718,0.113,-0.9448,-0.081382
3,especi say children grown abl live lifestyl mi...,0,0.040,0.960,0.000,-0.4215,0.110390
4,phrase sound sweet drive wall insid ill better...,0,0.202,0.722,0.077,-0.9951,0.051185
5,sigh knew good last forev singl elderli neighb...,0,0.065,0.893,0.042,-0.6310,-0.014286
...,...,...,...,...,...,...,...
9994,remov protect daughter,1,0.000,0.658,0.342,0.3818,0.000000
9995,seem like rel common toddler act newborn come ...,1,0.086,0.830,0.084,-0.1685,0.106848
9996,hi march husband take first trip abroad babi b...,1,0.000,0.827,0.173,0.9279,0.251786
9998,hello post get opinion disagr spous one morn p...,1,0.044,0.873,0.083,0.8490,0.090000


In [29]:
df_text['text']

0       someth wonder late lot guy singl relationship ...
1       work job mean deal lot medic inform vagu purpo...
3       especi say children grown abl live lifestyl mi...
4       phrase sound sweet drive wall insid ill better...
5       sigh knew good last forev singl elderli neighb...
                              ...                        
9994                               remov protect daughter
9995    seem like rel common toddler act newborn come ...
9996    hi march husband take first trip abroad babi b...
9998    hello post get opinion disagr spous one morn p...
9999    toddler 25 year old put play area that fenc ge...
Name: text, Length: 8825, dtype: object

In [30]:
# converting text column to strings (previously object)

df_text['text'] = df_text['text'].astype('string')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['text'] = df_text['text'].astype('string')


In [31]:
with open('pickles/df_text.pkl', 'wb') as f:
    pickle.dump(df_text, f)
    
with open('pickles/top_words.pkl', 'wb') as f:
    pickle.dump(top_words, f)
    
with open('pickles/word_freq.pkl', 'wb') as f:
    pickle.dump(word_freq, f)
    
with open('pickles/top_words_0.pkl', 'wb') as f:
    pickle.dump(top_words_0, f)
    
with open('pickles/top_words_1.pkl', 'wb') as f:
    pickle.dump(top_words_1, f)
    
with open('pickles/top_freqs_0.pkl', 'wb') as f:
    pickle.dump(top_freqs_0, f)
    
with open('pickles/top_freqs_1.pkl', 'wb') as f:
    pickle.dump(top_freqs_1, f)
    
with open('pickles/word_freq_0.pkl', 'wb') as f:
    pickle.dump(word_freq_0, f)
    
with open('pickles/word_freq_1.pkl', 'wb') as f:
    pickle.dump(word_freq_1, f)