# EDA

In this notebook, we'll import and explore the cleaned data from the previous notebook with the goal of determining which characteristics should and should not be the focus of our analysis.

It's anticipated that the output of this section will provide important insights into the structure and content of the data, which can guide subsequent analysis and modeling decisions.

### Library Imports

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import pickle

from pandas import json_normalize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

### Data Imports

In [2]:
with open('pickles/df_parent.pkl', 'rb') as f:
    df_parent = pickle.load(f)
    
with open('pickles/df_childfree.pkl', 'rb') as f:
    df_childfree = pickle.load(f)

In [27]:
df_parent.head(5)

Unnamed: 0,subreddit,selftext,gilded,title,hidden,pwls,link_flair_css_class,hide_score,quarantine,upvote_ratio,...,id,is_robot_indexable,num_comments,send_replies,whitelist_status,contest_mode,parent_whitelist_status,stickied,subreddit_subscribers,num_crossposts
0,Parenting,We had been trying to get pregnant for a while...,0,Drinking heavily before getting a positive tes...,False,6,advice,True,False,1.0,...,11b6nwj,True,0,True,all_ads,False,all_ads,False,5200856,0
1,Parenting,My wife spends so much time working with him t...,0,How do I get my 10 month old to say mama?,False,6,infant,True,False,1.0,...,11b6nlo,True,0,True,all_ads,False,all_ads,False,5200855,0
2,Parenting,This is difficult to write about and googling ...,0,Addressing Masturbation in Children,False,6,child,True,False,1.0,...,11b6m8x,True,0,True,promo_adult_nsfw,False,all_ads,False,5200848,0
3,Parenting,"So we just had our 2nd baby, he is 3 weeks old...",0,Activities for toddler and newborn,False,6,toddler,True,False,1.0,...,11b6j0f,True,0,True,all_ads,False,all_ads,False,5200837,0
4,Parenting,I’m scared out of my mind that my newborn is g...,0,5 year old has a stomach bug and I have a newb...,False,6,multiple ages,True,False,1.0,...,11b6dqd,True,0,True,all_ads,False,all_ads,False,5200826,0


### Let's combine the DataFrames

In [29]:
df = pd.concat([df_childfree,df_parent]).reset_index(drop= True)
df.head(-15)

Unnamed: 0,subreddit,selftext,gilded,title,hidden,pwls,link_flair_css_class,hide_score,quarantine,upvote_ratio,...,id,is_robot_indexable,num_comments,send_replies,whitelist_status,contest_mode,parent_whitelist_status,stickied,subreddit_subscribers,num_crossposts
0,childfree,"New day, new person with their baby at the Bar...",0,A Baby at the Bar.,False,6,rant,True,False,1.0,...,11b6u0b,True,0,True,all_ads,False,all_ads,False,1493376,0
1,childfree,"Hello All,\n\nA couple of weeks ago I read on ...",0,Good Morning America covered the events of the...,False,6,article,True,False,1.0,...,11b6o3v,True,0,True,all_ads,False,all_ads,False,1493376,0
2,childfree,[removed],0,YouTube couples are getting ridiculous with th...,False,6,rant,True,False,1.0,...,11b5yh7,False,0,True,all_ads,False,all_ads,False,1493374,0
3,childfree,“Stay at home mum” isn’t good enough for some ...,0,“Career mum”?,False,6,rant,True,False,1.0,...,11b44xx,True,0,True,all_ads,False,all_ads,False,1493378,0
4,childfree,in the summer my friend and I tried to hang ou...,0,'friend' called me lifeless because I dont hav...,False,6,rant,True,False,1.0,...,11b3k8d,True,0,True,all_ads,False,all_ads,False,1493374,0


### Check for missing values

In [5]:
df.isna().sum().sum()

0

### Label Encoding our Data

Before we proceed, let's convert the subreddit column to 1s and 0s. In this analysis, a 1 will correspond to r/Parenting and a 0 will correspond to Childfree.

We want to create this binary column so we can evaluate and model our data, as we'll see in the next steps.

In [6]:
df['subreddit'] = df['subreddit'].map({'Parenting': 1, "childfree": 0})
df.head(5)

Unnamed: 0,subreddit,selftext,gilded,title,hidden,pwls,link_flair_css_class,hide_score,quarantine,upvote_ratio,...,id,is_robot_indexable,num_comments,send_replies,whitelist_status,contest_mode,parent_whitelist_status,stickied,subreddit_subscribers,num_crossposts
0,0,"New day, new person with their baby at the Bar...",0,A Baby at the Bar.,False,6,rant,True,False,1.0,...,11b6u0b,True,0,True,all_ads,False,all_ads,False,1493376,0
1,0,"Hello All,\n\nA couple of weeks ago I read on ...",0,Good Morning America covered the events of the...,False,6,article,True,False,1.0,...,11b6o3v,True,0,True,all_ads,False,all_ads,False,1493376,0
2,0,[removed],0,YouTube couples are getting ridiculous with th...,False,6,rant,True,False,1.0,...,11b5yh7,False,0,True,all_ads,False,all_ads,False,1493374,0
3,0,“Stay at home mum” isn’t good enough for some ...,0,“Career mum”?,False,6,rant,True,False,1.0,...,11b44xx,True,0,True,all_ads,False,all_ads,False,1493378,0
4,0,in the summer my friend and I tried to hang ou...,0,'friend' called me lifeless because I dont hav...,False,6,rant,True,False,1.0,...,11b3k8d,True,0,True,all_ads,False,all_ads,False,1493374,0


### Baseline Calculation

Let's calculate a baseline for our model using a concatenated version of the post text and title text.

In [17]:
# Extracting Text Columns
df_text = pd.DataFrame()
df_text['text'] = df['selftext'].str.cat(df['title'],sep =' ')
df_text['subreddit'] = df['subreddit']
df_text

Unnamed: 0,text,subreddit
0,"New day, new person with their baby at the Bar...",0
1,"Hello All,\n\nA couple of weeks ago I read on ...",0
2,[removed] YouTube couples are getting ridiculo...,0
3,“Stay at home mum” isn’t good enough for some ...,0
4,in the summer my friend and I tried to hang ou...,0
5,Getting some work done on our pool and talking...,0
6,[removed] Am I in the wrong?,0
7,I’m a huge gym rat but before I started out I ...,0
8,"Wish I could post screenshots, but the entitle...",0
9,One more reason I would never give birth. I wo...,0


In [26]:
# Train Test Split
X= df_text.drop('subreddit',axis = 1)
y = df_text['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2,random_state=2023)

y.value_counts(normalize = True)


0    0.5
1    0.5
Name: subreddit, dtype: float64

As expected, our baseline is .5. This is because our data is evenly split between the two subreddits and any model that we create should be able to predict the subreddit better than half the time.

In [31]:
with open('pickles/df_text.pkl', 'wb') as f:
    pickle.dump(df_text, f)