# Data Cleaning

In this notebook, we'll import and explore the data with the goal of preparing it for EDA in the following notebook

#### Library Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import pickle

from pandas import json_normalize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Pushshift API calls

In [2]:
# urls for API calls
url_parent = 'https://api.pushshift.io/reddit/search/submission/?subreddit=parenting'
url_childless = 'https://api.pushshift.io/reddit/search/submission/?subreddit=childfree'

req_parent = requests.get(url_parent)
req_childfree = requests.get(url_childless)

In [3]:
# status check
print(req_childfree.status_code , req_parent.status_code)

200 200


### JSON data

In [4]:
json_parent = req_parent.json()
json_childfree = req_childfree.json()

### Converting to DataFrames

In [5]:
df_parent = pd.json_normalize(json_parent['data'])
df_childfree = pd.json_normalize(json_childfree['data'])

### r/Parenting Data Cleaning & Evaluation

In [6]:
df_parent.head(50)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str
0,Parenting,We had been trying to get pregnant for a while...,t2_mgozuz0u,0,Drinking heavily before getting a positive tes...,"[{'e': 'text', 't': 'Advice'}]",r/Parenting,False,6,advice,...,False,https://www.reddit.com/r/Parenting/comments/11...,5200856,1677284294,0,,False,1677284310,1677284310,2023-02-25 00:18:14
1,Parenting,My wife spends so much time working with him t...,t2_dg5c9s7m,0,How do I get my 10 month old to say mama?,"[{'e': 'text', 't': 'Infant 2-12 Months'}]",r/Parenting,False,6,infant,...,False,https://www.reddit.com/r/Parenting/comments/11...,5200855,1677284272,0,,False,1677284285,1677284285,2023-02-25 00:17:52
2,Parenting,This is difficult to write about and googling ...,t2_o3h5u,0,Addressing Masturbation in Children,"[{'e': 'text', 't': 'Child 4-9 Years'}]",r/Parenting,False,6,child,...,False,https://www.reddit.com/r/Parenting/comments/11...,5200848,1677284170,0,,False,1677284186,1677284186,2023-02-25 00:16:10
3,Parenting,"So we just had our 2nd baby, he is 3 weeks old...",t2_kbdxdxlx,0,Activities for toddler and newborn,"[{'e': 'text', 't': 'Toddler 1-3 Years'}]",r/Parenting,False,6,toddler,...,False,https://www.reddit.com/r/Parenting/comments/11...,5200837,1677283936,0,,False,1677283946,1677283947,2023-02-25 00:12:16
4,Parenting,I’m scared out of my mind that my newborn is g...,t2_81dphiyd,0,5 year old has a stomach bug and I have a newb...,"[{'e': 'text', 't': 'Multiple Ages'}]",r/Parenting,False,6,multiple ages,...,False,https://www.reddit.com/r/Parenting/comments/11...,5200826,1677283566,0,,False,1677283579,1677283580,2023-02-25 00:06:06
5,Parenting,"Hey guys, I'm a mom to a 13 year old 8th grade...",t2_dffrjwft,0,My 13 year old son has a bully. How can I help...,"[{'e': 'text', 't': 'Teenager 13-19 Years'}]",r/Parenting,False,6,teenager,...,False,https://www.reddit.com/r/Parenting/comments/11...,5200750,1677281343,0,,False,1677281357,1677281358,2023-02-24 23:29:03
6,Parenting,"Here's the situation. Two weeks ago, my 2 year...",t2_13uapkdw,0,A toddler with nightmares and a tired family,"[{'e': 'text', 't': 'Sleep &amp; Naps'}]",r/Parenting,False,6,sleep,...,False,https://www.reddit.com/r/Parenting/comments/11...,5200692,1677279453,0,,False,1677279465,1677279465,2023-02-24 22:57:33
7,Parenting,"I'm a mom to a little baby, so all I have seen...",t2_mg5a09qe,0,Am I the only parent to not like LOL Dolls?,"[{'e': 'text', 't': 'Rant/Vent'}]",r/Parenting,False,6,rant vent,...,False,https://www.reddit.com/r/Parenting/comments/11...,5200689,1677279248,0,,False,1677279261,1677279261,2023-02-24 22:54:08
8,Parenting,I guess this is more of a vent but if there ar...,t2_tbt58ffy,0,Convincing people that LO sleep schedule is no...,"[{'e': 'text', 't': 'Sleep &amp; Naps'}]",r/Parenting,False,6,sleep,...,False,https://www.reddit.com/r/Parenting/comments/11...,5200653,1677278237,0,,False,1677278252,1677278253,2023-02-24 22:37:17
9,Parenting,My daughter is 1 so I won’t have to worry abou...,t2_5c4fi518,0,Do you have rules on what kind of movies/shows...,"[{'e': 'text', 't': 'Media'}]",r/Parenting,False,6,media,...,False,https://www.reddit.com/r/Parenting/comments/11...,5200623,1677276868,0,,False,1677276888,1677276888,2023-02-24 22:14:28


In [7]:
df_parent.shape

(10, 84)

In [8]:
df_parent.isna().sum()

subreddit            0
selftext             0
author_fullname      0
gilded               0
title                0
                    ..
media               10
is_video             0
retrieved_utc        0
updated_utc          0
utc_datetime_str     0
Length: 84, dtype: int64

Since we're going to applying a NLP model to this data, the columns pertaining to video or image type are unnecessary. We'll remove those below

In [9]:
df_parent = df_parent.drop(columns = ['is_video','thumbnail_width','thumbnail'], index = None)
#'preview_images','preview_enabled'

In [10]:
df_parent.isna().sum().head(40)

subreddit                         0
selftext                          0
author_fullname                   0
gilded                            0
title                             0
link_flair_richtext               0
subreddit_name_prefixed           0
hidden                            0
pwls                              0
link_flair_css_class              0
thumbnail_height                 10
top_awarded_type                 10
hide_score                        0
quarantine                        0
link_flair_text_color             0
upvote_ratio                      0
author_flair_background_color    10
subreddit_type                    0
total_awards_received             0
author_flair_template_id         10
is_original_content               0
secure_media                     10
is_reddit_media_domain            0
is_meta                           0
category                         10
link_flair_text                   0
score                             0
is_created_from_ads_ui      

The remaining columns that contain NA values aren't text based and aren't necessary for our analysis. We'll drop them below:

In [11]:
df_parent = df_parent.dropna(axis=1)

### Evaluating Remaining Features

In [12]:
df_parent.head(3)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,retrieved_utc,updated_utc,utc_datetime_str
0,Parenting,We had been trying to get pregnant for a while...,t2_mgozuz0u,0,Drinking heavily before getting a positive tes...,"[{'e': 'text', 't': 'Advice'}]",r/Parenting,False,6,advice,...,/r/Parenting/comments/11b6nwj/drinking_heavily...,all_ads,False,https://www.reddit.com/r/Parenting/comments/11...,5200856,1677284294,0,1677284310,1677284310,2023-02-25 00:18:14
1,Parenting,My wife spends so much time working with him t...,t2_dg5c9s7m,0,How do I get my 10 month old to say mama?,"[{'e': 'text', 't': 'Infant 2-12 Months'}]",r/Parenting,False,6,infant,...,/r/Parenting/comments/11b6nlo/how_do_i_get_my_...,all_ads,False,https://www.reddit.com/r/Parenting/comments/11...,5200855,1677284272,0,1677284285,1677284285,2023-02-25 00:17:52
2,Parenting,This is difficult to write about and googling ...,t2_o3h5u,0,Addressing Masturbation in Children,"[{'e': 'text', 't': 'Child 4-9 Years'}]",r/Parenting,False,6,child,...,/r/Parenting/comments/11b6m8x/addressing_mastu...,all_ads,False,https://www.reddit.com/r/Parenting/comments/11...,5200848,1677284170,0,1677284186,1677284186,2023-02-25 00:16:10


Several of the columns are descriptors of the posts or of the subreddit itself. All of the subreddit descriptors will be removed as I don't believe they will provide adequate value to the model. Some of the post descriptors will be removed because they provide duplicate data (i.e- 'link_flair_richtext' is the same output as 'link_flair-css_class'.

In [13]:
#Subreddit descriptors
r_desc = ['author_fullname', 'permalink', 'url', 'link_flair_text_color','utc_datetime_str','author','subreddit_type','link_flair_background_color','author_patreon_flair']

#duplicate features
dup_feat = ['link_flair_richtext','subreddit_name_prefixed','retrieved_utc', 'updated_utc','created_utc']
rem_features = r_desc+dup_feat

In [14]:
df_parent = df_parent.drop(columns = rem_features, axis = 1)


In [15]:
df_parent.head(5)

Unnamed: 0,subreddit,selftext,gilded,title,hidden,pwls,link_flair_css_class,hide_score,quarantine,upvote_ratio,...,id,is_robot_indexable,num_comments,send_replies,whitelist_status,contest_mode,parent_whitelist_status,stickied,subreddit_subscribers,num_crossposts
0,Parenting,We had been trying to get pregnant for a while...,0,Drinking heavily before getting a positive tes...,False,6,advice,True,False,1.0,...,11b6nwj,True,0,True,all_ads,False,all_ads,False,5200856,0
1,Parenting,My wife spends so much time working with him t...,0,How do I get my 10 month old to say mama?,False,6,infant,True,False,1.0,...,11b6nlo,True,0,True,all_ads,False,all_ads,False,5200855,0
2,Parenting,This is difficult to write about and googling ...,0,Addressing Masturbation in Children,False,6,child,True,False,1.0,...,11b6m8x,True,0,True,promo_adult_nsfw,False,all_ads,False,5200848,0
3,Parenting,"So we just had our 2nd baby, he is 3 weeks old...",0,Activities for toddler and newborn,False,6,toddler,True,False,1.0,...,11b6j0f,True,0,True,all_ads,False,all_ads,False,5200837,0
4,Parenting,I’m scared out of my mind that my newborn is g...,0,5 year old has a stomach bug and I have a newb...,False,6,multiple ages,True,False,1.0,...,11b6dqd,True,0,True,all_ads,False,all_ads,False,5200826,0


### r/Childfree Data Cleaning & Evaluation

In [16]:
df_childfree.head(50)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str
0,childfree,"New day, new person with their baby at the Bar...",t2_ge4byjgp,0,A Baby at the Bar.,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,False,https://www.reddit.com/r/childfree/comments/11...,1493376,1677284731,0,,False,1677284747,1677284747,2023-02-25 00:25:31
1,childfree,"Hello All,\n\nA couple of weeks ago I read on ...",t2_262rqv78,0,Good Morning America covered the events of the...,"[{'e': 'text', 't': 'ARTICLE'}]",r/childfree,False,6,article,...,False,https://www.reddit.com/r/childfree/comments/11...,1493376,1677284307,0,,False,1677284318,1677284318,2023-02-25 00:18:27
2,childfree,[removed],t2_5r79txrx,0,YouTube couples are getting ridiculous with th...,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,False,https://www.reddit.com/r/childfree/comments/11...,1493374,1677282495,0,,False,1677282510,1677282510,2023-02-24 23:48:15
3,childfree,“Stay at home mum” isn’t good enough for some ...,t2_a2gxg5tc,0,“Career mum”?,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,False,https://www.reddit.com/r/childfree/comments/11...,1493378,1677277866,0,,False,1677277876,1677277877,2023-02-24 22:31:06
4,childfree,in the summer my friend and I tried to hang ou...,t2_dwbwlnva,0,'friend' called me lifeless because I dont hav...,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,False,https://www.reddit.com/r/childfree/comments/11...,1493374,1677276406,0,,False,1677276420,1677276421,2023-02-24 22:06:46
5,childfree,Getting some work done on our pool and talking...,t2_ewh36rg4,0,"""We need more people like you in the world.""","[{'e': 'text', 't': 'RAVE'}]",r/childfree,False,6,rave,...,False,https://www.reddit.com/r/childfree/comments/11...,1493373,1677276098,0,,False,1677276110,1677276110,2023-02-24 22:01:38
6,childfree,[removed],t2_dfzc6h5p,0,Am I in the wrong?,"[{'e': 'text', 't': 'PERSONAL'}]",r/childfree,False,6,personal,...,False,https://www.reddit.com/r/childfree/comments/11...,1493368,1677275127,0,,False,1677275142,1677275143,2023-02-24 21:45:27
7,childfree,I’m a huge gym rat but before I started out I ...,t2_o5sdvoq0,0,Yet another fitness influencer to unfollow,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,False,https://www.reddit.com/r/childfree/comments/11...,1493368,1677275022,0,,False,1677275035,1677275035,2023-02-24 21:43:42
8,childfree,"Wish I could post screenshots, but the entitle...",t2_8jx6mthd,0,Just go to literally any other restaurant,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,...,False,https://www.reddit.com/r/childfree/comments/11...,1493369,1677274342,0,,False,1677274354,1677274355,2023-02-24 21:32:22
9,childfree,One more reason I would never give birth. I wo...,t2_goh6n38u,0,Husband stitch,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,False,https://www.reddit.com/r/childfree/comments/11...,1493368,1677274199,0,,False,1677274217,1677274217,2023-02-24 21:29:59


In [26]:
df_childfree.isna().sum().head(10)

subreddit               0
selftext                0
gilded                  0
title                   0
hidden                  0
pwls                    0
link_flair_css_class    0
hide_score              0
quarantine              0
upvote_ratio            0
dtype: int64

Since the columns with missing values aren't relevant to our analysis, we can drop them

In [18]:
df_childfree = df_childfree.dropna(axis=1) 

In [19]:
df_childfree.columns

Index(['subreddit', 'selftext', 'author_fullname', 'gilded', 'title',
       'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls',
       'link_flair_css_class', 'hide_score', 'quarantine',
       'link_flair_text_color', 'upvote_ratio', 'subreddit_type',
       'total_awards_received', 'is_original_content',
       'is_reddit_media_domain', 'is_meta', 'link_flair_text', 'score',
       'is_created_from_ads_ui', 'author_premium', 'thumbnail', 'edited',
       'author_flair_richtext', 'is_self', 'link_flair_type', 'wls',
       'author_flair_type', 'domain', 'allow_live_comments', 'suggested_sort',
       'archived', 'no_follow', 'is_crosspostable', 'pinned', 'over_18',
       'all_awardings', 'awarders', 'media_only', 'link_flair_template_id',
       'can_gild', 'spoiler', 'locked', 'treatment_tags', 'subreddit_id',
       'link_flair_background_color', 'id', 'is_robot_indexable', 'author',
       'num_comments', 'send_replies', 'whitelist_status', 'contest_mode',
       

In [20]:
#removing features pertaining to video
df_childfree = df_childfree.drop(columns = ['is_video','thumbnail'], index = None)

### Evaluating Remaining Features

We'll apply a similar procedure to the cleaning of the remaining features in the r/Parenting data. Since both datasets will contain the same columns, we will reuse the rem_features variable that contains the duplicate and subreddit descriptor features

In [21]:
df_childfree = df_childfree.drop(columns = rem_features,axis =1)

In [22]:
df_childfree

Unnamed: 0,subreddit,selftext,gilded,title,hidden,pwls,link_flair_css_class,hide_score,quarantine,upvote_ratio,...,id,is_robot_indexable,num_comments,send_replies,whitelist_status,contest_mode,parent_whitelist_status,stickied,subreddit_subscribers,num_crossposts
0,childfree,"New day, new person with their baby at the Bar...",0,A Baby at the Bar.,False,6,rant,True,False,1.0,...,11b6u0b,True,0,True,all_ads,False,all_ads,False,1493376,0
1,childfree,"Hello All,\n\nA couple of weeks ago I read on ...",0,Good Morning America covered the events of the...,False,6,article,True,False,1.0,...,11b6o3v,True,0,True,all_ads,False,all_ads,False,1493376,0
2,childfree,[removed],0,YouTube couples are getting ridiculous with th...,False,6,rant,True,False,1.0,...,11b5yh7,False,0,True,all_ads,False,all_ads,False,1493374,0
3,childfree,“Stay at home mum” isn’t good enough for some ...,0,“Career mum”?,False,6,rant,True,False,1.0,...,11b44xx,True,0,True,all_ads,False,all_ads,False,1493378,0
4,childfree,in the summer my friend and I tried to hang ou...,0,'friend' called me lifeless because I dont hav...,False,6,rant,True,False,1.0,...,11b3k8d,True,0,True,all_ads,False,all_ads,False,1493374,0
5,childfree,Getting some work done on our pool and talking...,0,"""We need more people like you in the world.""",False,6,rave,True,False,1.0,...,11b3fum,True,0,True,all_ads,False,all_ads,False,1493373,0
6,childfree,[removed],0,Am I in the wrong?,False,6,personal,True,False,1.0,...,11b31o1,False,0,True,all_ads,False,all_ads,False,1493368,0
7,childfree,I’m a huge gym rat but before I started out I ...,0,Yet another fitness influencer to unfollow,False,6,rant,True,False,1.0,...,11b304k,True,0,True,all_ads,False,all_ads,False,1493368,0
8,childfree,"Wish I could post screenshots, but the entitle...",0,Just go to literally any other restaurant,False,6,discussion,True,False,1.0,...,11b2q9h,True,0,True,all_ads,False,all_ads,False,1493369,0
9,childfree,One more reason I would never give birth. I wo...,0,Husband stitch,False,6,rant,True,False,1.0,...,11b2o30,True,0,True,promo_adult_nsfw,False,all_ads,False,1493368,0


In [28]:
list_col_cf = df_childfree.columns
list_col_par = df_parent.columns

notin_col = [col for col in list_col_cf if col not in list_col_par]
notin_col

df_childfree['suggested_sort'].head(20)

0    confidence
1    confidence
2    confidence
3    confidence
4    confidence
5    confidence
6    confidence
7    confidence
8    confidence
9    confidence
Name: suggested_sort, dtype: object

The column suggested sort is the same value repeated and doesn't provide value for our model. We can remove.

In [29]:
df_childfree= df_childfree.drop(columns = 'suggested_sort',axis =1)

### Exporting Data

In [30]:
with open('pickles/df_childfree.pkl', 'wb') as f:
    pickle.dump(df_childfree, f)
      
with open('pickles/df_parent.pkl', 'wb') as f:
    pickle.dump(df_parent, f)