# Data Cleaning

In this notebook, we'll import and explore the data with the goal of preparing it for EDA in the following notebook

#### Library Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import pickle
import time

from pandas import json_normalize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
int(time.time())

1678059388

### Pushshift API calls

Since the API limits us to 1000 posts per request, we'll compile the data in two steps: 'before' and 'after'. The Before is designated as before the given 'created_utc' parameter given by Pushshift for the default API call for both subreddits.

By calling the API once before and once after the given time in UTC, the maximum number of posts for each subreddit.

#### Finding Maximum Posts

Before we can get more than 1000 posts, we need to make sure that both subreddits can return more posts. To do this we'll sum the 'value' for the 'total' element in the json data.

In [3]:
import requests

url = 'https://api.pushshift.io/reddit/search/submission/'

params = {
    'subreddit': 'parenting',
    'size': 0
}

req = requests.get(url, params=params)
num_posts = req.json()['metadata']['es']['hits']['total']['value']
print(f'Total number of posts in r/parenting: {num_posts}')

Total number of posts in r/parenting: 10000


In [4]:
import requests

url = 'https://api.pushshift.io/reddit/search/submission/'

params = {
    'subreddit': 'childfree',
    'size': 0
}

req = requests.get(url, params=params)
num_posts = req.json()['metadata']['es']['hits']['total']['value']
print(f'Total number of posts in r/childfree: {num_posts}')

Total number of posts in r/childfree: 10000


Both subreddits return 10,000 and so our mitigating factor becomes processing speed. I'll use 5000 posts per subreddit to reduce computation time, as I expect 10,000 posts per subreddit would require an infeasible amount of time to model.

### Calling API

I want two seperate datasets, so I'll be calling the API for both subreddits seperately using the below loop.

##### API Call for r/Parenting

In [5]:
subreddit = 'parenting'
num_posts = 5000
size = 500
before = 1677963622 # current UTC timestamp at time of modelling

params = {
    'subreddit': subreddit,
    'size': size,
    'before': before,
    #'sort': 'desc',
    'sort_type': 'created_utc'
}

all_posts = []

for i in range(num_posts // size):
    # Make a request for each batch of 1000 posts
    req = requests.get(url, params=params)
    posts = req.json()['data']
    all_posts.extend(posts)
    print(f'Retrieved {len(posts)} posts, total: {len(all_posts)}')
    # Update the before parameter to the timestamp of the earliest post retrieved
    before = posts[-1]['created_utc']
    params['before'] = before

print(f'Total number of posts retrieved: {len(all_posts)}')

Retrieved 500 posts, total: 500
Retrieved 500 posts, total: 1000
Retrieved 500 posts, total: 1500
Retrieved 500 posts, total: 2000
Retrieved 500 posts, total: 2500
Retrieved 500 posts, total: 3000
Retrieved 500 posts, total: 3500
Retrieved 500 posts, total: 4000
Retrieved 500 posts, total: 4500
Retrieved 500 posts, total: 5000
Total number of posts retrieved: 5000


##### API Call for r/Childfree

In [6]:
subreddit = 'Childfree'
num_posts = 5000
size = 500
before = 1677963622 # current UTC timestamp at time of modelling

params = {
    'subreddit': subreddit,
    'size': size,
    'before': before,
    'sort': 'created_utc',
    'sort_type': 'desc'
}

all_posts_child = []

for i in range(num_posts // size):
    # Make a request for each batch of 500 posts
    req_child = requests.get(url, params=params)
    posts_child = req_child.json()['data']
    all_posts_child.extend(posts_child)
    print(f'Retrieved {len(posts_child)} posts, total: {len(all_posts_child)}')
    # Update the before parameter to the timestamp of the earliest post retrieved
    before = posts_child[-1]['created_utc']
    params['before'] = before

print(f'Total number of posts retrieved: {len(all_posts_child)}')

Retrieved 500 posts, total: 500
Retrieved 500 posts, total: 1000
Retrieved 500 posts, total: 1500
Retrieved 500 posts, total: 2000
Retrieved 500 posts, total: 2500
Retrieved 500 posts, total: 3000
Retrieved 500 posts, total: 3500
Retrieved 500 posts, total: 4000
Retrieved 500 posts, total: 4500
Retrieved 500 posts, total: 5000
Total number of posts retrieved: 5000


### JSON data to df

#### Parenting JSON to df

In [7]:
df_parent = pd.DataFrame(all_posts)

#### Childfree JSON to df

In [8]:
df_childfree = pd.DataFrame(all_posts_child)

In [9]:
df_parent

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,edited_on,author_cakeday
0,Parenting,"Will the title says it all, my mother-in-law p...",t2_9wgzn8tj,0,First night away from my baby- help :(,"[{'e': 'text', 't': 'Advice'}]",r/Parenting,False,6,advice,...,0,,False,1677963246,1677963247,2023-03-04 20:53:51,,,,
1,Parenting,"Took the boy (14), the spouse (51), the dog (3...",t2_7cpfk85g,0,Exercising the teen and the…cat,"[{'e': 'text', 't': 'Teenager 13-19 Years'}]",r/Parenting,False,6,teenager,...,0,,False,1677962173,1677962174,2023-03-04 20:35:55,,,,
2,Parenting,Not sure if this is the right sub to post in. ...,t2_13h354,0,Questions about pediatrician…,"[{'e': 'text', 't': 'Advice'}]",r/Parenting,False,6,advice,...,0,,False,1677961908,1677961908,2023-03-04 20:31:34,,,,
3,Parenting,"My daughter is amazing. She is smart, fucking ...",t2_12755s,0,normal 4/5 year old behavior?,"[{'e': 'text', 't': 'Child 4-9 Years'}]",r/Parenting,False,6,child,...,0,,False,1677961401,1677961402,2023-03-04 20:23:05,,,,
4,Parenting,There's already a lot of good threads about ra...,t2_7n9yf,0,Yet another trilingual baby question,"[{'e': 'text', 't': 'Education &amp; Learning'}]",r/Parenting,False,6,education,...,0,,False,1677960744,1677960745,2023-03-04 20:12:11,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,Parenting,It seems like it's relatively common for a tod...,t2_sd3f9k1,0,Our 2.5yo toddler keeps hitting our 3mo newborn,"[{'e': 'text', 't': 'Toddler 1-3 Years'}]",r/Parenting,False,6,toddler,...,0,,False,1675289476,1675289477,2023-02-01 22:11:01,,,,
4996,Parenting,Hi all! In March my husband and I will be taki...,t2_6f7kfj0w,0,First trip abroad with baby - tips please!,"[{'e': 'text', 't': 'Infant 2-12 Months'}]",r/Parenting,False,6,infant,...,0,,False,1675289296,1675289297,2023-02-01 22:08:00,,,,
4997,Parenting,[removed],t2_364j9bzi,0,Question about hanna andersson store,"[{'e': 'text', 't': 'Child 4-9 Years'}]",r/Parenting,False,6,child,...,0,,False,1675289073,1675289074,2023-02-01 22:04:16,,,,
4998,Parenting,Hello - this post is to get some opinions on a...,t2_3hqfuetx,0,Does a parent sleeping in until 8:30/9 most we...,"[{'e': 'text', 't': 'Child 4-9 Years'}]",r/Parenting,False,6,child,...,0,,False,1675288625,1675288625,2023-02-01 21:56:50,,,,


In [10]:
df_childfree

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,edited_on,url_overridden_by_dest,author_cakeday
0,childfree,It‘s just something I’ve been wondering lately...,t2_tyrl12vk,0,I wonder how many of the guys who have bullied...,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,,False,1677963532,1677963533,2023-03-04 20:58:42,,,,,
1,childfree,So I work in a job that means I deal with a lo...,t2_4qo87csv,0,Needing to rant,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,,False,1677963299,1677963300,2023-03-04 20:54:45,,,,,
2,childfree,[removed],t2_j5qjl,0,Hysterectomy hashtags,"[{'e': 'text', 't': 'RAVE'}]",r/childfree,False,6,rave,...,,False,1677963115,1677963116,2023-03-04 20:51:40,,,,,
3,childfree,Especially when they say that their children a...,t2_etjx83hh,0,What do you say in response to a parent when t...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,...,,False,1677962638,1677962639,2023-03-04 20:43:48,,,,,
4,childfree,The phrase that sounds sweet but drives me up ...,t2_3m72kvkn,0,What's a phrase Parents use that sounds altrui...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,...,,False,1677962253,1677962254,2023-03-04 20:37:18,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,childfree,"Title, just curious about your opinions",t2_flo4mdql,0,Why is the average person so vehemently suppor...,"[{'e': 'text', 't': 'HUMOR'}]",r/childfree,False,6,humor,...,,False,1672793325,1672793326,2023-01-04 00:48:30,,,,,
4996,childfree,"Title says it all. So for the past few days, I...",t2_9i35ibaf,0,Why do little kids interrupt their parents' cu...,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,,False,1672791222,1672791222,2023-01-04 00:13:30,,,,,
4997,childfree,"I've been lurking for a while, but the number ...",t2_teqj3t66,0,What's with the misandry on this sub?,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,,False,1672789953,1672789954,2023-01-03 23:52:15,,,,,
4998,childfree,I saw on tiktok a girl with a list of every we...,t2_mcp7m6mn,0,Pregnancy nose??,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,...,,False,1672789294,1672789294,2023-01-03 23:41:19,,,,,


### r/Parenting Data Cleaning & Evaluation

In [11]:
df_parent.head(50)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,edited_on,author_cakeday
0,Parenting,"Will the title says it all, my mother-in-law p...",t2_9wgzn8tj,0,First night away from my baby- help :(,"[{'e': 'text', 't': 'Advice'}]",r/Parenting,False,6,advice,...,0,,False,1677963246,1677963247,2023-03-04 20:53:51,,,,
1,Parenting,"Took the boy (14), the spouse (51), the dog (3...",t2_7cpfk85g,0,Exercising the teen and the…cat,"[{'e': 'text', 't': 'Teenager 13-19 Years'}]",r/Parenting,False,6,teenager,...,0,,False,1677962173,1677962174,2023-03-04 20:35:55,,,,
2,Parenting,Not sure if this is the right sub to post in. ...,t2_13h354,0,Questions about pediatrician…,"[{'e': 'text', 't': 'Advice'}]",r/Parenting,False,6,advice,...,0,,False,1677961908,1677961908,2023-03-04 20:31:34,,,,
3,Parenting,"My daughter is amazing. She is smart, fucking ...",t2_12755s,0,normal 4/5 year old behavior?,"[{'e': 'text', 't': 'Child 4-9 Years'}]",r/Parenting,False,6,child,...,0,,False,1677961401,1677961402,2023-03-04 20:23:05,,,,
4,Parenting,There's already a lot of good threads about ra...,t2_7n9yf,0,Yet another trilingual baby question,"[{'e': 'text', 't': 'Education &amp; Learning'}]",r/Parenting,False,6,education,...,0,,False,1677960744,1677960745,2023-03-04 20:12:11,,,,
5,Parenting,"I have two boys, 4yrs and almost 2yrs. They’re...",t2_8tetvd05,0,When did your kids stop being so crazy loud?,"[{'e': 'text', 't': 'Behaviour'}]",r/Parenting,False,6,behaviour,...,0,,False,1677960727,1677960728,2023-03-04 20:11:55,,,,
6,Parenting,We've been together for about 8 months and we ...,t2_nfbtbbez,0,I'm pregnant but my partner (dad of 3) doesn't...,"[{'e': 'text', 't': 'Advice'}]",r/Parenting,False,6,advice,...,0,,False,1677959979,1677959980,2023-03-04 19:59:26,,,,
7,Parenting,"Over the last several weeks, our almost 2.5YO ...",t2_j64zs,0,2.5YO refusing to swallow or spit out last bit...,"[{'e': 'text', 't': 'Toddler 1-3 Years'}]",r/Parenting,False,6,toddler,...,0,,False,1677959954,1677959955,2023-03-04 19:59:00,,,,
8,Parenting,Sparked by a recent thread where 99 percent of...,t2_vqxhb2vc,0,How do you realistically prevent your tweens a...,"[{'e': 'text', 't': 'Teenager 13-19 Years'}]",r/Parenting,False,6,teenager,...,0,,False,1677958279,1677958280,2023-03-04 19:31:08,,,,
9,Parenting,My daughters best friend is staying with us...,t2_9o27qyes,0,"My daughter's friend (12f) has a ""boyfriend"" s...","[{'e': 'text', 't': 'Tween 10-12 Years'}]",r/Parenting,False,6,tween,...,0,,False,1677957739,1677957739,2023-03-04 19:22:04,,,,


In [12]:
df_parent.shape

(5000, 91)

In [13]:
df_parent.isna().sum()

subreddit              0
selftext               0
author_fullname       20
gilded                 0
title                  0
                    ... 
utc_datetime_str       0
post_hint           4856
preview             4856
edited_on           4987
author_cakeday      4986
Length: 91, dtype: int64

Since we're going to applying a NLP model to this data, the columns pertaining to video or image type are unnecessary. We'll remove those below

In [14]:
df_parent = df_parent.drop(columns = ['is_video','thumbnail_width','thumbnail'], index = None)
#'preview_images','preview_enabled'

In [15]:
df_parent.isna().sum().head(40)

subreddit                           0
selftext                            0
author_fullname                    20
gilded                              0
title                               0
link_flair_richtext                 0
subreddit_name_prefixed             0
hidden                              0
pwls                                0
link_flair_css_class                0
thumbnail_height                 5000
top_awarded_type                 5000
hide_score                          0
quarantine                          0
link_flair_text_color               0
upvote_ratio                        0
author_flair_background_color    4910
subreddit_type                      0
total_awards_received               0
media_embed                         0
author_flair_template_id         4931
is_original_content                 0
secure_media                     5000
is_reddit_media_domain              0
is_meta                             0
category                         5000
secure_media

The remaining columns that contain NA values aren't text based and aren't necessary for our analysis. We'll drop them below:

In [16]:
df_parent = df_parent.dropna(axis=1)

### Evaluating Remaining Features

In [17]:
df_parent.head(3)

Unnamed: 0,subreddit,selftext,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,hide_score,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,retrieved_utc,updated_utc,utc_datetime_str
0,Parenting,"Will the title says it all, my mother-in-law p...",0,First night away from my baby- help :(,"[{'e': 'text', 't': 'Advice'}]",r/Parenting,False,6,advice,True,...,/r/Parenting/comments/11icx6s/first_night_away...,all_ads,False,https://www.reddit.com/r/Parenting/comments/11...,5221073,1677963231,0,1677963246,1677963247,2023-03-04 20:53:51
1,Parenting,"Took the boy (14), the spouse (51), the dog (3...",0,Exercising the teen and the…cat,"[{'e': 'text', 't': 'Teenager 13-19 Years'}]",r/Parenting,False,6,teenager,True,...,/r/Parenting/comments/11icgox/exercising_the_t...,all_ads,False,https://www.reddit.com/r/Parenting/comments/11...,5221030,1677962155,0,1677962173,1677962174,2023-03-04 20:35:55
2,Parenting,Not sure if this is the right sub to post in. ...,0,Questions about pediatrician…,"[{'e': 'text', 't': 'Advice'}]",r/Parenting,False,6,advice,True,...,/r/Parenting/comments/11icck2/questions_about_...,all_ads,False,https://www.reddit.com/r/Parenting/comments/11...,5221024,1677961894,0,1677961908,1677961908,2023-03-04 20:31:34


Several of the columns are descriptors of the posts or of the subreddit itself. All of the subreddit descriptors will be removed as I don't believe they will provide adequate value to the model. Some of the post descriptors will be removed because they provide duplicate data (i.e- 'link_flair_richtext' is the same output as 'link_flair-css_class'.

In [18]:
#Subreddit descriptors
r_desc = [
    #'author_fullname', 
    'permalink', 
    'url', 
    'link_flair_text_color',
    'utc_datetime_str',
    'author','subreddit_type',
    #'link_flair_background_color',
    #'author_patreon_flair'
]

#duplicate features
dup_feat = ['link_flair_richtext','subreddit_name_prefixed','retrieved_utc', 'updated_utc','created_utc']
rem_features = r_desc+dup_feat

In [19]:
df_parent = df_parent.drop(columns = rem_features, axis = 1)


In [20]:
df_parent.head(5)

Unnamed: 0,subreddit,selftext,gilded,title,hidden,pwls,link_flair_css_class,hide_score,quarantine,upvote_ratio,...,id,is_robot_indexable,num_comments,send_replies,whitelist_status,contest_mode,parent_whitelist_status,stickied,subreddit_subscribers,num_crossposts
0,Parenting,"Will the title says it all, my mother-in-law p...",0,First night away from my baby- help :(,False,6,advice,True,False,1.0,...,11icx6s,True,0,True,all_ads,False,all_ads,False,5221073,0
1,Parenting,"Took the boy (14), the spouse (51), the dog (3...",0,Exercising the teen and the…cat,False,6,teenager,True,False,1.0,...,11icgox,True,0,True,all_ads,False,all_ads,False,5221030,0
2,Parenting,Not sure if this is the right sub to post in. ...,0,Questions about pediatrician…,False,6,advice,True,False,1.0,...,11icck2,True,0,True,all_ads,False,all_ads,False,5221024,0
3,Parenting,"My daughter is amazing. She is smart, fucking ...",0,normal 4/5 year old behavior?,False,6,child,True,False,1.0,...,11ic4rf,True,0,True,all_ads,False,all_ads,False,5221010,0
4,Parenting,There's already a lot of good threads about ra...,0,Yet another trilingual baby question,False,6,education,True,False,1.0,...,11ibuvf,True,0,True,all_ads,False,all_ads,False,5220995,0


### r/Childfree Data Cleaning & Evaluation

In [21]:
df_childfree.head(50)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,edited_on,url_overridden_by_dest,author_cakeday
0,childfree,It‘s just something I’ve been wondering lately...,t2_tyrl12vk,0,I wonder how many of the guys who have bullied...,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,,False,1677963532,1677963533,2023-03-04 20:58:42,,,,,
1,childfree,So I work in a job that means I deal with a lo...,t2_4qo87csv,0,Needing to rant,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,,False,1677963299,1677963300,2023-03-04 20:54:45,,,,,
2,childfree,[removed],t2_j5qjl,0,Hysterectomy hashtags,"[{'e': 'text', 't': 'RAVE'}]",r/childfree,False,6,rave,...,,False,1677963115,1677963116,2023-03-04 20:51:40,,,,,
3,childfree,Especially when they say that their children a...,t2_etjx83hh,0,What do you say in response to a parent when t...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,...,,False,1677962638,1677962639,2023-03-04 20:43:48,,,,,
4,childfree,The phrase that sounds sweet but drives me up ...,t2_3m72kvkn,0,What's a phrase Parents use that sounds altrui...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,...,,False,1677962253,1677962254,2023-03-04 20:37:18,,,,,
5,childfree,Sigh... I knew it was too good to last forever...,t2_1xd9gg9,0,Young kids moved in next door,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,...,,False,1677959419,1677959419,2023-03-04 19:50:05,,,,,
6,childfree,I’m 27 so I’m in the age when most people are ...,t2_60d6kmwk,0,All the people telling me not to get married h...,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,...,,False,1677959317,1677959317,2023-03-04 19:48:26,,,,,
7,childfree,Best $450ish I ever spent! Any childfree guys ...,t2_ybfrs,0,Today marks one year since my vasectomy,"[{'e': 'text', 't': 'FIX'}]",r/childfree,False,6,fix,...,,False,1677958980,1677958980,2023-03-04 19:42:47,,,,,
8,childfree,[removed],t2_60p31vgvh,0,FURFRIENDS Pet Survey,"[{'e': 'text', 't': 'PET'}]",r/childfree,False,6,pet,...,,False,1677956457,1677956457,2023-03-04 19:00:42,self,{'images': [{'source': {'url': 'https://extern...,,,
9,childfree,[removed],t2_60p31vgvh,0,FURFRIENDS Pet Survey,"[{'e': 'text', 't': 'PET'}]",r/childfree,False,6,pet,...,,False,1677956413,1677956414,2023-03-04 19:00:00,self,{'images': [{'source': {'url': 'https://extern...,,,


In [22]:
df_childfree.isna().sum().head(10)

subreddit                   0
selftext                    0
author_fullname            25
gilded                      0
title                       0
link_flair_richtext         0
subreddit_name_prefixed     0
hidden                      0
pwls                        0
link_flair_css_class        0
dtype: int64

Since the columns with missing values aren't relevant to our analysis, we can drop them

In [23]:
df_childfree = df_childfree.dropna(axis=1) 

In [24]:
df_childfree.columns

Index(['subreddit', 'selftext', 'gilded', 'title', 'link_flair_richtext',
       'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class',
       'hide_score', 'quarantine', 'link_flair_text_color', 'upvote_ratio',
       'subreddit_type', 'total_awards_received', 'media_embed',
       'is_original_content', 'is_reddit_media_domain', 'is_meta',
       'secure_media_embed', 'link_flair_text', 'score',
       'is_created_from_ads_ui', 'thumbnail', 'edited', 'gildings', 'is_self',
       'link_flair_type', 'wls', 'domain', 'allow_live_comments',
       'suggested_sort', 'archived', 'no_follow', 'is_crosspostable', 'pinned',
       'over_18', 'all_awardings', 'awarders', 'media_only',
       'link_flair_template_id', 'can_gild', 'spoiler', 'locked',
       'treatment_tags', 'subreddit_id', 'link_flair_background_color', 'id',
       'is_robot_indexable', 'author', 'num_comments', 'send_replies',
       'whitelist_status', 'contest_mode', 'permalink',
       'parent_whitelist_sta

In [25]:
#removing features pertaining to video
df_childfree = df_childfree.drop(columns = ['is_video','thumbnail'], index = None)

### Evaluating Remaining Features

We'll apply a similar procedure to the cleaning of the remaining features in the r/Parenting data. Since both datasets will contain the same columns, we will reuse the rem_features variable that contains the duplicate and subreddit descriptor features

In [26]:
df_childfree.head(5)

Unnamed: 0,subreddit,selftext,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,hide_score,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,retrieved_utc,updated_utc,utc_datetime_str
0,childfree,It‘s just something I’ve been wondering lately...,0,I wonder how many of the guys who have bullied...,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/11id1o4/i_wonder_how_man...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494153,1677963522,0,1677963532,1677963533,2023-03-04 20:58:42
1,childfree,So I work in a job that means I deal with a lo...,0,Needing to rant,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/11icxz6/needing_to_rant/,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494155,1677963285,0,1677963299,1677963300,2023-03-04 20:54:45
2,childfree,[removed],0,Hysterectomy hashtags,"[{'e': 'text', 't': 'RAVE'}]",r/childfree,False,6,rave,True,...,/r/childfree/comments/11icv81/hysterectomy_has...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494154,1677963100,0,1677963115,1677963116,2023-03-04 20:51:40
3,childfree,Especially when they say that their children a...,0,What do you say in response to a parent when t...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,True,...,/r/childfree/comments/11icnw8/what_do_you_say_...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494157,1677962628,0,1677962638,1677962639,2023-03-04 20:43:48
4,childfree,The phrase that sounds sweet but drives me up ...,0,What's a phrase Parents use that sounds altrui...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,True,...,/r/childfree/comments/11ichxn/whats_a_phrase_p...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494154,1677962238,0,1677962253,1677962254,2023-03-04 20:37:18


In [27]:
df_childfree.head(-5)

Unnamed: 0,subreddit,selftext,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,hide_score,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,retrieved_utc,updated_utc,utc_datetime_str
0,childfree,It‘s just something I’ve been wondering lately...,0,I wonder how many of the guys who have bullied...,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/11id1o4/i_wonder_how_man...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494153,1677963522,0,1677963532,1677963533,2023-03-04 20:58:42
1,childfree,So I work in a job that means I deal with a lo...,0,Needing to rant,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/11icxz6/needing_to_rant/,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494155,1677963285,0,1677963299,1677963300,2023-03-04 20:54:45
2,childfree,[removed],0,Hysterectomy hashtags,"[{'e': 'text', 't': 'RAVE'}]",r/childfree,False,6,rave,True,...,/r/childfree/comments/11icv81/hysterectomy_has...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494154,1677963100,0,1677963115,1677963116,2023-03-04 20:51:40
3,childfree,Especially when they say that their children a...,0,What do you say in response to a parent when t...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,True,...,/r/childfree/comments/11icnw8/what_do_you_say_...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494157,1677962628,0,1677962638,1677962639,2023-03-04 20:43:48
4,childfree,The phrase that sounds sweet but drives me up ...,0,What's a phrase Parents use that sounds altrui...,"[{'e': 'text', 't': 'DISCUSSION'}]",r/childfree,False,6,discussion,True,...,/r/childfree/comments/11ichxn/whats_a_phrase_p...,all_ads,False,https://www.reddit.com/r/childfree/comments/11...,1494154,1677962238,0,1677962253,1677962254,2023-03-04 20:37:18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4990,childfree,"To start off with, I have been iffy on posting...",0,"22 Years of ""You'll Change Your Mind""","[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/102qog3/22_years_of_youl...,all_ads,False,https://www.reddit.com/r/childfree/comments/10...,1487929,1672797547,0,1672797561,1672797562,2023-01-04 01:59:07
4991,childfree,My boyfriend &amp; I had a New Year’s Eve part...,0,"Friends expecting a baby in March tell me to ""...","[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/102qht4/friends_expectin...,all_ads,False,https://www.reddit.com/r/childfree/comments/10...,1487929,1672797028,0,1672797047,1672797048,2023-01-04 01:50:28
4992,childfree,Anyone else here HATE when pregnant women rub/...,0,Pregnant Women Pet Peeve,"[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/102q7bp/pregnant_women_p...,all_ads,False,https://www.reddit.com/r/childfree/comments/10...,1487928,1672796257,0,1672796274,1672796274,2023-01-04 01:37:37
4993,childfree,some of my younger of relatives are basically ...,0,"i have a flaming, burning hate for “ipad kids”","[{'e': 'text', 't': 'RANT'}]",r/childfree,False,6,rant,True,...,/r/childfree/comments/102pu25/i_have_a_flaming...,all_ads,False,https://www.reddit.com/r/childfree/comments/10...,1487927,1672795274,0,1672795288,1672795288,2023-01-04 01:21:14


In [28]:
list_col_cf = df_childfree.columns
list_col_par = df_parent.columns

notin_col = [col for col in list_col_cf if col not in list_col_par]
notin_col

['link_flair_richtext',
 'subreddit_name_prefixed',
 'link_flair_text_color',
 'subreddit_type',
 'suggested_sort',
 'author',
 'permalink',
 'url',
 'created_utc',
 'retrieved_utc',
 'updated_utc',
 'utc_datetime_str']

#### Checking for Duplicates

Before we proceed, I want to make sure that I haven't introduced any duplicate posts into my data. As we move deeper into the analysis, it will become more difficult to determine if this has occured.

In [29]:
duplicates_cf = df_childfree.duplicated(subset=['title', 'id'])
duplicates_cf.sum()

0

In [30]:
duplicates_p = df_parent.duplicated(subset=['title', 'id'])
duplicates_p.sum()

0

### Exporting Data

In [31]:
with open('pickles/df_childfree.pkl', 'wb') as f:
    pickle.dump(df_childfree, f)
      
with open('pickles/df_parent.pkl', 'wb') as f:
    pickle.dump(df_parent, f)