# Project 3 - NLP and Reddit Classification - Part 1: Data Import and Cleanup

## Problem Statement

This project is In the interest of improving and updating public education for things people feel they should know but are too embarrassed to ask.

We will build a classification model that can take inputs and determine if a topic is worth building up resources for by comparing posts in ‘TooAfraidToAsk’ (which is generally viewed as a more serious forum) vs ‘NoStupidQuestions’ (which is based off of questions that are just curiosities).

## Executive Summary

The model allowed us to gain some valuable insight despite not having a high accuracy score. The two subreddits are very similar in that they welcome questions on a variety of topics. That the model was not able to draw a line down the middle to separate posts from each is not completely surprising. 

Where we found success was in identifying what the model prioritized as keywords that occur most frequently. These are those relating to health and family situations. If there is a change to improving public education, it could relate to providing more support to these sorts of matters.

### Contents:
#### Part 1:
- [API Data Import](#API-Data-Import)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)

#### Part 2:
- Model Setup
- Initial Model Configurations
- Modelling Using Stemmed X Values
- Changing Train/Test Split to 80:20 from 66:33
- Engineer Additional Feature
- Score Against Train and Test Datasets
- Confusion Matrix
- Sensitivity and Specificity
- Reddit Post Content Analysis
- Conclusion



## Package Import

In [567]:
import requests
import pandas as pd
import numpy as np
import re

## API Data Import

### Acquiring 'NoStupidQuestions' data

In [568]:
url = "https://api.pushshift.io/reddit/search/submission"


In [569]:
params = {
    'subreddit' : 'NoStupidQuestions',
    'size' : 500,
    'before' : 1585961257 #500 posts before Apr 3, 8:47PM (posts range from Apr 3 2:38PM to 8:47PM)
}


In [570]:
req = requests.get(url, params)


In [571]:
req.status_code

200

In [572]:
sq = req.json()


In [574]:
posts = sq['data']

In [575]:
df = pd.DataFrame(posts)

In [576]:
df

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,preview,author_flair_background_color,author_flair_text_color,link_flair_css_class,link_flair_template_id,link_flair_text,author_cakeday,author_flair_template_id,banned_by,edited
0,[],False,greenkittypower,,[],,text,t2_53nrb95z,False,False,...,,,,,,,,,,
1,[],False,darkLordSantaClaus,,[],,text,t2_2in1fjf6,False,False,...,,,,,,,,,,
2,[],False,Natnaeltefera,,[],,text,t2_311kkk4l,False,False,...,,,,,,,,,,
3,[],False,ItchyPositive9,,[],,text,t2_647696a8,False,False,...,"{'enabled': False, 'images': [{'id': 'hki1NSgD...",,,,,,,,,
4,[],False,CoronaEnema,,[],,text,t2_629nnf7i,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,[],False,Cherryascanbe,,[],,text,t2_3x86spwe,False,False,...,,,,,,,,,,
496,[],False,GenericUsername180,,[],,text,t2_4ytqyiwl,False,False,...,,,,,,,,,,
497,[],False,br58T,,[],,text,t2_5vpiigy2,False,False,...,,,,,,,,,,
498,[],False,HarebrainedLitre,,[],,text,t2_3goufuaa,False,False,...,,,,,,,,,,


In [577]:
df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'removed_by_category', 'retrieved_on', 'score', 'selftext',
       'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title',
       'total_awards_received', 'url', 'wh

In [578]:
df['created_utc'].sort_values()

499    1585939037
498    1585939094
497    1585939166
496    1585939196
495    1585939230
          ...    
4      1585961052
3      1585961134
2      1585961141
1      1585961222
0      1585961223
Name: created_utc, Length: 500, dtype: int64

In [579]:
nsq = df[['subreddit','selftext','title']]

In [580]:
nsq.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  500 non-null    object
 1   selftext   494 non-null    object
 2   title      500 non-null    object
dtypes: object(3)
memory usage: 11.8+ KB


In [581]:
nsq['selftext'].value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                            138
[removed]                                                                                                                                                                                                                                                                                                                                                                                                                                                    46
[deleted]                                                                                               

In [582]:
params = {
    'subreddit' : 'NoStupidQuestions',
    'size' : 500,
    'before' : '1585910503' #500 posts before 6:41AM Apr 3 (posts range from Apr 2 9:28PM to Apr 3 6:41AM)
}

In [583]:
req = requests.get(url, params)
req.status_code

200

In [584]:
sq = req.json()
posts = sq['data']
df = pd.DataFrame(posts)


In [585]:
df['created_utc'].sort_values()

499    1585877308
498    1585877341
497    1585877401
496    1585877405
495    1585877474
          ...    
4      1585910072
3      1585910219
2      1585910307
1      1585910319
0      1585910404
Name: created_utc, Length: 500, dtype: int64

In [586]:
nsq1 = df[['subreddit','selftext','title']]

In [587]:
nsq = nsq.append(nsq1, ignore_index=True)

In [588]:
params = {
    'subreddit' : 'NoStupidQuestions',
    'size' : 500,
    'before' : '1585877308' #500 posts before 6:41AM Apr 3 (posts range from Apr 2 2:54PM to Apr 2 9:28PM)
}

In [589]:
req = requests.get(url, params)
req.status_code

200

In [590]:
sq = req.json()
posts = sq['data']
df = pd.DataFrame(posts)

In [591]:
df['created_utc'].sort_values()

499    1585853695
498    1585853714
497    1585853721
496    1585853724
495    1585853842
          ...    
4      1585877132
3      1585877202
2      1585877218
1      1585877253
0      1585877306
Name: created_utc, Length: 500, dtype: int64

In [592]:
nsq2 = df[['subreddit','selftext','title']]

In [593]:
nsq = nsq.append(nsq2, ignore_index=True)

In [594]:
params = {
    'subreddit' : 'NoStupidQuestions',
    'size' : 500,
    'before' : '1585853595' #500 posts before 6:41AM Apr 3 (posts range from Apr 2 7:37AM to Apr 2 2:51PM)
}

In [595]:
req = requests.get(url, params)
req.status_code

200

In [596]:
sq = req.json()
posts = sq['data']
df = pd.DataFrame(posts)

In [597]:
df['created_utc'].sort_values()

499    1585827424
498    1585827688
497    1585827837
496    1585827956
495    1585828024
          ...    
4      1585853416
3      1585853466
2      1585853469
1      1585853503
0      1585853505
Name: created_utc, Length: 500, dtype: int64

In [598]:
nsq3 = df[['subreddit','selftext','title']]

In [599]:
nsq = nsq.append(nsq3, ignore_index=True)

In [600]:
nsq.shape

(2000, 3)

In [601]:
nsq['selftext'].value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          630
[removed]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

### Acquiring 'TooAfraidToAsk' Data

In [602]:
params = {
    'subreddit' : 'TooAfraidToAsk',
    'size' : 500,
    'before' : 1585961257 #500 posts before Apr 3, 8:47PM (posts range from Apr 2 5:07PM to Apr 3 8:36PM)
}


In [603]:
req = requests.get(url, params)
req.status_code

200

In [604]:
sq = req.json()
posts = sq['data']
df = pd.DataFrame(posts)

In [605]:
df['created_utc'].sort_values()

499    1585818469
498    1585818880
497    1585818946
496    1585818961
495    1585819243
          ...    
4      1585959934
3      1585960438
2      1585960559
1      1585960596
0      1585960605
Name: created_utc, Length: 500, dtype: int64

In [606]:
tata = df[['subreddit','selftext','title']]

In [607]:
tata['selftext'].value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [608]:
params = {
    'subreddit' : 'TooAfraidToAsk',
    'size' : 500,
    'before' : 1585818069 #500 posts before Apr 3, 8:47PM (posts range from Mar 31 1:57PM to Apr 2 4:45AM)
}


In [609]:
req = requests.get(url, params)
req.status_code

200

In [610]:
sq = req.json()
posts = sq['data']
df = pd.DataFrame(posts)

In [611]:
df['created_utc'].sort_values()

499    1585677422
498    1585677591
497    1585678686
496    1585678743
495    1585678863
          ...    
4      1585815307
3      1585815792
2      1585815920
1      1585816447
0      1585817120
Name: created_utc, Length: 500, dtype: int64

In [612]:
tata1 = df[['subreddit','selftext','title']]

In [613]:
tata = tata.append(tata1, ignore_index=True)

In [614]:
params = {
    'subreddit' : 'TooAfraidToAsk',
    'size' : 500,
    'before' : 1585677422 #500 posts before Apr 3, 8:47PM (posts range from Mar 29 10:11PM to Mar 31 1:53PM)
}


In [615]:
req = requests.get(url, params)
req.status_code

200

In [616]:
sq = req.json()
posts = sq['data']
df = pd.DataFrame(posts)

In [617]:
df['created_utc'].sort_values()

499    1585534284
498    1585534304
497    1585534760
496    1585535078
495    1585535134
          ...    
4      1585676884
3      1585677033
2      1585677086
1      1585677190
0      1585677239
Name: created_utc, Length: 500, dtype: int64

In [618]:
tata2 = df[['subreddit','selftext','title']]

In [619]:
tata = tata.append(tata2, ignore_index=True)

In [620]:
params = {
    'subreddit' : 'TooAfraidToAsk',
    'size' : 500,
    'before' : 1585534284 #500 posts before Apr 3, 8:47PM (posts range from Mar 28 4:23AM to Mar 29 10:11PM)
}


In [621]:
req = requests.get(url, params)
req.status_code

200

In [622]:
sq = req.json()
posts = sq['data']
df = pd.DataFrame(posts)

In [623]:
df['created_utc'].sort_values()

499    1585383782
498    1585385670
497    1585385726
496    1585386485
495    1585386857
          ...    
4      1585533547
3      1585533804
2      1585533865
1      1585533957
0      1585534261
Name: created_utc, Length: 500, dtype: int64

In [624]:
tata3 = df[['subreddit','selftext','title']]

In [625]:
tata = tata.append(tata3, ignore_index=True)

### Merge 'NoStupidQuestions' and 'TooAfraidToAsk' posts together

In [860]:
data = nsq.append(tata, ignore_index=True)

In [861]:
data

Unnamed: 0,subreddit,selftext,title
0,NoStupidQuestions,[removed],Local travel in the time of Corona
1,NoStupidQuestions,"Like, that scene of Taken 3 of Liam Neeson jum...","In cinematography, why are cuts considered bad?"
2,NoStupidQuestions,"I need an advice, since I was grade 10 student...",This Decision will completely change my life
3,NoStupidQuestions,Every source I found says clinical depression ...,What gave some people the idea that clinical d...
4,NoStupidQuestions,My little brother (14) got an Xbox One and I g...,What exactly is the logic behind birthday pres...
...,...,...,...
3995,TooAfraidToAsk,"I know that our body's become ""paralyzed"" to p...",Why don't we sneeze while sleeping?
3996,TooAfraidToAsk,I have a an android my volume is up but no sou...,Audio For Mobile Reddit App
3997,TooAfraidToAsk,[removed],*SERIOUS ANSWERS ONLY* Is this a sign of menta...
3998,TooAfraidToAsk,So I read the news line about the U.S having a...,2 trillion bailout?


## Exploratory Data Analysis

In [862]:
data.info() #note, original pull of 1000 posts from each subreddit was expanded to 2000 posts due to high number of [deleted] and [removed] posts

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  4000 non-null   object
 1   selftext   3961 non-null   object
 2   title      4000 non-null   object
dtypes: object(3)
memory usage: 93.9+ KB


In [863]:
data['selftext'].isnull().sum()

39

In [864]:
data['selftext'].value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

Based on the above, there are 39 null values in the 'selftext' field. Additionally, 1023 entries are blanks, 671 have been removed and 43 entries were deleted. The null values and blanks will be replaced by the title. Per reddit, \[removed\] posts were removed by mods or admin while the \[deleted\] posts were deleted by the user. According to this, posts with both of these categories should be removed from our analysis as they have either been flagged as inappropriate for the subreddit or withdrawn by the asker. The deleted posts may or may not be added back to the dataset depeding on the modeling results and time permitting.

In [865]:
data['selftext'].replace(np.NaN, data['title'], inplace=True)

In [866]:
data['selftext'].isnull().sum()

0

In [867]:
data['selftext'].replace("", data['title'], inplace=True)


In [868]:
for i in range(len(data['selftext'])):
    if 'skeleton' in data.loc[i,'title']:
        print (i)

197


In [869]:
data.loc[197,'selftext'] #defective row, to be adjusted at end of cleaning

'##'

In [870]:
for i in range(len(data['selftext'])):
    if 'upvote counter' in data.loc[i,'title']:
        print (i)

3652
3655


In [871]:
data.loc[3652,'selftext'] #defective row, to be adjusted at end of cleaning

'.'

In [872]:
data.loc[3655,'selftext']

'[removed]'

In [873]:
data['selftext'] = [data.loc[i,'title'] if data.loc[i,'selftext'].isspace() else data.loc[i,'selftext'] for i in range(len(data['selftext']))] #Inserted as above code didn't catch all blank 'selftext' rows.

In [874]:
# Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = data[data.duplicated(['title'])]
 
duplicateRowsDF

Unnamed: 0,subreddit,selftext,title
466,NoStupidQuestions,I’m not good at electronic stuff so are all th...,Security camera question
487,NoStupidQuestions,\nI’m not good at electronic stuff so are all ...,Security camera question
583,NoStupidQuestions,"I have family and friends who say ""I want thre...",What is the point in having multiples of the s...
584,NoStupidQuestions,"I have family and friends who say ""I want thre...",What is the point in having multiples of the s...
589,NoStupidQuestions,So I just got a message from Reddit saying tha...,So I just got a message from Reddit saying tha...
633,NoStupidQuestions,I've heard a little bit about the transition o...,What is the reputation for Academy Schools in UK?
733,NoStupidQuestions,Msg vs molly,Who badder
734,NoStupidQuestions,Msg vs coke,Who badder
735,NoStupidQuestions,Msg vs thc,Who badder
938,NoStupidQuestions,Does my account need to be 30 days old?,Can you make your own subreddit on mobile?


These are duplicate posts that will need to be removed

In [875]:
duplicateRowsDF.shape

(53, 3)

In [876]:
duplicateRowsDF.index

Int64Index([ 466,  487,  583,  584,  589,  633,  733,  734,  735,  938, 1020,
            1028, 1237, 1347, 1426, 1633, 1678, 1716, 2015, 2041, 2066, 2165,
            2167, 2175, 2259, 2277, 2309, 2314, 2342, 2572, 2593, 2625, 2629,
            2672, 2679, 2680, 2684, 2685, 2725, 2737, 2765, 2766, 2928, 3183,
            3365, 3366, 3399, 3625, 3639, 3688, 3827, 3894, 3956],
           dtype='int64')

In [877]:
#test case prior to dupe drop - confirmed that a result remains after dupe drop completed
data[data['title'] == 'Security camera question']

Unnamed: 0,subreddit,selftext,title
385,NoStupidQuestions,\nI’m not good at electronic stuff so are all ...,Security camera question
466,NoStupidQuestions,I’m not good at electronic stuff so are all th...,Security camera question
487,NoStupidQuestions,\nI’m not good at electronic stuff so are all ...,Security camera question


In [878]:
data.drop(data.index[duplicateRowsDF.index], inplace=True)

In [879]:
data.shape

(3947, 3)

In [880]:
#Dropping removed and deleted posts
data = data[(data['selftext'] != '[removed]')]
data = data[(data['selftext'] != '[deleted]')]

In [881]:
data.shape

(3254, 3)

In [882]:
#Reset index
data = data.reset_index(drop=True)
data

Unnamed: 0,subreddit,selftext,title
0,NoStupidQuestions,"Like, that scene of Taken 3 of Liam Neeson jum...","In cinematography, why are cuts considered bad?"
1,NoStupidQuestions,"I need an advice, since I was grade 10 student...",This Decision will completely change my life
2,NoStupidQuestions,Every source I found says clinical depression ...,What gave some people the idea that clinical d...
3,NoStupidQuestions,My little brother (14) got an Xbox One and I g...,What exactly is the logic behind birthday pres...
4,NoStupidQuestions,"Do most people go through a ""phase"" in their l...","Do most people go through a ""phase"" in their l..."
...,...,...,...
3249,TooAfraidToAsk,Like in the UK you're allowed to go outside to...,Do people judge you for walking round the park?
3250,TooAfraidToAsk,"I know that our body's become ""paralyzed"" to p...",Why don't we sneeze while sleeping?
3251,TooAfraidToAsk,I have a an android my volume is up but no sou...,Audio For Mobile Reddit App
3252,TooAfraidToAsk,So I read the news line about the U.S having a...,2 trillion bailout?


In [883]:
#Clean up text to remove symbols and change to lowercase
for i in range(len(data['selftext'])):
    data.loc[i,'selftext']= re.sub("[^a-zA-Z]", " ", data.loc[i,'selftext']).lower().split()
    data.loc[i,'selftext'] = " ".join(data.loc[i,'selftext'])
    
    

In [884]:
data.loc[0,'selftext']

'like that scene of taken of liam neeson jumping the fence taking several cuts or the long shot in being considered a selling point i don t understand why either is considered good or bad'

In [885]:
#Clean up text to remove symbols and change to lowercase
for i in range(len(data['title'])):
    data.loc[i,'title']= re.sub("[^a-zA-Z]", " ", data.loc[i,'title']).lower().split()
    data.loc[i,'title'] = " ".join(data.loc[i,'title'])
    
    

In [886]:
data.loc[0,'title']

'in cinematography why are cuts considered bad'

In [887]:
data['selftext'] = [data.loc[i,'title'] if data.loc[i,'selftext'] == "" else data.loc[i,'selftext'] for i in range(len(data['selftext']))] #To replace spaces fields where spaces only remain after text replacement.

In [888]:
data['subreddit'] = [1 if i == 'TooAfraidToAsk' else 0 for i in data['subreddit']]

In [889]:
data

Unnamed: 0,subreddit,selftext,title
0,0,like that scene of taken of liam neeson jumpin...,in cinematography why are cuts considered bad
1,0,i need an advice since i was grade student i w...,this decision will completely change my life
2,0,every source i found says clinical depression ...,what gave some people the idea that clinical d...
3,0,my little brother got an xbox one and i got a ...,what exactly is the logic behind birthday pres...
4,0,do most people go through a phase in their lif...,do most people go through a phase in their lif...
...,...,...,...
3249,1,like in the uk you re allowed to go outside to...,do people judge you for walking round the park
3250,1,i know that our body s become paralyzed to pre...,why don t we sneeze while sleeping
3251,1,i have a an android my volume is up but no sou...,audio for mobile reddit app
3252,1,so i read the news line about the u s having a...,trillion bailout


In [890]:
#Export cleaned data to csv
data.to_csv('posts_clean.csv', sep = ',', index=False)

#### Confirm rows with characters only for selftext have actual text populated from title

In [891]:
data.loc[179,'selftext'] 

'when a person dies do the bones stick together and create a skeleton or do the bones separate from each other'

In [892]:
data.loc[2996,'selftext']

'why do some posts comments have the upvote counter hidden does it have to do with privacy'

In [893]:
data.loc[179,'title']

'when a person dies do the bones stick together and create a skeleton or do the bones separate from each other'

In [894]:
data.loc[2996,'title']

'why do some posts comments have the upvote counter hidden does it have to do with privacy'