# Reddit OverWatch - Data collection using Reddit API and Data Cleaning

# Problem Statement

Problem undertaken to solve is that intentional/unintentional miss categorisation of Reddit users posts with suicide tendencies going unnoticed under **Depression** subreddit category. Its a classification problem. A classification model is needed that predicts the proper category of the post a user written/writing  actually belongs to **Depression** subreddit or **SuicideWatch** subreddit. Accuracy to classify **SuicideWatch** category is important but misclassification of **Depression** is a real concern.  

**Links**  
[r/depression](https://www.reddit.com/r/depression/)  
[r/SuicideWatch](https://www.reddit.com/r/SuicideWatch/)

# Evaluation

Models are evaluated by Classification Accuracy score

$$\text{Classification Accuracy score}={\frac{True Positive + True Negative}{True Positive + True Negative + False Positive + False Negative}}$$

# Scope

- Using Reddit's API, collect posts from two subreddits of your choosing.
- Use NLP to train a classifier on which subreddit a given post came from. 
- This is a binary classification problem.

## Requirements
- Gather and prepare your data using the requests library.
- Create and compare two models.
- One of these must be a Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
- A Jupyter Notebook with your analysis for a peer audience of data scientists.
- An executive summary of the results you found.
- A short presentation outlining your process and findings for a semi-technical audience.

## Project Deliverables

- A README.md (that isn't this file)
- Jupyter notebook(s) with your analysis and models
- Data files
- Presentation slides
- Any other necessary files (images, etc.)

## Data Dictionary

|Dataframes:| comp1_df, comp2_df, final_df|
|---|---|

|Feature|Type|Description|
|---|---|---|
|name|object|post index|
|author|object|post author id|
|title|object|title of the post|
|selftext|object|actual body of the post|
|subreddit|int64|parent subedit post belongs to|


## Import libraries

In [1]:
import requests as requests
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time
import regex as re
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
%matplotlib inline

## Data collection using Reddit API

In [2]:
#Function to request data from reddit API
def Get_posts(url):
    posts=[]
    headers={'User-agent':'GA SG DSI10'}
    after=None #None after param

    for i in range(40): #loop 40 times to receive 25 posts in every loop
        if i % 10 == 0: 
            print('Starting loop ',i+1) #print status for every *10 loop
        if after == None: #param set to none for the first request
            param={} 
        else:
            param={'after':after} #param set to after to 2nd request onwards
        
        res = requests.get(url,params=param,headers=headers)
        if res.status_code==200: #if status 200 normal,proceed
            raw=res.json()
            posts.extend(raw['data']['children']) 
            after=raw['data']['after']  
        else:
            print('Error code ',results.status_code)
            break
        time.sleep(1) #seconds to sleep
    print('Done')
    print('Number of posts received: ',len(posts)) #print lenth of posts
    return posts
    

### Get posts from Depression subredit


In [3]:
comp1=Get_posts('https://www.reddit.com/r/depression/new.json')

Starting loop  1
Starting loop  11
Starting loop  21
Starting loop  31
Done
Number of posts received:  1000


### Get posts from SuicideWatch subredit


In [4]:
comp2=Get_posts('https://www.reddit.com/r/SuicideWatch/new.json')

Starting loop  1
Starting loop  11
Starting loop  21
Starting loop  31
Done
Number of posts received:  1000


In [5]:
#Check class
print('fiction tyep:',type(comp1))
print('real tyep:',type(comp2))

fiction tyep: <class 'list'>
real tyep: <class 'list'>


In [6]:
#check 1st entry
comp1[0]['data'].keys() 

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'thumbnail', 'edited', 'author_flair_css_class', 'steward_reports', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'view_count', 'archived', 'no_follow', 'is_crosspostable', 'pinned', 'over_18',

In [7]:
#check 1st entry
comp2[0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'thumbnail', 'edited', 'author_flair_css_class', 'steward_reports', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'view_count', 'archived', 'no_follow', 'is_crosspostable', 'pinned', 'over_18',

In [8]:
#Convert Depression posts to data frame
comp1_df = pd.DataFrame([i['data'] for i in comp1]) #convert to dataframe

In [9]:
#Convert SuicideWatch posts to data frame
comp2_df = pd.DataFrame([x['data'] for x in comp2]) #convert to dataframe

In [10]:
#inspect df
comp1_df.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,...,thumbnail,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,False,Kostas235,,,[],,...,,Why am I like this,0,2,https://www.reddit.com/r/depression/comments/d...,[],,False,no_ads,0.0
1,[],False,,,False,Amxela,,,[],,...,,I’ve started writing down my depressed thought...,0,3,https://www.reddit.com/r/depression/comments/d...,[],,False,no_ads,0.0
2,[],False,,,False,cwatson1060,,,[],,...,,How do I know I was truly laughing?,0,2,https://www.reddit.com/r/depression/comments/d...,[],,False,no_ads,0.0
3,[],False,,,False,Gregorvitch,,,[],,...,,A rant about Imposter syndrome.,0,2,https://www.reddit.com/r/depression/comments/d...,[],,False,no_ads,0.0
4,[],False,,,False,Asmodeuss1990,,,[],,...,,Its 3 am and I can't fucking sleep.,0,3,https://www.reddit.com/r/depression/comments/d...,[],,False,no_ads,0.0


In [11]:
#inspect df
comp2_df.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,...,thumbnail,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,False,akskdqieirkff,,,modmsg,[],...,,want to die for no reason,0,1,https://www.reddit.com/r/SuicideWatch/comments...,[],,False,no_ads,0
1,[],False,,,False,B055ANOVA,,,modmsg,[],...,,Dogs are great,0,3,https://www.reddit.com/r/SuicideWatch/comments...,[],,False,no_ads,0
2,[],False,,,False,kirelred,,,,[],...,,I just need things to stop.,0,1,https://www.reddit.com/r/SuicideWatch/comments...,[],,False,no_ads,0
3,[],False,,,False,Shamad00,,,,[],...,,it's just difficult,0,1,https://www.reddit.com/r/SuicideWatch/comments...,[],,False,no_ads,0
4,[],False,,,False,JMovilla,,,modmsg,[],...,,Guidance,0,2,https://www.reddit.com/r/SuicideWatch/comments...,[],,False,no_ads,0


#### Backup posts to CSV files

In [12]:
comp1_df.to_csv('depression_posts.csv',index=False)
comp2_df.to_csv('suicidewatch_posts.csv',index=False)

## Data Cleaning

#### Drop duplicate posts based on the body of the post ('selftext')

In [13]:
#check and and remove duplicate posts and drop them
comp1_df = comp1_df.drop_duplicates(subset='selftext', keep='first')

In [14]:
#check remaining number of posts for Depression
comp1_df.shape

(995, 97)

In [15]:
#check and and remove duplicate posts and drop them
comp2_df = comp2_df.drop_duplicates(subset='selftext', keep='first')

In [16]:
#check remaining number of posts for SuicideWatch
comp2_df.shape

(961, 98)

### Selecting only interested features.


In [17]:
comp1_df.columns

Index(['all_awardings', 'allow_live_comments', 'approved_at_utc',
       'approved_by', 'archived', 'author', 'author_flair_background_color',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_patreon_flair', 'awarders', 'banned_at_utc', 'banned_by',
       'can_gild', 'can_mod_post', 'category', 'clicked', 'content_categories',
       'contest_mode', 'created', 'created_utc', 'discussion_type',
       'distinguished', 'domain', 'downs', 'edited', 'gilded', 'gildings',
       'hidden', 'hide_score', 'id', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'likes', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_text',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media',
       'media_

Only intrested in the 'title' 'selftext' 'subreddit' as features but for traceability purpose will be selecting 'name' and 'author' as well.

**'name'**- post index

**'author'**- post author id 

**'title'**- title of the post

**'selftext'**- actual body of the post

**'subreddit'**- parent subedit post belongs to

In [18]:
comp1_df=comp1_df[['name','author','title','selftext','subreddit']] #select only 'name','title','selftext','subreddit' columns

In [19]:
comp2_df=comp2_df[['name','author','title','selftext','subreddit']] #select only 'name','title','selftext','subreddit' columns

In [20]:
comp1_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 995 entries, 0 to 999
Data columns (total 5 columns):
name         995 non-null object
author       995 non-null object
title        995 non-null object
selftext     995 non-null object
subreddit    995 non-null object
dtypes: object(5)
memory usage: 46.6+ KB


In [21]:
comp2_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 961 entries, 0 to 999
Data columns (total 5 columns):
name         961 non-null object
author       961 non-null object
title        961 non-null object
selftext     961 non-null object
subreddit    961 non-null object
dtypes: object(5)
memory usage: 45.0+ KB


#### Check for null values

In [23]:
#check for null values
comp1_df.isnull().sum() 

name         0
author       0
title        0
selftext     0
subreddit    0
dtype: int64

In [24]:
#check for null values
comp2_df.isnull().sum()

name         0
author       0
title        0
selftext     0
subreddit    0
dtype: int64

No null values found in any data frames

#### Check for blank string entries

In [25]:
#Function to check blank entries
def Chk_blanks(df):
    print((df[df.columns] == '').sum())

In [26]:
Chk_blanks(comp1_df)

name         0
author       0
title        0
selftext     1
subreddit    0
dtype: int64


Only one blank string entry found. Probably it was an image posted. Decided to drop the row since only 1.

In [27]:
Chk_blanks(comp2_df)

name         0
author       0
title        0
selftext     1
subreddit    0
dtype: int64


Only one blank string entry found. Probably it was an image posted. Decided to drop the row since only 1.

In [28]:
#replace blank rows with NaN and remove
comp1_df['selftext'].replace('', np.nan, inplace=True) #replace blanks with NaN
comp1_df.dropna(inplace=True) #Drop NaN
Chk_blanks(comp1_df)

name         0
author       0
title        0
selftext     0
subreddit    0
dtype: int64


In [29]:
#replace blank rows with NaN and remove
comp2_df['selftext'].replace('', np.nan, inplace=True) #replace blanks with NaN
comp2_df.dropna(inplace=True) #Drop NaN
Chk_blanks(comp2_df)

name         0
author       0
title        0
selftext     0
subreddit    0
dtype: int64


#### To Balance the categories

To balance the ratio of obervations for Depressed and SuicideWatch subreddits, decided to keep same number of rows for both the categories.

In [30]:
#check shape
comp1_df.shape

(994, 5)

In [31]:
#check shape
comp2_df.shape

(960, 5)

In [32]:
#Remove rows randomly from fiction dataframe to match
rdiff=len(comp1_df)-len(comp2_df)
print('Rows difference between fiction and real dataframes: ',rdiff)
print('Fiction data frame shape:' ,comp1_df.shape)
print('Real data frame shape: ' ,comp2_df.shape)
while len(comp1_df) > len(comp2_df):
    try:
        comp1_df.drop(np.random.randint(low=0, high=len(comp1_df)-1),inplace=True) #remove random rows
    except:
        pass
    

print('Fiction data frame shape after dropped rows:' ,comp1_df.shape)
print('Real data frame shape: ' ,comp2_df.shape)

Rows difference between fiction and real dataframes:  34
Fiction data frame shape: (994, 5)
Real data frame shape:  (960, 5)
Fiction data frame shape after dropped rows: (960, 5)
Real data frame shape:  (960, 5)


Since SuicideWatch row count is lower than Depressed, removed random posts from SuicideWatch to match the number of rows of Depressed.

In [33]:
#Convert all string values in 'title' and 'selftext' to lower case
comp1_df['title']=comp1_df['title'].str.lower()
comp1_df['selftext']=comp1_df['selftext'].str.lower()
comp2_df['title']=comp2_df['title'].str.lower()
comp2_df['selftext']=comp2_df['selftext'].str.lower()

In [34]:
#check dataframe
comp1_df.head()

Unnamed: 0,name,author,title,selftext,subreddit
0,t3_dlvltz,Kostas235,why am i like this,"my classmates don’t talk to me, they think i a...",depression
1,t3_dlvka1,Amxela,i’ve started writing down my depressed thought...,depression\n\n10/19-10/20 1:22am. \ni’m sittin...,depression
2,t3_dlvjgc,cwatson1060,how do i know i was truly laughing?,"i'm confused, i laugh in some classes in schoo...",depression
3,t3_dlvj4a,Gregorvitch,a rant about imposter syndrome.,imposter syndrome has hit me pretty hard latel...,depression
4,t3_dlvihy,Asmodeuss1990,its 3 am and i can't fucking sleep.,so all i can do is just listen to depressing m...,depression


In [35]:
#check dataframe
comp2_df.head()

Unnamed: 0,name,author,title,selftext,subreddit
0,t3_dlvmb0,akskdqieirkff,want to die for no reason,"i’ve always been suicidal, but i’m doing alrig...",SuicideWatch
1,t3_dlvihd,B055ANOVA,dogs are great,just tonight when i was fed up with everything...,SuicideWatch
2,t3_dlvh35,kirelred,i just need things to stop.,i struggle with my mask on a daily basis. my...,SuicideWatch
3,t3_dlvfwe,Shamad00,it's just difficult,it's getting difficult to even do the things i...,SuicideWatch
4,t3_dlvatq,JMovilla,guidance,i wanna end it quick where no one will ever fi...,SuicideWatch


In [36]:
#Remove html tags and punctuation using regex
comp1_df['title'] = comp1_df['title'].str.replace("[^a-zA-Z]", " ")

In [37]:
#Remove html tags and punctuation using regex
comp2_df['title'] = comp2_df['title'].str.replace("[^a-zA-Z]", " ")

In [38]:
#Remove html tags and punctuation using regex
comp1_df['selftext'] = comp1_df['selftext'].str.replace("[^a-zA-Z]", " ")

In [39]:
#Remove html tags and punctuation using regex
comp2_df['selftext'] = comp2_df['selftext'].str.replace("[^a-zA-Z]", " ")

In [40]:
#check dataframe
comp1_df.head()

Unnamed: 0,name,author,title,selftext,subreddit
0,t3_dlvltz,Kostas235,why am i like this,my classmates don t talk to me they think i a...,depression
1,t3_dlvka1,Amxela,i ve started writing down my depressed thought...,depression am i m sitting h...,depression
2,t3_dlvjgc,cwatson1060,how do i know i was truly laughing,i m confused i laugh in some classes in schoo...,depression
3,t3_dlvj4a,Gregorvitch,a rant about imposter syndrome,imposter syndrome has hit me pretty hard latel...,depression
4,t3_dlvihy,Asmodeuss1990,its am and i can t fucking sleep,so all i can do is just listen to depressing m...,depression


In [41]:
comp2_df.head()

Unnamed: 0,name,author,title,selftext,subreddit
0,t3_dlvmb0,akskdqieirkff,want to die for no reason,i ve always been suicidal but i m doing alrig...,SuicideWatch
1,t3_dlvihd,B055ANOVA,dogs are great,just tonight when i was fed up with everything...,SuicideWatch
2,t3_dlvh35,kirelred,i just need things to stop,i struggle with my mask on a daily basis my...,SuicideWatch
3,t3_dlvfwe,Shamad00,it s just difficult,it s getting difficult to even do the things i...,SuicideWatch
4,t3_dlvatq,JMovilla,guidance,i wanna end it quick where no one will ever fi...,SuicideWatch


#### Backup individual cleaned posts to CSV files

In [42]:
#cleaned dataframes to csv
comp1_df.to_csv('depressed_clean.csv',index=False)
comp2_df.to_csv('suicidewatch_clean.csv',index=False)

#### Load saved cleaned data from csv and check for null values

In [43]:
#load cleaned csv and check for null values
dftemp=pd.read_csv('comp1_clean.csv')
dftemp.isnull().sum()

name         0
author       0
title        0
selftext     0
subreddit    0
dtype: int64

In [44]:
#load cleaned csv and check for null values
dftemp=pd.read_csv('comp2_clean.csv')
dftemp.isnull().sum()

name         0
author       0
title        0
selftext     0
subreddit    0
dtype: int64

No null values found in any of the posts

#### Concatinate cleaned data to final dataframe

In [45]:
# Concat cleaned Depressed and SuicideWatch posts to final dataframe
final_df=pd.concat([comp1_df,comp2_df])

In [46]:
#check shape
final_df.shape

(1920, 5)

In [47]:
#check columns
final_df.columns

Index(['name', 'author', 'title', 'selftext', 'subreddit'], dtype='object')

In [48]:
#shuffle posts to randomise the order of the categories
from sklearn.utils import shuffle
final_df = shuffle(final_df)

#### Map and replace Depression category to 0 and SuicideWatch to 1

In [49]:
#convert categories 'SuicideWatch' =1 and 'depression'=0
final_df['subreddit'] = final_df['subreddit'].map(dict(SuicideWatch=1, depression=0))
final_df['subreddit'].value_counts()

1    960
0    960
Name: subreddit, dtype: int64

In [52]:
#Check types
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1920 entries, 935 to 932
Data columns (total 5 columns):
name         1920 non-null object
author       1920 non-null object
title        1920 non-null object
selftext     1920 non-null object
subreddit    1920 non-null int64
dtypes: int64(1), object(4)
memory usage: 90.0+ KB


In [50]:
#Check dataframe
final_df.head()

Unnamed: 0,name,author,title,selftext,subreddit
935,t3_dl75tg,Yeledkafot,miserable with my life losing my mind,my mom got diagnosed with bipolar disease and ...,0
324,t3_dlompu,lavender-slushie,fucccck you x vent,every single fucking time i try to talk to my ...,0
843,t3_dkfrvy,TheREAL_VeraPeterson,i m so scared,i don t think i can keep going i ll begging f...,1
103,t3_dlqv8z,nearlytherezopiclone,conversation away from organising and actionin...,year old male major depressive since y...,1
728,t3_dlbxd2,flyawaythrowaway1234,self harm tw,i have razor blades and i use them to slice my...,0


#### Save final cleaned data to CSV file

In [51]:
#final dataframes to csv
#commented off to preserve the report in the next note book.
#final_df.to_csv('final_table.csv',index=False)

Rest of the projects will be continued on `2 Model Building - Reddit OverWatch` file