<img src="../assets/a_eyes_readme.gif" style="float:right ; margin: 10px ; width:300px;"> 


<h1><left>Prediction and Analysis of Degree of Suicidal
Ideation in Online Content</left></h1>
<h4><left>by- Rajswi Lochan(Roll no-1806095)</left></h4>
<h4><left> Prince Sinha(Roll no-1806132)</left></h4>
<h4><left> Parth Sharma(Roll no-1806172)</left></h4>

___

## 1. Data Collection
- For this project, we will be using Reddit's API to collect posts from two subreddits: "r/depression" and "r/SuicideWatch"
- We aim to automate as much of this process as possible into neat functions to enable repeatability on the data collection front.
- When collecting data from servers, we will create a randomized delay between requests as a consideration to Reddit's servers and security staff.

> Note: Data in this notebook was collected on 10th October 2020. Do note that if we run the code again on another day, it will result in a new set of posts being scraped.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import requests
import time
import pandas as pd
from random import randint

### 1.1 Exploring the HTML architecture of the r/depression subreddit page 

In [None]:
#WE WILL SCRAPE THE r/depression AND r/SuicideWatch SUBREDDITS
#LET'S START BY EXPLORING THE HTML INNARDS OF THE FORMER
url_1 = "https://www.reddit.com/r/depression.json"

In [None]:
#DEFINING A USER AGENT AND MAKING SURE STATUS IS GOOD TO GO
headers = {"User-agent" : "Sam He"}
res = requests.get(url_1, headers=headers)
res.status_code

200

In [None]:
#PEEKING AT WHAT OUR DATA WILL LOOK LIKE
depress_json = res.json()
depress_json

{'data': {'after': 't3_jrxz5c',
  'before': None,
  'children': [{'data': {'all_awardings': [{'award_sub_type': 'GROUP',
       'award_type': 'global',
       'awardings_required_to_grant_benefits': 3,
       'coin_price': 300,
       'coin_reward': 250,
       'count': 1,
       'days_of_drip_extension': 0,
       'days_of_premium': 0,
       'description': 'THIS right here! Join together to give multiple This awards and see the award evolve in its display and shower benefits for the recipient. For every 3 This awards given to a post or comment, the author will get 250 coins.',
       'end_date': None,
       'giver_coin_reward': None,
       'icon_format': None,
       'icon_height': 2048,
       'icon_url': 'https://i.redd.it/award_images/t5_22cerq/vu6om0xnb7e41_This.png',
       'icon_width': 2048,
       'id': 'award_68ba1ee3-9baf-4252-be52-b808c1e8bdc4',
       'is_enabled': True,
       'is_new': False,
       'name': 'This',
       'penny_donate': None,
       'penny_price': No

In [None]:
#THE REDDIT DATA SEEMS TO BE ORGANISED AS A DICTIONARY
#LET'S GET ITS KEYS
sorted(depress_json["data"].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [None]:
#WITH SOME HELP FROM THIS YOUTUBE TUTORIAL: https://www.youtube.com/watch?v=5Y3ZE26Ciuk
#WE FIND OUT THAT THE after KEY IS THE QUERY STRING THAT WILL...
#INDICATE IN OUR URL THAT WE WANT TO SEE THE NEXT 25 POSTS AFTER THE after "CODE"

depress_json["data"]["after"]

't3_jrxz5c'

In [None]:
#DOUBLE CONFIRMING THAT THE PREVIOUS AFTER KEY IS REALLY THE LAST ITEM ON OUR PAGE
[post["data"]["name"] for post in depress_json["data"]["children"]]

['t3_doqwow',
 't3_iq10oq',
 't3_jrla14',
 't3_jrtbuz',
 't3_jrk7w9',
 't3_jrpu3j',
 't3_jr9klb',
 't3_jrx773',
 't3_jrmqkj',
 't3_jrwlk0',
 't3_jrpvjj',
 't3_jroq0h',
 't3_jrlsor',
 't3_jrvxck',
 't3_jrufwm',
 't3_jrnpc2',
 't3_jrl9tf',
 't3_jrmnu1',
 't3_jrois1',
 't3_jrwyau',
 't3_jrv6qs',
 't3_jrurvx',
 't3_jrwcrr',
 't3_jrx6j0',
 't3_jrwof9',
 't3_jri35b',
 't3_jrxz5c']

In [None]:
#CHECKING OUT THE NUMBER OF POSTS IN ONE PAGE
len(depress_json["data"]["children"])

27

In [None]:
# OH, WE CAN DATAFRAME IT. 
pd.DataFrame(depress_json["data"]["children"])

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
5,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
6,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
7,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
8,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
9,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."


In [None]:
#LOOKS LIKE THIS DATA IS REALLY WHAT WE ARE LOOKING FOR
depress_json["data"]["children"][0]["data"]

{'all_awardings': [{'award_sub_type': 'GROUP',
   'award_type': 'global',
   'awardings_required_to_grant_benefits': 3,
   'coin_price': 300,
   'coin_reward': 250,
   'count': 1,
   'days_of_drip_extension': 0,
   'days_of_premium': 0,
   'description': 'THIS right here! Join together to give multiple This awards and see the award evolve in its display and shower benefits for the recipient. For every 3 This awards given to a post or comment, the author will get 250 coins.',
   'end_date': None,
   'giver_coin_reward': None,
   'icon_format': None,
   'icon_height': 2048,
   'icon_url': 'https://i.redd.it/award_images/t5_22cerq/vu6om0xnb7e41_This.png',
   'icon_width': 2048,
   'id': 'award_68ba1ee3-9baf-4252-be52-b808c1e8bdc4',
   'is_enabled': True,
   'is_new': False,
   'name': 'This',
   'penny_donate': None,
   'penny_price': None,
   'resized_icons': [{'height': 16,
     'url': 'https://preview.redd.it/award_images/t5_22cerq/vu6om0xnb7e41_This.png?width=16&amp;height=16&amp;auto

### 1.2 Creating functions to automate the Data Collection process 
- We will first run those functions on r/depression and check if they have worked well.

In [None]:
# NOW WE CAN DEFINE A FUNCTION TO SCRAPE A REDDIT PAGE

def reddit_scrape(url_string, number_of_scrapes, output_list):
    #SCRAPED POSTS WILL BE CONTAINED IN OUTPUT LIST(SHD BE EMPTY)
    #THIS IS USEFUL FOR THE FIRST SCRAPE FROM THE VIRGIN SUBREDDIT
    after = None 
    for _ in range(number_of_scrapes):
        if _ == 0:
            print("SCRAPING {}\n--------------------------------------------------".format(url_string))
            print("<<<SCRAPING COMMENCED>>>") 
            print("Downloading Batch {} of {}...".format(1, number_of_scrapes))
        elif (_+1) % 5 ==0:
            print("Downloading Batch {} of {}...".format((_ + 1), number_of_scrapes))
        
        if after == None:
            params = {}
        else:
            #THIS WILL TELL THE SCRAPER TO GET THE NEXT SET AFTER REDDIT'S after CODE
            params = {"after": after}             
        res = requests.get(url_string, params=params, headers=headers)
        if res.status_code == 200:
            the_json = res.json()
            output_list.extend(the_json["data"]["children"])
            after = the_json["data"]["after"]
        else:
            print(res.status_code)
            break
        time.sleep(randint(1,6))
    
    print("<<<SCRAPING COMPLETED>>>")
    print("Number of posts downloaded: {}".format(len(output_list)))
    print("Number of unique posts: {}".format(len(set([p["data"]["name"] for p in output_list]))))
 

In [None]:
#CALLING THE FUNCTION ON OUR DEPRESSION SUBREDDIT
depress_scraped = [] #DEFINING AN EMPTY LIST THAT WILL CONTAIN OUR SCRAPED DATA
reddit_scrape("https://www.reddit.com/r/depression.json", 50, depress_scraped)

SCRAPING https://www.reddit.com/r/depression.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>
Downloading Batch 1 of 50...
Downloading Batch 5 of 50...
Downloading Batch 10 of 50...
Downloading Batch 15 of 50...
Downloading Batch 20 of 50...
Downloading Batch 25 of 50...
Downloading Batch 30 of 50...
Downloading Batch 35 of 50...
Downloading Batch 40 of 50...
Downloading Batch 45 of 50...
Downloading Batch 50 of 50...
<<<SCRAPING COMPLETED>>>
Number of posts downloaded: 1248
Number of unique posts: 997


In [None]:
#CREATING A FUNCTION TO OUTPUT A LIST OF UNIQUE POSTS
def create_unique_list(original_scrape_list, new_list_name):
    data_name_list=[]
    for i in range(len(original_scrape_list)):
        if original_scrape_list[i]["data"]["name"] not in data_name_list:
            new_list_name.append(original_scrape_list[i]["data"])
            data_name_list.append(original_scrape_list[i]["data"]["name"])
    #CHECKING IF THE NEW LIST IS OF SAME LENGTH AS UNIQUE POSTS
    print("LIST NOW CONTAINS {} UNIQUE SCRAPED POSTS".format(len(new_list_name)))
    

In [None]:
#CALLING THE FUNCTION ON OUR SCRAPED DATA
depress_scraped_unique = []
create_unique_list(depress_scraped, depress_scraped_unique)

LIST NOW CONTAINS 997 UNIQUE SCRAPED POSTS


In [None]:
#PUTTING DEPRESSION DATA INTO A DATAFRAME AND SAVING TO CSV
depression = pd.DataFrame(depress_scraped_unique)
depression["is_suicide"] = 0
depression.head() #CHECK IF THERE ARE 100 COLS AND LAST DUMMY is_suicide COL

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,...,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,is_suicide
0,,depression,We understand that most people who reply immed...,t2_1t70,False,,1,False,Our most-broken and least-understood rules is ...,[],r/depression,False,0,,0,,False,t3_doqwow,False,dark,1.0,,public,2322,29,{},,False,[],,False,False,,{},,False,2322,,True,,...,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,moderator,t5_2qqqf,,,,doqwow,True,,SQLwitch,,175,True,no_ads,False,[],False,,/r/depression/comments/doqwow/our_mostbroken_a...,no_ads,True,https://www.reddit.com/r/depression/comments/d...,697672,1572361000.0,1,,False,,0
1,,depression,Welcome to /r/depression's check-in post - a p...,t2_1t70,False,,0,False,"Regular Check-In Post. Plus, a reminder about ...",[],r/depression,False,0,,0,,False,t3_iq10oq,False,dark,1.0,,public,446,13,{},,False,[],,False,False,,{},,False,446,,True,,...,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,moderator,t5_2qqqf,,,,iq10oq,True,,SQLwitch,,1589,False,no_ads,False,[],False,,/r/depression/comments/iq10oq/regular_checkin_...,no_ads,True,https://www.reddit.com/r/depression/comments/i...,697672,1599735000.0,0,,False,,0
2,,depression,"All day I have my headphones on, I even sleep ...",t2_11tsnwf9,False,,0,False,The only reason I survive my days is music.,[],r/depression,False,0,,0,,False,t3_jrla14,False,dark,0.99,,public,1263,9,{},,False,[],,False,False,,{},,False,1263,,False,,...,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,,t5_2qqqf,,,,jrla14,True,,whenthetimesgettough,,137,True,no_ads,False,[],False,,/r/depression/comments/jrla14/the_only_reason_...,no_ads,False,https://www.reddit.com/r/depression/comments/j...,697672,1605016000.0,0,,False,,0
3,,depression,For the past 10 years I have been trying super...,t2_47641id0,False,,0,False,"Eight days ago I was trying to end my life, pa...",[],r/depression,False,0,,0,,False,t3_jrtbuz,False,dark,1.0,,public,45,1,{},,False,[],,False,False,,{},,False,45,,True,,...,False,"[{'giver_coin_reward': 0, 'subreddit_id': None...",[],False,False,False,False,,[],False,,,,t5_2qqqf,,,,jrtbuz,True,,d3adandbr0k3n,,4,True,no_ads,False,[],False,,/r/depression/comments/jrtbuz/eight_days_ago_i...,no_ads,False,https://www.reddit.com/r/depression/comments/j...,697672,1605042000.0,0,,False,,0
4,,depression,"i hate everything, i hate working, i hate scho...",t2_7c1kynyq,False,,0,False,fuck everything on this godforsaken planet.,[],r/depression,False,0,,0,,False,t3_jrk7w9,False,dark,0.98,,public,199,1,{},,False,[],,False,False,,{},,False,199,,False,,...,False,"[{'giver_coin_reward': 0, 'subreddit_id': None...",[],False,False,False,False,,[],False,,,,t5_2qqqf,,,,jrk7w9,True,,Evangelion909,,20,True,no_ads,False,[],False,,/r/depression/comments/jrk7w9/fuck_everything_...,no_ads,False,https://www.reddit.com/r/depression/comments/j...,697672,1605012000.0,0,,False,,0


### 1.3 Running our functions on the r/SuicideWatch subreddit 

In [None]:
#CALLING THE SCRAPING FUNCTION ON OUR SUICIDEWATCH SUBREDDIT
suicide_scraped = [] #DEFINING AN EMPTY LIST THAT WILL CONTAIN OUR SCRAPED DATA
reddit_scrape("https://www.reddit.com/r/SuicideWatch.json", 50, suicide_scraped)

SCRAPING https://www.reddit.com/r/SuicideWatch.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>
Downloading Batch 1 of 50...
Downloading Batch 5 of 50...
Downloading Batch 10 of 50...
Downloading Batch 15 of 50...
Downloading Batch 20 of 50...
Downloading Batch 25 of 50...
Downloading Batch 30 of 50...
Downloading Batch 35 of 50...
Downloading Batch 40 of 50...
Downloading Batch 45 of 50...
Downloading Batch 50 of 50...
<<<SCRAPING COMPLETED>>>
Number of posts downloaded: 1230
Number of unique posts: 980


In [None]:
#CALLING THE "UNIQUE ONLY" FUNCTION ON OUR SCRAPED DATA
suicide_scraped_unique = []
create_unique_list(suicide_scraped, suicide_scraped_unique)

LIST NOW CONTAINS 980 UNIQUE SCRAPED POSTS


In [None]:
#PUTTING SUICIDEWATCH DATA INTO A DATAFRAME AND SAVING TO CSV
suicide_watch = pd.DataFrame(suicide_scraped_unique)
suicide_watch["is_suicide"] = 1
suicide_watch.head() #CHECK IF THERE ARE 100 COLS AND LAST DUMMY is_suicide COL

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,...,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,is_suicide
0,,SuicideWatch,We've been seeing a worrying increase in pro-s...,t2_1t70,False,,1,False,New wiki on how to avoid accidentally encourag...,[],r/SuicideWatch,False,0,,0,,False,t3_cz6nfd,False,dark,0.99,,public,1730,17,{},2a7b5518-8e45-11e5-a506-0ed10b342609,False,[],,False,False,,{},,False,1730,,True,,...,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,moderator,t5_2qpzs,,,,cz6nfd,True,,SQLwitch,,255,True,no_ads,False,[],False,dark,/r/SuicideWatch/comments/cz6nfd/new_wiki_on_ho...,no_ads,True,https://www.reddit.com/r/SuicideWatch/comments...,237392,1567526000.0,0,,False,,1
1,,SuicideWatch,"Activism, i.e. advocating or fundraising for s...",t2_1t70,False,,1,False,Please remember that NO ACTIVISM of any kind i...,[],r/SuicideWatch,False,0,,0,,False,t3_iq0w21,False,dark,0.99,,public,778,6,{},2a7b5518-8e45-11e5-a506-0ed10b342609,False,[],,False,False,,{},,False,778,,True,,...,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,moderator,t5_2qpzs,,,,iq0w21,True,,SQLwitch,,53,True,no_ads,False,[],False,dark,/r/SuicideWatch/comments/iq0w21/please_remembe...,no_ads,True,https://www.reddit.com/r/SuicideWatch/comments...,237392,1599734000.0,0,,False,,1
2,,SuicideWatch,takes all the guilt and choice out of this who...,t2_55uydf5s,False,,0,False,would like to be murdered,[],r/SuicideWatch,False,0,,0,,False,t3_jrp9sh,False,dark,0.99,,public,293,0,{},,False,[],,False,False,,{},,False,293,,False,,...,False,[],[],False,False,False,False,,[],False,,,,t5_2qpzs,,,,jrp9sh,True,,check_my_french,,31,True,no_ads,False,[],False,,/r/SuicideWatch/comments/jrp9sh/would_like_to_...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,237392,1605029000.0,0,,False,,1
3,,SuicideWatch,I'm average teen boy and i've very much depres...,t2_7rmj1706,False,,0,False,Just to thank a girl angel that appears in my ...,[],r/SuicideWatch,False,0,,0,,False,t3_jrmi3o,False,dark,0.99,,public,440,3,{},,False,[],,False,False,,{},,False,440,,False,,...,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,,t5_2qpzs,,,,jrmi3o,True,,Grimarc,,33,True,no_ads,False,[],False,,/r/SuicideWatch/comments/jrmi3o/just_to_thank_...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,237392,1605021000.0,0,,False,,1
4,,SuicideWatch,,t2_8an2ccyw,False,,0,False,Why do I have to stay alive? Why? To make othe...,[],r/SuicideWatch,False,0,,0,,False,t3_jrn9qk,False,dark,0.99,,public,126,0,{},,False,[],,False,False,,{},,False,126,,False,,...,False,[],[],False,False,False,False,,[],False,,,,t5_2qpzs,,,,jrn9qk,True,,bleachdrinker2001,,13,True,no_ads,False,[],False,,/r/SuicideWatch/comments/jrn9qk/why_do_i_have_...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,237392,1605023000.0,0,,False,,1


#### NOTE: I've commented out the code in the next cell for "pd.to_csv" to prevent any accidental overwriting of the the saved dataset.**

In [None]:
#suicide_watch.to_csv('../data/suicide_watch.csv', index = False)
#depression.to_csv('../data/depression.csv', index = False)

In [None]:
#INVESTIGATING THE CASE OF r/SuicideWatch HAVING AN ADDITIONAL COLUMN
suicide_watch.columns.difference(depression.columns)

In [None]:
#LOOKING INTO THAT ADDITIONAL COLUMN
suicide_watch['author_cakeday'].isnull().value_counts()

#### Early thoughts about the collected data
- Data seems to be collected successfully.
- We have some "uneven-ness" in the size of our set as we collected 980 r/SuicideWatch posts and 917 r/depression posts. We might want to consider "even-ing" out the posts with another round of collection. 
- There is also a matter of r/SuicideWatch having one extra column. Which is strange to me considering that they both exist on the same site. The column is "author_cakeday" and it is mostly NaNs. Thus, it doesn't seem like a column we will be using for our classifier.
