<img src="../assets/a_eyes_readme.gif" style="float:right ; margin: 10px ; width:300px;"> 
<h1><left>NLP Project</left></h1>
<h4><left>Using Natural Language Processing to better understand Depression & Anxiety</left></h4>
___

## 1. Data Collection

In [23]:
import requests
import time
import pandas as pd
from random import randint

### 1.1 Exploring the HTML architecture

In [24]:
url_test = "https://www.reddit.com/r/depression.json"

headers = {"User-agent": "Yeganeh"}
res = requests.get(url_test, headers=headers)

res.status_code

200

In [25]:
depress_json = res.json()

In [26]:
sorted(depress_json.keys())

['data', 'kind']

In [27]:
sorted(depress_json["data"].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [28]:
#WE FIND OUT THAT THE after KEY IS THE QUERY STRING THAT WILL INDICATE IN OUR URL THAT WE WANT TO SEE THE NEXT 25 POSTS AFTER THE after "CODE"
#NAME OF THE LAST POST
depress_json["data"]["after"]

't3_n7138z'

In [29]:
#DOUBLE CONFIRMING THAT THE PREVIOUS AFTER KEY IS REALLY THE LAST ITEM ON OUR PAGE
[post["data"]["name"] for post in depress_json["data"]["children"]]

['t3_doqwow',
 't3_m246c4',
 't3_n728cp',
 't3_n6fydg',
 't3_n6ye5n',
 't3_n6idcp',
 't3_n72ux9',
 't3_n72bwo',
 't3_n7218c',
 't3_n6yxob',
 't3_n70c5z',
 't3_n759a4',
 't3_n6udjd',
 't3_n6k4kv',
 't3_n6zids',
 't3_n732tn',
 't3_n741lk',
 't3_n6n3mg',
 't3_n74pzk',
 't3_n6w6na',
 't3_n6yj90',
 't3_n73m2y',
 't3_n62n6o',
 't3_n6q9dq',
 't3_n6ze7q',
 't3_n714g5',
 't3_n7138z']

In [30]:
len(depress_json["data"]["children"])

27

In [31]:
pd.DataFrame(depress_json["data"]["children"])

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
5,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
6,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
7,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
8,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
9,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."


In [32]:
depress_json["data"]["children"][0]["data"]

{'approved_at_utc': None,
 'subreddit': 'depression',
 'selftext': 'We understand that most people who reply immediately to an OP with an invitation to talk privately  mean only to help, but this type of response usually leads to either disappointment or disaster.  it usually works out quite differently here than when you say "PM me anytime" in a casual social context.  \n\nWe have huge admiration and appreciation for the goodwill and good citizenship of so many of you who support others here and flag inappropriate content - even more so because we know that so many of you are struggling yourselves.  We\'re hard at work behind the scenes on more information and resources to make it easier to give and get quality help here - this is just a small start.  \n\nOur new wiki page explains in detail why it\'s much better to respond in public comments, at least until you\'ve gotten to know someone.  It will be maintained at /r/depression/wiki/private_contact, and the full text of the current v

### 1.2 Automate the Data Collection process

In [33]:
def reddit_scrape(url, number_of_scrapes, output_list):
    #SCRAPED POSTS WILL BE CONTAINED IN OUTPUT LIST(SHOULD BE EMPTY)
    #THIS IS USEFUL FOR THE FIRST SCRAPE FROM THE VIRGIN SUBREDDIT
    after = None 

    for i in range(number_of_scrapes):
        if i == 0:
            print("SCRAPING {}\n--------------------------------------------------".format(url))
            print("<<<SCRAPING COMMENCED>>>") 
            print("\nStart:", time.ctime())
            print("Downloading Batch {} of {}...".format(1, number_of_scrapes))
        elif (i + 1) % 5 == 0:
#         else:
            print("\nStart:", time.ctime())
            print("Downloading Batch {} of {}...".format((i + 1), number_of_scrapes))
        
        if after == None:
            params = {}
        else:
            #THIS WILL TELL THE SCRAPER TO GET THE NEXT SET AFTER REDDIT'S after CODE
            params = {"after": after}             
        
        res = requests.get(url, params=params, headers=headers)
        
        if res.status_code == 200:
            the_json = res.json()

            output_list.extend(the_json["data"]["children"])
            after = the_json["data"]["after"]    
        else:
            print(res.status_code)
            break
        
        if i == 0 or (i + 1) % 5 == 0:
            print("End:", time.ctime())
    
        time.sleep(randint(1, 6))
    
    print("\n<<<SCRAPING COMPLETED>>>")
    print("Number of posts downloaded:", len(output_list))
    print("Number of unique posts:", len(set([p["data"]["name"] for p in output_list])))

In [34]:
def create_unique_list(original_scrape_list, data_list):
    data_name_list = []

    for i in range(len(original_scrape_list)):
        if original_scrape_list[i]["data"]["name"] not in data_name_list:
            data_list.append(original_scrape_list[i]["data"])
            data_name_list.append(original_scrape_list[i]["data"]["name"])
                
#     data_list = set([post["data"]["name"] for p in original_scrape_list]

    #CHECKING IF THE NEW LIST IS OF SAME LENGTH AS UNIQUE POSTS
    print("LIST NOW CONTAINS {} UNIQUE SCRAPED POSTS".format(len(data_list)))

### 1.3 Collect depression data

In [35]:
depress_scraped = [] 
reddit_scrape("https://www.reddit.com/r/depression.json", 50, depress_scraped)

SCRAPING https://www.reddit.com/r/depression.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>

Start: Fri May  7 23:50:39 2021
Downloading Batch 1 of 50...
End: Fri May  7 23:50:40 2021

Start: Fri May  7 23:50:42 2021
Downloading Batch 2 of 50...
End: Fri May  7 23:50:43 2021

Start: Fri May  7 23:50:48 2021
Downloading Batch 3 of 50...
End: Fri May  7 23:50:49 2021

Start: Fri May  7 23:50:55 2021
Downloading Batch 4 of 50...
End: Fri May  7 23:50:57 2021

Start: Fri May  7 23:50:59 2021
Downloading Batch 5 of 50...
End: Fri May  7 23:51:00 2021

Start: Fri May  7 23:51:05 2021
Downloading Batch 6 of 50...
End: Fri May  7 23:51:06 2021

Start: Fri May  7 23:51:10 2021
Downloading Batch 7 of 50...
End: Fri May  7 23:51:11 2021

Start: Fri May  7 23:51:17 2021
Downloading Batch 8 of 50...
End: Fri May  7 23:51:18 2021

Start: Fri May  7 23:51:22 2021
Downloading Batch 9 of 50...
End: Fri May  7 23:51:23 2021

Start: Fri May  7 23:51:28 2021
Downloading B

In [36]:
depress_scraped_unique = []
create_unique_list(depress_scraped, depress_scraped_unique)

LIST NOW CONTAINS 932 UNIQUE SCRAPED POSTS


In [37]:
depression = pd.DataFrame(depress_scraped_unique)
depression["is_anxiety"] = 0
depression.head() 

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,is_anxiety
0,,depression,We understand that most people who reply immed...,t2_1t70,False,,1,False,Our most-broken and least-understood rules is ...,[],...,no_ads,True,https://www.reddit.com/r/depression/comments/d...,755513,1572361000.0,2,,False,,0
1,,depression,Welcome to /r/depression's check-in post - a p...,t2_1t70,False,,0,False,"Regular Check-In Post, with important reminder...",[],...,no_ads,True,https://www.reddit.com/r/depression/comments/m...,755513,1615400000.0,0,,False,,0
2,,depression,I'm so low rn I can't even type anything coher...,t2_8oa0yyky,False,,0,False,Low,[],...,no_ads,False,https://www.reddit.com/r/depression/comments/n...,755513,1620404000.0,0,,False,,0
3,,depression,When I wake up after 8 hours of decent sleep I...,t2_8bk84r51,False,,0,False,I’m always amazed at how much energy healthy p...,[],...,no_ads,False,https://www.reddit.com/r/depression/comments/n...,755513,1620331000.0,0,,False,,0
4,,depression,I guess i have always been depressed but never...,t2_bzoskmwx,False,,0,False,30 and never lived a day in my life,[],...,no_ads,False,https://www.reddit.com/r/depression/comments/n...,755513,1620394000.0,0,,False,,0


In [38]:
depression.to_csv('../data/depression.csv', index = False)

### 1.4 Collect Anxiety data

In [39]:
anxiety_scraped = []
reddit_scrape("https://www.reddit.com/r/Anxiety.json", 50, anxiety_scraped)

SCRAPING https://www.reddit.com/r/Anxiety.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>

Start: Fri May  7 23:54:21 2021
Downloading Batch 1 of 50...
End: Fri May  7 23:54:23 2021

Start: Fri May  7 23:54:24 2021
Downloading Batch 2 of 50...
End: Fri May  7 23:54:25 2021

Start: Fri May  7 23:54:28 2021
Downloading Batch 3 of 50...
End: Fri May  7 23:54:29 2021

Start: Fri May  7 23:54:30 2021
Downloading Batch 4 of 50...
End: Fri May  7 23:54:31 2021

Start: Fri May  7 23:54:34 2021
Downloading Batch 5 of 50...
End: Fri May  7 23:54:36 2021

Start: Fri May  7 23:54:38 2021
Downloading Batch 6 of 50...
End: Fri May  7 23:54:39 2021

Start: Fri May  7 23:54:43 2021
Downloading Batch 7 of 50...
End: Fri May  7 23:54:44 2021

Start: Fri May  7 23:54:45 2021
Downloading Batch 8 of 50...
End: Fri May  7 23:54:46 2021

Start: Fri May  7 23:54:50 2021
Downloading Batch 9 of 50...
End: Fri May  7 23:54:51 2021

Start: Fri May  7 23:54:55 2021
Downloading Batc

In [40]:
anxiety_scraped_unique = []
create_unique_list(anxiety_scraped, anxiety_scraped_unique)

LIST NOW CONTAINS 998 UNIQUE SCRAPED POSTS


In [41]:
anxiety = pd.DataFrame(anxiety_scraped_unique)
anxiety["is_anxiety"] = 1
anxiety.head() 

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,preview,is_anxiety
0,,Anxiety,Hello everyone! Welcome to the r/Anxiety month...,t2_6l4z3,False,,0,False,Monthly Check-In Thread,[],...,True,https://www.reddit.com/r/Anxiety/comments/myl0...,453474,1619395000.0,0,,False,,,1
1,,Anxiety,With the subreddit continuing to grow we're lo...,t2_5uptt,False,,0,False,Looking for new mods!,[],...,True,https://www.reddit.com/r/Anxiety/comments/modo...,453474,1618091000.0,0,,False,self,{'images': [{'source': {'url': 'https://extern...,1
2,,Anxiety,"The company that I worked for: ""Hey it's menta...",t2_7grjc3lq,False,,0,False,It's so frustrating when society wants to be a...,[],...,False,https://www.reddit.com/r/Anxiety/comments/n6ze...,453474,1620397000.0,0,,False,,,1
3,,Anxiety,"It's pretty simple and may seem obvious, but j...",t2_50q4oaru,False,,0,False,My therapist recently taught me a trick that h...,[],...,False,https://www.reddit.com/r/Anxiety/comments/n6kr...,453474,1620344000.0,1,,False,,,1
4,,Anxiety,I’m proud as fuck of myself. It’s hard. Really...,t2_ziole,False,,0,False,I’m 31 years old and have been depressed and h...,[],...,False,https://www.reddit.com/r/Anxiety/comments/n6r5...,453474,1620365000.0,0,,False,,,1


In [42]:
anxiety.to_csv('../data/anxiety.csv', index = False)

In [43]:
#INVESTIGATING THE CASE OF Anxiety HAVING AN ADDITIONAL COLUMN
anxiety.columns.difference(depression.columns)

Index(['link_flair_template_id', 'post_hint', 'preview', 'thumbnail_height',
       'thumbnail_width'],
      dtype='object')

In [54]:
#LOOKING INTO THAT ADDITIONAL COLUMN
for diff in anxiety.columns.difference(depression.columns):
    print(anxiety[diff].isnull().value_counts(), "\n")

False    998
Name: link_flair_template_id, dtype: int64 

True     987
False     11
Name: post_hint, dtype: int64 

True     987
False     11
Name: preview, dtype: int64 

True    998
Name: thumbnail_height, dtype: int64 

True    998
Name: thumbnail_width, dtype: int64 

