<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# **Project 3: Web APIs & NLP**

### **Content** 

---

#### Part 1 : [Web Scraping from Reddit](#part-1-web-scraping-from-reddit)

#### Part 2 : [Data Cleaning](Part_2_data_cleaning.ipynb)

#### Part 3 : [EDA and Data Preprocessing](Part_3_preprocessing_and_eda.ipynb)

#### Part 4 : [Modelling](Part_4_modelling.ipynb)

### **Context**
---
Mental health is an increasingly serious issue in Singapore, with depression being the most common mental illness, particulary in youths. One of the leading factors in the rising trend of depression in youths is due to heavy social media usage, due to cyberbullying and a pressure to constantly present a positive self image online. Given that it is difficult to control how much time youths spend on social media, it is more effective to make use of social media instead to address the issue of rising depression rates in youths, and give them the best support possible. 

### **Problem Statement**
---
Our aim is to address the following question: "**How might we detect whether Youths are at risk of depression based on their online post?**

Another objective of this project is to create a `predictive model` to identify based on social media post, if youth is at risk of depression. The success of the model will be evaluated based on ``sensitivity` score.

<a id="section_identifier"></a>
###  **Part 1: Web Scraping from Reddit**

---

Import Libraries Here

In [1]:
import praw 
import json
import pandas as pd
import time

For the purpose of this project, we will be using these 2 reddits as our datasets to build to model: 
1. r/depression
2. r/happy


We will start by creating function to scrape the following:

1. Function to scrape submission post

Because we can only get less than 1000 posts per extraction due to reddit limit, we will extract from both hot and new section and do further cleaning on the next part of the code

  Function to scrape **hot** post 

In [6]:
#define function to scrape "hot" post from reddit to json
def post_to_json(subreddit_name, client_id, client_secret,user_agent):
    #initialize praw
    reddit = praw.Reddit(client_id = client_id, client_secret = client_secret, user_agent= user_agent)

    #reddit code 
    subreddit = reddit.subreddit(subreddit_name)
    
    post_data = []
    count = 0
    processed_id = set()
    while count<10000:
        for post in subreddit.hot(limit= None):
            if post.id not in processed_id:
                if post.selftext != '[deleted]' and post.selftext != '[removed]':  #filter post that is deleted or removed
                    post_data.append({
                        'id': post.id,
                        'title': post.title,
                        'body': post.selftext,
                        'author': str(post.author) if post.author else None,
                        'score': post.score,
                        'total comment': post.num_comments,
                        'created_utc': post.created_utc,
                        'subreddit': str(post.subreddit)
                    })
                    count += 1
                    processed_id.add(post.id)

        time.sleep(1)
    # Output the number of processed posts
        print(f"Processed {count} posts.")

    #Save the post into JSON 
    filename = f'post_{subreddit_name}.json'
    
    with open(filename, 'w') as f:
        json.dump(post_data, f, indent=4)
        print(f"Posts from r/{subreddit_name} have been saved")

    

Function to extract **new** post

In [1]:
#define function to scrape "new" post from reddit to json
def post_new_to_json(subreddit_name, client_id, client_secret, user_agent):
    # Initialize praw
    reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

    # Reddit code
    subreddit = reddit.subreddit(subreddit_name)
    
    post_data = []
    count = 0
    
    # Retrieve posts with a limit of 10000 
    posts = subreddit.new(limit=10000)
    for post in posts:
        if post.selftext != '[deleted]' and post.selftext != '[removed]':  # Filter out deleted or removed posts
            post_data.append({
                'id': post.id,
                'title': post.title,
                'body': post.selftext,
                'author': str(post.author) if post.author else None,
                'score': post.score,
                'total comment': post.num_comments,
                'created_utc': post.created_utc,
                'subreddit': str(post.subreddit)
            })
            count += 1

        time.sleep(1)
    # Output the number of processed posts
    print(f"Processed {count} posts.")

    # Save the posts into JSON 
    filename = f'post_{subreddit_name}_new.json'
    
    with open(filename, 'w') as f:
        json.dump(post_data, f, indent=4)
        print(f"Posts from r/{subreddit_name} have been saved")

2. Function to scrape the subreddit **hot** post comment

In [20]:
#define function to get comments to JSON
def comments_to_json(subreddit_name, client_id, client_secret, user_agent):
    # Initialize praw
    reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

    # Reddit code
    subreddit = reddit.subreddit(subreddit_name)
    comments_data = []
    count = 0
    # while count<10000:
    for post in subreddit.hot(limit = 10000):
        post.comments.replace_more(limit=3)  # This line avoids loading more comments dynamically
        for comment in post.comments.list():
            comments_data.append({
                'id': comment.id,
                'body': comment.body,
                'author': str(comment.author),
                'score': comment.score,
                'created_utc': comment.created_utc,
                'subreddit' : str(comment.subreddit)
            })
        count += 1
        time.sleep(1)    
        print(f"Post check count {count}")
    # Save the comments to a JSON file
    filename = f'comment_{subreddit_name}_hot.json'
    with open(filename, 'w') as f:
            json.dump(comments_data, f, indent=4)

    print(f"Comments have been saved to {filename}")

Function to scrape the subreddit **new** post comment

In [6]:
#define function to get comments to JSON
def comments_new_to_json(subreddit_name, client_id, client_secret, user_agent):
    # Initialize praw
    reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

    # Reddit code
    subreddit = reddit.subreddit(subreddit_name)
    comments_data = []
    count = 0

    for post in subreddit.new(limit = None):
        post.comments.replace_more(limit=0)  # This line avoids loading more comments dynamically
        for comment in post.comments.list():
            comments_data.append({
                'id': comment.id,
                'body': comment.body,
                'author': str(comment.author),
                'score': comment.score,
                'created_utc': comment.created_utc,
                'subreddit' : str(comment.subreddit)
            })
        count += 1
        time.sleep(1)    
    print(f"Post check count {count}")
    # Save the comments to a JSON file
    filename = f'comment_{subreddit_name}_new.json'
    with open(filename, 'w') as f:
        json.dump(comments_data, f, indent=4)

    print(f"Comments have been saved to {filename}")

Instantiate Reddit Client Information from Reddit API 

In [7]:
#define client ID 
client_id = 'gQD9FMAWUmwxcg1P_IaqjA'
client_secret = 'VQf4eJns_xSQgNRBOTRwI5bCYAKJqA'
user_agent = 'Web Scraper'

The code cell below is to scrape reddit post submission and convert it to dataframe 

In [22]:
#call function to extract hot post
post_to_json('depression', client_id, client_secret, user_agent)
post_to_json('happy', client_id, client_secret, user_agent)

In [None]:
# call function to extract new post 
post_new_to_json('depression', client_id, client_secret, user_agent)
post_new_to_json('happy', client_id, client_secret, user_agent)

The code below is to load json file

In [23]:
#Load hot JSON file into dataframe 
df_post_depression = pd.read_json('../datasets/json/post_depression.json') #dataframe created from subreddit r/depression hot
df_post_happy = pd.read_json('../datasets/json/post_happy.json') #dataframe created from subreddit r/happy hot


In [24]:
#Load new JSON file into dataframe 
df_post_new_depression = pd.read_json('../datasets/json/post_depression_new.json') #dataframe created from subreddit r/depression new
df_post_new_happy = pd.read_json('../datasets/json/post_happy_new.json') #dataframe created from subreddit r/happy new

Check the dataframe no of rows and columns, and the content 

In [9]:
df_post_new_depression.head()

Unnamed: 0,id,title,body,author,score,total comment,created_utc,subreddit
0,1bmd4ud,Jobless,I'm currently jobless i had gone through depre...,Independent_Show9505,1,0,1711257670,depression
1,1bmd45v,19 and have nothing to show for it,"I haven't finished highschool, I basically hav...",Lux_butimnotreal,1,0,1711257598,depression
2,1bmd2og,"Need help ,is this healthy",I’ll get straight to it I’ve been on this dati...,Ok_Code_2143,1,0,1711257433,depression
3,1bmd2cx,Gut wrench,\n\nThe sinking feeling your heart gets right ...,Critical-Strength-83,1,0,1711257398,depression
4,1bmcwt4,I feel like my girlfriend is depressed and I w...,My girlfriend and I have been going strong for...,Fit-Basil-5447,1,1,1711256827,depression


In [10]:
df_post_new_depression.info()

<class 'pandas.core.frame.DataFrame'>
Index: 987 entries, 0 to 986
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             987 non-null    object
 1   title          987 non-null    object
 2   body           987 non-null    object
 3   author         981 non-null    object
 4   score          987 non-null    int64 
 5   total comment  987 non-null    int64 
 6   created_utc    987 non-null    int64 
 7   subreddit      987 non-null    object
dtypes: int64(3), object(5)
memory usage: 69.4+ KB


In [11]:
df_post_new_happy.head()

Unnamed: 0,id,title,body,author,score,total comment,created_utc,subreddit
0,1bmd3kx,i hit 8 months sober today!!! (and i couldnt b...,,haute_honey,8,3,1711257536,happy
1,1bmbu9p,I just had one of the most delicious meals in ...,There wasn't anything special about it really....,Someragingpacifist,3,2,1711253084,happy
2,1bm8fue,I managed to minimize my screen time by about ...,"I am not a phone addict, but I always used to ...",saayoutloud,4,6,1711242519,happy
3,1bm85b6,"Out of work for the last 5 months, finally lan...",Was starting to feel worthless. I could cry I’...,bakerjunt,92,3,1711241694,happy
4,1bm7hqc,Today marks 2 years alcohol-free for me. I was...,,theeblackdahlia,280,23,1711239848,happy


In [12]:
df_post_new_happy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 968 entries, 0 to 967
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             968 non-null    object
 1   title          968 non-null    object
 2   body           968 non-null    object
 3   author         945 non-null    object
 4   score          968 non-null    int64 
 5   total comment  968 non-null    int64 
 6   created_utc    968 non-null    int64 
 7   subreddit      968 non-null    object
dtypes: int64(3), object(5)
memory usage: 68.1+ KB


The code cell below is to scrape reddit post comment and convert it to dataframe 

In [23]:
# Extract comments from both subreddits
comments_to_json('depression', client_id, client_secret, user_agent)
comments_to_json('happy', client_id, client_secret, user_agent)

Post check count 1
Post check count 2
Post check count 3
Post check count 4
Post check count 5
Post check count 6
Post check count 7
Post check count 8
Post check count 9
Post check count 10
Post check count 11
Post check count 12
Post check count 13
Post check count 14
Post check count 15
Post check count 16
Post check count 17
Post check count 18
Post check count 19
Post check count 20
Post check count 21
Post check count 22
Post check count 23
Post check count 24
Post check count 25
Post check count 26
Post check count 27
Post check count 28
Post check count 29
Post check count 30
Post check count 31
Post check count 32
Post check count 33
Post check count 34
Post check count 35
Post check count 36
Post check count 37
Post check count 38
Post check count 39
Post check count 40
Post check count 41
Post check count 42
Post check count 43
Post check count 44
Post check count 45
Post check count 46
Post check count 47
Post check count 48
Post check count 49
Post check count 50
Post chec

In [8]:
#extract new commens from both subreddit 
comments_new_to_json('depression', client_id, client_secret, user_agent)
comments_new_to_json('happy', client_id, client_secret, user_agent)

Post check count 988
Comments have been saved to comment_depression_new.json
Post check count 968
Comments have been saved to comment_happy_new.json


The code below is to load json file

In [2]:
#Load comment JSON into dataframe 
df_comment_depression = pd.read_json('../datasets/json/comment_depression.json')
df_comment_happy = pd.read_json('../datasets/json/comment_happy.json')

In [4]:
df_comment_new_depression = pd.read_json('../datasets/json/comment_depression_new.json')
df_comment_new_happy = pd.read_json('../datasets/json/comment_happy_new.json')

In [12]:
df_comment = pd.concat([df_comment_depression,df_comment_new_depression],axis=0)

In [16]:
df_comment

Unnamed: 0,id,body,author,score,created_utc,subreddit
0,f5pot56,Understood and I apologise if I forget in the ...,scorpiontank27,237,1572364418,depression
1,f5pot7j,[removed],,63,1572364419,depression
2,f647tsy,Biggest Problem on private talks may be that y...,BloodyClash1133,47,1572689495,depression
3,f5pnusx,I have to agree with this. I know that people ...,,47,1572363803,depression
4,f5pq8wf,Great rule! I’ve never thought about things yo...,000000-,14,1572365346,depression
...,...,...,...,...,...,...
2436,kvyc1a3,"I have, but thanks for the warning! I will not...",Misce11aneou5,2,1711057307,depression
2437,kvvy16f,"Thanks for the advice, I’ll see what happens. ...",Misce11aneou5,2,1711028517,depression
2438,kvurr0d,I'm in the exact same situation. I don't have ...,EyeAnon,1,1711003231,depression
2439,kvtlh3d,Try to forgive yourself. No matter who you are...,floppy_gonga,1,1710982289,depression


In [19]:
df_comment[['id','body']].duplicated().value_counts()

True     38548
False     5226
Name: count, dtype: int64

Check the dataframe no of rows and columns, and the content for comments

In [37]:
df_comment_depression.head()

Unnamed: 0,id,body,author,score,created_utc,subreddit
0,f5pot56,Understood and I apologise if I forget in the ...,scorpiontank27,237,1572364418,depression
1,f5pot7j,[removed],,63,1572364419,depression
2,f647tsy,Biggest Problem on private talks may be that y...,BloodyClash1133,47,1572689495,depression
3,f5pnusx,I have to agree with this. I know that people ...,,47,1572363803,depression
4,f5pq8wf,Great rule! I’ve never thought about things yo...,000000-,14,1572365346,depression


In [38]:
df_comment_depression.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41333 entries, 0 to 41332
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           41333 non-null  object
 1   body         41333 non-null  object
 2   author       41333 non-null  object
 3   score        41333 non-null  int64 
 4   created_utc  41333 non-null  int64 
 5   subreddit    41333 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.2+ MB


In [39]:
df_comment_happy.head()

Unnamed: 0,id,body,author,score,created_utc,subreddit
0,jnx95nz,But no involvement in the protest?,SorcererSupreme21,3,1686600125,happy
1,jnxcwhi,Shut down the subreddit,WhoDidYourCirc_Bro,3,1686602074,happy
2,k7jijc6,Is this discord active? :),MaxSteelMetal,1,1698949875,happy
3,kw014ic,Welcome to /r/happy where we support people in...,AutoModerator,1,1711081369,happy
4,kw0215k,Robin eggs! What an Easter blessing.,Independent_Ad_5664,140,1711081849,happy


In [40]:
df_comment_happy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 130922 entries, 0 to 130921
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   id           130922 non-null  object
 1   body         130922 non-null  object
 2   author       130922 non-null  object
 3   score        130922 non-null  int64 
 4   created_utc  130922 non-null  int64 
 5   subreddit    130922 non-null  object
dtypes: int64(2), object(4)
memory usage: 7.0+ MB


Export File to CSV 

Export hot post to csv file

In [9]:
df_post_depression.to_csv('../datasets/post_depression.csv', index=False)
df_post_happy.to_csv('../datasets/post_happy.csv', index=False)

Export new post to csv file 

In [21]:
df_post_new_depression.to_csv('../datasets/post_new_depression.csv', index=False)
df_post_new_happy.to_csv('../datasets/post_new_happy.csv', index=False)

Export comment to csv file

In [22]:
df_comment_depression.to_csv('../datasets/comment_depression.csv', index=False)
df_comment_happy.to_csv('../datasets/comment_happy.csv', index=False)

#### Next: [Part 2 Data Cleaning](Part_2_data_cleaning.ipynb)