# Project 3(1): Classifying Reddit Posts with Natural Language Processing and Machine Learning - Scrap Subreddits Text

Done by: Richelle-Joy Chia, a Redditor-and-data-science enthusiast! 

Problem statement: Through natural language processing and classification models, how can we help Reddit and other interested parties classify posts based on the texts used by people who may be depressed or anxious? Furthermore, how can sentiment analysis be utilized to detect emotions associated with depression and anxiety?

## Introduction
Mental illness is highly prevalent worldwide, constituting a major cause of distress in people's life with impact on society's health and well-being. Depression and anxiety are psychiatric disorders that are observed in many areas of everyday life and affect millions of people worldwide. 
These mental disorders manifest somewhat frequently in texts written by non diagnosed users on social media. Furthermore, social stigma around mental disorders influenced people to not speak out and turn to other means (eg. social platforms). 

For the scope of this project, I focused on scraping texts from Reddit, which is one of the common platforms for people to explode and share their thoughts. Thereafter, I ran multiple classification models (LogReg, Naive Bayes, and Random Forest Classifier) and sentiment analysis (Hugging Face) to analyse the texts. For the classification models, I used the accuracy scores and confusion matrix as the metrics for sifting out the best model. As for the sentiment analysis, I used the probability as the metric for determining which model is better.

These are the two subreddits:

- Anxiety subreddit: https://www.reddit.com/r/Anxiety/
- Depression subreddit: https://www.reddit.com/r/depression/

## Problem statement

As such, it would be interesting to explore how we can help Reddit and other interested parties classify texts based on the differences in words used by people who suffer from depression and anxiety through natural language processing and classification models? Furthermore, how can sentiment analysis be utilized to detect emotions associated with depression and anxiety? 

## Outline:
    
- Part 1: Introduction and Data Scraping
- Part 2: Data Cleaning
- Part 3: Preprocessing
- Part 4: Modeling 
- Part 5a-b: Exploratory Analysis with 2 models from Hugging Face
    - Model 1: https://huggingface.co/j-hartmann/emotion-english-roberta-large
    - Model 2: https://huggingface.co/arpanghoshal/EmoRoBERTa
- Part 5c: Hugging Face Models Insights

## Data Scraping

In [1]:
#import libraries 

import requests
import pandas as pd
import datetime

#### Data scraping will begin from Friday as it is usually the day that posts tend to get deleted by moderators for various reasons. 

In [1]:
# custom a function to grab posts from the subreddits

def getposts(subreddit,utc_timing=1664755200): #this timing is set at Monday 12.00am. Start to filter from Sunday backwards.

    # posts_df = pd.DataFrame(columns=['subreddit', 'selftext', 'title']) #create an empty dataframe. 
    timing = utc_timing #set timing to the current time which I added on the function
    posts_number = 0 #set posts_number to zero. the loop will make use of this.
    total_post = [] #create an empty list for the post

    while True:
        url = 'https://api.pushshift.io/reddit/search/submission' #base site, params will add the subreddit
        params = {
            'subreddit': subreddit, #the subreddit which i have to key in
            'size' : 250, #max i can go
            'before' : timing #the first time will be the one above. subsequent timing will be a week before.
        }
        print(f'Searching through {subreddit}. Total post so far: {posts_number}') #print this to show its running
        res = requests.get(url, params) #standard stuff. requesting url with the params
        data = res.json() #save the website stuff into json and then use data to store it.
        posts = data['data'] #inside 'data' in data is where the informatiom we want.
        
        
        for i in range(len(posts)):
            post = {}
            epoch_time = posts[i]['created_utc'] #find the [i] post epoch time
            date_time = datetime.datetime.fromtimestamp( epoch_time ) #convert the epoch time to standard time
            #I saved the date time as i want to ensure the posts pulled are correct. The time is a good reference.     
            try:
                post['date_time'] = date_time.strftime('%Y-%m-%d %H:%M:%S') #save to a format which is readable
            except:
                post['date_time'] = 'post deleted' #incase posts get deleted and throws an error
            
            try:
                post['subreddit'] = posts[i]['subreddit'] #save the subreddit name
            except:
                post['subreddit'] = 'post deleted' #incase posts get deleted and throws an error
            try:
                post['selftext'] = posts[i]['selftext'] #save the selftext data into selftext
            except:
                post['selftext'] = 'post deleted' #incase posts get deleted and throws an error
            try: 
                post['title'] = posts[i]['title'] #save the post title into title
            except:
                post['title'] = 'post deleted' #incase posts get deleted and throws an error
           
            
            total_post.append(post) #append this current post dictionary, and put it ito total_post list.

        
        posts_number += len(posts) #add to post_number for counter and for the while true loop.
        max_post = len(post)
        timing = posts[-1]['created_utc'] #timing is set to the earliest post in the posts list
        #in a day. so minus a week have a better time range all the way to last year for the 15,000 post

        if posts_number >= 15000: #if this is true, the while loop breaks
            break
    
    posts_df = pd.DataFrame(total_post) #save the total_post from the subreddit and place it in a posts_df(dataframe)
    posts_df.to_csv('./dataset/'+subreddit+'.csv') #save the file name into dataset folder, with the subreddit name.
    

In [3]:
getposts('Anxiety',)

Searching through Anxiety. Total post so far: 0
Searching through Anxiety. Total post so far: 250
Searching through Anxiety. Total post so far: 500
Searching through Anxiety. Total post so far: 750
Searching through Anxiety. Total post so far: 1000
Searching through Anxiety. Total post so far: 1250
Searching through Anxiety. Total post so far: 1500
Searching through Anxiety. Total post so far: 1750
Searching through Anxiety. Total post so far: 1999
Searching through Anxiety. Total post so far: 2249
Searching through Anxiety. Total post so far: 2499
Searching through Anxiety. Total post so far: 2749
Searching through Anxiety. Total post so far: 2999
Searching through Anxiety. Total post so far: 3249
Searching through Anxiety. Total post so far: 3499
Searching through Anxiety. Total post so far: 3748
Searching through Anxiety. Total post so far: 3998
Searching through Anxiety. Total post so far: 4248
Searching through Anxiety. Total post so far: 4498
Searching through Anxiety. Total post

In [4]:
getposts('Depression',)

Searching through Depression. Total post so far: 0
Searching through Depression. Total post so far: 250
Searching through Depression. Total post so far: 500
Searching through Depression. Total post so far: 750
Searching through Depression. Total post so far: 1000
Searching through Depression. Total post so far: 1250
Searching through Depression. Total post so far: 1499
Searching through Depression. Total post so far: 1749
Searching through Depression. Total post so far: 1999
Searching through Depression. Total post so far: 2248
Searching through Depression. Total post so far: 2497
Searching through Depression. Total post so far: 2745
Searching through Depression. Total post so far: 2995
Searching through Depression. Total post so far: 3245
Searching through Depression. Total post so far: 3495
Searching through Depression. Total post so far: 3745
Searching through Depression. Total post so far: 3995
Searching through Depression. Total post so far: 4245
Searching through Depression. Tota

## It is time to proceed to data cleaning! 