<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP ( Part 1 )

## About the project

This project covers three of the biggest concepts we cover in the class: Classification Modeling, Natural Language Processing and Data Wrangling/Acquisition.

For this project, we will be scraping data from two subreddits. Thereafter, we will apply Natural Language Processing to train a classifier to classify which subreddit a particular post comes from.

---

## Problem Statement

According to Bachmann S. Epidemiology of Suicide and the Psychiatric Perspective, most suicides are related to psychiatric disease, with depression, substance use disorders and psychosis being the most relevant risk factors. In view of this statistic, a newly developed social media application, Chipper, has implemented a new feature where users are able to report other users' posts for suspected mental health issue so that they will be able to provide help to these users before it is too late.

As a data scientist working in this company, I am tasked to train a classifier that will categorise posts that were
reported for mental health issues into either Anxiety or Depression so that we are able to route these users to its appropriate helpline. To train the classifier, I will be using posts from Reddit's r/Anxiety and r/Depression subreddits as proxy data.

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6165520/#B1-ijerph-15-02028

---

## In this section

In this section, we will only be scraping data from `Anxiety` and `Depression` subreddit.

---

## Pushshift API

The pushshift.io Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions.

In [7]:
# pip install psaw

---

### Import Libraries

In [1]:
# Import libraries
import requests
import pandas as pd
import datetime as dt 
import time
import random

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

---

### Scrap Function

In [2]:
def get_reddit_posts(subreddit, n, days = 30):
    
    # Url
    base_url = 'https://api.pushshift.io/reddit/search/submission'
    full_url = f'{base_url}?subreddit={subreddit}&size=100'
    
    # Creating an empty list to store the posts
    posts = []
    
    # Iterations to modify the url after each iteration
    for i in range(1, n+1):
        urlmod = '{}&after={}d'.format(full_url, days*i)
        res_1 = requests.get(urlmod)
        
        # prevent errors from stopping the codes from running
        try:
            res = requests.get(urlmod)
            assert res.status_code == 200
        except:
            continue
        
        # Converting to json
        extracted = res.json()['data']
        # Constructing a dataframe from dict
        df = pd.DataFrame.from_dict(extracted)
        # Adding the df to post list(created on top)
        posts.append(df)
        
        # Total number of posts scrapped
        total_scraped = sum(len(x) for x in posts)
        
        # If there are more than n values/data, stop. 
        if total_scraped > n:
            break
        
        # Generate a random sleep duration to seem like a human user
        sleep_duration = random.randint(1,9)
        time.sleep(sleep_duration)
            
    
    # create list of features that we will be using
    features_of_interest = ['subreddit', 'title', 'selftext']
    
    # combine all iterations
    final_df = pd.concat(posts, sort=False)
    # remove all the unrequired columns from the datasets
    final_df = final_df[features_of_interest]
    # Drop duplicates
    final_df.drop_duplicates(inplace=True)
    return final_df.reset_index(drop=True)

---

### Scrapped on 24 Nov 2022, 11.43 A.M.

In [3]:
anxiety = get_reddit_posts('Anxiety', 3000)
depression = get_reddit_posts('depression', 3000)

print(f'Scraped {len(anxiety)} posts on \'Anxiety\' using Pushshift')
print((f'Scraped {len(depression)} posts on \'Depression\' using Pushshift'))

Scraped 3085 posts on 'Anxiety' using Pushshift
Scraped 3082 posts on 'Depression' using Pushshift


---

### Exporting Dataset

In [5]:
anxiety.to_csv('./datasets/anxiety.csv')
depression.to_csv('./datasets/depression.csv')