# Project 3

**Contents**
1. Problem Statement
2. Importing libraries
3. Retrieve Data from Alcoholic Anonymous Subreddit
4. Retrieve Data from Stop Smoking Subreddit


## Problem Statement


Due to the COVID-19 pandemic, the ability to physcially visit facilities of any nature have become limited. Specifically, one's ability to walk into a rehabilitative facility on a whim is no longer a possibility without first being tested negative for COVID. This really de-motivates people from seeking help since their needs cannot be catered to instatenously anymore and the expected delays often put people off from seeking help altogether. 

People are cooped up at home more now and it is no surprise that the stress of having to live through such unprecedented times is catching up. People are drinking, smoking, abusing drugs more with little to no other avenues to cope with the novel situation and it is easy to go down the slippery slope of addiction. 

People who are able to catch on that they are becoming addicted to a substance often do want to seek help but given the lack of ease in seeking helping physically in these pandemic times, recovering addicts have turned to social media platforms such as reddit to air their concerns. While doing so allows them to offer recovery aid peer to peer, it still does not allow them to access professional resources just as easily.

Rehabilitative facilites, as a result, have reached out to our team of data scientists and tasked us to identify people seeking aid for addiction and specifically seeking aid to curb a smoking or drinking problem. Avenues to help are available but there is a lack of means to access professional services that can help addicts just as well as a physical visit to a rehabilitative centre might. We aim to create an online chatbox for these rehabilitative facilities such that any query put in by a user in the chatbox can instantenously be identified as a smoking or drinking problem based on the keywords being used in the query and within seconds users can be redirected to targetted professional aid to guide them along their path to recovery during these unprecedented times. 

### The process of addressing the problem statement

This details the process of creating the chatbox. 2000 posts will be scraped from subreddits r/stopsmoking and r/alcoholicsanonymous. These datasets will then be checked for duplicate posts, unwanted characters like emojis, and URLs. Identified duplicate rows will be dropped and unwanted charcters and URLs will be scrubbed from the title and selftext columns. All other columns will be dropped as we want to focus on using text posts for this classification task. A new column combinign the title and selftext will be created so as to encompass all text from a post. This will be used in our modelling process. The r/alcoholicsanonymous and r/stopsomking datasets will then first be encoded as 0 and 1 respectively before being merged.The data will then undergo pre-processing which involves removing stop words, tokenizing words and lemmatizing them. After some explaratory data analysis to attain preliminary insights, the data will be fitted and classification models such as Naive Bayes, KNearestNeighbour and Logistic Regression, will be trained on the data such that they are able to classify whether a post is from r/stopsmoking or r/alcoholics anonymous respectively when it comes to unseen data/posts. The best model or most accurate model (based on metrics such as the R2 score or Area under the curve score) to predict whether a post is made by a recovering smoker or alcoholic will then be used to construct a chatbox.

## Import Relevant Libraries

In [1]:
import requests
import pandas as pd

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

## Retrieve Data from Alcoholic Anonymous Subreddit

We will be using the PushShift API to pull data out from our subreddits. Lets start by pulling out the data from the alcoholics subreddit.

Note that PushShift API does not allow pulling above 100 subreddits, thus we have to make muitiple requests to get our data scraped.

In [3]:
# Set up parameters for PushShift API to retrieve the data from the Alcoholic Anonymous subreddit
params = {
    'subreddit': 'alcoholicsanonymous', 'size': '100'
}

In [4]:
df_alc_anon = pd.DataFrame()
total_data = 0
last = 0
# Initialise our memory variables

while total_data < 2000:
    if last == 0: # This means it is a fresh run of the loop and thus 'before' parameter should be omitted
        params.pop('before', None)
    else:
        params['before'] = last 
        # There was a previous entry from PushShift API into our DataFrame, set 'before' to the previous oldest entry
        
    response = requests.get(url, params) # Get a response from PushShift API
    data = response.json() # Send our PushShift response into a json decoder
    post = pd.DataFrame(data ['data']) # Send the decoded data into our post DataFrame object
    df_alc_anon = pd.concat([df_alc_anon, post]) # Concatenate the old and new DataFrames together with the new one underneath.
    total_data += len(post) # Add our counter based on the number of posts we retrieved in post
    df_alc_anon.sort_values(by = 'created_utc', ascending = False, inplace = True) # sort data by their post time
    last = int(df_alc_anon['created_utc'].iloc[-1:]) # set our last time for our next loop

df_alc_anon.reset_index(drop = True, inplace = True) # The index will be screwed during the loop, reindex to maintain continuity
df_alc_anon.info() # Review our Data Retrieval to ensure it is correct

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 70 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  2000 non-null   object 
 1   allow_live_comments            2000 non-null   bool   
 2   author                         2000 non-null   object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          1991 non-null   object 
 5   author_flair_text              0 non-null      object 
 6   author_flair_type              1991 non-null   object 
 7   author_fullname                1991 non-null   object 
 8   author_is_blocked              2000 non-null   bool   
 9   author_patreon_flair           1991 non-null   object 
 10  author_premium                 1991 non-null   object 
 11  awarders                       2000 non-null   object 
 12  can_mod_post                   2000 non-null   b

### Save Pulled Alcoholics Data into CSV
With our data with us, let's save our data into the appropriate CSV.

In [5]:
# save our data into csv
try: 
    df_alc_anon.to_csv('../data/alcoholicsanonymous_raw.csv')
except:
    print('alcoholicsanonymous_raw.csv is currently open with another program, please rerun code cell again after closing the other program.')

## Retrieve Data from Stop Smoking Subreddit
With our alcoholics data in, now we need our smoking subreddit data. Let's pull out data as well.

In [6]:
# Set up parameters for PushShift API to retrieve the data from the Alcoholic Anonymous subreddit
params = {
    'subreddit': 'stopsmoking', 'size': '100'
}

In [7]:
df_stop_smoking = pd.DataFrame()
total_data = 0
last = 0
# Initialise our memory variables

while total_data < 2000:
    if last == 0: # This means it is a fresh run of the loop and thus 'before' parameter should be omitted
        params.pop('before', None)
    else:
        params ['before'] = last 
        # There was a previous entry from PushShift API into our DataFrame, set 'before' to the previous oldest entry
        
    response = requests.get(url, params) # Get a response from PushShift API
    data = response.json() # Send our PushShift response into a json decoder
    post = pd.DataFrame(data ['data']) # Send the decoded data into our post DataFrame object
    df_stop_smoking = pd.concat([df_stop_smoking, post]) # Concatenate the old and new DataFrames together with the new one underneath.
    total_data += len(post) # Add our counter based on the number of posts we retrieved in post
    df_stop_smoking.sort_values(by = 'created_utc', ascending = False, inplace = True) # sort data by their post time
    last = int(df_stop_smoking['created_utc'].iloc[-1:]) # set our last time for our next loop

df_stop_smoking.reset_index(drop = True, inplace = True) # The index will be screwed during the loop, reindex to maintain continuity
df_stop_smoking.info() # Review our Data Retrieval to ensure it is correct

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 81 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  2000 non-null   object 
 1   allow_live_comments            2000 non-null   bool   
 2   author                         2000 non-null   object 
 3   author_flair_css_class         197 non-null    object 
 4   author_flair_richtext          1993 non-null   object 
 5   author_flair_text              197 non-null    object 
 6   author_flair_type              1993 non-null   object 
 7   author_fullname                1993 non-null   object 
 8   author_is_blocked              2000 non-null   bool   
 9   author_patreon_flair           1993 non-null   object 
 10  author_premium                 1993 non-null   object 
 11  awarders                       2000 non-null   object 
 12  can_mod_post                   2000 non-null   b

### Save Pulled Smoking Data into CSV
With our data with us, let's save our data into the appropriate CSV.

In [8]:
# save our data into csv
df_stop_smoking.to_csv('../data/stopsmoking_raw.csv')

# End of Data Retrieval
With our data scraped, we can proceed to the next portion of our analysis which is to clean the data. This will be done on the data cleaning notebook.