# Project 3: Data Scraping From Subreddits
--- 


## Context
We are a group of consultants who are hired to present our data models to a panel of directors in rehabilitation centers for smokers and alcoholics.

## Problem Statement
The recent COVID-19 situation have restricted the rehabilitation of several people with smoking or alcohol issues. This has led to a rise in alcoholism and smoking in communities. Directors of these rehabilitation centers are looking for more innovative ways to better engage the community about alcoholism and smoking. 

One way of engaging patients is to engage them online. Several of them have turned to subreddits to post about their problems and concerns. We can leverage on this to provide further appropriate treatment to their posts.

Our director has requested us to create an algorithm that can more efficiently process requests from patients who will be using their online platform for rehabilitation. This algorithm will be used in their online frontend to more quickly classify patients with alcoholism or smoking problems, thus minimising resource and time wastage, allowing help to be rendered more efficiently. 

## Goal of this Data Analysis
Our goal in this data analysis is to be able to classify our patient's posts into smoking or alcoholic issues. This problem is a classification problem, and requires supervised machine learning. We will be using the Naive Bayers and KNN Classification models in our data analysis.

## Baseline Model
Currently, directors can only gage if the posts are smoking or alcoholics by pulling a set of data from the subreddits, looking at the majority of the data pulled, before deciding if the data is for smoking or alcoholics. Should any future data be sent to them, they have no means of recognising the data by machine learning, if the data belongs to alcoholics or smoking.

## Success Evaluation
Our evaluation metrics for the success of our model will be with our model prediction score, in particular the prediction accuracy score of both our train and test data (Which is split into a ratio of 80-20 from our data) in comparison to our baseline model score.

### Import Relevant Libraries

In [1]:
import requests
import pandas as pd

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

## Retrieve Data from Alcoholic Anonymous Subreddit

We will be using the PushShift API to pull data out from our subreddits. Lets start by pulling out the data from the alcoholics subreddit.

Note that PushShift API does not allow pulling above 100 subreddits, thus we have to make muitiple requests to get our data scraped.

In [3]:
# Set up parameters for PushShift API to retrieve the data from the Alcoholic Anonymous subreddit
params = {
    'subreddit': 'alcoholicsanonymous', 'size': '100'
}

In [4]:
df_alc_anon = pd.DataFrame()
total_data = 0
last = 0
# Initialise our memory variables

while total_data < 2000:
    if last == 0: # This means it is a fresh run of the loop and thus 'before' parameter should be omitted
        params.pop('before', None)
    else:
        params['before'] = last 
        # There was a previous entry from PushShift API into our DataFrame, set 'before' to the previous oldest entry
        
    response = requests.get(url, params) # Get a response from PushShift API
    data = response.json() # Send our PushShift response into a json decoder
    post = pd.DataFrame(data ['data']) # Send the decoded data into our post DataFrame object
    df_alc_anon = pd.concat([df_alc_anon, post]) # Concatenate the old and new DataFrames together with the new one underneath.
    total_data += len(post) # Add our counter based on the number of posts we retrieved in post
    df_alc_anon.sort_values(by = 'created_utc', ascending = False, inplace = True) # sort data by their post time
    last = int(df_alc_anon['created_utc'].iloc[-1:]) # set our last time for our next loop

df_alc_anon.reset_index(drop = True, inplace = True) # The index will be screwed during the loop, reindex to maintain continuity
df_alc_anon.info() # Review our Data Retrieval to ensure it is correct

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 70 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  2000 non-null   object 
 1   allow_live_comments            2000 non-null   bool   
 2   author                         2000 non-null   object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          1991 non-null   object 
 5   author_flair_text              0 non-null      object 
 6   author_flair_type              1991 non-null   object 
 7   author_fullname                1991 non-null   object 
 8   author_is_blocked              2000 non-null   bool   
 9   author_patreon_flair           1991 non-null   object 
 10  author_premium                 1991 non-null   object 
 11  awarders                       2000 non-null   object 
 12  can_mod_post                   2000 non-null   b

### Save Pulled Alcoholics Data into CSV
With our data with us, let's save our data into the appropriate CSV.

In [5]:
# save our data into csv
try: 
    df_alc_anon.to_csv('../data/alcoholicsanonymous_raw.csv')
except:
    print('alcoholicsanonymous_raw.csv is currently open with another program, please rerun code cell again after closing the other program.')

## Retrieve Data from Stop Smoking Subreddit
With our alcoholics data in, now we need our smoking subreddit data. Let's pull out data as well.

In [6]:
# Set up parameters for PushShift API to retrieve the data from the Alcoholic Anonymous subreddit
params = {
    'subreddit': 'stopsmoking', 'size': '100'
}

In [7]:
df_stop_smoking = pd.DataFrame()
total_data = 0
last = 0
# Initialise our memory variables

while total_data < 2000:
    if last == 0: # This means it is a fresh run of the loop and thus 'before' parameter should be omitted
        params.pop('before', None)
    else:
        params ['before'] = last 
        # There was a previous entry from PushShift API into our DataFrame, set 'before' to the previous oldest entry
        
    response = requests.get(url, params) # Get a response from PushShift API
    data = response.json() # Send our PushShift response into a json decoder
    post = pd.DataFrame(data ['data']) # Send the decoded data into our post DataFrame object
    df_stop_smoking = pd.concat([df_stop_smoking, post]) # Concatenate the old and new DataFrames together with the new one underneath.
    total_data += len(post) # Add our counter based on the number of posts we retrieved in post
    df_stop_smoking.sort_values(by = 'created_utc', ascending = False, inplace = True) # sort data by their post time
    last = int(df_stop_smoking['created_utc'].iloc[-1:]) # set our last time for our next loop

df_stop_smoking.reset_index(drop = True, inplace = True) # The index will be screwed during the loop, reindex to maintain continuity
df_stop_smoking.info() # Review our Data Retrieval to ensure it is correct

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 81 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  2000 non-null   object 
 1   allow_live_comments            2000 non-null   bool   
 2   author                         2000 non-null   object 
 3   author_flair_css_class         197 non-null    object 
 4   author_flair_richtext          1993 non-null   object 
 5   author_flair_text              197 non-null    object 
 6   author_flair_type              1993 non-null   object 
 7   author_fullname                1993 non-null   object 
 8   author_is_blocked              2000 non-null   bool   
 9   author_patreon_flair           1993 non-null   object 
 10  author_premium                 1993 non-null   object 
 11  awarders                       2000 non-null   object 
 12  can_mod_post                   2000 non-null   b

### Save Pulled Smoking Data into CSV
With our data with us, let's save our data into the appropriate CSV.

In [8]:
# save our data into csv
df_stop_smoking.to_csv('../data/stopsmoking_raw.csv')

# End of Data Retrieval
With our data scraped, we can proceed to the next portion of our analysis which is to clean the data. This will be done on the data cleaning notebook.