## <img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3:

## Problem Statement

We are a group of data scientist hired by a rehabilitation center to help create a chatbot or online platform, to aid them identify users that need help. As such, we are looking into the 2 most common addictions, smoking and alcoholism, as they are the most accessible legal substances available to adults. 

From the subreddits r/alcoholicsanonymous and r/stopsmoking, we aim to identify whether a user is reaching out to address a smoking addiction or alcoholism.

We will be using Logistic Regression and Multinomial Naive Bayers build the classification model, with the following vectorizers: CVEC (Count Vectorization) and TF-IDF (Term Frequency-Inverse Document Frequency) with the use of n-grams within each to further tune the parameters.

Metrics for measure of success of the model include the train and test accuracy score, Receiver Operating Characteristic Area Under the Curve (ROC AUC), sensitivity, specificity and precision.

### Contents:
- [Problem Statement](#Problem-Statment)
- [Background](#Background)
- [Executive Summary](#Executive-Summary)
- [Data Scraping from Subreddits](#Data-Scraping-from-Subreddits)

## Background

#### Subreddits
Reddit is an website consisting of aggregation of forums where people share news, content and discuss any topic of their choosing ([*source*](https://www.digitaltrends.com/web/what-is-reddit/)). Reddit can be further broken down into different communities known as 'subreddits', which starts with 'r/'. For instance, r/boardgames is a subreddit for people to discuss board games, while r/starwars is a subreddit for star wars enthusiast.

#### Addiction
Addiction is a condition of being addicted to a particular substance or activity, 2 of the most common forms of addiction is to alcohol and nicotine.

Alcoholism involves problems controlling a person's drinking, which persist even when it affects the person's overall quality of life. Such consumption may lead to tolerance causing the person to consume more to attain the same effect, or experiencing withdrawal symptoms when trying to cut off alcohol ([source](https://www.mayoclinic.org/diseases-conditions/alcohol-use-disorder/symptoms-causes/syc-20369243), [source](https://www.healthline.com/health/alcoholism/basics)).

Cigarette smoking tends to be correlated to a host of diseases including (but not limited to) cardiovascular disease, respiratory disease and even cancer ([source](https://www.cdc.gov/tobacco/data_statistics/fact_sheets/health_effects/effects_cig_smoking/index.htm)). Despite this knowledge that a smoker may have, it is hard to quit as cigarettes contain tabacco which often leads to addiction hence making it hard to drop the habit ([source](https://www.healthhub.sg/live-healthy/615/smoking_habitoraddiction)).

#### r/alcoholicsanonymous and r/stopsmoking
Smoking and alcoholism arise as these are the most accesible legal substances available to adults. A common thread we see between these 2 types of addiction is that the difficulty one faces when trying to break the addiction. As such, when a person is looking to quit, they often seek external help, be it seeking professional treatement or a community of like-minded people. The subreddits r/alcoholicsanonymous and r/stopsmoking provide such a community to those seeking it, and with COVID-19 and the acccompanying social restrictions, these online communities provide a feasible alternative to real life meetings and support groups.

## Datasets

Data was obtained through PushShift API to collect data from two subreddits.

1. Pushshift API Link: https://github.com/pushshift/api
- Pushshift API allows us to collect data from reddit.com and collect processable data using json.

2. alcoholicsanonymous_raw.csv
- Subreddit for discussing alcoholic recovery
- Source: https://www.reddit.com/r/alcoholicsanonymous/

3. stopsmoking_raw.csv 
- Subreddit for discussing smoking cessation
- Source: https://www.reddit.com/r/stopsmoking/

### Executive Summary

This project aims to create a classification model based on the subreddits r/alcoholicsanonymous and r/stopsmoking to classify posts based on their key feautres.

The data was scraped from both subreddits and clean through the removal of duplicate post and imputing of null values. Special characters and emojis removed via regex, and the text was cleaned through punctuation removal, tokenization and removal of stopwords. Both stemming and lemmatizing were compare to see which would be more appropriate in obtaining the root word, and lemmatizing was chosen as it produced more meaningful words. 

Both CVEC and TVEC were performed to see how each vectorize words, and unigram, bigram and trigrams were created for visualisation of the top features in each. New stopwords were added based on overlapping words detected in the n-grams.

Both logistic regression and multinomial naive bayes were chosen as potential candidates. A pipeline with gridsearch helped to determine the optimal parameters for each of the vectorizer-model combination. The final model was CVEC/multinomial NB as it produced the highest accuracy (96.9%), though all models faired well. 

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import requests
import time
from datetime import datetime 
import re

import warnings
warnings.filterwarnings('ignore')

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

## Data Scraping from Subreddits

In [2]:
# Set up function for PushShift API for data retrival from subreddit

def scrape_subreddit(subreddit):
    
    # Define pushshift base URL
    url = 'https://api.pushshift.io/reddit/search/submission'  
    
    # Set variable for 'before' in first iteration as None 
    #(would automatically pull out the latest posts)
    df_time = None
    
    # Create empty df for concat loop
    df = pd.DataFrame()
    
    # Create total post count for looping
    total_posts = 0
    
    while total_posts < 2000:
        
        # Set params
        params = {
        'subreddit': subreddit,
        'size': 100, 
        'before': df_time
        }
        
        # Get response from PushShift API
        res = requests.get(url, params)
        data = res.json()
        # Concat relevant json into df
        df = pd.concat([df, pd.DataFrame(data['data'])]) 
        
        # Get earliest-dated post in df
        df_time = df['created_utc'].min()
        #print(f'min time is {df_time}')
    
        # Get current length of df
        total_posts = len(df)
    
    df.reset_index(inplace=True)
        
    return df

In [3]:
# Defining subreddits
subreddit_1 = 'alcoholicsanonymous'
subreddit_2 = 'stopsmoking'

In [4]:
# Scraping of subreddits
df_aa = scrape_subreddit(subreddit_1)
df_ss = scrape_subreddit(subreddit_2)

In [5]:
# save to csv
df_aa.to_csv('data/alcoholicsanonymous_test.csv')
df_ss.to_csv('data/stopsmoking_test.csv')