<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & Classification - Part 1 

Webscrapping from Reddit


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-1">Problem Statement</a></span></li><li><span><a href="#Executive-Summary" data-toc-modified-id="Executive-Summary-2">Executive Summary</a></span></li><li><span><a href="#Methodology" data-toc-modified-id="Methodology-3">Methodology</a></span></li><li><span><a href="#Organization-of-Notebook:" data-toc-modified-id="Organization-of-Notebook:-4">Organization of Notebook:</a></span></li><li><span><a href="#Part-1---Webscrapping" data-toc-modified-id="Part-1---Webscrapping-5">Part 1 - Webscrapping</a></span></li><li><span><a href="#Data-Wrangling/Gathering" data-toc-modified-id="Data-Wrangling/Gathering-6">Data Wrangling/Gathering</a></span><ul class="toc-item"><li><span><a href="#Data-collecting-using-Reddit-API" data-toc-modified-id="Data-collecting-using-Reddit-API-6.1">Data collecting using Reddit API</a></span></li><li><span><a href="#Combining-2-Subreddits-files-into-1" data-toc-modified-id="Combining-2-Subreddits-files-into-1-6.2">Combining 2 Subreddits files into 1</a></span></li></ul></li></ul></div>

## Problem Statement

As a member of the Data Science Team in All Wellness online platform, we are tasked to use NLP to reduce the time and efforts required to classify members' online queries into fitness related or diet related, which will then be  channeled to the panel of certified fitness coach or nutritionist. 

---

## Executive Summary

**Who we are**

With the increasing emphasis on health and wellness, All Wellness is a startup online platform aiming to help people improve their overall well being via different channels. One of our selling point is having a panel of certified fitness and nutrition coaches giving advices to our platform members on the fitness and dietary queries they have in their workout routine or nutrition and diet. 

**The pain**

One issue All Wellness faced often is that when members wants to get advice via the platform, they typically do not label the nature of their question or label wrongly when they fill in the query form. To ensure the right queries gets to the right channel, lots of time and effort is required to manually tag the enormous amount of queries on a daily basis.

**What can be done**

Using Natural Language Processing, this tagging process can be automated as the queries come in, gets channel in the shortest amount of time, the panel of advisors gets the queries in real time and in turn members can be more engaged when the queries are correctly addressed.

**Solution**

Using training data from scrapping reddits, we have trained a Linear Regression model (with details decribed in the following sections in the Technical notebook) that is able to give us about **85%** accuracy on classifying the 2 categories. 

Misclassifications can still occur if members have a generalized query (no specific indication on diet or workout),  has mixed queries/description or doesn't give enough details (short question of 2 or 3 words)

Model can be fine-tuned when there are more data to train on.

**Conclusion**

Overall, the model can be deployed to reduce some time and efforts for classifying, while collecting more data to fine-tune the model. Future enhancements can be made though more robust ensembling of models, and with more data collected. 

**Future Enhancements**

With this solution, we can also consider creating a chatbot to provide members with some standard FAQ when the model detects certain words or word structure. 

---

## Methodology

In this project, we will use Reddit's API to scrape 2 subreddits 

-  ```/r/workout```
-  ```/r/diet/```

for posts and use NLP to train a classifer to identify the type of query. 


Models use:
- Logistic Regression with CountVectorizer, TF-IDF
- Naive Bayes with CountVectorizer, TF-IDF
- KNearestNeighbours with (KNN) CountVectorizer, TF-IDF
- Support Vector Machines (SVM) with CountVectorizer, TF-IDF

The model with highest Accuracy and Matthews Correlation Coefficient (MCC) score for on the validation data set will be deployed

## Organization of Notebook:

- Project3_Part 1_Web APIs & Classification-Scrapping - Problem Statement, Executive Summary and Webscrapping from Reddits
- Project3_Part 2_Web APIs & Classification-EDA Modelling - EDA, Modelling and Conclusion

---

## Part 1 - Webscrapping

In [448]:
#import libraries

#standard imports
import pandas as pd
#so that pandas do not truncate the rows
pd.set_option('max_columns', 100) 
#Set datafrome display format
pd.options.display.float_format = "{:,.3f}".format

import numpy as np

#graph imports
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#define the style of sns/plt

#API imports
import requests
import time
import random

from bs4 import BeautifulSoup

#Regex
import regex as re

#Lemmatizing and Stopwords
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer

#Modelling
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

#==KNN
from sklearn.neighbors import KNeighborsClassifier

#==== Classification matrix
from sklearn.metrics import plot_confusion_matrix, confusion_matrix, ConfusionMatrixDisplay, classification_report, accuracy_score, matthews_corrcoef
from sklearn.metrics import plot_roc_curve, auc, roc_auc_score, RocCurveDisplay

#=== Naive Bayes
from sklearn.naive_bayes import MultinomialNB

#=== CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from collections import Counter

#=== SVC
from sklearn.svm import SVC

#word cloud
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

## Data Wrangling/Gathering
<font color = "blue"> 

**Part 1** of the project focuses on **Data wrangling/gathering/acquisition**. This is a very important skill as not all the data you will need will be in clean CSVs or a single table in SQL. There is a good chance that wherever you land you will have to gather some data from some unstructured/semi-structured sources; when possible, requesting information from an API, but often scraping it because they don't have an API (or it's terribly documented).

For this project I will be using the reddit API and scrape 2 subreddits:

-  ```/r/workout```
-  ```/r/diet/```

for tagging and processing</font>

### Data collecting using Reddit API

In [109]:
def scrapreddit(url, filename=None, num_of_posts=100):
    
    """
    Function for collecting reddit posts from a subreddit using reddit API
    takes in arguement:
    - url: base url for the subreddit
    - filename: the filename to save the posts to
    - num_of_posts: the total number of posts to be scraped
    
    Note: reddit API allows 25 posts per request. 
    Hence Function will loop num_of_posts/25 times
    """
    
    #if filename is not specified, it will default to scrapreddit.csv
    if filename==None:
        filename='datasets/scrapreddit.csv'
        
    posts = []
    after = None

    #reddit default 25 posts per scrape, loop until desired posts
    for a in range(int(np.ceil(num_of_posts/25))):
    
        if after == None: #first run of getting data
            current_url = url
        else:
            #subsequent runs. Must get the after ID so that we know which post to extract from
            current_url = url + '?after=' + after 
        
        print(current_url)
    
        res = requests.get(current_url, headers={'User-agent': 'DSI 16.1'})
    
        #Didnt managed to get the data
        if res.status_code != 200:
            print('Status error', res.status_code)
            break
    
        #assign the requested data to a dict
        current_dict = res.json()
    
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']
        
        # not the first time we run, we need to retrieve the prev dataframe for appending, and 
        if a > 0:
            #return from prev records so that we can extend the data
            prev_posts = pd.read_csv(filename)

            print('\nreading existing file: current length', len(prev_posts))
            
            prev_posts = prev_posts.append(current_posts, ignore_index=True)

            print('\nWriting new data to file: new length', len(prev_posts))
            prev_posts.to_csv(filename, index = False)
            
            #prev_posts = pd.read_csv('boardgames.csv')
            #current_df = pd.DataFrame()
            
        else:
            print('new file created:', len(posts))
            pd.DataFrame(posts).to_csv(filename, index = False)
    
        # generate a random sleep duration to look more 'natural', so that it is not continuously getting data
        sleep_duration = random.randint(2,6)
        print(f'Done with {(a+1)*25} of {num_of_posts} records, pause for : {sleep_duration} sec')
        print('=============================')
        time.sleep(sleep_duration)
    
    print('Completed. Total posts scraped: ', len(posts))

<font color = "blue"> 
Scrapping ```/r/workout``` subreddit
</font>

In [111]:
scrapreddit('http://www.reddit.com/r/workout.json', filename='datasets/workout.csv', num_of_posts=1000)

http://www.reddit.com/r/workout.json
new file created: 27
Done with 25 of 1000 records, pause for : 6 sec
http://www.reddit.com/r/workout.json?after=t3_iietoq

reading existing file: current length 27

Writing new data to file: new length 52
Done with 50 of 1000 records, pause for : 4 sec
http://www.reddit.com/r/workout.json?after=t3_ii992l

reading existing file: current length 52

Writing new data to file: new length 77
Done with 75 of 1000 records, pause for : 2 sec
http://www.reddit.com/r/workout.json?after=t3_igyulv

reading existing file: current length 77

Writing new data to file: new length 102
Done with 100 of 1000 records, pause for : 5 sec
http://www.reddit.com/r/workout.json?after=t3_igzm3b

reading existing file: current length 102

Writing new data to file: new length 127
Done with 125 of 1000 records, pause for : 3 sec
http://www.reddit.com/r/workout.json?after=t3_igewro

reading existing file: current length 127

Writing new data to file: new length 152
Done with 150 o


reading existing file: current length 948

Writing new data to file: new length 973
Done with 975 of 1000 records, pause for : 5 sec
http://www.reddit.com/r/workout.json?after=t3_ii992l

reading existing file: current length 973

Writing new data to file: new length 998
Done with 1000 of 1000 records, pause for : 2 sec
Completed. Total posts scraped:  998


<font color = "blue"> 
Scrapping ```/r/diet``` subreddit
</font>

In [112]:
scrapreddit('http://www.reddit.com/r/diet.json',filename='datasets/diet.csv', num_of_posts=1000)

http://www.reddit.com/r/diet.json
new file created: 26
Done with 25 of 1000 records, pause for : 6 sec
http://www.reddit.com/r/diet.json?after=t3_ihfxz2

reading existing file: current length 26

Writing new data to file: new length 51
Done with 50 of 1000 records, pause for : 2 sec
http://www.reddit.com/r/diet.json?after=t3_igqkcz

reading existing file: current length 51

Writing new data to file: new length 76
Done with 75 of 1000 records, pause for : 5 sec
http://www.reddit.com/r/diet.json?after=t3_ifki8m

reading existing file: current length 76

Writing new data to file: new length 101
Done with 100 of 1000 records, pause for : 5 sec
http://www.reddit.com/r/diet.json?after=t3_ie0iz2

reading existing file: current length 101

Writing new data to file: new length 126
Done with 125 of 1000 records, pause for : 3 sec
http://www.reddit.com/r/diet.json?after=t3_icrzdk

reading existing file: current length 126

Writing new data to file: new length 151
Done with 150 of 1000 records, pa

http://www.reddit.com/r/diet.json?after=t3_igqkcz

reading existing file: current length 973

Writing new data to file: new length 998
Done with 1000 of 1000 records, pause for : 2 sec
Completed. Total posts scraped:  998


<font color = "blue"> 
Managed to scraped 
- 998 text records for ```/r/workout``` in ```'workout.csv'```
- 998 text records for ```/r/diet``` in ```'diet.csv'``` 


</font>

### Combining 2 Subreddits files into 1

In [251]:
#reading in our data files
df_workout = pd.read_csv('datasets/workout.csv')
df_workout.shape

(998, 111)

In [252]:
#combining titles and selftext into one columns call posts
df_workout['post'] = df_workout['title'].fillna('')+ ' ' + df_workout['selftext'].fillna('')
df_selectedworkout = df_workout[['subreddit','post']]

In [253]:
#check for null values
df_selectedworkout.isnull().sum()

subreddit    0
post         0
dtype: int64

In [254]:
#drop duplicates
df_selectedworkout = df_selectedworkout.drop_duplicates()
df_selectedworkout.shape

(914, 2)

In [255]:
#do the same for diet csv
df_diet = pd.read_csv('datasets/diet.csv')
df_diet.shape

(998, 110)

In [256]:
#combining titles and selftext into one columns call posts
df_diet['post'] = df_diet['title'].fillna('')+ ' ' + df_diet['selftext'].fillna('')
df_selecteddiet = df_diet[['subreddit','post']]

In [257]:
#check for duplicates
df_selecteddiet[df_selecteddiet.duplicated()]

Unnamed: 0,subreddit,post
178,diet,Post Workout Food
299,diet,Thought on a veggie only diet Before someone s...
888,diet,Raising my calorie intake So I’ve been calorie...
922,diet,Join our Discord server! Hey everyone! In orde...
923,diet,I have a friend that I'm looking to help [ques...
...,...,...
993,diet,Stress related hypothesis... Is there any scie...
994,diet,Your suggestions would really be appreciated H...
995,diet,Gilbert syndrome and Oatmeal (bilirubin and ma...
996,diet,Losing weight I’m 6’0 tall and weigh about 85k...


In [258]:
#drop duplicates
df_selecteddiet = df_selecteddiet.drop_duplicates()
df_selecteddiet.shape

(919, 2)

In [259]:
#combine 2 datasets
df_combine = pd.concat([df_selectedworkout,df_selecteddiet], axis=0, ignore_index=True)
df_combine

Unnamed: 0,subreddit,post
0,workout,"Do you need to Gain Weight, Lose Weight, or Ma..."
1,workout,Beginner's Guide to Working Out As a personal...
2,workout,"By next Saturday, 1 upvote = 5 pushups, will p..."
3,workout,Morning workout in my adidas. What are yall do...
4,workout,Day 25 of Zac Efron's 12 Week Baywatch Programme
...,...,...
1828,diet,"Am I Actually ""Intermittent Fasting""? I have b..."
1829,diet,50 calories over calorie allowance I went over...
1830,diet,BEST WHEY PROTEIN FOR WEIGHT LOSS (lean) Which...
1831,diet,Keep getting cravings/hungry in the evening He...


In [260]:
#saving the combined dataset for future manipulations
df_combine.to_csv('datasets/combine.csv')

<font color = "blue"> 
Total of 1833 rows for both posts with 2 columns: subreddit and post text
</font>

----