
# Web APIs & NLP Part 1    

_**Authors:** Zhan Yu_

## Problem Statement    
   
Sephora currently does not have a **chatbot** which is a computer program that simulates and processes human conversation (in this cast, text messages), allowing customers to interact with digital devices as if they were communicating with a real person.  
Sephora website would like to initiate a "chatbots" in order to improve their customer experience. But human power is expensive so the main purpose of this project is that, with Natural Language Processing (NLP) we could build a model which is able to identify which questions are needed to be responded by customer service employees and which could simply be generated by computer.  
The model in this project is based on two subreddit: [Sephora](https://www.reddit.com/r/Sephora/) and [Makeup](https://www.reddit.com/r/Makeup/). Sephora subreddit has 14.5k members and it discusses anything Sephora-related such as makeup and skincare advice. Makeup subreddit has 146k members and it talks about makeup tips and advices. 
In this project, we are going to use Linear Regression, Naive Bayes and Bagging Models and evaluate using accuracy.

## Executive Summary

We broke down our process into two parts (two notebooks):  

### Part 1:   
**Data Gathering**   
First we made a function using [pushshift.io Reddit API](https://github.com/pushshift/api) which provides enhanced functionality and search capabilities for searching Reddit comments and submissions, to gather post data from two subreddit: [Sephora](https://www.reddit.com/r/Sephora/) and [Makeup](https://www.reddit.com/r/Makeup/). Sephora subreddit has 14.5k members and Makeup subreddit has 146k members so for Sephora subreddit, we use 3 times period (`times = 15`) of Makeup subreddit (`times = 5`) to balance the two classes. 

**Data Cleaning** We pulled in the two subreddit data that was scraped and combine them into one dataframe. First we dropped rows with missing values and columns we don not need. Next we changed 'subreddit' column to binary class: `{'Makeup': 0, 'Sephora': 1}` and removed website links and `\n`. We finished our cleaning process by defining a function called `words_only` to convert a semi-raw text to a string of words.  

### Part 2:  
**Feature Engineering**  
We tested Stemming, Lemmentizing and without Stemming/Lemmentizing for the `text` column whitin the models we made, compared with Accuracy, and we got a conclusion that Lemmentizing is the best option.


**Modeling**  
First we established baseline model. In this project we are going to build a model to predict if a post is from "Makeup" or "Sephora". So this is a classification problem and the baseline model is predicting majority class and So our baseline is 0.582881.  
Then we built Linear Regression, Naive Bayes and Bagging Models with two vectorizers `CountVectorizer` and `TfidfVectorizer`.   
For Logistic Regression Models, we are usd `Pineline` to put `CountVectorizer` and `TfidfVectorizer`, and `LogisticRegression` model together with `GridSearchCV` to find best parameters.   
In Naive Bayes Models part, for `CountVectorizer` we used model `MultinomialNB`and for `TfidfVectorizer` we used model `GaussianNB`.   
For Bagging Models, we used `BaggingClassifierfor` with both `CountVectorizer` and `TfidfVectorizer`.  

**Data Visualization**  
We used `CountVectorizer` and `TfidfVectorizer` with two words `ngram_range=(2, 2)` which it is easier for us to see the most popular words in two subreddit.  

**Model Evaluation**  
Naive Bayes Models for `CountVectorizer` with `MultinomialNB` is our best model, according to the Accuracy.

## Table of Contents
- [Problem Statement](#Problem-Statement)  
- [Executive Summary](#Executive-Summary)
- [Gathering Data](#Gathering-Data)
- [Data Cleaning](#Data-Cleaning)

## Loading Libraries   

All the libraries will be imported here:

In [62]:
import time
import requests
import pandas as pd
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import regex as re
from nltk.corpus import stopwords
from bs4 import BeautifulSoup  


from sklearn.model_selection import train_test_split

## Gathering Data

In this project, we are going to use [Pushshift's](https://github.com/pushshift/api) API for collecting posts from two subreddits [Sephora](https://www.reddit.com/r/Sephora/) and [Makeup](https://www.reddit.com/r/Makeup/) from [reddit](https://www.reddit.com/).   
First, we are going to define a function which could help us easily get posts data from website.

In [63]:
def query_pushshift(subreddit, 
                    kind='submission', 
                    skip=30, 
                    times=5, 
                    subfield = ['title', 'selftext', 'subreddit', 
                                'created_utc', 'author', 'num_comments', 
                                'score', 'is_self'],
                    comfields = ['body', 'score', 'created_utc']):
    # Setting URL:
    stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size=500".format(kind, subreddit)
    mylist = []
    for x in range(1, times + 1):
        URL = "{}&after={}d".format(stem, skip * x)
        # The URL will show the time periods
        print(URL)
        
        # Gathering the data:
        response = requests.get(URL)                     # library "requests"
        assert response.status_code == 200               # check the status code
        mine = response.json()['data']                   # get all the data
        df = pd.DataFrame.from_dict(mine)                # make a dataframe
        mylist.append(df)                                # the list of dataframes we made
        # Set time as every 2 seconds
        time.sleep(2)
    
    full = pd.concat(mylist, sort=False)                 # put all dataframes together
    if kind == "submission":
        full = full[subfield]                            # keep only columns we want
        full = full.drop_duplicates()                    # drop all duplicated rows
        full = full.loc[full['is_self'] == True]         # only submission is a self post
    
    # Transforming the dates:
    def get_date(created):
        return dt.date.fromtimestamp(created)
    _timestamp = full["created_utc"].apply(get_date)
    full['timestamp'] = _timestamp                        
    print(full.shape)
    return full 

### Data Dictionary  

| Feature | Description | Type |
| ------ | ------ | ------ |
| title | Name of a post | String |
| selftext | Content of a post | String |
| subreddit | Specific subreddit | String |
| timestamp | Posting date | String |
| is_self | Restrict results based on if submission is a self post | Boolean |
| created_utc | Restrict results based on the epoch value given or range of values | Integer |

Since [Makeup](https://www.reddit.com/r/Makeup/) is a much bigger and more popular topic, we choose `times=5` and for [Sephora](https://www.reddit.com/r/Sephora/) we choose `times=15` to balance the two classes.

In [64]:
df_sephora = query_pushshift(subreddit='Sephora', 
                      times=15, 
                      subfield = ['title', 'selftext', 'subreddit', 'created_utc', 'is_self'])

https://api.pushshift.io/reddit/search/submission/?subreddit=Sephora&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=Sephora&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=Sephora&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=Sephora&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=Sephora&size=500&after=150d
https://api.pushshift.io/reddit/search/submission/?subreddit=Sephora&size=500&after=180d
https://api.pushshift.io/reddit/search/submission/?subreddit=Sephora&size=500&after=210d
https://api.pushshift.io/reddit/search/submission/?subreddit=Sephora&size=500&after=240d
https://api.pushshift.io/reddit/search/submission/?subreddit=Sephora&size=500&after=270d
https://api.pushshift.io/reddit/search/submission/?subreddit=Sephora&size=500&after=300d
https://api.pushshift.io/reddit/search/submission/?subreddit=Sephora&size=500&after=330d
https://api.pushshift.io

In [65]:
df_makeup = query_pushshift(subreddit='Makeup', 
                      times=5, 
                      subfield = ['title', 'selftext', 'subreddit', 'created_utc', 'is_self'])

https://api.pushshift.io/reddit/search/submission/?subreddit=Makeup&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=Makeup&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=Makeup&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=Makeup&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=Makeup&size=500&after=150d
(2481, 6)


## Data Cleaning

After we gathered data of two subreddits, we are going to some data cleaning:  
First, we combine the two dataframes together:

In [66]:
df = pd.concat([df_sephora, df_makeup], ignore_index=True)

# Check the shape
df.shape

(4250, 6)

In [67]:
# Combine 'title' and 'selftext' together and create a new column 'text'
df['text'] = df['title'] + ' ' + df['selftext']

# Drop the columns we don't need:
df = df.drop(columns=['title', 'selftext', 'created_utc', 'is_self'])

# Change 'subreddit' column to binary class:
df['subreddit'] = df['subreddit'].map({'Makeup': 0, 'Sephora': 1})

# Removed website links and '\n'
df['text'] = df['text'].str.replace('http\S+|www.\S+', ' ', case=False).str.replace('\n', ' ')

In [68]:
# Check missing values:
df.isnull().sum()

subreddit    0
timestamp    0
text         6
dtype: int64

In [69]:
df.loc[df['text'].isnull()==True]

Unnamed: 0,subreddit,timestamp,text
1837,0,2020-01-02,
1851,0,2020-01-02,
1882,0,2020-01-03,
1915,0,2020-01-04,
1917,0,2020-01-04,
1921,0,2020-01-04,


In [70]:
# Drop 6 rows and reset index
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

In [71]:
# Check missing values again:
df.isnull().sum().sum()

0

Now we are going to define a function called `words_only` to convert a semi-raw text to a string of words. 

In [99]:
def words_only(raw):

    # 1. Remove non-letters.
    letters_only = re.sub('[^a-zA-Z]', ' ', raw)
    
    # 2. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 3. Join all the stopwords as a string with " ", remove "'" from the stopwords and split it as a list.
    stops = " ".join(stopwords.words('english')).replace("'", "").split()
    
    # 4. Remove stopwords.
    meaningful_words = [w for w in words if not w in stops]
    
    # 5. Join the words back into one string separated by space and return the result.
    return(" ".join(meaningful_words))

In [100]:
df['text'] = df['text'].apply(words_only)
df.head()

Unnamed: 0,subreddit,timestamp,text
0,1,2019-12-31,final predictions beauty insider program happy...
1,1,2020-01-01,love go nearly points really hope going fixed ...
2,1,2020-01-01,points disappeared bought things december st c...
3,1,2020-01-01,sephora birthday perks days redeem sephora bir...
4,1,2020-01-01,bite beauty relaunching went sephora bite beau...


In [101]:
# Check missing values again:
df.isnull().sum().sum()

0

Export the data:

In [102]:
df.to_csv('data/meaningful_words.csv', index=False)