# Project 3: Web APIs & Classification

## Problem Statement

**Scenario:**

[r/TalesFromTheCustomer](https://www.reddit.com/r/TalesFromTheCustomer/) and [r/TalesFromRetail](https://www.reddit.com/r/TalesFromRetail/) are similar subreddits describing customer service experiences, one from the point of view of the customer, and one from the service staff. 

These subreddits contain narratives of people's experiences and are thus rich in text data and suitable for natural language processing (NLP) modelling using a bag of words approach. As they are not too dissimilar in content, it is interesting to examine if a NLP machine learning model can distinguish between these 2 subreddits and in the process, identify the most frequent words associated with customer / retail experiences.

Such a NLP model could be useful to retail executives who wish to understand more about the experiences of their customers and frontline staff. As such, the NLP model should preferably be interpretable, allowing inference on the significance of each word to the model's classification criteria.

**Task:**

This project will scrape the Reddit API of these 2 subreddits to collect text data (title + main post). After data cleaning, exploratory data analysis will be performed to identify the most frequent words in each subreddit. 

The data will be split into a train and test set. Classification models will be fit on the train set and scored on the test set to identify the best performing model. F1 score will be used as the main metric. The best models will be analyzed to examine how their underlying algorithms could have affected their predictive ability on this bag of words classification problem.

## Executive Summary

927 rows of data for r/TalesFromTheCustomer and 320 rows for r/TalesFromRetail were scraped and combined into a single dataframe. The columns 'title' and 'selftext' were combined into a single 'text' column. Lemmatized and stemmed versions of the text were created as new columns.

A first level of screening was performed across each shortener (lemmatized, stemmed, raw unshortened text), each vectorizer (Count Vectorizer, TF-IDF Vectorizer), and each model (Logistic Regression, Multinomial Naive Bayes, Support Vector Machine, K-Nearest Neighbours, Extra Trees and Random Forest). 

Logistic Regression, Multinomial Naive Bayes and Support Vector Machine (each with Count Vectorizer) and K-Nearest Neighbours (with TF-IDF Vectorizer) were found to be the best performing models and grid searched on.

After grid searching, Logistic Regression (test f1 score = 0.70) and Multinomial Naive Bayes (test f1 score = 0.72) were selected as the final models. If we want to **maximize precision and minimize false positives**, **Logistic Regression** would be the best model. If we want to **maximize recall and minimize false negatives**, **MultinomialNB** would be the best choice.

## Web Scraping

In [15]:
# Import libraries
import requests
import pandas as pd
import time
import random
import string

In [16]:
# Extract urls from r/TalesFromTheCustomer and r/TalesFromRetail
cust_url = 'https://www.reddit.com/r/TalesFromTheCustomer.json'
cust_req = requests.get(cust_url, headers={'User-agent': 'random_123'})

ret_url = 'https://www.reddit.com/r/TalesFromRetail.json'
ret_req = requests.get(ret_url, headers={'User-agent': 'random_123'})

In [17]:
# Check status codes
cust_req.status_code

200

In [18]:
ret_req.status_code

200

In [27]:
# Perform 40 iterations of scraping r/TalesFromTheCustomer
url = cust_url
posts = []
after = None

# Set number of loops = 40 and generate random user-agents for each iteration
num_loops = 40
user_agents = [''.join(random.choice(string.hexdigits) for i in range(8)) for j in range(num_loops)]

for i in range(num_loops):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': user_agents[i]})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # Save updated csv file for each iteration
    if i > 0:
        prev_posts = pd.read_csv('customer.csv')
        current_df = pd.DataFrame(posts)
        new_df = pd.concat([prev_posts, current_df])
        new_df.to_csv('customer.csv', index = False)
        
    else:
        pd.DataFrame(posts).to_csv('customer.csv', index = False)
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,60)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/TalesFromTheCustomer.json
19
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_luyo4v
27
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_lmveps
22
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_ldokfx
4
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_l74h1j
60
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_kzxa1m
29
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_kx0mly
56
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_kqosh6
18
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_kj45dt
52
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_kehnme
44
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_k93rif
42
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_k44zfg
44
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_jz8yr1
10
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_jrr5sx
7
https://www.reddit.com/r/TalesFromTheC

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


39
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_hv9p8i
7
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_htii5x
36
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_hq5lsu
3
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_ho32ng
16
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_hk73nx
44
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_hgi1fa
3
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_hcaha0
26
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_h8h2ki
53
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_gvcf8s
35
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_gpssyw


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


20
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_gkmuuj
46
https://www.reddit.com/r/TalesFromTheCustomer.json
38
https://www.reddit.com/r/TalesFromTheCustomer.json?after=t3_luyo4v
48


In [45]:
# Check final csv file
cust_df = pd.read_csv('customer.csv')
cust_df.shape

(20470, 107)

In [46]:
# Remove duplicates by 'selftext' column
cust_df.drop_duplicates(subset='selftext',inplace=True)
cust_df.shape

(927, 107)

In [47]:
# Save final csv file
cust_df.to_csv('datasets/customer.csv', index = False)

In [21]:
# Perform 40 iterations of scraping r/TalesFromRetail
url = ret_url
posts = []
after = None

# Set number of loops = 40 and generate random user-agents for each iteration
num_loops = 40
user_agents = [''.join(random.choice(string.ascii_letters) for i in range(10)) for j in range(num_loops)]

for i in range(num_loops):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': user_agents[i]})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # Save updated csv file for each iteration
    if i > 0:
        prev_posts = pd.read_csv('retail.csv')
        current_df = pd.DataFrame(posts)
        new_df = pd.concat([prev_posts, current_df])
        new_df.to_csv('retail.csv', index = False)
        
    else:
        pd.DataFrame(posts).to_csv('retail.csv', index = False)
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(30,90)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/TalesFromRetail.json
46
https://www.reddit.com/r/TalesFromRetail.json?after=t3_lrpknf
33
https://www.reddit.com/r/TalesFromRetail.json?after=t3_lgsxw2
40
https://www.reddit.com/r/TalesFromRetail.json?after=t3_l815gs
89
https://www.reddit.com/r/TalesFromRetail.json?after=t3_l0zo17
39
https://www.reddit.com/r/TalesFromRetail.json?after=t3_kri44z
67
https://www.reddit.com/r/TalesFromRetail.json?after=t3_kj9kc1
54
https://www.reddit.com/r/TalesFromRetail.json?after=t3_kckgnu
65
https://www.reddit.com/r/TalesFromRetail.json?after=t3_k5kpqi
62
https://www.reddit.com/r/TalesFromRetail.json?after=t3_k17hda
54
https://www.reddit.com/r/TalesFromRetail.json?after=t3_jv37q0
33
https://www.reddit.com/r/TalesFromRetail.json?after=t3_jqz8tj
59
https://www.reddit.com/r/TalesFromRetail.json?after=t3_jkbyxd
60
https://www.reddit.com/r/TalesFromRetail.json?after=t3_jfmfhs
85
https://www.reddit.com/r/TalesFromRetail.json
79
https://www.reddit.com/r/TalesFromRetail.json?after=t3_lr

In [22]:
# Check final csv file
ret_df = pd.read_csv('retail.csv')
ret_df.shape

(19898, 104)

In [23]:
# Remove duplicates by 'selftext' column
ret_df.drop_duplicates(subset='selftext',inplace=True)
ret_df.shape

(320, 104)

In [24]:
# Save final csv file
ret_df.to_csv('datasets/retail.csv', index=False)

Raw customer.csv and retail.csv have been deleted from main directory due to large file size.

927 rows of data remain for r/TalesFromTheCustomer and 320 rows remain for r/TalesFromRetail after removing duplicates.

In next section 02_modelling, the datasets will be processed and analysed. Classification models will be created and scored based on the processed data.