#  Project 3: Web APIs & Classification

## Problem Statement

KiddyToy sells wonderful and liable children toys. From last year, KiddyToy made decision to move their business online and also provided online customer feedback service. This increases company's revenue significantly. Company wants to utilize the customer feedback to improve the toy manufacturing and business decision-making. Customer service team will categorize customer's feedback as "customer complain" or "customer support".

However, KiddyToy faces the problem that receive drastic amount of customer feedback each day. The current system only allows employees to categorize customer feedback manually which is time-consuming. As the company's data science team member, we initialize a project to build up a text classifier to automate this process. 


## Executive Summary


It's estimated that around 80% of all information is unstructured, with text being one of the most common types of unstructured data. Because of the messy nature of text, analyzing, understanding, organizing, and sorting through text data is hard and time-consuming. KiddyToy's customer service team faces the same issue. 

In this project, we develop a text classifier to help KiddyToy's customer service team. We select two subreddits, /r/Coffee and /r/Tea from reddit website. The choice of these two subreddits is motivated by their text-heavy posts. The goal of this project is to classify which subreddit a given post came from. We create and compare two models: logistic regression and naive Bayes classifier and choose the best model. Our best model can perform well with a test accuracy score of 92.05%. 

By using this text classifier, customer service team can automatically structure texts from customer feedback in a fast and cost-effective way. This allows companies to save time analyzing text data, automate business processes, and improve customer satisfaction.

## Data Collection

In this project, we choose subreddits Coffee and Tea which share enough similarities to provide a good challenge, but are also differentiated enough that it should be possible to train a machine learning model.

In [1]:
#import library
import numpy as np
import pandas as pd

import requests
import time
import random

In [4]:
def grab_post(topic):
    
    posts = []
    after = None
    url = 'https://www.reddit.com/r/'+topic+'.json'

    for a in range(40):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'YL Agent'})

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)
        print(sleep_duration)
        time.sleep(sleep_duration)
        
    # save raw data to csv  
    pd.DataFrame(posts).to_csv("../data/"+topic+".csv", index = False)

In [5]:
grab_post('Tea')

https://www.reddit.com/r/Tea.json
3
https://www.reddit.com/r/Tea.json?after=t3_o51yit
3
https://www.reddit.com/r/Tea.json?after=t3_o4kuzn
4
https://www.reddit.com/r/Tea.json?after=t3_o3j7gr
6
https://www.reddit.com/r/Tea.json?after=t3_o32arr
4
https://www.reddit.com/r/Tea.json?after=t3_o1py6g
3
https://www.reddit.com/r/Tea.json?after=t3_o1m7vj
5
https://www.reddit.com/r/Tea.json?after=t3_o0q8q7
6
https://www.reddit.com/r/Tea.json?after=t3_nzqhi2
5
https://www.reddit.com/r/Tea.json?after=t3_nz02je
4
https://www.reddit.com/r/Tea.json?after=t3_nylgjr
3
https://www.reddit.com/r/Tea.json?after=t3_nyc8cd
6
https://www.reddit.com/r/Tea.json?after=t3_nxccoz
4
https://www.reddit.com/r/Tea.json?after=t3_nwg554
3
https://www.reddit.com/r/Tea.json?after=t3_nvx641
2
https://www.reddit.com/r/Tea.json?after=t3_nv1eb3
6
https://www.reddit.com/r/Tea.json?after=t3_nu7w5x
6
https://www.reddit.com/r/Tea.json?after=t3_ntfstp
4
https://www.reddit.com/r/Tea.json?after=t3_nsaop6
5
https://www.reddit.com/r/Tea

In [6]:
grab_post('Coffee')

https://www.reddit.com/r/Coffee.json
4
https://www.reddit.com/r/Coffee.json?after=t3_o3q04o
2
https://www.reddit.com/r/Coffee.json?after=t3_o1wpx6
5
https://www.reddit.com/r/Coffee.json?after=t3_o0ap08
3
https://www.reddit.com/r/Coffee.json?after=t3_nzpcba
5
https://www.reddit.com/r/Coffee.json?after=t3_nxviug
5
https://www.reddit.com/r/Coffee.json?after=t3_nv2tug
2
https://www.reddit.com/r/Coffee.json?after=t3_ntwm1b
4
https://www.reddit.com/r/Coffee.json?after=t3_ns7sxd
5
https://www.reddit.com/r/Coffee.json?after=t3_nqi9ch
5
https://www.reddit.com/r/Coffee.json?after=t3_npjq4t
6
https://www.reddit.com/r/Coffee.json?after=t3_no7zg6
2
https://www.reddit.com/r/Coffee.json?after=t3_nmzzaq
4
https://www.reddit.com/r/Coffee.json?after=t3_nl7fnv
5
https://www.reddit.com/r/Coffee.json?after=t3_nk8op0
4
https://www.reddit.com/r/Coffee.json?after=t3_nj4kad
4
https://www.reddit.com/r/Coffee.json?after=t3_ngwxmb
2
https://www.reddit.com/r/Coffee.json?after=t3_ngb5js
4
https://www.reddit.com/r/C