#  Project 3: Web APIs & Classification

## Problem Statement

Using Reddit's API, you'll collect posts from two subreddits of your choosing.
You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.
Create and compare two models. One of these must be a Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.


## Executive Summary
The Ames housing dataset includes 80 features of nominal, discrete, ordinal and continuous variables for individual residential properties sold in Ames, IA from 2006 to 2010.

During the first step workflow which is data cleaning, missing values were detected and fixed, outlier points are investigated and eliminitated from the dataset. Once data cleaning is done, Exploratory Data Analysis (EDA) is conducted for each feature. Ordinal features are converted to discrete values. For categorical data (nominal), bar plots were created to visualize the mean Sale Price across categories. During EDA for categorical features, special attention was paid to any patterns or clusters in Sale Prices that emerged. I also expore continuous/discrete features with similar groups, like sold/built year, bathroom No and floor SF to create more meaningful new features. 

The linear relationship between Sale Price and all numeric features were examined using heatmap and correlation coefficients. Features with high correlation rate (>=0.4) were filtered out for visualization check. 
Features with continous data were plotted with scatter plot while features with discrete data were plotted with boxplot. 
Heatmap for all filtered-out features were also plotted to check colinear between features. If there was high colinearity between features, the feature with less correlation rate with Sale Price was removed from filtered features list. 

After filtered features list were comfirmed, train dataset was divided into training set (70% of data) and holdout set (30% of data) to prepare for modeling. In this project, 4 models (linear regression/ridge regression/lasso regression/elsticNET regression) were used. 

In the first verification, filtered features list with 27 features from EDA were used for these 4 models. Among these models, elasticNET had best MSE score. However, I observed that with higher Sale Price, the best fit of line did not fit well. In the high sale price side, the line tended to be curly. We decided to add square value (power 2) for some features which had high correction with saleprice to observe whether could improve MSE score and best fit of line in second verification. 

In the second verification, 6 square values (power 2) were added. The result showed much better MSE and slight improvement of best fit line for higher Sale Price. So I decided to try higher power into the features, i.e. add power 3 to verify whether higher power value could further improve the models in the third verification. 

In the third verification, 6 power 3 values were added. MSE scores were improved slightly. By checking lasso regression coeffient, some features' coeffients were zero. This may be due to high colinearity among the feature and its power 2 / power 3. In the fourth verification, features with zero coeffient were dropped. 


In the fourth verification, the MSE score almost had no improvement. However, we can observe the samll gap between train data MSE score and holdout data MSE score. R2 score for train dataset and holdout data set were 91% and 90% respectively. We can say that this model could fit train and holdout data well. 

After we identify our best model, residuals plot was evaluated and Sale Prices in test dataset were predicted. Interpretations and recommendations were made based off of the best-performing model.

## Data Collection

Plan to choose r/Coffee & r/Tea

In [1]:
import numpy as np
import pandas as pd

import requests
import time
import random

In [4]:
def grab_post(topic):
    
    posts = []
    after = None
    url = 'https://www.reddit.com/r/'+topic+'.json'

    for a in range(40):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'YL Agent'})

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)
        print(sleep_duration)
        time.sleep(sleep_duration)
      
    pd.DataFrame(posts).to_csv("../data/"+topic+".csv", index = False)

In [5]:
grab_post('Tea')

https://www.reddit.com/r/Tea.json
3
https://www.reddit.com/r/Tea.json?after=t3_o51yit
3
https://www.reddit.com/r/Tea.json?after=t3_o4kuzn
4
https://www.reddit.com/r/Tea.json?after=t3_o3j7gr
6
https://www.reddit.com/r/Tea.json?after=t3_o32arr
4
https://www.reddit.com/r/Tea.json?after=t3_o1py6g
3
https://www.reddit.com/r/Tea.json?after=t3_o1m7vj
5
https://www.reddit.com/r/Tea.json?after=t3_o0q8q7
6
https://www.reddit.com/r/Tea.json?after=t3_nzqhi2
5
https://www.reddit.com/r/Tea.json?after=t3_nz02je
4
https://www.reddit.com/r/Tea.json?after=t3_nylgjr
3
https://www.reddit.com/r/Tea.json?after=t3_nyc8cd
6
https://www.reddit.com/r/Tea.json?after=t3_nxccoz
4
https://www.reddit.com/r/Tea.json?after=t3_nwg554
3
https://www.reddit.com/r/Tea.json?after=t3_nvx641
2
https://www.reddit.com/r/Tea.json?after=t3_nv1eb3
6
https://www.reddit.com/r/Tea.json?after=t3_nu7w5x
6
https://www.reddit.com/r/Tea.json?after=t3_ntfstp
4
https://www.reddit.com/r/Tea.json?after=t3_nsaop6
5
https://www.reddit.com/r/Tea

In [6]:
grab_post('Coffee')

https://www.reddit.com/r/Coffee.json
4
https://www.reddit.com/r/Coffee.json?after=t3_o3q04o
2
https://www.reddit.com/r/Coffee.json?after=t3_o1wpx6
5
https://www.reddit.com/r/Coffee.json?after=t3_o0ap08
3
https://www.reddit.com/r/Coffee.json?after=t3_nzpcba
5
https://www.reddit.com/r/Coffee.json?after=t3_nxviug
5
https://www.reddit.com/r/Coffee.json?after=t3_nv2tug
2
https://www.reddit.com/r/Coffee.json?after=t3_ntwm1b
4
https://www.reddit.com/r/Coffee.json?after=t3_ns7sxd
5
https://www.reddit.com/r/Coffee.json?after=t3_nqi9ch
5
https://www.reddit.com/r/Coffee.json?after=t3_npjq4t
6
https://www.reddit.com/r/Coffee.json?after=t3_no7zg6
2
https://www.reddit.com/r/Coffee.json?after=t3_nmzzaq
4
https://www.reddit.com/r/Coffee.json?after=t3_nl7fnv
5
https://www.reddit.com/r/Coffee.json?after=t3_nk8op0
4
https://www.reddit.com/r/Coffee.json?after=t3_nj4kad
4
https://www.reddit.com/r/Coffee.json?after=t3_ngwxmb
2
https://www.reddit.com/r/Coffee.json?after=t3_ngb5js
4
https://www.reddit.com/r/C