<img src="../images/tinap.png" style="float: left; margin: 15px; height: 50px">

# Tina's Project - Subreddit Posts Classification (Web APIs & NLP Application)

## Part 4. Modeling
Boosting with 1-word or 2-word Count Vectorizer

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import GradientBoostingClassifier

### Load Data - 1-word-cvec

In [2]:
df_cvec_1 = pd.read_csv('../data/df_cvec_1.csv')
df_cvec_1.head()

Unnamed: 0,00,008,01,011,013,02,022,026,03,031,...,zero,zimmer,zion,zions,zolak,zombie,zone,zrebiec,zubac,label
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


#### Shuffle the dataset
Before we separate them into training and testing datasets, we have to shuffle our original dataset first. Because the posts are sorted by their subreddit labels, if we split them into training data and testing data in this order, we will have training data with most of them labeled `1 (nba)`. And this might cause high bias in our model.

In [3]:
df_cvec_1 = df_cvec_1.sample(frac = 1, random_state = 2022).reset_index(drop = True)
df_cvec_1.head()

Unnamed: 0,00,008,01,011,013,02,022,026,03,031,...,zero,zimmer,zion,zions,zolak,zombie,zone,zrebiec,zubac,label
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Set up predictor variable (X) and target variable (y)
* X will be all columns expect `label`.
* y will be the column `label`.

In [4]:
X = df_cvec_1.drop(columns = 'label')
y = df_cvec_1['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

### Modeling

In [5]:
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

print(f"Train: {gb.score(X_train, y_train)}")
print(f"Test: {gb.score(X_test, y_test)}")

Train: 0.8478478478478478
Test: 0.8538538538538538


### Load Data - 2-word-cvec

In [6]:
df_cvec_2 = pd.read_csv('../data/df_cvec_2.csv')
df_cvec_2.head()

Unnamed: 0,00 atlanta,00 brooklyn,00 charlotte,00 chicago,00 dallas,00 denver,00 detroit,00 golden,00 houston,00 indiana,...,zions return,zolak belichick,zone jumps,zone june,zone pick,zone seattle,zone staying,zrebiec ravens,zubac scored,label
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [7]:
df_cvec_2 = df_cvec_2.sample(frac = 1, random_state = 2022).reset_index(drop = True)
df_cvec_2.head()

Unnamed: 0,00 atlanta,00 brooklyn,00 charlotte,00 chicago,00 dallas,00 denver,00 detroit,00 golden,00 houston,00 indiana,...,zions return,zolak belichick,zone jumps,zone june,zone pick,zone seattle,zone staying,zrebiec ravens,zubac scored,label
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
X = df_cvec_2.drop(columns = 'label')
y = df_cvec_2['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [9]:
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

print(f"Train: {gb.score(X_train, y_train)}")
print(f"Test: {gb.score(X_test, y_test)}")

Train: 0.7027027027027027
Test: 0.6726726726726727


### Summary
Compared to Multinomial Naive Bayes, Gradient Boosting Classifier doesn't get higher scores on either a 1-word or 2-word count vector.