# Project 4: Hackathon

## *Examining Classification Models on Entire Data Set*

In this notebook:

* [Linear Regression](#lr-data)
* [Logistic Regression](#lgr-data)
* [KNN](#knn-data)
* [Tree Models](#trees-data)
* [Support Vector Machines](#svm-data)
* [Evaluating the Models](#eval-data)

Note: for the larger data set, KNN and SVM took to long so these models were aborted. Other than that, similar models were used as for examining the Hip-Hop genre.


#### Import Libraries & Read in Data

In [1]:
## standard imports 
import pandas as pd 
import numpy as np
import re
## visualizations
import matplotlib.pyplot as plt
import seaborn as sns
## preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.dummy import DummyClassifier
## modeling
from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.naive_bayes import MultinomialNB
## trees
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, ExtraTreesClassifier, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, AdaBoostClassifier, GradientBoostingRegressor
## NLP
from sklearn.feature_extraction.text import CountVectorizer
## analysis
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, make_scorer, f1_score, mean_squared_error

## options
import sklearn
pd.options.display.max_rows = 4000
pd.options.display.max_columns = 100
pd.set_option('max_colwidth', 100)

In [2]:
### read in data
data = pd.read_csv('../data/data_cleaned_classification.csv')
data.head()

Unnamed: 0,genre,track_id,popularity,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,is_popular
0,Movie,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,0.91,0.0,0.346,-1.828,0.0525,166.969,0.814,0
1,Movie,0BjC1NfoEOOusryehmNudP,1,0.246,0.59,0.737,0.0,0.151,-5.559,0.0868,174.003,0.816,0
2,Movie,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,0.131,0.0,0.103,-13.879,0.0362,99.488,0.368,0
3,Movie,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,0.326,0.0,0.0985,-12.178,0.0395,171.758,0.227,0
4,Movie,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,0.225,0.123,0.202,-21.15,0.0456,140.576,0.39,0


### EDA & Visualizations

In [3]:
### check balance of classes
data['is_popular'].value_counts()

1    114071
0    113132
Name: is_popular, dtype: int64

In [4]:
data.describe()

Unnamed: 0,popularity,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,is_popular
count,227203.0,227203.0,227203.0,227203.0,227203.0,227203.0,227203.0,227203.0,227203.0,227203.0,227203.0
mean,42.021074,0.363011,0.551036,0.575265,0.148487,0.216242,-9.518473,0.120889,117.587412,0.449727,0.502066
std,17.427064,0.353884,0.184889,0.263201,0.3028,0.19938,6.023565,0.186038,30.905558,0.258224,0.499997
min,0.0,0.0,0.0569,2e-05,0.0,0.00967,-52.457,0.0222,30.379,0.0,0.0
25%,30.0,0.0355,0.433,0.394,0.0,0.0976,-11.642,0.0367,92.757,0.234,0.0
50%,44.0,0.223,0.567,0.612,4.7e-05,0.129,-7.678,0.05,115.604,0.438,1.0
75%,55.0,0.713,0.688,0.79,0.0366,0.266,-5.464,0.105,139.131,0.653,1.0
max,100.0,0.996,0.987,0.999,0.999,1.0,3.744,0.967,242.903,1.0,1.0


In [8]:
# sns.pairplot(data, hue='is_popular')

### Select Data

In [5]:
### select data
X = data.drop(columns=['genre', 'track_id', 'is_popular', 'popularity'])
y = data['is_popular']
### TTS
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Linear Regression <a class="anchor" id="lr-data"></a>
<hr/>

In [6]:
lr = LinearRegression()
lr.fit(X_train, y_train)

print('Training score: ', lr.score(X_train, y_train))
print('Testing score: ', lr.score(X_test, y_test))

Training score:  0.19472233641243453
Testing score:  0.19798707836096807


## Logistic Regression <a class="anchor" id="lgr-data"></a>
<hr/>

In [7]:
lgr = LogisticRegression(max_iter=1000)
lgr.fit(X_train, y_train)

print('Training score: ', lgr.score(X_train, y_train))
print('Testing score: ', lgr.score(X_test, y_test))

Training score:  0.6851151981784251
Testing score:  0.6896357458495449


In [8]:
pipe_lgr = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
params_lgr = {
#     'selectkbest__k': [10],
#     'logisticregression__max_iter':[1000],
    'logisticregression__C': [0.01, 0.1, 0.15]
}

gs_lgr = GridSearchCV(pipe_lgr, params_lgr, n_jobs=-1)
gs_lgr.fit(X_train, y_train)

print('Training score: ', gs_lgr.score(X_train, y_train))
print('Testing score: ', gs_lgr.score(X_test, y_test))
print('Best Params', gs_lgr.best_params_)

Training score:  0.6855494653818617
Testing score:  0.6904279854227918
Best Params {'logisticregression__C': 0.1}


## KNN Classifier <a class="anchor" id="knn-data"></a>
<hr/>

In [11]:
# pipe_knn = make_pipeline(StandardScaler(), KNeighborsClassifier())

# params_knn = {
# #     'selectkbest__k': [5, 10, 15],
#     'kneighborsclassifier__n_neighbors': [5],
# }

# gs_knn = GridSearchCV(pipe_knn, params_knn, n_jobs=-1)
# gs_knn.fit(X_train, y_train)

# print('Training score: ', gs_knn.score(X_train, y_train))
# print('Testing score: ', gs_knn.score(X_test, y_test))
# print('F1 score: ', f1_score(y, gs_knn.predict(X)))
# print('Best Params', gs_knn.best_params_)

KNN was taking way too long so this model was aborted.

## Tree Models <a class="anchor" id="trees-data"></a>
<hr/>

### Basic Decision Tree

In [9]:
dt = DecisionTreeClassifier(random_state = 42)

dt.fit(X_train, y_train)

print('Training Score: ', dt.score(X_train, y_train))
print('Testing Score: ', dt.score(X_test, y_test))

Training Score:  0.9949531108789803
Testing Score:  0.7526275945846024


### Bagged Decision Tree

In [10]:
bag = BaggingClassifier()

bag.fit(X_train, y_train)

print('Training Score: ', bag.score(X_train, y_train))
print('Testing Score: ', bag.score(X_test, y_test))

Training Score:  0.9860623701599747
Testing Score:  0.7847220999630288


### Random Forest

In [11]:
rf = RandomForestClassifier()

rf.fit(X_train, y_train)

print('Training Score: ', rf.score(X_train, y_train))
print('Testing Score: ', rf.score(X_test, y_test))

Training Score:  0.9949472424032582
Testing Score:  0.7934367352687453




### Ada Boost

In [12]:
ada = AdaBoostClassifier(n_estimators=100, random_state = 22)

ada.fit(X_train, y_train)

print('Training Score: ',ada.score(X_train, y_train))
print('Testing Score: ', ada.score(X_test, y_test))

Training Score:  0.6979202122040821
Testing Score:  0.7019418672206476


## Support Vector Machine <a class="anchor" id="svm-data"></a>
<hr/>

In [None]:
# svm = SVC()

# svm.fit(X_train, y_train)

# print('Training Score: ', svm.score(X_train, y_train))
# print('Testing Score: ', svm.score(X_test, y_test))

SVM was taking way too long so this model was aborted.

## Evaluate Models <a class="anchor" id="eval-data"></a>
<hr/>

In [14]:
models = {
    'Logistic Regression' : gs_lgr,
#     'KNN': gs_knn,
    'Decision Tree': dt,
    'Bagged Decision Tree' : bag,
    'Random Forest': rf,
    'AdaBoost': ada,
#     'SVC': svm
}

print('Classification Models: F1 Score')
print('----------------------------------------------------------------------')
print('{:^20s} | {:^14s} | {:^14s}| {:^14s}|'.format('Model', 'Training Score', 'Testing Score', 'Full Score'))
print('----------------------------------------------------------------------')

for model in models.keys():
    y_preds_test = models[model].predict(X_test)
    y_preds_train = models[model].predict(X_train)
    y_preds = models[model].predict(X)
    train_score = f1_score(y_train, y_preds_train)
    test_score = f1_score(y_test, y_preds_test)
    data_score = f1_score(y, y_preds)
    print('{:20s} | {:^14f} | {:^14f}| {:^14f}|'.format(str(model), train_score, test_score, data_score))
print('----------------------------------------------------------------------')

Classification Models: F1 Score
----------------------------------------------------------------------
       Model         | Training Score | Testing Score |   Full Score  |
----------------------------------------------------------------------
Logistic Regression  |    0.712909    |    0.716579   |    0.713824   |
Decision Tree        |    0.994969    |    0.763360   |    0.935254   |
Bagged Decision Tree |    0.986095    |    0.786052   |    0.935939   |
Random Forest        |    0.994982    |    0.800673   |    0.945348   |
AdaBoost             |    0.723251    |    0.726264   |    0.724003   |
----------------------------------------------------------------------
