# 04_Modeling with Categorical Features (6 Classes, Omitting Ignore)

In this section, I will attempt to build a model to predict what the specific state response will be to each protest, given that the state doesn't simply 'ignore' the protest movement.

In [195]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, recall_score, precision_score

pd.set_option('display.max_columns', None)

In [196]:
# Cleaned dataframe
df = pd.read_csv('../data/protests_clean3.csv')
df.head()

Unnamed: 0,year,region,protesterviolence,participants_category,notes,nationwide,startdate,enddate,length,participants_size,demand_political_behavior_process,demand_labor_wage_dispute,demand_police_brutality,demand_social_restrictions,demand_land_farm_issue,demand_politician_removal,demand_price_inc_tax_policy,stateresponse,neg_response
0,1990,North America,0,1000-1999,canada s railway passenger system was finally ...,1,1990-01-15,1990-01-15,0 days,1500,1,1,0,0,0,0,0,ignore,0
1,1990,North America,0,1000-1999,protestors were only identified as young peopl...,0,1990-06-25,1990-06-25,0 days,1500,1,0,0,0,0,0,0,ignore,0
2,1990,North America,0,100-999,"the queen, after calling on canadians to remai...",0,1990-07-01,1990-07-01,0 days,550,1,0,0,0,0,0,0,ignore,0
3,1990,North America,1,100-999,canada s federal government has agreed to acqu...,0,1990-07-12,1990-09-06,56 days,550,0,0,0,0,1,0,0,accomodation,0
4,1990,North America,1,100-999,protests were directed against the state due t...,0,1990-08-14,1990-08-15,1 days,550,1,0,0,0,0,0,0,arrests,1


In [197]:
no_ignore = df[df['stateresponse'] != 'ignore']

In [198]:
no_ignore.shape

(7214, 19)

Dataframe excluding the ignore values have 7214 rows to work with.

In [199]:
no_ignore['stateresponse'].value_counts()

crowd dispersal    2720
arrests            1566
accomodation       1032
killings            825
beatings            637
shootings           434
Name: stateresponse, dtype: int64

In [200]:
no_ignore['stateresponse'].value_counts(normalize=True)

crowd dispersal    0.377045
arrests            0.217078
accomodation       0.143055
killings           0.114361
beatings           0.088301
shootings          0.060161
Name: stateresponse, dtype: float64

There is a high class imbalances in which the most frequent 'crowd dispersal' takes up 38% of the data, when the least frequent'shooting' only takes up 6% of the data. However, I decided not to bring any changes to the dataframe to deal with the imbalance because the dataset is very small to begin with that little tweak with the dataset may highly affect the integrity and variance of the model. 

In [201]:
# drop unneccessary columns.
drop_columns1 = ['year', 'notes', 'startdate', 'enddate', 'neg_response', 'participants_size', 'nationwide']

In [202]:
no_ignore.drop(columns=drop_columns1)

Unnamed: 0,region,protesterviolence,participants_category,length,demand_political_behavior_process,demand_labor_wage_dispute,demand_police_brutality,demand_social_restrictions,demand_land_farm_issue,demand_politician_removal,demand_price_inc_tax_policy,stateresponse
3,North America,1,100-999,56 days,0,0,0,0,1,0,0,accomodation
4,North America,1,100-999,1 days,1,0,0,0,0,0,0,arrests
5,North America,0,100-999,0 days,0,0,1,0,0,0,0,shootings
8,North America,1,1000-1999,1 days,0,0,1,0,0,0,0,arrests
10,North America,0,100-999,61 days,1,0,0,0,0,0,0,arrests
...,...,...,...,...,...,...,...,...,...,...,...,...
15192,Oceania,0,2000-4999,0 days,1,0,0,0,0,0,0,crowd dispersal
15193,Oceania,1,100-999,2 days,1,0,0,0,0,0,0,shootings
15194,Oceania,1,1000-1999,25 days,0,0,0,0,0,1,0,killings
15195,Oceania,0,0-99,0 days,1,0,0,0,1,0,0,accomodation


In [203]:
df1 = no_ignore.drop(columns=drop_columns1)
df1.head()

Unnamed: 0,region,protesterviolence,participants_category,length,demand_political_behavior_process,demand_labor_wage_dispute,demand_police_brutality,demand_social_restrictions,demand_land_farm_issue,demand_politician_removal,demand_price_inc_tax_policy,stateresponse
3,North America,1,100-999,56 days,0,0,0,0,1,0,0,accomodation
4,North America,1,100-999,1 days,1,0,0,0,0,0,0,arrests
5,North America,0,100-999,0 days,0,0,1,0,0,0,0,shootings
8,North America,1,1000-1999,1 days,0,0,1,0,0,0,0,arrests
10,North America,0,100-999,61 days,1,0,0,0,0,0,0,arrests


In [204]:
# 'length feature to integers'
df1['length'] = df1['length'].str.replace('days', '')

In [103]:
df1['length'] = df1['length'].astype(int)

In [104]:
df1.dtypes

region                               object
protesterviolence                     int64
participants_category                object
length                                int64
demand_political_behavior_process     int64
demand_labor_wage_dispute             int64
demand_police_brutality               int64
demand_social_restrictions            int64
demand_land_farm_issue                int64
demand_politician_removal             int64
demand_price_inc_tax_policy           int64
stateresponse                        object
dtype: object

#### OHE categoricals

categorical data 'region' and 'participants_category' was one-hot encoded.

In [105]:
df1_ohe = pd.get_dummies(df1, columns=['region', 'participants_category'], drop_first=True)
df1_ohe.head()

Unnamed: 0,protesterviolence,length,demand_political_behavior_process,demand_labor_wage_dispute,demand_police_brutality,demand_social_restrictions,demand_land_farm_issue,demand_politician_removal,demand_price_inc_tax_policy,stateresponse,region_Asia,region_Central America,region_Europe,region_MENA,region_North America,region_Oceania,region_South America,participants_category_100-999,participants_category_1000-1999,participants_category_2000-4999,participants_category_5000-10000,participants_category_>10000
3,1,56,0,0,0,0,1,0,0,accomodation,0,0,0,0,1,0,0,1,0,0,0,0
4,1,1,1,0,0,0,0,0,0,arrests,0,0,0,0,1,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,0,0,shootings,0,0,0,0,1,0,0,1,0,0,0,0
8,1,1,0,0,1,0,0,0,0,arrests,0,0,0,0,1,0,0,0,1,0,0,0
10,0,61,1,0,0,0,0,0,0,arrests,0,0,0,0,1,0,0,1,0,0,0,0


Transform Target Column

In [205]:
# Transform target column to integer values, so that we can fit XGBoost model
df1_ohe['stateresponse'] = df1_ohe['stateresponse'].replace({'killings':5,
                                                             'shootings':4,
                                                             'beatings':3,
                                                             'arrests':2, 
                                                             'crowd dispersal':1,
                                                             'accomodation':0
                                                             })

In [122]:
df1_ohe['stateresponse'].value_counts()

1    2720
2    1566
0    1032
5     825
3     637
4     434
Name: stateresponse, dtype: int64

In [176]:
# baseline
df1_ohe['stateresponse'].value_counts(normalize=True)

1    0.377045
2    0.217078
0    0.143055
5    0.114361
3    0.088301
4    0.060161
Name: stateresponse, dtype: float64

The baseline accuracy is 38%

#### train-test-split

In [138]:
X = df1_ohe.drop(columns='stateresponse')
y = df1_ohe['stateresponse']

In [139]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

**standard scale**

In [141]:
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)

#### Modeling helper functions

*From Denise's code*

In [177]:
def run_models(model, 
               Xs_train, y_train, Xs_test, y_test,
               verbose=True):
    
    results = {}
    
    model.fit(X_train, y_train)

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    results['train_accuracy'] = accuracy_score(y_train, y_pred_train)
    results['test_accuracy'] = accuracy_score(y_test, y_pred_test)
    results['variance'] = results['train_accuracy'] - results['test_accuracy']
    results['test_recall'] = recall_score(y_test, y_pred_test, zero_division=0, average='weighted')
    results['test_precision'] = precision_score(y_test, y_pred_test, zero_division=0, average='weighted')
    
    return results

In [178]:
def test_models(models, Xs_train, y_train, Xs_test, y_test, verbose=False):

    results = {}
    
    for name,model in models.items():
        if verbose:
            print('\nRunning {} - {}'.format(name, model))
        
        results[name] = run_models(model, X_train, y_train, X_test, y_test, verbose=False)
        
        if verbose:
            print('Results: ', results[name])

    return pd.DataFrame.from_dict(results, orient='index')

In [180]:
models = {'Most Frequent': DummyClassifier(strategy='most_frequent'),
          'Multinomial Naive Bayes': MultinomialNB(),
          'Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
          'Random Forest': RandomForestClassifier(n_estimators=100),
          'XGBoost': XGBClassifier(),
          'Support Vector Classifier': SVC()}

#### Models

In [181]:
results1 = test_models(models,
                       Xs_train,
                       y_train,
                       Xs_test,
                       y_test,
                       verbose=True)


Running Most Frequent - DummyClassifier(strategy='most_frequent')
Results:  {'train_accuracy': 0.3770577023046266, 'test_accuracy': 0.376992376992377, 'variance': 6.532531224962002e-05, 'test_recall': 0.376992376992377, 'test_precision': 0.1421232523103625}

Running Multinomial Naive Bayes - MultinomialNB()
Results:  {'train_accuracy': 0.397504765205337, 'test_accuracy': 0.3866943866943867, 'variance': 0.010810378510950291, 'test_recall': 0.3866943866943867, 'test_precision': 0.38818144000021076}

Running Nearest Neighbors - KNeighborsClassifier()
Results:  {'train_accuracy': 0.4474094610985964, 'test_accuracy': 0.3700623700623701, 'variance': 0.07734709103622633, 'test_recall': 0.3700623700623701, 'test_precision': 0.3314089540529993}

Running Random Forest - RandomForestClassifier()
Results:  {'train_accuracy': 0.5884595390746837, 'test_accuracy': 0.3984753984753985, 'variance': 0.18998414059928526, 'test_recall': 0.3984753984753985, 'test_precision': 0.364537814684312}

Running XGB

**weighted models**

In [182]:
models2 = {'Nearest Neighbors': KNeighborsClassifier(n_neighbors=5, weights='uniform'),
          'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced'),
          'Support Vector Classifier': SVC(class_weight='balanced')}

In [186]:
results2 = test_models(models2,
                       Xs_train,
                       y_train,
                       Xs_test,
                       y_test,
                       verbose=True)


Running Nearest Neighbors - KNeighborsClassifier()
Results:  {'train_accuracy': 0.4474094610985964, 'test_accuracy': 0.3700623700623701, 'variance': 0.07734709103622633, 'test_recall': 0.3700623700623701, 'test_precision': 0.3314089540529993}

Running Random Forest - RandomForestClassifier(class_weight='balanced')
Results:  {'train_accuracy': 0.512389533876278, 'test_accuracy': 0.30353430353430355, 'variance': 0.2088552303419744, 'test_recall': 0.30353430353430355, 'test_precision': 0.36038122748138884}

Running Support Vector Classifier - SVC(class_weight='balanced')
Results:  {'train_accuracy': 0.24761739733148502, 'test_accuracy': 0.24532224532224534, 'variance': 0.002295152009239687, 'test_recall': 0.24532224532224534, 'test_precision': 0.35154094589100626}


Weighting the models that internally have the sample weighting system does not give dramatically better model. Compared to the non-weighted models, non-weighted XGBoost still performed the best. 

In the end, the models overall have bad accuracy and precision. Out of the models, XGBoost seems to be performing the best for both accuracy and precision.

In [184]:
results1

Unnamed: 0,train_accuracy,test_accuracy,variance,test_recall,test_precision
Most Frequent,0.377058,0.376992,6.5e-05,0.376992,0.142123
Multinomial Naive Bayes,0.397505,0.386694,0.01081,0.386694,0.388181
Nearest Neighbors,0.447409,0.370062,0.077347,0.370062,0.331409
Random Forest,0.58846,0.398475,0.189984,0.398475,0.364538
XGBoost,0.542887,0.420651,0.122235,0.420651,0.379625
Support Vector Classifier,0.385722,0.38115,0.004571,0.38115,0.400654


In [187]:
results2

Unnamed: 0,train_accuracy,test_accuracy,variance,test_recall,test_precision
Nearest Neighbors,0.447409,0.370062,0.077347,0.370062,0.331409
Random Forest,0.51239,0.303534,0.208855,0.303534,0.360381
Support Vector Classifier,0.247617,0.245322,0.002295,0.245322,0.351541


### Hypertuning

Will exclusively work with XGBoost and SVC as it performed the best in accuracy and precision, respectively.

In [155]:
#Initial estimates
xgb = XGBClassifier(learning_rate=0.1,
                    n_estimators=1000,
                    max_depth=5,
                    min_child_weight=1,
                    gamma=0,
                    subsample=0.8,
                    colsampe_bytree=0.8,
                    nthread=4,
                    scale_pos_weight=1,
                   )

Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsampe_bytree=0.8, colsample_bylevel=1, colsample_bynode=1,
              colsample_bytree=1, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.1, max_bin=256,
              max_cat_to_onehot=4, max_delta_step=0, max_depth=5, max_leaves=0,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=1000, n_jobs=4, nthread=4, num_parallel_tree=1,
              objective='multi:softprob', predictor='auto', ...)

In [163]:
train_pred = xgb.predict(Xs_train)
test_pred = xgb.predict(Xs_test)

In [164]:
accuracy_score(y_train, train_pred)

0.5770230462658118

In [165]:
accuracy_score(y_test, test_pred)

0.4047124047124047

It gave higher train accuracy, but lower test accuracy; the model has higher variance than the baseline model run above. 

In [169]:
param_test1 = {'max_depth':range(3, 10, 2),
               'min_child_weight':range(1, 6, 2)}

gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate=0.1,
                                   n_estimators=1000,
                                   gamma=0,
                                   subsample=0.8,
                                   colsampe_bytree=0.8,
                                   nthread=4,
                                   scale_pos_weight=1),
                       param_grid = param_test1,
                       scoring = 'roc_auc',
                       n_jobs = 4,
                       cv = 5)

In [170]:
gsearch1.fit(Xs_train, y_train)



Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




GridSearchCV(cv=5,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsampe_bytree=0.8,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     gamma=0, gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=0.1, max_bin=None,
                                     max_cat_to_onehot=None,
                                     max_delta_step=None, max_depth=None,
                                     max_leaves=None, min_child_weight=None,
                           

Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if y

Traceback (most recent call last):
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 349, in _score
    raise ValueError("{0} format is not supported".format(y_type))
ValueError: multiclass format is not supported

Traceback (most recent call last):
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/Users/

Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if y

Traceback (most recent call last):
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 349, in _score
    raise ValueError("{0} format is not supported".format(y_type))
ValueError: multiclass format is not supported

Traceback (most recent call last):
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/Users/

Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if y

Traceback (most recent call last):
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 349, in _score
    raise ValueError("{0} format is not supported".format(y_type))
ValueError: multiclass format is not supported

Traceback (most recent call last):
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/Users/

Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "colsampe_bytree", "scale_pos_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if y

Traceback (most recent call last):
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 349, in _score
    raise ValueError("{0} format is not supported".format(y_type))
ValueError: multiclass format is not supported

Traceback (most recent call last):
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/rhoeunpark/opt/anaconda3/envs/dsi/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/Users/

In [171]:
gsearch1.best_params_

{'max_depth': 3, 'min_child_weight': 1}

In [172]:
gs_train_preds = gsearch1.predict(Xs_train)
gs_test_preds = gsearch1.predict(Xs_test)

In [174]:
accuracy_score(y_train, gs_train_preds)

0.51429561601109

In [175]:
accuracy_score(y_test, gs_test_preds)

0.41302841302841303

In [189]:
precision_score(y_train, gs_train_preds, zero_division=0, average='weighted')

0.5420478218398938

In [191]:
precision_score(y_test, gs_test_preds, zero_division=0, average='weighted')

0.3739522262087129

Even with the gridsearch, the accuracy score was not better than it was in the baseline model. 

### Conclusion on 6 class classification

Overall, models to predict the specific state responses without the 'ignore' class does not perform well; just looking at the accuracy, only XGBoost performed slightly better than the baseline model that had 38% accuracy. Precision of all models were higher than the baseline, indicating that it performs somewhat better at predicting the positives than the null model. Even so, with the test accuracy lower than 50%, it is hard to say that the models would be helpful in predicting the state response solely based on the categorical features that we fed into the training model. 

Out of 6 different classification models, XGBoost had the highest accuracy score in both training and test set (0.54 and 0.42 respectively). The attempt to hypertune the model yielded training and test accuracy score of 0.51 and 0.41 respectively; other than reducing the variance slighty, the hypertuning process did not increase the predicting power of the model. 