# 02_Baseline Modeling with Categorical Features (2 & 3 Classes)

This notebook will consist of a wide variety of baseline modeling.

Features being considered:
- *region*
- *year*
- *protesterviolence*
- *participants_category*
- *demand_political_behavior_process*
- *demand_labor_wage_dispute*
- *demand_police_brutality*
- *demand_social_restrictions*
- *demand_land_farm_issue*
- *demand_politician_removal*
- *demand_price_inc_tax_policy*


Targets being considered:
- *response_category_2*
- *response_category_3*

As seen in the EDA, it was unclear what the impact of the year feature would be. With that in mind, models with and without the year feature will be run to see if it makes a difference. The models will also switch off between using 2 classes (response_category_2) and 3 classes (response_category_3) as the target.

The baseline modeling results will be categorized in 4 ways:
- *2 classes using year data*
- *3 classes using year data*
- *2 classes without year data*
- *3 classes without year data*

Due to the class imbalances in both the 2 classes and 3 classes, the following class imbalance techniques will be tested for each of the above mentioned categories:
- *Oversampling the least frequent class*
- *Undersampling the most frequent class with Near Miss*
- *Weighted models*

The following metrics will be used to assess model performance:
- *train accuracy*
- *test accuracy*
- *variance (train accuracy - test accuracy)*
- *test precision*
- *test F1 score*

Of these metrics, test accuracy, variance and test precision will be considered the most. Overall accuracy is the primary interest, but considering precision will provide insight as to how well the less frequent class is being classified. The F1 score will also be helpful to include since imbalanced classes are being dealt with.

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import NearMiss

from sklearn.metrics import accuracy_score, precision_score, f1_score

pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings("ignore")

## Data

In [2]:
protests = pd.read_csv('../data/protests_clean.csv')

In [101]:
# Dropping the columns that are not of interest at this time

protests.drop(columns=['country', 
                       'startdate',
                       'enddate',
                       'length_days', 
                       'protesteridentity',
                       'stateresponse',
                       'notes'], inplace=True)

In [4]:
print(protests.shape)
protests.head(3)

(15198, 13)


Unnamed: 0,region,year,protesterviolence,participants_category,demand_political_behavior_process,demand_labor_wage_dispute,demand_police_brutality,demand_social_restrictions,demand_land_farm_issue,demand_politician_removal,demand_price_inc_tax_policy,response_category_2,response_category_3
0,North America,2015,0,50-99,1,0,0,0,0,0,0,0,0
1,North America,2016,0,100-999,0,1,0,0,0,0,0,0,0
2,North America,2016,0,100-999,0,1,0,0,0,0,0,0,0


In [114]:
# Writing protests dataframe with dropped columns to a CSV

protests.to_csv('../data/protests_clean2.csv', index=False)

____________
## **#1. Modeling 2 classes (with YEAR data)**

### One-Hot Encoding Categorical Variables

In [5]:
# One-Hot-Encoding region, year and participants_category features

protests_ohe_w_year = pd.get_dummies(protests, 
                                     prefix={'region':'region',
                                             'year':'year', 
                                             'participants_category':'participants_category'},
                                     columns=['region','year',
                                              'participants_category'],
                                     drop_first=False)

In [6]:
# Organizing columns to make sure everything is correct when viewing the dataframe

protests_ohe_w_year.insert(54, 'protesterviolence', protests_ohe_w_year.pop('protesterviolence'))
protests_ohe_w_year.insert(54, 'demand_political_behavior_process', protests_ohe_w_year.pop('demand_political_behavior_process'))
protests_ohe_w_year.insert(54, 'demand_labor_wage_dispute', protests_ohe_w_year.pop('demand_labor_wage_dispute'))
protests_ohe_w_year.insert(54, 'demand_police_brutality', protests_ohe_w_year.pop('demand_police_brutality'))
protests_ohe_w_year.insert(54, 'demand_social_restrictions', protests_ohe_w_year.pop('demand_social_restrictions'))
protests_ohe_w_year.insert(54, 'demand_land_farm_issue', protests_ohe_w_year.pop('demand_land_farm_issue'))
protests_ohe_w_year.insert(54, 'demand_politician_removal', protests_ohe_w_year.pop('demand_politician_removal'))
protests_ohe_w_year.insert(54, 'demand_price_inc_tax_policy', protests_ohe_w_year.pop('demand_price_inc_tax_policy'))
protests_ohe_w_year.insert(54, 'response_category_2', protests_ohe_w_year.pop('response_category_2'))
protests_ohe_w_year.insert(54, 'response_category_3', protests_ohe_w_year.pop('response_category_3'))

In [7]:
protests_ohe_w_year.head(3)

Unnamed: 0,region_Africa,region_Asia,region_Central America,region_Europe,region_MENA,region_North America,region_Oceania,region_South America,year_1990,year_1991,year_1992,year_1993,year_1994,year_1995,year_1996,year_1997,year_1998,year_1999,year_2000,year_2001,year_2002,year_2003,year_2004,year_2005,year_2006,year_2007,year_2008,year_2009,year_2010,year_2011,year_2012,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018,year_2019,year_2020,participants_category_100-999,participants_category_1000-1999,participants_category_2000-4999,participants_category_50-99,participants_category_5000-10000,participants_category_>10000,protesterviolence,demand_political_behavior_process,demand_labor_wage_dispute,demand_police_brutality,demand_social_restrictions,demand_land_farm_issue,demand_politician_removal,demand_price_inc_tax_policy,response_category_2,response_category_3
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [8]:
# Writing protests_ohe_w_year dataframe to a CSV

# protests_ohe_w_year.to_csv('../data/protests_ohe_w_year.csv', index=False)

### Helper Functions (2 classes)

In [9]:
def run_baseline_2(model, 
                 X_train, y_train, X_test, y_test,
                 verbose=True):
    """
    Fits a baseline model for each model specified.
    Compiles accuracy, variance, precision and f1 score results in a dictionary.
    For 2 classes.
    """
    
    results = {}
    
    model.fit(X_train, y_train)

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    results['train_accuracy'] = accuracy_score(y_train, y_pred_train)
    results['test_accuracy'] = accuracy_score(y_test, y_pred_test)
    results['variance'] = results['train_accuracy'] - results['test_accuracy']
    results['test_precision'] = precision_score(y_test, y_pred_test, pos_label=1, zero_division=0)
    results['test_f1'] = f1_score(y_test, y_pred_test, pos_label=1, zero_division=0)
    
    return results

In [10]:
def test_models_2(models, X_train, y_train, X_test, y_test, verbose=False):
    """
    Returns the baseline model results in a dataframe.
    For 2 classes.
    """
    results = {}
    
    for name,model in models.items():
        if verbose:
            print('\nRunning {} - {}'.format(name, model))
        
        results[name] = run_baseline_2(model, X_train, y_train, X_test, y_test, verbose=False)

    return pd.DataFrame.from_dict(results, orient='index')

### Defining models (2 classes)

In [11]:
# Models to use with imbalanced, oversampled and undersampled classes

models2 = {'Most Frequent': DummyClassifier(strategy='most_frequent'),
          'Logistic Regression': LogisticRegression(solver='lbfgs'),
          'Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
          'Random Forest': RandomForestClassifier(n_estimators=100),
          'XGBoost': XGBClassifier(),
          'Support Vector Classifier': SVC(),
          'Bernoulli Naive Bayes': BernoulliNB(),
          'Multinomial Naive Bayes': MultinomialNB()}

In [12]:
# Models to use with weighted models

models2_weighted = {'Logistic Regression': LogisticRegression(solver='lbfgs', class_weight='balanced'),
          'Nearest Neighbors': KNeighborsClassifier(n_neighbors=5, weights='distance'),
          'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced'),
          'XGBoost': XGBClassifier(scale_pos_weight=3.38),
          'Support Vector Classifier': SVC(class_weight='balanced')}

### X, y & train-test split

In [13]:
X2_year = protests_ohe_w_year.drop(columns=['response_category_2','response_category_3'])
y2_year = protests_ohe_w_year['response_category_2']

In [14]:
X2_year_train, X2_year_test, y2_year_train, y2_year_test = train_test_split(X2_year, y2_year, test_size=0.2, random_state=42, stratify=y2_year)

### Imbalanced Data Techniques

#### Original, Imbalanced Data

In [106]:
# Checking the training set balance

y2_year_train.value_counts(normalize=True)

0    0.772166
1    0.227834
Name: response_category_2, dtype: float64

In [16]:
imbal2_year_model_results = test_models_2(models2,
                                          X2_year_train, y2_year_train,
                                          X2_year_test, y2_year_test,
                                          verbose=False)

#### #1: Oversampled

In [17]:
ros2_year = RandomOverSampler()

X2_year_train_over, y2_year_train_over = ros2_year.fit_resample(X2_year_train, y2_year_train)

In [107]:
# Checking the training set balance

y2_year_train_over.value_counts(normalize=True)

0    0.5
1    0.5
Name: response_category_2, dtype: float64

In [19]:
oversampled2_year_model_results = test_models_2(models2,
                                        X2_year_train_over, y2_year_train_over,
                                        X2_year_test, y2_year_test,
                                        verbose=False)

#### #2: Undersampled (Near Miss)

In [20]:
nr2_year = NearMiss() 

X2_year_train_near, y2_year_train_near= nr2_year.fit_sample(X2_year_train, y2_year_train) 

In [21]:
# Checking the training set balance

y2_year_train_near.value_counts(normalize=True)

0    0.5
1    0.5
Name: response_category_2, dtype: float64

In [22]:
undersampled2_year_model_results = test_models_2(models2,
                                                 X2_year_train_near, y2_year_train_near,
                                                 X2_year_test, y2_year_test,
                                                 verbose=False)

#### #3: Weighted Models

In [110]:
# Checking the training set balance

y2_year_train.value_counts(normalize=True)

0    0.772166
1    0.227834
Name: response_category_2, dtype: float64

In [109]:
weighted2_year_model_results = test_models_2(models2_weighted,
                                             X2_year_train, y2_year_train,
                                             X2_year_test, y2_year_test,
                                             verbose=False)

### **RESULTS**

In [25]:
imbal2_year_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
XGBoost,0.846192,0.796382,0.04981,0.582766,0.453663
Support Vector Classifier,0.826863,0.794079,0.032784,0.573991,0.449912
Multinomial Naive Bayes,0.786231,0.787171,-0.00094,0.588933,0.315344
Logistic Regression,0.79314,0.784211,0.00893,0.539823,0.426573
Bernoulli Naive Bayes,0.788041,0.782237,0.005804,0.527881,0.461789
Random Forest,0.893486,0.780921,0.112565,0.52549,0.445923
Most Frequent,0.772166,0.772368,-0.000202,0.0,0.0
Nearest Neighbors,0.828261,0.769737,0.058524,0.49005,0.360146


In [26]:
oversampled2_year_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
Most Frequent,0.5,0.772368,-0.272368,0.0,0.0
Support Vector Classifier,0.783181,0.747368,0.035812,0.462302,0.548235
Logistic Regression,0.71373,0.745395,-0.031664,0.456933,0.529197
XGBoost,0.808532,0.744737,0.063795,0.458167,0.542453
Random Forest,0.870473,0.737829,0.132644,0.442873,0.505276
Bernoulli Naive Bayes,0.705688,0.721382,-0.015693,0.426956,0.516828
Nearest Neighbors,0.81519,0.685526,0.129663,0.380866,0.468889
Multinomial Naive Bayes,0.689923,0.679934,0.009989,0.384742,0.490842


In [27]:
undersampled2_year_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
Most Frequent,0.5,0.772368,-0.272368,0.0,0.0
Nearest Neighbors,0.751083,0.565461,0.185623,0.28793,0.392644
Logistic Regression,0.724007,0.541776,0.182231,0.290996,0.411988
Bernoulli Naive Bayes,0.726354,0.539803,0.186551,0.286663,0.404427
Support Vector Classifier,0.763177,0.513158,0.250019,0.281838,0.407526
Multinomial Naive Bayes,0.708303,0.497697,0.210606,0.265844,0.38303
XGBoost,0.788267,0.480263,0.308004,0.265079,0.388071
Random Forest,0.803069,0.452303,0.350766,0.25614,0.38035


In [28]:
weighted2_year_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
Nearest Neighbors,0.890114,0.767763,0.12235,0.484716,0.386087
Random Forest,0.869633,0.756579,0.113054,0.469231,0.497283
XGBoost,0.803175,0.754934,0.048241,0.472987,0.554692
Support Vector Classifier,0.780474,0.751974,0.0285,0.468876,0.553318
Logistic Regression,0.758924,0.749342,0.009582,0.462687,0.532515


### **SUMMARY**

Of all the imbalanced data techniques, the **weighted models** performed best, though not great. The oversampled models performed very similarly, but the weighted models had them beat by 0.01-0.02 across the board. The undersampled models performed substantially worse than the both the weighted and oversampled models, with lower scores all around and a very high degree of variance. Of the weighted models, it was **XGBoost**, **Support Vector Classifier** and **Logistic Regression**, that performed the best. The test accuracy was in the mid 70's, variance was between 0.009 and 0.048, and test precision was in the mid-upper 40's.

____________
## **#2. Modeling 3 classes (with YEAR data)**

### Helper Functions (3 classes)

In [40]:
def run_baseline_3(model, 
                 X_train, y_train, X_test, y_test,
                 verbose=True):
    """
    Fits a baseline model for each model specified.
    Compiles accuracy, variance, precision and f1 score results in a dictionary.
    For 3 classes.
    """
    
    results = {}
    
    model.fit(X_train, y_train)

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    results['train_accuracy'] = accuracy_score(y_train, y_pred_train)
    results['test_accuracy'] = accuracy_score(y_test, y_pred_test)
    results['variance'] = results['train_accuracy'] - results['test_accuracy']
    results['test_precision'] = precision_score(y_test, y_pred_test, average='weighted', zero_division=0)
    results['test_f1'] = f1_score(y_test, y_pred_test, average='weighted', zero_division=0)
    
    return results

In [41]:
def test_models_3(models, X_train, y_train, X_test, y_test, verbose=False):

    """
    Returns the baseline model results in a dataframe.
    For 2 classes.
    """

    results = {}
    
    for name,model in models.items():
        if verbose:
            print('\nRunning {} - {}'.format(name, model))
        
        results[name] = run_baseline_3(model, X_train, y_train, X_test, y_test, verbose=False)
        
        if verbose:
            print('Results: ', results[name])

    return pd.DataFrame.from_dict(results, orient='index')

### Defining models (3 classes)

In [31]:
models3 = {'Most Frequent': DummyClassifier(strategy='most_frequent'),
          'Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
          'Random Forest': RandomForestClassifier(n_estimators=100),
          'XGBoost': XGBClassifier(),
          'Support Vector Classifier': SVC(),
          'Multinomial Naive Bayes': MultinomialNB()}

In [32]:
models3_weighted = {'Nearest Neighbors': KNeighborsClassifier(n_neighbors=5, weights='distance'),
          'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced'),
          'XGBoost': XGBClassifier(),
          'Support Vector Classifier': SVC(class_weight='balanced')}

### X, y & train-test split

In [33]:
X3_year = protests_ohe_w_year.drop(columns=['response_category_2','response_category_3'])
y3_year = protests_ohe_w_year['response_category_3']

In [34]:
# Train/test split

X3_year_train, X3_year_test, y3_year_train, y3_year_test = train_test_split(X3_year, y3_year, test_size=0.2, random_state=42, stratify=y3_year)

### Imbalanced Data Techniques

#### Original, Imbalanced Data

In [35]:
# Checking the training set balance

y3_year_train.value_counts(normalize=True)

0    0.525333
1    0.246833
2    0.227834
Name: response_category_3, dtype: float64

In [42]:
imbal3_year_model_results = test_models_3(models3,
                                          X3_year_train, y3_year_train,
                                          X3_year_test, y3_year_test,
                                          verbose=False)

#### #1. Oversampled

In [43]:
ros3_year = RandomOverSampler()

X3_year_train_over, y3_year_train_over = ros3_year.fit_resample(X3_year_train, y3_year_train)

In [44]:
# Checking the training set balance

y3_year_train_over.value_counts(normalize=True)

1    0.333333
0    0.333333
2    0.333333
Name: response_category_3, dtype: float64

In [45]:
oversampled3_year_model_results = test_models_3(models3,
                                                X3_year_train_over, y3_year_train_over,
                                                X3_year_test, y3_year_test,
                                                verbose=False)

#### #2. Undersampled (Near Miss)

In [46]:
nr3_year = NearMiss() 

X3_year_train_near, y3_year_train_near= nr3_year.fit_sample(X3_year_train, y3_year_train) 

In [47]:
# Checking the training set balance

y3_year_train_near.value_counts(normalize=True)

0    0.333333
1    0.333333
2    0.333333
Name: response_category_3, dtype: float64

In [48]:
undersampled3_year_model_results = test_models_3(models3,
                                                 X3_year_train_near, y3_year_train_near,
                                                 X3_year_test, y3_year_test,
                                                 verbose=False)

#### #3. Weighted Models

In [49]:
# Checking the training set balance

y3_year_train.value_counts(normalize=True)

0    0.525333
1    0.246833
2    0.227834
Name: response_category_3, dtype: float64

In [50]:
weighted3_year_model_results = test_models_3(models3_weighted,
                                             X3_year_train, y3_year_train,
                                             X3_year_test, y3_year_test,
                                             verbose=False)

### **RESULTS**

In [51]:
imbal3_year_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
Support Vector Classifier,0.668367,0.651974,0.016393,0.631672,0.608871
XGBoost,0.712782,0.640132,0.07265,0.611746,0.610613
Random Forest,0.796348,0.623684,0.172664,0.60258,0.608078
Multinomial Naive Bayes,0.623458,0.621382,0.002076,0.579724,0.572751
Nearest Neighbors,0.674864,0.581908,0.092956,0.55104,0.545793
Most Frequent,0.525333,0.525329,4e-06,0.275971,0.36185


In [52]:
oversampled3_year_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
Support Vector Classifier,0.659882,0.612171,0.047711,0.598257,0.603801
XGBoost,0.691248,0.599342,0.091906,0.594501,0.596741
Random Forest,0.787067,0.570724,0.216344,0.581669,0.575249
Multinomial Naive Bayes,0.523929,0.570395,-0.046466,0.575436,0.570677
Nearest Neighbors,0.713533,0.532237,0.181296,0.54871,0.538813
Most Frequent,0.333333,0.525329,-0.191996,0.275971,0.36185


In [53]:
undersampled3_year_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
Most Frequent,0.333333,0.525329,-0.191996,0.275971,0.36185
Multinomial Naive Bayes,0.543682,0.507237,0.036445,0.535672,0.517039
Support Vector Classifier,0.629122,0.506908,0.122214,0.539906,0.518466
XGBoost,0.687605,0.475,0.212605,0.532306,0.490645
Nearest Neighbors,0.627076,0.469408,0.157668,0.50532,0.480513
Random Forest,0.759206,0.442763,0.316443,0.529802,0.457593


In [54]:
weighted3_year_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
XGBoost,0.712782,0.640132,0.07265,0.611746,0.610613
Support Vector Classifier,0.68457,0.623026,0.061544,0.604485,0.61053
Nearest Neighbors,0.788041,0.592763,0.195278,0.566858,0.569436
Random Forest,0.785409,0.591118,0.19429,0.589653,0.59018


### **SUMMARY**

The **weighted models** performed best here as well, though still not great. The oversampled models did not perform as similarly as with only 2 classes, but they were not far behind. The undersampled models again performed substantially worse than the both the weighted and oversampled models, with all around lower scores and a higher degree of variance. Of the weighted models, it was **XGBoost** and **Support Vector Classifier**, that performed the best. The test accuracy was in the low 60's, variance was between 0.06 and 0.07, and test precision was in the low 60's.

____________
## **#3. Modeling 2 classes (NO YEAR data)**

### One-Hot Encoding Categorical Variables

In [55]:
# One-Hot-Encoding region and participants_category features

protests_ohe_wo_year = pd.get_dummies(protests, 
                                      prefix={'region':'region',
                                              'participants_category':'participants_category'},
                                      columns=['region',
                                               'participants_category'],
                                      drop_first=False)

In [56]:
# Organizing columns

protests_ohe_wo_year.insert(23, 'protesterviolence', protests_ohe_wo_year.pop('protesterviolence'))
protests_ohe_wo_year.insert(23, 'demand_political_behavior_process', protests_ohe_wo_year.pop('demand_political_behavior_process'))
protests_ohe_wo_year.insert(23, 'demand_labor_wage_dispute', protests_ohe_wo_year.pop('demand_labor_wage_dispute'))
protests_ohe_wo_year.insert(23, 'demand_police_brutality', protests_ohe_wo_year.pop('demand_police_brutality'))
protests_ohe_wo_year.insert(23, 'demand_social_restrictions', protests_ohe_wo_year.pop('demand_social_restrictions'))
protests_ohe_wo_year.insert(23, 'demand_land_farm_issue', protests_ohe_wo_year.pop('demand_land_farm_issue'))
protests_ohe_wo_year.insert(23, 'demand_politician_removal', protests_ohe_wo_year.pop('demand_politician_removal'))
protests_ohe_wo_year.insert(23, 'demand_price_inc_tax_policy', protests_ohe_wo_year.pop('demand_price_inc_tax_policy'))
protests_ohe_wo_year.insert(23, 'response_category_2', protests_ohe_wo_year.pop('response_category_2'))
protests_ohe_wo_year.insert(23, 'response_category_3', protests_ohe_wo_year.pop('response_category_3'))

In [57]:
protests_ohe_wo_year.head(3)

Unnamed: 0,year,region_Africa,region_Asia,region_Central America,region_Europe,region_MENA,region_North America,region_Oceania,region_South America,participants_category_100-999,participants_category_1000-1999,participants_category_2000-4999,participants_category_50-99,participants_category_5000-10000,protesterviolence,demand_political_behavior_process,demand_labor_wage_dispute,demand_police_brutality,demand_social_restrictions,demand_land_farm_issue,demand_politician_removal,demand_price_inc_tax_policy,response_category_2,response_category_3,participants_category_>10000
0,2015,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0
1,2016,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,2016,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [58]:
# Writing protests_ohe (without year column) dataframe to a CSV

# protests_ohe_wo_year.to_csv('../data/protests_ohe_wo_year.csv', index=False)

### X, y & train-test split

In [61]:
X2 = protests_ohe_wo_year.drop(columns=['response_category_2','response_category_3'])
y2 = protests_ohe_wo_year['response_category_2']

In [62]:
# Train/test split

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42, stratify=y2)

### Imbalanced Data Techniques

#### Original, Imbalanced Data

In [64]:
# Checking the training set balance

y2_train.value_counts(normalize=True)

0    0.772166
1    0.227834
Name: response_category_2, dtype: float64

In [65]:
imbal2_model_results = test_models_2(models2,
                                     X2_train, y2_train,
                                     X2_test, y2_test,
                                     verbose=False)

#### #1. Oversampled

In [66]:
ros2 = RandomOverSampler()

X2_train_over, y2_train_over = ros2.fit_resample(X2_train, y2_train)

In [67]:
# Checking the training set balance

y2_train_over.value_counts(normalize=True)

0    0.5
1    0.5
Name: response_category_2, dtype: float64

In [68]:
oversampled2_model_results = test_models_2(models2,
                                           X2_train_over, y2_train_over,
                                           X2_test, y2_test,
                                           verbose=False)

#### #2. Undersampled (Near Miss)

In [69]:
nr2 = NearMiss() 

X2_train_near, y2_train_near= nr2.fit_sample(X2_train, y2_train) 

In [70]:
y2_train_near.value_counts(normalize=True)

0    0.5
1    0.5
Name: response_category_2, dtype: float64

In [71]:
undersampled2_model_results = test_models_2(models2,
                                            X2_train_near, y2_train_near,
                                            X2_test, y2_test,
                                            verbose=False)

#### #3. Weighted Models

In [72]:
# Checking the training set balance

y2_train.value_counts(normalize=True)

0    0.772166
1    0.227834
Name: response_category_2, dtype: float64

In [73]:
weighted2_model_results = test_models_2(models2_weighted,
                                        X2_train, y2_train,
                                        X2_test, y2_test,
                                        verbose=False)

### **RESULTS**

In [74]:
imbal2_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
XGBoost,0.853594,0.793092,0.060502,0.572082,0.44287
Logistic Regression,0.790179,0.780921,0.009258,0.527083,0.431741
Bernoulli Naive Bayes,0.784751,0.780921,0.00383,0.523636,0.463768
Multinomial Naive Bayes,0.785985,0.778947,0.007037,0.537594,0.298539
Most Frequent,0.772166,0.772368,-0.000202,0.0,0.0
Support Vector Classifier,0.772166,0.772368,-0.000202,0.0,0.0
Nearest Neighbors,0.833114,0.770395,0.062719,0.493088,0.380107
Random Forest,0.893404,0.770066,0.123338,0.493578,0.434923


In [75]:
oversampled2_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
Most Frequent,0.5,0.772368,-0.272368,0.0,0.0
Logistic Regression,0.714795,0.750658,-0.035862,0.464968,0.536108
XGBoost,0.811035,0.745066,0.06597,0.456453,0.528875
Bernoulli Naive Bayes,0.709949,0.732566,-0.022617,0.439921,0.521483
Random Forest,0.870366,0.717763,0.152603,0.412262,0.47619
Multinomial Naive Bayes,0.693545,0.696053,-0.002508,0.39913,0.498371
Nearest Neighbors,0.817799,0.683882,0.133918,0.379155,0.46759
Support Vector Classifier,0.523008,0.550987,-0.027979,0.238539,0.310258


In [76]:
undersampled2_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
Most Frequent,0.5,0.772368,-0.272368,0.0,0.0
Nearest Neighbors,0.753249,0.584211,0.169039,0.293651,0.391723
Support Vector Classifier,0.542238,0.549013,-0.006775,0.237026,0.308623
Logistic Regression,0.729603,0.536184,0.193419,0.292245,0.417355
Bernoulli Naive Bayes,0.727256,0.521053,0.206204,0.274232,0.389262
Multinomial Naive Bayes,0.72148,0.498026,0.223454,0.256709,0.365752
XGBoost,0.794043,0.456579,0.337464,0.252066,0.371385
Random Forest,0.806679,0.424671,0.382008,0.238237,0.354851


In [77]:
weighted2_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
Support Vector Classifier,0.772166,0.772368,-0.000202,0.0,0.0
Nearest Neighbors,0.890689,0.767434,0.123255,0.483871,0.388937
Logistic Regression,0.770933,0.765132,0.005801,0.487059,0.536965
XGBoost,0.814608,0.753947,0.06066,0.470149,0.541104
Random Forest,0.868893,0.75,0.118893,0.456743,0.485792


### **SUMMARY**

The **weighted models** performed best yet again, but still not great. The oversampled models performed similarly, and actually did better with the degree of variance to the point where multiple models' test accuracy beat the train accuracy. But the weighted models still achieved the highest test accuracy and precision scores. The undersampled models again performed substantially worse than the both the weighted and oversampled models, with all around lower scores and a higher degree of variance. Of the weighted models, it was **XGBoost**, **Support Vector Classifier**, and **Logistic Regression** that performed the best. The test accuracy was in the mid-high 70's, variance was -0.0002 and 0.06, and test precision was in the mid-high 40's.

____________
## **#4. Modeling 3 classes (NO YEAR data)**

### X, y & train-test split

In [79]:
X3 = protests_ohe_wo_year.drop(columns=['response_category_2','response_category_3'])
y3 = protests_ohe_wo_year['response_category_3']

In [80]:
# Train/test split

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=42, stratify=y3)

### Imbalanced Data Techniques

#### Original, Imbalanced Data

In [84]:
# Checking the training set balance

y3_train.value_counts(normalize=True)

0    0.525333
1    0.246833
2    0.227834
Name: response_category_3, dtype: float64

In [95]:
imbal3_model_results = test_models_3(models2,
                                     X2_train, y2_train,
                                     X2_test, y2_test,
                                     verbose=False)

#### #1. Oversampled

In [86]:
ros3 = RandomOverSampler()

X3_train_over, y3_train_over = ros3.fit_resample(X3_train, y3_train)

In [87]:
# Checking the training set balance

y3_train_over.value_counts(normalize=True)

1    0.333333
0    0.333333
2    0.333333
Name: response_category_3, dtype: float64

In [88]:
oversampled3_model_results = test_models_3(models3,
                                           X3_train_over, y3_train_over,
                                           X3_test, y3_test,
                                           verbose=False)

#### #2. Undersampled (Near Miss)

In [89]:
nr3 = NearMiss() 

X3_train_near, y3_train_near= nr3.fit_sample(X3_train, y3_train) 

In [90]:
# Checking the training set balance

y3_train_near.value_counts(normalize=True)

0    0.333333
1    0.333333
2    0.333333
Name: response_category_3, dtype: float64

In [91]:
undersampled3_model_results = test_models_3(models3,
                                            X3_train_near, y3_train_near,
                                            X3_test, y3_test,
                                            verbose=False)

#### #3. Weighted Models

In [92]:
# Checking the training set balance

y2_train.value_counts(normalize=True)

0    0.772166
1    0.227834
Name: response_category_2, dtype: float64

In [93]:
weighted3_model_results = test_models_3(models3_weighted,
                                        X3_train, y3_train,
                                        X3_test, y3_test,
                                        verbose=False)

### **RESULTS**

In [96]:
imbal3_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
XGBoost,0.853594,0.793092,0.060502,0.771441,0.775054
Logistic Regression,0.790179,0.780921,0.009258,0.7599,0.765838
Bernoulli Naive Bayes,0.784751,0.780921,0.00383,0.766249,0.771612
Multinomial Naive Bayes,0.785985,0.778947,0.007037,0.741883,0.738991
Random Forest,0.893404,0.774342,0.119061,0.754835,0.761296
Most Frequent,0.772166,0.772368,-0.000202,0.596553,0.67317
Support Vector Classifier,0.772166,0.772368,-0.000202,0.596553,0.67317
Nearest Neighbors,0.833114,0.770395,0.062719,0.742941,0.750069


In [97]:
oversampled3_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
XGBoost,0.69965,0.599671,0.099979,0.598291,0.598531
Multinomial Naive Bayes,0.513961,0.581579,-0.067618,0.571581,0.571975
Random Forest,0.784719,0.554934,0.229785,0.571598,0.561729
Nearest Neighbors,0.707113,0.533882,0.173232,0.55287,0.541436
Most Frequent,0.333333,0.525329,-0.191996,0.275971,0.36185
Support Vector Classifier,0.354313,0.264803,0.089511,0.50599,0.204455


In [98]:
undersampled3_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
Most Frequent,0.333333,0.525329,-0.191996,0.275971,0.36185
Nearest Neighbors,0.627316,0.505921,0.121395,0.531912,0.514605
Multinomial Naive Bayes,0.544645,0.494079,0.050566,0.52561,0.503903
XGBoost,0.696871,0.468421,0.22845,0.533184,0.484021
Random Forest,0.758845,0.424013,0.334832,0.51174,0.438276
Support Vector Classifier,0.365343,0.2625,0.102843,0.48758,0.198714


In [99]:
weighted3_model_results.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,train_accuracy,test_accuracy,variance,test_precision,test_f1
XGBoost,0.721829,0.634211,0.087619,0.603785,0.606346
Nearest Neighbors,0.790837,0.599342,0.191495,0.573589,0.575004
Random Forest,0.785244,0.567105,0.218139,0.572082,0.569422
Support Vector Classifier,0.227834,0.227632,0.000202,0.051816,0.084416


### **SUMMARY**

The **weighted models** performed best one last time, but still not great. The oversampled models performed similarly but with slightly lesser scores, and the undersampled models again were incomparable. Of the weighted models, **XGBoost** emerged as the best performed. The test accuracy was in the low 60's, variance was 0.087, and test precision was in the low 60's.

____
## Basline Modeling Comparison & Evaluation

### 2 Classes (with year) vs 3 Classes (with year)

The weighted models performed best for both 2 classes and 3 classes. The 2 classes with year data weighted models performed better in terms of test accuracy and a lower variance. They achieved accuracy in the mid 70's (with a variance range of 0.009 and 0.048) while the 3 classes with year data were in the low 60's (with a variance range of 0.06 and 0.07). However, the 3 classes with year data weighted models performed better in terms of precision. They achieved precision in the low 60's while the 2 classes with year data were in the mid-high 40's.

### 2 Classes (no year) vs 3 Classes (no year)

The weighted models performed best for both 2 classes and 3 classes here as well. The 2 classes with no year data weighted models performed better in terms of test accuracy and a lower variance. They achieved accuracy in the mid-high 70's (with a variance range of -0.0002 and 0.06) while the 3 classes with year data were in the low 60's (with a variance of 0.087). However, the 3 classes with no year data weighted models performed better in terms of precision. They achieved precision in the low 60's while the 2 classes with year data were in the mid-high 40's.

____
## Baseline Modeling Final Thoughts

Overall, weighted models performed better than the models with oversampled or undersampled data. The best performing weighted models were Logistic Regression, Support Vector Classifier and XGBoost.

Running models for 2 and 3 classes came with some interesting observations. It was assumed that because the classes weren't going to be as imbalanced as with 2 classes, that they would achieve a higher accuracy, but the opposite happened. The accuracy decreased substantially. However, they did do better with classifying the less frequent classes.

Running models with and without year data also showed that the feature didn't impact model performance much, not even in terms of variance, which was the primary reason for testing this feature. After One-Hot-Encoding the data, when the year data was included, the dataframe contained 53 features. When the year data wasn't included, the dataframe contained 23 features, so it was thought that the variance might decrease without the year data, but it didn't make much of a difference. The accuracy and precision scores were also very similar with and without the year data.

Due to the lack of clarity on the impact of using year data and using 2 or 3 classes, all four cases will be modeled and hyper-tuned using Logistic Regression, Support Vector Classifier and XGBoost, in the event that hyper-tuning causes the impact to become clearer.