## Exercise

- In this exercise, we will work on a classification task of Brexit referendum vote
- The data is originally from British Election Study Online Panel
  - codebook: https://www.britishelectionstudy.com/wp-content/uploads/2020/05/Bes_wave19Documentation_V2.pdf
- The outcome is `LeaveVote` (1: Leave, 0: otherwise)
- The input we use are coming from the following article:
  - Hobolt, Sara (2016) The Brexit vote: a divided nation, a divided continent. _Journal of European Public Policy_, 23 (9) (https://doi.org/10.1080/13501763.2016.1225785)

In [None]:
!wget https://www.dropbox.com/s/up1zpkozgscaty1/brexit_bes_sampled_data.csv

## Import packages

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Load data

In [None]:
df_bes = pd.read_csv("brexit_bes_sampled_data.csv")

In [None]:
df_bes.head()

### Model Description

- There are four models in the article. We will use the idenity model (Model 2 in Table 2)
- List of input variables:
  gender, age, edlevel, hhincome, EuropeanIdentity, EnglishIdentity, BritishIdentity

## Data Preprocessing



In [None]:
X = df_bes.drop('LeaveVote', axis = 1)
y = df_bes['LeaveVote']

## Train-test split

In [None]:
from sklearn.model_selection import train_test_split

## Data wrangling

In [None]:
from sklearn.preprocessing import StandardScaler
st_scaler = StandardScaler()

## Logistic Regression

- Now, let's estimate a logistic regression model. 

### Train the model

In [None]:
from sklearn.linear_model import LogisticRegression

### Model Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

## KNN classifier

- Let's repeat the procedure using KNN Classifer

In [None]:
from sklearn.neighbors import KNeighborsClassifier


### Try `k=3` Model 

We start with a model with `k=3`. 

- How the results look like?
- Any performance gain over logistic regression?


### Parameter tuning for KNN (already coded)

- Parameter tuning will be done using cross-validation
- Reestimate the models for the different values of tuning parameters
  - For KNN, try different values of _k_
- By default, for classification tasks, evaluation metric is accuracy. I want to use f1 for the positive class.


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
f1 = make_scorer(f1_score, average = 'binary', pos_label = 1)

#### Setting up the schedule


In [2]:
ks = list(range(1, 26))+ [30, 40, 50]
param_grid = {'n_neighbors': ks}

#### Fit the model

In [None]:
knn2 = KNeighborsClassifier()
knn_cv = GridSearchCV(knn2, param_grid, cv=10, scoring=f1)
knn_cv.fit(X_train, y_train)

In [None]:
knn_cv.param_grid

#### Visualize the training results

In [None]:
sns.set_style('whitegrid')
knn_cv.cv_results_['mean_test_score']
plt.plot(ks, knn_cv.cv_results_['mean_test_score'])

In [None]:
print(knn_cv.best_score_)
print(knn_cv.best_params_)

### Evaluate the final model

## Random Forest Classifier

Here, we try two tree based methods:

- Random Forest
- ADA Boost

In [None]:
from sklearn.ensemble import RandomForestClassifier

### Training models with parameter tuning

- Now we train the model with parameter tuning
- The parameter grid is shown below

In [None]:
from sklearn.model_selection import GridSearchCV

In [1]:
parameter_grid = {"n_estimators": [32, 64, 100, 200, 500],
                  "max_features": [2, 3, 4]}

## AdaBoost Classifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

### Training models with parameter tuning

- Again, we train the model with parameter tuning
- The parameter grid is shown below

In [None]:
parameter_grid = {"n_estimators": [100, 200, 500],
                  "learning_rate": [0.001, 0.01, 0.1, 0.2, 0.5, 1]}

## (Optional) Support Vector Classifier

- We now try another popular supervised ML algorithm, support vector classifier (SVC)
- This is a coding example of training SVC

In [None]:
from sklearn.svm import SVC
svcmod = SVC(gamma='auto')

In [None]:
svcmod.fit(X_train, y_train)

In [None]:
svcmod.fit(X_train, y_train)

In [None]:
pred_svc = svcmod.predict(X_test)

In [None]:
print(confusion_matrix(y_test, pred_svc))
print(classification_report(y_test, pred_svc))

In [None]:
param_grid = {'C':[1,10,100,1000], # cost for miss classification
              'gamma':[1,0.1,0.001,0.0001], # flexibility of the model 
              'kernel':['rbf']}
svc_cv = GridSearchCV(SVC(),param_grid, refit = True, verbose=2)
svc_cv.fit(X_train,y_train)

In [None]:
print(svc_cv.best_score_)
print(svc_cv.best_params_)

In [None]:
pred_svc = svc_cv.predict(X_test)
print(classification_report(y_test, pred_svc))
print(confusion_matrix(y_test, pred_svc))