# Modeling and Evaluation

In this notebook, I train my classifiers and evaluate their efficacy. Then I attempt to improve my models with hyperparameter tuning. The classifiers I will be building are:
- Logistic Regression
- Decision Tree
- Random Forest
- Support Vector Machine (SVM)

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

In [16]:
# Load cleaned data
train = pd.read_csv('../data/final/train_cleaned.csv')
test_features = pd.read_csv('../data/final/test_cleaned.csv')

In [17]:
train_features = train.drop('target', axis=1)
train_target = train['target']

---

## Dropping Unneeded Columns

In [18]:
# Unneeded and redundant columns to drop
columns_to_drop = ['id', 'wpt_name', 'num_private', 'subvillage', 'ward', 'recorded_by', 'scheme_name', 
                    'scheme_management', 'water_quality', 'waterpoint_type_group', 'quantity_group', 'region_code', 
                    'extraction_type', 'extraction_type_group', 'payment', 'source_class', 'source_type',
                    'funder', 'installer', 'longitude', 'latitude', 'date_recorded', 'construction_year',
                    'district_code']

In [19]:
train_features = train_features.drop(columns=columns_to_drop)
test_features = test_features.drop(columns=columns_to_drop)

In [20]:
train_features.shape, test_features.shape

((59400, 18), (14850, 18))

## Encoding Target

0: <code>functional</code>: pump is functional

1: <code>non-functional</code>: pump is not functional

In [21]:
train_target = train_target.map({'functional': 0, 'non functional': 1})
train_target.value_counts()

target
0    32259
1    27141
Name: count, dtype: int64

## Encoding Features

Categorical features need to be encoded before being fed into classification algorithms. 

In [22]:
# Import packages for categorical encoding
import time
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from category_encoders import OrdinalEncoder
from category_encoders import CountEncoder
from category_encoders import HashingEncoder
from category_encoders import BackwardDifferenceEncoder
from category_encoders import HelmertEncoder
from category_encoders import CatBoostEncoder
from category_encoders import GLMMEncoder

The following encoding methods will be used:
- <code>OrdinalEncoder()</code>: converts categorical variables to ordered numerical values
- <code>CountEncoder()</code>: replaces each categorical variable with its frequency count in the column
- <code>HashingEncoder()</code>: hashes categorical variables to unique one-hot vectors with configurable dimensionality
- <code>BackwardDifferenceEncoder()</code>: the mean of the $k_{th}$ categorical variable is compared with the mean of the $k_{th}-1$
- <code>Helmert Encoding()</code>: the mean of the $k_{th}$ categorical variable is compared with the means of all previous categorical variables
- <code>CatBoostEncoding()</code>: similar to <code>TargetEncoder</code> which converts each categorical value with the mean appearance of its target, but also involves an ordering principle in order to overcome the "Target Leakage" problem
- <code>GLMMEncoder()</code>: Generalized Linear Mixed Model - A supervised encoder similar to <code>TargetEncoder</code>, but solid statistical theory is also included. The advantage over other supervised encoders is that no parameter tuning is needed

In [23]:
# Encoding methods
random_state = 42

encoding_methods = {
    'ordinal': OrdinalEncoder(),
    'count': CountEncoder(),
    'hashing': HashingEncoder(n_components=32, drop_invariant=True),
    'backward_difference': BackwardDifferenceEncoder(),
    'Helmert': HelmertEncoder(),
    'CatBoost': CatBoostEncoder(random_state=random_state),
    'GLMM': GLMMEncoder(random_state=random_state)
}

## Training Classifiers

In [24]:
# Classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(n_jobs=-1, random_state=random_state),
    'Decision Tree': DecisionTreeClassifier(random_state=random_state),
    'Naive Bayes': GaussianNB(),
    'Random Forest': RandomForestClassifier(random_state=random_state, n_jobs=-1),
}

In [25]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(train_features, train_target, test_size=0.2, random_state=random_state)

In [26]:
classifier_best_results = {classifier_name: (0, 'encoder') for classifier_name in classifiers.keys()}


for encoding_method, encoder in encoding_methods.items():
    print(f'\n----- Encoding data using {encoding_method} Encoder -----', end='')

    start_time = time.time()

    encoded_X_train = encoder.fit_transform(X_train, y_train)
    encoded_X_test = encoder.transform(X_test)

    end_time = time.time()
    print(f'Done in {round(end_time-start_time, 2)}s')

    for classifier_name, clf_algorithm in classifiers.items():
        print(f'Training {classifier_name} Classifier', end='')

        start_time = time.time()

        clf_algorithm.fit(encoded_X_train, y_train)
        y_pred = clf_algorithm.predict(encoded_X_test)
        acc = accuracy_score(y_test, y_pred)

        end_time = time.time()
        print(f'Done in {round(end_time-start_time, 2)}s')

        previous_acc, _ = classifier_best_results[classifier_name]
        if previous_acc < acc:
            classifier_best_results[classifier_name] = (acc, encoding_method)


for classifier_name, (score, encoding_method) in classifier_best_results.items():
    print(f'\nClassifier: {classifier_name}\tBest Score: {score} on Test Data\t Using {encoding_method} Encoder')


----- Encoding data using ordinal Encoder -----Done in 0.14s
Training Logistic Regression ClassifierDone in 0.24s
Training Decision Tree ClassifierDone in 0.2s
Training Naive Bayes ClassifierDone in 0.01s
Training Random Forest ClassifierDone in 1.22s

----- Encoding data using count Encoder -----Done in 0.29s
Training Logistic Regression ClassifierDone in 0.34s
Training Decision Tree ClassifierDone in 0.19s
Training Naive Bayes ClassifierDone in 0.01s
Training Random Forest ClassifierDone in 0.77s

----- Encoding data using hashing Encoder -----Done in 0.53s
Training Logistic Regression Classifier

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Done in 1.18s
Training Decision Tree ClassifierDone in 0.24s
Training Naive Bayes ClassifierDone in 0.02s
Training Random Forest ClassifierDone in 0.74s

----- Encoding data using backward_difference Encoder -----



Done in 0.87s
Training Logistic Regression ClassifierDone in 0.84s
Training Decision Tree ClassifierDone in 0.72s
Training Naive Bayes ClassifierDone in 0.1s
Training Random Forest ClassifierDone in 1.42s

----- Encoding data using Helmert Encoder -----



Done in 0.79s
Training Logistic Regression Classifier

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Done in 1.27s
Training Decision Tree ClassifierDone in 0.97s
Training Naive Bayes ClassifierDone in 0.11s
Training Random Forest ClassifierDone in 1.53s

----- Encoding data using CatBoost Encoder -----Done in 0.18s
Training Logistic Regression Classifier

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Done in 0.68s
Training Decision Tree ClassifierDone in 0.92s
Training Naive Bayes ClassifierDone in 0.01s
Training Random Forest ClassifierDone in 2.4s

----- Encoding data using GLMM Encoder -----Done in 20.95s
Training Logistic Regression ClassifierDone in 0.26s
Training Decision Tree ClassifierDone in 0.2s
Training Naive Bayes ClassifierDone in 0.01s
Training Random Forest ClassifierDone in 0.7s

Classifier: Logistic Regression	Best Score: 0.7383838383838384 on Test Data	 Using GLMM Encoder

Classifier: Decision Tree	Best Score: 0.7792087542087542 on Test Data	 Using Helmert Encoder

Classifier: Naive Bayes	Best Score: 0.7301346801346801 on Test Data	 Using CatBoost Encoder

Classifier: Random Forest	Best Score: 0.8118686868686869 on Test Data	 Using GLMM Encoder


In [27]:
scores = []
pipelines = []

for classifier_name, metric in classifier_best_results.items():
    score, encoding_method = metric
    encoder = encoding_methods[encoding_method]
    encoded_X_train = encoder.fit_transform(X_train, y_train)
    encoded_X_test = encoder.transform(X_test)

    clf = classifiers[classifier_name]
    clf.fit(encoded_X_train, y_train)
    y_pred = clf.predict(encoded_X_test)

    scores.append(score)
    pipelines.append(classifier_name + ' - ' + encoding_method)

    print(f'\n-----Classifier: {classifier_name}\tEncoding: {encoding_method}-----')
    print(classification_report(y_test, y_pred))


-----Classifier: Logistic Regression	Encoding: GLMM-----
              precision    recall  f1-score   support

           0       0.71      0.87      0.78      6457
           1       0.79      0.59      0.67      5423

    accuracy                           0.74     11880
   macro avg       0.75      0.73      0.73     11880
weighted avg       0.75      0.74      0.73     11880






-----Classifier: Decision Tree	Encoding: Helmert-----
              precision    recall  f1-score   support

           0       0.79      0.81      0.80      6457
           1       0.76      0.75      0.76      5423

    accuracy                           0.78     11880
   macro avg       0.78      0.78      0.78     11880
weighted avg       0.78      0.78      0.78     11880


-----Classifier: Naive Bayes	Encoding: CatBoost-----
              precision    recall  f1-score   support

           0       0.71      0.85      0.77      6457
           1       0.77      0.59      0.67      5423

    accuracy                           0.73     11880
   macro avg       0.74      0.72      0.72     11880
weighted avg       0.74      0.73      0.72     11880


-----Classifier: Random Forest	Encoding: GLMM-----
              precision    recall  f1-score   support

           0       0.81      0.85      0.83      6457
           1       0.81      0.77      0.79      5423

    accuracy         