# Model Training Notebook

Clean and extract features from raw data

# Steps

1. Split the data into training and test data set
1. Clean the data (transform null values)
1. Scale necessary attributes (normalization, standardization)
1. Save transformed data for model training


# Import packages

In [61]:
# model
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# serializing, compressing, and loading the models
import joblib
from tabulate import tabulate
# performance
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

# displaying plots
import numpy as np
import sys
sys.path.append("../lib")

from getConfig import *
config = getConfig("../")
config.cleanup(config.trained_path)


# Model Selection

While there are several classifiers available, we show how to train the following classifiers, compare and select one.

[Classification Model Comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html?highlight=svm%20svc)

1. [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
1. [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
1. [Logistic Regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)

If you are interested in a more extensive collection, i.e, training other kinds of models and comparing them to pick the best one, please refer

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

### Model Evaluation
https://scikit-learn.org/stable/modules/model_evaluation.html

In this notebook, we use the following APIs to evaluate the quality of a model’s predictions. For more detailed evaluation and other metrics, please refer to the link above

1. Estimator score method (Score)
2. Scoring parameter (Cross validation)
3. Metric functions (F1 score from Classification metrics)

Best practice to save every model you experiment with so you can come back easily to any model.
Save both the hyperparameters and trained parameter, as well as the cross-validation scores and predictions.
This will allow you to easily compare scores across model types. Use Pickle or joblib libraries.

## Load training and test data


In [62]:
with open(config.traintest_path + "X_train_prepared.csv") as file_name:
    X_train_prepared = np.loadtxt(file_name, delimiter=",")

with open(config.traintest_path + "X_train_prepared_m.csv") as file_name:
    X_train_prepared_m = np.loadtxt(file_name, delimiter=",")
    
with open(config.traintest_path + "y_train.csv") as file_name:
    y_train = np.loadtxt(file_name, delimiter=",")
    
with open(config.traintest_path + "y_test.csv") as file_name:
    y_test = np.loadtxt(file_name, delimiter=",")




## Initialize results and define method to evaluate models


In [63]:
# Initialize rows and columns of table to store results

results = [["Support Vector Classifier"], 
          ["Random Forest Classifier"], 
          ["Logistic Regression"]]
  
#define header names
col_names = ["Estimator score method", "Scoring parameter", "Metric functions"]


In [64]:
def eval_model(model):

    # ESTIMATOR SCORE METHOD

    accuracy = model.score(X_train_prepared,y_train)

    # SCORING PARAMETER
    cross_validation = cross_val_score(model, X_train_prepared, y_train, cv=3, scoring='recall_macro')

    # METRIC FUNCTIONS

    y_pred = model.predict(X_train_prepared)
    f1score = f1_score(y_train, y_pred)

    print ('Accuracy:' + str(accuracy) +'\nCross validation score: ' + str(cross_validation) + '\nf1 score: ' + str (f1score))

    var = [accuracy, cross_validation, f1score]
    return var

## Support Vector Classifier (SVC)
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

### Model Training

In [65]:
# select the SVM
clf = svm.SVC()

# fit the model to the data
CLF = clf.fit(X_train_prepared, y_train)


### Model Evaluation

In [66]:
result0 = eval_model(CLF)
for item in result0:
    results[0].append(item)

Accuracy:0.9118198874296435
Cross validation score: [0.5 0.5 0.5]
f1 score: 0.06622516556291391


### Store Model

In [67]:
# save and compress the model

joblib.dump(CLF, config.trained_path + "svc_model.pkl", compress=('bz2', 3))

['../experiments/experiment_4/models/trained/svc_model.pkl']

## Random Forest Classifier

### Model Training

In [68]:
# select the estimator
CLF = RandomForestClassifier()

# fit the model to the data
CLF.fit(X_train_prepared, y_train)

### Model Evaluation

In [69]:
result1 = eval_model(CLF)
for item in result1:
    results[1].append(item)

Accuracy:1.0
Cross validation score: [0.74587629 0.71519228 0.77137797]
f1 score: 1.0


### Store Model

In [70]:
# save the model
joblib.dump(CLF, config.trained_path + "rfc_model.pkl", compress=('bz2', 3))


['../experiments/experiment_4/models/trained/rfc_model.pkl']

## Logistic Regression
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#sklearn.linear_model.LogisticRegression)

### Model Training

In [71]:
# select the classifier
CLF = LogisticRegression()

# fit the model to the data
CLF.fit(X_train_prepared, y_train)

### Model Evaluation

In [72]:
result2 = eval_model(CLF)
for item in result2:
    results[2].append(item)

Accuracy:0.9049405878674172
Cross validation score: [0.5042311  0.49173554 0.51937511]
f1 score: 0.05


### Store Model

In [73]:
# save the model
joblib.dump(CLF, config.trained_path + "log_model.pkl", compress=('bz2', 3))

['../experiments/experiment_4/models/trained/log_model.pkl']

# Model Selection

In [74]:
#display table
print(tabulate(results, headers=col_names))


                             Estimator score method  Scoring parameter                     Metric functions
-------------------------  ------------------------  ----------------------------------  ------------------
Support Vector Classifier                  0.91182   [0.5 0.5 0.5]                                0.0662252
Random Forest Classifier                   1         [0.74587629 0.71519228 0.77137797]           1
Logistic Regression                        0.904941  [0.5042311  0.49173554 0.51937511]           0.05


In [75]:
fp = open(config.trained_path + "model_comparison.txt", "w")
fp.write(tabulate(results, headers=col_names))
fp.close()
