Classification Loop

This is to streamline the classification workflow. Each individual classifier is unique, and tuning efforts vary. However, their usage in Python takes a similar structure, and they are judged based on the same set of measures. Therefore, it makes sense to put all models in a loop, in order to make the comparison a bit straightforward.

In [1]:
import numpy as np
import pandas as pd

In [2]:
def classifier_fit(X_train, y_train, X_test, y_test, model):
    if model == 'logistic regression':
        from sklearn.linear_model import LogisticRegression
        classifier = LogisticRegression(random_state = 0)
    elif model == 'tree':
        from sklearn.tree import DecisionTreeClassifier
        classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
    elif model == 'random forest':
        from sklearn.ensemble import RandomForestClassifier
        classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
    elif model == 'k nearest neighbors':
        from sklearn.neighbors import KNeighborsClassifier
        classifier = KNeighborsClassifier(n_neighbors = 9, metric = 'minkowski', p = 2)
    elif model == 'naive bayes':
        from sklearn.naive_bayes import GaussianNB
        classifier = GaussianNB()
    elif model == 'kernel svm':
        from sklearn.svm import SVC
        classifier = SVC(kernel = 'rbf', random_state = 0)
    elif model == 'xgboost':
        from xgboost import XGBClassifier
        classifier = XGBClassifier(eval_metric='aucpr', use_label_encoder=False)
        # eval_metric: 'auc', 'error', 'logloss', etc.
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    return cm[0,0], cm[0,1], cm[1,0], cm[1,1]

Set up one dataframe to store model results.

In [3]:
models = ['logistic regression', 'tree', 'random forest', 'k nearest neighbors', 'naive bayes', 'kernel svm', 'xgboost']
model_comparison = pd.DataFrame(columns = ['Model', 'TP', 'FN', 'FP', 'TN', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])
for model in models:
    model_comparison.loc[len(model_comparison.index)] = [model, 0, 0, 0, 0, 0, 0, 0, 0]

model_comparison

Unnamed: 0,Model,TP,FN,FP,TN,Accuracy,Precision,Recall,F1 Score
0,logistic regression,0,0,0,0,0,0,0,0
1,tree,0,0,0,0,0,0,0,0
2,random forest,0,0,0,0,0,0,0,0
3,k nearest neighbors,0,0,0,0,0,0,0,0
4,naive bayes,0,0,0,0,0,0,0,0
5,kernel svm,0,0,0,0,0,0,0,0
6,xgboost,0,0,0,0,0,0,0,0


Every time a model makes a prediction, increment corresponding TP, FN, FP, TN.

This part can be nested inside cross validation, when each type of model is tried several times for different sub-datasets.

In [None]:
for i, model in enumerate(models):
    TP, FN, FP, TN = classifier_fit(X_train, y_train, X_test, y_test, model)
    model_comparison.at[i, 'TP'] += TP
    model_comparison.at[i, 'FN'] += FN
    model_comparison.at[i, 'FP'] += FP
    model_comparison.at[i, 'TN'] += TN

When all work involving models is finished, calculate accuracy, precision, recall, F1 score for each model.

In [None]:
for i in range(len(models)):
    TP = model_comparison.at[i, 'TP']
    FN = model_comparison.at[i, 'FN']
    FP = model_comparison.at[i, 'FP']
    TN = model_comparison.at[i, 'TN']
    model_comparison.at[i, 'Accuracy'] = (TP+TN)/(TP+TN+FP+FN)
    model_comparison.at[i, 'Precision'] = TP/(TP+FP)
    model_comparison.at[i, 'Recall'] = TP/(TP+FN)
    model_comparison.at[i, 'F1 Score'] = 2 / (1/model_comparison.at[i, 'Precision'] + 1/model_comparison.at[i, 'Recall'])

model_comparison