# Capstone Project: Classifying clinically actionable genetic mutations

***

## Notebook 3: Baseline Model

This notebook contains the code to identify a baseline classifier and use it to make predictions for the testing dataset.

### Contents

- [Importing of Libraries](#Importing-of-Libraries)
- [Data Import](#Data-Import)

## Importing of Libraries

In [1]:
# pip install imblearn

In [2]:
# pip install transformers

In [3]:
# pip install tabulate

In [4]:
import pandas as pd
import numpy as np

from tabulate import tabulate
from gensim.models.word2vec import Word2Vec
from collections import Counter, defaultdict

TRAIN_SET_PATH = "../assets/train_prep.csv"

GLOVE_6B_50D_PATH = "../assets/glove.6B.50d.txt"
GLOVE_6B_300D_PATH = "../assets/glove.6B.300d.txt"
encoding="utf-8"

from sklearn import linear_model, metrics, svm
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV,\
    cross_val_score, RandomizedSearchCV, StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.utils import resample
from sklearn.metrics import accuracy_score, roc_curve, auc, roc_auc_score
from sklearn.multiclass import OneVsRestClassifier

from imblearn.over_sampling import SMOTE

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from itertools import cycle
plt.style.use('fivethirtyeight')

import time

import nltk
from nltk.tokenize import RegexpTokenizer
import regex as re
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

from wordcloud import WordCloud

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

# Initialise random seeed for more consistent results
from numpy.random import seed
seed(42)

Using TensorFlow backend.


## Data Import

In [5]:
# Import 'train_prep' and 'test_prep' datasets
# We use the 'keep_default_na' option to False to ensure that pandas does not re-introduce missing values
train = pd.read_csv("../assets/train_prep.csv", keep_default_na=False)
test = pd.read_csv("../assets/test_prep.csv", keep_default_na=False)

In [6]:
train.shape, test.shape

((3321, 4325), (986, 4324))

In [7]:
train.head(2)

Unnamed: 0,id,class,text,gene_ABCB11,gene_ABCC6,gene_ABL1,gene_ACVR1,gene_ADAMTS13,gene_ADGRG1,gene_AGO2,...,variation_YAP1-TFE3 Fusion,variation_YWHAE-ROS1 Fusion,variation_ZC3H7B-BCOR Fusion,variation_ZNF198-FGFR1 Fusion,variation_null1313Y,variation_null189Y,variation_null262Q,variation_null267R,variation_null399R,variation_p61BRAF
0,0,1,cyclin dependent kinase cdks regulate variety ...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,2,abstract background non small lung nsclc heter...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
test.head(2)

Unnamed: 0,id,text,gene_ABCB11,gene_ABCC6,gene_ABL1,gene_ACVR1,gene_ADAMTS13,gene_ADGRG1,gene_AGO2,gene_AGXT,...,variation_YAP1-TFE3 Fusion,variation_YWHAE-ROS1 Fusion,variation_ZC3H7B-BCOR Fusion,variation_ZNF198-FGFR1 Fusion,variation_null1313Y,variation_null189Y,variation_null262Q,variation_null267R,variation_null399R,variation_p61BRAF
0,1,incidence breast increase china recent decade ...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,unselected series colorectal carcinoma stratif...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Splitting of data into Predictor (X) and Target (y) Dataframes

In [9]:
X = train[[i for i in train.columns if i not in ['id', 'class']]]
y = train['class']

In [10]:
X.shape, y.shape

((3321, 4323), (3321,))

In [11]:
X_test = test.drop(['id'], axis=1)

In [12]:
X_test.shape

(986, 4323)

## Creation of (Inner) Training and Validation Datasets

From our single training data set (X and y) we will create two separate datasets:
- (Inner) Training Dataset: this will be used to train our models (this will take 75% of the original training dataset)
- Validation Dataset: this will be used to validate our trained models (e.g. check for overfitting) (this will take 25% of our total 'posts' dataset

To create our datasets, we use train_test_split with the stratify option to ensure a consistent mix of values for the target feature within the created datasets.

In [13]:
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42, stratify=y)

In [14]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

((2490, 4323), (2490,), (831, 4323), (831,))

In [15]:
# Reset the indices to prevent spurious rows from appearing later during merging
X_train.reset_index(inplace=True, drop=True)
X_val.reset_index(inplace=True, drop=True)

## Generation of word embeddings using TfidfVectorizer

For our baseline model, we use the TfidfVectorizer from sklearn, which creates weighted word embeddings (also knon as vectors), where each word embedding consists of the number of times each word is observed in each descriptive text string, weighted by their inverse document frequency (i.e. heavier weights are assigned to words that are less frequent).

In [16]:
# Instantiate a CountVectorizer object
tvec = TfidfVectorizer()

In [17]:
%%time
X_train_tvec = tvec.fit_transform(X_train['text'])
X_val_tvec = tvec.transform(X_val['text'])
X_test_tvec = tvec.transform(X_test['text'])

Wall time: 19.7 s


In [18]:
X_train_tvec.shape, X_val_tvec.shape, X_test_tvec.shape

((2490, 72190), (831, 72190), (986, 72190))

In [19]:
X_train_tvec_df = pd.DataFrame(X_train_tvec.toarray(), columns=tvec.get_feature_names())
X_val_tvec_df = pd.DataFrame(X_val_tvec.toarray(), columns=tvec.get_feature_names())
X_test_tvec_df = pd.DataFrame(X_test_tvec.toarray(), columns=tvec.get_feature_names())

In [20]:
X_train_tvec_df.shape, X_val_tvec_df.shape, X_test_tvec_df.shape

((2490, 72190), (831, 72190), (986, 72190))

In [21]:
X_train_tvec_df.head()

Unnamed: 0,aa,aaa,aaaa,aaaaa,aaaaaagaaaattttagataaaaagag,aaaaaatcccaaccataacaaaattt,aaaaaatcctcttgtgttcag,aaaaaccggtatgaaaagcagcataccgaacaataaggagatccc,aaaaag,aaaaataactactgc,...,zytolight,zytomed,zytovision,zyx,zyxin,zz,zzo,zzq,zzsi,zzzq
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Combining word embeddings with dummy columns

In [22]:
%%time
# Concatenate the components parts of the dataframe
X_train = pd.concat([X_train, X_train_tvec_df], axis=1)
X_val = pd.concat([X_val, X_val_tvec_df], axis=1)
X_test = pd.concat([X_test, X_test_tvec_df], axis=1)

Wall time: 4.22 s


In [23]:
X_train.drop(columns=['text'], inplace=True)
X_val.drop(columns=['text'], inplace=True)
X_test.drop(columns=['text'], inplace=True)

In [24]:
X_train.shape, X_val.shape, X_test.shape

((2490, 76511), (831, 76511), (986, 76511))

In [25]:
X_train.head()

Unnamed: 0,gene_ABCB11,gene_ABCC6,gene_ABL1,gene_ACVR1,gene_ADAMTS13,gene_ADGRG1,gene_AGO2,gene_AGXT,gene_AKAP9,gene_AKT1,...,zytolight,zytomed,zytovision,zyx,zyxin,zz,zzo,zzq,zzsi,zzzq
0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Handling of imbalanced classes

In [26]:
y_train.value_counts(normalize=True)

7    0.287149
4    0.206426
1    0.171084
2    0.136145
6    0.082731
5    0.072691
3    0.026908
9    0.011245
8    0.005622
Name: class, dtype: float64

In [27]:
y_train.value_counts()

7    715
4    514
1    426
2    339
6    206
5    181
3     67
9     28
8     14
Name: class, dtype: int64

We note above that the **training set is highly imbalanced** -- i.e. classes 4 and 7 alone take up almost 50% of all classes found in the training set.

To deal with this, we will need to oversample one or more of the minority classes rather than undersample the majority classes as the latter will remove valuable data for our modelling.

We oversample by creating synthetic samples using imblearn’s SMOTE or Synthetic Minority Oversampling Technique. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model. We generate new samples **only in the training set** to ensure our model generalises well to unseen data. Instead of oversampling all minority classes, we instead oversample only the 3 most infrequent classes ('3', '9' and '8') such that we have 181 data points for each of these specific minority classes, which is the no. of datapoints having the '5' class. Our previous attempts to oversample **all** minority classes led to an over-expansion of the y_train dataset which in turn led to unmanageable execution times when performing the subsequent modelling.

In [28]:
# Instantiate a SMOTE object to oversample minority classes
sm = SMOTE(random_state=42, sampling_strategy={3:181, 9:181, 8:181})

In [29]:
%%time
X_train, y_train = sm.fit_sample(X_train, y_train)

Wall time: 40.6 s


In [30]:
X_train.shape, y_train.shape

((2924, 76511), (2924,))

In [31]:
y_train.value_counts()

7    715
4    514
1    426
2    339
6    206
9    181
5    181
3    181
8    181
Name: class, dtype: int64

As shown above, we have oversampled the three most infrequent classes such that there are 181 samples for each of them.

## Randomised Search for optimal classifier parameters

To manage the total time and resources used to tune the classifier parameters, we use the RandomizedSearchCV to randomly select parameters from the specified ranges of parameters to give the best cross-validated accuracy score on the training dataset, with a maximum of 10 iterations. We specify the range of parameters for each classifer based on experience and past results of running the RandomizedSearchCV.

We select the best classifier as the one with the highest accuracy score on the **validation dataset**.

In [32]:
# We have selected the models below for modelling purposes.
estimators = {
    'lr': LogisticRegression(random_state=42),
    'mnb': MultinomialNB(),
    'knn': KNeighborsClassifier(),
    'ada': AdaBoostClassifier(random_state=42),
    'dtree': DecisionTreeClassifier(random_state=42),
    'rf': RandomForestClassifier(random_state=42),
    'etree': ExtraTreesClassifier(random_state=42),
    'svm': SVC(random_state=42)
}.items()

In [33]:
params = {
    'lr': {
        # 'liblinear' solver has been excluded as a potential solver as it cannot learn a true multinomial
        # (multiclass) model; instead, the optimization problem is decomposed in a “one-vs-rest”
        # fashion so separate binary classifiers are trained for all classes.
        # 'lbfgs' solver has also been excluded as it fails to converge through past attempts
        'lr__solver': ['sag','saga'], 
        'lr__penalty': ['l1', 'l2'],
        'lr__C': np.logspace(-3, 1, 5),
        'lr__multi_class':['multinomial'] 
    },
    'mnb': {
        'mnb__alpha': np.linspace(0.5, 1.5, 3),
        'mnb__fit_prior': [True, False],  
    },
    'knn': {
        'knn__n_neighbors': [3, 5, 7]
    },
    'ada': {
        'ada__n_estimators': [50, 100, 150],
        'ada__learning_rate': [1, 1.5, 2]
    },
    'dtree': {
        'dtree__max_features': ['auto', 'sqrt', 'log2', None],
        'dtree__min_samples_split': [4, 6, 8],
        'dtree__min_samples_leaf': [2, 3, 4]
    },
    'rf': {
        'rf__n_estimators': [100, 200, 300],
        'rf__class_weight': ['balanced'], # 'balanced' will help to deal with our imbalanced classes
        'rf__min_samples_split':[5, 10, 15],
        'rf__min_samples_leaf':[2, 3, 4]    
    },
    'etree': {
        'etree__max_features': ['auto', 'sqrt', 'log2', None],
        'etree__min_samples_split': [4, 6, 8],
        'etree__min_samples_leaf': [2, 3, 4]
    },
    'svm': {
        'svm__C': np.logspace(-3, 3, 10),
        'svm__kernel': ['linear','poly', 'rbf', 'sigmoid']
    }
}

We now use RandomizedSearchCV to select the optimal parameters for each classifier that produces the best 3-fold cross-validated mean accuracy score based on the training dataset.

In [None]:
%%time
# initialise empty lists to store information later
models = []
parameters = []
train_accuracy = []
val_accuracy = []
best_score = []
train_roc_auc = []
val_roc_auc = []
sensitivity = []

for k,v in estimators:
    start = time.time()
    pipe = Pipeline([(k,v)])
    param = params[k]
    randomsearch = RandomizedSearchCV(
        n_iter=10, # we set a max. of 10 iterations
        estimator=pipe,
        random_state=42,
        param_distributions=param,
        verbose=1,
        cv= 3,
        # We limit the no. of jobs to ensure sufficient memory for successful execution
        n_jobs=4,
        return_train_score= True,
        # RandomizedSearchCV will use best cross-validation accuracy score to determine best parameters
        scoring = 'accuracy' 
    )

    randomsearch.fit(X_train, y_train)
    
    model = randomsearch.best_estimator_
    cv_score = randomsearch.cv_results_
    best_params = randomsearch.best_params_

    # predict y
    y_pred_train = model.predict(X_train)
    y_pred_val = model.predict(X_val)
    
    # print results
    print ("Model: ", k)
    print ("Fitting time: {}".format(time.time()-start))
    print ("Best parameters:", best_params)
    print ("Best accuracy cross validation score:", randomsearch.best_score_)
    print ("Training dataset accuracy:", accuracy_score(y_train,y_pred_train))
    print ("Validation dataset accuracy:", accuracy_score(y_val,y_pred_val))
    print ("")
    
    # append info to list
    models.append(k)
    best_score.append(randomsearch.best_score_)
    parameters.append(best_params)
    train_accuracy.append(accuracy_score(y_train,y_pred_train))
    val_accuracy.append(accuracy_score(y_val,y_pred_val))

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  30 out of  30 | elapsed: 102.6min finished


Model:  lr
Fitting time: 6981.50004863739
Best parameters: {'lr__solver': 'sag', 'lr__penalty': 'l2', 'lr__multi_class': 'multinomial', 'lr__C': 10.0}
Best accuracy cross validation score: 0.7041752224503764
Training dataset accuracy: 0.9976060191518468
Validation dataset accuracy: 0.6293622141997594

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  18 out of  18 | elapsed:  1.3min finished


Model:  mnb
Fitting time: 86.08883881568909
Best parameters: {'mnb__fit_prior': False, 'mnb__alpha': 0.5}
Best accuracy cross validation score: 0.6087551554081329
Training dataset accuracy: 0.7113543091655267
Validation dataset accuracy: 0.5896510228640193

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   9 out of   9 | elapsed: 48.3min finished


Model:  knn
Fitting time: 3954.8259868621826
Best parameters: {'knn__n_neighbors': 3}
Best accuracy cross validation score: 0.5201762052686077
Training dataset accuracy: 0.6853625170998632
Validation dataset accuracy: 0.45126353790613716

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


## Confirmation of Baseline Model

In [None]:
# Produce a summary table of the tuned classifiers
summary = pd.DataFrame({
    'model': models,
    'parameters': parameters,
    'Best accuracy cross-validation score': best_score,
    'Training dataset accuracy': train_accuracy,
    'Validation dataset accuracy': val_accuracy
    })

pd.set_option('display.max_colwidth', None)
summary.sort_values('Validation dataset accuracy', ascending=False).reset_index(drop=True)

<div class="alert alert-block alert-info">
The table above summarises the optimal parameters for each candidate classifier based on the best 3-fold cross-validation accuracy score (see "Best accuracy cross-validation score"). The "Training dataset accuracy" is also shown for reference. The candidate classifiers are sorted in descending order of the last column, which measures the "Validation dataset accuracy" for each classifier. We focus on validation dataset accuracy to ensure that we choose the classifier that gives us the least overfitting.<br>
<br>    
The best classifier is the Extra Trees Classifier based on the highest validation dataset accuracy amongst the other tuned classifiers.<br>
<br>
    Our <b>baseline model</b> therefore consists of:
    <br>
    <ul>
        <li>Word embeddings created by <b>TfidfVectorizer</b></li>
        <li><b>Extra Trees Classifier</b> based on the optimal parameters given in the table above.</li>
    </ul>
</div>

## Further Exploration of Baseline Model

In [None]:
# We instantiate the baseline classifier based on the best parameters found above
baseline_clf = ExtraTreesClassifier(verbose=1, n_jobs=-1, random_state=42, \
                                  min_samples_split=4, min_samples_leaf=3, max_features=None)

In [None]:
%%time
# Fit the best classifier on the training dataset
baseline_clf.fit(X_train, y_train)

### Visualisation of Extra Trees Classifier

We now visualise one particular decision tree.

In [None]:
estimator = baseline_clf.best_estimator_[100] # we choose one particular tree arbitrarily)

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = X_train.columns,
                class_names = list(str(range(1,10))),
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using Graphviz via system command
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

### ROC Curve & Metrics

In [None]:
# Generate predictions for the validation data based on our baseline model
y_val_pred = baseline_clf.predict(X_val)

In [None]:
# Binarize the output
y_train_binarized = label_binarize(y_train, classes=list(np.unique(y)))
y_val_pred_binarized = label_binarize(y_val_pred, classes=list(np.unique(y)))
n_classes = len(np.unique(y))

To come up with actual scores that can be used for ROC calculation, we use the OneVsRestClassifier coupled with a SVC (C-Support Vector) Classifier to fit the training dataset so that we can obtain the distances of each sample from the decision boundary for each class.

In [None]:
%%time
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
                                 random_state=42, verbose=1), n_jobs=4)
y_score = classifier.fit(X_train, y_train_binarized).decision_function(X_val)

In [None]:
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    # we compare our predicted labels for the validation dataset and the actual validation dataset labels
    # first parameter of roc_curve is y_true, and second parameter is y_score
    fpr[i], tpr[i], _ = roc_curve(y_val_pred_binarized[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_val_pred_binarized.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

We plot ROC curves for all the 9 classes.

In [None]:
# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))

# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])

# Finally average it and compute AUC
mean_tpr /= n_classes

fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

# Plot all ROC curves
plt.figure(figsize=(12,8))
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)

plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)

lw=2
colors = cycle(['rosybrown', 'firebrick', 'sienna', 'olivedrab', 'darkgreen',\
                'lightseagreen', 'darkturquoise', 'b', 'darkorange'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i+1, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Multiple Classes')
plt.legend(loc="lower right")
plt.show()

We note from the AUC scores ('area' metric) in the plot above that class 7 (the most predominant class in the training dataset) has a high AUC score relative to the other classes, which is not surprising. The AUC score for class 8, however, is very low. 

In [None]:
%%time
y_prob = classifier.predict_proba(X_val)

In [None]:
# Calculate the average unweighted AUC of all pairwise combinations of classes (one-vs-one);
# insensitive to class imbalance
macro_roc_auc_ovo = roc_auc_score(y_val_pred_binarized, y_prob, multi_class="ovo",
                                  average="macro")

# Calculate the weighted average AUC of all pairwise combinations of classes (one-vs-one);
# sensitive to class imbalance by considering the no. of true instances for each label
weighted_roc_auc_ovo = roc_auc_score(y_val_pred_binarized, y_prob, multi_class="ovo",
                                     average="weighted")

# Caclulate the unweighted AUC of each class against the rest;
# still sensitive to class imbalance because the imbalance affects the composition of each of the ‘rest’ groupings
macro_roc_auc_ovr = roc_auc_score(y_val_pred_binarized, y_prob, multi_class="ovr",
                                  average="macro")

# Caclulate the weighted AUC of each class against the rest; sensitive to class imbalance
weighted_roc_auc_ovr = roc_auc_score(y_val_pred_binarized, y_prob, multi_class="ovr",
                                     average="weighted")

print("One-vs-One ROC AUC scores for validation dataset:\n{:.6f} (macro),\n{:.6f} "
      "(weighted by prevalence)"
      .format(macro_roc_auc_ovo, weighted_roc_auc_ovo))
print("One-vs-Rest ROC AUC scores for validation dataset:\n{:.6f} (macro),\n{:.6f} "
      "(weighted by prevalence)"
      .format(macro_roc_auc_ovr, weighted_roc_auc_ovr))

## Evaluation of Baseline Model

<div class="alert alert-block alert-info">
    
(To be written)
    
</div>

## Data Export (for Kaggle Submission)

In [None]:
# Generate predictions
y_test_pred = baseline_clf.predict(X_test)

In [None]:
y_test_pred.shape

In [None]:
# Restore the 'id' column since we need this for the Kaggle submission
test_pred = pd.concat([test['id'], pd.DataFrame(y_test_pred, columns=['class'])], axis=1)

In [None]:
# Verify that we have a mix of predictions for variation classes
test_pred['class'].value_counts()

In [None]:
test_pred.head()

In [None]:
test_pred.to_csv("../assets/test_pred.csv", index=False)