<a href="https://colab.research.google.com/github/vappiah/Bioinformatics/blob/master/notebooks/evaluation/Compare_Classifiers_using_Heatmaps_Episode2_AUC_Scores.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Compare Classifiers using Heatmaps-Episode-2 
In this tutorial we are going to compute AUC scores for all classes per model and use the result to generate a heatmap

### Classifiers
 - Adaboost
 - Extra Trees
 - K-Nearest Neighbor
 - Logistic Regression
 - Naive Bayes
 - Random Forests
 - Support Vector Machines
 - XGBoost

## Required Libraries
 - numpy
 - matplotlib
 - seaborn
 - pandas
 - scikit

## Import Python libraries

In [None]:
#data handling
import pandas as pd
import numpy as np

#data visualization
import matplotlib.pyplot as plt
import seaborn as sns

#preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

#classification
from xgboost import XGBClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

# performance metrics
from sklearn.metrics import roc_auc_score

## Read data

In [None]:

#read data directly from a github repository

file_url='https://github.com/vappiah/Machine-Learning-Tutorials/raw/main/data/cancer_gene_expression.zip'

dataframe=pd.read_csv(file_url)


 
## **Data preprocesing** 
This is done to put the data in an appropriate format before modelling


In [None]:
#we will now seperate the feature values from the class. we do this because scikit-learn requires that features and class are separated before parsing them to the classifiers.

X=dataframe.iloc[:,0:-1]
y=dataframe.iloc[:,-1]

\
**Encode labels**

The labels for this data are categorical and we therefore have to convert them to numeric forms. This is referred to as encoding. Machine learning models usually require input data to be in numeric forms, hence we encoding the labels.

In [None]:
#let's encode target labels (y) with values between 0 and n_classes-1.
#encoding will be done using the LabelEncoder
label_encoder=LabelEncoder()
label_encoder.fit(y)
y_encoded=label_encoder.transform(y)
labels=label_encoder.classes_
classes=np.unique(y_encoded)

\
**Data Splitting**\
We will now split the data into training and test subsets.
The training data is initially parsed to the machine learning model. this is to enable the model to identify discriminatory patterns which can be used to make future predictions.
The testing data is used to evaluate the model after the training phase.

In [None]:
#split data into training and test sets
X_train,X_test,y_train,y_test=train_test_split(X,y_encoded,test_size=0.2,random_state=42)

\
**Data Normalization**\
Data normalization is done so that the values are in the same range. This will improve model performance and avoid bias

In [None]:
# scale data between 0 and 1

min_max_scaler=MinMaxScaler()
X_train_norm=min_max_scaler.fit_transform(X_train)
X_test_norm=min_max_scaler.fit_transform(X_test)

## **Feature Selection**
The purpose of feature selection is to select relevant features for classification. 
Feature selection is usually used as a pre-processing step before doing the actual learning. 

In this tutorial, mutual information algorithm is used to compute the relevance of each feature. The top n (eg. 300) features are selected for the machine learning analysis.

### Feature Selection using Mutual Information

In [None]:
MI=mutual_info_classif(X_train_norm,y_train)

In [None]:
#select top n features. lets say 300.
#you can modify the value and see how the performance of the model changes

n_features=300
selected_scores_indices=np.argsort(MI)[::-1][0:n_features]

In [None]:
X_train_selected=X_train_norm[:,selected_scores_indices]
X_test_selected=X_test_norm[:,selected_scores_indices]

## Classification
The random forest classifier is used in this tutorial. Random forest works with multiclass and high dimensional data. Classification will involve training and testing of the model

### Model Training
Training allows the machine learning model to learn from the data and use the identified patterns to predict the outcomes of data it has never seen before.
In the training phase, the model is given the training subset. In this tutorial, the Random Forest Classifier is used.

In [None]:
def compute_auc(model,xtest,ytest):
    
    if hasattr(model,'decision_function'):
        probs=model.decision_function(xtest) 
    elif hasattr(model,'predict_proba'):
        #returns the positive outcomes
        probs=model.predict_proba(xtest)
    
    

    y_test_binarized=label_binarize(y_test,classes=classes)

    # roc curve for classes
    fpr = {}
    tpr = {}
    thresh ={}
    roc_auc_dict = dict()
    n_class = classes.shape[0]
    for i in range(n_class):    
        roc_auc_dict[labels[i]] = roc_auc_score(y_test_binarized[:,i], probs[:,i],multi_class='ovr',
                                                average='weighted')
    
    return roc_auc_dict

In [None]:
def fit_data(xtrain,ytrain,xtest,ytest):
    
    #Adaboost Classifier
    ADB=AdaBoostClassifier()
    ADB.fit(xtrain,ytrain)

    #XGBoost Classifier
    XGB=XGBClassifier(num_class=labels.shape,eval_metric='mlogloss',use_label_encoder =False)
    XGB.fit(xtrain,ytrain)
    
    #Random Forest Classifier
    RF=RandomForestClassifier(max_features=0.2)
    RF.fit(xtrain,ytrain)

    #Support Vector Machine Classifier
    SVM=SVC()
    SVM.fit(xtrain,ytrain)

    #K-nearest Neighbor classifier
    KNN=KNeighborsClassifier()
    KNN.fit(xtrain,ytrain)

    #Naive Bayes Classifier
    NB=GaussianNB()
    NB.fit(xtrain,ytrain)

    #Extra Trees Classifier
    ETC=ExtraTreesClassifier()
    ETC.fit(xtrain,ytrain)

    #Logistic Regression
    
    LOGREG =LogisticRegression(C = 50, multi_class = 'multinomial',solver='lbfgs', max_iter=3000)
    LOGREG.fit(xtrain,ytrain)

	
    #create a dictionary object to store the models
    models={'KNN':KNN,'Random Forest':RF,'LOGREG':LOGREG,'SVM':SVM,'Naive Bayes':NB,
            'XGBoost':XGB,'Extra Trees':ETC,'Adaboost':ADB}
    
    return models
