<a href="https://colab.research.google.com/github/shreyadas-maple/Brain-cancer-gene-analysis/blob/main/Phase_3_ML_Classifer_Model_Building_Part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phase 3: ML Classifier Model Building Part 1
**Shreya Das**

This dataset was originally created to be used for machine learning, specifically a model used to predict which cancer type the sample originates from based on the gene expression. We want to create an ML model that correctly classifies the cancer type based on the gene expression.

Before we start building out ML model, we need to figure out which ML model is the best for this dataset and the goal that we want to achieve. We will use the original dataset for this process.

## Import libraries

In [None]:
import numpy as np
import pandas as pd

Read the .csv file into the notebook.

In [None]:
df = pd.read_csv("/content/Brain_GSE50161.csv")

In [None]:
df

We will now split the dataset into training data and test data from sklearn.model selection module. This is a crucial step as we need to first train our ML model on our data, but we also need to test how good the model is based on data that it has never seen. This is how we evaluate its performance in reality.

In [None]:
# install lazypredict module if required
!pip install lazypredict

In [None]:
from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split

We will keep the X data (independent variable) as the gene expression values of each of the genes. The Y data (dependent variable) is the cancer categories; this is the variable that the model will predict based on the gene expression levels of each gene.

In [None]:
# For the X variable we only want the gene expression values, not the samples and types
selection = ['samples', 'type']

# We exclude the columns we don't want which is the samples and types
X = df.loc[:, ~df.columns.isin(selection)]

In [None]:
X.head()

In [None]:
# Setting the Y variable to only include the types of cancer or categories
Y = df['type']

In [None]:
import numpy as np

print(np.unique(Y))

Here we see that we have 5 unique categories: 4 cancer types and 1 normal type.

We will be using a LabelEncoder ML model from sklearn module as the ML technique to label cancer types based on the gene expression level.

The fit_transform() function allows labels to be encoded as numerical values; this way the encoder can easily understand that different categorical information based on numerical values rather than string category names. To human language, string category names means something; to a computer or machine it means just a string with different characters, so we can replace these categories with unique numerical values.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# Put numerical values onto the unique type categories for the LabelEncoder to work with
Y_encoded = le.fit_transform(Y)

We are now going to split the X and Y data into X-Y training and test data. Generally we keep the test data as 20% of our dataset and 80% is the tranining data, because we need to make sure our models are trained with as muck information as possible.

The random_state variable is a numerical value provided to set a seed to the random generator, so that when we run this code again the X and Y train and test sets are not different everytime.

We are using LazyClassifier to check the ML model performance of the ML models we may use based on the data we have.

In [None]:
# Splitting the data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X,Y_encoded, test_size=0.2, random_state=13)

# Here is the LazyClassifier that we will use to check the model performance and use the data to fit to each of the models
clf = LazyClassifier(verbose=1, ignore_warnings=True, custom_metric=None)

Using the LazyClassifier we are going to get the models and predictions (classifications) based on the fitted X and Y data. This will take some time, so be patient.

In [None]:
models, predictions = clf.fit(X_train, X_test, y_train, y_test)

We will know see the model performance of each of the ML models for classification of the different cancer types.

**Accuracy** This is the ratio of total number of correctly predicted classifications vs. the total number of classifications. 1.0 means perfect accuracy.

**Balanced Accuracy** This is different metric used to evaluate model accuracy. It is different from accuracy such that it is not affected by the unequal number of samples in each group. This is important for our dataset since we have different number of samples in each sample type.

**ROC AUC** This is a metric that helps us understand if a model can distinguish between 2 classes: positive and negative. A value closer to 1.0 is perfect classification, and 0.5 is random guessing. However, this is not relevant in our classification task since we have more types than just positive and negative.

**F1 Score** This is a metric that combines precision and recall. Precision indicates the model's accuracy of predicting positive predictions. Recall indicates the model's ability to capture true positive predictions over number of actual positive predictions. A value close to 1.0 is best.

**Time Taken** Is how long the model takes to finish predictions. We want a model that also takes the least amount of time.

In [77]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BaggingClassifier,1.0,1.0,,1.0,14.78
ExtraTreesClassifier,1.0,1.0,,1.0,1.43
RandomForestClassifier,1.0,1.0,,1.0,2.96
RidgeClassifier,1.0,1.0,,1.0,1.42
RidgeClassifierCV,1.0,1.0,,1.0,1.38
LGBMClassifier,1.0,1.0,,1.0,152.13
LinearDiscriminantAnalysis,1.0,1.0,,1.0,4.21
CalibratedClassifierCV,0.96,0.98,,0.96,24.12
LinearSVC,0.96,0.98,,0.96,8.23
PassiveAggressiveClassifier,0.96,0.98,,0.96,4.18


In [78]:
predictions

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BaggingClassifier,1.0,1.0,,1.0,14.78
ExtraTreesClassifier,1.0,1.0,,1.0,1.43
RandomForestClassifier,1.0,1.0,,1.0,2.96
RidgeClassifier,1.0,1.0,,1.0,1.42
RidgeClassifierCV,1.0,1.0,,1.0,1.38
LGBMClassifier,1.0,1.0,,1.0,152.13
LinearDiscriminantAnalysis,1.0,1.0,,1.0,4.21
CalibratedClassifierCV,0.96,0.98,,0.96,24.12
LinearSVC,0.96,0.98,,0.96,8.23
PassiveAggressiveClassifier,0.96,0.98,,0.96,4.18


Looking at both the models (train data) and predictions (test data) we see that most models have the same performance. The top 3 models are **RidgeClassifierCV, Ridge Classifier, and ExtraTreesClassifier** across both accuracy, F1 score, and time taken.

Based on this we will build out the top 3 models for this dataset and compare the predictions at the end. Particularly for this dataset we have ~54,000 genes for each sample, which can have complex patterns for each sample. We need to find the best model that can handle this much data and complaxity.