<a href="https://colab.research.google.com/github/shreyadas-maple/Brain-cancer-gene-analysis/blob/main/Phase_3_ML_Classifier_Model_Building_Part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phase 3: ML Classifier Model Building Part 2 with RidgeClassifierCv
**Shreya Das**

In this part we will create the ML Classifier Model with our choosen ML model the RidgeClassifierCV model.

## Building an RidgeClassifierCV ML model

First we install the RidgeClassifierCV package.

In [5]:
!pip install scikit-learn



## Importing the required Libraries

We need **pandas** to convert the cancer categories into numerical categories for the model to interpret easily. We do this because the model will not understand categories that are named, but numerical categories it understands better.

The **ridgeclassifiercv** module is imported because this is the ML model we are going to use for this classification task.

**Classification_report** is used to generate a report to aid with finetuning of the hyperparameters of the RCC model.


In [6]:
import pandas as pd
import numpy as np
from sklearn.linear_model import RidgeClassifierCV
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

Read-in the data.

In [7]:
df = pd.read_csv("/content/Brain_GSE50161.csv")

In [8]:
df.head()

Unnamed: 0,samples,type,1007_s_at,1053_at,117_at,121_at,1255_g_at,1294_at,1316_at,1320_at,...,AFFX-r2-Ec-bioD-3_at,AFFX-r2-Ec-bioD-5_at,AFFX-r2-P1-cre-3_at,AFFX-r2-P1-cre-5_at,AFFX-ThrX-3_at,AFFX-ThrX-5_at,AFFX-ThrX-M_at,AFFX-TrpnX-3_at,AFFX-TrpnX-5_at,AFFX-TrpnX-M_at
0,834,ependymoma,12.49815,7.604868,6.880934,9.027128,4.176175,7.22492,6.085942,6.835999,...,9.979005,9.92647,12.719785,12.777792,5.403657,4.870548,4.04738,3.721936,4.516434,4.74994
1,835,ependymoma,13.067436,7.99809,7.209076,9.723322,4.826126,7.539381,6.250962,8.012549,...,11.924749,11.21593,13.605662,13.401342,5.224555,4.895315,3.786437,3.564481,4.430891,4.491416
2,836,ependymoma,13.068179,8.573674,8.647684,9.613002,4.396581,7.813101,6.007746,7.178156,...,12.154405,11.53246,13.764593,13.4778,5.303565,5.052184,4.005343,3.595382,4.563494,4.668827
3,837,ependymoma,12.45604,9.098977,6.628784,8.517677,4.154847,8.361843,6.596064,6.347285,...,11.969072,11.288801,13.600828,13.379029,4.953429,4.708371,3.892318,3.759429,4.748381,4.521275
4,838,ependymoma,12.699958,8.800721,11.556188,9.166309,4.165891,7.923826,6.212754,6.866387,...,11.411701,11.169317,13.751442,13.803646,4.892677,4.773806,3.796856,3.577544,4.504385,4.54145


In [9]:
# For the X variable we only want the gene expression values, not the samples and types
selection = ['samples', 'type']

# We exclude the columns we don't want which is the samples and types
X = df.drop(selection, axis =1)
Y = df['type']

In [10]:
X.head()

Unnamed: 0,1007_s_at,1053_at,117_at,121_at,1255_g_at,1294_at,1316_at,1320_at,1405_i_at,1431_at,...,AFFX-r2-Ec-bioD-3_at,AFFX-r2-Ec-bioD-5_at,AFFX-r2-P1-cre-3_at,AFFX-r2-P1-cre-5_at,AFFX-ThrX-3_at,AFFX-ThrX-5_at,AFFX-ThrX-M_at,AFFX-TrpnX-3_at,AFFX-TrpnX-5_at,AFFX-TrpnX-M_at
0,12.49815,7.604868,6.880934,9.027128,4.176175,7.22492,6.085942,6.835999,5.898355,5.51341,...,9.979005,9.92647,12.719785,12.777792,5.403657,4.870548,4.04738,3.721936,4.516434,4.74994
1,13.067436,7.99809,7.209076,9.723322,4.826126,7.539381,6.250962,8.012549,5.453147,6.173106,...,11.924749,11.21593,13.605662,13.401342,5.224555,4.895315,3.786437,3.564481,4.430891,4.491416
2,13.068179,8.573674,8.647684,9.613002,4.396581,7.813101,6.007746,7.178156,8.400266,6.323471,...,12.154405,11.53246,13.764593,13.4778,5.303565,5.052184,4.005343,3.595382,4.563494,4.668827
3,12.45604,9.098977,6.628784,8.517677,4.154847,8.361843,6.596064,6.347285,4.90038,6.008684,...,11.969072,11.288801,13.600828,13.379029,4.953429,4.708371,3.892318,3.759429,4.748381,4.521275
4,12.699958,8.800721,11.556188,9.166309,4.165891,7.923826,6.212754,6.866387,5.405628,5.279579,...,11.411701,11.169317,13.751442,13.803646,4.892677,4.773806,3.796856,3.577544,4.504385,4.54145


We are going to split the data.

In [11]:
Y.unique()

array(['ependymoma', 'glioblastoma', 'medulloblastoma', 'normal',
       'pilocytic_astrocytoma'], dtype=object)

In [12]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

Y_encoded = le.fit_transform(Y)

In [13]:
Y_encoded

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4])

In [112]:
# Splitting the data into training and testing data
# The aim here is to reduce the chance that the model will "memorize" the data, we want the model
# make generalizations about the data
X_train, X_test, y_train, y_test = train_test_split(X,Y_encoded, test_size=0.2, random_state=13)

We will now train the LGBMClassifier Model in this data. Note, the hyperparameters require a little bit of trail-and-error to figure out which combination of hyperparameters works best. We will use a classification report to aid with evaluating the performance of the LGBMClassifier.

Use the classification report to evaluate the performance of the LGBM Classifier Model.

In [113]:
# Training the Ridge Classifier CV model on the X and Y training data
ridge_cc = RidgeClassifierCV(alphas=np.logspace(-5, 5, 20), cv=None)
ridge_cc.fit(X_train, y_train)

In [114]:
# Based on the data that the model has not seen (testing data), we are going to see if the model can acurrately predict the
# cancer types.
predictions = ridge_cc.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         8
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00         4
           3       1.00      1.00      1.00         4
           4       1.00      1.00      1.00         1

    accuracy                           1.00        26
   macro avg       1.00      1.00      1.00        26
weighted avg       1.00      1.00      1.00        26



In [115]:
ridge_cc.score(X_test, y_test)

1.0

Here we see that the accuracy is 1.0. If we use this model, the predictions that are made for new data can be overfitting the data (this means that the model is not actually making predictions but rather is just memorizing the data).

Let's see what happens when we use very high alpha values, use a scoring metric, and a CV value.

In [116]:
# Training the Ridge Classifier CV model on the X and Y training data
ridge_cc = RidgeClassifierCV(alphas=[1e10, 1e11, 1e12], scoring = 'precision', cv = 2)
ridge_cc.fit(X_train, y_train)

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 949, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_scorer.py", line 288, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_scorer.py", line 388, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/utils/_param_validation.py", line 189, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-pac

In [117]:
# Based on the data that the model has not seen (testing data), we are going to see if the model can acurrately predict the
# cancer types.
predictions = ridge_cc.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.31      1.00      0.47         8
           1       0.00      0.00      0.00         9
           2       0.00      0.00      0.00         4
           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00         1

    accuracy                           0.31        26
   macro avg       0.06      0.20      0.09        26
weighted avg       0.09      0.31      0.14        26



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [118]:
ridge_cc.score(X_test, y_test)

0.3076923076923077

We see that the overall score of the model is very poor. This could be due to the high alpha values provided, the scoring metric and/or the cv value provided.

We want to try and aim to get the overall score between 0.85 and 0.95. Let's do a bit of trail-and-error to see which works the best.

In [131]:
# Training the Ridge Classifier CV model on the X and Y training data
ridge_cc = RidgeClassifierCV(alphas=np.logspace(-5, 5, 20), scoring = 'accuracy', cv = 6)
ridge_cc.fit(X_train, y_train)

In [132]:
# Based on the data that the model has not seen (testing data), we are going to see if the model can acurrately predict the
# cancer types.
predictions = ridge_cc.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         8
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00         4
           3       1.00      1.00      1.00         4
           4       1.00      1.00      1.00         1

    accuracy                           1.00        26
   macro avg       1.00      1.00      1.00        26
weighted avg       1.00      1.00      1.00        26



In [129]:
ridge_cc.score(X_test, y_test)

1.0

We want to aim for the accuracy of these groups to be at least 0.85 and less than 0.95. Less than 0.85 means that the model may not be learning the training data well and thus, is not predicting well for the test data. More than 0.95 may lead to overfitting of the data; realistically we don't want the accuracy to ever reach 1.0 because that would mean the model is 100% accurate at predicting based on the training data. **Overfitting** becomes a hug problem in ML models because it may mean that the model is trained too well on the training data, that the prediction performance on new test data may reduce.

We want to strike a good balance between enough accuracy and overfitting.