<img src='http://www-scf.usc.edu/~ghasemig/images/sharif.png' alt="SUT logo" width=200 height=200 align=left class="saturate" >

<br>
<font face="Times New Roman">
<div dir=ltr align=center>
<font color=0F5298 size=7>
    Introduction to Machine Learning <br>
<font color=2565AE size=5>
    Computer Engineering Department <br>
    Fall 2022<br>
<font color=3C99D size=5>
    Project <br>
<font color=696880 size=4>
    Project Team 
    
    
____


### Full Name : Mohammad Bagher Soltani, Masih Najafi
### Student Number : 98105813, ?
___

# Introduction

In this project, we are going to have a brief and elementary hands-on real-world project, predicting breast cancer survival using machine learning models with clinical data and gene expression profiles.

In [1]:
# imports
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
import umap
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from tqdm.std import tqdm
from torch.optim import Adam

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
np.random.seed(42)

# Data Documentation

For this purpose, we will use "Breast Cancer Gene Expression Profiles (METABRIC)" data. 
The first 31 columns of data contain clinical information including death status.
The next columns of the data contain gene's related information which includes both gene expressions and mutation information. (gene's mutation info columns have been marked with "_mut" at the end of the names of the columns) 
For more information please read the [data documentation](https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric).

# Data Preparation (15 Points)

In this section you must first split data into three datasets:
<br>
1- clinical dataset
<br>
2- gene expressions dataset
<br>
3- gene mutation dataset. (We will not use this dataset in further steps of the project)

## Data Loading & Splitting

In [3]:
# TODO
df = pd.read_csv('METABRIC_RNA_Mutation.csv', low_memory=False)
df.head(5)

Unnamed: 0,patient_id,age_at_diagnosis,type_of_breast_surgery,cancer_type,cancer_type_detailed,cellularity,chemotherapy,pam50_+_claudin-low_subtype,cohort,er_status_measured_by_ihc,...,mtap_mut,ppp2cb_mut,smarcd1_mut,nras_mut,ndfip1_mut,hras_mut,prps2_mut,smarcb1_mut,stmn2_mut,siah1_mut
0,0,75.65,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,,0,claudin-low,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
1,2,43.19,BREAST CONSERVING,Breast Cancer,Breast Invasive Ductal Carcinoma,High,0,LumA,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
2,5,48.87,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,High,1,LumB,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
3,6,47.68,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,Moderate,1,LumB,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
4,8,76.97,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,High,1,LumB,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Get column names for clinical, gene expression and gene mutation datasets

columns = df.columns
clinical_columns = columns[:31]
clinical_data_columns = df.columns[:24].append(df.columns[25:30])
label_column = columns[24]
gene_columns = columns[31:]
gene_mut_columns = pd.Index(filter(lambda s: s.endswith('_mut'),columns))
gene_expr_columns = pd.Index(set(gene_columns) - set(gene_mut_columns))

print(f'Number of clinical columns {len(clinical_columns)}')
print(f'Number of gene expression columns {len(gene_expr_columns)}')
print(f'Number of gene mutation columns {len(gene_mut_columns)}')

Number of clinical columns 31
Number of gene expression columns 489
Number of gene mutation columns 173


In [5]:
clinical_dataset = df[clinical_columns]
gene_expr_dataset = df[gene_expr_columns]
gene_mut_dataset = df[gene_mut_columns]

## EDA

For each dataset, you must perform a sufficient EDA.

In [6]:
clinical_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1904 entries, 0 to 1903
Data columns (total 31 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   patient_id                      1904 non-null   int64  
 1   age_at_diagnosis                1904 non-null   float64
 2   type_of_breast_surgery          1882 non-null   object 
 3   cancer_type                     1904 non-null   object 
 4   cancer_type_detailed            1889 non-null   object 
 5   cellularity                     1850 non-null   object 
 6   chemotherapy                    1904 non-null   int64  
 7   pam50_+_claudin-low_subtype     1904 non-null   object 
 8   cohort                          1904 non-null   float64
 9   er_status_measured_by_ihc       1874 non-null   object 
 10  er_status                       1904 non-null   object 
 11  neoplasm_histologic_grade       1832 non-null   float64
 12  her2_status_measured_by_snp6    19

In [7]:
clinical_dataset.describe()

Unnamed: 0,patient_id,age_at_diagnosis,chemotherapy,cohort,neoplasm_histologic_grade,hormone_therapy,lymph_nodes_examined_positive,mutation_count,nottingham_prognostic_index,overall_survival_months,overall_survival,radio_therapy,tumor_size,tumor_stage
count,1904.0,1904.0,1904.0,1904.0,1832.0,1904.0,1904.0,1859.0,1904.0,1904.0,1904.0,1904.0,1884.0,1403.0
mean,3921.982143,61.087054,0.207983,2.643908,2.415939,0.616597,2.002101,5.697687,4.033019,125.121324,0.420693,0.597164,26.238726,1.750535
std,2358.478332,12.978711,0.405971,1.228615,0.650612,0.486343,4.079993,4.058778,1.144492,76.334148,0.4938,0.490597,15.160976,0.628999
min,0.0,21.93,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
25%,896.5,51.375,0.0,1.0,2.0,0.0,0.0,3.0,3.046,60.825,0.0,0.0,17.0,1.0
50%,4730.5,61.77,0.0,3.0,3.0,1.0,0.0,5.0,4.042,115.616667,0.0,1.0,23.0,2.0
75%,5536.25,70.5925,0.0,3.0,3.0,1.0,2.0,7.0,5.04025,184.716667,1.0,1.0,30.0,2.0
max,7299.0,96.29,1.0,5.0,3.0,1.0,45.0,80.0,6.36,355.2,1.0,1.0,182.0,4.0


In [8]:
gene_expr_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1904 entries, 0 to 1903
Columns: 489 entries, cdk6 to nfkb1
dtypes: float64(489)
memory usage: 7.1 MB


In [9]:
gene_expr_dataset.describe()

Unnamed: 0,cdk6,map3k3,cyp11a1,fgf2,dtx2,srd5a2,inha,casp9,rassf1,hsd17b10,...,hsd17b4,map3k13,bmp2,e2f5,tubb1,nr3c1,mmp12,pik3r2,hdac1,nfkb1
count,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,...,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0
mean,6.302521e-07,-7.352941e-07,6.302521e-07,1e-06,-1e-06,-3.676471e-07,-6.302521e-07,-1.57563e-07,-3.151261e-07,-4.201681e-07,...,-1e-06,2.10084e-07,6.302521e-07,6.302521e-07,-1e-06,3.151261e-07,-8.928571e-07,5.777311e-07,2e-06,-2e-06
std,1.000263,1.000263,1.000263,1.000262,1.000262,1.000262,1.000263,1.000264,1.000264,1.000263,...,1.000263,1.000262,1.000262,1.000262,1.000263,1.000262,1.000263,1.000262,1.000263,1.000265
min,-2.2784,-3.0748,-2.4985,-2.1487,-4.5026,-3.3648,-2.0469,-3.5596,-2.7567,-3.7902,...,-4.9768,-2.6034,-1.5713,-2.8457,-2.7443,-2.5157,-1.0982,-4.4302,-5.9821,-4.5635
25%,-0.668075,-0.66675,-0.6545,-0.590625,-0.6833,-0.610475,-0.54925,-0.66415,-0.69835,-0.67235,...,-0.589925,-0.6962,-0.52015,-0.652575,-0.61225,-0.657325,-0.613125,-0.651325,-0.6242,-0.645775
50%,-0.1958,-0.06785,-0.16315,-0.20405,-0.097,-0.0469,-0.1259,0.0055,-0.077,-0.0495,...,0.0051,-0.1041,-0.20135,-0.12385,-0.0759,-0.04655,-0.4098,-0.0263,0.00275,0.03955
75%,0.4241,0.5865,0.4548,0.262825,0.5558,0.51445,0.307325,0.632575,0.62,0.58635,...,0.633275,0.53835,0.231425,0.54525,0.513975,0.648075,0.2039,0.6155,0.61375,0.655225
max,7.3807,5.4806,6.8484,8.2865,4.0397,10.2703,14.4243,4.1732,4.5171,5.0452,...,4.2983,5.4135,11.5429,5.1911,8.4721,6.1224,5.9782,4.3477,4.1961,3.8213


In [10]:
gene_mut_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1904 entries, 0 to 1903
Columns: 173 entries, pik3ca_mut to siah1_mut
dtypes: object(173)
memory usage: 2.5+ MB


In [11]:
gene_mut_dataset.describe()

Unnamed: 0,pik3ca_mut,tp53_mut,muc16_mut,ahnak2_mut,kmt2c_mut,syne1_mut,gata3_mut,map3k1_mut,ahnak_mut,dnah11_mut,...,mtap_mut,ppp2cb_mut,smarcd1_mut,nras_mut,ndfip1_mut,hras_mut,prps2_mut,smarcb1_mut,stmn2_mut,siah1_mut
count,1904,1904,1904,1904,1904,1904,1904,1904,1904,1904,...,1904,1904,1904,1904,1904,1904,1904,1904,1904,1904
unique,160,343,298,248,222,200,128,194,153,154,...,5,5,5,4,4,3,3,3,3,2
top,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
freq,1109,1245,1578,1593,1670,1672,1674,1706,1728,1729,...,1900,1900,1900,1901,1901,1902,1902,1902,1902,1903


In [12]:
# clean data means data with no NaN value in any column
def clean_stats(ds):
    return '''clean data: {0}'''.format(ds.shape[0] - ds.isnull().any(axis=1).sum())

print(f'Clinical dataset {clean_stats(clinical_dataset)}')
print(f'Gene expression dataset {clean_stats(gene_expr_dataset)}')
print(f'Gene mutation dataset {clean_stats(gene_mut_dataset)}')

Clinical dataset clean data: 1092
Gene expression dataset clean data: 1904
Gene mutation dataset clean data: 1904


In [13]:
clinical_dataset.dtypes

patient_id                          int64
age_at_diagnosis                  float64
type_of_breast_surgery             object
cancer_type                        object
cancer_type_detailed               object
cellularity                        object
chemotherapy                        int64
pam50_+_claudin-low_subtype        object
cohort                            float64
er_status_measured_by_ihc          object
er_status                          object
neoplasm_histologic_grade         float64
her2_status_measured_by_snp6       object
her2_status                        object
tumor_other_histologic_subtype     object
hormone_therapy                     int64
inferred_menopausal_state          object
integrative_cluster                object
primary_tumor_laterality           object
lymph_nodes_examined_positive     float64
mutation_count                    float64
nottingham_prognostic_index       float64
oncotree_code                      object
overall_survival_months           

In [14]:
def dtype_stats(ds):
    return '''
    columns: {0}, object columns: {1}, int columns: {2}, float columns: {3}
    '''.format(len(ds.columns),
               (ds.dtypes == object).sum(),
               (ds.dtypes == int).sum(),
               (ds.dtypes == float).sum())

print(f'Clinical dataset: {dtype_stats(clinical_dataset)}')
print(f'Gene expression dataset: {dtype_stats(gene_expr_dataset)}')
print(f'Gene mutation dataset: {dtype_stats(gene_mut_dataset)}')

Clinical dataset: 
    columns: 31, object columns: 17, int columns: 5, float columns: 9
    
Gene expression dataset: 
    columns: 489, object columns: 0, int columns: 0, float columns: 489
    
Gene mutation dataset: 
    columns: 173, object columns: 173, int columns: 0, float columns: 0
    


In [15]:
# check if int data needs scaling
clinical_dataset[clinical_columns[clinical_dataset.dtypes == int]].head(5)

Unnamed: 0,patient_id,chemotherapy,hormone_therapy,overall_survival,radio_therapy
0,0,0,1,1,1
1,2,0,1,1,1
2,5,1,1,0,0
3,6,1,1,1,1
4,8,1,1,0,1


In [16]:
# Perform scaling for float data
def scale(scaler, dataset):
    scaled = scaler.fit_transform(dataset)
    scaled_df = pd.DataFrame(scaled)
    scaled_df.columns = dataset.columns
    scaled_df.index = dataset.index
    return scaled_df

scaler = StandardScaler()
clinical_float_columns = list(clinical_columns[clinical_dataset.dtypes == float])
scaled_clinical = scaler.fit_transform(clinical_dataset[clinical_float_columns])
for i, index in enumerate(clinical_dataset.index):
    for j, column in enumerate(clinical_float_columns):
        clinical_dataset.loc[index, column] = scaled_clinical[i, j]

scaled = scaler.fit_transform(gene_expr_dataset)
scaled_df = pd.DataFrame(scaled)
scaled_df.columns = gene_expr_dataset.columns
scaled_df.index = gene_expr_dataset.index
gene_expr_dataset = scaled_df

In [17]:
clinical_dataset.head(5)

Unnamed: 0,patient_id,age_at_diagnosis,type_of_breast_surgery,cancer_type,cancer_type_detailed,cellularity,chemotherapy,pam50_+_claudin-low_subtype,cohort,er_status_measured_by_ihc,...,nottingham_prognostic_index,oncotree_code,overall_survival_months,overall_survival,pr_status,radio_therapy,3-gene_classifier_subtype,tumor_size,tumor_stage,death_from_cancer
0,0,1.122359,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,,0,claudin-low,-1.338368,Positve,...,1.757557,IDC,0.201518,1,Negative,1,ER-/HER2-,-0.279656,0.396748,Living
1,2,-1.379317,BREAST CONSERVING,Breast Cancer,Breast Invasive Ductal Carcinoma,High,0,LumA,-1.338368,Positve,...,-0.011378,IDC,-0.530544,1,Positive,1,ER+/HER2- High Prolif,-1.071371,-1.193646,Living
2,5,-0.941562,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,High,1,LumB,-1.338368,Positve,...,-0.002638,IDC,0.505525,0,Positive,0,,-0.74149,0.396748,Died of Disease
3,6,-1.033275,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,Moderate,1,LumB,-1.338368,Positve,...,0.014841,MDLC,0.521686,1,Positive,1,,-0.081727,0.396748,Living
4,8,1.224091,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,High,1,LumB,-1.338368,Positve,...,1.789021,MDLC,-1.097499,0,Positive,1,ER+/HER2- High Prolif,0.907918,0.396748,Died of Disease


In [18]:
gene_expr_dataset.head(5)

Unnamed: 0,cdk6,map3k3,cyp11a1,fgf2,dtx2,srd5a2,inha,casp9,rassf1,hsd17b10,...,hsd17b4,map3k13,bmp2,e2f5,tubb1,nr3c1,mmp12,pik3r2,hdac1,nfkb1
0,0.074399,0.9728,2.331099,2.4435,-1.8994,-0.0194,0.088301,-0.433199,1.416699,-0.0487,...,0.265001,-0.4401,4.774103,-0.090701,0.702901,1.5544,-0.529999,-3.162602,-1.682202,-0.883796
1,-0.604,-1.605598,-0.205301,-1.154802,-0.177499,0.453401,-0.279699,-0.448399,1.589099,-0.0693,...,-1.269299,-0.183,-1.138502,0.051899,0.897201,0.1142,-0.484499,0.2887,1.626699,-0.847696
2,0.258499,-2.254498,-0.7445,0.505899,0.231201,0.0668,-0.453899,-1.252699,1.283999,0.0082,...,0.111701,-0.4972,-0.786101,0.2885,0.315701,-0.421901,-0.024899,-0.990201,-1.810302,-1.245696
3,0.368899,-1.344399,-0.5875,0.040099,-0.276499,-0.7078,1.9837,0.6762,-0.253399,0.096,...,-0.658599,-1.274801,-0.600001,1.8323,8.472103,2.198501,-0.226099,-1.375201,-0.341202,-2.042094
4,0.889199,-0.284199,-1.1262,-0.044601,-0.752499,-0.3544,-0.170199,-0.3756,1.104499,0.2796,...,-0.705099,-0.4843,-0.526201,2.2129,1.018801,-0.268,-0.582599,-0.283101,0.481598,-1.387995


In [19]:
# Convert categorical data to numerical data for clinical dataset
ordinal_encoder = OrdinalEncoder()
encoded = ordinal_encoder.fit_transform(clinical_dataset)
encoded_df = pd.DataFrame(encoded)
encoded_df.columns = clinical_dataset.columns
encoded_df.index = clinical_dataset.index
clinical_dataset = encoded_df

print(f'Clinical dataset {dtype_stats(clinical_dataset)}')

Clinical dataset 
    columns: 31, object columns: 0, int columns: 0, float columns: 31
    


In [20]:
# Perform data imputation for clinical dataset
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputed_data = imputer.fit_transform(clinical_dataset)
imputed_df = pd.DataFrame(imputed_data)
imputed_df.columns = clinical_dataset.columns
imputed_df.index = clinical_dataset.index
clinical_dataset = imputed_df

print(f'Imputed clinical dataset: {clean_stats(clinical_dataset)}')

Imputed clinical dataset: clean data: 1904


In [21]:
clinical_dataset.head(5)

Unnamed: 0,patient_id,age_at_diagnosis,type_of_breast_surgery,cancer_type,cancer_type_detailed,cellularity,chemotherapy,pam50_+_claudin-low_subtype,cohort,er_status_measured_by_ihc,...,nottingham_prognostic_index,oncotree_code,overall_survival_months,overall_survival,pr_status,radio_therapy,3-gene_classifier_subtype,tumor_size,tumor_stage,death_from_cancer
0,0.0,1341.0,1.0,0.0,1.0,0.0,0.0,6.0,0.0,1.0,...,268.0,1.0,998.0,1.0,0.0,1.0,2.0,46.0,2.0,2.0
1,1.0,173.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,1.0,...,126.0,1.0,585.0,1.0,1.0,1.0,0.0,11.0,1.0,2.0
2,2.0,328.0,1.0,0.0,1.0,0.0,1.0,3.0,0.0,1.0,...,134.0,1.0,1129.0,0.0,1.0,0.0,1.0,23.0,2.0,0.0
3,3.0,293.0,1.0,0.0,4.0,2.0,1.0,3.0,0.0,1.0,...,152.0,5.0,1140.0,1.0,1.0,1.0,1.0,53.0,2.0,2.0
4,4.0,1386.0,1.0,0.0,4.0,0.0,1.0,3.0,0.0,1.0,...,286.0,5.0,264.0,0.0,1.0,1.0,0.0,72.0,2.0,0.0


In [22]:
# define data and labels for each dataset

labels = clinical_dataset[label_column].to_numpy()
clinical_data = clinical_dataset[clinical_data_columns].to_numpy()
gene_expr_data = gene_expr_dataset.to_numpy()

_clinical_train_X, _clinical_test_X, _clinical_train_y, _clinical_test_y = train_test_split(clinical_data, labels, test_size=0.17, random_state=42)
_gene_expr_train_X, _gene_expr_test_X, _gene_expr_train_y, _gene_expr_test_y = train_test_split(gene_expr_data, labels, test_size=0.17, random_state=42)

dataset = {
    'clinical':{
        'X_train': _clinical_train_X,
        'X_test': _clinical_test_X,
        'y_train': _clinical_train_y,
        'y_test': _clinical_test_y
    },
    'gene_expr':{
        'X_train': _gene_expr_train_X,
        'X_test': _gene_expr_test_X,
        'y_train': _gene_expr_train_y,
        'y_test': _gene_expr_test_y
    },
    'gene_expr_reduced':{
    }
}

## Dimension Reduction (20 + Up to 10 Points Optional)

For each dataset, investigate whether it is needed to use a dimensionality reduction approach or not. If yes, please reduce the dataset's dimension. You can use UMAP for this purpose but any other approach is acceptable. Finding the most important features contains extra points.

<span style="color:orange">
    we check if dimensionality reduction is needed by using a simple linear regression model as a baseline .
</span>



In [23]:
# predict for the clinical dataset using linear regression
_clf = LinearRegression()
_clf.fit(dataset['clinical']['X_train'], dataset['clinical']['y_train'])
_clinical_baseline_pred = np.round(_clf.predict(dataset['clinical']['X_test']))
_clinical_baseline_accuracy = accuracy_score(dataset['clinical']['y_test'], _clinical_baseline_pred)

# predict for the gene expression dataset using linear regression
_clf = LinearRegression()
_clf.fit(dataset['gene_expr']['X_train'], dataset['gene_expr']['y_train'])
_gene_expr_baseline_pred = np.round(_clf.predict(dataset['gene_expr']['X_test']))
_gene_expr_baseline_accuracy = accuracy_score(dataset['gene_expr']['y_test'], _gene_expr_baseline_pred)

print(f'Accuracy of simple linear regression model on clinical data: {_clinical_baseline_accuracy:.3f}')
print(f'Accuracy of simple linear regression model on gene expression data: {_gene_expr_baseline_accuracy:.3f}')

Accuracy of simple linear regression model on clinical data: 0.741
Accuracy of simple linear regression model on gene expression data: 0.583


<span style="color:orange">
    As we can see, the results are much better for the clinical dataset which has few dimensions, but not so much for the gene expession dataset.
    Therefore, we will only reduce the dimensions for gene expression dataset.
</span>



In [24]:
# reduce the dimensions for clinical data and predict using baseline model
_reducer = umap.UMAP(n_components=10)
_reducer.fit(clinical_data)
_reduced_X_train = _reducer.transform(dataset['clinical']['X_train'])
_reduced_X_test = _reducer.transform(dataset['clinical']['X_test'])

_clf = LinearRegression()
_clf.fit(_reduced_X_train, dataset['clinical']['y_train'])
_clinical_reduced_baseline_pred = np.round(_clf.predict(_reduced_X_test))
_clinical_reduced_baseline_accuracy = accuracy_score(dataset['clinical']['y_test'], _clinical_reduced_baseline_pred)

# reduce the dimensions for gene expression data and predict using baseline model
_reducer = umap.UMAP(n_components=20)
_reducer.fit(gene_expr_data)
_reduced_X_train = _reducer.transform(dataset['gene_expr']['X_train'])
_reduced_X_test = _reducer.transform(dataset['gene_expr']['X_test'])

_clf = LinearRegression()
_clf.fit(_reduced_X_train, dataset['gene_expr']['y_train'])
_gene_expr_reduced_baseline_pred = np.round(_clf.predict(_reduced_X_test))
_gene_expr_reduced_baseline_accuracy = accuracy_score(dataset['gene_expr']['y_test'], _gene_expr_reduced_baseline_pred)

print(f'Accuracy of simple linear regression model on reduced clinical data: {_clinical_reduced_baseline_accuracy:.3f}')
print(f'Accuracy of simple linear regression model on reduced gene expression data: {_gene_expr_reduced_baseline_accuracy:.3f}')

Accuracy of simple linear regression model on reduced clinical data: 0.698
Accuracy of simple linear regression model on reduced gene expression data: 0.630


<span style="color:orange">
    As we can see, applying dimension reduction on the clinical dataset leads to worse results, while on gene expression dataset improves the predictions.
    Therefore, we choose to reduce the dimensions of only the gene expression dataset. 
</span>



In [25]:
dataset['gene_expr_reduced'] = {
    'X_train': _reduced_X_train,
    'X_test': _reduced_X_test,
    'y_train': _gene_expr_train_y,
    'y_test': _gene_expr_test_y
}

# Classic Model (25 Points)

In this section, you must implement a classic classification model for clinical, gene expressions, and reduced gene expressions datasets. Using Random Forest is suggested. (minimum acceptable accuracy = 60%)

In [26]:
random_forst_models = {
    'clinical': None,
    'gene_expr': None,
    'gene_expr_reduced': None
}

for ds_name in random_forst_models:
    clf = RandomForestClassifier()
    ds = dataset[ds_name]
    clf.fit(ds['X_train'], ds['y_train'])
    y_pred = clf.predict(ds['X_test'])
    acc = accuracy_score(ds['y_test'], y_pred)
    random_forst_models[ds_name] = {
        'model': clf,
        'accuracy': acc
    }

    print(f'random forest on {ds_name} dataset had accuracy of {acc:.4f}')

svm_models = random_forst_models.copy()

for ds_name in random_forst_models:
    clf = RandomForestClassifier()
    ds = dataset[ds_name]
    clf.fit(ds['X_train'], ds['y_train'])
    y_pred = clf.predict(ds['X_test'])
    acc = accuracy_score(ds['y_test'], y_pred)
    random_forst_models[ds_name] = {
        'model': clf,
        'accuracy': acc
    }

    print(f'svm on {ds_name} dataset had accuracy of {acc:.4f}')



random forest on clinical dataset had accuracy of 0.7716
random forest on gene_expr dataset had accuracy of 0.6389
random forest on gene_expr_reduced dataset had accuracy of 0.6265
svm on clinical dataset had accuracy of 0.7685
svm on gene_expr dataset had accuracy of 0.6265
svm on gene_expr_reduced dataset had accuracy of 0.5895


# Neural Network (30 Points)

In this section, you must implement a neural network model for clinical, gene expressions and reduced gene expressions datasets. Using the MPL models is suggested. (minimum acceptable accuracy = 60%)

In [27]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

In [28]:
class CancerDataset(Dataset):

    def __init__(self, X, y) -> None:
        super().__init__()
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx].astype(np.float32), self.y[idx].astype(np.float32)


In [29]:
dataloaders = {}
batch_size = 64

for ds_name, ds_split in dataset.items():
    dataloaders[ds_name] = {}
    X_train = ds_split['X_train']
    X_test = ds_split['X_test']
    y_train = ds_split['y_train']
    y_test = ds_split['y_test']
    
    train_ds = CancerDataset(X_train, y_train)
    test_ds = CancerDataset(X_test, y_test)

    train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    test_dl = DataLoader(test_ds, batch_size=batch_size, shuffle=True)

    dataloaders[ds_name]['train'] = train_dl
    dataloaders[ds_name]['test'] = test_dl

In [30]:
def train(model, criterion, optimizer, dataloader, num_epochs):
    model.train()
    for epoch in range(num_epochs):
        total, correct = 0, 0
        train_loss = 0.0
        with tqdm(enumerate(dataloader), total=len(dataloader)) as pbar:
            for _, (data, labels) in pbar:
                optimizer.zero_grad()
                data, labels = data.to(device), labels.to(device)

                pred = model(data).squeeze()
                out = pred.round()

                total = total + len(data)
                correct = correct + (labels == out).sum()
                
                loss = criterion(pred, out)
                train_loss = train_loss + loss.detach()
                loss.backward()

                optimizer.step()

                pbar.set_description('Epoch {0}: train loss={1}, train accuracy={2}'.format(epoch, train_loss / total,
                                                                                            accuracy_score(labels.detach().numpy(), out.detach().numpy())))
    
    return model

In [31]:
def evaluate(model, dataloader):
    model.eval()
    total, correct = 0, 0
    with torch.no_grad():
        for _, (data, labels) in enumerate(dataloader):
            data, labels = data.to(device), labels.to(device)

            pred = model(data).squeeze()
            out = torch.round(pred)

            correct = correct + (labels == out).sum()
            total = total + len(data)
    
    return correct / total

In [32]:
mlp_models = random_forst_models.copy()
lr = 1e-4
num_epochs = 10

for ds_name in mlp_models:
    net = nn.Sequential(
        nn.Linear(dataset[ds_name]['X_train'].shape[1], 64),
        nn.ReLU(),
        nn.Linear(64, 32),
        nn.ReLU(),
        nn.Linear(32, 16),
        nn.ReLU(),
        nn.Linear(16, 1),
        nn.Sigmoid()
    )
    net = net.to(device)
    optimizer = Adam(net.parameters(), lr=lr)
    criterion = nn.BCELoss()
    train_dl = dataloaders[ds_name]['train']
    test_dl = dataloaders[ds_name]['test']
    net = train(net, criterion, optimizer, train_dl, num_epochs)
    
    # test
    acc = evaluate(net, test_dl)
    mlp_models[ds_name] = {
        'model': net,
        'accuracy': acc
    }

    print(f'mlp accuracy on {ds_name} dataset had accuracy of {acc:.4f}')

Epoch 0: train loss=0.0006668840069323778, train accuracy=0.4318181818181818: 100%|██████████| 25/25 [00:00<00:00, 77.49it/s]
Epoch 1: train loss=0.0004343927139416337, train accuracy=0.5: 100%|██████████| 25/25 [00:00<00:00, 63.19it/s]      
Epoch 2: train loss=0.00036117422860115767, train accuracy=0.5227272727272727: 100%|██████████| 25/25 [00:00<00:00, 61.79it/s]
Epoch 3: train loss=0.000318124977638945, train accuracy=0.38636363636363635: 100%|██████████| 25/25 [00:00<00:00, 98.06it/s]
Epoch 4: train loss=0.0002822641108650714, train accuracy=0.5454545454545454: 100%|██████████| 25/25 [00:00<00:00, 127.97it/s]
Epoch 5: train loss=0.0003013236855622381, train accuracy=0.5454545454545454: 100%|██████████| 25/25 [00:00<00:00, 159.39it/s]
Epoch 6: train loss=0.0002810160513035953, train accuracy=0.38636363636363635: 100%|██████████| 25/25 [00:00<00:00, 98.23it/s]
Epoch 7: train loss=0.00030327821150422096, train accuracy=0.6818181818181818: 100%|██████████| 25/25 [00:00<00:00, 158.48i

mlp accuracy on clinical dataset had accuracy of 0.4722


Epoch 0: train loss=0.009958168491721153, train accuracy=0.3409090909090909: 100%|██████████| 25/25 [00:00<00:00, 156.02it/s]
Epoch 1: train loss=0.00962876994162798, train accuracy=0.38636363636363635: 100%|██████████| 25/25 [00:00<00:00, 161.11it/s]
Epoch 2: train loss=0.009201196022331715, train accuracy=0.4318181818181818: 100%|██████████| 25/25 [00:00<00:00, 140.20it/s]
Epoch 3: train loss=0.008616495877504349, train accuracy=0.3409090909090909: 100%|██████████| 25/25 [00:00<00:00, 165.75it/s]
Epoch 4: train loss=0.0078555503860116, train accuracy=0.29545454545454547: 100%|██████████| 25/25 [00:00<00:00, 162.31it/s]
Epoch 5: train loss=0.006951160728931427, train accuracy=0.45454545454545453: 100%|██████████| 25/25 [00:00<00:00, 171.76it/s]
Epoch 6: train loss=0.005925280041992664, train accuracy=0.45454545454545453: 100%|██████████| 25/25 [00:00<00:00, 155.03it/s]
Epoch 7: train loss=0.0048456136137247086, train accuracy=0.4090909090909091: 100%|██████████| 25/25 [00:00<00:00, 12

mlp accuracy on gene_expr dataset had accuracy of 0.3981


Epoch 0: train loss=0.009356952272355556, train accuracy=0.3409090909090909: 100%|██████████| 25/25 [00:00<00:00, 118.68it/s]
Epoch 1: train loss=0.008608638308942318, train accuracy=0.4318181818181818: 100%|██████████| 25/25 [00:00<00:00, 178.67it/s]
Epoch 2: train loss=0.008162664249539375, train accuracy=0.38636363636363635: 100%|██████████| 25/25 [00:00<00:00, 190.97it/s]
Epoch 3: train loss=0.007615293841809034, train accuracy=0.3409090909090909: 100%|██████████| 25/25 [00:00<00:00, 138.80it/s]
Epoch 4: train loss=0.006999541074037552, train accuracy=0.4090909090909091: 100%|██████████| 25/25 [00:00<00:00, 136.05it/s]
Epoch 5: train loss=0.0063178339041769505, train accuracy=0.5: 100%|██████████| 25/25 [00:00<00:00, 186.97it/s]    
Epoch 6: train loss=0.0055382391437888145, train accuracy=0.4772727272727273: 100%|██████████| 25/25 [00:00<00:00, 170.59it/s]
Epoch 7: train loss=0.004689392168074846, train accuracy=0.5: 100%|██████████| 25/25 [00:00<00:00, 122.50it/s]    
Epoch 8: tr

mlp accuracy on gene_expr_reduced dataset had accuracy of 0.3981


# Model Comparison (10 Points)

Compare different models and different datasets (clinical, gene expressions, and gene reduced expressions) and try to explain their differences.

#### \# TODO