<img src='http://www-scf.usc.edu/~ghasemig/images/sharif.png' alt="SUT logo" width=200 height=200 align=left class="saturate" >

<br>
<font face="Times New Roman">
<div dir=ltr align=center>
<font color=0F5298 size=7>
    Introduction to Machine Learning <br>
<font color=2565AE size=5>
    Computer Engineering Department <br>
    Fall 2022<br>
<font color=3C99D size=5>
    Project <br>
<font color=696880 size=4>
    Project Team 
    
    
____


### Full Name : Mohammad Bagher Soltani, Masih Najafi
### Student Number : 98105813, ?
___

# Introduction

In this project, we are going to have a brief and elementary hands-on real-world project, predicting breast cancer survival using machine learning models with clinical data and gene expression profiles.

In [458]:
# imports
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
import umap
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Data Documentation

For this purpose, we will use "Breast Cancer Gene Expression Profiles (METABRIC)" data. 
The first 31 columns of data contain clinical information including death status.
The next columns of the data contain gene's related information which includes both gene expressions and mutation information. (gene's mutation info columns have been marked with "_mut" at the end of the names of the columns) 
For more information please read the [data documentation](https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric).

# Data Preparation (15 Points)

In this section you must first split data into three datasets:
<br>
1- clinical dataset
<br>
2- gene expressions dataset
<br>
3- gene mutation dataset. (We will not use this dataset in further steps of the project)

## Data Loading & Splitting

In [459]:
# TODO
df = pd.read_csv('METABRIC_RNA_Mutation.csv', low_memory=False)
df.head(5)

Unnamed: 0,patient_id,age_at_diagnosis,type_of_breast_surgery,cancer_type,cancer_type_detailed,cellularity,chemotherapy,pam50_+_claudin-low_subtype,cohort,er_status_measured_by_ihc,...,mtap_mut,ppp2cb_mut,smarcd1_mut,nras_mut,ndfip1_mut,hras_mut,prps2_mut,smarcb1_mut,stmn2_mut,siah1_mut
0,0,75.65,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,,0,claudin-low,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
1,2,43.19,BREAST CONSERVING,Breast Cancer,Breast Invasive Ductal Carcinoma,High,0,LumA,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
2,5,48.87,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,High,1,LumB,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
3,6,47.68,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,Moderate,1,LumB,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
4,8,76.97,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,High,1,LumB,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0


In [460]:
# Get column names for clinical, gene expression and gene mutation datasets

columns = df.columns
clinical_columns = columns[:31]
clinical_data_columns = df.columns[:24].append(df.columns[25:30])
label_column = columns[24]
gene_columns = columns[31:]
gene_mut_columns = pd.Index(filter(lambda s: s.endswith('_mut'),columns))
gene_expr_columns = pd.Index(set(gene_columns) - set(gene_mut_columns))

print(f'Number of clinical columns {len(clinical_columns)}')
print(f'Number of gene expression columns {len(gene_expr_columns)}')
print(f'Number of gene mutation columns {len(gene_mut_columns)}')

Number of clinical columns 31
Number of gene expression columns 489
Number of gene mutation columns 173


In [461]:
clinical_dataset = df[clinical_columns]
gene_expr_dataset = df[gene_expr_columns]
gene_mut_dataset = df[gene_mut_columns]

## EDA

For each dataset, you must perform a sufficient EDA.

In [462]:
clinical_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1904 entries, 0 to 1903
Data columns (total 31 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   patient_id                      1904 non-null   int64  
 1   age_at_diagnosis                1904 non-null   float64
 2   type_of_breast_surgery          1882 non-null   object 
 3   cancer_type                     1904 non-null   object 
 4   cancer_type_detailed            1889 non-null   object 
 5   cellularity                     1850 non-null   object 
 6   chemotherapy                    1904 non-null   int64  
 7   pam50_+_claudin-low_subtype     1904 non-null   object 
 8   cohort                          1904 non-null   float64
 9   er_status_measured_by_ihc       1874 non-null   object 
 10  er_status                       1904 non-null   object 
 11  neoplasm_histologic_grade       1832 non-null   float64
 12  her2_status_measured_by_snp6    19

In [463]:
clinical_dataset.describe()

Unnamed: 0,patient_id,age_at_diagnosis,chemotherapy,cohort,neoplasm_histologic_grade,hormone_therapy,lymph_nodes_examined_positive,mutation_count,nottingham_prognostic_index,overall_survival_months,overall_survival,radio_therapy,tumor_size,tumor_stage
count,1904.0,1904.0,1904.0,1904.0,1832.0,1904.0,1904.0,1859.0,1904.0,1904.0,1904.0,1904.0,1884.0,1403.0
mean,3921.982143,61.087054,0.207983,2.643908,2.415939,0.616597,2.002101,5.697687,4.033019,125.121324,0.420693,0.597164,26.238726,1.750535
std,2358.478332,12.978711,0.405971,1.228615,0.650612,0.486343,4.079993,4.058778,1.144492,76.334148,0.4938,0.490597,15.160976,0.628999
min,0.0,21.93,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
25%,896.5,51.375,0.0,1.0,2.0,0.0,0.0,3.0,3.046,60.825,0.0,0.0,17.0,1.0
50%,4730.5,61.77,0.0,3.0,3.0,1.0,0.0,5.0,4.042,115.616667,0.0,1.0,23.0,2.0
75%,5536.25,70.5925,0.0,3.0,3.0,1.0,2.0,7.0,5.04025,184.716667,1.0,1.0,30.0,2.0
max,7299.0,96.29,1.0,5.0,3.0,1.0,45.0,80.0,6.36,355.2,1.0,1.0,182.0,4.0


In [464]:
gene_expr_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1904 entries, 0 to 1903
Columns: 489 entries, kmt2c to casp3
dtypes: float64(489)
memory usage: 7.1 MB


In [465]:
gene_expr_dataset.describe()

Unnamed: 0,kmt2c,abcc1,stat5b,itgb3,npnt,smad5,dtx3,zfp36l1,e2f4,nrip1,...,dtx2,hsd3b2,mtor,nf2,gdf11,twist1,wwox,rad50,hsd17b3,casp3
count,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,...,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0
mean,-3.151261e-07,-5.252101e-08,-2.10084e-07,-6.827731e-07,1e-06,-8.403361e-07,5.252101e-07,-4.726891e-07,-1e-06,1.57563e-07,...,-1e-06,1.05042e-07,-6.302521e-07,-2.62605e-07,-5.252101e-08,5.777311e-07,5.777311e-07,-8.928571e-07,-1.57563e-07,7.352941e-07
std,1.000262,1.000263,1.000262,1.000263,1.000262,1.000262,1.000262,1.000263,1.000263,1.000263,...,1.000262,1.000262,1.000263,1.000263,1.000263,1.000262,1.000263,1.000263,1.000264,1.000262
min,-2.0533,-2.401,-2.8606,-3.9635,-2.1219,-1.8512,-3.1155,-3.2109,-2.7194,-3.527,...,-4.5026,-3.8209,-3.7609,-3.9781,-2.8462,-2.1429,-1.7385,-3.3854,-2.7959,-3.6361
25%,-0.71785,-0.64865,-0.677925,-0.56105,-0.806875,-0.5345,-0.636325,-0.6781,-0.749075,-0.6198,...,-0.6833,-0.6573,-0.655475,-0.6945,-0.65185,-0.65005,-0.640025,-0.648025,-0.64675,-0.654325
50%,-0.1562,-0.07965,-0.0151,-0.07745,-0.02515,-0.159,-0.001,0.0282,-0.04705,-0.02525,...,-0.097,-0.01145,0.0354,-0.0427,-0.1695,-0.1573,-0.26935,-0.00395,-0.0527,-0.0383
75%,0.51735,0.541775,0.66605,0.438325,0.70755,0.31185,0.6028,0.6956,0.672625,0.6384,...,0.5558,0.638075,0.6543,0.629025,0.4389,0.46485,0.2789,0.64905,0.493125,0.623
max,4.677,5.3473,4.3042,15.3308,4.0223,10.3549,10.5253,2.816,4.072,3.2006,...,4.0397,8.5608,5.293,4.4484,5.457,7.9391,5.6368,5.1668,8.2924,6.0239


In [466]:
gene_mut_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1904 entries, 0 to 1903
Columns: 173 entries, pik3ca_mut to siah1_mut
dtypes: object(173)
memory usage: 2.5+ MB


In [467]:
gene_mut_dataset.describe()

Unnamed: 0,pik3ca_mut,tp53_mut,muc16_mut,ahnak2_mut,kmt2c_mut,syne1_mut,gata3_mut,map3k1_mut,ahnak_mut,dnah11_mut,...,mtap_mut,ppp2cb_mut,smarcd1_mut,nras_mut,ndfip1_mut,hras_mut,prps2_mut,smarcb1_mut,stmn2_mut,siah1_mut
count,1904,1904,1904,1904,1904,1904,1904,1904,1904,1904,...,1904,1904,1904,1904,1904,1904,1904,1904,1904,1904
unique,160,343,298,248,222,200,128,194,153,154,...,5,5,5,4,4,3,3,3,3,2
top,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
freq,1109,1245,1578,1593,1670,1672,1674,1706,1728,1729,...,1900,1900,1900,1901,1901,1902,1902,1902,1902,1903


In [468]:
# clean data means data with no NaN value in any column
def clean_stats(ds):
    return '''clean data: {0}'''.format(ds.shape[0] - ds.isnull().any(axis=1).sum())

print(f'Clinical dataset {clean_stats(clinical_dataset)}')
print(f'Gene expression dataset {clean_stats(gene_expr_dataset)}')
print(f'Gene mutation dataset {clean_stats(gene_mut_dataset)}')

Clinical dataset clean data: 1092
Gene expression dataset clean data: 1904
Gene mutation dataset clean data: 1904


In [469]:
def dtype_stats(ds):
    return '''
    columns: {0}, object columns: {1}, int columns: {2}, float columns: {3}
    '''.format(len(ds.columns),
               (ds.dtypes == object).sum(),
               (ds.dtypes == int).sum(),
               (ds.dtypes == float).sum())

print(f'Clinical dataset: {dtype_stats(clinical_dataset)}')
print(f'Gene expression dataset: {dtype_stats(gene_expr_dataset)}')
print(f'Gene mutation dataset: {dtype_stats(gene_mut_dataset)}')

Clinical dataset: 
    columns: 31, object columns: 17, int columns: 5, float columns: 9
    
Gene expression dataset: 
    columns: 489, object columns: 0, int columns: 0, float columns: 489
    
Gene mutation dataset: 
    columns: 173, object columns: 173, int columns: 0, float columns: 0
    


In [470]:
# check if int data needs scaling
clinical_dataset[clinical_columns[clinical_dataset.dtypes == int]].head(5)

Unnamed: 0,patient_id,chemotherapy,hormone_therapy,overall_survival,radio_therapy
0,0,0,1,1,1
1,2,0,1,1,1
2,5,1,1,0,0
3,6,1,1,1,1
4,8,1,1,0,1


In [471]:
# Perform scaling for float data
def scale(scaler, dataset):
    scaled = scaler.fit_transform(dataset)
    scaled_df = pd.DataFrame(scaled)
    scaled_df.columns = dataset.columns
    scaled_df.index = dataset.index
    return scaled_df

scaler = StandardScaler()
clinical_float_columns = list(clinical_columns[clinical_dataset.dtypes == float])
scaled_clinical = scaler.fit_transform(clinical_dataset[clinical_float_columns])
for i, index in enumerate(clinical_dataset.index):
    for j, column in enumerate(clinical_float_columns):
        clinical_dataset.loc[index, column] = scaled_clinical[i, j]

scaled = scaler.fit_transform(gene_expr_dataset)
scaled_df = pd.DataFrame(scaled)
scaled_df.columns = gene_expr_dataset.columns
scaled_df.index = gene_expr_dataset.index
gene_expr_dataset = scaled_df

In [472]:
clinical_dataset.head(5)

Unnamed: 0,patient_id,age_at_diagnosis,type_of_breast_surgery,cancer_type,cancer_type_detailed,cellularity,chemotherapy,pam50_+_claudin-low_subtype,cohort,er_status_measured_by_ihc,...,nottingham_prognostic_index,oncotree_code,overall_survival_months,overall_survival,pr_status,radio_therapy,3-gene_classifier_subtype,tumor_size,tumor_stage,death_from_cancer
0,0,1.122359,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,,0,claudin-low,-1.338368,Positve,...,1.757557,IDC,0.201518,1,Negative,1,ER-/HER2-,-0.279656,0.396748,Living
1,2,-1.379317,BREAST CONSERVING,Breast Cancer,Breast Invasive Ductal Carcinoma,High,0,LumA,-1.338368,Positve,...,-0.011378,IDC,-0.530544,1,Positive,1,ER+/HER2- High Prolif,-1.071371,-1.193646,Living
2,5,-0.941562,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,High,1,LumB,-1.338368,Positve,...,-0.002638,IDC,0.505525,0,Positive,0,,-0.74149,0.396748,Died of Disease
3,6,-1.033275,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,Moderate,1,LumB,-1.338368,Positve,...,0.014841,MDLC,0.521686,1,Positive,1,,-0.081727,0.396748,Living
4,8,1.224091,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,High,1,LumB,-1.338368,Positve,...,1.789021,MDLC,-1.097499,0,Positive,1,ER+/HER2- High Prolif,0.907918,0.396748,Died of Disease


In [473]:
gene_expr_dataset.head(5)

Unnamed: 0,kmt2c,abcc1,stat5b,itgb3,npnt,smad5,dtx3,zfp36l1,e2f4,nrip1,...,dtx2,hsd3b2,mtor,nf2,gdf11,twist1,wwox,rad50,hsd17b3,casp3
0,-0.9045,-1.0213,2.577302,-0.230399,-1.154102,-0.364699,0.6028,0.3492,0.199201,0.4076,...,-1.8994,0.7345,-0.546899,-0.5124,-0.6499,2.809601,-0.4464,1.733,1.085999,-2.125702
1,-0.0208,0.4261,-1.325101,0.257701,1.112599,-0.539899,-0.799101,1.4963,1.001101,-0.1878,...,-0.177499,0.6433,-0.318599,-0.8002,0.0169,-0.227301,-0.8286,0.744,0.0623,0.5775
2,-0.5063,-0.5168,-1.083201,0.646901,0.489299,3.486902,-1.290601,2.815999,-0.833299,-0.2882,...,0.231201,-0.912801,-1.747199,0.6707,-1.2771,0.1984,-0.3171,1.4528,0.1493,-0.929101
3,-1.284001,-1.4145,-0.0195,-0.850199,0.602799,7.342503,-1.184501,2.0714,-0.833299,-0.2049,...,-0.276499,1.875401,-1.926499,-0.4457,-0.9396,0.1932,-0.4584,1.2102,0.2546,0.520699
4,-1.1026,-0.8794,-0.4278,-0.319399,-0.569501,-0.718699,-2.401801,-0.091199,-0.318099,0.2832,...,-0.752499,-0.126,-0.614899,0.8564,-0.3794,0.4128,0.163799,-0.763099,-0.762699,-0.520701


In [474]:
# Convert categorical data to numerical data for clinical dataset
ordinal_encoder = OrdinalEncoder()
encoded = ordinal_encoder.fit_transform(clinical_dataset)
encoded_df = pd.DataFrame(encoded)
encoded_df.columns = clinical_dataset.columns
encoded_df.index = clinical_dataset.index
clinical_dataset = encoded_df

print(f'Clinical dataset {dtype_stats(clinical_dataset)}')

Clinical dataset 
    columns: 31, object columns: 0, int columns: 0, float columns: 31
    


In [475]:
# Perform data imputation for clinical dataset
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputed_data = imputer.fit_transform(clinical_dataset)
imputed_df = pd.DataFrame(imputed_data)
imputed_df.columns = clinical_dataset.columns
imputed_df.index = clinical_dataset.index
clinical_dataset = imputed_df

print(f'Imputed clinical dataset: {clean_stats(clinical_dataset)}')

Imputed clinical dataset: clean data: 1904


In [476]:
clinical_dataset.head(5)

Unnamed: 0,patient_id,age_at_diagnosis,type_of_breast_surgery,cancer_type,cancer_type_detailed,cellularity,chemotherapy,pam50_+_claudin-low_subtype,cohort,er_status_measured_by_ihc,...,nottingham_prognostic_index,oncotree_code,overall_survival_months,overall_survival,pr_status,radio_therapy,3-gene_classifier_subtype,tumor_size,tumor_stage,death_from_cancer
0,0.0,1341.0,1.0,0.0,1.0,0.0,0.0,6.0,0.0,1.0,...,268.0,1.0,998.0,1.0,0.0,1.0,2.0,46.0,2.0,2.0
1,1.0,173.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,1.0,...,126.0,1.0,585.0,1.0,1.0,1.0,0.0,11.0,1.0,2.0
2,2.0,328.0,1.0,0.0,1.0,0.0,1.0,3.0,0.0,1.0,...,134.0,1.0,1129.0,0.0,1.0,0.0,1.0,23.0,2.0,0.0
3,3.0,293.0,1.0,0.0,4.0,2.0,1.0,3.0,0.0,1.0,...,152.0,5.0,1140.0,1.0,1.0,1.0,1.0,53.0,2.0,2.0
4,4.0,1386.0,1.0,0.0,4.0,0.0,1.0,3.0,0.0,1.0,...,286.0,5.0,264.0,0.0,1.0,1.0,0.0,72.0,2.0,0.0


In [477]:
# define data and labels for each dataset

labels = clinical_dataset[label_column].to_numpy()
clinical_data = clinical_dataset[clinical_data_columns].to_numpy()
gene_expr_data = gene_expr_dataset.to_numpy()

_clinical_train_X, _clinical_test_X, _clinical_train_y, _clinical_test_y = train_test_split(clinical_data, labels, test_size=0.17, random_state=42)
_gene_expr_train_X, _gene_expr_test_X, _gene_expr_train_y, _gene_expr_test_y = train_test_split(gene_expr_data, labels, test_size=0.17, random_state=42)

dataset = {
    'clinical':{
        'X_train': _clinical_train_X,
        'X_test': _clinical_test_X,
        'y_train': _clinical_train_y,
        'y_test': _clinical_test_y
    },
    'gene_expr':{
        'X_train': _gene_expr_train_X,
        'X_test': _gene_expr_test_X,
        'y_train': _gene_expr_train_y,
        'y_test': _gene_expr_test_y
    },
    'gene_expr_reduced':{
    }
}

## Dimension Reduction (20 + Up to 10 Points Optional)

For each dataset, investigate whether it is needed to use a dimensionality reduction approach or not. If yes, please reduce the dataset's dimension. You can use UMAP for this purpose but any other approach is acceptable. Finding the most important features contains extra points.

<span style="color:orange">
    we check if dimensionality reduction is needed by using a simple linear regression model as a baseline .
</span>



In [478]:
# predict for the clinical dataset using linear regression
_clf = LinearRegression()
_clf.fit(dataset['clinical']['X_train'], dataset['clinical']['y_train'])
_clinical_baseline_pred = np.round(_clf.predict(dataset['clinical']['X_test']))
_clinical_baseline_accuracy = accuracy_score(dataset['clinical']['y_test'], _clinical_baseline_pred)

# predict for the gene expression dataset using linear regression
_clf = LinearRegression()
_clf.fit(dataset['gene_expr']['X_train'], dataset['gene_expr']['y_train'])
_gene_expr_baseline_pred = np.round(_clf.predict(dataset['gene_expr']['X_test']))
_gene_expr_baseline_accuracy = accuracy_score(dataset['gene_expr']['y_test'], _gene_expr_baseline_pred)

print(f'Accuracy of simple linear regression model on clinical data: {_clinical_baseline_accuracy:.3f}')
print(f'Accuracy of simple linear regression model on gene expression data: {_gene_expr_baseline_accuracy:.3f}')

Accuracy of simple linear regression model on clinical data: 0.741
Accuracy of simple linear regression model on gene expression data: 0.583


<span style="color:orange">
    As we can see, the results are much better for the clinical dataset which has few dimensions, but not so much for the gene expession dataset.
    Therefore, we will only reduce the dimensions for gene expression dataset.
</span>



In [481]:
# reduce the dimensions for clinical data and predict using baseline model
_reducer = umap.UMAP(n_components=10)
_reducer.fit(clinical_data)
_reduced_X_train = _reducer.transform(dataset['clinical']['X_train'])
_reduced_X_test = _reducer.transform(dataset['clinical']['X_test'])

_clf = LinearRegression()
_clf.fit(_reduced_X_train, dataset['clinical']['y_train'])
_clinical_reduced_baseline_pred = np.round(_clf.predict(_reduced_X_test))
_clinical_reduced_baseline_accuracy = accuracy_score(dataset['clinical']['y_test'], _clinical_reduced_baseline_pred)

# reduce the dimensions for gene expression data and predict using baseline model
_reducer = umap.UMAP(n_components=20)
_reducer.fit(gene_expr_data)
_reduced_X_train = _reducer.transform(dataset['gene_expr']['X_train'])
_reduced_X_test = _reducer.transform(dataset['gene_expr']['X_test'])

_clf = LinearRegression()
_clf.fit(_reduced_X_train, dataset['gene_expr']['y_train'])
_gene_expr_reduced_baseline_pred = np.round(_clf.predict(_reduced_X_test))
_gene_expr_reduced_baseline_accuracy = accuracy_score(dataset['gene_expr']['y_test'], _gene_expr_reduced_baseline_pred)

print(f'Accuracy of simple linear regression model on reduced clinical data: {_clinical_reduced_baseline_accuracy:.3f}')
print(f'Accuracy of simple linear regression model on reduced gene expression data: {_gene_expr_reduced_baseline_accuracy:.3f}')

Accuracy of simple linear regression model on reduced clinical data: 0.719
Accuracy of simple linear regression model on reduced gene expression data: 0.627


<span style="color:orange">
    As we can see, applying dimension reduction on the clinical dataset leads to worse results, while on gene expression dataset improves the predictions.
    Therefore, we choose to reduce the dimensions of only the gene expression dataset. 
</span>



In [482]:
dataset['gene_expr_reduced'] = {
    'X_train': _reduced_X_train,
    'X_test': _reduced_X_test,
    'y_train': _gene_expr_train_y,
    'y_test': _gene_expr_test_y
}

# Classic Model (25 Points)

In this section, you must implement a classic classification model for clinical, gene expressions, and reduced gene expressions datasets. Using Random Forest is suggested. (minimum acceptable accuracy = 60%)

In [None]:
classic_models = {
    'random_forst_clinical': None,
    'random_forst_gene_expr': None,
    'random_forst_gene_expr_reduced': None
}

# Neural Network (30 Points)

In this section, you must implement a neural network model for clinical, gene expressions and reduced gene expressions datasets. Using the MPL models is suggested. (minimum acceptable accuracy = 60%)

In [None]:
# TODO

# Model Comparison (10 Points)

Compare different models and different datasets (clinical, gene expressions, and gene reduced expressions) and try to explain their differences.

#### \# TODO