<img src='http://www-scf.usc.edu/~ghasemig/images/sharif.png' alt="SUT logo" width=200 height=200 align=left class="saturate" >

<br>
<font face="Times New Roman">
<div dir=ltr align=center>
<font color=0F5298 size=7>
    Introduction to Machine Learning <br>
<font color=2565AE size=5>
    Computer Engineering Department <br>
    Fall 2022<br>
<font color=3C99D size=5>
    Project <br>
<font color=696880 size=4>
    Project Team 
    
    
____


### Full Name : Mohammad Bagher Soltani, Masih Najafi
### Student Number : 98105813, ?
___

# Introduction

In this project, we are going to have a brief and elementary hands-on real-world project, predicting breast cancer survival using machine learning models with clinical data and gene expression profiles.

In [1]:
# imports
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
import umap
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from tqdm.std import tqdm
from torch.optim import Adam
from copy import deepcopy

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
np.random.seed(42)

# Data Documentation

For this purpose, we will use "Breast Cancer Gene Expression Profiles (METABRIC)" data. 
The first 31 columns of data contain clinical information including death status.
The next columns of the data contain gene's related information which includes both gene expressions and mutation information. (gene's mutation info columns have been marked with "_mut" at the end of the names of the columns) 
For more information please read the [data documentation](https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric).

# Data Preparation (15 Points)

In this section you must first split data into three datasets:
<br>
1- clinical dataset
<br>
2- gene expressions dataset
<br>
3- gene mutation dataset. (We will not use this dataset in further steps of the project)

## Data Loading & Splitting

In [3]:
# TODO
df = pd.read_csv('METABRIC_RNA_Mutation.csv', low_memory=False)
df.head(5)

Unnamed: 0,patient_id,age_at_diagnosis,type_of_breast_surgery,cancer_type,cancer_type_detailed,cellularity,chemotherapy,pam50_+_claudin-low_subtype,cohort,er_status_measured_by_ihc,...,mtap_mut,ppp2cb_mut,smarcd1_mut,nras_mut,ndfip1_mut,hras_mut,prps2_mut,smarcb1_mut,stmn2_mut,siah1_mut
0,0,75.65,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,,0,claudin-low,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
1,2,43.19,BREAST CONSERVING,Breast Cancer,Breast Invasive Ductal Carcinoma,High,0,LumA,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
2,5,48.87,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,High,1,LumB,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
3,6,47.68,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,Moderate,1,LumB,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0
4,8,76.97,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,High,1,LumB,1.0,Positve,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Get column names for clinical, gene expression and gene mutation datasets

columns = df.columns
clinical_columns = columns[:31]
clinical_data_columns = df.columns[:24].append(df.columns[25:30])
label_column = columns[24]
gene_columns = columns[31:]
gene_mut_columns = pd.Index(filter(lambda s: s.endswith('_mut'),columns))
gene_expr_columns = pd.Index(set(gene_columns) - set(gene_mut_columns))

print(f'Number of clinical columns {len(clinical_columns)}')
print(f'Number of gene expression columns {len(gene_expr_columns)}')
print(f'Number of gene mutation columns {len(gene_mut_columns)}')

Number of clinical columns 31
Number of gene expression columns 489
Number of gene mutation columns 173


In [5]:
clinical_dataset = df[clinical_columns]
gene_expr_dataset = df[gene_expr_columns]
gene_mut_dataset = df[gene_mut_columns]

## EDA

For each dataset, you must perform a sufficient EDA.

In [6]:
clinical_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1904 entries, 0 to 1903
Data columns (total 31 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   patient_id                      1904 non-null   int64  
 1   age_at_diagnosis                1904 non-null   float64
 2   type_of_breast_surgery          1882 non-null   object 
 3   cancer_type                     1904 non-null   object 
 4   cancer_type_detailed            1889 non-null   object 
 5   cellularity                     1850 non-null   object 
 6   chemotherapy                    1904 non-null   int64  
 7   pam50_+_claudin-low_subtype     1904 non-null   object 
 8   cohort                          1904 non-null   float64
 9   er_status_measured_by_ihc       1874 non-null   object 
 10  er_status                       1904 non-null   object 
 11  neoplasm_histologic_grade       1832 non-null   float64
 12  her2_status_measured_by_snp6    19

In [7]:
clinical_dataset.describe()

Unnamed: 0,patient_id,age_at_diagnosis,chemotherapy,cohort,neoplasm_histologic_grade,hormone_therapy,lymph_nodes_examined_positive,mutation_count,nottingham_prognostic_index,overall_survival_months,overall_survival,radio_therapy,tumor_size,tumor_stage
count,1904.0,1904.0,1904.0,1904.0,1832.0,1904.0,1904.0,1859.0,1904.0,1904.0,1904.0,1904.0,1884.0,1403.0
mean,3921.982143,61.087054,0.207983,2.643908,2.415939,0.616597,2.002101,5.697687,4.033019,125.121324,0.420693,0.597164,26.238726,1.750535
std,2358.478332,12.978711,0.405971,1.228615,0.650612,0.486343,4.079993,4.058778,1.144492,76.334148,0.4938,0.490597,15.160976,0.628999
min,0.0,21.93,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
25%,896.5,51.375,0.0,1.0,2.0,0.0,0.0,3.0,3.046,60.825,0.0,0.0,17.0,1.0
50%,4730.5,61.77,0.0,3.0,3.0,1.0,0.0,5.0,4.042,115.616667,0.0,1.0,23.0,2.0
75%,5536.25,70.5925,0.0,3.0,3.0,1.0,2.0,7.0,5.04025,184.716667,1.0,1.0,30.0,2.0
max,7299.0,96.29,1.0,5.0,3.0,1.0,45.0,80.0,6.36,355.2,1.0,1.0,182.0,4.0


In [8]:
gene_expr_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1904 entries, 0 to 1903
Columns: 489 entries, peg3 to brca1
dtypes: float64(489)
memory usage: 7.1 MB


In [9]:
gene_expr_dataset.describe()

Unnamed: 0,peg3,tubb4b,tbl1xr1,cdk6,fanca,rassf1,epcam,tp53bp1,tgfb1,dtx2,...,rasgef1b,vegfa,eif4ebp1,hsd3b2,prkg1,bmp2,folr1,akr1c2,map2,brca1
count,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,...,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0,1904.0
mean,-4.726891e-07,-1.05042e-07,-3.676471e-07,6.302521e-07,4.201681e-07,-3.151261e-07,1.4927370000000002e-17,8.928571e-07,-2.62605e-07,-1e-06,...,-5.252101e-07,5.252101e-07,-4.201681e-07,1.05042e-07,1.05042e-07,6.302521e-07,-9.978992e-07,4.726891e-07,-6.827731e-07,-6.302521e-07
std,1.000264,1.000263,1.000263,1.000263,1.000262,1.000264,1.000263,1.000262,1.000262,1.000262,...,1.000263,1.000263,1.000263,1.000262,1.000263,1.000262,1.000263,1.000263,1.000263,1.000262
min,-1.4641,-3.6677,-3.8971,-2.2784,-3.8734,-2.7567,-2.1659,-3.8527,-3.8126,-4.5026,...,-4.1326,-1.9344,-1.8112,-3.8209,-4.0392,-1.5713,-1.1505,-1.4675,-1.8012,-2.4444
25%,-0.708725,-0.664625,-0.61385,-0.668075,-0.687325,-0.69835,-0.686,-0.671875,-0.66,-0.6833,...,-0.655225,-0.69375,-0.66325,-0.6573,-0.68615,-0.52015,-0.696725,-0.73515,-0.5506,-0.71985
50%,-0.2802,-0.0026,-0.07585,-0.1958,0.0022,-0.077,-0.14945,0.01315,0.0144,-0.097,...,-0.0096,-0.1639,-0.25435,-0.01145,-0.05055,-0.20135,-0.4091,-0.19555,-0.2434,-0.12445
75%,0.406625,0.649125,0.51305,0.4241,0.61795,0.62,0.5255,0.692475,0.633925,0.5558,...,0.635575,0.513125,0.36485,0.638075,0.60645,0.231425,0.33005,0.50035,0.2167,0.553225
max,5.0043,5.2716,7.0649,7.3807,4.5736,4.5171,6.9539,4.5349,4.6105,4.0397,...,4.1822,5.1319,7.4036,8.5608,5.1692,11.5429,4.4417,4.4172,6.9005,4.5542


In [10]:
gene_mut_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1904 entries, 0 to 1903
Columns: 173 entries, pik3ca_mut to siah1_mut
dtypes: object(173)
memory usage: 2.5+ MB


In [11]:
gene_mut_dataset.describe()

Unnamed: 0,pik3ca_mut,tp53_mut,muc16_mut,ahnak2_mut,kmt2c_mut,syne1_mut,gata3_mut,map3k1_mut,ahnak_mut,dnah11_mut,...,mtap_mut,ppp2cb_mut,smarcd1_mut,nras_mut,ndfip1_mut,hras_mut,prps2_mut,smarcb1_mut,stmn2_mut,siah1_mut
count,1904,1904,1904,1904,1904,1904,1904,1904,1904,1904,...,1904,1904,1904,1904,1904,1904,1904,1904,1904,1904
unique,160,343,298,248,222,200,128,194,153,154,...,5,5,5,4,4,3,3,3,3,2
top,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
freq,1109,1245,1578,1593,1670,1672,1674,1706,1728,1729,...,1900,1900,1900,1901,1901,1902,1902,1902,1902,1903


In [12]:
# clean data means data with no NaN value in any column
def clean_stats(ds):
    return '''clean data: {0}'''.format(ds.shape[0] - ds.isnull().any(axis=1).sum())

print(f'Clinical dataset {clean_stats(clinical_dataset)}')
print(f'Gene expression dataset {clean_stats(gene_expr_dataset)}')
print(f'Gene mutation dataset {clean_stats(gene_mut_dataset)}')

Clinical dataset clean data: 1092
Gene expression dataset clean data: 1904
Gene mutation dataset clean data: 1904


In [13]:
clinical_dataset.dtypes

patient_id                          int64
age_at_diagnosis                  float64
type_of_breast_surgery             object
cancer_type                        object
cancer_type_detailed               object
cellularity                        object
chemotherapy                        int64
pam50_+_claudin-low_subtype        object
cohort                            float64
er_status_measured_by_ihc          object
er_status                          object
neoplasm_histologic_grade         float64
her2_status_measured_by_snp6       object
her2_status                        object
tumor_other_histologic_subtype     object
hormone_therapy                     int64
inferred_menopausal_state          object
integrative_cluster                object
primary_tumor_laterality           object
lymph_nodes_examined_positive     float64
mutation_count                    float64
nottingham_prognostic_index       float64
oncotree_code                      object
overall_survival_months           

In [14]:
def dtype_stats(ds):
    return '''
    columns: {0}, object columns: {1}, int columns: {2}, float columns: {3}
    '''.format(len(ds.columns),
               (ds.dtypes == 'object').sum(),
               (ds.dtypes == 'int64').sum(),
               (ds.dtypes == 'float64').sum())

print(f'Clinical dataset: {dtype_stats(clinical_dataset)}')
print(f'Gene expression dataset: {dtype_stats(gene_expr_dataset)}')
print(f'Gene mutation dataset: {dtype_stats(gene_mut_dataset)}')

Clinical dataset: 
    columns: 31, object columns: 17, int columns: 5, float columns: 9
    
Gene expression dataset: 
    columns: 489, object columns: 0, int columns: 0, float columns: 489
    
Gene mutation dataset: 
    columns: 173, object columns: 173, int columns: 0, float columns: 0
    


In [15]:
# check if int data needs scaling
clinical_dataset[clinical_columns[clinical_dataset.dtypes == 'int64']].head(5)

Unnamed: 0,patient_id,chemotherapy,hormone_therapy,overall_survival,radio_therapy
0,0,0,1,1,1
1,2,0,1,1,1
2,5,1,1,0,0
3,6,1,1,1,1
4,8,1,1,0,1


In [16]:
# define data and labels for each dataset

labels = clinical_dataset[label_column].to_numpy()
clinical_data = clinical_dataset[clinical_data_columns].to_numpy()
gene_expr_data = gene_expr_dataset.to_numpy()

In [17]:
# Convert categorical data to numerical data for clinical dataset
ordinal_encoder = OrdinalEncoder()
clinical_data = ordinal_encoder.fit_transform(clinical_data)

In [18]:
# Perform data imputation for clinical dataset
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
clinical_data = imputer.fit_transform(clinical_data)

In [19]:
scaler = StandardScaler()
clinical_data = scaler.fit_transform(clinical_data)

In [20]:
_clinical_train_X, _clinical_test_X, _clinical_train_y, _clinical_test_y = train_test_split(clinical_data, labels, test_size=0.10, random_state=42)
_clinical_train_X, _clinical_val_X, _clinical_train_y, _clinical_val_y = train_test_split(_clinical_train_X, _clinical_train_y, test_size=0.10, random_state=42)

_gene_expr_train_X, _gene_expr_test_X, _gene_expr_train_y, _gene_expr_test_y = train_test_split(gene_expr_data, labels, test_size=0.10, random_state=42)
_gene_expr_train_X, _gene_expr_val_X, _gene_expr_train_y, _gene_expr_val_y = train_test_split(_gene_expr_train_X, _gene_expr_train_y, test_size=0.10, random_state=42)

dataset = {
    'clinical':{
        'X_train': _clinical_train_X,
        'X_val': _clinical_val_X,
        'X_test': _clinical_test_X,
        'y_train': _clinical_train_y,
        'y_val': _clinical_val_y,
        'y_test': _clinical_test_y
    },
    'gene_expr':{
        'X_train': _gene_expr_train_X,
        'X_val': _gene_expr_val_X,
        'X_test': _gene_expr_test_X,
        'y_train': _gene_expr_train_y,
        'y_val': _gene_expr_val_y,
        'y_test': _gene_expr_test_y
    },
    'gene_expr_reduced':{
    }
}

## Dimension Reduction (20 + Up to 10 Points Optional)

For each dataset, investigate whether it is needed to use a dimensionality reduction approach or not. If yes, please reduce the dataset's dimension. You can use UMAP for this purpose but any other approach is acceptable. Finding the most important features contains extra points.

<span style="color:orange">
    we check if dimensionality reduction is needed by using a simple linear regression model as a baseline .
</span>



In [21]:
# predict for the clinical dataset using linear regression
_clf = LinearRegression()
_clf.fit(dataset['clinical']['X_train'], dataset['clinical']['y_train'])
_clinical_baseline_pred = np.round(_clf.predict(dataset['clinical']['X_test']))
_clinical_baseline_accuracy = accuracy_score(dataset['clinical']['y_test'], _clinical_baseline_pred)

# predict for the gene expression dataset using linear regression
_clf = LinearRegression()
_clf.fit(dataset['gene_expr']['X_train'], dataset['gene_expr']['y_train'])
_gene_expr_baseline_pred = np.round(_clf.predict(dataset['gene_expr']['X_test']))
_gene_expr_baseline_accuracy = accuracy_score(dataset['gene_expr']['y_test'], _gene_expr_baseline_pred)

print(f'Accuracy of simple linear regression model on clinical data: {_clinical_baseline_accuracy:.3f}')
print(f'Accuracy of simple linear regression model on gene expression data: {_gene_expr_baseline_accuracy:.3f}')

Accuracy of simple linear regression model on clinical data: 0.738
Accuracy of simple linear regression model on gene expression data: 0.550


<span style="color:orange">
    As we can see, the results are much better for the clinical dataset which has few dimensions, but not so much for the gene expession dataset.
    Therefore, we will only reduce the dimensions for gene expression dataset.
</span>



In [22]:
# reduce the dimensions for clinical data and predict using baseline model
CLINICAL_REDUCED_DIMENSIONS = 5
GENE_EXPR_REDUCED_DIMENSIONS = 40


_reducer = umap.UMAP(n_components=CLINICAL_REDUCED_DIMENSIONS, random_state=42)
_reducer.fit(clinical_data)
_reduced_X_train = _reducer.transform(dataset['clinical']['X_train'])
_reduced_X_test = _reducer.transform(dataset['clinical']['X_test'])

_clf = LinearRegression()
_clf.fit(_reduced_X_train, dataset['clinical']['y_train'])
_clinical_reduced_baseline_pred = np.round(_clf.predict(_reduced_X_test))
_clinical_reduced_baseline_accuracy = accuracy_score(dataset['clinical']['y_test'], _clinical_reduced_baseline_pred)

# reduce the dimensions for gene expression data and predict using baseline model
_reducer = umap.UMAP(n_components=GENE_EXPR_REDUCED_DIMENSIONS, random_state=42)
_reducer.fit(gene_expr_data)
_reduced_X_train = _reducer.transform(dataset['gene_expr']['X_train'])
_reduced_X_val = _reducer.transform(dataset['gene_expr']['X_val'])
_reduced_X_test = _reducer.transform(dataset['gene_expr']['X_test'])

_clf = LinearRegression()
_clf.fit(_reduced_X_train, dataset['gene_expr']['y_train'])
_gene_expr_reduced_baseline_pred = np.round(_clf.predict(_reduced_X_test))
_gene_expr_reduced_baseline_accuracy = accuracy_score(dataset['gene_expr']['y_test'], _gene_expr_reduced_baseline_pred)

print(f'Accuracy of simple linear regression model on reduced clinical data: {_clinical_reduced_baseline_accuracy:.3f}')
print(f'Accuracy of simple linear regression model on reduced gene expression data: {_gene_expr_reduced_baseline_accuracy:.3f}')

Accuracy of simple linear regression model on reduced clinical data: 0.628
Accuracy of simple linear regression model on reduced gene expression data: 0.618


<span style="color:orange">
    As we can see, applying dimension reduction on the clinical dataset leads to worse results, while on gene expression dataset improves the predictions.
    Therefore, we choose to reduce the dimensions of only the gene expression dataset. 
</span>



In [23]:
dataset['gene_expr_reduced'] = {
    'X_train': _reduced_X_train,
    'X_val': _reduced_X_val,
    'X_test': _reduced_X_test,
    'y_train': _gene_expr_train_y,
    'y_val': _gene_expr_val_y,
    'y_test': _gene_expr_test_y
}

# Classic Model (25 Points)

In this section, you must implement a classic classification model for clinical, gene expressions, and reduced gene expressions datasets. Using Random Forest is suggested. (minimum acceptable accuracy = 60%)

In [24]:
random_forst_models = {
    'clinical': None,
    'gene_expr': None,
    'gene_expr_reduced': None
}

for ds_name in random_forst_models:
    clf = RandomForestClassifier(random_state=42)
    ds = dataset[ds_name]
    clf.fit(ds['X_train'], ds['y_train'])
    y_pred = clf.predict(ds['X_test'])
    acc = accuracy_score(ds['y_test'], y_pred)
    random_forst_models[ds_name] = {
        'model': clf,
        'accuracy': acc
    }

    print(f'random forest on {ds_name} dataset had accuracy of {acc:.4f}')

svm_models = random_forst_models.copy()

for ds_name in random_forst_models:
    clf = SVC(random_state=42)
    ds = dataset[ds_name]
    clf.fit(ds['X_train'], ds['y_train'])
    y_pred = clf.predict(ds['X_test'])
    acc = accuracy_score(ds['y_test'], y_pred)
    random_forst_models[ds_name] = {
        'model': clf,
        'accuracy': acc
    }

    print(f'svm on {ds_name} dataset had accuracy of {acc:.4f}')



random forest on clinical dataset had accuracy of 0.7696
random forest on gene_expr dataset had accuracy of 0.6283
random forest on gene_expr_reduced dataset had accuracy of 0.5550
svm on clinical dataset had accuracy of 0.7539
svm on gene_expr dataset had accuracy of 0.6073
svm on gene_expr_reduced dataset had accuracy of 0.5864


# Neural Network (30 Points)

In this section, you must implement a neural network model for clinical, gene expressions and reduced gene expressions datasets. Using the MPL models is suggested. (minimum acceptable accuracy = 60%)

In [25]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

In [26]:
class CancerDataset(Dataset):

    def __init__(self, X, y) -> None:
        super().__init__()
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx].astype(np.float32), self.y[idx].astype(np.float32)


In [27]:
dataloaders = {}
batch_size = 64

for ds_name, ds_split in dataset.items():
    dataloaders[ds_name] = {}
    X_train = ds_split['X_train']
    X_val = ds_split['X_val']
    X_test = ds_split['X_test']
    y_train = ds_split['y_train']
    y_val = ds_split['y_val']
    y_test = ds_split['y_test']
    
    train_ds = CancerDataset(X_train, y_train)
    val_ds = CancerDataset(X_val, y_val)
    test_ds = CancerDataset(X_test, y_test)

    train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    val_dl = DataLoader(val_ds, batch_size=batch_size, shuffle=True)
    test_dl = DataLoader(test_ds, batch_size=batch_size, shuffle=True)

    dataloaders[ds_name]['train'] = train_dl
    dataloaders[ds_name]['val'] = val_dl
    dataloaders[ds_name]['test'] = test_dl

In [28]:
def evaluate(model, dataloader):
    model.eval()
    total, correct = 0, 0
    with torch.no_grad():
        for _, (data, labels) in enumerate(dataloader):
            data, labels = data.to(device), labels.to(device)

            pred = model(data).squeeze()
            out = torch.round(pred)

            correct = correct + (labels == out).sum()
            total = total + len(data)
    
    return correct / total

In [29]:
def train(model, criterion, optimizer, train_dataloader, val_dataloader, num_epochs):
    best_model = model
    train_losses = []
    val_losses = []
    train_accs = []
    val_accs = []
    best_val_loss = np.inf

    for epoch in range(num_epochs):
        # train 
        model.train()
        total, correct = 0, 0
        train_loss = 0.0
        with tqdm(enumerate(train_dataloader), total=len(train_dataloader)) as pbar:
            for i, (data, labels) in pbar:
                data, labels = data.to(device), labels.to(device)
                optimizer.zero_grad()

                pred = model(data).squeeze()
                out = pred.detach().round()

                total = total + len(data)
                correct = correct + (labels == out).sum()
                
                loss = criterion(pred, labels)
                loss.backward()
                optimizer.step()

                train_loss = train_loss + loss.detach()

                if correct/total > 0.95:
                        return model
                
                pbar.set_description('Epoch {0}: train loss={1}, train accuracy={2}'.format(epoch, train_loss / total,
                                                                                            correct / total))
        
        train_losses.append(train_loss)
        train_accs.append((correct / total))
        
        # validation
        model.eval()
        total, correct = 0, 0
        val_loss = 0.0
        with torch.no_grad():
            with tqdm(enumerate(val_dataloader), total=len(val_dataloader)) as pbar:
                for i, (data, labels) in pbar:
                    data, labels = data.to(device), labels.to(device)

                    pred = model(data).squeeze()
                    out = torch.round(pred)

                    correct = correct + (labels == out).sum()
                    total = total + len(data)

                    val_loss += criterion(pred, labels).detach().cpu()
                    
                    pbar.set_description('Epoch {0}: val loss={1}, val accuracy={2}'.format(epoch, val_loss / total,
                                                                                                correct / total))
        if val_loss < best_val_loss:
            print('New model saved, loss {0} -> {1}'.format(best_val_loss, val_loss))
            best_val_loss = val_loss
            best_model = deepcopy(model)

    return [best_model, train_losses, val_losses, train_accs, val_accs]

In [30]:
mlp_models = random_forst_models.copy()
lr = 1e-4
num_epochs = 100

for ds_name in mlp_models:
    net = nn.Sequential(
        nn.Linear(dataset[ds_name]['X_train'].shape[1], 256),
        nn.LeakyReLU(negative_slope=0.01),
        nn.BatchNorm1d(256),
        nn.Linear(256, 64),
        nn.LeakyReLU(negative_slope=0.01),
        nn.BatchNorm1d(64),
        nn.Linear(64, 32),
        nn.LeakyReLU(negative_slope=0.01),
        nn.BatchNorm1d(32),
        nn.Linear(32, 16),
        nn.LeakyReLU(negative_slope=0.01),
        nn.BatchNorm1d(16),
        nn.Linear(16, 1),
        nn.Sigmoid()
    )
    net = net.to(device)
    optimizer = Adam(net.parameters(), lr=lr)
    criterion = nn.MSELoss()
    train_dl = dataloaders[ds_name]['train']
    val_dl = dataloaders[ds_name]['val']
    test_dl = dataloaders[ds_name]['test']

    return_vals = train(net, criterion, optimizer, train_dl, val_dl, num_epochs)
    net, train_losses, val_losses, train_accs, val_accs = return_vals 
    # test
    acc = evaluate(net, test_dl)
    mlp_models[ds_name] = {
        'model': net,
        'accuracy': acc,
        'train_losses': train_losses,
        'val_losses': val_losses,
        'train_accs': train_accs,
        'val_accs': val_accs
    }

Epoch 0: train loss=0.004064494743943214, train accuracy=0.5353666543960571: 100%|██████████| 25/25 [00:01<00:00, 24.56it/s]
Epoch 0: val loss=0.004318842198699713, val accuracy=0.5: 100%|██████████| 3/3 [00:00<00:00, 657.59it/s]


New model saved, loss inf -> 0.7428408265113831


Epoch 1: train loss=0.0036985548213124275, train accuracy=0.6067488789558411: 100%|██████████| 25/25 [00:00<00:00, 223.63it/s]
Epoch 1: val loss=0.0038636194076389074, val accuracy=0.680232584476471: 100%|██████████| 3/3 [00:00<00:00, 459.47it/s]


New model saved, loss 0.7428408265113831 -> 0.664542555809021


Epoch 2: train loss=0.003556503914296627, train accuracy=0.6391953229904175: 100%|██████████| 25/25 [00:00<00:00, 161.90it/s] 
Epoch 2: val loss=0.0037381132133305073, val accuracy=0.6744186282157898: 100%|██████████| 3/3 [00:00<00:00, 340.23it/s]


New model saved, loss 0.664542555809021 -> 0.642955482006073


Epoch 3: train loss=0.003454444231465459, train accuracy=0.6755353808403015: 100%|██████████| 25/25 [00:00<00:00, 205.41it/s]
Epoch 3: val loss=0.003708320204168558, val accuracy=0.6860465407371521: 100%|██████████| 3/3 [00:00<00:00, 472.15it/s]


New model saved, loss 0.642955482006073 -> 0.6378310918807983


Epoch 4: train loss=0.003233248135074973, train accuracy=0.6891629099845886: 100%|██████████| 25/25 [00:00<00:00, 203.43it/s] 
Epoch 4: val loss=0.003517341101542115, val accuracy=0.6686046719551086: 100%|██████████| 3/3 [00:00<00:00, 80.93it/s]


New model saved, loss 0.6378310918807983 -> 0.6049826741218567


Epoch 5: train loss=0.003174412529915571, train accuracy=0.7235561609268188: 100%|██████████| 25/25 [00:00<00:00, 193.14it/s] 
Epoch 5: val loss=0.0034870896488428116, val accuracy=0.6918604373931885: 100%|██████████| 3/3 [00:00<00:00, 328.45it/s]


New model saved, loss 0.6049826741218567 -> 0.5997794270515442


Epoch 6: train loss=0.003074994310736656, train accuracy=0.7300454378128052: 100%|██████████| 25/25 [00:00<00:00, 215.07it/s]
Epoch 6: val loss=0.0035475760232657194, val accuracy=0.7151162624359131: 100%|██████████| 3/3 [00:00<00:00, 329.07it/s]
Epoch 7: train loss=0.0029827633406966925, train accuracy=0.7566515207290649: 100%|██████████| 25/25 [00:00<00:00, 208.11it/s]
Epoch 7: val loss=0.0033412931952625513, val accuracy=0.6976743936538696: 100%|██████████| 3/3 [00:00<00:00, 338.73it/s]


New model saved, loss 0.5997794270515442 -> 0.5747024416923523


Epoch 8: train loss=0.0029082479886710644, train accuracy=0.7618429660797119: 100%|██████████| 25/25 [00:00<00:00, 154.86it/s]
Epoch 8: val loss=0.0034197531640529633, val accuracy=0.7209302186965942: 100%|██████████| 3/3 [00:00<00:00, 500.24it/s]
Epoch 9: train loss=0.0029281156603246927, train accuracy=0.7631407976150513: 100%|██████████| 25/25 [00:00<00:00, 210.38it/s]
Epoch 9: val loss=0.0033552872482687235, val accuracy=0.7325581312179565: 100%|██████████| 3/3 [00:00<00:00, 479.92it/s]
Epoch 10: train loss=0.0027322375681251287, train accuracy=0.7839065790176392: 100%|██████████| 25/25 [00:00<00:00, 216.72it/s]
Epoch 10: val loss=0.0032182831782847643, val accuracy=0.7383720874786377: 100%|██████████| 3/3 [00:00<00:00, 353.31it/s]


New model saved, loss 0.5747024416923523 -> 0.5535447001457214


Epoch 11: train loss=0.0028104630764573812, train accuracy=0.7806618809700012: 100%|██████████| 25/25 [00:00<00:00, 180.51it/s]
Epoch 11: val loss=0.0032856955658644438, val accuracy=0.75: 100%|██████████| 3/3 [00:00<00:00, 477.78it/s]
Epoch 12: train loss=0.0026331671979278326, train accuracy=0.7858533263206482: 100%|██████████| 25/25 [00:00<00:00, 212.27it/s]
Epoch 12: val loss=0.0033004970755428076, val accuracy=0.7441860437393188: 100%|██████████| 3/3 [00:00<00:00, 558.25it/s]
Epoch 13: train loss=0.0025841824244707823, train accuracy=0.805321216583252: 100%|██████████| 25/25 [00:00<00:00, 212.91it/s] 
Epoch 13: val loss=0.003187190042808652, val accuracy=0.7674418687820435: 100%|██████████| 3/3 [00:00<00:00, 396.14it/s]


New model saved, loss 0.5535447001457214 -> 0.5481966733932495


Epoch 14: train loss=0.0025685965083539486, train accuracy=0.8027254939079285: 100%|██████████| 25/25 [00:00<00:00, 163.51it/s]
Epoch 14: val loss=0.00323846354149282, val accuracy=0.7732558250427246: 100%|██████████| 3/3 [00:00<00:00, 603.99it/s]
Epoch 15: train loss=0.0025164878461509943, train accuracy=0.8254380226135254: 100%|██████████| 25/25 [00:00<00:00, 212.47it/s]
Epoch 15: val loss=0.003154254984110594, val accuracy=0.7616279125213623: 100%|██████████| 3/3 [00:00<00:00, 479.15it/s]


New model saved, loss 0.5481966733932495 -> 0.5425318479537964


Epoch 16: train loss=0.0024187962990254164, train accuracy=0.8234912157058716: 100%|██████████| 25/25 [00:00<00:00, 199.89it/s]
Epoch 16: val loss=0.0031456381548196077, val accuracy=0.7616279125213623: 100%|██████████| 3/3 [00:00<00:00, 504.91it/s]


New model saved, loss 0.5425318479537964 -> 0.5410497784614563


Epoch 17: train loss=0.0024570198729634285, train accuracy=0.8170019388198853: 100%|██████████| 25/25 [00:00<00:00, 155.74it/s]
Epoch 17: val loss=0.0030965828336775303, val accuracy=0.7441860437393188: 100%|██████████| 3/3 [00:00<00:00, 476.64it/s]


New model saved, loss 0.5410497784614563 -> 0.5326122641563416


Epoch 18: train loss=0.002315877005457878, train accuracy=0.8176508545875549: 100%|██████████| 25/25 [00:00<00:00, 212.28it/s] 
Epoch 18: val loss=0.003220517886802554, val accuracy=0.7325581312179565: 100%|██████████| 3/3 [00:00<00:00, 379.21it/s]
Epoch 19: train loss=0.0023259855806827545, train accuracy=0.8215444684028625: 100%|██████████| 25/25 [00:00<00:00, 211.63it/s]
Epoch 19: val loss=0.0031291479244828224, val accuracy=0.7441860437393188: 100%|██████████| 3/3 [00:00<00:00, 506.21it/s]
Epoch 20: train loss=0.002295862417668104, train accuracy=0.8371187448501587: 100%|██████████| 25/25 [00:00<00:00, 157.15it/s] 
Epoch 20: val loss=0.003224814310669899, val accuracy=0.7383720874786377: 100%|██████████| 3/3 [00:00<00:00, 312.11it/s]
Epoch 21: train loss=0.0022710664197802544, train accuracy=0.8325762748718262: 100%|██████████| 25/25 [00:00<00:00, 207.33it/s]
Epoch 21: val loss=0.003105603624135256, val accuracy=0.7732558250427246: 100%|██████████| 3/3 [00:00<00:00, 362.29it/s]
Epo

New model saved, loss 0.5326122641563416 -> 0.5285425186157227


  0%|          | 0/25 [00:00<?, ?it/s]


ValueError: too many values to unpack (expected 5)

In [None]:
for ds_name in mlp_models:
    acc = mlp_models[ds_name]['accuracy']
    print(f'mlp accuracy on {ds_name} dataset had accuracy of {acc:.4f}')

# Model Comparison (10 Points)

Compare different models and different datasets (clinical, gene expressions, and gene reduced expressions) and try to explain their differences.

#### \# TODO