# Machine learning application in cancer prognosis

This project is far from finished but I would like to use it as an opportunity to demonstrate my skills in python and machine learning.

**Objective**: Create a machine learning algorith to predict cancer prognosis for patients with colorectal adenocarcinoma.

**Data**: genomic and clinical data of Colon adenocarcinoma (COAD) cohort obtained from The Cancer Genome Atlas (TCGA). The results shown here are in whole based upon data generated by TCGA Research Network: https://www.cancer.gov/tcga.



Loading of the libraries

In [4]:
import numpy as np
import pandas as pd
import os
import csv
from numpy.random import randn
import matplotlib as plt
import seaborn as sns
pd.options.mode.chained_assignment = None

In [5]:
path = os.getcwd()
print(path)
os.chdir('/Users/gv1u14/Desktop/ML')
print(os.getcwd())

/Users/gv1u14/Desktop
/Users/gv1u14/Desktop/ML


Loading the data that was proviously pre-processed for different project. Code is not shown here becuase the analysis was done in R.

datE contains genomic data

datS contains clinical data

In [82]:
datE = pd.read_csv("datE.csv")
datS = pd.read_csv("datS_COAD_341.csv", header=0)

In [108]:
#for col in datS.columns:
#    print(col)

In [84]:
datS.shape

(341, 133)

# Preparing the data for the analysis

TCGA provides a lot of clinical data for each patient, however, missing values and categorical values are present.

## Numeric variables
Missing vaues in **numerical continous variable** (i.e. 'mitoScore','age','bmi','stemness_DNAmeth_based','Tumour cell DNA fraction', 'Fraction of genome with subclonal SCNAs', MS_stage2 subcategorties,) were replaced by **mean values** of each variable.


In [86]:
list1 = ['mitoScore','age','bmi','stemness_DNAmeth_based','abs_purity','abs_ploidy','WGD','Tumor cell DNA fraction','Fraction of genome with duplicated alleles','CDS','Fraction of genome with subclonal SCNAs','N_homozygous_deletions']                                                
list2 = [col for col in datS.columns if 'MS_' in col]
list3 = list1 + list2
for x in list3:
    datS[[x]] = datS[[x]].fillna(value=datS[[x]].mean())

Missing vaues in **numerical discrete variable** (i.e. cens, CDKN2A methylation,MHL1 methylation) were replaced by **median value**

In [87]:
list4 = ['cens','CDKN2A methylation','MLH1 methylation']
for x in list4:
    datS[[x]] = datS[[x]].fillna(value=datS[[x]].median())

### Removing variables

In this analysis progression-free interval (i.e. PFI_yrs) will be used as a dependent variabel and therefore patients with unknown PFI time are removed.

From datS were removed columns 'project','weight','height','bmi_bin','country','geographic region','organ','gastric history','anatomic region','time_yrs', 'Days to last known alive','Left or right colon','vital statuts' due to >90% proportion of NA, insignificant value of the variable or duplicity of the information.


In [88]:
datS = datS.dropna(subset = ['PFI_yrs'])

In [89]:
col = ['project','weight','height','bmi_bin','Country','Geographic Region','Organ','Gastric histological classification','Anatomic Region','time_yrs', 'Days to last known alive','Left or right colon','Vital status.1','Pathologic Stage']
datS = datS.drop(col, axis=1)

## Categorical variables
After all numerical NAs were handled, missing values of categorical variables were replaced by the most frequent value.

In [91]:
datS = datS.fillna(datS.mode().iloc[0])
datS.isnull().sum()

### Handling ordinal categorical variables

**Oridinal** categorical variables (e.g.can be ordered): *Pathologic T,Pathologic N, Pathoogogic M, Stage*

In [92]:
print(datS['Pathologic T'].unique())

['T2' 'T4a' 'T3' 'T4b' 'T4' 'T1' 'Tis']


becuase pathological T4 can be divided into T4a and T4b, I will replace these two subclasses by T4. This might lead to loss of information but it makes biologicaly more sense than having three T4 classes. Same process will be applied to pathological N and M variables.

In [93]:
datS['Pathologic T'] = datS['Pathologic T'].replace(to_replace=['T4a','T4b'],value='T4')

T_map = {'Tis':0,'T1':1,'T2':2,'T3':3,'T4':4}
datS['Pathologic T'] = datS['Pathologic T'].map(T_map)

In [94]:
print(datS['Pathologic N'].unique())

['N0' 'N1b' 'N2b' 'N2' 'N1' 'N2a' 'N1a' 'N1c' 'NX']


In [95]:
datS['Pathologic N'] = datS['Pathologic N'].replace(to_replace=['N1a','N1b','N1c'],value='N1')
datS['Pathologic N'] = datS['Pathologic N'].replace(to_replace=['N2a','N2b'],value='N2')

N_map = {'NX':0,'N0':1,'N1':2,'N2':3}
datS['Pathologic N'] = datS['Pathologic N'].map(N_map)

In [96]:
print(datS['Pathologic M'].unique())

['M0' 'MX' 'M1b' 'M1a' 'M1']


In [97]:
datS['Pathologic M'] = datS['Pathologic M'].replace(to_replace=['M1a','M1b'],value='M1')

M_map = {'MX':0,'M0':1,'M1':2}
datS['Pathologic M'] = datS['Pathologic M'].map(M_map)

In [98]:
print(datS['Stage'].unique())

['I' 'III' 'II' 'IV']


In [99]:
Stage_map = {'I':0,'II':1,'III':2,'IV':3}
datS['Stage'] = datS['Stage'].map(Stage_map)

### Handling nominal categorical variables

**nominal** categorical variables: *'Race','Gender','Hypermethylation category','anatomic_site',colorectal_CMS',Molecular subtype,MSI status, Vital status*

One-hot encoding will be used to handle categorical variables

In [103]:
nom_cat = ['Race','Gender','MSI Status','Molecular_Subtype','Vital status','colorectal_CMS','Hypermethylation category','anatomic_site','Vital status']
one_hot = pd.get_dummies(datS[nom_cat],drop_first=True)

In [104]:
datS = datS.drop(nom_cat,axis = 1)
datS = datS.join(one_hot)
datS.shape