# Flatiron Health mCRC: Cox model build

**OBJECTIVE: Build a Cox model inspired by the model described in the "Development and validation of risk prediction equations to estimate survival in patients with colorectal cancer: cohort study" by J. Hippseley Cox et al. in BMJ (2017).**

**BACKGROUND: This model is a classical Cox model published in 2017 that predicts overall survival following colorectal cancer diagnosis at any stage using a number of clinical variables. Discriminatory performance was validated on an external dataset and resulted in a concordance index of 0.80 over 5 years. Missingness was resolved by using multiple imputation chained equations (MICE).**

**OUTLINE:**
1. **Preprocessing**
2. **Crude imputation** 
3. **Complete cases**
4. **Complete cases using only patients with de novo metastaic disease**

## 1. Preprocessing 

**The published Cox model includes the following variables:**
1. Age 
2. Deprivation score
3. Cancer stage
4. Cancer grade 
5. Smoking status
6. Colorectal surgery within a year of diagnosis
7. Chemotherapy within a year of diagnosis
8. Family history of bowel cancer
9. Abnormal platelet count
10. Abnormal liver function 
11. Cardiovascular disease
12. Diabetes
13. Chronic renal disease
14. Chronic obstructive pulmonary disease
15. Prescribed aspirin at diagnosis 
16. Prescribed statin at diagnosis 

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Function that returns number of rows and count of unique PatientIDs for a dataframe. 
def row_ID(dataframe):
    row = dataframe.shape[0]
    ID = dataframe['PatientID'].nunique()
    return row, ID

In [3]:
train = pd.read_csv('train_full.csv')
print(len(train), train.PatientID.is_unique)

27452 True


In [4]:
test = pd.read_csv('test_full.csv')
print(len(test), test.PatientID.is_unique)

6863 True


### Training set

#### Age

Age at diagnosis. No preprocessing is required.

#### Deprivation score

The Townsend deprivation score is an area level score based on the patients’ postcode that reflects their socioeconomic status. It is a numeric score that ranges from 1 to 5 with 1 being the most affluent and 5 being the most deprives. The Townsend score includes unemployment, non-car ownership, non-home ownership, and household overcrowding. We will use Flatiron's SES index as a proxy for the Townsend deprivations score. Similar to the Townsend deprivation score, Flatiron's SES index is an area level score, also 1 through 5, based on a patients' address that takes into accout median household income, percentage of individuals living below poverty line, percentage of individuals who are unemployed, in addition to a few other variarbles.

#### Cancer stage 

Cancer stage is classified using TNM classification (version 7). No further preprocessing is required.

#### Cancer grade 

Cancer grade is classified as well differentiated, moderately differentiated, poorly differentiated, and undifferentiated, or not recorded. Cancer grade is not provided in the Flatiron dataset and so this varible will not be included.

#### Smoking status

Smoking status is defined as most recent smoking status before diagnosis. Smoking status is not provided in the Flatiron dataset and so this variable will not be included.

#### Colorectal surgery within a year of diagnosis

Surgery status is not included in the Flatiron dataset. This is unlikely to relevant for those that are diagnosed at time of metastatic disease which makes up about 60% of our dataset. A sensitivity analysis including only those diagnosed at time of metastatic disease will be performed.

#### Chemotherapy within a year of diagnosis

We will use receipt of adjuvant therapy as a proxy for this variable. The goal of our machine learning model is to predict survival from time of metastatic disease prior to therapy being selected, and so we did not include a variable that identifies those who were treated with chemotherapy.

#### Family history of bowel cancer

Flatiron did not provide family history of bowel cancer, so this variable will not be included in the model.

#### Abnormal platelet count

Abnormal platelet count is defined as values >480×109/L closest to time of diagnosis.

In [5]:
train['raised_plt'] = np.where(train['platelet_diag'] > 480, 1, 0)

#### Abnormal liver function

Abnormal liver function test result is defined as either γ glutamyltransferase or alanine aminotransferase or bilirubin more than three times upper normal limit based on the value closest to cancer diagnosis. Below are the thresholds for three times the upper normal limit:
- γ glutamyltransferase: 150
- alanine aminotransferase: 75
- bilirubin: 3

In [6]:
lab = pd.read_csv('Lab.csv')

In [7]:
lab.loc[:, 'ResultDate'] = pd.to_datetime(lab['ResultDate']) 

In [8]:
lab = lab[lab['PatientID'].isin(train.PatientID)]

In [9]:
enhanced_met = pd.read_csv('Enhanced_MetastaticCRC.csv')

In [10]:
enhanced_met.loc[:, 'MetDiagnosisDate'] = pd.to_datetime(enhanced_met['MetDiagnosisDate']) 

In [11]:
enhanced_met = enhanced_met[enhanced_met['PatientID'].isin(train.PatientID)] 

In [12]:
lab = pd.merge(lab, enhanced_met[['PatientID', 'MetDiagnosisDate']], on = 'PatientID', how = 'left')

In [13]:
lab = lab.query('LOINC == "2324-2"')

In [14]:
lab_win = (
    lab
    .assign(lab_date_diff = (lab['ResultDate'] - lab['MetDiagnosisDate']).dt.days)
    .query('lab_date_diff >= -90 and lab_date_diff <= 30')
    .filter(items = ['PatientID', 'TestBaseName', 'ResultDate', 'TestResultCleaned', 'MetDiagnosisDate', 'lab_date_diff'])
)

In [15]:
lab_win.loc[:, 'lab_date_diff'] = lab_win['lab_date_diff'].abs()

In [16]:
lab_wide = (
    lab_win.loc[lab_win.groupby('PatientID')['lab_date_diff'].idxmin()]
    .pivot(index = 'PatientID', columns = 'TestBaseName', values = 'TestResultCleaned')
    .reset_index()
    .rename(columns = {'gamma glutamyl transferase': 'ggt_diag'})
)

lab_wide.columns.name = None

In [17]:
train = pd.merge(train, lab_wide, on = 'PatientID', how = 'outer')

In [18]:
train['ab_lft'] = np.where((train['ggt_diag'] > 150) | 
                           (train['alt_diag'] > 75) | 
                           (train['total_bilirubin_diag'] > 3), 1, 0)

#### Cardiovascular disease, diabetes, chronic renal disease, and chronic obstructive pulmonary disease

The above chronic medical conditions is not explicity defined in the BMJ article. Patients in the Flatiron dataset will be labeled as having one of the above disease if they have an ICD code in their chart that maps to that respective disease.

In [19]:
diagnosis = pd.read_csv('Diagnosis.csv')

In [20]:
# Remove decimal to make mapping to Elixhauser easier. 
diagnosis.loc[:, 'diagnosis_code'] = diagnosis['DiagnosisCode'].replace('\.', '', regex = True)

In [21]:
diagnosis.loc[:, 'DiagnosisDate'] = pd.to_datetime(diagnosis['DiagnosisDate'])

In [22]:
diagnosis = diagnosis[diagnosis['PatientID'].isin(train.PatientID)]

In [23]:
diagnosis = pd.merge(diagnosis, enhanced_met[['PatientID', 'MetDiagnosisDate']], on = 'PatientID', how = 'left')

In [24]:
diagnosis.loc[:, 'diagnosis_date_diff'] = (diagnosis['DiagnosisDate'] - diagnosis['MetDiagnosisDate']).dt.days

#### ICD-9

In [25]:
diagnosis_9 = (
    diagnosis
    .query('diagnosis_date_diff <= 30')
    .query('DiagnosisCodeSystem == "ICD-9-CM"')
    .drop_duplicates(subset = (['PatientID', 'DiagnosisCode']), keep = 'first')
    .filter(items = ['PatientID', 'DiagnosisCode', 'diagnosis_code'])
)

In [26]:
diagnosis_9.loc[:, 'cardiovascular'] = (
    np.where(diagnosis_9['diagnosis_code'].str.match('39[012345678]|'
                                                     '4[01234]'), 1, 0)
)

In [27]:
diagnosis_9.loc[:, 'diabetes'] = (
    np.where(diagnosis_9['diagnosis_code'].str.match('250'), 1, 0)
)

In [28]:
diagnosis_9.loc[:, 'crd'] = (
    np.where(diagnosis_9['diagnosis_code'].str.match('585|582'), 1, 0)
)

In [29]:
diagnosis_9.loc[:, 'copd'] = (
    np.where(diagnosis_9['diagnosis_code'].str.match('49[0123456]'), 1, 0)
)

In [30]:
diagnosis_9_wide = (
    diagnosis_9
    .drop(columns = ['DiagnosisCode', 'diagnosis_code'])
    .groupby('PatientID').sum()
    .reset_index()
)

#### ICD-10

In [31]:
diagnosis_10 = (
    diagnosis
    .query('diagnosis_date_diff <= 30')
    .query('DiagnosisCodeSystem == "ICD-10-CM"')
    .drop_duplicates(subset = (['PatientID', 'DiagnosisCode']), keep = 'first')
    .filter(items = ['PatientID', 'DiagnosisCode', 'diagnosis_code'])
)

In [32]:
diagnosis_10.loc[:, 'cardiovascular'] = (
    np.where(diagnosis_10['diagnosis_code'].str.match('I'), 1, 0)
)

In [33]:
diagnosis_10.loc[:, 'diabetes'] = (
    np.where(diagnosis_10['diagnosis_code'].str.match('E0[89]|'
                                                            'E1[0123]'), 1, 0)
)

In [34]:
diagnosis_10.loc[:, 'crd'] = (
    np.where(diagnosis_10['diagnosis_code'].str.match('N03|N18'), 1, 0)
)

In [35]:
diagnosis_10.loc[:, 'copd'] = (
    np.where(diagnosis_10['diagnosis_code'].str.match('J4[01234567]'), 1, 0)
)

In [36]:
diagnosis_10_wide = (
    diagnosis_10
    .drop(columns = ['DiagnosisCode', 'diagnosis_code'])
    .groupby('PatientID').sum()
    .reset_index()
)

In [37]:
# Merge ICD 9 and 10 codes and sum by PatientID.
diagnosis_pmh = (
    pd.concat([diagnosis_9_wide, diagnosis_10_wide])
    .groupby('PatientID').sum()
)

In [38]:
diagnosis_pmh = diagnosis_pmh.mask(diagnosis_pmh >1, 1)

In [39]:
train = pd.merge(train, diagnosis_pmh, on = 'PatientID', how = 'outer')

In [40]:
train[['cardiovascular', 'diabetes', 'crd', 'copd']] = train[['cardiovascular', 'diabetes', 'crd', 'copd']].fillna(0)

#### Prescribed aspirin or statin at diagnosis 

A variable will be created to capture patients who received aspirin prior to time of diagnosis. While Flatiron captures aspirin use, it does not capture receipt of statin and so no indicator variable will be created for statin. 

In [41]:
med_admin = pd.read_csv('MedicationAdministration.csv')

In [42]:
med_admin.loc[:, 'AdministeredDate'] = pd.to_datetime(med_admin['AdministeredDate'])

In [43]:
med_admin = med_admin[med_admin['PatientID'].isin(train.PatientID)]

In [44]:
med_admin = pd.merge(med_admin, enhanced_met[['PatientID', 'MetDiagnosisDate']], on = 'PatientID', how = 'left')

In [45]:
med_admin = (
    med_admin
    .assign(med_admin_diff = (med_admin['AdministeredDate'] - med_admin['MetDiagnosisDate']).dt.days)
    .query('med_admin_diff <= 30')
)

In [46]:
med_admin.loc[:, 'aspirin'] = np.where(med_admin['DrugName'] == 'aspirin', 1, 0)

In [47]:
med_admin_aspirin = (
    med_admin
    .filter(items = ['PatientID', 'aspirin'])
    .groupby('PatientID').sum()
    .reset_index()
)

In [48]:
row_ID(train)

(27452, 27452)

In [49]:
train = pd.merge(train, med_admin_aspirin, on = 'PatientID', how = 'left')

In [50]:
row_ID(train)

(27452, 27452)

In [51]:
train[['aspirin']] = train[['aspirin']].fillna(0)

### Test set

#### Abnormal platelet count

In [52]:
test['raised_plt'] = np.where(test['platelet_diag'] > 480, 1, 0)

#### Abnormal liver function

In [53]:
lab = pd.read_csv('Lab.csv')

In [54]:
lab.loc[:, 'ResultDate'] = pd.to_datetime(lab['ResultDate']) 

In [55]:
lab = lab[lab['PatientID'].isin(test.PatientID)]

In [56]:
enhanced_met = pd.read_csv('Enhanced_MetastaticCRC.csv')

In [57]:
enhanced_met.loc[:, 'MetDiagnosisDate'] = pd.to_datetime(enhanced_met['MetDiagnosisDate']) 

In [58]:
enhanced_met = enhanced_met[enhanced_met['PatientID'].isin(test.PatientID)] 

In [59]:
lab = pd.merge(lab, enhanced_met[['PatientID', 'MetDiagnosisDate']], on = 'PatientID', how = 'left')

In [60]:
lab = lab.query('LOINC == "2324-2"')

In [61]:
lab_win = (
    lab
    .assign(lab_date_diff = (lab['ResultDate'] - lab['MetDiagnosisDate']).dt.days)
    .query('lab_date_diff >= -90 and lab_date_diff <= 30')
    .filter(items = ['PatientID', 'TestBaseName', 'ResultDate', 'TestResultCleaned', 'MetDiagnosisDate', 'lab_date_diff'])
)

In [62]:
lab_win.loc[:, 'lab_date_diff'] = lab_win['lab_date_diff'].abs()

In [63]:
lab_wide = (
    lab_win.loc[lab_win.groupby('PatientID')['lab_date_diff'].idxmin()]
    .pivot(index = 'PatientID', columns = 'TestBaseName', values = 'TestResultCleaned')
    .reset_index()
    .rename(columns = {'gamma glutamyl transferase': 'ggt_diag'})
)

lab_wide.columns.name = None

In [64]:
test = pd.merge(test, lab_wide, on = 'PatientID', how = 'outer')

In [65]:
test['ab_lft'] = np.where((test['ggt_diag'] > 150) |
                          (test['alt_diag'] > 75) | 
                          (test['total_bilirubin_diag'] > 3), 1, 0)

#### Cardiovascular disease, diabetes, chronic renal disease, and chronic obstructive pulmonary disease

In [66]:
diagnosis = pd.read_csv('Diagnosis.csv')

In [67]:
diagnosis.loc[:, 'diagnosis_code'] = diagnosis['DiagnosisCode'].replace('\.', '', regex = True)

In [68]:
diagnosis.loc[:, 'DiagnosisDate'] = pd.to_datetime(diagnosis['DiagnosisDate'])

In [69]:
diagnosis = diagnosis[diagnosis['PatientID'].isin(test.PatientID)]

In [70]:
diagnosis = pd.merge(diagnosis, enhanced_met[['PatientID', 'MetDiagnosisDate']], on = 'PatientID', how = 'left')

In [71]:
diagnosis.loc[:, 'diagnosis_date_diff'] = (diagnosis['DiagnosisDate'] - diagnosis['MetDiagnosisDate']).dt.days

#### ICD-9

In [72]:
diagnosis_9 = (
    diagnosis
    .query('diagnosis_date_diff <= 30')
    .query('DiagnosisCodeSystem == "ICD-9-CM"')
    .drop_duplicates(subset = (['PatientID', 'DiagnosisCode']), keep = 'first')
    .filter(items = ['PatientID', 'DiagnosisCode', 'diagnosis_code'])
)

In [73]:
diagnosis_9.loc[:, 'cardiovascular'] = (
    np.where(diagnosis_9['diagnosis_code'].str.match('39[012345678]|'
                                                     '4[01234]'), 1, 0)
)

In [74]:
diagnosis_9.loc[:, 'diabetes'] = (
    np.where(diagnosis_9['diagnosis_code'].str.match('250'), 1, 0)
)

In [75]:
diagnosis_9.loc[:, 'crd'] = (
    np.where(diagnosis_9['diagnosis_code'].str.match('585|582'), 1, 0)
)

In [76]:
diagnosis_9.loc[:, 'copd'] = (
    np.where(diagnosis_9['diagnosis_code'].str.match('49[0123456]'), 1, 0)
)

In [77]:
diagnosis_9_wide = (
    diagnosis_9
    .drop(columns = ['DiagnosisCode', 'diagnosis_code'])
    .groupby('PatientID').sum()
    .reset_index()
)

#### ICD-10

In [78]:
diagnosis_10 = (
    diagnosis
    .query('diagnosis_date_diff <= 30')
    .query('DiagnosisCodeSystem == "ICD-10-CM"')
    .drop_duplicates(subset = (['PatientID', 'DiagnosisCode']), keep = 'first')
    .filter(items = ['PatientID', 'DiagnosisCode', 'diagnosis_code'])
)

In [79]:
diagnosis_10.loc[:, 'cardiovascular'] = (
    np.where(diagnosis_10['diagnosis_code'].str.match('I'), 1, 0)
)

In [80]:
diagnosis_10.loc[:, 'diabetes'] = (
    np.where(diagnosis_10['diagnosis_code'].str.match('E0[89]|'
                                                            'E1[0123]'), 1, 0)
)

In [81]:
diagnosis_10.loc[:, 'crd'] = (
    np.where(diagnosis_10['diagnosis_code'].str.match('N03|N18'), 1, 0)
)

In [82]:
diagnosis_10.loc[:, 'copd'] = (
    np.where(diagnosis_10['diagnosis_code'].str.match('J4[01234567]'), 1, 0)
)

In [83]:
diagnosis_10_wide = (
    diagnosis_10
    .drop(columns = ['DiagnosisCode', 'diagnosis_code'])
    .groupby('PatientID').sum()
    .reset_index()
)

In [84]:
# Merge ICD 9 and 10 codes and sum by PatientID.
diagnosis_pmh = (
    pd.concat([diagnosis_9_wide, diagnosis_10_wide])
    .groupby('PatientID').sum()
)

In [85]:
diagnosis_pmh = diagnosis_pmh.mask(diagnosis_pmh >1, 1)

In [86]:
test = pd.merge(test, diagnosis_pmh, on = 'PatientID', how = 'outer')

In [87]:
test[['cardiovascular', 'diabetes', 'crd', 'copd']] = test[['cardiovascular', 'diabetes', 'crd', 'copd']].fillna(0)

#### Prescribed aspirin or statin at diagnosis 

In [88]:
med_admin = pd.read_csv('MedicationAdministration.csv')

In [89]:
med_admin.loc[:, 'AdministeredDate'] = pd.to_datetime(med_admin['AdministeredDate'])

In [90]:
med_admin = med_admin[med_admin['PatientID'].isin(test.PatientID)]

In [91]:
med_admin = pd.merge(med_admin, enhanced_met[['PatientID', 'MetDiagnosisDate']], on = 'PatientID', how = 'left')

In [92]:
med_admin = (
    med_admin
    .assign(med_admin_diff = (med_admin['AdministeredDate'] - med_admin['MetDiagnosisDate']).dt.days)
    .query('med_admin_diff <= 30')
)

In [93]:
med_admin.loc[:, 'aspirin'] = np.where(med_admin['DrugName'] == 'aspirin', 1, 0)

In [94]:
med_admin_aspirin = (
    med_admin
    .filter(items = ['PatientID', 'aspirin'])
    .groupby('PatientID').sum()
    .reset_index()
)

In [95]:
row_ID(test)

(6863, 6863)

In [96]:
test = pd.merge(test, med_admin_aspirin, on = 'PatientID', how = 'left')

In [97]:
row_ID(test)

(6863, 6863)

In [98]:
test[['aspirin']] = test[['aspirin']].fillna(0)

## 2. Crude imputation

There are 4 varialbes in the model that have missing values: (1) stage, (2) deprivation score, (3) abnormal platelet count, and (4) abnormal liver function. An "unknown" category will be created for those with unknown stage or deprivation score. Patients with missing platelet value or liver function testing will be assumed to be normal.

In [99]:
from sksurv.linear_model import CoxPHSurvivalAnalysis

from sksurv.metrics import cumulative_dynamic_auc

In [100]:
# Create function that creates dummy variables for categorical variable and drops original categorical variable. 
def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis = 1)
    res = res.drop([feature_to_encode], axis = 1)
    return(res) 

### Processing X 

#### Select relevant variables 

In [101]:
train_cox_x = (
    train
    .filter(items = [
        'PatientID',
        'age',
        'ses',
        'stage',
        'adjuv',
        'raised_plt',
        'ab_lft',
        'cardiovascular',
        'diabetes',
        'crd',
        'copd',
        'aspirin']))

In [102]:
train_cox_x = train_cox_x.set_index('PatientID')

In [103]:
train_cox_x.shape

(27452, 11)

In [104]:
test_cox_x = (
    test
    .filter(items = [
        'PatientID',
        'age',
        'ses',
        'stage',
        'adjuv',
        'raised_plt',
        'ab_lft',
        'cardiovascular',
        'diabetes',
        'crd',
        'copd',
        'aspirin']))

In [105]:
test_cox_x = test_cox_x.set_index('PatientID')

In [106]:
test_cox_x.shape

(6863, 11)

#### Impute "unknown" for missing deprivation score (ie., SES)

In [107]:
train_cox_x[['ses']] = train_cox_x[['ses']].fillna('unknown')

In [108]:
test_cox_x[['ses']] = test_cox_x[['ses']].fillna('unknown')

#### Convert relevant varibles to categorical 

In [109]:
list(train_cox_x.select_dtypes(include = ['object']).columns)

['ses', 'stage']

In [110]:
to_be_categorical = list(train_cox_x.select_dtypes(include = ['object']).columns)

In [111]:
to_be_categorical.append('ses')
to_be_categorical.append('adjuv')
to_be_categorical.append('raised_plt')
to_be_categorical.append('ab_lft')
to_be_categorical.append('cardiovascular')
to_be_categorical.append('diabetes')
to_be_categorical.append('crd')
to_be_categorical.append('copd')
to_be_categorical.append('aspirin')

In [112]:
for x in list(to_be_categorical):
    train_cox_x[x] = train_cox_x[x].astype('category')

In [113]:
train_cox_x.dtypes

age                  int64
ses               category
stage             category
adjuv             category
raised_plt        category
ab_lft            category
cardiovascular    category
diabetes          category
crd               category
copd              category
aspirin           category
dtype: object

In [114]:
for x in list(to_be_categorical):
    test_cox_x[x] = test_cox_x[x].astype('category')

#### Dummy encode categorical variables 

In [115]:
# Dummy variables for ses
train_cox_x = encode_and_bind(train_cox_x, 'ses')
train_cox_x = train_cox_x.drop(columns = ['ses_unknown'])

test_cox_x = encode_and_bind(test_cox_x, 'ses')
test_cox_x = test_cox_x.drop(columns = ['ses_unknown'])

# Dummy variables for stage 
train_cox_x = encode_and_bind(train_cox_x, 'stage')
train_cox_x = train_cox_x.drop(columns = ['stage_unknown'])

test_cox_x = encode_and_bind(test_cox_x, 'stage')
test_cox_x = test_cox_x.drop(columns = ['stage_unknown'])

In [116]:
print(train_cox_x.shape)
print(test_cox_x.shape)

(27452, 19)
(6863, 19)


### Processing Y

In [117]:
# Convert death_status into True or False (required for scikit-survival). 
train['death_status'] = train['death_status'].astype('bool')

In [118]:
y_dtypes = train[['death_status', 'timerisk_activity']].dtypes

train_cox_y = np.array([tuple(x) for x in train[['death_status', 'timerisk_activity']].values],
                       dtype = list(zip(y_dtypes.index, y_dtypes)))

In [119]:
train_cox_y.shape

(27452,)

In [120]:
# Convert death_status into True or False (required for scikit-survival). 
test['death_status'] = test['death_status'].astype('bool')

In [121]:
y_dtypes = test[['death_status', 'timerisk_activity']].dtypes

test_cox_y = np.array([tuple(x) for x in test[['death_status', 'timerisk_activity']].values],
                      dtype = list(zip(y_dtypes.index, y_dtypes)))

In [122]:
test_cox_y.shape

(6863,)

### Build and assess model performance 

In [123]:
cox_crude = CoxPHSurvivalAnalysis()

cox_crude.fit(train_cox_x, train_cox_y)

CoxPHSurvivalAnalysis()

In [124]:
cox_crude_risk_scores_te = cox_crude.predict(test_cox_x)
cox_crude_auc_te = cumulative_dynamic_auc(train_cox_y, test_cox_y, cox_crude_risk_scores_te, 730)[0][0]
print('Test set AUC at 2 years:', cox_crude_auc_te)

Test set AUC at 2 years: 0.6413335390494931


In [125]:
cox_crude_risk_scores_tr = cox_crude.predict(train_cox_x)
cox_crude_auc_tr = cumulative_dynamic_auc(train_cox_y, train_cox_y, cox_crude_risk_scores_tr, 730)[0][0]
print('Training set AUC at 2 years:', cox_crude_auc_tr)

Training set AUC at 2 years: 0.6454691507363597


In [126]:
# Bootstrap 10000 2 yr AUCs for test set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_te = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_crude_risk_scores_te), len(cox_crude_risk_scores_te))
    auc_yr = cumulative_dynamic_auc(train_cox_y, test_cox_y[indices], cox_crude_risk_scores_te[indices], 730)[0][0]
    bootstrapped_scores_te.append(auc_yr)

In [127]:
# Standard error of mean for test set AUC
sorted_scores_te = np.array(bootstrapped_scores_te)
sorted_scores_te.sort()

conf_lower_te = sorted_scores_te[int(0.025 * len(sorted_scores_te))]
conf_upper_te = sorted_scores_te[int(0.975 * len(sorted_scores_te))]

standard_error_te = (conf_upper_te - conf_lower_te) / 3.92
print('Test set AUC standard error:', standard_error_te)

Test set AUC standard error: 0.00751863720762494


In [128]:
# Bootstrap 10000 2-yr AUCs for train set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_tr = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_crude_risk_scores_tr), len(cox_crude_risk_scores_tr))
    auc_yr = cumulative_dynamic_auc(train_cox_y, train_cox_y[indices], cox_crude_risk_scores_tr[indices], 730)[0][0]
    bootstrapped_scores_tr.append(auc_yr)

In [129]:
# Standard error of mean for train set AUC
sorted_scores_tr = np.array(bootstrapped_scores_tr)
sorted_scores_tr.sort()

conf_lower_tr = sorted_scores_tr[int(0.025 * len(sorted_scores_tr))]
conf_upper_tr = sorted_scores_tr[int(0.975 * len(sorted_scores_tr))]

standard_error_tr = (conf_upper_tr - conf_lower_tr) / 3.92
print('Training set AUC standard error', standard_error_tr)

Training set AUC standard error 0.003806408910917419


In [130]:
cox_auc_data = {'model': ['cox_crude'],
                'auc_2yr_te': [cox_crude_auc_te],
                'sem_te': [standard_error_te],
                'auc_2yr_tr': [cox_crude_auc_tr],
                'sem_tr': [standard_error_tr]}

cox_auc_df = pd.DataFrame(cox_auc_data)

In [131]:
cox_auc_df

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
0,cox_crude,0.641334,0.007519,0.645469,0.003806


In [132]:
cox_auc_df.to_csv('cox_auc_df.csv', index = False, header = True)

In [133]:
times = np.arange(30, 1810, 30)
crude_cox_auc_over5 = cumulative_dynamic_auc(train_cox_y, test_cox_y, cox_crude_risk_scores_te, times)[0]

times_data = {}
values = crude_cox_auc_over5
time_names = []

for x in range(len(times)):
    time_names.append('time_'+str(times[x]))

for i in range(len(time_names)):
    times_data[time_names[i]] = values[i]
    
cox_auc_over5 = pd.DataFrame(times_data, index = ['cox_crude'])

In [134]:
cox_auc_over5

Unnamed: 0,time_30,time_60,time_90,time_120,time_150,time_180,time_210,time_240,time_270,time_300,...,time_1530,time_1560,time_1590,time_1620,time_1650,time_1680,time_1710,time_1740,time_1770,time_1800
cox_crude,0.684849,0.673378,0.688523,0.680508,0.667829,0.667237,0.664302,0.663377,0.658468,0.653381,...,0.639717,0.639204,0.637584,0.633764,0.632305,0.63068,0.632406,0.637725,0.63637,0.637921


In [135]:
cox_auc_over5.to_csv('cox_auc_over5.csv', index = True, header = True)

## 3. Complete cases

**This Cox build will look at complete cases only. Specifically excluding patients with unknown stage, SES, or labs**

### Remove patients with missing values 

In [136]:
train_cox_cc = (
    train
    .query('stage != "unknown"')
    .query('ses.notna()', engine = 'python')
    .query('platelet_diag.notna()', engine = 'python')
    .query('alt_diag.notna() or total_bilirubin_diag.notna() or ggt_diag.notna()', engine = 'python')
)

In [137]:
test_cox_cc = (
    test
    .query('stage != "unknown"')
    .query('ses.notna()', engine = 'python')
    .query('platelet_diag.notna()', engine = 'python')
    .query('alt_diag.notna() or total_bilirubin_diag.notna() or ggt_diag.notna()', engine = 'python')
)

### Processing X 

#### Select relevant variables

In [138]:
train_cox_cc_x = (
    train_cox_cc
    .filter(items = [
        'PatientID',
        'age',
        'ses',
        'stage',
        'adjuv',
        'raised_plt',
        'ab_lft',
        'cardiovascular',
        'diabetes',
        'crd',
        'copd',
        'aspirin']))

In [139]:
train_cox_cc_x = train_cox_cc_x.set_index('PatientID')

In [140]:
train_cox_cc_x.shape

(10715, 11)

In [141]:
test_cox_cc_x = (
    test_cox_cc
    .filter(items = [
        'PatientID',
        'age',
        'ses',
        'stage',
        'adjuv',
        'raised_plt',
        'ab_lft',
        'cardiovascular',
        'diabetes',
        'crd',
        'copd',
        'aspirin']))

In [142]:
test_cox_cc_x = test_cox_cc_x.set_index('PatientID')

In [143]:
test_cox_cc_x.shape

(2699, 11)

#### Convert relevant varibles to categorical 

In [144]:
list(train_cox_cc_x.select_dtypes(include = ['object']).columns)

['stage']

In [145]:
to_be_categorical = list(train_cox_cc_x.select_dtypes(include = ['object']).columns)

In [146]:
to_be_categorical.append('ses')
to_be_categorical.append('adjuv')
to_be_categorical.append('raised_plt')
to_be_categorical.append('ab_lft')
to_be_categorical.append('cardiovascular')
to_be_categorical.append('diabetes')
to_be_categorical.append('crd')
to_be_categorical.append('copd')
to_be_categorical.append('aspirin')

In [147]:
for x in list(to_be_categorical):
    train_cox_cc_x[x] = train_cox_cc_x[x].astype('category')

In [148]:
train_cox_cc_x.dtypes

age                  int64
ses               category
stage             category
adjuv             category
raised_plt        category
ab_lft            category
cardiovascular    category
diabetes          category
crd               category
copd              category
aspirin           category
dtype: object

In [149]:
for x in list(to_be_categorical):
    test_cox_cc_x[x] = test_cox_cc_x[x].astype('category')

#### Dummy encode categorical variables 

In [150]:
# Dummy variables for ses
train_cox_cc_x = encode_and_bind(train_cox_cc_x, 'ses')
train_cox_cc_x = train_cox_cc_x.drop(columns = ['ses_5.0'])

test_cox_cc_x = encode_and_bind(test_cox_cc_x, 'ses')
test_cox_cc_x = test_cox_cc_x.drop(columns = ['ses_5.0'])

# Dummy variables for stage 
train_cox_cc_x = encode_and_bind(train_cox_cc_x, 'stage')
train_cox_cc_x = train_cox_cc_x.drop(columns = ['stage_I', 'stage_0'])

test_cox_cc_x = encode_and_bind(test_cox_cc_x, 'stage')
test_cox_cc_x = test_cox_cc_x.drop(columns = ['stage_I'])

In [151]:
print(train_cox_cc_x.shape)
print(test_cox_cc_x.shape)

(10715, 16)
(2699, 16)


### Processing Y

In [152]:
y_dtypes = train[['death_status', 'timerisk_activity']].dtypes

train_cox_cc_y = np.array([tuple(x) for x in train_cox_cc[['death_status', 'timerisk_activity']].values],
                          dtype = list(zip(y_dtypes.index, y_dtypes)))

In [153]:
train_cox_cc_y.shape

(10715,)

In [154]:
y_dtypes = test[['death_status', 'timerisk_activity']].dtypes

test_cox_cc_y = np.array([tuple(x) for x in test_cox_cc[['death_status', 'timerisk_activity']].values],
                         dtype = list(zip(y_dtypes.index, y_dtypes)))

In [155]:
test_cox_cc_y.shape

(2699,)

### Build and assess model performance

In [156]:
cox_cc = CoxPHSurvivalAnalysis()

cox_cc.fit(train_cox_cc_x, train_cox_cc_y)

CoxPHSurvivalAnalysis()

In [157]:
cox_cc_risk_scores_te = cox_cc.predict(test_cox_cc_x)
cox_cc_auc_te = cumulative_dynamic_auc(train_cox_cc_y, test_cox_cc_y, cox_cc_risk_scores_te, 730)[0][0]
print('Test set AUC at 2 years:', cox_cc_auc_te)

Test set AUC at 2 years: 0.638517217248732


In [158]:
cox_cc_risk_scores_tr = cox_cc.predict(train_cox_cc_x)
cox_cc_auc_tr = cumulative_dynamic_auc(train_cox_cc_y, train_cox_cc_y, cox_cc_risk_scores_tr, 730)[0][0]
print('Training set AUC at 2 years:', cox_cc_auc_tr)

Training set AUC at 2 years: 0.6503900451309126


In [159]:
# Bootstrap 10000 1 yr AUCs for test set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_te = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_cc_risk_scores_te), len(cox_cc_risk_scores_te))
    auc_yr = cumulative_dynamic_auc(train_cox_cc_y, test_cox_cc_y[indices], cox_cc_risk_scores_te[indices], 730)[0][0]
    bootstrapped_scores_te.append(auc_yr)

In [160]:
# Standard error of mean for test set AUC
sorted_scores_te = np.array(bootstrapped_scores_te)
sorted_scores_te.sort()

conf_lower_te = sorted_scores_te[int(0.025 * len(sorted_scores_te))]
conf_upper_te = sorted_scores_te[int(0.975 * len(sorted_scores_te))]

standard_error_te = (conf_upper_te - conf_lower_te) / 3.92
print('Test set AUC standard error:', standard_error_te)

Test set AUC standard error: 0.01217997342931366


In [161]:
# Bootstrap 10000 1-yr AUCs for train set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_tr = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_cc_risk_scores_tr), len(cox_cc_risk_scores_tr))
    auc_yr = cumulative_dynamic_auc(train_cox_cc_y, train_cox_cc_y[indices], cox_cc_risk_scores_tr[indices], 730)[0][0]
    bootstrapped_scores_tr.append(auc_yr)

In [162]:
# Standard error of mean for train set AUC
sorted_scores_tr = np.array(bootstrapped_scores_tr)
sorted_scores_tr.sort()

conf_lower_tr = sorted_scores_tr[int(0.025 * len(sorted_scores_tr))]
conf_upper_tr = sorted_scores_tr[int(0.975 * len(sorted_scores_tr))]

standard_error_tr = (conf_upper_tr - conf_lower_tr) / 3.92
print('Training set AUC standard error', standard_error_tr)

Training set AUC standard error 0.005999249018900298


In [163]:
cox_auc_data = {'model': 'cox_cc',
                'auc_2yr_te': cox_cc_auc_te,
                'sem_te': standard_error_te,
                'auc_2yr_tr': cox_cc_auc_tr,
                'sem_tr': standard_error_tr}

In [164]:
cox_auc_df = pd.read_csv('cox_auc_df.csv')

In [165]:
cox_auc_df = cox_auc_df.append(cox_auc_data, ignore_index = True)

In [166]:
cox_auc_df

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
0,cox_crude,0.641334,0.007519,0.645469,0.003806
1,cox_cc,0.638517,0.01218,0.65039,0.005999


In [167]:
cox_auc_df.to_csv('cox_auc_df.csv', index = False, header = True)

In [168]:
times = np.arange(30, 1810, 30)
cc_cox_auc_over5 = cumulative_dynamic_auc(train_cox_cc_y, test_cox_cc_y, cox_cc_risk_scores_te, times)[0]

times_data = {}
values = cc_cox_auc_over5
time_names = []

for x in range(len(times)):
    time_names.append('time_'+str(times[x]))

for i in range(len(time_names)):
    times_data[time_names[i]] = values[i]
    
cc_cox_over5_df = pd.DataFrame(times_data, index = ['cox_cc'])

In [169]:
cox_auc_over5 = pd.read_csv('cox_auc_over5.csv', index_col = 0)

In [170]:
cox_auc_over5 = cox_auc_over5.append(cc_cox_over5_df, ignore_index = False)

In [171]:
cox_auc_over5

Unnamed: 0,time_30,time_60,time_90,time_120,time_150,time_180,time_210,time_240,time_270,time_300,...,time_1530,time_1560,time_1590,time_1620,time_1650,time_1680,time_1710,time_1740,time_1770,time_1800
cox_crude,0.684849,0.673378,0.688523,0.680508,0.667829,0.667237,0.664302,0.663377,0.658468,0.653381,...,0.639717,0.639204,0.637584,0.633764,0.632305,0.63068,0.632406,0.637725,0.63637,0.637921
cox_cc,0.68496,0.6767,0.688195,0.675908,0.663172,0.658616,0.659252,0.653853,0.651654,0.649249,...,0.644787,0.646598,0.644487,0.64727,0.650721,0.650166,0.652438,0.662345,0.666495,0.662999


In [172]:
cox_auc_over5.to_csv('cox_auc_over5.csv', index = True, header = True)

## 3. Complete cases using only patients with de novo metastaic disease**

**This Cox build will look at patients with de novo metastatic disease who have also don't have missing SES or labs**

### Select de novo metastatic disease and no missing variables 

In [173]:
train_cox_cc_dn = (
    train
    .query('stage == "IV"')
    .query('ses.notna()', engine = 'python')
    .query('platelet_diag.notna()', engine = 'python')
    .query('alt_diag.notna() or total_bilirubin_diag.notna() or ggt_diag.notna()', engine = 'python')
)

In [174]:
test_cox_cc_dn = (
    test
    .query('stage == "IV"')
    .query('ses.notna()', engine = 'python')
    .query('platelet_diag.notna()', engine = 'python')
    .query('alt_diag.notna() or total_bilirubin_diag.notna() or ggt_diag.notna()', engine = 'python')
)

### Processing X 

#### Select relevant variables

In [175]:
train_cox_cc_dn_x = (
    train_cox_cc_dn
    .filter(items = [
        'PatientID',
        'age',
        'ses',
        'adjuv',
        'raised_plt',
        'ab_lft',
        'cardiovascular',
        'diabetes',
        'crd',
        'copd',
        'aspirin']))

In [176]:
train_cox_cc_dn_x = train_cox_cc_dn_x.set_index('PatientID')

In [177]:
train_cox_cc_dn_x.shape

(5346, 10)

In [178]:
test_cox_cc_dn_x = (
    test_cox_cc_dn
    .filter(items = [
        'PatientID',
        'age',
        'ses',
        'adjuv',
        'raised_plt',
        'ab_lft',
        'cardiovascular',
        'diabetes',
        'crd',
        'copd',
        'aspirin']))

In [179]:
test_cox_cc_dn_x = test_cox_cc_dn_x.set_index('PatientID')

In [180]:
test_cox_cc_dn_x.shape

(1371, 10)

#### Convert relevant varibles to categorical 

In [181]:
list(train_cox_cc_dn_x.select_dtypes(include = ['object']).columns)

[]

In [182]:
to_be_categorical = list(train_cox_cc_dn_x.select_dtypes(include = ['object']).columns)

In [183]:
to_be_categorical.append('ses')
to_be_categorical.append('adjuv')
to_be_categorical.append('raised_plt')
to_be_categorical.append('ab_lft')
to_be_categorical.append('cardiovascular')
to_be_categorical.append('diabetes')
to_be_categorical.append('crd')
to_be_categorical.append('copd')
to_be_categorical.append('aspirin')

In [184]:
for x in list(to_be_categorical):
    train_cox_cc_dn_x[x] = train_cox_cc_dn_x[x].astype('category')

In [185]:
train_cox_cc_dn_x.dtypes

age                  int64
ses               category
adjuv             category
raised_plt        category
ab_lft            category
cardiovascular    category
diabetes          category
crd               category
copd              category
aspirin           category
dtype: object

In [186]:
for x in list(to_be_categorical):
    test_cox_cc_dn_x[x] = test_cox_cc_dn_x[x].astype('category')

#### Dummy encode categorical variables 

In [187]:
# Dummy variables for ses
train_cox_cc_dn_x = encode_and_bind(train_cox_cc_dn_x, 'ses')
train_cox_cc_dn_x = train_cox_cc_dn_x.drop(columns = ['ses_5.0'])

test_cox_cc_dn_x = encode_and_bind(test_cox_cc_dn_x, 'ses')
test_cox_cc_dn_x = test_cox_cc_dn_x.drop(columns = ['ses_5.0'])

In [188]:
print(train_cox_cc_dn_x.shape)
print(test_cox_cc_dn_x.shape)

(5346, 13)
(1371, 13)


### Processing Y

In [189]:
y_dtypes = train[['death_status', 'timerisk_activity']].dtypes

train_cox_cc_dn_y = np.array([tuple(x) for x in train_cox_cc_dn[['death_status', 'timerisk_activity']].values],
                             dtype = list(zip(y_dtypes.index, y_dtypes)))

In [190]:
train_cox_cc_dn_y.shape

(5346,)

In [191]:
y_dtypes = test[['death_status', 'timerisk_activity']].dtypes

test_cox_cc_dn_y = np.array([tuple(x) for x in test_cox_cc_dn[['death_status', 'timerisk_activity']].values],
                            dtype = list(zip(y_dtypes.index, y_dtypes)))

In [192]:
test_cox_cc_dn_y.shape

(1371,)

### Build and assess model performance

In [193]:
cox_cc_dn = CoxPHSurvivalAnalysis()

cox_cc_dn.fit(train_cox_cc_dn_x, train_cox_cc_dn_y)

CoxPHSurvivalAnalysis()

In [194]:
cox_cc_dn_risk_scores_te = cox_cc_dn.predict(test_cox_cc_dn_x)
cox_cc_dn_auc_te = cumulative_dynamic_auc(train_cox_cc_dn_y, test_cox_cc_dn_y, cox_cc_dn_risk_scores_te, 730)[0][0]
print('Test set AUC at 2 years:', cox_cc_dn_auc_te)

Test set AUC at 2 years: 0.62754875611693


In [195]:
cox_cc_dn_risk_scores_tr = cox_cc_dn.predict(train_cox_cc_dn_x)
cox_cc_dn_auc_tr = cumulative_dynamic_auc(train_cox_cc_dn_y, train_cox_cc_dn_y, cox_cc_dn_risk_scores_tr, 730)[0][0]
print('Training set AUC at 2 years:', cox_cc_dn_auc_tr)

Training set AUC at 2 years: 0.6499190957511551


In [196]:
# Bootstrap 10000 1 yr AUCs for test set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_te = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_cc_dn_risk_scores_te), len(cox_cc_dn_risk_scores_te))
    auc_yr = cumulative_dynamic_auc(train_cox_cc_dn_y, test_cox_cc_dn_y[indices], cox_cc_dn_risk_scores_te[indices], 730)[0][0]
    bootstrapped_scores_te.append(auc_yr)

In [197]:
# Standard error of mean for test set AUC
sorted_scores_te = np.array(bootstrapped_scores_te)
sorted_scores_te.sort()

conf_lower_te = sorted_scores_te[int(0.025 * len(sorted_scores_te))]
conf_upper_te = sorted_scores_te[int(0.975 * len(sorted_scores_te))]

standard_error_te = (conf_upper_te - conf_lower_te) / 3.92
print('Test set AUC standard error:', standard_error_te)

Test set AUC standard error: 0.017420664952171296


In [198]:
# Bootstrap 10000 1-yr AUCs for train set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_tr = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_cc_dn_risk_scores_tr), len(cox_cc_dn_risk_scores_tr))
    auc_yr = cumulative_dynamic_auc(train_cox_cc_dn_y, train_cox_cc_dn_y[indices], cox_cc_dn_risk_scores_tr[indices], 730)[0][0]
    bootstrapped_scores_tr.append(auc_yr)

In [199]:
# Standard error of mean for train set AUC
sorted_scores_tr = np.array(bootstrapped_scores_tr)
sorted_scores_tr.sort()

conf_lower_tr = sorted_scores_tr[int(0.025 * len(sorted_scores_tr))]
conf_upper_tr = sorted_scores_tr[int(0.975 * len(sorted_scores_tr))]

standard_error_tr = (conf_upper_tr - conf_lower_tr) / 3.92
print('Training set AUC standard error', standard_error_tr)

Training set AUC standard error 0.008523072754855874


In [200]:
cox_auc_data = {'model': 'cox_cc_dn',
                'auc_2yr_te': cox_cc_dn_auc_te,
                'sem_te': standard_error_te,
                'auc_2yr_tr': cox_cc_dn_auc_tr,
                'sem_tr': standard_error_tr}

In [201]:
cox_auc_df = pd.read_csv('cox_auc_df.csv')

In [202]:
cox_auc_df = cox_auc_df.append(cox_auc_data, ignore_index = True)

In [203]:
cox_auc_df

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
0,cox_crude,0.641334,0.007519,0.645469,0.003806
1,cox_cc,0.638517,0.01218,0.65039,0.005999
2,cox_cc_dn,0.627549,0.017421,0.649919,0.008523


In [204]:
cox_auc_df.to_csv('cox_auc_df.csv', index = False, header = True)

In [205]:
times = np.arange(30, 1810, 30)
cc_dn_cox_auc_over5 = cumulative_dynamic_auc(train_cox_cc_dn_y, test_cox_cc_dn_y, cox_cc_dn_risk_scores_te, times)[0]

times_data = {}
values = cc_dn_cox_auc_over5
time_names = []

for x in range(len(times)):
    time_names.append('time_'+str(times[x]))

for i in range(len(time_names)):
    times_data[time_names[i]] = values[i]
    
cc_dn_cox_over5_df = pd.DataFrame(times_data, index = ['cox_cc_dn'])

In [206]:
cox_auc_over5 = pd.read_csv('cox_auc_over5.csv', index_col = 0)

In [207]:
cox_auc_over5 = cox_auc_over5.append(cc_dn_cox_over5_df, ignore_index = False)

In [208]:
cox_auc_over5

Unnamed: 0,time_30,time_60,time_90,time_120,time_150,time_180,time_210,time_240,time_270,time_300,...,time_1530,time_1560,time_1590,time_1620,time_1650,time_1680,time_1710,time_1740,time_1770,time_1800
cox_crude,0.684849,0.673378,0.688523,0.680508,0.667829,0.667237,0.664302,0.663377,0.658468,0.653381,...,0.639717,0.639204,0.637584,0.633764,0.632305,0.63068,0.632406,0.637725,0.63637,0.637921
cox_cc,0.68496,0.6767,0.688195,0.675908,0.663172,0.658616,0.659252,0.653853,0.651654,0.649249,...,0.644787,0.646598,0.644487,0.64727,0.650721,0.650166,0.652438,0.662345,0.666495,0.662999
cox_cc_dn,0.774638,0.707427,0.715662,0.692114,0.68569,0.67748,0.668571,0.666876,0.661479,0.661678,...,0.676489,0.675633,0.673224,0.675832,0.68071,0.675896,0.678305,0.680421,0.689083,0.687752


In [209]:
cox_auc_over5.to_csv('cox_auc_over5.csv', index = True, header = True)

In [210]:
ml_auc_df = pd.read_csv('ml_auc_df.csv', dtype = {'auc_2yr_te': np.float64,
                                                  'sem_te': np.float64,
                                                  'auc_2yr_tr': np.float64,
                                                  'sem_tr': np.float64})

In [211]:
ml_auc_df

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
0,gbm_crude,0.767784,0.006411,0.817127,0.002841
1,rsf_crude,0.743442,0.006667,0.878791,0.00228
2,ridge_crude,0.735669,0.006739,0.731836,0.003417
3,lasso_crude,0.720964,0.006929,0.712436,0.003557
4,enet_crude,0.720824,0.006937,0.712219,0.003549
5,linear_svm_crude,0.73764,0.006739,0.736308,0.003442
6,gbm_mice,0.777956,0.003426,0.83132,0.012594


In [212]:
all_models_auc_df = ml_auc_df.append(cox_auc_df, ignore_index = True)

In [213]:
all_models_auc_df.sort_values(by = 'auc_2yr_te', ascending = False)

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
6,gbm_mice,0.777956,0.003426,0.83132,0.012594
0,gbm_crude,0.767784,0.006411,0.817127,0.002841
1,rsf_crude,0.743442,0.006667,0.878791,0.00228
5,linear_svm_crude,0.73764,0.006739,0.736308,0.003442
2,ridge_crude,0.735669,0.006739,0.731836,0.003417
3,lasso_crude,0.720964,0.006929,0.712436,0.003557
4,enet_crude,0.720824,0.006937,0.712219,0.003549
7,cox_crude,0.641334,0.007519,0.645469,0.003806
8,cox_cc,0.638517,0.01218,0.65039,0.005999
9,cox_cc_dn,0.627549,0.017421,0.649919,0.008523


In [214]:
all_models_auc_df.to_csv('all_models_auc_df.csv', index = False, header = True)

**In conclusion, the Cox model performs less well than the machine learning models in regards to 2-year AUC.** 