# Flatiron Health mBC: Cox model build

**OBJECTIVE: Build a Cox model inspired by the model described in "Prognostic Model for De Novo and Recurrent Metastatic Breast Cancer" by C. Barcenas et al. in JCO Clinical Cancer Informatics (2021).**

**BACKGROUND: The pusblished Cox model is a classical Cox model that predicts overall survival for patients with newly diagnosed metastatic breast cancer. It uses 11 clinical and biomarker variables. Discriminatory performance was validated on an external dataset and resulted in a concordance index of 0.73. Patients with complete data were included for training and testing. Due to violation of proportional hazards, the model consisted of six independent models built over different time points.**

**OUTLINE:**
1. **Preprocessing**
2. **Crude imputation** 
3. **Complete cases** 

## 1. Preprocessing 

**The published Cox model includes the following variables:** 
* Age at diagnosis of MBC (continous)
* De novo and metastasis-free interval
* HR and HER2 status
* KPS (ECOG will be used instead)
* Organs involved
* Number of organs involved
* Race or ethnicity
* Frontline biotherapy (trastuzumab and/or pertuzumab)
* Frontline HT
* Interaction between de novo and recurrent MBC status and HR and HER2 status
* Interaction betwee de novo and recurrent MBC status and HT 

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Function that returns number of rows and count of unique PatientIDs for a dataframe. 
def row_ID(dataframe):
    row = dataframe.shape[0]
    ID = dataframe['PatientID'].nunique()
    return row, ID

In [3]:
train = pd.read_csv('train_full.csv')
print(len(train), train.PatientID.is_unique)

25341 True


In [4]:
test = pd.read_csv('test_full.csv')
print(len(test), test.PatientID.is_unique)

6336 True


### Training set 

In [5]:
train_cox = (
    train
    .filter(items = [
        'PatientID',
        'age',
        'delta_met_diagnosis',
        'ER', 
        'PR',
        'HER2',
        'ecog_diagnosis',
        'race', 
        'death_status',
        'timerisk_activity'])
)

#### De novo and metastasis-free interval
* de novo
* MFI <24 months
* MFI >=24 months

In [6]:
conditions = [
    train_cox['delta_met_diagnosis'] == 0,
    (train_cox['delta_met_diagnosis'] != 0) & (train_cox['delta_met_diagnosis'] < 730),
    (train_cox['delta_met_diagnosis'] != 0) & (train_cox['delta_met_diagnosis'] >= 730)
]    

choices = ['de_novo',
           'l24',
           'ge24']
    
train_cox.loc[:, 'dn_recurrent'] = np.select(conditions, choices)

In [7]:
train_cox.dn_recurrent.value_counts(normalize = True, dropna = False)

ge24       0.519672
de_novo    0.312340
l24        0.167989
Name: dn_recurrent, dtype: float64

#### HR and HER2 status
* HR-positive and HER2-negative
* HR and HER2-positive
* HR-negative and HER2-positive
* Triple-negative

In [8]:
conditions = [
    ((train_cox['ER'] == 'positive') | (train_cox['PR'] == 'positive')) & (train_cox['HER2'] == 'negative'),
    ((train_cox['ER'] == 'positive') | (train_cox['PR'] == 'positive')) & (train_cox['HER2'] == 'positive'),
    ((train_cox['ER'] == 'negative') & (train_cox['PR'] == 'negative')) & (train_cox['HER2'] == 'positive'),
    (train_cox['ER'] == 'negative') & (train_cox['PR'] == 'negative') & (train_cox['HER2'] == 'negative')
]    

choices = ['hrp_her2n',
           'hrp_her2p',
           'hrn_her2p',
           'trip_neg']
    
train_cox.loc[:, 'hr_her2'] = np.select(conditions, choices, default = 'unknown')

In [9]:
train_cox.hr_her2.value_counts(normalize = True)

hrp_her2n    0.525433
unknown      0.190876
trip_neg     0.135551
hrp_her2p    0.097826
hrn_her2p    0.050314
Name: hr_her2, dtype: float64

#### Organs involved 
* Nonvisceral (ie., soft tissue, lymphadenopathy, skin, and bone)
* Bone-only
* Visceral, non-CNS
* CNS

In [10]:
mets = pd.read_csv('Enhanced_MetBreastSitesOfMet.csv')

In [11]:
mets = mets[mets['PatientID'].isin(train_cox['PatientID'])]

In [12]:
row_ID(mets)

(62969, 25234)

In [13]:
mets.sample(5)

Unnamed: 0,PatientID,DateOfMetastasis,SiteOfMetastasis
58932,FE78885870648,2019-08,Lung
54790,F0B521B23CAD5,2022-01,Distant lymph node
21950,F49825FE6FCB4,2013-06,Lung
72942,FE37AC81588B1,2017-12,Distant lymph node
54144,FA5A3BBC89F9F,2020-12,Distant lymph node


In [14]:
mets.loc[:, 'DateOfMetastasis'] = pd.to_datetime(mets['DateOfMetastasis'])

In [15]:
enhanced_met = pd.read_csv('Enhanced_MetastaticBreast.csv')

In [16]:
enhanced_met = enhanced_met[enhanced_met['PatientID'].isin(train_cox['PatientID'])]

In [17]:
row_ID(enhanced_met)

(25341, 25341)

In [18]:
enhanced_met.loc[:, 'MetDiagnosisDate'] = pd.to_datetime(enhanced_met['MetDiagnosisDate'])

In [19]:
mets = pd.merge(mets, enhanced_met[['PatientID', 'MetDiagnosisDate']], on = 'PatientID', how = 'left')

In [20]:
mets.loc[:, 'date_diff'] = (mets['DateOfMetastasis'] - mets['MetDiagnosisDate']).dt.days

In [21]:
mets = mets.query('date_diff <= 30')

In [22]:
mets.SiteOfMetastasis.value_counts(normalize = True)

Bone                  0.356779
Lung                  0.161228
Distant lymph node    0.151943
Liver                 0.124087
Pleura                0.052285
Brain                 0.037276
Other                 0.024497
Skin                  0.019246
Soft tissue           0.018300
Peritoneum            0.013973
Bone marrow           0.012373
Adrenal               0.010637
CNS site              0.008992
Ovary                 0.003854
Spleen                0.001893
Pancreas              0.001240
Kidney                0.000969
Thyroid               0.000428
Name: SiteOfMetastasis, dtype: float64

In [23]:
mets_wide = mets.groupby('PatientID')['SiteOfMetastasis'].apply(','.join).reset_index()

In [24]:
mets_wide.sample(3)

Unnamed: 0,PatientID,SiteOfMetastasis
3914,F27EEA657AC06,Bone
1800,F1312F93EAB67,"Distant lymph node,Other"
19469,FC6ACA3C389EB,"Bone,Distant lymph node,Liver"


In [25]:
row_ID(mets_wide)

(25105, 25105)

In [26]:
not_bone = mets.SiteOfMetastasis.unique().tolist()
not_bone.remove('Bone')
not_bone.remove('Bone marrow')

In [27]:
not_bone

['Distant lymph node',
 'Lung',
 'Pleura',
 'Liver',
 'Skin',
 'Brain',
 'Other',
 'Soft tissue',
 'Peritoneum',
 'Adrenal',
 'CNS site',
 'Pancreas',
 'Ovary',
 'Spleen',
 'Kidney',
 'Thyroid']

In [28]:
visceral_cns = mets.SiteOfMetastasis.unique().tolist()
visceral_cns.remove('Distant lymph node')
visceral_cns.remove('Skin')
visceral_cns.remove('Soft tissue')
visceral_cns.remove('Bone')
visceral_cns.remove('Bone marrow')

In [29]:
visceral_cns

['Lung',
 'Pleura',
 'Liver',
 'Brain',
 'Other',
 'Peritoneum',
 'Adrenal',
 'CNS site',
 'Pancreas',
 'Ovary',
 'Spleen',
 'Kidney',
 'Thyroid']

In [30]:
bone_only_IDs = (
    mets_wide[mets_wide['SiteOfMetastasis'].str.contains('Bone|Bone marrow') & 
             ~mets_wide['SiteOfMetastasis'].str.contains('|'.join(not_bone))]
    .PatientID
)

In [31]:
cns_IDs = (
    mets_wide[mets_wide['SiteOfMetastasis'].str.contains('Brain|CNS site')]
    .PatientID
)

In [32]:
nonvisceral = (
    mets_wide[mets_wide['SiteOfMetastasis'].str.contains('Distant lymph node|'
                                                         'Skin|'
                                                         'Soft tissue|'
                                                         'Bone|'
                                                         'Bone marrow') & 
             ~mets_wide['SiteOfMetastasis'].str.contains('|'.join(visceral_cns))]
)

In [33]:
nonvisceral_IDs = nonvisceral[~nonvisceral.PatientID.isin(bone_only_IDs)].PatientID

In [34]:
bone_cns_nonvisceral = np.concatenate([bone_only_IDs, cns_IDs, nonvisceral_IDs])

In [35]:
visceral_IDs = mets_wide[~mets_wide.PatientID.isin(bone_cns_nonvisceral)].PatientID

In [36]:
len(bone_only_IDs) + len(cns_IDs) + len(nonvisceral_IDs) + len(visceral_IDs) == len(mets_wide)

True

In [37]:
conditions = [
    (train_cox['PatientID'].isin(bone_only_IDs)),
    (train_cox['PatientID'].isin(cns_IDs)),
    (train_cox['PatientID'].isin(nonvisceral_IDs)),
    (train_cox['PatientID'].isin(visceral_IDs))
]    

choices = ['bone_only',
           'cns',
           'nonvisceral',
           'visceral_ncns']
    
train_cox.loc[:, 'met_site'] = np.select(conditions, choices, default = 'unknown')

In [38]:
train_cox.met_site.value_counts(normalize = True, dropna = False)

visceral_ncns    0.482854
bone_only        0.310722
nonvisceral      0.122252
cns              0.074859
unknown          0.009313
Name: met_site, dtype: float64

#### Number of organs involved: 1, 2, 3, or >=4

In [39]:
mets_count = (
    mets
    .groupby('PatientID')['SiteOfMetastasis'].count()
    .reset_index()
    .rename(columns = {'SiteOfMetastasis': 'organ_number'})
)

In [40]:
mets_count['organ_number'] = np.where(mets_count['organ_number'] >= 4, 4, mets_count['organ_number'])

In [41]:
mets_count.organ_number.value_counts(normalize = True, dropna = False)

1    0.546385
2    0.253774
3    0.122884
4    0.076957
Name: organ_number, dtype: float64

In [42]:
row_ID(train_cox)

(25341, 25341)

In [43]:
train_cox = pd.merge(train_cox, mets_count, on = 'PatientID', how = 'left')

In [44]:
row_ID(train_cox)

(25341, 25341)

In [45]:
train_cox.organ_number.value_counts(normalize = True, dropna = False)

1.0    0.541297
2.0    0.251411
3.0    0.121739
4.0    0.076240
NaN    0.009313
Name: organ_number, dtype: float64

#### Race or ethnicity
* White
* Black
* Others

In [46]:
train_cox.race.value_counts(normalize = True, dropna = False)

White                        0.642911
Black or African American    0.117951
Other Race                   0.116846
unknown                      0.101732
Asian                        0.020560
Name: race, dtype: float64

In [47]:
train_cox['race'] = np.where(train_cox['race'] == 'Asian', 'Other Race', train_cox['race'])

In [48]:
train_cox.race.value_counts(normalize = True, dropna = False)

White                        0.642911
Other Race                   0.137406
Black or African American    0.117951
unknown                      0.101732
Name: race, dtype: float64

#### Interaction between de novo and recurrent MBC status and HR and HER2 status
* De novo MBC or HR-positive and HER2-negative
* Recurrent MBC and HR and HER2-positive
* Recurrent MBC and HR-negative and HER2-positive
* Recurrent MBC and triple-negative 

In [49]:
conditions = [
    (train_cox['dn_recurrent'] == 'de_novo') | (train_cox['hr_her2'] == 'hrp_her2n'),
    ((train_cox['dn_recurrent'] == 'ge24') | (train_cox['dn_recurrent'] == 'l24')) & (train_cox['hr_her2'] == 'hrp_her2p'),
    ((train_cox['dn_recurrent'] == 'ge24') | (train_cox['dn_recurrent'] == 'l24')) & (train_cox['hr_her2'] == 'hrn_her2p'),
    ((train_cox['dn_recurrent'] == 'ge24') | (train_cox['dn_recurrent'] == 'l24')) & (train_cox['hr_her2'] == 'trip_neg')
]    

choices = ['dn_hrp_her2n',
           'r_hrp_her2p',
           'r_hrn_her2p',
           'r_trip_neg']
    
train_cox.loc[:, 'recurrent_biomarker'] = np.select(conditions, choices, default = 'unknown')

**Frontline biotherapy (trastuzumab and/or pertuzumab) and HT not included in the model since machine learning models do not include therapy.**

### Test set 

In [50]:
test_cox = (
    test
    .filter(items = [
        'PatientID',
        'age',
        'delta_met_diagnosis',
        'ER', 
        'PR',
        'HER2',
        'ecog_diagnosis',
        'race', 
        'death_status',
        'timerisk_activity'])
)

#### De novo and metastasis-free interval
* de novo
* MFI <24 months
* MFI >=24 months

In [51]:
conditions = [
    test_cox['delta_met_diagnosis'] == 0,
    (test_cox['delta_met_diagnosis'] != 0) & (test_cox['delta_met_diagnosis'] < 730),
    (test_cox['delta_met_diagnosis'] != 0) & (test_cox['delta_met_diagnosis'] >= 730)
]    

choices = ['de_novo',
           'l24',
           'ge24']
    
test_cox.loc[:, 'dn_recurrent'] = np.select(conditions, choices)

#### HR and HER2 status
* HR-positive and HER2-negative
* HR and HER2-positive
* HR-negative and HER2-positive
* Triple-negative

In [52]:
conditions = [
    ((test_cox['ER'] == 'positive') | (test_cox['PR'] == 'positive')) & (test_cox['HER2'] == 'negative'),
    ((test_cox['ER'] == 'positive') | (test_cox['PR'] == 'positive')) & (test_cox['HER2'] == 'positive'),
    ((test_cox['ER'] == 'negative') & (test_cox['PR'] == 'negative')) & (test_cox['HER2'] == 'positive'),
    (test_cox['ER'] == 'negative') & (test_cox['PR'] == 'negative') & (test_cox['HER2'] == 'negative')
]    

choices = ['hrp_her2n',
           'hrp_her2p',
           'hrn_her2p',
           'trip_neg']
    
test_cox.loc[:, 'hr_her2'] = np.select(conditions, choices, default = 'unknown')

#### Organs involved 
* Nonvisceral (ie., soft tissue, lymphadenopathy, skin, and bone)
* Bone-only
* Visceral, non-CNS
* CNS

In [53]:
mets = pd.read_csv('Enhanced_MetBreastSitesOfMet.csv')

In [54]:
mets = mets[mets['PatientID'].isin(test_cox['PatientID'])]

In [55]:
row_ID(mets)

(15986, 6306)

In [56]:
mets.sample(5)

Unnamed: 0,PatientID,DateOfMetastasis,SiteOfMetastasis
14608,F68B6B363D4F7,2022-02,Distant lymph node
40378,F93F0003D40E5,2016-12,Pleura
6892,FC0280FA5FB34,2017-10,Liver
65815,F7FED801777F9,2011-10,Bone
37866,FD518B51CE1C8,2016-07,Pleura


In [57]:
mets.loc[:, 'DateOfMetastasis'] = pd.to_datetime(mets['DateOfMetastasis'])

In [58]:
enhanced_met = pd.read_csv('Enhanced_MetastaticBreast.csv')

In [59]:
enhanced_met = enhanced_met[enhanced_met['PatientID'].isin(test_cox['PatientID'])]

In [60]:
row_ID(enhanced_met)

(6336, 6336)

In [61]:
enhanced_met.loc[:, 'MetDiagnosisDate'] = pd.to_datetime(enhanced_met['MetDiagnosisDate'])

In [62]:
mets = pd.merge(mets, enhanced_met[['PatientID', 'MetDiagnosisDate']], on = 'PatientID', how = 'left')

In [63]:
mets.loc[:, 'date_diff'] = (mets['DateOfMetastasis'] - mets['MetDiagnosisDate']).dt.days

In [64]:
mets = mets.query('date_diff <= 30')

In [65]:
mets_wide = mets.groupby('PatientID')['SiteOfMetastasis'].apply(','.join).reset_index()

In [66]:
row_ID(mets_wide)

(6271, 6271)

In [67]:
not_bone = mets.SiteOfMetastasis.unique().tolist()
not_bone.remove('Bone')
not_bone.remove('Bone marrow')

In [68]:
visceral_cns = mets.SiteOfMetastasis.unique().tolist()
visceral_cns.remove('Distant lymph node')
visceral_cns.remove('Skin')
visceral_cns.remove('Soft tissue')
visceral_cns.remove('Bone')
visceral_cns.remove('Bone marrow')

In [69]:
bone_only_IDs = (
    mets_wide[mets_wide['SiteOfMetastasis'].str.contains('Bone|Bone marrow') & 
             ~mets_wide['SiteOfMetastasis'].str.contains('|'.join(not_bone))]
    .PatientID
)

In [70]:
cns_IDs = (
    mets_wide[mets_wide['SiteOfMetastasis'].str.contains('Brain|CNS site')]
    .PatientID
)

In [71]:
nonvisceral = (
    mets_wide[mets_wide['SiteOfMetastasis'].str.contains('Distant lymph node|'
                                                         'Skin|'
                                                         'Soft tissue|'
                                                         'Bone|'
                                                         'Bone marrow') & 
             ~mets_wide['SiteOfMetastasis'].str.contains('|'.join(visceral_cns))]
)

In [72]:
nonvisceral_IDs = nonvisceral[~nonvisceral.PatientID.isin(bone_only_IDs)].PatientID

In [73]:
bone_cns_nonvisceral = np.concatenate([bone_only_IDs, cns_IDs, nonvisceral_IDs])

In [74]:
visceral_IDs = mets_wide[~mets_wide.PatientID.isin(bone_cns_nonvisceral)].PatientID

In [75]:
len(bone_only_IDs) + len(cns_IDs) + len(nonvisceral_IDs) + len(visceral_IDs) == len(mets_wide)

True

In [76]:
conditions = [
    (test_cox['PatientID'].isin(bone_only_IDs)),
    (test_cox['PatientID'].isin(cns_IDs)),
    (test_cox['PatientID'].isin(nonvisceral_IDs)),
    (test_cox['PatientID'].isin(visceral_IDs))
]    

choices = ['bone_only',
           'cns',
           'nonvisceral',
           'visceral_ncns']
    
test_cox.loc[:, 'met_site'] = np.select(conditions, choices, default = 'unknown')

#### Number of organs involved: 1, 2, 3, or >=4

In [77]:
mets_count = (
    mets
    .groupby('PatientID')['SiteOfMetastasis'].count()
    .reset_index()
    .rename(columns = {'SiteOfMetastasis': 'organ_number'})
)

In [78]:
mets_count['organ_number'] = np.where(mets_count['organ_number'] >= 4, 4, mets_count['organ_number'])

In [79]:
row_ID(test_cox)

(6336, 6336)

In [80]:
test_cox = pd.merge(test_cox, mets_count, on = 'PatientID', how = 'left')

In [81]:
row_ID(test_cox)

(6336, 6336)

#### Race or ethnicity
* White
* Black
* Others

In [82]:
test_cox['race'] = np.where(test_cox['race'] == 'Asian', 'Other Race', test_cox['race'])

#### Interaction between de novo and recurrent MBC status and HR and HER2 status
* De novo MBC or HR-positive and HER2-negative
* Recurrent MBC and HR and HER2-positive
* Recurrent MBC and HR-negative and HER2-positive
* Recurrent MBC and triple-negative 

In [83]:
conditions = [
    (test_cox['dn_recurrent'] == 'de_novo') | (test_cox['hr_her2'] == 'hrp_her2n'),
    ((test_cox['dn_recurrent'] == 'ge24') | (test_cox['dn_recurrent'] == 'l24')) & (test_cox['hr_her2'] == 'hrp_her2p'),
    ((test_cox['dn_recurrent'] == 'ge24') | (test_cox['dn_recurrent'] == 'l24')) & (test_cox['hr_her2'] == 'hrn_her2p'),
    ((test_cox['dn_recurrent'] == 'ge24') | (test_cox['dn_recurrent'] == 'l24')) & (test_cox['hr_her2'] == 'trip_neg')
]    

choices = ['dn_hrp_her2n',
           'r_hrp_her2p',
           'r_hrn_her2p',
           'r_trip_neg']
    
test_cox.loc[:, 'recurrent_biomarker'] = np.select(conditions, choices, default = 'unknown')

## 2. Crude imputation

**The first version of the Cox model will leave missing variables as "unknown" except for missing organ number which will be imputed as 1.**

In [84]:
from sksurv.linear_model import CoxPHSurvivalAnalysis

from sksurv.metrics import cumulative_dynamic_auc

In [85]:
# Create function that creates dummy variables for categorical variable and drops original categorical variable. 
def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis = 1)
    res = res.drop([feature_to_encode], axis = 1)
    return(res) 

### Processing X 

#### Select relevant variables

In [86]:
train_cox_x = (
    train_cox
    .filter(items = [
        'PatientID',
        'age',
        'ecog_diagnosis',
        'race',
        'dn_recurrent',
        'hr_her2',
        'met_site',
        'organ_number',
        'recurrent_biomarker']))

In [87]:
train_cox_x = train_cox_x.set_index('PatientID')

In [88]:
train_cox_x.shape

(25341, 8)

In [89]:
test_cox_x = (
    test_cox
    .filter(items = [
        'PatientID',
        'age',
        'ecog_diagnosis',
        'race',
        'dn_recurrent',
        'hr_her2',
        'met_site',
        'organ_number',
        'recurrent_biomarker']))

In [90]:
test_cox_x = test_cox_x.set_index('PatientID')

In [91]:
test_cox_x.shape

(6336, 8)

#### Impute 1 for missing organ numbers

In [92]:
train_cox_x['organ_number'] = train_cox_x['organ_number'].fillna(1)

In [93]:
train_cox_x.organ_number.value_counts(normalize = True, dropna = False)

1.0    0.550610
2.0    0.251411
3.0    0.121739
4.0    0.076240
Name: organ_number, dtype: float64

In [94]:
test_cox_x['organ_number'] = test_cox_x['organ_number'].fillna(1)

#### Convert relevant varibles to categorical 

In [95]:
list(train_cox_x.select_dtypes(include = ['object']).columns)

['ecog_diagnosis',
 'race',
 'dn_recurrent',
 'hr_her2',
 'met_site',
 'recurrent_biomarker']

In [96]:
to_be_categorical = list(train_cox_x.select_dtypes(include = ['object']).columns)

In [97]:
to_be_categorical.append('organ_number')

In [98]:
for x in list(to_be_categorical):
    train_cox_x[x] = train_cox_x[x].astype('category')

In [99]:
train_cox_x.dtypes

age                       int64
ecog_diagnosis         category
race                   category
dn_recurrent           category
hr_her2                category
met_site               category
organ_number           category
recurrent_biomarker    category
dtype: object

In [100]:
for x in list(to_be_categorical):
    test_cox_x[x] = test_cox_x[x].astype('category')

#### Dummy encode categorical variables 

In [101]:
# Dummy variables for ecog_diagnosis
train_cox_x = encode_and_bind(train_cox_x, 'ecog_diagnosis')
train_cox_x = train_cox_x.drop(columns = ['ecog_diagnosis_4.0'])

test_cox_x = encode_and_bind(test_cox_x, 'ecog_diagnosis')
test_cox_x = test_cox_x.drop(columns = ['ecog_diagnosis_4.0'])

# Dummy variables for race
train_cox_x = encode_and_bind(train_cox_x, 'race')
train_cox_x = train_cox_x.drop(columns = ['race_unknown'])

test_cox_x = encode_and_bind(test_cox_x, 'race')
test_cox_x = test_cox_x.drop(columns = ['race_unknown'])

# Dummy variables for dn_recurrent
train_cox_x = encode_and_bind(train_cox_x, 'dn_recurrent')
train_cox_x = train_cox_x.drop(columns = ['dn_recurrent_l24'])

test_cox_x = encode_and_bind(test_cox_x, 'dn_recurrent')
test_cox_x = test_cox_x.drop(columns = ['dn_recurrent_l24'])

# Dummy variables for hr_her2
train_cox_x = encode_and_bind(train_cox_x, 'hr_her2')
train_cox_x = train_cox_x.drop(columns = ['hr_her2_unknown'])

test_cox_x = encode_and_bind(test_cox_x, 'hr_her2')
test_cox_x = test_cox_x.drop(columns = ['hr_her2_unknown'])

# Dummy variables for met_site
train_cox_x = encode_and_bind(train_cox_x, 'met_site')
train_cox_x = train_cox_x.drop(columns = ['met_site_unknown'])

test_cox_x = encode_and_bind(test_cox_x, 'met_site')
test_cox_x = test_cox_x.drop(columns = ['met_site_unknown'])

# Dummy variables for organ_number
train_cox_x = encode_and_bind(train_cox_x, 'organ_number')
train_cox_x = train_cox_x.drop(columns = ['organ_number_4.0'])

test_cox_x = encode_and_bind(test_cox_x, 'organ_number')
test_cox_x = test_cox_x.drop(columns = ['organ_number_4.0'])

# Dummy variables for recurrent_biomarker
train_cox_x = encode_and_bind(train_cox_x, 'recurrent_biomarker')
train_cox_x = train_cox_x.drop(columns = ['recurrent_biomarker_unknown'])

test_cox_x = encode_and_bind(test_cox_x, 'recurrent_biomarker')
test_cox_x = test_cox_x.drop(columns = ['recurrent_biomarker_unknown'])

In [102]:
print(train_cox_x.shape)
print(test_cox_x.shape)

(25341, 26)
(6336, 26)


### Processing Y

In [103]:
# Convert death_status into True or False (required for scikit-survival). 
train_cox['death_status'] = train_cox['death_status'].astype('bool')

In [104]:
y_dtypes = train_cox[['death_status', 'timerisk_activity']].dtypes

train_cox_y = np.array([tuple(x) for x in train_cox[['death_status', 'timerisk_activity']].values],
                        dtype = list(zip(y_dtypes.index, y_dtypes)))

In [105]:
train_cox_y.shape

(25341,)

In [106]:
# Convert death_status into True or False (required for scikit-survival). 
test_cox['death_status'] = test_cox['death_status'].astype('bool')

In [107]:
y_dtypes = test_cox[['death_status', 'timerisk_activity']].dtypes

test_cox_y = np.array([tuple(x) for x in test_cox[['death_status', 'timerisk_activity']].values],
                        dtype = list(zip(y_dtypes.index, y_dtypes)))

In [108]:
test_cox_y.shape

(6336,)

### Build and assess model performance

In [109]:
cox_crude = CoxPHSurvivalAnalysis()

cox_crude.fit(train_cox_x, train_cox_y)

CoxPHSurvivalAnalysis()

In [110]:
cox_crude_risk_scores_te = cox_crude.predict(test_cox_x)
cox_crude_auc_te = cumulative_dynamic_auc(train_cox_y, test_cox_y, cox_crude_risk_scores_te, 730)[0][0]
print('Test set AUC at 2 years:', cox_crude_auc_te)

Test set AUC at 2 years: 0.7381584195762808


In [111]:
cox_crude_risk_scores_tr = cox_crude.predict(train_cox_x)
cox_crude_auc_tr = cumulative_dynamic_auc(train_cox_y, train_cox_y, cox_crude_risk_scores_tr, 730)[0][0]
print('Training set AUC at 2 years:', cox_crude_auc_tr)

Training set AUC at 2 years: 0.7401442929043079


In [112]:
# Bootstrap 10000 1 yr AUCs for test set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_te = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_crude_risk_scores_te), len(cox_crude_risk_scores_te))
    auc_yr = cumulative_dynamic_auc(train_cox_y, test_cox_y[indices], cox_crude_risk_scores_te[indices], 730)[0][0]
    bootstrapped_scores_te.append(auc_yr)

In [113]:
# Standard error of mean for test set AUC
sorted_scores_te = np.array(bootstrapped_scores_te)
sorted_scores_te.sort()

conf_lower_te = sorted_scores_te[int(0.025 * len(sorted_scores_te))]
conf_upper_te = sorted_scores_te[int(0.975 * len(sorted_scores_te))]

standard_error_te = (conf_upper_te - conf_lower_te) / 3.92
print('Test set AUC standard error:', standard_error_te)

Test set AUC standard error: 0.007090920750268647


In [114]:
# Bootstrap 10000 1-yr AUCs for train set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_tr = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_crude_risk_scores_tr), len(cox_crude_risk_scores_tr))
    auc_yr = cumulative_dynamic_auc(train_cox_y, train_cox_y[indices], cox_crude_risk_scores_tr[indices], 730)[0][0]
    bootstrapped_scores_tr.append(auc_yr)

In [115]:
# Standard error of mean for train set AUC
sorted_scores_tr = np.array(bootstrapped_scores_tr)
sorted_scores_tr.sort()

conf_lower_tr = sorted_scores_tr[int(0.025 * len(sorted_scores_tr))]
conf_upper_tr = sorted_scores_tr[int(0.975 * len(sorted_scores_tr))]

standard_error_tr = (conf_upper_tr - conf_lower_tr) / 3.92
print('Training set AUC standard error', standard_error_tr)

Training set AUC standard error 0.003576991982545572


In [116]:
cox_auc_data = {'model': ['cox_crude'],
                'auc_2yr_te': [cox_crude_auc_te],
                'sem_te': [standard_error_te],
                'auc_2yr_tr': [cox_crude_auc_tr],
                'sem_tr': [standard_error_tr]}

cox_auc_df = pd.DataFrame(cox_auc_data)

In [117]:
cox_auc_df

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
0,cox_crude,0.738158,0.007091,0.740144,0.003577


In [118]:
cox_auc_df.to_csv('cox_auc_df.csv', index = False, header = True)

In [119]:
times = np.arange(30, 1810, 30)
crude_cox_auc_over5 = cumulative_dynamic_auc(train_cox_y, test_cox_y, cox_crude_risk_scores_te, times)[0]

times_data = {}
values = crude_cox_auc_over5
time_names = []

for x in range(len(times)):
    time_names.append('time_'+str(times[x]))

for i in range(len(time_names)):
    times_data[time_names[i]] = values[i]
    
cox_auc_over5 = pd.DataFrame(times_data, index = ['cox_crude'])

In [120]:
cox_auc_over5

Unnamed: 0,time_30,time_60,time_90,time_120,time_150,time_180,time_210,time_240,time_270,time_300,...,time_1530,time_1560,time_1590,time_1620,time_1650,time_1680,time_1710,time_1740,time_1770,time_1800
cox_crude,0.754622,0.757984,0.760242,0.758289,0.756238,0.754524,0.754279,0.751298,0.757135,0.756677,...,0.712425,0.711211,0.712376,0.710032,0.711739,0.711844,0.712153,0.712078,0.715026,0.716995


In [121]:
cox_auc_over5.to_csv('cox_auc_over5.csv', index = True, header = True)

## 3. Complete cases

**This Cox build will look at complete cases only. Specifically excluding patients with unknown ECOG, biomarkers, site of metastasis, and organ number will be removed.**

### Remove patients with missing values 

In [122]:
train_cox_cc = (
    train_cox
    .query('ecog_diagnosis != "unknown"')
    .query('hr_her2 != "unknown"')
    .query('met_site != "unknown"')
    .query('organ_number.notna()', engine = 'python')
)

In [123]:
test_cox_cc = (
    test_cox
    .query('ecog_diagnosis != "unknown"')
    .query('hr_her2 != "unknown"')
    .query('met_site != "unknown"')
    .query('organ_number.notna()', engine = 'python')
)

### Processing X 

#### Select relevant variables

In [124]:
train_cox_cc_x = (
    train_cox_cc
    .filter(items = [
        'PatientID',
        'age',
        'ecog_diagnosis',
        'race',
        'dn_recurrent',
        'hr_her2',
        'met_site',
        'organ_number',
        'recurrent_biomarker']))

In [125]:
train_cox_cc_x = train_cox_cc_x.set_index('PatientID')

In [126]:
train_cox_cc_x.shape

(10068, 8)

In [127]:
test_cox_cc_x = (
    test_cox_cc
    .filter(items = [
        'PatientID',
        'age',
        'ecog_diagnosis',
        'race',
        'dn_recurrent',
        'hr_her2',
        'met_site',
        'organ_number',
        'recurrent_biomarker']))

In [128]:
test_cox_cc_x = test_cox_cc_x.set_index('PatientID')

In [129]:
test_cox_cc_x.shape

(2552, 8)

#### Convert relevant varibles to categorical 

In [130]:
list(train_cox_cc_x.select_dtypes(include = ['object']).columns)

['ecog_diagnosis',
 'race',
 'dn_recurrent',
 'hr_her2',
 'met_site',
 'recurrent_biomarker']

In [131]:
to_be_categorical = list(train_cox_cc_x.select_dtypes(include = ['object']).columns)

In [132]:
to_be_categorical.append('organ_number')

In [133]:
for x in list(to_be_categorical):
    train_cox_cc_x[x] = train_cox_cc_x[x].astype('category')

In [134]:
train_cox_cc_x.dtypes

age                       int64
ecog_diagnosis         category
race                   category
dn_recurrent           category
hr_her2                category
met_site               category
organ_number           category
recurrent_biomarker    category
dtype: object

In [135]:
for x in list(to_be_categorical):
    test_cox_cc_x[x] = test_cox_cc_x[x].astype('category')

#### Dummy encode categorical variables 

In [136]:
# Dummy variables for ecog_diagnosis
train_cox_cc_x = encode_and_bind(train_cox_cc_x, 'ecog_diagnosis')
train_cox_cc_x = train_cox_cc_x.drop(columns = ['ecog_diagnosis_4.0'])

test_cox_cc_x = encode_and_bind(test_cox_cc_x, 'ecog_diagnosis')
test_cox_cc_x = test_cox_cc_x.drop(columns = ['ecog_diagnosis_4.0'])

# Dummy variables for race
train_cox_cc_x = encode_and_bind(train_cox_cc_x, 'race')
train_cox_cc_x = train_cox_cc_x.drop(columns = ['race_unknown'])

test_cox_cc_x = encode_and_bind(test_cox_cc_x, 'race')
test_cox_cc_x = test_cox_cc_x.drop(columns = ['race_unknown'])

# Dummy variables for dn_recurrent
train_cox_cc_x = encode_and_bind(train_cox_cc_x, 'dn_recurrent')
train_cox_cc_x = train_cox_cc_x.drop(columns = ['dn_recurrent_l24'])

test_cox_cc_x = encode_and_bind(test_cox_cc_x, 'dn_recurrent')
test_cox_cc_x = test_cox_cc_x.drop(columns = ['dn_recurrent_l24'])

# Dummy variables for hr_her2
train_cox_cc_x = encode_and_bind(train_cox_cc_x, 'hr_her2')
train_cox_cc_x = train_cox_cc_x.drop(columns = ['hr_her2_hrn_her2p'])

test_cox_cc_x = encode_and_bind(test_cox_cc_x, 'hr_her2')
test_cox_cc_x = test_cox_cc_x.drop(columns = ['hr_her2_hrn_her2p'])

# Dummy variables for met_site
train_cox_cc_x = encode_and_bind(train_cox_cc_x, 'met_site')
train_cox_cc_x = train_cox_cc_x.drop(columns = ['met_site_cns'])

test_cox_cc_x = encode_and_bind(test_cox_cc_x, 'met_site')
test_cox_cc_x = test_cox_cc_x.drop(columns = ['met_site_cns'])

# Dummy variables for organ_number
train_cox_cc_x = encode_and_bind(train_cox_cc_x, 'organ_number')
train_cox_cc_x = train_cox_cc_x.drop(columns = ['organ_number_4.0'])

test_cox_cc_x = encode_and_bind(test_cox_cc_x, 'organ_number')
test_cox_cc_x = test_cox_cc_x.drop(columns = ['organ_number_4.0'])

# Dummy variables for recurrent_biomarker
train_cox_cc_x = encode_and_bind(train_cox_cc_x, 'recurrent_biomarker')
train_cox_cc_x = train_cox_cc_x.drop(columns = ['recurrent_biomarker_r_hrn_her2p'])

test_cox_cc_x = encode_and_bind(test_cox_cc_x, 'recurrent_biomarker')
test_cox_cc_x = test_cox_cc_x.drop(columns = ['recurrent_biomarker_r_hrn_her2p'])

In [137]:
print(train_cox_cc_x.shape)
print(test_cox_cc_x.shape)

(10068, 22)
(2552, 22)


### Processing Y

In [138]:
y_dtypes = train_cox[['death_status', 'timerisk_activity']].dtypes

train_cox_cc_y = np.array([tuple(x) for x in train_cox_cc[['death_status', 'timerisk_activity']].values],
                          dtype = list(zip(y_dtypes.index, y_dtypes)))

In [139]:
train_cox_cc_y.shape

(10068,)

In [140]:
y_dtypes = test_cox[['death_status', 'timerisk_activity']].dtypes

test_cox_cc_y = np.array([tuple(x) for x in test_cox_cc[['death_status', 'timerisk_activity']].values],
                         dtype = list(zip(y_dtypes.index, y_dtypes)))

In [141]:
test_cox_cc_y.shape

(2552,)

### Build and assess model performance

In [142]:
cox_cc = CoxPHSurvivalAnalysis()

cox_cc.fit(train_cox_cc_x, train_cox_cc_y)

CoxPHSurvivalAnalysis()

In [143]:
cox_cc_risk_scores_te = cox_cc.predict(test_cox_cc_x)
cox_cc_auc_te = cumulative_dynamic_auc(train_cox_cc_y, test_cox_cc_y, cox_cc_risk_scores_te, 730)[0][0]
print('Test set AUC at 2 years:', cox_cc_auc_te)

Test set AUC at 2 years: 0.7543168534141764


In [144]:
cox_cc_risk_scores_tr = cox_cc.predict(train_cox_cc_x)
cox_cc_auc_tr = cumulative_dynamic_auc(train_cox_cc_y, train_cox_cc_y, cox_cc_risk_scores_tr, 730)[0][0]
print('Training set AUC at 2 years:', cox_cc_auc_tr)

Training set AUC at 2 years: 0.7548071153536786


In [145]:
# Bootstrap 10000 1 yr AUCs for test set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_te = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_cc_risk_scores_te), len(cox_cc_risk_scores_te))
    auc_yr = cumulative_dynamic_auc(train_cox_cc_y, test_cox_cc_y[indices], cox_cc_risk_scores_te[indices], 730)[0][0]
    bootstrapped_scores_te.append(auc_yr)

In [146]:
# Standard error of mean for test set AUC
sorted_scores_te = np.array(bootstrapped_scores_te)
sorted_scores_te.sort()

conf_lower_te = sorted_scores_te[int(0.025 * len(sorted_scores_te))]
conf_upper_te = sorted_scores_te[int(0.975 * len(sorted_scores_te))]

standard_error_te = (conf_upper_te - conf_lower_te) / 3.92
print('Test set AUC standard error:', standard_error_te)

Test set AUC standard error: 0.010813202127139708


In [147]:
# Bootstrap 10000 1-yr AUCs for train set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_tr = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_cc_risk_scores_tr), len(cox_cc_risk_scores_tr))
    auc_yr = cumulative_dynamic_auc(train_cox_cc_y, train_cox_cc_y[indices], cox_cc_risk_scores_tr[indices], 730)[0][0]
    bootstrapped_scores_tr.append(auc_yr)

In [148]:
# Standard error of mean for train set AUC
sorted_scores_tr = np.array(bootstrapped_scores_tr)
sorted_scores_tr.sort()

conf_lower_tr = sorted_scores_tr[int(0.025 * len(sorted_scores_tr))]
conf_upper_tr = sorted_scores_tr[int(0.975 * len(sorted_scores_tr))]

standard_error_tr = (conf_upper_tr - conf_lower_tr) / 3.92
print('Training set AUC standard error', standard_error_tr)

Training set AUC standard error 0.005608897625185096


In [149]:
cox_auc_data = {'model': 'cox_cc',
                'auc_2yr_te': cox_cc_auc_te,
                'sem_te': standard_error_te,
                'auc_2yr_tr': cox_cc_auc_tr,
                'sem_tr': standard_error_tr}

In [150]:
cox_auc_df = pd.read_csv('cox_auc_df.csv')

In [151]:
cox_auc_df = cox_auc_df.append(cox_auc_data, ignore_index = True)

In [152]:
cox_auc_df

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
0,cox_crude,0.738158,0.007091,0.740144,0.003577
1,cox_cc,0.754317,0.010813,0.754807,0.005609


In [153]:
cox_auc_df.to_csv('cox_auc_df.csv', index = False, header = True)

In [154]:
times = np.arange(30, 1810, 30)
cc_cox_auc_over5 = cumulative_dynamic_auc(train_cox_cc_y, test_cox_cc_y, cox_cc_risk_scores_te, times)[0]

times_data = {}
values = cc_cox_auc_over5
time_names = []

for x in range(len(times)):
    time_names.append('time_'+str(times[x]))

for i in range(len(time_names)):
    times_data[time_names[i]] = values[i]
    
cc_cox_over5_df = pd.DataFrame(times_data, index = ['cox_cc'])

In [155]:
cox_auc_over5 = pd.read_csv('cox_auc_over5.csv', index_col = 0)

In [156]:
cox_auc_over5 = cox_auc_over5.append(cc_cox_over5_df, ignore_index = False)

In [157]:
cox_auc_over5

Unnamed: 0,time_30,time_60,time_90,time_120,time_150,time_180,time_210,time_240,time_270,time_300,...,time_1530,time_1560,time_1590,time_1620,time_1650,time_1680,time_1710,time_1740,time_1770,time_1800
cox_crude,0.754622,0.757984,0.760242,0.758289,0.756238,0.754524,0.754279,0.751298,0.757135,0.756677,...,0.712425,0.711211,0.712376,0.710032,0.711739,0.711844,0.712153,0.712078,0.715026,0.716995
cox_cc,0.70476,0.735457,0.754288,0.756185,0.751992,0.746681,0.741454,0.740093,0.749957,0.756006,...,0.739851,0.739346,0.739217,0.74414,0.743436,0.740559,0.735132,0.732202,0.732925,0.730467


In [158]:
cox_auc_over5.to_csv('cox_auc_over5.csv', index = True, header = True)

In [159]:
ml_auc_df = pd.read_csv('ml_auc_df.csv', dtype = {'auc_2yr_te': np.float64,
                                                  'sem_te': np.float64,
                                                  'auc_2yr_tr': np.float64,
                                                  'sem_tr': np.float64})

In [160]:
ml_auc_df

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
0,gbm_crude,0.81423,0.006072,0.850084,0.002805
1,rsf_crude,0.79544,0.006377,0.883202,0.002398
2,ridge_crude,0.782643,0.006584,0.784937,0.00334
3,lasso_crude,0.783386,0.006583,0.785551,0.003348
4,enet_crude,0.783361,0.006578,0.785536,0.003336
5,linear_svm_crude,0.785235,0.006473,0.790172,0.003294
6,gbm_mice,0.817845,0.002939,0.84065,0.005536


In [161]:
all_models_auc_df = ml_auc_df.append(cox_auc_df, ignore_index = True)

In [162]:
all_models_auc_df.sort_values(by = 'auc_2yr_te', ascending = False)

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
6,gbm_mice,0.817845,0.002939,0.84065,0.005536
0,gbm_crude,0.81423,0.006072,0.850084,0.002805
1,rsf_crude,0.79544,0.006377,0.883202,0.002398
5,linear_svm_crude,0.785235,0.006473,0.790172,0.003294
3,lasso_crude,0.783386,0.006583,0.785551,0.003348
4,enet_crude,0.783361,0.006578,0.785536,0.003336
2,ridge_crude,0.782643,0.006584,0.784937,0.00334
8,cox_cc,0.754317,0.010813,0.754807,0.005609
7,cox_crude,0.738158,0.007091,0.740144,0.003577


In [163]:
all_models_auc_df.to_csv('all_models_auc_df.csv', index = False, header = True)

**In conclusion, the Cox model performs less well than the machine learning models in regards to 2-year test AUC.** 