# Flatiron Health mPC: Cox model build

**OBJECTIVE: Build a Cox model inspired by the model described in "Prostascore: A Simplified Tool for Predicting Outcomes amongPatients with Treatment-naive Advanced Prostate Cancer" by O. Abdel-Rahman in Clinical Oncology (2017).**

**BACKGROUND: The pusblished Cox model is a classical Cox model that predicts overall survival for patients with newly diagnosed advanced prostate cancer. It uses 3 clinical variables. Patients with complete data from the SEER database were included for training and testing. AUC on the hold out set was 0.728.**

**OUTLINE:**
1. **Preprocessing**
2. **De novo metastatic disease** 
3. **De novo metastatic disease with complete information**

## 1. Preprocessing 

**The published Cox model includes the following variables:** 
* Site of extra-prostatic disease
* PSA level
* Grade group 

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Function that returns number of rows and count of unique PatientIDs for a dataframe. 
def row_ID(dataframe):
    row = dataframe.shape[0]
    ID = dataframe['PatientID'].nunique()
    return row, ID

In [3]:
train = pd.read_csv('train_full.csv')
row_ID(train)

(15141, 15141)

In [4]:
test = pd.read_csv('test_full.csv')
row_ID(test)

(3786, 3786)

### Training set 

In [5]:
train_cox = train.query('stage == "IV"')

#### Site of extra-prostatic disease
* N1
* M1a
* M1b
* M1c

In [6]:
train_cox.MStage.value_counts(dropna = False)

M1                          5002
M1b                         1277
M0                           522
M1c                          439
M1a                          179
Unknown / Not documented      91
Name: MStage, dtype: int64

In [7]:
train_cox.query('MStage == "M0"').NStage.value_counts()

N1                          443
N0                           48
Unknown / Not documented     23
NX                            8
Name: NStage, dtype: int64

In [8]:
train_cox['MStage'] = np.where((train_cox['MStage'] == 'M0') & (train_cox['NStage'] == 'N1'), 'N1', train_cox['MStage'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [9]:
train_cox.MStage.value_counts(dropna = False)

M1                          5002
M1b                         1277
N1                           443
M1c                          439
M1a                          179
Unknown / Not documented      91
M0                            79
Name: MStage, dtype: int64

In [10]:
unknown_bone_IDs = (
    train_cox
    .query('MStage == "M0" or MStage == "Unknown / Not documented"')
    .query('bone_met == 1')
    .query('thorax_met == 0')
    .query('peritoneum_met == 0')
    .query('liver_met == 0')
    .query('other_gi_met == 0')
    .query('cns_met == 0')
    .query('lymph_met == 0')
    .query('kidney_bladder_met == 0')
    .query('other_met == 0')
    .PatientID
)

In [11]:
unknown_vcns_IDs = (
    train_cox
    .query('MStage == "M0" or MStage == "Unknown / Not documented"')
    .query('thorax_met == 1 or peritoneum_met == 1 or liver_met == 1 or other_gi_met == 1 or cns_met == 1 or kidney_bladder_met == 1')
    .PatientID
)

In [12]:
m1_bone_IDs = (
    train_cox
    .query('MStage == "M1"')
    .query('bone_met == 1')
    .query('thorax_met == 0')
    .query('peritoneum_met == 0')
    .query('liver_met == 0')
    .query('other_gi_met == 0')
    .query('cns_met == 0')
    .query('lymph_met == 0')
    .query('kidney_bladder_met == 0')
    .query('other_met == 0')
    .PatientID
)

In [13]:
m1_vcns_IDs = (
    train_cox
    .query('MStage == "M1"')
    .query('thorax_met == 1 or peritoneum_met == 1 or liver_met == 1 or other_gi_met == 1 or cns_met == 1 or kidney_bladder_met == 1')
    .PatientID
)

In [14]:
n1_bone_IDs = (
    train_cox
    .query('MStage == "N1"')
    .query('bone_met == 1')
    .query('thorax_met == 0')
    .query('peritoneum_met == 0')
    .query('liver_met == 0')
    .query('other_gi_met == 0')
    .query('cns_met == 0')
    .query('lymph_met == 0')
    .query('kidney_bladder_met == 0')
    .query('other_met == 0')
    .PatientID
)

In [15]:
n1_vcns_IDs = (
    train_cox
    .query('MStage == "N1"')
    .query('thorax_met == 1 or peritoneum_met == 1 or liver_met == 1 or other_gi_met == 1 or cns_met == 1 or kidney_bladder_met == 1')
    .PatientID
)

In [16]:
bone_IDs = np.concatenate([unknown_bone_IDs, m1_bone_IDs, n1_bone_IDs])

In [17]:
vcns_IDs = np.concatenate([unknown_vcns_IDs, m1_vcns_IDs, n1_vcns_IDs])

In [18]:
conditions = [
    (train_cox['PatientID'].isin(bone_IDs)),
    (train_cox['PatientID'].isin(vcns_IDs))
]    

choices = ['M1b',
           'M1c']
    
train_cox.loc[:, 'MStage_n'] = np.select(conditions, choices, default = train_cox['MStage'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


In [19]:
train_cox.MStage_n.value_counts(dropna = False)

M1                          3703
M1b                         2576
M1c                          571
N1                           354
M1a                          179
Unknown / Not documented      68
M0                            59
Name: MStage_n, dtype: int64

In [20]:
train_cox['MStage_n'] = np.where(
    (train_cox['MStage_n'] == "M0") | (train_cox['MStage_n'] == "Unknown / Not documented"), 'unknown', train_cox['MStage_n']
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [21]:
train_cox.MStage_n.value_counts(dropna = False)

M1         3703
M1b        2576
M1c         571
N1          354
M1a         179
unknown     127
Name: MStage_n, dtype: int64

#### PSA level ( >= 60 or <60)

In [22]:
train_cox.PSAMetDiagnosis.isna().sum()/len(train_cox)

0.10732356857523302

In [23]:
train_cox['psa_60'] = np.where(train_cox['PSAMetDiagnosis'] >= 60, 1, 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


#### Grade group
* Grade group 1-3
* Grade group 4-5

In [24]:
train_cox.GleasonScore.value_counts()

9                                   2572
Unknown / Not documented            2399
8                                   1184
10                                   677
4 + 3 = 7                            381
3 + 4 = 7                            185
Less than or equal to 6               71
7 (when breakdown not available)      41
Name: GleasonScore, dtype: int64

In [25]:
conditions = [
    (train_cox['GleasonScore'] == 'Less than or equal to 6') | 
    (train_cox['GleasonScore'] == '3 + 4 = 7') |
    (train_cox['GleasonScore'] == '4 + 3 = 7') | 
    (train_cox['GleasonScore'] == '7 (when breakdown not available)'),
    (train_cox['GleasonScore'] == '8') | 
    (train_cox['GleasonScore'] == '9') |
    (train_cox['GleasonScore'] == '10')]    

choices = ['1-3',
           '4-5']
    
train_cox.loc[:, 'grade_group'] = np.select(conditions, choices, default = train_cox['GleasonScore'])

In [26]:
train_cox.grade_group.value_counts()

4-5                         4433
Unknown / Not documented    2399
1-3                          678
Name: grade_group, dtype: int64

### Test set 

In [27]:
test_cox = test.query('stage == "IV"')

#### Site of extra-prostatic disease
* N1
* M1a
* M1b
* M1c

In [28]:
test_cox.MStage.value_counts(dropna = False)

M1                          1288
M1b                          324
M0                           151
M1c                           88
M1a                           42
Unknown / Not documented      20
Name: MStage, dtype: int64

In [29]:
test_cox.query('MStage == "M0"').NStage.value_counts()

N1                          133
N0                           12
Unknown / Not documented      5
NX                            1
Name: NStage, dtype: int64

In [30]:
test_cox['MStage'] = np.where((test_cox['MStage'] == 'M0') & (test_cox['NStage'] == 'N1'), 'N1', test_cox['MStage'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [31]:
test_cox.MStage.value_counts(dropna = False)

M1                          1288
M1b                          324
N1                           133
M1c                           88
M1a                           42
Unknown / Not documented      20
M0                            18
Name: MStage, dtype: int64

In [32]:
unknown_bone_IDs = (
    test_cox
    .query('MStage == "M0" or MStage == "Unknown / Not documented"')
    .query('bone_met == 1')
    .query('thorax_met == 0')
    .query('peritoneum_met == 0')
    .query('liver_met == 0')
    .query('other_gi_met == 0')
    .query('cns_met == 0')
    .query('lymph_met == 0')
    .query('kidney_bladder_met == 0')
    .query('other_met == 0')
    .PatientID
)

In [33]:
unknown_vcns_IDs = (
    test_cox
    .query('MStage == "M0" or MStage == "Unknown / Not documented"')
    .query('thorax_met == 1 or peritoneum_met == 1 or liver_met == 1 or other_gi_met == 1 or cns_met == 1 or kidney_bladder_met == 1')
    .PatientID
)

In [34]:
m1_bone_IDs = (
    test_cox
    .query('MStage == "M1"')
    .query('bone_met == 1')
    .query('thorax_met == 0')
    .query('peritoneum_met == 0')
    .query('liver_met == 0')
    .query('other_gi_met == 0')
    .query('cns_met == 0')
    .query('lymph_met == 0')
    .query('kidney_bladder_met == 0')
    .query('other_met == 0')
    .PatientID
)

In [35]:
m1_vcns_IDs = (
    test_cox
    .query('MStage == "M1"')
    .query('thorax_met == 1 or peritoneum_met == 1 or liver_met == 1 or other_gi_met == 1 or cns_met == 1 or kidney_bladder_met == 1')
    .PatientID
)

In [36]:
n1_bone_IDs = (
    test_cox
    .query('MStage == "N1"')
    .query('bone_met == 1')
    .query('thorax_met == 0')
    .query('peritoneum_met == 0')
    .query('liver_met == 0')
    .query('other_gi_met == 0')
    .query('cns_met == 0')
    .query('lymph_met == 0')
    .query('kidney_bladder_met == 0')
    .query('other_met == 0')
    .PatientID
)

In [37]:
n1_vcns_IDs = (
    test_cox
    .query('MStage == "N1"')
    .query('thorax_met == 1 or peritoneum_met == 1 or liver_met == 1 or other_gi_met == 1 or cns_met == 1 or kidney_bladder_met == 1')
    .PatientID
)

In [38]:
bone_IDs = np.concatenate([unknown_bone_IDs, m1_bone_IDs, n1_bone_IDs])

In [39]:
vcns_IDs = np.concatenate([unknown_vcns_IDs, m1_vcns_IDs, n1_vcns_IDs])

In [40]:
conditions = [
    (test_cox['PatientID'].isin(bone_IDs)),
    (test_cox['PatientID'].isin(vcns_IDs))
]    

choices = ['M1b',
           'M1c']
    
test_cox.loc[:, 'MStage_n'] = np.select(conditions, choices, default = test_cox['MStage'])

In [41]:
test_cox.MStage_n.value_counts(dropna = False)

M1                          909
M1b                         702
M1c                         123
N1                          109
M1a                          42
Unknown / Not documented     16
M0                           12
Name: MStage_n, dtype: int64

In [42]:
test_cox['MStage_n'] = np.where(
    (test_cox['MStage_n'] == "M0") | (test_cox['MStage_n'] == "Unknown / Not documented"), 'unknown', test_cox['MStage_n']
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [43]:
test_cox.MStage_n.value_counts(dropna = False)

M1         909
M1b        702
M1c        123
N1         109
M1a         42
unknown     28
Name: MStage_n, dtype: int64

#### PSA level (>= 60 or <60)

In [44]:
test_cox['psa_60'] = np.where(test_cox['PSAMetDiagnosis'] >= 60, 1, 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


#### Grade group
* Grade group 1-3
* Grade group 4-5

In [45]:
test_cox.GleasonScore.value_counts()

Unknown / Not documented            668
9                                   636
8                                   301
10                                  147
4 + 3 = 7                            83
3 + 4 = 7                            54
Less than or equal to 6              16
7 (when breakdown not available)      8
Name: GleasonScore, dtype: int64

In [46]:
conditions = [
    (test_cox['GleasonScore'] == 'Less than or equal to 6') | 
    (test_cox['GleasonScore'] == '3 + 4 = 7') |
    (test_cox['GleasonScore'] == '4 + 3 = 7') | 
    (test_cox['GleasonScore'] == '7 (when breakdown not available)'),
    (test_cox['GleasonScore'] == '8') | 
    (test_cox['GleasonScore'] == '9') |
    (test_cox['GleasonScore'] == '10')]    

choices = ['1-3',
           '4-5']
    
test_cox.loc[:, 'grade_group'] = np.select(conditions, choices, default = test_cox['GleasonScore'])

In [47]:
test_cox.grade_group.value_counts()

4-5                         1084
Unknown / Not documented     668
1-3                          161
Name: grade_group, dtype: int64

## 2. De novo metastatic disease 

**The first version of the Cox model will include all patients with de novo metastatic disease. Patients with missing site of metastasis, PSA, or grade group will be included.**

In [48]:
from sksurv.linear_model import CoxPHSurvivalAnalysis

from sksurv.metrics import cumulative_dynamic_auc

In [49]:
# Create function that creates dummy variables for categorical variable and drops original categorical variable. 
def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis = 1)
    res = res.drop([feature_to_encode], axis = 1)
    return(res) 

### Processing X 

#### Select relevant variables

In [50]:
train_cox_x = (
    train_cox
    .filter(items = [
        'PatientID',
        'MStage_n',
        'psa_60',
        'grade_group']))

In [51]:
train_cox_x = train_cox_x.set_index('PatientID')

In [52]:
train_cox_x.shape

(7510, 3)

In [53]:
test_cox_x = (
    test_cox
    .filter(items = [
        'PatientID',
        'MStage_n',
        'psa_60',
        'grade_group']))

In [54]:
test_cox_x = test_cox_x.set_index('PatientID')

In [55]:
test_cox_x.shape

(1913, 3)

#### Convert relevant varibles to categorical 

In [56]:
list(train_cox_x.select_dtypes(include = ['object']).columns)

['MStage_n', 'grade_group']

In [57]:
to_be_categorical = list(train_cox_x.select_dtypes(include = ['object']).columns)

In [58]:
to_be_categorical.append('psa_60')

In [59]:
for x in list(to_be_categorical):
    train_cox_x[x] = train_cox_x[x].astype('category')

In [60]:
train_cox_x.dtypes

MStage_n       category
psa_60         category
grade_group    category
dtype: object

In [61]:
for x in list(to_be_categorical):
    test_cox_x[x] = test_cox_x[x].astype('category')

#### Dummy encode categorical variables 

In [62]:
# Dummy variables for site of metastasis
train_cox_x = encode_and_bind(train_cox_x, 'MStage_n')
train_cox_x = train_cox_x.drop(columns = ['MStage_n_unknown'])

test_cox_x = encode_and_bind(test_cox_x, 'MStage_n')
test_cox_x = test_cox_x.drop(columns = ['MStage_n_unknown'])

# Dummy variables for grade group 
train_cox_x = encode_and_bind(train_cox_x, 'grade_group')
train_cox_x = train_cox_x.drop(columns = ['grade_group_Unknown / Not documented'])

test_cox_x = encode_and_bind(test_cox_x, 'grade_group')
test_cox_x = test_cox_x.drop(columns = ['grade_group_Unknown / Not documented'])

In [63]:
print(train_cox_x.shape)
print(test_cox_x.shape)

(7510, 8)
(1913, 8)


### Processing Y

In [64]:
# Convert death_status into True or False (required for scikit-survival). 
train_cox['death_status'] = train_cox['death_status'].astype('bool')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [65]:
y_dtypes = train_cox[['death_status', 'timerisk_activity']].dtypes

train_cox_y = np.array([tuple(x) for x in train_cox[['death_status', 'timerisk_activity']].values],
                        dtype = list(zip(y_dtypes.index, y_dtypes)))

In [66]:
train_cox_y.shape

(7510,)

In [67]:
# Convert death_status into True or False (required for scikit-survival). 
test_cox['death_status'] = test_cox['death_status'].astype('bool')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [68]:
y_dtypes = test_cox[['death_status', 'timerisk_activity']].dtypes

test_cox_y = np.array([tuple(x) for x in test_cox[['death_status', 'timerisk_activity']].values],
                        dtype = list(zip(y_dtypes.index, y_dtypes)))

In [69]:
test_cox_y.shape

(1913,)

### Build and assess model performance

In [70]:
cox_crude = CoxPHSurvivalAnalysis()

cox_crude.fit(train_cox_x, train_cox_y)

CoxPHSurvivalAnalysis()

In [71]:
cox_crude_risk_scores_te = cox_crude.predict(test_cox_x)
cox_crude_auc_te = cumulative_dynamic_auc(train_cox_y, test_cox_y, cox_crude_risk_scores_te, 730)[0][0]
print('Test set AUC at 2 years:', cox_crude_auc_te)

Test set AUC at 2 years: 0.6121953517893658


In [72]:
cox_crude_risk_scores_tr = cox_crude.predict(train_cox_x)
cox_crude_auc_tr = cumulative_dynamic_auc(train_cox_y, train_cox_y, cox_crude_risk_scores_tr, 730)[0][0]
print('Training set AUC at 2 years:', cox_crude_auc_tr)

Training set AUC at 2 years: 0.6259058212378803


In [73]:
# Bootstrap 10000 1 yr AUCs for test set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_te = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_crude_risk_scores_te), len(cox_crude_risk_scores_te))
    auc_yr = cumulative_dynamic_auc(train_cox_y, test_cox_y[indices], cox_crude_risk_scores_te[indices], 730)[0][0]
    bootstrapped_scores_te.append(auc_yr)

In [74]:
# Standard error of mean for test set AUC
sorted_scores_te = np.array(bootstrapped_scores_te)
sorted_scores_te.sort()

conf_lower_te = sorted_scores_te[int(0.025 * len(sorted_scores_te))]
conf_upper_te = sorted_scores_te[int(0.975 * len(sorted_scores_te))]

standard_error_te = (conf_upper_te - conf_lower_te) / 3.92
print('Test set AUC standard error:', standard_error_te)

Test set AUC standard error: 0.015423403661116221


In [75]:
# Bootstrap 10000 1-yr AUCs for train set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_tr = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_crude_risk_scores_tr), len(cox_crude_risk_scores_tr))
    auc_yr = cumulative_dynamic_auc(train_cox_y, train_cox_y[indices], cox_crude_risk_scores_tr[indices], 730)[0][0]
    bootstrapped_scores_tr.append(auc_yr)

In [76]:
# Standard error of mean for train set AUC
sorted_scores_tr = np.array(bootstrapped_scores_tr)
sorted_scores_tr.sort()

conf_lower_tr = sorted_scores_tr[int(0.025 * len(sorted_scores_tr))]
conf_upper_tr = sorted_scores_tr[int(0.975 * len(sorted_scores_tr))]

standard_error_tr = (conf_upper_tr - conf_lower_tr) / 3.92
print('Training set AUC standard error', standard_error_tr)

Training set AUC standard error 0.007833688914237594


In [77]:
cox_auc_data = {'model': ['cox_crude'],
                'auc_2yr_te': [cox_crude_auc_te],
                'sem_te': [standard_error_te],
                'auc_2yr_tr': [cox_crude_auc_tr],
                'sem_tr': [standard_error_tr]}

cox_auc_df = pd.DataFrame(cox_auc_data)

In [78]:
cox_auc_df

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
0,cox_crude,0.612195,0.015423,0.625906,0.007834


In [79]:
cox_auc_df.to_csv('cox_auc_df.csv', index = False, header = True)

In [80]:
times = np.arange(30, 1810, 30)
crude_cox_auc_over5 = cumulative_dynamic_auc(train_cox_y, test_cox_y, cox_crude_risk_scores_te, times)[0]

times_data = {}
values = crude_cox_auc_over5
time_names = []

for x in range(len(times)):
    time_names.append('time_'+str(times[x]))

for i in range(len(time_names)):
    times_data[time_names[i]] = values[i]
    
cox_auc_over5 = pd.DataFrame(times_data, index = ['cox_crude'])

In [81]:
cox_auc_over5

Unnamed: 0,time_30,time_60,time_90,time_120,time_150,time_180,time_210,time_240,time_270,time_300,...,time_1530,time_1560,time_1590,time_1620,time_1650,time_1680,time_1710,time_1740,time_1770,time_1800
cox_crude,0.63523,0.719533,0.696485,0.696683,0.670808,0.672577,0.674294,0.667327,0.670997,0.669549,...,0.639714,0.638639,0.635354,0.635355,0.639401,0.639922,0.641251,0.643446,0.644487,0.645597


In [82]:
cox_auc_over5.to_csv('cox_auc_over5.csv', index = True, header = True)

## 3. De novo metastatic disease with complete cases

**This version will include patients with de novo metastatic disease who are not missing site of metastasis or PSA.**

### Remove patients with missing values 

In [83]:
train_cox_cc = (
    train_cox
    .query('MStage_n != "unknown"')
    .query('PSAMetDiagnosis.notna()', engine = 'python')
)

In [84]:
test_cox_cc = (
    test_cox
    .query('MStage_n != "unknown"')
    .query('PSAMetDiagnosis.notna()', engine = 'python')
)

### Processing X 

#### Select relevant variables

In [85]:
train_cox_cc_x = (
    train_cox_cc
    .filter(items = [
        'PatientID',
        'MStage_n',
        'psa_60',
        'grade_group']))

In [86]:
train_cox_cc_x = train_cox_cc_x.set_index('PatientID')

In [87]:
train_cox_cc_x.shape

(6605, 3)

In [88]:
test_cox_cc_x = (
    test_cox_cc
    .filter(items = [
        'PatientID',
        'MStage_n',
        'psa_60',
        'grade_group']))

In [89]:
test_cox_cc_x = test_cox_cc_x.set_index('PatientID')

In [90]:
test_cox_cc_x.shape

(1686, 3)

#### Convert relevant varibles to categorical 

In [91]:
list(train_cox_cc_x.select_dtypes(include = ['object']).columns)

['MStage_n', 'grade_group']

In [92]:
to_be_categorical = list(train_cox_cc_x.select_dtypes(include = ['object']).columns)

In [93]:
to_be_categorical.append('psa_60')

In [94]:
for x in list(to_be_categorical):
    train_cox_cc_x[x] = train_cox_cc_x[x].astype('category')

In [95]:
train_cox_cc_x.dtypes

MStage_n       category
psa_60         category
grade_group    category
dtype: object

In [96]:
for x in list(to_be_categorical):
    test_cox_cc_x[x] = test_cox_cc_x[x].astype('category')

#### Dummy encode categorical variables 

In [97]:
# Dummy variables for site of metastasis
train_cox_cc_x = encode_and_bind(train_cox_cc_x, 'MStage_n')
train_cox_cc_x = train_cox_cc_x.drop(columns = ['MStage_n_N1'])

test_cox_cc_x = encode_and_bind(test_cox_cc_x, 'MStage_n')
test_cox_cc_x = test_cox_cc_x.drop(columns = ['MStage_n_N1'])

# Dummy variables for grade group 
train_cox_cc_x = encode_and_bind(train_cox_cc_x, 'grade_group')
train_cox_cc_x = train_cox_cc_x.drop(columns = ['grade_group_Unknown / Not documented'])

test_cox_cc_x = encode_and_bind(test_cox_cc_x, 'grade_group')
test_cox_cc_x = test_cox_cc_x.drop(columns = ['grade_group_Unknown / Not documented'])

In [98]:
print(train_cox_cc_x.shape)
print(test_cox_cc_x.shape)

(6605, 7)
(1686, 7)


### Processing Y

In [99]:
y_dtypes = train_cox[['death_status', 'timerisk_activity']].dtypes

train_cox_cc_y = np.array([tuple(x) for x in train_cox_cc[['death_status', 'timerisk_activity']].values],
                          dtype = list(zip(y_dtypes.index, y_dtypes)))

In [100]:
train_cox_cc_y.shape

(6605,)

In [101]:
y_dtypes = test_cox[['death_status', 'timerisk_activity']].dtypes

test_cox_cc_y = np.array([tuple(x) for x in test_cox_cc[['death_status', 'timerisk_activity']].values],
                         dtype = list(zip(y_dtypes.index, y_dtypes)))

In [102]:
test_cox_cc_y.shape

(1686,)

### Build and assess model performance

In [103]:
cox_cc = CoxPHSurvivalAnalysis()

cox_cc.fit(train_cox_cc_x, train_cox_cc_y)

CoxPHSurvivalAnalysis()

In [104]:
cox_cc_risk_scores_te = cox_cc.predict(test_cox_cc_x)
cox_cc_auc_te = cumulative_dynamic_auc(train_cox_cc_y, test_cox_cc_y, cox_cc_risk_scores_te, 730)[0][0]
print('Test set AUC at 2 years:', cox_cc_auc_te)

Test set AUC at 2 years: 0.6045630959059534


In [105]:
cox_cc_risk_scores_tr = cox_cc.predict(train_cox_cc_x)
cox_cc_auc_tr = cumulative_dynamic_auc(train_cox_cc_y, train_cox_cc_y, cox_cc_risk_scores_tr, 730)[0][0]
print('Training set AUC at 2 years:', cox_cc_auc_tr)

Training set AUC at 2 years: 0.6286820193801725


In [106]:
# Bootstrap 10000 1 yr AUCs for test set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_te = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_cc_risk_scores_te), len(cox_cc_risk_scores_te))
    auc_yr = cumulative_dynamic_auc(train_cox_cc_y, test_cox_cc_y[indices], cox_cc_risk_scores_te[indices], 730)[0][0]
    bootstrapped_scores_te.append(auc_yr)

In [107]:
# Standard error of mean for test set AUC
sorted_scores_te = np.array(bootstrapped_scores_te)
sorted_scores_te.sort()

conf_lower_te = sorted_scores_te[int(0.025 * len(sorted_scores_te))]
conf_upper_te = sorted_scores_te[int(0.975 * len(sorted_scores_te))]

standard_error_te = (conf_upper_te - conf_lower_te) / 3.92
print('Test set AUC standard error:', standard_error_te)

Test set AUC standard error: 0.016152617569610277


In [108]:
# Bootstrap 10000 1-yr AUCs for train set 
n_bootstraps = 10000
rng_seed = 42 
bootstrapped_scores_tr = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    indices = rng.randint(0, len(cox_cc_risk_scores_tr), len(cox_cc_risk_scores_tr))
    auc_yr = cumulative_dynamic_auc(train_cox_cc_y, train_cox_cc_y[indices], cox_cc_risk_scores_tr[indices], 730)[0][0]
    bootstrapped_scores_tr.append(auc_yr)

In [109]:
# Standard error of mean for train set AUC
sorted_scores_tr = np.array(bootstrapped_scores_tr)
sorted_scores_tr.sort()

conf_lower_tr = sorted_scores_tr[int(0.025 * len(sorted_scores_tr))]
conf_upper_tr = sorted_scores_tr[int(0.975 * len(sorted_scores_tr))]

standard_error_tr = (conf_upper_tr - conf_lower_tr) / 3.92
print('Training set AUC standard error', standard_error_tr)

Training set AUC standard error 0.00827588705646784


In [110]:
cox_auc_data = {'model': 'cox_cc',
                'auc_2yr_te': cox_cc_auc_te,
                'sem_te': standard_error_te,
                'auc_2yr_tr': cox_cc_auc_tr,
                'sem_tr': standard_error_tr}

In [111]:
cox_auc_df = pd.read_csv('cox_auc_df.csv')

In [112]:
cox_auc_df = cox_auc_df.append(cox_auc_data, ignore_index = True)

In [113]:
cox_auc_df

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
0,cox_crude,0.612195,0.015423,0.625906,0.007834
1,cox_cc,0.604563,0.016153,0.628682,0.008276


In [114]:
cox_auc_df.to_csv('cox_auc_df.csv', index = False, header = True)

In [115]:
times = np.arange(30, 1810, 30)
cc_cox_auc_over5 = cumulative_dynamic_auc(train_cox_cc_y, test_cox_cc_y, cox_cc_risk_scores_te, times)[0]

times_data = {}
values = cc_cox_auc_over5
time_names = []

for x in range(len(times)):
    time_names.append('time_'+str(times[x]))

for i in range(len(time_names)):
    times_data[time_names[i]] = values[i]
    
cc_cox_over5_df = pd.DataFrame(times_data, index = ['cox_cc'])

In [116]:
cox_auc_over5 = pd.read_csv('cox_auc_over5.csv', index_col = 0)

In [117]:
cox_auc_over5 = cox_auc_over5.append(cc_cox_over5_df, ignore_index = False)

In [118]:
cox_auc_over5

Unnamed: 0,time_30,time_60,time_90,time_120,time_150,time_180,time_210,time_240,time_270,time_300,...,time_1530,time_1560,time_1590,time_1620,time_1650,time_1680,time_1710,time_1740,time_1770,time_1800
cox_crude,0.63523,0.719533,0.696485,0.696683,0.670808,0.672577,0.674294,0.667327,0.670997,0.669549,...,0.639714,0.638639,0.635354,0.635355,0.639401,0.639922,0.641251,0.643446,0.644487,0.645597
cox_cc,0.784304,0.729547,0.722173,0.70625,0.68377,0.680388,0.679858,0.679429,0.687113,0.677172,...,0.643343,0.643382,0.63848,0.636745,0.644063,0.642834,0.650582,0.651796,0.651795,0.652504


In [119]:
cox_auc_over5.to_csv('cox_auc_over5.csv', index = True, header = True)

In [120]:
ml_auc_df = pd.read_csv('ml_auc_df.csv', dtype = {'auc_2yr_te': np.float64,
                                                  'sem_te': np.float64,
                                                  'auc_2yr_tr': np.float64,
                                                  'sem_tr': np.float64})

In [121]:
ml_auc_df

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
0,gbm_crude,0.754326,0.009489,0.784849,0.00459
1,rsf_crude,0.747107,0.009631,0.835588,0.003914
2,ridge_crude,0.739499,0.009629,0.743281,0.004897
3,lasso_crude,0.729512,0.009712,0.733718,0.004997
4,enet_crude,0.731323,0.009719,0.734977,0.004973
5,linear_svm_crude,0.740643,0.009662,0.74623,0.004833
6,gbm_mice,0.772443,0.00454,0.82346,0.011367


In [122]:
all_models_auc_df = ml_auc_df.append(cox_auc_df, ignore_index = True)

In [123]:
all_models_auc_df.sort_values(by = 'auc_2yr_te', ascending = False)

Unnamed: 0,model,auc_2yr_te,sem_te,auc_2yr_tr,sem_tr
6,gbm_mice,0.772443,0.00454,0.82346,0.011367
0,gbm_crude,0.754326,0.009489,0.784849,0.00459
1,rsf_crude,0.747107,0.009631,0.835588,0.003914
5,linear_svm_crude,0.740643,0.009662,0.74623,0.004833
2,ridge_crude,0.739499,0.009629,0.743281,0.004897
4,enet_crude,0.731323,0.009719,0.734977,0.004973
3,lasso_crude,0.729512,0.009712,0.733718,0.004997
7,cox_crude,0.612195,0.015423,0.625906,0.007834
8,cox_cc,0.604563,0.016153,0.628682,0.008276


In [124]:
all_models_auc_df.to_csv('all_models_auc_df.csv', index = False, header = True)