# Duration test

##### Updated 01/26/2023
##### Selin Kubali

#### Goal: 
Test whether restricting to incident cases (for HCM) and prevalent cases (for all other diseases) makes a difference to Cox model. Create summary tables documenting statistics for categories of phenotypic data.

#### Input: 
- hypertrophic_df.csv, produced from the notebook parse_phenotypic_data.ipynb

In [34]:
import pandas as pd
from lifelines import CoxPHFitter
phenotypic_data = pd.read_csv('/Users/uriel/Downloads/work_temp/hypertrophic_df.csv')


### Test whether to keep only incident HCM cases or all HCM cases
Incident cases refers to cases that developed after the first assessment. 'Duration left censored' is the measurement of duration for this category. It measures duration starting from the time of first assessment.

In [24]:
# All HCM cases
phenotypic_data_all = phenotypic_data.drop(['incident_hcm', 'duration_left_censored', 'eid'], axis = 1)
cph = CoxPHFitter()
cph.fit(phenotypic_data_all, 'duration', 'is_HCM', fit_options = {"step_size":0.1})
cph.print_summary()


# Only incident HCM cases
phenotypic_data_incident = phenotypic_data.drop(['is_HCM', 'duration', 'eid'], axis = 1)
cph = CoxPHFitter()
cph.fit(phenotypic_data_incident, 'duration_left_censored', 'incident_hcm', fit_options = {"step_size":0.1})
cph.print_summary()

0,1
model,lifelines.CoxPHFitter
duration col,'duration'
event col,'is_HCM'
baseline estimation,breslow
number of observations,470059
number of events observed,695
partial log-likelihood,-8129.74
time fit was run,2024-01-26 16:27:26 UTC

Unnamed: 0,coef,exp(coef),se(coef),coef lower 95%,coef upper 95%,exp(coef) lower 95%,exp(coef) upper 95%,cmp to,z,p,-log2(p)
prevalent_af,-0.45,0.64,0.13,-0.7,-0.2,0.5,0.82,0.0,-3.53,<0.005,11.25
prevalent_ckd,0.13,1.14,0.21,-0.29,0.55,0.75,1.74,0.0,0.62,0.54,0.9
prevalent_cad,0.84,2.32,0.09,0.66,1.02,1.93,2.78,0.0,9.07,<0.005,62.85
prevalent_t2d,0.02,1.02,0.12,-0.2,0.25,0.82,1.29,0.0,0.21,0.84,0.26
prevalent_htn,0.01,1.01,0.1,-0.18,0.19,0.83,1.21,0.0,0.05,0.96,0.06
age,-0.12,0.89,0.01,-0.13,-0.1,0.88,0.9,0.0,-16.23,<0.005,194.46
sex,0.44,1.55,0.08,0.27,0.61,1.32,1.83,0.0,5.21,<0.005,22.33
smoking_status,-0.2,0.82,0.08,-0.35,-0.04,0.7,0.96,0.0,-2.53,0.01,6.44
bmi,-0.01,0.99,0.01,-0.02,0.01,0.98,1.01,0.0,-0.64,0.52,0.93
physical_activity,-0.03,0.97,0.02,-0.06,0.0,0.94,1.0,0.0,-1.69,0.09,3.45

0,1
Concordance,0.84
Partial AIC,16303.47
log-likelihood ratio test,1224.03 on 22 df
-log2(p) of ll-ratio test,812.15


0,1
model,lifelines.CoxPHFitter
duration col,'duration_left_censored'
event col,'incident_hcm'
baseline estimation,breslow
number of observations,470059
number of events observed,475
partial log-likelihood,-5774.43
time fit was run,2024-01-26 16:27:33 UTC

Unnamed: 0,coef,exp(coef),se(coef),coef lower 95%,coef upper 95%,exp(coef) lower 95%,exp(coef) upper 95%,cmp to,z,p,-log2(p)
prevalent_af,-0.01,0.99,0.19,-0.39,0.36,0.68,1.44,0.0,-0.07,0.94,0.09
prevalent_ckd,0.29,1.34,0.28,-0.26,0.85,0.77,2.34,0.0,1.04,0.30,1.74
prevalent_cad,1.14,3.11,0.11,0.91,1.36,2.49,3.89,0.0,10.02,<0.005,76.05
prevalent_t2d,0.03,1.03,0.14,-0.25,0.31,0.78,1.36,0.0,0.21,0.83,0.26
prevalent_htn,0.1,1.1,0.11,-0.12,0.31,0.89,1.37,0.0,0.86,0.39,1.37
age,-0.02,0.98,0.01,-0.04,-0.01,0.96,0.99,0.0,-3.35,<0.005,10.3
sex,0.2,1.22,0.1,0.01,0.4,1.01,1.49,0.0,2.03,0.04,4.57
smoking_status,-0.12,0.89,0.09,-0.3,0.07,0.74,1.07,0.0,-1.22,0.22,2.17
bmi,-0.01,0.99,0.01,-0.03,0.01,0.97,1.01,0.0,-1.05,0.29,1.77
physical_activity,0.0,1.0,0.02,-0.04,0.04,0.96,1.04,0.0,0.04,0.97,0.05

0,1
Concordance,0.82
Partial AIC,11592.87
log-likelihood ratio test,775.47 on 22 df
-log2(p) of ll-ratio test,495.15


#### Conclusion
The model concordance is slightly higher with all HCM cases, and more traits are statistically significant. Because of this and in order to maximize HCM cases among rare variants, we will keep all HCM cases. 

In [35]:
phenotypic_data = phenotypic_data.drop(['incident_hcm', 'duration_left_censored'], axis = 1)

### Test whether to keep only prevalent disease data or all disease data 
Prevalent disease is disease that developed before the initial assessment. Biddinger et al. (2022) measured only prevalent cases when evaluating the Cox model of HCM. Here we test whether this is a necessary assumption. 

In [31]:
# All diseases
phenotypic_data_all = phenotypic_data.drop(['prevalent_af', 'prevalent_ckd', 'prevalent_cad', 'prevalent_t2d', 'prevalent_htn', 'eid'], axis = 1)
cph = CoxPHFitter()
cph.fit(phenotypic_data_all, 'duration', 'is_HCM', fit_options = {"step_size":0.1})
cph.print_summary()


# Only prevalent diseases
phenotypic_data_prevalent = phenotypic_data.drop(['is_AF','is_CKD', 'is_CAD', 'is_T2D', 'is_HTN', 'eid'], axis = 1)
cph = CoxPHFitter()
cph.fit(phenotypic_data_prevalent, 'duration', 'is_HCM', fit_options = {"step_size":0.1})
cph.print_summary()

0,1
model,lifelines.CoxPHFitter
duration col,'duration'
event col,'is_HCM'
baseline estimation,breslow
number of observations,470059
number of events observed,695
partial log-likelihood,-8172.29
time fit was run,2024-01-26 16:33:47 UTC

Unnamed: 0,coef,exp(coef),se(coef),coef lower 95%,coef upper 95%,exp(coef) lower 95%,exp(coef) upper 95%,cmp to,z,p,-log2(p)
age,-0.11,0.89,0.01,-0.13,-0.1,0.88,0.91,0.0,-15.98,<0.005,188.54
sex,0.48,1.61,0.08,0.31,0.64,1.37,1.9,0.0,5.68,<0.005,26.15
smoking_status,-0.2,0.82,0.08,-0.35,-0.04,0.7,0.96,0.0,-2.5,0.01,6.32
bmi,-0.01,0.99,0.01,-0.02,0.01,0.98,1.01,0.0,-0.66,0.51,0.97
physical_activity,-0.03,0.97,0.02,-0.06,0.01,0.94,1.01,0.0,-1.63,0.10,3.28
alcohol_use,0.04,1.05,0.03,-0.02,0.11,0.98,1.11,0.0,1.42,0.16,2.68
is_family_hist,0.42,1.52,0.08,0.27,0.57,1.31,1.77,0.0,5.38,<0.005,23.65
diastolic,-0.01,0.99,0.0,-0.02,-0.0,0.98,1.0,0.0,-2.05,0.04,4.64
systolic,0.0,1.0,0.0,-0.0,0.01,1.0,1.01,0.0,0.43,0.67,0.58
heavy_alcohol_use,-0.29,0.75,0.13,-0.55,-0.04,0.58,0.96,0.0,-2.26,0.02,5.38

0,1
Concordance,0.83
Partial AIC,16378.58
log-likelihood ratio test,1138.93 on 17 df
-log2(p) of ll-ratio test,766.67


0,1
model,lifelines.CoxPHFitter
duration col,'duration'
event col,'is_HCM'
baseline estimation,breslow
number of observations,470059
number of events observed,695
partial log-likelihood,-8248.72
time fit was run,2024-01-26 16:33:56 UTC

Unnamed: 0,coef,exp(coef),se(coef),coef lower 95%,coef upper 95%,exp(coef) lower 95%,exp(coef) upper 95%,cmp to,z,p,-log2(p)
prevalent_af,1.1,2.99,0.1,0.91,1.29,2.47,3.62,0.0,11.27,<0.005,95.44
prevalent_ckd,0.44,1.55,0.11,0.22,0.66,1.24,1.93,0.0,3.93,<0.005,13.51
prevalent_cad,0.94,2.56,0.09,0.76,1.12,2.13,3.07,0.0,10.12,<0.005,77.61
prevalent_t2d,0.06,1.06,0.12,-0.17,0.29,0.85,1.34,0.0,0.53,0.60,0.75
prevalent_htn,0.29,1.33,0.09,0.11,0.47,1.11,1.6,0.0,3.14,<0.005,9.19
age,-0.1,0.91,0.01,-0.11,-0.08,0.89,0.92,0.0,-14.02,<0.005,145.87
sex,0.55,1.73,0.08,0.39,0.71,1.47,2.04,0.0,6.59,<0.005,34.38
smoking_status,-0.16,0.85,0.08,-0.32,-0.01,0.73,0.99,0.0,-2.09,0.04,4.79
bmi,0.01,1.01,0.01,-0.01,0.03,0.99,1.03,0.0,1.29,0.20,2.34
physical_activity,-0.03,0.97,0.02,-0.06,0.0,0.94,1.0,0.0,-1.93,0.05,4.21

0,1
Concordance,0.81
Partial AIC,16531.44
log-likelihood ratio test,986.07 on 17 df
-log2(p) of ll-ratio test,657.96


#### Conclusion
There is a slightly higher concordance for prevalent cases, in addition to more statistically significant features, so we will keep the only the prevalent cases for AF, CKD, CAD, T2D, and HTN.

In [36]:
phenotypic_data = phenotypic_data.drop(['is_AF', 'is_CKD', 'is_CAD', 'is_T2D', 'is_HTN'], axis = 1)

## Summary statistics

#### HCM cases

In [37]:
pd.DataFrame({'Total number of patients': [len(phenotypic_data)], 
              'Number of patients with HCM':  [phenotypic_data['is_HCM'].sum()], 
              'Number of patients without HCM': [len(phenotypic_data) - phenotypic_data['is_HCM'].sum()],
              'Percent of patients with HCM': [str(phenotypic_data['is_HCM'].sum()/len(phenotypic_data) * 100)[0:4] + '%']})

Unnamed: 0,Total number of patients,Number of patients with HCM,Number of patients without HCM,Percent of patients with HCM
0,470059,695,469364,0.14%


#### Discrete cases

In [40]:
discrete_df = phenotypic_data[['prevalent_af', 'prevalent_ckd', 'prevalent_cad', 'prevalent_t2d', 'prevalent_htn', 'is_obesity', 'is_heart_failure', 'heavy_alcohol_use', 'is_family_hist', 'is_HCM', 'sex']]
all_patients = discrete_df.drop(['is_HCM'], axis = 1).sum().to_list()
hcm = discrete_df[discrete_df['is_HCM'] == True].drop(['is_HCM'], axis = 1).sum().to_list()
no_hcm = discrete_df[discrete_df['is_HCM'] == False].drop(['is_HCM'], axis = 1).sum().to_list()

percent_list = []
for i in range(len(all_patients)):
    percent_list.append(str(hcm[i] / all_patients[i] * 100)[0:4] + '%')


discrete_df = pd.DataFrame({'Columns': ['Has AF', 'Has CKD', 'Has CAD', 'Has prevalent T2D', 'Has HTN', 'Has obesity', 'Has heart failure', 'Heavy alcohol use', 'Has family history', 'Is male'], 
                            'All patients': all_patients, 
                            'Patients with HCM': hcm, 
                            'Patients without HCM': no_hcm, 
                            'Percent of group with HCM': percent_list})

discrete_df

Unnamed: 0,Columns,All patients,Patients with HCM,Patients without HCM,Percent of group with HCM
0,Has AF,30832.0,232.0,30600.0,0.75%
1,Has CKD,22177.0,120.0,22057.0,0.54%
2,Has CAD,35667.0,232.0,35435.0,0.65%
3,Has prevalent T2D,29377.0,95.0,29282.0,0.32%
4,Has HTN,61634.0,169.0,61465.0,0.27%
5,Has obesity,99.0,0.0,99.0,0.0%
6,Has heart failure,17407.0,191.0,17216.0,1.09%
7,Heavy alcohol use,97462.0,112.0,97350.0,0.11%
8,Has family history,209861.0,409.0,209452.0,0.19%
9,Is male,216686.0,447.0,216239.0,0.20%


#### Continuous cases
More information about alcohol intake frequency coding can be found [here](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=1558)

In [41]:

continuous_df = phenotypic_data[['duration', 'age', 'bmi', 'systolic', 'diastolic', 'alcohol_use', 'physical_activity','is_HCM']]
all_patients = continuous_df.drop(['is_HCM'], axis = 1).mean().to_list()
hcm = continuous_df[continuous_df['is_HCM'] == True].drop(['is_HCM'], axis = 1).mean().to_list()
no_hcm = continuous_df[continuous_df['is_HCM'] == False].drop(['is_HCM'], axis = 1).mean().to_list()

for n in range(len(all_patients)):
    all_patients[n] = str(all_patients[n])[0:5]

for n in range(len(hcm)):
    hcm[n] = str(hcm[n])[0:5]

for n in range(len(no_hcm)):
    no_hcm[n] = str(no_hcm[n])[0:5]


continuous_df = pd.DataFrame({'Columns': ['Duration', 'Age', 'BMI', 'Systolic bp', 'Diastolic bp', 'Days/wk moderate physical activity', 'Alcohol intake frequency'], 'All patients': all_patients, 'Patients with HCM': hcm, 'Patients without HCM': no_hcm})

continuous_df

Unnamed: 0,Columns,All patients,Patients with HCM,Patients without HCM
0,Duration,70.27,62.08,70.29
1,Age,70.34,72.26,70.33
2,BMI,27.34,28.61,27.34
3,Systolic bp,135.8,138.8,135.8
4,Diastolic bp,82.23,82.23,82.23
5,Days/wk moderate physical activity,2.912,3.138,2.912
6,Alcohol intake frequency,3.627,3.428,3.627
