# Data Regrouping Example

As previously discussed, both estimators require observations for each (j, t). Sometimes, the data do not comply with this requirement either because observations become less frequent towards later times, or there are no observation at specific time points. For example, when dealing with hospitalization length of stay, patients are more likely to be released after a few days rather than after a month, and releases can be less frequent on weekends.

In this example we demonstrate data regrouping step to use during the data preprocessing stage for both of these cases, which will allow a successful model estimation. 

In [1]:
import warnings
import sys 

import pandas as pd
import numpy as np
from pydts.examples_utils.generate_simulations_data import generate_quick_start_df
from pydts.fitters import TwoStagesFitter

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 25)
warnings.filterwarnings('ignore')
%matplotlib inline

## Not enough observations in later times

In this case we consider a setting in which the observed events become less frequent in later times. We generate the data using the same model described at Data Preparation section of the Usage Example, only for much less patients ($n=1000$)

In [2]:
real_coef_dict = {
    "alpha": {
        1: lambda t: -1 - 0.3 * np.log(t),
        2: lambda t: -1.75 - 0.15 * np.log(t)
    },
    "beta": {
        1: -np.log([0.8, 3, 3, 2.5, 2]),
        2: -np.log([1, 3, 4, 3, 2])
    }
}

df = generate_quick_start_df(n_patients=1000, n_cov=5, d_times=30, j_events=2, pid_col='pid', seed=0, 
                             real_coef_dict=real_coef_dict)

Now, when we repeat the data check to make sure we observe events for each (j,t), we see that we do not observe enough events in later times. For example, the number of events $n_{j=1,t=25} = 1$ and $n_{j=2,t=25} = 0$

In [3]:
df.groupby(['J', 'X'])['pid'].count().unstack('J')

J,0,1,2
X,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,30.0,63.0,24.0
2,20.0,49.0,24.0
3,28.0,34.0,13.0
4,21.0,34.0,11.0
5,22.0,15.0,9.0
6,22.0,25.0,12.0
7,23.0,20.0,14.0
8,21.0,17.0,6.0
9,20.0,11.0,13.0
10,11.0,12.0,3.0


Trying to fit the model with such data will result in the following error message:

In [4]:
m2 = TwoStagesFitter()
try:
    m2.fit(df.drop(columns=['C', 'T']), verbose=0)
except RuntimeError as e:
    raise e.with_traceback(None)

RuntimeError: In event J=1, The method didn't have events D=[24, 27, 28, 29, 30]. Consider changing the problem definition.
 See https://tomer1812.github.io/pydts/UsageExample-RegroupingData/ for more details.

To fix this, we can induce administrative censorship, such that events occured later than the 21st day (either $J=1$ or $J=2$) are considered to be in a 21+ event occurrence time.

In [5]:
regrouped_df = df.copy()
regrouped_df['X'].clip(upper=21, inplace=True)
regrouped_df.groupby(['J', 'X'])['pid'].count().unstack('J')

J,0,1,2
X,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,30,63,24
2,20,49,24
3,28,34,13
4,21,34,11
5,22,15,9
6,22,25,12
7,23,20,14
8,21,17,6
9,20,11,13
10,11,12,3


Now, we can successfully estimate the model:

In [6]:
m2 = TwoStagesFitter()
m2.fit(regrouped_df.drop(columns=['C', 'T']))
m2.print_summary()

INFO: Pandarallel will run on 2 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


Model summary for event: 1


0,1
model,lifelines.CoxPHFitter
duration col,'X'
event col,'j_1'
strata,X_copy
baseline estimation,breslow
number of observations,9091
number of events observed,359
partial log-likelihood,-2222.61
time fit was run,2022-04-09 21:53:20 UTC

Unnamed: 0,coef,exp(coef),se(coef),coef lower 95%,coef upper 95%,exp(coef) lower 95%,exp(coef) upper 95%,z,p,-log2(p)
Z1,0.06,1.06,0.19,-0.31,0.43,0.74,1.53,0.32,0.75,0.41
Z2,-0.98,0.38,0.19,-1.36,-0.6,0.26,0.55,-5.08,<0.005,21.31
Z3,-1.01,0.36,0.19,-1.39,-0.64,0.25,0.53,-5.29,<0.005,22.99
Z4,-1.05,0.35,0.18,-1.4,-0.69,0.25,0.5,-5.79,<0.005,27.12
Z5,-0.82,0.44,0.18,-1.17,-0.46,0.31,0.63,-4.53,<0.005,17.37

0,1
Concordance,0.65
Partial AIC,4455.21
log-likelihood ratio test,104.24 on 5 df
-log2(p) of ll-ratio test,67.01


Unnamed: 0_level_0,Unnamed: 1_level_0,n_jt,success,alpha_jt
J,X,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,63,True,-0.947684
1,2,49,True,-1.051086
1,3,34,True,-1.295866
1,4,34,True,-1.17856
1,5,15,True,-1.903136
1,6,25,True,-1.298924
1,7,20,True,-1.401847
1,8,17,True,-1.451159
1,9,11,True,-1.788478
1,10,12,True,-1.564731




Model summary for event: 2


0,1
model,lifelines.CoxPHFitter
duration col,'X'
event col,'j_2'
strata,X_copy
baseline estimation,breslow
number of observations,9091
number of events observed,156
partial log-likelihood,-976.47
time fit was run,2022-04-09 21:53:21 UTC

Unnamed: 0,coef,exp(coef),se(coef),coef lower 95%,coef upper 95%,exp(coef) lower 95%,exp(coef) upper 95%,z,p,-log2(p)
Z1,0.43,1.54,0.29,-0.13,0.99,0.88,2.7,1.51,0.13,2.94
Z2,-0.65,0.52,0.29,-1.21,-0.09,0.3,0.91,-2.28,0.02,5.47
Z3,-1.15,0.32,0.29,-1.72,-0.58,0.18,0.56,-3.95,<0.005,13.66
Z4,-0.22,0.8,0.27,-0.75,0.31,0.47,1.37,-0.81,0.42,1.26
Z5,-0.48,0.62,0.27,-1.01,0.06,0.36,1.06,-1.74,0.08,3.62

0,1
Concordance,0.62
Partial AIC,1962.94
log-likelihood ratio test,26.60 on 5 df
-log2(p) of ll-ratio test,13.84


Unnamed: 0_level_0,Unnamed: 1_level_0,n_jt,success,alpha_jt
J,X,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,1,24,True,-2.770174
2,2,24,True,-2.619309
2,3,13,True,-3.105049
2,4,11,True,-3.164241
2,5,9,True,-3.269706
2,6,12,True,-2.900518
2,7,14,True,-2.616379
2,8,6,True,-3.361561
2,9,13,True,-2.468053
2,10,3,True,-3.827


## Not enough observations at specific times

Now let's examine the case which there are no events during the weekend, only cencoreship. We will re-arrange the data to represent such case:

In [7]:
def map_days(row):
    if row['X'] in [7, 14, 21] and row['J'] in [1,2]:
        row['X'] -= 1
        row['X'].astype(int)
    return row

regrouped_df = regrouped_df.apply(map_days, axis=1)
regrouped_df[['J', 'T', 'C', 'X']] = regrouped_df[['J', 'T', 'C', 'X']].astype('int64')
(regrouped_df.groupby(['J'])['X'].value_counts()).to_frame().unstack()

Unnamed: 0_level_0,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X
X,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
J,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,30.0,20.0,28.0,21.0,22.0,22.0,23.0,21.0,20.0,11.0,18.0,15.0,15.0,13.0,21.0,16.0,19.0,14.0,14.0,14.0,108.0
1,63.0,49.0,34.0,34.0,15.0,45.0,,17.0,11.0,12.0,15.0,12.0,12.0,,14.0,5.0,5.0,4.0,1.0,11.0,
2,24.0,24.0,13.0,11.0,9.0,26.0,,6.0,13.0,3.0,1.0,3.0,7.0,,3.0,3.0,1.0,2.0,2.0,5.0,


Trying to fit the model with such data will result in the following error message:

In [8]:
m2 = TwoStagesFitter()
try: 
    m2.fit(regrouped_df.drop(columns=['C', 'T']), verbose=0)
except RuntimeError as e:
    raise e.with_traceback(None)

RuntimeError: In event J=1, The method didn't have events D=[7, 14, 21]. Consider changing the problem definition.
 See https://tomer1812.github.io/pydts/UsageExample-RegroupingData/ for more details.

A possible solution can be to regroup the missing times with the preceding days into combined categories of "weekend":

In [9]:
def map_days_second_try(row):
    if row['X'] in [7, 14, 21] and row['J'] == 0:
        row['X'] -= 1
        row['X'].astype(int)
    return row

regrouped_df = regrouped_df.apply(map_days_second_try, axis=1)
regrouped_df[['J', 'T', 'C', 'X']] = regrouped_df[['J', 'T', 'C', 'X']].astype('int64')
(regrouped_df.groupby(['J'])['X'].value_counts()).to_frame().unstack()

Unnamed: 0_level_0,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X
X,1,2,3,4,5,6,8,9,10,11,12,13,15,16,17,18,19,20
J,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
0,30,20,28,21,22,45,21,20,11,18,15,28,21,16,19,14,14,122
1,63,49,34,34,15,45,17,11,12,15,12,12,14,5,5,4,1,11
2,24,24,13,11,9,26,6,13,3,1,3,7,3,3,1,2,2,5


Note that here the problematic times 7,14,21 are no longer in our data, and times 6,13,20 in fact represent times 6-7, 13-14, 20-21 respectively. Now, we can estimate the model, and it will provide $\alpha_{jt}$ estimations only for the times we provided. Thus, our model can't predict for times 7,14,21, but only for times 6-7, 13-14, 20-21.

In [10]:
m2 = TwoStagesFitter()
m2.fit(regrouped_df.drop(columns=['C', 'T']), verbose=0)
m2.print_summary()



Model summary for event: 1


0,1
model,lifelines.CoxPHFitter
duration col,'X'
event col,'j_1'
strata,X_copy
baseline estimation,breslow
number of observations,1718
number of events observed,240
partial log-likelihood,-1314.84
time fit was run,2022-04-09 21:53:28 UTC

Unnamed: 0,coef,exp(coef),se(coef),coef lower 95%,coef upper 95%,exp(coef) lower 95%,exp(coef) upper 95%,z,p,-log2(p)
Z1,0.13,1.13,0.23,-0.32,0.57,0.73,1.77,0.55,0.58,0.79
Z2,-0.59,0.55,0.23,-1.04,-0.14,0.35,0.87,-2.57,0.01,6.6
Z3,-0.58,0.56,0.23,-1.03,-0.13,0.36,0.88,-2.5,0.01,6.35
Z4,-0.81,0.44,0.22,-1.25,-0.38,0.29,0.68,-3.67,<0.005,12.03
Z5,-0.59,0.56,0.23,-1.04,-0.13,0.35,0.88,-2.53,0.01,6.46

0,1
Concordance,0.63
Partial AIC,2639.68
log-likelihood ratio test,36.64 on 5 df
-log2(p) of ll-ratio test,20.43


Unnamed: 0_level_0,Unnamed: 1_level_0,n_jt,success,alpha_jt
J,X,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,63,True,-1.544779
1,2,49,True,-1.658131
1,3,34,True,-1.911357
1,4,34,True,-1.79552
1,5,15,True,-2.530238
1,6,45,True,-1.296992
1,8,17,True,-2.090432
1,9,11,True,-2.431846
1,10,12,True,-2.211441
1,11,15,True,-1.904531




Model summary for event: 2


0,1
model,lifelines.CoxPHFitter
duration col,'X'
event col,'j_2'
strata,X_copy
baseline estimation,breslow
number of observations,1718
number of events observed,107
partial log-likelihood,-588.54
time fit was run,2022-04-09 21:53:29 UTC

Unnamed: 0,coef,exp(coef),se(coef),coef lower 95%,coef upper 95%,exp(coef) lower 95%,exp(coef) upper 95%,z,p,-log2(p)
Z1,0.56,1.75,0.34,-0.12,1.23,0.89,3.44,1.62,0.1,3.25
Z2,-0.15,0.86,0.34,-0.81,0.52,0.44,1.68,-0.43,0.67,0.59
Z3,-0.6,0.55,0.34,-1.28,0.07,0.28,1.07,-1.75,0.08,3.65
Z4,0.12,1.13,0.33,-0.52,0.76,0.59,2.15,0.36,0.72,0.48
Z5,-0.25,0.78,0.35,-0.93,0.42,0.39,1.52,-0.74,0.46,1.12

0,1
Concordance,0.57
Partial AIC,1187.08
log-likelihood ratio test,6.29 on 5 df
-log2(p) of ll-ratio test,1.84


Unnamed: 0_level_0,Unnamed: 1_level_0,n_jt,success,alpha_jt
J,X,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,1,24,True,-3.574037
2,2,24,True,-3.442172
2,3,13,True,-3.946532
2,4,11,True,-4.010704
2,5,9,True,-4.10463
2,6,26,True,-2.947698
2,8,6,True,-4.20507
2,9,13,True,-3.331808
2,10,3,True,-4.678181
2,11,1,True,-5.489276
