# Regularization

Regularized regression can be easily accommodated only with TwoStagesFitter where we first estimate $\beta_j$ and then $\alpha_{jt}$. Regularization is introduced by CoxPHFitter of lifelines with event-specific tuning parameters, $\eta_j \geq 0$, and l1_ratio argument. 

For each $j$, usually, a path of models in $\eta_j$ are fitted, and the value of l1_ratio defines the type of prediction model. In particular, ridge regression is performed by setting l1_ratio=0, lasso by l1_ratio=1, and elastic net by 0 < l1_ratio < 1.

In the following, we present how to use *PyDTS* to fit a lasso regularized model, and how to tune the regularization parameters $\eta_j$.

We start by generating data, as discussed in previous sections:

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pydts.examples_utils.generate_simulations_data import generate_quick_start_df
import warnings
pd.set_option("display.max_rows", 500)
warnings.filterwarnings('ignore')
%matplotlib inline

real_coef_dict = {
    "alpha": {
        1: lambda t: -1 - 0.3 * np.log(t),
        2: lambda t: -1.75 - 0.15 * np.log(t)
    },
    "beta": {
        1: -np.log([0.8, 3, 3, 2.5, 2]),
        2: -np.log([1, 3, 4, 3, 2])
    }
}

n_patients = 50000
n_cov = 5

patients_df = generate_quick_start_df(n_patients=n_patients, n_cov=n_cov, d_times=30, j_events=2, 
                                      pid_col='pid', seed=0, censoring_prob=0.8, 
                                      real_coef_dict=real_coef_dict)

train_df, test_df = train_test_split(patients_df, test_size=0.2)

patients_df.head()

Unnamed: 0,pid,Z1,Z2,Z3,Z4,Z5,J,T,C,X
0,0,0.548814,0.715189,0.602763,0.544883,0.423655,0,31,10,10
1,1,0.645894,0.437587,0.891773,0.963663,0.383442,0,31,24,24
2,2,0.791725,0.528895,0.568045,0.925597,0.071036,0,17,11,11
3,3,0.087129,0.020218,0.83262,0.778157,0.870012,1,1,31,1
4,4,0.978618,0.799159,0.461479,0.780529,0.118274,0,15,14,14


## Predefined Regularization Parameters

Lasso with $\eta_1=0.003$ and $\eta_2=0.005$, can be applied by

In [2]:
from pydts.fitters import TwoStagesFitter

L1_regularized_fitter = TwoStagesFitter()
fit_beta_kwargs = {
    'model_kwargs': {
        1: {'penalizer': 0.003, 'l1_ratio': 1},
        2: {'penalizer': 0.005, 'l1_ratio': 1}
}}
L1_regularized_fitter.fit(df = patients_df.drop(['C', 'T'], axis = 1),
                          fit_beta_kwargs = fit_beta_kwargs)

L1_regularized_fitter.print_summary()



Unnamed: 0_level_0,j1_params,j1_SE,j2_params,j2_SE
covariate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Z1,2e-06,0.000101,1.981085e-08,2.4e-05
Z2,-0.772797,0.0254,-7.976064e-07,7.1e-05
Z3,-0.761229,0.025532,-0.1720702,0.038499
Z4,-0.550481,0.025318,-8.073968e-07,7.2e-05
Z5,-0.338471,0.025211,-3.40961e-07,3.1e-05




Model summary for event: 1


Unnamed: 0_level_0,Unnamed: 1_level_0,n_jt,success,alpha_jt
J,X,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,3374,True,-1.471644
1,2,2328,True,-1.714213
1,3,1805,True,-1.859723
1,4,1524,True,-1.920774
1,5,1214,True,-2.050566
1,6,1114,True,-2.038532
1,7,916,True,-2.142666
1,8,830,True,-2.151764
1,9,683,True,-2.257665
1,10,626,True,-2.258146




Model summary for event: 2


Unnamed: 0_level_0,Unnamed: 1_level_0,n_jt,success,alpha_jt
J,X,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,1,1250,True,-3.578498
2,2,839,True,-3.857174
2,3,805,True,-3.793806
2,4,644,True,-3.92234
2,5,570,True,-3.95329
2,6,483,True,-4.033435
2,7,416,True,-4.097484
2,8,409,True,-4.031967
2,9,323,True,-4.185559
2,10,306,True,-4.159369


## Tuning Regularization Parameters

In penalized regression, one should fit a path of models in each $\eta_j$, $j=1,\ldots,M$. The final set of values of $\eta_1,\ldots,\eta_M$ corresponds to the values yielding the best results in terms of  pre-specified criteria, such as maximizing $\widehat{\mbox{AUC}}_j$ and  $\widehat{\mbox{AUC}}$, or minimizing $\widehat{\mbox{BS}}_j$ and $\widehat{\mbox{BS}}$. The default criteria in *PyDTS* is maximizing the global AUC, $\widehat{\mbox{AUC}}$. Two $M$-dimensional grid search options are implemented, `PenaltyGridSearch` when the user provides train and test datasets, and `PenaltyGridSearchCV` for applying a K-fold cross validation (CV) approach.

### PenaltyGridSearch

When train and test sets are available, by excecuting the following code, all the four optimization criteria are calculated over the $M$-dimensional grid and optimal_set includes the optimal values of $\eta_1,\ldots,\eta_M$ based on $\widehat{\mbox{AUC}}$. Here, the optimal set based on $\widehat{\mbox{AUC}}$ is $\log\eta_1 = -6$ and $\log\eta_2 = -6$.

It is noted, that we estimate the parameters of each $\eta_j$ once. However, since our performance measures requires the evaluation of the overall survival function, we must check each possible combination of $\eta_j$ seperately. This can be time consuming, especially when we would like to choose between a large number of possible penalizers.

In [3]:
from pydts.model_selection import PenaltyGridSearch


penalizers = np.exp([-2, -3, -4, -5, -6])
grid_search = PenaltyGridSearch()
optimal_set = grid_search.evaluate(train_df, test_df, l1_ratio = 1, 
                                   penalizers = penalizers,
                                   metrics = ['IBS', 'GBS', 'IAUC', 'GAUC'])

print(optimal_set)

Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 199 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 204 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 206 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 207 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 213 seconds
(0.0024787521766663585, 0.0024787521766663585)


The user can choose the set of $\eta_j$, $j=1,\ldots,M$, values that optimizes other desired criteria. For example, the set that minimizes $\widehat{\mbox{BS}}$ can be selected as follows 

In [4]:
res = grid_search.convert_results_dict_to_df(grid_search.global_bs)
res.columns = ['BS']
res.index.set_names(['eta_1', 'eta_2'], inplace=True)
res

Unnamed: 0_level_0,Unnamed: 1_level_0,BS
eta_1,eta_2,Unnamed: 2_level_1
0.135335,0.135335,0.038662
0.135335,0.049787,0.038662
0.135335,0.018316,0.038633
0.135335,0.006738,0.038468
0.135335,0.002479,0.038137
0.049787,0.135335,0.03843
0.049787,0.049787,0.03843
0.049787,0.018316,0.038403
0.049787,0.006738,0.038245
0.049787,0.002479,0.037919


In [5]:
grid_search.convert_results_dict_to_df(grid_search.global_bs).idxmin()

0    (0.0024787521766663585, 0.0024787521766663585)
dtype: object

the final model can be retrieved by

In [6]:
optimal_two_stages_fitter = grid_search.get_mixed_two_stages_fitter(optimal_set)

### PenaltyGridSearchCV

Alternatively, 5-fold CV is performed by

In [7]:
from pydts.cross_validation import PenaltyGridSearchCV


penalizers = np.exp([-2, -3, -4, -5, -6])
grid_search_cv = PenaltyGridSearchCV()
results_df = grid_search_cv.cross_validate(patients_df, l1_ratio=1, 
                                           penalizers=penalizers, n_splits=5, 
                                           metrics=['IBS', 'GBS', 'IAUC', 'GAUC'])

Starting fold 1/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 174 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 186 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 200 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 209 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 155 seconds
Finished fold 1/5, 1003 seconds
Starting fold 2/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
Finished estimating t

Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 213 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 182 seconds
Finished fold 5/5, 1040 seconds


In [8]:
results_df

Unnamed: 0,Unnamed: 1,Mean,SE
0.135335,0.135335,0.638475,0.002592
0.135335,0.049787,0.639094,0.003754
0.135335,0.018316,0.639103,0.003627
0.135335,0.006738,0.637383,0.002958
0.135335,0.002479,0.488831,0.003291
0.049787,0.135335,0.638764,0.002806
0.049787,0.049787,0.638899,0.003862
0.049787,0.018316,0.639014,0.00378
0.049787,0.006738,0.637522,0.003173
0.049787,0.002479,0.488832,0.003291


In [10]:
optimal_set = results_df['Mean'].idxmax()
optimal_set

(0.1353352832366127, 0.01831563888873418)

## References

[1] Meir, Tomer, Gutman, Rom, and Gorfine, Malka, "PyDTS: A Python Package for Discrete-Time Survival Analysis with Competing Risks" (2022)