# GPU LightGBM Baseline 
In this notebook, we present a GPU LightGBM baseline. In this notebook, compared to my previous starter notebooks we teach 5 new things:
* How to tranform `efs` and `efs_time` into single target with `KaplanMeierFitter`.
* How to train `GPU LightGBM model` with `KaplanMeierFitter` target
* How to train `XGBoost with Survivial:Cox loss`
* How to train `CatBoost with Survival:Cox loss`
* How to ensemble 5 models using `scipy.stats.rankdata()`.

# Two Competition Approaches
In this competition, there are two ways to train a Survival Model:
* We can input both `efs` and `efs_time` and train a **model that supports** `survival loss like Cox`.
* Transform `efs` and `efs_time` into a single target proxy for `risk score` and train **any model** with `regression loss like MSE`.

In this notebook, we train 5 models. The first 3 models (XGBoost, CatBoost, LightGBM) use bullet point two. And the next 2 models (XGBoost Cox, CatBoost Cox) use bullet point one. Discussion about this notebook is [here][4] and [here][3]. 

Since this competition's metric is a ranking metric, we ensemble the 5 predictions by first converting each into ranks using `scipy.stats.rankdata()`. Afterward we created a weighted average from the ranks.

Have Fun! Enjoy!

# Previous Notebooks
My previous starter notebooks are:
* XGBoost and CatBoost starter [here][1]
* NN (MLP) starter [here][2]

Associated discussions are [here][3], [here][4], [here][5]!

[1]: https://www.kaggle.com/code/cdeotte/xgboost-catboost-baseline-cv-668-lb-668
[2]: https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676
[3]: https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003
[4]: https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141
[5]: https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550343

# Load Train and Test

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

train_csv_path = "../data/post_hct_survival/train.csv"
test_csv_path = "../data/post_hct_survival/test.csv"

test = pd.read_csv(test_csv_path)
print("Test shape:", test.shape )

train = pd.read_csv(train_csv_path)
print("Train shape:",train.shape)
train.head()

Test shape: (3, 58)
Train shape: (28800, 60)


Unnamed: 0,ID,dri_score,psych_disturb,cyto_score,diabetes,hla_match_c_high,hla_high_res_8,tbi_status,arrhythmia,hla_low_res_6,graft_type,vent_hist,renal_issue,pulm_severe,prim_disease_hct,hla_high_res_6,cmv_status,hla_high_res_10,hla_match_dqb1_high,tce_imm_match,hla_nmdp_6,hla_match_c_low,rituximab,hla_match_drb1_low,hla_match_dqb1_low,prod_type,cyto_score_detail,conditioning_intensity,ethnicity,year_hct,obesity,mrd_hct,in_vivo_tcd,tce_match,hla_match_a_high,hepatic_severe,donor_age,prior_tumor,hla_match_b_low,peptic_ulcer,age_at_hct,hla_match_a_low,gvhd_proph,rheum_issue,sex_match,hla_match_b_high,race_group,comorbidity_score,karnofsky_score,hepatic_mild,tce_div_match,donor_related,melphalan_dose,hla_low_res_8,cardiac,hla_match_drb1_high,pulm_moderate,hla_low_res_10,efs,efs_time
0,0,N/A - non-malignant indication,No,,No,,,No TBI,No,6.0,Bone marrow,No,No,No,IEA,6.0,+/+,,2.0,,6.0,2.0,No,2.0,2.0,BM,,,Not Hispanic or Latino,2016,No,,Yes,,2.0,No,,No,2.0,No,9.942,2.0,FKalone,No,M-F,2.0,More than one race,0.0,90.0,No,,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,42.356
1,1,Intermediate,No,Intermediate,No,2.0,8.0,"TBI +- Other, >cGy",No,6.0,Peripheral blood,No,No,No,AML,6.0,+/+,10.0,2.0,P/P,6.0,2.0,No,2.0,2.0,PB,Intermediate,MAC,Not Hispanic or Latino,2008,No,Positive,No,Permissive,2.0,No,72.29,No,2.0,No,43.705,2.0,Other GVHD Prophylaxis,No,F-F,2.0,Asian,3.0,90.0,No,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,Yes,10.0,1.0,4.672
2,2,N/A - non-malignant indication,No,,No,2.0,8.0,No TBI,No,6.0,Bone marrow,No,No,No,HIS,6.0,+/+,10.0,2.0,P/P,6.0,2.0,No,2.0,2.0,BM,,,Not Hispanic or Latino,2019,No,,Yes,,2.0,No,,No,2.0,No,33.997,2.0,Cyclophosphamide alone,No,F-M,2.0,More than one race,0.0,90.0,No,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,19.793
3,3,High,No,Intermediate,No,2.0,8.0,No TBI,No,6.0,Bone marrow,No,No,No,ALL,6.0,+/+,10.0,2.0,P/P,6.0,2.0,No,2.0,2.0,BM,Intermediate,MAC,Not Hispanic or Latino,2009,No,Positive,No,Permissive,2.0,No,29.23,No,2.0,No,43.245,2.0,FK+ MMF +- others,No,M-M,2.0,White,0.0,90.0,Yes,Permissive mismatched,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,102.349
4,4,High,No,,No,2.0,8.0,No TBI,No,6.0,Peripheral blood,No,No,No,MPN,6.0,+/+,10.0,2.0,,5.0,2.0,No,2.0,2.0,PB,,MAC,Hispanic or Latino,2018,No,,Yes,,2.0,No,56.81,No,2.0,No,29.74,2.0,TDEPLETION +- other,No,M-F,2.0,American Indian or Alaska Native,1.0,90.0,No,Permissive mismatched,Related,MEL,8.0,No,2.0,No,10.0,0.0,16.223


# Features
There are a total of 57 features. From these 35 are categorical and 22 are numerical. We will label encode the categorical features. Then our XGB and CAT model will accept these as categorical features and process them special internally. We leave the numerical feature NANs as NANs because GBDT (like XGB and CAT) can handle NAN and will use this information.

In [2]:
RMV = ["ID","efs","efs_time","y"]
FEATURES = [c for c in train.columns if not c in RMV]
print(f"There are {len(FEATURES)} FEATURES: {FEATURES}")

There are 57 FEATURES: ['dri_score', 'psych_disturb', 'cyto_score', 'diabetes', 'hla_match_c_high', 'hla_high_res_8', 'tbi_status', 'arrhythmia', 'hla_low_res_6', 'graft_type', 'vent_hist', 'renal_issue', 'pulm_severe', 'prim_disease_hct', 'hla_high_res_6', 'cmv_status', 'hla_high_res_10', 'hla_match_dqb1_high', 'tce_imm_match', 'hla_nmdp_6', 'hla_match_c_low', 'rituximab', 'hla_match_drb1_low', 'hla_match_dqb1_low', 'prod_type', 'cyto_score_detail', 'conditioning_intensity', 'ethnicity', 'year_hct', 'obesity', 'mrd_hct', 'in_vivo_tcd', 'tce_match', 'hla_match_a_high', 'hepatic_severe', 'donor_age', 'prior_tumor', 'hla_match_b_low', 'peptic_ulcer', 'age_at_hct', 'hla_match_a_low', 'gvhd_proph', 'rheum_issue', 'sex_match', 'hla_match_b_high', 'race_group', 'comorbidity_score', 'karnofsky_score', 'hepatic_mild', 'tce_div_match', 'donor_related', 'melphalan_dose', 'hla_low_res_8', 'cardiac', 'hla_match_drb1_high', 'pulm_moderate', 'hla_low_res_10']


In [3]:
CATS = []
for c in FEATURES:
    if train[c].dtype=="object":
        CATS.append(c)
        train[c] = train[c].fillna("NAN")
        test[c] = test[c].fillna("NAN")
print(f"In these features, there are {len(CATS)} CATEGORICAL FEATURES: {CATS}")

In these features, there are 35 CATEGORICAL FEATURES: ['dri_score', 'psych_disturb', 'cyto_score', 'diabetes', 'tbi_status', 'arrhythmia', 'graft_type', 'vent_hist', 'renal_issue', 'pulm_severe', 'prim_disease_hct', 'cmv_status', 'tce_imm_match', 'rituximab', 'prod_type', 'cyto_score_detail', 'conditioning_intensity', 'ethnicity', 'obesity', 'mrd_hct', 'in_vivo_tcd', 'tce_match', 'hepatic_severe', 'prior_tumor', 'peptic_ulcer', 'gvhd_proph', 'rheum_issue', 'sex_match', 'race_group', 'hepatic_mild', 'tce_div_match', 'donor_related', 'melphalan_dose', 'cardiac', 'pulm_moderate']


In [4]:
combined = pd.concat([train,test],axis=0,ignore_index=True)
#print("Combined data shape:", combined.shape )

# LABEL ENCODE CATEGORICAL FEATURES
print("We LABEL ENCODE the CATEGORICAL FEATURES: ",end="")
for c in FEATURES:

    # LABEL ENCODE CATEGORICAL AND CONVERT TO INT32 CATEGORY
    if c in CATS:
        print(f"{c}, ",end="")
        combined[c],_ = combined[c].factorize()
        combined[c] -= combined[c].min()
        combined[c] = combined[c].astype("int32")
        combined[c] = combined[c].astype("category")
        
    # REDUCE PRECISION OF NUMERICAL TO 32BIT TO SAVE MEMORY
    else:
        if combined[c].dtype=="float64":
            combined[c] = combined[c].astype("float32")
        if combined[c].dtype=="int64":
            combined[c] = combined[c].astype("int32")
    
train = combined.iloc[:len(train)].copy()
test = combined.iloc[len(train):].reset_index(drop=True).copy()

We LABEL ENCODE the CATEGORICAL FEATURES: dri_score, psych_disturb, cyto_score, diabetes, tbi_status, arrhythmia, graft_type, vent_hist, renal_issue, pulm_severe, prim_disease_hct, cmv_status, tce_imm_match, rituximab, prod_type, cyto_score_detail, conditioning_intensity, ethnicity, obesity, mrd_hct, in_vivo_tcd, tce_match, hepatic_severe, prior_tumor, peptic_ulcer, gvhd_proph, rheum_issue, sex_match, race_group, hepatic_mild, tce_div_match, donor_related, melphalan_dose, cardiac, pulm_moderate, 

In [6]:
from sklearn.model_selection import KFold
from lifelines.utils import concordance_index

In [7]:
def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str) -> float:
    """
    >>> import pandas as pd
    >>> row_id_column_name = "id"
    >>> y_pred = {'prediction': {0: 1.0, 1: 0.0, 2: 1.0}}
    >>> y_pred = pd.DataFrame(y_pred)
    >>> y_pred.insert(0, row_id_column_name, range(len(y_pred)))
    >>> y_true = { 'efs': {0: 1.0, 1: 0.0, 2: 0.0}, 'efs_time': {0: 25.1234,1: 250.1234,2: 2500.1234}, 'race_group': {0: 'race_group_1', 1: 'race_group_1', 2: 'race_group_1'}}
    >>> y_true = pd.DataFrame(y_true)
    >>> y_true.insert(0, row_id_column_name, range(len(y_true)))
    >>> score(y_true.copy(), y_pred.copy(), row_id_column_name)
    0.75
    """
    
    del solution[row_id_column_name]
    del submission[row_id_column_name]
    
    event_label = 'efs'
    interval_label = 'efs_time'
    prediction_label = 'prediction'
    # Merging solution and submission dfs on ID
    merged_df = pd.concat([solution, submission], axis=1)
    merged_df.reset_index(inplace=True)
    merged_df_race_dict = dict(merged_df.groupby(['race_group']).groups)
    metric_list = []
    for race in merged_df_race_dict.keys():
        # Retrieving values from y_test based on index
        indices = sorted(merged_df_race_dict[race])
        merged_df_race = merged_df.iloc[indices]
        # Calculate the concordance index
        c_index_race = concordance_index(
                        merged_df_race[interval_label],
                        -merged_df_race[prediction_label],
                        merged_df_race[event_label])
        metric_list.append(c_index_race)
    return float(np.mean(metric_list)-np.sqrt(np.var(metric_list)))

In [8]:
# SURVIVAL COX NEEDS THIS TARGET (TO DIGEST EFS AND EFS_TIME)
train["efs_time2"] = train.efs_time.copy()
train.loc[train.efs==0,"efs_time2"] *= -1

In [9]:
train

Unnamed: 0,ID,dri_score,psych_disturb,cyto_score,diabetes,hla_match_c_high,hla_high_res_8,tbi_status,arrhythmia,hla_low_res_6,graft_type,vent_hist,renal_issue,pulm_severe,prim_disease_hct,hla_high_res_6,cmv_status,hla_high_res_10,hla_match_dqb1_high,tce_imm_match,hla_nmdp_6,hla_match_c_low,rituximab,hla_match_drb1_low,hla_match_dqb1_low,prod_type,cyto_score_detail,conditioning_intensity,ethnicity,year_hct,obesity,mrd_hct,in_vivo_tcd,tce_match,hla_match_a_high,hepatic_severe,donor_age,prior_tumor,hla_match_b_low,peptic_ulcer,age_at_hct,hla_match_a_low,gvhd_proph,rheum_issue,sex_match,hla_match_b_high,race_group,comorbidity_score,karnofsky_score,hepatic_mild,tce_div_match,donor_related,melphalan_dose,hla_low_res_8,cardiac,hla_match_drb1_high,pulm_moderate,hla_low_res_10,efs,efs_time,efs_time2
0,0,0,0,0,0,,,0,0,6.0,0,0,0,0,0,6.0,0,,2.0,0,6.0,2.0,0,2.0,2.0,0,0,0,0,2016,0,0,0,0,2.0,0,,0,2.0,0,9.942000,2.0,0,0,0,2.0,0,0.0,90.0,0,0,0,0,8.0,0,2.0,0,10.0,0.0,42.356,-42.356
1,1,1,0,1,0,2.0,8.0,1,0,6.0,1,0,0,0,1,6.0,0,10.0,2.0,1,6.0,2.0,0,2.0,2.0,1,1,1,0,2008,0,1,1,1,2.0,0,72.290001,0,2.0,0,43.705002,2.0,1,0,1,2.0,1,3.0,90.0,0,1,1,0,8.0,0,2.0,1,10.0,1.0,4.672,4.672
2,2,0,0,0,0,2.0,8.0,0,0,6.0,0,0,0,0,2,6.0,0,10.0,2.0,1,6.0,2.0,0,2.0,2.0,0,0,0,0,2019,0,0,0,0,2.0,0,,0,2.0,0,33.997002,2.0,2,0,2,2.0,0,0.0,90.0,0,1,1,0,8.0,0,2.0,0,10.0,0.0,19.793,-19.793
3,3,2,0,1,0,2.0,8.0,0,0,6.0,0,0,0,0,3,6.0,0,10.0,2.0,1,6.0,2.0,0,2.0,2.0,0,1,1,0,2009,0,1,1,1,2.0,0,29.230000,0,2.0,0,43.244999,2.0,3,0,3,2.0,2,0.0,90.0,1,1,0,0,8.0,0,2.0,0,10.0,0.0,102.349,-102.349
4,4,2,0,0,0,2.0,8.0,0,0,6.0,1,0,0,0,4,6.0,0,10.0,2.0,0,5.0,2.0,0,2.0,2.0,1,0,1,1,2018,0,0,0,0,2.0,0,56.810001,0,2.0,0,29.740000,2.0,4,0,0,2.0,3,1.0,90.0,0,1,1,1,8.0,0,2.0,0,10.0,0.0,16.223,-16.223
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28795,28795,7,3,4,0,2.0,8.0,0,0,6.0,1,0,0,2,3,6.0,2,10.0,2.0,1,6.0,2.0,0,2.0,2.0,1,1,1,0,2018,0,2,0,3,2.0,0,24.212000,1,2.0,0,51.136002,2.0,7,1,0,2.0,0,0.0,,2,4,3,0,8.0,3,2.0,0,10.0,0.0,18.633,-18.633
28796,28796,2,0,2,1,1.0,4.0,0,0,5.0,1,0,0,0,1,3.0,1,6.0,2.0,4,4.0,1.0,0,2.0,2.0,1,2,2,1,2017,0,1,1,0,1.0,0,30.770000,0,1.0,0,18.075001,2.0,8,0,0,1.0,4,3.0,90.0,0,2,1,0,6.0,1,1.0,1,8.0,1.0,4.892,4.892
28797,28797,6,3,2,3,2.0,8.0,0,2,6.0,1,0,1,2,9,6.0,1,10.0,2.0,4,6.0,2.0,1,2.0,2.0,1,3,1,0,2018,0,0,1,4,2.0,0,22.627001,0,2.0,1,51.005001,2.0,3,1,0,2.0,4,5.0,90.0,2,2,0,0,8.0,3,2.0,0,10.0,0.0,23.157,-23.157
28798,28798,0,0,2,0,1.0,4.0,0,0,3.0,1,0,1,2,9,3.0,0,5.0,1.0,1,3.0,1.0,0,1.0,1.0,1,0,3,0,2018,2,0,0,0,1.0,0,58.074001,1,1.0,1,0.044000,1.0,2,0,3,1.0,5,1.0,90.0,0,1,1,1,4.0,0,1.0,0,5.0,0.0,52.351,-52.351


# CatBoost with Survival:Cox
We train CatBoost using Survival:Cox loss for 10 folds and achieve **CV=671**!

In [10]:
import catboost
from catboost import CatBoostClassifier, CatBoostRegressor

print(catboost.__version__)

1.2.7


In [11]:
FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
    
oof_cat_cox = np.zeros(len(train))
pred_cat_cox = np.zeros(len(test))

for i, (train_index, test_index) in enumerate(kf.split(train)):

    print("#"*25)
    print(f"### Fold {i+1}")
    print("#"*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,"efs_time2"]    
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,"efs_time2"]
    x_test = test[FEATURES].copy()

    model_cat_cox = CatBoostRegressor(
        loss_function="Cox",
        #task_type="GPU",   
        iterations=400,     
        learning_rate=0.1,  
        grow_policy='Lossguide',
        use_best_model=False,
    )
    model_cat_cox.fit(x_train,y_train,
              eval_set=(x_valid, y_valid),
              cat_features=CATS,
              verbose=100)
    
    # INFER OOF
    oof_cat_cox[test_index] = model_cat_cox.predict(x_valid)
    # INFER TEST
    pred_cat_cox += model_cat_cox.predict(x_test)

    y_true = train.loc[test_index, ["ID","efs","efs_time","race_group"]].copy()
    y_pred = train.loc[test_index, ["ID"]].copy()
    
    y_pred["prediction"] = oof_cat_cox[test_index]
    m = score(y_true.copy(), y_pred.copy(), "ID")
    print(f"CV Fold {i} score for CatBoost Survival:Cox =",m)

# COMPUTE AVERAGE TEST PREDS
pred_cat_cox /= FOLDS

#########################
### Fold 1
#########################
0:	learn: -137204.2010418	test: -11625.0126498	best: -11625.0126498 (0)	total: 79.2ms	remaining: 31.6s
100:	learn: -134245.0940003	test: -11368.0935757	best: -11367.7720241 (99)	total: 2.24s	remaining: 6.63s
200:	learn: -133569.4247640	test: -11357.0053940	best: -11356.8330165 (182)	total: 4.33s	remaining: 4.28s
300:	learn: -133095.7842781	test: -11351.1819262	best: -11351.0222775 (299)	total: 6.5s	remaining: 2.14s
399:	learn: -132763.5913301	test: -11349.4816640	best: -11349.4142821 (327)	total: 8.53s	remaining: 0us

bestTest = -11349.41428
bestIteration = 327

CV Fold 0 score for CatBoost Survival:Cox = 0.6601447774389542
#########################
### Fold 2
#########################
0:	learn: -137014.2912101	test: -11772.8856048	best: -11772.8856048 (0)	total: 23.3ms	remaining: 9.28s


  merged_df_race_dict = dict(merged_df.groupby(['race_group']).groups)


100:	learn: -134091.3022715	test: -11485.4489792	best: -11485.3225232 (99)	total: 2.16s	remaining: 6.39s
200:	learn: -133312.7852628	test: -11460.6629034	best: -11460.6629034 (200)	total: 4.26s	remaining: 4.22s
300:	learn: -132843.8300906	test: -11453.5101666	best: -11453.1395642 (286)	total: 6.37s	remaining: 2.09s
399:	learn: -132444.2041710	test: -11451.6650578	best: -11451.1640114 (386)	total: 8.44s	remaining: 0us

bestTest = -11451.16401
bestIteration = 386

CV Fold 1 score for CatBoost Survival:Cox = 0.6686153514799281
#########################
### Fold 3
#########################
0:	learn: -136740.2719659	test: -11983.0664595	best: -11983.0664595 (0)	total: 22.5ms	remaining: 9s


  merged_df_race_dict = dict(merged_df.groupby(['race_group']).groups)


100:	learn: -133765.3366558	test: -11689.7400344	best: -11689.7400344 (100)	total: 2.15s	remaining: 6.35s
200:	learn: -133055.1524830	test: -11675.0143694	best: -11674.4228636 (194)	total: 4.25s	remaining: 4.21s
300:	learn: -132628.9478783	test: -11670.8603836	best: -11670.7024139 (293)	total: 6.31s	remaining: 2.07s
399:	learn: -132318.8285745	test: -11674.4124251	best: -11670.3801276 (317)	total: 8.31s	remaining: 0us

bestTest = -11670.38013
bestIteration = 317

CV Fold 2 score for CatBoost Survival:Cox = 0.6701362347142874
#########################
### Fold 4
#########################
0:	learn: -136474.7243316	test: -12180.0536823	best: -12180.0536823 (0)	total: 21.8ms	remaining: 8.7s


  merged_df_race_dict = dict(merged_df.groupby(['race_group']).groups)


100:	learn: -133463.0770737	test: -11892.2690720	best: -11892.2690720 (100)	total: 2.12s	remaining: 6.26s
200:	learn: -132783.0162317	test: -11878.1443791	best: -11877.4964925 (197)	total: 4.26s	remaining: 4.21s
300:	learn: -132368.6648703	test: -11875.1126818	best: -11874.8392905 (290)	total: 6.34s	remaining: 2.08s
399:	learn: -131959.4648801	test: -11873.9173657	best: -11873.8885020 (398)	total: 8.37s	remaining: 0us

bestTest = -11873.8885
bestIteration = 398

CV Fold 3 score for CatBoost Survival:Cox = 0.6668181895906764
#########################
### Fold 5
#########################
0:	learn: -137321.8175168	test: -11539.7868480	best: -11539.7868480 (0)	total: 22.3ms	remaining: 8.89s


  merged_df_race_dict = dict(merged_df.groupby(['race_group']).groups)


100:	learn: -134353.3215180	test: -11253.3822723	best: -11253.3822723 (100)	total: 2.1s	remaining: 6.21s


KeyboardInterrupt: 

In [None]:
y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = oof_cat_cox
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"\nOverall CV for CatBoost Survival:Cox =",m)

In [None]:
feature_importance = model_cat_cox.get_feature_importance()
importance_df = pd.DataFrame({
    "Feature": FEATURES, 
    "Importance": feature_importance
}).sort_values(by="Importance", ascending=False)
plt.figure(figsize=(10, 15))
plt.barh(importance_df["Feature"], importance_df["Importance"])
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("CatBoost Survival:Cox Feature Importance")
plt.gca().invert_yaxis()  # Flip features for better readability
plt.show()

# Ensemble CAT and XGB and LGB
We ensemble our XGBoost, CatBoost, LightGBM, XGBoost Cox, and CatBoost Cox using `scipy.stats.rankdata()` and achieve an amazing **CV=0.681** Wow!

In [None]:
from scipy.stats import rankdata 

y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = rankdata(oof_xgb) + rankdata(oof_cat) + rankdata(oof_lgb)\
                     + rankdata(oof_xgb_cox) + rankdata(oof_cat_cox)
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"\nOverall CV for Ensemble =",m)

# Create Submission CSV

In [None]:
sub = pd.read_csv("/kaggle/input/equity-post-HCT-survival-predictions/sample_submission.csv")
sub.prediction = rankdata(pred_xgb) + rankdata(pred_cat) + rankdata(pred_lgb)\
                     + rankdata(pred_xgb_cox) + rankdata(pred_cat_cox)
sub.to_csv("submission.csv",index=False)
print("Sub shape:",sub.shape)
sub.head()