OBJECTIVE: The main goal of this competition is to predict survival outcomes for allogeneic hematopoietic cell transplantation (HCT) patients.
TARGET VARs: 2 Critical Target Variables
  1) EFS (Event-Free Survival) - A higher (time period) EFS means a longer survival without complications (such as disease recurrence or treatment failure).
  2) EFS_Time (Event-Free Survival Time)—The Actual time duration that a patient remains event-free. It’s a numerical value representing the survival period in a continuous format.
Our task is to predict the likelihood of survival, where the relationship is:
Lower survival means higher ris ki.e.rds, the shorter the event-free survival time (EFS_Time), the higher the risk associated with the patient’s cni

Goal:
Dev a model to provide predictions for HCT patients based on their risk levels:
Higher Prediction values => Lower Risk or Better Survival

C-indexTo evaluate the equitable prediction of transplant survival outcomes,
we use the concordance index (C-index) between a series of event times and a predicted score across each race group.

It represents the global assessment of the model discrimination power: this is the model’s ability to correctly provide a reliable ranking of the survival times based on the individual risk scores.

The concordance index is a value between 0 and
0.5 - expected result from random predictions
1.0 - Perfect Concordance
0.0 - Perfect Anti-concordance0erwise >0.0)

In [1]:
 # Installations-1
!pip install /kaggle/input/pip-install-lifelines/autograd-1.7.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/autograd-gamma-0.5.0.tar.gz
!pip install /kaggle/input/pip-install-lifelines/interface_meta-1.3.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/formulaic-1.0.2-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/lifelines-0.30.0-py3-none-any.whl

Processing /kaggle/input/pip-install-lifelines/autograd-1.7.0-py3-none-any.whl
autograd is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.
Processing /kaggle/input/pip-install-lifelines/autograd-gamma-0.5.0.tar.gz
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: autograd-gamma
  Building wheel for autograd-gamma (setup.py) ... [?25l[?25hdone
  Created wheel for autograd-gamma: filename=autograd_gamma-0.5.0-py3-none-any.whl size=4031 sha256=2660df9618e252e21f84d90641d69d95d53ef882dced105df16242b0d3269c84
  Stored in directory: /root/.cache/pip/wheels/6b/b5/e0/4c79e15c0b5f2c15ecf613c720bb20daab20a666eb67135155
Successfully built autograd-gamma
Installing collected packages: autograd-gamma
Successfully installed autograd-gamma-0.5.0
Processing /kaggle/input/pip-install-lifelines/interface_meta-1.3.0-py3-none-any.whl
Installing collected packages: interface-

In [2]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from lifelines.utils import concordance_index
from lifelines import KaplanMeierFitter

In [3]:
import plotly.io as pio
pio.renderers.default = 'iframe'
pd.options.display.max_columns = None 

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [5]:
# Import dataset (df0)
df0=pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/train.csv')

In [6]:
df = df0.drop('ID', axis=1)

#### KaplanMeierFitter()
_The KaplanMeierFitter_ is a tool from the lifelines library in Python used for survival analysis. It allows you to create Kaplan-Meier estimates, which are used to estimate the survival function from lifetime data.  
It's a class provided by the lifelines library, which makes it easier to fit and plot survival curves.

In [7]:
# Combining 2 target variables (efs and efs_time) into One Target- 

def kaplan(data=df, time_col = 'efs_time', event_col='efs'):
    
    kmf = KaplanMeierFitter()
    kmf.fit(data[time_col], event_observed=data[event_col])
    return kmf.survival_function_at_times(df[time_col]).values.flatten()


df['target'] = kaplan(data=df)

In [8]:
df.shape

(28800, 60)

In [9]:
df = df.drop(columns=['efs', 'efs_time'], errors='ignore')

In [10]:
df.duplicated()[df.duplicated()==True]
    #> No duplicates found

Series([], dtype: bool)

## Handling Missing Variables

In [11]:
for col in df.select_dtypes(include='object').columns:
       df[col] = df[col].str.strip().str.lower().replace(
        {'n/a': None, 'na': None, 'nan': None, '-': None})

In [12]:
# cat_vars
cat_vars = [var for var in df.columns if df[var].dtype == "O"]

# num_vars
num_vars= [var for var in df.columns if df[var].dtype != "O" and var != 'target']

In [13]:
print('cat_vars: ',cat_vars,'\n')
print('num_vars: ',num_vars )

# Note that target is num_var

cat_vars:  ['dri_score', 'psych_disturb', 'cyto_score', 'diabetes', 'tbi_status', 'arrhythmia', 'graft_type', 'vent_hist', 'renal_issue', 'pulm_severe', 'prim_disease_hct', 'cmv_status', 'tce_imm_match', 'rituximab', 'prod_type', 'cyto_score_detail', 'conditioning_intensity', 'ethnicity', 'obesity', 'mrd_hct', 'in_vivo_tcd', 'tce_match', 'hepatic_severe', 'prior_tumor', 'peptic_ulcer', 'gvhd_proph', 'rheum_issue', 'sex_match', 'race_group', 'hepatic_mild', 'tce_div_match', 'donor_related', 'melphalan_dose', 'cardiac', 'pulm_moderate'] 

num_vars:  ['hla_match_c_high', 'hla_high_res_8', 'hla_low_res_6', 'hla_high_res_6', 'hla_high_res_10', 'hla_match_dqb1_high', 'hla_nmdp_6', 'hla_match_c_low', 'hla_match_drb1_low', 'hla_match_dqb1_low', 'year_hct', 'hla_match_a_high', 'donor_age', 'hla_match_b_low', 'age_at_hct', 'hla_match_a_low', 'hla_match_b_high', 'comorbidity_score', 'karnofsky_score', 'hla_low_res_8', 'hla_match_drb1_high', 'hla_low_res_10']


In [14]:
# Filling missing values for numerical columns with their median except target column
df[num_vars] = df[num_vars].fillna(df[num_vars].median())

In [15]:
# Filling missing values for categorical columns with their mode
df[cat_vars] = df[cat_vars].apply(
    lambda col: col.fillna(col.mode()[0] if not col.mode().empty else 'unknown'))

In [16]:
# Factorise Categorical vars - 
for i in df[cat_vars]:
    df[i], _ = pd.factorize(df[i])

In [17]:
pd.set_option('display.max_columns', None)

df.head()

Unnamed: 0,dri_score,psych_disturb,cyto_score,diabetes,hla_match_c_high,hla_high_res_8,tbi_status,arrhythmia,hla_low_res_6,graft_type,vent_hist,renal_issue,pulm_severe,prim_disease_hct,hla_high_res_6,cmv_status,hla_high_res_10,hla_match_dqb1_high,tce_imm_match,hla_nmdp_6,hla_match_c_low,rituximab,hla_match_drb1_low,hla_match_dqb1_low,prod_type,cyto_score_detail,conditioning_intensity,ethnicity,year_hct,obesity,mrd_hct,in_vivo_tcd,tce_match,hla_match_a_high,hepatic_severe,donor_age,prior_tumor,hla_match_b_low,peptic_ulcer,age_at_hct,hla_match_a_low,gvhd_proph,rheum_issue,sex_match,hla_match_b_high,race_group,comorbidity_score,karnofsky_score,hepatic_mild,tce_div_match,donor_related,melphalan_dose,hla_low_res_8,cardiac,hla_match_drb1_high,pulm_moderate,hla_low_res_10,target
0,0,0,0,0,2.0,8.0,0,0,6.0,0,0,0,0,0,6.0,0,10.0,2.0,0,6.0,2.0,0,2.0,2.0,0,0,0,0,2016,0,0,0,0,2.0,0,40.063,0,2.0,0,9.942,2.0,0,0,0,2.0,0,0.0,90.0,0,0,0,0,8.0,0,2.0,0,10.0,0.458687
1,1,0,1,0,2.0,8.0,1,0,6.0,1,0,0,0,1,6.0,0,10.0,2.0,0,6.0,2.0,0,2.0,2.0,1,0,0,0,2008,0,1,1,0,2.0,0,72.29,0,2.0,0,43.705,2.0,1,0,1,2.0,1,3.0,90.0,0,0,1,0,8.0,0,2.0,1,10.0,0.847759
2,0,0,0,0,2.0,8.0,0,0,6.0,0,0,0,0,2,6.0,0,10.0,2.0,0,6.0,2.0,0,2.0,2.0,0,0,0,0,2019,0,0,0,0,2.0,0,40.063,0,2.0,0,33.997,2.0,2,0,2,2.0,0,0.0,90.0,0,0,1,0,8.0,0,2.0,0,10.0,0.462424
3,2,0,1,0,2.0,8.0,0,0,6.0,0,0,0,0,3,6.0,0,10.0,2.0,0,6.0,2.0,0,2.0,2.0,0,0,0,0,2009,0,1,1,0,2.0,0,29.23,0,2.0,0,43.245,2.0,3,0,3,2.0,2,0.0,90.0,1,0,0,0,8.0,0,2.0,0,10.0,0.456661
4,2,0,0,0,2.0,8.0,0,0,6.0,1,0,0,0,4,6.0,0,10.0,2.0,0,5.0,2.0,0,2.0,2.0,1,0,0,1,2018,0,0,0,0,2.0,0,56.81,0,2.0,0,29.74,2.0,4,0,0,2.0,3,1.0,90.0,0,0,1,1,8.0,0,2.0,0,10.0,0.464674


## Correlation

In [18]:
# Get list of feature columns (excluding target)

X = [col for col in df.columns if col != 'target']

for feature in X:
    correlation = df[feature].corr(df["target"])
    print('Correlation for ', feature, ' = ',correlation)

Correlation for  dri_score  =  -0.01566322289688407
Correlation for  psych_disturb  =  0.044588447814429666
Correlation for  cyto_score  =  0.005200974001810688
Correlation for  diabetes  =  0.06472635931119727
Correlation for  hla_match_c_high  =  -0.012607743434186169
Correlation for  hla_high_res_8  =  -0.021272383826416143
Correlation for  tbi_status  =  0.03962863015684239
Correlation for  arrhythmia  =  0.05401029510435931
Correlation for  hla_low_res_6  =  -0.016370955161609413
Correlation for  graft_type  =  0.14554016971025166
Correlation for  vent_hist  =  0.008114302625674752
Correlation for  renal_issue  =  -0.0014354737125929862
Correlation for  pulm_severe  =  0.0885694909623311
Correlation for  prim_disease_hct  =  0.022227791379227736
Correlation for  hla_high_res_6  =  -0.022485941112610514
Correlation for  cmv_status  =  -0.047002368730733844
Correlation for  hla_high_res_10  =  -0.014870926680981335
Correlation for  hla_match_dqb1_high  =  -0.011014935896483157
Corre

## K Fold Validation

In [19]:
df1=df0.copy()

In [20]:
df1['target']=df['target']

In [21]:
# Define features and target
X1 = df1.drop(columns=['efs', 'efs_time', 'ID', 'target', 'rituximab'], errors='ignore')
target = df1['target']

In [22]:
# Identify categorical and numerical columns
cat_vars1 = X1.select_dtypes(include=['object']).columns
num_vars1 = X1.select_dtypes(include=['int64', 'float64']).columns

In [23]:
# Preprocessing for numerical data: Imputation and Scaling
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [24]:
from sklearn.preprocessing import OneHotEncoder

In [25]:
# Preprocessing for categorical data: Imputation and One-Hot Encoding
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [26]:
from sklearn.compose import ColumnTransformer

In [27]:
# Combine preprocessors in a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, num_vars1),
        ('cat', categorical_transformer, cat_vars1)
    ]
)

## Create Model Pipeline

In [28]:
from sklearn.ensemble import GradientBoostingRegressor

In [29]:
# Create the model pipeline
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingRegressor(random_state=42))
])

In [30]:
# Import necessary libraries for hyperparameter optimization
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from lifelines.utils import concordance_index

# Define parameter grid for hyperparameter optimization
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__learning_rate': [0.01, 0.1],
    'classifier__max_depth': [3, 5, 7]
}

# Setup GridSearchCV with custom scoring function (concordance index)
grid_search = GridSearchCV(
    model, 
    param_grid, 
    cv=5, 
    scoring=make_scorer(concordance_index), 
    verbose=1
)

# Fit the grid search to the data
grid_search.fit(X1, target)

# Retrieve best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Train the model on the full dataset using the best parameters
best_model.fit(X1, target)


Fitting 5 folds for each of 12 candidates, totalling 60 fits


## Model Evaluation on training data

In [31]:
# Evaluate the model on the full training data (using concordance index)
train_pred = best_model.predict(X1)
train_c_index = concordance_index(target, train_pred)
print(f"Concordance Index on Training Data: {train_c_index}")


Concordance Index on Training Data: 0.6560702660858555


## Make Predictions on Test Data

In [32]:
# Load the test data
test_data = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/test.csv')

# Predict survival outcomes on the test data
prediction = best_model.predict(test_data.drop(columns=['ID'], errors='ignore'))

In [33]:
# Add predictions to the test dataset
test_data['prediction'] = prediction

# Save predictions to a new CSV file
output_file_path = 'submission.csv'
test_data[['ID', 'prediction']].to_csv(output_file_path, index=False)

print(f"Predictions saved to {output_file_path}")

Predictions saved to submission.csv
