# Notebook 6: Two-Model Approach (CatBoost CLF + LightGBM REG)

## Model Overview

This approach uses two separate models for classification and regression, respectively:
1. **CatBoost Classifier**: Predicts death probability (with sample weights)
2. **LightGBM Regressor**: Predicts survival time (trained on events only)

This is inspired by the 4th place solution in the Kaggle competition CIBMTR, which the author generously made public online.

## Sample Weighting
- **Events (deaths)**: weight = 1.0
- **Censored**: weight = F(t) / F_max (KM cumulative density at censoring time)


## Merge Formula
```
risk = pred_event × (1 + odds(avg_pred_event) × (1 - pred_time_norm))
```

## Configuration
- **Features**: 128 fixed (NaN imputed, scaled)
- **Evaluation**: `concordance_index_ipcw` from sksurv
- **CV Score**: 0.6909 weighted C-index
- **Weight in 4-model ensemble**: 30%

In [None]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from catboost import CatBoostClassifier
from sklearn.model_selection import StratifiedKFold
from sksurv.metrics import concordance_index_ipcw
from sksurv.util import Surv
from lifelines import KaplanMeierFitter
import optuna
from optuna.samplers import TPESampler

TRAIN_PATH = '/your_path/SurvivalPrediction/data'

## 1. Load Data

In [6]:
# Load 128-feature FIXED scaled dataset
X_train = pd.read_csv(f'{TRAIN_PATH}/X_train_128features_clean_fixed_scaled.csv')
X_train_unscaled = pd.read_csv(f'{TRAIN_PATH}/X_train_128features_clean_fixed.csv')
target = pd.read_csv(f'{TRAIN_PATH}/target_train_clean_aligned.csv')

y_time = target['OS_YEARS'].values
y_event = target['OS_STATUS'].values.astype(bool)
n_samples = len(X_train)

# Create structured array for sksurv
y_surv = Surv.from_arrays(event=y_event, time=y_time)

print(f"Features: {X_train.shape[1]}")
print(f"Samples: {n_samples}")
print(f"Events: {y_event.sum()} ({y_event.mean()*100:.1f}%)")

Features: 128
Samples: 3120
Events: 1600 (51.3%)


In [7]:
# Risk groups
def define_risk_groups(X_unscaled):
    risk_factors = pd.DataFrame(index=X_unscaled.index)
    risk_factors['high_blast'] = (X_unscaled['BM_BLAST'] > 10).astype(int)
    risk_factors['has_TP53'] = (X_unscaled['has_TP53'] > 0).astype(int)
    risk_factors['low_hb'] = (X_unscaled['HB'] < 10).astype(int)
    risk_factors['low_plt'] = (X_unscaled['PLT'] < 50).astype(int)
    risk_factors['high_cyto'] = (X_unscaled['cyto_risk_score'] >= 3).astype(int)
    n_risk_factors = risk_factors.sum(axis=1)
    return {
        'test_like': n_risk_factors >= 1,
        'high_risk': n_risk_factors >= 2,
    }

risk_groups = define_risk_groups(X_train_unscaled)

# Stratification variable
has_tp53 = (X_train_unscaled['has_TP53'] > 0).astype(int).values
strat_var = pd.Series([f"{int(e)}_{int(t)}" for e, t in zip(y_event, has_tp53)])

## 2. Evaluation Metric

In [9]:
def weighted_cindex_ipcw(risk, y_surv_all, risk_groups, tau=7.0):
    """Compute weighted C-index using concordance_index_ipcw (competition metric)."""
    c_overall = concordance_index_ipcw(y_surv_all, y_surv_all, risk, tau=tau)[0]
    
    mask_test = risk_groups['test_like'].values
    y_surv_test = Surv.from_arrays(event=y_surv_all['event'][mask_test], time=y_surv_all['time'][mask_test])
    c_test = concordance_index_ipcw(y_surv_all, y_surv_test, risk[mask_test], tau=tau)[0]

    mask_high = risk_groups['high_risk'].values if hasattr(risk_groups['high_risk'], 'values') else risk_groups['high_risk']
    y_surv_high = Surv.from_arrays(event=y_surv_all['event'][mask_high], time=y_surv_all['time'][mask_high])
    c_high = concordance_index_ipcw(y_surv_all, y_surv_high, risk[mask_high], tau=tau)[0]

    weighted = 0.3 * c_overall + 0.4 * c_test + 0.3 * c_high
    return {'overall': c_overall, 'test_like': c_test, 'high_risk': c_high, 'weighted': weighted}

## 3. Classification Sample Weighting

Use Kaplan-Meier-based sample weights for the classifier:
- **Events (deaths)**: weight = 1.0 (full weight)
- **Censored observations**: weight = F(t) / F_max (cumulative density at censoring time / max cumulative density)

This gives higher weight to censored patients who were observed longer (more informative).

The LightGBM regressor is trained on **events only**.

In [12]:
# Sample weight computation
def compute_sample_weights(times, events):
    """
    Events (deaths): weight = 1.0
    Censored: weight = F(t) / F_max (KM cumulative density)
    """

    # Fit Kaplan_Meier curve
    kmf_event = KaplanMeierFitter()
    kmf_event.fit(times, event_observed=events)
    
    # Get maximum cdf
    t_max = times.max() # maximum observed time
    F_max = kmf_event.cumulative_density_at_times([t_max]).values[0] #maximum cdf value
    F_max = max(F_max, 0.01) # clipping to avoid division by zero

    # Assign weights to samples
    weights = np.zeros(len(times))
    for i in range(len(times)):
        if events[i] == 1:
            weights[i] = 1.0
        else:
            F_t = kmf_event.cumulative_density_at_times([times[i]]).values[0]
            weights[i] = F_t / F_max
            
    # normalize so average weight = 1.0
    weights = weights / weights.mean()
    return weights

# Merge function: combine classifier and regressor predictions
def merge_predictions(clf_pred, reg_pred, time_min, time_max):
    # normalized predicted times then clip to [0,1]
    pred_time_norm = (reg_pred - time_min) / (time_max - time_min + 1e-8)
    pred_time_norm = np.clip(pred_time_norm, 0, 1) 

    # compute odds of predicted population deaths
    avg_pred_event = np.mean(clf_pred)
    odds = avg_pred_event / (1 - avg_pred_event + 1e-8)
    odds = np.clip(odds, 0.1, 10)

    # compute score
    risk = clf_pred * (1 + odds * (1 - pred_time_norm))
    return risk

print("Classification Sample Weighting:")
print("  Events: weight = 1.0")
print("  Censored: weight = F(t) / F_max (KM cumulative density)")
print("\nMerge Formula:")
print("  risk = pred_event * (1 + odds(avg_pred_event) * (1 - pred_time_norm))")

Classification Sample Weighting:
  Events: weight = 1.0
  Censored: weight = F(t) / F_max (KM cumulative density)

Merge Formula:
  risk = pred_event * (1 + odds(avg_pred_event) * (1 - pred_time_norm))


## 4. Global OOF Evaluation

In [13]:
def global_oof_evaluate(cat_params, lgb_params, n_splits=5, base_seed=42):
    """Global OOF CV: CatBoost CLF (sample weights) + LightGBM REG (events only)."""
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=base_seed)
    oof_preds = np.zeros(n_samples)
    X_arr = X_train.values

    for fold_idx, (train_idx, val_idx) in enumerate(skf.split(X_arr, strat_var)):
        seed = base_seed + fold_idx
        X_tr, X_val = X_arr[train_idx], X_arr[val_idx]
        y_time_tr = y_time[train_idx]
        y_event_tr = y_event[train_idx].astype(int)

        # Sample weights for classifier; estimate from only training folds to avoid data leakage
        clf_weights = compute_sample_weights(y_time_tr, y_event_tr)

        # CatBoost classifier with sample weights
        cat_params_fold = {**cat_params, 'random_seed': seed}
        clf = CatBoostClassifier(**cat_params_fold)
        clf.fit(X_tr, y_event_tr, sample_weight=clf_weights)
        clf_pred = clf.predict_proba(X_val)[:, 1] # pick second column for predicted P(death)

        # LightGBM regressor (events only)
        event_mask = y_event_tr == 1
        X_tr_events = X_tr[event_mask]
        y_time_events = y_time_tr[event_mask]

        lgb_params_fold = {**lgb_params, 'random_state': seed, 'verbosity': -1}
        reg = lgb.LGBMRegressor(**lgb_params_fold)
        reg.fit(X_tr_events, y_time_events)
        reg_pred = reg.predict(X_val)

        # Merge predictions
        time_min, time_max = y_time_tr.min(), y_time_tr.max()
        risk = merge_predictions(clf_pred, reg_pred, time_min, time_max)
        oof_preds[val_idx] = risk

    # Global Z-score normalization
    oof_normalized = (oof_preds - oof_preds.mean()) / (oof_preds.std() + 1e-8)
    return weighted_cindex_ipcw(oof_normalized, y_surv, risk_groups)

## 5. Hyperparameter Tuning

In [17]:
def objective(trial):
    """Optuna objective for Two-Model hyperparameter tuning."""
    cat_params = {
        'depth': trial.suggest_int('cat_depth', 3, 8),
        'iterations': trial.suggest_int('cat_iterations', 100, 400),
        'learning_rate': trial.suggest_float('cat_learning_rate', 0.01, 0.1, log=True),
        'l2_leaf_reg': trial.suggest_float('cat_l2_leaf_reg', 0.1, 10, log=True),
        'verbose': False,
        'allow_writing_files': False,
    }

    lgb_params = {
        'max_depth': trial.suggest_int('lgb_max_depth', 3, 10),
        'n_estimators': trial.suggest_int('lgb_n_estimators', 100, 400),
        'learning_rate': trial.suggest_float('lgb_learning_rate', 0.01, 0.1, log=True),
        'num_leaves': trial.suggest_int('lgb_num_leaves', 15, 63),
        'min_child_samples': trial.suggest_int('lgb_min_child_samples', 5, 50),
        'subsample': trial.suggest_float('lgb_subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('lgb_colsample_bytree', 0.6, 1.0),
        'reg_alpha': trial.suggest_float('lgb_reg_alpha', 1e-8, 10, log=True),
        'reg_lambda': trial.suggest_float('lgb_reg_lambda', 1e-8, 10, log=True),
    }

    results = global_oof_evaluate(cat_params, lgb_params)

    trial.set_user_attr('overall', results['overall'])
    trial.set_user_attr('test_like', results['test_like'])
    trial.set_user_attr('high_risk', results['high_risk'])

    return results['weighted']

# Run Optuna study (reduced trials for notebook)
print("Running Optuna hyperparameter tuning...")
print("(Set n_trials=200 for full tuning)\n")

sampler = TPESampler(seed=42)
study = optuna.create_study(direction='maximize', sampler=sampler)
study.optimize(objective, n_trials=200, show_progress_bar=True)  # Use 200 for full tuning

print(f"\nBest trial:")
print(f"  Weighted C-index: {study.best_value:.4f}")
print(f"  Params: {study.best_params}")

Running Optuna hyperparameter tuning...
(Set n_trials=200 for full tuning)



  0%|          | 0/200 [00:00<?, ?it/s]


Best trial:
  Weighted C-index: 0.6912
  Params: {'cat_depth': 5, 'cat_iterations': 262, 'cat_learning_rate': 0.019227204273246305, 'cat_l2_leaf_reg': 0.12876998314647772, 'lgb_max_depth': 9, 'lgb_n_estimators': 149, 'lgb_learning_rate': 0.08025255147825751, 'lgb_num_leaves': 36, 'lgb_min_child_samples': 10, 'lgb_subsample': 0.9638486108112962, 'lgb_colsample_bytree': 0.8126162431458441, 'lgb_reg_alpha': 4.6796458896222854e-06, 'lgb_reg_lambda': 4.190436671160808e-07}


## 6. Best Model Results

In [18]:
BEST_CAT_PARAMS = {
    'depth': 5, 
    'iterations': 262, 
    'learning_rate': 0.019227204273246305, 
    'l2_leaf_reg': 0.12876998314647772,
    'verbose': False,
    'allow_writing_files': False,
    }

BEST_LGB_PARAMS = {
    'max_depth': 9, 
    'n_estimators': 149, 
    'learning_rate': 0.08025255147825751, 
    'num_leaves': 36, 
    'min_child_samples': 10, 
    'subsample': 0.9638486108112962, 
    'colsample_bytree': 0.8126162431458441, 
    'reg_alpha': 4.6796458896222854e-06, 
    'reg_lambda': 4.190436671160808e-07}

result = global_oof_evaluate(BEST_CAT_PARAMS, BEST_LGB_PARAMS)
print("Two-Model Results (concordance_index_ipcw):")
print(f"  Overall C-index: {result['overall']:.4f}")
print(f"  Test-like C-index: {result['test_like']:.4f}")
print(f"  High-risk C-index: {result['high_risk']:.4f}")
print(f"  Weighted C-index: {result['weighted']:.4f}")

Two-Model Results (concordance_index_ipcw):
  Overall C-index: 0.7184
  Test-like C-index: 0.6937
  High-risk C-index: 0.6608
  Weighted C-index: 0.6912


## Summary

### Two-Model Results

| Metric | Value |
|--------|-------|
| Overall C-index | 0.7184 |
| Test-like C-index | 0.6937 |
| High-risk C-index | 0.6608 |
| **Weighted C-index** | **0.6912** |

### Design
- **CatBoost Classifier only**: Predicts death probability with KM-based sample weights
- **LightGBM Regressor only**: Predicts survival time, trained on events only (censored times are lower bounds)
- **Merge**: `risk = pred_event × (1 + odds(avg_pred_event) × (1 - pred_time_norm))`