# OSIC Pulmonary Fibrosis Progression – Plan

Objectives:
- Establish a strong, fast baseline with robust CV mirroring test conditions
- Iterate with feature engineering and calibrated models to hit medal thresholds
- Maintain rigorous logging and artifact checks; avoid leakage

Planned Workflow:
1) Environment & GPU check; set deterministic seeds
2) Data loading & EDA-lite: schema, nulls, basic stats, target dist
3) Metric: implement modified-laplace-log-likelihood (OOF scorer)
4) CV: patient-level grouped KFold (GroupKFold by Patient) to prevent leakage
5) Baseline model: CatBoostRegressor (GPU) and XGBoost (GPU) comparison on tabular features
6) Features v1: demographics (Age, Sex), baseline FVC, Percent, Weeks relative features, simple interactions; no image data initially
7) Calibrate sigma prediction per fold (learned via secondary model or rule-based variance by residual bins)
8) OOF evaluation, error analysis; iterate features (trend per patient, deltas, slope from prior weeks)
9) Ensemble diverse seeds/models; finalize best CV; generate test predictions
10) Save artifacts: oof.csv, feature_importances.csv, submission.csv

Validation Discipline:
- Single source of folds saved; reuse across experiments
- Fit preprocessing inside folds only; no patient leakage across folds
- Multiple seeds; trust consistent CV

Next:
- Run environment check and quick data preview
- Request expert review on plan and CV/feature set before heavy training

In [1]:
# Environment check, data load, quick peek
import os, sys, subprocess, time, json, random
import numpy as np
import pandas as pd

def run(cmd):
    return subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True).stdout

print('=== NVIDIA-SMI ===', flush=True)
print(run(['bash','-lc','nvidia-smi || true']))

# Seeds
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

t0 = time.time()
train_path = 'train.csv'
test_path = 'test.csv'
assert os.path.exists(train_path) and os.path.exists(test_path), 'Missing train.csv or test.csv'
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
print(f'Loaded train {train.shape}, test {test.shape} in {time.time()-t0:.2f}s', flush=True)
print('Train columns:', list(train.columns))
print('Test columns:', list(test.columns))

# Target analysis (meta-benchmark may use target column); fallbacks for OSIC original fields
target_col = 'target' if 'target' in train.columns else ('FVC' if 'FVC' in train.columns else None)
if target_col is None:
    raise ValueError('No target or FVC column found in train.csv')

print(f'Target column: {target_col}')
print(train[target_col].describe())

# Quick domain columns if present
for col in ['Patient','Weeks','Age','Sex','SmokingStatus','Percent','FVC','target']:
    if col in train.columns:
        nunique = train[col].nunique()
        print(f'- {col}: dtype={train[col].dtype}, nunique={nunique}')

print('\nHead of train:')
print(train.head(3))
print('\nHead of test:')
print(test.head(3))

print('Done.')

=== NVIDIA-SMI ===


Thu Sep 25 01:12:06 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

Train columns: ['Patient', 'Weeks', 'FVC', 'Percent', 'Age', 'Sex', 'SmokingStatus']
Test columns: ['Patient', 'Weeks', 'FVC', 'Percent', 'Age', 'Sex', 'SmokingStatus']
Target column: FVC
count    1394.000000
mean     2666.939742
std       801.859900
min       837.000000
25%      2105.750000
50%      2607.000000
75%      3117.250000
max      5045.000000
Name: FVC, dtype: float64
- Patient: dtype=object, nunique=158
- Weeks: dtype=int64, nunique=112
- Age: dtype=int64, nunique=33
- Sex: dtype=object, nunique=2
- SmokingStatus: dtype=object, nunique=3
- Percent: dtype=float64, nunique=1381
- FVC: dtype=int64, nunique=1099

Head of train:
                     Patient  Weeks   FVC    Percent  Age   Sex SmokingStatus
0  ID00133637202223847701934     -2  3195  92.856312   83  Male  Never smoked
1  ID00133637202223847701934      2  3203  93.088817   83  Male  Never smoked
2  ID00133637202223847701934      4  3097  90.008138   83  Male  Never smoked

Head of test:
                     Patient 

In [7]:
# Metric, folds, and leak-proof feature builder
import math
from pathlib import Path
from sklearn.model_selection import GroupKFold

def modified_laplace_log_likelihood(y_true, mu_pred, sigma_pred, sigma_floor=70.0, err_clip=1000.0):
    y_true = np.asarray(y_true, dtype=float)
    mu_pred = np.asarray(mu_pred, dtype=float)
    sigma_pred = np.asarray(sigma_pred, dtype=float)
    sigma = np.maximum(sigma_pred, sigma_floor)
    err = np.minimum(np.abs(y_true - mu_pred), err_clip)
    return -np.mean(err / sigma + np.log(2.0 * sigma))

# Unit test sanity:
yt = np.array([1000.0, 2000.0, 3000.0])
mu = yt.copy()
sg = np.full_like(yt, 70.0)
score_test = modified_laplace_log_likelihood(yt, mu, sg)
print('Metric unit check: expected ≈', -np.log(140.0), 'got', score_test)

# Persisted folds
folds_path = Path('folds_groupkfold.csv')
if not folds_path.exists():
    gkf = GroupKFold(n_splits=5)
    groups = train['Patient'].values
    fold = np.full(len(train), -1, dtype=int)
    for i, (tr, va) in enumerate(gkf.split(train, groups=groups)):
        fold[va] = i
    folds_df = pd.DataFrame({'index': np.arange(len(train)), 'fold': fold})
    folds_df.to_csv(folds_path, index=False)
    print('Saved folds to', folds_path.as_posix())
else:
    folds_df = pd.read_csv(folds_path)
    print('Loaded existing folds from', folds_path.as_posix())
assert (folds_df['fold']>=0).all() and len(folds_df)==len(train), 'Bad folds'
print(f"Fold sizes: {folds_df['fold'].value_counts().sort_index().to_dict()}")

def compute_baseline_table(df_part):
    # baseline row per patient = min Weeks
    idx = df_part.groupby('Patient')['Weeks'].idxmin()
    # Only keep fields that do not collide with original df columns,
    # except for baseline_* renamed ones
    base = df_part.loc[idx, ['Patient','Weeks','FVC','Percent']].copy()
    base = base.rename(columns={'Weeks':'baseline_week','FVC':'baseline_fvc','Percent':'baseline_percent'})
    return base.set_index('Patient')

def build_features(df_part):
    base = compute_baseline_table(df_part)
    df = df_part.copy()
    df = df.join(base, on='Patient', how='left')
    # time features
    df['week_diff'] = df['Weeks'] - df['baseline_week']
    # cap extreme horizons before poly
    wd_cap = df['week_diff'].clip(-40, 40)
    df['abs_week_diff'] = df['week_diff'].abs()
    df['week_diff2'] = (wd_cap**2).astype(float)
    df['week_diff3'] = (wd_cap**3).astype(float)
    df['log_abs_week_diff'] = np.log1p(df['abs_week_diff'])
    # interactions (use ONLY baseline_* to avoid leakage)
    df['bfvc_x_week'] = df['baseline_fvc'] * df['week_diff']
    df['bpercent_x_week'] = df['baseline_percent'] * df['week_diff']
    df['age_x_week'] = df['Age'] * df['week_diff']
    df['age_x_bpercent'] = df['Age'] * df['baseline_percent']
    df['bfvc_x_age'] = df['baseline_fvc'] * df['Age']
    # optional logs
    df['log_baseline_fvc'] = np.log(df['baseline_fvc'].clip(lower=1.0))
    df['log_baseline_percent'] = np.log(df['baseline_percent'].clip(lower=1e-3))
    # Ensure categorical dtypes (useful for CatBoost)
    for c in ['Sex','SmokingStatus']:
        if c in df.columns:
            df[c] = df[c].astype('category')
    return df

def get_feature_cols(df):
    # Drop identifiers, target, baseline_week, and current-row Percent to avoid leakage
    drop_cols = {'Patient','Weeks','FVC','baseline_week','Percent'}
    return [c for c in df.columns if c not in drop_cols]

# Quick feature preview on full train (just to inspect; during CV we will rebuild per fold)
feat_preview = build_features(train)
print('Feature preview columns:', [c for c in feat_preview.columns if c not in ['Patient']][:12], '... total', feat_preview.shape[1])
print(feat_preview.head(2)[['Patient','Weeks','baseline_week','baseline_fvc','baseline_percent','week_diff']])

print('Setup ready: scorer ok, folds saved, feature builder ready.')

Metric unit check: expected ≈ -4.941642422609304 got -4.941642422609304
Loaded existing folds from folds_groupkfold.csv
Fold sizes: {0: 282, 1: 281, 2: 275, 3: 275, 4: 281}
Feature preview columns: ['Weeks', 'FVC', 'Percent', 'Age', 'Sex', 'SmokingStatus', 'baseline_week', 'baseline_fvc', 'baseline_percent', 'week_diff', 'abs_week_diff', 'week_diff2'] ... total 19
                     Patient  Weeks  baseline_week  baseline_fvc  \
0  ID00133637202223847701934     -2             -2          3195   
1  ID00133637202223847701934      2             -2          3195   

   baseline_percent  week_diff  
0         92.856312          0  
1         92.856312          4  
Setup ready: scorer ok, folds saved, feature builder ready.


In [6]:
# Two-stage slope model with GroupKFold OOF and sigma calibration
import time
import itertools
import lightgbm as lgb
from sklearn.metrics import mean_absolute_error

def per_patient_slope(df_fold):
    # Fit simple linear slope FVC ~ Weeks per patient within provided df_fold
    slopes = []
    for pid, g in df_fold.groupby('Patient'):
        if g['Weeks'].nunique() < 2:
            # fallback slope 0 if only one point
            b = 0.0
        else:
            x = g['Weeks'].values.astype(float)
            y = g['FVC'].values.astype(float)
            # ordinary least squares slope
            x_mean = x.mean(); y_mean = y.mean()
            denom = ((x - x_mean)**2).sum()
            if denom == 0:
                b = 0.0
            else:
                b = ((x - x_mean)*(y - y_mean)).sum() / denom
        slopes.append((pid, b))
    return pd.DataFrame(slopes, columns=['Patient','slope'])

def prepare_patient_level_features(df_part):
    # Build baseline features per patient from df_part only
    base = compute_baseline_table(df_part).reset_index()
    # Merge demographics from the baseline row in df_part
    demo_cols = ['Patient','Age','Sex','SmokingStatus']
    idx = df_part.groupby('Patient')['Weeks'].idxmin()
    demo = df_part.loc[idx, demo_cols].copy()
    base = base.merge(demo, on='Patient', how='left')
    # interactions
    base['bfvc_x_pct'] = base['baseline_fvc'] * base['baseline_percent']
    base['age_x_pct'] = base['Age'] * base['baseline_percent']
    # label encode small cats
    sex_map = {v:i for i,v in enumerate(sorted(base['Sex'].astype(str).unique()))}
    smoke_map = {v:i for i,v in enumerate(sorted(base['SmokingStatus'].astype(str).unique()))}
    base['Sex_le'] = base['Sex'].astype(str).map(sex_map).astype(int)
    base['Smoking_le'] = base['SmokingStatus'].astype(str).map(smoke_map).astype(int)
    feat_cols = ['baseline_fvc','baseline_percent','Age','bfvc_x_pct','age_x_pct','Sex_le','Smoking_le']
    return base[['Patient'] + feat_cols].copy(), feat_cols

def apply_mu_from_slope(val_rows, b_pred, val_patient_ids):
    # Build features from val_rows; use existing baseline_fvc and week_diff
    dfv = build_features(val_rows)
    b_series = pd.Series(b_pred, index=val_patient_ids)
    dfv['b_pred'] = dfv['Patient'].map(b_series).astype(float)
    dfv['mu'] = dfv['baseline_fvc'] + dfv['week_diff'] * dfv['b_pred']
    return dfv

def tune_sigma_linear(oof_abs_err, oof_abs_week_diff, grid_s0=None, grid_s1=None, sigma_floor=70.0, sigma_max=1000.0):
    if grid_s0 is None: grid_s0 = np.arange(70, 201, 10)
    if grid_s1 is None: grid_s1 = np.arange(0.0, 5.1, 0.25)
    best = (-1e9, 70.0, 0.0)
    for s0, s1 in itertools.product(grid_s0, grid_s1):
        sigma = np.clip(s0 + s1 * oof_abs_week_diff, sigma_floor, sigma_max)
        score = -np.mean(oof_abs_err / sigma + np.log(2.0 * sigma))
        if score > best[0]:
            best = (score, s0, s1)
    return {'score': best[0], 's0': best[1], 's1': best[2]}

t_start = time.time()
oof_mu = np.zeros(len(train), dtype=float)
oof_y = train['FVC'].values.astype(float)
oof_abs_week = np.zeros(len(train), dtype=float)
fold_indices = folds_df['fold'].values

feat_cols_cache = None
lgb_params = dict(objective='regression', learning_rate=0.05, num_leaves=31, min_data_in_leaf=20, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1, reg_alpha=0.1, reg_lambda=0.1, n_estimators=2000)

for f in sorted(folds_df['fold'].unique()):
    t_fold = time.time()
    tr_idx = folds_df.index[fold_indices != f].values
    va_idx = folds_df.index[fold_indices == f].values
    df_tr = train.iloc[tr_idx].copy()
    df_va = train.iloc[va_idx].copy()
    print(f'Fold {f}: train rows={len(df_tr)}, val rows={len(df_va)}')
    # patient-level slope targets from train fold
    slopes_df = per_patient_slope(df_tr)
    # patient-level features for train and val
    X_tr_pat, feat_cols = prepare_patient_level_features(df_tr)
    X_va_pat, _ = prepare_patient_level_features(df_va)
    feat_cols_cache = feat_cols
    X_tr = X_tr_pat[feat_cols].values
    y_tr = slopes_df.set_index('Patient').loc[X_tr_pat['Patient'].values, 'slope'].values
    X_va = X_va_pat[feat_cols].values
    # train slope model
    lgbm = lgb.LGBMRegressor(**lgb_params)
    lgbm.fit(X_tr, y_tr, eval_set=[(X_tr, y_tr)], eval_metric='l1')
    b_pred_val = lgbm.predict(X_va)
    # build per-row mu for val
    df_val_mu = apply_mu_from_slope(df_va, b_pred_val, X_va_pat['Patient'].values)
    oof_mu[va_idx] = df_val_mu['mu'].values
    oof_abs_week[va_idx] = df_val_mu['abs_week_diff'].values
    mae_fold = mean_absolute_error(oof_y[va_idx], oof_mu[va_idx])
    print(f'Fold {f} done in {time.time()-t_fold:.2f}s | MAE={mae_fold:.3f}', flush=True)

# Tune sigma on OOF residuals
oof_abs_err = np.abs(oof_y - oof_mu)
t_sigma = time.time()
sigma_tune = tune_sigma_linear(oof_abs_err, oof_abs_week, grid_s0=np.arange(70, 201, 5), grid_s1=np.arange(0.0, 5.1, 0.1))
print('Sigma tuning:', sigma_tune, f'in {time.time()-t_sigma:.2f}s')
sigma_oof = np.clip(sigma_tune['s0'] + sigma_tune['s1'] * oof_abs_week, 70.0, 1000.0)
oof_score = modified_laplace_log_likelihood(oof_y, oof_mu, sigma_oof)
mae_oof = mean_absolute_error(oof_y, oof_mu)
print(f'OOF: score={oof_score:.5f}, MAE={mae_oof:.3f}, elapsed={time.time()-t_start:.2f}s')

# Save OOF
oof_df = train[['Patient','Weeks','FVC']].copy()
oof_df['mu'] = oof_mu
oof_df['sigma'] = sigma_oof
oof_df['fold'] = fold_indices
oof_df.to_csv('oof_slope_model.csv', index=False)
print('Saved oof_slope_model.csv')

# Train on full data for test inference
X_full_pat, feat_cols_final = prepare_patient_level_features(train)
slopes_full = per_patient_slope(train)
X_full = X_full_pat[feat_cols_final].values
y_full = slopes_full.set_index('Patient').loc[X_full_pat['Patient'].values, 'slope'].values
lgbm_full = lgb.LGBMRegressor(**lgb_params)
lgbm_full.fit(X_full, y_full, eval_set=[(X_full, y_full)], eval_metric='l1')

# Build test predictions
test_feats_pat, _ = prepare_patient_level_features(test)
b_pred_test = lgbm_full.predict(test_feats_pat[feat_cols_final].values)
df_test_mu = apply_mu_from_slope(test, b_pred_test, test_feats_pat['Patient'].values)
mu_test = df_test_mu['mu'].values
abs_week_test = df_test_mu['abs_week_diff'].values
sigma_test = np.clip(sigma_tune['s0'] + sigma_tune['s1'] * abs_week_test, 70.0, 1000.0)

# Prepare submission
sub = pd.DataFrame({'Patient': df_test_mu['Patient'], 'Weeks': df_test_mu['Weeks'], 'FVC': mu_test, 'Confidence': sigma_test})
sub['Patient_Week'] = sub['Patient'].astype(str) + '_' + sub['Weeks'].astype(str)
submission = sub[['Patient_Week','FVC','Confidence']].copy()
submission.to_csv('submission.csv', index=False)
print('Saved submission.csv with shape', submission.shape)
print(submission.head())

Fold 0: train rows=1112, val rows=282
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000041 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 199
[LightGBM] [Info] Number of data points in the train set: 126, number of used features: 7
[LightGBM] [Info] Start training from score -4.832560


Fold 0 done in 0.27s | MAE=185.094


Fold 1: train rows=1113, val rows=281
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000037 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 200
[LightGBM] [Info] Number of data points in the train set: 126, number of used features: 7
[LightGBM] [Info] Start training from score -4.561591




Fold 1 done in 0.30s | MAE=135.819


Fold 2: train rows=1119, val rows=275
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000052 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 204
[LightGBM] [Info] Number of data points in the train set: 127, number of used features: 7
[LightGBM] [Info] Start training from score -4.028586








Fold 2 done in 0.29s | MAE=153.087


Fold 3: train rows=1119, val rows=275


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000036 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 207
[LightGBM] [Info] Number of data points in the train set: 127, number of used features: 7
[LightGBM] [Info] Start training from score -4.037823








Fold 3 done in 0.28s | MAE=176.373


Fold 4: train rows=1113, val rows=281
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000037 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 200
[LightGBM] [Info] Number of data points in the train set: 126, number of used features: 7
[LightGBM] [Info] Start training from score -4.326875












Fold 4 done in 0.28s | MAE=174.272


Sigma tuning: {'score': -6.690801746567622, 's0': 80, 's1': 5.0} in 0.03s
OOF: score=-6.68476, MAE=164.945, elapsed=1.45s
Saved oof_slope_model.csv


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000054 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 246
[LightGBM] [Info] Number of data points in the train set: 158, number of used features: 7
[LightGBM] [Info] Start training from score -4.356461
















Saved submission.csv with shape (18, 3)
                   Patient_Week     FVC  Confidence
0   ID00014637202177757139317_0  3807.0        80.0
1  ID00019637202178323708467_13  2100.0        80.0
2   ID00047637202184938901501_2  3313.0        80.0
3  ID00082637202201836229724_19  2918.0        80.0
4  ID00126637202218610655908_18  2375.0        80.0


In [15]:
# Row-level quantile LightGBM with GroupKFold OOF on delta target, sigma from spread
import time, itertools
import lightgbm as lgb
from sklearn.metrics import mean_absolute_error

def encode_cats(train_df, val_df=None, test_df=None, cols=('Sex','SmokingStatus')):
    maps = {}
    enc_train = train_df.copy()
    enc_val = val_df.copy() if val_df is not None else None
    enc_test = test_df.copy() if test_df is not None else None
    for c in cols:
        if c in enc_train.columns:
            uniq = sorted(enc_train[c].astype(str).unique())
            mapping = {v:i for i,v in enumerate(uniq)}
            maps[c] = mapping
            enc_train[c] = enc_train[c].astype(str).map(mapping).fillna(-1).astype(int)
            if enc_val is not None:
                enc_val[c] = enc_val[c].astype(str).map(mapping).fillna(-1).astype(int)
            if enc_test is not None:
                enc_test[c] = enc_test[c].astype(str).map(mapping).fillna(-1).astype(int)
    return enc_train, enc_val, enc_test, maps

alphas = [0.2, 0.5, 0.8]
gbm_params = dict(objective='quantile', learning_rate=0.05, num_leaves=31, min_data_in_leaf=20, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1, reg_alpha=0.1, reg_lambda=0.1, n_estimators=3000, verbosity=-1)

t0 = time.time()
y_true_full = train['FVC'].values.astype(float)
folds = folds_df['fold'].values
q_oof = {a: np.zeros(len(train), dtype=float) for a in alphas}

for f in sorted(np.unique(folds)):
    t_fold = time.time()
    tr_idx = folds_df.index[folds != f].values
    va_idx = folds_df.index[folds == f].values
    df_tr = build_features(train.iloc[tr_idx].copy())
    df_va = build_features(train.iloc[va_idx].copy())
    df_tr, df_va, _, _ = encode_cats(df_tr, df_va, None)
    feat_cols = get_feature_cols(df_tr)
    X_tr = df_tr[feat_cols].values
    X_va = df_va[feat_cols].values
    y_tr_delta = (df_tr['FVC'].values.astype(float) - df_tr['baseline_fvc'].values.astype(float))
    print(f'Quantile LGBM Fold {f}: X_tr={X_tr.shape}, X_va={X_va.shape}', flush=True)
    for a in alphas:
        mdl = lgb.LGBMRegressor(**gbm_params, alpha=a)
        mdl.fit(X_tr, y_tr_delta, eval_set=[(X_va, (df_va['FVC'].values.astype(float) - df_va['baseline_fvc'].values.astype(float)))], callbacks=[lgb.log_evaluation(0), lgb.early_stopping(100, verbose=False)])
        pred_delta = mdl.predict(X_va, num_iteration=mdl.best_iteration_)
        q_oof[a][va_idx] = df_va['baseline_fvc'].values.astype(float) + pred_delta
    mae_f = mean_absolute_error(y_true_full[va_idx], q_oof[0.5][va_idx])
    print(f'Fold {f} done in {time.time()-t_fold:.2f}s | MAE(q50 FVC)={mae_f:.3f}', flush=True)

# Compute OOF mu and sigma from quantile spread on FVC predictions
mu_oof = q_oof[0.5]
spread = (q_oof[0.8] - q_oof[0.2]).astype(float)
spread = np.abs(spread)
spread[~np.isfinite(spread)] = 0.0
base_sigma = spread / 1.6
base_sigma = np.maximum(base_sigma, 1e-6)

# Tune scale C
best = (-1e9, 1.0)
for C in np.arange(0.5, 2.01, 0.05):
    s = np.clip(C * base_sigma, 70.0, 600.0)
    sc = modified_laplace_log_likelihood(y_true_full, mu_oof, s)
    if sc > best[0]:
        best = (sc, float(C))
C_best = best[1]
sigma_oof = np.clip(C_best * base_sigma, 70.0, 600.0)
oof_score = modified_laplace_log_likelihood(y_true_full, mu_oof, sigma_oof)
mae_oof = mean_absolute_error(y_true_full, mu_oof)
print(f'Quantile OOF (delta->FVC): score={oof_score:.5f}, MAE={mae_oof:.3f}, C_best={C_best}, elapsed={time.time()-t0:.2f}s', flush=True)

# Save OOF
oof_q = train[['Patient','Weeks','FVC']].copy()
oof_q['mu'] = mu_oof
oof_q['sigma'] = sigma_oof
oof_q['fold'] = folds
oof_q.to_csv('oof_quantile_lgbm.csv', index=False)
print('Saved oof_quantile_lgbm.csv')

# Train full models and predict test (delta target), then convert back to FVC
df_full = build_features(train.copy())
df_test = build_features(test.copy())
df_full_enc, df_test_enc, _, _ = encode_cats(df_full, df_test, None)
feat_cols_full = get_feature_cols(df_full_enc)
X_full = df_full_enc[feat_cols_full].values
y_full_delta = (df_full_enc['FVC'].values.astype(float) - df_full_enc['baseline_fvc'].values.astype(float))
X_test = df_test_enc[feat_cols_full].values

q_test = {}
for a in alphas:
    mdl = lgb.LGBMRegressor(**gbm_params, alpha=a)
    mdl.fit(X_full, y_full_delta, eval_set=[(X_full, y_full_delta)], callbacks=[lgb.log_evaluation(0), lgb.early_stopping(50, verbose=False)])
    q_test[a] = mdl.predict(X_test, num_iteration=mdl.best_iteration_)

mu_test = (df_test_enc['baseline_fvc'].values.astype(float) + q_test[0.5].astype(float))
spread_test = np.abs(q_test[0.8] - q_test[0.2]).astype(float)
spread_test[~np.isfinite(spread_test)] = 0.0
sigma_test = np.clip(C_best * (spread_test / 1.6), 70.0, 600.0).astype(float)

# Submission (single-row test, not the expanded grid)
sub = pd.DataFrame({'Patient': test['Patient'].astype(str), 'Weeks': test['Weeks'].astype(int), 'FVC': mu_test.astype(float), 'Confidence': sigma_test.astype(float)})
sub['Patient_Week'] = sub['Patient'] + '_' + sub['Weeks'].astype(str)
submission = sub[['Patient_Week','FVC','Confidence']].copy()
submission.to_csv('submission.csv', index=False)
print('Saved submission.csv', submission.shape, '\n', submission.head())

Quantile LGBM Fold 0: X_tr=(1112, 15), X_va=(282, 15)


Fold 0 done in 4.70s | MAE(q50 FVC)=70.275


Quantile LGBM Fold 1: X_tr=(1113, 15), X_va=(281, 15)


Fold 1 done in 2.46s | MAE(q50 FVC)=35.297


Quantile LGBM Fold 2: X_tr=(1119, 15), X_va=(275, 15)


Fold 2 done in 4.10s | MAE(q50 FVC)=51.240


Quantile LGBM Fold 3: X_tr=(1119, 15), X_va=(275, 15)


Fold 3 done in 4.07s | MAE(q50 FVC)=48.615


Quantile LGBM Fold 4: X_tr=(1113, 15), X_va=(281, 15)


Fold 4 done in 4.69s | MAE(q50 FVC)=41.073


Quantile OOF (delta->FVC): score=-5.61412, MAE=49.310, C_best=1.3000000000000007, elapsed=20.02s


Saved oof_quantile_lgbm.csv








Saved submission.csv (18, 3) 
                    Patient_Week     FVC  Confidence
0   ID00014637202177757139317_0  3807.0        70.0
1  ID00019637202178323708467_13  2100.0        70.0
2   ID00047637202184938901501_2  3313.0        70.0
3  ID00082637202201836229724_19  2918.0        70.0
4  ID00126637202218610655908_18  2375.0        70.0


In [11]:
# CatBoost row-level mu model (CPU for speed), OOF eval, and blend with quantile LGBM
import time
from sklearn.metrics import mean_absolute_error

try:
    from catboost import CatBoostRegressor, Pool
except Exception as e:
    raise RuntimeError('CatBoost not installed. Please pip install catboost.')

def build_row_level_splits(folds_arr):
    all_folds = sorted(np.unique(folds_arr))
    split_indices = []
    for f in all_folds:
        tr_idx = folds_df.index[folds_arr != f].values
        va_idx = folds_df.index[folds_arr == f].values
        split_indices.append((tr_idx, va_idx))
    return split_indices

def catboost_oof_mu(train_df, folds_arr, cat_cols=('Sex','SmokingStatus'), params=None):
    if params is None:
        params = dict(
            loss_function='MAE',
            learning_rate=0.05,
            depth=6,
            l2_leaf_reg=8.0,
            iterations=2000,
            od_type='Iter',
            od_wait=200,
            random_seed=42,
            task_type='CPU',
            verbose=False,
            allow_writing_files=False
        )
    oof_pred = np.zeros(len(train_df), dtype=float)
    splits = build_row_level_splits(folds_arr)
    for i, (tr_idx, va_idx) in enumerate(splits):
        print(f'CatBoost fold {i}: train={len(tr_idx)}, val={len(va_idx)}', flush=True)
        df_tr = build_features(train_df.iloc[tr_idx].copy())
        df_va = build_features(train_df.iloc[va_idx].copy())
        # Encode cats using train mapping only
        df_tr_enc, df_va_enc, _, _ = encode_cats(df_tr, df_va, None, cols=cat_cols)
        feat_cols = get_feature_cols(df_tr_enc)
        X_tr = df_tr_enc[feat_cols]
        y_tr = df_tr_enc['FVC'].astype(float)
        X_va = df_va_enc[feat_cols]
        y_va = df_va_enc['FVC'].astype(float)
        cat_features_idx = [X_tr.columns.get_loc(c) for c in cat_cols if c in X_tr.columns]
        pool_tr = Pool(X_tr, y_tr, cat_features=cat_features_idx)
        pool_va = Pool(X_va, y_va, cat_features=cat_features_idx)
        model = CatBoostRegressor(**params)
        t0 = time.time()
        model.fit(pool_tr, eval_set=pool_va, verbose=False)
        oof_pred[va_idx] = model.predict(pool_va)
        print(f'  val MAE={mean_absolute_error(y_va, oof_pred[va_idx]):.3f} in {time.time()-t0:.2f}s', flush=True)
    return oof_pred

# Ensure quantile OOF artifacts exist in kernel: mu_oof, base_sigma, C_best from previous cell (4)
assert 'mu_oof' in globals() and 'base_sigma' in globals() and 'C_best' in globals(), 'Run cell 4 first to compute quantile OOF and C_best.'

folds_arr = folds_df['fold'].values
print('Training CatBoost OOF mu (CPU)...', flush=True)
mu_cb_oof = catboost_oof_mu(train, folds_arr)

# Blend mu (simple average) and evaluate with sigma from quantile spread
mu_blend_oof = 0.5 * mu_oof + 0.5 * mu_cb_oof
sigma_oof_blend = np.clip(C_best * base_sigma, 70.0, 600.0)
oof_score_cb = modified_laplace_log_likelihood(train['FVC'].values.astype(float), mu_cb_oof, sigma_oof_blend)
oof_score_blend = modified_laplace_log_likelihood(train['FVC'].values.astype(float), mu_blend_oof, sigma_oof_blend)
print(f'CatBoost OOF score (with quantile sigma): {oof_score_cb:.5f}')
print(f'Blend OOF score (0.5 LGBM q50 + 0.5 CatBoost, quantile sigma): {oof_score_blend:.5f}')

# Train full CatBoost for test mu and blend with quantile mu_test from cell 4
df_full = build_features(train.copy())
df_test = build_features(test.copy())
df_full_enc, df_test_enc, _, _ = encode_cats(df_full, df_test, None)
feat_cols_full = get_feature_cols(df_full_enc)
cat_cols = ['Sex','SmokingStatus']
cat_idx_full = [df_full_enc[feat_cols_full].columns.get_loc(c) for c in cat_cols if c in feat_cols_full]
pool_full = Pool(df_full_enc[feat_cols_full], df_full_enc['FVC'].astype(float), cat_features=cat_idx_full)
pool_test = Pool(df_test_enc[feat_cols_full], cat_features=cat_idx_full)
cb_params_full = dict(
    loss_function='MAE',
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=8.0,
    iterations=2000,
    od_type='Iter',
    od_wait=200,
    random_seed=42,
    task_type='CPU',
    verbose=False,
    allow_writing_files=False
)
cb_full = CatBoostRegressor(**cb_params_full)
cb_full.fit(pool_full, verbose=False)
mu_test_cb = cb_full.predict(pool_test).astype(float)

assert 'mu_test' in globals() and 'sigma_test' in globals(), 'Run cell 4 first to compute mu_test and sigma_test from quantiles.'
mu_test_blend = 0.5 * mu_test + 0.5 * mu_test_cb
submission = pd.DataFrame({'Patient': test['Patient'].astype(str)})
submission['Weeks'] = test['Weeks'].astype(int)
submission['Patient_Week'] = submission['Patient'] + '_' + submission['Weeks'].astype(str)
submission['FVC'] = mu_test_blend.astype(float)
submission['Confidence'] = sigma_test.astype(float)
submission[['Patient_Week','FVC','Confidence']].to_csv('submission.csv', index=False)
print('Saved blended submission.csv', submission.shape, '\n', submission.head())

Training CatBoost OOF mu (CPU)...


CatBoost fold 0: train=1112, val=282


  val MAE=89.480 in 3.40s


CatBoost fold 1: train=1113, val=281


  val MAE=105.645 in 2.95s


CatBoost fold 2: train=1119, val=275


  val MAE=96.306 in 3.03s


CatBoost fold 3: train=1119, val=275


  val MAE=64.822 in 2.91s


CatBoost fold 4: train=1113, val=281


  val MAE=69.156 in 2.08s


CatBoost OOF score (with quantile sigma): -6.11427
Blend OOF score (0.5 LGBM q50 + 0.5 CatBoost, quantile sigma): -6.03406


Saved blended submission.csv (18, 5) 
                      Patient  Weeks                  Patient_Week  \
0  ID00014637202177757139317      0   ID00014637202177757139317_0   
1  ID00019637202178323708467     13  ID00019637202178323708467_13   
2  ID00047637202184938901501      2   ID00047637202184938901501_2   
3  ID00082637202201836229724     19  ID00082637202201836229724_19   
4  ID00126637202218610655908     18  ID00126637202218610655908_18   

           FVC  Confidence  
0  3828.887473   70.000000  
1  2088.325381  113.004331  
2  3414.818039   70.000000  
3  2975.346900   70.000000  
4  2303.613272   70.000000  


In [16]:
# Expand test to Patient x Weeks grid and generate final blended submission using trained pipelines
import numpy as np, pandas as pd, time
from catboost import CatBoostRegressor, Pool

def expand_test_grid(test_df, week_start=-12, week_end=133):
    # One baseline row per patient provided in test_df
    base = test_df[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].copy()
    patients = base['Patient'].unique().tolist()
    weeks = np.arange(week_start, week_end+1, dtype=int)
    grid = pd.MultiIndex.from_product([patients, weeks], names=['Patient','Weeks']).to_frame(index=False)
    # Attach baseline info per patient (replicate across weeks);
    # since grid has only Patient and Weeks, non-overlapping columns will merge without suffixes.
    base_min = base.groupby('Patient', as_index=False).first()
    grid = grid.merge(base_min, on='Patient', how='left', suffixes=('', '_base'))
    # Ensure required columns for feature builder exist;
    # FVC/Percent/Age/Sex/SmokingStatus now come from the provided test baseline row for each patient.
    # Keep only required columns for feature builder
    cols = ['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']
    grid = grid[cols].copy()
    return grid

t0 = time.time()
print('Building expanded test grid and predicting...', flush=True)
# Build full-train features for encoders and full-model fits
df_full = build_features(train.copy())
df_full_enc, _, _, _ = encode_cats(df_full, None, None)
feat_cols_full = [c for c in get_feature_cols(df_full_enc)]
X_full = df_full_enc[feat_cols_full].values
y_full = df_full_enc['FVC'].values.astype(float)

# Expand test grid
test_grid = expand_test_grid(test, week_start=-12, week_end=133)
df_test_grid = build_features(test_grid.copy())
_, df_test_grid_enc, _, _ = encode_cats(df_full, df_test_grid, None)  # map cats from train
X_test_grid = df_test_grid_enc[feat_cols_full].values

# Quantile LGBM full-models for q20/q50/q80
import lightgbm as lgb
alphas = [0.2, 0.5, 0.8]
gbm_params_full = dict(objective='quantile', learning_rate=0.05, num_leaves=31, min_data_in_leaf=20, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1, reg_alpha=0.1, reg_lambda=0.1, n_estimators=3000, verbosity=-1)
q_test_grid = {}
for a in alphas:
    mdl = lgb.LGBMRegressor(**gbm_params_full, alpha=a)
    mdl.fit(X_full, y_full, eval_set=[(X_full, y_full)], callbacks=[lgb.log_evaluation(0), lgb.early_stopping(50, verbose=False)])
    q_test_grid[a] = mdl.predict(X_test_grid, num_iteration=mdl.best_iteration_).astype(float)

# Derive mu and sigma (use C_best from OOF tuning in cell 4)
assert 'C_best' in globals(), 'C_best not found; run cell 4 first.'
mu_test_lgb = q_test_grid[0.5]
spread_test_grid = np.abs(q_test_grid[0.8] - q_test_grid[0.2])
sigma_test_grid = np.clip(C_best * (spread_test_grid / 1.6), 70.0, 600.0).astype(float)

# Train full CatBoost on full train and predict mu on grid
cat_cols = ['Sex','SmokingStatus']
cat_idx_full = [df_full_enc[feat_cols_full].columns.get_loc(c) for c in cat_cols if c in feat_cols_full]
pool_full = Pool(df_full_enc[feat_cols_full], y_full, cat_features=cat_idx_full)
pool_test_grid = Pool(df_test_grid_enc[feat_cols_full], cat_features=cat_idx_full)
cb_params_full = dict(
    loss_function='MAE',
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=8.0,
    iterations=2000,
    od_type='Iter',
    od_wait=200,
    random_seed=42,
    task_type='CPU',
    verbose=False,
    allow_writing_files=False
)
cb_full = CatBoostRegressor(**cb_params_full)
cb_full.fit(pool_full, verbose=False)
mu_test_cb_grid = cb_full.predict(pool_test_grid).astype(float)

# Blend mu and assemble submission
mu_test_blend_grid = 0.5 * mu_test_lgb + 0.5 * mu_test_cb_grid
sub_grid = pd.DataFrame({
    'Patient': df_test_grid_enc['Patient'].astype(str).values,
    'Weeks': df_test_grid_enc['Weeks'].astype(int).values,
    'FVC': mu_test_blend_grid.astype(float),
    'Confidence': sigma_test_grid.astype(float)
})
sub_grid['Patient_Week'] = sub_grid['Patient'] + '_' + sub_grid['Weeks'].astype(str)
submission_grid = sub_grid[['Patient_Week','FVC','Confidence']].copy()
submission_grid.sort_values(['Patient_Week'], inplace=True)
submission_grid.to_csv('submission.csv', index=False)
print('Saved expanded-grid submission.csv', submission_grid.shape, 'in', f'{time.time()-t0:.2f}s')
print(submission_grid.head())

Building expanded test grid and predicting...








Saved expanded-grid submission.csv (2628, 3) in 9.86s
                     Patient_Week          FVC  Confidence
11   ID00014637202177757139317_-1  3836.853124   70.521495
2   ID00014637202177757139317_-10  3824.965811   70.574987
1   ID00014637202177757139317_-11  3822.850768   70.000000
0   ID00014637202177757139317_-12  3828.887473   70.000000
10   ID00014637202177757139317_-2  3839.897444   70.000000


In [17]:
# Align submission to sample_submission keys
import pandas as pd
ss = pd.read_csv('sample_submission.csv')
pred_full = pd.read_csv('submission.csv')  # from expanded grid step
out = ss[['Patient_Week']].merge(pred_full, on='Patient_Week', how='left')
missing = out['FVC'].isna().sum() + out['Confidence'].isna().sum()
if missing > 0:
    print('Warning: missing predictions for some sample_submission keys:', int(missing))
out.to_csv('submission.csv', index=False)
print('Final submission aligned to sample keys:', out.shape)
print(out.head())

Final submission aligned to sample keys: (1908, 3)
                   Patient_Week          FVC  Confidence
0  ID00126637202218610655908_-3  2304.471206        70.0
1  ID00126637202218610655908_-2  2299.173766        70.0
2  ID00126637202218610655908_-1  2298.809196        70.0
3   ID00126637202218610655908_0  2299.394064        70.0
4   ID00126637202218610655908_1  2304.680378        70.0


In [18]:
# Diagnose and fix submission.csv format to match sample_submission exactly
import pandas as pd, numpy as np, time
import lightgbm as lgb
from catboost import CatBoostRegressor, Pool

def make_expanded_preds():
    # Rebuild expanded grid predictions (mirrors cell 6) to have a superset to align with sample keys
    df_full = build_features(train.copy())
    df_full_enc, _, _, _ = encode_cats(df_full, None, None)
    feat_cols_full = [c for c in get_feature_cols(df_full_enc)]
    X_full = df_full_enc[feat_cols_full].values
    y_full = df_full_enc['FVC'].values.astype(float)
    test_grid = expand_test_grid(test, week_start=-12, week_end=133)
    df_test_grid = build_features(test_grid.copy())
    _, df_test_grid_enc, _, _ = encode_cats(df_full, df_test_grid, None)
    X_test_grid = df_test_grid_enc[feat_cols_full].values
    alphas = [0.2, 0.5, 0.8]
    gbm_params_full = dict(objective='quantile', learning_rate=0.05, num_leaves=31, min_data_in_leaf=20, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1, reg_alpha=0.1, reg_lambda=0.1, n_estimators=2000, verbosity=-1)
    q_test_grid = {}
    for a in alphas:
        mdl = lgb.LGBMRegressor(**gbm_params_full, alpha=a)
        mdl.fit(X_full, y_full, callbacks=[lgb.log_evaluation(0)])
        q_test_grid[a] = mdl.predict(X_test_grid).astype(float)
    assert 'C_best' in globals(), 'C_best missing; run cell 4 first.'
    mu_test_lgb = q_test_grid[0.5]
    spread = np.abs(q_test_grid[0.8] - q_test_grid[0.2])
    sigma = np.clip(C_best * (spread / 1.6), 70.0, 600.0).astype(float)
    # CatBoost full
    cat_cols = ['Sex','SmokingStatus']
    cat_idx_full = [df_full_enc[feat_cols_full].columns.get_loc(c) for c in cat_cols if c in feat_cols_full]
    pool_full = Pool(df_full_enc[feat_cols_full], y_full, cat_features=cat_idx_full)
    pool_test_grid = Pool(df_test_grid_enc[feat_cols_full], cat_features=cat_idx_full)
    cb_params_full = dict(loss_function='MAE', learning_rate=0.05, depth=6, l2_leaf_reg=8.0, iterations=1000, od_type='Iter', od_wait=100, random_seed=42, task_type='CPU', verbose=False, allow_writing_files=False)
    cb_full = CatBoostRegressor(**cb_params_full)
    cb_full.fit(pool_full, verbose=False)
    mu_cb = cb_full.predict(pool_test_grid).astype(float)
    mu_blend = 0.5 * mu_test_lgb + 0.5 * mu_cb
    df_pred = pd.DataFrame({
        'Patient': df_test_grid_enc['Patient'].astype(str).values,
        'Weeks': df_test_grid_enc['Weeks'].astype(int).values,
        'FVC': mu_blend.astype(float),
        'Confidence': sigma.astype(float)
    })
    df_pred['Patient_Week'] = df_pred['Patient'] + '_' + df_pred['Weeks'].astype(str)
    return df_pred[['Patient_Week','FVC','Confidence']]

t0 = time.time()
ss = pd.read_csv('sample_submission.csv')
print('sample_submission rows:', len(ss))
sub_cur = pd.read_csv('submission.csv')
print('current submission rows:', len(sub_cur), 'cols:', list(sub_cur.columns))

# If current submission doesn't match sample keys/size, rebuild and align
need_fix = (list(sub_cur.columns) != ['Patient_Week','FVC','Confidence']) or (len(sub_cur) != len(ss)) or (sub_cur['Patient_Week'].nunique() != len(ss))
if need_fix:
    print('Rebuilding expanded predictions and aligning to sample keys...')
    pred_full = make_expanded_preds()
    out = ss[['Patient_Week']].merge(pred_full, on='Patient_Week', how='left')
else:
    # Ensure alignment anyway
    out = ss[['Patient_Week']].merge(sub_cur, on='Patient_Week', how='left')

# Fill any missing with safe defaults
miss_fvc = out['FVC'].isna().sum()
miss_sig = out['Confidence'].isna().sum()
if miss_fvc or miss_sig:
    print('Filling missing values: FVC', int(miss_fvc), 'Confidence', int(miss_sig))
    out['FVC'] = out['FVC'].astype(float).fillna(2000.0)
    out['Confidence'] = out['Confidence'].astype(float).fillna(70.0)

# Dtype enforcement
out['FVC'] = out['FVC'].astype(float)
out['Confidence'] = out['Confidence'].astype(float)

# Sanity checks
assert len(out) == len(ss), f'Row count mismatch: {len(out)} vs {len(ss)}'
assert out['Patient_Week'].isna().sum() == 0, 'Patient_Week has NaNs'
assert out['FVC'].isna().sum() == 0 and out['Confidence'].isna().sum() == 0, 'NaNs in predictions'
out.to_csv('submission.csv', index=False)
print('Wrote submission.csv with', out.shape, 'in', f'{time.time()-t0:.2f}s')
print(out.head())

sample_submission rows: 1908
current submission rows: 1908 cols: ['Patient_Week', 'FVC', 'Confidence']
Wrote submission.csv with (1908, 3) in 0.01s
                   Patient_Week          FVC  Confidence
0  ID00126637202218610655908_-3  2304.471206        70.0
1  ID00126637202218610655908_-2  2299.173766        70.0
2  ID00126637202218610655908_-1  2298.809196        70.0
3   ID00126637202218610655908_0  2299.394064        70.0
4   ID00126637202218610655908_1  2304.680378        70.0


In [20]:
# Multi-seed quantile LGBM (delta target) and direct prediction for sample_submission keys
import time
import lightgbm as lgb

seeds = [42, 123, 2023, 314, 999]
alphas = [0.2, 0.5, 0.8]
gbm_params = dict(
    objective='quantile',
    learning_rate=0.05,
    num_leaves=31,
    min_data_in_leaf=20,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=1,
    reg_alpha=0.1,
    reg_lambda=0.1,
    n_estimators=3000,
    verbosity=-1
)

t0 = time.time()
print('Building full-train features and encoder...', flush=True)
df_full = build_features(train.copy())
df_full_enc, _, _, _ = encode_cats(df_full, None, None)
feat_cols = get_feature_cols(df_full_enc)
X_full = df_full_enc[feat_cols].values
y_full_delta = (df_full_enc['FVC'].values.astype(float) - df_full_enc['baseline_fvc'].values.astype(float))

# Prepare sample grid rows as inference target (exact keys and order)
ss = pd.read_csv('sample_submission.csv')
ss_split = ss['Patient_Week'].str.rsplit('_', n=1, expand=True)
ss_df = pd.DataFrame({'Patient': ss_split[0].astype(str), 'Weeks': ss_split[1].astype(int)})
# Merge baseline test row info per patient
test_base = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].copy()
test_base = test_base.groupby('Patient', as_index=False).first()
inf_df = ss_df.merge(test_base, on='Patient', how='left', suffixes=('', '_base_row'))
inf_df.rename(columns={'Weeks_x':'Weeks'}, inplace=True)
inf_df = inf_df[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].copy()
# Build features and encode cats using train mapping
df_inf = build_features(inf_df.copy())
_, df_inf_enc, _, _ = encode_cats(df_full, df_inf, None)
X_inf = df_inf_enc[feat_cols].values

# OOF via GroupKFold for each seed to compute robust C_best on averaged predictions
folds = folds_df['fold'].values
y_true_full = train['FVC'].values.astype(float)
mu_oof_seeds = []
base_sigma_seeds = []
q_inf_pred_seeds = []  # store dict of alpha->pred for inference grid

for si, seed in enumerate(seeds):
    print(f'== Seed {seed} ==', flush=True)
    q_oof = {a: np.zeros(len(train), dtype=float) for a in alphas}
    for f in sorted(np.unique(folds)):
        tr_idx = folds_df.index[folds != f].values
        va_idx = folds_df.index[folds == f].values
        df_tr = build_features(train.iloc[tr_idx].copy())
        df_va = build_features(train.iloc[va_idx].copy())
        df_tr, df_va, _, _ = encode_cats(df_tr, df_va, None)
        fcols = get_feature_cols(df_tr)
        X_tr = df_tr[fcols].values
        X_va = df_va[fcols].values
        y_tr_delta = (df_tr['FVC'].values.astype(float) - df_tr['baseline_fvc'].values.astype(float))
        for a in alphas:
            mdl = lgb.LGBMRegressor(**gbm_params, alpha=a, random_state=seed)
            mdl.fit(X_tr, y_tr_delta,
                    eval_set=[(X_va, (df_va['FVC'].values.astype(float) - df_va['baseline_fvc'].values.astype(float)))],
                    callbacks=[lgb.log_evaluation(0), lgb.early_stopping(100, verbose=False)])
            pred_delta = mdl.predict(X_va, num_iteration=mdl.best_iteration_)
            q_oof[a][va_idx] = df_va['baseline_fvc'].values.astype(float) + pred_delta
    mu_oof_seed = q_oof[0.5]
    spread_oof = np.abs(q_oof[0.8] - q_oof[0.2]).astype(float)
    base_sigma_seed = np.maximum(spread_oof / 1.6, 1e-6)
    mu_oof_seeds.append(mu_oof_seed)
    base_sigma_seeds.append(base_sigma_seed)

    # Train on full and predict on inference grid for this seed
    q_inf = {}
    for a in alphas:
        mdl = lgb.LGBMRegressor(**gbm_params, alpha=a, random_state=seed)
        mdl.fit(X_full, y_full_delta, eval_set=[(X_full, y_full_delta)], callbacks=[lgb.log_evaluation(0), lgb.early_stopping(50, verbose=False)])
        q_inf[a] = mdl.predict(X_inf, num_iteration=mdl.best_iteration_)
    q_inf_pred_seeds.append(q_inf)

# Average OOF mu and base_sigma across seeds
mu_oof_mean = np.mean(mu_oof_seeds, axis=0)
base_sigma_mean = np.mean(base_sigma_seeds, axis=0)

# Calibrate C on averaged OOF
best = (-1e9, 1.0)
for C in np.arange(0.5, 2.01, 0.05):
    s = np.clip(C * base_sigma_mean, 70.0, 600.0)
    sc = modified_laplace_log_likelihood(y_true_full, mu_oof_mean, s)
    if sc > best[0]:
        best = (sc, float(C))
C_best_multi = best[1]
sigma_oof = np.clip(C_best_multi * base_sigma_mean, 70.0, 600.0)
oof_score = modified_laplace_log_likelihood(y_true_full, mu_oof_mean, sigma_oof)
print(f'Multi-seed quantile OOF: score={oof_score:.5f}, C_best={C_best_multi}', flush=True)

# Average inference predictions across seeds and derive mu/sigma
q_inf_mean = {a: np.mean([q[a] for q in q_inf_pred_seeds], axis=0) for a in alphas}
mu_inf = (df_inf_enc['baseline_fvc'].values.astype(float) + q_inf_mean[0.5].astype(float))
spread_inf = np.abs(q_inf_mean[0.8] - q_inf_mean[0.2]).astype(float)
sigma_inf = np.clip(C_best_multi * (spread_inf / 1.6), 70.0, 600.0).astype(float)

# Write submission exactly matching sample keys
sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'].astype(str)})
sub['FVC'] = mu_inf.astype(float)
sub['Confidence'] = sigma_inf.astype(float)
sub.to_csv('submission.csv', index=False)
print('Saved multi-seed quantile submission.csv', sub.shape, '| elapsed', f'{time.time()-t0:.2f}s')
print(sub.head())

Building full-train features and encoder...


== Seed 42 ==








== Seed 123 ==








== Seed 2023 ==








== Seed 314 ==








== Seed 999 ==








Multi-seed quantile OOF: score=-5.59723, C_best=1.4000000000000008


Saved multi-seed quantile submission.csv (1908, 3) | elapsed 133.11s
                   Patient_Week          FVC  Confidence
0  ID00126637202218610655908_-3  2375.000000        70.0
1  ID00126637202218610655908_-2  2388.902425        70.0
2  ID00126637202218610655908_-1  2390.092908        70.0
3   ID00126637202218610655908_0  2387.996943        70.0
4   ID00126637202218610655908_1  2390.673599        70.0


In [22]:
# Sigma per-horizon calibration (OOF-tuned s0 + s1*|week_diff|) and rewrite submission
import numpy as np, pandas as pd, time

t0 = time.time()
# Build train features to get abs_week_diff
df_train_feat = build_features(train.copy())
abs_wd_train = df_train_feat['abs_week_diff'].values.astype(float)

# Choose OOF mu source
if 'mu_oof_mean' in globals():
    mu_oof_used = mu_oof_mean.astype(float)
    print('Using mu_oof_mean from multi-seed run')
elif 'mu_oof' in globals():
    mu_oof_used = mu_oof.astype(float)
    print('Using mu_oof from single-seed run')
else:
    raise RuntimeError('No OOF mu available. Run cell 4 or 9 first.')
y_true = train['FVC'].values.astype(float)
oof_abs_err = np.abs(y_true - mu_oof_used)

# Grid search s0, s1
best = (-1e9, 70.0, 0.0)
for s0 in np.arange(70.0, 201.0, 5.0):
    for s1 in np.arange(0.0, 5.1, 0.1):
        sigma = np.clip(s0 + s1 * abs_wd_train, 70.0, 600.0)
        score = modified_laplace_log_likelihood(y_true, mu_oof_used, sigma)
        if score > best[0]:
            best = (score, float(s0), float(s1))
print(f'Tuned horizon sigma: score={best[0]:.5f}, s0={best[1]}, s1={best[2]} in {time.time()-t0:.2f}s', flush=True)

# Build inference abs_week_diff for sample_submission keys
ss = pd.read_csv('sample_submission.csv')
ss_split = ss['Patient_Week'].str.rsplit('_', n=1, expand=True)
ss_df = pd.DataFrame({'Patient': ss_split[0].astype(str), 'Weeks': ss_split[1].astype(int)})
test_base = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].copy().groupby('Patient', as_index=False).first()
inf_df = ss_df.merge(test_base, on='Patient', how='left')
# Ensure we have a proper Weeks column from ss_df; after merge pandas may create Weeks_x/Weeks_y
if 'Weeks' not in inf_df.columns:
    if 'Weeks_x' in inf_df.columns:
        inf_df.rename(columns={'Weeks_x':'Weeks'}, inplace=True)
    elif 'Weeks_y' in inf_df.columns:
        inf_df.rename(columns={'Weeks_y':'Weeks'}, inplace=True)
inf_df = inf_df[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].copy()
df_inf_feat = build_features(inf_df.copy())
abs_wd_inf = df_inf_feat['abs_week_diff'].values.astype(float)

# Read current submission to keep mu (FVC) and rewrite Confidence
sub = pd.read_csv('submission.csv')
assert list(sub.columns)==['Patient_Week','FVC','Confidence'] and len(sub)==len(ss), 'submission.csv not aligned to sample keys'
sigma_inf = np.clip(best[1] + best[2] * abs_wd_inf, 70.0, 600.0).astype(float)
sub['Confidence'] = sigma_inf
sub.to_csv('submission.csv', index=False)
print('Rewrote submission.csv with horizon-calibrated sigma. Shape:', sub.shape)
print(sub.head())

Using mu_oof_mean from multi-seed run
Tuned horizon sigma: score=-5.63839, s0=70.0, s1=0.0 in 0.04s


Rewrote submission.csv with horizon-calibrated sigma. Shape: (1908, 3)
                   Patient_Week          FVC  Confidence
0  ID00126637202218610655908_-3  2375.000000        70.0
1  ID00126637202218610655908_-2  2388.902425        70.0
2  ID00126637202218610655908_-1  2390.092908        70.0
3   ID00126637202218610655908_0  2387.996943        70.0
4   ID00126637202218610655908_1  2390.673599        70.0
