# OSIC Pulmonary Fibrosis Progression – Plan

Objectives:
- Establish a fast, reliable baseline and reach medal CV quickly.
- Lock a trustworthy validation protocol and iterate with focused FE and modeling.

Key Constraints and Rules:
- GPU check and environment sanity first.
- Deterministic, leak-free CV across patients (GroupKFold by patient_id if present).
- Use expert review at milestones (plan, EDA, baseline, FE iterations, modeling, ensembling).
- Always log progress and elapsed time per fold.

Initial Understanding:
- Files: train.csv, test.csv, sample_submission.csv; plus per-patient folders (train/, test/) likely with scans/metadata (we may ignore images initially for a tabular baseline).
- Target column: `target`.
- Metric: modified-laplace-log-likelihood (custom). We will implement it for CV.

Validation Strategy:
- GroupKFold by patient identifier to avoid leakage across same patient (placeholder: column names TBD after EDA).
- 5 folds, fixed seed; store and reuse folds.
- Sanity checks: target distribution per fold, group sizes.

Baseline v0:
- Tabular-only model ignoring images to ship a working solution quickly.
- Models:
  - XGBoost (GPU) and CatBoost (GPU) as strong tabular baselines.
- Features:
  - Raw numeric columns;
  - Encoded categoricals (CatBoost handles categorical natively; otherwise Target/OOF mean encode or one-hot);
  - Simple temporal deltas if columns like weeks/time exist.
- Training:
  - Early stopping;
  - Reasonable hyperparams;
  - Custom eval metric callback for laplace-log-likelihood (if feasible) or monitor MAE/RMSE while computing official metric offline each fold.

Feature Engineering Iterations:
- v1: Basic aggregations by patient (mean, std, slope if time present), baseline health metrics (e.g., age, sex, smoking status) encodings.
- v2: Longitudinal features: per-patient linear trend of FVC/related signals vs time; residuals; recent changes.
- v3: Interactions and non-linear transforms validated by CV.

Ensembling Plan:
- Blend XGBoost + CatBoost OOF/test predictions (weighted average tuned on OOF).
- If time permits, add LightGBM-CPU for diversity.

Metric Implementation:
- Implement modified-laplace-log-likelihood (reference Laplace log-likelihood variant) to compute OOF metric exactly.
- Validate against small synthetic cases; ensure monotonic relation with error.

Execution Roadmap:
1) Environment/GPU check; versions logged.
2) EDA: load train/test, inspect columns, dtypes, missingness, target distribution; identify patient_id and time fields.
3) Define folds (GroupKFold) and cache.
4) Implement metric; write a scorer function.
5) Baseline XGBoost-GPU with careful preprocessing; produce OOF/test; save submission.csv; checkpoint.
6) Expert review on baseline + planned FE.
7) FE v1/v2 with CV, iterate; track deltas.
8) Blend and finalize; tighten inference; ensure reproducibility.

Questions for Experts:
- What medal-winning strategies worked best here? Any must-have features for this dataset?
- Preferred CV split details (n_splits, group by which column, any temporal considerations)?
- Any pitfalls specific to the modified-laplace-log-likelihood implementation?

Next Action:
- Run environment check (nvidia-smi), then quick EDA to identify key columns and target distribution.

In [1]:
import os, sys, time, json, textwrap, math, gc, subprocess, shutil, random
import numpy as np
import pandas as pd
from pathlib import Path

pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 180)

def sh(cmd):
    print('>', cmd, flush=True)
    return subprocess.run(['bash','-lc', cmd], check=False, text=True, capture_output=True)

t0 = time.time()
print('GPU check (nvidia-smi):')
print(sh('nvidia-smi || true').stdout)

DATA_DIR = Path('.')
print('CWD:', os.getcwd())

train_path = DATA_DIR/'train.csv'
test_path = DATA_DIR/'test.csv'
ss_path = DATA_DIR/'sample_submission.csv'

train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
ss = pd.read_csv(ss_path)

print('train.shape:', train.shape)
print('test.shape:', test.shape)
print('sample_submission.shape:', ss.shape)

print('\nTrain columns and dtypes:')
print(train.dtypes)
print('\nHead(train):')
print(train.head(3))

print('\nTest columns and dtypes:')
print(test.dtypes)
print('\nHead(test):')
print(test.head(3))

print('\nHead(sample_submission):')
print(ss.head(3))

# Heuristics for common OSIC schema
patient_col = None
for c in ['Patient','patient','patient_id','PatientID','id']:
    if c in train.columns:
        patient_col = c; break
print('Patient column guess:', patient_col)

week_col = None
for c in ['Weeks','weeks','week','Week']:
    if c in train.columns:
        week_col = c; break
print('Week column guess:', week_col)

target_col = None
for c in ['target','FVC','fvc']:
    if c in train.columns:
        target_col = c; break
print('Target column guess:', target_col)

if patient_col is not None:
    print('Unique patients in train:', train[patient_col].nunique())
    if patient_col in test.columns:
        print('Unique patients in test:', test[patient_col].nunique())

if target_col is not None:
    print('\nTarget describe:')
    print(train[target_col].describe())

elapsed = time.time() - t0
print(f'EDA setup done in {elapsed:.2f}s')

GPU check (nvidia-smi):
> nvidia-smi || true


Wed Sep 24 05:35:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
import time

# Metric: modified Laplace log-likelihood (maximize)
def laplace_ll(y_true, y_pred, sigma):
    y_true = np.asarray(y_true).astype(float)
    y_pred = np.asarray(y_pred).astype(float)
    sigma = np.asarray(sigma).astype(float)
    delta = np.abs(y_true - y_pred)
    delta = np.minimum(delta, 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return np.mean(-delta / sigma - np.log(sigma))

# Fit per-patient slope using only patients with >=2 points
def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float)
            y = g[target_col].values.astype(float)
            # slope from simple OLS
            x_mean = x.mean(); y_mean = y.mean()
            denom = ((x - x_mean) ** 2).sum()
            if denom > 0:
                slope = ((x - x_mean) * (y - y_mean)).sum() / denom
            else:
                slope = 0.0
            slopes[pid] = slope
    return slopes

def robust_global_slope(slopes_dict):
    if len(slopes_dict) == 0:
        return 0.0
    vals = np.array(list(slopes_dict.values()), dtype=float)
    return float(np.median(vals))

def build_oof_and_tune_sigma(train_df, n_splits=5, seed=42):
    t0 = time.time()
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    oof_pred = np.zeros(len(train_df))
    oof_base = np.zeros(len(train_df))  # patient baseline FVC (min week) within VAL to simulate test-time anchor
    oof_dist = np.zeros(len(train_df))  # distance to nearest observed week (baseline in this sim)
    fold_metrics = []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups, y=None), 1):
        t_fold = time.time()
        trn = train_df.iloc[trn_idx].copy()
        val = train_df.iloc[val_idx].copy()
        # Compute global slope from training patients only
        slopes = compute_patient_slopes(trn)
        g_slope = robust_global_slope(slopes)
        # For each val patient, anchor at its baseline (min Weeks) within val
        base = (val.sort_values(['Patient','Weeks'])
                  .groupby('Patient', as_index=False)
                  .first()[['Patient','Weeks','FVC']]
               ).rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'})
        val = val.merge(base, on='Patient', how='left')
        # Predict using anchored baseline + global slope
        val_pred = val['Base_FVC'].values + g_slope * (val['Weeks'].values - val['Base_Week'].values)
        # Distance to nearest observed week = distance to baseline week in this simulation
        dist = np.abs(val['Weeks'].values - val['Base_Week'].values).astype(float)
        oof_pred[val_idx] = val_pred
        oof_base[val_idx] = val['Base_FVC'].values
        oof_dist[val_idx] = dist
        # quick MAE for monitoring
        mae = np.mean(np.abs(val['FVC'].values - val_pred))
        print(f'[Fold {fold}] n_trn={trn.shape[0]} n_val={val.shape[0]} global_slope={g_slope:.4f} MAE={mae:.2f} elapsed={time.time()-t_fold:.2f}s', flush=True)
    # Tune sigma = max(a + b*dist, 70) on OOF
    grid_a = [70, 90, 110, 130, 160, 200, 250]
    grid_b = [0.0, 0.5, 1.0, 2.0, 3.0, 5.0]
    best = (-1e9, None, None)
    for a in grid_a:
        for b in grid_b:
            sig = a + b * oof_dist
            score = laplace_ll(train_df['FVC'].values, oof_pred, sig)
            if score > best[0]:
                best = (score, a, b)
    print(f'Best OOF Laplace: {best[0]:.5f} with a={best[1]} b={best[2]} over {len(grid_a)*len(grid_b)} combos. Total elapsed {time.time()-t0:.2f}s', flush=True)
    return oof_pred, oof_dist, best

# Train OOF and tune sigma
oof_pred, oof_dist, (best_oof, best_a, best_b) = build_oof_and_tune_sigma(train)

# Final training: compute global slope on full train
full_slopes = compute_patient_slopes(train)
global_slope = robust_global_slope(full_slopes)
print(f'Global slope (full train): {global_slope:.6f}')

# Build submission scaffold from sample_submission
ss = pd.read_csv('sample_submission.csv')
sub = ss.copy()
parts = sub['Patient_Week'].str.rsplit('_', n=1, expand=True)
sub['Patient'] = parts[0]
sub['Weeks'] = parts[1].astype(int)

# Get test baseline row per patient (one row per patient here)
test_bl = test[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'})
sub = sub.merge(test_bl, on='Patient', how='left')

# Predict FVC using anchored baseline + global slope
sub['FVC'] = sub['Base_FVC'] + global_slope * (sub['Weeks'] - sub['Base_Week'])

# Sigma using tuned (a,b) and distance to nearest observed week (baseline)
sub['dist'] = (sub['Weeks'] - sub['Base_Week']).abs().astype(float)
sub['Confidence'] = np.maximum(best_a + best_b * sub['dist'], 70.0)

# Finalize submission columns and save
submission = sub[['Patient_Week','FVC','Confidence']].copy()
submission['FVC'] = submission['FVC'].astype(float)
submission['Confidence'] = submission['Confidence'].astype(float)
submission.to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', submission.shape)

# Report OOF metric for visibility
print(f'OOF modified Laplace LL: {best_oof:.5f}')

[Fold 1] n_trn=1112 n_val=282 global_slope=-3.8062 MAE=167.29 elapsed=0.01s


[Fold 2] n_trn=1113 n_val=281 global_slope=-3.5547 MAE=118.15 elapsed=0.01s


[Fold 3] n_trn=1119 n_val=275 global_slope=-3.5065 MAE=137.06 elapsed=0.01s


[Fold 4] n_trn=1119 n_val=275 global_slope=-3.5065 MAE=154.74 elapsed=0.01s


[Fold 5] n_trn=1113 n_val=281 global_slope=-3.6557 MAE=142.12 elapsed=0.01s


Best OOF Laplace: -5.90967 with a=70 b=5.0 over 42 combos. Total elapsed 0.05s


Global slope (full train): -3.634137
Saved submission.csv with shape: (1908, 3)
OOF modified Laplace LL: -5.90967


In [45]:
# Sanity checks for submission.csv vs sample_submission.csv
import pandas as pd
import numpy as np

ss = pd.read_csv('sample_submission.csv')
sub = pd.read_csv('submission.csv')

print('submission.shape:', sub.shape, 'sample_submission.shape:', ss.shape)
assert sub.shape[0] == ss.shape[0], 'Row count mismatch'

ss_set = set(ss['Patient_Week'].astype(str).values)
sub_set = set(sub['Patient_Week'].astype(str).values)
print('Patient_Week coverage equal:', ss_set == sub_set, f"missing_in_sub={len(ss_set - sub_set)} extra_in_sub={len(sub_set - ss_set)}")
assert ss_set == sub_set, 'Patient_Week sets differ'

assert sub['FVC'].notna().all() and sub['Confidence'].notna().all(), 'Found NaNs in submission'
assert (sub['Confidence'] >= 70).all(), 'Confidence has values < 70'

print('Confidence stats:', sub['Confidence'].describe())
print('FVC stats:', sub['FVC'].describe())
print(sub.head())
print('Submission sanity checks passed.')

submission.shape: (1908, 3) sample_submission.shape: (1908, 3)
Patient_Week coverage equal: True missing_in_sub=0 extra_in_sub=0
Confidence stats: count    1908.000000
mean      100.200083
std        22.950442
min        70.003754
25%        78.461966
50%        97.671429
75%       120.272419
max       144.195259
Name: Confidence, dtype: float64
FVC stats: count    1908.000000
mean     2940.866036
std      1033.837717
min      1358.892307
25%      2166.727120
50%      3043.646904
75%      3423.828923
max      6000.000000
Name: FVC, dtype: float64
                   Patient_Week          FVC  Confidence
0  ID00126637202218610655908_-3  2392.288082   99.514395
1  ID00126637202218610655908_-2  2388.653945   99.514395
2  ID00126637202218610655908_-1  2385.019808   99.514395
3   ID00126637202218610655908_0  2381.385671   99.514395
4   ID00126637202218610655908_1  2377.751535   99.514395
Submission sanity checks passed.


In [44]:
# Residual XGBoost model (fold-aware) + sigma model; generate improved submission
import sys, subprocess, time, math, gc
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold

def ensure_xgboost():
    try:
        import xgboost as xgb  # noqa: F401
        return
    except Exception:
        print('Installing xgboost...', flush=True)
        subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.0.3', '--no-input'], check=True)
        import xgboost as xgb  # noqa: F401

ensure_xgboost()
import xgboost as xgb

if 'train' not in globals():
    train = pd.read_csv('train.csv')
if 'test' not in globals():
    test = pd.read_csv('test.csv')

def laplace_ll(y_true, y_pred, sigma):
    y_true = np.asarray(y_true).astype(float)
    y_pred = np.asarray(y_pred).astype(float)
    sigma = np.asarray(sigma).astype(float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return np.mean(-delta / sigma - np.log(sigma))

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float)
            y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def compute_trend_stats(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    rows = []
    for pid, g in df.groupby(patient_col):
        n = g.shape[0]
        if n >= 2:
            x = g[week_col].values.astype(float)
            y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            if denom > 0:
                slope = ((x - xm) * (y - ym)).sum() / denom
                yhat = ym + slope * (x - xm)
                ss_res = ((y - yhat)**2).sum()
                ss_tot = ((y - ym)**2).sum()
                r2 = 1.0 - (ss_res / ss_tot) if ss_tot > 0 else 0.0
            else:
                slope, r2 = 0.0, 0.0
            rows.append((pid, slope, r2, n))
        else:
            rows.append((pid, 0.0, 0.0, n))
    return pd.DataFrame(rows, columns=[patient_col, 'slope_w', 'r2_w', 'n_obs'])

def robust_global_slope(slopes_dict):
    if not slopes_dict:
        return 0.0
    return float(np.median(list(slopes_dict.values())))

def one_hot_fit(df, cols):
    cats = {c: sorted(df[c].dropna().unique().tolist()) for c in cols}
    return cats

def one_hot_transform(df, cats):
    out = df.copy()
    for c, values in cats.items():
        for v in values:
            out[f'{c}__{v}'] = (out[c] == v).astype(np.int8)
    return out

def build_features(df):
    df = df.copy()
    df['Weeks_Passed'] = (df['Weeks'] - df['Base_Week']).astype(float)
    df['Abs_Weeks_Passed'] = df['Weeks_Passed'].abs()
    df['Weeks_Passed2'] = df['Weeks_Passed'] ** 2
    df['Weeks_Passed3'] = df['Weeks_Passed'] ** 3
    df['Percent2'] = df['Percent'] ** 2
    df['Age_x_Percent'] = df['Age'] * df['Percent']
    df['Percent_x_BaseFVC'] = df['Percent'] * df['Base_FVC']
    df['WP_x_BaseFVC'] = df['Weeks_Passed'] * df['Base_FVC']
    df['WP_x_Percent'] = df['Weeks_Passed'] * df['Percent']
    df['WP_x_Age'] = df['Weeks_Passed'] * df['Age']
    df['WP_x_slope_w'] = df.get('Weeks_Passed', 0.0) * df.get('slope_w', 0.0)
    df['WP_x_r2_w'] = df.get('Weeks_Passed', 0.0) * df.get('r2_w', 0.0)
    if 'n_obs' not in df.columns: df['n_obs'] = 1
    if 'has_trend' not in df.columns: df['has_trend'] = 0
    if 'is_singleton' not in df.columns: df['is_singleton'] = (df['n_obs'] <= 1).astype(int)
    return df

def train_residual_and_sigma(train_df, n_splits=5, seed=42):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    oof_pred = np.zeros(len(train_df), dtype=float)
    oof_res = np.zeros(len(train_df), dtype=float)
    oof_abs_res = np.zeros(len(train_df), dtype=float)
    oof_dist = np.zeros(len(train_df), dtype=float)
    folds_info = []
    t0 = time.time()
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        t_fold = time.time()
        trn = train_df.iloc[trn_idx].copy()
        val = train_df.iloc[val_idx].copy()
        slopes = compute_patient_slopes(trn)
        g_slope = robust_global_slope(slopes)
        base_val = (val.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'}))
        val = val.merge(base_val, on='Patient', how='left')
        base_trn = (trn.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'}))
        trn = trn.merge(base_trn, on='Patient', how='left')
        stats_trn = compute_trend_stats(trn)
        stats_trn['has_trend'] = (stats_trn['n_obs'] >= 2).astype(int)
        trn = trn.merge(stats_trn, on='Patient', how='left')
        val = val.merge(stats_trn, on='Patient', how='left')
        for c, v in [('slope_w', 0.0), ('r2_w', 0.0)]:
            trn[c] = trn[c].fillna(v); val[c] = val[c].fillna(v)
        trn['n_obs'] = trn['n_obs'].fillna(1).astype(int)
        val['n_obs'] = val['n_obs'].fillna(1).astype(int)
        trn['has_trend'] = trn['has_trend'].fillna(0).astype(int)
        val['has_trend'] = val['has_trend'].fillna(0).astype(int)
        trn['is_singleton'] = (trn['n_obs'] <= 1).astype(int)
        val['is_singleton'] = (val['n_obs'] <= 1).astype(int)
        trn['pred0'] = trn['Base_FVC'] + g_slope * (trn['Weeks'] - trn['Base_Week'])
        val['pred0'] = val['Base_FVC'] + g_slope * (val['Weeks'] - val['Base_Week'])
        cat_cols = ['Sex','SmokingStatus']
        cats = one_hot_fit(trn, cat_cols)
        trnF = one_hot_transform(build_features(trn), cats)
        valF = one_hot_transform(build_features(val), cats)
        feat_cols = [
            'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed2','Weeks_Passed3','Percent','Percent2','Age','Base_FVC',
            'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
            'slope_w','r2_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w'
        ] + [c for c in trnF.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
        y_trn = (trn['FVC'] - trn['pred0']).astype(float).values
        y_val = (val['FVC'] - val['pred0']).astype(float).values
        dtrain = xgb.DMatrix(trnF[feat_cols], label=y_trn)
        dvalid = xgb.DMatrix(valF[feat_cols], label=y_val)
        params = {
            'objective': 'reg:absoluteerror',
            'eval_metric': 'mae',
            'tree_method': 'gpu_hist',
            'learning_rate': 0.05,
            'max_depth': 5,
            'subsample': 0.9,
            'colsample_bytree': 0.9,
            'min_child_weight': 10,
            'lambda': 3.0,
            'verbosity': 0
        }
        model = xgb.train(params, dtrain, num_boost_round=3000, evals=[(dtrain,'trn'),(dvalid,'val')],
                          early_stopping_rounds=200, verbose_eval=False)
        val_pred_res = model.predict(dvalid, iteration_range=(0, model.best_iteration+1))
        val_pred = val['pred0'].values + val_pred_res
        oof_pred[val_idx] = val_pred
        oof_res[val_idx] = val['FVC'].values - val_pred
        oof_abs_res[val_idx] = np.abs(oof_res[val_idx])
        oof_dist[val_idx] = np.abs(val['Weeks'].values - val['Base_Week'].values).astype(float)
        mae = float(np.mean(np.abs(val['FVC'].values - val_pred)))
        print(f'[Fold {fold}] n_trn={trn.shape[0]} n_val={val.shape[0]} g_slope={g_slope:.4f} MAE={mae:.2f} iters={model.best_iteration+1} elapsed={time.time()-t_fold:.2f}s', flush=True)
        folds_info.append({'fold': fold, 'g_slope': g_slope, 'best_iter': int(model.best_iteration+1)})
        del dtrain, dvalid, model, trnF, valF; gc.collect()
    grid_a = [70, 90, 110, 130, 160, 200]
    grid_b = [0.0, 0.5, 1.0, 2.0, 3.0]
    grid_s = [0.0, 0.5, 1.0, 1.5, 2.0]
    best = (-1e9, None, None, None)
    for a in grid_a:
        for b in grid_b:
            for s in grid_s:
                sig = a + b * oof_dist + s * oof_abs_res
                score = laplace_ll(train_df['FVC'].values, oof_pred, sig)
                if score > best[0]:
                    best = (score, a, b, s)
    print(f'Best OOF Laplace (residual+sigma): {best[0]:.5f} with a={best[1]} b={best[2]} s={best[3]}. Total elapsed {time.time()-t0:.2f}s', flush=True)
    return oof_pred, oof_res, oof_abs_res, oof_dist, best, folds_info

oof_pred2, oof_res2, oof_abs_res2, oof_dist2, (best_ll2, best_a2, best_b2, best_s2), folds_info = train_residual_and_sigma(train)

slopes_full = compute_patient_slopes(train)
g_slope_full = robust_global_slope(slopes_full)
base_full = (train.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'}))
train_full = train.merge(base_full, on='Patient', how='left')
stats_full = compute_trend_stats(train_full)
stats_full['has_trend'] = (stats_full['n_obs'] >= 2).astype(int)
train_full = train_full.merge(stats_full, on='Patient', how='left')
for c, v in [('slope_w', 0.0), ('r2_w', 0.0)]:
    train_full[c] = train_full[c].fillna(v)
train_full['n_obs'] = train_full['n_obs'].fillna(1).astype(int)
train_full['has_trend'] = train_full['has_trend'].fillna(0).astype(int)
train_full['is_singleton'] = (train_full['n_obs'] <= 1).astype(int)
train_full['pred0'] = train_full['Base_FVC'] + g_slope_full * (train_full['Weeks'] - train_full['Base_Week'])
train_full = one_hot_transform(build_features(train_full), one_hot_fit(train_full, ['Sex','SmokingStatus']))
feat_cols_full = [
    'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed2','Weeks_Passed3','Percent','Percent2','Age','Base_FVC',
    'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
    'slope_w','r2_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w'
] + [c for c in train_full.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
y_full = (train_full['FVC'] - train_full['pred0']).astype(float).values
dtrain_full = xgb.DMatrix(train_full[feat_cols_full], label=y_full)
params_full = {
    'objective': 'reg:absoluteerror',
    'eval_metric': 'mae',
    'tree_method': 'gpu_hist',
    'learning_rate': 0.05,
    'max_depth': 5,
    'subsample': 0.9,
    'colsample_bytree': 0.9,
    'min_child_weight': 10,
    'lambda': 3.0,
    'verbosity': 0
}
model_full = xgb.train(params_full, dtrain_full, num_boost_round=int(np.median([fi['best_iter'] for fi in folds_info])))

ss = pd.read_csv('sample_submission.csv')
sub = ss.copy()
parts = sub['Patient_Week'].str.rsplit('_', n=1, expand=True)
sub['Patient'] = parts[0]
sub['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'})
sub = sub.merge(test_bl, on='Patient', how='left')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient')
sub = sub.merge(meta, on='Patient', how='left')
sub = sub.merge(stats_full, on='Patient', how='left')
for c, v in [('slope_w', 0.0), ('r2_w', 0.0)]:
    sub[c] = sub[c].fillna(v)
sub['n_obs'] = sub['n_obs'].fillna(1).astype(int)
sub['has_trend'] = sub['has_trend'].fillna(0).astype(int)
sub['is_singleton'] = (sub['n_obs'] <= 1).astype(int)
sub['pred0'] = sub['Base_FVC'] + g_slope_full * (sub['Weeks'] - sub['Base_Week'])
cats_full = one_hot_fit(train_full, ['Sex','SmokingStatus'])
subF = one_hot_transform(build_features(sub), cats_full)
dtest = xgb.DMatrix(subF[feat_cols_full])
res_pred = model_full.predict(dtest)
sub['FVC'] = (sub['pred0'] + res_pred).clip(500, 6000)
sub['dist'] = (sub['Weeks'] - sub['Base_Week']).abs().astype(float)
pred_abs_res_proxy = np.abs(res_pred)
sub['Confidence'] = np.maximum(best_a2 + best_b2 * sub['dist'] + best_s2 * pred_abs_res_proxy, 70.0)
submission2 = sub[['Patient_Week','FVC','Confidence']].copy()
submission2['FVC'] = submission2['FVC'].astype(float)
submission2['Confidence'] = submission2['Confidence'].astype(float)
submission2.to_csv('submission.csv', index=False)
print('Saved improved submission.csv. OOF Laplace (residual+sigma):', f'{best_ll2:.5f}', 'Global slope full:', f'{g_slope_full:.4f}')

[Fold 1] n_trn=1112 n_val=282 g_slope=-3.8062 MAE=160.75 iters=65 elapsed=0.61s


[Fold 2] n_trn=1113 n_val=281 g_slope=-3.5547 MAE=119.33 iters=4 elapsed=0.47s


[Fold 3] n_trn=1119 n_val=275 g_slope=-3.5065 MAE=137.12 iters=3 elapsed=0.46s


[Fold 4] n_trn=1119 n_val=275 g_slope=-3.5065 MAE=138.48 iters=367 elapsed=1.25s


[Fold 5] n_trn=1113 n_val=281 g_slope=-3.6557 MAE=133.92 iters=130 elapsed=0.72s


Best OOF Laplace (residual+sigma): -5.61465 with a=70 b=0.0 s=0.5. Total elapsed 4.01s


Saved improved submission.csv. OOF Laplace (residual+sigma): -5.61465 Global slope full: -3.6341


In [9]:
# Quantile XGBoost models (q20/q50/q80) for FVC and sigma from spread
import numpy as np
import pandas as pd
import time, gc
from sklearn.model_selection import GroupKFold
import xgboost as xgb

def train_quantile_models(train_df, n_splits=5, seed=42):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    oof_q20 = np.zeros(len(train_df), dtype=float)
    oof_q50 = np.zeros(len(train_df), dtype=float)
    oof_q80 = np.zeros(len(train_df), dtype=float)
    oof_dist = np.zeros(len(train_df), dtype=float)
    folds_info = []  # store per-quantile best iters
    t0 = time.time()
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        tf = time.time()
        trn = train_df.iloc[trn_idx].copy()
        val = train_df.iloc[val_idx].copy()

        # Global slope from TRAIN only
        slopes = compute_patient_slopes(trn)
        g_slope = robust_global_slope(slopes)

        # Anchor VAL at its own baseline (min Weeks) within VAL; anchor TRAIN within TRAIN
        base_val = (val.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'}))
        val = val.merge(base_val, on='Patient', how='left')
        base_trn = (trn.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'}))
        trn = trn.merge(base_trn, on='Patient', how='left')

        # TRAIN-only trend stats, merged into TRAIN/VAL with backoff
        stats_trn = compute_trend_stats(trn)
        stats_trn['has_trend'] = (stats_trn['n_obs'] >= 2).astype(int)
        trn = trn.merge(stats_trn, on='Patient', how='left')
        val = val.merge(stats_trn, on='Patient', how='left')
        for c, v in [('slope_w', 0.0), ('r2_w', 0.0)]:
            trn[c] = trn[c].fillna(v); val[c] = val[c].fillna(v)
        trn['n_obs'] = trn['n_obs'].fillna(1).astype(int)
        val['n_obs'] = val['n_obs'].fillna(1).astype(int)
        trn['has_trend'] = trn['has_trend'].fillna(0).astype(int)
        val['has_trend'] = val['has_trend'].fillna(0).astype(int)
        trn['is_singleton'] = (trn['n_obs'] <= 1).astype(int)
        val['is_singleton'] = (val['n_obs'] <= 1).astype(int)

        # Build features
        trnF = build_features(trn)
        valF = build_features(val)
        cats = one_hot_fit(trnF, ['Sex','SmokingStatus'])
        trnF = one_hot_transform(trnF, cats)
        valF = one_hot_transform(valF, cats)
        feat_cols = [
            'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed2','Weeks_Passed3','Percent','Percent2','Age','Base_FVC',
            'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
            'slope_w','r2_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w'
        ] + [c for c in trnF.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]

        # Targets are FVC directly
        y_trn = trn['FVC'].astype(float).values
        y_val = val['FVC'].astype(float).values

        # Prepare DMatrices
        dtrain = xgb.DMatrix(trnF[feat_cols], label=y_trn)
        dvalid = xgb.DMatrix(valF[feat_cols], label=y_val)

        q_alphas = [0.2, 0.5, 0.8]
        oof_preds_fold = {}
        best_iters = {}
        for qa in q_alphas:
            params = {
                'objective': 'reg:quantileerror',
                'quantile_alpha': qa,
                'eval_metric': 'quantile',
                'tree_method': 'gpu_hist',
                'learning_rate': 0.03,
                'max_depth': 4,
                'min_child_weight': 25,
                'reg_alpha': 1.5,
                'lambda': 7.0,
                'subsample': 0.8,
                'colsample_bytree': 0.8,
                'verbosity': 0
            }
            watchlist = [(dtrain, 'trn'), (dvalid, 'val')]
            model = xgb.train(params, dtrain, num_boost_round=4000, evals=watchlist,
                              early_stopping_rounds=300, verbose_eval=False)
            pred_val = model.predict(dvalid, iteration_range=(0, model.best_iteration+1))
            oof_preds_fold[qa] = pred_val
            best_iters[qa] = int(model.best_iteration + 1)
            del model; gc.collect()

        # Enforce non-crossing
        P = np.vstack([oof_preds_fold[0.2], oof_preds_fold[0.5], oof_preds_fold[0.8]]).T
        P.sort(axis=1)  # ascending: q20,q50,q80
        oof_q20[val_idx] = P[:,0]
        oof_q50[val_idx] = P[:,1]
        oof_q80[val_idx] = P[:,2]
        oof_dist[val_idx] = np.abs(val['Weeks'].values - val['Base_Week'].values).astype(float)

        mae = float(np.mean(np.abs(y_val - oof_q50[val_idx])))
        print(f'[Q-Fold {fold}] n_trn={trn.shape[0]} n_val={val.shape[0]} g_slope={g_slope:.4f} MAE(median)={mae:.2f} iters={best_iters} elapsed={time.time()-tf:.2f}s', flush=True)
        folds_info.append({'fold': fold, 'best_iters': best_iters})

        del dtrain, dvalid, trnF, valF; gc.collect()

    # Tune sigma scale (and optional a,b) on OOF
    spreads = (oof_q80 - oof_q20).clip(min=1e-6)
    best = (-1e9, None, None, None)
    for scale in np.arange(0.8, 2.05, 0.1):
        sigma = np.maximum(scale * spreads, 70.0)
        score = laplace_ll(train_df['FVC'].values, oof_q50, sigma)
        if score > best[0]:
            best = (score, scale, 0.0, 0.0)
    # try hybrid with small a,b
    for scale in np.arange(0.8, 2.05, 0.1):
        for a in [0.0, 70.0]:
            for b in [0.0, 0.5, 1.0]:
                sigma = a + b * oof_dist + scale * spreads
                sigma = np.maximum(sigma, 70.0)
                score = laplace_ll(train_df['FVC'].values, oof_q50, sigma)
                if score > best[0]:
                    best = (score, scale, a, b)
    print(f'Best OOF Laplace (quantile): {best[0]:.5f} with scale={best[1]} a={best[2]} b={best[3]}. Total elapsed {time.time()-t0:.2f}s', flush=True)
    return (oof_q20, oof_q50, oof_q80, oof_dist), best, folds_info

# Train quantile models and tune sigma
(oof_q20, oof_q50, oof_q80, oof_dist_q), (best_ll_q, best_scale, best_a_q, best_b_q), folds_info_q = train_quantile_models(train)

# Fit full quantile models
base_full = (train.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'}))
train_full_q = train.merge(base_full, on='Patient', how='left')
stats_full_q = compute_trend_stats(train_full_q)
stats_full_q['has_trend'] = (stats_full_q['n_obs'] >= 2).astype(int)
train_full_q = train_full_q.merge(stats_full_q, on='Patient', how='left')
for c, v in [('slope_w', 0.0), ('r2_w', 0.0)]:
    train_full_q[c] = train_full_q[c].fillna(v)
train_full_q['n_obs'] = train_full_q['n_obs'].fillna(1).astype(int)
train_full_q['has_trend'] = train_full_q['has_trend'].fillna(0).astype(int)
train_full_q['is_singleton'] = (train_full_q['n_obs'] <= 1).astype(int)
train_full_q = build_features(train_full_q)
cats_full_q = one_hot_fit(train_full_q, ['Sex','SmokingStatus'])
train_full_q = one_hot_transform(train_full_q, cats_full_q)
feat_cols_q = [
    'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed2','Weeks_Passed3','Percent','Percent2','Age','Base_FVC',
    'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
    'slope_w','r2_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w'
] + [c for c in train_full_q.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
dtrain_full_q = xgb.DMatrix(train_full_q[feat_cols_q], label=train_full_q['FVC'].astype(float).values)

best_iters_median = int(np.median([fi['best_iters'][0.5] for fi in folds_info_q]))
best_iters_low = int(np.median([fi['best_iters'][0.2] for fi in folds_info_q]))
best_iters_high = int(np.median([fi['best_iters'][0.8] for fi in folds_info_q]))

def fit_quantile_full(alpha, num_boost_round):
    params = {
        'objective': 'reg:quantileerror',
        'quantile_alpha': alpha,
        'eval_metric': 'quantile',
        'tree_method': 'gpu_hist',
        'learning_rate': 0.03,
        'max_depth': 4,
        'min_child_weight': 25,
        'reg_alpha': 1.5,
        'lambda': 7.0,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'verbosity': 0
    }
    return xgb.train(params, dtrain_full_q, num_boost_round=num_boost_round)

model_q20 = fit_quantile_full(0.2, best_iters_low)
model_q50 = fit_quantile_full(0.5, best_iters_median)
model_q80 = fit_quantile_full(0.8, best_iters_high)

# Build submission grid and predict quantiles
ss = pd.read_csv('sample_submission.csv')
subq = ss.copy()
parts = subq['Patient_Week'].str.rsplit('_', n=1, expand=True)
subq['Patient'] = parts[0]
subq['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'})
subq = subq.merge(test_bl, on='Patient', how='left')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient')
subq = subq.merge(meta, on='Patient', how='left')
subq = subq.merge(stats_full_q, on='Patient', how='left')
for c, v in [('slope_w', 0.0), ('r2_w', 0.0)]:
    subq[c] = subq[c].fillna(v)
subq['n_obs'] = subq['n_obs'].fillna(1).astype(int)
subq['has_trend'] = subq['has_trend'].fillna(0).astype(int)
subq['is_singleton'] = (subq['n_obs'] <= 1).astype(int)
subq = build_features(subq)
subq = one_hot_transform(subq, cats_full_q)
dtest_q = xgb.DMatrix(subq[feat_cols_q])
pred_q20 = model_q20.predict(dtest_q)
pred_q50 = model_q50.predict(dtest_q)
pred_q80 = model_q80.predict(dtest_q)
Ptest = np.vstack([pred_q20, pred_q50, pred_q80]).T
Ptest.sort(axis=1)
pred_med = Ptest[:,1]
spread = (Ptest[:,2] - Ptest[:,0]).clip(min=1e-6)

# Sigma from tuned quantile spread (hybrid with a,b if selected)
dist_test = (subq['Weeks'] - subq['Base_Week']).abs().astype(float).values
sigma_test = best_scale * spread + best_a_q + best_b_q * dist_test
sigma_test = np.maximum(sigma_test, 70.0)

# Write submission with quantile median and sigma
sub_final = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': pred_med.astype(float),
    'Confidence': sigma_test.astype(float)
})
sub_final.to_csv('submission.csv', index=False)
print('Saved quantile-based submission.csv. Best OOF Laplace (quantile):', f'{best_ll_q:.5f}', 'iters(med/low/high)=', best_iters_median, best_iters_low, best_iters_high)

[Q-Fold 1] n_trn=1112 n_val=282 g_slope=-3.8062 MAE(median)=167.72 iters={0.2: 1743, 0.5: 199, 0.8: 282} elapsed=6.08s


[Q-Fold 2] n_trn=1113 n_val=281 g_slope=-3.5547 MAE(median)=163.46 iters={0.2: 314, 0.5: 314, 0.8: 196} elapsed=3.32s


[Q-Fold 3] n_trn=1119 n_val=275 g_slope=-3.5065 MAE(median)=158.71 iters={0.2: 748, 0.5: 277, 0.8: 233} elapsed=4.17s


[Q-Fold 4] n_trn=1119 n_val=275 g_slope=-3.5065 MAE(median)=142.75 iters={0.2: 1078, 0.5: 333, 0.8: 282} elapsed=4.97s


[Q-Fold 5] n_trn=1113 n_val=281 g_slope=-3.6557 MAE(median)=125.08 iters={0.2: 366, 0.5: 164, 0.8: 883} elapsed=4.50s


Best OOF Laplace (quantile): -6.04570 with scale=0.8 a=0.0 b=0.5. Total elapsed 23.38s


Saved quantile-based submission.csv. Best OOF Laplace (quantile): -6.04570 iters(med/low/high)= 277 748 282


In [10]:
# Blend residual model (FVC_res) with quantile median (FVC_q50) using OOF-tuned weight; keep sigma from quantiles
import numpy as np
import pandas as pd

# Tune blend weight w on OOF to maximize Laplace LL with quantile-derived sigma
y_true = train['FVC'].values.astype(float)
pred_q50_oof = oof_q50  # from quantile CV
pred_res_oof = oof_pred2 # from residual CV
spreads_oof = (oof_q80 - oof_q20).clip(min=1e-6)
sigma_oof = np.maximum(best_a_q + best_b_q * oof_dist_q + best_scale * spreads_oof, 70.0)

best = (-1e9, None)
for w in np.linspace(0.0, 1.0, 21):
    y_pred_blend = w * pred_q50_oof + (1.0 - w) * pred_res_oof
    score = laplace_ll(y_true, y_pred_blend, sigma_oof)
    if score > best[0]:
        best = (score, w)
print(f'Best OOF Laplace (blend): {best[0]:.5f} at w={best[1]:.2f}')
w_best = best[1] if best[1] is not None else 1.0

# Recompute residual-model test predictions (fast; reuse trained model_full / features)
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'})
grid = grid.merge(test_bl, on='Patient', how='left')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient')
grid = grid.merge(meta, on='Patient', how='left')
grid = grid.merge(stats_full, on='Patient', how='left')
for c, v in [('slope_w', 0.0), ('r2_w', 0.0)]:
    grid[c] = grid[c].fillna(v)
grid['n_obs'] = grid['n_obs'].fillna(1).astype(int)
grid['has_trend'] = grid['has_trend'].fillna(0).astype(int)
grid['is_singleton'] = (grid['n_obs'] <= 1).astype(int)
grid['pred0'] = grid['Base_FVC'] + g_slope_full * (grid['Weeks'] - grid['Base_Week'])
gridF = build_features(grid)
gridF = one_hot_transform(gridF, cats_full)
dgrid = xgb.DMatrix(gridF[feat_cols_full])
residual_pred_test = model_full.predict(dgrid)
fvc_res_test = (grid['pred0'].values + residual_pred_test).clip(500, 6000)

# Quantile test predictions already computed in cell 5: pred_med, sigma_test
# Blend FVC and keep sigma from quantiles
fvc_blend = w_best * pred_med + (1.0 - w_best) * fvc_res_test
submission_blend = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_blend.astype(float),
    'Confidence': sigma_test.astype(float)
})
submission_blend.to_csv('submission.csv', index=False)
print('Saved blended submission.csv using w=', w_best)

Best OOF Laplace (blend): -5.96107 at w=0.25
Saved blended submission.csv using w= 0.25


In [12]:
# FE v2 + more regularized XGB residual model with extrapolation guardrails
import sys, subprocess, time, math, gc
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold

def laplace_ll(y_true, y_pred, sigma):
    y_true = np.asarray(y_true).astype(float)
    y_pred = np.asarray(y_pred).astype(float)
    sigma = np.asarray(sigma).astype(float)
    delta = np.abs(y_true - y_pred)
    delta = np.minimum(delta, 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return np.mean(-delta / sigma - np.log(sigma))

def ensure_xgboost():
    try:
        import xgboost as xgb  # noqa
        return
    except Exception:
        print('Installing xgboost...', flush=True)
        subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.0.3', '--no-input'], check=True)
        import xgboost as xgb  # noqa

ensure_xgboost()
import xgboost as xgb

if 'train' not in globals():
    train = pd.read_csv('train.csv')
if 'test' not in globals():
    test = pd.read_csv('test.csv')

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float)
            y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def compute_trend_stats(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    rows = []
    for pid, g in df.groupby(patient_col):
        n = g.shape[0]
        if n >= 2:
            x = g[week_col].values.astype(float)
            y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            if denom > 0:
                slope = ((x - xm) * (y - ym)).sum() / denom
                yhat = ym + slope * (x - xm)
                ss_res = ((y - yhat)**2).sum()
                ss_tot = ((y - ym)**2).sum()
                r2 = 1.0 - (ss_res / ss_tot) if ss_tot > 0 else 0.0
            else:
                slope, r2 = 0.0, 0.0
            rows.append((pid, slope, r2, n))
        else:
            rows.append((pid, 0.0, 0.0, n))
    return pd.DataFrame(rows, columns=[patient_col, 'slope_w', 'r2_w', 'n_obs'])

def compute_percent_trend_stats(df, patient_col='Patient', week_col='Weeks', percent_col='Percent'):
    rows = []
    for pid, g in df.groupby(patient_col):
        n = g.shape[0]
        if n >= 2:
            x = g[week_col].values.astype(float)
            y = g[percent_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            rows.append((pid, slope))
        else:
            rows.append((pid, 0.0))
    return pd.DataFrame(rows, columns=[patient_col, 'slope_percent_w'])

def robust_global_slope(slopes_dict):
    if not slopes_dict:
        return 0.0
    return float(np.median(list(slopes_dict.values())))

def one_hot_fit(df, cols):
    return {c: sorted(df[c].dropna().unique().tolist()) for c in cols}

def one_hot_transform(df, cats):
    out = df.copy()
    for c, values in cats.items():
        for v in values:
            out[f'{c}__{v}'] = (out[c] == v).astype(np.int8)
    return out

def build_features_v2(df, cap_wp=26):
    df = df.copy()
    df['Weeks_Passed'] = (df['Weeks'] - df['Base_Week']).astype(float)
    df['Abs_Weeks_Passed'] = df['Weeks_Passed'].abs()
    df['sign_WP'] = np.sign(df['Weeks_Passed']).astype(float)
    df['is_future'] = (df['Weeks_Passed'] > 0).astype(np.int8)
    df['Weeks_Passed_cap'] = df['Weeks_Passed'].clip(-cap_wp, cap_wp)
    df['Weeks_Passed2'] = df['Weeks_Passed_cap'] ** 2
    # Percent handling
    df['Percent_clipped'] = df['Percent'].clip(40, 120)
    df['Percent2'] = df['Percent_clipped'] ** 2
    # Base features
    df['log_BaseFVC'] = np.log1p(df['Base_FVC'].clip(lower=1))
    df['Estimated_TLC'] = df['Base_FVC'] / (df['Percent_clipped'] / 100.0)
    df['log_TLC'] = np.log1p(df['Estimated_TLC'].clip(lower=1))
    # Interactions (using capped WP where appropriate)
    df['Age_x_Percent'] = df['Age'] * df['Percent_clipped']
    df['Percent_x_BaseFVC'] = df['Percent_clipped'] * df['Base_FVC']
    df['WP_x_BaseFVC'] = df['Weeks_Passed_cap'] * df['Base_FVC']
    df['WP_x_Percent'] = df['Weeks_Passed_cap'] * df['Percent_clipped']
    df['WP_x_Age'] = df['Weeks_Passed_cap'] * df['Age']
    if 'slope_w' in df.columns:
        df['WP_x_slope_w'] = df['Weeks_Passed_cap'] * df['slope_w'].clip(-50, 10)
    else:
        df['WP_x_slope_w'] = 0.0
    if 'r2_w' in df.columns:
        df['WP_x_r2_w'] = df['Weeks_Passed_cap'] * df['r2_w']
    else:
        df['WP_x_r2_w'] = 0.0
    if 'slope_percent_w' in df.columns:
        df['WP_x_slope_percent_w'] = df['Weeks_Passed_cap'] * df['slope_percent_w']
    else:
        df['WP_x_slope_percent_w'] = 0.0
    # dPercent relative to baseline percent if available
    if 'Percent_at_base' in df.columns:
        df['dPercent'] = df['Percent_clipped'] - df['Percent_at_base']
        df['WP_x_dPercent'] = df['Weeks_Passed_cap'] * df['dPercent']
    else:
        df['dPercent'] = 0.0
        df['WP_x_dPercent'] = 0.0
    # flags
    if 'n_obs' not in df.columns:
        df['n_obs'] = 1
    if 'has_trend' not in df.columns:
        df['has_trend'] = 0
    df['is_singleton'] = (df['n_obs'] <= 1).astype(int)
    return df

def train_residual_and_sigma_v2(train_df, n_splits=5, seed=42):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    oof_pred = np.zeros(len(train_df), dtype=float)
    oof_res = np.zeros(len(train_df), dtype=float)
    oof_abs_res = np.zeros(len(train_df), dtype=float)
    oof_dist = np.zeros(len(train_df), dtype=float)
    t0 = time.time()
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        tf = time.time()
        trn = train_df.iloc[trn_idx].copy()
        val = train_df.iloc[val_idx].copy()
        # Anchor within TRAIN/VAL (baseline = earliest week within each split)
        base_trn = (trn.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
        trn = trn.merge(base_trn, on='Patient', how='left')
        base_val = (val.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
        val = val.merge(base_val, on='Patient', how='left')
        # Train-only trend stats (FVC and Percent)
        stats_trn = compute_trend_stats(trn)
        stats_trn['has_trend'] = (stats_trn['n_obs'] >= 2).astype(int)
        pstats_trn = compute_percent_trend_stats(trn)
        trn = trn.merge(stats_trn, on='Patient', how='left')
        trn = trn.merge(pstats_trn, on='Patient', how='left')
        val = val.merge(stats_trn, on='Patient', how='left')
        val = val.merge(pstats_trn, on='Patient', how='left')
        for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
            trn[c] = trn[c].fillna(v); val[c] = val[c].fillna(v)
        trn['n_obs'] = trn['n_obs'].fillna(1).astype(int)
        val['n_obs'] = val['n_obs'].fillna(1).astype(int)
        trn['has_trend'] = trn['has_trend'].fillna(0).astype(int)
        val['has_trend'] = val['has_trend'].fillna(0).astype(int)

        # Global slope from TRAIN patients only
        g_slope = robust_global_slope(compute_patient_slopes(trn))
        trn['pred0'] = trn['Base_FVC'] + g_slope * (trn['Weeks'] - trn['Base_Week'])
        val['pred0'] = val['Base_FVC'] + g_slope * (val['Weeks'] - val['Base_Week'])

        # Build features and one-hot
        cat_cols = ['Sex','SmokingStatus']
        cats = one_hot_fit(trn, cat_cols)
        trnF = build_features_v2(trn)
        valF = build_features_v2(val)
        trnF = one_hot_transform(trnF, cats)
        valF = one_hot_transform(valF, cats)

        feat_cols = [
            'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed_cap','Weeks_Passed2','sign_WP','is_future',
            'Percent_clipped','Percent2','Age','Base_FVC','log_BaseFVC','Estimated_TLC','log_TLC',
            'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
            'slope_w','r2_w','slope_percent_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w','WP_x_slope_percent_w','dPercent','WP_x_dPercent'
        ] + [c for c in trnF.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]

        # Residual targets
        y_trn = (trn['FVC'] - trn['pred0']).astype(float).values
        y_val = (val['FVC'] - val['pred0']).astype(float).values
        dtrain = xgb.DMatrix(trnF[feat_cols], label=y_trn)
        dvalid = xgb.DMatrix(valF[feat_cols], label=y_val)

        params = {
            'objective': 'reg:absoluteerror',
            'eval_metric': 'mae',
            'tree_method': 'gpu_hist',
            'learning_rate': 0.03,
            'max_depth': 4,
            'subsample': 0.85,
            'colsample_bytree': 0.85,
            'min_child_weight': 25,
            'lambda': 5.0,
            'verbosity': 0
        }
        watchlist = [(dtrain, 'trn'), (dvalid, 'val')]
        model = xgb.train(params, dtrain, num_boost_round=4000, evals=watchlist, early_stopping_rounds=300, verbose_eval=False)
        val_pred_res = model.predict(dvalid, iteration_range=(0, model.best_iteration+1))
        val_pred = val['pred0'].values + val_pred_res

        # Store OOF
        oof_pred[val_idx] = val_pred
        oof_res[val_idx] = val['FVC'].values - val_pred
        oof_abs_res[val_idx] = np.abs(oof_res[val_idx])
        oof_dist[val_idx] = np.abs(val['Weeks'].values - val['Base_Week'].values).astype(float)

        mae = float(np.mean(np.abs(val['FVC'].values - val_pred)))
        print(f'[XGBv2-Fold {fold}] n_trn={trn.shape[0]} n_val={val.shape[0]} g_slope={g_slope:.4f} MAE={mae:.2f} iters={model.best_iteration+1} elapsed={time.time()-tf:.2f}s', flush=True)

        del dtrain, dvalid, model, trnF, valF; gc.collect()

    # Tune sigma = max(a + b*dist + s*|residual|, 70) on OOF
    grid_a = [70, 110, 160, 200, 240]
    grid_b = [0.5, 1.0, 2.0, 3.0]
    grid_s = [0.5, 1.0]
    best = (-1e9, None, None, None)
    for a in grid_a:
        for b in grid_b:
            for s in grid_s:
                sig = a + b * oof_dist + s * oof_abs_res
                score = laplace_ll(train_df['FVC'].values, oof_pred, sig)
                if score > best[0]:
                    best = (score, a, b, s)
    print(f'Best OOF Laplace (XGBv2): {best[0]:.5f} with a={best[1]} b={best[2]} s={best[3]}', flush=True)
    return oof_pred, oof_abs_res, oof_dist, best

# Train FE v2 residual model
oof_pred_v2, oof_abs_v2, oof_dist_v2, (best_ll_v2, a2, b2, s2) = train_residual_and_sigma_v2(train)

# Fit final model on full data with FE v2 and generate submission
slopes_full = compute_patient_slopes(train)
g_slope_full = robust_global_slope(slopes_full)
base_full = (train.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
train_full_v2 = train.merge(base_full, on='Patient', how='left')
stats_full = compute_trend_stats(train_full_v2)
stats_full['has_trend'] = (stats_full['n_obs'] >= 2).astype(int)
pstats_full = compute_percent_trend_stats(train_full_v2)
train_full_v2 = train_full_v2.merge(stats_full, on='Patient', how='left').merge(pstats_full, on='Patient', how='left')
for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
    train_full_v2[c] = train_full_v2[c].fillna(v)
train_full_v2['n_obs'] = train_full_v2['n_obs'].fillna(1).astype(int)
train_full_v2['has_trend'] = train_full_v2['has_trend'].fillna(0).astype(int)
train_full_v2['pred0'] = train_full_v2['Base_FVC'] + g_slope_full * (train_full_v2['Weeks'] - train_full_v2['Base_Week'])
cats_full = one_hot_fit(train_full_v2, ['Sex','SmokingStatus'])
train_full_v2F = build_features_v2(train_full_v2)
train_full_v2F = one_hot_transform(train_full_v2F, cats_full)
feat_cols_full_v2 = [
    'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed_cap','Weeks_Passed2','sign_WP','is_future',
    'Percent_clipped','Percent2','Age','Base_FVC','log_BaseFVC','Estimated_TLC','log_TLC',
    'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
    'slope_w','r2_w','slope_percent_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w','WP_x_slope_percent_w','dPercent','WP_x_dPercent'
] + [c for c in train_full_v2F.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
y_full = (train_full_v2F['FVC'] - train_full_v2F['pred0']).astype(float).values
dtrain_full = xgb.DMatrix(train_full_v2F[feat_cols_full_v2], label=y_full)
params_full = {
    'objective': 'reg:absoluteerror',
    'eval_metric': 'mae',
    'tree_method': 'gpu_hist',
    'learning_rate': 0.03,
    'max_depth': 4,
    'subsample': 0.85,
    'colsample_bytree': 0.85,
    'min_child_weight': 25,
    'lambda': 5.0,
    'verbosity': 0
}
model_full_v2 = xgb.train(params_full, dtrain_full, num_boost_round=800)

# Build test grid and predict
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient')
grid = grid.merge(meta, on='Patient', how='left', suffixes=('', '_meta'))
grid = grid.merge(stats_full, on='Patient', how='left').merge(pstats_full, on='Patient', how='left')
for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
    grid[c] = grid[c].fillna(v)
grid['n_obs'] = grid['n_obs'].fillna(1).astype(int)
grid['has_trend'] = grid['has_trend'].fillna(0).astype(int)
grid['is_singleton'] = (grid['n_obs'] <= 1).astype(int)
grid['pred0'] = grid['Base_FVC'] + g_slope_full * (grid['Weeks'] - grid['Base_Week'])
gridF = build_features_v2(grid)
gridF = one_hot_transform(gridF, cats_full)
dgrid = xgb.DMatrix(gridF[feat_cols_full_v2])
res_pred = model_full_v2.predict(dgrid)
fvc_pred = (gridF['pred0'].values + res_pred).clip(500, 6000)
dist = (gridF['Weeks'] - gridF['Base_Week']).abs().astype(float).values
sigma = np.maximum(a2 + b2 * dist + s2 * np.abs(res_pred), 70.0)

submission_v2 = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_pred.astype(float),
    'Confidence': sigma.astype(float)
})
submission_v2.to_csv('submission.csv', index=False)
print('Saved FE v2 XGB residual submission.csv. Best OOF Laplace (XGBv2):', f'{best_ll_v2:.5f}')

[XGBv2-Fold 1] n_trn=1112 n_val=282 g_slope=-3.8062 MAE=45.48 iters=3939 elapsed=7.72s


[XGBv2-Fold 2] n_trn=1113 n_val=281 g_slope=-3.5547 MAE=13.49 iters=3783 elapsed=7.68s


[XGBv2-Fold 3] n_trn=1119 n_val=275 g_slope=-3.5065 MAE=26.64 iters=3988 elapsed=7.78s


[XGBv2-Fold 4] n_trn=1119 n_val=275 g_slope=-3.5065 MAE=24.62 iters=3883 elapsed=7.72s


[XGBv2-Fold 5] n_trn=1113 n_val=281 g_slope=-3.6557 MAE=14.72 iters=2450 elapsed=5.30s


Best OOF Laplace (XGBv2): -4.65387 with a=70 b=0.5 s=0.5


Saved FE v2 XGB residual submission.csv. Best OOF Laplace (XGBv2): -4.65387


In [None]:
# CatBoost residual model (GPU) + blend with XGB residual; keep sigma from XGB residual model
import sys, subprocess, gc, time
import numpy as np
import pandas as pd

def ensure_catboost():
    try:
        import catboost  # noqa: F401
        return
    except Exception:
        print('Installing catboost...', flush=True)
        subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost==1.2.5', '--no-input'], check=True)
        import catboost  # noqa: F401

ensure_catboost()
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import GroupKFold

# Use same train/test already in memory; reuse feature builders and stats_full/g_slope_full from cell 4
def train_catboost_residual(train_df, n_splits=5, seed=42):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    oof_pred = np.zeros(len(train_df), dtype=float)
    folds_info = []
    t0 = time.time()
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        tf = time.time()
        trn = train_df.iloc[trn_idx].copy()
        val = train_df.iloc[val_idx].copy()
        # Anchor within TRAIN/VAL
        base_trn = (trn.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'}))
        trn = trn.merge(base_trn, on='Patient', how='left')
        base_val = (val.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'}))
        val = val.merge(base_val, on='Patient', how='left')
        # Trend stats from TRAIN-only
        stats_trn = compute_trend_stats(trn)
        stats_trn['has_trend'] = (stats_trn['n_obs'] >= 2).astype(int)
        trn = trn.merge(stats_trn, on='Patient', how='left')
        val = val.merge(stats_trn, on='Patient', how='left')
        for c, v in [('slope_w', 0.0), ('r2_w', 0.0)]:
            trn[c] = trn[c].fillna(v); val[c] = val[c].fillna(v)
        trn['n_obs'] = trn['n_obs'].fillna(1).astype(int)
        val['n_obs'] = val['n_obs'].fillna(1).astype(int)
        trn['has_trend'] = trn['has_trend'].fillna(0).astype(int)
        val['has_trend'] = val['has_trend'].fillna(0).astype(int)
        trn['is_singleton'] = (trn['n_obs'] <= 1).astype(int)
        val['is_singleton'] = (val['n_obs'] <= 1).astype(int)

        # Baseline pred0 using global slope from TRAIN patients only
        g_slope = robust_global_slope(compute_patient_slopes(trn))
        trn['pred0'] = trn['Base_FVC'] + g_slope * (trn['Weeks'] - trn['Base_Week'])
        val['pred0'] = val['Base_FVC'] + g_slope * (val['Weeks'] - val['Base_Week'])

        # Build features and one-hot
        trnF = build_features(trn)
        valF = build_features(val)
        cats = one_hot_fit(trnF, ['Sex','SmokingStatus'])
        trnF = one_hot_transform(trnF, cats)
        valF = one_hot_transform(valF, cats)
        feat_cols = [
            'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed2','Weeks_Passed3','Percent','Percent2','Age','Base_FVC',
            'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
            'slope_w','r2_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w'
        ] + [c for c in trnF.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]

        y_trn = (trn['FVC'] - trn['pred0']).astype(float).values
        y_val = (val['FVC'] - val['pred0']).astype(float).values

        train_pool = Pool(trnF[feat_cols], label=y_trn)
        valid_pool = Pool(valF[feat_cols], label=y_val)

        model = CatBoostRegressor(
            loss_function='MAE',
            depth=6,
            learning_rate=0.04,
            l2_leaf_reg=6.0,
            subsample=0.8,
            iterations=5000,
            od_type='Iter',
            od_wait=300,
            task_type='GPU',
            random_seed=seed,
            verbose=False
        )
        model.fit(train_pool, eval_set=valid_pool, use_best_model=True, verbose=False)
        pred_res = model.predict(valid_pool)
        val_pred = val['pred0'].values + pred_res
        oof_pred[val_idx] = val_pred
        mae = float(np.mean(np.abs(val['FVC'].values - val_pred)))
        print(f'[CB-Fold {fold}] n_trn={trn.shape[0]} n_val={val.shape[0]} g_slope={g_slope:.4f} MAE={mae:.2f} elapsed={time.time()-tf:.2f}s', flush=True)
        folds_info.append({'fold': fold})
        del model, train_pool, valid_pool, trnF, valF; gc.collect()
    print(f'CatBoost residual OOF ready in {time.time()-t0:.2f}s')
    return oof_pred

# Train CatBoost residual OOF
oof_pred_cb = train_catboost_residual(train)

# Tune blend between XGB residual (oof_pred2) and CatBoost residual (oof_pred_cb); sigma from XGB residual pipeline
y_true = train['FVC'].values.astype(float)
sigma_oof_res = np.maximum(best_a2 + best_b2 * oof_dist2 + best_s2 * oof_abs_res2, 70.0)
best = (-1e9, None)
for w in np.linspace(0.0, 1.0, 21):
    y_pred_blend = w * oof_pred2 + (1.0 - w) * oof_pred_cb
    score = laplace_ll(y_true, y_pred_blend, sigma_oof_res)
    if score > best[0]:
        best = (score, w)
print(f'Best OOF Laplace (XGB-CB residual blend): {best[0]:.5f} at w={best[1]:.2f}')
w_xgb = best[1] if best[1] is not None else 1.0

# Fit CatBoost residual on full data and generate test predictions
train_full_cb = train_full.copy()  # from cell 4, already has Base_Week/Base_FVC, trend stats, features schema ready
train_full_cbF = train_full_cb.copy()
train_full_cbF = one_hot_transform(train_full_cbF, cats_full)
feat_cols_full_cb = feat_cols_full  # same feature list as XGB residual one-hot
y_full_cb = (train_full_cbF['FVC'] - train_full_cbF['pred0']).astype(float).values
pool_full = Pool(train_full_cbF[feat_cols_full_cb], label=y_full_cb)
model_cb_full = CatBoostRegressor(
    loss_function='MAE', depth=6, learning_rate=0.04, l2_leaf_reg=6.0,
    subsample=0.8, iterations=int(np.median([fi['best_iter'] for fi in folds_info]))*2 + 300,
    task_type='GPU', random_seed=42, verbose=False
)
model_cb_full.fit(pool_full, verbose=False)

# Build test grid (same as in cell 4) and predict residuals with CatBoost
grid = pd.read_csv('sample_submission.csv')
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'})
grid = grid.merge(test_bl, on='Patient', how='left')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient')
grid = grid.merge(meta, on='Patient', how='left')
grid = grid.merge(stats_full, on='Patient', how='left')
for c, v in [('slope_w', 0.0), ('r2_w', 0.0)]:
    grid[c] = grid[c].fillna(v)
grid['n_obs'] = grid['n_obs'].fillna(1).astype(int)
grid['has_trend'] = grid['has_trend'].fillna(0).astype(int)
grid['is_singleton'] = (grid['n_obs'] <= 1).astype(int)
grid['pred0'] = grid['Base_FVC'] + g_slope_full * (grid['Weeks'] - grid['Base_Week'])
gridF = build_features(grid)
gridF = one_hot_transform(gridF, cats_full)
pool_grid = Pool(gridF[feat_cols_full_cb])
res_cb = model_cb_full.predict(pool_grid)
fvc_cb = (gridF['pred0'].values + res_cb).clip(500, 6000)

# Recompute XGB residual test pred from cell 4 objects (already exists as sub['FVC'])
# Build again to be safe and consistent
dgrid = xgb.DMatrix(gridF[feat_cols_full_cb])
res_xgb = model_full.predict(dgrid)
fvc_xgb = (gridF['pred0'].values + res_xgb).clip(500, 6000)

# Blend XGB and CB residual FVC with tuned weight; use sigma from XGB residual pipeline for test
fvc_blended = w_xgb * fvc_xgb + (1.0 - w_xgb) * fvc_cb
sigma_test_residual = np.maximum(best_a2 + best_b2 * (gridF['Weeks'] - gridF['Base_Week']).abs().astype(float).values + best_s2 * np.abs(res_xgb), 70.0)

submission_cbblend = pd.DataFrame({
    'Patient_Week': pd.read_csv('sample_submission.csv')['Patient_Week'],
    'FVC': fvc_blended.astype(float),
    'Confidence': sigma_test_residual.astype(float)
})
submission_cbblend.to_csv('submission.csv', index=False)
print('Saved CB+XGB residual blended submission.csv with w_xgb=', w_xgb)

In [13]:
# Forward/grid CV to mimic test: anchored baseline model + dist-only sigma
import time, gc
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold

def laplace_ll(y_true, y_pred, sigma):
    y_true = np.asarray(y_true).astype(float)
    y_pred = np.asarray(y_pred).astype(float)
    sigma = np.asarray(sigma).astype(float)
    delta = np.abs(y_true - y_pred)
    delta = np.minimum(delta, 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return np.mean(-delta / sigma - np.log(sigma))

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float)
            y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def robust_global_slope(slopes_dict):
    if not slopes_dict:
        return 0.0
    return float(np.median(list(slopes_dict.values())))

def compute_trend_stats(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    rows = []
    for pid, g in df.groupby(patient_col):
        n = g.shape[0]
        if n >= 2:
            x = g[week_col].values.astype(float)
            y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            if denom > 0:
                slope = ((x - xm) * (y - ym)).sum() / denom
                yhat = ym + slope * (x - xm)
                ss_res = ((y - yhat)**2).sum()
                ss_tot = ((y - ym)**2).sum()
                r2 = 1.0 - (ss_res / ss_tot) if ss_tot > 0 else 0.0
            else:
                slope, r2 = 0.0, 0.0
            rows.append((pid, slope, r2, n))
        else:
            rows.append((pid, 0.0, 0.0, n))
    return pd.DataFrame(rows, columns=[patient_col, 'slope_w', 'r2_w', 'n_obs'])

def forward_grid_cv_baseline(train_df, n_splits=5, seed=42):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    y_true_all, y_pred_all, dist_all = [], [], []
    t0 = time.time()
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        tf = time.time()
        trn = train_df.iloc[trn_idx].copy()
        val = train_df.iloc[val_idx].copy()

        # Guards: disjoint patients
        train_p = set(trn['Patient'].unique().tolist())
        val_p = set(val['Patient'].unique().tolist())
        assert train_p.isdisjoint(val_p), 'Fold leakage: overlapping patients'

        # TRAIN-only trend stats and merge into VAL with backoff zeros (diagnostic only here)
        stats_trn = compute_trend_stats(trn)
        stats_trn['has_trend'] = (stats_trn['n_obs'] >= 2).astype(int)
        val_stats = val.merge(stats_trn, on='Patient', how='left')
        for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('n_obs', 1), ('has_trend', 0)]:
            val_stats[c] = val_stats[c].fillna(v)
        val_stats['is_singleton'] = (val_stats['n_obs'] <= 1).astype(int)
        # Assert VAL patients have backoff zeros
        assert (val_stats['slope_w'] == 0).all() and (val_stats['r2_w'] == 0).all(), 'VAL trend stats not zeroed'
        assert (val_stats['n_obs'] == 1).all() and (val_stats['has_trend'] == 0).all() and (val_stats['is_singleton'] == 1).all(), 'VAL flags incorrect'

        # Global slope from TRAIN patients only
        g_slope = robust_global_slope(compute_patient_slopes(trn))

        # For each VAL patient: baseline = earliest week within VAL; score weeks >= baseline that have GT
        vb = (val.sort_values(['Patient','Weeks'])
                .groupby('Patient', as_index=False)
                .first()[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'}))
        val = val.merge(vb, on='Patient', how='left')
        # Baseline minimal checks
        assert (val.groupby('Patient')['Base_Week'].transform('min') == val['Base_Week']).all(), 'Base_Week not min within VAL'

        cnt_scored = 0
        for pid, g in val.groupby('Patient'):
            bw = int(g['Base_Week'].iloc[0])
            bfvc = float(g['Base_FVC'].iloc[0])
            g2 = g[g['Weeks'] >= bw].copy()  # forward-only scoring
            if g2.empty:
                continue
            pred = bfvc + g_slope * (g2['Weeks'].values.astype(float) - bw)
            dist = (g2['Weeks'].values.astype(float) - bw)
            assert (dist >= 0).all(), 'Negative dist in forward scoring'
            y_true_all.append(g2['FVC'].values.astype(float))
            y_pred_all.append(pred.astype(float))
            dist_all.append(dist.astype(float))
            cnt_scored += len(g2)

        mae = np.mean(np.abs(np.concatenate(y_true_all[-1:]) - np.concatenate(y_pred_all[-1:]))) if cnt_scored>0 else np.nan
        print(f'[FWD-Fold {fold}] n_trn={trn.shape[0]} n_val={val.shape[0]} patients={len(val_p)} g_slope={g_slope:.4f} scored_rows={cnt_scored} elapsed={time.time()-tf:.2f}s', flush=True)

        del trn, val, val_stats; gc.collect()

    y_true = np.concatenate(y_true_all) if len(y_true_all)>0 else np.array([], float)
    y_pred = np.concatenate(y_pred_all) if len(y_pred_all)>0 else np.array([], float)
    dist = np.concatenate(dist_all) if len(dist_all)>0 else np.array([], float)
    print(f'Total scored rows: {y_true.shape[0]} (of {len(train_df)}) in {time.time()-t0:.2f}s')

    # Tune dist-only sigma
    grid_a = [120, 160, 200, 240]
    grid_b = [1.0, 2.0, 3.0]
    best = (-1e9, None, None)
    for a in grid_a:
        for b in grid_b:
            sig = a + b * dist
            score = laplace_ll(y_true, y_pred, sig)
            if score > best[0]:
                best = (score, a, b)
    print(f'Forward-CV OOF Laplace (anchored baseline): {best[0]:.5f} with a={best[1]} b={best[2]}')
    return y_true, y_pred, dist, best

# Run forward/grid CV anchored baseline
y_true_fwd, y_pred_fwd, dist_fwd, (best_ll_fwd, a_fwd, b_fwd) = forward_grid_cv_baseline(train)

# Train final anchored baseline on full data and write submission
full_slopes = compute_patient_slopes(train)
g_slope_full = robust_global_slope(full_slopes)
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC'})
grid = grid.merge(test_bl, on='Patient', how='left')
grid['FVC'] = (grid['Base_FVC'] + g_slope_full * (grid['Weeks'] - grid['Base_Week'])).clip(500, 6000)
grid['dist'] = (grid['Weeks'] - grid['Base_Week']).abs().astype(float)
grid['Confidence'] = np.maximum(a_fwd + b_fwd * grid['dist'], 70.0)
submission_fwd = grid[['Patient_Week','FVC','Confidence']].copy()
submission_fwd.to_csv('submission.csv', index=False)
print('Saved forward-CV aligned submission.csv. Reported OOF LL:', f'{best_ll_fwd:.5f}', 'Full-train global slope:', f'{g_slope_full:.4f}')

[FWD-Fold 1] n_trn=1112 n_val=282 patients=32 g_slope=-3.8062 scored_rows=282 elapsed=0.03s


[FWD-Fold 2] n_trn=1113 n_val=281 patients=32 g_slope=-3.5547 scored_rows=281 elapsed=0.03s


[FWD-Fold 3] n_trn=1119 n_val=275 patients=31 g_slope=-3.5065 scored_rows=275 elapsed=0.03s


[FWD-Fold 4] n_trn=1119 n_val=275 patients=31 g_slope=-3.5065 scored_rows=275 elapsed=0.03s


[FWD-Fold 5] n_trn=1113 n_val=281 patients=32 g_slope=-3.6557 scored_rows=281 elapsed=0.03s


Total scored rows: 1394 (of 1394) in 0.51s
Forward-CV OOF Laplace (anchored baseline): -5.92555 with a=120 b=2.0
Saved forward-CV aligned submission.csv. Reported OOF LL: -5.92555 Full-train global slope: -3.6341


In [19]:
# Forward/grid CV residual XGB with capped time features + sigma |residual| GBM
import sys, subprocess, time, math, gc
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold

def ensure_xgboost():
    try:
        import xgboost as xgb  # noqa
        return
    except Exception:
        print('Installing xgboost...', flush=True)
        subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.0.3', '--no-input'], check=True)
        import xgboost as xgb  # noqa

ensure_xgboost()
import xgboost as xgb

if 'train' not in globals():
    train = pd.read_csv('train.csv')
if 'test' not in globals():
    test = pd.read_csv('test.csv')

# Metric
def laplace_ll(y_true, y_pred, sigma):
    y_true = np.asarray(y_true).astype(float)
    y_pred = np.asarray(y_pred).astype(float)
    sigma = np.asarray(sigma).astype(float)
    delta = np.abs(y_true - y_pred)
    delta = np.minimum(delta, 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return np.mean(-delta / sigma - np.log(sigma))

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float)
            y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def robust_global_slope(slopes_dict):
    if not slopes_dict:
        return 0.0
    return float(np.median(list(slopes_dict.values())))

def compute_trend_stats(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    rows = []
    for pid, g in df.groupby(patient_col):
        n = g.shape[0]
        if n >= 2:
            x = g[week_col].values.astype(float)
            y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            if denom > 0:
                slope = ((x - xm) * (y - ym)).sum() / denom
                yhat = ym + slope * (x - xm)
                ss_res = ((y - yhat)**2).sum()
                ss_tot = ((y - ym)**2).sum()
                r2 = 1.0 - (ss_res / ss_tot) if ss_tot > 0 else 0.0
            else:
                slope, r2 = 0.0, 0.0
            rows.append((pid, slope, r2, n))
        else:
            rows.append((pid, 0.0, 0.0, n))
    return pd.DataFrame(rows, columns=[patient_col, 'slope_w', 'r2_w', 'n_obs'])

def compute_percent_trend_stats(df, patient_col='Patient', week_col='Weeks', percent_col='Percent'):
    rows = []
    for pid, g in df.groupby(patient_col):
        n = g.shape[0]
        if n >= 2:
            x = g[week_col].values.astype(float)
            y = g[percent_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            rows.append((pid, slope))
        else:
            rows.append((pid, 0.0))
    return pd.DataFrame(rows, columns=[patient_col, 'slope_percent_w'])

def one_hot_fit(df, cols):
    return {c: sorted(df[c].dropna().unique().tolist()) for c in cols}

def one_hot_transform(df, cats):
    out = df.copy()
    for c, values in cats.items():
        for v in values:
            out[f'{c}__{v}'] = (out[c] == v).astype(np.int8)
    return out

def build_features_v2(df, cap_wp=26):
    df = df.copy()
    df['Weeks_Passed'] = (df['Weeks'] - df['Base_Week']).astype(float)
    df['Abs_Weeks_Passed'] = df['Weeks_Passed'].abs()
    df['is_future'] = (df['Weeks_Passed'] > 0).astype(np.int8)
    df['sign_WP'] = np.sign(df['Weeks_Passed']).astype(float)
    df['Weeks_Passed_cap'] = df['Weeks_Passed'].clip(-cap_wp, cap_wp)
    df['Weeks_Passed2'] = df['Weeks_Passed_cap'] ** 2
    # Percent handling: USE BASELINE ONLY to avoid leakage
    df['Percent_clipped'] = df['Percent_at_base'].clip(40, 120)
    df['Percent2'] = df['Percent_clipped'] ** 2
    df['log_BaseFVC'] = np.log1p(df['Base_FVC'].clip(lower=1))
    df['Estimated_TLC'] = df['Base_FVC'] / (df['Percent_clipped'] / 100.0)
    df['log_TLC'] = np.log1p(df['Estimated_TLC'].clip(lower=1))
    df['Age_x_Percent'] = df['Age'] * df['Percent_clipped']
    df['Percent_x_BaseFVC'] = df['Percent_clipped'] * df['Base_FVC']
    df['WP_x_BaseFVC'] = df['Weeks_Passed_cap'] * df['Base_FVC']
    df['WP_x_Percent'] = df['Weeks_Passed_cap'] * df['Percent_clipped']
    df['WP_x_Age'] = df['Weeks_Passed_cap'] * df['Age']
    df['slope_w'] = df.get('slope_w', 0.0)
    df['r2_w'] = df.get('r2_w', 0.0)
    df['slope_percent_w'] = df.get('slope_percent_w', 0.0)
    df['WP_x_slope_w'] = df['Weeks_Passed_cap'] * pd.Series(df['slope_w']).clip(-50, 10)
    df['WP_x_r2_w'] = df['Weeks_Passed_cap'] * df['r2_w']
    df['WP_x_slope_percent_w'] = df['Weeks_Passed_cap'] * df['slope_percent_w']
    # Do not use time-varying Percent; set dPercent features to zero
    df['dPercent'] = 0.0
    df['WP_x_dPercent'] = 0.0
    if 'n_obs' not in df.columns:
        df['n_obs'] = 1
    if 'has_trend' not in df.columns:
        df['has_trend'] = 0
    df['is_singleton'] = (df['n_obs'] <= 1).astype(int)
    return df

def forward_grid_cv_residual_xgb(train_df, n_splits=5, seed=42):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    y_true_all, y_pred_all, dist_all = [], [], []
    res_all, rows_sigma = [], []
    best_iters = []
    t0 = time.time()
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        tf = time.time()
        trn = train_df.iloc[trn_idx].copy()
        val = train_df.iloc[val_idx].copy()
        # Disjoint patients
        assert set(trn['Patient']).isdisjoint(set(val['Patient'])), 'Overlapping patients'
        # Anchors
        base_trn = (trn.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False)
                    .first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
        trn = trn.merge(base_trn, on='Patient', how='left')
        base_val = (val.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False)
                    .first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
        val = val.merge(base_val, on='Patient', how='left')
        # TRAIN-only stats
        stats_trn = compute_trend_stats(trn)
        stats_trn['has_trend'] = (stats_trn['n_obs'] >= 2).astype(int)
        pstats_trn = compute_percent_trend_stats(trn)
        trn = trn.merge(stats_trn, on='Patient', how='left').merge(pstats_trn, on='Patient', how='left')
        val = val.merge(stats_trn, on='Patient', how='left').merge(pstats_trn, on='Patient', how='left')
        for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
            trn[c] = trn[c].fillna(v); val[c] = val[c].fillna(v)
        trn['n_obs'] = trn['n_obs'].fillna(1).astype(int)
        val['n_obs'] = val['n_obs'].fillna(1).astype(int)
        trn['has_trend'] = trn['has_trend'].fillna(0).astype(int)
        val['has_trend'] = val['has_trend'].fillna(0).astype(int)
        # Assert VAL trend zeroed
        assert (val['slope_w'] == 0).all() and (val['r2_w'] == 0).all(), 'VAL trend not zeroed'
        assert (val['n_obs'] == 1).all() and (val['has_trend'] == 0).all(), 'VAL flags incorrect'
        val['is_singleton'] = 1; trn['is_singleton'] = (trn['n_obs'] <= 1).astype(int)
        # Global slope from TRAIN only
        g_slope = robust_global_slope(compute_patient_slopes(trn))
        trn['pred0'] = trn['Base_FVC'] + g_slope * (trn['Weeks'] - trn['Base_Week'])
        val['pred0'] = val['Base_FVC'] + g_slope * (val['Weeks'] - val['Base_Week'])
        # Features and cats
        trnF = build_features_v2(trn)
        valF = build_features_v2(val)
        cats = one_hot_fit(trnF, ['Sex','SmokingStatus'])
        trnF = one_hot_transform(trnF, cats)
        valF = one_hot_transform(valF, cats)
        feat_cols = [
            'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed_cap','Weeks_Passed2','sign_WP','is_future',
            'Percent_clipped','Percent2','Age','Base_FVC','log_BaseFVC','Estimated_TLC','log_TLC',
            'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
            'slope_w','r2_w','slope_percent_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w','WP_x_slope_percent_w','dPercent','WP_x_dPercent'
        ] + [c for c in trnF.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
        y_trn = (trn['FVC'] - trn['pred0']).astype(float).values
        y_val = (val['FVC'] - val['pred0']).astype(float).values
        dtrain = xgb.DMatrix(trnF[feat_cols], label=y_trn)
        dvalid = xgb.DMatrix(valF[feat_cols], label=y_val)
        params = {
            'objective': 'reg:absoluteerror',
            'eval_metric': 'mae',
            'tree_method': 'gpu_hist',
            'learning_rate': 0.03,
            'max_depth': 4,
            'min_child_weight': 30,
            'lambda': 7.0,
            'subsample': 0.85,
            'colsample_bytree': 0.85,
            'verbosity': 0
        }
        watchlist = [(dtrain, 'trn'), (dvalid, 'val')]
        model = xgb.train(params, dtrain, num_boost_round=4000, evals=watchlist, early_stopping_rounds=300, verbose_eval=False)
        val_pred_res = model.predict(dvalid, iteration_range=(0, model.best_iteration+1))
        val_pred = val['pred0'].values + val_pred_res
        # Forward-only scoring mask
        mask = (val['Weeks'].values >= val['Base_Week'].values)
        v_true = val['FVC'].values[mask].astype(float)
        v_pred = val_pred[mask].astype(float)
        v_dist = (val['Weeks'].values[mask] - val['Base_Week'].values[mask]).astype(float)
        y_true_all.append(v_true); y_pred_all.append(v_pred); dist_all.append(v_dist)
        # Residuals for sigma model
        v_res = (v_true - v_pred)
        res_all.append(v_res)
        # Prepare sigma features
        vF = valF.loc[mask].copy()
        rows_sigma.append(pd.DataFrame({
            'Patient': val.loc[mask, 'Patient'].values,
            'dist': v_dist,
            'Abs_Weeks_Passed_cap': vF['Weeks_Passed_cap'].values,
            'sign_WP': vF['sign_WP'].values,
            'is_future': vF['is_future'].values,
            'n_obs': vF['n_obs'].values,
            'is_singleton': vF['is_singleton'].values,
            'has_trend': vF['has_trend'].values,
            'r2_w': vF['r2_w'].values,
            'slope_w': pd.Series(vF['slope_w']).clip(-50,10).values,
            'slope_percent_w': pd.Series(vF['slope_percent_w']).clip(-10,10).values,
            'dPercent': vF['dPercent'].values,
            'Percent_at_base': val.loc[mask, 'Percent_at_base'].values,
            'Base_FVC': val.loc[mask, 'Base_FVC'].values,
            'log_BaseFVC': vF['log_BaseFVC'].values,
            'Estimated_TLC': vF['Estimated_TLC'].values,
            'log_TLC': vF['log_TLC'].values,
            'Age': val.loc[mask, 'Age'].values,
            'Sex': val.loc[mask, 'Sex'].values,
            'SmokingStatus': val.loc[mask, 'SmokingStatus'].values
        }))
        mae = float(np.mean(np.abs(v_true - v_pred))) if v_true.size else np.nan
        print(f'[FWD-XGB-Fold {fold}] n_trn={trn.shape[0]} n_val={val.shape[0]} g_slope={g_slope:.4f} scored={v_true.size} MAE={mae:.2f} iters={model.best_iteration+1} elapsed={time.time()-tf:.2f}s', flush=True)
        best_iters.append(int(model.best_iteration+1))
        if fold == 1:
            # Persist schema for full-train fit
            globals()['_feat_cols_res'] = feat_cols
            globals()['_cats_res'] = cats
        del dtrain, dvalid, model, trnF, valF; gc.collect()

    y_true = np.concatenate(y_true_all) if y_true_all else np.array([], float)
    y_pred = np.concatenate(y_pred_all) if y_pred_all else np.array([], float)
    dist = np.concatenate(dist_all) if dist_all else np.array([], float)
    res = np.concatenate(res_all) if res_all else np.array([], float)
    oof_ll_baseline = laplace_ll(y_true, y_pred, np.maximum(200 + 2.0*dist, 70.0))
    print(f'Forward residual OOF baseline LL (simple sigma): {oof_ll_baseline:.5f}')
    return (y_true, y_pred, dist, res, rows_sigma, int(np.median(best_iters)))

def sigma_model_oof(rows_sigma, res, n_splits=5, seed=42):
    df = pd.concat(rows_sigma, ignore_index=True)
    df['abs_res'] = np.abs(res)
    df['z'] = np.log1p(df['abs_res'])
    # One-hot for Sex/Smoking
    cats = { 'Sex': sorted(df['Sex'].dropna().unique()), 'SmokingStatus': sorted(df['SmokingStatus'].dropna().unique()) }
    for c, vals in cats.items():
        for v in vals:
            df[f'{c}__{v}'] = (df[c] == v).astype(np.int8)
    feat_cols = [
        'dist','Abs_Weeks_Passed_cap','sign_WP','is_future','n_obs','is_singleton','has_trend','r2_w','slope_w','slope_percent_w',
        'dPercent','Percent_at_base','Base_FVC','log_BaseFVC','Estimated_TLC','log_TLC','Age'
    ] + [c for c in df.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
    gkf = GroupKFold(n_splits=n_splits)
    groups = df['Patient'].values
    z_hat = np.zeros(df.shape[0], dtype=float)
    models = []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(df, groups=groups), 1):
        trn = df.iloc[trn_idx]; val = df.iloc[val_idx]
        dtrain = xgb.DMatrix(trn[feat_cols], label=trn['z'].values)
        dvalid = xgb.DMatrix(val[feat_cols], label=val['z'].values)
        params = {
            'objective': 'reg:squarederror',
            'tree_method': 'gpu_hist',
            'learning_rate': 0.05,
            'max_depth': 3,
            'min_child_weight': 20,
            'lambda': 5.0,
            'subsample': 0.8,
            'colsample_bytree': 0.8,
            'verbosity': 0
        }
        model = xgb.train(params, dtrain, num_boost_round=4000, evals=[(dvalid,'val')], early_stopping_rounds=200, verbose_eval=False)
        z_hat[val_idx] = model.predict(dvalid, iteration_range=(0, model.best_iteration+1))
        models.append(model)
    # Tune a,b,c on OOF
    abs_hat = np.expm1(z_hat)
    best = (-1e9, None, None, None)
    for a in [120, 160, 200, 240]:
        for b in [1.0, 2.0, 3.0]:
            for c in [0.5, 1.0, 1.5, 2.0]:
                sigma = np.maximum(a + b*df['dist'].values + c*abs_hat, 70.0)
                score = laplace_ll(df['Base_FVC'].values*0 + df['abs_res'].values*0 + 0,  # dummy
                                   df['Base_FVC'].values*0 + df['abs_res'].values*0 + 0,  # y_true not used here
                                   sigma)  # we'll compute score with true y outside; keep grid selection simple
                if best[0] == -1e9:
                    best = (0.0, a, b, c)
    return df, z_hat, feat_cols, cats

# Run forward/grid residual CV
y_true_fw, y_pred_fw, dist_fw, res_fw, rows_sigma, best_iters_res = forward_grid_cv_residual_xgb(train)

# Proper tuning of a,b,c on OOF with sigma |res_hat|
df_sigma_oof, z_hat_oof, sigma_feat_cols, sigma_cats = sigma_model_oof(rows_sigma, res_fw)
abs_hat_oof = np.expm1(z_hat_oof)
best = (-1e9, None, None, None)
for a in [120, 160, 200, 240]:
    for b in [1.0, 2.0, 3.0]:
        for c in [0.5, 1.0, 1.5, 2.0]:
            sigma = np.maximum(a + b*df_sigma_oof['dist'].values + c*abs_hat_oof, 70.0)
            score = laplace_ll(y_true_fw, y_pred_fw, sigma)
            if score > best[0]:
                best = (score, a, b, c)
print(f'Forward OOF LL (residual + sigma-|res_hat|): {best[0]:.5f} with a={best[1]} b={best[2]} c={best[3]}')
a_sig, b_sig, c_sig = best[1], best[2], best[3]

# Train final residual model on full data
slopes_full = compute_patient_slopes(train)
g_slope_full = robust_global_slope(slopes_full)
base_full = (train.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
train_full = train.merge(base_full, on='Patient', how='left')
stats_full = compute_trend_stats(train_full); stats_full['has_trend'] = (stats_full['n_obs'] >= 2).astype(int)
pstats_full = compute_percent_trend_stats(train_full)
train_full = train_full.merge(stats_full, on='Patient', how='left').merge(pstats_full, on='Patient', how='left')
for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
    train_full[c] = train_full[c].fillna(v)
train_full['n_obs'] = train_full['n_obs'].fillna(1).astype(int)
train_full['has_trend'] = train_full['has_trend'].fillna(0).astype(int)
train_full['pred0'] = train_full['Base_FVC'] + g_slope_full * (train_full['Weeks'] - train_full['Base_Week'])
train_fullF = build_features_v2(train_full)
cats_res_full = one_hot_fit(train_fullF, ['Sex','SmokingStatus'])
train_fullF = one_hot_transform(train_fullF, cats_res_full)
feat_cols_res = [
    'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed_cap','Weeks_Passed2','sign_WP','is_future',
    'Percent_clipped','Percent2','Age','Base_FVC','log_BaseFVC','Estimated_TLC','log_TLC',
    'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
    'slope_w','r2_w','slope_percent_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w','WP_x_slope_percent_w','dPercent','WP_x_dPercent'
] + [c for c in train_fullF.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
y_full = (train_fullF['FVC'] - train_fullF['pred0']).astype(float).values
dtrain_full = xgb.DMatrix(train_fullF[feat_cols_res], label=y_full)
params_res_full = {
    'objective': 'reg:absoluteerror',
    'eval_metric': 'mae',
    'tree_method': 'gpu_hist',
    'learning_rate': 0.03,
    'max_depth': 4,
    'min_child_weight': 30,
    'lambda': 7.0,
    'subsample': 0.85,
    'colsample_bytree': 0.85,
    'verbosity': 0
}
model_res_full = xgb.train(params_res_full, dtrain_full, num_boost_round=max(200, best_iters_res))

# Train final sigma model on all forward-scored rows
df_sigma_full = df_sigma_oof.copy()
z_full = np.log1p(df_sigma_full['abs_res'].values)  # use true |res| from OOF for fitting
for c, vals in sigma_cats.items():
    for v in vals:
        if f'{c}__{v}' not in df_sigma_full.columns:
            df_sigma_full[f'{c}__{v}'] = (df_sigma_full[c] == v).astype(np.int8)
dtrain_sigma_full = xgb.DMatrix(df_sigma_full[sigma_feat_cols], label=z_full)
params_sigma_full = {
    'objective': 'reg:squarederror',
    'tree_method': 'gpu_hist',
    'learning_rate': 0.05,
    'max_depth': 3,
    'min_child_weight': 20,
    'lambda': 5.0,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'verbosity': 0
}
model_sigma_full = xgb.train(params_sigma_full, dtrain_sigma_full, num_boost_round=600)

# Build test grid and predict
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient')
grid = grid.merge(meta, on='Patient', how='left', suffixes=('', '_meta'))
# Do NOT merge train stats into test; hard-set backoff zeros to mirror VAL/TEST behavior
grid['slope_w'] = 0.0
grid['r2_w'] = 0.0
grid['slope_percent_w'] = 0.0
grid['n_obs'] = 1
grid['has_trend'] = 0
grid['is_singleton'] = 1
grid['pred0'] = grid['Base_FVC'] + g_slope_full * (grid['Weeks'] - grid['Base_Week'])
gridF = build_features_v2(grid)
gridF = one_hot_transform(gridF, cats_res_full)
dgrid = xgb.DMatrix(gridF[feat_cols_res])
res_pred = model_res_full.predict(dgrid)
fvc_pred = (gridF['pred0'].values + res_pred).clip(500, 6000)
dist_test = (gridF['Weeks'] - gridF['Base_Week']).abs().astype(float).values

# Sigma features for test
df_sigma_test = pd.DataFrame({
    'dist': dist_test,
    'Abs_Weeks_Passed_cap': gridF['Weeks_Passed_cap'].values,
    'sign_WP': gridF['sign_WP'].values,
    'is_future': gridF['is_future'].values,
    'n_obs': gridF['n_obs'].values,
    'is_singleton': gridF['is_singleton'].values,
    'has_trend': gridF['has_trend'].values,
    'r2_w': gridF['r2_w'].values,
    'slope_w': pd.Series(gridF['slope_w']).clip(-50,10).values,
    'slope_percent_w': pd.Series(gridF['slope_percent_w']).clip(-10,10).values,
    'dPercent': gridF['dPercent'].values,
    'Percent_at_base': grid['Percent_at_base'].values,
    'Base_FVC': grid['Base_FVC'].values,
    'log_BaseFVC': gridF['log_BaseFVC'].values,
    'Estimated_TLC': gridF['Estimated_TLC'].values,
    'log_TLC': gridF['log_TLC'].values,
    'Age': grid['Age'].values
})
for c, vals in sigma_cats.items():
    for v in vals:
        df_sigma_test[f'{c}__{v}'] = (grid[c] == v).astype(np.int8)
for col in sigma_feat_cols:
    if col not in df_sigma_test.columns:
        df_sigma_test[col] = 0
dtest_sigma = xgb.DMatrix(df_sigma_test[sigma_feat_cols])
z_hat_test = model_sigma_full.predict(dtest_sigma)
abs_hat_test = np.expm1(z_hat_test)
sigma_test = np.maximum(a_sig + b_sig*dist_test + c_sig*abs_hat_test, 70.0)

submission_fw_res = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_pred.astype(float),
    'Confidence': sigma_test.astype(float)
})
submission_fw_res.to_csv('submission.csv', index=False)
print('Saved forward-residual submission.csv. Forward OOF LL (res+sigma):', f'{best[0]:.5f}', 'best_iters_res:', best_iters_res, 'g_slope_full:', f'{g_slope_full:.4f}')

[FWD-XGB-Fold 1] n_trn=1112 n_val=282 g_slope=-3.8062 scored=282 MAE=167.37 iters=2 elapsed=0.64s


[FWD-XGB-Fold 2] n_trn=1113 n_val=281 g_slope=-3.5547 scored=281 MAE=117.11 iters=18 elapsed=0.65s


[FWD-XGB-Fold 3] n_trn=1119 n_val=275 g_slope=-3.5065 scored=275 MAE=136.84 iters=6 elapsed=0.62s


[FWD-XGB-Fold 4] n_trn=1119 n_val=275 g_slope=-3.5065 scored=275 MAE=154.29 iters=178 elapsed=0.93s


[FWD-XGB-Fold 5] n_trn=1113 n_val=281 g_slope=-3.6557 scored=281 MAE=137.79 iters=348 elapsed=1.25s


Forward residual OOF baseline LL (simple sigma): -6.03975


Forward OOF LL (residual + sigma-|res_hat|): -5.93471 with a=120 b=1.0 c=0.5


Saved forward-residual submission.csv. Forward OOF LL (res+sigma): -5.93471 best_iters_res: 18 g_slope_full: -3.6341


In [17]:
# Forward/grid CatBoost residual (native cats) + blend with XGB forward residual; keep sigma from sigma |res_hat|
import sys, subprocess, gc, time
import numpy as np
import pandas as pd

def ensure_catboost():
    try:
        import catboost  # noqa: F401
        return
    except Exception:
        print('Installing catboost...', flush=True)
        subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost==1.2.5', '--no-input'], check=True)
        import catboost  # noqa: F401

ensure_catboost()
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import GroupKFold

if 'train' not in globals():
    train = pd.read_csv('train.csv')
if 'test' not in globals():
    test = pd.read_csv('test.csv')

def train_catboost_forward_residual(train_df, n_splits=5, seed=42):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    y_true_all, y_pred_all = [], []
    t0 = time.time()
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        tf = time.time()
        trn = train_df.iloc[trn_idx].copy()
        val = train_df.iloc[val_idx].copy()
        assert set(trn['Patient']).isdisjoint(set(val['Patient'])), 'Overlapping patients'
        # Anchors
        base_trn = (trn.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False)
                    .first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
        trn = trn.merge(base_trn, on='Patient', how='left')
        base_val = (val.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False)
                    .first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
        val = val.merge(base_val, on='Patient', how='left')
        # TRAIN-only stats
        stats_trn = compute_trend_stats(trn); stats_trn['has_trend'] = (stats_trn['n_obs'] >= 2).astype(int)
        pstats_trn = compute_percent_trend_stats(trn)
        trn = trn.merge(stats_trn, on='Patient', how='left').merge(pstats_trn, on='Patient', how='left')
        val = val.merge(stats_trn, on='Patient', how='left').merge(pstats_trn, on='Patient', how='left')
        for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
            trn[c] = trn[c].fillna(v); val[c] = val[c].fillna(v)
        trn['n_obs'] = trn['n_obs'].fillna(1).astype(int)
        val['n_obs'] = val['n_obs'].fillna(1).astype(int)
        trn['has_trend'] = trn['has_trend'].fillna(0).astype(int)
        val['has_trend'] = val['has_trend'].fillna(0).astype(int)
        val['is_singleton'] = 1; trn['is_singleton'] = (trn['n_obs'] <= 1).astype(int)
        # Global slope
        g_slope = robust_global_slope(compute_patient_slopes(trn))
        trn['pred0'] = trn['Base_FVC'] + g_slope * (trn['Weeks'] - trn['Base_Week'])
        val['pred0'] = val['Base_FVC'] + g_slope * (val['Weeks'] - val['Base_Week'])
        # Features for CatBoost: use build_features_v2, keep Sex, SmokingStatus as cats
        trnF = build_features_v2(trn)
        valF = build_features_v2(val)
        feat_cols = [
            'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed_cap','Weeks_Passed2','sign_WP','is_future',
            'Percent','Percent_clipped','Percent2','Age','Base_FVC','log_BaseFVC','Estimated_TLC','log_TLC',
            'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
            'slope_w','r2_w','slope_percent_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w','WP_x_slope_percent_w','dPercent','WP_x_dPercent',
            'Sex','SmokingStatus'
        ]
        cat_cols = ['Sex','SmokingStatus']
        cat_idx = [feat_cols.index(c) for c in cat_cols]
        y_trn = (trn['FVC'] - trn['pred0']).astype(float).values
        y_val = (val['FVC'] - val['pred0']).astype(float).values
        train_pool = Pool(trnF[feat_cols], label=y_trn, cat_features=cat_idx)
        valid_pool = Pool(valF[feat_cols], label=y_val, cat_features=cat_idx)
        model = CatBoostRegressor(
            loss_function='MAE',
            iterations=6000,
            od_type='Iter',
            od_wait=300,
            learning_rate=0.035,
            depth=6,
            l2_leaf_reg=8.0,
            bootstrap_type='Bernoulli',
            subsample=0.8,
            task_type='GPU',
            random_seed=seed,
            verbose=False
        )
        model.fit(train_pool, eval_set=valid_pool, use_best_model=True, verbose=False)
        pred_res = model.predict(valid_pool)
        val_pred = val['pred0'].values + pred_res
        mask = (val['Weeks'].values >= val['Base_Week'].values)
        y_true_all.append(val['FVC'].values[mask].astype(float))
        y_pred_all.append(val_pred[mask].astype(float))
        mae = float(np.mean(np.abs(val['FVC'].values[mask] - val_pred[mask]))) if mask.any() else np.nan
        print(f'[FWD-CB-Fold {fold}] n_trn={trn.shape[0]} n_val={val.shape[0]} g_slope={g_slope:.4f} MAE={mae:.2f} elapsed={time.time()-tf:.2f}s', flush=True)
        del model, train_pool, valid_pool, trnF, valF; gc.collect()
    y_true = np.concatenate(y_true_all) if y_true_all else np.array([], float)
    y_pred = np.concatenate(y_pred_all) if y_pred_all else np.array([], float)
    print(f'CatBoost forward OOF ready: {y_true.shape[0]} rows in {time.time()-t0:.2f}s')
    return y_true, y_pred

# Train CatBoost forward residual OOF
y_true_cb, y_pred_cb = train_catboost_forward_residual(train)

# If available, blend with XGB forward OOF (from cell 10) using a simple tuned weight; otherwise default w=0.4
def safe_blend_weight(y_true_ref, y_pred_xgb, y_pred_cb, dist_ref, a=120, b=1.0, c=0.0):
    try:
        best = (-1e9, None)
        for w in np.linspace(0.0, 1.0, 21):
            yb = w * y_pred_xgb + (1.0 - w) * y_pred_cb
            sig = np.maximum(a + b * dist_ref, 70.0)
            score = laplace_ll(y_true_ref, yb, sig)
            if score > best[0]:
                best = (score, w)
        return best[1] if best[1] is not None else 0.6
    except Exception as e:
        print('Blend weight tuning failed, fallback w=0.6:', e)
        return 0.6

try:
    w_blend = safe_blend_weight(y_true_fw, y_pred_fw, y_pred_cb, dist_fw, a=120, b=1.0)
except Exception as e:
    print('Using default blend weight (0.6) due to missing XGB OOF context:', e)
    w_blend = 0.6
print('Chosen XGB:CB blend weight (XGB share)=', w_blend)

# Fit CatBoost residual on full data and generate test predictions
slopes_full = compute_patient_slopes(train)
g_slope_full = robust_global_slope(slopes_full)
base_full = (train.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
train_full_cb = train.merge(base_full, on='Patient', how='left')
stats_full = compute_trend_stats(train_full_cb); stats_full['has_trend'] = (stats_full['n_obs'] >= 2).astype(int)
pstats_full = compute_percent_trend_stats(train_full_cb)
train_full_cb = train_full_cb.merge(stats_full, on='Patient', how='left').merge(pstats_full, on='Patient', how='left')
for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
    train_full_cb[c] = train_full_cb[c].fillna(v)
train_full_cb['n_obs'] = train_full_cb['n_obs'].fillna(1).astype(int)
train_full_cb['has_trend'] = train_full_cb['has_trend'].fillna(0).astype(int)
train_full_cb['pred0'] = train_full_cb['Base_FVC'] + g_slope_full * (train_full_cb['Weeks'] - train_full_cb['Base_Week'])
train_full_cbF = build_features_v2(train_full_cb)
feat_cols_cb_full = [
    'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed_cap','Weeks_Passed2','sign_WP','is_future',
    'Percent','Percent_clipped','Percent2','Age','Base_FVC','log_BaseFVC','Estimated_TLC','log_TLC',
    'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
    'slope_w','r2_w','slope_percent_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w','WP_x_slope_percent_w','dPercent','WP_x_dPercent',
    'Sex','SmokingStatus'
]
cat_cols_full = ['Sex','SmokingStatus']
cat_idx_full = [feat_cols_cb_full.index(c) for c in cat_cols_full]
pool_full = Pool(train_full_cbF[feat_cols_cb_full], label=(train_full_cbF['FVC'] - train_full_cbF['pred0']).astype(float).values, cat_features=cat_idx_full)
model_cb_full = CatBoostRegressor(
    loss_function='MAE', depth=6, learning_rate=0.035, l2_leaf_reg=8.0,
    bootstrap_type='Bernoulli', subsample=0.8, iterations=3000, od_type='Iter', od_wait=200,
    task_type='GPU', random_seed=42, verbose=False
)
model_cb_full.fit(pool_full, verbose=False)

# Build test grid and predict residuals with CatBoost
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient')
grid = grid.merge(meta, on='Patient', how='left', suffixes=('', '_meta'))
grid = grid.merge(stats_full, on='Patient', how='left').merge(pstats_full, on='Patient', how='left')
for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
    grid[c] = grid[c].fillna(v)
grid['n_obs'] = grid['n_obs'].fillna(1).astype(int)
grid['has_trend'] = grid['has_trend'].fillna(0).astype(int)
grid['is_singleton'] = (grid['n_obs'] <= 1).astype(int)
grid['pred0'] = grid['Base_FVC'] + g_slope_full * (grid['Weeks'] - grid['Base_Week'])
gridF = build_features_v2(grid)
pool_grid = Pool(gridF[feat_cols_cb_full], cat_features=cat_idx_full)
res_cb = model_cb_full.predict(pool_grid)
fvc_cb = (gridF['pred0'].values + res_cb).clip(500, 6000)

# Recompute/ensure XGB forward predictions available (from cell 10). If missing, fall back to pred0 only.
try:
    fvc_xgb = fvc_pred.copy()
    sigma_out = sigma_test.copy()
except Exception as e:
    print('XGB forward preds not found in scope, rebuilding baseline preds as fallback:', e)
    fvc_xgb = (gridF['pred0'].values).clip(500, 6000)
    dist_test_fb = (gridF['Weeks'] - gridF['Base_Week']).abs().astype(float).values
    sigma_out = np.maximum(200 + 2.0 * dist_test_fb, 70.0)

# Blend and save submission
fvc_blend = w_blend * fvc_xgb + (1.0 - w_blend) * fvc_cb
submission_cb_fw = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_blend.astype(float),
    'Confidence': sigma_out.astype(float)
})
submission_cb_fw.to_csv('submission.csv', index=False)
print('Saved forward CV XGB+CatBoost blended submission.csv with w_blend (XGB share)=', w_blend)

Default metric period is 5 because MAE is/are not implemented for GPU


[FWD-CB-Fold 1] n_trn=1112 n_val=282 g_slope=-3.8062 MAE=112.78 elapsed=69.58s


Default metric period is 5 because MAE is/are not implemented for GPU


[FWD-CB-Fold 2] n_trn=1113 n_val=281 g_slope=-3.5547 MAE=67.32 elapsed=69.86s


Default metric period is 5 because MAE is/are not implemented for GPU


[FWD-CB-Fold 3] n_trn=1119 n_val=275 g_slope=-3.5065 MAE=86.21 elapsed=70.07s


Default metric period is 5 because MAE is/are not implemented for GPU


[FWD-CB-Fold 4] n_trn=1119 n_val=275 g_slope=-3.5065 MAE=98.66 elapsed=69.93s


Default metric period is 5 because MAE is/are not implemented for GPU


[FWD-CB-Fold 5] n_trn=1113 n_val=281 g_slope=-3.6557 MAE=89.78 elapsed=69.64s


CatBoost forward OOF ready: 1394 rows in 349.55s
Chosen XGB:CB blend weight (XGB share)= 1.0


Default metric period is 5 because MAE is/are not implemented for GPU


Saved forward CV XGB+CatBoost blended submission.csv with w_blend (XGB share)= 1.0


In [20]:
# Write conservative-sigma submission from forward residual XGB predictions (test hygiene fixed)
import numpy as np, pandas as pd

# Rebuild test grid and predictions using forward residual model objects from Cell 10
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient')
grid = grid.merge(meta, on='Patient', how='left', suffixes=('', '_meta'))

# STRICT TEST HYGIENE: Do NOT merge any train-derived trend stats. Hard-set zeros like in Cell 10.
grid['slope_w'] = 0.0
grid['r2_w'] = 0.0
grid['slope_percent_w'] = 0.0
grid['n_obs'] = 1
grid['has_trend'] = 0
grid['is_singleton'] = 1

# Baseline and features
grid['pred0'] = grid['Base_FVC'] + g_slope_full * (grid['Weeks'] - grid['Base_Week'])
gridF = build_features_v2(grid)
gridF = one_hot_transform(gridF, cats_res_full)
dgrid = xgb.DMatrix(gridF[feat_cols_res])
res_pred = model_res_full.predict(dgrid)
fvc_pred_cons = (gridF['pred0'].values + res_pred).clip(500, 6000)
dist_cons = (gridF['Weeks'] - gridF['Base_Week']).abs().astype(float).values

# Conservative sigma per expert advice
sigma_cons = np.maximum(200.0 + 3.0 * dist_cons, 70.0)

submission_cons = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_pred_cons.astype(float),
    'Confidence': sigma_cons.astype(float)
})
submission_cons.to_csv('submission.csv', index=False)
print('Saved conservative-sigma submission.csv with dist-only sigma (200 + 3*dist) and strict test hygiene.')

Saved conservative-sigma submission.csv with dist-only sigma (200 + 3*dist) and strict test hygiene.


In [38]:
# Seed-averaged forward residual XGB (+ sigma re-tune) per expert advice
import numpy as np, pandas as pd, time, gc
from sklearn.model_selection import GroupKFold
import xgboost as xgb

def laplace_ll(y_true, y_pred, sigma):
    y_true = np.asarray(y_true).astype(float)
    y_pred = np.asarray(y_pred).astype(float)
    sigma = np.asarray(sigma).astype(float)
    delta = np.abs(y_true - y_pred)
    delta = np.minimum(delta, 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return np.mean(-delta / sigma - np.log(sigma))

def ecdf_percentile(values_sorted, x):
    # values_sorted must be sorted ascending; returns percentile in [0,1]
    idx = np.searchsorted(values_sorted, np.asarray(x, dtype=float), side='right')
    return idx / float(len(values_sorted)) if len(values_sorted) > 0 else np.zeros_like(x, dtype=float)

def add_group_slope_prior(trn, val, g_slope_backoff):
    # Build static age bins and compute group slope priors from TRAIN-only per-patient slopes
    bins = [0, 50, 60, 70, 80, 200]
    labels = ['<=50','50-60','60-70','70-80','80+']
    trn = trn.copy(); val = val.copy()
    trn['AgeBin'] = pd.cut(trn['Age'].astype(float), bins=bins, labels=labels, include_lowest=True)
    val['AgeBin'] = pd.cut(val['Age'].astype(float), bins=bins, labels=labels, include_lowest=True)
    slopes_trn = compute_patient_slopes(trn)
    df_sl = pd.DataFrame({'Patient': list(slopes_trn.keys()), 'slope_p': list(slopes_trn.values())})
    trn = trn.merge(df_sl, on='Patient', how='left')
    grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
              .median().reset_index().rename(columns={'slope_p':'group_slope_prior'}))
    trn = trn.merge(grp, on=['Sex','SmokingStatus','AgeBin'], how='left')
    val = val.merge(grp, on=['Sex','SmokingStatus','AgeBin'], how='left')
    trn['group_slope_prior'] = trn['group_slope_prior'].fillna(g_slope_backoff)
    val['group_slope_prior'] = val['group_slope_prior'].fillna(g_slope_backoff)
    return trn, val

def add_basefvc_percentile(trn, val, base_trn_vals_sorted):
    trn = trn.copy(); val = val.copy()
    trn['BaseFVC_pct'] = ecdf_percentile(base_trn_vals_sorted, trn['Base_FVC'].values)
    val['BaseFVC_pct'] = ecdf_percentile(base_trn_vals_sorted, val['Base_FVC'].values)
    return trn, val

def forward_residual_oof_and_full(seed=42, lr=0.045, max_depth=4, min_child_weight=20, reg_lambda=5.0, subsample=0.9, colsample=0.9, colsample_bylevel=1.0):
    gkf = GroupKFold(n_splits=5)
    groups = train['Patient'].values
    y_true_all, y_pred_all, dist_all, res_all = [], [], [], []
    iters = []
    t0 = time.time()
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
        tf = time.time()
        trn = train.iloc[trn_idx].copy()
        val = train.iloc[val_idx].copy()
        # anchors
        base_trn = (trn.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False)
                    .first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
        trn = trn.merge(base_trn, on='Patient', how='left')
        base_val = (val.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False)
                    .first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
        val = val.merge(base_val, on='Patient', how='left')
        # train-only stats, VAL zeros
        stats_trn = compute_trend_stats(trn); stats_trn['has_trend'] = (stats_trn['n_obs'] >= 2).astype(int)
        pstats_trn = compute_percent_trend_stats(trn)
        trn = trn.merge(stats_trn, on='Patient', how='left').merge(pstats_trn, on='Patient', how='left')
        val = val.merge(stats_trn, on='Patient', how='left').merge(pstats_trn, on='Patient', how='left')
        for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
            trn[c] = trn[c].fillna(v); val[c] = val[c].fillna(v)
        trn['n_obs'] = trn['n_obs'].fillna(1).astype(int)
        val['n_obs'] = val['n_obs'].fillna(1).astype(int)
        trn['has_trend'] = trn['has_trend'].fillna(0).astype(int)
        val['has_trend'] = val['has_trend'].fillna(0).astype(int)
        val['is_singleton'] = 1; trn['is_singleton'] = (trn['n_obs'] <= 1).astype(int)
        # global slope
        g_slope = robust_global_slope(compute_patient_slopes(trn))
        # Add group slope prior (TRAIN-only derived) and BaseFVC percentile rank
        trn, val = add_group_slope_prior(trn, val, g_slope_backoff=g_slope)
        base_trn_sorted = np.sort(base_trn['Base_FVC'].values.astype(float))
        trn, val = add_basefvc_percentile(trn, val, base_trn_sorted)
        # Group median Base_FVC and delta from population priors (TRAIN-only)
        grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']
                      .median().reset_index().rename(columns={'Base_FVC':'group_basefvc_median'}))
        trn = trn.merge(grp_med, on=['Sex','SmokingStatus','AgeBin'], how='left')
        val = val.merge(grp_med, on=['Sex','SmokingStatus','AgeBin'], how='left')
        trn['group_basefvc_median'] = trn['group_basefvc_median'].fillna(trn['Base_FVC'].median())
        val['group_basefvc_median'] = val['group_basefvc_median'].fillna(trn['Base_FVC'].median())
        trn['delta_BaseFVC'] = trn['Base_FVC'] - trn['group_basefvc_median']
        val['delta_BaseFVC'] = val['Base_FVC'] - val['group_basefvc_median']
        # Percent_at_base percentile rank (TRAIN-only ECDF) — use Percent_at_base to avoid reliance on base_trn['Percent']
        base_percent_sorted = np.sort(trn['Percent_at_base'].values.astype(float))
        trn['PercentBase_pct'] = ecdf_percentile(base_percent_sorted, trn['Percent_at_base'].values)
        val['PercentBase_pct'] = ecdf_percentile(base_percent_sorted, val['Percent_at_base'].values)
        # pred0
        trn['pred0'] = trn['Base_FVC'] + g_slope * (trn['Weeks'] - trn['Base_Week'])
        val['pred0'] = val['Base_FVC'] + g_slope * (val['Weeks'] - val['Base_Week'])
        # features
        trnF = build_features_v2(trn); valF = build_features_v2(val)
        # add interactions for new priors and optional ratios
        trnF['WP_x_group_slope'] = trnF['Weeks_Passed_cap'] * trnF['group_slope_prior']
        valF['WP_x_group_slope'] = valF['Weeks_Passed_cap'] * valF['group_slope_prior']
        trnF['WP_x_BaseFVC_pct'] = trnF['Weeks_Passed_cap'] * trnF['BaseFVC_pct']
        valF['WP_x_BaseFVC_pct'] = valF['Weeks_Passed_cap'] * valF['BaseFVC_pct']
        # shrunk slope prior
        trnF['shrunk_slope'] = 0.7 * trnF['group_slope_prior'] + 0.3 * g_slope
        valF['shrunk_slope'] = 0.7 * valF['group_slope_prior'] + 0.3 * g_slope
        trnF['WP_x_shrunk_slope'] = trnF['Weeks_Passed_cap'] * trnF['shrunk_slope']
        valF['WP_x_shrunk_slope'] = valF['Weeks_Passed_cap'] * valF['shrunk_slope']
        # carry delta_BaseFVC and PercentBase_pct
        trnF['delta_BaseFVC'] = trn['delta_BaseFVC'].values
        valF['delta_BaseFVC'] = val['delta_BaseFVC'].values
        trnF['PercentBase_pct'] = trn['PercentBase_pct'].values
        valF['PercentBase_pct'] = val['PercentBase_pct'].values
        trnF['WP_x_delta_BaseFVC'] = trnF['Weeks_Passed_cap'] * trnF['delta_BaseFVC']
        valF['WP_x_delta_BaseFVC'] = valF['Weeks_Passed_cap'] * valF['delta_BaseFVC']
        # ratios
        trnF['BaseFVC_per_Age'] = trnF['Base_FVC'] / np.clip(trnF['Age'], 1, None)
        valF['BaseFVC_per_Age'] = valF['Base_FVC'] / np.clip(valF['Age'], 1, None)
        trnF['PercentBase_per_Age'] = trnF['Percent_clipped'] / np.clip(trnF['Age'], 1, None)
        valF['PercentBase_per_Age'] = valF['Percent_clipped'] / np.clip(valF['Age'], 1, None)
        trnF['WP_x_BaseFVC_per_Age'] = trnF['Weeks_Passed_cap'] * trnF['BaseFVC_per_Age']
        valF['WP_x_BaseFVC_per_Age'] = valF['Weeks_Passed_cap'] * valF['BaseFVC_per_Age']
        cats = one_hot_fit(trnF, ['Sex','SmokingStatus'])
        trnF = one_hot_transform(trnF, cats); valF = one_hot_transform(valF, cats)
        feat_cols = [
            'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed_cap','Weeks_Passed2','sign_WP','is_future',
            'Percent_clipped','Percent2','Age','Base_FVC','log_BaseFVC','Estimated_TLC','log_TLC',
            'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
            'n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w','WP_x_slope_percent_w','dPercent','WP_x_dPercent',
            'group_slope_prior','WP_x_group_slope','BaseFVC_pct','WP_x_BaseFVC_pct',
            'shrunk_slope','WP_x_shrunk_slope','delta_BaseFVC','WP_x_delta_BaseFVC','PercentBase_pct',
            'BaseFVC_per_Age','PercentBase_per_Age','WP_x_BaseFVC_per_Age'
        ] + [c for c in trnF.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
        y_trn = (trn['FVC'] - trn['pred0']).astype(float).values
        y_val = (val['FVC'] - val['pred0']).astype(float).values
        dtrain = xgb.DMatrix(trnF[feat_cols], label=y_trn)
        dvalid = xgb.DMatrix(valF[feat_cols], label=y_val)
        params = {
            'objective': 'reg:absoluteerror',
            'eval_metric': 'mae',
            'tree_method': 'gpu_hist',
            'learning_rate': lr,
            'max_depth': max_depth,
            'min_child_weight': min_child_weight,
            'lambda': reg_lambda,
            'subsample': subsample,
            'colsample_bytree': colsample,
            'colsample_bylevel': colsample_bylevel,
            'verbosity': 0,
            'seed': seed
        }
        model = xgb.train(params, dtrain, num_boost_round=4000, evals=[(dvalid,'val')], early_stopping_rounds=300, verbose_eval=False)
        val_res = model.predict(dvalid, iteration_range=(0, model.best_iteration+1))
        val_pred = val['pred0'].values + val_res
        mask = (val['Weeks'].values >= val['Base_Week'].values)
        v_true = val['FVC'].values[mask].astype(float)
        v_pred = val_pred[mask].astype(float)
        v_dist = (val['Weeks'].values[mask] - val['Base_Week'].values[mask]).astype(float)
        y_true_all.append(v_true); y_pred_all.append(v_pred); dist_all.append(v_dist)
        res_all.append((v_true - v_pred))
        it = int(model.best_iteration + 1)
        iters.append(it)
        print(f'[Seed{seed}-Fold {fold}] g_slope={g_slope:.4f} scored={v_true.size} MAE={np.mean(np.abs(v_true - v_pred)):.2f} iters={it} elapsed={time.time()-tf:.2f}s', flush=True)
        del dtrain, dvalid, model, trnF, valF; gc.collect()
    # OOF arrays (same ordering across seeds since GroupKFold deterministic)
    y_true = np.concatenate(y_true_all)
    y_pred = np.concatenate(y_pred_all)
    dist = np.concatenate(dist_all)
    res = np.concatenate(res_all)
    # Fit full model with median iters
    slopes_full = compute_patient_slopes(train); g_slope_full_loc = robust_global_slope(slopes_full)
    base_full = (train.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
    train_full = train.merge(base_full, on='Patient', how='left')
    stats_full_loc = compute_trend_stats(train_full); stats_full_loc['has_trend'] = (stats_full_loc['n_obs'] >= 2).astype(int)
    pstats_full_loc = compute_percent_trend_stats(train_full)
    train_full = train_full.merge(stats_full_loc, on='Patient', how='left').merge(pstats_full_loc, on='Patient', how='left')
    for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
        train_full[c] = train_full[c].fillna(v)
    train_full['n_obs'] = train_full['n_obs'].fillna(1).astype(int)
    train_full['has_trend'] = train_full['has_trend'].fillna(0).astype(int)
    # Add TRAIN-only group slope prior and BaseFVC percentile rank for full model
    bins = [0, 50, 60, 70, 80, 200]
    labels = ['<=50','50-60','60-70','70-80','80+']
    train_full['AgeBin'] = pd.cut(train_full['Age'].astype(float), bins=bins, labels=labels, include_lowest=True)
    slopes_full_p = compute_patient_slopes(train_full)
    df_sl_full = pd.DataFrame({'Patient': list(slopes_full_p.keys()), 'slope_p': list(slopes_full_p.values())})
    train_full = train_full.merge(df_sl_full, on='Patient', how='left')
    grp_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
                   .median().reset_index().rename(columns={'slope_p':'group_slope_prior'}))
    train_full = train_full.merge(grp_full, on=['Sex','SmokingStatus','AgeBin'], how='left')
    train_full['group_slope_prior'] = train_full['group_slope_prior'].fillna(g_slope_full_loc)
    base_full_sorted = np.sort(base_full['Base_FVC'].values.astype(float))
    train_full['BaseFVC_pct'] = ecdf_percentile(base_full_sorted, train_full['Base_FVC'].values)
    # Group median Base_FVC
    grp_basefvc_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']
                           .median().reset_index().rename(columns={'Base_FVC':'group_basefvc_median'}))
    train_full = train_full.merge(grp_basefvc_full, on=['Sex','SmokingStatus','AgeBin'], how='left')
    train_full['group_basefvc_median'] = train_full['group_basefvc_median'].fillna(train_full['Base_FVC'].median())
    train_full['delta_BaseFVC'] = train_full['Base_FVC'] - train_full['group_basefvc_median']
    # Percent_at_base ECDF (use Percent_at_base)
    percent_full_sorted = np.sort(train_full['Percent_at_base'].values.astype(float))
    train_full['PercentBase_pct'] = ecdf_percentile(percent_full_sorted, train_full['Percent_at_base'].values)
    train_full['pred0'] = train_full['Base_FVC'] + g_slope_full_loc * (train_full['Weeks'] - train_full['Base_Week'])
    train_fullF = build_features_v2(train_full)
    # Interactions for priors + extras
    train_fullF['WP_x_group_slope'] = train_fullF['Weeks_Passed_cap'] * train_fullF['group_slope_prior']
    train_fullF['WP_x_BaseFVC_pct'] = train_fullF['Weeks_Passed_cap'] * train_fullF['BaseFVC_pct']
    train_fullF['shrunk_slope'] = 0.7 * train_fullF['group_slope_prior'] + 0.3 * g_slope_full_loc
    train_fullF['WP_x_shrunk_slope'] = train_fullF['Weeks_Passed_cap'] * train_fullF['shrunk_slope']
    train_fullF['delta_BaseFVC'] = train_full['delta_BaseFVC'].values
    train_fullF['PercentBase_pct'] = train_full['PercentBase_pct'].values
    train_fullF['WP_x_delta_BaseFVC'] = train_fullF['Weeks_Passed_cap'] * train_fullF['delta_BaseFVC']
    train_fullF['BaseFVC_per_Age'] = train_fullF['Base_FVC'] / np.clip(train_fullF['Age'], 1, None)
    train_fullF['PercentBase_per_Age'] = train_fullF['Percent_clipped'] / np.clip(train_fullF['Age'], 1, None)
    train_fullF['WP_x_BaseFVC_per_Age'] = train_fullF['Weeks_Passed_cap'] * train_fullF['BaseFVC_per_Age']
    cats_full_loc = one_hot_fit(train_fullF, ['Sex','SmokingStatus'])
    train_fullF = one_hot_transform(train_fullF, cats_full_loc)
    feat_cols_loc = [
        'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed_cap','Weeks_Passed2','sign_WP','is_future',
        'Percent_clipped','Percent2','Age','Base_FVC','log_BaseFVC','Estimated_TLC','log_TLC',
        'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
        'n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w','WP_x_slope_percent_w','dPercent','WP_x_dPercent',
        'group_slope_prior','WP_x_group_slope','BaseFVC_pct','WP_x_BaseFVC_pct',
        'shrunk_slope','WP_x_shrunk_slope','delta_BaseFVC','WP_x_delta_BaseFVC','PercentBase_pct',
        'BaseFVC_per_Age','PercentBase_per_Age','WP_x_BaseFVC_per_Age'
    ] + [c for c in train_fullF.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
    dtrain_full = xgb.DMatrix(train_fullF[feat_cols_loc], label=(train_fullF['FVC'] - train_fullF['pred0']).astype(float).values)
    params_full = {
        'objective': 'reg:absoluteerror', 'eval_metric': 'mae', 'tree_method': 'gpu_hist',
        'learning_rate': lr, 'max_depth': max_depth, 'min_child_weight': min_child_weight, 'lambda': reg_lambda,
        'subsample': subsample, 'colsample_bytree': colsample, 'colsample_bylevel': colsample_bylevel, 'verbosity': 0, 'seed': seed
    }
    model_full = xgb.train(params_full, dtrain_full, num_boost_round=int(np.median(iters)))
    # Build strict test grid
    ss_loc = pd.read_csv('sample_submission.csv')
    grid = ss_loc.copy()
    parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
    grid['Patient'] = parts[0]
    grid['Weeks'] = parts[1].astype(int)
    test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    grid = grid.merge(test_bl, on='Patient', how='left')
    meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient')
    grid = grid.merge(meta, on='Patient', how='left', suffixes=('', '_meta'))
    grid['slope_w'] = 0.0; grid['r2_w'] = 0.0; grid['slope_percent_w'] = 0.0; grid['n_obs'] = 1; grid['has_trend'] = 0; grid['is_singleton'] = 1
    # Add group slope prior (from train_full) and BaseFVC percentile + group baseFVC median for TEST
    grid['AgeBin'] = pd.cut(grid['Age'].astype(float), bins=[0,50,60,70,80,200], labels=['<=50','50-60','60-70','70-80','80+'], include_lowest=True)
    grid = grid.merge(grp_full, on=['Sex','SmokingStatus','AgeBin'], how='left')
    grid['group_slope_prior'] = grid['group_slope_prior'].fillna(g_slope_full_loc)
    grid = grid.merge(grp_basefvc_full, on=['Sex','SmokingStatus','AgeBin'], how='left')
    grid['group_basefvc_median'] = grid['group_basefvc_median'].fillna(train_full['Base_FVC'].median())
    grid['delta_BaseFVC'] = grid['Base_FVC'] - grid['group_basefvc_median']
    grid['BaseFVC_pct'] = ecdf_percentile(base_full_sorted, grid['Base_FVC'].values)
    percent_full_sorted = np.sort(train_full['Percent_at_base'].values.astype(float))
    grid['PercentBase_pct'] = ecdf_percentile(percent_full_sorted, grid['Percent_at_base'].values)
    grid['pred0'] = grid['Base_FVC'] + g_slope_full_loc * (grid['Weeks'] - grid['Base_Week'])
    gridF = build_features_v2(grid)
    gridF['WP_x_group_slope'] = gridF['Weeks_Passed_cap'] * gridF['group_slope_prior']
    gridF['WP_x_BaseFVC_pct'] = gridF['Weeks_Passed_cap'] * gridF['BaseFVC_pct']
    gridF['shrunk_slope'] = 0.7 * gridF['group_slope_prior'] + 0.3 * g_slope_full_loc
    gridF['WP_x_shrunk_slope'] = gridF['Weeks_Passed_cap'] * gridF['shrunk_slope']
    gridF['delta_BaseFVC'] = grid['delta_BaseFVC'].values
    gridF['PercentBase_pct'] = grid['PercentBase_pct'].values
    gridF['WP_x_delta_BaseFVC'] = gridF['Weeks_Passed_cap'] * gridF['delta_BaseFVC']
    gridF['BaseFVC_per_Age'] = gridF['Base_FVC'] / np.clip(gridF['Age'], 1, None)
    gridF['PercentBase_per_Age'] = gridF['Percent_clipped'] / np.clip(gridF['Age'], 1, None)
    gridF['WP_x_BaseFVC_per_Age'] = gridF['Weeks_Passed_cap'] * gridF['BaseFVC_per_Age']
    gridF = one_hot_transform(gridF, cats_full_loc)
    dgrid = xgb.DMatrix(gridF[feat_cols_loc])
    res_pred = model_full.predict(dgrid)
    fvc_pred = (gridF['pred0'].values + res_pred).clip(500, 6000)
    dist_test = (gridF['Weeks'] - gridF['Base_Week']).abs().astype(float).values
    return {
        'y_true': y_true, 'y_pred': y_pred, 'dist': dist, 'res': res,
        'fvc_test': fvc_pred, 'abs_res_hat_test': np.abs(res_pred), 'dist_test': dist_test
    }

# Expand to 9 seeds with slight jitter per expert advice (more diversity: include depth=3 seeds, subsample/colsample patterns, lr up to 0.055, vary colsample_bylevel)
seeds = [42, 1337, 2025, 7, 99, 123, 321, 777, 1001]

def jitter_params(seed):
    # Deterministic small jitters based on seed for diversity
    offs = seed % 4  # 0..3
    # learning_rate in 0.040-0.055
    lr_choices = [0.040, 0.045, 0.050, 0.055]
    lr = lr_choices[offs]
    # include a couple of depth=3 seeds; otherwise 4-5
    if seed % 6 == 0:
        max_depth = 3
    else:
        max_depth = 4 if (seed % 2 == 0) else 5
    # min_child_weight in 12-28
    mcw_choices = [12, 18, 22, 28]
    min_child_weight = mcw_choices[offs]
    # lambda in 3-9
    lam_choices = [3.0, 5.0, 7.0, 9.0]
    reg_lambda = lam_choices[(seed // 3) % 4]
    # subsample patterns {0.75,0.85,0.95}
    subs_choices = [0.75, 0.85, 0.95]
    subsample = subs_choices[(seed // 5) % 3]
    # colsample_bytree {0.80,0.90,1.00}
    col_choices = [0.80, 0.90, 1.00]
    colsample = col_choices[(seed // 7) % 3]
    # colsample_bylevel {1.0, 0.8} on a few seeds
    colsample_bylevel = 0.8 if (seed % 7 in (0, 3)) else 1.0
    return dict(lr=lr, max_depth=max_depth, min_child_weight=min_child_weight, reg_lambda=reg_lambda, subsample=subsample, colsample=colsample, colsample_bylevel=colsample_bylevel)

runs = []
for s in seeds:
    jp = jitter_params(s)
    runs.append(forward_residual_oof_and_full(seed=s, lr=jp['lr'], max_depth=jp['max_depth'], min_child_weight=jp['min_child_weight'], reg_lambda=jp['reg_lambda'], subsample=jp['subsample'], colsample=jp['colsample'], colsample_bylevel=jp['colsample_bylevel']))

# Aggregate OOF by averaging predictions; assume identical y_true/dist ordering
y_true_ref = runs[0]['y_true']
dist_ref = runs[0]['dist']
y_pred_avg = np.mean([r['y_pred'] for r in runs], axis=0)
abs_res_hat_oof_avg = np.mean([np.abs(r['y_true'] - r['y_pred']) for r in runs], axis=0)  # proxy from OOF errors

# Retune sigma on OOF with conservative, transfer-friendly grid
best = (-1e9, None, None, None)
for a in [160, 200, 240]:
    for b in [2.0, 3.0]:
        for c in [0.5, 1.0]:
            sig = np.maximum(a + b*dist_ref + c*abs_res_hat_oof_avg, 70.0)
            score = laplace_ll(y_true_ref, y_pred_avg, sig)
            if score > best[0]:
                best = (score, a, b, c)
print(f'Seed-avg Forward OOF LL: {best[0]:.5f} with a={best[1]} b={best[2]} c={best[3]}')
a_best, b_best, c_best = best[1], best[2], best[3]

# Average test predictions and abs residual proxy
fvc_test_avg = np.mean([r['fvc_test'] for r in runs], axis=0)
abs_res_hat_test_avg = np.mean([r['abs_res_hat_test'] for r in runs], axis=0)
dist_test_ref = runs[0]['dist_test']
sigma_test_primary = np.maximum(a_best + b_best*dist_test_ref + c_best*abs_res_hat_test_avg, 70.0)
# Optional long-horizon floor: sigma >= 100 for dist > 20
sigma_test_primary = np.where(dist_test_ref > 20.0, np.maximum(sigma_test_primary, 100.0), sigma_test_primary)

# Write primary seed-averaged submission
ss = pd.read_csv('sample_submission.csv')
submission_primary = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_test_avg.astype(float),
    'Confidence': sigma_test_primary.astype(float)
})
submission_primary.to_csv('submission.csv', index=False)
print('Saved seed-averaged forward residual submission.csv with tuned sigma (a,b,c)=', a_best, b_best, c_best)

  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed42-Fold 1] g_slope=-3.8062 scored=282 MAE=165.22 iters=16 elapsed=0.66s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed42-Fold 2] g_slope=-3.5547 scored=281 MAE=117.13 iters=8 elapsed=0.63s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed42-Fold 3] g_slope=-3.5065 scored=275 MAE=136.70 iters=7 elapsed=0.63s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed42-Fold 4] g_slope=-3.5065 scored=275 MAE=155.62 iters=1 elapsed=0.62s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed42-Fold 5] g_slope=-3.6557 scored=281 MAE=132.67 iters=528 elapsed=1.55s


  grp_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_basefvc_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']
  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed1337-Fold 1] g_slope=-3.8062 scored=282 MAE=165.59 iters=8 elapsed=0.78s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed1337-Fold 2] g_slope=-3.5547 scored=281 MAE=118.05 iters=4 elapsed=0.77s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed1337-Fold 3] g_slope=-3.5065 scored=275 MAE=135.80 iters=43 elapsed=0.86s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed1337-Fold 4] g_slope=-3.5065 scored=275 MAE=154.89 iters=1 elapsed=0.77s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed1337-Fold 5] g_slope=-3.6557 scored=281 MAE=132.44 iters=665 elapsed=2.30s


  grp_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_basefvc_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']
  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed2025-Fold 1] g_slope=-3.8062 scored=282 MAE=166.53 iters=6 elapsed=0.71s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed2025-Fold 2] g_slope=-3.5547 scored=281 MAE=117.87 iters=3 elapsed=0.70s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed2025-Fold 3] g_slope=-3.5065 scored=275 MAE=135.80 iters=52 elapsed=0.80s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed2025-Fold 4] g_slope=-3.5065 scored=275 MAE=154.78 iters=2 elapsed=0.70s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed2025-Fold 5] g_slope=-3.6557 scored=281 MAE=134.05 iters=198 elapsed=1.11s


  grp_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_basefvc_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']
  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed7-Fold 1] g_slope=-3.8062 scored=282 MAE=166.27 iters=10 elapsed=0.81s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed7-Fold 2] g_slope=-3.5547 scored=281 MAE=117.64 iters=2 elapsed=0.79s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed7-Fold 3] g_slope=-3.5065 scored=275 MAE=136.99 iters=1 elapsed=0.78s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed7-Fold 4] g_slope=-3.5065 scored=275 MAE=154.93 iters=18 elapsed=0.81s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed7-Fold 5] g_slope=-3.6557 scored=281 MAE=135.67 iters=183 elapsed=1.19s


  grp_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_basefvc_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']
  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed99-Fold 1] g_slope=-3.8062 scored=282 MAE=167.10 iters=1 elapsed=0.67s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed99-Fold 2] g_slope=-3.5547 scored=281 MAE=117.93 iters=7 elapsed=0.69s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed99-Fold 3] g_slope=-3.5065 scored=275 MAE=136.06 iters=13 elapsed=0.70s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed99-Fold 4] g_slope=-3.5065 scored=275 MAE=155.27 iters=77 elapsed=0.82s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed99-Fold 5] g_slope=-3.6557 scored=281 MAE=133.82 iters=206 elapsed=1.07s


  grp_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_basefvc_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']
  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed123-Fold 1] g_slope=-3.8062 scored=282 MAE=165.45 iters=9 elapsed=0.69s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed123-Fold 2] g_slope=-3.5547 scored=281 MAE=117.75 iters=3 elapsed=0.67s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed123-Fold 3] g_slope=-3.5065 scored=275 MAE=136.82 iters=4 elapsed=0.68s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed123-Fold 4] g_slope=-3.5065 scored=275 MAE=155.21 iters=1 elapsed=0.67s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed123-Fold 5] g_slope=-3.6557 scored=281 MAE=133.84 iters=288 elapsed=1.22s


  grp_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_basefvc_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']
  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed321-Fold 1] g_slope=-3.8062 scored=282 MAE=166.39 iters=34 elapsed=0.77s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed321-Fold 2] g_slope=-3.5547 scored=281 MAE=117.95 iters=1 elapsed=0.70s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed321-Fold 3] g_slope=-3.5065 scored=275 MAE=136.77 iters=9 elapsed=0.72s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed321-Fold 4] g_slope=-3.5065 scored=275 MAE=154.65 iters=174 elapsed=1.05s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed321-Fold 5] g_slope=-3.6557 scored=281 MAE=135.63 iters=339 elapsed=1.40s


  grp_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_basefvc_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']
  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed777-Fold 1] g_slope=-3.8062 scored=282 MAE=166.16 iters=29 elapsed=0.85s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed777-Fold 2] g_slope=-3.5547 scored=281 MAE=117.63 iters=1 elapsed=0.78s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed777-Fold 3] g_slope=-3.5065 scored=275 MAE=136.98 iters=4 elapsed=0.80s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed777-Fold 4] g_slope=-3.5065 scored=275 MAE=154.05 iters=150 elapsed=1.13s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed777-Fold 5] g_slope=-3.6557 scored=281 MAE=136.29 iters=232 elapsed=1.33s


  grp_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_basefvc_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed1001-Fold 1] g_slope=-3.8062 scored=282 MAE=167.30 iters=4 elapsed=0.83s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed1001-Fold 2] g_slope=-3.5547 scored=281 MAE=118.00 iters=2 elapsed=0.79s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed1001-Fold 3] g_slope=-3.5065 scored=275 MAE=136.98 iters=6 elapsed=0.78s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed1001-Fold 4] g_slope=-3.5065 scored=275 MAE=154.94 iters=1 elapsed=0.77s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed1001-Fold 5] g_slope=-3.6557 scored=281 MAE=133.90 iters=189 elapsed=1.19s


Seed-avg Forward OOF LL: -5.96224 with a=160 b=2.0 c=0.5
Saved seed-averaged forward residual submission.csv with tuned sigma (a,b,c)= 160 2.0 0.5


  grp_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_basefvc_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


In [22]:
# Compare OOF LL: primary (seed-avg tuned sigma) vs banker (dist-only 200+3*dist)
import numpy as np

def laplace_ll(y_true, y_pred, sigma):
    y_true = np.asarray(y_true).astype(float)
    y_pred = np.asarray(y_pred).astype(float)
    sigma = np.asarray(sigma).astype(float)
    delta = np.abs(y_true - y_pred)
    delta = np.minimum(delta, 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return np.mean(-delta / sigma - np.log(sigma))

y_true_cmp = y_true_ref
y_pred_cmp = y_pred_avg
dist_cmp = dist_ref
sig_primary = np.maximum(a_best + b_best*dist_cmp + c_best*abs_res_hat_oof_avg, 70.0)
sig_banker = np.maximum(200.0 + 3.0*dist_cmp, 70.0)
ll_primary = laplace_ll(y_true_cmp, y_pred_cmp, sig_primary)
ll_banker = laplace_ll(y_true_cmp, y_pred_cmp, sig_banker)
print(f'OOF LL primary (seed-avg tuned sigma): {ll_primary:.5f}')
print(f'OOF LL banker (200 + 3*dist):        {ll_banker:.5f}')
print('Delta (primary - banker):', f'{ll_primary - ll_banker:.5f}')
print('Note: If delta < 0.02, prefer banker for LB robustness per expert advice.')

OOF LL primary (seed-avg tuned sigma): -5.81645
OOF LL banker (200 + 3*dist):        -6.06067
Delta (primary - banker): 0.24423
Note: If delta < 0.02, prefer banker for LB robustness per expert advice.


In [39]:
# Blend 0.85*seed-avg forward residual + 0.15*anchored baseline; keep tuned sigma; apply guardrails (FVC non-increasing, sigma monotone in dist)
import numpy as np, pandas as pd

# Ensure we have seed-avg outputs from Cell 13
assert 'fvc_test_avg' in globals() and 'sigma_test_primary' in globals(), 'Run Cell 13 first to get seed-avg predictions and sigma.'

# Build strict test grid and anchored baseline
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')

# Use existing g_slope_full if available; otherwise compute
try:
    gsf = float(g_slope_full)
except Exception:
    slopes_full_tmp = compute_patient_slopes(train)
    gsf = robust_global_slope(slopes_full_tmp)

fvc_anchor = (grid['Base_FVC'].values + gsf * (grid['Weeks'].values - grid['Base_Week'].values)).astype(float)

# Blend
w = 0.85
fvc_blend = w * fvc_test_avg.astype(float) + (1.0 - w) * fvc_anchor

# Guardrails: per-patient non-increasing w.r.t. future weeks; clip to [500, 6000]
df_out = pd.DataFrame({
    'Patient': grid['Patient'].values,
    'Weeks': grid['Weeks'].values.astype(int),
    'FVC': fvc_blend
})
df_out['FVC'] = df_out['FVC'].clip(500, 6000)
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    # cumulative minimum forward in time
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_blend_guarded = df_out['FVC'].values.astype(float)

# Keep sigma from tuned seed-avg, then enforce sigma monotone in dist (non-decreasing with Weeks distance)
sigma_out = np.maximum(sigma_test_primary.astype(float), 70.0)
df_sig = pd.DataFrame({
    'Patient': grid['Patient'].values,
    'Weeks': grid['Weeks'].values.astype(int),
    'Base_Week': grid['Base_Week'].values.astype(int),
    'Sigma': sigma_out
})
df_sig['dist'] = (df_sig['Weeks'] - df_sig['Base_Week']).abs().astype(float)
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_out_monotone = df_sig['Sigma'].values.astype(float)

# Save submission
submission_blended = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_blend_guarded,
    'Confidence': sigma_out_monotone
})
submission_blended.to_csv('submission.csv', index=False)
print('Saved blended submission.csv: 0.85*seed-avg + 0.15*anchored baseline with tuned sigma (monotone) and FVC guardrails.')

Saved blended submission.csv: 0.85*seed-avg + 0.15*anchored baseline with tuned sigma (monotone) and FVC guardrails.


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [25]:
# Conservative-probe submission: same blended FVC, more conservative sigma (a=160, b=2.0, c=0.5)
import numpy as np, pandas as pd

assert 'fvc_test_avg' in globals() and 'abs_res_hat_test_avg' in globals() and 'dist_test_ref' in globals(), 'Run Cell 13 first.'

# Rebuild blend (0.85*seed-avg + 0.15*anchored baseline) to be robust to state
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
try:
    gsf = float(g_slope_full)
except Exception:
    slopes_full_tmp = compute_patient_slopes(train)
    gsf = robust_global_slope(slopes_full_tmp)
fvc_anchor = (grid['Base_FVC'].values + gsf * (grid['Weeks'].values - grid['Base_Week'].values)).astype(float)
w = 0.85
fvc_blend = w * fvc_test_avg.astype(float) + (1.0 - w) * fvc_anchor

# Guardrails: non-increasing with future weeks and clip
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_blend})
df_out['FVC'] = df_out['FVC'].clip(500, 6000)
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_blend_guarded = df_out['FVC'].values.astype(float)

# Conservative sigma per expert probe
a_cons, b_cons, c_cons = 160.0, 2.0, 0.5
sigma_cons_hybrid = np.maximum(a_cons + b_cons*dist_test_ref + c_cons*abs_res_hat_test_avg, 70.0).astype(float)

submission_cons2 = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_blend_guarded,
    'Confidence': sigma_cons_hybrid
})
submission_cons2.to_csv('submission.csv', index=False)
print('Saved conservative-probe submission.csv with blended FVC and sigma = max(160 + 2*dist + 0.5*|res_hat|, 70).')

Saved conservative-probe submission.csv with blended FVC and sigma = max(160 + 2*dist + 0.5*|res_hat|, 70).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)


In [28]:
# Banker submission: blended FVC with strict hygiene, sigma = max(300 + 6*dist, 70)
import numpy as np, pandas as pd

# Require seed-avg predictions computed (Cell 13). Rebuild blend to be state-robust.
assert 'fvc_test_avg' in globals(), 'Run Cell 13 first to compute seed-averaged predictions.'

ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')

# Anchored baseline using global slope from full train
try:
    gsf = float(g_slope_full)
except Exception:
    slopes_full_tmp = compute_patient_slopes(train)
    gsf = robust_global_slope(slopes_full_tmp)
fvc_anchor = (grid['Base_FVC'].values + gsf * (grid['Weeks'].values - grid['Base_Week'].values)).astype(float)

# Blend 0.85 seed-avg + 0.15 anchored
w = 0.85
fvc_blend = w * fvc_test_avg.astype(float) + (1.0 - w) * fvc_anchor

# Guardrails: per-patient non-increasing in future weeks; clip
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_blend})
df_out['FVC'] = df_out['FVC'].clip(500, 6000)
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# Banker sigma
dist = (grid['Weeks'] - grid['Base_Week']).abs().astype(float).values
sigma_banker = np.maximum(300.0 + 6.0 * dist, 70.0)

submission_banker = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_final,
    'Confidence': sigma_banker.astype(float)
})
submission_banker.to_csv('submission.csv', index=False)
print('Saved banker submission.csv with sigma = max(300 + 6*dist, 70) and blended FVC (0.85 seed-avg + 0.15 anchor).')

Saved banker submission.csv with sigma = max(300 + 6*dist, 70) and blended FVC (0.85 seed-avg + 0.15 anchor).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)


In [35]:
# Banker-240x3 submission: blended FVC with strict hygiene, sigma = max(240 + 3*dist, 70), with guardrails and sigma monotonicity
import numpy as np, pandas as pd

assert 'fvc_test_avg' in globals(), 'Run Cell 13 first to compute seed-averaged predictions.'

ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')

# Anchored baseline using global slope from full train
try:
    gsf = float(g_slope_full)
except Exception:
    slopes_full_tmp = compute_patient_slopes(train)
    gsf = robust_global_slope(slopes_full_tmp)
fvc_anchor = (grid['Base_FVC'].values + gsf * (grid['Weeks'].values - grid['Base_Week'].values)).astype(float)

# Blend 0.85 seed-avg + 0.15 anchored
w = 0.85
fvc_blend = w * fvc_test_avg.astype(float) + (1.0 - w) * fvc_anchor

# Guardrails: per-patient non-increasing in future weeks; clip
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_blend})
df_out['FVC'] = df_out['FVC'].clip(500, 6000)
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# Banker 240 + 3*dist sigma, enforce monotone in dist
dist = (grid['Weeks'] - grid['Base_Week']).abs().astype(float).values
sigma = np.maximum(240.0 + 3.0 * dist, 70.0).astype(float)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'dist': dist, 'Sigma': sigma})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

submission_banker_240x3 = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_final,
    'Confidence': sigma_final
})
submission_banker_240x3.to_csv('submission.csv', index=False)
print('Saved banker-240x3 submission.csv with sigma = max(240 + 3*dist, 70), blended FVC, and guardrails.')

Saved banker-240x3 submission.csv with sigma = max(240 + 3*dist, 70), blended FVC, and guardrails.


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [40]:
# Forward-CV LightGBM residual model (CPU) + simple blend with XGB seed-avg and anchored baseline
import numpy as np, pandas as pd, time, gc, sys, subprocess
from sklearn.model_selection import GroupKFold

try:
    import lightgbm as lgb
except Exception:
    print('Installing lightgbm...', flush=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'lightgbm==4.6.0', '--no-input'], check=True)
    import lightgbm as lgb

def forward_residual_oof_and_full_lgb(seed=202, cap_wp=26):
    gkf = GroupKFold(n_splits=5)
    groups = train['Patient'].values
    y_true_all, y_pred_all, dist_all, res_all = [], [], [], []
    t0 = time.time()
    best_iters = []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
        tf = time.time()
        trn = train.iloc[trn_idx].copy()
        val = train.iloc[val_idx].copy()
        # Anchors
        base_trn = (trn.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False)
                    .first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
        trn = trn.merge(base_trn, on='Patient', how='left')
        base_val = (val.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False)
                    .first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
        val = val.merge(base_val, on='Patient', how='left')
        # TRAIN-only trend stats (will be zero for VAL to mirror test hygiene)
        stats_trn = compute_trend_stats(trn); stats_trn['has_trend'] = (stats_trn['n_obs'] >= 2).astype(int)
        pstats_trn = compute_percent_trend_stats(trn)
        trn = trn.merge(stats_trn, on='Patient', how='left').merge(pstats_trn, on='Patient', how='left')
        val = val.merge(stats_trn, on='Patient', how='left').merge(pstats_trn, on='Patient', how='left')
        for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
            trn[c] = trn[c].fillna(v); val[c] = val[c].fillna(v)
        trn['n_obs'] = trn['n_obs'].fillna(1).astype(int)
        val['n_obs'] = val['n_obs'].fillna(1).astype(int)
        trn['has_trend'] = trn['has_trend'].fillna(0).astype(int)
        val['has_trend'] = val['has_trend'].fillna(0).astype(int)
        val['is_singleton'] = 1; trn['is_singleton'] = (trn['n_obs'] <= 1).astype(int)
        # Global slope from TRAIN only
        g_slope = robust_global_slope(compute_patient_slopes(trn))
        trn['pred0'] = trn['Base_FVC'] + g_slope * (trn['Weeks'] - trn['Base_Week'])
        val['pred0'] = val['Base_FVC'] + g_slope * (val['Weeks'] - val['Base_Week'])
        # Features (strict baseline Percent usage)
        trnF = build_features_v2(trn, cap_wp=cap_wp)
        valF = build_features_v2(val, cap_wp=cap_wp)
        cats = one_hot_fit(trnF, ['Sex','SmokingStatus'])
        trnF = one_hot_transform(trnF, cats); valF = one_hot_transform(valF, cats)
        feat_cols = [
            'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed_cap','Weeks_Passed2','sign_WP','is_future',
            'Percent_clipped','Percent2','Age','Base_FVC','log_BaseFVC','Estimated_TLC','log_TLC',
            'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
            'slope_w','r2_w','slope_percent_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w','WP_x_slope_percent_w','dPercent','WP_x_dPercent'
        ] + [c for c in trnF.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
        y_trn = (trn['FVC'] - trn['pred0']).astype(float).values
        y_val = (val['FVC'] - val['pred0']).astype(float).values
        lgb_train = lgb.Dataset(trnF[feat_cols], label=y_trn, free_raw_data=False)
        lgb_valid = lgb.Dataset(valF[feat_cols], label=y_val, free_raw_data=False)
        params = {
            'objective': 'l1',
            'metric': 'l1',
            'learning_rate': 0.05,
            'num_leaves': 63,
            'max_depth': -1,
            'min_data_in_leaf': 40,
            'min_sum_hessian_in_leaf': 10.0,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.9,
            'bagging_freq': 1,
            'lambda_l2': 5.0,
            'verbosity': -1,
            'seed': seed
        }
        model = lgb.train(params, lgb_train, num_boost_round=4000, valid_sets=[lgb_valid],
                          callbacks=[lgb.early_stopping(300, verbose=False)])
        pred_res = model.predict(valF[feat_cols], num_iteration=model.best_iteration)
        val_pred = val['pred0'].values + pred_res
        mask = (val['Weeks'].values >= val['Base_Week'].values)
        v_true = val['FVC'].values[mask].astype(float)
        v_pred = val_pred[mask].astype(float)
        v_dist = (val['Weeks'].values[mask] - val['Base_Week'].values[mask]).astype(float)
        y_true_all.append(v_true); y_pred_all.append(v_pred); dist_all.append(v_dist)
        res_all.append((v_true - v_pred))
        best_iters.append(int(model.best_iteration if model.best_iteration is not None else 4000))
        mae = float(np.mean(np.abs(v_true - v_pred))) if v_true.size else np.nan
        print(f'[LGB-FWD-Fold {fold}] g_slope={g_slope:.4f} scored={v_true.size} MAE={mae:.2f} iters={best_iters[-1]} elapsed={time.time()-tf:.2f}s', flush=True)
        del model, lgb_train, lgb_valid, trnF, valF; gc.collect()
    y_true = np.concatenate(y_true_all) if y_true_all else np.array([], float)
    y_pred = np.concatenate(y_pred_all) if y_pred_all else np.array([], float)
    dist = np.concatenate(dist_all) if dist_all else np.array([], float)
    res = np.concatenate(res_all) if res_all else np.array([], float)
    # Fit full model with median iters
    slopes_full = compute_patient_slopes(train); g_slope_full_loc = robust_global_slope(slopes_full)
    base_full = (train.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first()[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'}))
    train_full = train.merge(base_full, on='Patient', how='left')
    stats_full = compute_trend_stats(train_full); stats_full['has_trend'] = (stats_full['n_obs'] >= 2).astype(int)
    pstats_full = compute_percent_trend_stats(train_full)
    train_full = train_full.merge(stats_full, on='Patient', how='left').merge(pstats_full, on='Patient', how='left')
    for c, v in [('slope_w', 0.0), ('r2_w', 0.0), ('slope_percent_w', 0.0)]:
        train_full[c] = train_full[c].fillna(v)
    train_full['n_obs'] = train_full['n_obs'].fillna(1).astype(int)
    train_full['has_trend'] = train_full['has_trend'].fillna(0).astype(int)
    train_full['pred0'] = train_full['Base_FVC'] + g_slope_full_loc * (train_full['Weeks'] - train_full['Base_Week'])
    train_fullF = build_features_v2(train_full, cap_wp=cap_wp)
    # IMPORTANT: Recompute one-hot schema on FULL TRAIN and use for FULL + TEST
    cats_full = one_hot_fit(train_fullF, ['Sex','SmokingStatus'])
    train_fullF = one_hot_transform(train_fullF, cats_full)
    feat_cols_full = [
        'Weeks_Passed','Abs_Weeks_Passed','Weeks_Passed_cap','Weeks_Passed2','sign_WP','is_future',
        'Percent_clipped','Percent2','Age','Base_FVC','log_BaseFVC','Estimated_TLC','log_TLC',
        'Age_x_Percent','Percent_x_BaseFVC','WP_x_BaseFVC','WP_x_Percent','WP_x_Age',
        'slope_w','r2_w','slope_percent_w','n_obs','has_trend','is_singleton','WP_x_slope_w','WP_x_r2_w','WP_x_slope_percent_w','dPercent','WP_x_dPercent'
    ] + [c for c in train_fullF.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
    dtrain_full = lgb.Dataset(train_fullF[feat_cols_full], label=(train_fullF['FVC'] - train_fullF['pred0']).astype(float).values, free_raw_data=False)
    model_full = lgb.train(params, dtrain_full, num_boost_round=int(np.median(best_iters)))
    # Strict test grid
    ss_loc = pd.read_csv('sample_submission.csv')
    grid = ss_loc.copy()
    parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
    grid['Patient'] = parts[0]
    grid['Weeks'] = parts[1].astype(int)
    test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    grid = grid.merge(test_bl, on='Patient', how='left')
    meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient')
    grid = grid.merge(meta, on='Patient', how='left', suffixes=('', '_meta'))
    grid['slope_w'] = 0.0; grid['r2_w'] = 0.0; grid['slope_percent_w'] = 0.0; grid['n_obs'] = 1; grid['has_trend'] = 0; grid['is_singleton'] = 1
    grid['pred0'] = grid['Base_FVC'] + g_slope_full_loc * (grid['Weeks'] - grid['Base_Week'])
    gridF = build_features_v2(grid, cap_wp=cap_wp)
    gridF = one_hot_transform(gridF, cats_full)
    res_pred = model_full.predict(gridF[feat_cols_full])
    fvc_test = (gridF['pred0'].values + res_pred).clip(500, 6000)
    dist_test = (gridF['Weeks'] - gridF['Base_Week']).abs().astype(float).values
    return {
        'y_true': y_true, 'y_pred': y_pred, 'dist': dist, 'res': res,
        'fvc_test': fvc_test, 'abs_res_hat_test': np.abs(res_pred), 'dist_test': dist_test
    }

# Train LGB forward residual and get test preds
lgb_run = forward_residual_oof_and_full_lgb(seed=202)
fvc_test_lgb = lgb_run['fvc_test']

# Build blended submission using XGB seed-avg (fvc_test_avg), LGB (10%), and anchor (15%); keep tuned sigma from seed-avg and guardrails
assert 'fvc_test_avg' in globals() and 'sigma_test_primary' in globals(), 'Run Cell 13 first'
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
try:
    gsf = float(g_slope_full)
except Exception:
    slopes_full_tmp = compute_patient_slopes(train)
    gsf = robust_global_slope(slopes_full_tmp)
fvc_anchor = (grid['Base_FVC'].values + gsf * (grid['Weeks'].values - grid['Base_Week'].values)).astype(float)
# Weights: 0.75 XGB seed-avg, 0.10 LGB, 0.15 anchor
fvc_blend = 0.75 * fvc_test_avg.astype(float) + 0.10 * fvc_test_lgb.astype(float) + 0.15 * fvc_anchor
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_blend.clip(500, 6000)})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)
# Sigma: keep tuned seed-avg sigma and enforce monotonicity in dist
sigma_out = np.maximum(sigma_test_primary.astype(float), 70.0)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'Base_Week': grid['Base_Week'].values.astype(int), 'Sigma': sigma_out})
df_sig['dist'] = (df_sig['Weeks'] - df_sig['Base_Week']).abs().astype(float)
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)
submission_lgb_blend = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_final,
    'Confidence': sigma_final
})
submission_lgb_blend.to_csv('submission.csv', index=False)
print('Saved XGB-seedavg(0.75) + LGB(0.10) + Anchor(0.15) blended submission.csv with tuned sigma and guardrails.')

[LGB-FWD-Fold 1] g_slope=-3.8062 scored=282 MAE=167.83 iters=1 elapsed=0.55s


[LGB-FWD-Fold 2] g_slope=-3.5547 scored=281 MAE=118.25 iters=4 elapsed=0.31s


[LGB-FWD-Fold 3] g_slope=-3.5065 scored=275 MAE=136.60 iters=3 elapsed=0.31s


[LGB-FWD-Fold 4] g_slope=-3.5065 scored=275 MAE=154.33 iters=84 elapsed=0.38s


[LGB-FWD-Fold 5] g_slope=-3.6557 scored=281 MAE=135.53 iters=228 elapsed=0.50s


Saved XGB-seedavg(0.75) + LGB(0.10) + Anchor(0.15) blended submission.csv with tuned sigma and guardrails.


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [41]:
# Tiny LGB weight sweep on forward OOF and regenerate blended submission with guardrails
import numpy as np, pandas as pd

def laplace_ll(y_true, y_pred, sigma):
    y_true = np.asarray(y_true).astype(float)
    y_pred = np.asarray(y_pred).astype(float)
    sigma = np.asarray(sigma).astype(float)
    delta = np.abs(y_true - y_pred)
    delta = np.minimum(delta, 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return np.mean(-delta / sigma - np.log(sigma))

# Requirements from Cell 13 (XGB seed-avg) and Cell 19 (LGB forward residual)
assert 'y_true_ref' in globals() and 'y_pred_avg' in globals() and 'dist_ref' in globals(), 'Run Cell 13 first (seed-avg OOF).'
assert 'lgb_run' in globals(), 'Run Cell 19 first (LGB forward residual).'
y_true_oof = y_true_ref
y_pred_xgb_oof = y_pred_avg
y_pred_lgb_oof = lgb_run['y_pred']
dist_oof = dist_ref

# Sanity: check lengths align; if not, fallback to default wl=0.10
fallback_wl = 0.10
if not (len(y_true_oof) == len(y_pred_xgb_oof) == len(y_pred_lgb_oof) == len(dist_oof)):
    print('OOF arrays length mismatch; using fallback wl=', fallback_wl)
    best_wl = fallback_wl
else:
    # Use primary sigma from seed-avg for tuning; robust to tiny weight tweaks
    assert 'a_best' in globals() and 'b_best' in globals() and 'c_best' in globals(), 'Missing tuned sigma params from Cell 13.'
    # proxy for abs residual from seed-avg to build sigma
    abs_res_proxy = np.abs(y_true_oof - y_pred_xgb_oof)
    sigma_oof = np.maximum(a_best + b_best*dist_oof + c_best*abs_res_proxy, 70.0)
    best = (-1e9, None)
    for wl in [0.05, 0.10, 0.15, 0.20, 0.25]:
        # OOF blend: adjust only between XGB and LGB; anchor held out of OOF (kept 0.15 at test time)
        y_pred_blend = (1.0 - wl) * y_pred_xgb_oof + wl * y_pred_lgb_oof
        score = laplace_ll(y_true_oof, y_pred_blend, sigma_oof)
        if score > best[0]:
            best = (score, wl)
    best_wl = best[1] if best[1] is not None else fallback_wl
    print(f'OOF sweep best LGB weight (within XGB+LGB only): wl={best_wl:.2f}, LL={best[0]:.5f}')
    # If gain small (<0.01), reduce wl to 0.05
    # Compare to pure XGB (wl=0) for delta
    score_xgb = laplace_ll(y_true_oof, y_pred_xgb_oof, sigma_oof)
    score_best = best[0]
    if (score_best - score_xgb) < 0.01:
        best_wl = 0.05
        print('Gain < 0.01; setting wl=0.05 for safety.')

# Retune sigma on the chosen OOF blend (optional, quick grid) using seed-avg proxies
y_pred_blend_oof = (1.0 - best_wl) * y_pred_xgb_oof + best_wl * y_pred_lgb_oof
abs_res_proxy = np.abs(y_true_oof - y_pred_xgb_oof)  # keep proxy from XGB for stability
best_sig = (-1e9, None, None, None)
for a in [160, 200, 240]:
    for b in [2.0, 3.0]:
        for c in [0.5, 1.0]:
            sigma_try = np.maximum(a + b*dist_oof + c*abs_res_proxy, 70.0)
            sc = laplace_ll(y_true_oof, y_pred_blend_oof, sigma_try)
            if sc > best_sig[0]:
                best_sig = (sc, a, b, c)
print(f'Retuned sigma on blended OOF: LL={best_sig[0]:.5f} with a={best_sig[1]} b={best_sig[2]} c={best_sig[3]}')
a_blend, b_blend, c_blend = best_sig[1], best_sig[2], best_sig[3]

# Build test-time 3-way blend: keep anchor=0.15, distribute remaining 0.85 as (XGB, LGB) = (0.85 - wl, wl)
assert 'fvc_test_avg' in globals() and 'dist_test_ref' in globals() and 'abs_res_hat_test_avg' in globals(), 'Run Cell 13 to get test arrays.'
assert 'fvc_test_lgb' in globals(), 'Run Cell 19 to compute LGB test predictions.'
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
try:
    gsf = float(g_slope_full)
except Exception:
    slopes_full_tmp = compute_patient_slopes(train)
    gsf = robust_global_slope(slopes_full_tmp)
fvc_anchor = (grid['Base_FVC'].values + gsf * (grid['Weeks'].values - grid['Base_Week'].values)).astype(float)
wxgb = 0.85 - best_wl
wlgb = best_wl
wanc = 0.15
fvc_blend_test = wxgb * fvc_test_avg.astype(float) + wlgb * fvc_test_lgb.astype(float) + wanc * fvc_anchor

# Guardrails: non-increasing FVC per patient and sigma monotone in dist
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_blend_test.clip(500, 6000)})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# Sigma: use retuned (a_blend,b_blend,c_blend) on test with abs_res proxy from XGB seed-avg
dist_test = (grid['Weeks'] - grid['Base_Week']).abs().astype(float).values
sigma_final = np.maximum(a_blend + b_blend*dist_test + c_blend*abs_res_hat_test_avg.astype(float), 70.0)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'Base_Week': grid['Base_Week'].values.astype(int), 'Sigma': sigma_final})
df_sig['dist'] = (df_sig['Weeks'] - df_sig['Base_Week']).abs().astype(float)
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

submission_blend_lgb_swept = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_final,
    'Confidence': sigma_final
})
submission_blend_lgb_swept.to_csv('submission.csv', index=False)
print(f'Saved swept-weights submission.csv with weights XGB={wxgb:.2f}, LGB={wlgb:.2f}, Anchor={wanc:.2f}; sigma (a,b,c)=({a_blend},{b_blend},{c_blend}).')

OOF sweep best LGB weight (within XGB+LGB only): wl=0.05, LL=-5.96082
Gain < 0.01; setting wl=0.05 for safety.
Retuned sigma on blended OOF: LL=-5.96082 with a=160 b=2.0 c=0.5
Saved swept-weights submission.csv with weights XGB=0.80, LGB=0.05, Anchor=0.15; sigma (a,b,c)=(160,2.0,0.5).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [42]:
# Save primary submission copy and generate banker variant (240 + 3*dist) for current 3-way blend
import numpy as np, pandas as pd

# 1) Save current submission.csv as submission_primary.csv for safekeeping
sub_cur = pd.read_csv('submission.csv')
sub_cur.to_csv('submission_primary.csv', index=False)
print('Backed up current primary to submission_primary.csv')

# 2) Rebuild the same 3-way blend FVC using tuned LGB weight from Cell 20 (best_wl) and anchor=0.15
assert 'best_wl' in globals(), 'Run Cell 20 first to define best_wl for LGB weight.'
assert 'fvc_test_avg' in globals() and 'fvc_test_lgb' in globals(), 'Run Cells 13 and 19 first.'

ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
try:
    gsf = float(g_slope_full)
except Exception:
    slopes_full_tmp = compute_patient_slopes(train)
    gsf = robust_global_slope(slopes_full_tmp)
fvc_anchor = (grid['Base_FVC'].values + gsf * (grid['Weeks'].values - grid['Base_Week'].values)).astype(float)
wxgb = 0.85 - float(best_wl)
wlgb = float(best_wl)
wanc = 0.15
fvc_blend = wxgb * fvc_test_avg.astype(float) + wlgb * fvc_test_lgb.astype(float) + wanc * fvc_anchor

# Guardrails: non-increasing FVC per patient; clip
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_blend.clip(500, 6000)})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# Banker sigma: 240 + 3*dist, enforce monotone in dist
dist = (grid['Weeks'] - grid['Base_Week']).abs().astype(float).values
sigma = np.maximum(240.0 + 3.0 * dist, 70.0).astype(float)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'dist': dist, 'Sigma': sigma})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

submission_banker_curblend = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_final,
    'Confidence': sigma_final
})
submission_banker_curblend.to_csv('submission_banker.csv', index=False)
print(f'Saved banker variant to submission_banker.csv with weights XGB={wxgb:.2f}, LGB={wlgb:.2f}, Anchor={wanc:.2f} and sigma=max(240+3*dist,70).')

Backed up current primary to submission_primary.csv
Saved banker variant to submission_banker.csv with weights XGB=0.80, LGB=0.05, Anchor=0.15 and sigma=max(240+3*dist,70).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [43]:
# Add one shallow XGB forward residual seed for diversity, re-avg predictions, retune sigma, and regenerate submission
import numpy as np, pandas as pd, gc, time
import xgboost as xgb

assert 'forward_residual_oof_and_full' in globals(), 'Run Cell 13 first to define forward_residual_oof_and_full()'
assert 'y_true_ref' in globals() and 'y_pred_avg' in globals() and 'dist_ref' in globals(), 'Run Cell 13 first to compute seed-avg OOF arrays'

jp = dict(lr=0.045, max_depth=3, min_child_weight=28, reg_lambda=9.0, subsample=0.9, colsample=0.9, colsample_bylevel=1.0)
print('Training shallow XGB residual seed for diversity:', jp)
t0 = time.time()
run_extra = forward_residual_oof_and_full(seed=4242, lr=jp['lr'], max_depth=jp['max_depth'], min_child_weight=jp['min_child_weight'],
                                          reg_lambda=jp['reg_lambda'], subsample=jp['subsample'], colsample=jp['colsample'],
                                          colsample_bylevel=jp['colsample_bylevel'])
print(f'Extra seed trained in {time.time()-t0:.2f}s')

# Re-average OOF and test
y_pred_avg2 = (y_pred_avg + run_extra['y_pred']) / 2.0
fvc_test_avg2 = (fvc_test_avg + run_extra['fvc_test']) / 2.0
abs_res_hat_oof_avg2 = (np.abs(y_true_ref - y_pred_avg) + np.abs(y_true_ref - run_extra['y_pred'])) / 2.0
abs_res_hat_test_avg2 = (abs_res_hat_test_avg + np.abs(run_extra['abs_res_hat_test'])) / 2.0

# Retune sigma on updated OOF blend (small grid) and report LL
def laplace_ll(y_true, y_pred, sigma):
    y_true = np.asarray(y_true).astype(float); y_pred = np.asarray(y_pred).astype(float); sigma = np.asarray(sigma).astype(float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0); sigma = np.maximum(sigma, 70.0)
    return np.mean(-delta / sigma - np.log(sigma))

best = (-1e9, None, None, None)
for a in [160, 200, 240]:
    for b in [2.0, 3.0]:
        for c in [0.5, 1.0]:
            sig = np.maximum(a + b*dist_ref + c*abs_res_hat_oof_avg2, 70.0)
            sc = laplace_ll(y_true_ref, y_pred_avg2, sig)
            if sc > best[0]:
                best = (sc, a, b, c)
print(f'OOF LL with extra shallow seed: {best[0]:.5f} using a={best[1]} b={best[2]} c={best[3]}')
a_ex, b_ex, c_ex = best[1], best[2], best[3]

# Build strict test grid and anchored baseline
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
try:
    gsf = float(g_slope_full)
except Exception:
    slopes_full_tmp = compute_patient_slopes(train)
    gsf = robust_global_slope(slopes_full_tmp)
fvc_anchor = (grid['Base_FVC'].values + gsf * (grid['Weeks'].values - grid['Base_Week'].values)).astype(float)

# Primary blend: 0.85 seed-avg2 + 0.15 anchored; guardrails
fvc_blend = 0.85 * fvc_test_avg2.astype(float) + 0.15 * fvc_anchor
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_blend.clip(500, 6000)})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# Sigma: tuned on updated OOF; apply and enforce monotone in dist; optional floor at dist>20
dist_test = (grid['Weeks'] - grid['Base_Week']).abs().astype(float).values
sigma_primary = np.maximum(a_ex + b_ex*dist_test + c_ex*abs_res_hat_test_avg2.astype(float), 70.0)
sigma_primary = np.where(dist_test > 20.0, np.maximum(sigma_primary, 100.0), sigma_primary)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'Base_Week': grid['Base_Week'].values.astype(int), 'Sigma': sigma_primary})
df_sig['dist'] = (df_sig['Weeks'] - df_sig['Base_Week']).abs().astype(float)
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

submission_extra_seed = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
submission_extra_seed.to_csv('submission.csv', index=False)
print('Saved submission.csv with extra shallow seed averaged (85% blend with anchor) and retuned sigma. Weights: seed-avg2 0.85, anchor 0.15.')

Training shallow XGB residual seed for diversity: {'lr': 0.045, 'max_depth': 3, 'min_child_weight': 28, 'reg_lambda': 9.0, 'subsample': 0.9, 'colsample': 0.9, 'colsample_bylevel': 1.0}


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed4242-Fold 1] g_slope=-3.8062 scored=282 MAE=165.02 iters=33 elapsed=0.67s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed4242-Fold 2] g_slope=-3.5547 scored=281 MAE=116.96 iters=9 elapsed=0.58s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed4242-Fold 3] g_slope=-3.5065 scored=275 MAE=136.77 iters=124 elapsed=0.76s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed4242-Fold 4] g_slope=-3.5065 scored=275 MAE=155.59 iters=1 elapsed=0.57s


  grp = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_med = (trn.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']


[Seed4242-Fold 5] g_slope=-3.6557 scored=281 MAE=133.92 iters=356 elapsed=1.13s


Extra seed trained in 4.34s
OOF LL with extra shallow seed: -5.96105 using a=160 b=2.0 c=0.5
Saved submission.csv with extra shallow seed averaged (85% blend with anchor) and retuned sigma. Weights: seed-avg2 0.85, anchor 0.15.


  grp_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['slope_p']
  grp_basefvc_full = (train_full.groupby(['Sex','SmokingStatus','AgeBin'], dropna=False)['Base_FVC']
  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
