# Production pipeline: OSIC Pulmonary Fibrosis Progression

Goal:
- Reproduce a clean, medal-ready submission with strict forward-CV hygiene, seed-averaged XGB residual backbone, tiny LGB weight sweep, anchored baseline blend, and robust sigma.

Plan:
1) Environment + data load; deterministic setup.
2) Helpers: metric, slope/stats builders, one-hot utils, leak-free build_features_v2.
3) Forward-CV residual XGB:
   - 5-fold GroupKFold (by Patient), forward scoring (weeks >= base).
   - Strict hygiene: Percent_at_base only; zero trend stats for VAL/TEST.
   - 9-seed jittered bag; aggregate OOF/test.
4) Sigma primary:
   - Tune (a,b,c) on OOF with abs_res proxy; floor >=70; optional floor >=100 for dist>20.
5) Optional diversity:
   - Tiny CPU LightGBM residual; sweep small weight wl in [0.05,0.10,0.15,0.20,0.25] vs XGB on OOF.
6) Blending:
   - Primary FVC: 0.85*seed-avg (or XGB/LGB sweep) + 0.15*anchor.
   - Guardrails: per-patient FVC non-increasing in future; clip [500, 6000].
   - Sigma guardrail: non-decreasing with distance.
7) Sigma strategy:
   - Primary: tuned hybrid sigma (a,b,c).
   - Banker: sigma = max(240 + 3*dist, 70).
   - Rule: if primary vs banker OOF delta < 0.02, prefer banker.
8) Outputs:
   - Save submission_primary.csv (primary sigma) and submission_banker.csv (banker).
   - Choose which to write as submission.csv based on OOF rule.

Notes:
- Recompute one-hot schema on full train and reuse for test (avoid one-hot schema leak).
- Strict test hygiene: no train-derived per-patient trend stats used on test; use baseline-only features.
- Log per-fold progress and elapsed time.

In [21]:
# Copy banker submission to submission.csv (banker: sigma = max(240 + 3*dist, 70))
import pandas as pd
ss = pd.read_csv('sample_submission.csv')
sub_banker = pd.read_csv('submission_banker.csv')
assert sub_banker.shape[0] == ss.shape[0], 'Row count mismatch vs sample_submission'
assert set(sub_banker['Patient_Week'].astype(str)) == set(ss['Patient_Week'].astype(str)), 'Patient_Week sets differ'
assert sub_banker['FVC'].notna().all() and sub_banker['Confidence'].notna().all(), 'NaNs in banker submission'
sub_banker.to_csv('submission.csv', index=False)
print('submission.csv overwritten with banker submission (sigma=240+3*dist).')

submission.csv overwritten with banker submission (sigma=240+3*dist).


In [2]:
# Copy hybrid (primary) submission to submission.csv for second submit
import pandas as pd
ss = pd.read_csv('sample_submission.csv')
sub_primary = pd.read_csv('submission_primary.csv')
assert sub_primary.shape[0] == ss.shape[0], 'Row count mismatch vs sample_submission'
assert set(sub_primary['Patient_Week'].astype(str)) == set(ss['Patient_Week'].astype(str)), 'Patient_Week sets differ'
assert sub_primary['FVC'].notna().all() and sub_primary['Confidence'].notna().all(), 'NaNs in primary submission'
sub_primary.to_csv('submission.csv', index=False)
print('submission.csv overwritten with hybrid primary submission (tuned hybrid sigma).')

submission.csv overwritten with hybrid primary submission (tuned hybrid sigma).


In [3]:
# Build hybrid sigma floored by banker: sigma = max(hybrid, banker, 70), with guardrails; overwrite submission.csv
import pandas as pd
import numpy as np

ss = pd.read_csv('sample_submission.csv')
sub_primary = pd.read_csv('submission_primary.csv')  # hybrid
sub_banker = pd.read_csv('submission_banker.csv')   # banker

assert sub_primary.shape[0] == ss.shape[0] == sub_banker.shape[0], 'Row count mismatch'
assert set(sub_primary['Patient_Week'].astype(str)) == set(ss['Patient_Week'].astype(str)) == set(sub_banker['Patient_Week'].astype(str)), 'Patient_Week sets differ'

# Rebuild grid to compute dist and apply guardrails
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
grid['dist'] = (grid['Weeks'] - grid['Base_Week']).abs().astype(float)

# Align submissions with grid order
p = sub_primary.set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
b = sub_banker.set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()

# FVC from hybrid primary; re-enforce non-increasing per patient
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': p['FVC'].astype(float).clip(500, 6000)})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# Sigma: hybrid floored by banker; then guardrails
sigma_h = p['Confidence'].astype(float).values
sigma_b = b['Confidence'].astype(float).values
sigma = np.maximum(sigma_h, sigma_b)
sigma = np.maximum(sigma, 70.0)
# Floor sigma to >=100 when dist > 20
sigma = np.where(grid['dist'].values > 20.0, np.maximum(sigma, 100.0), sigma)

df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': grid['dist'].values.astype(float), 'Sigma': sigma})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

submission_hybrid_floored = pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'FVC': fvc_final,
    'Confidence': sigma_final
})
submission_hybrid_floored.to_csv('submission_hybrid_floored.csv', index=False)
submission_hybrid_floored.to_csv('submission.csv', index=False)
print('Saved submission_hybrid_floored.csv and overwritten submission.csv with hybrid sigma floored by banker, guardrails enforced.')

Saved submission_hybrid_floored.csv and overwritten submission.csv with hybrid sigma floored by banker, guardrails enforced.


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [5]:
# Setup: imports, seeds, data load, core helpers (metric, one-hot, ECDF, slope utils), leak-free feature builder
import os, sys, time, math, gc, random
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import GroupKFold

np.random.seed(42); random.seed(42)
pd.set_option('display.max_columns', 200)

# Load data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
ss = pd.read_csv('sample_submission.csv')

# Metric: modified Laplace log-likelihood (maximize)
def laplace_ll(y_true, y_pred, sigma):
    y_true = np.asarray(y_true).astype(float)
    y_pred = np.asarray(y_pred).astype(float)
    sigma = np.asarray(sigma).astype(float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

# One-hot utilities (fit on reference df; transform applies same columns; unseen -> zeros)
def one_hot_fit(df, cols):
    return {c: sorted(df[c].dropna().astype(str).unique().tolist()) for c in cols}

def one_hot_transform(df, cats):
    out = df.copy()
    for c, values in cats.items():
        col = df[c].astype(str)
        for v in values:
            out[f'{c}__{v}'] = (col == v).astype(np.int8)
    return out

# ECDF rank (0..1) fitted on train only
def ecdf_rank_fit(x):
    xs = np.sort(np.asarray(x, dtype=float))
    return xs

def ecdf_rank_transform(x, xs):
    x = np.asarray(x, dtype=float)
    # rank = fraction <= x
    idx = np.searchsorted(xs, x, side='right')
    return idx / max(len(xs), 1)

# Per-patient slope (OLS) on (Weeks, FVC)
def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float); y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def robust_global_slope(slopes_dict):
    if not slopes_dict: return 0.0
    return float(np.median(list(slopes_dict.values())))

# Leak-free feature builder: baseline-only Percent, zero trends for val/test
def build_features_v2(df, cap_wp=26):
    d = df.copy()
    # Weeks_Passed relative to Base_Week
    d['Weeks_Passed'] = (d['Weeks'] - d['Base_Week']).astype(float)
    d['Abs_Weeks_Passed'] = d['Weeks_Passed'].abs()
    d['Weeks_Passed_cap'] = d['Weeks_Passed'].clip(-cap_wp, cap_wp)
    d['Weeks_Passed2'] = d['Weeks_Passed'] ** 2
    d['sign_WP'] = np.sign(d['Weeks_Passed']).astype(int)
    d['is_future'] = (d['Weeks_Passed'] >= 0).astype(int)
    # Baseline-only Percent usage
    d['Percent_at_base'] = d['Percent_at_base'].astype(float)
    d['Percent_clipped'] = d['Percent_at_base'].clip(30, 120)
    d['Percent2'] = d['Percent_clipped'] ** 2
    # Base_FVC transforms
    d['Base_FVC'] = d['Base_FVC'].astype(float)
    d['log_BaseFVC'] = np.log1p(np.maximum(d['Base_FVC'], 1.0))
    # Simple TLC proxy
    d['Estimated_TLC'] = d['Base_FVC'] / (d['Percent_at_base'].clip(1e-3) / 100.0)
    d['log_TLC'] = np.log1p(np.maximum(d['Estimated_TLC'], 1.0))
    # Interactions
    d['Age'] = d['Age'].astype(float)
    d['Age_x_Percent'] = d['Age'] * d['Percent_at_base']
    d['Percent_x_BaseFVC'] = d['Percent_at_base'] * d['Base_FVC']
    d['WP_x_BaseFVC'] = d['Weeks_Passed'] * d['Base_FVC']
    d['WP_x_Percent'] = d['Weeks_Passed'] * d['Percent_at_base']
    d['WP_x_Age'] = d['Weeks_Passed'] * d['Age']
    # Trend placeholders (should be zero for val/test hygiene)
    for c in ['slope_w','r2_w','slope_percent_w']:
        if c not in d.columns: d[c] = 0.0
    for c in ['n_obs','has_trend','is_singleton']:
        if c not in d.columns: d[c] = 0
    # dPercent disabled to avoid leak
    d['dPercent'] = 0.0; d['WP_x_dPercent'] = 0.0
    return d

print('Production setup ready: data loaded and helpers defined.')

Production setup ready: data loaded and helpers defined.


In [6]:
# Slope head: build forward-safe slope labels per fold and fit Ridge + KNN on baseline-only features
import numpy as np, pandas as pd, time, gc
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

def patient_forward_slope(g):
    # Simple OLS slope on all points of this patient's TRAIN partition
    x = g['Weeks'].values.astype(float); y = g['FVC'].values.astype(float)
    xm = x.mean(); ym = y.mean()
    denom = ((x - xm)**2).sum()
    return ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0

def prepare_baseline_table(df):
    # Earliest visit row per patient to extract baseline features
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

def build_slope_features(base_df, ecdf_basefvc=None, ecdf_percent=None, cats=None, fit=False):
    b = base_df.copy()
    b['log_Base_FVC'] = np.log1p(np.maximum(b['Base_FVC'].astype(float), 1.0))
    b['BaseFVC_over_Age'] = b['Base_FVC'].astype(float) / np.maximum(b['Age'].astype(float), 1.0)
    b['PercentBase_over_Age'] = b['Percent_at_base'].astype(float) / np.maximum(b['Age'].astype(float), 1.0)
    # ECDF ranks
    if fit:
        ecdf_basefvc = ecdf_rank_fit(b['Base_FVC'].values)
        ecdf_percent = ecdf_rank_fit(b['Percent_at_base'].values)
    b['BaseFVC_ecdf'] = ecdf_rank_transform(b['Base_FVC'].values, ecdf_basefvc)
    b['Percent_ecdf'] = ecdf_rank_transform(b['Percent_at_base'].values, ecdf_percent)
    # One-hot
    if fit:
        cats = one_hot_fit(b, ['Sex','SmokingStatus'])
    b = one_hot_transform(b, cats)
    num_cols = ['Age','Base_FVC','log_Base_FVC','Percent_at_base','BaseFVC_over_Age','PercentBase_over_Age','BaseFVC_ecdf','Percent_ecdf']
    cat_cols = [c for c in b.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
    feat_cols = num_cols + cat_cols
    return b, feat_cols, ecdf_basefvc, ecdf_percent, cats

def run_slope_head_forward_cv(n_splits=5, seed=42):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train['Patient'].values
    # OOF containers aligned to per-row grid of future-only scoring
    y_true_all, dist_all = [], []
    # We'll store FVC predictions from slope models per validation patient-week (future only)
    fvc_ridge_all, fvc_knn_all = [], []
    t0 = time.time()
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
        tf = time.time()
        trn = train.iloc[trn_idx].copy(); val = train.iloc[val_idx].copy()
        # Build per-patient forward slope labels on TRAIN only
        slopes = []
        for pid, g in trn.groupby('Patient'):
            slopes.append((pid, patient_forward_slope(g)))
        slope_labels = pd.DataFrame(slopes, columns=['Patient','s_label'])
        # Baseline tables
        base_trn = prepare_baseline_table(trn)
        base_val = prepare_baseline_table(val)
        base_trn = base_trn.merge(slope_labels, on='Patient', how='left')
        # Fit baseline-only feature transforms on TRAIN baseline
        base_trnF, feat_cols, ecdf_bf, ecdf_pc, cats_full = build_slope_features(base_trn, fit=True)
        base_valF, _, _, _, _ = build_slope_features(base_val, ecdf_bf, ecdf_pc, cats_full, fit=False)
        # Standardize numeric space for KNN; Ridge will handle via same scaler
        scaler = StandardScaler(with_mean=True, with_std=True)
        X_trn = base_trnF[feat_cols].values.astype(float); y_trn = base_trnF['s_label'].fillna(0.0).values.astype(float)
        X_trn_std = scaler.fit_transform(X_trn)
        X_val_std = scaler.transform(base_valF[feat_cols].values.astype(float))
        # Models
        ridge = Ridge(alpha=1.0, random_state=seed)
        ridge.fit(X_trn_std, y_trn)
        knn = KNeighborsRegressor(n_neighbors=9, weights='distance', metric='euclidean')
        knn.fit(X_trn_std, y_trn)
        # Predict s_hat on VAL baseline, then expand to patient-week rows (future only) for scoring
        s_ridge = ridge.predict(X_val_std)
        s_knn = knn.predict(X_val_std)
        # Merge s_hat back to val rows
        val = val.merge(base_val[['Patient','Base_Week','Base_FVC']], on='Patient', how='left')
        val = val.merge(base_valF[['Patient']], on='Patient', how='left')
        # Future-only mask
        mask = (val['Weeks'].values >= val['Base_Week'].values)
        dist = (val['Weeks'].values - val['Base_Week'].values).astype(float)
        # Map patient -> s_hat
        map_ridge = dict(zip(base_val['Patient'].values, s_ridge))
        map_knn = dict(zip(base_val['Patient'].values, s_knn))
        s_r = val['Patient'].map(map_ridge).astype(float).fillna(0.0).values
        s_k = val['Patient'].map(map_knn).astype(float).fillna(0.0).values
        fvc_r = (val['Base_FVC'].values + s_r * dist).astype(float)
        fvc_k = (val['Base_FVC'].values + s_k * dist).astype(float)
        # Append future-only rows
        y_true_all.append(val['FVC'].values[mask].astype(float))
        dist_all.append(dist[mask].astype(float))
        fvc_ridge_all.append(fvc_r[mask].astype(float))
        fvc_knn_all.append(fvc_k[mask].astype(float))
        print(f'[Slope-Fold {fold}] n_pat_trn={base_trn.shape[0]} ridge_fitted, knn_fitted; elapsed={time.time()-tf:.2f}s', flush=True)
        del trn, val, base_trn, base_val, base_trnF, base_valF, X_trn, X_trn_std, X_val_std
        gc.collect()
    y_true = np.concatenate(y_true_all) if y_true_all else np.array([], float)
    dist = np.concatenate(dist_all) if dist_all else np.array([], float)
    fvc_ridge_oof = np.concatenate(fvc_ridge_all) if fvc_ridge_all else np.array([], float)
    fvc_knn_oof = np.concatenate(fvc_knn_all) if fvc_knn_all else np.array([], float)
    print(f'[Slope OOF] built in {time.time()-t0:.2f}s; arrays: y={y_true.shape}, ridge={fvc_ridge_oof.shape}, knn={fvc_knn_oof.shape}', flush=True)
    return dict(y_true=y_true, dist=dist, fvc_ridge_oof=fvc_ridge_oof, fvc_knn_oof=fvc_knn_oof)

def fit_slope_head_full_and_predict_test():
    # Train slope models on full train baseline table and predict test grid
    base_full = prepare_baseline_table(train)
    # Fit labels from full train per-patient slope (for logging only; not used for test prediction except model fit)
    slopes_full = compute_patient_slopes(train)
    slope_labels_full = pd.DataFrame({'Patient': list(slopes_full.keys()), 's_label': list(slopes_full.values())})
    base_full = base_full.merge(slope_labels_full, on='Patient', how='left')
    base_fullF, feat_cols, ecdf_bf, ecdf_pc, cats_full = build_slope_features(base_full, fit=True)
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_full = base_fullF[feat_cols].values.astype(float); y_full = base_fullF['s_label'].fillna(0.0).values.astype(float)
    X_full_std = scaler.fit_transform(X_full)
    ridge = Ridge(alpha=1.0, random_state=42).fit(X_full_std, y_full)
    knn = KNeighborsRegressor(n_neighbors=9, weights='distance', metric='euclidean').fit(X_full_std, y_full)
    # Build strict test grid baseline
    grid = ss.copy()
    parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
    grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
    test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    grid = grid.merge(test_bl, on='Patient', how='left')
    base_test = grid[['Patient','Base_Week','Base_FVC']].drop_duplicates('Patient')
    # Build baseline features for test patients
    meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient').rename(columns={'Percent':'Percent_at_base'})
    base_test = base_test.merge(meta, on='Patient', how='left')
    base_testF, _, _, _, _ = build_slope_features(base_test, ecdf_bf, ecdf_pc, cats_full, fit=False)
    X_test_std = scaler.transform(base_testF[feat_cols].values.astype(float))
    s_ridge_test = ridge.predict(X_test_std)
    s_knn_test = knn.predict(X_test_std)
    # Map back to full grid
    map_r = dict(zip(base_testF['Patient'].values, s_ridge_test))
    map_k = dict(zip(base_testF['Patient'].values, s_knn_test))
    dist_test = (grid['Weeks'].values - grid['Base_Week'].values).astype(float)
    fvc_ridge_test = (grid['Base_FVC'].values + pd.Series(grid['Patient']).map(map_r).astype(float).fillna(0.0).values * dist_test).astype(float)
    fvc_knn_test = (grid['Base_FVC'].values + pd.Series(grid['Patient']).map(map_k).astype(float).fillna(0.0).values * dist_test).astype(float)
    return dict(grid=grid, fvc_ridge_test=fvc_ridge_test, fvc_knn_test=fvc_knn_test, dist_test=dist_test)

# Execute slope head
slope_oof = run_slope_head_forward_cv(n_splits=5, seed=42)
slope_test = fit_slope_head_full_and_predict_test()
print('Slope head ready: OOF and test predictions computed (Ridge + KNN).')

[Slope-Fold 1] n_pat_trn=126 ridge_fitted, knn_fitted; elapsed=0.02s


[Slope-Fold 2] n_pat_trn=126 ridge_fitted, knn_fitted; elapsed=0.02s


[Slope-Fold 3] n_pat_trn=127 ridge_fitted, knn_fitted; elapsed=0.02s


[Slope-Fold 4] n_pat_trn=127 ridge_fitted, knn_fitted; elapsed=0.02s


[Slope-Fold 5] n_pat_trn=126 ridge_fitted, knn_fitted; elapsed=0.02s


[Slope OOF] built in 0.33s; arrays: y=(1394,), ridge=(1394,), knn=(1394,)


Slope head ready: OOF and test predictions computed (Ridge + KNN).


In [7]:
# Blend Ridge + KNN slope heads, create slope+anchor submission with banker sigma and guardrails
import numpy as np, pandas as pd

# OOF blend weight scan for slope
y_true = slope_oof['y_true']
dist_oof = slope_oof['dist']
fvc_ridge_oof = slope_oof['fvc_ridge_oof']
fvc_knn_oof = slope_oof['fvc_knn_oof']

def laplace_ll(y_true, y_pred, sigma):
    y_true = np.asarray(y_true).astype(float)
    y_pred = np.asarray(y_pred).astype(float)
    sigma = np.asarray(sigma).astype(float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

# Use banker-like sigma for OOF selection (robust, independent of model residuals)
sigma_oof = np.maximum(240.0 + 3.0 * dist_oof, 70.0)
best = (-1e9, None)
for w in [0.3,0.4,0.5,0.6,0.7,0.8]:
    fvc_bl = w * fvc_ridge_oof + (1.0 - w) * fvc_knn_oof
    sc = laplace_ll(y_true, fvc_bl, sigma_oof)
    if sc > best[0]:
        best = (sc, w)
w_ridge = best[1] if best[1] is not None else 0.6
w_knn = 1.0 - w_ridge
print(f'[Slope OOF] best blend weights: ridge={w_ridge:.2f}, knn={w_knn:.2f}, LL={best[0]:.5f}')

# Build test slope blend and final FVC with anchor=0.15 (slope bucket 0.85) as a standalone probe
grid = slope_test['grid'].copy()
dist_test = slope_test['dist_test'].astype(float)
fvc_ridge_test = slope_test['fvc_ridge_test'].astype(float)
fvc_knn_test = slope_test['fvc_knn_test'].astype(float)
slope_blend_test = w_ridge * fvc_ridge_test + w_knn * fvc_knn_test

# Anchored baseline (global slope from train)
gs = robust_global_slope(compute_patient_slopes(train))
fvc_anchor = (grid['Base_FVC'].values + gs * (grid['Weeks'].values - grid['Base_Week'].values)).astype(float)

# Final FVC for this probe: 0.85 slope + 0.15 anchor
fvc_probe = 0.85 * slope_blend_test + 0.15 * fvc_anchor

# Guardrails: non-increasing FVC per patient and clip
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': np.clip(fvc_probe, 500, 6000)})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# Sigma banker + guardrails (monotone, floors incl. dist>20>=100)
sigma = np.maximum(240.0 + 3.0 * np.abs(dist_test), 70.0).astype(float)
sigma = np.where(np.abs(dist_test) > 20.0, np.maximum(sigma, 100.0), sigma)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': np.abs(dist_test).astype(float), 'Sigma': sigma})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

# Save slope+anchor banker submission as an additional artifact; do not overwrite current submission.csv automatically
submission_slope_banker = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
submission_slope_banker.to_csv('submission_slope_banker.csv', index=False)
print('Saved submission_slope_banker.csv (0.85 slope blend + 0.15 anchor, banker sigma).')

[Slope OOF] best blend weights: ridge=0.80, knn=0.20, LL=-6.15768
Saved submission_slope_banker.csv (0.85 slope blend + 0.15 anchor, banker sigma).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [20]:
# Integrate slope backbone with a light residual XGB, sweep weights, and build banker/primary submissions
import numpy as np, pandas as pd, time, gc, math
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor

t0 = time.time()

# Helper: build forward CV grids with baseline columns merged
def build_fold_grids(trn_df, val_df):
    # Baselines from earliest visit within each split
    base_trn = (trn_df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base_trn = base_trn[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    base_val = (val_df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base_val = base_val[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    # Drop potential duplicates to avoid suffixes; keep baseline versions
    trn_df = trn_df.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore')
    val_df = val_df.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore')
    trn = trn_df.merge(base_trn, on='Patient', how='left')
    val = val_df.merge(base_val, on='Patient', how='left')
    # Guardrails
    assert trn['Base_Week'].notna().all() and val['Base_Week'].notna().all(), 'Base_Week has NaNs in fold build'
    # future-only masks and dist
    trn['dist'] = (trn['Weeks'] - trn['Base_Week']).astype(float)
    val['dist'] = (val['Weeks'] - val['Base_Week']).astype(float)
    m_trn = trn['dist'] >= 0
    m_val = val['dist'] >= 0
    return trn, val, m_trn.values, m_val.values, base_trn, base_val

# Compute global slope from a dataframe
def global_slope_from_df(df):
    slopes = compute_patient_slopes(df)
    return robust_global_slope(slopes)

# Prepare features for residual model using leak-safe builder (baseline-only + distance bases)
def prepare_residual_features(df):
    d = df.copy()
    # map possible suffixed baseline columns back to expected names
    for name, alts in [('Age',['Age_x','Age_y','Age_base']), ('Sex',['Sex_x','Sex_y','Sex_base']), ('SmokingStatus',['SmokingStatus_x','SmokingStatus_y','Smoking_base'])]:
        if name not in d.columns:
            for a in alts:
                if a in d.columns:
                    d[name] = d[a]
                    break
    # ensure required cols exist
    for c in ['Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus','Weeks','dist']:
        assert c in d.columns, f'missing {c}'
    # minimal columns for build_features_v2
    d['Weeks_Passed'] = d['dist']
    d['Abs_Weeks_Passed'] = d['dist'].abs()
    d['Weeks_Passed2'] = d['dist']**2
    d['Weeks_Passed_cap'] = d['dist'].clip(-26, 26)
    d['sign_WP'] = np.sign(d['dist']).astype(int)
    d['is_future'] = (d['dist'] >= 0).astype(int)
    # percent and base transforms inside build_features_v2 expect certain cols already present
    d = build_features_v2(d)
    # distance bases and interactions (safe)
    d['dist2'] = d['dist']**2
    d['dist3'] = d['dist']**3
    d['dist_cap'] = d['dist'].clip(0, 30)
    d['dist_x_BaseFVC'] = d['dist'] * d['Base_FVC']
    d['dist_x_Age'] = d['dist'] * d['Age']
    # one-hot for Sex, SmokingStatus on TRAIN fit, applied to VAL/TEST later outside this fn
    return d

def fit_ohe(train_df, cols):
    return one_hot_fit(train_df, cols)

def apply_ohe(df, cats):
    return one_hot_transform(df, cats)

# Retrieve OOF arrays and create combined predictions aligned per-fold
y_true_oof_list, dist_oof_list = [], []
fvc_slope_oof_list, fvc_resid_oof_list, fvc_anchor_oof_list = [], [], []

gkf = GroupKFold(n_splits=5)
groups = train['Patient'].values

# To rebuild per-fold anchor and slope aligned arrays, recompute slope preds within the loop using the already-fitted slope_head logic per fold.
ridge_w, knn_w = 0.80, 0.20
bag_seeds = [0,1,2,3,4]  # residual XGB seed bag

fold_start = time.time()
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    t_fold = time.time()
    trn_df = train.iloc[trn_idx].copy()
    val_df = train.iloc[val_idx].copy()
    trn, val, m_trn, m_val, base_trn, base_val = build_fold_grids(trn_df, val_df)
    # Fit slope label model on TRAIN baseline (Ridge+KNN) to get s_hat for VAL baseline
    slopes_tr = compute_patient_slopes(trn_df)
    base_trn_lab = base_trn.merge(pd.DataFrame({'Patient': list(slopes_tr.keys()), 's_label': list(slopes_tr.values())}), on='Patient', how='left')
    base_trnF, feat_cols, ecdf_bf, ecdf_pc, cats_full = build_slope_features(base_trn_lab, fit=True)
    base_valF, _, _, _, _ = build_slope_features(base_val, ecdf_bf, ecdf_pc, cats_full, fit=False)
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_trn = base_trnF[feat_cols].values.astype(float); y_trn = base_trnF['s_label'].fillna(0.0).values.astype(float)
    X_trn_std = scaler.fit_transform(X_trn)
    X_val_std = scaler.transform(base_valF[feat_cols].values.astype(float))
    from sklearn.linear_model import Ridge
    from sklearn.neighbors import KNeighborsRegressor
    ridge = Ridge(alpha=1.0, random_state=42).fit(X_trn_std, y_trn)
    knn = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(X_trn_std, y_trn)
    s_r = ridge.predict(X_val_std); s_k = knn.predict(X_val_std)
    # Clamp s_hat to [5th,95th] train-fold slope label percentiles
    q_lo, q_hi = np.percentile(y_trn, 5), np.percentile(y_trn, 95)
    s_blend_val = ridge_w * s_r + knn_w * s_k
    s_blend_val = np.clip(s_blend_val, q_lo, q_hi)
    s_hat_map = dict(zip(base_val['Patient'].values, s_blend_val))
    # Build per-row slope backbone predictions on VAL future rows
    dist_val = val['dist'].values.astype(float)
    s_hat_val = val['Patient'].map(s_hat_map).astype(float).fillna(0.0).values
    fvc_slope_val = (val['Base_FVC'].values + s_hat_val * dist_val).astype(float)
    # Compute per-fold anchor using TRAIN-only global slope
    gs_fold = global_slope_from_df(trn_df)
    fvc_anchor_val = (val['Base_FVC'].values + gs_fold * dist_val).astype(float)
    # Residual targets on TRAIN future rows: r = y - fvc_slope
    base_trnF2, _, _, _, _ = build_slope_features(base_trn, ecdf_bf, ecdf_pc, cats_full, fit=False)
    X_trn_std2 = scaler.transform(base_trnF2[feat_cols].values.astype(float))
    s_r_tr = ridge.predict(X_trn_std2); s_k_tr = knn.predict(X_trn_std2)
    s_blend_tr = ridge_w * s_r_tr + knn_w * s_k_tr
    s_blend_tr = np.clip(s_blend_tr, q_lo, q_hi)
    s_hat_map_tr = dict(zip(base_trn['Patient'].values, s_blend_tr))
    trn['s_hat'] = trn['Patient'].map(s_hat_map_tr).astype(float).fillna(0.0)
    trn['fvc_slope'] = trn['Base_FVC'].astype(float) + trn['s_hat'].astype(float) * trn['dist'].astype(float)
    trn_fut = trn[m_trn].copy()
    trn_fut['r_target'] = trn_fut['FVC'].astype(float) - trn_fut['fvc_slope'].astype(float)
    # Ensure baseline demographics present for feature builder
    trn_fut = trn_fut.merge(base_trn[['Patient','Age','Sex','SmokingStatus']], on='Patient', how='left')
    # Features for TRAIN and VAL future rows
    trn_feat = prepare_residual_features(trn_fut)
    val_fut = val[m_val].copy()
    # Ensure baseline demographics present for VAL as well
    val_fut = val_fut.merge(base_val[['Patient','Age','Sex','SmokingStatus']], on='Patient', how='left')
    val_feat = prepare_residual_features(val_fut)
    # Fit OHE on TRAIN
    cats = fit_ohe(trn_feat, ['Sex','SmokingStatus'])
    trn_feat = apply_ohe(trn_feat, cats)
    val_feat = apply_ohe(val_feat, cats)
    # Select numeric feature columns (exclude objects to avoid strings like 'Male')
    drop_cols = ['Patient','Weeks','FVC','Base_Week','s_hat','fvc_slope','r_target','Sex','SmokingStatus']
    feat_cols_resid = [c for c in trn_feat.columns if c not in drop_cols and np.issubdtype(trn_feat[c].dtype, np.number)]
    Xr = trn_feat[feat_cols_resid].values.astype(float)
    yr = trn_feat['r_target'].values.astype(float)
    Xv = val_feat[feat_cols_resid].values.astype(float)
    # Residual XGB seed bagging (reduced capacity per expert)
    r_pred_val_bag = np.zeros(Xv.shape[0], dtype=float)
    for sd in bag_seeds:
        xgb = XGBRegressor(
            n_estimators=500,
            max_depth=3,
            learning_rate=0.04,
            subsample=0.8,
            colsample_bytree=0.8,
            reg_lambda=5.0,
            min_child_weight=5.0,
            tree_method='hist',
            random_state=sd,
            n_jobs=0
        )
        xgb.fit(Xr, yr, verbose=False)
        r_pred_val_bag += xgb.predict(Xv)
        del xgb
    r_pred_val = r_pred_val_bag / max(len(bag_seeds), 1)
    fvc_resid_val = fvc_slope_val[m_val] + r_pred_val
    # Accumulate OOF arrays (VAL future rows only)
    y_true_oof_list.append(val_fut['FVC'].values.astype(float))
    dist_oof_list.append(dist_val[m_val].astype(float))
    fvc_slope_oof_list.append(fvc_slope_val[m_val].astype(float))
    fvc_resid_oof_list.append(fvc_resid_val.astype(float))
    fvc_anchor_oof_list.append(fvc_anchor_val[m_val].astype(float))
    print(f'[Integrate-Fold {fold}] trn_fut={trn_fut.shape[0]} val_fut={val_fut.shape[0]} done in {time.time()-t_fold:.2f}s', flush=True)
    del trn_df, val_df, trn, val, trn_fut, val_fut, trn_feat, val_feat, Xr, yr, Xv
    gc.collect()

y_true_oof = np.concatenate(y_true_oof_list)
dist_oof = np.concatenate(dist_oof_list)
fvc_slope_oof = np.concatenate(fvc_slope_oof_list)
fvc_resid_oof = np.concatenate(fvc_resid_oof_list)
fvc_anchor_oof = np.concatenate(fvc_anchor_oof_list)
print(f'[OOF Arrays] y={y_true_oof.shape}, slope={fvc_slope_oof.shape}, resid={fvc_resid_oof.shape}, anchor={fvc_anchor_oof.shape}')

# OOF weight selection: safer reweight toward anchor using banker sigma; choose by long-distance slice first
def laplace_ll_np(y_true, y_pred, sigma):
    y_true = y_true.astype(float); y_pred = y_pred.astype(float); sigma = sigma.astype(float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

sigma_banker_oof = np.maximum(240.0 + 3.0 * np.abs(dist_oof), 70.0)
sigma_banker_oof = np.where(np.abs(dist_oof) > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)
m_long = np.abs(dist_oof) > 20.0
best = {'ll_long': -1e9, 'll_global': -1e9, 'w_anchor': None, 'w_resid': None}
for w_a in [0.20, 0.25, 0.30]:
    w_s = 0.00
    w_r = 1.0 - (w_a + w_s)
    fvc_bl = w_r * fvc_resid_oof + w_s * fvc_slope_oof + w_a * fvc_anchor_oof
    ll_g = laplace_ll_np(y_true_oof, fvc_bl, sigma_banker_oof)
    ll_l = laplace_ll_np(y_true_oof[m_long], fvc_bl[m_long], sigma_banker_oof[m_long]) if m_long.any() else ll_g
    print(f'[OOF Anchor Sweep] w_anchor={w_a:.2f}, w_resid={w_r:.2f}, LL_global={ll_g:.5f}, LL_long={ll_l:.5f}')
    if ll_l > best['ll_long'] or (np.isclose(ll_l, best['ll_long']) and ll_g > best['ll_global']):
        best.update({'ll_long': ll_l, 'll_global': ll_g, 'w_anchor': w_a, 'w_resid': w_r})
w_anchor_best = best['w_anchor'] if best['w_anchor'] is not None else 0.30
w_resid_best = best['w_resid'] if best['w_resid'] is not None else 0.70
w_slope_best = 0.00
print(f"[Final Weights] Using w_resid={w_resid_best:.2f}, w_slope={w_slope_best:.2f}, w_anchor={w_anchor_best:.2f} selected by long-distance OOF.")

# Train final residual model on full train future rows with consistent schema
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
# Full-train slope models on baseline-only features
def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base
base_full = prepare_baseline_table(train)
slopes_full = compute_patient_slopes(train)
slope_labels_full = pd.DataFrame({'Patient': list(slopes_full.keys()), 's_label': list(slopes_full.values())})
base_full_lab = base_full.merge(slope_labels_full, on='Patient', how='left')
base_fullF, feat_cols_s, ecdf_bf_s, ecdf_pc_s, cats_s = build_slope_features(base_full_lab, fit=True)
scaler_full = StandardScaler(with_mean=True, with_std=True)
X_full_s = base_fullF[feat_cols_s].values.astype(float); y_full_s = base_fullF['s_label'].fillna(0.0).values.astype(float)
X_full_s_std = scaler_full.fit_transform(X_full_s)
ridge_full = Ridge(alpha=1.0, random_state=42).fit(X_full_s_std, y_full_s)
knn_full = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(X_full_s_std, y_full_s)

# Build full-train grid with baseline merge and future-only rows
train_full = train.copy()
train_full = train_full.merge(base_full[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
train_full['dist'] = (train_full['Weeks'] - train_full['Base_Week']).astype(float)
mask_full = train_full['dist'] >= 0
Xb_full = build_slope_features(base_full[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], ecdf_bf_s, ecdf_pc_s, cats_s, fit=False)[0]
s_r_full = ridge_full.predict(scaler_full.transform(Xb_full[feat_cols_s].values.astype(float)))
s_k_full = knn_full.predict(scaler_full.transform(Xb_full[feat_cols_s].values.astype(float)))
q_lo_full, q_hi_full = np.percentile(y_full_s, 5), np.percentile(y_full_s, 95)
s_blend_full = np.clip(ridge_w * s_r_full + knn_w * s_k_full, q_lo_full, q_hi_full)
map_s_full = dict(zip(base_full['Patient'].values, s_blend_full))
train_full['s_hat'] = train_full['Patient'].map(map_s_full).astype(float).fillna(0.0)
train_full['fvc_slope'] = train_full['Base_FVC'].astype(float) + train_full['s_hat'].astype(float) * train_full['dist'].astype(float)
train_fut = train_full[mask_full].copy()
train_fut['r_target'] = train_fut['FVC'].astype(float) - train_fut['fvc_slope'].astype(float)
# Residual features and OHE schema fit on full train cats
tr_full_feat = prepare_residual_features(train_fut)
cats_full = one_hot_fit(train[['Sex','SmokingStatus']].drop_duplicates(), ['Sex','SmokingStatus'])
tr_full_feat = apply_ohe(tr_full_feat, cats_full)
drop_cols_full = ['Patient','Weeks','FVC','Base_Week','s_hat','fvc_slope','r_target','Sex','SmokingStatus']
feat_cols_resid_full = [c for c in tr_full_feat.columns if c not in drop_cols_full and np.issubdtype(tr_full_feat[c].dtype, np.number)]
X_full_resid = tr_full_feat[feat_cols_resid_full].values.astype(float)
y_full_resid = tr_full_feat['r_target'].values.astype(float)
# Residual full model: seed bagging (reduced capacity per expert)
resid_full_models = []
for sd in bag_seeds:
    mdl = XGBRegressor(
        n_estimators=650,
        max_depth=3,
        learning_rate=0.04,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_lambda=6.0,
        min_child_weight=5.0,
        tree_method='hist',
        random_state=sd,
        n_jobs=0
    )
    mdl.fit(X_full_resid, y_full_resid, verbose=False)
    resid_full_models.append(mdl)

# Build test grid and compute slope_backbone, anchor, and residual features for prediction
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
assert grid['Base_Week'].notna().all(), 'Base_Week has NaNs in test grid merge'
dist_test = (grid['Weeks'].values - grid['Base_Week'].values).astype(float)

# Slope backbone test from full-fit slope models (with clamp)
base_test = grid[['Patient','Base_Week','Base_FVC']].drop_duplicates('Patient')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient').rename(columns={'Percent':'Percent_at_base'})
base_test = base_test.merge(meta, on='Patient', how='left')
base_testF, _, _, _, _ = build_slope_features(base_test, ecdf_bf_s, ecdf_pc_s, cats_s, fit=False)
X_test_std = scaler_full.transform(base_testF[feat_cols_s].values.astype(float))
s_ridge_test = ridge_full.predict(X_test_std)
s_knn_test = knn_full.predict(X_test_std)
s_blend_test = np.clip(ridge_w * s_ridge_test + knn_w * s_knn_test, q_lo_full, q_hi_full)
map_s_test = dict(zip(base_testF['Patient'].values, s_blend_test))
fvc_slope_test = (grid['Base_FVC'].values + pd.Series(grid['Patient']).map(map_s_test).astype(float).fillna(0.0).values * dist_test).astype(float)

# Anchor test using global slope from full train
gs_full = global_slope_from_df(train)
fvc_anchor_test = (grid['Base_FVC'].values + gs_full * dist_test).astype(float)

# Residual features for test rows
test_feat = grid.copy()
test_feat['dist'] = dist_test
test_feat = prepare_residual_features(test_feat)
cats_pred = one_hot_fit(train[['Sex','SmokingStatus']].drop_duplicates(), ['Sex','SmokingStatus'])
test_feat = apply_ohe(test_feat, cats_pred)
drop_cols_pred = ['Patient','Weeks','FVC','Base_Week','Sex','SmokingStatus']
# Align test feature columns to training residual feature schema; add missing columns as zeros
for c in feat_cols_resid_full:
    if c not in test_feat.columns:
        test_feat[c] = 0.0
feat_cols_pred = [c for c in feat_cols_resid_full if c in test_feat.columns]
X_test_resid = test_feat[feat_cols_pred].values.astype(float)
# Predict residuals with seed bag and average
r_pred_test_sum = np.zeros(X_test_resid.shape[0], dtype=float)
for mdl in resid_full_models:
    r_pred_test_sum += mdl.predict(X_test_resid)
r_pred_test = r_pred_test_sum / max(len(resid_full_models), 1)
fvc_resid_test = fvc_slope_test + r_pred_test

# Final blended FVC for test using selected safer weights
fvc_blend_test = w_resid_best * fvc_resid_test + w_slope_best * fvc_slope_test + w_anchor_best * fvc_anchor_test

# Guardrails: pin dist==0 to Base_FVC, non-increasing FVC per patient, clip [500,6000]
fvc_clip = np.clip(fvc_blend_test, 500, 6000)
fvc_clip = np.where(dist_test == 0.0, grid['Base_FVC'].values.astype(float), fvc_clip)
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_clip})
def enforce_non_increasing_local(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing_local)
fvc_final_test = df_out['FVC'].values.astype(float)

# Banker sigma for test with guardrails
sigma_banker_test = np.maximum(240.0 + 3.0 * np.abs(dist_test), 70.0)
sigma_banker_test = np.where(np.abs(dist_test) > 20.0, np.maximum(sigma_banker_test, 100.0), sigma_banker_test)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': np.abs(dist_test).astype(float), 'Sigma': sigma_banker_test.astype(float)})
def enforce_sigma_monotone_local(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone_local)
sigma_banker_final_test = df_sig['Sigma'].values.astype(float)

submission_banker = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final_test, 'Confidence': sigma_banker_final_test})
submission_banker.to_csv('submission_banker.csv', index=False)
print('Saved submission_banker.csv (integrated blend + banker sigma).')

# Learned sigma model on OOF residuals of the selected blend
r_oof = y_true_oof - (w_resid_best * fvc_resid_oof + w_slope_best * fvc_slope_oof + w_anchor_best * fvc_anchor_oof)
y_sigma = np.log1p(np.abs(r_oof))
sigma_feat_oof = pd.DataFrame({
    'dist': dist_oof,
    'dist2': dist_oof**2,
})
sigma_model = XGBRegressor(
    n_estimators=500,
    max_depth=3,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=10.0,
    tree_method='hist',
    random_state=42,
    n_jobs=0
)
sigma_model.fit(sigma_feat_oof.values.astype(float), y_sigma, verbose=False)
# Test sigma features
sigma_feat_test = pd.DataFrame({'dist': np.abs(dist_test).astype(float), 'dist2': (np.abs(dist_test)**2).astype(float)})
sigma_pred = np.expm1(sigma_model.predict(sigma_feat_test.values.astype(float)))
sigma_pred = np.maximum(sigma_pred, 70.0)
sigma_pred = np.where(np.abs(dist_test) > 20.0, np.maximum(sigma_pred, 100.0), sigma_pred)
# Monotone per patient
df_sig2 = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': np.abs(dist_test).astype(float), 'Sigma': sigma_pred.astype(float)})
df_sig2 = df_sig2.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone_local)
sigma_learned_final = df_sig2['Sigma'].values.astype(float)
# Floor by banker
sigma_primary = np.maximum(sigma_learned_final, sigma_banker_final_test)

submission_primary = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final_test, 'Confidence': sigma_primary})
submission_primary.to_csv('submission_primary.csv', index=False)
submission_primary.to_csv('submission.csv', index=False)
print(f'Saved submission_primary.csv and overwritten submission.csv (integrated blend + learned sigma floored by banker). Elapsed {time.time()-t0:.1f}s')

[Integrate-Fold 1] trn_fut=1112 val_fut=282 done in 0.99s


[Integrate-Fold 2] trn_fut=1113 val_fut=281 done in 0.99s


[Integrate-Fold 3] trn_fut=1119 val_fut=275 done in 1.00s


[Integrate-Fold 4] trn_fut=1119 val_fut=275 done in 0.99s


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing_local)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone_local)


Saved submission_banker.csv (integrated blend + banker sigma).


Saved submission_primary.csv and overwritten submission.csv (integrated blend + learned sigma floored by banker). Elapsed 7.2s


  df_sig2 = df_sig2.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone_local)


In [14]:
# Linear sigma sweep (sigma = a + b*|dist|) with floors; compare to banker OOF and write alternative submission if >0.02 OOF gain
import numpy as np, pandas as pd

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = np.asarray(y_true, float); y_pred = np.asarray(y_pred, float); sigma = np.asarray(sigma, float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.clip(sigma, 70.0, 1000.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

# Build OOF blend residuals from best weights discovered in Cell 7
fvc_oof_best = w_resid_best * fvc_resid_oof + w_slope_best * fvc_slope_oof + w_anchor_best * fvc_anchor_oof
r_oof = y_true_oof - fvc_oof_best
abs_dist_oof = np.abs(dist_oof).astype(float)

# Banker OOF
sigma_banker_oof = np.maximum(240.0 + 3.0 * abs_dist_oof, 70.0)
sigma_banker_oof = np.where(abs_dist_oof > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)
ll_banker = laplace_ll_np(y_true_oof, fvc_oof_best, sigma_banker_oof)
print(f'[Sigma OOF] Banker LL={ll_banker:.5f}')

# Sweep linear sigma grids
grids = {
    'A': {'a': [150, 200, 250, 300], 'b': [2.0, 2.5, 3.0, 3.5]},
    'B': {'a': [90, 100, 110, 120], 'b': [1.6, 1.8, 2.0, 2.2, 2.4]}
}
best = (ll_banker - 1e9, None, None)  # (LL, a, b) initialize far below banker for comparison printing
for grid_name, grid in grids.items():
    for a in grid['a']:
        for b in grid['b']:
            sig = a + b * abs_dist_oof
            sig = np.maximum(sig, 70.0)
            sig = np.where(abs_dist_oof > 20.0, np.maximum(sig, 100.0), sig)
            sig = np.clip(sig, 70.0, 1000.0)
            ll = laplace_ll_np(y_true_oof, fvc_oof_best, sig)
            print(f"[Sigma OOF] Grid {grid_name} a={a} b={b:.2f} LL={ll:.5f}")
            if ll > best[0]:
                best = (ll, a, b)

ll_best, a_best, b_best = best
delta = ll_best - ll_banker
print(f'[Sigma OOF] Best linear: a={a_best} b={b_best:.2f} LL={ll_best:.5f} (Δ vs banker={delta:+.5f})')

# Build test sigma for best (a,b), enforce floors and per-patient monotone; floor by banker for safety
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
assert grid['Base_Week'].notna().all(), 'Base_Week NaNs in linear sigma grid build'
abs_dist_test = (grid['Weeks'] - grid['Base_Week']).abs().astype(float)
sigma_linear_test = a_best + b_best * abs_dist_test
sigma_linear_test = np.maximum(sigma_linear_test, 70.0)
sigma_linear_test = np.where(abs_dist_test > 20.0, np.maximum(sigma_linear_test, 100.0), sigma_linear_test)
sigma_linear_test = np.clip(sigma_linear_test, 70.0, 1000.0)

# Per-patient monotone on |dist|
df_sigL = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_test.values, 'Sigma': sigma_linear_test.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sigL = df_sigL.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_linear_test_mono = df_sigL['Sigma'].values.astype(float)

# Load current best FVC (from integrated blend built in Cell 7), align to ss order
sub_int = pd.read_csv('submission_banker.csv')
sub_int = sub_int.set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
fvc_test_final = sub_int['FVC'].astype(float).values

# Banker test sigma for flooring
sigma_banker_test = np.maximum(240.0 + 3.0 * abs_dist_test, 70.0)
sigma_banker_test = np.where(abs_dist_test > 20.0, np.maximum(sigma_banker_test, 100.0), sigma_banker_test)

# Final linear-floored sigma
sigma_linear_floored = np.maximum(sigma_linear_test_mono, sigma_banker_test)

sub_linear = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_test_final, 'Confidence': sigma_linear_floored})
sub_linear.to_csv('submission_linear_floored.csv', index=False)
print('Saved submission_linear_floored.csv (linear sigma floored by banker).')

# Overwrite submission.csv only if OOF gain > 0.02 over banker
if delta > 0.02:
    sub_linear.to_csv('submission.csv', index=False)
    print('[Sigma OOF] Linear sigma beats banker by >0.02 on OOF; submission.csv set to submission_linear_floored.csv')
else:
    print('[Sigma OOF] Linear sigma NOT >0.02 over banker; keeping banker as primary submission.csv')

[Sigma OOF] Banker LL=-5.89595
[Sigma OOF] Grid A a=150 b=2.00 LL=-5.57238
[Sigma OOF] Grid A a=150 b=2.50 LL=-5.59623
[Sigma OOF] Grid A a=150 b=3.00 LL=-5.61948
[Sigma OOF] Grid A a=150 b=3.50 LL=-5.64211
[Sigma OOF] Grid A a=200 b=2.00 LL=-5.73874
[Sigma OOF] Grid A a=200 b=2.50 LL=-5.76042
[Sigma OOF] Grid A a=200 b=3.00 LL=-5.78147
[Sigma OOF] Grid A a=200 b=3.50 LL=-5.80192
[Sigma OOF] Grid A a=250 b=2.00 LL=-5.88414
[Sigma OOF] Grid A a=250 b=2.50 LL=-5.90372
[Sigma OOF] Grid A a=250 b=3.00 LL=-5.92274
[Sigma OOF] Grid A a=250 b=3.50 LL=-5.94122
[Sigma OOF] Grid A a=300 b=2.00 LL=-6.01223
[Sigma OOF] Grid A a=300 b=2.50 LL=-6.02999
[Sigma OOF] Grid A a=300 b=3.00 LL=-6.04726
[Sigma OOF] Grid A a=300 b=3.50 LL=-6.06406
[Sigma OOF] Grid B a=90 b=1.60 LL=-5.32675
[Sigma OOF] Grid B a=90 b=1.80 LL=-5.33634
[Sigma OOF] Grid B a=90 b=2.00 LL=-5.34608
[Sigma OOF] Grid B a=90 b=2.20 LL=-5.35592
[Sigma OOF] Grid B a=90 b=2.40 LL=-5.36582
[Sigma OOF] Grid B a=100 b=1.60 LL=-5.36496
[Sigma

  df_sigL = df_sigL.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [16]:
# Conservative linear sigma sweep near banker; select on |dist|>20 OOF and floor by banker
import numpy as np, pandas as pd

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = np.asarray(y_true, float); y_pred = np.asarray(y_pred, float); sigma = np.asarray(sigma, float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.clip(sigma, 70.0, 1000.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

# Use OOF arrays from Cell 7
fvc_oof_best = w_resid_best * fvc_resid_oof + w_slope_best * fvc_slope_oof + w_anchor_best * fvc_anchor_oof
abs_dist_oof = np.abs(dist_oof).astype(float)

# Banker OOF sigma with floors
sigma_banker_oof = np.maximum(240.0 + 3.0 * abs_dist_oof, 70.0)
sigma_banker_oof = np.where(abs_dist_oof > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)
ll_banker_global = laplace_ll_np(y_true_oof, fvc_oof_best, sigma_banker_oof)
m_long = abs_dist_oof > 20.0
ll_banker_long = laplace_ll_np(y_true_oof[m_long], fvc_oof_best[m_long], sigma_banker_oof[m_long]) if m_long.any() else ll_banker_global
print(f'[Cons Sigma] Banker LL global={ll_banker_global:.5f} | long>20={ll_banker_long:.5f}')

# Conservative grid near banker
a_grid = [220, 240, 260]
b_grid = [2.6, 3.0, 3.4]
best = {'ll_long': -1e9, 'll_global': -1e9, 'a': None, 'b': None, 'diff_frac': 0.0}
for a in a_grid:
    for b in b_grid:
        sig_lin = a + b * abs_dist_oof
        sig_lin = np.maximum(sig_lin, 70.0)
        sig_lin = np.where(abs_dist_oof > 20.0, np.maximum(sig_lin, 100.0), sig_lin)
        # Floor by banker for OOF selection to match test-time parity
        sig_oof = np.maximum(sig_lin, sigma_banker_oof)
        ll_g = laplace_ll_np(y_true_oof, fvc_oof_best, sig_oof)
        ll_l = laplace_ll_np(y_true_oof[m_long], fvc_oof_best[m_long], sig_oof[m_long]) if m_long.any() else ll_g
        diff_frac = float(np.mean((sig_oof - sigma_banker_oof) > 1e-6))
        print(f'[Cons Sigma] a={a} b={b:.2f} LL_global={ll_g:.5f} LL_long={ll_l:.5f} changed_frac={diff_frac:.3f}')
        if ll_l > best['ll_long'] or (np.isclose(ll_l, best['ll_long']) and ll_g > best['ll_global']):
            best.update({'ll_long': ll_l, 'll_global': ll_g, 'a': a, 'b': b, 'diff_frac': diff_frac})

print(f"[Cons Sigma] Best by long slice: a={best['a']} b={best['b']:.2f} LL_global={best['ll_global']:.5f} (Δ={best['ll_global']-ll_banker_global:+.5f}) LL_long={best['ll_long']:.5f} (Δ={best['ll_long']-ll_banker_long:+.5f}) changed_frac={best['diff_frac']:.3f}")

# Also report OOF LL by distance bins for banker vs best candidate
bins = [(0.0, 5.0), (5.0, 15.0), (15.0, 1e9)]
sig_best_oof = None
if best['a'] is not None:
    sig_lin = best['a'] + best['b'] * abs_dist_oof
    sig_lin = np.maximum(sig_lin, 70.0)
    sig_lin = np.where(abs_dist_oof > 20.0, np.maximum(sig_lin, 100.0), sig_lin)
    sig_best_oof = np.maximum(sig_lin, sigma_banker_oof)
for lo, hi in bins:
    m = (abs_dist_oof > lo) & (abs_dist_oof <= hi)
    if not m.any():
        print(f'[Cons Sigma] Bin ({lo},{hi}] empty'); continue
    ll_b = laplace_ll_np(y_true_oof[m], fvc_oof_best[m], sigma_banker_oof[m])
    if sig_best_oof is not None:
        ll_c = laplace_ll_np(y_true_oof[m], fvc_oof_best[m], sig_best_oof[m])
        print(f'[Cons Sigma] Bin ({lo},{hi}] banker={ll_b:.5f} best={ll_c:.5f} Δ={ll_c-ll_b:+.5f}')
    else:
        print(f'[Cons Sigma] Bin ({lo},{hi}] banker={ll_b:.5f}')

# Build test sigma for best (a,b), enforce monotone and banker floor, and optionally set submission.csv
use_cons = False
if best['a'] is not None:
    gain_global = best['ll_global'] - ll_banker_global
    gain_long = best['ll_long'] - ll_banker_long
    use_cons = (gain_long > 0.0) and (gain_global > 0.02)
    print(f'[Cons Sigma] adopt? {use_cons} (global Δ={gain_global:+.5f}, long Δ={gain_long:+.5f})')

grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
assert grid['Base_Week'].notna().all(), 'Base_Week NaNs in conservative sigma build'
abs_dist_test = (grid['Weeks'] - grid['Base_Week']).abs().astype(float)
a_best, b_best = best['a'], best['b']
if a_best is None:
    a_best, b_best = 240.0, 3.0
sigma_lin_test = a_best + b_best * abs_dist_test
sigma_lin_test = np.maximum(sigma_lin_test, 70.0)
sigma_lin_test = np.where(abs_dist_test > 20.0, np.maximum(sigma_lin_test, 100.0), sigma_lin_test)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_test.values, 'Sigma': sigma_lin_test.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_lin_mono = df_sig['Sigma'].values.astype(float)
# Banker floor on test
sigma_banker_test = np.maximum(240.0 + 3.0 * abs_dist_test, 70.0)
sigma_banker_test = np.where(abs_dist_test > 20.0, np.maximum(sigma_banker_test, 100.0), sigma_banker_test)
sigma_cons_test = np.maximum(sigma_lin_mono, sigma_banker_test)

# Load FVC from integrated banker submission (same FVC across sigma variants)
sub_b = pd.read_csv('submission_banker.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
sub_cons = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': sub_b['FVC'].astype(float).values, 'Confidence': sigma_cons_test})
sub_cons.to_csv('submission_linear_floored_conservative.csv', index=False)
print('Saved submission_linear_floored_conservative.csv (conservative linear sigma floored by banker).')

if use_cons:
    sub_cons.to_csv('submission.csv', index=False)
    print('[Cons Sigma] Adopted conservative linear sigma as primary (submission.csv).')
else:
    print('[Cons Sigma] Keeping banker as primary (submission.csv unchanged).')

[Cons Sigma] Banker LL global=-5.89595 | long>20=-6.13413
[Cons Sigma] a=220 b=2.60 LL_global=-5.89595 LL_long=-6.13413 changed_frac=0.000
[Cons Sigma] a=220 b=3.00 LL_global=-5.89595 LL_long=-6.13413 changed_frac=0.000
[Cons Sigma] a=220 b=3.40 LL_global=-5.89632 LL_long=-6.13528 changed_frac=0.103
[Cons Sigma] a=240 b=2.60 LL_global=-5.89595 LL_long=-6.13413 changed_frac=0.000
[Cons Sigma] a=240 b=3.00 LL_global=-5.89595 LL_long=-6.13413 changed_frac=0.000
[Cons Sigma] a=240 b=3.40 LL_global=-5.91107 LL_long=-6.16592 changed_frac=0.882
[Cons Sigma] a=260 b=2.60 LL_global=-5.93437 LL_long=-6.14491 changed_frac=0.897
[Cons Sigma] a=260 b=3.00 LL_global=-5.94886 LL_long=-6.17533 changed_frac=1.000
[Cons Sigma] a=260 b=3.40 LL_global=-5.96340 LL_long=-6.20600 changed_frac=1.000
[Cons Sigma] Best by long slice: a=220 b=2.60 LL_global=-5.89595 (Δ=+0.00000) LL_long=-6.13413 (Δ=+0.00000) changed_frac=0.000
[Cons Sigma] Bin (0.0,5.0] banker=-5.77585 best=-5.77585 Δ=+0.00000
[Cons Sigma] Bin (

  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [17]:
# Overwrite submission.csv with slope+anchor banker submission
import pandas as pd
ss = pd.read_csv('sample_submission.csv')
sub_slope = pd.read_csv('submission_slope_banker.csv')
assert sub_slope.shape[0] == ss.shape[0], 'Row count mismatch vs sample_submission'
assert set(sub_slope['Patient_Week'].astype(str)) == set(ss['Patient_Week'].astype(str)), 'Patient_Week sets differ'
assert sub_slope['FVC'].notna().all() and sub_slope['Confidence'].notna().all(), 'NaNs in slope+anchor submission'
sub_slope.to_csv('submission.csv', index=False)
print('submission.csv overwritten with slope+anchor banker submission (0.85 slope blend + 0.15 anchor, banker sigma).')

submission.csv overwritten with slope+anchor banker submission (0.85 slope blend + 0.15 anchor, banker sigma).


In [18]:
# Build tolerant non-increasing FVC (allow +25 ml increases) on banker FVC; write alt submission and set as current
import numpy as np, pandas as pd

# Load banker submission and rebuild grid for ordering
ss = pd.read_csv('sample_submission.csv')
sub_b = pd.read_csv('submission_banker.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
assert grid['Base_Week'].notna().all(), 'Base_Week NaNs when building tolerant FVC grid'

# Apply tolerant monotonicity per patient: going into the future, enforce FVC[t] <= next_FVC + tol
tol = 25.0
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': sub_b['FVC'].astype(float).clip(500, 6000)})
def enforce_non_increasing_tolerant(g, tol=25.0):
    g = g.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    # Work backwards to future; allow small increases up to tol
    for i in range(len(f)-2, -1, -1):
        f[i] = min(f[i], f[i+1] + tol)
    g['FVC'] = f
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, tol))

# Pin dist==0 to Base_FVC after tolerance enforcement
dist = (grid['Weeks'].values - grid['Base_Week'].values).astype(float)
base_fvc = grid['Base_FVC'].values.astype(float)
fvc_tol = df_out['FVC'].values.astype(float)
fvc_tol = np.where(dist == 0.0, base_fvc, fvc_tol)

# Keep banker sigma as-is (already monotone and floored in submission_banker)
sigma_banker = sub_b['Confidence'].astype(float).values

sub_tol = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_tol, 'Confidence': sigma_banker})
sub_tol.to_csv('submission_banker_tol25.csv', index=False)
sub_tol.to_csv('submission.csv', index=False)
print('Saved submission_banker_tol25.csv and set submission.csv (banker sigma, FVC monotonicity tolerance +25 ml).')

Saved submission_banker_tol25.csv and set submission.csv (banker sigma, FVC monotonicity tolerance +25 ml).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, tol))


In [19]:
# Build banker-sigma submission WITHOUT FVC monotonicity (only clip + pin dist==0); set as current
import numpy as np, pandas as pd

ss = pd.read_csv('sample_submission.csv')
sub_b = pd.read_csv('submission_banker.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()

# Rebuild grid to compute dist and access Base_FVC
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
assert grid['Base_Week'].notna().all(), 'Base_Week NaNs when building no-mono FVC grid'
dist = (grid['Weeks'].values - grid['Base_Week'].values).astype(float)
base_fvc = grid['Base_FVC'].values.astype(float)

# No monotonicity: just clip and pin dist==0 to Base_FVC
fvc_nm = np.clip(sub_b['FVC'].astype(float).values, 500, 6000)
fvc_nm = np.where(dist == 0.0, base_fvc, fvc_nm)

# Keep banker sigma from submission_banker
sigma_b = sub_b['Confidence'].astype(float).values

sub_no_mono = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_nm, 'Confidence': sigma_b})
sub_no_mono.to_csv('submission_banker_no_mono.csv', index=False)
sub_no_mono.to_csv('submission.csv', index=False)
print('Saved submission_banker_no_mono.csv and set submission.csv (banker sigma, no FVC monotonicity; only clip + pin dist==0).')

Saved submission_banker_no_mono.csv and set submission.csv (banker sigma, no FVC monotonicity; only clip + pin dist==0).


In [22]:
# Medal path A: slope-only parametric line + global anchor (w_slope=0.60, w_anchor=0.40),
# clamp slopes to [5th,95th], banker sigma with floors and monotonicity; overwrite submission.csv
import numpy as np, pandas as pd
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

# 1) Prepare baseline tables
def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

# Reuse helpers from above cells: build_slope_features, compute_patient_slopes, robust_global_slope already defined

# 2) Fit full-train slope models (Ridge + KNN) with ECDF/OHE on train baseline; clamp predictions
base_full = prepare_baseline_table(train)
slopes_full = compute_patient_slopes(train)
slope_labels_full = pd.DataFrame({'Patient': list(slopes_full.keys()), 's_label': list(slopes_full.values())})
base_full_lab = base_full.merge(slope_labels_full, on='Patient', how='left')
base_fullF, feat_cols_s, ecdf_bf_s, ecdf_pc_s, cats_s = build_slope_features(base_full_lab, fit=True)
scaler = StandardScaler(with_mean=True, with_std=True)
X_full = base_fullF[feat_cols_s].values.astype(float)
y_full = base_fullF['s_label'].fillna(0.0).values.astype(float)
X_full_std = scaler.fit_transform(X_full)
ridge = Ridge(alpha=1.0, random_state=42).fit(X_full_std, y_full)
knn = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(X_full_std, y_full)
q_lo, q_hi = np.percentile(y_full, 5), np.percentile(y_full, 95)

# 3) Build strict test grid and baseline features for test patients
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
    columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
assert grid['Base_Week'].notna().all(), 'Base_Week missing in test grid'
dist = (grid['Weeks'].values - grid['Base_Week'].values).astype(float)

base_test = grid[['Patient','Base_Week','Base_FVC']].drop_duplicates('Patient')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient').rename(
    columns={'Percent':'Percent_at_base'})
base_test = base_test.merge(meta, on='Patient', how='left')
base_testF, _, _, _, _ = build_slope_features(base_test, ecdf_bf_s, ecdf_pc_s, cats_s, fit=False)
X_test_std = scaler.transform(base_testF[feat_cols_s].values.astype(float))
s_r = ridge.predict(X_test_std)
s_k = knn.predict(X_test_std)
s_hat = 0.80 * s_r + 0.20 * s_k  # ridge/knn blend from probe
s_hat = np.clip(s_hat, q_lo, q_hi)
map_s = dict(zip(base_testF['Patient'].values, s_hat))

# 4) Build slope-only FVC and global anchor; blend w_slope=0.60, w_anchor=0.40
fvc_slope = (grid['Base_FVC'].values + pd.Series(grid['Patient']).map(map_s).astype(float).fillna(0.0).values * dist).astype(float)
gs = robust_global_slope(compute_patient_slopes(train))
fvc_anchor = (grid['Base_FVC'].values + gs * dist).astype(float)
fvc_mix = 0.60 * fvc_slope + 0.40 * fvc_anchor

# Guardrails: pin dist==0 to Base_FVC; per-patient non-increasing into the future; clip [500,6000]
fvc_mix = np.where(dist == 0.0, grid['Base_FVC'].values.astype(float), fvc_mix)
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': np.clip(fvc_mix, 500, 6000)})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# 5) Sigma: banker sigma = max(240 + 3*|dist|, 70), and ≥100 when |dist|>20; enforce per-patient monotone in |dist|
abs_dist = np.abs(dist).astype(float)
sigma = np.maximum(240.0 + 3.0 * abs_dist, 70.0)
sigma = np.where(abs_dist > 20.0, np.maximum(sigma, 100.0), sigma)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist, 'Sigma': sigma.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

# 6) Save submission
sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub.to_csv('submission_slope_anchor_banker.csv', index=False)
sub.to_csv('submission.csv', index=False)
print('Saved submission_slope_anchor_banker.csv and set submission.csv (w_slope=0.60, w_anchor=0.40, banker sigma, clamps+guardrails).')

Saved submission_slope_anchor_banker.csv and set submission.csv (w_slope=0.60, w_anchor=0.40, banker sigma, clamps+guardrails).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [23]:
# Iter 2: CatBoost slope head (blend 0.5 CatBoost + 0.5 Ridge), slope-only + anchor (0.60/0.40), banker sigma
import numpy as np, pandas as pd
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from catboost import CatBoostRegressor

# Reuse helpers: build_slope_features, compute_patient_slopes, robust_global_slope, prepare_baseline_table
def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

# 1) Full-train baseline with slope labels
base_full = prepare_baseline_table(train)
slopes_full = compute_patient_slopes(train)
slope_labels_full = pd.DataFrame({'Patient': list(slopes_full.keys()), 's_label': list(slopes_full.values())})
base_full_lab = base_full.merge(slope_labels_full, on='Patient', how='left')
base_fullF, feat_cols_s, ecdf_bf_s, ecdf_pc_s, cats_s = build_slope_features(base_full_lab, fit=True)
X_full = base_fullF[feat_cols_s].values.astype(float)
y_full = base_fullF['s_label'].fillna(0.0).values.astype(float)

# 2) Ridge slope (standardized)
scaler = StandardScaler(with_mean=True, with_std=True)
X_full_std = scaler.fit_transform(X_full)
ridge = Ridge(alpha=1.0, random_state=42).fit(X_full_std, y_full)

# 3) CatBoost slope
cb = CatBoostRegressor(
    iterations=1200, depth=4, learning_rate=0.05, l2_leaf_reg=6.0, subsample=0.8,
    random_strength=0.8, border_count=128, od_type='Iter', od_wait=50, bootstrap_type='Bernoulli',
    loss_function='RMSE', random_state=42, verbose=False
)
cb.fit(X_full, y_full)

# Clamp range from labels
q_lo, q_hi = np.percentile(y_full, 5), np.percentile(y_full, 95)

# 4) Build test grid baseline features
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
    columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
assert grid['Base_Week'].notna().all(), 'Base_Week missing in test grid'
dist = (grid['Weeks'].values - grid['Base_Week'].values).astype(float)

base_test = grid[['Patient','Base_Week','Base_FVC']].drop_duplicates('Patient')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient').rename(
    columns={'Percent':'Percent_at_base'})
base_test = base_test.merge(meta, on='Patient', how='left')
base_testF, _, _, _, _ = build_slope_features(base_test, ecdf_bf_s, ecdf_pc_s, cats_s, fit=False)
X_test = base_testF[feat_cols_s].values.astype(float)
X_test_std = scaler.transform(X_test)

# Predict slopes
s_r = ridge.predict(X_test_std)
s_cb = cb.predict(X_test)
s_hat = 0.5 * s_r + 0.5 * s_cb
s_hat = np.clip(s_hat, q_lo, q_hi)
map_s = dict(zip(base_testF['Patient'].values, s_hat))

# 5) Build FVC: slope-only + global anchor (0.60/0.40)
fvc_slope = (grid['Base_FVC'].values + pd.Series(grid['Patient']).map(map_s).astype(float).fillna(0.0).values * dist).astype(float)
gs = robust_global_slope(compute_patient_slopes(train))
fvc_anchor = (grid['Base_FVC'].values + gs * dist).astype(float)
fvc_mix = 0.60 * fvc_slope + 0.40 * fvc_anchor

# Guardrails: pin dist==0 to Base_FVC; per-patient non-increasing; clip
fvc_mix = np.where(dist == 0.0, grid['Base_FVC'].values.astype(float), fvc_mix)
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': np.clip(fvc_mix, 500, 6000)})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# 6) Banker sigma with floors and per-patient monotonicity in |dist|
abs_dist = np.abs(dist).astype(float)
sigma = np.maximum(240.0 + 3.0 * abs_dist, 70.0)
sigma = np.where(abs_dist > 20.0, np.maximum(sigma, 100.0), sigma)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist, 'Sigma': sigma.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

# 7) Save submission
sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub.to_csv('submission_slopeCB_anchor_banker.csv', index=False)
sub.to_csv('submission.csv', index=False)
print('Saved submission_slopeCB_anchor_banker.csv and set submission.csv (0.5 CatBoost + 0.5 Ridge slope, 0.60 slope + 0.40 anchor, banker sigma).')

Saved submission_slopeCB_anchor_banker.csv and set submission.csv (0.5 CatBoost + 0.5 Ridge slope, 0.60 slope + 0.40 anchor, banker sigma).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [32]:
# Iter 3: Tune w_anchor with adoption rule: start at 0.60, probe 0.65 (adopt if LL_long +>=0.002 and LL_global drop <=0.002), then optionally 0.70
import numpy as np, pandas as pd, time, gc
from sklearn.model_selection import GroupKFold
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

def build_slope_features(base_df, ecdf_basefvc=None, ecdf_percent=None, cats=None, fit=False):
    b = base_df.copy()
    b['log_Base_FVC'] = np.log1p(np.maximum(b['Base_FVC'].astype(float), 1.0))
    b['BaseFVC_over_Age'] = b['Base_FVC'].astype(float) / np.maximum(b['Age'].astype(float), 1.0)
    b['PercentBase_over_Age'] = b['Percent_at_base'].astype(float) / np.maximum(b['Age'].astype(float), 1.0)
    if fit:
        ecdf_basefvc = ecdf_rank_fit(b['Base_FVC'].values)
        ecdf_percent = ecdf_rank_fit(b['Percent_at_base'].values)
    b['BaseFVC_ecdf'] = ecdf_rank_transform(b['Base_FVC'].values, ecdf_basefvc)
    b['Percent_ecdf'] = ecdf_rank_transform(b['Percent_at_base'].values, ecdf_percent)
    if fit:
        cats = one_hot_fit(b, ['Sex','SmokingStatus'])
    b = one_hot_transform(b, cats)
    num_cols = ['Age','Base_FVC','log_Base_FVC','Percent_at_base','BaseFVC_over_Age','PercentBase_over_Age','BaseFVC_ecdf','Percent_ecdf']
    cat_cols = [c for c in b.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
    feat_cols = num_cols + cat_cols
    return b, feat_cols, ecdf_basefvc, ecdf_percent, cats

def forward_oof_slope_and_anchor(train_df, n_splits=5, seed=42, ridge_w=0.80, knn_w=0.20):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    y_true_list, dist_list = [], []
    fvc_slope_list, fvc_anchor_list = [], []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        trn = train_df.iloc[trn_idx].copy(); val = train_df.iloc[val_idx].copy()
        base_trn = prepare_baseline_table(trn); base_val = prepare_baseline_table(val)
        slopes_tr = compute_patient_slopes(trn)
        base_trn_lab = base_trn.merge(pd.DataFrame({'Patient': list(slopes_tr.keys()), 's_label': list(slopes_tr.values())}), on='Patient', how='left')
        base_trnF, feat_cols, ecdf_bf, ecdf_pc, cats_full = build_slope_features(base_trn_lab, fit=True)
        base_valF, _, _, _, _ = build_slope_features(base_val, ecdf_bf, ecdf_pc, cats_full, fit=False)
        scaler = StandardScaler(with_mean=True, with_std=True)
        X_trn = base_trnF[feat_cols].values.astype(float); y_trn = base_trnF['s_label'].fillna(0.0).values.astype(float)
        X_trn_std = scaler.fit_transform(X_trn)
        X_val_std = scaler.transform(base_valF[feat_cols].values.astype(float))
        ridge = Ridge(alpha=1.0, random_state=seed).fit(X_trn_std, y_trn)
        knn = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(X_trn_std, y_trn)
        s_r = ridge.predict(X_val_std); s_k = knn.predict(X_val_std)
        q_lo, q_hi = np.percentile(y_trn, 5), np.percentile(y_trn, 95)
        s_bl = np.clip(ridge_w * s_r + knn_w * s_k, q_lo, q_hi)
        s_map = dict(zip(base_val['Patient'].values, s_bl))
        valm = val.merge(base_val[['Patient','Base_Week','Base_FVC']], on='Patient', how='left')
        mask = (valm['Weeks'].values >= valm['Base_Week'].values)
        dist = (valm['Weeks'].values - valm['Base_Week'].values).astype(float)
        s_hat = valm['Patient'].map(s_map).astype(float).fillna(0.0).values
        fvc_slope = (valm['Base_FVC'].values + s_hat * dist).astype(float)
        gs_fold = robust_global_slope(compute_patient_slopes(trn))
        fvc_anchor = (valm['Base_FVC'].values + gs_fold * dist).astype(float)
        y_true_list.append(valm['FVC'].values[mask].astype(float))
        dist_list.append(dist[mask].astype(float))
        fvc_slope_list.append(fvc_slope[mask].astype(float))
        fvc_anchor_list.append(fvc_anchor[mask].astype(float))
        del trn, val, base_trn, base_val, base_trnF, base_valF, X_trn, X_trn_std, X_val_std
        gc.collect()
    return (np.concatenate(y_true_list), np.concatenate(dist_list),
            np.concatenate(fvc_slope_list), np.concatenate(fvc_anchor_list))

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = y_true.astype(float); y_pred = y_pred.astype(float); sigma = sigma.astype(float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

# 1) Build OOF arrays for slope-only and anchor
t0 = time.time()
y_oof, dist_oof, fvc_slope_oof, fvc_anchor_oof = forward_oof_slope_and_anchor(train, n_splits=5, seed=42)
sigma_oof = np.maximum(240.0 + 3.0 * np.abs(dist_oof), 70.0)
sigma_oof = np.where(np.abs(dist_oof) > 20.0, np.maximum(sigma_oof, 100.0), sigma_oof)
m_long = np.abs(dist_oof) > 20.0

# 2) Adoption rule: start at 0.60, probe 0.65; adopt if LL_long improves by >=0.002 and LL_global drop <=0.002; then optionally probe 0.70
def eval_w(w_a):
    fvc_bl = (1.0 - w_a) * fvc_slope_oof + w_a * fvc_anchor_oof
    ll_g = laplace_ll_np(y_oof, fvc_bl, sigma_oof)
    ll_l = laplace_ll_np(y_oof[m_long], fvc_bl[m_long], sigma_oof[m_long]) if m_long.any() else ll_g
    return ll_g, ll_l

ll_g_60, ll_l_60 = eval_w(0.60)
print(f'[Slope-only OOF] w_anchor=0.60 LL_global={ll_g_60:.5f} LL_long={ll_l_60:.5f}', flush=True)
ll_g_65, ll_l_65 = eval_w(0.65)
print(f'[Slope-only OOF] w_anchor=0.65 LL_global={ll_g_65:.5f} LL_long={ll_l_65:.5f}', flush=True)
w_anchor = 0.60
if (ll_l_65 - ll_l_60) >= 0.002 and (ll_g_60 - ll_g_65) <= 0.002:
    w_anchor = 0.65
    # Optionally probe 0.70 only if 0.65 adopted and it improved long slice sufficiently
    ll_g_70, ll_l_70 = eval_w(0.70)
    print(f'[Slope-only OOF] w_anchor=0.70 LL_global={ll_g_70:.5f} LL_long={ll_l_70:.5f}', flush=True)
    if (ll_l_70 - ll_l_65) >= 0.002 and (ll_g_65 - ll_g_70) <= 0.002:
        w_anchor = 0.70
print(f'[Select] Using w_anchor={w_anchor:.2f} by adoption rule; elapsed={time.time()-t0:.2f}s', flush=True)

# 3) Fit slope head on full baseline and predict test; blend with selected anchor weight
base_full = prepare_baseline_table(train)
slopes_full = compute_patient_slopes(train)
base_full_lab = base_full.merge(pd.DataFrame({'Patient': list(slopes_full.keys()), 's_label': list(slopes_full.values())}), on='Patient', how='left')
base_fullF, feat_cols, ecdf_bf, ecdf_pc, cats_full = build_slope_features(base_full_lab, fit=True)
scaler = StandardScaler(with_mean=True, with_std=True)
X_full = base_fullF[feat_cols].values.astype(float); y_full = base_fullF['s_label'].fillna(0.0).values.astype(float)
X_full_std = scaler.fit_transform(X_full)
ridge = Ridge(alpha=1.0, random_state=42).fit(X_full_std, y_full)
knn = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(X_full_std, y_full)
q_lo, q_hi = np.percentile(y_full, 5), np.percentile(y_full, 95)

ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
    columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
dist = (grid['Weeks'].values - grid['Base_Week'].values).astype(float)

base_test = grid[['Patient','Base_Week','Base_FVC']].drop_duplicates('Patient')
meta = test[['Patient','Percent','Age','Sex','SmokingStatus']].drop_duplicates('Patient').rename(columns={'Percent':'Percent_at_base'})
base_test = base_test.merge(meta, on='Patient', how='left')
base_testF, _, _, _, _ = build_slope_features(base_test, ecdf_bf, ecdf_pc, cats_full, fit=False)
X_test_std = scaler.transform(base_testF[feat_cols].values.astype(float))
s_r = ridge.predict(X_test_std); s_k = knn.predict(X_test_std)
s_hat = np.clip(0.80 * s_r + 0.20 * s_k, q_lo, q_hi)
map_s = dict(zip(base_testF['Patient'].values, s_hat))

fvc_slope = (grid['Base_FVC'].values + pd.Series(grid['Patient']).map(map_s).astype(float).fillna(0.0).values * dist).astype(float)
gs = robust_global_slope(slopes_full)
fvc_anchor = (grid['Base_FVC'].values + gs * dist).astype(float)
fvc_mix = (1.0 - w_anchor) * fvc_slope + w_anchor * fvc_anchor

# Guardrails and sigma
fvc_mix = np.where(dist == 0.0, grid['Base_FVC'].values.astype(float), fvc_mix)
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': np.clip(fvc_mix, 500, 6000)})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

abs_dist = np.abs(dist).astype(float)
sigma = np.maximum(240.0 + 3.0 * abs_dist, 70.0)
sigma = np.where(abs_dist > 20.0, np.maximum(sigma, 100.0), sigma)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist, 'Sigma': sigma.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
fname = f'submission_slope_anchor_banker_wA{int(round(w_anchor*100))}.csv'
sub.to_csv(fname, index=False)
sub.to_csv('submission.csv', index=False)
print(f'Saved {fname} and set submission.csv (slope-only + anchor, w_anchor={w_anchor:.2f}, banker sigma).')

[Slope-only OOF] w_anchor=0.60 LL_global=-6.14324 LL_long=-6.42511


[Slope-only OOF] w_anchor=0.65 LL_global=-6.14273 LL_long=-6.42402


[Select] Using w_anchor=0.60 by adoption rule; elapsed=0.60s


Saved submission_slope_anchor_banker_wA60.csv and set submission.csv (slope-only + anchor, w_anchor=0.60, banker sigma).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [57]:
# Quantile LGBM v2: delta-target with ordered quantiles, fold hygiene, multi-param averaging
from lightgbm import LGBMRegressor
import lightgbm as lgb
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
import numpy as np, pandas as pd, time, gc

t0 = time.time()
ss = pd.read_csv('sample_submission.csv')

# Helpers assumed available earlier:
# - prepare_baseline_table, build_slope_features, ecdf_rank_fit, ecdf_rank_transform, one_hot_fit, one_hot_transform
# - compute_patient_slopes, robust_global_slope
# - laplace_ll

# Quantile feature builder (baseline-only + safe distance bases/interactions + ECDF/OHE)
def build_q_features(grid_df, base_df, ecdf_bf=None, ecdf_pc=None, cats=None, fit=False):
    d = grid_df.merge(base_df[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
    d['dist'] = (d['Weeks'] - d['Base_Week']).astype(float)
    d = d[d['dist'] >= 0].copy()
    d['abs_dist'] = d['dist'].abs()
    d['log1p_abs_dist'] = np.log1p(d['abs_dist'])
    d['dist_cap'] = d['dist'].clip(0, 30)
    # Piecewise distance bases
    d['dist_short'] = d['dist'].clip(0, 5)
    d['dist_mid'] = (d['dist'] - 5).clip(lower=0, upper=10)
    d['dist_long'] = (d['dist'] - 15).clip(lower=0)
    d['dist2'] = d['dist']**2
    d['dist3'] = d['dist']**3

    d['Base_FVC'] = d['Base_FVC'].astype(float)
    d['Percent_at_base'] = d['Percent_at_base'].astype(float).clip(30, 120)
    d['Age'] = d['Age'].astype(float)
    d['log_Base_FVC'] = np.log1p(np.maximum(d['Base_FVC'], 1.0))

    # Safe interactions
    d['Age_x_Percent'] = d['Age'] * d['Percent_at_base']
    d['BaseFVC_x_dist'] = d['Base_FVC'] * d['dist']
    d['dist_x_Age'] = d['dist'] * d['Age']
    d['dist_x_Percent'] = d['dist'] * d['Percent_at_base']
    # Piecewise interactions with Base_FVC
    d['BaseFVC_x_dshort'] = d['Base_FVC'] * d['dist_short']
    d['BaseFVC_x_dmid'] = d['Base_FVC'] * d['dist_mid']
    d['BaseFVC_x_dlong'] = d['Base_FVC'] * d['dist_long']

    if fit:
        ecdf_bf = ecdf_rank_fit(d['Base_FVC'].values)
        ecdf_pc = ecdf_rank_fit(d['Percent_at_base'].values)
        cats = one_hot_fit(d, ['Sex','SmokingStatus'])
    d['BaseFVC_ecdf'] = ecdf_rank_transform(d['Base_FVC'].values, ecdf_bf)
    d['Percent_ecdf'] = ecdf_rank_transform(d['Percent_at_base'].values, ecdf_pc)
    d = one_hot_transform(d, cats)

    # ECDF-derived decile bucket for Base_FVC (fold-fit), deterministic one-hot without extra fit state
    d['BFV_decile'] = np.floor(d['BaseFVC_ecdf'] * 10).clip(0, 9).astype(int)
    for k in range(10):
        d[f'BFV_decile__{k}'] = (d['BFV_decile'] == k).astype(np.int8)

    feat_cols = [
        'Age','Base_FVC','log_Base_FVC','Percent_at_base','BaseFVC_ecdf','Percent_ecdf',
        'dist','abs_dist','log1p_abs_dist','dist_cap','dist_short','dist_mid','dist_long','dist2','dist3',
        'Age_x_Percent','BaseFVC_x_dist','dist_x_Age','dist_x_Percent','BaseFVC_x_dshort','BaseFVC_x_dmid','BaseFVC_x_dlong','s_hat'
    ] + [c for c in d.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__') or c.startswith('BFV_decile__')]

    for c in feat_cols:
        if c not in d.columns: d[c] = 0.0

    return d, feat_cols, ecdf_bf, ecdf_pc, cats

# CV config
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
alphas = [0.10, 0.20, 0.50, 0.80, 0.90]  # q10/q20/q50/q80/q90
oof_cols = ['q10_delta_oof','q20_delta_oof','q50_delta_oof','q80_delta_oof','q90_delta_oof']

# 3 moderate param sets
params_list = [
    dict(objective='quantile', metric='quantile', n_estimators=3000, learning_rate=0.035,
         num_leaves=20, max_depth=5, min_data_in_leaf=32, subsample=0.8, colsample_bytree=0.8,
         reg_alpha=0.1, reg_lambda=0.2, n_jobs=-1, verbose=-1),
    dict(objective='quantile', metric='quantile', n_estimators=3500, learning_rate=0.030,
         num_leaves=31, max_depth=6, min_data_in_leaf=24, subsample=0.75, colsample_bytree=0.75,
         reg_alpha=0.1, reg_lambda=0.2, n_jobs=-1, verbose=-1),
    dict(objective='quantile', metric='quantile', n_estimators=4000, learning_rate=0.028,
         num_leaves=48, max_depth=7, min_data_in_leaf=20, subsample=0.7, colsample_bytree=0.7,
         reg_alpha=0.0, reg_lambda=0.3, n_jobs=-1, verbose=-1),
]

# OOF containers
oof_df = train[['Patient','Weeks','FVC']].copy()
for c in oof_cols: oof_df[c] = np.nan

# Static TEST grid (weeks, baseline merge) and index mapping to ss order
grid_te = ss.copy()
parts = grid_te['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid_te['Patient'] = parts[0]; grid_te['Weeks'] = parts[1].astype(int)
test_base = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
    columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})

# Build a mapping from (Patient,Weeks) -> row index in ss
grid_te_idx = grid_te[['Patient','Weeks']].copy()
grid_te_idx['ss_idx'] = np.arange(grid_te_idx.shape[0], dtype=int)

# Build fold-local s_hat machinery helper (use TRAIN fold only for slope labels; no leak)
def fit_s_hat_fold(trn_df, base_trn):
    slopes_trn = compute_patient_slopes(trn_df)
    slope_labels_trn = pd.DataFrame({'Patient': list(slopes_trn.keys()), 's_label': list(slopes_trn.values())})
    base_trn_lab = base_trn.merge(slope_labels_trn, on='Patient', how='left')
    bf_trn, f_cols_s, ecdf_bf_s, ecdf_pc_s, cats_s = build_slope_features(base_trn_lab, fit=True)
    scaler_s = StandardScaler(with_mean=True, with_std=True).fit(bf_trn[f_cols_s].values.astype(float))
    Xs_tr = scaler_s.transform(bf_trn[f_cols_s].values.astype(float))
    y_s = bf_trn['s_label'].fillna(0.0).values.astype(float)
    ridge = Ridge(alpha=1.0, random_state=42).fit(Xs_tr, y_s)
    knn = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(Xs_tr, y_s)
    q_lo, q_hi = np.percentile(y_s, [5,95])

    def get_s_hat_map(base_df_patients):
        bf_pred, _, _, _, _ = build_slope_features(base_df_patients, ecdf_bf_s, ecdf_pc_s, cats_s, fit=False)
        Xs = scaler_s.transform(bf_pred[f_cols_s].values.astype(float))
        s = 0.8*ridge.predict(Xs) + 0.2*knn.predict(Xs)
        s = np.clip(s, q_lo, q_hi)
        return dict(zip(bf_pred['Patient'].values, s))
    return get_s_hat_map

# Test deltas accumulator (fold- and param-averaged); full SS length
test_preds_delta = np.zeros((ss.shape[0], len(alphas)), dtype=float)

for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    tf = time.time()
    trn_df = train.iloc[trn_idx].copy(); val_df = train.iloc[val_idx].copy()

    # Fold-local baseline tables
    base_trn = prepare_baseline_table(trn_df)
    base_val = prepare_baseline_table(val_df)

    # s_hat maps (TRAIN fold only)
    get_s_hat_map = fit_s_hat_fold(trn_df, base_trn)
    s_map_trn = get_s_hat_map(base_trn)
    s_map_val = get_s_hat_map(base_val)
    base_test = grid_te[['Patient']].drop_duplicates().merge(
        test_base.drop_duplicates('Patient'), on='Patient', how='left')
    s_map_test = get_s_hat_map(base_test[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']])

    # Build future-only train/val with s_hat
    trn = trn_df.merge(base_trn[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
    val = val_df.merge(base_val[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
    trn['dist'] = (trn['Weeks'] - trn['Base_Week']).astype(float); trn = trn[trn['dist'] >= 0].copy()
    val['dist'] = (val['Weeks'] - val['Base_Week']).astype(float); val = val[val['dist'] >= 0].copy()
    trn['s_hat'] = trn['Patient'].map(s_map_trn).astype(float).fillna(0.0)
    val['s_hat'] = val['Patient'].map(s_map_val).astype(float).fillna(0.0)

    # Features (fit transforms on TRAIN fold)
    trn_feat, feat_cols, ecdf_bf, ecdf_pc, cats = build_q_features(trn[['Patient','Weeks']].copy(), base_trn, fit=True)
    trn_feat['s_hat'] = trn_feat['Patient'].map(s_map_trn).astype(float).fillna(0.0)

    # IMPORTANT: use base_val for VAL features to get correct Base_Week for val patients
    val_feat, _, _, _, _ = build_q_features(val[['Patient','Weeks']].copy(), base_val, ecdf_bf, ecdf_pc, cats, fit=False)
    val_feat['s_hat'] = val_feat['Patient'].map(s_map_val).astype(float).fillna(0.0)

    # Align features with labels strictly by (Patient, Weeks) keys to avoid length mismatches
    trn_feat_aligned = trn_feat.merge(trn[['Patient','Weeks','FVC']], on=['Patient','Weeks'], how='inner')
    val_feat_aligned = val_feat.merge(val[['Patient','Weeks','FVC']], on=['Patient','Weeks'], how='inner')

    # Targets: delta = FVC - Base_FVC
    y_tr_delta = (trn_feat_aligned['FVC'].astype(float).values - trn_feat_aligned['Base_FVC'].astype(float).values)
    y_va_delta = (val_feat_aligned['FVC'].astype(float).values - val_feat_aligned['Base_FVC'].astype(float).values)

    X_tr = trn_feat_aligned[feat_cols].values.astype(float)
    X_va = val_feat_aligned[feat_cols].values.astype(float)

    # Skip fold if no data (safety against degenerate future-only splits)
    if X_tr.shape[0] == 0 or X_va.shape[0] == 0:
        print(f'[Quantile-Δ Fold {fold}] skipped (X_tr={X_tr.shape[0]}, X_va={X_va.shape[0]})', flush=True)
        del trn_df, val_df, trn, val, trn_feat, val_feat, trn_feat_aligned, val_feat_aligned
        gc.collect()
        continue

    # Fold accumulators
    val_pred_delta_sum = np.zeros((X_va.shape[0], len(alphas)), dtype=float)
    test_pred_delta_sum = np.zeros((ss.shape[0], len(alphas)), dtype=float)

    # Test features under TRAIN-fold transforms; keep keys and map to ss indices
    te_feat, _, _, _, _ = build_q_features(grid_te[['Patient','Weeks']].copy(), test_base, ecdf_bf, ecdf_pc, cats, fit=False)
    te_feat['s_hat'] = te_feat['Patient'].map(s_map_test).astype(float).fillna(0.0).values
    X_te = te_feat[feat_cols].values.astype(float)
    te_keys = te_feat[['Patient','Weeks']].copy()
    te_keys = te_keys.merge(grid_te_idx, on=['Patient','Weeks'], how='left')
    te_idx = te_keys['ss_idx'].values.astype(int)

    for p_i, p in enumerate(params_list):
        for qi, a in enumerate(alphas):
            mdl = LGBMRegressor(**p, alpha=a, random_state=42+fold+p_i*17)
            mdl.fit(X_tr, y_tr_delta,
                    eval_set=[(X_va, y_va_delta)],
                    eval_metric='quantile',
                    callbacks=[lgb.early_stopping(200, verbose=False)])
            val_pred_delta_sum[:, qi] += mdl.predict(X_va, num_iteration=mdl.best_iteration_)
            # add to the correct positions only for future rows
            pred_te = mdl.predict(X_te, num_iteration=mdl.best_iteration_)
            test_pred_delta_sum[te_idx, qi] += pred_te / len(params_list)

    # Enforce quantile order on VAL deltas
    val_pred_delta = np.sort(val_pred_delta_sum / len(params_list), axis=1)
    # Write OOF delta quantiles by (Patient, Weeks)
    val_keys = val_feat_aligned[['Patient','Weeks']].reset_index(drop=True)
    oof_block = pd.DataFrame(val_pred_delta, columns=oof_cols)
    oof_block = pd.concat([val_keys.reset_index(drop=True), oof_block], axis=1)
    oof_df = oof_df.merge(oof_block, on=['Patient','Weeks'], how='left', suffixes=('','_new'))
    for c in oof_cols:
        oof_df[c] = oof_df[c].fillna(oof_df[c + '_new'])
        oof_df.drop(columns=[c + '_new'], inplace=True)

    # Accumulate TEST deltas (fold-avg)
    test_preds_delta += (test_pred_delta_sum / N_SPLITS)

    print(f'[Quantile-Δ Fold {fold}] trn={trn_feat_aligned.shape[0]} val={val_feat_aligned.shape[0]} elapsed={time.time()-tf:.2f}s', flush=True)
    del trn_df, val_df, trn, val, trn_feat, val_feat, trn_feat_aligned, val_feat_aligned, X_tr, X_va, X_te, te_feat, te_idx, te_keys
    gc.collect()

# Test: enforce quantile order on deltas (ascending across quantiles)
test_preds_delta = np.sort(test_preds_delta, axis=1)
q10_d, q20_d, q50_d, q80_d, q90_d = test_preds_delta.T

# Reconstruct test FVC; guardrails
parts = ss['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid = pd.DataFrame({'Patient': parts[0], 'Weeks': parts[1].astype(int)})
grid = grid.merge(test_base, on='Patient', how='left')
dist_te = (grid['Weeks'] - grid['Base_Week']).astype(float)
base_fvc_te = grid['Base_FVC'].values.astype(float)

fvc_point = base_fvc_te + q50_d
fvc_point = np.where(dist_te == 0.0, base_fvc_te, fvc_point)
fvc_point = np.clip(fvc_point, 500, 6000)
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = pd.DataFrame({'Patient': grid['Patient'], 'Weeks': grid['Weeks'].astype(int), 'FVC': fvc_point})
df_out = df_out.groupby('Patient', group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# Sigma from quantile band with banker floor; tune c on OOF in {1.6, 1.8, 2.0}
train_base = prepare_baseline_table(train)
oof_fut = oof_df.dropna(subset=['q50_delta_oof']).copy()
oof_fut = oof_fut.merge(train_base[['Patient','Base_Week','Base_FVC']], on='Patient', how='left')
oof_fut['dist'] = (oof_fut['Weeks'] - oof_fut['Base_Week']).astype(float)
oof_fut = oof_fut[oof_fut['dist'] >= 0].copy()
oof_fut['pred_fvc'] = oof_fut['Base_FVC'].astype(float) + oof_fut['q50_delta_oof'].astype(float)

band = (oof_fut['q80_delta_oof'] - oof_fut['q20_delta_oof']).abs().astype(float).values
abs_dist_oof = oof_fut['dist'].abs().astype(float).values
sigma_banker_oof = np.maximum(240.0 + 3.0 * abs_dist_oof, 70.0)
sigma_banker_oof = np.where(abs_dist_oof > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)

best_c, best_ll = 1.8, -1e9
for c in [1.6, 1.8, 2.0]:
    sigma_c = np.maximum(band / c, sigma_banker_oof)
    ll = laplace_ll(oof_fut['FVC'].values.astype(float), oof_fut['pred_fvc'].values.astype(float), sigma_c)
    if ll > best_ll:
        best_ll, best_c = ll, c
print(f'[Quantile-Δ] Tuned sigma c={best_c:.1f} on OOF (LL={best_ll:.5f})', flush=True)

# Test sigma with banker floor + per-patient monotone
abs_dist_te = np.abs(dist_te).astype(float)
sigma_from_band = (q80_d - q20_d) / best_c
sigma_banker = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker, 100.0), sigma_banker)
sigma = np.maximum(sigma_from_band, sigma_banker)

df_sig = pd.DataFrame({'Patient': grid['Patient'], 'Weeks': grid['Weeks'].astype(int), 'dist': abs_dist_te, 'Sigma': sigma.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

# Save artifacts
sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub.to_csv('submission_quantile_lgbm_v2.csv', index=False)

oof_save = oof_fut[['Patient','Weeks','FVC','Base_FVC','q10_delta_oof','q20_delta_oof','q50_delta_oof','q80_delta_oof','q90_delta_oof']].copy()
oof_save.to_csv('oof_quantile_lgbm_v2.csv', index=False)

# Also save test delta quantiles aligned to ss for downstream sigma band blending
pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'q10_d': q10_d,
    'q20_d': q20_d,
    'q50_d': q50_d,
    'q80_d': q80_d,
    'q90_d': q90_d
}).to_csv('pred_quantile_deltas_v2.csv', index=False)

print(f'Saved submission_quantile_lgbm_v2.csv, oof_quantile_lgbm_v2.csv, and pred_quantile_deltas_v2.csv. Elapsed {time.time()-t0:.1f}s')

[Quantile-Δ Fold 1] trn=1124 val=284 elapsed=3.00s


[Quantile-Δ Fold 2] trn=1127 val=281 elapsed=3.04s


[Quantile-Δ Fold 3] trn=1129 val=279 elapsed=3.25s


[Quantile-Δ Fold 4] trn=1129 val=279 elapsed=3.67s


[Quantile-Δ Fold 5] trn=1123 val=285 elapsed=2.66s


[Quantile-Δ] Tuned sigma c=1.8 on OOF (LL=-6.14165)


Saved submission_quantile_lgbm_v2.csv, oof_quantile_lgbm_v2.csv, and pred_quantile_deltas_v2.csv. Elapsed 16.2s


  df_out = df_out.groupby('Patient', group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', group_keys=False).apply(enforce_sigma_monotone)


In [27]:
# Set submission.csv to tuned slope-only + anchor banker (w_anchor=0.50)
import pandas as pd
src = 'submission_slope_anchor_banker_wA50.csv'
sub = pd.read_csv(src)
ss = pd.read_csv('sample_submission.csv')
assert sub.shape[0] == ss.shape[0], 'Row mismatch vs sample_submission'
assert set(sub['Patient_Week'].astype(str)) == set(ss['Patient_Week'].astype(str)), 'Patient_Week mismatch'
sub.to_csv('submission.csv', index=False)
print(f'submission.csv overwritten with {src}')

submission.csv overwritten with submission_slope_anchor_banker_wA50.csv


In [31]:
# Iter 5: MixedLM (LME) banker submission per expert Blueprint C
import numpy as np, pandas as pd, warnings, time, gc
import statsmodels.formula.api as smf

# 1) Build train_lme with baseline merge and Weeks_Passed >= 0; standardize Age and Percent_at_base
def build_baseline(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

t0 = time.time()
train_base = build_baseline(train)
# Drop overlapping demographic columns from left to avoid suffixes
train_left = train.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore')
train_lme = train_left.merge(train_base[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
train_lme['Weeks_Passed'] = (train_lme['Weeks'] - train_lme['Base_Week']).astype(float)
train_lme = train_lme[train_lme['Weeks_Passed'] >= 0].copy()
train_lme['Weeks_Passed'] = train_lme['Weeks_Passed'] / 10.0  # stabilize
age_mean, age_std = float(train_lme['Age'].astype(float).mean()), float(train_lme['Age'].astype(float).std() + 1e-9)
pc_mean = float(train_lme['Percent_at_base'].astype(float).mean())
pc_std  = float(train_lme['Percent_at_base'].astype(float).std() + 1e-9)
train_lme['Age_std'] = (train_lme['Age'].astype(float) - age_mean) / age_std
train_lme['Percent_at_base_std'] = (train_lme['Percent_at_base'].astype(float) - pc_mean) / pc_std

# 2) Fit MixedLM: add quadratic time term I(Weeks_Passed**2) to fixed effects; random effects unchanged
formula = 'FVC ~ 1 + Weeks_Passed + I(Weeks_Passed**2) + Age_std + C(Sex) + C(SmokingStatus) + Percent_at_base_std + Age_std:Percent_at_base_std'
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    md = smf.mixedlm(formula, data=train_lme, groups=train_lme['Patient'], re_formula='~Weeks_Passed')
    mdf = md.fit(method='lbfgs', reml=True, maxiter=500, disp=False)
print('MixedLM fitted. Converged:', mdf.converged, 'nobs:', mdf.nobs)

# 3) Build strict test grid and predict fixed-effects only (new patients => random effects ~ 0 by default)
ss = pd.read_csv('sample_submission.csv')
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test_base = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
    columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_base, on='Patient', how='left')
assert grid['Base_Week'].notna().all(), 'Base_Week missing in test grid'
grid['Weeks_Passed'] = (grid['Weeks'] - grid['Base_Week']).astype(float) / 10.0
grid['Age_std'] = (grid['Age'].astype(float) - age_mean) / age_std
grid['Percent_at_base_std'] = (grid['Percent_at_base'].astype(float) - pc_mean) / pc_std

fvc_lme = mdf.predict(grid)

# 4) Guardrails: pin dist==0 to Base_FVC, per-patient non-increasing, clip [500,6000]
dist = (grid['Weeks'].values - grid['Base_Week'].values).astype(float)
fvc_clip = np.clip(fvc_lme.values.astype(float), 500, 6000)
fvc_clip = np.where(dist == 0.0, grid['Base_FVC'].values.astype(float), fvc_clip)
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_clip})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# 5) Banker sigma with floors and per-patient monotonicity in |dist|
abs_dist = np.abs(dist).astype(float)
sigma = np.maximum(240.0 + 3.0 * abs_dist, 70.0)
sigma = np.where(abs_dist > 20.0, np.maximum(sigma, 100.0), sigma)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist, 'Sigma': sigma.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

# 6) Save submission
sub_lme = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub_lme.to_csv('submission_lme_banker.csv', index=False)
sub_lme.to_csv('submission.csv', index=False)
print('Saved submission_lme_banker.csv and set submission.csv (MixedLM banker). Elapsed {:.2f}s'.format(time.time()-t0))

MixedLM fitted. Converged: True nobs: 1394
Saved submission_lme_banker.csv and set submission.csv (MixedLM banker). Elapsed 0.41s


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [39]:
# Final 3-model blend: 0.30 slopeA60 + 0.30 LME + 0.40 Quantile; banker sigma with guardrails
import numpy as np, pandas as pd

ss = pd.read_csv('sample_submission.csv')

def load_fvc(path):
    df = pd.read_csv(path).set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
    return df['FVC'].astype(float).values

# Load components
fvc_slope = load_fvc('submission_slope_anchor_banker_wA60.csv')
fvc_lme   = load_fvc('submission_lme_banker.csv')
fvc_q     = load_fvc('submission_quantile_lgbm.csv')

# Blend weights (per expert: 0.30 slope + 0.30 LME + 0.40 Quantile)
w_s, w_l, w_q = 0.30, 0.30, 0.40
fvc_blend = w_s*fvc_slope + w_l*fvc_lme + w_q*fvc_q

# Build grid to apply guardrails and banker sigma
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
abs_dist = (grid['Weeks'] - grid['Base_Week']).abs().astype(float)

# Guardrails on FVC
fvc_blend = np.where(abs_dist==0.0, grid['Base_FVC'].values.astype(float), fvc_blend)
fvc_blend = np.clip(fvc_blend, 500, 6000)
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_blend})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# Banker sigma + monotone in |dist|
sigma = np.maximum(240.0 + 3.0*abs_dist, 70.0)
sigma = np.where(abs_dist>20.0, np.maximum(sigma, 100.0), sigma)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist.values, 'Sigma': sigma.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub.to_csv('submission_final_blend.csv', index=False)
sub.to_csv('submission.csv', index=False)
print('Saved submission_final_blend.csv and set submission.csv (FVC=0.30 slopeA60 + 0.30 LME + 0.40 Quantile; banker sigma).')

Saved submission_final_blend.csv and set submission.csv (FVC=0.30 slopeA60 + 0.30 LME + 0.40 Quantile; banker sigma).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [54]:
# OOF-driven, distance-aware 3-model blend (SlopeA60, LME, Quantile q50+anchor) with banker sigma
import numpy as np, pandas as pd, gc, warnings, time
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
import statsmodels.formula.api as smf

t0 = time.time()
ss = pd.read_csv('sample_submission.csv')

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = y_true.astype(float); y_pred = y_pred.astype(float); sigma = sigma.astype(float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

# 1) Build Slope+Anchor OOF (GroupKFold, future-only); anchor weight applied later
def slope_anchor_oof(train_df, n_splits=5, seed=42):
    from sklearn.linear_model import Ridge
    from sklearn.neighbors import KNeighborsRegressor
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    y_list, d_list, fvc_slope_list, fvc_anchor_list = [], [], [], []
    pid_list, week_list = [], []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        trn = train_df.iloc[trn_idx].copy(); val = train_df.iloc[val_idx].copy()
        base_trn = prepare_baseline_table(trn); base_val = prepare_baseline_table(val)
        # Fit slope-head on TRAIN baseline
        slopes_tr = compute_patient_slopes(trn)
        lab = pd.DataFrame({'Patient': list(slopes_tr.keys()), 's_label': list(slopes_tr.values())})
        bf_trn, feat_cols, ecdf_bf, ecdf_pc, cats = build_slope_features(base_trn.merge(lab, on='Patient', how='left'), fit=True)
        bf_val, _, _, _, _ = build_slope_features(base_val, ecdf_bf, ecdf_pc, cats, fit=False)
        scaler = StandardScaler(with_mean=True, with_std=True)
        X_tr = bf_trn[feat_cols].values.astype(float); y_tr = bf_trn['s_label'].fillna(0.0).values.astype(float)
        X_trs = scaler.fit_transform(X_tr); X_vs = scaler.transform(bf_val[feat_cols].values.astype(float))
        ridge = Ridge(alpha=1.0, random_state=seed).fit(X_trs, y_tr)
        knn   = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(X_trs, y_tr)
        s_r = ridge.predict(X_vs); s_k = knn.predict(X_vs)
        q_lo, q_hi = np.percentile(y_tr, [5,95])
        s_bl = np.clip(0.80*s_r + 0.20*s_k, q_lo, q_hi)
        s_map = dict(zip(base_val['Patient'].values, s_bl))
        valm = val.merge(base_val[['Patient','Base_Week','Base_FVC']], on='Patient', how='left')
        mask = (valm['Weeks'] >= valm['Base_Week'])
        dist = (valm['Weeks'] - valm['Base_Week']).astype(float)
        fvc_slope = (valm['Base_FVC'].values + valm['Patient'].map(s_map).fillna(0.0).values * dist).astype(float)
        gs_fold = robust_global_slope(compute_patient_slopes(trn))
        fvc_anchor = (valm['Base_FVC'].values + gs_fold * dist).astype(float)
        # Append masked rows along with keys
        y_list.append(valm.loc[mask, 'FVC'].values.astype(float))
        d_list.append(dist.values[mask].astype(float))
        fvc_slope_list.append(fvc_slope[mask].astype(float))
        fvc_anchor_list.append(fvc_anchor[mask].astype(float))
        pid_list.append(valm.loc[mask, 'Patient'].astype(str).values)
        week_list.append(valm.loc[mask, 'Weeks'].astype(int).values)
        del trn, val, base_trn, base_val, bf_trn, bf_val, X_tr, X_trs, X_vs
        gc.collect()
    return (np.concatenate(y_list), np.concatenate(d_list),
            np.concatenate(fvc_slope_list), np.concatenate(fvc_anchor_list),
            np.concatenate(pid_list), np.concatenate(week_list))

print('Building Slope+Anchor OOF...', flush=True)
y_oof_s, dist_oof_s, fvc_slope_oof, fvc_anchor_oof, pid_oof_s, weeks_oof_s = slope_anchor_oof(train, n_splits=5, seed=42)

# 2) Build LME OOF (GroupKFold, future-only hygiene)
def lme_oof(train_df, n_splits=5):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    y_list, d_list, fvc_list = [], [], []
    pid_list, week_list = [], []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        trn = train_df.iloc[trn_idx].copy(); val = train_df.iloc[val_idx].copy()
        base_trn = prepare_baseline_table(trn); base_val = prepare_baseline_table(val)
        trn_l = trn.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore') \
                   .merge(base_trn[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
        trn_l['Weeks_Passed'] = (trn_l['Weeks'] - trn_l['Base_Week']).astype(float)/10.0
        trn_l = trn_l[trn_l['Weeks_Passed'] >= 0].copy()
        age_mean, age_std = trn_l['Age'].mean(), trn_l['Age'].std()+1e-9
        pc_mean, pc_std   = trn_l['Percent_at_base'].mean(), trn_l['Percent_at_base'].std()+1e-9
        trn_l['Age_std'] = (trn_l['Age'] - age_mean)/age_std
        trn_l['Percent_at_base_std'] = (trn_l['Percent_at_base'] - pc_mean)/pc_std
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            md = smf.mixedlm('FVC ~ 1 + Weeks_Passed + I(Weeks_Passed**2) + Age_std + C(Sex) + C(SmokingStatus) + Percent_at_base_std + Age_std:Percent_at_base_std',
                              data=trn_l, groups=trn_l['Patient'], re_formula='~Weeks_Passed')
            mdf = md.fit(method='lbfgs', reml=True, maxiter=500, disp=False)
        # Build VAL grid
        val_left = val.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore')
        val_l = val_left.merge(base_val[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
        mask = (val_l['Weeks'] >= val_l['Base_Week'])
        dist = (val_l['Weeks'] - val_l['Base_Week']).astype(float)
        val_l['Weeks_Passed'] = dist/10.0
        val_l['Age_std'] = (val_l['Age'] - age_mean)/age_std
        val_l['Percent_at_base_std'] = (val_l['Percent_at_base'] - pc_mean)/pc_std
        fvc_pred = mdf.predict(val_l).astype(float).values
        y_list.append(val_l.loc[mask, 'FVC'].values.astype(float))
        d_list.append(dist.values[mask].astype(float))
        fvc_list.append(fvc_pred[mask].astype(float))
        pid_list.append(val_l.loc[mask, 'Patient'].astype(str).values)
        week_list.append(val_l.loc[mask, 'Weeks'].astype(int).values)
        del trn, val, base_trn, base_val, trn_l, val_l
        gc.collect()
    return np.concatenate(y_list), np.concatenate(d_list), np.concatenate(fvc_list), np.concatenate(pid_list), np.concatenate(week_list)

print('Building LME OOF...', flush=True)
y_oof_l, dist_oof_l, fvc_lme_oof, pid_oof_l, weeks_oof_l = lme_oof(train, n_splits=5)

# 3) Align Quantile OOF (q50 delta) and build quantile-based point preds with per-fold anchor (no leak)
print('Loading Quantile OOF...', flush=True)
oof_q = pd.read_csv('oof_quantile_lgbm_v2.csv')
train_base = prepare_baseline_table(train)
oof_q = oof_q.merge(train_base[['Patient','Base_Week','Base_FVC']], on='Patient', how='left', suffixes=('', '_base'))
# Resolve possible suffixes from merge if oof_q already contains Base_FVC from saved file
if 'Base_FVC_base' in oof_q.columns:
    if 'Base_FVC' not in oof_q.columns:
        oof_q['Base_FVC'] = oof_q['Base_FVC_base']
    else:
        oof_q['Base_FVC'] = oof_q['Base_FVC'].fillna(oof_q['Base_FVC_base'])
    oof_q.drop(columns=['Base_FVC_base'], inplace=True)
if 'Base_Week_base' in oof_q.columns and 'Base_Week' not in oof_q.columns:
    oof_q['Base_Week'] = oof_q['Base_Week_base']
    oof_q.drop(columns=['Base_Week_base'], inplace=True)
oof_q['dist'] = (oof_q['Weeks'] - oof_q['Base_Week']).astype(float)
oof_q = oof_q[oof_q['dist'] >= 0].dropna(subset=['q50_delta_oof']).copy()

# Build fold membership and per-fold global slopes for anchor
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
patient_to_fold = {}
fold_to_gs = {}
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    trn_df = train.iloc[trn_idx]
    gs_fold = robust_global_slope(compute_patient_slopes(trn_df))
    fold_to_gs[fold] = gs_fold
    val_pats = train.iloc[val_idx]['Patient'].astype(str).unique()
    for p in val_pats:
        patient_to_fold[p] = fold

oof_q['fold'] = oof_q['Patient'].astype(str).map(patient_to_fold).astype(int)
oof_q['gs_fold'] = oof_q['fold'].map(fold_to_gs).astype(float)
fvc_anchor_fold = oof_q['Base_FVC'].astype(float).values + oof_q['gs_fold'].values * oof_q['dist'].astype(float).values
# Reconstruct quantile point from delta
fvc_q_point = oof_q['Base_FVC'].astype(float).values + oof_q['q50_delta_oof'].astype(float).values
fvc_q_oof = 0.70 * fvc_q_point + 0.30 * fvc_anchor_fold
y_true_oof_q = oof_q['FVC'].astype(float).values

# 4) Key-based alignment of OOF sources by (Patient, Weeks) using inner joins
df_s = pd.DataFrame({
    'Patient': pid_oof_s.astype(str),
    'Weeks': weeks_oof_s.astype(int),
    'y_true': y_oof_s.astype(float),
    'dist': dist_oof_s.astype(float),
    'fvc_slope': fvc_slope_oof.astype(float),
    'fvc_anchor': fvc_anchor_oof.astype(float)
})
df_l = pd.DataFrame({
    'Patient': pid_oof_l.astype(str),
    'Weeks': weeks_oof_l.astype(int),
    'fvc_lme': fvc_lme_oof.astype(float)
})
df_q = oof_q[['Patient','Weeks']].astype({'Patient':'str','Weeks':'int'}).copy()
df_q['y_true_q'] = y_true_oof_q
df_q['dist_q'] = oof_q['dist'].astype(float).values
df_q['fvc_q'] = fvc_q_oof.astype(float)

df_merged = df_s.merge(df_l, on=['Patient','Weeks'], how='inner').merge(df_q, on=['Patient','Weeks'], how='inner')
# Sanity: y_true should match
y_true_aligned = df_merged['y_true'].values.astype(float)
dist_aligned = df_merged['dist'].values.astype(float)
fvc_s_aligned = df_merged['fvc_slope'].values.astype(float)
fvc_a_aligned = df_merged['fvc_anchor'].values.astype(float)
fvc_l_aligned = df_merged['fvc_lme'].values.astype(float)
fvc_q_aligned = df_merged['fvc_q'].values.astype(float)

# 5) Weight search on distance bins using banker sigma
sigma_oof = np.maximum(240.0 + 3.0 * np.abs(dist_aligned), 70.0)
sigma_oof = np.where(np.abs(dist_aligned) > 20.0, np.maximum(sigma_oof, 100.0), sigma_oof)

def grid_best(y, s, l, q, sigma, w_grid=np.arange(0.0, 1.01, 0.05)):
    best_ll, best_w = -1e9, (0.3, 0.3, 0.4)
    for ws in w_grid:
        for wl in w_grid:
            wq = 1.0 - ws - wl
            if wq < 0 or wq > 1: continue
            pred = ws*s + wl*l + wq*q
            ll = laplace_ll_np(y, pred, sigma)
            if ll > best_ll:
                best_ll, best_w = ll, (ws, wl, wq)
    return best_ll, best_w

bins = [(0.0, 5.0), (5.0, 15.0), (15.0, 1e9)]
best_weights = {}
for lo, hi in bins:
    m = (np.abs(dist_aligned) > lo) & (np.abs(dist_aligned) <= hi)
    if not m.any():
        best_weights[(lo,hi)] = (0.30, 0.30, 0.40)
        print(f'Bin {lo}-{hi}: empty; default 0.30/0.30/0.40')
        continue
    ll, w = grid_best(y_true_aligned[m], fvc_s_aligned[m], fvc_l_aligned[m], fvc_q_aligned[m], sigma_oof[m])
    best_weights[(lo, hi)] = w
    print(f'Bin {lo}-{hi}: best weights Slope/LME/Quantile = {w[0]:.2f}/{w[1]:.2f}/{w[2]:.2f}, OOF LL={ll:.5f}')

# 6) Apply weights to test submissions
def load_fvc(path):
    return pd.read_csv(path).set_index('Patient_Week').loc[ss['Patient_Week'],'FVC'].astype(float).values

fvc_s_test = load_fvc('submission_slope_anchor_banker_wA60.csv')
fvc_l_test = load_fvc('submission_lme_banker.csv')
fvc_q_test = load_fvc('submission_quantile_lgbm_v2.csv')

grid_te = ss.copy()
parts = grid_te['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid_te['Patient'] = parts[0]; grid_te['Weeks'] = parts[1].astype(int)
test_base = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid_te = grid_te.merge(test_base, on='Patient', how='left')
dist_te = (grid_te['Weeks'] - grid_te['Base_Week']).astype(float); abs_dist_te = np.abs(dist_te).astype(float)

fvc_blend = np.zeros_like(fvc_s_test)
for (lo, hi), w in best_weights.items():
    m = (abs_dist_te > lo) & (abs_dist_te <= hi)
    fvc_blend[m] = w[0]*fvc_s_test[m] + w[1]*fvc_l_test[m] + w[2]*fvc_q_test[m]

# Guardrails
fvc_blend = np.where(abs_dist_te==0.0, grid_te['Base_FVC'].values.astype(float), fvc_blend)
fvc_blend = np.clip(fvc_blend, 500, 6000)
df_out = pd.DataFrame({'Patient': grid_te['Patient'], 'Weeks': grid_te['Weeks'], 'FVC': fvc_blend})
def enforce_non_increasing(g):
    g = g.sort_values('Weeks').copy()
    g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
    return g
df_out = df_out.groupby('Patient', group_keys=False).apply(enforce_non_increasing)
fvc_final = df_out['FVC'].values.astype(float)

# Sigma: banker only (per expert); optional hybrid with Quantile v2
HYBRID_SIGMA = False
sigma_banker = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker, 100.0), sigma_banker)
if HYBRID_SIGMA:
    sig_q = pd.read_csv('submission_quantile_lgbm_v2.csv').set_index('Patient_Week').loc[ss['Patient_Week'],'Confidence'].astype(float).values
    sigma = np.maximum(sig_q, sigma_banker)
else:
    sigma = sigma_banker
df_sig = pd.DataFrame({'Patient': grid_te['Patient'], 'dist': abs_dist_te, 'Sigma': sigma})
df_sig = df_sig.groupby('Patient', group_keys=False).apply(lambda g: g.sort_values('dist').assign(Sigma=np.maximum.accumulate(g['Sigma'])))
sigma_final = df_sig['Sigma'].values.astype(float)

sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub.to_csv('submission_distance_blend.csv', index=False)
sub.to_csv('submission.csv', index=False)
print('Saved submission_distance_blend.csv and set submission.csv. Elapsed {:.1f}s'.format(time.time()-t0))

Building Slope+Anchor OOF...


Building LME OOF...


Loading Quantile OOF...


Bin 0.0-5.0: best weights Slope/LME/Quantile = 0.05/0.00/0.95, OOF LL=-6.04045
Bin 5.0-15.0: best weights Slope/LME/Quantile = 0.00/0.00/1.00, OOF LL=-6.12583
Bin 15.0-1000000000.0: best weights Slope/LME/Quantile = 0.05/0.05/0.90, OOF LL=-6.39437
Saved submission_distance_blend.csv and set submission.csv. Elapsed 3.6s


  df_out = df_out.groupby('Patient', group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', group_keys=False).apply(lambda g: g.sort_values('dist').assign(Sigma=np.maximum.accumulate(g['Sigma'])))


In [52]:
# Build secondary submission: hybrid sigma = max(distance-blend banker, Quantile v2), with per-patient monotone; do NOT overwrite submission.csv
import numpy as np, pandas as pd

ss = pd.read_csv('sample_submission.csv')
sub_blend = pd.read_csv('submission_distance_blend.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
sub_qv2 = pd.read_csv('submission_quantile_lgbm_v2.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()

# Rebuild grid to get |dist| for monotone ordering
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]
grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
abs_dist = (grid['Weeks'] - grid['Base_Week']).abs().astype(float).values

# Hybrid sigma
sigma_blend = sub_blend['Confidence'].astype(float).values
sigma_q = sub_qv2['Confidence'].astype(float).values
sigma = np.maximum(sigma_blend, sigma_q)

# Per-patient monotone in |dist|
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist, 'Sigma': sigma})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

# Save secondary submission (keep FVC from distance blend)
sub_h = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': sub_blend['FVC'].astype(float).values, 'Confidence': sigma_final})
sub_h.to_csv('submission_distance_blend_hybrid.csv', index=False)
print('Saved submission_distance_blend_hybrid.csv (sigma = max(distance-blend banker, Quantile v2), monotone per patient).')

Saved submission_distance_blend_hybrid.csv (sigma = max(distance-blend banker, Quantile v2), monotone per patient).


  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [56]:
# OOF diagnostics and constrained distance-aware blends: current vs regularized vs equal
import numpy as np, pandas as pd, warnings, gc, time
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
import statsmodels.formula.api as smf

t0 = time.time()
ss = pd.read_csv('sample_submission.csv')

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = y_true.astype(float); y_pred = y_pred.astype(float); sigma = sigma.astype(float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

# Helpers already defined in earlier cells: prepare_baseline_table, build_slope_features, compute_patient_slopes, robust_global_slope

def slope_anchor_oof(train_df, n_splits=5, seed=42):
    from sklearn.linear_model import Ridge
    from sklearn.neighbors import KNeighborsRegressor
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    y_list, d_list, fvc_slope_list, fvc_anchor_list = [], [], [], []
    pid_list, week_list = [], []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        trn = train_df.iloc[trn_idx].copy(); val = train_df.iloc[val_idx].copy()
        base_trn = prepare_baseline_table(trn); base_val = prepare_baseline_table(val)
        slopes_tr = compute_patient_slopes(trn)
        lab = pd.DataFrame({'Patient': list(slopes_tr.keys()), 's_label': list(slopes_tr.values())})
        bf_trn, feat_cols, ecdf_bf, ecdf_pc, cats = build_slope_features(base_trn.merge(lab, on='Patient', how='left'), fit=True)
        bf_val, _, _, _, _ = build_slope_features(base_val, ecdf_bf, ecdf_pc, cats, fit=False)
        scaler = StandardScaler(with_mean=True, with_std=True)
        X_tr = bf_trn[feat_cols].values.astype(float); y_tr = bf_trn['s_label'].fillna(0.0).values.astype(float)
        X_trs = scaler.fit_transform(X_tr); X_vs = scaler.transform(bf_val[feat_cols].values.astype(float))
        from sklearn.linear_model import Ridge
        from sklearn.neighbors import KNeighborsRegressor
        ridge = Ridge(alpha=1.0, random_state=seed).fit(X_trs, y_tr)
        knn   = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(X_trs, y_tr)
        s_r = ridge.predict(X_vs); s_k = knn.predict(X_vs)
        q_lo, q_hi = np.percentile(y_tr, [5,95])
        s_bl = np.clip(0.80*s_r + 0.20*s_k, q_lo, q_hi)
        s_map = dict(zip(base_val['Patient'].values, s_bl))
        valm = val.merge(base_val[['Patient','Base_Week','Base_FVC']], on='Patient', how='left')
        mask = (valm['Weeks'] >= valm['Base_Week'])
        dist = (valm['Weeks'] - valm['Base_Week']).astype(float)
        fvc_slope = (valm['Base_FVC'].values + valm['Patient'].map(s_map).fillna(0.0).values * dist).astype(float)
        gs_fold = robust_global_slope(compute_patient_slopes(trn))
        fvc_anchor = (valm['Base_FVC'].values + gs_fold * dist).astype(float)
        y_list.append(valm.loc[mask, 'FVC'].values.astype(float))
        d_list.append(dist.values[mask].astype(float))
        fvc_slope_list.append(fvc_slope[mask].astype(float))
        fvc_anchor_list.append(fvc_anchor[mask].astype(float))
        pid_list.append(valm.loc[mask, 'Patient'].astype(str).values)
        week_list.append(valm.loc[mask, 'Weeks'].astype(int).values)
        del trn, val, base_trn, base_val, bf_trn, bf_val, X_tr, X_trs, X_vs
        gc.collect()
    return (np.concatenate(y_list), np.concatenate(d_list), np.concatenate(fvc_slope_list), np.concatenate(fvc_anchor_list), np.concatenate(pid_list), np.concatenate(week_list))

def lme_oof(train_df, n_splits=5):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    y_list, d_list, fvc_list = [], [], []
    pid_list, week_list = [], []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        trn = train_df.iloc[trn_idx].copy(); val = train_df.iloc[val_idx].copy()
        base_trn = prepare_baseline_table(trn); base_val = prepare_baseline_table(val)
        trn_l = trn.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore').merge(
            base_trn[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left'
        )
        trn_l['Weeks_Passed'] = (trn_l['Weeks'] - trn_l['Base_Week']).astype(float)/10.0
        trn_l = trn_l[trn_l['Weeks_Passed'] >= 0].copy()
        age_mean, age_std = trn_l['Age'].mean(), trn_l['Age'].std()+1e-9
        pc_mean, pc_std   = trn_l['Percent_at_base'].mean(), trn_l['Percent_at_base'].std()+1e-9
        trn_l['Age_std'] = (trn_l['Age'] - age_mean)/age_std
        trn_l['Percent_at_base_std'] = (trn_l['Percent_at_base'] - pc_mean)/pc_std
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            md = smf.mixedlm('FVC ~ 1 + Weeks_Passed + I(Weeks_Passed**2) + Age_std + C(Sex) + C(SmokingStatus) + Percent_at_base_std + Age_std:Percent_at_base_std',
                              data=trn_l, groups=trn_l['Patient'], re_formula='~Weeks_Passed')
            mdf = md.fit(method='lbfgs', reml=True, maxiter=500, disp=False)
        val_left = val.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore')
        val_l = val_left.merge(base_val[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
        mask = (val_l['Weeks'] >= val_l['Base_Week'])
        dist = (val_l['Weeks'] - val_l['Base_Week']).astype(float)
        val_l['Weeks_Passed'] = dist/10.0
        val_l['Age_std'] = (val_l['Age'] - age_mean)/age_std
        val_l['Percent_at_base_std'] = (val_l['Percent_at_base'] - pc_mean)/pc_std
        fvc_pred = mdf.predict(val_l).astype(float).values
        y_list.append(val_l.loc[mask, 'FVC'].values.astype(float))
        d_list.append(dist.values[mask].astype(float))
        fvc_list.append(fvc_pred[mask].astype(float))
        pid_list.append(val_l.loc[mask, 'Patient'].astype(str).values)
        week_list.append(val_l.loc[mask, 'Weeks'].astype(int).values)
        del trn, val, base_trn, base_val, trn_l, val_l
        gc.collect()
    return np.concatenate(y_list), np.concatenate(d_list), np.concatenate(fvc_list), np.concatenate(pid_list), np.concatenate(week_list)

print('Build OOF sources...', flush=True)
y_s, d_s, fvc_s, fvc_a, pid_s, wk_s = slope_anchor_oof(train, 5, 42)
y_l, d_l, fvc_l, pid_l, wk_l = lme_oof(train, 5)
oof_q = pd.read_csv('oof_quantile_lgbm_v2.csv')
train_base = prepare_baseline_table(train)
oof_q = oof_q.merge(train_base[['Patient','Base_Week','Base_FVC']], on='Patient', how='left', suffixes=('', '_base'))
if 'Base_FVC_base' in oof_q.columns:
    if 'Base_FVC' not in oof_q.columns: oof_q['Base_FVC'] = oof_q['Base_FVC_base']
    else: oof_q['Base_FVC'] = oof_q['Base_FVC'].fillna(oof_q['Base_FVC_base'])
    oof_q.drop(columns=['Base_FVC_base'], inplace=True)
if 'Base_Week_base' in oof_q.columns and 'Base_Week' not in oof_q.columns:
    oof_q['Base_Week'] = oof_q['Base_Week_base']
    oof_q.drop(columns=['Base_Week_base'], inplace=True)
oof_q['dist'] = (oof_q['Weeks'] - oof_q['Base_Week']).astype(float)
oof_q = oof_q[oof_q['dist'] >= 0].dropna(subset=['q50_delta_oof']).copy()

# Per-fold anchor for quantile OOF
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
patient_to_fold = {}; fold_to_gs = {}
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    trn_df = train.iloc[trn_idx]
    gs_fold = robust_global_slope(compute_patient_slopes(trn_df))
    fold_to_gs[fold] = gs_fold
    for p in train.iloc[val_idx]['Patient'].astype(str).unique():
        patient_to_fold[p] = fold
oof_q['fold'] = oof_q['Patient'].astype(str).map(patient_to_fold).astype(int)
oof_q['gs_fold'] = oof_q['fold'].map(fold_to_gs).astype(float)
fvc_anchor_q = oof_q['Base_FVC'].astype(float).values + oof_q['gs_fold'].values * oof_q['dist'].astype(float).values
fvc_q_point = oof_q['Base_FVC'].astype(float).values + oof_q['q50_delta_oof'].astype(float).values
fvc_q = 0.70 * fvc_q_point + 0.30 * fvc_anchor_q

# Align OOF by keys
df_s = pd.DataFrame({'Patient': pid_s.astype(str), 'Weeks': wk_s.astype(int), 'y_true': y_s.astype(float), 'dist': d_s.astype(float), 'fvc_s': fvc_s.astype(float)})
df_s['fvc_a'] = fvc_a.astype(float)
df_l = pd.DataFrame({'Patient': pid_l.astype(str), 'Weeks': wk_l.astype(int), 'fvc_l': fvc_l.astype(float)})
df_q = oof_q[['Patient','Weeks']].astype({'Patient':'str','Weeks':'int'}).copy()
df_q['fvc_q'] = fvc_q.astype(float)

dfm = df_s.merge(df_l, on=['Patient','Weeks'], how='inner').merge(df_q, on=['Patient','Weeks'], how='inner')
y = dfm['y_true'].values.astype(float)
dist = dfm['dist'].values.astype(float)
s = dfm['fvc_s'].values.astype(float)
l = dfm['fvc_l'].values.astype(float)
q = dfm['fvc_q'].values.astype(float)
sigma_oof = np.maximum(240.0 + 3.0 * np.abs(dist), 70.0)
sigma_oof = np.where(np.abs(dist) > 20.0, np.maximum(sigma_oof, 100.0), sigma_oof)

def search_weights(dist_mask, w_grid=np.arange(0.0, 1.01, 0.05), ws_min=0.0, wl_min=0.0):
    idx = dist_mask
    if not np.any(idx):
        return (-1e9, (0.33, 0.33, 0.34))
    best_ll, best_w = -1e9, (0.33, 0.33, 0.34)
    for ws in w_grid:
        for wl in w_grid:
            wq = 1.0 - ws - wl
            if wq < 0 or wq > 1: continue
            if ws < ws_min or wl < wl_min: continue
            pred = ws*s[idx] + wl*l[idx] + wq*q[idx]
            ll = laplace_ll_np(y[idx], pred, sigma_oof[idx])
            if ll > best_ll:
                best_ll, best_w = ll, (ws, wl, wq)
    return best_ll, best_w

bins = [(0.0,5.0), (5.0,15.0), (15.0, 1e9)]
mask_bins = [ (np.abs(dist)>lo) & (np.abs(dist)<=hi) for lo,hi in bins ]

# Current: unconstrained
w_cur = {}
for (lo,hi), m in zip(bins, mask_bins):
    ll, w = search_weights(m, ws_min=0.0, wl_min=0.0)
    w_cur[(lo,hi)] = w
    print(f'[CUR] Bin {lo}-{hi}: ws={w[0]:.2f} wl={w[1]:.2f} wq={w[2]:.2f}')

# Regularized: enforce ws>=0.15 for long bin, keep others unconstrained; also wl_min=0.05 in long
w_reg = {}
for (lo,hi), m in zip(bins, mask_bins):
    ws_min = 0.15 if lo==15.0 else 0.0
    wl_min = 0.05 if lo==15.0 else 0.0
    ll, w = search_weights(m, ws_min=ws_min, wl_min=wl_min)
    w_reg[(lo,hi)] = w
    print(f'[REG] Bin {lo}-{hi}: ws={w[0]:.2f} wl={w[1]:.2f} wq={w[2]:.2f}')

# Equal weights
w_eq = {(lo,hi):(1/3,1/3,1/3) for (lo,hi) in bins}

def oof_ll_for_weights(weights):
    pred = np.zeros_like(y)
    for (lo,hi), m in zip(bins, mask_bins):
        ws, wl, wq = weights[(lo,hi)]
        pred[m] = ws*s[m] + wl*l[m] + wq*q[m]
    return laplace_ll_np(y, pred, sigma_oof)

ll_cur = oof_ll_for_weights(w_cur)
ll_reg = oof_ll_for_weights(w_reg)
ll_eq  = oof_ll_for_weights(w_eq)
print(f'[OOF LL] cur={ll_cur:.5f} | reg={ll_reg:.5f} (Δ={ll_reg-ll_cur:+.5f}) | equal={ll_eq:.5f} (Δ={ll_eq-ll_cur:+.5f})')

# Build test blends for cur, reg, equal with banker sigma and guardrails
def load_fvc(path):
    return pd.read_csv(path).set_index('Patient_Week').loc[ss['Patient_Week'],'FVC'].astype(float).values

fvc_s_test = load_fvc('submission_slope_anchor_banker_wA60.csv')
fvc_l_test = load_fvc('submission_lme_banker.csv')
fvc_q_test = load_fvc('submission_quantile_lgbm_v2.csv')

grid_te = ss.copy()
parts = grid_te['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid_te['Patient'] = parts[0]; grid_te['Weeks'] = parts[1].astype(int)
test_base = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid_te = grid_te.merge(test_base, on='Patient', how='left')
dist_te = (grid_te['Weeks'] - grid_te['Base_Week']).astype(float); abs_dist_te = np.abs(dist_te).astype(float)

def apply_weights_to_test(weights):
    fvc = np.zeros_like(fvc_s_test)
    for (lo,hi), (ws, wl, wq) in weights.items():
        m = (abs_dist_te > lo) & (abs_dist_te <= hi)
        fvc[m] = ws*fvc_s_test[m] + wl*fvc_l_test[m] + wq*fvc_q_test[m]
    fvc = np.where(abs_dist_te==0.0, grid_te['Base_FVC'].values.astype(float), fvc)
    fvc = np.clip(fvc, 500, 6000)
    df_out = pd.DataFrame({'Patient': grid_te['Patient'], 'Weeks': grid_te['Weeks'], 'FVC': fvc})
    def enforce_non_increasing(g):
        g = g.sort_values('Weeks').copy()
        g['FVC'] = np.minimum.accumulate(g['FVC'].values[::-1])[::-1]
        return g
    df_out = df_out.groupby('Patient', group_keys=False).apply(enforce_non_increasing)
    fvc_final = df_out['FVC'].values.astype(float)
    sigma_b = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
    sigma_b = np.where(abs_dist_te > 20.0, np.maximum(sigma_b, 100.0), sigma_b)
    df_sig = pd.DataFrame({'Patient': grid_te['Patient'], 'dist': abs_dist_te, 'Sigma': sigma_b})
    df_sig = df_sig.groupby('Patient', group_keys=False).apply(lambda g: g.sort_values('dist').assign(Sigma=np.maximum.accumulate(g['Sigma'])))
    sigma_final = df_sig['Sigma'].values.astype(float)
    return fvc_final, sigma_final

fvc_cur, sig_cur = apply_weights_to_test(w_cur)
fvc_reg, sig_reg = apply_weights_to_test(w_reg)
fvc_eq,  sig_eq  = apply_weights_to_test(w_eq)

pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_cur, 'Confidence': sig_cur}).to_csv('submission_distance_blend_cur.csv', index=False)
pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_reg, 'Confidence': sig_reg}).to_csv('submission_distance_blend_reg.csv', index=False)
pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_eq,  'Confidence': sig_eq}).to_csv('submission_distance_blend_equal.csv', index=False)
print('Saved submissions: _cur, _reg (ws>=0.15, wl>=0.05 in long), _equal. Diagnostics complete. Elapsed {:.1f}s'.format(time.time()-t0))

Build OOF sources...


[CUR] Bin 0.0-5.0: ws=0.05 wl=0.00 wq=0.95
[CUR] Bin 5.0-15.0: ws=0.00 wl=0.00 wq=1.00
[CUR] Bin 15.0-1000000000.0: ws=0.05 wl=0.05 wq=0.90
[REG] Bin 0.0-5.0: ws=0.05 wl=0.00 wq=0.95
[REG] Bin 5.0-15.0: ws=0.00 wl=0.00 wq=1.00
[REG] Bin 15.0-1000000000.0: ws=0.15 wl=0.05 wq=0.80
[OOF LL] cur=-6.95557 | reg=-6.95593 (Δ=-0.00035) | equal=-7.03691 (Δ=-0.08134)
Saved submissions: _cur, _reg (ws>=0.15, wl>=0.05 in long), _equal. Diagnostics complete. Elapsed 3.5s


  df_out = df_out.groupby('Patient', group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', group_keys=False).apply(lambda g: g.sort_values('dist').assign(Sigma=np.maximum.accumulate(g['Sigma'])))
  df_out = df_out.groupby('Patient', group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', group_keys=False).apply(lambda g: g.sort_values('dist').assign(Sigma=np.maximum.accumulate(g['Sigma'])))
  df_out = df_out.groupby('Patient', group_keys=False).apply(enforce_non_increasing)
  df_sig = df_sig.groupby('Patient', group_keys=False).apply(lambda g: g.sort_values('dist').assign(Sigma=np.maximum.accumulate(g['Sigma'])))


In [60]:
# Sigma via quantile-band tuned against final blended OOF FVC; apply to test deltas; keep FVC from latest distance blend
import numpy as np, pandas as pd, gc, warnings, time
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
import statsmodels.formula.api as smf

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = np.asarray(y_true, float); y_pred = np.asarray(y_pred, float); sigma = np.asarray(sigma, float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

# Helper OOF builders (reuse earlier helpers already defined in notebook):
def slope_anchor_oof(train_df, n_splits=5, seed=42):
    from sklearn.linear_model import Ridge
    from sklearn.neighbors import KNeighborsRegressor
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    y_list, d_list, fvc_slope_list, fvc_anchor_list = [], [], [], []
    pid_list, week_list = [], []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        trn = train_df.iloc[trn_idx].copy(); val = train_df.iloc[val_idx].copy()
        base_trn = prepare_baseline_table(trn); base_val = prepare_baseline_table(val)
        slopes_tr = compute_patient_slopes(trn)
        lab = pd.DataFrame({'Patient': list(slopes_tr.keys()), 's_label': list(slopes_tr.values())})
        bf_trn, feat_cols, ecdf_bf, ecdf_pc, cats = build_slope_features(base_trn.merge(lab, on='Patient', how='left'), fit=True)
        bf_val, _, _, _, _ = build_slope_features(base_val, ecdf_bf, ecdf_pc, cats, fit=False)
        scaler = StandardScaler(with_mean=True, with_std=True)
        X_tr = bf_trn[feat_cols].values.astype(float); y_tr = bf_trn['s_label'].fillna(0.0).values.astype(float)
        X_trs = scaler.fit_transform(X_tr); X_vs = scaler.transform(bf_val[feat_cols].values.astype(float))
        ridge = Ridge(alpha=1.0, random_state=seed).fit(X_trs, y_tr)
        knn   = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(X_trs, y_tr)
        s_r = ridge.predict(X_vs); s_k = knn.predict(X_vs)
        q_lo, q_hi = np.percentile(y_tr, [5,95])
        s_bl = np.clip(0.80*s_r + 0.20*s_k, q_lo, q_hi)
        s_map = dict(zip(base_val['Patient'].values, s_bl))
        valm = val.merge(base_val[['Patient','Base_Week','Base_FVC']], on='Patient', how='left')
        mask = (valm['Weeks'] >= valm['Base_Week'])
        dist = (valm['Weeks'] - valm['Base_Week']).astype(float)
        fvc_slope = (valm['Base_FVC'].values + valm['Patient'].map(s_map).fillna(0.0).values * dist).astype(float)
        gs_fold = robust_global_slope(compute_patient_slopes(trn))
        fvc_anchor = (valm['Base_FVC'].values + gs_fold * dist).astype(float)
        y_list.append(valm.loc[mask, 'FVC'].values.astype(float))
        d_list.append(dist.values[mask].astype(float))
        fvc_slope_list.append(fvc_slope[mask].astype(float))
        fvc_anchor_list.append(fvc_anchor[mask].astype(float))
        pid_list.append(valm.loc[mask, 'Patient'].astype(str).values)
        week_list.append(valm.loc[mask, 'Weeks'].astype(int).values)
        del trn, val, base_trn, base_val, bf_trn, bf_val, X_tr, X_trs, X_vs
        gc.collect()
    return (np.concatenate(y_list), np.concatenate(d_list),
            np.concatenate(fvc_slope_list), np.concatenate(fvc_anchor_list),
            np.concatenate(pid_list), np.concatenate(week_list))

def lme_oof(train_df, n_splits=5):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    y_list, d_list, fvc_list = [], [], []
    pid_list, week_list = [], []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        trn = train_df.iloc[trn_idx].copy(); val = train_df.iloc[val_idx].copy()
        base_trn = prepare_baseline_table(trn); base_val = prepare_baseline_table(val)
        trn_l = trn.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore') \
                   .merge(base_trn[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
        trn_l['Weeks_Passed'] = (trn_l['Weeks'] - trn_l['Base_Week']).astype(float)/10.0
        trn_l = trn_l[trn_l['Weeks_Passed'] >= 0].copy()
        age_mean, age_std = trn_l['Age'].mean(), trn_l['Age'].std()+1e-9
        pc_mean, pc_std   = trn_l['Percent_at_base'].mean(), trn_l['Percent_at_base'].std()+1e-9
        trn_l['Age_std'] = (trn_l['Age'] - age_mean)/age_std
        trn_l['Percent_at_base_std'] = (trn_l['Percent_at_base'] - pc_mean)/pc_std
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            md = smf.mixedlm('FVC ~ 1 + Weeks_Passed + I(Weeks_Passed**2) + Age_std + C(Sex) + C(SmokingStatus) + Percent_at_base_std + Age_std:Percent_at_base_std',
                              data=trn_l, groups=trn_l['Patient'], re_formula='~Weeks_Passed')
            mdf = md.fit(method='lbfgs', reml=True, maxiter=500, disp=False)
        val_left = val.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore')
        val_l = val_left.merge(base_val[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
        mask = (val_l['Weeks'] >= val_l['Base_Week'])
        dist = (val_l['Weeks'] - val_l['Base_Week']).astype(float)
        val_l['Weeks_Passed'] = dist/10.0
        val_l['Age_std'] = (val_l['Age'] - age_mean)/age_std
        val_l['Percent_at_base_std'] = (val_l['Percent_at_base'] - pc_mean)/pc_std
        fvc_pred = mdf.predict(val_l).astype(float).values
        y_list.append(val_l.loc[mask, 'FVC'].values.astype(float))
        d_list.append(dist.values[mask].astype(float))
        fvc_list.append(fvc_pred[mask].astype(float))
        pid_list.append(val_l.loc[mask, 'Patient'].astype(str).values)
        week_list.append(val_l.loc[mask, 'Weeks'].astype(int).values)
        del trn, val, base_trn, base_val, trn_l, val_l
        gc.collect()
    return np.concatenate(y_list), np.concatenate(d_list), np.concatenate(fvc_list), np.concatenate(pid_list), np.concatenate(week_list)

# 1) Build aligned OOF for Slope+Anchor, LME, and Quantile q50+per-fold anchor
print('[Sigma-QBand] Building OOF sources for blend alignment...', flush=True)
y_s, d_s, fvc_s, fvc_a, pid_s, wk_s = slope_anchor_oof(train, 5, 42)
y_l, d_l, fvc_l, pid_l, wk_l = lme_oof(train, 5)
oof_q = pd.read_csv('oof_quantile_lgbm_v2.csv')
train_base = prepare_baseline_table(train)
oof_q = oof_q.merge(train_base[['Patient','Base_Week','Base_FVC']], on='Patient', how='left', suffixes=('', '_base'))
if 'Base_FVC_base' in oof_q.columns:
    if 'Base_FVC' not in oof_q.columns: oof_q['Base_FVC'] = oof_q['Base_FVC_base']
    else: oof_q['Base_FVC'] = oof_q['Base_FVC'].fillna(oof_q['Base_FVC_base'])
    oof_q.drop(columns=['Base_FVC_base'], inplace=True)
if 'Base_Week_base' in oof_q.columns and 'Base_Week' not in oof_q.columns:
    oof_q['Base_Week'] = oof_q['Base_Week_base']
    oof_q.drop(columns=['Base_Week_base'], inplace=True)
oof_q['dist'] = (oof_q['Weeks'] - oof_q['Base_Week']).astype(float)
oof_q = oof_q[oof_q['dist'] >= 0].dropna(subset=['q50_delta_oof']).copy()

# Per-fold anchor for quantile OOF
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
patient_to_fold = {}; fold_to_gs = {}
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    trn_df = train.iloc[trn_idx]
    gs_fold = robust_global_slope(compute_patient_slopes(trn_df))
    fold_to_gs[fold] = gs_fold
    for p in train.iloc[val_idx]['Patient'].astype(str).unique():
        patient_to_fold[p] = fold
oof_q['fold'] = oof_q['Patient'].astype(str).map(patient_to_fold).astype(int)
oof_q['gs_fold'] = oof_q['fold'].map(fold_to_gs).astype(float)
fvc_q_point = oof_q['Base_FVC'].astype(float).values + oof_q['q50_delta_oof'].astype(float).values
fvc_q = 0.70 * fvc_q_point + 0.30 * (oof_q['Base_FVC'].astype(float).values + oof_q['gs_fold'].values * oof_q['dist'].astype(float).values)

# Align by keys
df_s = pd.DataFrame({'Patient': pid_s.astype(str), 'Weeks': wk_s.astype(int), 'y_true': y_s.astype(float), 'dist': d_s.astype(float), 'fvc_s': fvc_s.astype(float), 'fvc_a': fvc_a.astype(float)})
df_l = pd.DataFrame({'Patient': pid_l.astype(str), 'Weeks': wk_l.astype(int), 'fvc_l': fvc_l.astype(float)})
df_q = oof_q[['Patient','Weeks']].astype({'Patient':'str','Weeks':'int'}).copy()
df_q['fvc_q'] = fvc_q.astype(float)
df_q['band'] = (oof_q['q80_delta_oof'] - oof_q['q20_delta_oof']).abs().astype(float).values
dfm = df_s.merge(df_l, on=['Patient','Weeks'], how='inner').merge(df_q, on=['Patient','Weeks'], how='inner')
y = dfm['y_true'].values.astype(float)
dist = dfm['dist'].values.astype(float)
s = dfm['fvc_s'].values.astype(float)
a = dfm['fvc_a'].values.astype(float)
l = dfm['fvc_l'].values.astype(float)
q = dfm['fvc_q'].values.astype(float)
band_oof = dfm['band'].values.astype(float)
sigma_banker_oof = np.maximum(240.0 + 3.0 * np.abs(dist), 70.0)
sigma_banker_oof = np.where(np.abs(dist) > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)

# 2) Recompute OOF distance-aware weights on bins (same protocol as Cell 20) with banker sigma
def grid_best(y, s, l, q, sigma, w_grid=np.arange(0.0, 1.01, 0.05)):
    best_ll, best_w = -1e9, (0.3, 0.3, 0.4)
    for ws in w_grid:
        for wl in w_grid:
            wq = 1.0 - ws - wl
            if wq < 0 or wq > 1: continue
            pred = ws*s + wl*l + wq*q
            ll = laplace_ll_np(y, pred, sigma)
            if ll > best_ll:
                best_ll, best_w = ll, (ws, wl, wq)
    return best_ll, best_w

bins = [(0.0, 5.0), (5.0, 15.0), (15.0, 1e9)]
best_w = {}
for lo, hi in bins:
    m = (np.abs(dist) > lo) & (np.abs(dist) <= hi)
    if not m.any():
        best_w[(lo,hi)] = (0.30, 0.30, 0.40)
        print(f'[Sigma-QBand] Bin {lo}-{hi} empty; default 0.30/0.30/0.40')
        continue
    ll, w = grid_best(y[m], s[m], l[m], q[m], sigma_banker_oof[m])
    best_w[(lo,hi)] = w
    print(f'[Sigma-QBand] Bin {lo}-{hi} weights S/L/Q = {w[0]:.2f}/{w[1]:.2f}/{w[2]:.2f} (OOF LL={ll:.5f})')

# Build blended OOF FVC per row using bin weights
fvc_blend_oof = np.zeros_like(y)
for (lo, hi), (ws, wl, wq) in best_w.items():
    m = (np.abs(dist) > lo) & (np.abs(dist) <= hi)
    if np.any(m):
        fvc_blend_oof[m] = ws*s[m] + wl*l[m] + wq*q[m]

# 3) Tune c per distance bin on OOF against blended FVC: sigma = max(band/c, banker); dist==0 -> 70
c_grid = [1.2, 1.4, 1.6, 1.8, 2.0, 2.2]
best_c = {}
for lo, hi in bins:
    m = (np.abs(dist) > lo) & (np.abs(dist) <= hi)
    if not m.any():
        best_c[(lo,hi)] = 1.8
        print(f'[Sigma-QBand] ({lo},{hi}] empty; default c=1.8')
        continue
    b_ll, b_c = -1e9, None
    for c in c_grid:
        sig = np.maximum(band_oof[m] / c, sigma_banker_oof[m])
        # Special-case: dist==0 -> sigma=70
        z = (np.abs(dist[m]) == 0.0)
        if np.any(z):
            sig[z] = np.maximum(70.0, sig[z])
        ll = laplace_ll_np(y[m], fvc_blend_oof[m], sig)
        if ll > b_ll:
            b_ll, b_c = ll, c
    best_c[(lo,hi)] = b_c
    print(f"[Sigma-QBand] ({lo},{hi}] best c={b_c:.2f} (OOF LL={b_ll:.5f})")

# 4) Apply tuned c to TEST using saved quantile delta predictions; keep FVC from latest distance-aware blend
ss = pd.read_csv('sample_submission.csv')
pred_d = pd.read_csv('pred_quantile_deltas_v2.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q20_d = pred_d['q20_d'].astype(float).values
q80_d = pred_d['q80_d'].astype(float).values
band_te = np.abs(q80_d - q20_d).astype(float)

# Build |dist| for bins and banker floor
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_bl, on='Patient', how='left')
abs_dist_te = (grid['Weeks'] - grid['Base_Week']).abs().astype(float).values
sigma_banker_te = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker_te = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker_te, 100.0), sigma_banker_te)

# Compute sigma per row using bin's c and floor by banker; dist==0 -> 70
sigma_from_band = np.zeros_like(band_te, dtype=float)
for (lo, hi), c in best_c.items():
    m = (abs_dist_te > lo) & (abs_dist_te <= hi)
    if np.any(m):
        sigma_from_band[m] = band_te[m] / c
sigma_te = np.maximum(sigma_from_band, sigma_banker_te)
sigma_te = np.where(abs_dist_te == 0.0, 70.0, sigma_te)

# Per-patient monotone in |dist|
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_te.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

# 5) Build final submission with FVC from latest distance blend and new sigma
sub_fvc = pd.read_csv('submission_distance_blend.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
sub_new_sigma = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': sub_fvc['FVC'].astype(float).values, 'Confidence': sigma_final})
sub_new_sigma.to_csv('submission_distance_blend_sigma_qband.csv', index=False)
sub_new_sigma.to_csv('submission.csv', index=False)
print('Saved submission_distance_blend_sigma_qband.csv and set submission.csv (OOF-aligned tuned q-band sigma floored by banker; dist==0 -> 70; monotone per patient).')

[Sigma-QBand] Building OOF sources for blend alignment...


[Sigma-QBand] Bin 0.0-5.0 weights S/L/Q = 0.05/0.00/0.95 (OOF LL=-6.04045)
[Sigma-QBand] Bin 5.0-15.0 weights S/L/Q = 0.00/0.00/1.00 (OOF LL=-6.12583)
[Sigma-QBand] Bin 15.0-1000000000.0 weights S/L/Q = 0.05/0.05/0.90 (OOF LL=-6.39437)
[Sigma-QBand] (0.0,5.0] best c=1.40 (OOF LL=-6.04044)
[Sigma-QBand] (5.0,15.0] best c=1.60 (OOF LL=-6.12583)
[Sigma-QBand] (15.0,1000000000.0] best c=1.80 (OOF LL=-6.39437)
Saved submission_distance_blend_sigma_qband.csv and set submission.csv (OOF-aligned tuned q-band sigma floored by banker; dist==0 -> 70; monotone per patient).


  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [62]:
# Sigma variant: asymmetric quantile band selection in long bins, optional 15-30 split; apply to test, keep FVC from distance blend
import numpy as np, pandas as pd, gc, warnings
from sklearn.model_selection import GroupKFold

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = np.asarray(y_true, float); y_pred = np.asarray(y_pred, float); sigma = np.asarray(sigma, float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

# 1) Rebuild aligned OOF using cell 23 logic but only to get blended OOF preds and quantile bands
ss = pd.read_csv('sample_submission.csv')
oof_q = pd.read_csv('oof_quantile_lgbm_v2.csv')
train_base = prepare_baseline_table(train)
oof_q = oof_q.merge(train_base[['Patient','Base_Week','Base_FVC']], on='Patient', how='left', suffixes=('', '_base'))
if 'Base_FVC_base' in oof_q.columns:
    if 'Base_FVC' not in oof_q.columns: oof_q['Base_FVC'] = oof_q['Base_FVC_base']
    else: oof_q['Base_FVC'] = oof_q['Base_FVC'].fillna(oof_q['Base_FVC_base'])
    oof_q.drop(columns=['Base_FVC_base'], inplace=True)
if 'Base_Week_base' in oof_q.columns and 'Base_Week' not in oof_q.columns:
    oof_q['Base_Week'] = oof_q['Base_Week_base']
    oof_q.drop(columns=['Base_Week_base'], inplace=True)
oof_q['dist'] = (oof_q['Weeks'] - oof_q['Base_Week']).astype(float)
oof_q = oof_q[oof_q['dist'] >= 0].dropna(subset=['q50_delta_oof']).copy()

# Per-fold anchor for quantile OOF
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
patient_to_fold, fold_to_gs = {}, {}
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    trn_df = train.iloc[trn_idx]
    gs_fold = robust_global_slope(compute_patient_slopes(trn_df))
    fold_to_gs[fold] = gs_fold
    for p in train.iloc[val_idx]['Patient'].astype(str).unique():
        patient_to_fold[p] = fold
oof_q['fold'] = oof_q['Patient'].astype(str).map(patient_to_fold).astype(int)
oof_q['gs_fold'] = oof_q['fold'].map(fold_to_gs).astype(float)

# Build Slope+Anchor and LME OOF minimal (reuse quick builders from cell 23)
def slope_anchor_oof_min(train_df, n_splits=5, seed=42):
    from sklearn.linear_model import Ridge
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.preprocessing import StandardScaler
    gkf = GroupKFold(n_splits=n_splits); groups = train_df['Patient'].values
    y_list, d_list, fvc_s_list, fvc_a_list, pid_list, wk_list = [], [], [], [], [], []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        trn = train_df.iloc[trn_idx].copy(); val = train_df.iloc[val_idx].copy()
        base_trn = prepare_baseline_table(trn); base_val = prepare_baseline_table(val)
        slopes_tr = compute_patient_slopes(trn)
        lab = pd.DataFrame({'Patient': list(slopes_tr.keys()), 's_label': list(slopes_tr.values())})
        bf_trn, feat_cols, ecdf_bf, ecdf_pc, cats = build_slope_features(base_trn.merge(lab, on='Patient', how='left'), fit=True)
        bf_val, _, _, _, _ = build_slope_features(base_val, ecdf_bf, ecdf_pc, cats, fit=False)
        sc = StandardScaler(with_mean=True, with_std=True)
        X_tr = bf_trn[feat_cols].values.astype(float); y_tr = bf_trn['s_label'].fillna(0.0).values.astype(float)
        X_trs = sc.fit_transform(X_tr); X_vs = sc.transform(bf_val[feat_cols].values.astype(float))
        ridge = Ridge(alpha=1.0, random_state=seed).fit(X_trs, y_tr)
        knn = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(X_trs, y_tr)
        s_r = ridge.predict(X_vs); s_k = knn.predict(X_vs)
        q_lo, q_hi = np.percentile(y_tr, [5,95])
        s_bl = np.clip(0.80*s_r + 0.20*s_k, q_lo, q_hi)
        s_map = dict(zip(base_val['Patient'].values, s_bl))
        valm = val.merge(base_val[['Patient','Base_Week','Base_FVC']], on='Patient', how='left')
        mask = (valm['Weeks'] >= valm['Base_Week'])
        dist = (valm['Weeks'] - valm['Base_Week']).astype(float)
        fvc_s = (valm['Base_FVC'].values + valm['Patient'].map(s_map).fillna(0.0).values * dist).astype(float)
        gs_fold = robust_global_slope(compute_patient_slopes(trn))
        fvc_a = (valm['Base_FVC'].values + gs_fold * dist).astype(float)
        y_list.append(valm.loc[mask, 'FVC'].values.astype(float)); d_list.append(dist.values[mask].astype(float))
        fvc_s_list.append(fvc_s[mask].astype(float)); fvc_a_list.append(fvc_a[mask].astype(float))
        pid_list.append(valm.loc[mask, 'Patient'].astype(str).values); wk_list.append(valm.loc[mask, 'Weeks'].astype(int).values)
    return (np.concatenate(y_list), np.concatenate(d_list), np.concatenate(fvc_s_list), np.concatenate(fvc_a_list), np.concatenate(pid_list), np.concatenate(wk_list))

def lme_oof_min(train_df, n_splits=5):
    import statsmodels.formula.api as smf, warnings
    gkf = GroupKFold(n_splits=n_splits); groups = train_df['Patient'].values
    y_list, d_list, fvc_list, pid_list, wk_list = [], [], [], [], []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        trn = train_df.iloc[trn_idx].copy(); val = train_df.iloc[val_idx].copy()
        base_trn = prepare_baseline_table(trn); base_val = prepare_baseline_table(val)
        trn_l = trn.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore').merge(base_trn[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
        trn_l['Weeks_Passed'] = (trn_l['Weeks'] - trn_l['Base_Week']).astype(float)/10.0
        trn_l = trn_l[trn_l['Weeks_Passed'] >= 0].copy()
        age_mean, age_std = trn_l['Age'].mean(), trn_l['Age'].std()+1e-9
        pc_mean, pc_std   = trn_l['Percent_at_base'].mean(), trn_l['Percent_at_base'].std()+1e-9
        trn_l['Age_std'] = (trn_l['Age'] - age_mean)/age_std; trn_l['Percent_at_base_std'] = (trn_l['Percent_at_base'] - pc_mean)/pc_std
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            md = smf.mixedlm('FVC ~ 1 + Weeks_Passed + I(Weeks_Passed**2) + Age_std + C(Sex) + C(SmokingStatus) + Percent_at_base_std + Age_std:Percent_at_base_std',
                              data=trn_l, groups=trn_l['Patient'], re_formula='~Weeks_Passed')
            mdf = md.fit(method='lbfgs', reml=True, maxiter=500, disp=False)
        val_left = val.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore')
        val_l = val_left.merge(base_val[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
        mask = (val_l['Weeks'] >= val_l['Base_Week'])
        dist = (val_l['Weeks'] - val_l['Base_Week']).astype(float)
        val_l['Weeks_Passed'] = dist/10.0
        val_l['Age_std'] = (val_l['Age'] - age_mean)/age_std
        val_l['Percent_at_base_std'] = (val_l['Percent_at_base'] - pc_mean)/pc_std
        fvc_pred = mdf.predict(val_l).astype(float).values
        y_list.append(val_l.loc[mask, 'FVC'].values.astype(float)); d_list.append(dist.values[mask].astype(float)); fvc_list.append(fvc_pred[mask].astype(float))
        pid_list.append(val_l.loc[mask, 'Patient'].astype(str).values); wk_list.append(val_l.loc[mask, 'Weeks'].astype(int).values)
    return np.concatenate(y_list), np.concatenate(d_list), np.concatenate(fvc_list), np.concatenate(pid_list), np.concatenate(wk_list)

y_s, d_s, fvc_s, fvc_a, pid_s, wk_s = slope_anchor_oof_min(train, 5, 42)
y_l, d_l, fvc_l, pid_l, wk_l = lme_oof_min(train, 5)

# Align OOF sources
df_s = pd.DataFrame({'Patient': pid_s.astype(str), 'Weeks': wk_s.astype(int), 'y_true': y_s.astype(float), 'dist': d_s.astype(float), 'fvc_s': fvc_s.astype(float), 'fvc_a': fvc_a.astype(float)})
df_l = pd.DataFrame({'Patient': pid_l.astype(str), 'Weeks': wk_l.astype(int), 'fvc_l': fvc_l.astype(float)})
# Bring gs_fold directly into df_q via merge to avoid indexing length mismatches
df_q = oof_q[['Patient','Weeks','q10_delta_oof','q20_delta_oof','q50_delta_oof','q80_delta_oof','q90_delta_oof','Base_FVC','dist','gs_fold']].astype({'Patient':'str','Weeks':'int'})
fvc_q = 0.70 * (df_q['Base_FVC'].astype(float) + df_q['q50_delta_oof'].astype(float)) + 0.30 * (df_q['Base_FVC'].astype(float) + df_q['gs_fold'].astype(float) * df_q['dist'].astype(float))
df_q['fvc_q'] = fvc_q.astype(float)
df_q['band_sym'] = (df_q['q80_delta_oof'] - df_q['q20_delta_oof']).abs().astype(float)
df_q['band_asym'] = np.maximum((df_q['q90_delta_oof'] - df_q['q50_delta_oof']).abs().astype(float), (df_q['q50_delta_oof'] - df_q['q10_delta_oof']).abs().astype(float))
dfm = df_s.merge(df_l, on=['Patient','Weeks'], how='inner').merge(df_q[['Patient','Weeks','fvc_q','band_sym','band_asym']], on=['Patient','Weeks'], how='inner')

y = dfm['y_true'].values.astype(float)
dist = dfm['dist'].values.astype(float)
s = dfm['fvc_s'].values.astype(float)
a = dfm['fvc_a'].values.astype(float)
lme = dfm['fvc_l'].values.astype(float)
qpt = dfm['fvc_q'].values.astype(float)
band_sym = dfm['band_sym'].values.astype(float)
band_asym = dfm['band_asym'].values.astype(float)

# Recompute OOF distance-aware weights using banker sigma (same as cell 23)
sigma_banker_oof = np.maximum(240.0 + 3.0 * np.abs(dist), 70.0)
sigma_banker_oof = np.where(np.abs(dist) > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)

def grid_best(y, s, l, q, sigma, w_grid=np.arange(0.0, 1.01, 0.05)):
    best_ll, best_w = -1e9, (0.3, 0.3, 0.4)
    for ws in w_grid:
        for wl in w_grid:
            wq = 1.0 - ws - wl
            if wq < 0 or wq > 1: continue
            pred = ws*s + wl*l + wq*q
            ll = laplace_ll_np(y, pred, sigma)
            if ll > best_ll:
                best_ll, best_w = ll, (ws, wl, wq)
    return best_ll, best_w

bins = [(0.0,5.0),(5.0,15.0),(15.0,30.0),(30.0,1e9)]
best_w = {}
for lo, hi in bins:
    m = (np.abs(dist) > lo) & (np.abs(dist) <= hi)
    if not m.any():
        best_w[(lo,hi)] = (0.05, 0.05, 0.90) if lo>=15.0 else (0.05, 0.00, 0.95)
        continue
    ll, w = grid_best(y[m], s[m], lme[m], qpt[m], sigma_banker_oof[m])
    best_w[(lo,hi)] = w

fvc_blend_oof = np.zeros_like(y)
for (lo,hi), (ws, wl, wq) in best_w.items():
    m = (np.abs(dist) > lo) & (np.abs(dist) <= hi)
    fvc_blend_oof[m] = ws*s[m] + wl*lme[m] + wq*qpt[m]

# 2) For each bin, choose band type (sym for short/mid, choose sym vs asym for long bins) and tune c in {1.2..2.2}
c_grid = [1.2, 1.4, 1.6, 1.8, 2.0, 2.2]
best_c, best_band_type = {}, {}
for lo, hi in bins:
    m = (np.abs(dist) > lo) & (np.abs(dist) <= hi)
    if not m.any():
        best_c[(lo,hi)] = 1.8; best_band_type[(lo,hi)] = 'sym' if hi<=15.0 else 'sym'
        continue
    # band candidates
    bands = {'sym': band_sym[m]} if hi <= 15.0 else {'sym': band_sym[m], 'asym': band_asym[m]}
    best_ll, sel_c, sel_type = -1e9, 1.8, 'sym'
    for btype, bvals in bands.items():
        for c in c_grid:
            sig = np.maximum(bvals / c, sigma_banker_oof[m])
            ll = laplace_ll_np(y[m], fvc_blend_oof[m], sig)
            if ll > best_ll:
                best_ll, sel_c, sel_type = ll, c, btype
    best_c[(lo,hi)] = sel_c; best_band_type[(lo,hi)] = sel_type
    print(f"[Sigma-ASYM] Bin ({lo},{hi}] type={sel_type} c={sel_c:.2f} OOF LL={best_ll:.5f}")

# 3) Apply to TEST deltas using selected band per bin, banker floor, dist==0->70, +5 stabilizer for |dist|>15
pred_d = pd.read_csv('pred_quantile_deltas_v2.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q10_d = pred_d['q10_d'].astype(float).values
q20_d = pred_d['q20_d'].astype(float).values
q50_d = pred_d['q50_d'].astype(float).values
q80_d = pred_d['q80_d'].astype(float).values
q90_d = pred_d['q90_d'].astype(float).values
band_sym_te = np.abs(q80_d - q20_d).astype(float)
band_asym_te = np.maximum(np.abs(q90_d - q50_d), np.abs(q50_d - q10_d)).astype(float)

grid_te = ss.copy()
parts = grid_te['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid_te['Patient'] = parts[0]; grid_te['Weeks'] = parts[1].astype(int)
test_bl = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid_te = grid_te.merge(test_bl, on='Patient', how='left')
abs_dist_te = (grid_te['Weeks'] - grid_te['Base_Week']).abs().astype(float).values
sigma_banker_te = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker_te = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker_te, 100.0), sigma_banker_te)

sigma_from_band = np.zeros_like(abs_dist_te, dtype=float)
for (lo,hi), c in best_c.items():
    m = (abs_dist_te > lo) & (abs_dist_te <= hi)
    if not np.any(m): continue
    btype = best_band_type[(lo,hi)]
    bvals = band_sym_te if btype=='sym' else band_asym_te
    sigma_from_band[m] = bvals[m] / c
sigma_te = np.maximum(sigma_from_band, sigma_banker_te)
sigma_te = np.where(abs_dist_te == 0.0, 70.0, sigma_te)
sigma_te = np.where(abs_dist_te > 15.0, sigma_te + 5.0, sigma_te)

# Per-patient monotone in |dist|
df_sig = pd.DataFrame({'Patient': grid_te['Patient'].values, 'Weeks': grid_te['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_te.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

# 4) Save submission with FVC from distance blend
sub_fvc = pd.read_csv('submission_distance_blend.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
sub_new = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': sub_fvc['FVC'].astype(float).values, 'Confidence': sigma_final})
sub_new.to_csv('submission_distance_blend_sigma_qband_asym.csv', index=False)
sub_new.to_csv('submission.csv', index=False)
print('Saved submission_distance_blend_sigma_qband_asym.csv and set submission.csv (bin-wise band type selection with asym in long bins, banker floor, +5 past 15w, monotone).')

[Sigma-ASYM] Bin (0.0,5.0] type=sym c=1.40 OOF LL=-6.04044
[Sigma-ASYM] Bin (5.0,15.0] type=sym c=1.60 OOF LL=-6.12583
[Sigma-ASYM] Bin (15.0,30.0] type=sym c=1.60 OOF LL=-6.31705
[Sigma-ASYM] Bin (30.0,1000000000.0] type=asym c=1.60 OOF LL=-6.43738
Saved submission_distance_blend_sigma_qband_asym.csv and set submission.csv (bin-wise band type selection with asym in long bins, banker floor, +5 past 15w, monotone).


  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [67]:
# 2-model (q50 + per-fold anchor) with constrained per-bin alpha; retune sigma c per bin; apply to test
import numpy as np, pandas as pd, gc, time
from sklearn.model_selection import GroupKFold

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = np.asarray(y_true, float); y_pred = np.asarray(y_pred, float); sigma = np.asarray(sigma, float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

t0 = time.time()
# 1) Build OOF for q50 and per-fold anchor; align future rows only
oof_q = pd.read_csv('oof_quantile_lgbm_v2.csv')
train_base = prepare_baseline_table(train)
oof_q = oof_q.merge(train_base[['Patient','Base_Week','Base_FVC']], on='Patient', how='left', suffixes=('', '_base'))
if 'Base_FVC_base' in oof_q.columns:
    if 'Base_FVC' not in oof_q.columns: oof_q['Base_FVC'] = oof_q['Base_FVC_base']
    else: oof_q['Base_FVC'] = oof_q['Base_FVC'].fillna(oof_q['Base_FVC_base'])
    oof_q.drop(columns=['Base_FVC_base'], inplace=True)
if 'Base_Week_base' in oof_q.columns and 'Base_Week' not in oof_q.columns:
    oof_q['Base_Week'] = oof_q['Base_Week_base']
    oof_q.drop(columns=['Base_Week_base'], inplace=True)
oof_q['dist'] = (oof_q['Weeks'] - oof_q['Base_Week']).astype(float)
oof_q = oof_q[oof_q['dist'] >= 0].dropna(subset=['q20_delta_oof','q50_delta_oof','q80_delta_oof']).copy()

# Per-fold global slope (anchor) without leak
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
patient_to_fold, fold_to_gs = {}, {}
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    trn_df = train.iloc[trn_idx]
    gs_fold = robust_global_slope(compute_patient_slopes(trn_df))
    fold_to_gs[fold] = gs_fold
    for p in train.iloc[val_idx]['Patient'].astype(str).unique():
        patient_to_fold[p] = fold
oof_q['fold'] = oof_q['Patient'].astype(str).map(patient_to_fold).astype(int)
oof_q['gs_fold'] = oof_q['fold'].map(fold_to_gs).astype(float)

fvc_q50_oof = (oof_q['Base_FVC'].astype(float).values + oof_q['q50_delta_oof'].astype(float).values)
fvc_anchor_oof = (oof_q['Base_FVC'].astype(float).values + oof_q['gs_fold'].astype(float).values * oof_q['dist'].astype(float).values)
y_oof = oof_q['FVC'].astype(float).values
dist_oof = oof_q['dist'].astype(float).values
abs_dist_oof = np.abs(dist_oof)
band_oof = np.abs(oof_q['q80_delta_oof'].astype(float).values - oof_q['q20_delta_oof'].astype(float).values)

# Banker sigma OOF
sigma_banker_oof = np.maximum(240.0 + 3.0 * abs_dist_oof, 70.0)
sigma_banker_oof = np.where(abs_dist_oof > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)

# 2) Use hard-capped alphas = 0.30 across all bins (70/30 q50/anchor), per expert guidance
bins = [(0.0,5.0),(5.0,15.0),(15.0,1e9)]
masks = [ (abs_dist_oof>lo) & (abs_dist_oof<=hi) for lo,hi in bins ]
alpha_s = alpha_m = alpha_l = 0.30
print(f"[2-Model] Using fixed alphas (short,mid,long) = {alpha_s:.2f}, {alpha_m:.2f}, {alpha_l:.2f}")

# Build blended OOF preds with fixed alphas for sigma tuning
fvc_blend_oof = np.zeros_like(y_oof)
for (lo,hi), m, a in zip(bins, masks, (alpha_s, alpha_m, alpha_l)):
    if np.any(m):
        fvc_blend_oof[m] = (1.0 - a) * fvc_q50_oof[m] + a * fvc_anchor_oof[m]

# 3) Tune sigma c per bin on this OOF: sigma = max(|q80-q20|/c, banker). Optional >=130 for |dist|>30 if OOF-neutral
c_grid = [1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2]
best_c = {}
use_floor130 = False
for (lo,hi), m in zip(bins, masks):
    if not np.any(m):
        best_c[(lo,hi)] = 1.8
        print(f'[Sigma-2M] Bin ({lo},{hi}] empty; default c=1.8')
        continue
    b_ll, b_c = -1e9, 1.8
    for c in c_grid:
        sig = np.maximum(band_oof[m] / c, sigma_banker_oof[m])
        # dist==0 -> 70
        z = (abs_dist_oof[m] == 0.0)
        if np.any(z): sig[z] = np.maximum(70.0, sig[z])
        ll = laplace_ll_np(y_oof[m], fvc_blend_oof[m], sig)
        if ll > b_ll: b_ll, b_c = ll, c
    best_c[(lo,hi)] = b_c
    print(f"[Sigma-2M] Bin ({lo},{hi}] best c={b_c:.2f} OOF LL={b_ll:.5f}")

# Test >=130 for |dist|>30 as optional floor (OOF-neutral adoption check)
m_gt30 = abs_dist_oof > 30.0
if np.any(m_gt30):
    sig_base = np.zeros_like(abs_dist_oof)
    for (lo,hi), m in zip(bins, masks):
        if np.any(m):
            sig_base[m] = np.maximum(band_oof[m] / best_c[(lo,hi)], sigma_banker_oof[m])
    sig_base = np.where(abs_dist_oof == 0.0, np.maximum(70.0, sig_base), sig_base)
    ll_no130 = laplace_ll_np(y_oof, fvc_blend_oof, sig_base)
    sig_130 = np.where(m_gt30, np.maximum(sig_base, 130.0), sig_base)
    ll_130 = laplace_ll_np(y_oof, fvc_blend_oof, sig_130)
    if ll_130 >= ll_no130 - 1e-6:
        use_floor130 = True
    print(f"[Sigma-2M] >=130 floor test: LL_no130={ll_no130:.5f} LL_130={ll_130:.5f} adopt={use_floor130}")

# 4) Apply to TEST: build q50 and anchor from full-train gs; blend with fixed alphas; build sigma with tuned c
ss = pd.read_csv('sample_submission.csv')
pred_d = pd.read_csv('pred_quantile_deltas_v2.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q20_d = pred_d['q20_d'].astype(float).values
q50_d = pred_d['q50_d'].astype(float).values
q80_d = pred_d['q80_d'].astype(float).values
band_te = np.abs(q80_d - q20_d).astype(float)

grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test_base = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_base, on='Patient', how='left')
dist_te = (grid['Weeks'] - grid['Base_Week']).astype(float).values
abs_dist_te = np.abs(dist_te)
base_fvc_te = grid['Base_FVC'].astype(float).values

gs_full = robust_global_slope(compute_patient_slopes(train))
fvc_q50_te = base_fvc_te + q50_d
fvc_anchor_te = base_fvc_te + gs_full * dist_te

# Fixed alpha 0.30 across bins
alpha_bins = np.zeros_like(abs_dist_te, dtype=float) + 0.30

fvc_pred = (1.0 - alpha_bins) * fvc_q50_te + alpha_bins * fvc_anchor_te
fvc_pred = np.clip(fvc_pred, 500, 6000)

# Apply tolerant non-increasing FVC per patient (+25 ml), then pin dist==0 to Base_FVC
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_pred})
def enforce_non_increasing_tolerant(g, tol=25.0):
    g = g.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    for i in range(len(f)-2, -1, -1):
        f[i] = min(f[i], f[i+1] + tol)
    g['FVC'] = f
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
fvc_final = df_out['FVC'].values.astype(float)
fvc_final = np.where(abs_dist_te == 0.0, base_fvc_te, fvc_final)

# Sigma test: max(band/c_bin, banker); dist==0->70; floors; per-patient monotone
sigma_banker_te = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker_te = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker_te, 100.0), sigma_banker_te)
sigma_from_band = np.zeros_like(abs_dist_te, dtype=float)
for (lo,hi) in bins:
    m = (abs_dist_te>lo) & (abs_dist_te<=hi)
    if not np.any(m): continue
    c = best_c[(lo,hi)]
    sigma_from_band[m] = band_te[m] / c
sigma_te = np.maximum(sigma_from_band, sigma_banker_te)
if use_floor130:
    sigma_te = np.where(abs_dist_te > 30.0, np.maximum(sigma_te, 130.0), sigma_te)
sigma_te = np.where(abs_dist_te == 0.0, 70.0, sigma_te)

df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_te.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

# 5) Save submission
sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub.to_csv('submission_2model_q50_anchor_qband.csv', index=False)
sub.to_csv('submission.csv', index=False)
print(f"Saved submission_2model_q50_anchor_qband.csv and set submission.csv. Alphas: {alpha_s:.2f}/{alpha_m:.2f}/{alpha_l:.2f} (forced to 0.30); use_floor130={use_floor130}. Elapsed {time.time()-t0:.1f}s")

[2-Model] Using fixed alphas (short,mid,long) = 0.30, 0.30, 0.30
[Sigma-2M] Bin (0.0,5.0] best c=1.30 OOF LL=-6.04051
[Sigma-2M] Bin (5.0,15.0] best c=1.60 OOF LL=-6.12583
[Sigma-2M] Bin (15.0,1000000000.0] best c=1.70 OOF LL=-6.40760
[Sigma-2M] >=130 floor test: LL_no130=-7.93749 LL_130=-7.93749 adopt=True
Saved submission_2model_q50_anchor_qband.csv and set submission.csv. Alphas: 0.30/0.30/0.30 (forced to 0.30); use_floor130=True. Elapsed 0.1s


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [66]:
# 2-model (70/30 q50+anchor) with banker-only sigma; tolerant FVC monotonicity; save alt submission
import numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold

# Build q50 and per-fold anchor OOF just to ensure hygiene for anchor; then use full-train anchor for test
def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

# Test grid
ss = pd.read_csv('sample_submission.csv')
pred_d = pd.read_csv('pred_quantile_deltas_v2.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q50_d = pred_d['q50_d'].astype(float).values

grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test_base = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_base, on='Patient', how='left')
dist_te = (grid['Weeks'] - grid['Base_Week']).astype(float).values
abs_dist_te = np.abs(dist_te)
base_fvc_te = grid['Base_FVC'].astype(float).values

# Full-train global slope for anchor
gs_full = robust_global_slope(compute_patient_slopes(train))
fvc_q50_te = base_fvc_te + q50_d
fvc_anchor_te = base_fvc_te + gs_full * dist_te

# Fixed alpha 0.30 across bins (70/30 blend)
alpha = 0.30
fvc_pred = (1.0 - alpha) * fvc_q50_te + alpha * fvc_anchor_te
fvc_pred = np.clip(fvc_pred, 500, 6000)

# Tolerant non-increasing per patient (+25 ml), then pin dist==0 to Base_FVC
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_pred})
def enforce_non_increasing_tolerant(g, tol=25.0):
    g = g.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    for i in range(len(f)-2, -1, -1):
        f[i] = min(f[i], f[i+1] + tol)
    g['FVC'] = f
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
fvc_final = df_out['FVC'].values.astype(float)
fvc_final = np.where(abs_dist_te == 0.0, base_fvc_te, fvc_final)

# Banker-only sigma with standard floors and per-patient monotone in |dist|
sigma_banker_te = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker_te = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker_te, 100.0), sigma_banker_te)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_banker_te.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub.to_csv('submission_2model_q50_anchor_banker.csv', index=False)
sub.to_csv('submission.csv', index=False)
print('Saved submission_2model_q50_anchor_banker.csv and set submission.csv (70/30 q50+anchor; banker-only sigma; tolerant FVC monotonicity).')

Saved submission_2model_q50_anchor_banker.csv and set submission.csv (70/30 q50+anchor; banker-only sigma; tolerant FVC monotonicity).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [79]:
# Step 3: Rebuild 2-model (70/30 q50+anchor) using v3 q50; retune sigma c per bin on blended OOF; write v3 submissions
import numpy as np, pandas as pd, time, gc
from sklearn.model_selection import GroupKFold

t0 = time.time()
train = pd.read_csv('train.csv')
ss = pd.read_csv('sample_submission.csv')

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = np.asarray(y_true, float); y_pred = np.asarray(y_pred, float); sigma = np.asarray(sigma, float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float); y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def robust_global_slope(slopes_dict):
    if not slopes_dict: return 0.0
    return float(np.median(list(slopes_dict.values())))

# 1) Load OOF: v3 q50 deltas (q50_delta_oof) and v2 bands (q20/q80 deltas); dedupe/average per (Patient,Weeks) and align by keys
oof_v3 = pd.read_csv('oof_quantile_lgbm_v3.csv')  # Patient, Weeks, FVC, Base_Week, Base_FVC, q50_delta_oof
oof_v2 = pd.read_csv('oof_quantile_lgbm_v2.csv')[['Patient','Weeks','q20_delta_oof','q80_delta_oof']]

# Deduplicate/average within each source to remove multi-seed duplicates
oof_v3 = (oof_v3.groupby(['Patient','Weeks'], as_index=False)
          .agg({'FVC':'first','Base_Week':'first','Base_FVC':'first','q50_delta_oof':'mean'}))
oof_v2 = (oof_v2.groupby(['Patient','Weeks'], as_index=False)
          .agg({'q20_delta_oof':'mean','q80_delta_oof':'mean'}))

oof = oof_v3.merge(oof_v2, on=['Patient','Weeks'], how='inner')

# Build dist and future-only filter; drop dist==0 rows for OOF tuning/scoring; remove residual duplicates
oof['dist'] = (oof['Weeks'] - oof['Base_Week']).astype(float)
pre_rows = oof.shape[0]
oof = oof[(oof['dist'] >= 0) & oof[['q50_delta_oof','q20_delta_oof','q80_delta_oof']].notna().all(axis=1)].copy()
oof = oof.sort_values(['Patient','Weeks']).drop_duplicates(['Patient','Weeks'])
oof = oof[oof['dist'] > 0].copy()
print(f"[Diag] OOF merged rows pre-filter={pre_rows} post-filter={oof.shape[0]} unique_patients={oof['Patient'].nunique()}")

# 2) Per-fold anchor gs_fold (TRAIN-only) using GroupKFold; map fold to patients and gs
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
patient_to_fold, fold_to_gs = {}, {}
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    trn_df = train.iloc[trn_idx]
    gs_fold = robust_global_slope(compute_patient_slopes(trn_df))
    fold_to_gs[fold] = gs_fold
    for p in train.iloc[val_idx]['Patient'].astype(str).unique():
        patient_to_fold[p] = fold
oof['fold'] = oof['Patient'].astype(str).map(patient_to_fold).astype(int)
oof['gs_fold'] = oof['fold'].map(fold_to_gs).astype(float)

# 3) Build 70/30 blended OOF FVC and compute banker sigma + tune c per bin for q-band sigma
base = oof['Base_FVC'].astype(float).values
dist = oof['dist'].astype(float).values
abs_dist = np.abs(dist).astype(float)
fvc_q50_oof = base + oof['q50_delta_oof'].astype(float).values
fvc_anchor_oof = base + oof['gs_fold'].astype(float).values * dist
fvc_blend_oof = 0.70 * fvc_q50_oof + 0.30 * fvc_anchor_oof
y_oof = oof['FVC'].astype(float).values
band_oof = np.abs(oof['q80_delta_oof'].astype(float).values - oof['q20_delta_oof'].astype(float).values)

sigma_banker_oof = np.maximum(240.0 + 3.0 * abs_dist, 70.0)
sigma_banker_oof = np.where(abs_dist > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)

bins = [(0.0,5.0),(5.0,15.0),(15.0,1e9)]
masks = [ (abs_dist>lo) & (abs_dist<=hi) for lo,hi in bins ]
print('[Diag] OOF bin counts short/mid/long =', [int(m.sum()) for m in masks])

c_grid_short_mid = [1.3,1.4,1.5,1.6,1.7,1.8,2.0]
c_grid_long = [1.3,1.4,1.5,1.6,1.7,1.8,2.0,2.3,2.4,2.5,2.6]
best_c = {}
for (lo,hi), m in zip(bins, masks):
    if not np.any(m):
        best_c[(lo,hi)] = 1.8
        print(f'[v3 Sigma] Bin ({lo},{hi}] empty; default c=1.8')
        continue
    grid_c = c_grid_short_mid if hi<=15.0 else c_grid_long
    b_ll, b_c = -1e9, 1.8
    for c in grid_c:
        sig = np.maximum(band_oof[m] / c, sigma_banker_oof[m])
        z = (abs_dist[m] == 0.0)
        if np.any(z):
            sig[z] = np.maximum(70.0, sig[z])
        ll = laplace_ll_np(y_oof[m], fvc_blend_oof[m], sig)
        if ll > b_ll:
            b_ll, b_c = ll, c
    best_c[(lo,hi)] = b_c
    print(f"[v3 Sigma] Bin ({lo},{hi}] best c={b_c:.2f} OOF LL={b_ll:.5f}")

# Optional >=130 floor for |dist|>30 if OOF-neutral
m_gt30 = abs_dist > 30.0
use_floor130 = False
if np.any(m_gt30):
    sig_base = np.zeros_like(abs_dist)
    for (lo,hi), m in zip(bins, masks):
        if np.any(m):
            sig_base[m] = np.maximum(band_oof[m] / best_c[(lo,hi)], sigma_banker_oof[m])
    sig_base = np.where(abs_dist == 0.0, np.maximum(70.0, sig_base), sig_base)
    ll_no130 = laplace_ll_np(y_oof, fvc_blend_oof, sig_base)
    sig_130 = np.where(m_gt30, np.maximum(sig_base, 130.0), sig_base)
    ll_130 = laplace_ll_np(y_oof, fvc_blend_oof, sig_130)
    use_floor130 = (ll_130 >= ll_no130 - 1e-6)
    print(f"[v3 Sigma] >=130 floor test: LL_no130={ll_no130:.5f} LL_130={ll_130:.5f} adopt={use_floor130}")

# Compute global OOF LL for q-band sigma tuned above and banker
sig_oof = np.zeros_like(abs_dist)
for (lo,hi), m in zip(bins, masks):
    if np.any(m):
        sig_oof[m] = np.maximum(band_oof[m] / best_c[(lo,hi)], sigma_banker_oof[m])
sig_oof = np.where(abs_dist == 0.0, np.maximum(70.0, sig_oof), sig_oof)
if use_floor130:
    sig_oof = np.where(abs_dist > 30.0, np.maximum(sig_oof, 130.0), sig_oof)
ll_global_qband = laplace_ll_np(y_oof, fvc_blend_oof, sig_oof)
ll_global_banker = laplace_ll_np(y_oof, fvc_blend_oof, sigma_banker_oof)
print(f'[v3 OOF] rows={oof.shape[0]} pats={oof["Patient"].nunique()} LL_qband={ll_global_qband:.5f} | LL_banker={ll_global_banker:.5f}')

# 4) Build TEST FVC using v3 q50_d and full-train anchor; apply guardrails
pred_v3 = pd.read_csv('pred_quantile_deltas_v3.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q50_d_te = pred_v3['q50_d'].astype(float).values

grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_base = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_base, on='Patient', how='left')
dist_te = (grid['Weeks'] - grid['Base_Week']).astype(float).values
abs_dist_te = np.abs(dist_te).astype(float)
base_fvc_te = grid['Base_FVC'].astype(float).values

gs_full = robust_global_slope(compute_patient_slopes(train))
fvc_q50_te = base_fvc_te + q50_d_te
fvc_anchor_te = base_fvc_te + gs_full * dist_te
fvc_te = 0.70 * fvc_q50_te + 0.30 * fvc_anchor_te
fvc_te = np.clip(fvc_te, 500, 6000)

# Tolerant non-increasing per patient (+25 ml), then pin dist==0 to Base_FVC
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_te})
def enforce_non_increasing_tolerant(g, tol=25.0):
    g = g.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    for i in range(len(f)-2, -1, -1):
        f[i] = min(f[i], f[i+1] + tol)
    g['FVC'] = f
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
fvc_final = df_out['FVC'].values.astype(float)
fvc_final = np.where(abs_dist_te == 0.0, base_fvc_te, fvc_final)

# 5) Build TEST sigmas: q-band (tuned c) and banker; per-patient monotone
pred_v2 = pd.read_csv('pred_quantile_deltas_v2.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
band_te = np.abs(pred_v2['q80_d'].astype(float).values - pred_v2['q20_d'].astype(float).values)
sigma_banker_te = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker_te = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker_te, 100.0), sigma_banker_te)
sigma_from_band = np.zeros_like(abs_dist_te, dtype=float)
for (lo,hi) in bins:
    m = (abs_dist_te>lo) & (abs_dist_te<=hi)
    if np.any(m):
        c = best_c[(lo,hi)]
        sigma_from_band[m] = band_te[m] / c
sigma_qband_te = np.maximum(sigma_from_band, sigma_banker_te)
if use_floor130:
    sigma_qband_te = np.where(abs_dist_te > 30.0, np.maximum(sigma_qband_te, 130.0), sigma_qband_te)
sigma_qband_te = np.where(abs_dist_te == 0.0, 70.0, sigma_qband_te)

def enforce_sigma_monotone(df):
    def _mono(g):
        g = g.sort_values('dist').copy()
        g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
        return g
    return df.groupby('Patient', as_index=False, group_keys=False).apply(_mono)

df_sig_q = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_qband_te.astype(float)})
df_sig_q = enforce_sigma_monotone(df_sig_q)
sigma_qband_final = df_sig_q['Sigma'].values.astype(float)

df_sig_b = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_banker_te.astype(float)})
df_sig_b = enforce_sigma_monotone(df_sig_b)
sigma_banker_final = df_sig_b['Sigma'].values.astype(float)

# 6) Save two submissions; do not overwrite submission.csv here
sub_q = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_qband_final})
sub_b = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_banker_final})
sub_q.to_csv('submission_v3_qband.csv', index=False)
sub_b.to_csv('submission_v3_banker.csv', index=False)
print('Saved submission_v3_qband.csv and submission_v3_banker.csv. Elapsed {:.1f}s'.format(time.time()-t0))

[Diag] OOF merged rows pre-filter=1387 post-filter=1229 unique_patients=158
[Diag] OOF bin counts short/mid/long = [286, 438, 505]
[v3 Sigma] Bin (0.0,5.0] best c=1.40 OOF LL=-6.04133
[v3 Sigma] Bin (5.0,15.0] best c=1.60 OOF LL=-6.12839
[v3 Sigma] Bin (15.0,1000000000.0] best c=1.70 OOF LL=-6.41465
[v3 Sigma] >=130 floor test: LL_no130=-6.22576 LL_130=-6.22576 adopt=True
[v3 OOF] rows=1229 pats=158 LL_qband=-6.22576 | LL_banker=-6.22576
Saved submission_v3_qband.csv and submission_v3_banker.csv. Elapsed 0.1s


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
  return df.groupby('Patient', as_index=False, group_keys=False).apply(_mono)
  return df.groupby('Patient', as_index=False, group_keys=False).apply(_mono)


In [75]:
# Set submission.csv to v3 q-band primary
import pandas as pd
src = 'submission_v3_qband.csv'
ss = pd.read_csv('sample_submission.csv')
sub = pd.read_csv(src)
assert sub.shape[0] == ss.shape[0], 'Row count mismatch vs sample_submission'
assert set(sub['Patient_Week'].astype(str)) == set(ss['Patient_Week'].astype(str)), 'Patient_Week sets differ'
assert sub['FVC'].notna().all() and sub['Confidence'].notna().all(), 'NaNs in v3_qband submission'
sub.to_csv('submission.csv', index=False)
print(f'submission.csv overwritten with {src}')

submission.csv overwritten with submission_v3_qband.csv


In [71]:
# Optional lever: add tiny LME weight (0.10) only for |dist|>15 to v3 2-model FVC; keep v3 q-band sigma
import numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold

ss = pd.read_csv('sample_submission.csv')

# Load v3 q50 deltas and build q50+anchor FVC like in Cell 27
pred_v3 = pd.read_csv('pred_quantile_deltas_v3.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q50_d_te = pred_v3['q50_d'].astype(float).values

grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_base = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_base, on='Patient', how='left')
dist_te = (grid['Weeks'] - grid['Base_Week']).astype(float).values
abs_dist_te = np.abs(dist_te).astype(float)
base_fvc_te = grid['Base_FVC'].astype(float).values

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float); y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slopes[pid] = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
    return slopes

def robust_global_slope(slopes_dict):
    if not slopes_dict: return 0.0
    return float(np.median(list(slopes_dict.values())))

train = pd.read_csv('train.csv')
gs_full = robust_global_slope(compute_patient_slopes(train))
fvc_q50_te = base_fvc_te + q50_d_te
fvc_anchor_te = base_fvc_te + gs_full * dist_te

# Load LME FVC from existing artifact
fvc_lme = pd.read_csv('submission_lme_banker.csv').set_index('Patient_Week').loc[ss['Patient_Week'],'FVC'].astype(float).values

# Build FVC with rule: if |dist|<=15: 0.70*q50 + 0.30*anchor; else: 0.60*q50 + 0.30*anchor + 0.10*LME
fvc_base = 0.70 * fvc_q50_te + 0.30 * fvc_anchor_te
m_long = abs_dist_te > 15.0
fvc_alt_long = 0.60 * fvc_q50_te + 0.30 * fvc_anchor_te + 0.10 * fvc_lme
fvc_mix = np.where(m_long, fvc_alt_long, fvc_base)
fvc_mix = np.clip(fvc_mix, 500, 6000)

# Tolerant non-increasing FVC per patient (+25 ml), then pin dist==0 to Base_FVC
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_mix})
def enforce_non_increasing_tolerant(g, tol=25.0):
    g = g.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    for i in range(len(f)-2, -1, -1):
        f[i] = min(f[i], f[i+1] + tol)
    g['FVC'] = f
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
fvc_final = df_out['FVC'].values.astype(float)
fvc_final = np.where(abs_dist_te == 0.0, base_fvc_te, fvc_final)

# Keep sigma from v3 q-band submission (already banker-floored, monotone)
sub_qband = pd.read_csv('submission_v3_qband.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
sigma_final = sub_qband['Confidence'].astype(float).values

sub_new = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub_new.to_csv('submission_v3_qband_lme10long.csv', index=False)
sub_new.to_csv('submission.csv', index=False)
print('Saved submission_v3_qband_lme10long.csv and set submission.csv (add 0.10 LME for |dist|>15; sigma from v3 q-band unchanged).')

Saved submission_v3_qband_lme10long.csv and set submission.csv (add 0.10 LME for |dist|>15; sigma from v3 q-band unchanged).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))


In [72]:
# Set submission.csv to v3 banker backup
import pandas as pd
src = 'submission_v3_banker.csv'
ss = pd.read_csv('sample_submission.csv')
sub = pd.read_csv(src)
assert sub.shape[0] == ss.shape[0], 'Row count mismatch vs sample_submission'
assert set(sub['Patient_Week'].astype(str)) == set(ss['Patient_Week'].astype(str)), 'Patient_Week sets differ'
assert sub['FVC'].notna().all() and sub['Confidence'].notna().all(), 'NaNs in v3_banker submission'
sub.to_csv('submission.csv', index=False)
print(f'submission.csv overwritten with {src}')

submission.csv overwritten with submission_v3_banker.csv


In [78]:
# Step 3b: Average q50 (LGBM v3 + CatBoost v1), rebuild 2-model 70/30, retune sigma per-bin, write submissions
import numpy as np, pandas as pd, time
from sklearn.model_selection import GroupKFold

t0 = time.time()
train = pd.read_csv('train.csv')
ss = pd.read_csv('sample_submission.csv')

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = np.asarray(y_true, float); y_pred = np.asarray(y_pred, float); sigma = np.asarray(sigma, float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float); y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def robust_global_slope(slopes_dict):
    if not slopes_dict: return 0.0
    return float(np.median(list(slopes_dict.values())))

# 1) Load OOF q50 deltas: v3 (LGBM) and CatBoost v1; dedupe/average per (Patient, Weeks) to avoid multi-seed duplicates
oof_v3 = pd.read_csv('oof_quantile_lgbm_v3.csv')  # Patient, Weeks, FVC, Base_Week, Base_FVC, q50_delta_oof
oof_cb = pd.read_csv('oof_quantile_cat_v1.csv')  # Patient, Weeks, FVC, Base_Week, Base_FVC, q50_delta_oof

# Deduplicate/average
oof_v3 = (oof_v3.groupby(['Patient','Weeks'], as_index=False)
          .agg({'FVC':'first','Base_Week':'first','Base_FVC':'first','q50_delta_oof':'mean'}))
oof_cb = (oof_cb.groupby(['Patient','Weeks'], as_index=False)
          .agg({'q50_delta_oof':'mean'}))

# Merge averaged q50s
oof = (oof_v3.rename(columns={'q50_delta_oof':'q50_v3'})
       .merge(oof_cb.rename(columns={'q50_delta_oof':'q50_cb'}), on=['Patient','Weeks'], how='inner'))

# Bring v2 bands for sigma and average duplicates there too
oof_v2 = pd.read_csv('oof_quantile_lgbm_v2.csv')[['Patient','Weeks','q20_delta_oof','q80_delta_oof']]
oof_v2 = (oof_v2.groupby(['Patient','Weeks'], as_index=False)
          .agg({'q20_delta_oof':'mean','q80_delta_oof':'mean'}))
oof = oof.merge(oof_v2, on=['Patient','Weeks'], how='inner')

# Future-only and diagnostics; drop dist==0 rows for OOF tuning/scoring; also drop any residual duplicates
oof['dist'] = (oof['Weeks'] - oof['Base_Week']).astype(float)
pre = oof.shape[0]
oof = oof[(oof['dist'] >= 0) & oof[['q50_v3','q50_cb','q20_delta_oof','q80_delta_oof']].notna().all(axis=1)].copy()
oof = oof.sort_values(['Patient','Weeks']).drop_duplicates(['Patient','Weeks'])
oof = oof[oof['dist'] > 0].copy()
print(f"[Avg-q50 Diag] OOF pre={pre} post={oof.shape[0]} pats={oof['Patient'].nunique()}")

# Per-fold anchor gs_fold (TRAIN-only)
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
patient_to_fold, fold_to_gs = {}, {}
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    trn_df = train.iloc[trn_idx]
    gs_fold = robust_global_slope(compute_patient_slopes(trn_df))
    fold_to_gs[fold] = gs_fold
    for p in train.iloc[val_idx]['Patient'].astype(str).unique():
        patient_to_fold[p] = fold
oof['fold'] = oof['Patient'].astype(str).map(patient_to_fold).astype(int)
oof['gs_fold'] = oof['fold'].map(fold_to_gs).astype(float)

# 2) Build 70/30 blended OOF with averaged q50
base = oof['Base_FVC'].astype(float).values
dist = oof['dist'].astype(float).values
abs_dist = np.abs(dist).astype(float)
q50_avg = 0.5 * (oof['q50_v3'].astype(float).values + oof['q50_cb'].astype(float).values)
fvc_q50_oof = base + q50_avg
fvc_anchor_oof = base + oof['gs_fold'].astype(float).values * dist
fvc_blend_oof = 0.70 * fvc_q50_oof + 0.30 * fvc_anchor_oof
y_oof = oof['FVC'].astype(float).values
band_oof = np.abs(oof['q80_delta_oof'].astype(float).values - oof['q20_delta_oof'].astype(float).values)

sigma_banker_oof = np.maximum(240.0 + 3.0 * abs_dist, 70.0)
sigma_banker_oof = np.where(abs_dist > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)

bins = [(0.0,5.0),(5.0,15.0),(15.0,1e9)]
masks = [ (abs_dist>lo) & (abs_dist<=hi) for lo,hi in bins ]
print('[Avg-q50 Diag] bin counts:', [int(m.sum()) for m in masks])

# 3) Tune c per bin for q-band sigma floored by banker; optional >=130 for |dist|>30 if OOF-neutral
c_grid_short_mid = [1.3,1.4,1.5,1.6,1.7,1.8,2.0]
c_grid_long = [1.3,1.4,1.5,1.6,1.7,1.8,2.0,2.3,2.4,2.5,2.6]
best_c = {}
for (lo,hi), m in zip(bins, masks):
    if not np.any(m):
        best_c[(lo,hi)] = 1.8
        print(f'[Avg-q50 Sigma] Bin ({lo},{hi}] empty; c=1.8')
        continue
    grid_c = c_grid_short_mid if hi<=15.0 else c_grid_long
    b_ll, b_c = -1e9, 1.8
    for c in grid_c:
        sig = np.maximum(band_oof[m] / c, sigma_banker_oof[m])
        z = (abs_dist[m] == 0.0)
        if np.any(z): sig[z] = np.maximum(70.0, sig[z])
        ll = laplace_ll_np(y_oof[m], fvc_blend_oof[m], sig)
        if ll > b_ll: b_ll, b_c = ll, c
    best_c[(lo,hi)] = b_c
    print(f"[Avg-q50 Sigma] Bin ({lo},{hi}] best c={b_c:.2f} OOF LL={b_ll:.5f}")

m_gt30 = abs_dist > 30.0
use_floor130 = False
if np.any(m_gt30):
    sig_base = np.zeros_like(abs_dist)
    for (lo,hi), m in zip(bins, masks):
        if np.any(m): sig_base[m] = np.maximum(band_oof[m] / best_c[(lo,hi)], sigma_banker_oof[m])
    sig_base = np.where(abs_dist == 0.0, np.maximum(70.0, sig_base), sig_base)
    ll_no130 = laplace_ll_np(y_oof, fvc_blend_oof, sig_base)
    sig_130 = np.where(m_gt30, np.maximum(sig_base, 130.0), sig_base)
    ll_130 = laplace_ll_np(y_oof, fvc_blend_oof, sig_130)
    use_floor130 = (ll_130 >= ll_no130 - 1e-6)
    print(f"[Avg-q50 Sigma] >=130 test: LL_no130={ll_no130:.5f} LL_130={ll_130:.5f} adopt={use_floor130}")

# Global OOF LLs
sig_oof = np.zeros_like(abs_dist)
for (lo,hi), m in zip(bins, masks):
    if np.any(m): sig_oof[m] = np.maximum(band_oof[m] / best_c[(lo,hi)], sigma_banker_oof[m])
sig_oof = np.where(abs_dist == 0.0, np.maximum(70.0, sig_oof), sig_oof)
if use_floor130: sig_oof = np.where(abs_dist > 30.0, np.maximum(sig_oof, 130.0), sig_oof)
ll_qb = laplace_ll_np(y_oof, fvc_blend_oof, sig_oof)
ll_bk = laplace_ll_np(y_oof, fvc_blend_oof, sigma_banker_oof)
print(f"[Avg-q50 OOF] rows={oof.shape[0]} pats={oof['Patient'].nunique()} LL_qband={ll_qb:.5f} | LL_banker={ll_bk:.5f}")

# 4) TEST: average q50_d (v3 + CatBoost), 70/30 with full-train anchor; guardrails
pred_v3 = pd.read_csv('pred_quantile_deltas_v3.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
pred_cb = pd.read_csv('pred_quantile_deltas_cat_v1.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q50_d_te = 0.5 * (pred_v3['q50_d'].astype(float).values + pred_cb['q50_d'].astype(float).values)

grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_base = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_base, on='Patient', how='left')
dist_te = (grid['Weeks'] - grid['Base_Week']).astype(float).values
abs_dist_te = np.abs(dist_te).astype(float)
base_fvc_te = grid['Base_FVC'].astype(float).values

gs_full = robust_global_slope(compute_patient_slopes(train))
fvc_q50_te = base_fvc_te + q50_d_te
fvc_anchor_te = base_fvc_te + gs_full * dist_te
fvc_te = 0.70 * fvc_q50_te + 0.30 * fvc_anchor_te
fvc_te = np.clip(fvc_te, 500, 6000)

# Tolerant per-patient monotonicity (+25ml), then pin dist==0 to Base_FVC
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_te})
def enforce_non_increasing_tolerant(g, tol=25.0):
    g = g.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    for i in range(len(f)-2, -1, -1):
        f[i] = min(f[i], f[i+1] + tol)
    g['FVC'] = f
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
fvc_final = df_out['FVC'].values.astype(float)
fvc_final = np.where(abs_dist_te == 0.0, base_fvc_te, fvc_final)

# 5) TEST sigma: q-band tuned per-bin and banker; per-patient monotone
pred_v2 = pd.read_csv('pred_quantile_deltas_v2.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
band_te = np.abs(pred_v2['q80_d'].astype(float).values - pred_v2['q20_d'].astype(float).values)
sigma_banker_te = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker_te = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker_te, 100.0), sigma_banker_te)
sigma_from_band = np.zeros_like(abs_dist_te, dtype=float)
for (lo,hi) in bins:
    m = (abs_dist_te>lo) & (abs_dist_te<=hi)
    if np.any(m): sigma_from_band[m] = band_te[m] / best_c[(lo,hi)]
sigma_qband_te = np.maximum(sigma_from_band, sigma_banker_te)
if use_floor130: sigma_qband_te = np.where(abs_dist_te > 30.0, np.maximum(sigma_qband_te, 130.0), sigma_qband_te)
sigma_qband_te = np.where(abs_dist_te == 0.0, 70.0, sigma_qband_te)

def enforce_sigma_monotone(df):
    def _mono(g):
        g = g.sort_values('dist').copy()
        g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
        return g
    return df.groupby('Patient', as_index=False, group_keys=False).apply(_mono)

df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_qband_te.astype(float)})
df_sig = enforce_sigma_monotone(df_sig)
sigma_qband_final = df_sig['Sigma'].values.astype(float)

# Banker-only version
df_sig_b = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_banker_te.astype(float)})
df_sig_b = enforce_sigma_monotone(df_sig_b)
sigma_banker_final = df_sig_b['Sigma'].values.astype(float)

# 6) Save submissions; do not overwrite submission.csv automatically
sub_q = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_qband_final})
sub_b = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_banker_final})
sub_q.to_csv('submission_v3cat_qband.csv', index=False)
sub_b.to_csv('submission_v3cat_banker.csv', index=False)
print('Saved submission_v3cat_qband.csv and submission_v3cat_banker.csv. Elapsed {:.1f}s'.format(time.time()-t0))

[Avg-q50 Diag] OOF pre=1387 post=1229 pats=158
[Avg-q50 Diag] bin counts: [286, 438, 505]
[Avg-q50 Sigma] Bin (0.0,5.0] best c=1.50 OOF LL=-6.04091
[Avg-q50 Sigma] Bin (5.0,15.0] best c=1.60 OOF LL=-6.12936
[Avg-q50 Sigma] Bin (15.0,1000000000.0] best c=1.70 OOF LL=-6.41399
[Avg-q50 Sigma] >=130 test: LL_no130=-6.22573 LL_130=-6.22573 adopt=True
[Avg-q50 OOF] rows=1229 pats=158 LL_qband=-6.22573 | LL_banker=-6.22573
Saved submission_v3cat_qband.csv and submission_v3cat_banker.csv. Elapsed 0.1s


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
  return df.groupby('Patient', as_index=False, group_keys=False).apply(_mono)
  return df.groupby('Patient', as_index=False, group_keys=False).apply(_mono)


In [96]:
# Set submission.csv to averaged q50 (LGBM v3 + CatBoost v1) 2-model with banker sigma
import pandas as pd
src = 'submission_v3cat_banker.csv'
ss = pd.read_csv('sample_submission.csv')
sub = pd.read_csv(src)
assert sub.shape[0] == ss.shape[0], 'Row count mismatch vs sample_submission'
assert set(sub['Patient_Week'].astype(str)) == set(ss['Patient_Week'].astype(str)), 'Patient_Week sets differ'
assert sub['FVC'].notna().all() and sub['Confidence'].notna().all(), 'NaNs in v3cat_banker submission'
sub.to_csv('submission.csv', index=False)
print(f'submission.csv overwritten with {src}')

submission.csv overwritten with submission_v3cat_banker.csv


In [97]:
# Fast LB probes per expert: strict mono on v3cat_banker; LME-long variants (0.05, 0.10) with banker sigma
import numpy as np, pandas as pd

ss = pd.read_csv('sample_submission.csv')
test = pd.read_csv('test.csv')

# Build test grid with baseline to compute dist and Base_FVC
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test_base = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
    columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_base, on='Patient', how='left')
dist = (grid['Weeks'].values - grid['Base_Week'].values).astype(float)
abs_dist = np.abs(dist).astype(float)
base_fvc = grid['Base_FVC'].values.astype(float)

# 1) Strict monotonicity A/B on current best v3cat banker
sub_b = pd.read_csv('submission_v3cat_banker.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
fvc_b = sub_b['FVC'].astype(float).clip(500, 6000).values

def enforce_non_increasing_strict(df):
    g = df.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    # Strict non-increasing: f[i] <= f[i+1]
    for i in range(len(f)-2, -1, -1):
        f[i] = min(f[i], f[i+1])
    g['FVC'] = f
    return g

df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_b})
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing_strict)
fvc_strict = df_out['FVC'].values.astype(float)
fvc_strict = np.where(abs_dist == 0.0, base_fvc, fvc_strict)
sigma_b = sub_b['Confidence'].astype(float).values  # keep sigma unchanged
sub_strict = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_strict, 'Confidence': sigma_b})
sub_strict.to_csv('submission_v3cat_banker_strictmono.csv', index=False)

# 2) LME boost only in long horizon (|dist|>15): w_lme in {0.05, 0.10}; else keep 0.70*q50 + 0.30*anchor; banker sigma
from pathlib import Path

# Recompute v3cat backbone FVC from components to avoid using post-mono FVC
pred_v3 = pd.read_csv('pred_quantile_deltas_v3.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
pred_cb = pd.read_csv('pred_quantile_deltas_cat_v1.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q50_d_avg = 0.5 * (pred_v3['q50_d'].astype(float).values + pred_cb['q50_d'].astype(float).values)

train = pd.read_csv('train.csv')
def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float); y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slopes[pid] = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
    return slopes
def robust_global_slope(slopes_dict):
    if not slopes_dict: return 0.0
    return float(np.median(list(slopes_dict.values())))

gs_full = robust_global_slope(compute_patient_slopes(train))
fvc_q50 = base_fvc + q50_d_avg
fvc_anchor = base_fvc + gs_full * dist
fvc_base = 0.70 * fvc_q50 + 0.30 * fvc_anchor

# LME FVC from artifact
fvc_lme = pd.read_csv('submission_lme_banker.csv').set_index('Patient_Week').loc[ss['Patient_Week'], 'FVC'].astype(float).values
m_long = abs_dist > 15.0

def build_lme_long(w_lme):
    fvc_mix = fvc_base.copy()
    fvc_long = 0.60 * fvc_q50 + 0.30 * fvc_anchor + w_lme * fvc_lme
    fvc_mix[m_long] = fvc_long[m_long]
    fvc_mix = np.clip(fvc_mix, 500, 6000)
    # Tolerant monotonicity (+25 ml) as current default; pin dist==0 to Base_FVC
    def enforce_non_increasing_tolerant(g, tol=25.0):
        g = g.sort_values('Weeks').copy()
        f = g['FVC'].values.astype(float)
        for i in range(len(f)-2, -1, -1):
            f[i] = min(f[i], f[i+1] + tol)
        g['FVC'] = f
        return g
    df = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_mix})
    df = df.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
    fvc_final = df['FVC'].values.astype(float)
    fvc_final = np.where(abs_dist == 0.0, base_fvc, fvc_final)
    # Banker sigma with floors and per-patient monotone in |dist|
    sigma_banker = np.maximum(240.0 + 3.0 * abs_dist, 70.0)
    sigma_banker = np.where(abs_dist > 20.0, np.maximum(sigma_banker, 100.0), sigma_banker)
    df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist, 'Sigma': sigma_banker.astype(float)})
    def enforce_sigma_monotone(g):
        g = g.sort_values('dist').copy()
        g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
        return g
    df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
    sigma_final = df_sig['Sigma'].values.astype(float)
    return fvc_final, sigma_final

for w in [0.05, 0.10]:
    fvc_l, sig_l = build_lme_long(w)
    pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_l, 'Confidence': sig_l}).to_csv(f'submission_v3cat_banker_lme{int(w*100):02d}long.csv', index=False)

# Set submission.csv to strict mono variant for immediate A/B submit
sub_strict.to_csv('submission.csv', index=False)
print('Saved submission_v3cat_banker_strictmono.csv, submission_v3cat_banker_lme05long.csv, submission_v3cat_banker_lme10long.csv; submission.csv set to strict mono.')

Saved submission_v3cat_banker_strictmono.csv, submission_v3cat_banker_lme05long.csv, submission_v3cat_banker_lme10long.csv; submission.csv set to strict mono.


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing_strict)
  df = df.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
  df = df.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [94]:
# Set submission.csv to LME-long (w=0.10) variant per expert A/B plan
import pandas as pd
src = 'submission_v3cat_banker_lme10long.csv'
ss = pd.read_csv('sample_submission.csv')
sub = pd.read_csv(src)
assert sub.shape[0] == ss.shape[0], 'Row count mismatch vs sample_submission'
assert set(sub['Patient_Week'].astype(str)) == set(ss['Patient_Week'].astype(str)), 'Patient_Week sets differ'
assert sub['FVC'].notna().all() and sub['Confidence'].notna().all(), 'NaNs in LME-long submission'
sub.to_csv('submission.csv', index=False)
print(f'submission.csv overwritten with {src}')

submission.csv overwritten with submission_v3cat_banker_lme10long.csv


In [98]:
# Set submission.csv to LME-long (w=0.05) variant for A/B probe
import pandas as pd
src = 'submission_v3cat_banker_lme05long.csv'
ss = pd.read_csv('sample_submission.csv')
sub = pd.read_csv(src)
assert sub.shape[0] == ss.shape[0], 'Row count mismatch vs sample_submission'
assert set(sub['Patient_Week'].astype(str)) == set(ss['Patient_Week'].astype(str)), 'Patient_Week sets differ'
assert sub['FVC'].notna().all() and sub['Confidence'].notna().all(), 'NaNs in LME05-long submission'
sub.to_csv('submission.csv', index=False)
print(f'submission.csv overwritten with {src}')

submission.csv overwritten with submission_v3cat_banker_lme05long.csv


In [83]:
# Per-bin alpha re-optimization (cap <=0.30) on corrected OOF for avg q50 backbone; banker sigma; gated adoption
import numpy as np, pandas as pd, time
from sklearn.model_selection import GroupKFold

t0 = time.time()
train = pd.read_csv('train.csv')
ss = pd.read_csv('sample_submission.csv')

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = np.asarray(y_true, float); y_pred = np.asarray(y_pred, float); sigma = np.asarray(sigma, float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float); y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def robust_global_slope(slopes_dict):
    if not slopes_dict: return 0.0
    return float(np.median(list(slopes_dict.values())))

# 1) Build corrected OOF for averaged q50 (LGBM v3 + CatBoost v1) with per-fold anchors; dedupe and drop dist==0
oof_v3 = pd.read_csv('oof_quantile_lgbm_v3.csv')
oof_cb = pd.read_csv('oof_quantile_cat_v1.csv')
oof_v3 = (oof_v3.groupby(['Patient','Weeks'], as_index=False)
          .agg({'FVC':'first','Base_Week':'first','Base_FVC':'first','q50_delta_oof':'mean'}))
oof_cb = (oof_cb.groupby(['Patient','Weeks'], as_index=False)
          .agg({'q50_delta_oof':'mean'}))
oof = (oof_v3.rename(columns={'q50_delta_oof':'q50_v3'})
       .merge(oof_cb.rename(columns={'q50_delta_oof':'q50_cb'}), on=['Patient','Weeks'], how='inner'))
oof['dist'] = (oof['Weeks'] - oof['Base_Week']).astype(float)
pre = oof.shape[0]
oof = oof[(oof['dist'] >= 0)].copy()
oof = oof.sort_values(['Patient','Weeks']).drop_duplicates(['Patient','Weeks'])
oof = oof[oof['dist'] > 0].copy()

# Per-fold anchors
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
patient_to_fold, fold_to_gs = {}, {}
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    trn_df = train.iloc[trn_idx]
    gs_fold = robust_global_slope(compute_patient_slopes(trn_df))
    fold_to_gs[fold] = gs_fold
    for p in train.iloc[val_idx]['Patient'].astype(str).unique():
        patient_to_fold[p] = fold
oof['fold'] = oof['Patient'].astype(str).map(patient_to_fold).astype(int)
oof['gs_fold'] = oof['fold'].map(fold_to_gs).astype(float)

# 2) Compute OOF predictions for a fixed 70/30 and for a per-bin alpha grid; banker sigma for scoring
base = oof['Base_FVC'].astype(float).values
dist = oof['dist'].astype(float).values; abs_dist = np.abs(dist).astype(float)
q50_avg = 0.5 * (oof['q50_v3'].astype(float).values + oof['q50_cb'].astype(float).values)
fvc_q50 = base + q50_avg
fvc_anchor = base + oof['gs_fold'].astype(float).values * dist
y_oof = oof['FVC'].astype(float).values
sigma_banker_oof = np.maximum(240.0 + 3.0 * abs_dist, 70.0)
sigma_banker_oof = np.where(abs_dist > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)

bins = [(0.0,5.0),(5.0,15.0),(15.0,1e9)]
masks = [ (abs_dist>lo) & (abs_dist<=hi) for lo,hi in bins ]

# Baseline fixed alpha=0.30 across bins
fvc_bl_fixed = np.zeros_like(y_oof)
for (lo,hi), m in zip(bins, masks):
    if np.any(m):
        a = 0.30
        fvc_bl_fixed[m] = (1.0 - a) * fvc_q50[m] + a * fvc_anchor[m]
ll_fixed = laplace_ll_np(y_oof, fvc_bl_fixed, sigma_banker_oof)
print(f"[AlphaOpt] Baseline fixed alpha=0.30 OOF LL={ll_fixed:.5f}")

# Grid alpha per bin in {0.20, 0.25, 0.30} with cap <=0.30
grid_alphas = [0.20, 0.25, 0.30]
best_alpha = {}
for (lo,hi), m in zip(bins, masks):
    if not np.any(m):
        best_alpha[(lo,hi)] = 0.30
        continue
    b_ll, b_a = -1e9, 0.30
    for a in grid_alphas:
        fvc_m = (1.0 - a) * fvc_q50[m] + a * fvc_anchor[m]
        ll = laplace_ll_np(y_oof[m], fvc_m, sigma_banker_oof[m])
        if ll > b_ll: b_ll, b_a = ll, a
    best_alpha[(lo,hi)] = b_a
    print(f"[AlphaOpt] Bin ({lo},{hi}] best alpha={b_a:.2f} OOF LL={b_ll:.5f}")

# Build blended OOF with optimized alphas and assess global gain
fvc_bl_opt = np.zeros_like(y_oof)
for (lo,hi), m in zip(bins, masks):
    a = best_alpha[(lo,hi)]
    if np.any(m): fvc_bl_opt[m] = (1.0 - a) * fvc_q50[m] + a * fvc_anchor[m]
ll_opt = laplace_ll_np(y_oof, fvc_bl_opt, sigma_banker_oof)
gain = ll_opt - ll_fixed
print(f"[AlphaOpt] OOF global LL_opt={ll_opt:.5f} gain={gain:+.5f} vs fixed 0.30")

# 3) If gain > 0.005, build TEST with per-bin alphas; else skip adoption
adopt = gain > 0.005
print(f"[AlphaOpt] Adopt per-bin alphas? {adopt}")

# Prepare TEST components
pred_v3 = pd.read_csv('pred_quantile_deltas_v3.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
pred_cb = pd.read_csv('pred_quantile_deltas_cat_v1.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q50_d_te = 0.5 * (pred_v3['q50_d'].astype(float).values + pred_cb['q50_d'].astype(float).values)
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_base = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_base, on='Patient', how='left')
dist_te = (grid['Weeks'] - grid['Base_Week']).astype(float).values
abs_dist_te = np.abs(dist_te).astype(float)
base_fvc_te = grid['Base_FVC'].astype(float).values
gs_full = robust_global_slope(compute_patient_slopes(train))
fvc_q50_te = base_fvc_te + q50_d_te
fvc_anchor_te = base_fvc_te + gs_full * dist_te

alpha_map = {}
for (lo,hi) in bins:
    alpha_map[(lo,hi)] = best_alpha[(lo,hi)] if adopt else 0.30

alpha_vec = np.zeros_like(abs_dist_te, dtype=float) + 0.30
for (lo,hi), a in alpha_map.items():
    m = (abs_dist_te>lo) & (abs_dist_te<=hi)
    if np.any(m): alpha_vec[m] = a

fvc_te = (1.0 - alpha_vec) * fvc_q50_te + alpha_vec * fvc_anchor_te
fvc_te = np.clip(fvc_te, 500, 6000)

# Tolerant monotonicity (+25 ml) and pin dist==0 to Base_FVC
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_te})
def enforce_non_increasing_tolerant(g, tol=25.0):
    g = g.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    for i in range(len(f)-2, -1, -1):
        f[i] = min(f[i], f[i+1] + tol)
    g['FVC'] = f
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
fvc_final = df_out['FVC'].values.astype(float)
fvc_final = np.where(abs_dist_te == 0.0, base_fvc_te, fvc_final)

# Banker sigma with floors and per-patient monotone in |dist|
sigma_banker_te = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker_te = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker_te, 100.0), sigma_banker_te)
df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_banker_te.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub.to_csv('submission_v3cat_banker_alphaOpt.csv', index=False)
if adopt:
    sub.to_csv('submission.csv', index=False)
    print('Saved submission_v3cat_banker_alphaOpt.csv and set submission.csv (adopted per-bin alphas). Elapsed {:.1f}s'.format(time.time()-t0))
else:
    print('Saved submission_v3cat_banker_alphaOpt.csv (not adopted; ΔOOF <= 0.005). Elapsed {:.1f}s'.format(time.time()-t0))

[AlphaOpt] Baseline fixed alpha=0.30 OOF LL=-6.22573
[AlphaOpt] Bin (0.0,5.0] best alpha=0.30 OOF LL=-6.04091
[AlphaOpt] Bin (5.0,15.0] best alpha=0.30 OOF LL=-6.12936
[AlphaOpt] Bin (15.0,1000000000.0] best alpha=0.30 OOF LL=-6.41399
[AlphaOpt] OOF global LL_opt=-6.22573 gain=+0.00000 vs fixed 0.30
[AlphaOpt] Adopt per-bin alphas? False
Saved submission_v3cat_banker_alphaOpt.csv (not adopted; ΔOOF <= 0.005). Elapsed 0.1s


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)


In [84]:
# Quantile LGBM v3-bands: train q20 and q80 deltas (OOF + Test); reuse v3 q50 setup
import numpy as np, pandas as pd, time, gc
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
import lightgbm as lgb
from lightgbm import LGBMRegressor

t0 = time.time()
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
ss = pd.read_csv('sample_submission.csv')

def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

def one_hot_fit(df, cols):
    return {c: sorted(df[c].dropna().astype(str).unique().tolist()) for c in cols}

def one_hot_transform(df, cats):
    out = df.copy()
    for c, values in cats.items():
        col = df[c].astype(str)
        for v in values:
            out[f'{c}__{v}'] = (col == v).astype(np.int8)
    return out

def ecdf_rank_fit(x):
    xs = np.sort(np.asarray(x, dtype=float))
    return xs

def ecdf_rank_transform(x, xs):
    x = np.asarray(x, dtype=float)
    idx = np.searchsorted(xs, x, side='right')
    return idx / max(len(xs), 1)

def build_slope_features(base_df, ecdf_basefvc=None, ecdf_percent=None, cats=None, fit=False):
    b = base_df.copy()
    b['log_Base_FVC'] = np.log1p(np.maximum(b['Base_FVC'].astype(float), 1.0))
    b['BaseFVC_over_Age'] = b['Base_FVC'].astype(float) / np.maximum(b['Age'].astype(float), 1.0)
    b['PercentBase_over_Age'] = b['Percent_at_base'].astype(float) / np.maximum(b['Age'].astype(float), 1.0)
    if fit:
        ecdf_basefvc = ecdf_rank_fit(b['Base_FVC'].values)
        ecdf_percent = ecdf_rank_fit(b['Percent_at_base'].values)
    b['BaseFVC_ecdf'] = ecdf_rank_transform(b['Base_FVC'].values, ecdf_basefvc)
    b['Percent_ecdf'] = ecdf_rank_transform(b['Percent_at_base'].values, ecdf_percent)
    if fit:
        cats = one_hot_fit(b, ['Sex','SmokingStatus'])
    b = one_hot_transform(b, cats)
    num_cols = ['Age','Base_FVC','log_Base_FVC','Percent_at_base','BaseFVC_over_Age','PercentBase_over_Age','BaseFVC_ecdf','Percent_ecdf']
    cat_cols = [c for c in b.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
    feat_cols = num_cols + cat_cols
    return b, feat_cols, ecdf_basefvc, ecdf_percent, cats

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float); y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def build_q_features(grid_df, base_df, ecdf_bf=None, ecdf_pc=None, cats=None, fit=False):
    d = grid_df.merge(base_df[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
    d['dist'] = (d['Weeks'] - d['Base_Week']).astype(float)
    d = d[d['dist'] >= 0].copy()
    d['abs_dist'] = d['dist'].abs()
    d['log1p_abs_dist'] = np.log1p(d['abs_dist'])
    d['dist_cap'] = d['dist'].clip(0, 30)
    d['dist_short'] = d['dist'].clip(0, 5)
    d['dist_mid'] = (d['dist'] - 5).clip(lower=0, upper=10)
    d['dist_long'] = (d['dist'] - 15).clip(lower=0)
    d['dist2'] = d['dist']**2
    d['dist3'] = d['dist']**3
    d['Base_FVC'] = d['Base_FVC'].astype(float)
    d['Percent_at_base'] = d['Percent_at_base'].astype(float).clip(30, 120)
    d['Age'] = d['Age'].astype(float)
    d['log_Base_FVC'] = np.log1p(np.maximum(d['Base_FVC'], 1.0))
    d['Age_x_Percent'] = d['Age'] * d['Percent_at_base']
    d['BaseFVC_x_dist'] = d['Base_FVC'] * d['dist']
    d['dist_x_Age'] = d['dist'] * d['Age']
    d['dist_x_Percent'] = d['dist'] * d['Percent_at_base']
    d['BaseFVC_x_dshort'] = d['Base_FVC'] * d['dist_short']
    d['BaseFVC_x_dmid'] = d['Base_FVC'] * d['dist_mid']
    d['BaseFVC_x_dlong'] = d['Base_FVC'] * d['dist_long']
    if fit:
        ecdf_bf = ecdf_rank_fit(d['Base_FVC'].values)
        ecdf_pc = ecdf_rank_fit(d['Percent_at_base'].values)
        cats = one_hot_fit(d, ['Sex','SmokingStatus'])
    d['BaseFVC_ecdf'] = ecdf_rank_transform(d['Base_FVC'].values, ecdf_bf)
    d['Percent_ecdf'] = ecdf_rank_transform(d['Percent_at_base'].values, ecdf_pc)
    d = one_hot_transform(d, cats)
    d['BFV_decile'] = np.floor(d['BaseFVC_ecdf'] * 10).clip(0, 9).astype(int)
    for k in range(10):
        d[f'BFV_decile__{k}'] = (d['BFV_decile'] == k).astype(np.int8)
    feat_cols = [
        'Age','Base_FVC','log_Base_FVC','Percent_at_base','BaseFVC_ecdf','Percent_ecdf',
        'dist','abs_dist','log1p_abs_dist','dist_cap','dist_short','dist_mid','dist_long','dist2','dist3',
        'Age_x_Percent','BaseFVC_x_dist','dist_x_Age','dist_x_Percent','BaseFVC_x_dshort','BaseFVC_x_dmid','BaseFVC_x_dlong','s_hat'
    ] + [c for c in d.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__') or c.startswith('BFV_decile__')]
    for c in feat_cols:
        if c not in d.columns: d[c] = 0.0
    return d, feat_cols, ecdf_bf, ecdf_pc, cats

def fit_s_hat_fold(trn_df, base_trn):
    slopes_trn = compute_patient_slopes(trn_df)
    slope_labels_trn = pd.DataFrame({'Patient': list(slopes_trn.keys()), 's_label': list(slopes_trn.values())})
    base_trn_lab = base_trn.merge(slope_labels_trn, on='Patient', how='left')
    bf_trn, f_cols_s, ecdf_bf_s, ecdf_pc_s, cats_s = build_slope_features(base_trn_lab, fit=True)
    scaler_s = StandardScaler(with_mean=True, with_std=True).fit(bf_trn[f_cols_s].values.astype(float))
    Xs_tr = scaler_s.transform(bf_trn[f_cols_s].values.astype(float))
    y_s = bf_trn['s_label'].fillna(0.0).values.astype(float)
    ridge = Ridge(alpha=1.0, random_state=42).fit(Xs_tr, y_s)
    knn = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(Xs_tr, y_s)
    q_lo, q_hi = np.percentile(y_s, [5,95])
    def get_s_hat_map(base_df_patients):
        bf_pred, _, _, _, _ = build_slope_features(base_df_patients, ecdf_bf_s, ecdf_pc_s, cats_s, fit=False)
        Xs = scaler_s.transform(bf_pred[f_cols_s].values.astype(float))
        s = 0.8*ridge.predict(Xs) + 0.2*knn.predict(Xs)
        s = np.clip(s, q_lo, q_hi)
        return dict(zip(bf_pred['Patient'].values, s))
    return get_s_hat_map

# Config
alphas = [0.2, 0.8]  # q20 and q80
seeds = [1337, 2027, 3037]
params = dict(objective='quantile', metric='quantile',
              n_estimators=2400, learning_rate=0.032,
              num_leaves=31, max_depth=6, min_data_in_leaf=24,
              subsample=0.75, colsample_bytree=0.75,
              reg_alpha=0.1, reg_lambda=0.2, n_jobs=-1, verbose=-1)

# OOF container
oof_df = train[['Patient','Weeks','FVC']].copy()
oof_df['q20_delta_oof'] = np.nan
oof_df['q80_delta_oof'] = np.nan

# Static TEST grid and SS index map
grid_te = ss.copy()
parts = grid_te['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid_te['Patient'] = parts[0]; grid_te['Weeks'] = parts[1].astype(int)
test_base = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
    columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid_te_idx = grid_te[['Patient','Weeks']].copy()
grid_te_idx['ss_idx'] = np.arange(grid_te_idx.shape[0], dtype=int)

# Accumulators for TEST deltas per alpha
test_preds = {0.2: np.zeros(ss.shape[0], dtype=float), 0.8: np.zeros(ss.shape[0], dtype=float)}

gkf = GroupKFold(n_splits=5)
groups = train['Patient'].values

for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    tf = time.time()
    trn_df = train.iloc[trn_idx].copy(); val_df = train.iloc[val_idx].copy()
    base_trn = prepare_baseline_table(trn_df)
    base_val = prepare_baseline_table(val_df)
    # s_hat maps
    get_s_hat_map = fit_s_hat_fold(trn_df, base_trn)
    s_map_trn = get_s_hat_map(base_trn)
    s_map_val = get_s_hat_map(base_val)
    base_test = grid_te[['Patient']].drop_duplicates().merge(test_base.drop_duplicates('Patient'), on='Patient', how='left')
    s_map_test = get_s_hat_map(base_test[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']])

    # Future-only train/val with s_hat
    trn = trn_df.merge(base_trn[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
    val = val_df.merge(base_val[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
    trn['dist'] = (trn['Weeks'] - trn['Base_Week']).astype(float); trn = trn[trn['dist'] >= 0].copy()
    val['dist'] = (val['Weeks'] - val['Base_Week']).astype(float); val = val[val['dist'] >= 0].copy()
    trn['s_hat'] = trn['Patient'].map(s_map_trn).astype(float).fillna(0.0)
    val['s_hat'] = val['Patient'].map(s_map_val).astype(float).fillna(0.0)

    # Features (fit on TRAIN fold)
    trn_feat, feat_cols, ecdf_bf, ecdf_pc, cats = build_q_features(trn[['Patient','Weeks']].copy(), base_trn, fit=True)
    trn_feat['s_hat'] = trn_feat['Patient'].map(s_map_trn).astype(float).fillna(0.0)
    val_feat, _, _, _, _ = build_q_features(val[['Patient','Weeks']].copy(), base_val, ecdf_bf, ecdf_pc, cats, fit=False)
    val_feat['s_hat'] = val_feat['Patient'].map(s_map_val).astype(float).fillna(0.0)

    # Strict alignment
    trn_feat_aligned = trn_feat.merge(trn[['Patient','Weeks','FVC']], on=['Patient','Weeks'], how='inner')
    val_feat_aligned = val_feat.merge(val[['Patient','Weeks','FVC']], on=['Patient','Weeks'], how='inner')

    y_tr_delta = (trn_feat_aligned['FVC'].astype(float).values - trn_feat_aligned['Base_FVC'].astype(float).values)
    y_va_delta = (val_feat_aligned['FVC'].astype(float).values - val_feat_aligned['Base_FVC'].astype(float).values)
    X_tr = trn_feat_aligned[feat_cols].values.astype(float)
    X_va = val_feat_aligned[feat_cols].values.astype(float)

    if X_tr.shape[0] == 0 or X_va.shape[0] == 0:
        print(f'[v3-bands Fold {fold}] skipped (X_tr={X_tr.shape[0]}, X_va={X_va.shape[0]})', flush=True)
        del trn_df, val_df, trn, val, trn_feat, val_feat, trn_feat_aligned, val_feat_aligned
        gc.collect()
        continue

    # TEST features under TRAIN-fold transforms; align to ss via index map
    te_feat, _, _, _, _ = build_q_features(grid_te[['Patient','Weeks']].copy(), test_base, ecdf_bf, ecdf_pc, cats, fit=False)
    te_feat['s_hat'] = te_feat['Patient'].map(s_map_test).astype(float).fillna(0.0)
    X_te = te_feat[feat_cols].values.astype(float)
    te_keys = te_feat[['Patient','Weeks']].copy().merge(grid_te_idx, on=['Patient','Weeks'], how='left')
    te_idx = te_keys['ss_idx'].values.astype(int)

    # For each alpha (0.2, 0.8), seed-bag models and accumulate
    for a in alphas:
        val_pred_sum = np.zeros(X_va.shape[0], dtype=float)
        test_pred_sum_fold = np.zeros(ss.shape[0], dtype=float)
        for si, sd in enumerate(seeds):
            lr = params['learning_rate'] + (0.002 if (si % 2 == 0) else -0.002)
            mdl = LGBMRegressor(**{**params, 'alpha': a, 'learning_rate': lr}, random_state=sd)
            mdl.fit(X_tr, y_tr_delta,
                    eval_set=[(X_va, y_va_delta)],
                    eval_metric='quantile',
                    callbacks=[lgb.early_stopping(200, verbose=False)])
            val_pred_sum += mdl.predict(X_va, num_iteration=mdl.best_iteration_)
            pred_te = mdl.predict(X_te, num_iteration=mdl.best_iteration_)
            test_pred_sum_fold[te_idx] += pred_te
            del mdl
        val_pred_avg = val_pred_sum / max(len(seeds), 1)
        test_pred_avg_fold = test_pred_sum_fold / max(len(seeds), 1)
        # Write OOF deltas for this alpha
        keys = val_feat_aligned[['Patient','Weeks']].reset_index(drop=True)
        col = 'q20_delta_oof' if np.isclose(a, 0.2) else 'q80_delta_oof'
        block = pd.DataFrame({'Patient': keys['Patient'].astype(str), 'Weeks': keys['Weeks'].astype(int), col: val_pred_avg})
        oof_df = oof_df.merge(block, on=['Patient','Weeks'], how='left', suffixes=('', '_new'))
        oof_df[col] = oof_df[col].fillna(oof_df[col + '_new'])
        oof_df.drop(columns=[col + '_new'], inplace=True)
        # Accumulate test preds
        test_preds[a] += (test_pred_avg_fold / 5.0)

    print(f'[v3-bands Fold {fold}] trn={X_tr.shape[0]} val={X_va.shape[0]} elapsed={time.time()-tf:.2f}s', flush=True)
    del trn_df, val_df, trn, val, trn_feat, val_feat, trn_feat_aligned, val_feat_aligned, X_tr, X_va, X_te, te_feat, te_idx, te_keys
    gc.collect()

# Save OOF bands with baseline for downstream use
train_base = prepare_baseline_table(train)
oof_save = oof_df.dropna(subset=['q20_delta_oof','q80_delta_oof']).merge(train_base[['Patient','Base_Week','Base_FVC']], on='Patient', how='left')
oof_save.to_csv('oof_quantile_lgbm_v3_bands.csv', index=False)

# Save TEST bands aligned to ss
pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'q20_d': test_preds[0.2].astype(float),
    'q80_d': test_preds[0.8].astype(float)
}).to_csv('pred_quantile_deltas_v3_bands.csv', index=False)

print(f'Saved oof_quantile_lgbm_v3_bands.csv and pred_quantile_deltas_v3_bands.csv. Elapsed {time.time()-t0:.1f}s')

[v3-bands Fold 1] trn=1124 val=284 elapsed=1.09s


[v3-bands Fold 2] trn=1127 val=281 elapsed=1.18s


[v3-bands Fold 3] trn=1129 val=279 elapsed=1.25s


[v3-bands Fold 4] trn=1129 val=279 elapsed=1.12s


[v3-bands Fold 5] trn=1123 val=285 elapsed=1.19s


Saved oof_quantile_lgbm_v3_bands.csv and pred_quantile_deltas_v3_bands.csv. Elapsed 6.5s


In [85]:
# Step 3c: Use v3 quantile bands (q20/q80) to retune sigma for averaged q50 (LGBM v3 + CatBoost v1); banker floor; save submission
import numpy as np, pandas as pd, time
from sklearn.model_selection import GroupKFold

t0 = time.time()
train = pd.read_csv('train.csv')
ss = pd.read_csv('sample_submission.csv')

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = np.asarray(y_true, float); y_pred = np.asarray(y_pred, float); sigma = np.asarray(sigma, float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float); y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def robust_global_slope(slopes_dict):
    if not slopes_dict: return 0.0
    return float(np.median(list(slopes_dict.values())))

# 1) Build corrected OOF for averaged q50 with v3 bands merged (dedupe, drop dist==0)
oof_v3 = pd.read_csv('oof_quantile_lgbm_v3.csv')  # q50 deltas
oof_cb = pd.read_csv('oof_quantile_cat_v1.csv')   # q50 deltas
oof_bands = pd.read_csv('oof_quantile_lgbm_v3_bands.csv')[['Patient','Weeks','q20_delta_oof','q80_delta_oof']]

# Deduplicate/average per (Patient, Weeks)
oof_v3 = (oof_v3.groupby(['Patient','Weeks'], as_index=False)
          .agg({'FVC':'first','Base_Week':'first','Base_FVC':'first','q50_delta_oof':'mean'}))
oof_cb = (oof_cb.groupby(['Patient','Weeks'], as_index=False)
          .agg({'q50_delta_oof':'mean'}))
oof_bands = (oof_bands.groupby(['Patient','Weeks'], as_index=False)
             .agg({'q20_delta_oof':'mean','q80_delta_oof':'mean'}))

oof = (oof_v3.rename(columns={'q50_delta_oof':'q50_v3'})
       .merge(oof_cb.rename(columns={'q50_delta_oof':'q50_cb'}), on=['Patient','Weeks'], how='inner')
       .merge(oof_bands, on=['Patient','Weeks'], how='inner'))

oof['dist'] = (oof['Weeks'] - oof['Base_Week']).astype(float)
pre = oof.shape[0]
oof = oof[(oof['dist'] >= 0) & oof[['q50_v3','q50_cb','q20_delta_oof','q80_delta_oof']].notna().all(axis=1)].copy()
oof = oof.sort_values(['Patient','Weeks']).drop_duplicates(['Patient','Weeks'])
oof = oof[oof['dist'] > 0].copy()
print(f"[v3-bands Diag] OOF pre={pre} post={oof.shape[0]} pats={oof['Patient'].nunique()}")

# Per-fold anchor (TRAIN-only)
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
patient_to_fold, fold_to_gs = {}, {}
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    trn_df = train.iloc[trn_idx]
    gs_fold = robust_global_slope(compute_patient_slopes(trn_df))
    fold_to_gs[fold] = gs_fold
    for p in train.iloc[val_idx]['Patient'].astype(str).unique():
        patient_to_fold[p] = fold
oof['fold'] = oof['Patient'].astype(str).map(patient_to_fold).astype(int)
oof['gs_fold'] = oof['fold'].map(fold_to_gs).astype(float)

# 2) 70/30 averaged q50 + anchor OOF; band from v3 bands
base = oof['Base_FVC'].astype(float).values
dist = oof['dist'].astype(float).values
abs_dist = np.abs(dist).astype(float)
q50_avg = 0.5 * (oof['q50_v3'].astype(float).values + oof['q50_cb'].astype(float).values)
fvc_q50_oof = base + q50_avg
fvc_anchor_oof = base + oof['gs_fold'].astype(float).values * dist
fvc_blend_oof = 0.70 * fvc_q50_oof + 0.30 * fvc_anchor_oof
y_oof = oof['FVC'].astype(float).values
band_oof = np.abs(oof['q80_delta_oof'].astype(float).values - oof['q20_delta_oof'].astype(float).values)

sigma_banker_oof = np.maximum(240.0 + 3.0 * abs_dist, 70.0)
sigma_banker_oof = np.where(abs_dist > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)

bins = [(0.0,5.0),(5.0,15.0),(15.0,1e9)]
masks = [ (abs_dist>lo) & (abs_dist<=hi) for lo,hi in bins ]
print('[v3-bands Diag] bin counts:', [int(m.sum()) for m in masks])

c_grid_short_mid = [1.3,1.4,1.5,1.6,1.7,1.8,2.0]
c_grid_long = [1.3,1.4,1.5,1.6,1.7,1.8,2.0,2.3,2.4,2.5,2.6]
best_c = {}
for (lo,hi), m in zip(bins, masks):
    if not np.any(m):
        best_c[(lo,hi)] = 1.8
        print(f'[v3-bands Sigma] Bin ({lo},{hi}] empty; c=1.8')
        continue
    grid_c = c_grid_short_mid if hi<=15.0 else c_grid_long
    b_ll, b_c = -1e9, 1.8
    for c in grid_c:
        sig = np.maximum(band_oof[m] / c, sigma_banker_oof[m])
        ll = laplace_ll_np(y_oof[m], fvc_blend_oof[m], sig)
        if ll > b_ll: b_ll, b_c = ll, c
    best_c[(lo,hi)] = b_c
    print(f"[v3-bands Sigma] Bin ({lo},{hi}] best c={b_c:.2f} OOF LL={b_ll:.5f}")

# Optional >=130 for |dist|>30 (adopt if OOF-neutral)
m_gt30 = abs_dist > 30.0
use_floor130 = False
if np.any(m_gt30):
    sig_base = np.zeros_like(abs_dist)
    for (lo,hi), m in zip(bins, masks):
        if np.any(m): sig_base[m] = np.maximum(band_oof[m] / best_c[(lo,hi)], sigma_banker_oof[m])
    ll_no130 = laplace_ll_np(y_oof, fvc_blend_oof, sig_base)
    sig_130 = np.where(m_gt30, np.maximum(sig_base, 130.0), sig_base)
    ll_130 = laplace_ll_np(y_oof, fvc_blend_oof, sig_130)
    use_floor130 = (ll_130 >= ll_no130 - 1e-6)
    print(f"[v3-bands Sigma] >=130 test: LL_no130={ll_no130:.5f} LL_130={ll_130:.5f} adopt={use_floor130}")

sig_oof = np.zeros_like(abs_dist)
for (lo,hi), m in zip(bins, masks):
    if np.any(m): sig_oof[m] = np.maximum(band_oof[m] / best_c[(lo,hi)], sigma_banker_oof[m])
if use_floor130: sig_oof = np.where(abs_dist > 30.0, np.maximum(sig_oof, 130.0), sig_oof)
ll_qb = laplace_ll_np(y_oof, fvc_blend_oof, sig_oof)
ll_bk = laplace_ll_np(y_oof, fvc_blend_oof, sigma_banker_oof)
print(f"[v3-bands OOF] rows={oof.shape[0]} pats={oof['Patient'].nunique()} LL_qband={ll_qb:.5f} | LL_banker={ll_bk:.5f}")

# 3) TEST: averaged q50 + anchor; sigma from v3 bands with tuned c, floored by banker; monotone
pred_v3 = pd.read_csv('pred_quantile_deltas_v3.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
pred_cb = pd.read_csv('pred_quantile_deltas_cat_v1.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
pred_bands_v3 = pd.read_csv('pred_quantile_deltas_v3_bands.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q50_d_te = 0.5 * (pred_v3['q50_d'].astype(float).values + pred_cb['q50_d'].astype(float).values)
band_te = np.abs(pred_bands_v3['q80_d'].astype(float).values - pred_bands_v3['q20_d'].astype(float).values)

grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_base = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_base, on='Patient', how='left')
dist_te = (grid['Weeks'] - grid['Base_Week']).astype(float).values
abs_dist_te = np.abs(dist_te).astype(float)
base_fvc_te = grid['Base_FVC'].astype(float).values

gs_full = robust_global_slope(compute_patient_slopes(train))
fvc_q50_te = base_fvc_te + q50_d_te
fvc_anchor_te = base_fvc_te + gs_full * dist_te
fvc_te = 0.70 * fvc_q50_te + 0.30 * fvc_anchor_te
fvc_te = np.clip(fvc_te, 500, 6000)

# Tolerant non-increasing per patient (+25 ml) then pin dist==0
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_te})
def enforce_non_increasing_tolerant(g, tol=25.0):
    g = g.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    for i in range(len(f)-2, -1, -1):
        f[i] = min(f[i], f[i+1] + tol)
    g['FVC'] = f
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
fvc_final = df_out['FVC'].values.astype(float)
fvc_final = np.where(abs_dist_te == 0.0, base_fvc_te, fvc_final)

# Sigma: v3 band/c per-bin floored by banker; dist==0->70; per-patient monotone
sigma_banker_te = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker_te = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker_te, 100.0), sigma_banker_te)
sigma_from_band = np.zeros_like(abs_dist_te, dtype=float)
for (lo,hi), c in best_c.items():
    m = (abs_dist_te>lo) & (abs_dist_te<=hi)
    if np.any(m): sigma_from_band[m] = band_te[m] / c
sigma_qband_te = np.maximum(sigma_from_band, sigma_banker_te)
if use_floor130: sigma_qband_te = np.where(abs_dist_te > 30.0, np.maximum(sigma_qband_te, 130.0), sigma_qband_te)
sigma_qband_te = np.where(abs_dist_te == 0.0, 70.0, sigma_qband_te)

def enforce_sigma_monotone(df):
    def _mono(g):
        g = g.sort_values('dist').copy()
        g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
        return g
    return df.groupby('Patient', as_index=False, group_keys=False).apply(_mono)

df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_qband_te.astype(float)})
df_sig = enforce_sigma_monotone(df_sig)
sigma_qband_final = df_sig['Sigma'].values.astype(float)

sub_q = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_qband_final})
sub_q.to_csv('submission_v3cat_qband_v3bands.csv', index=False)
print('Saved submission_v3cat_qband_v3bands.csv. Elapsed {:.1f}s'.format(time.time()-t0))

[v3-bands Diag] OOF pre=1387 post=1229 pats=158
[v3-bands Diag] bin counts: [286, 438, 505]
[v3-bands Sigma] Bin (0.0,5.0] best c=1.50 OOF LL=-6.04091
[v3-bands Sigma] Bin (5.0,15.0] best c=1.60 OOF LL=-6.12936
[v3-bands Sigma] Bin (15.0,1000000000.0] best c=1.60 OOF LL=-6.41386
[v3-bands Sigma] >=130 test: LL_no130=-6.22568 LL_130=-6.22568 adopt=True
[v3-bands OOF] rows=1229 pats=158 LL_qband=-6.22568 | LL_banker=-6.22573
Saved submission_v3cat_qband_v3bands.csv. Elapsed 0.1s


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
  return df.groupby('Patient', as_index=False, group_keys=False).apply(_mono)


In [86]:
# Set submission.csv to v3cat q-band with v3 bands (A/B vs banker)
import pandas as pd
src = 'submission_v3cat_qband_v3bands.csv'
ss = pd.read_csv('sample_submission.csv')
sub = pd.read_csv(src)
assert sub.shape[0] == ss.shape[0], 'Row count mismatch vs sample_submission'
assert set(sub['Patient_Week'].astype(str)) == set(ss['Patient_Week'].astype(str)), 'Patient_Week sets differ'
assert sub['FVC'].notna().all() and sub['Confidence'].notna().all(), 'NaNs in v3cat_qband_v3bands submission'
sub.to_csv('submission.csv', index=False)
print(f'submission.csv overwritten with {src}')

submission.csv overwritten with submission_v3cat_qband_v3bands.csv


In [87]:
# A/B: strict non-increasing FVC on v3cat_qband_v3bands; keep sigma; set as submission.csv
import numpy as np, pandas as pd

ss = pd.read_csv('sample_submission.csv')
test = pd.read_csv('test.csv')

# Build grid for dist and Base_FVC
grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test_base = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_base, on='Patient', how='left')
dist = (grid['Weeks'].values - grid['Base_Week'].values).astype(float)
abs_dist = np.abs(dist).astype(float)
base_fvc = grid['Base_FVC'].values.astype(float)

# Load v3bands q-band submission
sub = pd.read_csv('submission_v3cat_qband_v3bands.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
fvc = sub['FVC'].astype(float).clip(500, 6000).values
sigma = sub['Confidence'].astype(float).values  # unchanged

def enforce_non_increasing_strict(g):
    g = g.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    for i in range(len(f)-2, -1, -1):
        f[i] = min(f[i], f[i+1])
    g['FVC'] = f
    return g

df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc})
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing_strict)
fvc_strict = df_out['FVC'].values.astype(float)
fvc_strict = np.where(abs_dist == 0.0, base_fvc, fvc_strict)

sub_strict = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_strict, 'Confidence': sigma})
sub_strict.to_csv('submission_v3cat_qband_v3bands_strictmono.csv', index=False)
sub_strict.to_csv('submission.csv', index=False)
print('Saved submission_v3cat_qband_v3bands_strictmono.csv and set submission.csv (strict FVC monotonicity; sigma unchanged).')

Saved submission_v3cat_qband_v3bands_strictmono.csv and set submission.csv (strict FVC monotonicity; sigma unchanged).


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_non_increasing_strict)


In [88]:
# CatBoost Quantile v1-bands: train q20 and q80 deltas (OOF + Test) to match averaged q50 backbone
import numpy as np, pandas as pd, time, gc
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from catboost import CatBoostRegressor

t0 = time.time()
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
ss = pd.read_csv('sample_submission.csv')

def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

def one_hot_fit(df, cols):
    return {c: sorted(df[c].dropna().astype(str).unique().tolist()) for c in cols}

def one_hot_transform(df, cats):
    out = df.copy()
    for c, values in cats.items():
        col = df[c].astype(str)
        for v in values:
            out[f'{c}__{v}'] = (col == v).astype(np.int8)
    return out

def ecdf_rank_fit(x):
    xs = np.sort(np.asarray(x, dtype=float))
    return xs

def ecdf_rank_transform(x, xs):
    x = np.asarray(x, dtype=float)
    idx = np.searchsorted(xs, x, side='right')
    return idx / max(len(xs), 1)

def build_slope_features(base_df, ecdf_basefvc=None, ecdf_percent=None, cats=None, fit=False):
    b = base_df.copy()
    b['log_Base_FVC'] = np.log1p(np.maximum(b['Base_FVC'].astype(float), 1.0))
    b['BaseFVC_over_Age'] = b['Base_FVC'].astype(float) / np.maximum(b['Age'].astype(float), 1.0)
    b['PercentBase_over_Age'] = b['Percent_at_base'].astype(float) / np.maximum(b['Age'].astype(float), 1.0)
    if fit:
        ecdf_basefvc = ecdf_rank_fit(b['Base_FVC'].values)
        ecdf_percent = ecdf_rank_fit(b['Percent_at_base'].values)
    b['BaseFVC_ecdf'] = ecdf_rank_transform(b['Base_FVC'].values, ecdf_basefvc)
    b['Percent_ecdf'] = ecdf_rank_transform(b['Percent_at_base'].values, ecdf_percent)
    if fit:
        cats = one_hot_fit(b, ['Sex','SmokingStatus'])
    b = one_hot_transform(b, cats)
    num_cols = ['Age','Base_FVC','log_Base_FVC','Percent_at_base','BaseFVC_over_Age','PercentBase_over_Age','BaseFVC_ecdf','Percent_ecdf']
    cat_cols = [c for c in b.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
    feat_cols = num_cols + cat_cols
    return b, feat_cols, ecdf_basefvc, ecdf_percent, cats

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float); y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slopes[pid] = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
    return slopes

def build_q_features(grid_df, base_df, ecdf_bf=None, ecdf_pc=None, cats=None, fit=False):
    d = grid_df.merge(base_df[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
    d['dist'] = (d['Weeks'] - d['Base_Week']).astype(float)
    d = d[d['dist'] >= 0].copy()
    d['abs_dist'] = d['dist'].abs()
    d['log1p_abs_dist'] = np.log1p(d['abs_dist'])
    d['dist_cap'] = d['dist'].clip(0, 30)
    d['dist_short'] = d['dist'].clip(0, 5)
    d['dist_mid'] = (d['dist'] - 5).clip(lower=0, upper=10)
    d['dist_long'] = (d['dist'] - 15).clip(lower=0)
    d['dist2'] = d['dist']**2
    d['dist3'] = d['dist']**3
    d['Base_FVC'] = d['Base_FVC'].astype(float)
    d['Percent_at_base'] = d['Percent_at_base'].astype(float).clip(30, 120)
    d['Age'] = d['Age'].astype(float)
    d['log_Base_FVC'] = np.log1p(np.maximum(d['Base_FVC'], 1.0))
    d['Age_x_Percent'] = d['Age'] * d['Percent_at_base']
    d['BaseFVC_x_dist'] = d['Base_FVC'] * d['dist']
    d['dist_x_Age'] = d['dist'] * d['Age']
    d['dist_x_Percent'] = d['dist'] * d['Percent_at_base']
    d['BaseFVC_x_dshort'] = d['Base_FVC'] * d['dist_short']
    d['BaseFVC_x_dmid'] = d['Base_FVC'] * d['dist_mid']
    d['BaseFVC_x_dlong'] = d['Base_FVC'] * d['dist_long']
    if fit:
        ecdf_bf = ecdf_rank_fit(d['Base_FVC'].values)
        ecdf_pc = ecdf_rank_fit(d['Percent_at_base'].values)
        cats = one_hot_fit(d, ['Sex','SmokingStatus'])
    d['BaseFVC_ecdf'] = ecdf_rank_transform(d['Base_FVC'].values, ecdf_bf)
    d['Percent_ecdf'] = ecdf_rank_transform(d['Percent_at_base'].values, ecdf_pc)
    d = one_hot_transform(d, cats)
    d['BFV_decile'] = np.floor(d['BaseFVC_ecdf'] * 10).clip(0, 9).astype(int)
    for k in range(10):
        d[f'BFV_decile__{k}'] = (d['BFV_decile'] == k).astype(np.int8)
    feat_cols = [
        'Age','Base_FVC','log_Base_FVC','Percent_at_base','BaseFVC_ecdf','Percent_ecdf',
        'dist','abs_dist','log1p_abs_dist','dist_cap','dist_short','dist_mid','dist_long','dist2','dist3',
        'Age_x_Percent','BaseFVC_x_dist','dist_x_Age','dist_x_Percent','BaseFVC_x_dshort','BaseFVC_x_dmid','BaseFVC_x_dlong','s_hat'
    ] + [c for c in d.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__') or c.startswith('BFV_decile__')]
    for c in feat_cols:
        if c not in d.columns: d[c] = 0.0
    return d, feat_cols, ecdf_bf, ecdf_pc, cats

def fit_s_hat_fold(trn_df, base_trn):
    slopes_trn = compute_patient_slopes(trn_df)
    slope_labels_trn = pd.DataFrame({'Patient': list(slopes_trn.keys()), 's_label': list(slopes_trn.values())})
    base_trn_lab = base_trn.merge(slope_labels_trn, on='Patient', how='left')
    bf_trn, f_cols_s, ecdf_bf_s, ecdf_pc_s, cats_s = build_slope_features(base_trn_lab, fit=True)
    scaler_s = StandardScaler(with_mean=True, with_std=True).fit(bf_trn[f_cols_s].values.astype(float))
    Xs_tr = scaler_s.transform(bf_trn[f_cols_s].values.astype(float))
    y_s = bf_trn['s_label'].fillna(0.0).values.astype(float)
    ridge = Ridge(alpha=1.0, random_state=42).fit(Xs_tr, y_s)
    knn = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(Xs_tr, y_s)
    q_lo, q_hi = np.percentile(y_s, [5,95])
    def get_s_hat_map(base_df_patients):
        bf_pred, _, _, _, _ = build_slope_features(base_df_patients, ecdf_bf_s, ecdf_pc_s, cats_s, fit=False)
        Xs = scaler_s.transform(bf_pred[f_cols_s].values.astype(float))
        s = 0.8*ridge.predict(Xs) + 0.2*knn.predict(Xs)
        s = np.clip(s, q_lo, q_hi)
        return dict(zip(bf_pred['Patient'].values, s))
    return get_s_hat_map

# Config: alphas and seeds
alphas = [0.2, 0.8]
seeds = [1337, 2027, 3037]
cb_params = dict(
    iterations=1600,
    learning_rate=0.045,
    depth=6,
    l2_leaf_reg=6.0,
    subsample=0.8,
    rsm=0.8,
    random_strength=0.8,
    od_type='Iter',
    od_wait=120,
    bootstrap_type='Bernoulli',
    loss_function=None,  # set per alpha
    task_type='CPU',
    verbose=False
)

# OOF container
oof_df = train[['Patient','Weeks','FVC']].copy()
oof_df['q20_delta_oof'] = np.nan
oof_df['q80_delta_oof'] = np.nan

# TEST grid and index mapping
grid_te = ss.copy()
parts = grid_te['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid_te['Patient'] = parts[0]; grid_te['Weeks'] = parts[1].astype(int)
test_base = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
    columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid_te_idx = grid_te[['Patient','Weeks']].copy()
grid_te_idx['ss_idx'] = np.arange(grid_te_idx.shape[0], dtype=int)

# Accumulators for TEST deltas per alpha
test_preds = {0.2: np.zeros(ss.shape[0], dtype=float), 0.8: np.zeros(ss.shape[0], dtype=float)}

gkf = GroupKFold(n_splits=5)
groups = train['Patient'].values

for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    tf = time.time()
    trn_df = train.iloc[trn_idx].copy(); val_df = train.iloc[val_idx].copy()
    base_trn = prepare_baseline_table(trn_df)
    base_val = prepare_baseline_table(val_df)
    # s_hat maps
    get_s_hat_map = fit_s_hat_fold(trn_df, base_trn)
    s_map_trn = get_s_hat_map(base_trn)
    s_map_val = get_s_hat_map(base_val)
    base_test = grid_te[['Patient']].drop_duplicates().merge(test_base.drop_duplicates('Patient'), on='Patient', how='left')
    s_map_test = get_s_hat_map(base_test[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']])

    # Future-only train/val with s_hat
    trn = trn_df.merge(base_trn[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
    val = val_df.merge(base_val[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
    trn['dist'] = (trn['Weeks'] - trn['Base_Week']).astype(float); trn = trn[trn['dist'] >= 0].copy()
    val['dist'] = (val['Weeks'] - val['Base_Week']).astype(float); val = val[val['dist'] >= 0].copy()
    trn['s_hat'] = trn['Patient'].map(s_map_trn).astype(float).fillna(0.0)
    val['s_hat'] = val['Patient'].map(s_map_val).astype(float).fillna(0.0)

    # Features (fit on TRAIN fold)
    trn_feat, feat_cols, ecdf_bf, ecdf_pc, cats = build_q_features(trn[['Patient','Weeks']].copy(), base_trn, fit=True)
    trn_feat['s_hat'] = trn_feat['Patient'].map(s_map_trn).astype(float).fillna(0.0)
    val_feat, _, _, _, _ = build_q_features(val[['Patient','Weeks']].copy(), base_val, ecdf_bf, ecdf_pc, cats, fit=False)
    val_feat['s_hat'] = val_feat['Patient'].map(s_map_val).astype(float).fillna(0.0)

    # Strict alignment
    trn_feat_aligned = trn_feat.merge(trn[['Patient','Weeks','FVC']], on=['Patient','Weeks'], how='inner')
    val_feat_aligned = val_feat.merge(val[['Patient','Weeks','FVC']], on=['Patient','Weeks'], how='inner')

    y_tr_delta = (trn_feat_aligned['FVC'].astype(float).values - trn_feat_aligned['Base_FVC'].astype(float).values)
    y_va_delta = (val_feat_aligned['FVC'].astype(float).values - val_feat_aligned['Base_FVC'].astype(float).values)
    X_tr = trn_feat_aligned[feat_cols].values.astype(float)
    X_va = val_feat_aligned[feat_cols].values.astype(float)

    if X_tr.shape[0] == 0 or X_va.shape[0] == 0:
        print(f'[cat-bands Fold {fold}] skipped (X_tr={X_tr.shape[0]}, X_va={X_va.shape[0]})', flush=True)
        del trn_df, val_df, trn, val, trn_feat, val_feat, trn_feat_aligned, val_feat_aligned
        gc.collect()
        continue

    # TEST features under TRAIN-fold transforms; align to ss via index map
    te_feat, _, _, _, _ = build_q_features(grid_te[['Patient','Weeks']].copy(), test_base, ecdf_bf, ecdf_pc, cats, fit=False)
    te_feat['s_hat'] = te_feat['Patient'].map(s_map_test).astype(float).fillna(0.0)
    X_te = te_feat[feat_cols].values.astype(float)
    te_keys = te_feat[['Patient','Weeks']].copy().merge(grid_te_idx, on=['Patient','Weeks'], how='left')
    te_idx = te_keys['ss_idx'].values.astype(int)

    # For each alpha (0.2, 0.8), seed-bag CatBoost and accumulate
    for a in alphas:
        loss = f'Quantile:alpha={a}'
        val_pred_sum = np.zeros(X_va.shape[0], dtype=float)
        test_pred_sum_fold = np.zeros(ss.shape[0], dtype=float)
        for sd in seeds:
            params = dict(cb_params)
            params['loss_function'] = loss
            model = CatBoostRegressor(**params, random_state=sd)
            model.fit(X_tr, y_tr_delta, eval_set=(X_va, y_va_delta))
            val_pred_sum += model.predict(X_va)
            pred_te = model.predict(X_te)
            test_pred_sum_fold[te_idx] += pred_te
            del model
        val_pred_avg = val_pred_sum / max(len(seeds), 1)
        test_pred_avg_fold = test_pred_sum_fold / max(len(seeds), 1)
        # Write OOF deltas for this alpha
        keys = val_feat_aligned[['Patient','Weeks']].reset_index(drop=True)
        col = 'q20_delta_oof' if np.isclose(a, 0.2) else 'q80_delta_oof'
        block = pd.DataFrame({'Patient': keys['Patient'].astype(str), 'Weeks': keys['Weeks'].astype(int), col: val_pred_avg})
        oof_df = oof_df.merge(block, on=['Patient','Weeks'], how='left', suffixes=('', '_new'))
        oof_df[col] = oof_df[col].fillna(oof_df[col + '_new'])
        oof_df.drop(columns=[col + '_new'], inplace=True)
        # Accumulate TEST preds
        test_preds[a] += (test_pred_avg_fold / 5.0)

    print(f'[cat-bands Fold {fold}] trn={X_tr.shape[0]} val={X_va.shape[0]} elapsed={time.time()-tf:.2f}s', flush=True)
    del trn_df, val_df, trn, val, trn_feat, val_feat, trn_feat_aligned, val_feat_aligned, X_tr, X_va, X_te, te_feat, te_idx, te_keys
    gc.collect()

# Save OOF bands
train_base = prepare_baseline_table(train)
oof_save = oof_df.dropna(subset=['q20_delta_oof','q80_delta_oof']).merge(train_base[['Patient','Base_Week','Base_FVC']], on='Patient', how='left')
oof_save.to_csv('oof_quantile_cat_v1_bands.csv', index=False)

# Save TEST bands aligned to ss
pd.DataFrame({
    'Patient_Week': ss['Patient_Week'],
    'q20_d': test_preds[0.2].astype(float),
    'q80_d': test_preds[0.8].astype(float)
}).to_csv('pred_quantile_deltas_cat_v1_bands.csv', index=False)

print(f'Saved oof_quantile_cat_v1_bands.csv and pred_quantile_deltas_cat_v1_bands.csv. Elapsed {time.time()-t0:.1f}s')

[cat-bands Fold 1] trn=1124 val=284 elapsed=1.15s


[cat-bands Fold 2] trn=1127 val=281 elapsed=1.20s


[cat-bands Fold 3] trn=1129 val=279 elapsed=1.35s


[cat-bands Fold 4] trn=1129 val=279 elapsed=1.17s


[cat-bands Fold 5] trn=1123 val=285 elapsed=1.16s


Saved oof_quantile_cat_v1_bands.csv and pred_quantile_deltas_cat_v1_bands.csv. Elapsed 6.7s


In [93]:
# Step 3d: Average LGBM v3 bands + CatBoost v1 bands; retune sigma on averaged-q50 OOF; build test submission
import numpy as np, pandas as pd, time
from sklearn.model_selection import GroupKFold

t0 = time.time()
train = pd.read_csv('train.csv')
ss = pd.read_csv('sample_submission.csv')

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = np.asarray(y_true, float); y_pred = np.asarray(y_pred, float); sigma = np.asarray(sigma, float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float); y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def robust_global_slope(slopes_dict):
    if not slopes_dict: return 0.0
    return float(np.median(list(slopes_dict.values())))

# 1) Build corrected OOF with averaged q50 (LGBM v3 + Cat v1) and averaged bands (v3 + cat)
oof_v3 = pd.read_csv('oof_quantile_lgbm_v3.csv')      # q50
oof_cb = pd.read_csv('oof_quantile_cat_v1.csv')       # q50
oof_b3 = pd.read_csv('oof_quantile_lgbm_v3_bands.csv')[['Patient','Weeks','q20_delta_oof','q80_delta_oof']]
oof_bC = pd.read_csv('oof_quantile_cat_v1_bands.csv')[['Patient','Weeks','q20_delta_oof','q80_delta_oof']]

# Deduplicate/average within sources
oof_v3 = (oof_v3.groupby(['Patient','Weeks'], as_index=False)
          .agg({'FVC':'first','Base_Week':'first','Base_FVC':'first','q50_delta_oof':'mean'}))
oof_cb = (oof_cb.groupby(['Patient','Weeks'], as_index=False)
          .agg({'q50_delta_oof':'mean'}))
oof_b3 = (oof_b3.groupby(['Patient','Weeks'], as_index=False)
          .agg({'q20_delta_oof':'mean','q80_delta_oof':'mean'}))
oof_bC = (oof_bC.groupby(['Patient','Weeks'], as_index=False)
          .agg({'q20_delta_oof':'mean','q80_delta_oof':'mean'}))

# Merge and compute averaged q50 and averaged bands
oof = (oof_v3.rename(columns={'q50_delta_oof':'q50_v3'})
       .merge(oof_cb.rename(columns={'q50_delta_oof':'q50_cb'}), on=['Patient','Weeks'], how='inner')
       .merge(oof_b3.rename(columns={'q20_delta_oof':'q20_b3','q80_delta_oof':'q80_b3'}), on=['Patient','Weeks'], how='inner')
       .merge(oof_bC.rename(columns={'q20_delta_oof':'q20_cB','q80_delta_oof':'q80_cB'}), on=['Patient','Weeks'], how='inner'))

oof['q50_avg'] = 0.5 * (oof['q50_v3'].astype(float) + oof['q50_cb'].astype(float))
oof['q20_avg'] = 0.5 * (oof['q20_b3'].astype(float) + oof['q20_cB'].astype(float))
oof['q80_avg'] = 0.5 * (oof['q80_b3'].astype(float) + oof['q80_cB'].astype(float))
oof['band_avg'] = (oof['q80_avg'] - oof['q20_avg']).abs().astype(float)

oof['dist'] = (oof['Weeks'] - oof['Base_Week']).astype(float)
pre = oof.shape[0]
oof = oof[(oof['dist'] >= 0) & oof[['q50_avg','band_avg']].notna().all(axis=1)].copy()
oof = oof.sort_values(['Patient','Weeks']).drop_duplicates(['Patient','Weeks'])
oof = oof[oof['dist'] > 0].copy()
print(f"[avgBands Diag] OOF pre={pre} post={oof.shape[0]} pats={oof['Patient'].nunique()}")

# Per-fold anchor (TRAIN-only) for gs_fold
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
patient_to_fold, fold_to_gs = {}, {}
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    trn_df = train.iloc[trn_idx]
    gs_fold = robust_global_slope(compute_patient_slopes(trn_df))
    fold_to_gs[fold] = gs_fold
    for p in train.iloc[val_idx]['Patient'].astype(str).unique():
        patient_to_fold[p] = fold
oof['fold'] = oof['Patient'].astype(str).map(patient_to_fold).astype(int)
oof['gs_fold'] = oof['fold'].map(fold_to_gs).astype(float)

# 2) Build 70/30 averaged q50 + per-fold anchor; tune c per bin with banker floor; optional >=130 for |dist|>30
base = oof['Base_FVC'].astype(float).values
dist = oof['dist'].astype(float).values
abs_dist = np.abs(dist).astype(float)
fvc_q50_oof = base + oof['q50_avg'].astype(float).values
fvc_anchor_oof = base + oof['gs_fold'].astype(float).values * dist
fvc_blend_oof = 0.70 * fvc_q50_oof + 0.30 * fvc_anchor_oof
y_oof = oof['FVC'].astype(float).values
band_oof = oof['band_avg'].astype(float).values

sigma_banker_oof = np.maximum(240.0 + 3.0 * abs_dist, 70.0)
sigma_banker_oof = np.where(abs_dist > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)
bins = [(0.0,5.0),(5.0,15.0),(15.0,1e9)]
masks = [ (abs_dist>lo) & (abs_dist<=hi) for lo,hi in bins ]
print('[avgBands Diag] bin counts:', [int(m.sum()) for m in masks])

c_grid_short_mid = [1.3,1.4,1.5,1.6,1.7,1.8,2.0]
c_grid_long = [1.3,1.4,1.5,1.6,1.7,1.8,2.0,2.3,2.4,2.5,2.6]
best_c = {}
for (lo,hi), m in zip(bins, masks):
    if not np.any(m):
        best_c[(lo,hi)] = 1.8
        print(f'[avgBands Sigma] Bin ({lo},{hi}] empty; c=1.8')
        continue
    grid_c = c_grid_short_mid if hi<=15.0 else c_grid_long
    b_ll, b_c = -1e9, 1.8
    for c in grid_c:
        sig = np.maximum(band_oof[m] / c, sigma_banker_oof[m])
        ll = laplace_ll_np(y_oof[m], fvc_blend_oof[m], sig)
        if ll > b_ll: b_ll, b_c = ll, c
    best_c[(lo,hi)] = b_c
    print(f"[avgBands Sigma] Bin ({lo},{hi}] best c={b_c:.2f} OOF LL={b_ll:.5f}")

m_gt30 = abs_dist > 30.0
use_floor130 = False
if np.any(m_gt30):
    sig_base = np.zeros_like(abs_dist)
    for (lo,hi), m in zip(bins, masks):
        if np.any(m): sig_base[m] = np.maximum(band_oof[m] / best_c[(lo,hi)], sigma_banker_oof[m])
    ll_no130 = laplace_ll_np(y_oof, fvc_blend_oof, sig_base)
    sig_130 = np.where(m_gt30, np.maximum(sig_base, 130.0), sig_base)
    ll_130 = laplace_ll_np(y_oof, fvc_blend_oof, sig_130)
    use_floor130 = (ll_130 >= ll_no130 - 1e-6)
    print(f"[avgBands Sigma] >=130 test: LL_no130={ll_no130:.5f} LL_130={ll_130:.5f} adopt={use_floor130}")

sig_oof = np.zeros_like(abs_dist)
for (lo,hi), m in zip(bins, masks):
    if np.any(m): sig_oof[m] = np.maximum(band_oof[m] / best_c[(lo,hi)], sigma_banker_oof[m])
if use_floor130: sig_oof = np.where(abs_dist > 30.0, np.maximum(sig_oof, 130.0), sig_oof)
ll_qb = laplace_ll_np(y_oof, fvc_blend_oof, sig_oof)
ll_bk = laplace_ll_np(y_oof, fvc_blend_oof, sigma_banker_oof)
print(f"[avgBands OOF] rows={oof.shape[0]} pats={oof['Patient'].nunique()} LL_qband={ll_qb:.5f} | LL_banker={ll_bk:.5f}")

# 3) TEST: averaged q50_d (v3 + cat) 70/30 with full-train anchor; sigma from averaged bands; guardrails
pred_v3 = pd.read_csv('pred_quantile_deltas_v3.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
pred_cb = pd.read_csv('pred_quantile_deltas_cat_v1.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q50_d_te = 0.5 * (pred_v3['q50_d'].astype(float).values + pred_cb['q50_d'].astype(float).values)

pred_b3 = pd.read_csv('pred_quantile_deltas_v3_bands.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
pred_bC = pd.read_csv('pred_quantile_deltas_cat_v1_bands.csv').set_index('Patient_Week').loc[ss['Patient_Week']].reset_index()
q20_avg_te = 0.5 * (pred_b3['q20_d'].astype(float).values + pred_bC['q20_d'].astype(float).values)
q80_avg_te = 0.5 * (pred_b3['q80_d'].astype(float).values + pred_bC['q80_d'].astype(float).values)
band_te = np.abs(q80_avg_te - q20_avg_te).astype(float)

grid = ss.copy()
parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
test = pd.read_csv('test.csv')
test_base = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid = grid.merge(test_base, on='Patient', how='left')
dist_te = (grid['Weeks'] - grid['Base_Week']).astype(float).values
abs_dist_te = np.abs(dist_te).astype(float)
base_fvc_te = grid['Base_FVC'].astype(float).values

gs_full = robust_global_slope(compute_patient_slopes(train))
fvc_q50_te = base_fvc_te + q50_d_te
fvc_anchor_te = base_fvc_te + gs_full * dist_te
fvc_te = 0.70 * fvc_q50_te + 0.30 * fvc_anchor_te
fvc_te = np.clip(fvc_te, 500, 6000)

# Tolerant non-increasing per patient (+25 ml) then pin dist==0
df_out = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'FVC': fvc_te})
def enforce_non_increasing_tolerant(g, tol=25.0):
    g = g.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    for i in range(len(f)-2, -1, -1):
        f[i] = min(f[i], f[i+1] + tol)
    g['FVC'] = f
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
fvc_final = df_out['FVC'].values.astype(float)
fvc_final = np.where(abs_dist_te == 0.0, base_fvc_te, fvc_final)

# Sigma: averaged bands per-bin with tuned c; banker floor; dist==0->70; optional >=130; per-patient monotone
sigma_banker_te = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker_te = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker_te, 100.0), sigma_banker_te)
sigma_from_band = np.zeros_like(abs_dist_te, dtype=float)
for (lo,hi), c in best_c.items():
    m = (abs_dist_te>lo) & (abs_dist_te<=hi)
    if np.any(m): sigma_from_band[m] = band_te[m] / c
sigma_qband_te = np.maximum(sigma_from_band, sigma_banker_te)
if use_floor130: sigma_qband_te = np.where(abs_dist_te > 30.0, np.maximum(sigma_qband_te, 130.0), sigma_qband_te)
sigma_qband_te = np.where(abs_dist_te == 0.0, 70.0, sigma_qband_te)

def enforce_sigma_monotone(df):
    def _mono(g):
        g = g.sort_values('dist').copy()
        g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
        return g
    return df.groupby('Patient', as_index=False, group_keys=False).apply(_mono)

df_sig = pd.DataFrame({'Patient': grid['Patient'].values, 'Weeks': grid['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_qband_te.astype(float)})
df_sig = enforce_sigma_monotone(df_sig)
sigma_final = df_sig['Sigma'].values.astype(float)

sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub.to_csv('submission_v3cat_qband_avgBands.csv', index=False)
sub.to_csv('submission.csv', index=False)
print('Saved submission_v3cat_qband_avgBands.csv and set submission.csv. Elapsed {:.1f}s'.format(time.time()-t0))

[avgBands Diag] OOF pre=1387 post=1229 pats=158
[avgBands Diag] bin counts: [286, 438, 505]
[avgBands Sigma] Bin (0.0,5.0] best c=1.50 OOF LL=-6.04091
[avgBands Sigma] Bin (5.0,15.0] best c=1.40 OOF LL=-6.12936
[avgBands Sigma] Bin (15.0,1000000000.0] best c=1.60 OOF LL=-6.41398
[avgBands Sigma] >=130 test: LL_no130=-6.22573 LL_130=-6.22573 adopt=True
[avgBands OOF] rows=1229 pats=158 LL_qband=-6.22573 | LL_banker=-6.22573
Saved submission_v3cat_qband_avgBands.csv and set submission.csv. Elapsed 0.1s


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(lambda g: enforce_non_increasing_tolerant(g, 25.0))
  return df.groupby('Patient', as_index=False, group_keys=False).apply(_mono)


In [95]:
# Parametric sigma (a + b*|dist|) floored by banker with progressive floors; hybrid FVC monotonicity on distance-aware blend
import numpy as np, pandas as pd, time, warnings, gc
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
import statsmodels.formula.api as smf

t0 = time.time()
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
ss = pd.read_csv('sample_submission.csv')

def laplace_ll_np(y_true, y_pred, sigma):
    y_true = np.asarray(y_true, float); y_pred = np.asarray(y_pred, float); sigma = np.asarray(sigma, float)
    delta = np.minimum(np.abs(y_true - y_pred), 1000.0)
    sigma = np.maximum(sigma, 70.0)
    return float(np.mean(-delta / sigma - np.log(sigma)))

def prepare_baseline_table(df):
    base = (df.sort_values(['Patient','Weeks']).groupby('Patient', as_index=False).first())
    base = base[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(
        columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    return base

def compute_patient_slopes(df, patient_col='Patient', week_col='Weeks', target_col='FVC'):
    slopes = {}
    for pid, g in df.groupby(patient_col):
        if g.shape[0] >= 2:
            x = g[week_col].values.astype(float); y = g[target_col].values.astype(float)
            xm = x.mean(); ym = y.mean()
            denom = ((x - xm)**2).sum()
            slope = ((x - xm) * (y - ym)).sum() / denom if denom > 0 else 0.0
            slopes[pid] = slope
    return slopes

def robust_global_slope(slopes_dict):
    if not slopes_dict: return 0.0
    return float(np.median(list(slopes_dict.values())))

# --- Build OOF sources for distance-aware blend (Slope+Anchor, LME, Quantile q50+per-fold anchor) ---
def slope_anchor_oof(train_df, n_splits=5, seed=42):
    from sklearn.linear_model import Ridge
    from sklearn.neighbors import KNeighborsRegressor
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df['Patient'].values
    y_list, d_list, fvc_s_list, fvc_a_list, pid_list, wk_list = [], [], [], [], [], []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        trn = train_df.iloc[trn_idx].copy(); val = train_df.iloc[val_idx].copy()
        base_trn = prepare_baseline_table(trn); base_val = prepare_baseline_table(val)
        slopes_tr = compute_patient_slopes(trn)
        lab = pd.DataFrame({'Patient': list(slopes_tr.keys()), 's_label': list(slopes_tr.values())})
        # Fit slope head on TRAIN baseline
        bf_trn, feat_cols, ecdf_bf, ecdf_pc, cats = build_slope_features(base_trn.merge(lab, on='Patient', how='left'), fit=True)
        bf_val, _, _, _, _ = build_slope_features(base_val, ecdf_bf, ecdf_pc, cats, fit=False)
        sc = StandardScaler(with_mean=True, with_std=True)
        X_tr = bf_trn[feat_cols].values.astype(float); y_tr = bf_trn['s_label'].fillna(0.0).values.astype(float)
        X_trs = sc.fit_transform(X_tr); X_vs = sc.transform(bf_val[feat_cols].values.astype(float))
        ridge = Ridge(alpha=1.0, random_state=seed).fit(X_trs, y_tr)
        knn   = KNeighborsRegressor(n_neighbors=9, weights='distance').fit(X_trs, y_tr)
        s_r = ridge.predict(X_vs); s_k = knn.predict(X_vs)
        q_lo, q_hi = np.percentile(y_tr, [5,95])
        s_bl = np.clip(0.80*s_r + 0.20*s_k, q_lo, q_hi)
        s_map = dict(zip(base_val['Patient'].values, s_bl))
        valm = val.merge(base_val[['Patient','Base_Week','Base_FVC']], on='Patient', how='left')
        mask = (valm['Weeks'] >= valm['Base_Week'])
        dist = (valm['Weeks'] - valm['Base_Week']).astype(float)
        fvc_s = (valm['Base_FVC'].values + valm['Patient'].map(s_map).fillna(0.0).values * dist).astype(float)
        gs_fold = robust_global_slope(compute_patient_slopes(trn))
        fvc_a = (valm['Base_FVC'].values + gs_fold * dist).astype(float)
        y_list.append(valm.loc[mask,'FVC'].values.astype(float)); d_list.append(dist.values[mask].astype(float))
        fvc_s_list.append(fvc_s[mask].astype(float)); fvc_a_list.append(fvc_a[mask].astype(float))
        pid_list.append(valm.loc[mask,'Patient'].astype(str).values); wk_list.append(valm.loc[mask,'Weeks'].astype(int).values)
        del trn, val, base_trn, base_val, bf_trn, bf_val, X_tr, X_trs, X_vs; gc.collect()
    return (np.concatenate(y_list), np.concatenate(d_list), np.concatenate(fvc_s_list), np.concatenate(fvc_a_list), np.concatenate(pid_list), np.concatenate(wk_list))

def lme_oof(train_df, n_splits=5):
    gkf = GroupKFold(n_splits=5); groups = train_df['Patient'].values
    y_list, d_list, fvc_list, pid_list, wk_list = [], [], [], [], []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(train_df, groups=groups), 1):
        trn = train_df.iloc[trn_idx].copy(); val = train_df.iloc[val_idx].copy()
        base_trn = prepare_baseline_table(trn); base_val = prepare_baseline_table(val)
        trn_l = trn.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore').merge(base_trn[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
        trn_l['Weeks_Passed'] = (trn_l['Weeks'] - trn_l['Base_Week']).astype(float)/10.0
        trn_l = trn_l[trn_l['Weeks_Passed'] >= 0].copy()
        age_mean, age_std = trn_l['Age'].mean(), trn_l['Age'].std()+1e-9
        pc_mean, pc_std   = trn_l['Percent_at_base'].mean(), trn_l['Percent_at_base'].std()+1e-9
        trn_l['Age_std'] = (trn_l['Age'] - age_mean)/age_std; trn_l['Percent_at_base_std'] = (trn_l['Percent_at_base'] - pc_mean)/pc_std
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            md = smf.mixedlm('FVC ~ 1 + Weeks_Passed + I(Weeks_Passed**2) + Age_std + C(Sex) + C(SmokingStatus) + Percent_at_base_std + Age_std:Percent_at_base_std',
                              data=trn_l, groups=trn_l['Patient'], re_formula='~Weeks_Passed')
            mdf = md.fit(method='lbfgs', reml=True, maxiter=500, disp=False)
        val_l = val.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore').merge(base_val[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
        mask = (val_l['Weeks'] >= val_l['Base_Week'])
        dist = (val_l['Weeks'] - val_l['Base_Week']).astype(float)
        val_l['Weeks_Passed'] = dist/10.0
        val_l['Age_std'] = (val_l['Age'] - age_mean)/age_std; val_l['Percent_at_base_std'] = (val_l['Percent_at_base'] - pc_mean)/pc_std
        fvc_pred = mdf.predict(val_l).astype(float).values
        y_list.append(val_l.loc[mask,'FVC'].values.astype(float)); d_list.append(dist.values[mask].astype(float));
        fvc_list.append(fvc_pred[mask].astype(float)); pid_list.append(val_l.loc[mask,'Patient'].astype(str).values); wk_list.append(val_l.loc[mask,'Weeks'].astype(int).values)
        del trn, val, base_trn, base_val, trn_l, val_l; gc.collect()
    return np.concatenate(y_list), np.concatenate(d_list), np.concatenate(fvc_list), np.concatenate(pid_list), np.concatenate(wk_list)

def build_slope_features(base_df, ecdf_basefvc=None, ecdf_percent=None, cats=None, fit=False):
    # Light adapter using earlier definitions already in notebook
    b = base_df.copy()
    b['log_Base_FVC'] = np.log1p(np.maximum(b['Base_FVC'].astype(float), 1.0))
    b['BaseFVC_over_Age'] = b['Base_FVC'].astype(float) / np.maximum(b['Age'].astype(float), 1.0)
    b['PercentBase_over_Age'] = b['Percent_at_base'].astype(float) / np.maximum(b['Age'].astype(float), 1.0)
    if fit:
        ecdf_basefvc = np.sort(b['Base_FVC'].values.astype(float))
        ecdf_percent = np.sort(b['Percent_at_base'].values.astype(float))
    def ecdf_rank_transform(x, xs):
        x = np.asarray(x, float); idx = np.searchsorted(xs, x, side='right'); return idx / max(len(xs), 1)
    b['BaseFVC_ecdf'] = ecdf_rank_transform(b['Base_FVC'].values, ecdf_basefvc)
    b['Percent_ecdf'] = ecdf_rank_transform(b['Percent_at_base'].values, ecdf_percent)
    if fit:
        cats = {'Sex': sorted(b['Sex'].dropna().astype(str).unique().tolist()), 'SmokingStatus': sorted(b['SmokingStatus'].dropna().astype(str).unique().tolist())}
    # one-hot
    out = b.copy()
    for c, values in cats.items():
        col = b[c].astype(str)
        for v in values:
            out[f'{c}__{v}'] = (col == v).astype(np.int8)
    num_cols = ['Age','Base_FVC','log_Base_FVC','Percent_at_base','BaseFVC_over_Age','PercentBase_over_Age','BaseFVC_ecdf','Percent_ecdf']
    cat_cols = [c for c in out.columns if c.startswith('Sex__') or c.startswith('SmokingStatus__')]
    feat_cols = num_cols + cat_cols
    return out, feat_cols, ecdf_basefvc, ecdf_percent, cats

# Build OOF arrays
y_s, d_s, fvc_s, fvc_a, pid_s, wk_s = slope_anchor_oof(train, 5, 42)
y_l, d_l, fvc_l, pid_l, wk_l = lme_oof(train, 5)
oof_q = pd.read_csv('oof_quantile_lgbm_v2.csv')
train_base = prepare_baseline_table(train)
oof_q = oof_q.merge(train_base[['Patient','Base_Week','Base_FVC']], on='Patient', how='left', suffixes=('', '_base'))
if 'Base_FVC_base' in oof_q.columns:
    if 'Base_FVC' not in oof_q.columns: oof_q['Base_FVC'] = oof_q['Base_FVC_base']
    else: oof_q['Base_FVC'] = oof_q['Base_FVC'].fillna(oof_q['Base_FVC_base'])
    oof_q.drop(columns=['Base_FVC_base'], inplace=True)
if 'Base_Week_base' in oof_q.columns and 'Base_Week' not in oof_q.columns:
    oof_q['Base_Week'] = oof_q['Base_Week_base']; oof_q.drop(columns=['Base_Week_base'], inplace=True)
oof_q['dist'] = (oof_q['Weeks'] - oof_q['Base_Week']).astype(float)
oof_q = oof_q[oof_q['dist'] >= 0].dropna(subset=['q50_delta_oof']).copy()

# Per-fold anchors for quantile OOF
N_SPLITS = 5
gkf = GroupKFold(n_splits=N_SPLITS)
groups = train['Patient'].values
patient_to_fold, fold_to_gs = {}, {}
for fold, (trn_idx, val_idx) in enumerate(gkf.split(train, groups=groups), 1):
    trn_df = train.iloc[trn_idx]
    gs_fold = robust_global_slope(compute_patient_slopes(trn_df))
    fold_to_gs[fold] = gs_fold
    for p in train.iloc[val_idx]['Patient'].astype(str).unique():
        patient_to_fold[p] = fold
oof_q['fold'] = oof_q['Patient'].astype(str).map(patient_to_fold).astype(int)
oof_q['gs_fold'] = oof_q['fold'].map(fold_to_gs).astype(float)
fvc_anchor_q = oof_q['Base_FVC'].astype(float).values + oof_q['gs_fold'].values * oof_q['dist'].astype(float).values
fvc_q_point = oof_q['Base_FVC'].astype(float).values + oof_q['q50_delta_oof'].astype(float).values
fvc_q = 0.70 * fvc_q_point + 0.30 * fvc_anchor_q

# Align OOF by keys
df_s = pd.DataFrame({'Patient': pid_s.astype(str), 'Weeks': wk_s.astype(int), 'y_true': y_s.astype(float), 'dist': d_s.astype(float), 'fvc_s': fvc_s.astype(float), 'fvc_a': fvc_a.astype(float)})
df_l = pd.DataFrame({'Patient': pid_l.astype(str), 'Weeks': wk_l.astype(int), 'fvc_l': fvc_l.astype(float)})
df_q = oof_q[['Patient','Weeks']].astype({'Patient':'str','Weeks':'int'}).copy()
df_q['fvc_q'] = fvc_q.astype(float)
dfm = df_s.merge(df_l, on=['Patient','Weeks'], how='inner').merge(df_q, on=['Patient','Weeks'], how='inner')

y = dfm['y_true'].values.astype(float)
dist = dfm['dist'].values.astype(float)
s = dfm['fvc_s'].values.astype(float)
a = dfm['fvc_a'].values.astype(float)
l = dfm['fvc_l'].values.astype(float)
q = dfm['fvc_q'].values.astype(float)

sigma_banker_oof = np.maximum(240.0 + 3.0 * np.abs(dist), 70.0)
sigma_banker_oof = np.where(np.abs(dist) > 20.0, np.maximum(sigma_banker_oof, 100.0), sigma_banker_oof)

def grid_best(y, s, l, q, sigma, w_grid=np.arange(0.0, 1.01, 0.05)):
    best_ll, best_w = -1e9, (0.05, 0.05, 0.90)
    for ws in w_grid:
        for wl in w_grid:
            wq = 1.0 - ws - wl
            if wq < 0 or wq > 1: continue
            pred = ws*s + wl*l + wq*q
            ll = laplace_ll_np(y, pred, sigma)
            if ll > best_ll: best_ll, best_w = ll, (ws, wl, wq)
    return best_ll, best_w

bins = [(0.0,5.0),(5.0,15.0),(15.0,1e9)]
weights = {}
for lo,hi in bins:
    m = (np.abs(dist)>lo) & (np.abs(dist)<=hi)
    if not np.any(m):
        weights[(lo,hi)] = (0.05,0.05,0.90)
    else:
        _, w = grid_best(y[m], s[m], l[m], q[m], sigma_banker_oof[m])
        weights[(lo,hi)] = w
print('[ParamSigma] OOF weights by bin (S/L/Q):', {k: tuple(round(x,2) for x in v) for k,v in weights.items()})

# Build blended OOF FVC using bin weights
fvc_blend_oof = np.zeros_like(y)
for (lo,hi), (ws, wl, wq) in weights.items():
    m = (np.abs(dist)>lo) & (np.abs(dist)<=hi)
    if np.any(m): fvc_blend_oof[m] = ws*s[m] + wl*l[m] + wq*q[m]

# Parametric sigma grid with progressive floors and banker floor
a_grid = [80, 90, 100, 110, 120, 140]
b_grid = [1.4, 1.6, 1.8, 2.0]
best = (-1e9, None, None)
for a0 in a_grid:
    for b0 in b_grid:
        sig = a0 + b0 * np.abs(dist)
        sig = np.maximum(sig, 70.0)
        sig = np.where(np.abs(dist) > 20.0, np.maximum(sig, 100.0), sig)
        sig = np.where(np.abs(dist) > 30.0, np.maximum(sig, 130.0), sig)
        sig = np.where(np.abs(dist) > 40.0, np.maximum(sig, 160.0), sig)
        sig = np.maximum(sig, sigma_banker_oof)
        ll = laplace_ll_np(y, fvc_blend_oof, sig)
        if ll > best[0]: best = (ll, a0, b0)
ll_banker_oof = laplace_ll_np(y, fvc_blend_oof, sigma_banker_oof)
print(f"[ParamSigma] Best OOF (a,b)=({best[1]},{best[2]:.2f}) LL={best[0]:.5f} vs BANKER={ll_banker_oof:.5f} Δ={best[0]-ll_banker_oof:+.5f}")
adopt_sigma = (best[0] >= ll_banker_oof + 0.002)

# --- Build TEST: distance-aware blend with learned weights; hybrid FVC monotonicity; parametric sigma ---
def load_fvc(path):
    return pd.read_csv(path).set_index('Patient_Week').loc[ss['Patient_Week'],'FVC'].astype(float).values

fvc_s_test = load_fvc('submission_slope_anchor_banker_wA60.csv') if pd.io.common.file_exists('submission_slope_anchor_banker_wA60.csv') else load_fvc('submission_slope_anchor_banker.csv')
try:
    fvc_l_test = load_fvc('submission_lme_banker.csv')
except Exception:
    # build LME banker quickly if missing
    base_tr = prepare_baseline_table(train)
    trn_l = train.drop(columns=['Age','Sex','SmokingStatus'], errors='ignore').merge(base_tr[['Patient','Base_Week','Base_FVC','Percent_at_base','Age','Sex','SmokingStatus']], on='Patient', how='left')
    trn_l['Weeks_Passed'] = (trn_l['Weeks'] - trn_l['Base_Week']).astype(float)/10.0
    trn_l = trn_l[trn_l['Weeks_Passed'] >= 0].copy()
    age_mean, age_std = trn_l['Age'].mean(), trn_l['Age'].std()+1e-9
    pc_mean, pc_std   = trn_l['Percent_at_base'].mean(), trn_l['Percent_at_base'].std()+1e-9
    trn_l['Age_std'] = (trn_l['Age'] - age_mean)/age_std; trn_l['Percent_at_base_std'] = (trn_l['Percent_at_base'] - pc_mean)/pc_std
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        md = smf.mixedlm('FVC ~ 1 + Weeks_Passed + I(Weeks_Passed**2) + Age_std + C(Sex) + C(SmokingStatus) + Percent_at_base_std + Age_std:Percent_at_base_std',
                          data=trn_l, groups=trn_l['Patient'], re_formula='~Weeks_Passed')
        mdf = md.fit(method='lbfgs', reml=True, maxiter=500, disp=False)
    grid = ss.copy()
    parts = grid['Patient_Week'].str.rsplit('_', n=1, expand=True)
    grid['Patient'] = parts[0]; grid['Weeks'] = parts[1].astype(int)
    test_bl = test[['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
    grid = grid.merge(test_bl, on='Patient', how='left')
    grid['Weeks_Passed'] = (grid['Weeks'] - grid['Base_Week']).astype(float)/10.0
    grid['Age_std'] = (grid['Age'].astype(float) - age_mean) / (age_std)
    grid['Percent_at_base_std'] = (grid['Percent_at_base'].astype(float) - pc_mean) / (pc_std)
    fvc_l_test = mdf.predict(grid).astype(float).values

# Quantile blended test FVC; rebuild test blend with weights from OOF
grid_te = ss.copy()
parts = grid_te['Patient_Week'].str.rsplit('_', n=1, expand=True)
grid_te['Patient'] = parts[0]; grid_te['Weeks'] = parts[1].astype(int)
test_base = test[['Patient','Weeks','FVC','Percent']].rename(columns={'Weeks':'Base_Week','FVC':'Base_FVC','Percent':'Percent_at_base'})
grid_te = grid_te.merge(test_base, on='Patient', how='left')
abs_dist_te = np.abs((grid_te['Weeks'] - grid_te['Base_Week']).astype(float).values)

fvc_q_test = pd.read_csv('submission_quantile_lgbm_v2.csv').set_index('Patient_Week').loc[ss['Patient_Week'],'FVC'].astype(float).values

fvc_blend_test = np.zeros_like(fvc_q_test)
for (lo,hi), (ws, wl, wq) in weights.items():
    m = (abs_dist_te > lo) & (abs_dist_te <= hi)
    fvc_blend_test[m] = ws*fvc_s_test[m] + wl*fvc_l_test[m] + wq*fvc_q_test[m]

# Hybrid FVC monotonicity: tol +25 for |dist|<=10, strict for >10; then pin dist==0; clip
dist_te = (grid_te['Weeks'] - grid_te['Base_Week']).astype(float).values
base_fvc_te = grid_te['Base_FVC'].astype(float).values
fvc_blend_test = np.where(dist_te == 0.0, base_fvc_te, np.clip(fvc_blend_test, 500, 6000))
df_out = pd.DataFrame({'Patient': grid_te['Patient'].values, 'Weeks': grid_te['Weeks'].values.astype(int), 'FVC': fvc_blend_test, 'abs_dist': abs_dist_te})
def enforce_hybrid_mono(g, tol_short=25.0, thr=10.0):
    g = g.sort_values('Weeks').copy()
    f = g['FVC'].values.astype(float)
    d = g['abs_dist'].values.astype(float)
    for i in range(len(f)-2, -1, -1):
        tol = tol_short if d[i] <= thr else 0.0
        f[i] = min(f[i], f[i+1] + tol)
    g['FVC'] = f
    return g
df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_hybrid_mono)
fvc_final = df_out['FVC'].values.astype(float)
fvc_final = np.where(abs_dist_te == 0.0, base_fvc_te, fvc_final)

# Parametric sigma on TEST
a_best, b_best = best[1], best[2]
sigma_param = a_best + b_best * abs_dist_te
sigma_param = np.maximum(sigma_param, 70.0)
sigma_param = np.where(abs_dist_te > 20.0, np.maximum(sigma_param, 100.0), sigma_param)
sigma_param = np.where(abs_dist_te > 30.0, np.maximum(sigma_param, 130.0), sigma_param)
sigma_param = np.where(abs_dist_te > 40.0, np.maximum(sigma_param, 160.0), sigma_param)
sigma_banker_te = np.maximum(240.0 + 3.0 * abs_dist_te, 70.0)
sigma_banker_te = np.where(abs_dist_te > 20.0, np.maximum(sigma_banker_te, 100.0), sigma_banker_te)
sigma_te = np.maximum(sigma_param, sigma_banker_te) if adopt_sigma else sigma_banker_te

# Per-patient monotone in |dist| for sigma
df_sig = pd.DataFrame({'Patient': grid_te['Patient'].values, 'Weeks': grid_te['Weeks'].values.astype(int), 'dist': abs_dist_te, 'Sigma': sigma_te.astype(float)})
def enforce_sigma_monotone(g):
    g = g.sort_values('dist').copy()
    g['Sigma'] = np.maximum.accumulate(g['Sigma'].values)
    return g
df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
sigma_final = df_sig['Sigma'].values.astype(float)

sub = pd.DataFrame({'Patient_Week': ss['Patient_Week'], 'FVC': fvc_final, 'Confidence': sigma_final})
sub.to_csv('submission_distance_blend_paramSigma_hybridMono.csv', index=False)
if adopt_sigma:
    sub.to_csv('submission.csv', index=False)
    print(f"Saved submission_distance_blend_paramSigma_hybridMono.csv and set submission.csv. Adopt_sigma={adopt_sigma} | OOF gain={best[0]-ll_banker_oof:+.5f} | Elapsed {time.time()-t0:.1f}s")
else:
    print(f"Saved submission_distance_blend_paramSigma_hybridMono.csv. Adopt_sigma={adopt_sigma}; submission.csv unchanged. OOF gain={best[0]-ll_banker_oof:+.5f} | Elapsed {time.time()-t0:.1f}s")

[ParamSigma] OOF weights by bin (S/L/Q): {(0.0, 5.0): (0.05, 0.0, 0.95), (5.0, 15.0): (0.0, 0.0, 1.0), (15.0, 1000000000.0): (0.05, 0.05, 0.9)}
[ParamSigma] Best OOF (a,b)=(80,1.40) LL=-6.95557 vs BANKER=-6.95557 Δ=+0.00000
Saved submission_distance_blend_paramSigma_hybridMono.csv and set submission.csv. Adopt_sigma=False | OOF gain=+0.00000 | Elapsed 3.7s


  df_out = df_out.groupby('Patient', as_index=False, group_keys=False).apply(enforce_hybrid_mono)
  df_sig = df_sig.groupby('Patient', as_index=False, group_keys=False).apply(enforce_sigma_monotone)
