# Plan: Google Brain - Ventilator Pressure Prediction

Objectives:
- Establish reliable CV mirroring test conditions; build strong baseline fast; iterate to medal.

Validation:
- GroupKFold by breath_id to prevent leakage (train/test split is by breath).
- 5 folds, deterministic seed; log per-fold times and scores; save OOF for ensembling.

Baseline v0:
- Simple feature-engineered tabular model (XGBoost/CB on GPU) per time-step rows.
- Features: u_in lag/lead (1..3), rolling stats (mean/std/max/min over small windows), cumulative sums, time_diff, area under u_in curve, R/C one-hot, interaction terms (u_in * R/C), u_in changes, step index (normalized), breathwise stats.
- Train target: pressure.

Seq NN v1:
- GRU/LSTM 1D with per-breath sequences length=80; inputs: u_in, u_out, R, C plus engineered per-step features; regression head.
- Loss: SmoothL1; early stopping on CV; AMP + GPU.

Advanced FE v2:
- Physical-inspired: compute RC dynamics proxy: dp_est = (u_in - leak)*k - integrate(u_out), cumulative RC charge, expiratory markers, plateaus.
- Breath-level globals appended to each step: R, C, sums, areas, peaks, counts.

Ensembling:
- Blend tabular (XGB/CB) and GRU predictions; weight by OOF performance.
- Calibrate blending weights via OOF minimization of metric surrogate (MAE if needed; use provided dice-hausdorff-combo if function available).

Metric handling:
- Primary dev metric: MAE on pressure (classic for this comp) as proxy if dice-hausdorff-combo impl not available locally.
- After obtaining strong MAE CV, validate with provided metric function if supplied in environment.

Pipeline steps:
1) Env check (GPU) and package setup (torch cu121; xgboost/catboost).
2) EDA: shapes, nulls, unique R/C, breath length consistency, target distribution.
3) FE v0 and cache features to feather/parquet; strict fold-fit of scalers within CV.
4) Baseline XGB/CB OOF + test preds; save artifacts.
5) Seq GRU v1 with small model; OOF + test; sanity plots.
6) Blend and generate submission.csv; iterate weights.
7) Error analysis by breath segments (high residuals buckets) → FE v2.

Runtime discipline:
- Subsample smoke runs first (10k breaths, 2 folds) before full training.
- Print fold/time logs; save OOF/test npy for reproducibility.

Next actions:
A) Run GPU sanity check (nvidia-smi).
B) Quick data load + EDA.
C) Implement FE v0 and XGB baseline with 5-fold GroupKFold.
D) Request expert review after baseline CV results; then proceed to GRU and blend.

In [1]:
import subprocess, time, sys
print('Checking GPU with nvidia-smi...', flush=True)
t0 = time.time()
ret = subprocess.run(['bash','-lc','nvidia-smi || true'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
print(ret.stdout)
print(f'Elapsed: {time.time()-t0:.2f}s', flush=True)

Checking GPU with nvidia-smi...


Failed to initialize NVML: Unknown Error

Elapsed: 0.01s
