# INGV - Volcanic Eruption Prediction: Plan to Medal

## Objectives
- Build a fast, reliable baseline with strong CV.
- Iterate feature engineering (time + frequency) and GPU models (XGBoost/CatBoost).
- Ensemble diverse seeds/models once CV is stable.
- Submit only after CV aligns with LB.

## Data Understanding & Pipeline
- Inspect train.csv schema (ID, target column name, any meta).
- Load a few train/*.csv segments to confirm sensor columns, sampling rate, length.
- Mirror test-time pipeline: per-file features -> model -> predict.
- Cache features to disk (parquet) for reuse.

## Validation
- Avoid leakage: all transforms fit within folds.
- Determine grouping (by volcano/station/day if present in train.csv; else robust KFold with file-wise split).
- Fix a deterministic CV (e.g., 5 folds GroupKFold if group available; else 5-fold KFold by file).
- Track OOF MAE and per-fold times.

## Baseline v1
- Subsample run: 200 files, 2 folds to smoke-test.
- Simple features per channel:
  - Global: mean, std, min, max, q01/q05/q25/q50/q75/q95/q99, iqr, skew, kurtosis, zero crossings, RMS.
  - Rolling windows (coarse, e.g., 10–20 chunks): chunk means/stds and their stats.
  - Frequency: FFT bandpowers over log-spaced bands; spectral centroid, roll-off, spectral flatness.
- Model: XGBoost regressor with GPU (tree_method='hist', device='cuda').
- Early stopping, 2000 rounds max, patience 100.

## Feature Set v2
- Add robust features:
  - Denoised stats (median filtered).
  - Autocorr features, dominant period via FFT peak.
  - Per-axis + cross-axis correlations if multiple channels.
- Try CatBoost GPU as alternative; blend with XGB.

## Efficiency
- Verify GPU available first.
- Parallel feature extraction (joblib) with chunked reads.
- Log progress and elapsed times per fold and per 100 files.
- Cache train/test features and OOF predictions.

## Ensembling
- Blend XGB/CatBoost (and seeds). Weight by 1/MAE^p on OOF.

## Error Analysis
- Inspect OOF residuals by file stats (e.g., signal energy, station/volcano if available).
- Iterate on top error buckets.

## Deliverables
- submission.csv with predictions for test IDs.
- Saved features, OOF, model configs.

Next:
1) Verify GPU and environment; 2) Inspect train.csv and a sample file; 3) Implement feature extractor skeleton; 4) Build CV and baseline XGB.

In [1]:
# Setup, GPU check, and schema inspection
import os, sys, json, time, glob, math, random, gc, warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from pathlib import Path

print('Python', sys.version)
print('CWD:', os.getcwd())

# Verify GPU availability (xgboost + optional torch)
gpu_info = {}
try:
    import xgboost as xgb
    gpu_info['xgboost_version'] = xgb.__version__
    print('xgboost', xgb.__version__)
except Exception as e:
    print('xgboost import failed:', e)

try:
    import torch
    gpu_info['torch'] = {
        'available': torch.cuda.is_available(),
        'count': torch.cuda.device_count(),
        'name': torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
    }
    print('GPU Available (torch):', gpu_info['torch']['available'])
    if gpu_info['torch']['available']:
        print('GPU Count:', gpu_info['torch']['count'])
        print('GPU Name:', gpu_info['torch']['name'])
except Exception as e:
    print('torch check failed:', e)

# Paths
DATA_DIR = Path('.')
TRAIN_DIR = DATA_DIR / 'train'
TEST_DIR = DATA_DIR / 'test'

# Inspect train.csv and sample_submission.csv
train_meta_path = DATA_DIR / 'train.csv'
ss_path = DATA_DIR / 'sample_submission.csv'
assert train_meta_path.exists(), 'train.csv not found'
assert ss_path.exists(), 'sample_submission.csv not found'

train_meta = pd.read_csv(train_meta_path)
print('train.csv shape:', train_meta.shape)
print('train.csv columns:', list(train_meta.columns))
print(train_meta.head(3))

ss = pd.read_csv(ss_path)
print('sample_submission shape:', ss.shape)
print('sample_submission head:')
print(ss.head())

# Peek a few files to understand sensor columns and length
train_files = sorted(glob.glob(str(TRAIN_DIR / '*.csv')))[:5]
print('Sample train files:', [Path(f).name for f in train_files])

def peek_file(fp, n=5):
    df = pd.read_csv(fp, nrows=n)
    return df

for fp in train_files:
    df_head = peek_file(fp, n=5)
    print('File:', Path(fp).name, 'shape(head)=', df_head.shape, 'columns=', list(df_head.columns))
    break

# Read full length of one small file to gauge memory/length (use first file)
t0 = time.time()
probe_fp = train_files[0] if train_files else None
if probe_fp:
    df_probe = pd.read_csv(probe_fp)
    print('Probe file:', Path(probe_fp).name, 'shape=', df_probe.shape, 'dtypes:', df_probe.dtypes.to_dict())
    print('Elapsed to read probe: %.2fs' % (time.time()-t0))
    # Basic stats on columns
    print('NA fraction per col:', df_probe.isna().mean().to_dict())
else:
    print('No train files found!')

# Utility: define band edges according to sampling rate later after we detect Fs and signal length
def describe_signal_layout(df):
    cols = list(df.columns)
    n_rows = len(df)
    return {'n_rows': n_rows, 'cols': cols}

if probe_fp:
    print('Layout:', describe_signal_layout(df_probe))

print('SETUP DONE')

Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
CWD: /app/agent_run_states/predict-volcanic-eruptions-ingv-oe-spray-20250912-042104


xgboost 2.1.4
torch check failed: No module named 'torch'
train.csv shape: (3987, 2)
train.csv columns: ['segment_id', 'time_to_eruption']
   segment_id  time_to_eruption
0  1410693848          20109998
1   987159268           7202883
2  1990984540          28138930
sample_submission shape: (444, 2)
sample_submission head:
   segment_id  time_to_eruption
0   951290289                 0
1   508758258                 0
2  1566132188                 0
3  1891418251                 0
4  1968343855                 0
Sample train files: ['1000015382.csv', '1000554676.csv', '1000745424.csv', '1001461087.csv', '1001732002.csv']
File: 1000015382.csv shape(head)= (5, 10) columns= ['sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5', 'sensor_6', 'sensor_7', 'sensor_8', 'sensor_9', 'sensor_10']
Probe file: 1000015382.csv shape= (60001, 10) dtypes: {'sensor_1': dtype('float64'), 'sensor_2': dtype('float64'), 'sensor_3': dtype('float64'), 'sensor_4': dtype('float64'), 'sensor_5': dtype('float

In [2]:
# Feature extraction utilities and smoke feature build
import os, time, glob, math, gc
import numpy as np
import pandas as pd
from pathlib import Path
from typing import List, Dict, Tuple

# Ensure deps
def ensure_imports():
    import importlib, subprocess, sys
    def pip_install(pkg):
        print(f'Installing {pkg} ...');
        subprocess.run([sys.executable, '-m', 'pip', 'install', pkg], check=True)
    for mod, pkg in [('scipy', 'scipy'), ('joblib', 'joblib'), ('sklearn', 'scikit-learn')]:
        try:
            importlib.import_module(mod)
        except ImportError:
            pip_install(pkg)
    # re-import after potential install
    global signal, welch, RobustScaler, sosfiltfilt, butter, periodogram
    from scipy import signal
    from scipy.signal import welch, sosfiltfilt, butter, periodogram
    from sklearn.preprocessing import RobustScaler

ensure_imports()
from joblib import Parallel, delayed

RNG = np.random.default_rng(42)

# Sampling rate detection from a probe file (assume constant across files)
probe_fp = sorted(glob.glob(str(TRAIN_DIR / '*.csv')))[0]
probe_n = len(pd.read_csv(probe_fp))
Fs = max(1, round((probe_n - 1) / 600))  # approx samples per second, 600 seconds window
print('Detected Fs ~', Fs, 'Hz from n_rows=', probe_n)
nyq = Fs / 2.0

# Bands (Hz), capped below Nyquist
raw_bands = [(0.5,2),(2,4),(4,8),(8,16),(16,32),(32,48)]
BANDS = [(lo, min(hi, nyq*0.96)) for (lo,hi) in raw_bands if lo < nyq*0.96]
print('Using bands:', BANDS)

# Controls
USE_ACF = False  # drop autocorr for v1 full run per expert advice

def linear_detrend(x: np.ndarray) -> np.ndarray:
    return signal.detrend(x, type='linear', overwrite_data=False)

def butter_highpass_sos(x: np.ndarray, cutoff=1.0, order=3) -> np.ndarray:
    if cutoff >= nyq:
        return x
    sos = butter(order, Wn=cutoff/nyq, btype='highpass', output='sos')
    try:
        return sosfiltfilt(sos, x)
    except Exception:
        return sosfiltfilt(sos, x)

def robust_scale_per_series(x: np.ndarray) -> np.ndarray:
    med = np.nanmedian(x)
    iqr = np.nanpercentile(x, 75) - np.nanpercentile(x, 25)
    if iqr == 0 or not np.isfinite(iqr):
        return x - med
    return (x - med) / iqr

def zero_crossing_rate(x: np.ndarray) -> float:
    x = np.nan_to_num(x, nan=0.0)
    return float(((x[:-1] * x[1:]) < 0).mean())

def line_length(x: np.ndarray) -> float:
    return float(np.nanmean(np.abs(np.diff(x))))

def hjorth_params(x: np.ndarray) -> Tuple[float,float,float]:
    x = np.nan_to_num(x, nan=0.0)
    var0 = np.var(x)
    dx = np.diff(x)
    var1 = np.var(dx) if len(dx)>0 else 0.0
    ddx = np.diff(dx)
    var2 = np.var(ddx) if len(ddx)>0 else 0.0
    activity = var0
    mobility = math.sqrt(var1/var0) if var0>0 else 0.0
    complexity = math.sqrt((var2/var1)) / mobility if (var1>0 and mobility>0) else 0.0
    return float(activity), float(mobility), float(complexity)

def spectral_features(x: np.ndarray) -> Dict[str, float]:
    # Welch PSD
    x = np.nan_to_num(x, nan=0.0).astype(np.float32)
    try:
        f, Pxx = welch(x, fs=Fs, nperseg=min(len(x), 1024), noverlap=512, detrend='constant')
    except Exception:
        # Fallback: simple periodogram
        f, Pxx = periodogram(x, fs=Fs, scaling='density')
    Pxx = np.maximum(Pxx, 1e-20)
    total_power = np.trapz(Pxx, f)
    # bandpowers
    feats = {}
    for i,(lo,hi) in enumerate(BANDS):
        mask = (f>=lo) & (f<hi)
        bp = np.trapz(Pxx[mask], f[mask]) if mask.any() else 0.0
        feats[f'bandpower_{i}_{lo:.1f}_{hi:.1f}'] = float(bp)
    # ratios high/low
    if len(BANDS)>=2:
        low = feats.get('bandpower_0_0.5_2.0', 0.0)
        high = feats.get(f'bandpower_{len(BANDS)-1}_{BANDS[-1][0]:.1f}_{BANDS[-1][1]:.1f}', 0.0)
        feats['bp_ratio_high_low'] = float(high/(low+1e-9))
    # centroid
    centroid = float(np.sum(f*Pxx)/np.sum(Pxx)) if total_power>0 else 0.0
    feats['spec_centroid'] = centroid
    # roll-off 95%
    cumsum = np.cumsum(Pxx) / np.sum(Pxx)
    idx95 = np.searchsorted(cumsum, 0.95)
    feats['spec_rolloff95'] = float(f[min(idx95, len(f)-1)]) if len(f)>0 else 0.0
    # flatness (geometric/arith mean)
    geo = float(np.exp(np.mean(np.log(Pxx))))
    ari = float(np.mean(Pxx))
    feats['spec_flatness'] = float(geo/(ari+1e-12))
    # entropy
    p = Pxx/np.sum(Pxx)
    feats['spec_entropy'] = float(-np.sum(p*np.log(p+1e-12)))
    # dominant peak
    idx = int(np.argmax(Pxx)) if len(Pxx)>0 else 0
    feats['peak_freq'] = float(f[idx]) if len(f)>0 else 0.0
    feats['peak_amp'] = float(Pxx[idx]) if len(Pxx)>0 else 0.0
    return feats

def chunk_features(x: np.ndarray, n_chunks: int = 10) -> Dict[str, float]:
    n = len(x)
    feats = {}
    if n_chunks <= 1 or n < n_chunks:
        return feats
    idxs = np.linspace(0, n, n_chunks+1, dtype=int)
    means, stds, rmss = [], [], []
    for i in range(n_chunks):
        seg = x[idxs[i]:idxs[i+1]]
        if len(seg)==0:
            m=0; s=0; r=0
        else:
            m = float(np.nanmean(seg))
            s = float(np.nanstd(seg))
            r = float(np.sqrt(np.nanmean(seg**2)))
        means.append(m); stds.append(s); rmss.append(r)
    means = np.array(means); stds = np.array(stds); rmss = np.array(rmss)
    # summarize across chunks
    for arr, name in [(means,'mean'), (stds,'std'), (rmss,'rms')]:
        feats[f'chunks_{name}_mean'] = float(np.nanmean(arr))
        feats[f'chunks_{name}_std'] = float(np.nanstd(arr))
        feats[f'chunks_{name}_min'] = float(np.nanmin(arr))
        feats[f'chunks_{name}_max'] = float(np.nanmax(arr))
        # slope via linear fit
        x_idx = np.arange(len(arr))
        try:
            slope = float(np.polyfit(x_idx, arr, 1)[0])
        except Exception:
            slope = 0.0
        feats[f'chunks_{name}_slope'] = slope
        feats[f'chunks_{name}_last_first_delta'] = float(arr[-1] - arr[0])
    # energy fraction in last 3 chunks
    energy = rmss**2
    denom = float(energy.sum()) + 1e-9
    feats['chunks_energy_last3_frac'] = float(energy[-3:].sum()/denom) if len(energy)>=3 else 0.0
    return feats

def global_stats(x: np.ndarray) -> Dict[str, float]:
    x = np.array(x, dtype=np.float32)
    feats = {}
    qs = [1,5,25,50,75,95,99]
    qv = np.nanpercentile(x, qs)
    feats.update({f'q{q}': float(v) for q,v in zip(qs, qv)})
    feats['mean'] = float(np.nanmean(x))
    feats['std'] = float(np.nanstd(x))
    feats['min'] = float(np.nanmin(x))
    feats['max'] = float(np.nanmax(x))
    feats['iqr'] = float(np.nanpercentile(x,75) - np.nanpercentile(x,25))
    feats['rms'] = float(np.sqrt(np.nanmean(x**2)))
    feats['mad'] = float(np.nanmedian(np.abs(x - np.nanmedian(x))))
    with np.errstate(all='ignore'):
        feats['skew'] = float(pd.Series(x).skew())
        feats['kurt'] = float(pd.Series(x).kurt())
    feats['zcr'] = zero_crossing_rate(x)
    feats['line_len'] = line_length(x)
    a,m,c = hjorth_params(x)
    feats['hj_activity'] = a; feats['hj_mobility'] = m; feats['hj_complexity'] = c
    feats['nan_frac'] = float(np.mean(~np.isfinite(x)))
    return feats

def compute_features_for_file(fp: str, segment_id: int) -> Dict[str, float]:
    t0 = time.time()
    df = pd.read_csv(fp, dtype=np.float32)
    cols = df.columns.tolist()
    # Handle NaNs
    df = df.replace([np.inf, -np.inf], np.nan)
    df = df.fillna(0.0)
    # Cross-channel correlations
    corr_feats = {}
    try:
        corr = df.corr().values
        # upper triangle without diag
        idx = np.triu_indices_from(corr, k=1)
        corr_vals = corr[idx]
        corr_feats['xcorr_mean'] = float(np.nanmean(corr_vals))
        corr_feats['xcorr_std'] = float(np.nanstd(corr_vals))
        corr_feats['xcorr_max'] = float(np.nanmax(corr_vals))
        corr_feats['xcorr_min'] = float(np.nanmin(corr_vals))
    except Exception:
        corr_feats['xcorr_mean']=corr_feats['xcorr_std']=corr_feats['xcorr_max']=corr_feats['xcorr_min']=0.0
    feats = {**{ 'segment_id': segment_id }, **corr_feats}
    # Per-channel features
    for col in cols:
        x = df[col].values.astype(np.float32)
        x = linear_detrend(x)
        x = butter_highpass_sos(x, cutoff=1.0, order=3)
        # optional robust scale per series (keep unitless features stable)
        # x = robust_scale_per_series(x)
        g = global_stats(x)
        ch = f'{col}'
        for k,v in g.items(): feats[f'{ch}_{k}'] = v
        spec = spectral_features(x)
        for k,v in spec.items(): feats[f'{ch}_{k}'] = v
        chf = chunk_features(x, n_chunks=10)
        for k,v in chf.items(): feats[f'{ch}_{k}'] = v
        if USE_ACF:
            acf = autocorr_features(x, max_sec=5.0)
            for k,v in acf.items(): feats[f'{ch}_{k}'] = v
    # timing
    feats['_proc_seconds'] = float(time.time()-t0)
    return feats

def build_features_for_ids(ids: List[int], split: str, n_jobs: int = 4, max_items: int = None) -> pd.DataFrame:
    if max_items is not None:
        ids = ids[:max_items]
    base_dir = TRAIN_DIR if split=='train' else TEST_DIR
    fps = [str(base_dir / f'{i}.csv') for i in ids]
    assert all([Path(fp).exists() for fp in fps]), 'Some files missing'
    print(f'[{split}] Extracting features for {len(ids)} files with {n_jobs} jobs...')
    t0 = time.time()
    results = Parallel(n_jobs=n_jobs, prefer='threads')(delayed(compute_features_for_file)(fp, seg_id) for fp, seg_id in zip(fps, ids))
    df = pd.DataFrame(results)
    print(f'[{split}] Done. Shape={df.shape}. Elapsed {time.time()-t0:.1f}s')
    return df

# Smoke build: 400 train + all test later
train_ids = train_meta['segment_id'].tolist()
test_ids = pd.read_csv('sample_submission.csv')['segment_id'].tolist()

SMOKE_N = 400
train_feats_smoke = build_features_for_ids(train_ids, 'train', n_jobs=6, max_items=SMOKE_N)
train_feats_smoke.to_parquet('train_features_smoke.parquet', index=False)
print('Saved train_features_smoke.parquet')

test_feats_smoke = build_features_for_ids(test_ids, 'test', n_jobs=6, max_items=200)  # limit for speed; full later
test_feats_smoke.to_parquet('test_features_smoke.parquet', index=False)
print('Saved test_features_smoke.parquet')

gc.collect(); print('SMOKE FEATURE EXTRACTION COMPLETE')

Detected Fs ~ 100 Hz from n_rows= 60001
Using bands: [(0.5, 2), (2, 4), (4, 8), (8, 16), (16, 32), (32, 48)]
[train] Extracting features for 400 files with 6 jobs...


[train] Done. Shape=(400, 566). Elapsed 504.1s
Saved train_features_smoke.parquet
[test] Extracting features for 200 files with 6 jobs...


In [6]:
# Modeling: XGBoost GPU baseline on smoke features with KFold (temporary) using xgb.train
import time, gc
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
import xgboost as xgb
from pathlib import Path

t0 = time.time()
train_feats_path = 'train_features_smoke.parquet'
assert Path(train_feats_path).exists(), 'train_features_smoke.parquet not found. Build features first.'
train_feats = pd.read_parquet(train_feats_path)
print('Loaded train features:', train_feats.shape)

# Merge target
train_df = train_feats.merge(train_meta[['segment_id','time_to_eruption']], on='segment_id', how='left')
print('Merged with target:', train_df.shape, 'missing targets:', train_df['time_to_eruption'].isna().sum())

# Features/target
drop_cols = ['segment_id', 'time_to_eruption']
feat_cols = [c for c in train_df.columns if c not in drop_cols]
X = train_df[feat_cols].astype(np.float32).values
y = train_df['time_to_eruption'].values.astype(np.float32)
print('Feature matrix:', X.shape, 'Target shape:', y.shape)

# KFold (temporary until we infer proper groups)
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
oof = np.zeros(len(train_df), dtype=np.float32)
models = []

params = {
    'tree_method': 'hist',
    'device': 'cuda',
    'max_depth': 8,
    'learning_rate': 0.04,
    'subsample': 0.75,
    'colsample_bytree': 0.7,
    'min_child_weight': 9,
    'reg_alpha': 0.5,
    'reg_lambda': 6.0,
    'objective': 'reg:absoluteerror',
    'eval_metric': 'mae',
    'seed': 42
}

num_boost_round = 4000
esr = 250

for fold, (trn_idx, val_idx) in enumerate(kf.split(X, y)):
    t_fold = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(trn_idx)} val {len(val_idx)}')
    X_tr, y_tr = X[trn_idx], y[trn_idx]
    X_va, y_va = X[val_idx], y[val_idx]
    dtrain = xgb.DMatrix(X_tr, label=y_tr)
    dvalid = xgb.DMatrix(X_va, label=y_va)
    evals = [(dtrain, 'train'), (dvalid, 'valid')]
    booster = xgb.train(
        params=params,
        dtrain=dtrain,
        num_boost_round=num_boost_round,
        evals=evals,
        early_stopping_rounds=esr,
        verbose_eval=200
    )
    # Use best_iteration from early stopping; add +1 since iteration_range is exclusive of end
    best_iter = getattr(booster, 'best_iteration', None)
    if best_iter is not None:
        preds = booster.predict(dvalid, iteration_range=(0, best_iter + 1))
    else:
        preds = booster.predict(dvalid)
    oof[val_idx] = preds.astype(np.float32)
    mae = mean_absolute_error(y_va, preds)
    print(f'Fold {fold+1} MAE: {mae:,.0f} | best_iter={best_iter} | elapsed {time.time()-t_fold:.1f}s')
    models.append(booster)
    gc.collect()

oof_mae = mean_absolute_error(y, oof)
print(f'OOF MAE (KFold, smoke): {oof_mae:,.0f}')
print('Total modeling time: %.1fs' % (time.time()-t0))

# Save OOF for tracking
pd.DataFrame({'segment_id': train_df['segment_id'], 'oof': oof, 'y': y}).to_csv('oof_smoke.csv', index=False)
print('Saved oof_smoke.csv')

Loaded train features: (400, 566)
Merged with target: (400, 567) missing targets: 0
Feature matrix: (400, 565) Target shape: (400,)
Fold 1/5 - train 320 val 80
[0]	train-mae:10771263.39687	valid-mae:12057508.26250


[200]	train-mae:2241880.19062	valid-mae:8217171.87812


[400]	train-mae:1430950.44687	valid-mae:8027088.26562


[600]	train-mae:1134374.70391	valid-mae:7896189.62812


[800]	train-mae:1016807.18809	valid-mae:7878095.45625


[1000]	train-mae:950226.04766	valid-mae:7877420.52187


[1200]	train-mae:913312.54004	valid-mae:7853254.15937


[1400]	train-mae:865172.35469	valid-mae:7850800.11562


[1600]	train-mae:833669.07871	valid-mae:7833513.45937


[1800]	train-mae:799967.25820	valid-mae:7825075.12812


[2000]	train-mae:778541.44883	valid-mae:7820125.47187


[2200]	train-mae:751737.51992	valid-mae:7808549.63125


[2400]	train-mae:717185.39482	valid-mae:7779025.78750


[2600]	train-mae:698471.64658	valid-mae:7773014.90625


[2800]	train-mae:677099.66621	valid-mae:7761023.86250


[3000]	train-mae:661154.48301	valid-mae:7757002.35938


[3200]	train-mae:646850.59736	valid-mae:7763307.18125


[3316]	train-mae:637879.51064	valid-mae:7759282.78125


Fold 1 MAE: 7,752,507 | best_iter=3067 | elapsed 20.9s
Fold 2/5 - train 320 val 80


[0]	train-mae:10884712.80000	valid-mae:11557859.46250


[200]	train-mae:1972414.13867	valid-mae:7016213.89687


[400]	train-mae:1280609.04238	valid-mae:7034757.12500


[600]	train-mae:1056318.05273	valid-mae:6975958.63125


[800]	train-mae:974493.12852	valid-mae:6976011.10313


[1000]	train-mae:891084.98662	valid-mae:6967420.00625


[1089]	train-mae:872635.85752	valid-mae:6967847.63750


Fold 2 MAE: 6,966,051 | best_iter=840 | elapsed 7.0s
Fold 3/5 - train 320 val 80
[0]	train-mae:10947865.20000	valid-mae:11231243.97500


[200]	train-mae:1872362.40488	valid-mae:6685098.28750


[400]	train-mae:1175171.73379	valid-mae:6506978.30938


[600]	train-mae:938047.83457	valid-mae:6474971.59687


[800]	train-mae:853043.20068	valid-mae:6492019.63438


[848]	train-mae:831046.90918	valid-mae:6479554.95000


Fold 3 MAE: 6,474,742 | best_iter=599 | elapsed 5.3s
Fold 4/5 - train 320 val 80
[0]	train-mae:10933355.78750	valid-mae:11484721.80000


[200]	train-mae:1917420.82187	valid-mae:8280821.61875


[400]	train-mae:1256486.66992	valid-mae:8096406.41875


[600]	train-mae:1014684.13574	valid-mae:8079449.51250


[707]	train-mae:944154.42754	valid-mae:8080435.73125


Fold 4 MAE: 8,072,845 | best_iter=457 | elapsed 4.4s
Fold 5/5 - train 320 val 80
[0]	train-mae:11146637.83437	valid-mae:10506216.40000


[200]	train-mae:2016210.62500	valid-mae:5453072.30000


[400]	train-mae:1337177.23135	valid-mae:5317999.45000


[600]	train-mae:1091180.05674	valid-mae:5290912.74375


[800]	train-mae:1002907.26348	valid-mae:5281947.68750


[1000]	train-mae:921244.12676	valid-mae:5268951.65313


[1200]	train-mae:815355.05488	valid-mae:5242442.84375


[1400]	train-mae:703308.37461	valid-mae:5270824.15313


[1426]	train-mae:695759.46719	valid-mae:5271523.86250


Fold 5 MAE: 5,233,041 | best_iter=1177 | elapsed 9.1s
OOF MAE (KFold, smoke): 6,899,838
Total modeling time: 47.2s
Saved oof_smoke.csv


In [7]:
# GroupKFold CV sanity-check on smoke features (no leakage)
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import mean_absolute_error
import xgboost as xgb
from pathlib import Path
import time, gc

t0 = time.time()
# Load smoke features and targets
train_feats_path = 'train_features_smoke.parquet'
assert Path(train_feats_path).exists(), 'Missing train_features_smoke.parquet'
train_feats = pd.read_parquet(train_feats_path)
df_meta = train_meta[['segment_id','time_to_eruption']].copy()

# Build groups: global cumsum of increases in time_to_eruption when sorted by target
tmp = df_meta.sort_values('time_to_eruption').reset_index(drop=True)
tmp['group'] = (tmp['time_to_eruption'].diff().fillna(1) > 0).cumsum().astype(int)
groups_map = tmp.set_index('segment_id')['group']

# Merge features with target and groups
df = train_feats.merge(df_meta, on='segment_id', how='left')
df['group'] = df['segment_id'].map(groups_map)
assert df['group'].notna().all(), 'Grouping failed for some rows'
print('Data shape:', df.shape, '| unique groups:', df['group'].nunique())

drop_cols = ['segment_id','time_to_eruption','group']
feat_cols = [c for c in df.columns if c not in drop_cols]
X = df[feat_cols].astype(np.float32).values
y = df['time_to_eruption'].astype(np.float32).values
groups = df['group'].values.astype(int)

gkf = GroupKFold(n_splits=5)
oof = np.zeros(len(df), dtype=np.float32)
models = []

params = {
    'tree_method': 'gpu_hist',
    'max_depth': 8,
    'learning_rate': 0.04,
    'subsample': 0.75,
    'colsample_bytree': 0.7,
    'min_child_weight': 9,
    'reg_alpha': 0.5,
    'reg_lambda': 6.0,
    'objective': 'reg:absoluteerror',
    'eval_metric': 'mae',
    'seed': 42
}
num_boost_round = 4000
esr = 250

for fold, (trn_idx, val_idx) in enumerate(gkf.split(X, y, groups=groups)):
    t_fold = time.time()
    print(f'[GroupKFold] Fold {fold+1}/5 | train {len(trn_idx)} val {len(val_idx)} | groups train {len(np.unique(groups[trn_idx]))} val {len(np.unique(groups[val_idx]))}')
    dtrain = xgb.DMatrix(X[trn_idx], label=y[trn_idx])
    dvalid = xgb.DMatrix(X[val_idx], label=y[val_idx])
    booster = xgb.train(
        params=params,
        dtrain=dtrain,
        num_boost_round=num_boost_round,
        evals=[(dtrain,'train'),(dvalid,'valid')],
        early_stopping_rounds=esr,
        verbose_eval=200
    )
    best_iter = getattr(booster, 'best_iteration', None)
    preds = booster.predict(dvalid, iteration_range=(0, best_iter+1)) if best_iter is not None else booster.predict(dvalid)
    oof[val_idx] = preds.astype(np.float32)
    mae = mean_absolute_error(y[val_idx], preds)
    print(f'Fold {fold+1} MAE: {mae:,.0f} | best_iter={best_iter} | elapsed {time.time()-t_fold:.1f}s')
    models.append(booster)
    gc.collect()

oof_mae = mean_absolute_error(y, oof)
print(f'OOF MAE (GroupKFold, smoke): {oof_mae:,.0f}')
print('Elapsed: %.1fs' % (time.time()-t0))
pd.DataFrame({'segment_id': df['segment_id'], 'group': groups, 'oof_groupkfold': oof, 'y': y}).to_csv('oof_smoke_groupkfold.csv', index=False)
print('Saved oof_smoke_groupkfold.csv')

Data shape: (400, 568) | unique groups: 400
[GroupKFold] Fold 1/5 | train 320 val 80 | groups train 320 val 80
[0]	train-mae:10972174.19063	valid-mae:11153262.85000


[200]	train-mae:2027805.37969	valid-mae:6858036.78750


[400]	train-mae:1262405.79746	valid-mae:6732352.31250


[600]	train-mae:1005253.85283	valid-mae:6657007.12812


[800]	train-mae:898596.05713	valid-mae:6618894.47500


[1000]	train-mae:823482.88311	valid-mae:6594702.48125


[1200]	train-mae:779602.87744	valid-mae:6571923.55312


[1400]	train-mae:746336.09297	valid-mae:6567170.58750


[1600]	train-mae:714133.68193	valid-mae:6547479.67500


[1800]	train-mae:694226.50791	valid-mae:6542255.30625


[2000]	train-mae:676580.30791	valid-mae:6537485.02813


[2200]	train-mae:650567.82021	valid-mae:6538591.81250


[2400]	train-mae:605071.04380	valid-mae:6525320.77813


[2600]	train-mae:578619.60764	valid-mae:6521650.82812


[2800]	train-mae:561023.33330	valid-mae:6514998.05625


[3000]	train-mae:549055.14314	valid-mae:6514254.05625


[3200]	train-mae:538478.40867	valid-mae:6511817.54063


[3400]	train-mae:523684.65679	valid-mae:6508668.10625


[3600]	train-mae:509224.76643	valid-mae:6505688.95937


[3736]	train-mae:504328.09624	valid-mae:6506478.82188


Fold 1 MAE: 6,503,348 | best_iter=3486 | elapsed 23.9s


[GroupKFold] Fold 2/5 | train 320 val 80 | groups train 320 val 80
[0]	train-mae:11083110.73438	valid-mae:10695187.40000


[200]	train-mae:2053779.98379	valid-mae:6419084.64375


[400]	train-mae:1302076.62607	valid-mae:6389420.90781


[554]	train-mae:1096443.33652	valid-mae:6367755.40156


Fold 2 MAE: 6,357,028 | best_iter=304 | elapsed 3.5s
[GroupKFold] Fold 3/5 | train 320 val 80 | groups train 320 val 80
[0]	train-mae:10921313.98125	valid-mae:11129080.80000


[200]	train-mae:2042984.30703	valid-mae:6971588.17500


[400]	train-mae:1370740.44961	valid-mae:6861166.20937


[600]	train-mae:1160274.64297	valid-mae:6846004.06875


[800]	train-mae:1082300.02471	valid-mae:6825338.11250


[1000]	train-mae:982310.23984	valid-mae:6783100.63438


[1200]	train-mae:924562.00508	valid-mae:6766355.49688


[1400]	train-mae:868166.41133	valid-mae:6731829.68438


[1600]	train-mae:827703.55332	valid-mae:6724811.99375


[1800]	train-mae:778218.85840	valid-mae:6717404.62188


[2000]	train-mae:744190.74238	valid-mae:6704594.52187


[2200]	train-mae:717485.32070	valid-mae:6693859.89687


[2400]	train-mae:692441.93145	valid-mae:6694691.33125


[2599]	train-mae:672705.53984	valid-mae:6697231.53437


Fold 3 MAE: 6,687,608 | best_iter=2350 | elapsed 16.5s
[GroupKFold] Fold 4/5 | train 320 val 80 | groups train 320 val 80
[0]	train-mae:10889311.74687	valid-mae:11343677.45000


[200]	train-mae:1980544.47930	valid-mae:7788245.80000


[400]	train-mae:1287210.65400	valid-mae:7638119.60703


[600]	train-mae:1106964.25186	valid-mae:7618815.18984


[800]	train-mae:1005556.73984	valid-mae:7600965.63438


[1000]	train-mae:940986.91777	valid-mae:7584368.21641


[1200]	train-mae:887864.21230	valid-mae:7560587.83906


[1400]	train-mae:839791.57227	valid-mae:7553723.66953


[1600]	train-mae:804906.99717	valid-mae:7562101.73906


[1733]	train-mae:783116.85225	valid-mae:7560125.98203


Fold 4 MAE: 7,550,699 | best_iter=1483 | elapsed 11.0s
[GroupKFold] Fold 5/5 | train 320 val 80 | groups train 320 val 80
[0]	train-mae:11028167.53750	valid-mae:11141533.33750


[200]	train-mae:2080156.45000	valid-mae:7714866.28906


[400]	train-mae:1294914.45137	valid-mae:7414221.20312


[600]	train-mae:1034762.07686	valid-mae:7379992.76562


[800]	train-mae:925981.86270	valid-mae:7353077.61250


[990]	train-mae:884655.40742	valid-mae:7349995.57812


Fold 5 MAE: 7,348,069 | best_iter=741 | elapsed 6.2s
OOF MAE (GroupKFold, smoke): 6,889,350
Elapsed: 61.5s
Saved oof_smoke_groupkfold.csv


In [9]:
# Full feature extraction (v1) for all train/test with batching + caching
import os, time, gc, math
import numpy as np
import pandas as pd
from pathlib import Path

# Limit thread oversubscription for SciPy/NumPy/BLAS
os.environ['OMP_NUM_THREADS'] = os.environ.get('OMP_NUM_THREADS', '1')
os.environ['MKL_NUM_THREADS'] = os.environ.get('MKL_NUM_THREADS', '1')
os.environ['NUMEXPR_NUM_THREADS'] = os.environ.get('NUMEXPR_NUM_THREADS', '1')

train_ids_full = train_meta['segment_id'].tolist()
test_ids_full = pd.read_csv('sample_submission.csv')['segment_id'].tolist()

train_feat_path = Path('train_features_v1.parquet')
test_feat_path = Path('test_features_v1.parquet')

parts_dir_train = Path('train_features_v1_parts'); parts_dir_train.mkdir(exist_ok=True)
parts_dir_test = Path('test_features_v1_parts'); parts_dir_test.mkdir(exist_ok=True)

def extract_in_batches(ids, split: str, out_parts_dir: Path, final_path: Path, batch_size: int = 400, n_jobs: int = 8):
    t0 = time.time()
    n = len(ids)
    n_batches = math.ceil(n / batch_size)
    print(f'[{split}] Total {n} ids → {n_batches} batches of size {batch_size}')
    completed_parts = []
    for b in range(n_batches):
        start = b * batch_size
        end = min(n, (b+1) * batch_size)
        part_path = out_parts_dir / f'part_{b:03d}.parquet'
        if part_path.exists():
            print(f'[{split}] Skip batch {b+1}/{n_batches} ({start}:{end}) existing {part_path.name}')
            completed_parts.append(part_path)
            continue
        batch_ids = ids[start:end]
        print(f'[{split}] Batch {b+1}/{n_batches} ({start}:{end}) extracting...')
        t_batch = time.time()
        df_part = build_features_for_ids(batch_ids, split=split, n_jobs=n_jobs, max_items=None)
        df_part.to_parquet(part_path, index=False)
        completed_parts.append(part_path)
        print(f'[{split}] Saved {part_path.name} | shape={df_part.shape} | elapsed {time.time()-t_batch:.1f}s | total {time.time()-t0:.1f}s')
        del df_part
        gc.collect()
    # Concatenate parts into final parquet
    print(f'[{split}] Concatenating {len(completed_parts)} parts into {final_path} ...')
    dfs = [pd.read_parquet(p) for p in sorted(completed_parts)]
    df_all = pd.concat(dfs, axis=0, ignore_index=True)
    df_all.to_parquet(final_path, index=False)
    print(f'[{split}] Saved {final_path} | shape={df_all.shape} | total elapsed {time.time()-t0:.1f}s')
    del dfs, df_all
    gc.collect()

start_all = time.time()
# 1) Train features first (to enable modeling ASAP)
if not train_feat_path.exists():
    print('Building full train features v1 in batches ...')
    extract_in_batches(train_ids_full, split='train', out_parts_dir=parts_dir_train, final_path=train_feat_path, batch_size=400, n_jobs=8)
else:
    print('Full train features already exist:', train_feat_path)

# 2) Test features next (also batched)
if not test_feat_path.exists():
    print('Building full test features v1 in batches ...')
    extract_in_batches(test_ids_full, split='test', out_parts_dir=parts_dir_test, final_path=test_feat_path, batch_size=222, n_jobs=8)
else:
    print('Full test features already exist:', test_feat_path)

print('Full feature extraction v1 (batched) complete. Elapsed: %.1f s' % (time.time()-start_all))
gc.collect()

Building full train features v1 in batches ...
[train] Total 3987 ids → 10 batches of size 400
[train] Batch 1/10 (0:400) extracting...
[train] Extracting features for 400 files with 8 jobs...


[train] Done. Shape=(400, 566). Elapsed 496.5s
[train] Saved part_000.parquet | shape=(400, 566) | elapsed 496.6s | total 496.6s
[train] Batch 2/10 (400:800) extracting...
[train] Extracting features for 400 files with 8 jobs...


[train] Done. Shape=(400, 566). Elapsed 496.2s
[train] Saved part_001.parquet | shape=(400, 566) | elapsed 496.3s | total 993.0s
[train] Batch 3/10 (800:1200) extracting...
[train] Extracting features for 400 files with 8 jobs...


[train] Done. Shape=(400, 566). Elapsed 494.7s
[train] Saved part_002.parquet | shape=(400, 566) | elapsed 494.8s | total 1487.9s
[train] Batch 4/10 (1200:1600) extracting...
[train] Extracting features for 400 files with 8 jobs...


[train] Done. Shape=(400, 566). Elapsed 493.4s
[train] Saved part_003.parquet | shape=(400, 566) | elapsed 493.5s | total 1981.5s
[train] Batch 5/10 (1600:2000) extracting...
[train] Extracting features for 400 files with 8 jobs...


[train] Done. Shape=(400, 566). Elapsed 499.2s
[train] Saved part_004.parquet | shape=(400, 566) | elapsed 499.3s | total 2480.9s
[train] Batch 6/10 (2000:2400) extracting...
[train] Extracting features for 400 files with 8 jobs...


[train] Done. Shape=(400, 566). Elapsed 499.7s
[train] Saved part_005.parquet | shape=(400, 566) | elapsed 499.8s | total 2980.8s
[train] Batch 7/10 (2400:2800) extracting...
[train] Extracting features for 400 files with 8 jobs...


[train] Done. Shape=(400, 566). Elapsed 494.0s
[train] Saved part_006.parquet | shape=(400, 566) | elapsed 494.1s | total 3475.0s
[train] Batch 8/10 (2800:3200) extracting...
[train] Extracting features for 400 files with 8 jobs...


[train] Done. Shape=(400, 566). Elapsed 497.0s
[train] Saved part_007.parquet | shape=(400, 566) | elapsed 497.1s | total 3972.2s
[train] Batch 9/10 (3200:3600) extracting...
[train] Extracting features for 400 files with 8 jobs...


[train] Done. Shape=(400, 566). Elapsed 490.8s
[train] Saved part_008.parquet | shape=(400, 566) | elapsed 490.9s | total 4463.2s
[train] Batch 10/10 (3600:3987) extracting...
[train] Extracting features for 387 files with 8 jobs...


[train] Done. Shape=(387, 566). Elapsed 476.5s
[train] Saved part_009.parquet | shape=(387, 566) | elapsed 476.6s | total 4939.8s
[train] Concatenating 10 parts into train_features_v1.parquet ...


[train] Saved train_features_v1.parquet | shape=(3987, 566) | total elapsed 4940.5s
Building full test features v1 in batches ...
[test] Total 444 ids → 2 batches of size 222
[test] Batch 1/2 (0:222) extracting...
[test] Extracting features for 222 files with 8 jobs...


[test] Done. Shape=(222, 566). Elapsed 271.7s
[test] Saved part_000.parquet | shape=(222, 566) | elapsed 271.8s | total 271.8s
[test] Batch 2/2 (222:444) extracting...
[test] Extracting features for 222 files with 8 jobs...


[test] Done. Shape=(222, 566). Elapsed 273.4s
[test] Saved part_001.parquet | shape=(222, 566) | elapsed 273.5s | total 545.3s
[test] Concatenating 2 parts into test_features_v1.parquet ...


[test] Saved test_features_v1.parquet | shape=(444, 566) | total elapsed 545.6s
Full feature extraction v1 (batched) complete. Elapsed: 5486.3 s


0

In [None]:
# Full training with GroupKFold on v1 features + test inference and submission
import os, time, gc
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import GroupKFold
from sklearn.metrics import mean_absolute_error
import xgboost as xgb

t0 = time.time()
train_feat_path = Path('train_features_v1.parquet')
test_feat_path = Path('test_features_v1.parquet')
if not train_feat_path.exists():
    print('train_features_v1.parquet not found yet. Run the feature extraction cell first.');
elif not test_feat_path.exists():
    print('test_features_v1.parquet not found yet. Build test features, then re-run this cell.');
else:
    print('Loading full features...')
    train_feats = pd.read_parquet(train_feat_path)
    test_feats = pd.read_parquet(test_feat_path)
    print('Train feats:', train_feats.shape, 'Test feats:', test_feats.shape)
    # Merge target
    df_meta = train_meta[['segment_id','time_to_eruption']].copy()
    df = train_feats.merge(df_meta, on='segment_id', how='left')
    # Build groups (global cumsum of increases when sorted by target)
    tmp = df_meta.sort_values('time_to_eruption').reset_index(drop=True)
    tmp['group'] = (tmp['time_to_eruption'].diff().fillna(1) > 0).cumsum().astype(int)
    groups_map = tmp.set_index('segment_id')['group']
    df['group'] = df['segment_id'].map(groups_map).astype(int)
    print('Unique groups:', df['group'].nunique())

    # Prepare matrices
    drop_cols = ['segment_id','time_to_eruption','group']
    feat_cols = [c for c in df.columns if c not in drop_cols]
    X = df[feat_cols].astype(np.float32).values
    y = df['time_to_eruption'].astype(np.float32).values
    groups = df['group'].values.astype(int)
    X_test = test_feats[feat_cols].astype(np.float32).values

    # XGBoost params
    params = {
        'tree_method': 'gpu_hist',
        'max_depth': 8,
        'learning_rate': 0.04,
        'subsample': 0.75,
        'colsample_bytree': 0.7,
        'min_child_weight': 9,
        'reg_alpha': 0.5,
        'reg_lambda': 6.0,
        'objective': 'reg:absoluteerror',
        'eval_metric': 'mae',
        'seed': 42
    }
    num_boost_round = 5000
    esr = 250

    gkf = GroupKFold(n_splits=5)
    oof = np.zeros(len(df), dtype=np.float32)
    test_preds_folds = []

    for fold, (trn_idx, val_idx) in enumerate(gkf.split(X, y, groups=groups)):
        t_fold = time.time()
        print(f'[FULL] Fold {fold+1}/5 | train {len(trn_idx)} val {len(val_idx)}')
        dtrain = xgb.DMatrix(X[trn_idx], label=y[trn_idx])
        dvalid = xgb.DMatrix(X[val_idx], label=y[val_idx])
        booster = xgb.train(
            params=params,
            dtrain=dtrain,
            num_boost_round=num_boost_round,
            evals=[(dtrain,'train'),(dvalid,'valid')],
            early_stopping_rounds=esr,
            verbose_eval=200
        )
        best_iter = getattr(booster, 'best_iteration', None)
        preds_val = booster.predict(dvalid, iteration_range=(0, best_iter+1)) if best_iter is not None else booster.predict(dvalid)
        oof[val_idx] = preds_val.astype(np.float32)
        mae = mean_absolute_error(y[val_idx], preds_val)
        print(f'Fold {fold+1} MAE: {mae:,.0f} | best_iter={best_iter} | elapsed {time.time()-t_fold:.1f}s')
        # Test prediction for this fold
        dtest = xgb.DMatrix(X_test)
        preds_test = booster.predict(dtest, iteration_range=(0, best_iter+1)) if best_iter is not None else booster.predict(dtest)
        test_preds_folds.append(preds_test.astype(np.float32))
        gc.collect()

    oof_mae = mean_absolute_error(y, oof)
    print(f'OOF MAE (GroupKFold, full): {oof_mae:,.0f}')
    pd.DataFrame({'segment_id': df['segment_id'], 'oof_full': oof, 'y': y}).to_csv('oof_full_xgb.csv', index=False)
    # Average test preds across folds
    test_pred = np.mean(np.vstack(test_preds_folds), axis=0)
    # Clip predictions
    y_max = float(df['time_to_eruption'].max())
    test_pred = np.clip(test_pred, 0.0, y_max)

    # Build submission
    ss = pd.read_csv('sample_submission.csv')
    sub = ss[['segment_id']].copy()
    # Align predictions by row order of test_feats
    # Ensure ordering matches
    if 'segment_id' in test_feats.columns:
        # Our X_test used test_feats[feat_cols], which excludes segment_id; predictions align with test_feats row order
        sub = test_feats[['segment_id']].merge(sub, on='segment_id', how='right')
        sub['time_to_eruption'] = test_pred
        sub = sub[['segment_id','time_to_eruption']]
    else:
        # Fallback: assume same order as ss
        sub['time_to_eruption'] = test_pred
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv')
    print('Total training+inference elapsed: %.1fs' % (time.time()-t0))

Loading full features...
Train feats: (3987, 566) Test feats: (444, 566)
Unique groups: 3986
[FULL] Fold 1/5 | train 3189 val 798
[0]	train-mae:11283527.24208	valid-mae:11296145.01504


[200]	train-mae:1624283.76376	valid-mae:3441694.37798
