# Plan

Goals:
- Verify GPU and environment
- Inspect provided artifacts and define train/test splits
- Establish fast baseline (tabular: tracking + baseline helmets); defer video modeling unless needed
- Build robust CV mirroring test (game/time/fold discipline), avoid leakage
- Train quick baseline (XGBoost GPU if possible), produce OOF and test preds
- Iterate with feature engineering and model ensembling

Milestones (request expert review at each):
1) Plan + environment check
2) Data audit/EDA + fold strategy
3) Baseline features + baseline model
4) Error analysis + FE v1
5) Model tuning / blend
6) Finalize submission

Metric: MCC on test. Submission: submission.csv.

Assumption here: Prepared artifacts already include extracted features from tracking and helmets; we start tabular. We will log progress, cache OOF/preds, and keep deterministic seeds.

In [1]:
import os, sys, subprocess, time, json, random
import numpy as np
import pandas as pd

def run(cmd):
    print('>>', ' '.join(cmd), flush=True)
    return subprocess.run(cmd, check=False, text=True, capture_output=True).stdout

start = time.time()
print('Env check...')
print(run(['bash','-lc','nvidia-smi || true']))

print('Python:', sys.version)
print('CWD:', os.getcwd())

files = sorted(os.listdir('.'))
print('Files:', files)

def info(df, name):
    print(f'[{name}] shape={df.shape}')
    print('cols:', list(df.columns)[:20], ('... total %d cols' % len(df.columns) if len(df.columns)>20 else ''))

train_labels = pd.read_csv('train_labels.csv')
train_track = pd.read_csv('train_player_tracking.csv')
train_helm = pd.read_csv('train_baseline_helmets.csv')
train_vmeta = pd.read_csv('train_video_metadata.csv')
test_track = pd.read_csv('test_player_tracking.csv')
test_helm = pd.read_csv('test_baseline_helmets.csv')
test_vmeta = pd.read_csv('test_video_metadata.csv')

info(train_labels, 'train_labels')
info(train_track, 'train_player_tracking')
info(train_helm, 'train_baseline_helmets')
info(train_vmeta, 'train_video_metadata')
info(test_track, 'test_player_tracking')
info(test_helm, 'test_baseline_helmets')
info(test_vmeta, 'test_video_metadata')

print('Label distribution:')
lbl_col = None
for c in train_labels.columns:
    if c.lower() in ('contact','is_contact','contact_label','label'):
        lbl_col = c; break
print('label_col:', lbl_col)
if lbl_col is not None:
    print(train_labels[lbl_col].value_counts(dropna=False))
else:
    print('No obvious label col found; will inspect later head:')
    print(train_labels.head(3))

print('Sample submission head:')
try:
    ss = pd.read_csv('sample_submission.csv')
    print(ss.head())
except Exception as e:
    print('no sample_submission.csv:', e)

elapsed = time.time()-start
print(f'Env+EDA done in {elapsed:.2f}s', flush=True)

Env check...
>> bash -lc nvidia-smi || true


Thu Sep 25 23:47:32 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

[train_labels] shape=(4258375, 7)
cols: ['contact_id', 'game_play', 'datetime', 'step', 'nfl_player_id_1', 'nfl_player_id_2', 'contact'] 
[train_player_tracking] shape=(1225299, 17)
cols: ['game_play', 'game_key', 'play_id', 'nfl_player_id', 'datetime', 'step', 'team', 'position', 'jersey_number', 'x_position', 'y_position', 'speed', 'distance', 'direction', 'orientation', 'acceleration', 'sa'] 
[train_baseline_helmets] shape=(3412208, 12)
cols: ['game_play', 'game_key', 'play_id', 'view', 'video', 'frame', 'nfl_player_id', 'player_label', 'left', 'width', 'top', 'height'] 
[train_video_metadata] shape=(432, 7)
cols: ['game_play', 'game_key', 'play_id', 'view', 'start_time', 'end_time', 'snap_time'] 
[test_player_tracking] shape=(127754, 17)
cols: ['game_play', 'game_key', 'play_id', 'nfl_player_id', 'datetime', 'step', 'team', 'position', 'jersey_number', 'x_position', 'y_position', 'speed', 'distance', 'direction', 'orientation', 'acceleration', 'sa'] 
[test_baseline_helmets] shape=(

In [3]:
import re, itertools, math, time
from sklearn.model_selection import GroupKFold

print('Setting up CV and keys...')
t0 = time.time()

# Parse contact_id -> (game_play, step, p1, p2)
def parse_contact_id(cid: str):
    # format: {game_play}_{step}_{p1}_{p2}
    # game_play itself has an underscore: e.g., 58187_001341
    parts = cid.split('_')
    # Expect 5 parts: [g1, g2, step, p1, p2]
    if len(parts) != 5:
        raise ValueError(f'Unexpected contact_id format: {cid}')
    game_play = parts[0] + '_' + parts[1]
    step = int(parts[2])
    p1 = parts[3]; p2 = parts[4]
    # canonicalize pair (handle 'G' ground) keep as strings
    if p1 == 'G' or p2 == 'G':
        p1c, p2c = ('G', p2) if p1 == 'G' else ('G', p1)
    else:
        a, b = int(p1), int(p2)
        p1c, p2c = (str(a), str(b)) if a <= b else (str(b), str(a))
    return game_play, step, p1c, p2c

# Quick sanity on sample_submission format
ss = pd.read_csv('sample_submission.csv')
g, s, a, b = parse_contact_id(ss.loc[0, 'contact_id'])
print('Parsed sample row:', g, s, a, b)

# Build GroupKFold on train_labels grouped by game_play
unique_gp = train_labels[['game_play']].drop_duplicates().reset_index(drop=True)
groups = unique_gp['game_play'].values
gkf = GroupKFold(n_splits=5)
fold_map = {}
for fold, (tr_idx, va_idx) in enumerate(gkf.split(unique_gp, groups=groups)):
    for idx in va_idx:
        fold_map[unique_gp.loc[idx, 'game_play']] = fold
folds_df = pd.DataFrame({'game_play': list(fold_map.keys()), 'fold': list(fold_map.values())})
folds_df.to_csv('folds_game_play.csv', index=False)
print('Folds saved:', folds_df['fold'].value_counts().sort_index().to_dict())

# Ensure key dtypes align and canonicalize player pair in training labels
train_labels['pid1'] = train_labels['nfl_player_id_1'].astype(str)
train_labels['pid2'] = train_labels['nfl_player_id_2'].astype(str)
def canon_pair(p1, p2):
    if p1 == 'G' or p2 == 'G':
        return ('G', p2) if p1 == 'G' else ('G', p1)
    a, b = int(p1), int(p2)
    return (str(a), str(b)) if a <= b else (str(b), str(a))
cp = [canon_pair(p1, p2) for p1, p2 in zip(train_labels['pid1'], train_labels['pid2'])]
train_labels['p1'] = [x[0] for x in cp]
train_labels['p2'] = [x[1] for x in cp]

# Attach fold to labels
train_labels = train_labels.merge(folds_df, on='game_play', how='left')
assert train_labels['fold'].notna().all(), 'Missing fold assignment'
print('Labels+folds shape:', train_labels.shape)

# Basic index for tracking per step (reduced columns for speed)
trk_cols = ['game_play','step','nfl_player_id','team','position','x_position','y_position','speed','acceleration','direction','orientation']
train_track_idx = train_track[trk_cols].copy()
test_track_idx = test_track[trk_cols].copy()
for df in (train_track_idx, test_track_idx):
    df['nfl_player_id'] = df['nfl_player_id'].astype(int)

print('Prepared tracking indices:', train_track_idx.shape, test_track_idx.shape)
print(f'Setup done in {time.time()-t0:.2f}s', flush=True)

Setting up CV and keys...
Parsed sample row: 58187_001341 0 47795 52650
Folds saved: {0: 44, 1: 43, 2: 43, 3: 43, 4: 43}


Labels+folds shape: (4258375, 12)
Prepared tracking indices: (1225299, 11) (127754, 11)
Setup done in 4.52s


In [4]:
import math, time
from itertools import combinations

print('Building candidate pairs and minimal features (r=3.0 yd)...')
t0 = time.time()

def cosd(a):
    return math.cos(math.radians(a)) if pd.notna(a) else 0.0
def sind(a):
    return math.sin(math.radians(a)) if pd.notna(a) else 0.0
def heading_diff(a, b):
    if pd.isna(a) or pd.isna(b):
        return np.nan
    d = (a - b + 180) % 360 - 180
    return abs(d)

def build_pairs_for_group(gdf, r=3.0):
    rows = []
    arr = gdf[['nfl_player_id','team','position','x_position','y_position','speed','acceleration','direction']].values
    n = arr.shape[0]
    for i, j in combinations(range(n), 2):
        pid_i, team_i, pos_i, xi, yi, si, ai, diri = arr[i]
        pid_j, team_j, pos_j, xj, yj, sj, aj, dirj = arr[j]
        dx = xj - xi; dy = yj - yi
        dist = math.hypot(dx, dy)
        if dist > r:
            continue
        # canonicalize pair ids as strings
        a = int(pid_i); b = int(pid_j)
        p1, p2 = (str(a), str(b)) if a <= b else (str(b), str(a))
        # velocities from speed+direction (tracking dir: degrees, 0 = east per NFL; use cos/sin)
        vxi = si * cosd(diri); vyi = si * sind(diri)
        vxj = sj * cosd(dirj); vyj = sj * sind(dirj)
        rvx = vxj - vxi; rvy = vyj - vyi
        if dist > 0:
            ux = dx / dist; uy = dy / dist
            closing = rvx * ux + rvy * uy
        else:
            closing = 0.0
        hd = heading_diff(diri, dirj)
        rows.append((p1, p2, dist, dx, dy, si, sj, ai, aj, closing, abs(closing), hd, int(team_i == team_j), str(team_i), str(team_j), str(pos_i), str(pos_j)))
    if not rows:
        return pd.DataFrame(columns=['p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])
    df = pd.DataFrame(rows, columns=['p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])
    return df

def build_feature_table(track_df, r=3.0):
    feats = []
    cnt = 0
    last_log = time.time()
    for (gp, step), gdf in track_df.groupby(['game_play','step'], sort=False):
        f = build_pairs_for_group(gdf, r=r)
        if not f.empty:
            f.insert(0, 'step', step)
            f.insert(0, 'game_play', gp)
            feats.append(f)
        cnt += 1
        if cnt % 500 == 0:
            now = time.time()
            print(f' processed {cnt} groups in {now - last_log:.1f}s; total elapsed {now - t0:.1f}s', flush=True)
            last_log = now
    if feats:
        return pd.concat(feats, ignore_index=True)
    return pd.DataFrame(columns=['game_play','step','p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])

# Build train features (radius 3.0 yds)
train_feats = build_feature_table(train_track_idx, r=3.0)
print('Train feats shape:', train_feats.shape)
train_feats.to_parquet('train_pairs_v1.parquet', index=False)

# Merge labels to get target
key_cols = ['game_play','step','p1','p2']
lab_cols = key_cols + ['contact']
train_supervised = train_feats.merge(train_labels[lab_cols], on=key_cols, how='left')
missing = train_supervised['contact'].isna().mean()
print(f'Label NaN rate after merge: {missing:.3f}')
train_supervised = train_supervised.dropna(subset=['contact'])
train_supervised['contact'] = train_supervised['contact'].astype(int)
print('Supervised rows:', train_supervised.shape)
train_supervised.to_parquet('train_supervised_v1.parquet', index=False)

# Build test features
test_feats = build_feature_table(test_track_idx, r=3.0)
print('Test feats shape:', test_feats.shape)
test_feats.to_parquet('test_pairs_v1.parquet', index=False)

print(f'All done in {time.time()-t0:.1f}s', flush=True)

Building candidate pairs and minimal features (r=3.0 yd)...


 processed 500 groups in 0.9s; total elapsed 0.9s


 processed 1000 groups in 0.6s; total elapsed 1.4s


 processed 1500 groups in 0.7s; total elapsed 2.1s


 processed 2000 groups in 0.6s; total elapsed 2.7s


 processed 2500 groups in 0.6s; total elapsed 3.3s


 processed 3000 groups in 0.6s; total elapsed 3.8s


 processed 3500 groups in 0.8s; total elapsed 4.6s


 processed 4000 groups in 0.6s; total elapsed 5.2s


 processed 4500 groups in 0.6s; total elapsed 5.8s


 processed 5000 groups in 0.6s; total elapsed 6.3s


 processed 5500 groups in 0.6s; total elapsed 6.9s


 processed 6000 groups in 0.9s; total elapsed 7.8s


 processed 6500 groups in 0.6s; total elapsed 8.3s


 processed 7000 groups in 0.6s; total elapsed 8.9s


 processed 7500 groups in 0.6s; total elapsed 9.5s


 processed 8000 groups in 0.6s; total elapsed 10.0s


 processed 8500 groups in 0.6s; total elapsed 10.6s


 processed 9000 groups in 0.8s; total elapsed 11.4s


 processed 9500 groups in 0.6s; total elapsed 11.9s


 processed 10000 groups in 0.6s; total elapsed 12.5s


 processed 10500 groups in 0.6s; total elapsed 13.1s


 processed 11000 groups in 0.5s; total elapsed 13.6s


 processed 11500 groups in 0.5s; total elapsed 14.2s


 processed 12000 groups in 0.9s; total elapsed 15.1s


 processed 12500 groups in 0.6s; total elapsed 15.7s


 processed 13000 groups in 0.6s; total elapsed 16.2s


 processed 13500 groups in 0.5s; total elapsed 16.8s


 processed 14000 groups in 0.6s; total elapsed 17.3s


 processed 14500 groups in 0.6s; total elapsed 17.9s


 processed 15000 groups in 0.6s; total elapsed 18.5s


 processed 15500 groups in 0.5s; total elapsed 19.0s


 processed 16000 groups in 0.6s; total elapsed 19.6s


 processed 16500 groups in 1.1s; total elapsed 20.7s


 processed 17000 groups in 0.5s; total elapsed 21.3s


 processed 17500 groups in 0.6s; total elapsed 21.8s


 processed 18000 groups in 0.5s; total elapsed 22.4s


 processed 18500 groups in 0.6s; total elapsed 22.9s


 processed 19000 groups in 0.6s; total elapsed 23.5s


 processed 19500 groups in 0.5s; total elapsed 24.0s


 processed 20000 groups in 0.6s; total elapsed 24.6s


 processed 20500 groups in 0.6s; total elapsed 25.1s


 processed 21000 groups in 0.6s; total elapsed 25.7s


 processed 21500 groups in 1.2s; total elapsed 26.9s


 processed 22000 groups in 0.6s; total elapsed 27.5s


 processed 22500 groups in 0.6s; total elapsed 28.0s


 processed 23000 groups in 0.6s; total elapsed 28.6s


 processed 23500 groups in 0.6s; total elapsed 29.1s


 processed 24000 groups in 0.6s; total elapsed 29.7s


 processed 24500 groups in 0.6s; total elapsed 30.3s


 processed 25000 groups in 0.6s; total elapsed 30.8s


 processed 25500 groups in 0.6s; total elapsed 31.4s


 processed 26000 groups in 0.6s; total elapsed 32.0s


 processed 26500 groups in 0.6s; total elapsed 32.5s


 processed 27000 groups in 0.6s; total elapsed 33.1s


 processed 27500 groups in 0.6s; total elapsed 33.7s


 processed 28000 groups in 1.1s; total elapsed 34.8s


 processed 28500 groups in 0.6s; total elapsed 35.3s


 processed 29000 groups in 0.6s; total elapsed 35.9s


 processed 29500 groups in 0.6s; total elapsed 36.5s


 processed 30000 groups in 0.6s; total elapsed 37.1s


 processed 30500 groups in 0.6s; total elapsed 37.6s


 processed 31000 groups in 0.6s; total elapsed 38.2s


 processed 31500 groups in 0.6s; total elapsed 38.7s


 processed 32000 groups in 0.5s; total elapsed 39.3s


 processed 32500 groups in 0.5s; total elapsed 39.8s


 processed 33000 groups in 0.5s; total elapsed 40.4s


 processed 33500 groups in 0.6s; total elapsed 40.9s


 processed 34000 groups in 0.6s; total elapsed 41.5s


 processed 34500 groups in 0.6s; total elapsed 42.1s


 processed 35000 groups in 0.5s; total elapsed 42.6s


 processed 35500 groups in 0.6s; total elapsed 43.2s


 processed 36000 groups in 1.4s; total elapsed 44.6s


 processed 36500 groups in 0.6s; total elapsed 45.2s


 processed 37000 groups in 0.6s; total elapsed 45.7s


 processed 37500 groups in 0.6s; total elapsed 46.3s


 processed 38000 groups in 0.6s; total elapsed 46.9s


 processed 38500 groups in 0.6s; total elapsed 47.4s


 processed 39000 groups in 0.6s; total elapsed 48.0s


 processed 39500 groups in 0.5s; total elapsed 48.5s


 processed 40000 groups in 0.6s; total elapsed 49.1s


 processed 40500 groups in 0.6s; total elapsed 49.6s


 processed 41000 groups in 0.5s; total elapsed 50.2s


 processed 41500 groups in 0.6s; total elapsed 50.8s


 processed 42000 groups in 0.5s; total elapsed 51.3s


 processed 42500 groups in 0.6s; total elapsed 51.9s


 processed 43000 groups in 0.6s; total elapsed 52.4s


 processed 43500 groups in 0.6s; total elapsed 53.0s


 processed 44000 groups in 0.5s; total elapsed 53.6s


 processed 44500 groups in 0.6s; total elapsed 54.1s


 processed 45000 groups in 0.5s; total elapsed 54.7s


 processed 45500 groups in 1.6s; total elapsed 56.3s


 processed 46000 groups in 0.6s; total elapsed 56.8s


 processed 46500 groups in 0.6s; total elapsed 57.4s


 processed 47000 groups in 0.6s; total elapsed 57.9s


 processed 47500 groups in 0.6s; total elapsed 58.5s


 processed 48000 groups in 0.6s; total elapsed 59.1s


 processed 48500 groups in 0.6s; total elapsed 59.6s


 processed 49000 groups in 0.6s; total elapsed 60.2s


 processed 49500 groups in 0.5s; total elapsed 60.7s


 processed 50000 groups in 0.5s; total elapsed 61.3s


 processed 50500 groups in 0.6s; total elapsed 61.9s


 processed 51000 groups in 0.6s; total elapsed 62.4s


 processed 51500 groups in 0.6s; total elapsed 63.0s


 processed 52000 groups in 0.6s; total elapsed 63.6s


 processed 52500 groups in 0.6s; total elapsed 64.2s


 processed 53000 groups in 0.5s; total elapsed 64.7s


 processed 53500 groups in 0.6s; total elapsed 65.3s


 processed 54000 groups in 0.6s; total elapsed 65.8s


 processed 54500 groups in 0.6s; total elapsed 66.4s


 processed 55000 groups in 0.5s; total elapsed 66.9s


 processed 55500 groups in 0.6s; total elapsed 67.5s


Train feats shape: (1641668, 19)


Label NaN rate after merge: 0.746
Supervised rows: (416574, 20)


 processed 500 groups in 0.6s; total elapsed 76.5s


 processed 1000 groups in 0.6s; total elapsed 77.0s


 processed 1500 groups in 0.6s; total elapsed 77.6s


 processed 2000 groups in 0.6s; total elapsed 78.2s


 processed 2500 groups in 0.6s; total elapsed 78.8s


 processed 3000 groups in 0.5s; total elapsed 79.3s


 processed 3500 groups in 0.6s; total elapsed 79.9s


 processed 4000 groups in 0.6s; total elapsed 80.4s


 processed 4500 groups in 0.6s; total elapsed 81.0s


 processed 5000 groups in 0.5s; total elapsed 81.5s


 processed 5500 groups in 0.6s; total elapsed 82.1s


Test feats shape: (191559, 19)
All done in 82.9s


In [13]:
import time, math, subprocess, sys
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

# Ensure xgboost is available; print version
try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb
print('xgboost version:', getattr(xgb, '__version__', 'unknown'))

def mcc_score(y_true, y_prob, thr):
    y_pred = (y_prob >= thr).astype(int)
    return matthews_corrcoef(y_true, y_pred)

print('Loading supervised train (W5+pos-exp+helm) and test features...')
train_sup = pd.read_parquet('train_supervised_w5_helm.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
print('train_sup:', train_sup.shape, 'test_feats:', test_feats.shape)

# Attach folds
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()

# Fill NaNs for helmet features: no-helmet => large distance, 0 views
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns:
        df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns:
        df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

# Feature set: base + temporal windows (past-5) + counts + trend + helmet
feat_cols = [
    'distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team',
    'dist_min_p5','dist_mean_p5','dist_max_p5','dist_std_p5',
    'abs_close_min_p5','abs_close_mean_p5','abs_close_max_p5','abs_close_std_p5',
    'cnt_dist_lt15_p5','cnt_dist_lt20_p5','cnt_dist_lt25_p5',
    'dist_delta_p5',
    'px_dist_norm_min','views_both_present'
]
missing_feats = [c for c in feat_cols if c not in train_sup.columns]
if missing_feats:
    raise RuntimeError(f'Missing features: {missing_feats}')

X = train_sup[feat_cols].astype(float).values
y = train_sup['contact'].astype(int).values
groups = train_sup['game_play'].values

print('Pos rate:', y.mean())

gkf = GroupKFold(n_splits=5)
oof = np.zeros(len(train_sup), dtype=float)
models = []  # list of (booster, best_iteration)
start = time.time()

for fold, (tr_idx, va_idx) in enumerate(gkf.split(X, y, groups=groups)):
    t0 = time.time()
    X_tr, y_tr = X[tr_idx], y[tr_idx]
    X_va, y_va = X[va_idx], y[va_idx]
    # class imbalance handling
    neg = (y_tr == 0).sum(); pos = (y_tr == 1).sum()
    spw = max(1.0, neg / max(1, pos))
    print(f'Fold {fold}: train {len(tr_idx)} (pos {pos}), valid {len(va_idx)} (pos {(y_va==1).sum()}), scale_pos_weight={spw:.1f}', flush=True)
    dtrain = xgb.DMatrix(X_tr, label=y_tr)
    dvalid = xgb.DMatrix(X_va, label=y_va)
    params = {
        'tree_method': 'hist',
        'device': 'cuda',
        'max_depth': 8,
        'eta': 0.05,
        'subsample': 0.9,
        'colsample_bytree': 0.7,
        'min_child_weight': 8,
        'lambda': 1.5,
        'alpha': 0.0,
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'scale_pos_weight': float(spw),
        'seed': 42
    }
    evals = [(dtrain, 'train'), (dvalid, 'valid')]
    booster = xgb.train(
        params=params,
        dtrain=dtrain,
        num_boost_round=3000,
        evals=evals,
        early_stopping_rounds=100,
        verbose_eval=False
    )
    # best iteration
    if hasattr(booster, 'best_iteration') and booster.best_iteration is not None:
        best_it = int(booster.best_iteration)
    else:
        best_it = int(booster.num_boosted_rounds()) - 1
    p = booster.predict(dvalid, iteration_range=(0, best_it + 1))
    oof[va_idx] = p
    models.append((booster, best_it))
    print(f' Fold {fold} done in {time.time()-t0:.1f}s; best_iteration={best_it}', flush=True)

print('OOF threshold sweep for MCC...')
best_thr, best_mcc = 0.5, -1.0
for thr in np.linspace(0.01, 0.99, 99):
    m = mcc_score(y, oof, thr)
    if m > best_mcc:
        best_mcc, best_thr = m, thr
print(f'Best OOF MCC={best_mcc:.5f} at thr={best_thr:.3f}')

# Predict test with same features
Xt = test_feats[feat_cols].astype(float).values
dtest = xgb.DMatrix(Xt)
pt = np.zeros(len(test_feats), dtype=float)
for i, (booster, best_it) in enumerate(models):
    t0 = time.time()
    pt += booster.predict(dtest, iteration_range=(0, best_it + 1))
    print(f' Inference model {i} took {time.time()-t0:.1f}s')
pt /= len(models)

# Optional simple temporal smoothing (2-of-3 via rolling max over probs by (game_play,p1,p2))
pred_tmp = test_feats[['game_play','step','p1','p2']].copy()
pred_tmp['prob'] = pt
pred_tmp = pred_tmp.sort_values(['game_play','p1','p2','step'])
grp = pred_tmp.groupby(['game_play','p1','p2'], sort=False)
pred_tmp['prob_smooth'] = grp['prob'].transform(lambda s: s.rolling(3, center=True, min_periods=1).max())
pt_smooth = pred_tmp['prob_smooth'].values

# Build contact_id for test pairs
cid = test_feats['game_play'].astype(str) + '_' + test_feats['step'].astype(str) + '_' + test_feats['p1'].astype(str) + '_' + test_feats['p2'].astype(str)
pred_df = pd.DataFrame({'contact_id': cid, 'contact_prob': pt_smooth})

# Map to sample_submission; fill missing with 0.0 prob
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df, on='contact_id', how='left')
sub['contact_prob'] = sub['contact_prob'].fillna(0.0)
sub['contact'] = (sub['contact_prob'] >= best_thr).astype(int)
sub[['contact_id','contact']].to_csv('submission.csv', index=False)
print('Saved submission.csv')
print('Head:\n', sub.head())
print('Done. Total time:', f'{time.time()-start:.1f}s', flush=True)

xgboost version: 2.1.4
Loading supervised train (W5+pos-exp+helm) and test features...
train_sup: (416574, 34) test_feats: (191559, 33)
Pos rate: 0.10227714643736767


Fold 0: train 333273 (pos 33885), valid 83301 (pos 8721), scale_pos_weight=8.8


 Fold 0 done in 14.0s; best_iteration=2107


Fold 1: train 333268 (pos 34206), valid 83306 (pos 8400), scale_pos_weight=8.7


 Fold 1 done in 15.2s; best_iteration=2320


Fold 2: train 333273 (pos 34570), valid 83301 (pos 8036), scale_pos_weight=8.6


 Fold 2 done in 13.7s; best_iteration=2125


Fold 3: train 333286 (pos 33688), valid 83288 (pos 8918), scale_pos_weight=8.9


 Fold 3 done in 13.8s; best_iteration=2088


Fold 4: train 333196 (pos 34075), valid 83378 (pos 8531), scale_pos_weight=8.8


 Fold 4 done in 12.4s; best_iteration=1913


OOF threshold sweep for MCC...


Best OOF MCC=0.69162 at thr=0.550
 Inference model 0 took 0.1s
 Inference model 1 took 0.1s
 Inference model 2 took 0.0s


 Inference model 3 took 0.0s
 Inference model 4 took 0.0s


Saved submission.csv
Head:
                    contact_id  contact  contact_prob
0  58187_001341_0_47795_52650        0           0.0
1  58187_001341_0_47795_47804        0           0.0
2  58187_001341_0_47795_52863        0           0.0
3  58187_001341_0_47795_52574        0           0.0
4  58187_001341_0_47795_52483        0           0.0
Done. Total time: 77.2s


In [9]:
# Temporal window features (past-only W=5) + positive expansion (±1) and rebuild supervised/train/test tables
import pandas as pd, numpy as np, time

t0 = time.time()
print('Loading base pair features...')
train_pairs = pd.read_parquet('train_pairs_v1.parquet')
test_pairs = pd.read_parquet('test_pairs_v1.parquet')
print('Loaded train_pairs:', train_pairs.shape, 'test_pairs:', test_pairs.shape)

def add_window_feats(df: pd.DataFrame, W: int = 5):
    df = df.sort_values(['game_play','p1','p2','step']).copy()
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    # rolling on distance
    df['dist_min_p5'] = grp['distance'].rolling(W, min_periods=1).min().reset_index(level=[0,1,2], drop=True)
    df['dist_mean_p5'] = grp['distance'].rolling(W, min_periods=1).mean().reset_index(level=[0,1,2], drop=True)
    df['dist_max_p5'] = grp['distance'].rolling(W, min_periods=1).max().reset_index(level=[0,1,2], drop=True)
    df['dist_std_p5'] = grp['distance'].rolling(W, min_periods=1).std().reset_index(level=[0,1,2], drop=True)
    # abs_closing rolling
    df['abs_close_min_p5'] = grp['abs_closing'].rolling(W, min_periods=1).min().reset_index(level=[0,1,2], drop=True)
    df['abs_close_mean_p5'] = grp['abs_closing'].rolling(W, min_periods=1).mean().reset_index(level=[0,1,2], drop=True)
    df['abs_close_max_p5'] = grp['abs_closing'].rolling(W, min_periods=1).max().reset_index(level=[0,1,2], drop=True)
    df['abs_close_std_p5'] = grp['abs_closing'].rolling(W, min_periods=1).std().reset_index(level=[0,1,2], drop=True)
    # counts distance under thresholds
    for thr, name in [(1.5,'lt15'), (2.0,'lt20'), (2.5,'lt25')]:
        key = f'cnt_dist_{name}_p5'
        df[key] = grp['distance'].apply(lambda s: s.lt(thr).rolling(W, min_periods=1).sum()).reset_index(level=[0,1,2], drop=True)
    # trend over 5: distance[t] - distance[t-5]
    df['dist_delta_p5'] = df['distance'] - grp['distance'].shift(W)
    return df

print('Adding window features to train...')
train_w = add_window_feats(train_pairs, W=5)
print('Adding window features to test...')
test_w = add_window_feats(test_pairs, W=5)

train_w.to_parquet('train_pairs_w5.parquet', index=False)
test_w.to_parquet('test_pairs_w5.parquet', index=False)
print('Saved windowed pairs parquet.')

# Rebuild supervised set using INNER JOIN to label domain; then apply positive expansion (±1) within existing rows
key_cols = ['game_play','step','p1','p2']
lab_cols = key_cols + ['contact']
labels_min = train_labels[lab_cols].copy()
sup = labels_min.merge(train_w, on=key_cols, how='inner')
print('Supervised(inner) shape:', sup.shape, 'pos rate:', sup['contact'].mean())

# Positive expansion: set contact=1 at step±1 for rows present in sup
pos = sup[sup['contact'] == 1][['game_play','p1','p2','step']].copy()
pos_m1 = pos.copy(); pos_m1['step'] = pos_m1['step'] - 1
pos_p1 = pos.copy(); pos_p1['step'] = pos_p1['step'] + 1
pos_exp = pd.concat([pos_m1, pos_p1], axis=0, ignore_index=True).drop_duplicates()
pos_exp['flag_pos_exp'] = 1
sup = sup.merge(pos_exp, on=['game_play','p1','p2','step'], how='left')
sup.loc[sup['flag_pos_exp'] == 1, 'contact'] = 1
sup = sup.drop(columns=['flag_pos_exp'])
print('After positive expansion, pos rate:', sup['contact'].mean())

sup.to_parquet('train_supervised_w5.parquet', index=False)
print('Saved train_supervised_w5.parquet. Total time {:.1f}s'.format(time.time()-t0), flush=True)

Loading base pair features...
Loaded train_pairs: (1641668, 19) test_pairs: (191559, 19)
Adding window features to train...


Adding window features to test...


Saved windowed pairs parquet.


Supervised(inner) shape: (416574, 32) pos rate: 0.10227714643736767


After positive expansion, pos rate: 0.11705963406261552


Saved train_supervised_w5.parquet. Total time 43.2s


In [12]:
# Helmet proximity features v1: map frames->steps via snap_frame, aggregate per (game_play, step, view, player), merge to pairs, compute min normalized pixel distance across views
import pandas as pd, numpy as np, time
from math import sqrt

t0 = time.time()
FPS = 59.94
print('Loading helmet and metadata CSVs...')
train_helm = pd.read_csv('train_baseline_helmets.csv')
test_helm = pd.read_csv('test_baseline_helmets.csv')
train_vmeta = pd.read_csv('train_video_metadata.csv')
test_vmeta = pd.read_csv('test_video_metadata.csv')
print('Helm train/test:', train_helm.shape, test_helm.shape)

def prep_meta(vmeta: pd.DataFrame):
    vm = vmeta.copy()
    # parse times to seconds (assume string s with seconds float or hh:mm:ss.sss); pandas to_datetime then total_seconds
    for c in ['start_time','snap_time']:
        if np.issubdtype(vm[c].dtype, np.number):
            continue
        ts = pd.to_datetime(vm[c], errors='coerce')
        # If already numeric-like strings, coerce to numeric
        if ts.notna().any():
            vm[c] = (ts - ts.dt.floor('D')).dt.total_seconds().astype(float)
        else:
            vm[c] = pd.to_numeric(vm[c], errors='coerce')
    vm['snap_frame'] = ((vm['snap_time'] - vm['start_time']) * FPS).round().astype('Int64')
    return vm[['game_play','view','snap_frame']].drop_duplicates()

meta_tr = prep_meta(train_vmeta)
meta_te = prep_meta(test_vmeta)

def dedup_and_step(helm: pd.DataFrame, meta: pd.DataFrame):
    df = helm[['game_play','view','frame','nfl_player_id','left','top','width','height']].copy()
    df = df.dropna(subset=['nfl_player_id'])
    df['nfl_player_id'] = df['nfl_player_id'].astype(int).astype(str)
    df['area'] = df['width'] * df['height']
    df['cx'] = df['left'] + 0.5 * df['width']
    df['cy'] = df['top'] + 0.5 * df['height']
    # dedup per (gp,view,frame,player) by largest area
    df = df.sort_values(['game_play','view','frame','nfl_player_id','area'], ascending=[True,True,True,True,False])
    df = df.drop_duplicates(['game_play','view','frame','nfl_player_id'], keep='first')
    # map to step using snap_frame
    df = df.merge(meta, on=['game_play','view'], how='left')
    # step = round((frame - snap_frame)/6)
    df['step'] = ((df['frame'] - df['snap_frame']).astype('float') / 6.0).round().astype('Int64')
    df = df.dropna(subset=['step'])
    df['step'] = df['step'].astype(int)
    # expand to target_step in {step-1, step, step+1}
    dm1 = df.copy(); dm1['target_step'] = dm1['step'] - 1
    d0 = df.copy(); d0['target_step'] = d0['step']
    dp1 = df.copy(); dp1['target_step'] = dp1['step'] + 1
    d = pd.concat([dm1, d0, dp1], ignore_index=True)
    # aggregate per (gp,view,target_step,player)
    agg = d.groupby(['game_play','view','target_step','nfl_player_id'], sort=False).agg(
        cx_mean=('cx','mean'),
        cy_mean=('cy','mean'),
        h_mean=('height','mean'),
        cnt=('cx','size')
    ).reset_index().rename(columns={'target_step':'step'})
    return agg

print('Preparing per-step helmet aggregates...')
h_tr = dedup_and_step(train_helm, meta_tr)
h_te = dedup_and_step(test_helm, meta_te)
print('Agg helmets train/test:', h_tr.shape, h_te.shape)

def merge_helmet_to_pairs(pairs_path: str, h_agg: pd.DataFrame, out_path: str):
    pairs = pd.read_parquet(pairs_path)
    ha = h_agg[['game_play','step','view','nfl_player_id','cx_mean','cy_mean','h_mean']].copy()
    # self-join per (gp,step,view) to compute per-view pair distances
    a = ha.rename(columns={'nfl_player_id':'p1','cx_mean':'cx1','cy_mean':'cy1','h_mean':'h1'})
    b = ha.rename(columns={'nfl_player_id':'p2','cx_mean':'cx2','cy_mean':'cy2','h_mean':'h2'})
    merged = a.merge(b, on=['game_play','step','view'], how='inner')
    # keep ordered pairs p1 < p2 (string compare but ids are numeric strings)
    mask = merged['p1'] < merged['p2']
    merged = merged[mask]
    merged['px_dist'] = np.sqrt((merged['cx1'] - merged['cx2'])**2 + (merged['cy1'] - merged['cy2'])**2)
    merged['px_dist_norm'] = merged['px_dist'] / np.sqrt(np.maximum(1e-6, merged['h1'] * merged['h2']))
    agg = merged.groupby(['game_play','step','p1','p2'], as_index=False).agg(
        px_dist_norm_min=('px_dist_norm','min'),
        views_both_present=('px_dist_norm', lambda s: int(s.notna().sum()))
    )
    out = pairs.merge(agg, on=['game_play','step','p1','p2'], how='left')
    out.to_parquet(out_path, index=False)
    print('Saved', out_path, 'shape', out.shape, 'with helmet cols')

# Build and save merged pairs with helmet features
print('Merging helmet features into train pairs...')
merge_helmet_to_pairs('train_pairs_w5.parquet', h_tr, 'train_pairs_w5_helm.parquet')
print('Merging helmet features into test pairs...')
merge_helmet_to_pairs('test_pairs_w5.parquet', h_te, 'test_pairs_w5_helm.parquet')

# Rebuild supervised with helmet features via inner-join to labels
pairs_tr_helm = pd.read_parquet('train_pairs_w5_helm.parquet')
key_cols = ['game_play','step','p1','p2']
sup = train_labels[key_cols + ['contact']].merge(pairs_tr_helm, on=key_cols, how='inner')
sup.to_parquet('train_supervised_w5_helm.parquet', index=False)
print('Saved train_supervised_w5_helm.parquet', sup.shape)
print('Helmet feature build done in {:.1f}s'.format(time.time()-t0))

Loading helmet and metadata CSVs...


Helm train/test: (3412208, 12) (371408, 12)
Preparing per-step helmet aggregates...


Agg helmets train/test: (620840, 8) (67667, 8)
Merging helmet features into train pairs...


Saved train_pairs_w5_helm.parquet shape (1641668, 33) with helmet cols
Merging helmet features into test pairs...


Saved test_pairs_w5_helm.parquet shape (191559, 33) with helmet cols


Saved train_supervised_w5_helm.parquet (416574, 34)
Helmet feature build done in 215.9s


In [14]:
# Add TTC/delta/relative/helmet-dynamics features to pairs, then rebuild supervised via INNER JOIN and apply ±1 positive expansion
import pandas as pd, numpy as np, time, math

t0 = time.time()
print('Loading pairs with W5 and helmet features...')
tr = pd.read_parquet('train_pairs_w5_helm.parquet')
te = pd.read_parquet('test_pairs_w5_helm.parquet')
print('train pairs:', tr.shape, 'test pairs:', te.shape)

def add_dyn_feats(df: pd.DataFrame) -> pd.DataFrame:
    df = df.sort_values(['game_play','p1','p2','step']).copy()
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    # Impute helmet base before dynamics
    if 'px_dist_norm_min' in df.columns:
        df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns:
        df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

    # 1) TTC (approach-aware)
    # approaching when closing < 0 (players moving toward each other along line of centers)
    df['approaching_flag'] = (df['closing'] < 0).astype(int)
    denom = (-df['closing']).clip(lower=1e-3)
    ttc_raw = df['distance'] / denom
    ttc_raw = ttc_raw.where(df['approaching_flag'] == 1, 10.0)
    df['ttc_raw'] = ttc_raw.astype(float)
    df['ttc_clip'] = df['ttc_raw'].clip(0, 5)
    df['ttc_log'] = np.log1p(df['ttc_clip'])
    df['inv_ttc'] = 1.0 / (1.0 + df['ttc_clip'])

    # 2) Deltas (fillna 0)
    for col in ['distance','closing','abs_closing','speed1','speed2','accel1','accel2']:
        # shift 1/2/5 where applicable
        if col in ['distance']:
            df['d_dist_1'] = df['distance'] - grp['distance'].shift(1)
            df['d_dist_2'] = df['distance'] - grp['distance'].shift(2)
            df['d_dist_5'] = df['distance'] - grp['distance'].shift(5)
        elif col == 'closing':
            df['d_close_1'] = df['closing'] - grp['closing'].shift(1)
        elif col == 'abs_closing':
            df['d_absclose_1'] = df['abs_closing'] - grp['abs_closing'].shift(1)
        elif col == 'speed1':
            df['d_speed1_1'] = df['speed1'] - grp['speed1'].shift(1)
        elif col == 'speed2':
            df['d_speed2_1'] = df['speed2'] - grp['speed2'].shift(1)
        elif col == 'accel1':
            df['d_accel1_1'] = df['accel1'] - grp['accel1'].shift(1)
        elif col == 'accel2':
            df['d_accel2_1'] = df['accel2'] - grp['accel2'].shift(1)
    # small smoothers
    df['rm3_d_dist_1'] = grp['d_dist_1'].transform(lambda s: s.rolling(3, min_periods=1).mean())
    df['rm3_d_close_1'] = grp['d_close_1'].transform(lambda s: s.rolling(3, min_periods=1).mean())
    # fill deltas NaN with 0
    for c in ['d_dist_1','d_dist_2','d_dist_5','d_close_1','d_absclose_1','d_speed1_1','d_speed2_1','d_accel1_1','d_accel2_1','rm3_d_dist_1','rm3_d_close_1']:
        if c in df.columns:
            df[c] = df[c].fillna(0.0)

    # 3) Relative motion
    df['rel_speed'] = (df['speed2'] - df['speed1']).astype(float)
    df['abs_rel_speed'] = df['rel_speed'].abs()
    df['rel_accel'] = (df['accel2'] - df['accel1']).astype(float)
    df['abs_rel_accel'] = df['rel_accel'].abs()
    # Jerk per player
    df['jerk1'] = grp['accel1'].diff().fillna(0.0)
    df['jerk2'] = grp['accel2'].diff().fillna(0.0)

    # 4) Helmet dynamics (cheap)
    if 'px_dist_norm_min' in df.columns:
        df['d_px_norm_1'] = df['px_dist_norm_min'] - grp['px_dist_norm_min'].shift(1)
        df['d_px_norm_1'] = df['d_px_norm_1'].fillna(0.0)
        df['cnt_px_lt006_p3'] = grp['px_dist_norm_min'].transform(lambda s: s.lt(0.06).rolling(3, min_periods=1).sum()).astype(float)
        df['cnt_px_lt008_p3'] = grp['px_dist_norm_min'].transform(lambda s: s.lt(0.08).rolling(3, min_periods=1).sum()).astype(float)
    else:
        df['d_px_norm_1'] = 0.0
        df['cnt_px_lt006_p3'] = 0.0
        df['cnt_px_lt008_p3'] = 0.0

    return df

print('Adding dynamic features to train...')
tr_dyn = add_dyn_feats(tr)
print('Adding dynamic features to test...')
te_dyn = add_dyn_feats(te)

tr_dyn.to_parquet('train_pairs_w5_helm_dyn.parquet', index=False)
te_dyn.to_parquet('test_pairs_w5_helm_dyn.parquet', index=False)
print('Saved dyn pairs: train', tr_dyn.shape, 'test', te_dyn.shape)

# Rebuild supervised via INNER JOIN and apply ±1 positive expansion within supervised only
key_cols = ['game_play','step','p1','p2']
lab_cols = key_cols + ['contact']
labels_min = train_labels[lab_cols].copy()
sup = labels_min.merge(tr_dyn, on=key_cols, how='inner')
print('Supervised (inner) before expansion:', sup.shape, 'pos rate:', sup['contact'].mean())

# Positive expansion ±1: flip existing rows at t-1 and t+1 to contact=1 when present
pos = sup.loc[sup['contact'] == 1, ['game_play','p1','p2','step']]
pos_m1 = pos.copy(); pos_m1['step'] = pos_m1['step'] - 1
pos_p1 = pos.copy(); pos_p1['step'] = pos_p1['step'] + 1
pos_exp = pd.concat([pos_m1, pos_p1], ignore_index=True).drop_duplicates()
pos_exp['flag_pos_exp'] = 1
sup = sup.merge(pos_exp, on=['game_play','p1','p2','step'], how='left')
sup.loc[sup['flag_pos_exp'] == 1, 'contact'] = 1
sup.drop(columns=['flag_pos_exp'], inplace=True)
print('After positive expansion: pos rate:', sup['contact'].mean())

sup.to_parquet('train_supervised_w5_helm_dyn.parquet', index=False)
print('Saved train_supervised_w5_helm_dyn.parquet', sup.shape, 'Elapsed {:.1f}s'.format(time.time()-t0))

Loading pairs with W5 and helmet features...
train pairs: (1641668, 33) test pairs: (191559, 33)
Adding dynamic features to train...


Adding dynamic features to test...


Saved dyn pairs: train (1641668, 58) test (191559, 58)


Supervised (inner) before expansion: (416574, 59) pos rate: 0.10227714643736767


After positive expansion: pos rate: 0.11705963406261552


Saved train_supervised_w5_helm_dyn.parquet (416574, 59) Elapsed 22.4s


In [18]:
# Train XGBoost on dyn features, smooth OOF, dual thresholds (same vs opp), predict test
import time, math, subprocess, sys
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb
print('xgboost version:', getattr(xgb, '__version__', 'unknown'))

def mcc_from_counts(tp, tn, fp, fn):
    num = tp * tn - fp * fn
    den = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
    den = np.where(den == 0, 1.0, den)
    return num / den

def fast_dual_threshold_mcc(y_true, prob, same_flag, grid_points=151):
    # Build cohort arrays
    res = {}
    for cohort in (0, 1):
        mask = (same_flag == cohort)
        y_c = y_true[mask].astype(int)
        p_c = prob[mask].astype(float)
        n = len(y_c)
        if n == 0:
            res[cohort] = {
                'k_grid': np.array([0], dtype=int),
                'tp': np.array([0], dtype=float),
                'fp': np.array([0], dtype=float),
                'tn': np.array([0], dtype=float),
                'fn': np.array([0], dtype=float),
                'thr_vals': np.array([1.0], dtype=float)
            }
            continue
        order = np.argsort(-p_c)  # descending by prob
        y_sorted = y_c[order]
        p_sorted = p_c[order]
        cum_pos = np.concatenate([[0], np.cumsum(y_sorted)])  # length n+1
        # Grid of k = number predicted positives (top-k rule)
        k_grid = np.unique(np.linspace(0, n, num=min(grid_points, n + 1), dtype=int))
        tp = cum_pos[k_grid]
        fp = k_grid - tp
        P = y_sorted.sum()
        N = n - P
        fn = P - tp
        tn = N - fp
        thr_vals = np.where(k_grid == 0, 1.0 + 1e-6, p_sorted[np.maximum(0, k_grid - 1)])
        res[cohort] = {'k_grid': k_grid, 'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn, 'thr_vals': thr_vals}

    # Combine cohorts: iterate small grids and compute MCC from summed counts
    tp0, fp0, tn0, fn0, thr0 = res[0]['tp'], res[0]['fp'], res[0]['tn'], res[0]['fn'], res[0]['thr_vals']
    tp1, fp1, tn1, fn1, thr1 = res[1]['tp'], res[1]['fp'], res[1]['tn'], res[1]['fn'], res[1]['thr_vals']
    best = (-1.0, 0.5, 0.5)
    for i in range(len(thr0)):
        tp_i = tp0[i]; fp_i = fp0[i]; tn_i = tn0[i]; fn_i = fn0[i]
        tp_sum = tp_i + tp1
        fp_sum = fp_i + fp1
        tn_sum = tn_i + tn1
        fn_sum = fn_i + fn1
        m_arr = mcc_from_counts(tp_sum, tn_sum, fp_sum, fn_sum)
        j = int(np.argmax(m_arr))
        m = float(m_arr[j])
        if m > best[0]:
            best = (m, float(thr0[i]), float(thr1[j]))
    return best  # (best_mcc, thr_opp(=cohort0), thr_same(=cohort1))

print('Loading supervised dyn train and dyn test features...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
print('train_sup:', train_sup.shape, 'test_feats:', test_feats.shape)

# Attach folds
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()

# Ensure helmet imputations (already done earlier, but safe)
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns:
        df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns:
        df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

# Build feature columns: use numeric columns excluding keys/label
drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

X_all = train_sup[feat_cols].astype(float).values
y_all = train_sup['contact'].astype(int).values
groups = train_sup['game_play'].values
same_flag_all = train_sup['same_team'].astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), dtype=int)

gkf = GroupKFold(n_splits=5)
oof = np.full(len(train_sup), np.nan, dtype=float)
models = []
start = time.time()

for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
    t0 = time.time()
    X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
    X_va, y_va = X_all[va_idx], y_all[va_idx]
    neg = (y_tr == 0).sum(); pos = (y_tr == 1).sum()
    spw = max(1.0, neg / max(1, pos))
    print(f'Fold {fold}: train {len(tr_idx)} (pos {pos}), valid {len(va_idx)} (pos {(y_va==1).sum()}), spw={spw:.2f}', flush=True)
    dtrain = xgb.DMatrix(X_tr, label=y_tr)
    dvalid = xgb.DMatrix(X_va, label=y_va)
    params = {
        'tree_method': 'hist',
        'device': 'cuda',
        'max_depth': 7,
        'eta': 0.05,
        'subsample': 0.9,
        'colsample_bytree': 0.8,
        'min_child_weight': 10,
        'lambda': 1.5,
        'alpha': 0.1,
        'gamma': 0.1,
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'scale_pos_weight': float(spw),
        'seed': 42 + fold
    }
    evals = [(dtrain, 'train'), (dvalid, 'valid')]
    booster = xgb.train(
        params=params,
        dtrain=dtrain,
        num_boost_round=4000,
        evals=evals,
        early_stopping_rounds=200,
        verbose_eval=False
    )
    best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
    oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
    models.append((booster, best_it))
    print(f' Fold {fold} done in {time.time()-t0:.1f}s; best_iteration={best_it}', flush=True)

# Smooth OOF per (gp,p1,p2) with centered rolling-max window=3
oof_df = train_sup[['game_play','p1','p2','step','views_both_present']].copy()
oof_df['oof'] = oof
oof_df = oof_df.sort_values(['game_play','p1','p2','step'])
grp = oof_df.groupby(['game_play','p1','p2'], sort=False)
oof_df['oof_smooth'] = grp['oof'].transform(lambda s: s.rolling(3, center=True, min_periods=1).max())
oof_smooth = oof_df['oof_smooth'].values

# Align labels and flags to the sorted oof_df row order
idx_ord = oof_df.index.to_numpy()
y_sorted = train_sup['contact'].astype(int).to_numpy()[idx_ord]
if 'same_team' in train_sup.columns:
    same_flag_sorted = train_sup['same_team'].fillna(0).astype(int).to_numpy()[idx_ord]
else:
    same_flag_sorted = np.zeros(len(oof_df), dtype=int)
vb_sorted = (oof_df['views_both_present'].to_numpy() > 0).astype(int)

# Two cohorts by views_both_present, each with dual thresholds (same vs opp)
thr_dict = {}  # (vb)->(thr_opp, thr_same)
for vb in (0, 1):
    mask = (vb_sorted == vb)
    if mask.sum() == 0:
        thr_dict[vb] = (0.77, 0.77)  # default
        continue
    best_mcc_sub, thr_opp_sub, thr_same_sub = fast_dual_threshold_mcc(y_sorted[mask], oof_smooth[mask], same_flag_sorted[mask], grid_points=151)
    if not np.isfinite(best_mcc_sub) or best_mcc_sub < 0:
        thrs = np.linspace(0.01, 0.99, 99)
        m_list = []
        for t in thrs:
            pred = (oof_smooth[mask] >= t).astype(int)
            m_list.append(matthews_corrcoef(y_sorted[mask], pred))
        j = int(np.argmax(m_list))
        thr_opp_sub = thr_same_sub = float(thrs[j])
    thr_dict[vb] = (float(thr_opp_sub), float(thr_same_sub))
print('Thresholds by views flag:', thr_dict)

# Evaluate combined OOF MCC with 4 thresholds
thr_arr = np.empty(len(oof_df), dtype=float)
for vb in (0, 1):
    mask = (vb_sorted == vb)
    t_opp, t_same = thr_dict[vb]
    thr_arr[mask] = np.where(same_flag_sorted[mask] == 1, t_same, t_opp)
pred_oof = (oof_smooth >= thr_arr).astype(int)
oof_mcc_all = matthews_corrcoef(y_sorted, pred_oof)
print(f'OOF MCC with 4 thresholds: {oof_mcc_all:.5f}')

# Inference on test and smoothing
Xt = test_feats[feat_cols].astype(float).values
dtest = xgb.DMatrix(Xt)
pt = np.zeros(len(test_feats), dtype=float)
for i, (booster, best_it) in enumerate(models):
    t0 = time.time()
    pt += booster.predict(dtest, iteration_range=(0, best_it + 1))
    print(f' Inference model {i} took {time.time()-t0:.1f}s', flush=True)
pt /= max(1, len(models))
pred_tmp = test_feats[['game_play','step','p1','p2','views_both_present']].copy()
pred_tmp['prob'] = pt
pred_tmp = pred_tmp.sort_values(['game_play','p1','p2','step'])
grp_t = pred_tmp.groupby(['game_play','p1','p2'], sort=False)
pred_tmp['prob_smooth'] = grp_t['prob'].transform(lambda s: s.rolling(3, center=True, min_periods=1).max())

# Apply 4 thresholds by same_team and views_both_present on test
same_flag_test = test_feats['same_team'].astype(int).values if 'same_team' in test_feats.columns else np.zeros(len(test_feats), dtype=int)
vb_test = (pred_tmp['views_both_present'].to_numpy() > 0).astype(int)
thr_arr_test = np.empty(len(pred_tmp), dtype=float)
for vb in (0, 1):
    mask = (vb_test == vb)
    t_opp, t_same = thr_dict[vb]
    thr_arr_test[mask] = np.where(same_flag_test[mask] == 1, t_same, t_opp)
pred_bin = (pred_tmp['prob_smooth'].values >= thr_arr_test).astype(int)

# Build submission safely (avoid column clash)
cid = (test_feats['game_play'].astype(str) + '_' + test_feats['step'].astype(str) + '_' +
       test_feats['p1'].astype(str) + '_' + test_feats['p2'].astype(str))
pred_df = pd.DataFrame({'contact_id': cid, 'pred_contact': pred_bin})
ss = pd.read_csv('sample_submission.csv')
sub = ss.copy()
sub['contact'] = sub['contact_id'].map(pred_df.set_index('contact_id')['pred_contact']).fillna(0).astype(int)
sub[['contact_id','contact']].to_csv('submission.csv', index=False)
print('Saved submission.csv')
print('Done. Total time:', f'{time.time()-start:.1f}s', flush=True)

xgboost version: 2.1.4
Loading supervised dyn train and dyn test features...
train_sup: (416574, 59) test_feats: (191559, 58)
Using 50 features


Fold 0: train 333273 (pos 38783), valid 83301 (pos 9981), spw=7.59


 Fold 0 done in 22.7s; best_iteration=2724


Fold 1: train 333268 (pos 39102), valid 83306 (pos 9662), spw=7.52


 Fold 1 done in 24.9s; best_iteration=3085


Fold 2: train 333273 (pos 39536), valid 83301 (pos 9228), spw=7.43


 Fold 2 done in 22.7s; best_iteration=2732


Fold 3: train 333286 (pos 38584), valid 83288 (pos 10180), spw=7.64


 Fold 3 done in 23.1s; best_iteration=2751


Fold 4: train 333196 (pos 39051), valid 83378 (pos 9713), spw=7.53


 Fold 4 done in 23.5s; best_iteration=2720


  den = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))


Thresholds by views flag: {0: (0.36875900626182556, 0.40483367443084717), 1: (0.77, 0.77)}
OOF MCC with 4 thresholds: 0.70980


 Inference model 0 took 0.1s


 Inference model 1 took 0.1s


 Inference model 2 took 0.1s


 Inference model 3 took 0.1s


 Inference model 4 took 0.1s


Saved submission.csv
Done. Total time: 126.2s


In [19]:
# Rebuild full pipeline with candidate radius r=3.5 and save *_r35 artifacts (pairs -> W5 -> helmets -> dyn -> supervised + ±1 expansion)
import pandas as pd, numpy as np, time, math
from itertools import combinations

t0 = time.time()
print('Rebuilding pipeline with r=3.5 ...')

# 1) Build pairs at r=3.5 using existing tracking indices and build_feature_table/build_pairs_for_group from earlier cells
def build_pairs_for_group_r(gdf, r=3.5):
    rows = []
    arr = gdf[['nfl_player_id','team','position','x_position','y_position','speed','acceleration','direction']].values
    n = arr.shape[0]
    for i, j in combinations(range(n), 2):
        pid_i, team_i, pos_i, xi, yi, si, ai, diri = arr[i]
        pid_j, team_j, pos_j, xj, yj, sj, aj, dirj = arr[j]
        dx = xj - xi; dy = yj - yi
        dist = math.hypot(dx, dy)
        if dist > r:
            continue
        a = int(pid_i); b = int(pid_j)
        p1, p2 = (str(a), str(b)) if a <= b else (str(b), str(a))
        vxi = si * math.cos(math.radians(diri)) if not pd.isna(diri) else 0.0
        vyi = si * math.sin(math.radians(diri)) if not pd.isna(diri) else 0.0
        vxj = sj * math.cos(math.radians(dirj)) if not pd.isna(dirj) else 0.0
        vyj = sj * math.sin(math.radians(dirj)) if not pd.isna(dirj) else 0.0
        rvx = vxj - vxi; rvy = vyj - vyi
        if dist > 0:
            ux = dx / dist; uy = dy / dist
            closing = rvx * ux + rvy * uy
        else:
            closing = 0.0
        # heading difference abs
        if pd.isna(diri) or pd.isna(dirj):
            hd = np.nan
        else:
            d = (diri - dirj + 180) % 360 - 180
            hd = abs(d)
        rows.append((p1, p2, dist, dx, dy, si, sj, ai, aj, closing, abs(closing), hd, int(team_i == team_j), str(team_i), str(team_j), str(pos_i), str(pos_j)))
    if not rows:
        return pd.DataFrame(columns=['p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])
    return pd.DataFrame(rows, columns=['p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])

def build_feature_table_r(track_df, r=3.5):
    feats = []
    cnt = 0
    last = time.time()
    for (gp, step), gdf in track_df.groupby(['game_play','step'], sort=False):
        f = build_pairs_for_group_r(gdf, r=r)
        if not f.empty:
            f.insert(0, 'step', step)
            f.insert(0, 'game_play', gp)
            feats.append(f)
        cnt += 1
        if cnt % 500 == 0:
            now = time.time()
            print(f' processed {cnt} steps; +{now-last:.1f}s; total {now-t0:.1f}s', flush=True)
            last = now
    if feats:
        return pd.concat(feats, ignore_index=True)
    return pd.DataFrame(columns=['game_play','step','p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])

print('Building train pairs r=3.5 ...')
train_pairs_r35 = build_feature_table_r(train_track_idx, r=3.5)
print('train_pairs_r35:', train_pairs_r35.shape)
train_pairs_r35.to_parquet('train_pairs_r35.parquet', index=False)
print('Building test pairs r=3.5 ...')
test_pairs_r35 = build_feature_table_r(test_track_idx, r=3.5)
print('test_pairs_r35:', test_pairs_r35.shape)
test_pairs_r35.to_parquet('test_pairs_r35.parquet', index=False)

# 2) Add W5 past-only features using existing add_window_feats from earlier cells
def add_window_feats_local(df: pd.DataFrame, W: int = 5):
    df = df.sort_values(['game_play','p1','p2','step']).copy()
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    df['dist_min_p5'] = grp['distance'].rolling(W, min_periods=1).min().reset_index(level=[0,1,2], drop=True)
    df['dist_mean_p5'] = grp['distance'].rolling(W, min_periods=1).mean().reset_index(level=[0,1,2], drop=True)
    df['dist_max_p5'] = grp['distance'].rolling(W, min_periods=1).max().reset_index(level=[0,1,2], drop=True)
    df['dist_std_p5'] = grp['distance'].rolling(W, min_periods=1).std().reset_index(level=[0,1,2], drop=True)
    df['abs_close_min_p5'] = grp['abs_closing'].rolling(W, min_periods=1).min().reset_index(level=[0,1,2], drop=True)
    df['abs_close_mean_p5'] = grp['abs_closing'].rolling(W, min_periods=1).mean().reset_index(level=[0,1,2], drop=True)
    df['abs_close_max_p5'] = grp['abs_closing'].rolling(W, min_periods=1).max().reset_index(level=[0,1,2], drop=True)
    df['abs_close_std_p5'] = grp['abs_closing'].rolling(W, min_periods=1).std().reset_index(level=[0,1,2], drop=True)
    for thr, name in [(1.5,'lt15'), (2.0,'lt20'), (2.5,'lt25')]:
        key = f'cnt_dist_{name}_p5'
        df[key] = grp['distance'].apply(lambda s: s.lt(thr).rolling(W, min_periods=1).sum()).reset_index(level=[0,1,2], drop=True)
    df['dist_delta_p5'] = df['distance'] - grp['distance'].shift(W)
    return df

print('Adding W5 features (train/test)...')
train_w_r35 = add_window_feats_local(train_pairs_r35, W=5)
test_w_r35 = add_window_feats_local(test_pairs_r35, W=5)
train_w_r35.to_parquet('train_pairs_w5_r35.parquet', index=False)
test_w_r35.to_parquet('test_pairs_w5_r35.parquet', index=False)

# 3) Helmet merge: recompute aggregates (dedup + step mapping) and merge into pairs
FPS = 59.94
def prep_meta(vmeta: pd.DataFrame):
    vm = vmeta.copy()
    for c in ['start_time','snap_time']:
        if np.issubdtype(vm[c].dtype, np.number):
            continue
        ts = pd.to_datetime(vm[c], errors='coerce')
        if ts.notna().any():
            vm[c] = (ts - ts.dt.floor('D')).dt.total_seconds().astype(float)
        else:
            vm[c] = pd.to_numeric(vm[c], errors='coerce')
    vm['snap_frame'] = ((vm['snap_time'] - vm['start_time']) * FPS).round().astype('Int64')
    return vm[['game_play','view','snap_frame']].drop_duplicates()

print('Loading helmets and video metadata for r=3.5 merge...')
train_helm_df = pd.read_csv('train_baseline_helmets.csv')
test_helm_df = pd.read_csv('test_baseline_helmets.csv')
train_vmeta_df = pd.read_csv('train_video_metadata.csv')
test_vmeta_df = pd.read_csv('test_video_metadata.csv')
meta_tr = prep_meta(train_vmeta_df); meta_te = prep_meta(test_vmeta_df)

def dedup_and_step(helm: pd.DataFrame, meta: pd.DataFrame):
    df = helm[['game_play','view','frame','nfl_player_id','left','top','width','height']].copy()
    df = df.dropna(subset=['nfl_player_id'])
    df['nfl_player_id'] = df['nfl_player_id'].astype(int).astype(str)
    df['area'] = df['width'] * df['height']
    df['cx'] = df['left'] + 0.5 * df['width']
    df['cy'] = df['top'] + 0.5 * df['height']
    df = df.sort_values(['game_play','view','frame','nfl_player_id','area'], ascending=[True,True,True,True,False]).drop_duplicates(['game_play','view','frame','nfl_player_id'], keep='first')
    df = df.merge(meta, on=['game_play','view'], how='left')
    df['step'] = ((df['frame'] - df['snap_frame']).astype('float') / 6.0).round().astype('Int64')
    df = df.dropna(subset=['step']); df['step'] = df['step'].astype(int)
    dm1 = df.copy(); dm1['target_step'] = dm1['step'] - 1
    d0 = df.copy(); d0['target_step'] = d0['step']
    dp1 = df.copy(); dp1['target_step'] = dp1['step'] + 1
    d = pd.concat([dm1, d0, dp1], ignore_index=True)
    agg = d.groupby(['game_play','view','target_step','nfl_player_id'], sort=False).agg(
        cx_mean=('cx','mean'), cy_mean=('cy','mean'), h_mean=('height','mean'), cnt=('cx','size')
    ).reset_index().rename(columns={'target_step':'step'})
    return agg

print('Preparing helmet aggregates...')
h_tr = dedup_and_step(train_helm_df, meta_tr)
h_te = dedup_and_step(test_helm_df, meta_te)
print('Helmet agg shapes:', h_tr.shape, h_te.shape)

def merge_helmet_to_pairs_df(pairs: pd.DataFrame, h_agg: pd.DataFrame):
    ha = h_agg[['game_play','step','view','nfl_player_id','cx_mean','cy_mean','h_mean']].copy()
    a = ha.rename(columns={'nfl_player_id':'p1','cx_mean':'cx1','cy_mean':'cy1','h_mean':'h1'})
    b = ha.rename(columns={'nfl_player_id':'p2','cx_mean':'cx2','cy_mean':'cy2','h_mean':'h2'})
    merged = a.merge(b, on=['game_play','step','view'], how='inner')
    merged = merged[merged['p1'] < merged['p2']]
    merged['px_dist'] = np.sqrt((merged['cx1'] - merged['cx2'])**2 + (merged['cy1'] - merged['cy2'])**2)
    merged['px_dist_norm'] = merged['px_dist'] / np.sqrt(np.maximum(1e-6, merged['h1'] * merged['h2']))
    agg = merged.groupby(['game_play','step','p1','p2'], as_index=False).agg(
        px_dist_norm_min=('px_dist_norm','min'),
        views_both_present=('px_dist_norm', lambda s: int(s.notna().sum()))
    )
    out = pairs.merge(agg, on=['game_play','step','p1','p2'], how='left')
    return out

print('Merging helmets into pairs (train/test) ...')
train_pairs_w5_helm_r35 = merge_helmet_to_pairs_df(train_w_r35, h_tr)
test_pairs_w5_helm_r35 = merge_helmet_to_pairs_df(test_w_r35, h_te)
train_pairs_w5_helm_r35.to_parquet('train_pairs_w5_helm_r35.parquet', index=False)
test_pairs_w5_helm_r35.to_parquet('test_pairs_w5_helm_r35.parquet', index=False)

# 4) Add dynamic features (TTC/deltas/rel/helmet dynamics)
def add_dyn_feats(df: pd.DataFrame) -> pd.DataFrame:
    df = df.sort_values(['game_play','p1','p2','step']).copy()
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)
    df['approaching_flag'] = (df['closing'] < 0).astype(int)
    denom = (-df['closing']).clip(lower=1e-3)
    ttc_raw = df['distance'] / denom
    ttc_raw = ttc_raw.where(df['approaching_flag'] == 1, 10.0)
    df['ttc_raw'] = ttc_raw.astype(float)
    df['ttc_clip'] = df['ttc_raw'].clip(0, 5)
    df['ttc_log'] = np.log1p(df['ttc_clip'])
    df['inv_ttc'] = 1.0 / (1.0 + df['ttc_clip'])
    df['d_dist_1'] = df['distance'] - grp['distance'].shift(1)
    df['d_dist_2'] = df['distance'] - grp['distance'].shift(2)
    df['d_dist_5'] = df['distance'] - grp['distance'].shift(5)
    df['d_close_1'] = df['closing'] - grp['closing'].shift(1)
    df['d_absclose_1'] = df['abs_closing'] - grp['abs_closing'].shift(1)
    df['d_speed1_1'] = df['speed1'] - grp['speed1'].shift(1)
    df['d_speed2_1'] = df['speed2'] - grp['speed2'].shift(1)
    df['d_accel1_1'] = df['accel1'] - grp['accel1'].shift(1)
    df['d_accel2_1'] = df['accel2'] - grp['accel2'].shift(1)
    df['rm3_d_dist_1'] = grp['d_dist_1'].transform(lambda s: s.rolling(3, min_periods=1).mean())
    df['rm3_d_close_1'] = grp['d_close_1'].transform(lambda s: s.rolling(3, min_periods=1).mean())
    for c in ['d_dist_1','d_dist_2','d_dist_5','d_close_1','d_absclose_1','d_speed1_1','d_speed2_1','d_accel1_1','d_accel2_1','rm3_d_dist_1','rm3_d_close_1']:
        df[c] = df[c].fillna(0.0)
    df['rel_speed'] = (df['speed2'] - df['speed1']).astype(float)
    df['abs_rel_speed'] = df['rel_speed'].abs()
    df['rel_accel'] = (df['accel2'] - df['accel1']).astype(float)
    df['abs_rel_accel'] = df['rel_accel'].abs()
    df['jerk1'] = grp['accel1'].diff().fillna(0.0)
    df['jerk2'] = grp['accel2'].diff().fillna(0.0)
    if 'px_dist_norm_min' in df.columns:
        df['d_px_norm_1'] = df['px_dist_norm_min'] - grp['px_dist_norm_min'].shift(1)
        df['d_px_norm_1'] = df['d_px_norm_1'].fillna(0.0)
        df['cnt_px_lt006_p3'] = grp['px_dist_norm_min'].transform(lambda s: s.lt(0.06).rolling(3, min_periods=1).sum()).astype(float)
        df['cnt_px_lt008_p3'] = grp['px_dist_norm_min'].transform(lambda s: s.lt(0.08).rolling(3, min_periods=1).sum()).astype(float)
    else:
        df['d_px_norm_1'] = 0.0; df['cnt_px_lt006_p3'] = 0.0; df['cnt_px_lt008_p3'] = 0.0
    return df

print('Adding dyn features (train/test) ...')
tr_dyn_r35 = add_dyn_feats(train_pairs_w5_helm_r35)
te_dyn_r35 = add_dyn_feats(test_pairs_w5_helm_r35)
tr_dyn_r35.to_parquet('train_pairs_w5_helm_dyn_r35.parquet', index=False)
te_dyn_r35.to_parquet('test_pairs_w5_helm_dyn_r35.parquet', index=False)

# 5) Supervised via INNER JOIN to labels then ±1 positive expansion
key_cols = ['game_play','step','p1','p2']
lab_cols = key_cols + ['contact']
labels_min = train_labels[lab_cols].copy()
sup_r35 = labels_min.merge(tr_dyn_r35, on=key_cols, how='inner')
print('Supervised(inner) r=3.5 before expansion:', sup_r35.shape, 'pos rate:', sup_r35['contact'].mean())
pos = sup_r35.loc[sup_r35['contact'] == 1, ['game_play','p1','p2','step']]
pos_m1 = pos.copy(); pos_m1['step'] = pos_m1['step'] - 1
pos_p1 = pos.copy(); pos_p1['step'] = pos_p1['step'] + 1
pos_exp = pd.concat([pos_m1, pos_p1], ignore_index=True).drop_duplicates()
pos_exp['flag_pos_exp'] = 1
sup_r35 = sup_r35.merge(pos_exp, on=['game_play','p1','p2','step'], how='left')
sup_r35.loc[sup_r35['flag_pos_exp'] == 1, 'contact'] = 1
sup_r35.drop(columns=['flag_pos_exp'], inplace=True)
print('After positive expansion (r=3.5): pos rate:', sup_r35['contact'].mean())
sup_r35.to_parquet('train_supervised_w5_helm_dyn_r35.parquet', index=False)

print('Done r=3.5 rebuild in {:.1f}s'.format(time.time()-t0), flush=True)

Rebuilding pipeline with r=3.5 ...
Building train pairs r=3.5 ...


 processed 500 steps; +0.7s; total 0.7s


 processed 1000 steps; +0.5s; total 1.2s


 processed 1500 steps; +0.8s; total 2.0s


 processed 2000 steps; +0.6s; total 2.6s


 processed 2500 steps; +0.6s; total 3.1s


 processed 3000 steps; +0.6s; total 3.7s


 processed 3500 steps; +0.6s; total 4.2s


 processed 4000 steps; +0.8s; total 5.1s


 processed 4500 steps; +0.6s; total 5.6s


 processed 5000 steps; +0.6s; total 6.2s


 processed 5500 steps; +0.6s; total 6.7s


 processed 6000 steps; +0.5s; total 7.3s


 processed 6500 steps; +0.5s; total 7.8s


 processed 7000 steps; +0.6s; total 8.4s


 processed 7500 steps; +0.9s; total 9.3s


 processed 8000 steps; +0.5s; total 9.8s


 processed 8500 steps; +0.6s; total 10.4s


 processed 9000 steps; +0.6s; total 10.9s


 processed 9500 steps; +0.6s; total 11.5s


 processed 10000 steps; +0.6s; total 12.1s


 processed 10500 steps; +0.6s; total 12.7s


 processed 11000 steps; +0.5s; total 13.2s


 processed 11500 steps; +0.5s; total 13.7s


 processed 12000 steps; +1.0s; total 14.7s


 processed 12500 steps; +0.6s; total 15.3s


 processed 13000 steps; +0.6s; total 15.8s


 processed 13500 steps; +0.5s; total 16.4s


 processed 14000 steps; +0.6s; total 16.9s


 processed 14500 steps; +0.6s; total 17.5s


 processed 15000 steps; +0.5s; total 18.1s


 processed 15500 steps; +0.5s; total 18.6s


 processed 16000 steps; +0.6s; total 19.2s


 processed 16500 steps; +0.6s; total 19.7s


 processed 17000 steps; +1.0s; total 20.8s


 processed 17500 steps; +0.6s; total 21.3s


 processed 18000 steps; +0.5s; total 21.9s


 processed 18500 steps; +0.6s; total 22.4s


 processed 19000 steps; +0.6s; total 23.0s


 processed 19500 steps; +0.5s; total 23.5s


 processed 20000 steps; +0.5s; total 24.1s


 processed 20500 steps; +0.5s; total 24.6s


 processed 21000 steps; +0.6s; total 25.2s


 processed 21500 steps; +0.5s; total 25.7s


 processed 22000 steps; +0.5s; total 26.3s


 processed 22500 steps; +0.6s; total 26.8s


 processed 23000 steps; +0.5s; total 27.4s


 processed 23500 steps; +1.2s; total 28.5s


 processed 24000 steps; +0.6s; total 29.1s


 processed 24500 steps; +0.6s; total 29.6s


 processed 25000 steps; +0.6s; total 30.2s


 processed 25500 steps; +0.6s; total 30.8s


 processed 26000 steps; +0.6s; total 31.3s


 processed 26500 steps; +0.6s; total 31.9s


 processed 27000 steps; +0.6s; total 32.4s


 processed 27500 steps; +0.6s; total 33.0s


 processed 28000 steps; +0.6s; total 33.6s


 processed 28500 steps; +0.6s; total 34.1s


 processed 29000 steps; +0.6s; total 34.7s


 processed 29500 steps; +0.6s; total 35.3s


 processed 30000 steps; +0.6s; total 35.9s


 processed 30500 steps; +0.6s; total 36.4s


 processed 31000 steps; +1.4s; total 37.8s


 processed 31500 steps; +0.6s; total 38.4s


 processed 32000 steps; +0.5s; total 38.9s


 processed 32500 steps; +0.5s; total 39.5s


 processed 33000 steps; +0.5s; total 40.0s


 processed 33500 steps; +0.6s; total 40.6s


 processed 34000 steps; +0.6s; total 41.2s


 processed 34500 steps; +0.6s; total 41.7s


 processed 35000 steps; +0.5s; total 42.3s


 processed 35500 steps; +0.6s; total 42.8s


 processed 36000 steps; +0.6s; total 43.4s


 processed 36500 steps; +0.6s; total 43.9s


 processed 37000 steps; +0.6s; total 44.5s


 processed 37500 steps; +0.6s; total 45.1s


 processed 38000 steps; +0.6s; total 45.6s


 processed 38500 steps; +0.6s; total 46.2s


 processed 39000 steps; +0.6s; total 46.8s


 processed 39500 steps; +0.5s; total 47.3s


 processed 40000 steps; +1.4s; total 48.7s


 processed 40500 steps; +0.6s; total 49.3s


 processed 41000 steps; +0.5s; total 49.9s


 processed 41500 steps; +0.6s; total 50.4s


 processed 42000 steps; +0.5s; total 51.0s


 processed 42500 steps; +0.6s; total 51.5s


 processed 43000 steps; +0.6s; total 52.1s


 processed 43500 steps; +0.6s; total 52.7s


 processed 44000 steps; +0.5s; total 53.2s


 processed 44500 steps; +0.5s; total 53.8s


 processed 45000 steps; +0.5s; total 54.3s


 processed 45500 steps; +0.6s; total 54.9s


 processed 46000 steps; +0.6s; total 55.4s


 processed 46500 steps; +0.5s; total 56.0s


 processed 47000 steps; +0.5s; total 56.5s


 processed 47500 steps; +0.6s; total 57.1s


 processed 48000 steps; +0.6s; total 57.7s


 processed 48500 steps; +0.6s; total 58.2s


 processed 49000 steps; +0.6s; total 58.8s


 processed 49500 steps; +0.5s; total 59.3s


 processed 50000 steps; +0.5s; total 59.9s


 processed 50500 steps; +0.6s; total 60.5s


 processed 51000 steps; +0.6s; total 61.0s


 processed 51500 steps; +1.7s; total 62.7s


 processed 52000 steps; +0.6s; total 63.3s


 processed 52500 steps; +0.6s; total 63.9s


 processed 53000 steps; +0.6s; total 64.4s


 processed 53500 steps; +0.6s; total 65.0s


 processed 54000 steps; +0.6s; total 65.6s


 processed 54500 steps; +0.6s; total 66.1s


 processed 55000 steps; +0.6s; total 66.7s


 processed 55500 steps; +0.6s; total 67.2s


train_pairs_r35: (2051428, 19)


Building test pairs r=3.5 ...


 processed 500 steps; +0.6s; total 73.6s


 processed 1000 steps; +0.6s; total 74.2s


 processed 1500 steps; +0.6s; total 74.8s


 processed 2000 steps; +0.6s; total 75.3s


 processed 2500 steps; +0.6s; total 75.9s


 processed 3000 steps; +0.5s; total 76.5s


 processed 3500 steps; +0.5s; total 77.0s


 processed 4000 steps; +0.6s; total 77.6s


 processed 4500 steps; +0.6s; total 78.1s


 processed 5000 steps; +0.5s; total 78.7s


 processed 5500 steps; +0.6s; total 79.2s


test_pairs_r35: (237673, 19)


Adding W5 features (train/test)...


Loading helmets and video metadata for r=3.5 merge...


Preparing helmet aggregates...


Helmet agg shapes: (620840, 8) (67667, 8)
Merging helmets into pairs (train/test) ...


Adding dyn features (train/test) ...


Supervised(inner) r=3.5 before expansion: (524248, 59) pos rate: 0.0812878637591369


After positive expansion (r=3.5): pos rate: 0.09304947276861333


Done r=3.5 rebuild in 369.6s


In [30]:
# Train on r=3.5 dyn features, smooth OOF, dual thresholds (same vs opp), predict test
import time, sys, subprocess, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb
print('xgboost version:', getattr(xgb, '__version__', 'unknown'))

def mcc_from_counts(tp, tn, fp, fn):
    num = tp * tn - fp * fn
    den = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
    den = np.where(den == 0, 1.0, den)
    return num / den

def fast_dual_threshold_mcc(y_true, prob, same_flag, grid_points=151):
    res = {}
    for cohort in (0, 1):
        mask = (same_flag == cohort)
        y_c = y_true[mask].astype(int)
        p_c = prob[mask].astype(float)
        n = len(y_c)
        if n == 0:
            res[cohort] = {'k_grid': np.array([0], int), 'tp': np.array([0.0]), 'fp': np.array([0.0]), 'tn': np.array([0.0]), 'fn': np.array([0.0]), 'thr_vals': np.array([1.0])}
            continue
        order = np.argsort(-p_c)
        y_sorted = y_c[order]
        p_sorted = p_c[order]
        cum_pos = np.concatenate([[0], np.cumsum(y_sorted)])
        k_grid = np.unique(np.linspace(0, n, num=min(grid_points, n + 1), dtype=int))
        tp = cum_pos[k_grid]
        fp = k_grid - tp
        P = y_sorted.sum(); N = n - P
        fn = P - tp; tn = N - fp
        thr_vals = np.where(k_grid == 0, 1.0 + 1e-6, p_sorted[np.maximum(0, k_grid - 1)])
        res[cohort] = {'k_grid': k_grid, 'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn, 'thr_vals': thr_vals}
    tp0, fp0, tn0, fn0, thr0 = res[0]['tp'], res[0]['fp'], res[0]['tn'], res[0]['fn'], res[0]['thr_vals']
    tp1, fp1, tn1, fn1, thr1 = res[1]['tp'], res[1]['fp'], res[1]['tn'], res[1]['fn'], res[1]['thr_vals']
    best = (-1.0, 0.5, 0.5)
    for i in range(len(thr0)):
        tp_sum = tp0[i] + tp1; fp_sum = fp0[i] + fp1; tn_sum = tn0[i] + tn1; fn_sum = fn0[i] + fn1
        m_arr = mcc_from_counts(tp_sum, tn_sum, fp_sum, fn_sum)
        j = int(np.argmax(m_arr)); m = float(m_arr[j])
        if m > best[0]:
            best = (m, float(thr0[i]), float(thr1[j]))
    return best  # (best_mcc, thr_opp, thr_same)

print('Loading r=3.5 supervised dyn train and dyn test features...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r35.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r35.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
print('train_sup:', train_sup.shape, 'test_feats:', test_feats.shape)

train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()

for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

X_all = train_sup[feat_cols].astype(float).values
y_all = train_sup['contact'].astype(int).values
groups = train_sup['game_play'].values
same_flag_all = train_sup['same_team'].astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), dtype=int)

gkf = GroupKFold(n_splits=5)
oof = np.full(len(train_sup), np.nan, dtype=float)
models = []
start = time.time()

for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
    t0 = time.time()
    X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
    X_va, y_va = X_all[va_idx], y_all[va_idx]
    neg = (y_tr == 0).sum(); pos = (y_tr == 1).sum()
    spw = max(1.0, neg / max(1, pos))
    print(f'Fold {fold}: train {len(tr_idx)} (pos {pos}), valid {len(va_idx)} (pos {(y_va==1).sum()}), spw={spw:.2f}', flush=True)
    dtrain = xgb.DMatrix(X_tr, label=y_tr)
    dvalid = xgb.DMatrix(X_va, label=y_va)
    params = {
        'tree_method': 'hist', 'device': 'cuda', 'max_depth': 7, 'eta': 0.05, 'subsample': 0.9,
        'colsample_bytree': 0.8, 'min_child_weight': 10, 'lambda': 1.5, 'alpha': 0.1, 'gamma': 0.1,
        'objective': 'binary:logistic', 'eval_metric': 'logloss', 'scale_pos_weight': float(spw), 'seed': 42 + fold
    }
    booster = xgb.train(params=params, dtrain=dtrain, num_boost_round=4000, evals=[(dtrain,'train'),(dvalid,'valid')],
                        early_stopping_rounds=200, verbose_eval=False)
    best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
    oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
    models.append((booster, best_it))
    print(f' Fold {fold} done in {time.time()-t0:.1f}s; best_iteration={best_it}', flush=True)

# Smooth OOF per (gp,p1,p2) with centered rolling-max window=3
oof_df = train_sup[['game_play','p1','p2','step']].copy()
oof_df['oof'] = oof
oof_df = oof_df.sort_values(['game_play','p1','p2','step'])
grp = oof_df.groupby(['game_play','p1','p2'], sort=False)
oof_df['oof_smooth'] = grp['oof'].transform(lambda s: s.rolling(3, center=True, min_periods=1).max())
oof_smooth = oof_df['oof_smooth'].values
idx_ord = oof_df.index.to_numpy()
y_sorted = train_sup['contact'].astype(int).to_numpy()[idx_ord]
same_sorted = (train_sup['same_team'].fillna(0).astype(int).to_numpy()[idx_ord]) if 'same_team' in train_sup.columns else np.zeros(len(oof_df), dtype=int)

# Dual thresholds by same_team
best_mcc, thr_opp, thr_same = fast_dual_threshold_mcc(y_sorted, oof_smooth, same_sorted, grid_points=151)
if not np.isfinite(best_mcc) or best_mcc < 0:
    thrs = np.linspace(0.01, 0.99, 99)
    m_list = [matthews_corrcoef(y_sorted, (oof_smooth >= t).astype(int)) for t in thrs]
    j = int(np.argmax(m_list)); best_mcc = float(m_list[j]); thr_opp = thr_same = float(thrs[j])
print(f'Best OOF MCC (dual thresholds)={best_mcc:.5f} | thr_same={thr_same:.4f}, thr_opp={thr_opp:.4f}')

# Inference on test and smoothing
Xt = test_feats[feat_cols].astype(float).values
dtest = xgb.DMatrix(Xt)
pt = np.zeros(len(test_feats), dtype=float)
for i, (booster, best_it) in enumerate(models):
    t0 = time.time()
    pt += booster.predict(dtest, iteration_range=(0, best_it + 1))
    print(f' Inference model {i} took {time.time()-t0:.1f}s', flush=True)
pt /= max(1, len(models))
pred_tmp = test_feats[['game_play','step','p1','p2']].copy()
pred_tmp['prob'] = pt
pred_tmp = pred_tmp.sort_values(['game_play','p1','p2','step'])
grp_t = pred_tmp.groupby(['game_play','p1','p2'], sort=False)
pred_tmp['prob_smooth'] = grp_t['prob'].transform(lambda s: s.rolling(3, center=True, min_periods=1).max())

# Apply dual thresholds by same_team on test
same_flag_test = test_feats['same_team'].astype(int).values if 'same_team' in test_feats.columns else np.zeros(len(test_feats), dtype=int)
thr_arr_test = np.where(same_flag_test == 1, thr_same, thr_opp)
pred_bin = (pred_tmp['prob_smooth'].values >= thr_arr_test).astype(int)

# Build submission
cid = (test_feats['game_play'].astype(str) + '_' + test_feats['step'].astype(str) + '_' +
       test_feats['p1'].astype(str) + '_' + test_feats['p2'].astype(str))
pred_df = pd.DataFrame({'contact_id': cid, 'pred_contact': pred_bin})
ss = pd.read_csv('sample_submission.csv')
sub = ss.copy()
sub['contact'] = sub['contact_id'].map(pred_df.set_index('contact_id')['pred_contact']).fillna(0).astype(int)
sub[['contact_id','contact']].to_csv('submission.csv', index=False)
print('Saved submission.csv')
print('Done. Total time:', f'{time.time()-start:.1f}s', flush=True)

xgboost version: 2.1.4
Loading r=3.5 supervised dyn train and dyn test features...
train_sup: (524248, 59) test_feats: (237673, 58)
Using 50 features


Fold 0: train 419413 (pos 38437), valid 104835 (pos 10344), spw=9.91


 Fold 0 done in 29.7s; best_iteration=3094


Fold 1: train 419331 (pos 39249), valid 104917 (pos 9532), spw=9.68


 Fold 1 done in 30.7s; best_iteration=3182


Fold 2: train 419417 (pos 39309), valid 104831 (pos 9472), spw=9.67


 Fold 2 done in 32.0s; best_iteration=3270


Fold 3: train 419430 (pos 39012), valid 104818 (pos 9769), spw=9.75


 Fold 3 done in 30.1s; best_iteration=2995


Fold 4: train 419401 (pos 39117), valid 104847 (pos 9664), spw=9.72


 Fold 4 done in 28.3s; best_iteration=2873


  den = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))


Best OOF MCC (dual thresholds)=0.71744 | thr_same=0.7800, thr_opp=0.7800


 Inference model 0 took 0.2s


 Inference model 1 took 0.1s


 Inference model 2 took 0.1s


 Inference model 3 took 0.1s


 Inference model 4 took 0.1s


Saved submission.csv
Done. Total time: 163.5s


In [43]:
# Player-Ground (G) head: build per-player features, train XGB, smooth+threshold, and merge G preds into submission
import time, math, sys, subprocess, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost for G head...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb
print('xgboost version (G head):', getattr(xgb, '__version__', 'unknown'))

t0 = time.time()
print('Building per-player features for G...')

# 1) Base per-player tracking features (past-only dynamics)
trk_cols = ['game_play','step','nfl_player_id','team','position','x_position','y_position','speed','acceleration','direction','orientation']
tr_trk = pd.read_csv('train_player_tracking.csv', usecols=trk_cols).copy()
te_trk = pd.read_csv('test_player_tracking.csv', usecols=trk_cols).copy()
for df in (tr_trk, te_trk):
    df['nfl_player_id'] = df['nfl_player_id'].astype(int).astype(str)

def circ_diff_deg(a, b):
    d = (a - b + 180.0) % 360.0 - 180.0
    return np.abs(d)

def build_player_dyn(df: pd.DataFrame) -> pd.DataFrame:
    df = df.sort_values(['game_play','nfl_player_id','step']).copy()
    grp = df.groupby(['game_play','nfl_player_id'], sort=False)
    # basic deltas
    df['d_speed_1'] = grp['speed'].diff(1)
    df['d_speed_3'] = df['speed'] - grp['speed'].shift(3)
    df['d_accel_1'] = grp['acceleration'].diff(1)
    df['jerk'] = grp['acceleration'].diff(1)
    # rolling stats
    for col in ['speed','acceleration']:
        s = grp[col]
        df[f'{col}_min_p3'] = s.rolling(3, min_periods=1).min().reset_index(level=[0,1], drop=True)
        df[f'{col}_mean_p3'] = s.rolling(3, min_periods=1).mean().reset_index(level=[0,1], drop=True)
        df[f'{col}_std_p3'] = s.rolling(3, min_periods=1).std().reset_index(level=[0,1], drop=True)
        df[f'{col}_min_p5'] = s.rolling(5, min_periods=1).min().reset_index(level=[0,1], drop=True)
        df[f'{col}_mean_p5'] = s.rolling(5, min_periods=1).mean().reset_index(level=[0,1], drop=True)
        df[f'{col}_std_p5'] = s.rolling(5, min_periods=1).std().reset_index(level=[0,1], drop=True)
    # direction vs orientation
    df['dir_orient_diff'] = circ_diff_deg(df['direction'].fillna(0.0), df['orientation'].fillna(0.0))
    # boundary context
    df['dist_to_sideline'] = np.minimum(df['y_position'], 53.3 - df['y_position'])
    df['near_sideline'] = ((df['y_position'] <= 2.0) | (df['y_position'] >= 51.3)).astype(int)
    df['near_goal'] = ((df['x_position'] <= 3.0) | (df['x_position'] >= 117.0)).astype(int)
    # fill deltas
    for c in ['d_speed_1','d_speed_3','d_accel_1','jerk','speed_std_p3','speed_std_p5','acceleration_std_p3','acceleration_std_p5']:
        if c in df.columns:
            df[c] = df[c].fillna(0.0)
    return df

tr_p = build_player_dyn(tr_trk)
te_p = build_player_dyn(te_trk)

# 2) Opponent context from r=3.5 pairs
tr_pairs = pd.read_parquet('train_pairs_r35.parquet')
te_pairs = pd.read_parquet('test_pairs_r35.parquet')

def pairs_to_player_ctx(pairs: pd.DataFrame) -> pd.DataFrame:
    # Build per-player rows from both sides
    a = pairs[['game_play','step','p1','distance']].rename(columns={'p1':'nfl_player_id'})
    b = pairs[['game_play','step','p2','distance']].rename(columns={'p2':'nfl_player_id'})
    u = pd.concat([a, b], ignore_index=True)
    g = u.groupby(['game_play','step','nfl_player_id'], sort=False)
    out = g['distance'].agg(min_opp_dist='min').reset_index()
    # counts within thresholds: recompute by applying thresholds before groupby for speed
    for thr, name in [(1.5,'lt15'), (2.0,'lt20'), (2.5,'lt25')]:
        u[name] = (u['distance'] < thr).astype(int)
        cnt = u.groupby(['game_play','step','nfl_player_id'], sort=False)[name].sum().rename(f'cnt_opp_{name}')
        out = out.merge(cnt.reset_index(), on=['game_play','step','nfl_player_id'], how='left')
    return out

tr_ctx = pairs_to_player_ctx(tr_pairs)
te_ctx = pairs_to_player_ctx(te_pairs)

# 3) Helmet per-player aggregates and deltas
train_helm = pd.read_csv('train_baseline_helmets.csv')
test_helm = pd.read_csv('test_baseline_helmets.csv')
train_vmeta = pd.read_csv('train_video_metadata.csv')
test_vmeta = pd.read_csv('test_video_metadata.csv')
FPS = 59.94
def prep_meta(vmeta: pd.DataFrame):
    vm = vmeta.copy()
    for c in ['start_time','snap_time']:
        if not np.issubdtype(vm[c].dtype, np.number):
            ts = pd.to_datetime(vm[c], errors='coerce')
            vm[c] = (ts - ts.dt.floor('D')).dt.total_seconds().astype(float)
    vm['snap_frame'] = ((vm['snap_time'] - vm['start_time']) * FPS).round().astype('Int64')
    return vm[['game_play','view','snap_frame']].drop_duplicates()
meta_tr = prep_meta(train_vmeta)
meta_te = prep_meta(test_vmeta)

def helm_player_agg(helm: pd.DataFrame, meta: pd.DataFrame) -> pd.DataFrame:
    df = helm[['game_play','view','frame','nfl_player_id','left','top','width','height']].copy()
    df = df.dropna(subset=['nfl_player_id'])
    df['nfl_player_id'] = df['nfl_player_id'].astype(int).astype(str)
    df['area'] = df['width'] * df['height']
    df['cx'] = df['left'] + 0.5 * df['width']
    df['cy'] = df['top'] + 0.5 * df['height']
    df = df.sort_values(['game_play','view','frame','nfl_player_id','area'], ascending=[True,True,True,True,False])
    df = df.drop_duplicates(['game_play','view','frame','nfl_player_id'], keep='first')
    df = df.merge(meta, on=['game_play','view'], how='left')
    df['step'] = ((df['frame'] - df['snap_frame']).astype('float') / 6.0).round().astype('Int64')
    df = df.dropna(subset=['step'])
    df['step'] = df['step'].astype(int)
    # expand ±1 to align tolerance
    dm1 = df.copy(); dm1['target_step'] = dm1['step'] - 1
    d0 = df.copy(); d0['target_step'] = d0['step']
    dp1 = df.copy(); dp1['target_step'] = dp1['step'] + 1
    d = pd.concat([dm1, d0, dp1], ignore_index=True)
    agg = d.groupby(['game_play','target_step','nfl_player_id'], sort=False).agg(
        cy_mean=('cy','mean'), h_mean=('height','mean'), cnt=('cx','size')
    ).reset_index().rename(columns={'target_step':'step'})
    # deltas per player
    agg = agg.sort_values(['game_play','nfl_player_id','step'])
    g = agg.groupby(['game_play','nfl_player_id'], sort=False)
    agg['d_cy_1'] = g['cy_mean'].diff(1).fillna(0.0)
    agg['d_h_1'] = g['h_mean'].diff(1).fillna(0.0)
    return agg

h_tr_p = helm_player_agg(train_helm, meta_tr)
h_te_p = helm_player_agg(test_helm, meta_te)

# 4) Merge contexts into per-player frames
def merge_all(base: pd.DataFrame, ctx: pd.DataFrame, helm: pd.DataFrame) -> pd.DataFrame:
    df = base.merge(ctx, on=['game_play','step','nfl_player_id'], how='left')
    df = df.merge(helm, on=['game_play','step','nfl_player_id'], how='left')
    # fill context NaNs
    fill0 = ['min_opp_dist','cnt_opp_lt15','cnt_opp_lt20','cnt_opp_lt25','cy_mean','h_mean','d_cy_1','d_h_1']
    for c in fill0:
        if c in df.columns:
            df[c] = df[c].fillna(0.0)
    return df

tr_feat_p = merge_all(tr_p, tr_ctx, h_tr_p)
te_feat_p = merge_all(te_p, te_ctx, h_te_p)
print('Per-player train/test feature shapes:', tr_feat_p.shape, te_feat_p.shape)

# 5) Supervision for G: inner-join to labels where one pid is G, with ±1 expansion within supervised only
labels = pd.read_csv('train_labels.csv', usecols=['contact_id','game_play','step','nfl_player_id_1','nfl_player_id_2','contact'])
labels['pid1'] = labels['nfl_player_id_1'].astype(str)
labels['pid2'] = labels['nfl_player_id_2'].astype(str)
mask_g = (labels['pid1'] == 'G') | (labels['pid2'] == 'G')
g_labels = labels.loc[mask_g, ['game_play','step','pid1','pid2','contact']].copy()
g_labels['player'] = np.where(g_labels['pid1'] == 'G', g_labels['pid2'], g_labels['pid1'])
g_labels = g_labels[['game_play','step','player','contact']]
sup_g = g_labels.merge(tr_feat_p.rename(columns={'nfl_player_id':'player'}), on=['game_play','step','player'], how='inner')
print('G supervised inner shape:', sup_g.shape, 'pos rate:', sup_g['contact'].mean())
pos = sup_g.loc[sup_g['contact'] == 1, ['game_play','step','player']]
pos_m1 = pos.copy(); pos_m1['step'] = pos_m1['step'] - 1
pos_p1 = pos.copy(); pos_p1['step'] = pos_p1['step'] + 1
pos_exp = pd.concat([pos_m1, pos_p1], ignore_index=True).drop_duplicates()
pos_exp['flag_pos_exp'] = 1
sup_g = sup_g.merge(pos_exp, on=['game_play','step','player'], how='left')
sup_g.loc[sup_g['flag_pos_exp'] == 1, 'contact'] = 1
sup_g.drop(columns=['flag_pos_exp'], inplace=True)
print('G after ±1 expansion pos rate:', sup_g['contact'].mean())

# 6) Train small XGB with GroupKFold by game_play
drop_cols = {'contact','game_play','step','player','team','position','nfl_player_id'}
feat_cols = [c for c in sup_g.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(sup_g[c])]
print('G feature count:', len(feat_cols))

X_all = sup_g[feat_cols].astype(float).values
y_all = sup_g['contact'].astype(int).values
groups = sup_g['game_play'].values
gkf = GroupKFold(n_splits=5)
oof = np.full(len(sup_g), np.nan, float)
models = []

for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
    t1 = time.time()
    X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
    X_va, y_va = X_all[va_idx], y_all[va_idx]
    neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
    spw = max(1.0, neg / max(1, posc))
    print(f'G Fold {fold}: train {len(tr_idx)} (pos {posc}), valid {len(va_idx)} (pos {(y_va==1).sum()}), spw={spw:.2f}', flush=True)
    dtrain = xgb.DMatrix(X_tr, label=y_tr)
    dvalid = xgb.DMatrix(X_va, label=y_va)
    params = {
        'tree_method': 'hist', 'device': 'cuda', 'max_depth': 6, 'eta': 0.05,
        'subsample': 0.9, 'colsample_bytree': 0.8, 'min_child_weight': 10,
        'lambda': 1.5, 'alpha': 0.0, 'objective': 'binary:logistic', 'eval_metric': 'logloss',
        'scale_pos_weight': float(spw), 'seed': 42 + fold
    }
    booster = xgb.train(params, dtrain, num_boost_round=2000, evals=[(dtrain,'train'),(dvalid,'valid')],
                        early_stopping_rounds=100, verbose_eval=False)
    best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
    oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
    models.append((booster, best_it))
    print(f' G Fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)

# 7) Smooth OOF with centered rolling-max window=5 per (gp,player)
oof_df = sup_g[['game_play','player','step']].copy()
oof_df['oof'] = oof
oof_df = oof_df.sort_values(['game_play','player','step'])
grp_o = oof_df.groupby(['game_play','player'], sort=False)
oof_df['oof_smooth'] = grp_o['oof'].transform(lambda s: s.rolling(5, center=True, min_periods=1).max())
oof_smooth = oof_df['oof_smooth'].values
y_sorted = sup_g.loc[oof_df.index, 'contact'].astype(int).values

# Hysteresis: require min_duration >= 2 via rolling sum >=2 within window=3 centered
def apply_min_dur(bin_arr, gp, pl):
    df = pd.DataFrame({'gp': gp, 'pl': pl, 'b': bin_arr})
    df = df.groupby(['gp','pl'], sort=False)['b'].apply(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))
    return df.values

# threshold sweep for G
best_thr, best_mcc = 0.58, -1.0
thr_grid = np.linspace(0.4, 0.8, 41)
gp_arr = oof_df['game_play'].values
pl_arr = oof_df['player'].values
for thr in thr_grid:
    pred0 = (oof_smooth >= thr).astype(int)
    pred = apply_min_dur(pred0, gp_arr, pl_arr)
    m = matthews_corrcoef(y_sorted, pred)
    if m > best_mcc:
        best_mcc, best_thr = float(m), float(thr)
print(f'G OOF MCC={best_mcc:.5f} at thr={best_thr:.2f} (after smooth+min_duration)')

# 8) Inference on test
Xt = te_feat_p[feat_cols].astype(float).values
dtest = xgb.DMatrix(Xt)
pt = np.zeros(len(te_feat_p), dtype=float)
for i, (booster, best_it) in enumerate(models):
    t1 = time.time()
    pt += booster.predict(dtest, iteration_range=(0, best_it + 1))
    print(f' G Inference model {i} took {time.time()-t1:.1f}s')
pt /= max(1, len(models))
pred_tmp = te_feat_p[['game_play','step','nfl_player_id']].rename(columns={'nfl_player_id':'player'}).copy()
pred_tmp['prob'] = pt
pred_tmp = pred_tmp.sort_values(['game_play','player','step'])
grp_t = pred_tmp.groupby(['game_play','player'], sort=False)
pred_tmp['prob_smooth'] = grp_t['prob'].transform(lambda s: s.rolling(5, center=True, min_periods=1).max())
bin0 = (pred_tmp['prob_smooth'].values >= best_thr).astype(int)
bin1 = apply_min_dur(bin0, pred_tmp['game_play'].values, pred_tmp['player'].values)
pred_tmp['pred_bin'] = bin1.astype(int)

# 9) Build G contact_id and overwrite in submission
g_cid = (pred_tmp['game_play'].astype(str) + '_' + pred_tmp['step'].astype(str) + '_' + 'G' + '_' + pred_tmp['player'].astype(str))
g_pred_df = pd.DataFrame({'contact_id': g_cid, 'contact': pred_tmp['pred_bin'].astype(int)})

sub = pd.read_csv('submission.csv')
before_ones = int(sub['contact'].sum())
sub = sub.drop(columns=['contact']).merge(g_pred_df, on='contact_id', how='left').merge(pd.read_csv('submission.csv'), on='contact_id', how='left', suffixes=('_g','_pp'))
sub['contact'] = sub['contact_g'].fillna(sub['contact_pp']).astype(int)
sub = sub[['contact_id','contact']]
after_ones = int(sub['contact'].sum())
sub.to_csv('submission.csv', index=False)
print(f'Overwrote G rows in submission. ones before={before_ones}, after={after_ones}')
print('G head done in {:.1f}s'.format(time.time()-t0))

xgboost version (G head): 2.1.4
Building per-player features for G...


Per-player train/test feature shapes: (1225299, 40) (127754, 40)


G supervised inner shape: (370351, 41) pos rate: 0.041106949893479426
G after ±1 expansion pos rate: 0.04376658899260431
G feature count: 35


G Fold 0: train 296428 (pos 12821), valid 73923 (pos 3388), spw=22.12


 G Fold 0 done in 9.2s; best_it=1470


G Fold 1: train 296519 (pos 13057), valid 73832 (pos 3152), spw=21.71


 G Fold 1 done in 8.5s; best_it=1371


G Fold 2: train 296365 (pos 13719), valid 73986 (pos 2490), spw=20.60


 G Fold 2 done in 8.7s; best_it=1434


G Fold 3: train 296365 (pos 12501), valid 73986 (pos 3708), spw=22.71


 G Fold 3 done in 7.5s; best_it=1193


G Fold 4: train 295727 (pos 12738), valid 74624 (pos 3471), spw=22.22


 G Fold 4 done in 10.1s; best_it=1629


G OOF MCC=0.53119 at thr=0.73 (after smooth+min_duration)
 G Inference model 0 took 0.0s
 G Inference model 1 took 0.0s
 G Inference model 2 took 0.0s
 G Inference model 3 took 0.0s
 G Inference model 4 took 0.0s


Overwrote G rows in submission. ones before=8903, after=8903
G head done in 137.7s


In [22]:
# Diagnose G contact_id alignment and coverage
import pandas as pd, numpy as np
ss = pd.read_csv('sample_submission.csv')
tokens = ss['contact_id'].str.split('_', n=4, expand=True)
tokens.columns = ['g1','g2','step','a','b']
g_first_mask = tokens['a'] == 'G'
g_second_mask = tokens['b'] == 'G'
print('Sample G-first rows:', int(g_first_mask.sum()))
print('Sample G-second rows:', int(g_second_mask.sum()))

# Build our predicted G ids in both orientations from te_feat_p (available from Cell 11)
assert 'te_feat_p' in globals(), 'te_feat_p missing; re-run Cell 11 first.'
pred_base = te_feat_p[['game_play','step','nfl_player_id']].copy()
pred_base['nfl_player_id'] = pred_base['nfl_player_id'].astype(str)
cid_g_first = pred_base['game_play'].astype(str) + '_' + pred_base['step'].astype(str) + '_G_' + pred_base['nfl_player_id']
cid_g_second = pred_base['game_play'].astype(str) + '_' + pred_base['step'].astype(str) + '_' + pred_base['nfl_player_id'] + '_G'
our_g_first = set(cid_g_first.values)
our_g_second = set(cid_g_second.values)

ss_ids = set(ss['contact_id'].values)
inter_first = len(our_g_first & ss_ids)
inter_second = len(our_g_second & ss_ids)
print('Intersect counts -> G-first:', inter_first, '| G-second:', inter_second)
print('Total our per-player rows:', len(pred_base), 'Total ss rows:', len(ss))

# Show a few examples of existing G ids in sample
print('Sample G-first example:', ss.loc[g_first_mask, 'contact_id'].head(3).tolist())
print('Sample G-second example:', ss.loc[g_second_mask, 'contact_id'].head(3).tolist())

Sample G-first rows: 0
Sample G-second rows: 40282
Intersect counts -> G-first: 0 | G-second: 40282
Total our per-player rows: 127754 Total ss rows: 463243
Sample G-first example: []
Sample G-second example: ['58187_001341_0_47795_G', '58187_001341_0_41300_G', '58187_001341_0_52650_G']


In [44]:
# Fix G contact_id orientation to '<player>_G' and overwrite submission accordingly
import pandas as pd
assert 'pred_tmp' in globals(), 'pred_tmp missing; re-run Cell 11 first to compute G per-player predictions.'

# Build G-second contact_ids: {game_play}_{step}_{player}_G
g_cid_second = (pred_tmp['game_play'].astype(str) + '_' + pred_tmp['step'].astype(str) + '_' + pred_tmp['player'].astype(str) + '_G')
g_pred_second = pd.DataFrame({'contact_id': g_cid_second, 'contact_g': pred_tmp['pred_bin'].astype(int).values})

sub = pd.read_csv('submission.csv')
before_ones = int(sub['contact'].sum())
sub = sub.merge(g_pred_second, on='contact_id', how='left')
sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
sub = sub[['contact_id','contact']]
after_ones = int(sub['contact'].sum())
sub.to_csv('submission.csv', index=False)
print(f'Applied G-second overwrite. ones before={before_ones}, after={after_ones}, delta={after_ones-before_ones}')

Applied G-second overwrite. ones before=8903, after=9121, delta=218


In [24]:
# Multi-seed bagging: PP (r=3.5 dyn) + G head; rebuild submission with averaged probs and calibrated thresholds
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb
print('xgboost version (bagging):', getattr(xgb, '__version__', 'unknown'))

def mcc_from_counts(tp, tn, fp, fn):
    num = tp * tn - fp * fn
    den = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
    den = np.where(den == 0, 1.0, den)
    return num / den

def fast_dual_threshold_mcc(y_true, prob, same_flag, grid_points=151):
    res = {}
    for cohort in (0, 1):
        mask = (same_flag == cohort)
        y_c = y_true[mask].astype(int); p_c = prob[mask].astype(float)
        n = len(y_c)
        if n == 0:
            res[cohort] = {'k_grid': np.array([0], int), 'tp': np.array([0.0]), 'fp': np.array([0.0]), 'tn': np.array([0.0]), 'fn': np.array([0.0]), 'thr_vals': np.array([1.0])}
            continue
        order = np.argsort(-p_c)
        y_sorted = y_c[order]; p_sorted = p_c[order]
        cum_pos = np.concatenate([[0], np.cumsum(y_sorted)])
        k_grid = np.unique(np.linspace(0, n, num=min(grid_points, n + 1), dtype=int))
        tp = cum_pos[k_grid]; fp = k_grid - tp
        P = y_sorted.sum(); N = n - P
        fn = P - tp; tn = N - fp
        thr_vals = np.where(k_grid == 0, 1.0 + 1e-6, p_sorted[np.maximum(0, k_grid - 1)])
        res[cohort] = {'k_grid': k_grid, 'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn, 'thr_vals': thr_vals}
    tp0, fp0, tn0, fn0, thr0 = res[0]['tp'], res[0]['fp'], res[0]['tn'], res[0]['fn'], res[0]['thr_vals']
    tp1, fp1, tn1, fn1, thr1 = res[1]['tp'], res[1]['fp'], res[1]['tn'], res[1]['fn'], res[1]['thr_vals']
    best = (-1.0, 0.5, 0.5)
    for i in range(len(thr0)):
        tp_sum = tp0[i] + tp1; fp_sum = fp0[i] + fp1; tn_sum = tn0[i] + tn1; fn_sum = fn0[i] + fn1
        m_arr = mcc_from_counts(tp_sum, tn_sum, fp_sum, fn_sum)
        j = int(np.argmax(m_arr)); m = float(m_arr[j])
        if m > best[0]:
            best = (m, float(thr0[i]), float(thr1[j]))
    return best

t_all = time.time()
# ====================
# 1) PP multi-seed bagging (r=3.5 dyn features)
# ====================
print('PP bagging: loading artifacts...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r35.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r35.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)
drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
X_all = train_sup[feat_cols].astype(float).values
y_all = train_sup['contact'].astype(int).values
groups = train_sup['game_play'].values
same_all = (train_sup['same_team'].fillna(0).astype(int).values) if 'same_team' in train_sup.columns else np.zeros(len(train_sup), int)
gkf = GroupKFold(n_splits=5)
seeds = [42, 1337, 2025]
oof_list = []
test_preds_list = []
for s in seeds:
    print(f' PP seed {s} ...')
    oof = np.full(len(train_sup), np.nan, float)
    models = []
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
        t0 = time.time()
        X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all[va_idx], y_all[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method': 'hist','device': 'cuda','max_depth': 7,'eta': 0.05,'subsample': 0.9,'colsample_bytree': 0.8,
                  'min_child_weight': 10,'lambda': 1.5,'alpha': 0.1,'gamma': 0.1,'objective': 'binary:logistic','eval_metric': 'logloss',
                  'scale_pos_weight': float(spw),'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=3500, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models.append((booster, best_it))
        print(f'  PP seed {s} fold {fold} done in {time.time()-t0:.1f}s; best_it={best_it}', flush=True)
    # Smooth OOF per (gp,p1,p2)
    oof_df = train_sup[['game_play','p1','p2','step']].copy()
    oof_df['oof'] = oof
    oof_df = oof_df.sort_values(['game_play','p1','p2','step'])
    grp_s = oof_df.groupby(['game_play','p1','p2'], sort=False)
    oof_df['oof_smooth'] = grp_s['oof'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    oof_list.append((oof_df.index.values, oof_df['oof_smooth'].values))
    # Test preds for this seed
    Xt = test_feats[feat_cols].astype(float).values
    dtest = xgb.DMatrix(Xt)
    pt = np.zeros(len(test_feats), float)
    for i, (booster, best_it) in enumerate(models):
        t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
        print(f'   PP seed {s} inference model {i} {time.time()-t1:.1f}s')
    pt /= max(1, len(models))
    pred_tmp = test_feats[['game_play','step','p1','p2']].copy(); pred_tmp['prob'] = pt
    pred_tmp = pred_tmp.sort_values(['game_play','p1','p2','step'])
    grp_t = pred_tmp.groupby(['game_play','p1','p2'], sort=False)
    pred_tmp['prob_smooth'] = grp_t['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    test_preds_list.append(pred_tmp['prob_smooth'].values)

# Align and average OOF across seeds (indices should be identical order by construction)
idx_ord = oof_list[0][0]
oof_avg = np.mean([x[1] for x in oof_list], axis=0)
y_sorted = train_sup['contact'].astype(int).to_numpy()[idx_ord]
same_sorted = same_all[idx_ord]
best_mcc, thr_opp, thr_same = fast_dual_threshold_mcc(y_sorted, oof_avg, same_sorted, grid_points=151)
print(f'PP bagged OOF MCC={best_mcc:.5f} | thr_same={thr_same:.4f}, thr_opp={thr_opp:.4f}')

# Average test probs across seeds and threshold
pt_bag = np.mean(np.vstack(test_preds_list), axis=0)
pred_tmp_final = test_feats[['game_play','step','p1','p2']].copy()
pred_tmp_final = pred_tmp_final.sort_values(['game_play','p1','p2','step'])
same_flag_test = test_feats['same_team'].astype(int).values if 'same_team' in test_feats.columns else np.zeros(len(test_feats), int)
thr_arr_test = np.where(same_flag_test == 1, thr_same, thr_opp)
pred_bin_pp = (pt_bag >= thr_arr_test).astype(int)
cid_pp = (test_feats['game_play'].astype(str) + '_' + test_feats['step'].astype(str) + '_' + test_feats['p1'].astype(str) + '_' + test_feats['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_pp, 'contact_pp': pred_bin_pp})

# ====================
# 2) G head multi-seed bagging (reuse sup_g/te_feat_p/feat_cols from Cell 11 if present; else rebuild minimal)
# ====================
print('G bagging...')
if 'sup_g' not in globals() or 'te_feat_p' not in globals() or 'feat_cols' not in globals():
    raise RuntimeError('G features not in memory. Re-run Cell 11 first.')
labels_g = sup_g[['game_play','player','step','contact']].copy()
feat_cols_g = [c for c in sup_g.columns if c not in {'contact','game_play','step','player','team','position','nfl_player_id'} and pd.api.types.is_numeric_dtype(sup_g[c])]
Xg = sup_g[feat_cols_g].astype(float).values; yg = sup_g['contact'].astype(int).values; groups_g = sup_g['game_play'].values
gkf_g = GroupKFold(n_splits=5)
oof_list_g = []; test_list_g = []
for s in seeds:
    print(f' G seed {s} ...')
    oofg = np.full(len(sup_g), np.nan, float)
    models_g = []
    for fold, (tr_idx, va_idx) in enumerate(gkf_g.split(Xg, yg, groups=groups_g)):
        t0 = time.time()
        X_tr, y_tr = Xg[tr_idx], yg[tr_idx]
        X_va, y_va = Xg[va_idx], yg[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method': 'hist','device': 'cuda','max_depth': 6,'eta': 0.05,'subsample': 0.9,'colsample_bytree': 0.8,
                  'min_child_weight': 10,'lambda': 1.5,'alpha': 0.0,'objective': 'binary:logistic','eval_metric': 'logloss','scale_pos_weight': float(spw),'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=2000, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=100, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oofg[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models_g.append((booster, best_it))
        print(f'   G seed {s} fold {fold} {time.time()-t0:.1f}s; best_it={best_it}')
    # Smooth OOF with centered rolling-max window=5
    oof_df_g = sup_g[['game_play','player','step']].copy()
    oof_df_g['oof'] = oofg
    oof_df_g = oof_df_g.sort_values(['game_play','player','step'])
    grp_go = oof_df_g.groupby(['game_play','player'], sort=False)
    oof_df_g['oof_smooth'] = grp_go['oof'].transform(lambda s_: s_.rolling(5, center=True, min_periods=1).max())
    oof_list_g.append((oof_df_g.index.values, oof_df_g['oof_smooth'].values))
    # Test
    Xt_g = te_feat_p[feat_cols_g].astype(float).values
    dtest_g = xgb.DMatrix(Xt_g)
    ptg = np.zeros(len(te_feat_p), float)
    for i, (booster, best_it) in enumerate(models_g):
        t1 = time.time(); ptg += booster.predict(dtest_g, iteration_range=(0, best_it + 1));
        print(f'    G seed {s} infer model {i} {time.time()-t1:.1f}s')
    ptg /= max(1, len(models_g))
    pred_g_tmp = te_feat_p[['game_play','step','nfl_player_id']].rename(columns={'nfl_player_id':'player'}).copy()
    pred_g_tmp['prob'] = ptg
    pred_g_tmp = pred_g_tmp.sort_values(['game_play','player','step'])
    grp_gt = pred_g_tmp.groupby(['game_play','player'], sort=False)
    pred_g_tmp['prob_smooth'] = grp_gt['prob'].transform(lambda s_: s_.rolling(5, center=True, min_periods=1).max())
    test_list_g.append(pred_g_tmp['prob_smooth'].values)

# Align and average OOF for G
idx_ord_g = oof_list_g[0][0]
oof_avg_g = np.mean([x[1] for x in oof_list_g], axis=0)
yg_sorted = sup_g.loc[idx_ord_g, 'contact'].astype(int).values
gp_arr = sup_g.loc[idx_ord_g, 'game_play'].values
pl_arr = sup_g.loc[idx_ord_g, 'player'].values
def apply_min_dur(bin_arr, gp, pl):
    df = pd.DataFrame({'gp': gp, 'pl': pl, 'b': bin_arr})
    df = df.groupby(['gp','pl'], sort=False)['b'].apply(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))
    return df.values
best_thr_g, best_mcc_g = 0.60, -1.0
thr_grid = np.linspace(0.4, 0.8, 41)
for thr in thr_grid:
    pred0 = (oof_avg_g >= thr).astype(int)
    pred = apply_min_dur(pred0, gp_arr, pl_arr)
    m = matthews_corrcoef(yg_sorted, pred)
    if m > best_mcc_g:
        best_mcc_g, best_thr_g = float(m), float(thr)
print(f'G bagged OOF MCC={best_mcc_g:.5f} at thr={best_thr_g:.2f}')

# Average test probs and threshold + min_duration for G
ptg_bag = np.mean(np.vstack(test_list_g), axis=0)
pred_g_final = te_feat_p[['game_play','step','nfl_player_id']].rename(columns={'nfl_player_id':'player'}).copy()
pred_g_final = pred_g_final.sort_values(['game_play','player','step'])
pred_g_final['prob_bag'] = ptg_bag
bin0 = (pred_g_final['prob_bag'].values >= best_thr_g).astype(int)
bin1 = apply_min_dur(bin0, pred_g_final['game_play'].values, pred_g_final['player'].values)
pred_g_final['pred_bin'] = bin1.astype(int)
g_cid_second = (pred_g_final['game_play'].astype(str) + '_' + pred_g_final['step'].astype(str) + '_' + pred_g_final['player'].astype(str) + '_G')
g_pred_second = pd.DataFrame({'contact_id': g_cid_second, 'contact_g': pred_g_final['pred_bin'].astype(int).values})

# ====================
# 3) Build submission: PP then overwrite G-second
# ====================
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
before_ones = int(sub['contact'].sum())
sub = sub.merge(g_pred_second, on='contact_id', how='left')
sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
sub = sub[['contact_id','contact']]
after_ones = int(sub['contact'].sum())
sub.to_csv('submission.csv', index=False)
print(f'Final submission saved. PP ones={before_ones}, after G overwrite={after_ones}. Total time {time.time()-t_all:.1f}s')

xgboost version (bagging): 2.1.4
PP bagging: loading artifacts...


 PP seed 42 ...


  PP seed 42 fold 0 done in 29.8s; best_it=3094


  PP seed 42 fold 1 done in 30.6s; best_it=3182


  PP seed 42 fold 2 done in 32.0s; best_it=3270


  PP seed 42 fold 3 done in 30.1s; best_it=2995


  PP seed 42 fold 4 done in 27.2s; best_it=2873


   PP seed 42 inference model 0 0.1s
   PP seed 42 inference model 1 0.1s


   PP seed 42 inference model 2 0.1s
   PP seed 42 inference model 3 0.1s


   PP seed 42 inference model 4 0.1s


 PP seed 1337 ...


  PP seed 1337 fold 0 done in 30.9s; best_it=3211


  PP seed 1337 fold 1 done in 31.6s; best_it=3291


  PP seed 1337 fold 2 done in 32.2s; best_it=3408


  PP seed 1337 fold 3 done in 29.5s; best_it=2955


  PP seed 1337 fold 4 done in 26.9s; best_it=2747


   PP seed 1337 inference model 0 0.1s
   PP seed 1337 inference model 1 0.1s


   PP seed 1337 inference model 2 0.1s
   PP seed 1337 inference model 3 0.1s


   PP seed 1337 inference model 4 0.1s


 PP seed 2025 ...


  PP seed 2025 fold 0 done in 28.8s; best_it=3011


  PP seed 2025 fold 1 done in 31.7s; best_it=3338


  PP seed 2025 fold 2 done in 31.3s; best_it=3189


  PP seed 2025 fold 3 done in 29.1s; best_it=2925


  PP seed 2025 fold 4 done in 27.5s; best_it=2898


   PP seed 2025 inference model 0 0.1s
   PP seed 2025 inference model 1 0.1s


   PP seed 2025 inference model 2 0.1s
   PP seed 2025 inference model 3 0.1s


   PP seed 2025 inference model 4 0.1s


  den = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))


PP bagged OOF MCC=-1.00000 | thr_same=0.5000, thr_opp=0.5000
G bagging...


 G seed 42 ...


   G seed 42 fold 0 9.2s; best_it=1470


   G seed 42 fold 1 8.5s; best_it=1371


   G seed 42 fold 2 8.7s; best_it=1434


   G seed 42 fold 3 7.5s; best_it=1193


   G seed 42 fold 4 9.7s; best_it=1629


    G seed 42 infer model 0 0.0s
    G seed 42 infer model 1 0.0s
    G seed 42 infer model 2 0.0s
    G seed 42 infer model 3 0.0s
    G seed 42 infer model 4 0.0s
 G seed 1337 ...


   G seed 1337 fold 0 9.6s; best_it=1545


   G seed 1337 fold 1 9.3s; best_it=1518


   G seed 1337 fold 2 8.8s; best_it=1458


   G seed 1337 fold 3 8.5s; best_it=1373


   G seed 1337 fold 4 9.2s; best_it=1481


    G seed 1337 infer model 0 0.0s
    G seed 1337 infer model 1 0.0s
    G seed 1337 infer model 2 0.0s
    G seed 1337 infer model 3 0.0s
    G seed 1337 infer model 4 0.0s
 G seed 2025 ...


   G seed 2025 fold 0 8.8s; best_it=1413


   G seed 2025 fold 1 7.9s; best_it=1264


   G seed 2025 fold 2 9.4s; best_it=1566


   G seed 2025 fold 3 8.2s; best_it=1308


   G seed 2025 fold 4 9.5s; best_it=1537


    G seed 2025 infer model 0 0.0s
    G seed 2025 infer model 1 0.0s
    G seed 2025 infer model 2 0.0s
    G seed 2025 infer model 3 0.0s
    G seed 2025 infer model 4 0.0s


G bagged OOF MCC=0.53627 at thr=0.79


Final submission saved. PP ones=8829, after G overwrite=10572. Total time 666.3s


In [29]:
# PP bagging with strict key alignment; robust thresholds; read G preds from prior submission; rebuild submission
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb
print('xgboost version (pp-align):', getattr(xgb, '__version__', 'unknown'))

def fast_dual_threshold_mcc(y_true, prob, same_flag, grid_points=256):
    import numpy as np
    y = np.asarray(y_true, dtype=np.int64)
    p = np.asarray(prob, dtype=np.float64)
    s = np.asarray(same_flag, dtype=np.int8)
    mask = np.isfinite(y) & np.isfinite(p) & np.isfinite(s)
    y, p, s = y[mask], p[mask], s[mask]

    def cohort_counts(yc, pc, G):
        n = yc.size
        if n == 0:
            return dict(tp=np.array([0], np.int64), fp=np.array([0], np.int64),
                        tn=np.array([0], np.int64), fn=np.array([0], np.int64),
                        thr=np.array([1.0], np.float64))
        order = np.argsort(-pc, kind='mergesort')
        ys, ps = yc[order], pc[order]
        P = int(ys.sum()); N = n - P
        step = max(1, n // max(1, (G - 1)))
        k = np.arange(0, n + 1, step, dtype=np.int64)
        if k[-1] != n: k = np.append(k, n)
        cum = np.concatenate(([0], np.cumsum(ys, dtype=np.int64)))
        tp = cum[k]; fp = k - tp; fn = P - tp; tn = N - fp
        thr = np.where(k == 0, 1.0 + 1e-6, ps[k - 1])
        return dict(tp=tp, fp=fp, tn=tn, fn=fn, thr=thr)

    a = cohort_counts(y[s == 0], p[s == 0], grid_points)
    b = cohort_counts(y[s == 1], p[s == 1], grid_points)

    tp = a['tp'][:, None] + b['tp'][None, :]
    fp = a['fp'][:, None] + b['fp'][None, :]
    tn = a['tn'][:, None] + b['tn'][None, :]
    fn = a['fn'][:, None] + b['fn'][None, :]

    with np.errstate(invalid='ignore', divide='ignore'):
        num = tp * tn - fp * fn
        den = (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)
        den = np.where(den > 0, np.sqrt(den), np.nan)
        mcc = num / den

    if not np.isfinite(mcc).any():
        return -1.0, 0.5, 0.5
    i, j = np.unravel_index(np.nanargmax(mcc), mcc.shape)
    return float(mcc[i, j]), float(a['thr'][i]), float(b['thr'][j])

t0 = time.time()
print('Loading r=3.5 dyn artifacts...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r35.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r35.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
gkf = GroupKFold(n_splits=5)
groups = train_sup['game_play'].values
y_all = train_sup['contact'].astype(int).values
same_all = train_sup['same_team'].fillna(0).astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), int)

seeds = [42,1337,2025]
oof_frames = []
test_frames = []

for s in seeds:
    print(f'PP aligned bagging seed {s} ...')
    X_all = train_sup[feat_cols].astype(float).values
    oof = np.full(len(train_sup), np.nan, float)
    models = []
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
        t1 = time.time()
        X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all[va_idx], y_all[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method':'hist','device':'cuda','max_depth':7,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
                  'min_child_weight':10,'lambda':1.5,'alpha':0.1,'gamma':0.1,'objective':'binary:logistic','eval_metric':'logloss',
                  'scale_pos_weight': float(spw), 'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=3500, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models.append((booster, best_it))
        print(f'  seed {s} fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)
    # Smooth and attach keys for alignment
    oof_df = train_sup[['game_play','p1','p2','step']].copy()
    oof_df['oof'] = oof
    oof_df = oof_df.sort_values(['game_play','p1','p2','step'])
    grp = oof_df.groupby(['game_play','p1','p2'], sort=False)
    oof_df['oof_smooth'] = grp['oof'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    oof_df['key'] = (oof_df['game_play'].astype(str) + '|' + oof_df['p1'].astype(str) + '|' + oof_df['p2'].astype(str) + '|' + oof_df['step'].astype(str))
    oof_frames.append(oof_df[['key','oof_smooth']].rename(columns={'oof_smooth': f'oof_{s}'}))
    # Test probs for this seed
    Xt = test_feats[feat_cols].astype(float).values
    dtest = xgb.DMatrix(Xt)
    pt = np.zeros(len(test_feats), float)
    for i, (booster, best_it) in enumerate(models):
        t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
        print(f'   seed {s} test model {i} {time.time()-t1:.1f}s')
    pt /= max(1, len(models))
    pred_tmp = test_feats[['game_play','p1','p2','step']].copy()
    pred_tmp['prob'] = pt
    pred_tmp = pred_tmp.sort_values(['game_play','p1','p2','step'])
    grp_t = pred_tmp.groupby(['game_play','p1','p2'], sort=False)
    pred_tmp['prob_smooth'] = grp_t['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    pred_tmp['key'] = (pred_tmp['game_play'].astype(str) + '|' + pred_tmp['p1'].astype(str) + '|' + pred_tmp['p2'].astype(str) + '|' + pred_tmp['step'].astype(str))
    test_frames.append(pred_tmp[['key','prob_smooth']].rename(columns={'prob_smooth': f'pt_{s}'}))

# Align by key and average (outer to avoid row loss), fill missing to neutral 0.5
oof_join = oof_frames[0]
for df in oof_frames[1:]:
    oof_join = oof_join.merge(df, on='key', how='outer')
oof_join = oof_join.fillna(0.5)
oof_join['oof_avg'] = oof_join[[c for c in oof_join.columns if c.startswith('oof_')]].mean(axis=1)

keys_split = oof_join['key'].str.split('|', expand=True)
keys_split.columns = ['game_play','p1','p2','step']
oof_join = pd.concat([oof_join[['key','oof_avg']], keys_split], axis=1)
oof_join['step'] = oof_join['step'].astype(int)

# Bring labels and same_flag aligned to keys
lab_df = train_sup[['game_play','p1','p2','step','contact','same_team']].copy()
lab_df['key'] = (lab_df['game_play'].astype(str) + '|' + lab_df['p1'].astype(str) + '|' + lab_df['p2'].astype(str) + '|' + lab_df['step'].astype(str))
eval_df = oof_join.merge(lab_df[['key','contact','same_team']], on='key', how='inner')
y_sorted = eval_df['contact'].astype(int).to_numpy()
p_sorted = eval_df['oof_avg'].astype(float).to_numpy()
same_sorted = eval_df['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in eval_df.columns else np.zeros(len(eval_df), int)

# Robust dual-threshold search with fallback
best_mcc, thr_opp, thr_same = fast_dual_threshold_mcc(y_sorted, p_sorted, same_sorted, grid_points=256)
if (not np.isfinite(best_mcc)) or best_mcc < 0.6:
    thrs = np.linspace(0.6, 0.9, 31)
    m_list = [matthews_corrcoef(y_sorted, (p_sorted >= t).astype(int)) for t in thrs]
    j = int(np.argmax(m_list))
    thr_opp = thr_same = float(thrs[j])  # fallback single threshold
    best_mcc = float(m_list[j])
print(f'PP aligned bagged OOF MCC={best_mcc:.5f} | thr_same={thr_same:.4f}, thr_opp={thr_opp:.4f}')

# Test: align and average (outer), fill to 0.5
test_join = test_frames[0]
for df in test_frames[1:]:
    test_join = test_join.merge(df, on='key', how='outer')
test_join = test_join.fillna(0.5)
test_join['pt_avg'] = test_join[[c for c in test_join.columns if c.startswith('pt_')]].mean(axis=1)
tj = test_join.copy()
split_t = tj['key'].str.split('|', expand=True)
split_t.columns = ['game_play','p1','p2','step']
tj = pd.concat([tj[['key','pt_avg']], split_t], axis=1)
tj['step'] = tj['step'].astype(int)

# Merge same_team for thresholding
st = test_feats[['game_play','p1','p2','step','same_team']].copy()
st['key'] = (st['game_play'].astype(str) + '|' + st['p1'].astype(str) + '|' + st['p2'].astype(str) + '|' + st['step'].astype(str))
tj = tj.merge(st[['key','same_team']], on='key', how='left')
same_flag_test = tj['same_team'].fillna(0).astype(int).to_numpy()
thr_arr_test = np.where(same_flag_test == 1, thr_same, thr_opp)
tj['pred_bin'] = (tj['pt_avg'].to_numpy() >= thr_arr_test).astype(int)
cid_pp = (tj['game_play'].astype(str) + '_' + tj['step'].astype(str) + '_' + tj['p1'].astype(str) + '_' + tj['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_pp, 'contact_pp': tj['pred_bin'].astype(int).values})

# Build final submission: PP then overwrite G-second, read G from prior submission or CSV
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
before_ones = int(sub['contact'].sum())

# Option B: read G rows from existing submission.csv
try:
    prev_sub = pd.read_csv('submission.csv')
    g_pred_second = prev_sub[prev_sub['contact_id'].str.endswith('_G')][['contact_id','contact']].rename(columns={'contact':'contact_g'})
except Exception as e:
    print('Warning: could not read previous submission for G; defaulting to no G overwrite:', e)
    g_pred_second = pd.DataFrame(columns=['contact_id','contact_g'])

sub = sub.merge(g_pred_second, on='contact_id', how='left')
sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
sub = sub[['contact_id','contact']]
after_ones = int(sub['contact'].sum())
sub.to_csv('submission.csv', index=False)
print(f'Final aligned-bag submission saved. PP ones={before_ones}, after G overwrite={after_ones}. Took {time.time()-t0:.1f}s')

xgboost version (pp-align): 2.1.4
Loading r=3.5 dyn artifacts...
PP aligned bagging seed 42 ...


  seed 42 fold 0 done in 29.7s; best_it=3094


  seed 42 fold 1 done in 30.8s; best_it=3182


  seed 42 fold 2 done in 32.0s; best_it=3270


  seed 42 fold 3 done in 30.1s; best_it=2995


  seed 42 fold 4 done in 28.3s; best_it=2873


   seed 42 test model 0 0.1s
   seed 42 test model 1 0.1s


   seed 42 test model 2 0.1s
   seed 42 test model 3 0.1s


   seed 42 test model 4 0.1s


PP aligned bagging seed 1337 ...


  seed 1337 fold 0 done in 30.9s; best_it=3211


  seed 1337 fold 1 done in 31.7s; best_it=3291


  seed 1337 fold 2 done in 32.2s; best_it=3408


  seed 1337 fold 3 done in 29.6s; best_it=2955


  seed 1337 fold 4 done in 27.3s; best_it=2747


   seed 1337 test model 0 0.2s
   seed 1337 test model 1 0.2s


   seed 1337 test model 2 0.2s
   seed 1337 test model 3 0.1s


   seed 1337 test model 4 0.1s


PP aligned bagging seed 2025 ...


  seed 2025 fold 0 done in 28.9s; best_it=3011


  seed 2025 fold 1 done in 31.8s; best_it=3338


  seed 2025 fold 2 done in 31.3s; best_it=3189


  seed 2025 fold 3 done in 29.3s; best_it=2925


  seed 2025 fold 4 done in 28.7s; best_it=2898


   seed 2025 test model 0 0.1s
   seed 2025 test model 1 0.2s


   seed 2025 test model 2 0.1s
   seed 2025 test model 3 0.1s


   seed 2025 test model 4 0.1s


PP aligned bagged OOF MCC=2122.94447 | thr_same=0.0048, thr_opp=0.1520


Final aligned-bag submission saved. PP ones=15681, after G overwrite=17733. Took 470.3s


In [34]:
# Rebuild full pipeline with candidate radius r=4.0 and save *_r40 artifacts
import pandas as pd, numpy as np, time, math
from itertools import combinations

t0 = time.time()
print('Rebuilding pipeline with r=4.0 ...')

def build_pairs_for_group_r(gdf, r=4.0):
    rows = []
    arr = gdf[['nfl_player_id','team','position','x_position','y_position','speed','acceleration','direction']].values
    n = arr.shape[0]
    for i, j in combinations(range(n), 2):
        pid_i, team_i, pos_i, xi, yi, si, ai, diri = arr[i]
        pid_j, team_j, pos_j, xj, yj, sj, aj, dirj = arr[j]
        dx = xj - xi; dy = yj - yi
        dist = math.hypot(dx, dy)
        if dist > r:
            continue
        a = int(pid_i); b = int(pid_j)
        p1, p2 = (str(a), str(b)) if a <= b else (str(b), str(a))
        vxi = si * math.cos(math.radians(diri)) if not pd.isna(diri) else 0.0
        vyi = si * math.sin(math.radians(diri)) if not pd.isna(diri) else 0.0
        vxj = sj * math.cos(math.radians(dirj)) if not pd.isna(dirj) else 0.0
        vyj = sj * math.sin(math.radians(dirj)) if not pd.isna(dirj) else 0.0
        rvx = vxj - vxi; rvy = vyj - vyi
        if dist > 0:
            ux = dx / dist; uy = dy / dist
            closing = rvx * ux + rvy * uy
        else:
            closing = 0.0
        if pd.isna(diri) or pd.isna(dirj):
            hd = np.nan
        else:
            d = (diri - dirj + 180) % 360 - 180
            hd = abs(d)
        rows.append((p1, p2, dist, dx, dy, si, sj, ai, aj, closing, abs(closing), hd, int(team_i == team_j), str(team_i), str(team_j), str(pos_i), str(pos_j)))
    if not rows:
        return pd.DataFrame(columns=['p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])
    return pd.DataFrame(rows, columns=['p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])

def build_feature_table_r(track_df, r=4.0):
    feats = []
    cnt = 0
    last = time.time()
    for (gp, step), gdf in track_df.groupby(['game_play','step'], sort=False):
        f = build_pairs_for_group_r(gdf, r=r)
        if not f.empty:
            f.insert(0, 'step', step)
            f.insert(0, 'game_play', gp)
            feats.append(f)
        cnt += 1
        if cnt % 500 == 0:
            now = time.time()
            print(f' processed {cnt} steps; +{now-last:.1f}s; total {now-t0:.1f}s', flush=True)
            last = now
    if feats:
        return pd.concat(feats, ignore_index=True)
    return pd.DataFrame(columns=['game_play','step','p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])

print('Building train pairs r=4.0 ...')
train_pairs_r40 = build_feature_table_r(train_track_idx, r=4.0)
print('train_pairs_r40:', train_pairs_r40.shape)
train_pairs_r40.to_parquet('train_pairs_r40.parquet', index=False)
print('Building test pairs r=4.0 ...')
test_pairs_r40 = build_feature_table_r(test_track_idx, r=4.0)
print('test_pairs_r40:', test_pairs_r40.shape)
test_pairs_r40.to_parquet('test_pairs_r40.parquet', index=False)

def add_window_feats_local(df: pd.DataFrame, W: int = 5):
    df = df.sort_values(['game_play','p1','p2','step']).copy()
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    df['dist_min_p5'] = grp['distance'].rolling(W, min_periods=1).min().reset_index(level=[0,1,2], drop=True)
    df['dist_mean_p5'] = grp['distance'].rolling(W, min_periods=1).mean().reset_index(level=[0,1,2], drop=True)
    df['dist_max_p5'] = grp['distance'].rolling(W, min_periods=1).max().reset_index(level=[0,1,2], drop=True)
    df['dist_std_p5'] = grp['distance'].rolling(W, min_periods=1).std().reset_index(level=[0,1,2], drop=True)
    df['abs_close_min_p5'] = grp['abs_closing'].rolling(W, min_periods=1).min().reset_index(level=[0,1,2], drop=True)
    df['abs_close_mean_p5'] = grp['abs_closing'].rolling(W, min_periods=1).mean().reset_index(level=[0,1,2], drop=True)
    df['abs_close_max_p5'] = grp['abs_closing'].rolling(W, min_periods=1).max().reset_index(level=[0,1,2], drop=True)
    df['abs_close_std_p5'] = grp['abs_closing'].rolling(W, min_periods=1).std().reset_index(level=[0,1,2], drop=True)
    for thr, name in [(1.5,'lt15'), (2.0,'lt20'), (2.5,'lt25')]:
        key = f'cnt_dist_{name}_p5'
        df[key] = grp['distance'].apply(lambda s: s.lt(thr).rolling(W, min_periods=1).sum()).reset_index(level=[0,1,2], drop=True)
    df['dist_delta_p5'] = df['distance'] - grp['distance'].shift(W)
    return df

print('Adding W5 features (train/test) for r=4.0 ...')
train_w_r40 = add_window_feats_local(train_pairs_r40, W=5)
test_w_r40 = add_window_feats_local(test_pairs_r40, W=5)
train_w_r40.to_parquet('train_pairs_w5_r40.parquet', index=False)
test_w_r40.to_parquet('test_pairs_w5_r40.parquet', index=False)

FPS = 59.94
def prep_meta(vmeta: pd.DataFrame):
    vm = vmeta.copy()
    for c in ['start_time','snap_time']:
        if np.issubdtype(vm[c].dtype, np.number):
            continue
        ts = pd.to_datetime(vm[c], errors='coerce')
        if ts.notna().any():
            vm[c] = (ts - ts.dt.floor('D')).dt.total_seconds().astype(float)
        else:
            vm[c] = pd.to_numeric(vm[c], errors='coerce')
    vm['snap_frame'] = ((vm['snap_time'] - vm['start_time']) * FPS).round().astype('Int64')
    return vm[['game_play','view','snap_frame']].drop_duplicates()

print('Loading helmets and video metadata...')
train_helm_df = pd.read_csv('train_baseline_helmets.csv')
test_helm_df = pd.read_csv('test_baseline_helmets.csv')
train_vmeta_df = pd.read_csv('train_video_metadata.csv')
test_vmeta_df = pd.read_csv('test_video_metadata.csv')
meta_tr = prep_meta(train_vmeta_df); meta_te = prep_meta(test_vmeta_df)

def dedup_and_step(helm: pd.DataFrame, meta: pd.DataFrame):
    df = helm[['game_play','view','frame','nfl_player_id','left','top','width','height']].copy()
    df = df.dropna(subset=['nfl_player_id'])
    df['nfl_player_id'] = df['nfl_player_id'].astype(int).astype(str)
    df['area'] = df['width'] * df['height']
    df['cx'] = df['left'] + 0.5 * df['width']
    df['cy'] = df['top'] + 0.5 * df['height']
    df = df.sort_values(['game_play','view','frame','nfl_player_id','area'], ascending=[True,True,True,True,False]).drop_duplicates(['game_play','view','frame','nfl_player_id'], keep='first')
    df = df.merge(meta, on=['game_play','view'], how='left')
    df['step'] = ((df['frame'] - df['snap_frame']).astype('float') / 6.0).round().astype('Int64')
    df = df.dropna(subset=['step']); df['step'] = df['step'].astype(int)
    dm1 = df.copy(); dm1['target_step'] = dm1['step'] - 1
    d0 = df.copy(); d0['target_step'] = df['step']
    dp1 = df.copy(); dp1['target_step'] = df['step'] + 1
    d = pd.concat([dm1, d0, dp1], ignore_index=True)
    agg = d.groupby(['game_play','view','target_step','nfl_player_id'], sort=False).agg(
        cx_mean=('cx','mean'), cy_mean=('cy','mean'), h_mean=('height','mean'), cnt=('cx','size')
    ).reset_index().rename(columns={'target_step':'step'})
    return agg

print('Preparing helmet aggregates...')
h_tr = dedup_and_step(train_helm_df, meta_tr)
h_te = dedup_and_step(test_helm_df, meta_te)
print('Helmet agg shapes:', h_tr.shape, h_te.shape)

def merge_helmet_to_pairs_df(pairs: pd.DataFrame, h_agg: pd.DataFrame):
    ha = h_agg[['game_play','step','view','nfl_player_id','cx_mean','cy_mean','h_mean']].copy()
    a = ha.rename(columns={'nfl_player_id':'p1','cx_mean':'cx1','cy_mean':'cy1','h_mean':'h1'})
    b = ha.rename(columns={'nfl_player_id':'p2','cx_mean':'cx2','cy_mean':'cy2','h_mean':'h2'})
    merged = a.merge(b, on=['game_play','step','view'], how='inner')
    merged = merged[merged['p1'] < merged['p2']]
    merged['px_dist'] = np.sqrt((merged['cx1'] - merged['cx2'])**2 + (merged['cy1'] - merged['cy2'])**2)
    merged['px_dist_norm'] = merged['px_dist'] / np.sqrt(np.maximum(1e-6, merged['h1'] * merged['h2']))
    agg = merged.groupby(['game_play','step','p1','p2'], as_index=False).agg(
        px_dist_norm_min=('px_dist_norm','min'),
        views_both_present=('px_dist_norm', lambda s: int(s.notna().sum()))
    )
    out = pairs.merge(agg, on=['game_play','step','p1','p2'], how='left')
    return out

print('Merging helmets into pairs (train/test) ...')
train_pairs_w5_helm_r40 = merge_helmet_to_pairs_df(train_w_r40, h_tr)
test_pairs_w5_helm_r40 = merge_helmet_to_pairs_df(test_w_r40, h_te)
train_pairs_w5_helm_r40.to_parquet('train_pairs_w5_helm_r40.parquet', index=False)
test_pairs_w5_helm_r40.to_parquet('test_pairs_w5_helm_r40.parquet', index=False)

def add_dyn_feats(df: pd.DataFrame) -> pd.DataFrame:
    df = df.sort_values(['game_play','p1','p2','step']).copy()
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)
    df['approaching_flag'] = (df['closing'] < 0).astype(int)
    denom = (-df['closing']).clip(lower=1e-3)
    ttc_raw = df['distance'] / denom
    ttc_raw = ttc_raw.where(df['approaching_flag'] == 1, 10.0)
    df['ttc_raw'] = ttc_raw.astype(float)
    df['ttc_clip'] = df['ttc_raw'].clip(0, 5)
    df['ttc_log'] = np.log1p(df['ttc_clip'])
    df['inv_ttc'] = 1.0 / (1.0 + df['ttc_clip'])
    df['d_dist_1'] = df['distance'] - grp['distance'].shift(1)
    df['d_dist_2'] = df['distance'] - grp['distance'].shift(2)
    df['d_dist_5'] = df['distance'] - grp['distance'].shift(5)
    df['d_close_1'] = df['closing'] - grp['closing'].shift(1)
    df['d_absclose_1'] = df['abs_closing'] - grp['abs_closing'].shift(1)
    df['d_speed1_1'] = df['speed1'] - grp['speed1'].shift(1)
    df['d_speed2_1'] = df['speed2'] - grp['speed2'].shift(1)
    df['d_accel1_1'] = df['accel1'] - grp['accel1'].shift(1)
    df['d_accel2_1'] = df['accel2'] - grp['accel2'].shift(1)
    df['rm3_d_dist_1'] = grp['d_dist_1'].transform(lambda s: s.rolling(3, min_periods=1).mean())
    df['rm3_d_close_1'] = grp['d_close_1'].transform(lambda s: s.rolling(3, min_periods=1).mean())
    for c in ['d_dist_1','d_dist_2','d_dist_5','d_close_1','d_absclose_1','d_speed1_1','d_speed2_1','d_accel1_1','d_accel2_1','rm3_d_dist_1','rm3_d_close_1']:
        df[c] = df[c].fillna(0.0)
    df['rel_speed'] = (df['speed2'] - df['speed1']).astype(float)
    df['abs_rel_speed'] = df['rel_speed'].abs()
    df['rel_accel'] = (df['accel2'] - df['accel1']).astype(float)
    df['abs_rel_accel'] = df['rel_accel'].abs()
    df['jerk1'] = grp['accel1'].diff().fillna(0.0)
    df['jerk2'] = grp['accel2'].diff().fillna(0.0)
    if 'px_dist_norm_min' in df.columns:
        df['d_px_norm_1'] = df['px_dist_norm_min'] - grp['px_dist_norm_min'].shift(1)
        df['d_px_norm_1'] = df['d_px_norm_1'].fillna(0.0)
        df['cnt_px_lt006_p3'] = grp['px_dist_norm_min'].transform(lambda s: s.lt(0.06).rolling(3, min_periods=1).sum()).astype(float)
        df['cnt_px_lt008_p3'] = grp['px_dist_norm_min'].transform(lambda s: s.lt(0.08).rolling(3, min_periods=1).sum()).astype(float)
    else:
        df['d_px_norm_1'] = 0.0; df['cnt_px_lt006_p3'] = 0.0; df['cnt_px_lt008_p3'] = 0.0
    return df

print('Adding dyn features (train/test) ...')
tr_dyn_r40 = add_dyn_feats(train_pairs_w5_helm_r40)
te_dyn_r40 = add_dyn_feats(test_pairs_w5_helm_r40)
tr_dyn_r40.to_parquet('train_pairs_w5_helm_dyn_r40.parquet', index=False)
te_dyn_r40.to_parquet('test_pairs_w5_helm_dyn_r40.parquet', index=False)

key_cols = ['game_play','step','p1','p2']
lab_cols = key_cols + ['contact']
labels_min = train_labels[lab_cols].copy()
sup_r40 = labels_min.merge(tr_dyn_r40, on=key_cols, how='inner')
print('Supervised(inner) r=4.0 before expansion:', sup_r40.shape, 'pos rate:', sup_r40['contact'].mean())
pos = sup_r40.loc[sup_r40['contact'] == 1, ['game_play','p1','p2','step']]
pos_m1 = pos.copy(); pos_m1['step'] = pos_m1['step'] - 1
pos_p1 = pos.copy(); pos_p1['step'] = pos_p1['step'] + 1
pos_exp = pd.concat([pos_m1, pos_p1], ignore_index=True).drop_duplicates()
pos_exp['flag_pos_exp'] = 1
sup_r40 = sup_r40.merge(pos_exp, on=['game_play','p1','p2','step'], how='left')
sup_r40.loc[sup_r40['flag_pos_exp'] == 1, 'contact'] = 1
sup_r40.drop(columns=['flag_pos_exp'], inplace=True)
print('After positive expansion (r=4.0): pos rate:', sup_r40['contact'].mean())
sup_r40.to_parquet('train_supervised_w5_helm_dyn_r40.parquet', index=False)

print('Done r=4.0 rebuild in {:.1f}s'.format(time.time()-t0), flush=True)

Rebuilding pipeline with r=4.0 ...
Building train pairs r=4.0 ...


 processed 500 steps; +0.7s; total 0.7s


 processed 1000 steps; +0.8s; total 1.5s


 processed 1500 steps; +0.6s; total 2.1s


 processed 2000 steps; +0.6s; total 2.7s


 processed 2500 steps; +0.6s; total 3.3s


 processed 3000 steps; +0.6s; total 3.8s


 processed 3500 steps; +0.6s; total 4.4s


 processed 4000 steps; +0.9s; total 5.3s


 processed 4500 steps; +0.6s; total 5.8s


 processed 5000 steps; +0.6s; total 6.4s


 processed 5500 steps; +0.6s; total 7.0s


 processed 6000 steps; +0.6s; total 7.6s


 processed 6500 steps; +0.6s; total 8.1s


 processed 7000 steps; +0.6s; total 8.7s


 processed 7500 steps; +1.0s; total 9.7s


 processed 8000 steps; +0.6s; total 10.2s


 processed 8500 steps; +0.6s; total 10.8s


 processed 9000 steps; +0.6s; total 11.4s


 processed 9500 steps; +0.6s; total 12.0s


 processed 10000 steps; +0.6s; total 12.5s


 processed 10500 steps; +0.6s; total 13.2s


 processed 11000 steps; +0.6s; total 13.7s


 processed 11500 steps; +1.1s; total 14.8s


 processed 12000 steps; +0.6s; total 15.4s


 processed 12500 steps; +0.6s; total 15.9s


 processed 13000 steps; +0.6s; total 16.5s


 processed 13500 steps; +0.6s; total 17.1s


 processed 14000 steps; +0.6s; total 17.6s


 processed 14500 steps; +0.6s; total 18.3s


 processed 15000 steps; +0.6s; total 18.8s


 processed 15500 steps; +0.6s; total 19.4s


 processed 16000 steps; +0.6s; total 19.9s


 processed 16500 steps; +0.6s; total 20.5s


 processed 17000 steps; +1.1s; total 21.6s


 processed 17500 steps; +0.6s; total 22.2s


 processed 18000 steps; +0.6s; total 22.7s


 processed 18500 steps; +0.6s; total 23.3s


 processed 19000 steps; +0.6s; total 23.8s


 processed 19500 steps; +0.6s; total 24.4s


 processed 20000 steps; +0.6s; total 25.0s


 processed 20500 steps; +0.6s; total 25.5s


 processed 21000 steps; +0.6s; total 26.1s


 processed 21500 steps; +0.6s; total 26.7s


 processed 22000 steps; +0.6s; total 27.2s


 processed 22500 steps; +0.6s; total 27.8s


 processed 23000 steps; +1.2s; total 29.0s


 processed 23500 steps; +0.6s; total 29.5s


 processed 24000 steps; +0.6s; total 30.1s


 processed 24500 steps; +0.6s; total 30.7s


 processed 25000 steps; +0.6s; total 31.3s


 processed 25500 steps; +0.6s; total 31.9s


 processed 26000 steps; +0.6s; total 32.4s


 processed 26500 steps; +0.6s; total 33.0s


 processed 27000 steps; +0.6s; total 33.6s


 processed 27500 steps; +0.6s; total 34.1s


 processed 28000 steps; +0.6s; total 34.7s


 processed 28500 steps; +0.6s; total 35.3s


 processed 29000 steps; +0.6s; total 35.9s


 processed 29500 steps; +0.6s; total 36.5s


 processed 30000 steps; +0.6s; total 37.1s


 processed 30500 steps; +0.6s; total 37.7s


 processed 31000 steps; +1.4s; total 39.0s


 processed 31500 steps; +0.6s; total 39.6s


 processed 32000 steps; +0.6s; total 40.2s


 processed 32500 steps; +0.6s; total 40.7s


 processed 33000 steps; +0.6s; total 41.3s


 processed 33500 steps; +0.6s; total 41.9s


 processed 34000 steps; +0.6s; total 42.4s


 processed 34500 steps; +0.6s; total 43.0s


 processed 35000 steps; +0.6s; total 43.6s


 processed 35500 steps; +0.6s; total 44.1s


 processed 36000 steps; +0.6s; total 44.7s


 processed 36500 steps; +0.6s; total 45.3s


 processed 37000 steps; +0.6s; total 45.8s


 processed 37500 steps; +0.6s; total 46.4s


 processed 38000 steps; +0.6s; total 47.0s


 processed 38500 steps; +0.6s; total 47.6s


 processed 39000 steps; +0.6s; total 48.1s


 processed 39500 steps; +0.6s; total 48.7s


 processed 40000 steps; +1.6s; total 50.2s


 processed 40500 steps; +0.6s; total 50.8s


 processed 41000 steps; +0.6s; total 51.4s


 processed 41500 steps; +0.6s; total 52.0s


 processed 42000 steps; +0.6s; total 52.5s


 processed 42500 steps; +0.6s; total 53.1s


 processed 43000 steps; +0.6s; total 53.6s


 processed 43500 steps; +0.6s; total 54.2s


 processed 44000 steps; +0.6s; total 54.8s


 processed 44500 steps; +0.6s; total 55.4s


 processed 45000 steps; +0.6s; total 55.9s


 processed 45500 steps; +0.6s; total 56.5s


 processed 46000 steps; +0.6s; total 57.1s


 processed 46500 steps; +0.6s; total 57.6s


 processed 47000 steps; +0.6s; total 58.2s


 processed 47500 steps; +0.6s; total 58.8s


 processed 48000 steps; +0.6s; total 59.3s


 processed 48500 steps; +0.6s; total 59.9s


 processed 49000 steps; +0.6s; total 60.5s


 processed 49500 steps; +0.6s; total 61.0s


 processed 50000 steps; +0.6s; total 61.6s


 processed 50500 steps; +0.6s; total 62.2s


 processed 51000 steps; +0.6s; total 62.7s


 processed 51500 steps; +1.7s; total 64.4s


 processed 52000 steps; +0.6s; total 65.0s


 processed 52500 steps; +0.6s; total 65.6s


 processed 53000 steps; +0.6s; total 66.1s


 processed 53500 steps; +0.6s; total 66.7s


 processed 54000 steps; +0.6s; total 67.3s


 processed 54500 steps; +0.6s; total 67.8s


 processed 55000 steps; +0.6s; total 68.4s


 processed 55500 steps; +0.6s; total 69.0s


train_pairs_r40: (2430943, 19)


Building test pairs r=4.0 ...


 processed 500 steps; +0.6s; total 75.6s


 processed 1000 steps; +0.6s; total 76.2s


 processed 1500 steps; +0.6s; total 76.7s


 processed 2000 steps; +0.6s; total 77.4s


 processed 2500 steps; +0.6s; total 78.0s


 processed 3000 steps; +0.6s; total 78.5s


 processed 3500 steps; +0.6s; total 79.1s


 processed 4000 steps; +0.6s; total 79.6s


 processed 4500 steps; +0.6s; total 80.2s


 processed 5000 steps; +0.5s; total 80.7s


 processed 5500 steps; +0.6s; total 81.3s


test_pairs_r40: (278492, 19)


Adding W5 features (train/test) for r=4.0 ...


Loading helmets and video metadata...


Preparing helmet aggregates...


Helmet agg shapes: (620840, 8) (67667, 8)
Merging helmets into pairs (train/test) ...


Adding dyn features (train/test) ...


Supervised(inner) r=4.0 before expansion: (634192, 59) pos rate: 0.06721308373489417


After positive expansion (r=4.0): pos rate: 0.07693884501854328


Done r=4.0 rebuild in 386.5s


In [35]:
# Train on r=4.0 dyn features, smooth OOF, dual thresholds (same vs opp), predict test
import time, sys, subprocess, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb
print('xgboost version:', getattr(xgb, '__version__', 'unknown'))

def mcc_from_counts(tp, tn, fp, fn):
    num = tp * tn - fp * fn
    den = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
    den = np.where(den == 0, 1.0, den)
    return num / den

def fast_dual_threshold_mcc(y_true, prob, same_flag, grid_points=151):
    res = {}
    for cohort in (0, 1):
        mask = (same_flag == cohort)
        y_c = y_true[mask].astype(int)
        p_c = prob[mask].astype(float)
        n = len(y_c)
        if n == 0:
            res[cohort] = {'k_grid': np.array([0], int), 'tp': np.array([0.0]), 'fp': np.array([0.0]), 'tn': np.array([0.0]), 'fn': np.array([0.0]), 'thr_vals': np.array([1.0])}
            continue
        order = np.argsort(-p_c)
        y_sorted = y_c[order]
        p_sorted = p_c[order]
        cum_pos = np.concatenate([[0], np.cumsum(y_sorted)])
        k_grid = np.unique(np.linspace(0, n, num=min(grid_points, n + 1), dtype=int))
        tp = cum_pos[k_grid]
        fp = k_grid - tp
        P = y_sorted.sum(); N = n - P
        fn = P - tp; tn = N - fp
        thr_vals = np.where(k_grid == 0, 1.0 + 1e-6, p_sorted[np.maximum(0, k_grid - 1)])
        res[cohort] = {'k_grid': k_grid, 'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn, 'thr_vals': thr_vals}
    tp0, fp0, tn0, fn0, thr0 = res[0]['tp'], res[0]['fp'], res[0]['tn'], res[0]['fn'], res[0]['thr_vals']
    tp1, fp1, tn1, fn1, thr1 = res[1]['tp'], res[1]['fp'], res[1]['tn'], res[1]['fn'], res[1]['thr_vals']
    best = (-1.0, 0.5, 0.5)
    for i in range(len(thr0)):
        tp_sum = tp0[i] + tp1; fp_sum = fp0[i] + fp1; tn_sum = tn0[i] + tn1; fn_sum = fn0[i] + fn1
        m_arr = mcc_from_counts(tp_sum, tn_sum, fp_sum, fn_sum)
        j = int(np.argmax(m_arr)); m = float(m_arr[j])
        if m > best[0]:
            best = (m, float(thr0[i]), float(thr1[j]))
    return best  # (best_mcc, thr_opp, thr_same)

print('Loading r=4.0 supervised dyn train and dyn test features...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r40.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r40.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
print('train_sup:', train_sup.shape, 'test_feats:', test_feats.shape)

train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()

for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

X_all = train_sup[feat_cols].astype(float).values
y_all = train_sup['contact'].astype(int).values
groups = train_sup['game_play'].values
same_flag_all = train_sup['same_team'].astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), dtype=int)

gkf = GroupKFold(n_splits=5)
oof = np.full(len(train_sup), np.nan, dtype=float)
models = []
start = time.time()

for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
    t0 = time.time()
    X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
    X_va, y_va = X_all[va_idx], y_all[va_idx]
    neg = (y_tr == 0).sum(); pos = (y_tr == 1).sum()
    spw = max(1.0, neg / max(1, pos))
    print(f'Fold {fold}: train {len(tr_idx)} (pos {pos}), valid {len(va_idx)} (pos {(y_va==1).sum()}), spw={spw:.2f}', flush=True)
    dtrain = xgb.DMatrix(X_tr, label=y_tr)
    dvalid = xgb.DMatrix(X_va, label=y_va)
    params = {
        'tree_method': 'hist', 'device': 'cuda', 'max_depth': 7, 'eta': 0.05, 'subsample': 0.9,
        'colsample_bytree': 0.8, 'min_child_weight': 10, 'lambda': 1.5, 'alpha': 0.1, 'gamma': 0.1,
        'objective': 'binary:logistic', 'eval_metric': 'logloss', 'scale_pos_weight': float(spw), 'seed': 42 + fold
    }
    booster = xgb.train(params=params, dtrain=dtrain, num_boost_round=4000, evals=[(dtrain,'train'),(dvalid,'valid')],
                        early_stopping_rounds=200, verbose_eval=False)
    best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
    oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
    models.append((booster, best_it))
    print(f' Fold {fold} done in {time.time()-t0:.1f}s; best_iteration={best_it}', flush=True)

# Smooth OOF per (gp,p1,p2) with centered rolling-max window=3
oof_df = train_sup[['game_play','p1','p2','step']].copy()
oof_df['oof'] = oof
oof_df = oof_df.sort_values(['game_play','p1','p2','step'])
grp = oof_df.groupby(['game_play','p1','p2'], sort=False)
oof_df['oof_smooth'] = grp['oof'].transform(lambda s: s.rolling(3, center=True, min_periods=1).max())
oof_smooth = oof_df['oof_smooth'].values
idx_ord = oof_df.index.to_numpy()
y_sorted = train_sup['contact'].astype(int).to_numpy()[idx_ord]
same_sorted = (train_sup['same_team'].fillna(0).astype(int).to_numpy()[idx_ord]) if 'same_team' in train_sup.columns else np.zeros(len(oof_df), dtype=int)

best_mcc, thr_opp, thr_same = fast_dual_threshold_mcc(y_sorted, oof_smooth, same_sorted, grid_points=151)
if not np.isfinite(best_mcc) or best_mcc < 0:
    thrs = np.linspace(0.6, 0.9, 31)
    m_list = [matthews_corrcoef(y_sorted, (oof_smooth >= t).astype(int)) for t in thrs]
    j = int(np.argmax(m_list)); best_mcc = float(m_list[j]); thr_opp = thr_same = float(thrs[j])
print(f'Best OOF MCC (dual thresholds)={best_mcc:.5f} | thr_same={thr_same:.4f}, thr_opp={thr_opp:.4f}')

# Inference on test and smoothing
Xt = test_feats[feat_cols].astype(float).values
dtest = xgb.DMatrix(Xt)
pt = np.zeros(len(test_feats), dtype=float)
for i, (booster, best_it) in enumerate(models):
    t0 = time.time()
    pt += booster.predict(dtest, iteration_range=(0, best_it + 1))
    print(f' Inference model {i} took {time.time()-t0:.1f}s', flush=True)
pt /= max(1, len(models))
pred_tmp = test_feats[['game_play','step','p1','p2']].copy()
pred_tmp['prob'] = pt
pred_tmp = pred_tmp.sort_values(['game_play','p1','p2','step'])
grp_t = pred_tmp.groupby(['game_play','p1','p2'], sort=False)
pred_tmp['prob_smooth'] = grp_t['prob'].transform(lambda s: s.rolling(3, center=True, min_periods=1).max())

# Apply dual thresholds by same_team on test
same_flag_test = test_feats['same_team'].astype(int).values if 'same_team' in test_feats.columns else np.zeros(len(test_feats), dtype=int)
thr_arr_test = np.where(same_flag_test == 1, thr_same, thr_opp)
pred_bin = (pred_tmp['prob_smooth'].values >= thr_arr_test).astype(int)

# Build submission
cid = (test_feats['game_play'].astype(str) + '_' + test_feats['step'].astype(str) + '_' + test_feats['p1'].astype(str) + '_' + test_feats['p2'].astype(str))
pred_df = pd.DataFrame({'contact_id': cid, 'pred_contact': pred_bin})
ss = pd.read_csv('sample_submission.csv')
sub = ss.copy()
sub['contact'] = sub['contact_id'].map(pred_df.set_index('contact_id')['pred_contact']).fillna(0).astype(int)
sub[['contact_id','contact']].to_csv('submission.csv', index=False)
print('Saved submission.csv')
print('Done. Total time:', f'{time.time()-start:.1f}s', flush=True)

xgboost version: 2.1.4
Loading r=4.0 supervised dyn train and dyn test features...
train_sup: (634192, 59) test_feats: (278492, 58)


Using 50 features


Fold 0: train 507383 (pos 38276), valid 126809 (pos 10518), spw=12.26


 Fold 0 done in 34.9s; best_iteration=3253


Fold 1: train 507362 (pos 38941), valid 126830 (pos 9853), spw=12.03


 Fold 1 done in 40.1s; best_iteration=3632


Fold 2: train 507403 (pos 39474), valid 126789 (pos 9320), spw=11.85


 Fold 2 done in 37.9s; best_iteration=3326


Fold 3: train 507360 (pos 39212), valid 126832 (pos 9582), spw=11.94


 Fold 3 done in 37.4s; best_iteration=3446


Fold 4: train 507260 (pos 39273), valid 126932 (pos 9521), spw=11.92


 Fold 4 done in 37.3s; best_iteration=3468


  den = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))


Best OOF MCC (dual thresholds)=0.72401 | thr_same=0.7900, thr_opp=0.7900


 Inference model 0 took 0.2s


 Inference model 1 took 0.2s


 Inference model 2 took 0.2s


 Inference model 3 took 0.2s


 Inference model 4 took 0.2s


Saved submission.csv
Done. Total time: 195.8s


In [38]:
# PP multi-seed bagging (r=4.0) with strict alignment and float-safe dual-thresholds; then optional G overwrite from prior submission
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.4'], check=True)
    import xgboost as xgb
print('xgboost version (pp-bag-r40):', getattr(xgb, '__version__', 'unknown'))

def fast_dual_threshold_mcc(y_true, prob, same_flag, grid_points=256):
    import numpy as np
    y = np.asarray(y_true, dtype=np.int64)
    p = np.asarray(prob, dtype=np.float64)
    s = np.asarray(same_flag, dtype=np.int8)
    mask = np.isfinite(y) & np.isfinite(p) & np.isfinite(s)
    y, p, s = y[mask], p[mask], s[mask]

    def cohort_counts(yc, pc, G):
        n = yc.size
        if n == 0:
            return dict(tp=np.array([0], np.float64), fp=np.array([0], np.float64),
                        tn=np.array([0], np.float64), fn=np.array([0], np.float64),
                        thr=np.array([1.0], np.float64))
        order = np.argsort(-pc, kind='mergesort')
        ys, ps = yc[order], pc[order]
        P = float(ys.sum()); N = float(n - ys.sum())
        # top-k grid
        step = max(1, n // max(1, (G - 1)))
        k = np.arange(0, n + 1, step, dtype=np.int64)
        if k[-1] != n: k = np.append(k, n)
        cum = np.concatenate(([0], np.cumsum(ys, dtype=np.int64)))
        tp = cum[k].astype(np.float64); fp = (k - cum[k]).astype(np.float64)
        fn = P - tp; tn = N - fp
        thr = np.where(k == 0, 1.0 + 1e-6, ps[np.maximum(0, k - 1)])
        return dict(tp=tp, fp=fp, tn=tn, fn=fn, thr=thr)

    a = cohort_counts(y[s == 0], p[s == 0], grid_points)
    b = cohort_counts(y[s == 1], p[s == 1], grid_points)

    tp = a['tp'][:, None] + b['tp'][None, :]
    fp = a['fp'][:, None] + b['fp'][None, :]
    tn = a['tn'][:, None] + b['tn'][None, :]
    fn = a['fn'][:, None] + b['fn'][None, :]

    with np.errstate(invalid='ignore', divide='ignore'):
        num = tp * tn - fp * fn
        den = (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)
        den = np.where(den > 0, np.sqrt(den), np.nan)
        mcc = num / den

    if not np.isfinite(mcc).any():
        return -1.0, 0.79, 0.79
    i, j = np.unravel_index(np.nanargmax(mcc), mcc.shape)
    return float(mcc[i, j]), float(a['thr'][i]), float(b['thr'][j])

t0 = time.time()
print('PP bagging r=4.0: loading artifacts...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r40.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r40.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

# Canonical key order used by ALL seeds
key_df = train_sup[['game_play','p1','p2','step']].copy()
key_df = key_df.assign(row=np.arange(len(key_df)))
ord_idx = key_df.sort_values(['game_play','p1','p2','step']).index.to_numpy()

gkf = GroupKFold(n_splits=5)
groups = train_sup['game_play'].values
y_all = train_sup['contact'].astype(int).values
same_all = train_sup['same_team'].fillna(0).astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), np.int8)

seeds = [42, 1337, 2025]
oof_s_list = []
test_s_list = []

for s in seeds:
    print(f' PP r=4.0 seed {s} ...', flush=True)
    X_all = train_sup[feat_cols].astype(float).values
    oof = np.full(len(train_sup), np.nan, float)
    models = []
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
        t1 = time.time()
        X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all[va_idx], y_all[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method':'hist','device':'cuda','max_depth':7,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
                  'min_child_weight':10,'lambda':1.5,'alpha':0.1,'gamma':0.1,'objective':'binary:logistic','eval_metric':'logloss',
                  'scale_pos_weight': float(spw), 'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=3800, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models.append((booster, best_it))
        print(f'   seed {s} fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)
    # Smooth OOF on canonical order
    df = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy()
    df['oof'] = oof[ord_idx]
    df = df.sort_values(['game_play','p1','p2','step'])
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    df['oof_smooth'] = grp['oof'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    oof_s_list.append(df['oof_smooth'].to_numpy())

    # Test predictions, smooth on canonical order for test
    Xt = test_feats[feat_cols].astype(float).values
    dtest = xgb.DMatrix(Xt)
    pt = np.zeros(len(test_feats), float)
    for i, (booster, best_it) in enumerate(models):
        t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
        print(f'    seed {s} test model {i} {time.time()-t1:.1f}s', flush=True)
    pt /= max(1, len(models))
    dt = test_feats[['game_play','p1','p2','step']].copy()
    dt = dt.sort_values(['game_play','p1','p2','step'])
    dt['prob'] = pt[dt.index.values]  # already aligned but safe
    grp_t = dt.groupby(['game_play','p1','p2'], sort=False)
    dt['prob_smooth'] = grp_t['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    test_s_list.append(dt['prob_smooth'].to_numpy())

# Average OOF across seeds in the same canonical order
oof_avg = np.mean(np.vstack(oof_s_list), axis=0)
y_sorted = train_sup['contact'].astype(int).to_numpy()[ord_idx]
same_sorted = train_sup['same_team'].fillna(0).astype(int).to_numpy()[ord_idx] if 'same_team' in train_sup.columns else np.zeros_like(y_sorted, np.int8)
best_mcc, thr_opp, thr_same = fast_dual_threshold_mcc(y_sorted, oof_avg, same_sorted, grid_points=256)
if (not np.isfinite(best_mcc)) or best_mcc < 0:
    thrs = np.linspace(0.7, 0.85, 31)
    m_list = [matthews_corrcoef(y_sorted, (oof_avg >= t).astype(int)) for t in thrs]
    j = int(np.argmax(m_list)); best_mcc = float(m_list[j]); thr_opp = thr_same = float(thrs[j])
print(f'PP bagged r=4.0 OOF MCC={best_mcc:.5f} | thr_same={thr_same:.4f}, thr_opp={thr_opp:.4f}')

# Average test probs across seeds and threshold
pt_bag = np.mean(np.vstack(test_s_list), axis=0)
dt_keys = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step'])
same_flag_test = test_feats[['game_play','p1','p2','step','same_team']].copy()
same_flag_test = same_flag_test.merge(dt_keys.reset_index().rename(columns={'index':'ord'}), on=['game_play','p1','p2','step'], how='right')
same_flag_test = same_flag_test.sort_values('ord')
same_flag_test_arr = same_flag_test['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in same_flag_test.columns else np.zeros(len(dt_keys), int)
thr_arr_test = np.where(same_flag_test_arr == 1, thr_same, thr_opp)
pred_bin_sorted = (pt_bag >= thr_arr_test).astype(int)

# Map back to contact_id
cid_sorted = (dt_keys['game_play'].astype(str) + '_' + dt_keys['step'].astype(str) + '_' + dt_keys['p1'].astype(str) + '_' + dt_keys['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_sorted.values, 'contact_pp': pred_bin_sorted})

# Build submission from sample and PP preds
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
pp_ones = int(sub['contact'].sum())
print('PP bagged ones before G overwrite:', pp_ones)

# Optional: overwrite G-second rows from previous submission if available
try:
    prev_sub = pd.read_csv('submission.csv')
    g_pred_second = prev_sub[prev_sub['contact_id'].str.endswith('_G')][['contact_id','contact']].rename(columns={'contact':'contact_g'})
    sub = sub.merge(g_pred_second, on='contact_id', how='left')
    sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
    sub = sub[['contact_id','contact']]
    after_ones = int(sub['contact'].sum())
    print(f'Applied prior G overwrite. ones after={after_ones}, delta={after_ones-pp_ones}')
except Exception as e:
    print('No prior submission with G rows found; skipping G overwrite.', e)
    sub = sub[['contact_id','contact']]

sub.to_csv('submission.csv', index=False)
print('Final PP-bagged (r=4.0) submission saved. Took {:.1f}s'.format(time.time()-t0))

xgboost version (pp-bag-r40): 2.1.4
PP bagging r=4.0: loading artifacts...


Using 50 features
 PP r=4.0 seed 42 ...


   seed 42 fold 0 done in 34.9s; best_it=3253


   seed 42 fold 1 done in 39.7s; best_it=3632


   seed 42 fold 2 done in 37.8s; best_it=3326


   seed 42 fold 3 done in 37.3s; best_it=3446


   seed 42 fold 4 done in 37.2s; best_it=3468


    seed 42 test model 0 0.2s


    seed 42 test model 1 0.2s


    seed 42 test model 2 0.2s


    seed 42 test model 3 0.2s


    seed 42 test model 4 0.2s


 PP r=4.0 seed 1337 ...


   seed 1337 fold 0 done in 38.1s; best_it=3385


   seed 1337 fold 1 done in 39.6s; best_it=3608


   seed 1337 fold 2 done in 36.0s; best_it=3140


   seed 1337 fold 3 done in 38.0s; best_it=3378


   seed 1337 fold 4 done in 39.1s; best_it=3609


    seed 1337 test model 0 0.2s


    seed 1337 test model 1 0.2s


    seed 1337 test model 2 0.2s


    seed 1337 test model 3 0.2s


    seed 1337 test model 4 0.2s


 PP r=4.0 seed 2025 ...


   seed 2025 fold 0 done in 39.0s; best_it=3453


   seed 2025 fold 1 done in 38.4s; best_it=3408


   seed 2025 fold 2 done in 37.9s; best_it=3284


   seed 2025 fold 3 done in 40.6s; best_it=3573


   seed 2025 fold 4 done in 37.1s; best_it=3388


    seed 2025 test model 0 0.2s


    seed 2025 test model 1 0.2s


    seed 2025 test model 2 0.2s


    seed 2025 test model 3 0.2s


    seed 2025 test model 4 0.2s


PP bagged r=4.0 OOF MCC=0.72502 | thr_same=0.7626, thr_opp=0.7713


PP bagged ones before G overwrite: 6810


Applied prior G overwrite. ones after=8862, delta=2052


Final PP-bagged (r=4.0) submission saved. Took 586.2s


In [39]:
# Post-processing: apply PP hysteresis (2-of-3 min-duration) on binary predictions in submission.csv; keep G rows intact
import pandas as pd, numpy as np
from pathlib import Path

sub_path = Path('submission.csv')
assert sub_path.exists(), 'submission.csv not found. Run bagging cell first.'
sub = pd.read_csv(sub_path)
before_ones = int(sub['contact'].sum())

# Parse contact_id
tok = sub['contact_id'].str.split('_', n=4, expand=True)
tok.columns = ['g1','g2','step','a','b']
tok['game_play'] = tok['g1'] + '_' + tok['g2']
tok['step'] = tok['step'].astype(int)

# Identify PP vs G rows (G rows end with _G as b == 'G')
is_g = tok['b'] == 'G'
pp_df = pd.DataFrame({
    'game_play': tok['game_play'],
    'p1': tok['a'],
    'p2': tok['b'],
    'step': tok['step'],
    'contact': sub['contact'].values
})

# Keep only player-player pairs (neither side is G). In sample, G appears only as second token.
pp_mask = (~is_g) & (pp_df['p1'] != 'G')
pp = pp_df.loc[pp_mask, ['game_play','p1','p2','step','contact']].copy()

# Apply centered rolling majority (2-of-3) per (game_play, p1, p2)
pp = pp.sort_values(['game_play','p1','p2','step'])
grp = pp.groupby(['game_play','p1','p2'], sort=False)['contact']
pp['contact_hyst'] = grp.transform(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))

# Map back to submission
pp_key = (pp['game_play'] + '_' + pp['step'].astype(str) + '_' + pp['p1'] + '_' + pp['p2'])
sub_key = (tok['game_play'] + '_' + tok['step'].astype(str) + '_' + tok['a'] + '_' + tok['b'])
map_h = pd.Series(pp['contact_hyst'].values, index=pp_key.values)
sub_h = sub_key.map(map_h)

# Overwrite only PP rows with hysteresis; keep G rows untouched
sub.loc[pp_mask, 'contact'] = sub_h.loc[pp_mask].fillna(sub.loc[pp_mask, 'contact']).astype(int)

after_ones = int(sub['contact'].sum())
sub.to_csv('submission.csv', index=False)
print(f'Applied PP hysteresis (2-of-3). ones before={before_ones}, after={after_ones}, delta={after_ones-before_ones}')

Applied PP hysteresis (2-of-3). ones before=8862, after=8885, delta=23


In [40]:
# PP r=4.0 multi-seed bagging with 4 thresholds (same/opponent x distance<=1.8), strict alignment; optional G overwrite; save submission
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.4'], check=True)
    import xgboost as xgb
print('xgboost version (pp-bag-r40-dist4):', getattr(xgb, '__version__', 'unknown'))

def fast_dual_threshold_mcc(y_true, prob, same_flag, grid_points=256):
    import numpy as np
    y = np.asarray(y_true, dtype=np.int64)
    p = np.asarray(prob, dtype=np.float64)
    s = np.asarray(same_flag, dtype=np.int8)
    mask = np.isfinite(y) & np.isfinite(p) & np.isfinite(s)
    y, p, s = y[mask], p[mask], s[mask]
    def cohort_counts(yc, pc, G):
        n = yc.size
        if n == 0:
            return dict(tp=np.array([0], np.float64), fp=np.array([0], np.float64),
                        tn=np.array([0], np.float64), fn=np.array([0], np.float64),
                        thr=np.array([1.0], np.float64))
        order = np.argsort(-pc, kind='mergesort')
        ys, ps = yc[order], pc[order]
        P = float(ys.sum()); N = float(n - ys.sum())
        step = max(1, n // max(1, (G - 1)))
        k = np.arange(0, n + 1, step, dtype=np.int64)
        if k[-1] != n: k = np.append(k, n)
        cum = np.concatenate(([0], np.cumsum(ys, dtype=np.int64)))
        tp = cum[k].astype(np.float64); fp = (k - cum[k]).astype(np.float64)
        fn = P - tp; tn = N - fp
        thr = np.where(k == 0, 1.0 + 1e-6, ps[np.maximum(0, k - 1)])
        return dict(tp=tp, fp=fp, tn=tn, fn=fn, thr=thr)
    a = cohort_counts(y[s == 0], p[s == 0], grid_points)
    b = cohort_counts(y[s == 1], p[s == 1], grid_points)
    tp = a['tp'][:, None] + b['tp'][None, :]
    fp = a['fp'][:, None] + b['fp'][None, :]
    tn = a['tn'][:, None] + b['tn'][None, :]
    fn = a['fn'][:, None] + b['fn'][None, :]
    with np.errstate(invalid='ignore', divide='ignore'):
        num = tp * tn - fp * fn
        den = (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)
        den = np.where(den > 0, np.sqrt(den), np.nan)
        mcc = num / den
    if not np.isfinite(mcc).any():
        return -1.0, 0.79, 0.79
    i, j = np.unravel_index(np.nanargmax(mcc), mcc.shape)
    return float(mcc[i, j]), float(a['thr'][i]), float(b['thr'][j])

t0 = time.time()
print('PP bagging r=4.0 (4-thr by distance): loading artifacts...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r40.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r40.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

# Canonical order
key_df = train_sup[['game_play','p1','p2','step']].copy()
ord_idx = key_df.sort_values(['game_play','p1','p2','step']).index.to_numpy()

gkf = GroupKFold(n_splits=5)
groups = train_sup['game_play'].values
y_all = train_sup['contact'].astype(int).values
same_all = train_sup['same_team'].fillna(0).astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), np.int8)

seeds = [42, 1337, 2025]
oof_list = []
test_list = []

for s in seeds:
    print(f' PP r=4.0 seed {s} ...', flush=True)
    X_all = train_sup[feat_cols].astype(float).values
    oof = np.full(len(train_sup), np.nan, float)
    models = []
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
        t1 = time.time()
        X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all[va_idx], y_all[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method':'hist','device':'cuda','max_depth':7,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
                  'min_child_weight':10,'lambda':1.5,'alpha':0.1,'gamma':0.1,'objective':'binary:logistic','eval_metric':'logloss',
                  'scale_pos_weight': float(spw), 'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=3800, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models.append((booster, best_it))
        print(f'   seed {s} fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)
    # store unsmoothed OOF in canonical order
    oof_list.append(oof[ord_idx])
    # Test predictions (unsmoothed) in canonical order of test
    Xt = test_feats[feat_cols].astype(float).values
    dtest = xgb.DMatrix(Xt)
    pt = np.zeros(len(test_feats), float)
    for i, (booster, best_it) in enumerate(models):
        t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
        print(f'    seed {s} test model {i} {time.time()-t1:.1f}s', flush=True)
    pt /= max(1, len(models))
    dt = test_feats[['game_play','p1','p2','step']].copy()
    dt = dt.sort_values(['game_play','p1','p2','step'])
    test_list.append(pt[dt.index.values])

# Average across seeds, then smooth by (gp,p1,p2) with window=3 centered
oof_avg = np.mean(np.vstack(oof_list), axis=0)
ts_keys = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy()
df_o = ts_keys.copy(); df_o['prob'] = oof_avg
df_o = df_o.sort_values(['game_play','p1','p2','step'])
grp = df_o.groupby(['game_play','p1','p2'], sort=False)
df_o['prob_smooth'] = grp['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
oof_smooth = df_o['prob_smooth'].to_numpy()

y_sorted = train_sup['contact'].astype(int).to_numpy()[ord_idx]
same_sorted = train_sup['same_team'].fillna(0).astype(int).to_numpy()[ord_idx] if 'same_team' in train_sup.columns else np.zeros_like(y_sorted, np.int8)
dist_sorted = train_sup['distance'].astype(float).to_numpy()[ord_idx] if 'distance' in train_sup.columns else np.full_like(y_sorted, 10.0, float)
bin_sorted = (dist_sorted <= 1.8).astype(np.int8)  # 1=close, 0=far

# 4 thresholds: by distance bin (close/far) x same_team
thr_dict = {}  # bin -> (thr_opp, thr_same)
mcc_parts = []
for b in (0, 1):
    mask = (bin_sorted == b)
    if mask.sum() == 0:
        thr_dict[b] = (0.79, 0.79)
        continue
    mcc_b, t_opp_b, t_same_b = fast_dual_threshold_mcc(y_sorted[mask], oof_smooth[mask], same_sorted[mask], grid_points=256)
    if (not np.isfinite(mcc_b)) or mcc_b < 0:
        thrs = np.linspace(0.7, 0.85, 31)
        ml = [matthews_corrcoef(y_sorted[mask], (oof_smooth[mask] >= t).astype(int)) for t in thrs]
        j = int(np.argmax(ml)); t_opp_b = t_same_b = float(thrs[j])
    thr_dict[b] = (float(t_opp_b), float(t_same_b))
print('Thresholds by dist bin (0=far,1=close):', thr_dict)

# Evaluate combined OOF MCC
thr_arr = np.empty(len(oof_smooth), dtype=float)
for b in (0, 1):
    m = (bin_sorted == b)
    t_opp, t_same = thr_dict[b]
    thr_arr[m] = np.where(same_sorted[m] == 1, t_same, t_opp)
pred_oof = (oof_smooth >= thr_arr).astype(int)
oof_mcc_all = matthews_corrcoef(y_sorted, pred_oof)
print(f'OOF MCC with 4 thresholds (dist bins): {oof_mcc_all:.5f}')

# Test: average across seeds, then smooth, then apply 4 thresholds
pt_avg = np.mean(np.vstack(test_list), axis=0)
dt_keys = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step'])
df_t = dt_keys.copy()
df_t['prob'] = pt_avg
grp_t = df_t.groupby(['game_play','p1','p2'], sort=False)
df_t['prob_smooth'] = grp_t['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
same_flag_test = test_feats[['game_play','p1','p2','step','same_team']].copy()
same_flag_test = same_flag_test.merge(dt_keys.reset_index().rename(columns={'index':'ord'}), on=['game_play','p1','p2','step'], how='right').sort_values('ord')
same_flag_test_arr = same_flag_test['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in same_flag_test.columns else np.zeros(len(df_t), int)
dist_test = test_feats[['game_play','p1','p2','step','distance']].copy()
dist_test = dist_test.merge(dt_keys.reset_index().rename(columns={'index':'ord'}), on=['game_play','p1','p2','step'], how='right').sort_values('ord')
bin_test = (dist_test['distance'].astype(float).to_numpy() <= 1.8).astype(np.int8)
thr_arr_test = np.where(same_flag_test_arr == 1,
                        np.where(bin_test == 1, thr_dict[1][1], thr_dict[0][1]),
                        np.where(bin_test == 1, thr_dict[1][0], thr_dict[0][0]))
pred_bin_sorted = (df_t['prob_smooth'].to_numpy() >= thr_arr_test).astype(int)

# Build submission and optional G overwrite from prior submission.csv
cid_sorted = (df_t['game_play'].astype(str) + '_' + df_t['step'].astype(str) + '_' + df_t['p1'].astype(str) + '_' + df_t['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_sorted.values, 'contact_pp': pred_bin_sorted})
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
pp_ones = int(sub['contact'].sum())
print('PP bagged (r40, 4-thr) ones before G overwrite:', pp_ones)
try:
    prev_sub = pd.read_csv('submission.csv')
    g_pred_second = prev_sub[prev_sub['contact_id'].str.endswith('_G')][['contact_id','contact']].rename(columns={'contact':'contact_g'})
    sub = sub.merge(g_pred_second, on='contact_id', how='left')
    sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
    sub = sub[['contact_id','contact']]
    after_ones = int(sub['contact'].sum())
    print(f'Applied prior G overwrite. ones after={after_ones}, delta={after_ones-pp_ones}')
except Exception as e:
    print('No prior submission with G rows found; skipping G overwrite.', e)
    sub = sub[['contact_id','contact']]

sub.to_csv('submission.csv', index=False)
print('Saved submission.csv. Took {:.1f}s'.format(time.time()-t0))

xgboost version (pp-bag-r40-dist4): 2.1.4
PP bagging r=4.0 (4-thr by distance): loading artifacts...


Using 50 features
 PP r=4.0 seed 42 ...


   seed 42 fold 0 done in 36.7s; best_it=3253


   seed 42 fold 1 done in 39.7s; best_it=3632


   seed 42 fold 2 done in 37.9s; best_it=3326


   seed 42 fold 3 done in 38.8s; best_it=3446


   seed 42 fold 4 done in 37.2s; best_it=3468


    seed 42 test model 0 0.2s


    seed 42 test model 1 0.2s


    seed 42 test model 2 0.2s


    seed 42 test model 3 0.2s


    seed 42 test model 4 0.2s


 PP r=4.0 seed 1337 ...


   seed 1337 fold 0 done in 38.1s; best_it=3385


   seed 1337 fold 1 done in 39.7s; best_it=3608


   seed 1337 fold 2 done in 36.0s; best_it=3140


   seed 1337 fold 3 done in 38.1s; best_it=3378


   seed 1337 fold 4 done in 39.1s; best_it=3609


    seed 1337 test model 0 0.2s


    seed 1337 test model 1 0.2s


    seed 1337 test model 2 0.2s


    seed 1337 test model 3 0.2s


    seed 1337 test model 4 0.2s


 PP r=4.0 seed 2025 ...


   seed 2025 fold 0 done in 39.2s; best_it=3453


   seed 2025 fold 1 done in 38.7s; best_it=3408


   seed 2025 fold 2 done in 38.1s; best_it=3284


   seed 2025 fold 3 done in 40.8s; best_it=3573


   seed 2025 fold 4 done in 37.2s; best_it=3388


    seed 2025 test model 0 0.2s


    seed 2025 test model 1 0.2s


    seed 2025 test model 2 0.2s


    seed 2025 test model 3 0.2s


    seed 2025 test model 4 0.2s


Thresholds by dist bin (0=far,1=close): {0: (0.5131852726141611, 1.000001), 1: (0.8557006319363912, 0.7985922495524088)}
OOF MCC with 4 thresholds (dist bins): 0.72028


PP bagged (r40, 4-thr) ones before G overwrite: 6097


Applied prior G overwrite. ones after=8149, delta=2052


Saved submission.csv. Took 584.3s


In [41]:
# Blend r=3.5 and r=4.0 PP bagged probabilities (0.6/0.4), smooth, dual-threshold, apply PP hysteresis, keep prior G overwrite
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.4'], check=True)
    import xgboost as xgb
print('xgboost version (blend r35+r40):', getattr(xgb, '__version__', 'unknown'))

def fast_dual_threshold_mcc(y_true, prob, same_flag, grid_points=256):
    import numpy as np
    y = np.asarray(y_true, dtype=np.int64)
    p = np.asarray(prob, dtype=np.float64)
    s = np.asarray(same_flag, dtype=np.int8)
    mask = np.isfinite(y) & np.isfinite(p) & np.isfinite(s)
    y, p, s = y[mask], p[mask], s[mask]
    def cohort_counts(yc, pc, G):
        n = yc.size
        if n == 0:
            return dict(tp=np.array([0], np.float64), fp=np.array([0], np.float64), tn=np.array([0], np.float64), fn=np.array([0], np.float64), thr=np.array([1.0], np.float64))
        order = np.argsort(-pc, kind='mergesort')
        ys, ps = yc[order], pc[order]
        P = float(ys.sum()); N = float(n - ys.sum())
        step = max(1, n // max(1, (G - 1)))
        k = np.arange(0, n + 1, step, dtype=np.int64)
        if k[-1] != n: k = np.append(k, n)
        cum = np.concatenate(([0], np.cumsum(ys, dtype=np.int64)))
        tp = cum[k].astype(np.float64); fp = (k - cum[k]).astype(np.float64)
        fn = P - tp; tn = N - fp
        thr = np.where(k == 0, 1.0 + 1e-6, ps[np.maximum(0, k - 1)])
        return dict(tp=tp, fp=fp, tn=tn, fn=fn, thr=thr)
    a = cohort_counts(y[s == 0], p[s == 0], grid_points)
    b = cohort_counts(y[s == 1], p[s == 1], grid_points)
    tp = a['tp'][:, None] + b['tp'][None, :]
    fp = a['fp'][:, None] + b['fp'][None, :]
    tn = a['tn'][:, None] + b['tn'][None, :]
    fn = a['fn'][:, None] + b['fn'][None, :]
    with np.errstate(invalid='ignore', divide='ignore'):
        num = tp * tn - fp * fn
        den = (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)
        den = np.where(den > 0, np.sqrt(den), np.nan)
        mcc = num / den
    if not np.isfinite(mcc).any():
        return -1.0, 0.79, 0.79
    i, j = np.unravel_index(np.nanargmax(mcc), mcc.shape)
    return float(mcc[i, j]), float(a['thr'][i]), float(b['thr'][j])

def run_bag(radius_tag, train_sup_path, test_feats_path, seeds=(42,1337,2025)):
    print(f'PP bagging {radius_tag}: load...')
    train_sup = pd.read_parquet(train_sup_path)
    test_feats = pd.read_parquet(test_feats_path)
    folds_df = pd.read_csv('folds_game_play.csv')
    train_sup = train_sup.merge(folds_df, on='game_play', how='left')
    assert train_sup['fold'].notna().all()
    for df in (train_sup, test_feats):
        if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
        if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)
    drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
    feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
    print(f' Using {len(feat_cols)} features')
    key_df = train_sup[['game_play','p1','p2','step']].copy()
    ord_idx = key_df.sort_values(['game_play','p1','p2','step']).index.to_numpy()
    gkf = GroupKFold(n_splits=5)
    groups = train_sup['game_play'].values
    y_all = train_sup['contact'].astype(int).values
    oof_list = []; test_list = []
    for s in seeds:
        print(f'  {radius_tag} seed {s} ...', flush=True)
        X_all = train_sup[feat_cols].astype(float).values
        oof = np.full(len(train_sup), np.nan, float)
        models = []
        for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
            t1 = time.time()
            X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
            X_va, y_va = X_all[va_idx], y_all[va_idx]
            neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
            spw = max(1.0, neg / max(1, posc))
            dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
            params = {'tree_method':'hist','device':'cuda','max_depth':7,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
                      'min_child_weight':10,'lambda':1.5,'alpha':0.1,'gamma':0.1,'objective':'binary:logistic','eval_metric':'logloss',
                      'scale_pos_weight': float(spw), 'seed': int(s + fold)}
            booster = xgb.train(params, dtrain, num_boost_round=3800, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
            best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
            oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
            models.append((booster, best_it))
            print(f'   {radius_tag} seed {s} fold {fold} {time.time()-t1:.1f}s; best_it={best_it}', flush=True)
        oof_list.append(oof[ord_idx])
        Xt = test_feats[feat_cols].astype(float).values
        dtest = xgb.DMatrix(Xt)
        pt = np.zeros(len(test_feats), float)
        for i, (booster, best_it) in enumerate(models):
            t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
            print(f'    {radius_tag} seed {s} test model {i} {time.time()-t1:.1f}s', flush=True)
        pt /= max(1, len(models))
        dt = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step'])
        test_list.append(pt[dt.index.values])
    # pack outputs and keys
    keys_tr = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy()
    keys_te = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step'])
    return keys_tr, np.mean(np.vstack(oof_list), axis=0), keys_te, np.mean(np.vstack(test_list), axis=0), train_sup, test_feats, ord_idx

t0 = time.time()
print('Running bagging for r=3.5 and r=4.0...')
keys_tr35, oof35, keys_te35, pt35, tr35, te35, ord35 = run_bag('r3.5', 'train_supervised_w5_helm_dyn_r35.parquet', 'test_pairs_w5_helm_dyn_r35.parquet')
keys_tr40, oof40, keys_te40, pt40, tr40, te40, ord40 = run_bag('r4.0', 'train_supervised_w5_helm_dyn_r40.parquet', 'test_pairs_w5_helm_dyn_r40.parquet')

# Ensure same key domains for train to blend OOF: inner-join on keys
kt35 = keys_tr35.copy(); kt35['key'] = kt35['game_play'] + '|' + kt35['p1'] + '|' + kt35['p2'] + '|' + kt35['step'].astype(str)
kt40 = keys_tr40.copy(); kt40['key'] = kt40['game_play'] + '|' + kt40['p1'] + '|' + kt40['p2'] + '|' + kt40['step'].astype(str)
df35 = kt35[['key']].copy(); df35['p35'] = oof35
df40 = kt40[['key']].copy(); df40['p40'] = oof40
blend_tr = df35.merge(df40, on='key', how='inner')
w35, w40 = 0.6, 0.4
blend_tr['p_blend'] = w35 * blend_tr['p35'] + w40 * blend_tr['p40']

# Smooth blended OOF by (gp,p1,p2)
keys_split = blend_tr['key'].str.split('|', expand=True)
keys_split.columns = ['game_play','p1','p2','step']
tmp = keys_split.copy()
tmp['step'] = tmp['step'].astype(int)
tmp['p'] = blend_tr['p_blend'].values
tmp = tmp.sort_values(['game_play','p1','p2','step'])
grp = tmp.groupby(['game_play','p1','p2'], sort=False)
tmp['p_smooth'] = grp['p'].transform(lambda s: s.rolling(3, center=True, min_periods=1).max())
oof_blend_smooth = tmp['p_smooth'].to_numpy()
y_map = tr40[['game_play','p1','p2','step','contact','same_team']].copy()
y_map['key'] = (y_map['game_play'] + '|' + y_map['p1'] + '|' + y_map['p2'] + '|' + y_map['step'].astype(str))
eval_df = tmp.copy()
eval_df['key'] = (eval_df['game_play'] + '|' + eval_df['p1'] + '|' + eval_df['p2'] + '|' + eval_df['step'].astype(str))
eval_df = eval_df.merge(y_map[['key','contact','same_team']], on='key', how='inner')
y_sorted = eval_df['contact'].astype(int).to_numpy()
same_sorted = eval_df['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in eval_df.columns else np.zeros(len(eval_df), np.int8)

best_mcc, thr_opp, thr_same = fast_dual_threshold_mcc(y_sorted, oof_blend_smooth, same_sorted, grid_points=256)
if (not np.isfinite(best_mcc)) or best_mcc < 0:
    thrs = np.linspace(0.7, 0.85, 31)
    m_list = [matthews_corrcoef(y_sorted, (oof_blend_smooth >= t).astype(int)) for t in thrs]
    j = int(np.argmax(m_list)); best_mcc = float(m_list[j]); thr_opp = thr_same = float(thrs[j])
print(f'Blended OOF MCC={best_mcc:.5f} | thr_same={thr_same:.4f}, thr_opp={thr_opp:.4f}')

# Test: align r3.5 and r4.0 by keys then blend
kte35 = keys_te35.copy(); kte35['key'] = kte35['game_play'] + '|' + kte35['p1'] + '|' + kte35['p2'] + '|' + kte35['step'].astype(str)
kte40 = keys_te40.copy(); kte40['key'] = kte40['game_play'] + '|' + kte40['p1'] + '|' + kte40['p2'] + '|' + kte40['step'].astype(str)
te35_df = kte35[['key']].copy(); te35_df['p35'] = pt35
te40_df = kte40[['key']].copy(); te40_df['p40'] = pt40
blend_te = te40_df.merge(te35_df, on='key', how='inner')  # use inner to ensure both available
blend_te['p_blend'] = w35 * blend_te['p35'] + w40 * blend_te['p40']
ks = blend_te['key'].str.split('|', expand=True); ks.columns = ['game_play','p1','p2','step']
df_t = ks.copy(); df_t['step'] = df_t['step'].astype(int); df_t['prob'] = blend_te['p_blend'].values
df_t = df_t.sort_values(['game_play','p1','p2','step'])
grp_t = df_t.groupby(['game_play','p1','p2'], sort=False)
df_t['prob_smooth'] = grp_t['prob'].transform(lambda s: s.rolling(3, center=True, min_periods=1).max())

# Apply dual thresholds by same_team on test
st = te40[['game_play','p1','p2','step','same_team']].copy()
st['key'] = (st['game_play'] + '|' + st['p1'] + '|' + st['p2'] + '|' + st['step'].astype(str))
df_t['key'] = (df_t['game_play'] + '|' + df_t['p1'] + '|' + df_t['p2'] + '|' + df_t['step'].astype(str))
df_t = df_t.merge(st[['key','same_team']], on='key', how='left')
same_flag_test = df_t['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in df_t.columns else np.zeros(len(df_t), int)
thr_arr_test = np.where(same_flag_test == 1, thr_same, thr_opp)
df_t['pred_bin'] = (df_t['prob_smooth'].to_numpy() >= thr_arr_test).astype(int)

# Build submission from sample and PP preds
cid_sorted = (df_t['game_play'].astype(str) + '_' + df_t['step'].astype(str) + '_' + df_t['p1'].astype(str) + '_' + df_t['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_sorted.values, 'contact_pp': df_t['pred_bin'].astype(int).values})
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
pp_ones = int(sub['contact'].sum())
print('PP blended (r35*0.6 + r40*0.4) ones before G overwrite:', pp_ones)

# Optional: overwrite G-second rows from previous submission if available
try:
    prev_sub = pd.read_csv('submission.csv')
    g_pred_second = prev_sub[prev_sub['contact_id'].str.endswith('_G')][['contact_id','contact']].rename(columns={'contact':'contact_g'})
    sub = sub.merge(g_pred_second, on='contact_id', how='left')
    sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
    sub = sub[['contact_id','contact']]
    after_ones = int(sub['contact'].sum())
    print(f'Applied prior G overwrite. ones after={after_ones}, delta={after_ones-pp_ones}')
except Exception as e:
    print('No prior submission with G rows found; skipping G overwrite.', e)
    sub = sub[['contact_id','contact']]

# Apply PP hysteresis (2-of-3) on PP rows only
tok = sub['contact_id'].str.split('_', n=4, expand=True)
tok.columns = ['g1','g2','step','a','b']
tok['game_play'] = tok['g1'] + '_' + tok['g2']
tok['step'] = tok['step'].astype(int)
is_g = tok['b'] == 'G'
pp_mask = (~is_g) & (tok['a'] != 'G')
pp_df = pd.DataFrame({'game_play': tok['game_play'], 'p1': tok['a'], 'p2': tok['b'], 'step': tok['step'], 'contact': sub['contact'].values})
pp = pp_df.loc[pp_mask, ['game_play','p1','p2','step','contact']].copy().sort_values(['game_play','p1','p2','step'])
grp_pp = pp.groupby(['game_play','p1','p2'], sort=False)['contact']
pp['contact_hyst'] = grp_pp.transform(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))
pp_key = (pp['game_play'] + '_' + pp['step'].astype(str) + '_' + pp['p1'] + '_' + pp['p2'])
sub_key = (tok['game_play'] + '_' + tok['step'].astype(str) + '_' + tok['a'] + '_' + tok['b'])
map_h = pd.Series(pp['contact_hyst'].values, index=pp_key.values)
sub_h = sub_key.map(map_h)
before_ones_all = int(sub['contact'].sum())
sub.loc[pp_mask, 'contact'] = sub_h.loc[pp_mask].fillna(sub.loc[pp_mask, 'contact']).astype(int)
after_ones_all = int(sub['contact'].sum())
print(f'Applied PP hysteresis. ones before={before_ones_all}, after={after_ones_all}, delta={after_ones_all-before_ones_all}')

sub.to_csv('submission.csv', index=False)
print('Blended submission saved. Total time {:.1f}s'.format(time.time()-t0))

xgboost version (blend r35+r40): 2.1.4
Running bagging for r=3.5 and r=4.0...
PP bagging r3.5: load...


 Using 50 features
  r3.5 seed 42 ...


   r3.5 seed 42 fold 0 29.7s; best_it=3094


   r3.5 seed 42 fold 1 30.2s; best_it=3182


   r3.5 seed 42 fold 2 32.0s; best_it=3270


   r3.5 seed 42 fold 3 29.4s; best_it=2995


   r3.5 seed 42 fold 4 27.2s; best_it=2873


    r3.5 seed 42 test model 0 0.1s


    r3.5 seed 42 test model 1 0.1s


    r3.5 seed 42 test model 2 0.1s


    r3.5 seed 42 test model 3 0.1s


    r3.5 seed 42 test model 4 0.1s


  r3.5 seed 1337 ...


   r3.5 seed 1337 fold 0 30.9s; best_it=3211


   r3.5 seed 1337 fold 1 31.8s; best_it=3291


   r3.5 seed 1337 fold 2 33.3s; best_it=3408


   r3.5 seed 1337 fold 3 29.7s; best_it=2955


   r3.5 seed 1337 fold 4 27.0s; best_it=2747


    r3.5 seed 1337 test model 0 0.2s


    r3.5 seed 1337 test model 1 0.2s


    r3.5 seed 1337 test model 2 0.2s


    r3.5 seed 1337 test model 3 0.1s


    r3.5 seed 1337 test model 4 0.1s


  r3.5 seed 2025 ...


   r3.5 seed 2025 fold 0 28.9s; best_it=3011


   r3.5 seed 2025 fold 1 32.2s; best_it=3338


   r3.5 seed 2025 fold 2 31.3s; best_it=3189


   r3.5 seed 2025 fold 3 29.3s; best_it=2925


   r3.5 seed 2025 fold 4 28.6s; best_it=2898


    r3.5 seed 2025 test model 0 0.1s


    r3.5 seed 2025 test model 1 0.2s


    r3.5 seed 2025 test model 2 0.1s


    r3.5 seed 2025 test model 3 0.1s


    r3.5 seed 2025 test model 4 0.1s


PP bagging r4.0: load...


 Using 50 features
  r4.0 seed 42 ...


   r4.0 seed 42 fold 0 35.0s; best_it=3253


   r4.0 seed 42 fold 1 39.9s; best_it=3632


   r4.0 seed 42 fold 2 38.1s; best_it=3326


   r4.0 seed 42 fold 3 37.7s; best_it=3446


   r4.0 seed 42 fold 4 37.8s; best_it=3468


    r4.0 seed 42 test model 0 0.2s


    r4.0 seed 42 test model 1 0.2s


    r4.0 seed 42 test model 2 0.2s


    r4.0 seed 42 test model 3 0.2s


    r4.0 seed 42 test model 4 0.2s


  r4.0 seed 1337 ...


   r4.0 seed 1337 fold 0 37.7s; best_it=3385


   r4.0 seed 1337 fold 1 39.9s; best_it=3608


   r4.0 seed 1337 fold 2 36.1s; best_it=3140


   r4.0 seed 1337 fold 3 38.1s; best_it=3378


   r4.0 seed 1337 fold 4 38.8s; best_it=3609


    r4.0 seed 1337 test model 0 0.2s


    r4.0 seed 1337 test model 1 0.2s


    r4.0 seed 1337 test model 2 0.2s


    r4.0 seed 1337 test model 3 0.2s


    r4.0 seed 1337 test model 4 0.2s


  r4.0 seed 2025 ...


   r4.0 seed 2025 fold 0 38.3s; best_it=3453


   r4.0 seed 2025 fold 1 39.1s; best_it=3408


   r4.0 seed 2025 fold 2 36.0s; best_it=3284


   r4.0 seed 2025 fold 3 40.9s; best_it=3573


   r4.0 seed 2025 fold 4 37.6s; best_it=3388


    r4.0 seed 2025 test model 0 0.2s


    r4.0 seed 2025 test model 1 0.2s


    r4.0 seed 2025 test model 2 0.2s


    r4.0 seed 2025 test model 3 0.2s


    r4.0 seed 2025 test model 4 0.2s


Blended OOF MCC=0.72128 | thr_same=0.7433, thr_opp=0.7328


PP blended (r35*0.6 + r40*0.4) ones before G overwrite: 7044


Applied prior G overwrite. ones after=9096, delta=2052


Applied PP hysteresis. ones before=9096, after=9121, delta=25


Blended submission saved. Total time 1043.1s


In [42]:
# G head with ±2 label expansion; smooth + 2-of-3 hysteresis; overwrite *_G rows in submission
import time, math, sys, subprocess, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost for G head...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.4'], check=True)
    import xgboost as xgb
print('xgboost version (G head ±2):', getattr(xgb, '__version__', 'unknown'))

t0 = time.time()
print('G head ±2: building per-player features (reuse fast pipeline) ...')

# Base tracking
trk_cols = ['game_play','step','nfl_player_id','team','position','x_position','y_position','speed','acceleration','direction','orientation']
tr_trk = pd.read_csv('train_player_tracking.csv', usecols=trk_cols).copy()
te_trk = pd.read_csv('test_player_tracking.csv', usecols=trk_cols).copy()
for df in (tr_trk, te_trk):
    df['nfl_player_id'] = df['nfl_player_id'].astype(int).astype(str)

def circ_diff_deg(a, b):
    d = (a - b + 180.0) % 360.0 - 180.0
    return np.abs(d)

def build_player_dyn(df: pd.DataFrame) -> pd.DataFrame:
    df = df.sort_values(['game_play','nfl_player_id','step']).copy()
    grp = df.groupby(['game_play','nfl_player_id'], sort=False)
    df['d_speed_1'] = grp['speed'].diff(1)
    df['d_speed_3'] = df['speed'] - grp['speed'].shift(3)
    df['d_accel_1'] = grp['acceleration'].diff(1)
    df['jerk'] = grp['acceleration'].diff(1)
    for col in ['speed','acceleration']:
        s = grp[col]
        df[f'{col}_min_p3'] = s.rolling(3, min_periods=1).min().reset_index(level=[0,1], drop=True)
        df[f'{col}_mean_p3'] = s.rolling(3, min_periods=1).mean().reset_index(level=[0,1], drop=True)
        df[f'{col}_std_p3'] = s.rolling(3, min_periods=1).std().reset_index(level=[0,1], drop=True)
        df[f'{col}_min_p5'] = s.rolling(5, min_periods=1).min().reset_index(level=[0,1], drop=True)
        df[f'{col}_mean_p5'] = s.rolling(5, min_periods=1).mean().reset_index(level=[0,1], drop=True)
        df[f'{col}_std_p5'] = s.rolling(5, min_periods=1).std().reset_index(level=[0,1], drop=True)
    df['dir_orient_diff'] = circ_diff_deg(df['direction'].fillna(0.0), df['orientation'].fillna(0.0))
    df['dist_to_sideline'] = np.minimum(df['y_position'], 53.3 - df['y_position'])
    df['near_sideline'] = ((df['y_position'] <= 2.0) | (df['y_position'] >= 51.3)).astype(int)
    df['near_goal'] = ((df['x_position'] <= 3.0) | (df['x_position'] >= 117.0)).astype(int)
    for c in ['d_speed_1','d_speed_3','d_accel_1','jerk','speed_std_p3','speed_std_p5','acceleration_std_p3','acceleration_std_p5']:
        if c in df.columns:
            df[c] = df[c].fillna(0.0)
    return df

tr_p = build_player_dyn(tr_trk)
te_p = build_player_dyn(te_trk)

# Opponent context from r=3.5 pairs (prebuilt files exist)
tr_pairs = pd.read_parquet('train_pairs_r35.parquet')
te_pairs = pd.read_parquet('test_pairs_r35.parquet')

def pairs_to_player_ctx(pairs: pd.DataFrame) -> pd.DataFrame:
    a = pairs[['game_play','step','p1','distance']].rename(columns={'p1':'nfl_player_id'})
    b = pairs[['game_play','step','p2','distance']].rename(columns={'p2':'nfl_player_id'})
    u = pd.concat([a, b], ignore_index=True)
    g = u.groupby(['game_play','step','nfl_player_id'], sort=False)
    out = g['distance'].agg(min_opp_dist='min').reset_index()
    for thr, name in [(1.5,'lt15'), (2.0,'lt20'), (2.5,'lt25')]:
        u[name] = (u['distance'] < thr).astype(int)
        cnt = u.groupby(['game_play','step','nfl_player_id'], sort=False)[name].sum().rename(f'cnt_opp_{name}')
        out = out.merge(cnt.reset_index(), on=['game_play','step','nfl_player_id'], how='left')
    return out

tr_ctx = pairs_to_player_ctx(tr_pairs)
te_ctx = pairs_to_player_ctx(te_pairs)

# Helmet per-player aggregates and deltas
train_helm = pd.read_csv('train_baseline_helmets.csv')
test_helm = pd.read_csv('test_baseline_helmets.csv')
train_vmeta = pd.read_csv('train_video_metadata.csv')
test_vmeta = pd.read_csv('test_video_metadata.csv')
FPS = 59.94
def prep_meta(vmeta: pd.DataFrame):
    vm = vmeta.copy()
    for c in ['start_time','snap_time']:
        if not np.issubdtype(vm[c].dtype, np.number):
            ts = pd.to_datetime(vm[c], errors='coerce')
            vm[c] = (ts - ts.dt.floor('D')).dt.total_seconds().astype(float)
    vm['snap_frame'] = ((vm['snap_time'] - vm['start_time']) * FPS).round().astype('Int64')
    return vm[['game_play','view','snap_frame']].drop_duplicates()
meta_tr = prep_meta(train_vmeta)
meta_te = prep_meta(test_vmeta)

def helm_player_agg(helm: pd.DataFrame, meta: pd.DataFrame) -> pd.DataFrame:
    df = helm[['game_play','view','frame','nfl_player_id','left','top','width','height']].copy()
    df = df.dropna(subset=['nfl_player_id'])
    df['nfl_player_id'] = df['nfl_player_id'].astype(int).astype(str)
    df['area'] = df['width'] * df['height']
    df['cx'] = df['left'] + 0.5 * df['width']
    df['cy'] = df['top'] + 0.5 * df['height']
    df = df.sort_values(['game_play','view','frame','nfl_player_id','area'], ascending=[True,True,True,True,False]).drop_duplicates(['game_play','view','frame','nfl_player_id'], keep='first')
    df = df.merge(meta, on=['game_play','view'], how='left')
    df['step'] = ((df['frame'] - df['snap_frame']).astype('float') / 6.0).round().astype('Int64')
    df = df.dropna(subset=['step']); df['step'] = df['step'].astype(int)
    # expand ±1 to align tolerance with steps
    dm1 = df.copy(); dm1['target_step'] = dm1['step'] - 1
    d0 = df.copy(); d0['target_step'] = d0['step']
    dp1 = df.copy(); dp1['target_step'] = df['step'] + 1
    d = pd.concat([dm1, d0, dp1], ignore_index=True)
    agg = d.groupby(['game_play','target_step','nfl_player_id'], sort=False).agg(
        cy_mean=('cy','mean'), h_mean=('height','mean'), cnt=('cx','size')
    ).reset_index().rename(columns={'target_step':'step'})
    agg = agg.sort_values(['game_play','nfl_player_id','step'])
    g = agg.groupby(['game_play','nfl_player_id'], sort=False)
    agg['d_cy_1'] = g['cy_mean'].diff(1).fillna(0.0)
    agg['d_h_1'] = g['h_mean'].diff(1).fillna(0.0)
    return agg

h_tr_p = helm_player_agg(train_helm, meta_tr)
h_te_p = helm_player_agg(test_helm, meta_te)

def merge_all(base: pd.DataFrame, ctx: pd.DataFrame, helm: pd.DataFrame) -> pd.DataFrame:
    df = base.merge(ctx, on=['game_play','step','nfl_player_id'], how='left')
    df = df.merge(helm, on=['game_play','step','nfl_player_id'], how='left')
    for c in ['min_opp_dist','cnt_opp_lt15','cnt_opp_lt20','cnt_opp_lt25','cy_mean','h_mean','d_cy_1','d_h_1']:
        if c in df.columns:
            df[c] = df[c].fillna(0.0)
    return df

tr_feat_p = merge_all(tr_p, tr_ctx, h_tr_p)
te_feat_p = merge_all(te_p, te_ctx, h_te_p)
print('Per-player train/test feature shapes:', tr_feat_p.shape, te_feat_p.shape)

# Supervision for G with ±2 expansion
labels = pd.read_csv('train_labels.csv', usecols=['contact_id','game_play','step','nfl_player_id_1','nfl_player_id_2','contact'])
labels['pid1'] = labels['nfl_player_id_1'].astype(str); labels['pid2'] = labels['nfl_player_id_2'].astype(str)
mask_g = (labels['pid1'] == 'G') | (labels['pid2'] == 'G')
g_labels = labels.loc[mask_g, ['game_play','step','pid1','pid2','contact']].copy()
g_labels['player'] = np.where(g_labels['pid1'] == 'G', g_labels['pid2'], g_labels['pid1'])
g_labels = g_labels[['game_play','step','player','contact']]
sup_g = g_labels.merge(tr_feat_p.rename(columns={'nfl_player_id':'player'}), on=['game_play','step','player'], how='inner')
print('G supervised inner shape:', sup_g.shape, 'pos rate:', sup_g['contact'].mean())
pos = sup_g.loc[sup_g['contact'] == 1, ['game_play','step','player']]
ex = [pos.assign(step=pos['step'] + d) for d in (-2,-1,1,2)]
pos_exp = pd.concat(ex, ignore_index=True).drop_duplicates()
pos_exp['flag_pos_exp'] = 1
sup_g = sup_g.merge(pos_exp, on=['game_play','step','player'], how='left')
sup_g.loc[sup_g['flag_pos_exp'] == 1, 'contact'] = 1
sup_g.drop(columns=['flag_pos_exp'], inplace=True)
print('G after ±2 expansion pos rate:', sup_g['contact'].mean())

# Train XGB with GKF
drop_cols = {'contact','game_play','step','player','team','position','nfl_player_id'}
feat_cols = [c for c in sup_g.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(sup_g[c])]
print('G feature count:', len(feat_cols))
X_all = sup_g[feat_cols].astype(float).values
y_all = sup_g['contact'].astype(int).values
groups = sup_g['game_play'].values
gkf = GroupKFold(n_splits=5)
oof = np.full(len(sup_g), np.nan, float)
models = []
for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
    t1 = time.time()
    X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
    X_va, y_va = X_all[va_idx], y_all[va_idx]
    neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
    spw = max(1.0, neg / max(1, posc))
    print(f'G±2 Fold {fold}: train {len(tr_idx)} (pos {posc}), valid {len(va_idx)} (pos {(y_va==1).sum()}), spw={spw:.2f}', flush=True)
    dtrain = xgb.DMatrix(X_tr, label=y_tr)
    dvalid = xgb.DMatrix(X_va, label=y_va)
    params = {'tree_method':'hist','device':'cuda','max_depth':6,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
              'min_child_weight':10,'lambda':1.5,'alpha':0.0,'objective':'binary:logistic','eval_metric':'logloss',
              'scale_pos_weight': float(spw), 'seed': 2025 + fold}
    booster = xgb.train(params, dtrain, num_boost_round=2000, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=100, verbose_eval=False)
    best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
    oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
    models.append((booster, best_it))
    print(f' G±2 Fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)

# Smooth OOF (roll-max(5)) and apply 2-of-3 hysteresis for evaluation thresholding
oof_df = sup_g[['game_play','player','step']].copy()
oof_df['oof'] = oof
oof_df = oof_df.sort_values(['game_play','player','step'])
grp_o = oof_df.groupby(['game_play','player'], sort=False)
oof_df['oof_smooth'] = grp_o['oof'].transform(lambda s: s.rolling(5, center=True, min_periods=1).max())
oof_smooth = oof_df['oof_smooth'].values
y_sorted = sup_g.loc[oof_df.index, 'contact'].astype(int).values

def apply_min_dur(bin_arr, gp, pl):
    df = pd.DataFrame({'gp': gp, 'pl': pl, 'b': bin_arr})
    df = df.groupby(['gp','pl'], sort=False)['b'].apply(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))
    return df.values

best_thr, best_mcc = 0.6, -1.0
thr_grid = np.linspace(0.4, 0.85, 46)
gp_arr = oof_df['game_play'].values
pl_arr = oof_df['player'].values
for thr in thr_grid:
    pred0 = (oof_smooth >= thr).astype(int)
    pred = apply_min_dur(pred0, gp_arr, pl_arr)
    m = matthews_corrcoef(y_sorted, pred)
    if m > best_mcc:
        best_mcc, best_thr = float(m), float(thr)
print(f'G±2 OOF MCC={best_mcc:.5f} at thr={best_thr:.2f}')

# Inference on test
Xt = te_feat_p[feat_cols].astype(float).values
dtest = xgb.DMatrix(Xt)
pt = np.zeros(len(te_feat_p), dtype=float)
for i, (booster, best_it) in enumerate(models):
    t1 = time.time()
    pt += booster.predict(dtest, iteration_range=(0, best_it + 1))
    print(f' G±2 Inference model {i} took {time.time()-t1:.1f}s')
pt /= max(1, len(models))
pred_tmp = te_feat_p[['game_play','step','nfl_player_id']].rename(columns={'nfl_player_id':'player'}).copy()
pred_tmp['prob'] = pt
pred_tmp = pred_tmp.sort_values(['game_play','player','step'])
grp_t = pred_tmp.groupby(['game_play','player'], sort=False)
pred_tmp['prob_smooth'] = grp_t['prob'].transform(lambda s: s.rolling(5, center=True, min_periods=1).max())
bin0 = (pred_tmp['prob_smooth'].values >= best_thr).astype(int)
bin1 = apply_min_dur(bin0, pred_tmp['game_play'].values, pred_tmp['player'].values)
pred_tmp['pred_bin'] = bin1.astype(int)

# Build G contact_id with player_G (second token is G) and overwrite submission
g_cid_second = (pred_tmp['game_play'].astype(str) + '_' + pred_tmp['step'].astype(str) + '_' + pred_tmp['player'].astype(str) + '_G')
g_pred_second = pd.DataFrame({'contact_id': g_cid_second, 'contact': pred_tmp['pred_bin'].astype(int)})

sub = pd.read_csv('submission.csv')
before_ones = int(sub['contact'].sum())
sub = sub.drop(columns=['contact']).merge(g_pred_second, on='contact_id', how='left').merge(pd.read_csv('submission.csv'), on='contact_id', how='left', suffixes=('_g','_pp'))
sub['contact'] = sub['contact_g'].fillna(sub['contact_pp']).astype(int)
sub = sub[['contact_id','contact']]
after_ones = int(sub['contact'].sum())
sub.to_csv('submission.csv', index=False)
print(f'G±2 overwrite done. ones before={before_ones}, after={after_ones}, delta={after_ones-before_ones}')
print('G±2 head done in {:.1f}s'.format(time.time()-t0))

xgboost version (G head ±2): 2.1.4
G head ±2: building per-player features (reuse fast pipeline) ...


Per-player train/test feature shapes: (1225299, 40) (127754, 40)


G supervised inner shape: (370351, 41) pos rate: 0.041106949893479426
G after ±2 expansion pos rate: 0.04628312060720776
G feature count: 35


G±2 Fold 0: train 296428 (pos 13560), valid 73923 (pos 3581), spw=20.86


 G±2 Fold 0 done in 9.2s; best_it=1475


G±2 Fold 1: train 296519 (pos 13800), valid 73832 (pos 3341), spw=20.49


 G±2 Fold 1 done in 8.4s; best_it=1352


G±2 Fold 2: train 296365 (pos 14505), valid 73986 (pos 2636), spw=19.43


 G±2 Fold 2 done in 8.2s; best_it=1344


G±2 Fold 3: train 296365 (pos 13238), valid 73986 (pos 3903), spw=21.39


 G±2 Fold 3 done in 9.1s; best_it=1472


G±2 Fold 4: train 295727 (pos 13461), valid 74624 (pos 3680), spw=20.97


 G±2 Fold 4 done in 9.2s; best_it=1488


G±2 OOF MCC=0.53988 at thr=0.80
 G±2 Inference model 0 took 0.0s
 G±2 Inference model 1 took 0.0s
 G±2 Inference model 2 took 0.0s
 G±2 Inference model 3 took 0.0s
 G±2 Inference model 4 took 0.0s


G±2 overwrite done. ones before=9121, after=8903, delta=-218
G±2 head done in 146.4s


In [45]:
# PP r=4.0 bagging with per-step cap=2 per player (on smoothed probs) -> dual thresholds -> 2-of-3 hysteresis; keep prior G overwrite
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.4'], check=True)
    import xgboost as xgb
print('xgboost version (pp-bag-r40-cap2):', getattr(xgb, '__version__', 'unknown'))

def fast_dual_threshold_mcc(y_true, prob, same_flag, grid_points=256):
    import numpy as np
    y = np.asarray(y_true, dtype=np.int64)
    p = np.asarray(prob, dtype=np.float64)
    s = np.asarray(same_flag, dtype=np.int8)
    mask = np.isfinite(y) & np.isfinite(p) & np.isfinite(s)
    y, p, s = y[mask], p[mask], s[mask]
    def cohort_counts(yc, pc, G):
        n = yc.size
        if n == 0:
            return dict(tp=np.array([0], np.float64), fp=np.array([0], np.float64), tn=np.array([0], np.float64), fn=np.array([0], np.float64), thr=np.array([1.0], np.float64))
        order = np.argsort(-pc, kind='mergesort')
        ys, ps = yc[order], pc[order]
        P = float(ys.sum()); N = float(n - ys.sum())
        step = max(1, n // max(1, (G - 1)))
        k = np.arange(0, n + 1, step, dtype=np.int64)
        if k[-1] != n: k = np.append(k, n)
        cum = np.concatenate(([0], np.cumsum(ys, dtype=np.int64)))
        tp = cum[k].astype(np.float64); fp = (k - cum[k]).astype(np.float64)
        fn = P - tp; tn = N - fp
        thr = np.where(k == 0, 1.0 + 1e-6, ps[np.maximum(0, k - 1)])
        return dict(tp=tp, fp=fp, tn=tn, fn=fn, thr=thr)
    a = cohort_counts(y[s == 0], p[s == 0], grid_points)
    b = cohort_counts(y[s == 1], p[s == 1], grid_points)
    tp = a['tp'][:, None] + b['tp'][None, :]
    fp = a['fp'][:, None] + b['fp'][None, :]
    tn = a['tn'][:, None] + b['tn'][None, :]
    fn = a['fn'][:, None] + b['fn'][None, :]
    with np.errstate(invalid='ignore', divide='ignore'):
        num = tp * tn - fp * fn
        den = (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)
        den = np.where(den > 0, np.sqrt(den), np.nan)
        mcc = num / den
    if not np.isfinite(mcc).any():
        return -1.0, 0.79, 0.79
    i, j = np.unravel_index(np.nanargmax(mcc), mcc.shape)
    return float(mcc[i, j]), float(a['thr'][i]), float(b['thr'][j])

t0 = time.time()
print('PP bagging r=4.0 with cap=2: loading artifacts...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r40.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r40.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

# Canonical order
ord_idx = train_sup[['game_play','p1','p2','step']].sort_values(['game_play','p1','p2','step']).index.to_numpy()
gkf = GroupKFold(n_splits=5)
groups = train_sup['game_play'].values
y_all = train_sup['contact'].astype(int).values
same_all = train_sup['same_team'].fillna(0).astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), np.int8)
seeds = [42,1337,2025]
oof_s_list = []; test_s_list = []

for s in seeds:
    print(f' PP r=4.0 seed {s} ...', flush=True)
    X_all = train_sup[feat_cols].astype(float).values
    oof = np.full(len(train_sup), np.nan, float)
    models = []
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
        t1 = time.time()
        X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all[va_idx], y_all[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method':'hist','device':'cuda','max_depth':7,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
                  'min_child_weight':10,'lambda':1.5,'alpha':0.1,'gamma':0.1,'objective':'binary:logistic','eval_metric':'logloss',
                  'scale_pos_weight': float(spw), 'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=3800, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models.append((booster, best_it))
        print(f'   seed {s} fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)
    # Smooth OOF on canonical order
    df = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy()
    df['oof'] = oof[ord_idx]
    df = df.sort_values(['game_play','p1','p2','step'])
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    df['oof_smooth'] = grp['oof'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    oof_s_list.append(df['oof_smooth'].to_numpy())

    # Test predictions per seed
    Xt = test_feats[feat_cols].astype(float).values
    dtest = xgb.DMatrix(Xt)
    pt = np.zeros(len(test_feats), float)
    for i, (booster, best_it) in enumerate(models):
        t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
        print(f'    seed {s} test model {i} {time.time()-t1:.1f}s', flush=True)
    pt /= max(1, len(models))
    dt = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step'])
    dt['prob'] = pt[dt.index.values]
    grp_t = dt.groupby(['game_play','p1','p2'], sort=False)
    dt['prob_smooth'] = grp_t['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    test_s_list.append(dt['prob_smooth'].to_numpy())

# Average OOF across seeds
oof_avg = np.mean(np.vstack(oof_s_list), axis=0)
y_sorted = train_sup['contact'].astype(int).to_numpy()[ord_idx]
same_sorted = train_sup['same_team'].fillna(0).astype(int).to_numpy()[ord_idx] if 'same_team' in train_sup.columns else np.zeros_like(y_sorted, np.int8)
best_mcc, thr_opp, thr_same = fast_dual_threshold_mcc(y_sorted, oof_avg, same_sorted, grid_points=256)
if (not np.isfinite(best_mcc)) or best_mcc < 0:
    thrs = np.linspace(0.7, 0.85, 31)
    m_list = [matthews_corrcoef(y_sorted, (oof_avg >= t).astype(int)) for t in thrs]
    j = int(np.argmax(m_list)); best_mcc = float(m_list[j]); thr_opp = thr_same = float(thrs[j])
print(f'PP bagged r=4.0 OOF MCC={best_mcc:.5f} | thr_same={thr_same:.4f}, thr_opp={thr_opp:.4f}')

# Average test probs, then apply cap=2 per (game_play, step, player) using smoothed probs
pt_bag = np.mean(np.vstack(test_s_list), axis=0)
df_t = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step']).reset_index(drop=True)
df_t['prob_smooth'] = pt_bag
df_t['row_id'] = np.arange(len(df_t))

# Build long frame for players (both sides) to rank and keep top-2 per player-step
long1 = df_t[['game_play','step','p1','prob_smooth','row_id']].rename(columns={'p1':'player','prob_smooth':'prob'})
long2 = df_t[['game_play','step','p2','prob_smooth','row_id']].rename(columns={'p2':'player','prob_smooth':'prob'})
df_long = pd.concat([long1, long2], ignore_index=True)
df_long = df_long.sort_values(['game_play','step','player','prob'], ascending=[True, True, True, False])
df_long['rank'] = df_long.groupby(['game_play','step','player'], sort=False)['prob'].rank(method='first', ascending=False)
kept_rows = set(df_long.loc[df_long['rank'] <= 2, 'row_id'].tolist())
keep_mask = df_t['row_id'].isin(kept_rows).to_numpy()
df_t.loc[~keep_mask, 'prob_smooth'] = 0.0
print('Applied cap=2 per player-step. Kept rows:', int(keep_mask.sum()), 'out of', len(keep_mask))

# Threshold by same_team
same_flag_test = test_feats[['game_play','p1','p2','step','same_team']].copy()
same_flag_test = same_flag_test.merge(df_t[['game_play','p1','p2','step','row_id']], on=['game_play','p1','p2','step'], how='right').sort_values('row_id')
same_flag_arr = same_flag_test['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in same_flag_test.columns else np.zeros(len(df_t), int)
thr_arr_test = np.where(same_flag_arr == 1, thr_same, thr_opp)
df_t['pred_bin'] = (df_t['prob_smooth'].to_numpy() >= thr_arr_test).astype(int)

# Apply 2-of-3 hysteresis per (gp,p1,p2)
df_h = df_t[['game_play','p1','p2','step','pred_bin']].copy().sort_values(['game_play','p1','p2','step'])
grp_h = df_h.groupby(['game_play','p1','p2'], sort=False)['pred_bin']
df_h['pred_hyst'] = grp_h.transform(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))
df_t = df_t.merge(df_h[['game_play','p1','p2','step','pred_hyst']], on=['game_play','p1','p2','step'], how='left')

# Build submission from sample and PP preds (after hysteresis), then overwrite G from prior submission.csv
cid_sorted = (df_t['game_play'].astype(str) + '_' + df_t['step'].astype(str) + '_' + df_t['p1'].astype(str) + '_' + df_t['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_sorted.values, 'contact_pp': df_t['pred_hyst'].astype(int).values})
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
pp_ones = int(sub['contact'].sum())
print('PP (r40 bag+cap2+hyst) ones before G overwrite:', pp_ones)
try:
    prev_sub = pd.read_csv('submission.csv')
    g_pred_second = prev_sub[prev_sub['contact_id'].str.endswith('_G')][['contact_id','contact']].rename(columns={'contact':'contact_g'})
    sub = sub.merge(g_pred_second, on='contact_id', how='left')
    sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
    sub = sub[['contact_id','contact']]
    after_ones = int(sub['contact'].sum())
    print(f'Applied prior G overwrite. ones after={after_ones}, delta={after_ones-pp_ones}')
except Exception as e:
    print('No prior submission with G rows found; skipping G overwrite.', e)
    sub = sub[['contact_id','contact']]

sub.to_csv('submission.csv', index=False)
print('Saved submission.csv. Took {:.1f}s'.format(time.time()-t0))

xgboost version (pp-bag-r40-cap2): 2.1.4
PP bagging r=4.0 with cap=2: loading artifacts...


Using 50 features
 PP r=4.0 seed 42 ...


   seed 42 fold 0 done in 34.9s; best_it=3253


   seed 42 fold 1 done in 39.7s; best_it=3632


   seed 42 fold 2 done in 37.9s; best_it=3326


   seed 42 fold 3 done in 37.4s; best_it=3446


   seed 42 fold 4 done in 37.3s; best_it=3468


    seed 42 test model 0 0.2s


    seed 42 test model 1 0.2s


    seed 42 test model 2 0.2s


    seed 42 test model 3 0.2s


    seed 42 test model 4 0.2s


 PP r=4.0 seed 1337 ...


   seed 1337 fold 0 done in 38.2s; best_it=3385


   seed 1337 fold 1 done in 39.7s; best_it=3608


   seed 1337 fold 2 done in 36.0s; best_it=3140


   seed 1337 fold 3 done in 38.0s; best_it=3378


   seed 1337 fold 4 done in 39.1s; best_it=3609


    seed 1337 test model 0 0.2s


    seed 1337 test model 1 0.2s


    seed 1337 test model 2 0.2s


    seed 1337 test model 3 0.2s


    seed 1337 test model 4 0.2s


 PP r=4.0 seed 2025 ...


   seed 2025 fold 0 done in 38.8s; best_it=3453


   seed 2025 fold 1 done in 38.3s; best_it=3408


   seed 2025 fold 2 done in 37.9s; best_it=3284


   seed 2025 fold 3 done in 40.7s; best_it=3573


   seed 2025 fold 4 done in 37.2s; best_it=3388


    seed 2025 test model 0 0.2s


    seed 2025 test model 1 0.2s


    seed 2025 test model 2 0.2s


    seed 2025 test model 3 0.2s


    seed 2025 test model 4 0.2s


PP bagged r=4.0 OOF MCC=0.72502 | thr_same=0.7626, thr_opp=0.7713


Applied cap=2 per player-step. Kept rows: 122291 out of 278492


PP (r40 bag+cap2+hyst) ones before G overwrite: 6565


Applied prior G overwrite. ones after=8617, delta=2052


Saved submission.csv. Took 587.3s


In [46]:
# PP r=4.0 bagging with fold-median dual thresholds, matching CV post-proc to test (smooth -> cap2), then hysteresis and G overwrite
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.4'], check=True)
    import xgboost as xgb
print('xgboost version (pp-bag-r40-fold-median):', getattr(xgb, '__version__', 'unknown'))

def fast_dual_threshold_mcc(y_true, prob, same_flag, grid_points=256):
    import numpy as np
    y = np.asarray(y_true, dtype=np.int64)
    p = np.asarray(prob, dtype=np.float64)
    s = np.asarray(same_flag, dtype=np.int8)
    mask = np.isfinite(y) & np.isfinite(p) & np.isfinite(s)
    y, p, s = y[mask], p[mask], s[mask]
    def cohort_counts(yc, pc, G):
        n = yc.size
        if n == 0:
            return dict(tp=np.array([0], np.float64), fp=np.array([0], np.float64), tn=np.array([0], np.float64), fn=np.array([0], np.float64), thr=np.array([1.0], np.float64))
        order = np.argsort(-pc, kind='mergesort')
        ys, ps = yc[order], pc[order]
        P = float(ys.sum()); N = float(n - ys.sum())
        step = max(1, n // max(1, (G - 1)))
        k = np.arange(0, n + 1, step, dtype=np.int64)
        if k[-1] != n: k = np.append(k, n)
        cum = np.concatenate(([0], np.cumsum(ys, dtype=np.int64)))
        tp = cum[k].astype(np.float64); fp = (k - cum[k]).astype(np.float64)
        fn = P - tp; tn = N - fp
        thr = np.where(k == 0, 1.0 + 1e-6, ps[np.maximum(0, k - 1)])
        return dict(tp=tp, fp=fp, tn=tn, fn=fn, thr=thr)
    a = cohort_counts(y[s == 0], p[s == 0], grid_points)
    b = cohort_counts(y[s == 1], p[s == 1], grid_points)
    tp = a['tp'][:, None] + b['tp'][None, :]
    fp = a['fp'][:, None] + b['fp'][None, :]
    tn = a['tn'][:, None] + b['tn'][None, :]
    fn = a['fn'][:, None] + b['fn'][None, :]
    with np.errstate(invalid='ignore', divide='ignore'):
        num = tp * tn - fp * fn
        den = (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)
        den = np.where(den > 0, np.sqrt(den), np.nan)
        mcc = num / den
    if not np.isfinite(mcc).any():
        return -1.0, 0.79, 0.79
    i, j = np.unravel_index(np.nanargmax(mcc), mcc.shape)
    return float(mcc[i, j]), float(a['thr'][i]), float(b['thr'][j])

t0 = time.time()
print('Fold-median thresholds run: loading r=4.0 supervised dyn and test features...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r40.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r40.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

# Canonical order
ord_idx = train_sup[['game_play','p1','p2','step']].sort_values(['game_play','p1','p2','step']).index.to_numpy()
gkf = GroupKFold(n_splits=5)
groups = train_sup['game_play'].values
y_all = train_sup['contact'].astype(int).values
same_all = train_sup['same_team'].fillna(0).astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), np.int8)
fold_arr = train_sup['fold'].astype(int).to_numpy()

seeds = [42,1337,2025]
oof_s_list = []
test_s_list = []

for s in seeds:
    print(f' PP r=4.0 seed {s} ...', flush=True)
    X_all = train_sup[feat_cols].astype(float).values
    oof = np.full(len(train_sup), np.nan, float)
    models = []
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
        t1 = time.time()
        X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all[va_idx], y_all[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method':'hist','device':'cuda','max_depth':7,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
                  'min_child_weight':10,'lambda':1.5,'alpha':0.1,'gamma':0.1,'objective':'binary:logistic','eval_metric':'logloss',
                  'scale_pos_weight': float(spw), 'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=3800, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models.append((booster, best_it))
        print(f'   seed {s} fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)
    # Smooth OOF on canonical order
    df = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy()
    df['oof'] = oof[ord_idx]
    df = df.sort_values(['game_play','p1','p2','step'])
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    df['oof_smooth'] = grp['oof'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    oof_s_list.append(df['oof_smooth'].to_numpy())

    # Test predictions per seed with smoothing
    Xt = test_feats[feat_cols].astype(float).values
    dtest = xgb.DMatrix(Xt)
    pt = np.zeros(len(test_feats), float)
    for i, (booster, best_it) in enumerate(models):
        t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
        print(f'    seed {s} test model {i} {time.time()-t1:.1f}s', flush=True)
    pt /= max(1, len(models))
    dt = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step'])
    dt['prob'] = pt[dt.index.values]
    grp_t = dt.groupby(['game_play','p1','p2'], sort=False)
    dt['prob_smooth'] = grp_t['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    test_s_list.append(dt['prob_smooth'].to_numpy())

# Average OOF across seeds in canonical order
oof_avg = np.mean(np.vstack(oof_s_list), axis=0)
keys_tr_sorted = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy().reset_index(drop=True)
y_sorted = train_sup['contact'].astype(int).to_numpy()[ord_idx]
same_sorted = train_sup['same_team'].fillna(0).astype(int).to_numpy()[ord_idx] if 'same_team' in train_sup.columns else np.zeros_like(y_sorted, np.int8)
fold_sorted = fold_arr[ord_idx]

# Apply cap=2 by (game_play, step, player) on smoothed OOF probs BEFORE thresholding
df_o = keys_tr_sorted.copy()
df_o['prob'] = oof_avg
df_o['row_id'] = np.arange(len(df_o))
long1 = df_o[['game_play','step','p1','prob','row_id']].rename(columns={'p1':'player'})
long2 = df_o[['game_play','step','p2','prob','row_id']].rename(columns={'p2':'player'})
df_long = pd.concat([long1, long2], ignore_index=True)
df_long = df_long.sort_values(['game_play','step','player','prob'], ascending=[True, True, True, False])
df_long['rank'] = df_long.groupby(['game_play','step','player'], sort=False)['prob'].rank(method='first', ascending=False)
kept_rows = set(df_long.loc[df_long['rank'] <= 2, 'row_id'].tolist())
keep_mask_all = df_o['row_id'].isin(kept_rows).to_numpy()
oof_cap = oof_avg.copy()
oof_cap[~keep_mask_all] = 0.0
print('Applied cap=2 to OOF. Kept rows:', int(keep_mask_all.sum()), 'of', len(keep_mask_all))

# Per-fold threshold optimization on capped OOF, then median across folds
thr_opp_f = []; thr_same_f = []
for k in sorted(np.unique(fold_sorted)):
    m = (fold_sorted == k)
    mcc_k, t_opp_k, t_same_k = fast_dual_threshold_mcc(y_sorted[m], oof_cap[m], same_sorted[m], grid_points=256)
    if (not np.isfinite(mcc_k)) or mcc_k < 0:
        thrs = np.linspace(0.7, 0.85, 31)
        ml = [matthews_corrcoef(y_sorted[m], (oof_cap[m] >= t).astype(int)) for t in thrs]
        j = int(np.argmax(ml)); t_opp_k = t_same_k = float(thrs[j])
    thr_opp_f.append(float(t_opp_k)); thr_same_f.append(float(t_same_k))
    print(f' Fold {k} thresholds: thr_opp={t_opp_k:.4f}, thr_same={t_same_k:.4f}')
thr_opp = float(np.median(thr_opp_f)); thr_same = float(np.median(thr_same_f))
print(f'Final fold-median thresholds: thr_opp={thr_opp:.4f}, thr_same={thr_same:.4f}')

# Test: average probs across seeds, smooth, then cap=2, then apply median thresholds
pt_bag = np.mean(np.vstack(test_s_list), axis=0)
df_t = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step']).reset_index(drop=True)
df_t['prob_smooth'] = pt_bag
df_t['row_id'] = np.arange(len(df_t))
long1t = df_t[['game_play','step','p1','prob_smooth','row_id']].rename(columns={'p1':'player','prob_smooth':'prob'})
long2t = df_t[['game_play','step','p2','prob_smooth','row_id']].rename(columns={'p2':'player','prob_smooth':'prob'})
df_long_t = pd.concat([long1t, long2t], ignore_index=True)
df_long_t = df_long_t.sort_values(['game_play','step','player','prob'], ascending=[True, True, True, False])
df_long_t['rank'] = df_long_t.groupby(['game_play','step','player'], sort=False)['prob'].rank(method='first', ascending=False)
kept_rows_t = set(df_long_t.loc[df_long_t['rank'] <= 2, 'row_id'].tolist())
keep_mask_t = df_t['row_id'].isin(kept_rows_t).to_numpy()
df_t.loc[~keep_mask_t, 'prob_smooth'] = 0.0
print('Applied cap=2 on test. Kept rows:', int(keep_mask_t.sum()), 'of', len(keep_mask_t))

same_flag_test = test_feats[['game_play','p1','p2','step','same_team']].copy()
same_flag_test = same_flag_test.merge(df_t[['game_play','p1','p2','step','row_id']], on=['game_play','p1','p2','step'], how='right').sort_values('row_id')
same_flag_arr = same_flag_test['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in same_flag_test.columns else np.zeros(len(df_t), int)
thr_arr_test = np.where(same_flag_arr == 1, thr_same, thr_opp)
df_t['pred_bin'] = (df_t['prob_smooth'].to_numpy() >= thr_arr_test).astype(int)

# Apply 2-of-3 hysteresis per (gp,p1,p2) on binaries
df_h = df_t[['game_play','p1','p2','step','pred_bin']].copy().sort_values(['game_play','p1','p2','step'])
grp_h = df_h.groupby(['game_play','p1','p2'], sort=False)['pred_bin']
df_h['pred_hyst'] = grp_h.transform(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))
df_t = df_t.merge(df_h[['game_play','p1','p2','step','pred_hyst']], on=['game_play','p1','p2','step'], how='left')

# Build submission from sample and PP preds (after hysteresis), then overwrite G from prior submission.csv
cid_sorted = (df_t['game_play'].astype(str) + '_' + df_t['step'].astype(str) + '_' + df_t['p1'].astype(str) + '_' + df_t['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_sorted.values, 'contact_pp': df_t['pred_hyst'].astype(int).values})
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
pp_ones = int(sub['contact'].sum())
print('PP (r40 bag + fold-median thr + cap2 + hyst) ones before G overwrite:', pp_ones)
try:
    prev_sub = pd.read_csv('submission.csv')
    g_pred_second = prev_sub[prev_sub['contact_id'].str.endswith('_G')][['contact_id','contact']].rename(columns={'contact':'contact_g'})
    sub = sub.merge(g_pred_second, on='contact_id', how='left')
    sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
    sub = sub[['contact_id','contact']]
    after_ones = int(sub['contact'].sum())
    print(f'Applied prior G overwrite. ones after={after_ones}, delta={after_ones-pp_ones}')
except Exception as e:
    print('No prior submission with G rows found; skipping G overwrite.', e)
    sub = sub[['contact_id','contact']]

sub.to_csv('submission.csv', index=False)
print('Saved submission.csv. Took {:.1f}s'.format(time.time()-t0))

xgboost version (pp-bag-r40-fold-median): 2.1.4
Fold-median thresholds run: loading r=4.0 supervised dyn and test features...


Using 50 features
 PP r=4.0 seed 42 ...


   seed 42 fold 0 done in 36.8s; best_it=3253


   seed 42 fold 1 done in 39.7s; best_it=3632


   seed 42 fold 2 done in 37.9s; best_it=3326


   seed 42 fold 3 done in 38.8s; best_it=3446


   seed 42 fold 4 done in 37.1s; best_it=3468


    seed 42 test model 0 0.2s


    seed 42 test model 1 0.2s


    seed 42 test model 2 0.2s


    seed 42 test model 3 0.2s


    seed 42 test model 4 0.2s


 PP r=4.0 seed 1337 ...


   seed 1337 fold 0 done in 38.1s; best_it=3385


   seed 1337 fold 1 done in 39.6s; best_it=3608


   seed 1337 fold 2 done in 36.0s; best_it=3140


   seed 1337 fold 3 done in 38.1s; best_it=3378


   seed 1337 fold 4 done in 39.1s; best_it=3609


    seed 1337 test model 0 0.2s


    seed 1337 test model 1 0.2s


    seed 1337 test model 2 0.2s


    seed 1337 test model 3 0.2s


    seed 1337 test model 4 0.2s


 PP r=4.0 seed 2025 ...


   seed 2025 fold 0 done in 39.0s; best_it=3453


   seed 2025 fold 1 done in 38.4s; best_it=3408


   seed 2025 fold 2 done in 37.7s; best_it=3284


   seed 2025 fold 3 done in 40.5s; best_it=3573


   seed 2025 fold 4 done in 36.9s; best_it=3388


    seed 2025 test model 0 0.2s


    seed 2025 test model 1 0.2s


    seed 2025 test model 2 0.2s


    seed 2025 test model 3 0.2s


    seed 2025 test model 4 0.2s


Applied cap=2 to OOF. Kept rows: 308871 of 634192
 Fold 0 thresholds: thr_opp=0.7721, thr_same=0.8034
 Fold 1 thresholds: thr_opp=0.8178, thr_same=0.7708
 Fold 2 thresholds: thr_opp=0.8713, thr_same=0.6954
 Fold 3 thresholds: thr_opp=0.7253, thr_same=0.7283
 Fold 4 thresholds: thr_opp=0.8006, thr_same=0.8531
Final fold-median thresholds: thr_opp=0.8006, thr_same=0.7708


Applied cap=2 on test. Kept rows: 122291 of 278492


PP (r40 bag + fold-median thr + cap2 + hyst) ones before G overwrite: 6356


Applied prior G overwrite. ones after=8408, delta=2052


Saved submission.csv. Took 590.4s


In [48]:
# PP r=4.0 bagging with fold-median dual thresholds and cap=3 per player-step (smooth->cap3), then hysteresis and G overwrite
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.4'], check=True)
    import xgboost as xgb
print('xgboost version (pp-bag-r40-fold-median-cap3):', getattr(xgb, '__version__', 'unknown'))

def fast_dual_threshold_mcc(y_true, prob, same_flag, grid_points=256):
    import numpy as np
    y = np.asarray(y_true, dtype=np.int64)
    p = np.asarray(prob, dtype=np.float64)
    s = np.asarray(same_flag, dtype=np.int8)
    mask = np.isfinite(y) & np.isfinite(p) & np.isfinite(s)
    y, p, s = y[mask], p[mask], s[mask]
    def cohort_counts(yc, pc, G):
        n = yc.size
        if n == 0:
            return dict(tp=np.array([0], np.float64), fp=np.array([0], np.float64), tn=np.array([0], np.float64), fn=np.array([0], np.float64), thr=np.array([1.0], np.float64))
        order = np.argsort(-pc, kind='mergesort')
        ys, ps = yc[order], pc[order]
        P = float(ys.sum()); N = float(n - ys.sum())
        step = max(1, n // max(1, (G - 1)))
        k = np.arange(0, n + 1, step, dtype=np.int64)
        if k[-1] != n: k = np.append(k, n)
        cum = np.concatenate(([0], np.cumsum(ys, dtype=np.int64)))
        tp = cum[k].astype(np.float64); fp = (k - cum[k]).astype(np.float64)
        fn = P - tp; tn = N - fp
        thr = np.where(k == 0, 1.0 + 1e-6, ps[np.maximum(0, k - 1)])
        return dict(tp=tp, fp=fp, tn=tn, fn=fn, thr=thr)
    a = cohort_counts(y[s == 0], p[s == 0], grid_points)
    b = cohort_counts(y[s == 1], p[s == 1], grid_points)
    tp = a['tp'][:, None] + b['tp'][None, :]
    fp = a['fp'][:, None] + b['fp'][None, :]
    tn = a['tn'][:, None] + b['tn'][None, :]
    fn = a['fn'][:, None] + b['fn'][None, :]
    with np.errstate(invalid='ignore', divide='ignore'):
        num = tp * tn - fp * fn
        den = (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)
        den = np.where(den > 0, np.sqrt(den), np.nan)
        mcc = num / den
    if not np.isfinite(mcc).any():
        return -1.0, 0.79, 0.79
    i, j = np.unravel_index(np.nanargmax(mcc), mcc.shape)
    return float(mcc[i, j]), float(a['thr'][i]), float(b['thr'][j])

t0 = time.time()
print('Fold-median thresholds (cap=3) run: loading r=4.0 supervised dyn and test features...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r40.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r40.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

ord_idx = train_sup[['game_play','p1','p2','step']].sort_values(['game_play','p1','p2','step']).index.to_numpy()
gkf = GroupKFold(n_splits=5)
groups = train_sup['game_play'].values
y_all = train_sup['contact'].astype(int).values
same_all = train_sup['same_team'].fillna(0).astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), np.int8)
fold_arr = train_sup['fold'].astype(int).to_numpy()

seeds = [42,1337,2025]
oof_s_list = []; test_s_list = []

for s in seeds:
    print(f' PP r=4.0 seed {s} ...', flush=True)
    X_all = train_sup[feat_cols].astype(float).values
    oof = np.full(len(train_sup), np.nan, float)
    models = []
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
        t1 = time.time()
        X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all[va_idx], y_all[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method':'hist','device':'cuda','max_depth':7,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
                  'min_child_weight':10,'lambda':1.5,'alpha':0.1,'gamma':0.1,'objective':'binary:logistic','eval_metric':'logloss',
                  'scale_pos_weight': float(spw), 'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=3800, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models.append((booster, best_it))
        print(f'   seed {s} fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)
    df = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy()
    df['oof'] = oof[ord_idx]
    df = df.sort_values(['game_play','p1','p2','step'])
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    df['oof_smooth'] = grp['oof'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    oof_s_list.append(df['oof_smooth'].to_numpy())

    Xt = test_feats[feat_cols].astype(float).values
    dtest = xgb.DMatrix(Xt)
    pt = np.zeros(len(test_feats), float)
    for i, (booster, best_it) in enumerate(models):
        t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
        print(f'    seed {s} test model {i} {time.time()-t1:.1f}s', flush=True)
    pt /= max(1, len(models))
    dt = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step'])
    dt['prob'] = pt[dt.index.values]
    grp_t = dt.groupby(['game_play','p1','p2'], sort=False)
    dt['prob_smooth'] = grp_t['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    test_s_list.append(dt['prob_smooth'].to_numpy())

oof_avg = np.mean(np.vstack(oof_s_list), axis=0)
keys_tr_sorted = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy().reset_index(drop=True)
y_sorted = train_sup['contact'].astype(int).to_numpy()[ord_idx]
same_sorted = train_sup['same_team'].fillna(0).astype(int).to_numpy()[ord_idx] if 'same_team' in train_sup.columns else np.zeros_like(y_sorted, np.int8)
fold_sorted = fold_arr[ord_idx]

# cap=3 on OOF
df_o = keys_tr_sorted.copy()
df_o['prob'] = oof_avg
df_o['row_id'] = np.arange(len(df_o))
long1 = df_o[['game_play','step','p1','prob','row_id']].rename(columns={'p1':'player'})
long2 = df_o[['game_play','step','p2','prob','row_id']].rename(columns={'p2':'player'})
df_long = pd.concat([long1, long2], ignore_index=True)
df_long = df_long.sort_values(['game_play','step','player','prob'], ascending=[True, True, True, False])
df_long['rank'] = df_long.groupby(['game_play','step','player'], sort=False)['prob'].rank(method='first', ascending=False)
kept_rows = set(df_long.loc[df_long['rank'] <= 3, 'row_id'].tolist())
keep_mask_all = df_o['row_id'].isin(kept_rows).to_numpy()
oof_cap = oof_avg.copy(); oof_cap[~keep_mask_all] = 0.0
print('Applied cap=3 to OOF. Kept rows:', int(keep_mask_all.sum()), 'of', len(keep_mask_all))

thr_opp_f = []; thr_same_f = []
for k in sorted(np.unique(fold_sorted)):
    m = (fold_sorted == k)
    mcc_k, t_opp_k, t_same_k = fast_dual_threshold_mcc(y_sorted[m], oof_cap[m], same_sorted[m], grid_points=256)
    if (not np.isfinite(mcc_k)) or mcc_k < 0:
        thrs = np.linspace(0.7, 0.85, 31)
        ml = [matthews_corrcoef(y_sorted[m], (oof_cap[m] >= t).astype(int)) for t in thrs]
        j = int(np.argmax(ml)); t_opp_k = t_same_k = float(thrs[j])
    thr_opp_f.append(float(t_opp_k)); thr_same_f.append(float(t_same_k))
    print(f' Fold {k} thresholds: thr_opp={t_opp_k:.4f}, thr_same={t_same_k:.4f}')
thr_opp = float(np.median(thr_opp_f)); thr_same = float(np.median(thr_same_f))
print(f'Final fold-median thresholds (cap3): thr_opp={thr_opp:.4f}, thr_same={thr_same:.4f}')

pt_bag = np.mean(np.vstack(test_s_list), axis=0)
df_t = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step']).reset_index(drop=True)
df_t['prob_smooth'] = pt_bag
df_t['row_id'] = np.arange(len(df_t))
long1t = df_t[['game_play','step','p1','prob_smooth','row_id']].rename(columns={'p1':'player','prob_smooth':'prob'})
long2t = df_t[['game_play','step','p2','prob_smooth','row_id']].rename(columns={'p2':'player','prob_smooth':'prob'})
df_long_t = pd.concat([long1t, long2t], ignore_index=True)
df_long_t = df_long_t.sort_values(['game_play','step','player','prob'], ascending=[True, True, True, False])
df_long_t['rank'] = df_long_t.groupby(['game_play','step','player'], sort=False)['prob'].rank(method='first', ascending=False)
kept_rows_t = set(df_long_t.loc[df_long_t['rank'] <= 3, 'row_id'].tolist())
keep_mask_t = df_t['row_id'].isin(kept_rows_t).to_numpy()
df_t.loc[~keep_mask_t, 'prob_smooth'] = 0.0
print('Applied cap=3 on test. Kept rows:', int(keep_mask_t.sum()), 'of', len(keep_mask_t))

same_flag_test = test_feats[['game_play','p1','p2','step','same_team']].copy()
same_flag_test = same_flag_test.merge(df_t[['game_play','p1','p2','step','row_id']], on=['game_play','p1','p2','step'], how='right').sort_values('row_id')
same_flag_arr = same_flag_test['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in same_flag_test.columns else np.zeros(len(df_t), int)
thr_arr_test = np.where(same_flag_arr == 1, thr_same, thr_opp)
df_t['pred_bin'] = (df_t['prob_smooth'].to_numpy() >= thr_arr_test).astype(int)

df_h = df_t[['game_play','p1','p2','step','pred_bin']].copy().sort_values(['game_play','p1','p2','step'])
grp_h = df_h.groupby(['game_play','p1','p2'], sort=False)['pred_bin']
df_h['pred_hyst'] = grp_h.transform(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))
df_t = df_t.merge(df_h[['game_play','p1','p2','step','pred_hyst']], on=['game_play','p1','p2','step'], how='left')

cid_sorted = (df_t['game_play'].astype(str) + '_' + df_t['step'].astype(str) + '_' + df_t['p1'].astype(str) + '_' + df_t['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_sorted.values, 'contact_pp': df_t['pred_hyst'].astype(int).values})
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
pp_ones = int(sub['contact'].sum())
print('PP (r40 bag + fold-median thr cap3 + hyst) ones before G overwrite:', pp_ones)
try:
    prev_sub = pd.read_csv('submission.csv')
    g_pred_second = prev_sub[prev_sub['contact_id'].str.endswith('_G')][['contact_id','contact']].rename(columns={'contact':'contact_g'})
    sub = sub.merge(g_pred_second, on='contact_id', how='left')
    sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
    sub = sub[['contact_id','contact']]
    after_ones = int(sub['contact'].sum())
    print(f'Applied prior G overwrite. ones after={after_ones}, delta={after_ones-pp_ones}')
except Exception as e:
    print('No prior submission with G rows found; skipping G overwrite.', e)
    sub = sub[['contact_id','contact']]

sub.to_csv('submission.csv', index=False)
print('Saved submission.csv. Took {:.1f}s'.format(time.time()-t0))

xgboost version (pp-bag-r40-fold-median-cap3): 2.1.4
Fold-median thresholds (cap=3) run: loading r=4.0 supervised dyn and test features...


Using 50 features
 PP r=4.0 seed 42 ...


   seed 42 fold 0 done in 36.7s; best_it=3253


   seed 42 fold 1 done in 39.7s; best_it=3632


   seed 42 fold 2 done in 37.9s; best_it=3326


   seed 42 fold 3 done in 38.8s; best_it=3446


   seed 42 fold 4 done in 37.3s; best_it=3468


    seed 42 test model 0 0.2s


    seed 42 test model 1 0.2s


    seed 42 test model 2 0.2s


    seed 42 test model 3 0.2s


    seed 42 test model 4 0.2s


 PP r=4.0 seed 1337 ...


   seed 1337 fold 0 done in 38.1s; best_it=3385


   seed 1337 fold 1 done in 39.7s; best_it=3608


   seed 1337 fold 2 done in 36.0s; best_it=3140


   seed 1337 fold 3 done in 38.1s; best_it=3378


   seed 1337 fold 4 done in 38.9s; best_it=3609


    seed 1337 test model 0 0.2s


    seed 1337 test model 1 0.2s


    seed 1337 test model 2 0.2s


    seed 1337 test model 3 0.2s


    seed 1337 test model 4 0.2s


 PP r=4.0 seed 2025 ...


   seed 2025 fold 0 done in 38.8s; best_it=3453


   seed 2025 fold 1 done in 38.1s; best_it=3408


   seed 2025 fold 2 done in 37.8s; best_it=3284


   seed 2025 fold 3 done in 40.7s; best_it=3573


   seed 2025 fold 4 done in 37.2s; best_it=3388


    seed 2025 test model 0 0.2s


    seed 2025 test model 1 0.2s


    seed 2025 test model 2 0.2s


    seed 2025 test model 3 0.2s


    seed 2025 test model 4 0.2s


Applied cap=3 to OOF. Kept rows: 400456 of 634192
 Fold 0 thresholds: thr_opp=0.7890, thr_same=0.8594
 Fold 1 thresholds: thr_opp=0.8231, thr_same=0.7933
 Fold 2 thresholds: thr_opp=0.8570, thr_same=0.6249
 Fold 3 thresholds: thr_opp=0.7145, thr_same=0.7097
 Fold 4 thresholds: thr_opp=0.8042, thr_same=0.8685
Final fold-median thresholds (cap3): thr_opp=0.8042, thr_same=0.7933


Applied cap=3 on test. Kept rows: 162678 of 278492


PP (r40 bag + fold-median thr cap3 + hyst) ones before G overwrite: 6488


Applied prior G overwrite. ones after=8540, delta=2052


Saved submission.csv. Took 590.7s


In [49]:
# PP r=4.0 bagging with thresholds optimized AFTER hysteresis per fold (cap=2), fold-median thresholds, apply same chain on test, then G overwrite
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.4'], check=True)
    import xgboost as xgb
print('xgboost version (pp-bag-r40-thr-after-hyst):', getattr(xgb, '__version__', 'unknown'))

def apply_hyst_per_pair(df_bin: pd.DataFrame) -> np.ndarray:
    # df_bin must have columns: game_play, p1, p2, step, pred_bin
    df_h = df_bin.sort_values(['game_play','p1','p2','step']).copy()
    grp = df_h.groupby(['game_play','p1','p2'], sort=False)['pred_bin']
    df_h['pred_hyst'] = grp.transform(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))
    return df_h['pred_hyst'].to_numpy()

t0 = time.time()
print('Loading r=4.0 supervised dyn train and test features...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r40.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r40.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

# Canonical sorted order for alignment
ord_idx = train_sup[['game_play','p1','p2','step']].sort_values(['game_play','p1','p2','step']).index.to_numpy()
gkf = GroupKFold(n_splits=5)
groups = train_sup['game_play'].values
y_all = train_sup['contact'].astype(int).values
same_all = train_sup['same_team'].fillna(0).astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), np.int8)
fold_arr = train_sup['fold'].astype(int).to_numpy()

seeds = [42,1337,2025]
oof_s_list = []; test_s_list = []

for s in seeds:
    print(f' PP r=4.0 seed {s} ...', flush=True)
    X_all = train_sup[feat_cols].astype(float).values
    oof = np.full(len(train_sup), np.nan, float)
    models = []
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
        t1 = time.time()
        X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all[va_idx], y_all[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method':'hist','device':'cuda','max_depth':7,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
                  'min_child_weight':10,'lambda':1.5,'alpha':0.1,'gamma':0.1,'objective':'binary:logistic','eval_metric':'logloss',
                  'scale_pos_weight': float(spw), 'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=3800, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models.append((booster, best_it))
        print(f'   seed {s} fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)
    # Smooth OOF on canonical order
    df = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy()
    df['oof'] = oof[ord_idx]
    df = df.sort_values(['game_play','p1','p2','step'])
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    df['oof_smooth'] = grp['oof'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    oof_s_list.append(df['oof_smooth'].to_numpy())

    # Test predictions and smoothing
    Xt = test_feats[feat_cols].astype(float).values
    dtest = xgb.DMatrix(Xt)
    pt = np.zeros(len(test_feats), float)
    for i, (booster, best_it) in enumerate(models):
        t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
        print(f'    seed {s} test model {i} {time.time()-t1:.1f}s', flush=True)
    pt /= max(1, len(models))
    dt = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step'])
    dt['prob'] = pt[dt.index.values]
    grp_t = dt.groupby(['game_play','p1','p2'], sort=False)
    dt['prob_smooth'] = grp_t['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    test_s_list.append(dt['prob_smooth'].to_numpy())

# Average OOF across seeds in canonical order
oof_avg = np.mean(np.vstack(oof_s_list), axis=0)
keys_tr_sorted = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy().reset_index(drop=True)
y_sorted = train_sup['contact'].astype(int).to_numpy()[ord_idx]
same_sorted = train_sup['same_team'].fillna(0).astype(int).to_numpy()[ord_idx] if 'same_team' in train_sup.columns else np.zeros_like(y_sorted, np.int8)
fold_sorted = fold_arr[ord_idx]

# Apply cap=2 on OOF probs before thresholding
df_o = keys_tr_sorted.copy()
df_o['prob'] = oof_avg
df_o['row_id'] = np.arange(len(df_o))
long1 = df_o[['game_play','step','p1','prob','row_id']].rename(columns={'p1':'player'})
long2 = df_o[['game_play','step','p2','prob','row_id']].rename(columns={'p2':'player'})
df_long = pd.concat([long1, long2], ignore_index=True)
df_long = df_long.sort_values(['game_play','step','player','prob'], ascending=[True, True, True, False])
df_long['rank'] = df_long.groupby(['game_play','step','player'], sort=False)['prob'].rank(method='first', ascending=False)
kept_rows = set(df_long.loc[df_long['rank'] <= 2, 'row_id'].tolist())
keep_mask_all = df_o['row_id'].isin(kept_rows).to_numpy()
oof_cap = oof_avg.copy(); oof_cap[~keep_mask_all] = 0.0
print('Applied cap=2 to OOF. Kept rows:', int(keep_mask_all.sum()), 'of', len(keep_mask_all))

# Optimize thresholds AFTER hysteresis per fold
thr_grid = np.round(np.linspace(0.70, 0.85, 16), 3)
thr_best = []
for k in sorted(np.unique(fold_sorted)):
    m = (fold_sorted == k)
    df_k = keys_tr_sorted.loc[m, ['game_play','p1','p2','step']].copy()
    df_k['prob'] = oof_cap[m]
    df_k['same'] = same_sorted[m]
    y_k = y_sorted[m]
    # Build cap already applied; threshold and hysteresis will be varied
    best_m, best_to, best_ts = -1.0, 0.78, 0.78
    # Pre-allocate arrays for speed
    same_arr = df_k['same'].to_numpy()
    for to in thr_grid:
        thr_arr = np.where(same_arr == 1, 1.0, to)  # temp; will set same later in inner loop
        for ts in thr_grid:
            thr_arr = np.where(same_arr == 1, ts, to)
            pred_bin = (df_k['prob'].to_numpy() >= thr_arr).astype(int)
            df_tmp = df_k[['game_play','p1','p2','step']].copy()
            df_tmp['pred_bin'] = pred_bin
            pred_h = apply_hyst_per_pair(df_tmp)
            mcc = matthews_corrcoef(y_k, pred_h)
            if mcc > best_m:
                best_m, best_to, best_ts = float(mcc), float(to), float(ts)
    thr_best.append((best_to, best_ts))
    print(f' Fold {k} best after-hyst MCC={best_m:.5f} thr_opp={best_to:.3f} thr_same={best_ts:.3f}')

thr_best = np.array(thr_best, float)
thr_opp_med = float(np.median(thr_best[:, 0]))
thr_same_med = float(np.median(thr_best[:, 1]))
print(f'Fold-median thresholds after hysteresis (cap2): thr_opp={thr_opp_med:.4f}, thr_same={thr_same_med:.4f}')

# Test: average probs across seeds, smooth, cap=2, then apply median thresholds, then hysteresis
pt_bag = np.mean(np.vstack(test_s_list), axis=0)
df_t = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step']).reset_index(drop=True)
df_t['prob_smooth'] = pt_bag
df_t['row_id'] = np.arange(len(df_t))
long1t = df_t[['game_play','step','p1','prob_smooth','row_id']].rename(columns={'p1':'player','prob_smooth':'prob'})
long2t = df_t[['game_play','step','p2','prob_smooth','row_id']].rename(columns={'p2':'player','prob_smooth':'prob'})
df_long_t = pd.concat([long1t, long2t], ignore_index=True)
df_long_t = df_long_t.sort_values(['game_play','step','player','prob'], ascending=[True, True, True, False])
df_long_t['rank'] = df_long_t.groupby(['game_play','step','player'], sort=False)['prob'].rank(method='first', ascending=False)
kept_rows_t = set(df_long_t.loc[df_long_t['rank'] <= 2, 'row_id'].tolist())
keep_mask_t = df_t['row_id'].isin(kept_rows_t).to_numpy()
df_t.loc[~keep_mask_t, 'prob_smooth'] = 0.0
print('Applied cap=2 on test. Kept rows:', int(keep_mask_t.sum()), 'of', len(keep_mask_t))

same_flag_test = test_feats[['game_play','p1','p2','step','same_team']].copy()
same_flag_test = same_flag_test.merge(df_t[['game_play','p1','p2','step','row_id']], on=['game_play','p1','p2','step'], how='right').sort_values('row_id')
same_arr_t = same_flag_test['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in same_flag_test.columns else np.zeros(len(df_t), int)
thr_arr_t = np.where(same_arr_t == 1, thr_same_med, thr_opp_med)
df_t['pred_bin'] = (df_t['prob_smooth'].to_numpy() >= thr_arr_t).astype(int)

df_tmp_t = df_t[['game_play','p1','p2','step','pred_bin']].copy()
pred_h_t = apply_hyst_per_pair(df_tmp_t)
df_t['pred_hyst'] = pred_h_t.astype(int)

# Build submission with PP, then overwrite G rows from prior submission (no PP leakage in CV)
cid_sorted = (df_t['game_play'].astype(str) + '_' + df_t['step'].astype(str) + '_' + df_t['p1'].astype(str) + '_' + df_t['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_sorted.values, 'contact_pp': df_t['pred_hyst'].astype(int).values})
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
pp_ones = int(sub['contact'].sum())
print('PP (r40 bag + thr-after-hyst cap2) ones before G overwrite:', pp_ones)
try:
    prev_sub = pd.read_csv('submission.csv')
    g_pred_second = prev_sub[prev_sub['contact_id'].str.endswith('_G')][['contact_id','contact']].rename(columns={'contact':'contact_g'})
    sub = sub.merge(g_pred_second, on='contact_id', how='left')
    sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
    sub = sub[['contact_id','contact']]
    after_ones = int(sub['contact'].sum())
    print(f'Applied prior G overwrite. ones after={after_ones}, delta={after_ones-pp_ones}')
except Exception as e:
    print('No prior submission with G rows found; skipping G overwrite.', e)
    sub = sub[['contact_id','contact']]

sub.to_csv('submission.csv', index=False)
print('Saved submission.csv. Took {:.1f}s'.format(time.time()-t0))

xgboost version (pp-bag-r40-thr-after-hyst): 2.1.4
Loading r=4.0 supervised dyn train and test features...


Using 50 features
 PP r=4.0 seed 42 ...


   seed 42 fold 0 done in 36.8s; best_it=3253


   seed 42 fold 1 done in 39.7s; best_it=3632


   seed 42 fold 2 done in 37.8s; best_it=3326


   seed 42 fold 3 done in 38.9s; best_it=3446


   seed 42 fold 4 done in 37.2s; best_it=3468


    seed 42 test model 0 0.2s


    seed 42 test model 1 0.2s


    seed 42 test model 2 0.2s


    seed 42 test model 3 0.2s


    seed 42 test model 4 0.2s


 PP r=4.0 seed 1337 ...


   seed 1337 fold 0 done in 38.1s; best_it=3385


   seed 1337 fold 1 done in 39.6s; best_it=3608


   seed 1337 fold 2 done in 36.0s; best_it=3140


   seed 1337 fold 3 done in 38.0s; best_it=3378


   seed 1337 fold 4 done in 38.9s; best_it=3609


    seed 1337 test model 0 0.2s


    seed 1337 test model 1 0.2s


    seed 1337 test model 2 0.2s


    seed 1337 test model 3 0.2s


    seed 1337 test model 4 0.2s


 PP r=4.0 seed 2025 ...


   seed 2025 fold 0 done in 38.9s; best_it=3453


   seed 2025 fold 1 done in 38.4s; best_it=3408


   seed 2025 fold 2 done in 37.9s; best_it=3284


   seed 2025 fold 3 done in 40.6s; best_it=3573


   seed 2025 fold 4 done in 37.1s; best_it=3388


    seed 2025 test model 0 0.2s


    seed 2025 test model 1 0.2s


    seed 2025 test model 2 0.2s


    seed 2025 test model 3 0.2s


    seed 2025 test model 4 0.2s


Applied cap=2 to OOF. Kept rows: 308871 of 634192


 Fold 0 best after-hyst MCC=0.71295 thr_opp=0.790 thr_same=0.830


 Fold 1 best after-hyst MCC=0.74333 thr_opp=0.820 thr_same=0.780


 Fold 2 best after-hyst MCC=0.73521 thr_opp=0.850 thr_same=0.700


 Fold 3 best after-hyst MCC=0.73375 thr_opp=0.720 thr_same=0.700


 Fold 4 best after-hyst MCC=0.72896 thr_opp=0.770 thr_same=0.840
Fold-median thresholds after hysteresis (cap2): thr_opp=0.7900, thr_same=0.7800


Applied cap=2 on test. Kept rows: 122291 of 278492


PP (r40 bag + thr-after-hyst cap2) ones before G overwrite: 6418


Applied prior G overwrite. ones after=8470, delta=2052


Saved submission.csv. Took 1585.4s


In [50]:
# PP r=4.0 bagging with distance-aware caps (3/2/1) before thresholding, thresholds optimized AFTER hysteresis per fold, fold-median thresholds; apply same chain on test; G overwrite
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.4'], check=True)
    import xgboost as xgb
print('xgboost version (pp-r40-cap321-thr-after-hyst):', getattr(xgb, '__version__', 'unknown'))

def apply_hyst_per_pair(df_bin: pd.DataFrame) -> np.ndarray:
    df_h = df_bin.sort_values(['game_play','p1','p2','step']).copy()
    grp = df_h.groupby(['game_play','p1','p2'], sort=False)['pred_bin']
    df_h['pred_hyst'] = grp.transform(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))
    return df_h['pred_hyst'].to_numpy()

t0 = time.time()
print('Loading r=4.0 supervised dyn train and test features...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r40.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r40.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

# Canonical order
ord_idx = train_sup[['game_play','p1','p2','step']].sort_values(['game_play','p1','p2','step']).index.to_numpy()
gkf = GroupKFold(n_splits=5)
groups = train_sup['game_play'].values
y_all = train_sup['contact'].astype(int).values
same_all = train_sup['same_team'].fillna(0).astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), np.int8)
fold_arr = train_sup['fold'].astype(int).to_numpy()

seeds = [42,1337,2025]
oof_s_list = []; test_s_list = []

for s in seeds:
    print(f' PP r=4.0 seed {s} ...', flush=True)
    X_all = train_sup[feat_cols].astype(float).values
    oof = np.full(len(train_sup), np.nan, float)
    models = []
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
        t1 = time.time()
        X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all[va_idx], y_all[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method':'hist','device':'cuda','max_depth':7,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
                  'min_child_weight':10,'lambda':1.5,'alpha':0.1,'gamma':0.1,'objective':'binary:logistic','eval_metric':'logloss',
                  'scale_pos_weight': float(spw), 'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=3800, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models.append((booster, best_it))
        print(f'   seed {s} fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)
    # Smooth OOF on canonical order
    df = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy()
    df['oof'] = oof[ord_idx]
    df = df.sort_values(['game_play','p1','p2','step'])
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    df['oof_smooth'] = grp['oof'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    oof_s_list.append(df['oof_smooth'].to_numpy())

    # Test predictions and smoothing
    Xt = test_feats[feat_cols].astype(float).values
    dtest = xgb.DMatrix(Xt)
    pt = np.zeros(len(test_feats), float)
    for i, (booster, best_it) in enumerate(models):
        t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
        print(f'    seed {s} test model {i} {time.time()-t1:.1f}s', flush=True)
    pt /= max(1, len(models))
    dt = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step'])
    dt['prob'] = pt[dt.index.values]
    grp_t = dt.groupby(['game_play','p1','p2'], sort=False)
    dt['prob_smooth'] = grp_t['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    test_s_list.append(dt['prob_smooth'].to_numpy())

# Average OOF across seeds in canonical order
oof_avg = np.mean(np.vstack(oof_s_list), axis=0)
keys_tr_sorted = train_sup[['game_play','p1','p2','step','distance']].iloc[ord_idx].copy().reset_index(drop=True)
y_sorted = train_sup['contact'].astype(int).to_numpy()[ord_idx]
same_sorted = train_sup['same_team'].fillna(0).astype(int).to_numpy()[ord_idx] if 'same_team' in train_sup.columns else np.zeros_like(y_sorted, np.int8)
fold_sorted = fold_arr[ord_idx]

# Distance-aware caps (<=1.6: top-3, 1.6-2.4: top-2, >2.4: top-1) applied on smoothed probs BEFORE thresholding
df_o = keys_tr_sorted.copy()
df_o['prob'] = oof_avg
df_o['row_id'] = np.arange(len(df_o))
df_o['bin'] = np.where(df_o['distance'] <= 1.6, 0, np.where(df_o['distance'] <= 2.4, 1, 2))
cap_map = {0:3, 1:2, 2:1}
# long format for both players with distance bin carried
long1 = df_o[['game_play','step','p1','prob','row_id','bin']].rename(columns={'p1':'player'})
long2 = df_o[['game_play','step','p2','prob','row_id','bin']].rename(columns={'p2':'player'})
df_long = pd.concat([long1, long2], ignore_index=True)
df_long = df_long.sort_values(['game_play','step','player','bin','prob'], ascending=[True, True, True, True, False])
df_long['rank_in_bin'] = df_long.groupby(['game_play','step','player','bin'], sort=False)['prob'].rank(method='first', ascending=False)
keep_rows = []
for b, cap in cap_map.items():
    keep_rows.append(df_long.loc[(df_long['bin'] == b) & (df_long['rank_in_bin'] <= cap), 'row_id'])
kept_rows = set(pd.concat(keep_rows).tolist())
keep_mask_all = df_o['row_id'].isin(kept_rows).to_numpy()
oof_cap = oof_avg.copy(); oof_cap[~keep_mask_all] = 0.0
print('Applied distance-aware caps (3/2/1) to OOF. Kept rows:', int(keep_mask_all.sum()), 'of', len(keep_mask_all))

# Optimize thresholds AFTER hysteresis per fold on capped OOF
thr_grid = np.round(np.linspace(0.70, 0.85, 16), 3)
thr_best = []
for k in sorted(np.unique(fold_sorted)):
    m = (fold_sorted == k)
    df_k = keys_tr_sorted.loc[m, ['game_play','p1','p2','step']].copy()
    df_k['prob'] = oof_cap[m]
    df_k['same'] = same_sorted[m]
    y_k = y_sorted[m]
    best_m, best_to, best_ts = -1.0, 0.78, 0.78
    same_arr = df_k['same'].to_numpy()
    prob_arr = df_k['prob'].to_numpy()
    for to in thr_grid:
        for ts in thr_grid:
            thr_arr = np.where(same_arr == 1, ts, to)
            pred_bin = (prob_arr >= thr_arr).astype(int)
            df_tmp = df_k[['game_play','p1','p2','step']].copy()
            df_tmp['pred_bin'] = pred_bin
            pred_h = apply_hyst_per_pair(df_tmp)
            mcc = matthews_corrcoef(y_k, pred_h)
            if mcc > best_m:
                best_m, best_to, best_ts = float(mcc), float(to), float(ts)
    thr_best.append((best_to, best_ts))
    print(f' Fold {k} best after-hyst MCC={best_m:.5f} thr_opp={best_to:.3f} thr_same={best_ts:.3f}')

thr_best = np.array(thr_best, float)
thr_opp_med = float(np.median(thr_best[:, 0]))
thr_same_med = float(np.median(thr_best[:, 1]))
print(f'Fold-median thresholds after hysteresis (cap 3/2/1): thr_opp={thr_opp_med:.4f}, thr_same={thr_same_med:.4f}')

# Test: average probs, smooth, apply distance-aware caps, then median thresholds, then hysteresis
pt_bag = np.mean(np.vstack(test_s_list), axis=0)
df_t = test_feats[['game_play','p1','p2','step','distance']].copy().sort_values(['game_play','p1','p2','step']).reset_index(drop=True)
df_t['prob_smooth'] = pt_bag
df_t['row_id'] = np.arange(len(df_t))
df_t['bin'] = np.where(df_t['distance'] <= 1.6, 0, np.where(df_t['distance'] <= 2.4, 1, 2))
long1t = df_t[['game_play','step','p1','prob_smooth','row_id','bin']].rename(columns={'p1':'player','prob_smooth':'prob'})
long2t = df_t[['game_play','step','p2','prob_smooth','row_id','bin']].rename(columns={'p2':'player','prob_smooth':'prob'})
df_long_t = pd.concat([long1t, long2t], ignore_index=True)
df_long_t = df_long_t.sort_values(['game_play','step','player','bin','prob'], ascending=[True, True, True, True, False])
df_long_t['rank_in_bin'] = df_long_t.groupby(['game_play','step','player','bin'], sort=False)['prob'].rank(method='first', ascending=False)
keep_rows_t = []
for b, cap in cap_map.items():
    keep_rows_t.append(df_long_t.loc[(df_long_t['bin'] == b) & (df_long_t['rank_in_bin'] <= cap), 'row_id'])
kept_rows_t = set(pd.concat(keep_rows_t).tolist())
keep_mask_t = df_t['row_id'].isin(kept_rows_t).to_numpy()
df_t.loc[~keep_mask_t, 'prob_smooth'] = 0.0
print('Applied distance-aware caps (3/2/1) on test. Kept rows:', int(keep_mask_t.sum()), 'of', len(keep_mask_t))

same_flag_test = test_feats[['game_play','p1','p2','step','same_team']].copy()
same_flag_test = same_flag_test.merge(df_t[['game_play','p1','p2','step','row_id']], on=['game_play','p1','p2','step'], how='right').sort_values('row_id')
same_arr_t = same_flag_test['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in same_flag_test.columns else np.zeros(len(df_t), int)
thr_arr_t = np.where(same_arr_t == 1, thr_same_med, thr_opp_med)
df_t['pred_bin'] = (df_t['prob_smooth'].to_numpy() >= thr_arr_t).astype(int)

df_tmp_t = df_t[['game_play','p1','p2','step','pred_bin']].copy()
pred_h_t = apply_hyst_per_pair(df_tmp_t)
df_t['pred_hyst'] = pred_h_t.astype(int)

# Build submission with PP, then overwrite G rows from prior submission
cid_sorted = (df_t['game_play'].astype(str) + '_' + df_t['step'].astype(str) + '_' + df_t['p1'].astype(str) + '_' + df_t['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_sorted.values, 'contact_pp': df_t['pred_hyst'].astype(int).values})
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
pp_ones = int(sub['contact'].sum())
print('PP (r40 bag + thr-after-hyst cap3/2/1) ones before G overwrite:', pp_ones)
try:
    prev_sub = pd.read_csv('submission.csv')
    g_pred_second = prev_sub[prev_sub['contact_id'].str.endswith('_G')][['contact_id','contact']].rename(columns={'contact':'contact_g'})
    sub = sub.merge(g_pred_second, on='contact_id', how='left')
    sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
    sub = sub[['contact_id','contact']]
    after_ones = int(sub['contact'].sum())
    print(f'Applied prior G overwrite. ones after={after_ones}, delta={after_ones-pp_ones}')
except Exception as e:
    print('No prior submission with G rows found; skipping G overwrite.', e)
    sub = sub[['contact_id','contact']]

sub.to_csv('submission.csv', index=False)
print('Saved submission.csv. Took {:.1f}s'.format(time.time()-t0))

xgboost version (pp-r40-cap321-thr-after-hyst): 2.1.4
Loading r=4.0 supervised dyn train and test features...


Using 50 features
 PP r=4.0 seed 42 ...


   seed 42 fold 0 done in 36.7s; best_it=3253


   seed 42 fold 1 done in 39.7s; best_it=3632


   seed 42 fold 2 done in 37.9s; best_it=3326


   seed 42 fold 3 done in 38.9s; best_it=3446


   seed 42 fold 4 done in 37.3s; best_it=3468


    seed 42 test model 0 0.2s


    seed 42 test model 1 0.2s


    seed 42 test model 2 0.2s


    seed 42 test model 3 0.2s


    seed 42 test model 4 0.2s


 PP r=4.0 seed 1337 ...


   seed 1337 fold 0 done in 38.2s; best_it=3385


   seed 1337 fold 1 done in 39.6s; best_it=3608


   seed 1337 fold 2 done in 36.0s; best_it=3140


   seed 1337 fold 3 done in 38.0s; best_it=3378


   seed 1337 fold 4 done in 38.8s; best_it=3609


    seed 1337 test model 0 0.2s


    seed 1337 test model 1 0.2s


    seed 1337 test model 2 0.2s


    seed 1337 test model 3 0.2s


    seed 1337 test model 4 0.2s


 PP r=4.0 seed 2025 ...


   seed 2025 fold 0 done in 38.8s; best_it=3453


   seed 2025 fold 1 done in 38.2s; best_it=3408


   seed 2025 fold 2 done in 37.7s; best_it=3284


   seed 2025 fold 3 done in 40.5s; best_it=3573


   seed 2025 fold 4 done in 37.2s; best_it=3388


    seed 2025 test model 0 0.2s


    seed 2025 test model 1 0.2s


    seed 2025 test model 2 0.2s


    seed 2025 test model 3 0.2s


    seed 2025 test model 4 0.2s


Applied distance-aware caps (3/2/1) to OOF. Kept rows: 440634 of 634192


 Fold 0 best after-hyst MCC=0.71215 thr_opp=0.790 thr_same=0.830


 Fold 1 best after-hyst MCC=0.74012 thr_opp=0.820 thr_same=0.780


 Fold 2 best after-hyst MCC=0.73435 thr_opp=0.850 thr_same=0.700


 Fold 3 best after-hyst MCC=0.73583 thr_opp=0.720 thr_same=0.700


 Fold 4 best after-hyst MCC=0.73013 thr_opp=0.770 thr_same=0.840
Fold-median thresholds after hysteresis (cap 3/2/1): thr_opp=0.7900, thr_same=0.7800


Applied distance-aware caps (3/2/1) on test. Kept rows: 189763 of 278492


PP (r40 bag + thr-after-hyst cap3/2/1) ones before G overwrite: 6642


Applied prior G overwrite. ones after=8694, delta=2052


Saved submission.csv. Took 1596.0s


In [52]:
# Rebuild full pipeline with candidate radius r=4.5 and save *_r45 artifacts
import pandas as pd, numpy as np, time, math
from itertools import combinations

t0 = time.time()
print('Rebuilding pipeline with r=4.5 ...')

def build_pairs_for_group_r(gdf, r=4.5):
    rows = []
    arr = gdf[['nfl_player_id','team','position','x_position','y_position','speed','acceleration','direction']].values
    n = arr.shape[0]
    for i, j in combinations(range(n), 2):
        pid_i, team_i, pos_i, xi, yi, si, ai, diri = arr[i]
        pid_j, team_j, pos_j, xj, yj, sj, aj, dirj = arr[j]
        dx = xj - xi; dy = yj - yi
        dist = math.hypot(dx, dy)
        if dist > r:
            continue
        a = int(pid_i); b = int(pid_j)
        p1, p2 = (str(a), str(b)) if a <= b else (str(b), str(a))
        vxi = si * math.cos(math.radians(diri)) if not pd.isna(diri) else 0.0
        vyi = si * math.sin(math.radians(diri)) if not pd.isna(diri) else 0.0
        vxj = sj * math.cos(math.radians(dirj)) if not pd.isna(dirj) else 0.0
        vyj = sj * math.sin(math.radians(dirj)) if not pd.isna(dirj) else 0.0
        rvx = vxj - vxi; rvy = vyj - vyi
        if dist > 0:
            ux = dx / dist; uy = dy / dist
            closing = rvx * ux + rvy * uy
        else:
            closing = 0.0
        if pd.isna(diri) or pd.isna(dirj):
            hd = np.nan
        else:
            d = (diri - dirj + 180) % 360 - 180
            hd = abs(d)
        rows.append((p1, p2, dist, dx, dy, si, sj, ai, aj, closing, abs(closing), hd, int(team_i == team_j), str(team_i), str(team_j), str(pos_i), str(pos_j)))
    if not rows:
        return pd.DataFrame(columns=['p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])
    return pd.DataFrame(rows, columns=['p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])

def build_feature_table_r(track_df, r=4.5):
    feats = []
    cnt = 0
    last = time.time()
    for (gp, step), gdf in track_df.groupby(['game_play','step'], sort=False):
        f = build_pairs_for_group_r(gdf, r=r)
        if not f.empty:
            f.insert(0, 'step', step)
            f.insert(0, 'game_play', gp)
            feats.append(f)
        cnt += 1
        if cnt % 500 == 0:
            now = time.time()
            print(f' processed {cnt} steps; +{now-last:.1f}s; total {now-t0:.1f}s', flush=True)
            last = now
    if feats:
        return pd.concat(feats, ignore_index=True)
    return pd.DataFrame(columns=['game_play','step','p1','p2','distance','rel_dx','rel_dy','speed1','speed2','accel1','accel2','closing','abs_closing','abs_d_heading','same_team','team1','team2','pos1','pos2'])

print('Building train pairs r=4.5 ...')
train_pairs_r45 = build_feature_table_r(train_track_idx, r=4.5)
print('train_pairs_r45:', train_pairs_r45.shape)
train_pairs_r45.to_parquet('train_pairs_r45.parquet', index=False)
print('Building test pairs r=4.5 ...')
test_pairs_r45 = build_feature_table_r(test_track_idx, r=4.5)
print('test_pairs_r45:', test_pairs_r45.shape)
test_pairs_r45.to_parquet('test_pairs_r45.parquet', index=False)

def add_window_feats_local(df: pd.DataFrame, W: int = 5):
    df = df.sort_values(['game_play','p1','p2','step']).copy()
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    df['dist_min_p5'] = grp['distance'].rolling(W, min_periods=1).min().reset_index(level=[0,1,2], drop=True)
    df['dist_mean_p5'] = grp['distance'].rolling(W, min_periods=1).mean().reset_index(level=[0,1,2], drop=True)
    df['dist_max_p5'] = grp['distance'].rolling(W, min_periods=1).max().reset_index(level=[0,1,2], drop=True)
    df['dist_std_p5'] = grp['distance'].rolling(W, min_periods=1).std().reset_index(level=[0,1,2], drop=True)
    df['abs_close_min_p5'] = grp['abs_closing'].rolling(W, min_periods=1).min().reset_index(level=[0,1,2], drop=True)
    df['abs_close_mean_p5'] = grp['abs_closing'].rolling(W, min_periods=1).mean().reset_index(level=[0,1,2], drop=True)
    df['abs_close_max_p5'] = grp['abs_closing'].rolling(W, min_periods=1).max().reset_index(level=[0,1,2], drop=True)
    df['abs_close_std_p5'] = grp['abs_closing'].rolling(W, min_periods=1).std().reset_index(level=[0,1,2], drop=True)
    for thr, name in [(1.5,'lt15'), (2.0,'lt20'), (2.5,'lt25')]:
        key = f'cnt_dist_{name}_p5'
        df[key] = grp['distance'].apply(lambda s: s.lt(thr).rolling(W, min_periods=1).sum()).reset_index(level=[0,1,2], drop=True)
    df['dist_delta_p5'] = df['distance'] - grp['distance'].shift(W)
    return df

print('Adding W5 features (train/test) for r=4.5 ...')
train_w_r45 = add_window_feats_local(train_pairs_r45, W=5)
test_w_r45 = add_window_feats_local(test_pairs_r45, W=5)
train_w_r45.to_parquet('train_pairs_w5_r45.parquet', index=False)
test_w_r45.to_parquet('test_pairs_w5_r45.parquet', index=False)

FPS = 59.94
def prep_meta(vmeta: pd.DataFrame):
    vm = vmeta.copy()
    for c in ['start_time','snap_time']:
        if np.issubdtype(vm[c].dtype, np.number):
            continue
        ts = pd.to_datetime(vm[c], errors='coerce')
        if ts.notna().any():
            vm[c] = (ts - ts.dt.floor('D')).dt.total_seconds().astype(float)
        else:
            vm[c] = pd.to_numeric(vm[c], errors='coerce')
    vm['snap_frame'] = ((vm['snap_time'] - vm['start_time']) * FPS).round().astype('Int64')
    return vm[['game_play','view','snap_frame']].drop_duplicates()

print('Loading helmets and video metadata...')
train_helm_df = pd.read_csv('train_baseline_helmets.csv')
test_helm_df = pd.read_csv('test_baseline_helmets.csv')
train_vmeta_df = pd.read_csv('train_video_metadata.csv')
test_vmeta_df = pd.read_csv('test_video_metadata.csv')
meta_tr = prep_meta(train_vmeta_df); meta_te = prep_meta(test_vmeta_df)

def dedup_and_step(helm: pd.DataFrame, meta: pd.DataFrame):
    df = helm[['game_play','view','frame','nfl_player_id','left','top','width','height']].copy()
    df = df.dropna(subset=['nfl_player_id'])
    df['nfl_player_id'] = df['nfl_player_id'].astype(int).astype(str)
    df['area'] = df['width'] * df['height']
    df['cx'] = df['left'] + 0.5 * df['width']
    df['cy'] = df['top'] + 0.5 * df['height']
    df = df.sort_values(['game_play','view','frame','nfl_player_id','area'], ascending=[True,True,True,True,False]).drop_duplicates(['game_play','view','frame','nfl_player_id'], keep='first')
    df = df.merge(meta, on=['game_play','view'], how='left')
    df['step'] = ((df['frame'] - df['snap_frame']).astype('float') / 6.0).round().astype('Int64')
    df = df.dropna(subset=['step']); df['step'] = df['step'].astype(int)
    dm1 = df.copy(); dm1['target_step'] = dm1['step'] - 1
    d0 = df.copy(); d0['target_step'] = df['step']
    dp1 = df.copy(); dp1['target_step'] = df['step'] + 1
    d = pd.concat([dm1, d0, dp1], ignore_index=True)
    agg = d.groupby(['game_play','view','target_step','nfl_player_id'], sort=False).agg(
        cx_mean=('cx','mean'), cy_mean=('cy','mean'), h_mean=('height','mean'), cnt=('cx','size')
    ).reset_index().rename(columns={'target_step':'step'})
    return agg

print('Preparing helmet aggregates...')
h_tr = dedup_and_step(train_helm_df, meta_tr)
h_te = dedup_and_step(test_helm_df, meta_te)
print('Helmet agg shapes:', h_tr.shape, h_te.shape)

def merge_helmet_to_pairs_df(pairs: pd.DataFrame, h_agg: pd.DataFrame):
    ha = h_agg[['game_play','step','view','nfl_player_id','cx_mean','cy_mean','h_mean']].copy()
    a = ha.rename(columns={'nfl_player_id':'p1','cx_mean':'cx1','cy_mean':'cy1','h_mean':'h1'})
    b = ha.rename(columns={'nfl_player_id':'p2','cx_mean':'cx2','cy_mean':'cy2','h_mean':'h2'})
    merged = a.merge(b, on=['game_play','step','view'], how='inner')
    merged = merged[merged['p1'] < merged['p2']]
    merged['px_dist'] = np.sqrt((merged['cx1'] - merged['cx2'])**2 + (merged['cy1'] - merged['cy2'])**2)
    merged['px_dist_norm'] = merged['px_dist'] / np.sqrt(np.maximum(1e-6, merged['h1'] * merged['h2']))
    agg = merged.groupby(['game_play','step','p1','p2'], as_index=False).agg(
        px_dist_norm_min=('px_dist_norm','min'),
        views_both_present=('px_dist_norm', lambda s: int(s.notna().sum()))
    )
    out = pairs.merge(agg, on=['game_play','step','p1','p2'], how='left')
    return out

print('Merging helmets into pairs (train/test) ...')
train_pairs_w5_helm_r45 = merge_helmet_to_pairs_df(train_w_r45, h_tr)
test_pairs_w5_helm_r45 = merge_helmet_to_pairs_df(test_w_r45, h_te)
train_pairs_w5_helm_r45.to_parquet('train_pairs_w5_helm_r45.parquet', index=False)
test_pairs_w5_helm_r45.to_parquet('test_pairs_w5_helm_r45.parquet', index=False)

def add_dyn_feats(df: pd.DataFrame) -> pd.DataFrame:
    df = df.sort_values(['game_play','p1','p2','step']).copy()
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)
    df['approaching_flag'] = (df['closing'] < 0).astype(int)
    denom = (-df['closing']).clip(lower=1e-3)
    ttc_raw = df['distance'] / denom
    ttc_raw = ttc_raw.where(df['approaching_flag'] == 1, 10.0)
    df['ttc_raw'] = ttc_raw.astype(float)
    df['ttc_clip'] = df['ttc_raw'].clip(0, 5)
    df['ttc_log'] = np.log1p(df['ttc_clip'])
    df['inv_ttc'] = 1.0 / (1.0 + df['ttc_clip'])
    df['d_dist_1'] = df['distance'] - grp['distance'].shift(1)
    df['d_dist_2'] = df['distance'] - grp['distance'].shift(2)
    df['d_dist_5'] = df['distance'] - grp['distance'].shift(5)
    df['d_close_1'] = df['closing'] - grp['closing'].shift(1)
    df['d_absclose_1'] = df['abs_closing'] - grp['abs_closing'].shift(1)
    df['d_speed1_1'] = df['speed1'] - grp['speed1'].shift(1)
    df['d_speed2_1'] = df['speed2'] - grp['speed2'].shift(1)
    df['d_accel1_1'] = df['accel1'] - grp['accel1'].shift(1)
    df['d_accel2_1'] = df['accel2'] - grp['accel2'].shift(1)
    df['rm3_d_dist_1'] = grp['d_dist_1'].transform(lambda s: s.rolling(3, min_periods=1).mean())
    df['rm3_d_close_1'] = grp['d_close_1'].transform(lambda s: s.rolling(3, min_periods=1).mean())
    for c in ['d_dist_1','d_dist_2','d_dist_5','d_close_1','d_absclose_1','d_speed1_1','d_speed2_1','d_accel1_1','d_accel2_1','rm3_d_dist_1','rm3_d_close_1']:
        df[c] = df[c].fillna(0.0)
    df['rel_speed'] = (df['speed2'] - df['speed1']).astype(float)
    df['abs_rel_speed'] = df['rel_speed'].abs()
    df['rel_accel'] = (df['accel2'] - df['accel1']).astype(float)
    df['abs_rel_accel'] = df['rel_accel'].abs()
    df['jerk1'] = grp['accel1'].diff().fillna(0.0)
    df['jerk2'] = grp['accel2'].diff().fillna(0.0)
    if 'px_dist_norm_min' in df.columns:
        df['d_px_norm_1'] = df['px_dist_norm_min'] - grp['px_dist_norm_min'].shift(1)
        df['d_px_norm_1'] = df['d_px_norm_1'].fillna(0.0)
        df['cnt_px_lt006_p3'] = grp['px_dist_norm_min'].transform(lambda s: s.lt(0.06).rolling(3, min_periods=1).sum()).astype(float)
        df['cnt_px_lt008_p3'] = grp['px_dist_norm_min'].transform(lambda s: s.lt(0.08).rolling(3, min_periods=1).sum()).astype(float)
    else:
        df['d_px_norm_1'] = 0.0; df['cnt_px_lt006_p3'] = 0.0; df['cnt_px_lt008_p3'] = 0.0
    return df

print('Adding dyn features (train/test) ...')
tr_dyn_r45 = add_dyn_feats(train_pairs_w5_helm_r45)
te_dyn_r45 = add_dyn_feats(test_pairs_w5_helm_r45)
tr_dyn_r45.to_parquet('train_pairs_w5_helm_dyn_r45.parquet', index=False)
te_dyn_r45.to_parquet('test_pairs_w5_helm_dyn_r45.parquet', index=False)

key_cols = ['game_play','step','p1','p2']
lab_cols = key_cols + ['contact']
labels_min = train_labels[lab_cols].copy()
sup_r45 = labels_min.merge(tr_dyn_r45, on=key_cols, how='inner')
print('Supervised(inner) r=4.5 before expansion:', sup_r45.shape, 'pos rate:', sup_r45['contact'].mean())
pos = sup_r45.loc[sup_r45['contact'] == 1, ['game_play','p1','p2','step']]
pos_m1 = pos.copy(); pos_m1['step'] = pos_m1['step'] - 1
pos_p1 = pos.copy(); pos_p1['step'] = pos_p1['step'] + 1
pos_exp = pd.concat([pos_m1, pos_p1], ignore_index=True).drop_duplicates()
pos_exp['flag_pos_exp'] = 1
sup_r45 = sup_r45.merge(pos_exp, on=['game_play','p1','p2','step'], how='left')
sup_r45.loc[sup_r45['flag_pos_exp'] == 1, 'contact'] = 1
sup_r45.drop(columns=['flag_pos_exp'], inplace=True)
print('After positive expansion (r=4.5): pos rate:', sup_r45['contact'].mean())
sup_r45.to_parquet('train_supervised_w5_helm_dyn_r45.parquet', index=False)

print('Done r=4.5 rebuild in {:.1f}s'.format(time.time()-t0), flush=True)

Rebuilding pipeline with r=4.5 ...
Building train pairs r=4.5 ...


 processed 500 steps; +0.7s; total 0.7s


 processed 1000 steps; +0.6s; total 1.3s


 processed 1500 steps; +0.6s; total 1.9s


 processed 2000 steps; +0.6s; total 2.5s


 processed 2500 steps; +0.6s; total 3.1s


 processed 3000 steps; +0.6s; total 3.6s


 processed 3500 steps; +0.6s; total 4.2s


 processed 4000 steps; +0.6s; total 4.8s


 processed 4500 steps; +0.6s; total 5.4s


 processed 5000 steps; +0.6s; total 6.0s


 processed 5500 steps; +0.6s; total 6.6s


 processed 6000 steps; +0.6s; total 7.1s


 processed 6500 steps; +1.6s; total 8.7s


 processed 7000 steps; +0.6s; total 9.3s


 processed 7500 steps; +0.6s; total 9.9s


 processed 8000 steps; +0.6s; total 10.4s


 processed 8500 steps; +0.6s; total 11.0s


 processed 9000 steps; +0.6s; total 11.6s


 processed 9500 steps; +0.6s; total 12.2s


 processed 10000 steps; +0.6s; total 12.8s


 processed 10500 steps; +0.6s; total 13.4s


 processed 11000 steps; +0.6s; total 13.9s


 processed 11500 steps; +0.5s; total 14.5s


 processed 12000 steps; +0.6s; total 15.1s


 processed 12500 steps; +0.6s; total 15.7s


 processed 13000 steps; +0.6s; total 16.2s


 processed 13500 steps; +0.6s; total 16.8s


 processed 14000 steps; +0.6s; total 17.4s


 processed 14500 steps; +0.6s; total 18.0s


 processed 15000 steps; +0.6s; total 18.6s


 processed 15500 steps; +0.6s; total 19.2s


 processed 16000 steps; +1.7s; total 20.9s


 processed 16500 steps; +0.6s; total 21.4s


 processed 17000 steps; +0.6s; total 22.0s


 processed 17500 steps; +0.6s; total 22.6s


 processed 18000 steps; +0.6s; total 23.1s


 processed 18500 steps; +0.6s; total 23.7s


 processed 19000 steps; +0.6s; total 24.3s


 processed 19500 steps; +0.6s; total 24.8s


 processed 20000 steps; +0.6s; total 25.4s


 processed 20500 steps; +0.6s; total 26.0s


 processed 21000 steps; +0.6s; total 26.6s


 processed 21500 steps; +0.6s; total 27.1s


 processed 22000 steps; +0.6s; total 27.7s


 processed 22500 steps; +0.6s; total 28.3s


 processed 23000 steps; +0.6s; total 28.8s


 processed 23500 steps; +0.6s; total 29.4s


 processed 24000 steps; +0.6s; total 30.0s


 processed 24500 steps; +0.6s; total 30.6s


 processed 25000 steps; +0.6s; total 31.1s


 processed 25500 steps; +0.6s; total 31.7s


 processed 26000 steps; +0.6s; total 32.3s


 processed 26500 steps; +0.6s; total 32.9s


 processed 27000 steps; +0.6s; total 33.5s


 processed 27500 steps; +1.9s; total 35.3s


 processed 28000 steps; +0.6s; total 35.9s


 processed 28500 steps; +0.6s; total 36.5s


 processed 29000 steps; +0.6s; total 37.1s


 processed 29500 steps; +0.6s; total 37.7s


 processed 30000 steps; +0.6s; total 38.3s


 processed 30500 steps; +0.6s; total 38.9s


 processed 31000 steps; +0.6s; total 39.4s


 processed 31500 steps; +0.6s; total 40.0s


 processed 32000 steps; +0.6s; total 40.6s


 processed 32500 steps; +0.6s; total 41.1s


 processed 33000 steps; +0.6s; total 41.7s


 processed 33500 steps; +0.6s; total 42.3s


 processed 34000 steps; +0.6s; total 42.9s


 processed 34500 steps; +0.6s; total 43.5s


 processed 35000 steps; +0.6s; total 44.0s


 processed 35500 steps; +0.6s; total 44.6s


 processed 36000 steps; +0.6s; total 45.2s


 processed 36500 steps; +0.6s; total 45.8s


 processed 37000 steps; +0.6s; total 46.4s


 processed 37500 steps; +0.6s; total 46.9s


 processed 38000 steps; +0.6s; total 47.5s


 processed 38500 steps; +0.6s; total 48.1s


 processed 39000 steps; +0.6s; total 48.7s


 processed 39500 steps; +0.6s; total 49.2s


 processed 40000 steps; +0.6s; total 49.8s


 processed 40500 steps; +0.6s; total 50.4s


 processed 41000 steps; +0.6s; total 51.0s


 processed 41500 steps; +2.1s; total 53.1s


 processed 42000 steps; +0.6s; total 53.6s


 processed 42500 steps; +0.6s; total 54.2s


 processed 43000 steps; +0.6s; total 54.8s


 processed 43500 steps; +0.6s; total 55.4s


 processed 44000 steps; +0.6s; total 56.0s


 processed 44500 steps; +0.6s; total 56.5s


 processed 45000 steps; +0.6s; total 57.1s


 processed 45500 steps; +0.6s; total 57.7s


 processed 46000 steps; +0.6s; total 58.3s


 processed 46500 steps; +0.6s; total 58.8s


 processed 47000 steps; +0.6s; total 59.4s


 processed 47500 steps; +0.6s; total 60.0s


 processed 48000 steps; +0.6s; total 60.6s


 processed 48500 steps; +0.6s; total 61.1s


 processed 49000 steps; +0.6s; total 61.7s


 processed 49500 steps; +0.6s; total 62.3s


 processed 50000 steps; +0.6s; total 62.9s


 processed 50500 steps; +0.6s; total 63.5s


 processed 51000 steps; +0.6s; total 64.0s


 processed 51500 steps; +0.6s; total 64.6s


 processed 52000 steps; +0.6s; total 65.2s


 processed 52500 steps; +0.6s; total 65.8s


 processed 53000 steps; +0.6s; total 66.4s


 processed 53500 steps; +0.6s; total 67.0s


 processed 54000 steps; +0.6s; total 67.6s


 processed 54500 steps; +0.6s; total 68.1s


 processed 55000 steps; +0.6s; total 68.7s


 processed 55500 steps; +0.6s; total 69.3s


train_pairs_r45: (2828916, 19)


Building test pairs r=4.5 ...


 processed 500 steps; +0.6s; total 76.8s


 processed 1000 steps; +0.6s; total 77.4s


 processed 1500 steps; +0.6s; total 78.0s


 processed 2000 steps; +0.6s; total 78.6s


 processed 2500 steps; +0.6s; total 79.2s


 processed 3000 steps; +0.6s; total 79.8s


 processed 3500 steps; +0.6s; total 80.3s


 processed 4000 steps; +0.6s; total 80.9s


 processed 4500 steps; +0.6s; total 81.5s


 processed 5000 steps; +0.6s; total 82.0s


 processed 5500 steps; +0.6s; total 82.6s


test_pairs_r45: (319769, 19)


Adding W5 features (train/test) for r=4.5 ...


Loading helmets and video metadata...


Preparing helmet aggregates...


Helmet agg shapes: (620840, 8) (67667, 8)
Merging helmets into pairs (train/test) ...


Adding dyn features (train/test) ...


Supervised(inner) r=4.5 before expansion: (745624, 59) pos rate: 0.05718431810134867


After positive expansion (r=4.5): pos rate: 0.06546060749117517


Done r=4.5 rebuild in 395.7s


In [53]:
# PP r=4.5 bagging with thresholds optimized AFTER hysteresis per fold (cap=2), fold-median thresholds, identical test chain, then G overwrite
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.4'], check=True)
    import xgboost as xgb
print('xgboost version (pp-bag-r45-thr-after-hyst):', getattr(xgb, '__version__', 'unknown'))

def apply_hyst_per_pair(df_bin: pd.DataFrame) -> np.ndarray:
    df_h = df_bin.sort_values(['game_play','p1','p2','step']).copy()
    grp = df_h.groupby(['game_play','p1','p2'], sort=False)['pred_bin']
    df_h['pred_hyst'] = grp.transform(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))
    return df_h['pred_hyst'].to_numpy()

t0 = time.time()
print('Loading r=4.5 supervised dyn train and test features...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r45.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r45.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

# Canonical order
ord_idx = train_sup[['game_play','p1','p2','step']].sort_values(['game_play','p1','p2','step']).index.to_numpy()
gkf = GroupKFold(n_splits=5)
groups = train_sup['game_play'].values
y_all = train_sup['contact'].astype(int).values
same_all = train_sup['same_team'].fillna(0).astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), np.int8)
fold_arr = train_sup['fold'].astype(int).to_numpy()

seeds = [42,1337,2025]
oof_s_list = []; test_s_list = []

for s in seeds:
    print(f' PP r=4.5 seed {s} ...', flush=True)
    X_all = train_sup[feat_cols].astype(float).values
    oof = np.full(len(train_sup), np.nan, float)
    models = []
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
        t1 = time.time()
        X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all[va_idx], y_all[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method':'hist','device':'cuda','max_depth':7,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
                  'min_child_weight':10,'lambda':1.5,'alpha':0.1,'gamma':0.1,'objective':'binary:logistic','eval_metric':'logloss',
                  'scale_pos_weight': float(spw), 'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=3800, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models.append((booster, best_it))
        print(f'   seed {s} fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)
    # Smooth OOF on canonical order
    df = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy()
    df['oof'] = oof[ord_idx]
    df = df.sort_values(['game_play','p1','p2','step'])
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    df['oof_smooth'] = grp['oof'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    oof_s_list.append(df['oof_smooth'].to_numpy())

    # Test predictions and smoothing
    Xt = test_feats[feat_cols].astype(float).values
    dtest = xgb.DMatrix(Xt)
    pt = np.zeros(len(test_feats), float)
    for i, (booster, best_it) in enumerate(models):
        t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
        print(f'    seed {s} test model {i} {time.time()-t1:.1f}s', flush=True)
    pt /= max(1, len(models))
    dt = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step'])
    dt['prob'] = pt[dt.index.values]
    grp_t = dt.groupby(['game_play','p1','p2'], sort=False)
    dt['prob_smooth'] = grp_t['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    test_s_list.append(dt['prob_smooth'].to_numpy())

# Average OOF across seeds in canonical order
oof_avg = np.mean(np.vstack(oof_s_list), axis=0)
keys_tr_sorted = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy().reset_index(drop=True)
y_sorted = train_sup['contact'].astype(int).to_numpy()[ord_idx]
same_sorted = train_sup['same_team'].fillna(0).astype(int).to_numpy()[ord_idx] if 'same_team' in train_sup.columns else np.zeros_like(y_sorted, np.int8)
fold_sorted = fold_arr[ord_idx]

# Apply cap=2 on OOF probs before thresholding
df_o = keys_tr_sorted.copy()
df_o['prob'] = oof_avg
df_o['row_id'] = np.arange(len(df_o))
long1 = df_o[['game_play','step','p1','prob','row_id']].rename(columns={'p1':'player'})
long2 = df_o[['game_play','step','p2','prob','row_id']].rename(columns={'p2':'player'})
df_long = pd.concat([long1, long2], ignore_index=True)
df_long = df_long.sort_values(['game_play','step','player','prob'], ascending=[True, True, True, False])
df_long['rank'] = df_long.groupby(['game_play','step','player'], sort=False)['prob'].rank(method='first', ascending=False)
kept_rows = set(df_long.loc[df_long['rank'] <= 2, 'row_id'].tolist())
keep_mask_all = df_o['row_id'].isin(kept_rows).to_numpy()
oof_cap = oof_avg.copy(); oof_cap[~keep_mask_all] = 0.0
print('Applied cap=2 to OOF. Kept rows:', int(keep_mask_all.sum()), 'of', len(keep_mask_all))

# Optimize thresholds AFTER hysteresis per fold
thr_grid = np.round(np.linspace(0.70, 0.85, 16), 3)
thr_best = []
for k in sorted(np.unique(fold_sorted)):
    m = (fold_sorted == k)
    df_k = keys_tr_sorted.loc[m, ['game_play','p1','p2','step']].copy()
    df_k['prob'] = oof_cap[m]
    df_k['same'] = same_sorted[m]
    y_k = y_sorted[m]
    best_m, best_to, best_ts = -1.0, 0.78, 0.78
    same_arr = df_k['same'].to_numpy()
    prob_arr = df_k['prob'].to_numpy()
    for to in thr_grid:
        for ts in thr_grid:
            thr_arr = np.where(same_arr == 1, ts, to)
            pred_bin = (prob_arr >= thr_arr).astype(int)
            df_tmp = df_k[['game_play','p1','p2','step']].copy()
            df_tmp['pred_bin'] = pred_bin
            pred_h = apply_hyst_per_pair(df_tmp)
            mcc = matthews_corrcoef(y_k, pred_h)
            if mcc > best_m:
                best_m, best_to, best_ts = float(mcc), float(to), float(ts)
    thr_best.append((best_to, best_ts))
    print(f' Fold {k} best after-hyst MCC={best_m:.5f} thr_opp={best_to:.3f} thr_same={best_ts:.3f}')

thr_best = np.array(thr_best, float)
thr_opp_med = float(np.median(thr_best[:, 0]))
thr_same_med = float(np.median(thr_best[:, 1]))
print(f'Fold-median thresholds after hysteresis (r=4.5, cap2): thr_opp={thr_opp_med:.4f}, thr_same={thr_same_med:.4f}')

# Test: average probs across seeds, smooth, cap=2, then apply median thresholds, then hysteresis
pt_bag = np.mean(np.vstack(test_s_list), axis=0)
df_t = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step']).reset_index(drop=True)
df_t['prob_smooth'] = pt_bag
df_t['row_id'] = np.arange(len(df_t))
long1t = df_t[['game_play','step','p1','prob_smooth','row_id']].rename(columns={'p1':'player','prob_smooth':'prob'})
long2t = df_t[['game_play','step','p2','prob_smooth','row_id']].rename(columns={'p2':'player','prob_smooth':'prob'})
df_long_t = pd.concat([long1t, long2t], ignore_index=True)
df_long_t = df_long_t.sort_values(['game_play','step','player','prob'], ascending=[True, True, True, False])
df_long_t['rank'] = df_long_t.groupby(['game_play','step','player'], sort=False)['prob'].rank(method='first', ascending=False)
kept_rows_t = set(df_long_t.loc[df_long_t['rank'] <= 2, 'row_id'].tolist())
keep_mask_t = df_t['row_id'].isin(kept_rows_t).to_numpy()
df_t.loc[~keep_mask_t, 'prob_smooth'] = 0.0
print('Applied cap=2 on test. Kept rows:', int(keep_mask_t.sum()), 'of', len(keep_mask_t))

same_flag_test = test_feats[['game_play','p1','p2','step','same_team']].copy()
same_flag_test = same_flag_test.merge(df_t[['game_play','p1','p2','step','row_id']], on=['game_play','p1','p2','step'], how='right').sort_values('row_id')
same_arr_t = same_flag_test['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in same_flag_test.columns else np.zeros(len(df_t), int)
thr_arr_t = np.where(same_arr_t == 1, thr_same_med, thr_opp_med)
df_t['pred_bin'] = (df_t['prob_smooth'].to_numpy() >= thr_arr_t).astype(int)

df_tmp_t = df_t[['game_play','p1','p2','step','pred_bin']].copy()
pred_h_t = apply_hyst_per_pair(df_tmp_t)
df_t['pred_hyst'] = pred_h_t.astype(int)

# Build submission with PP, then overwrite G rows from prior submission
cid_sorted = (df_t['game_play'].astype(str) + '_' + df_t['step'].astype(str) + '_' + df_t['p1'].astype(str) + '_' + df_t['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_sorted.values, 'contact_pp': df_t['pred_hyst'].astype(int).values})
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
pp_ones = int(sub['contact'].sum())
print('PP (r45 bag + thr-after-hyst cap2) ones before G overwrite:', pp_ones)
try:
    prev_sub = pd.read_csv('submission.csv')
    g_pred_second = prev_sub[prev_sub['contact_id'].str.endswith('_G')][['contact_id','contact']].rename(columns={'contact':'contact_g'})
    sub = sub.merge(g_pred_second, on='contact_id', how='left')
    sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
    sub = sub[['contact_id','contact']]
    after_ones = int(sub['contact'].sum())
    print(f'Applied prior G overwrite. ones after={after_ones}, delta={after_ones-pp_ones}')
except Exception as e:
    print('No prior submission with G rows found; skipping G overwrite.', e)
    sub = sub[['contact_id','contact']]

sub.to_csv('submission.csv', index=False)
print('Saved submission.csv. Took {:.1f}s'.format(time.time()-t0))

xgboost version (pp-bag-r45-thr-after-hyst): 2.1.4
Loading r=4.5 supervised dyn train and test features...


Using 50 features
 PP r=4.5 seed 42 ...


   seed 42 fold 0 done in 46.2s; best_it=3688


   seed 42 fold 1 done in 44.1s; best_it=3754


   seed 42 fold 2 done in 43.0s; best_it=3466


   seed 42 fold 3 done in 39.5s; best_it=3177


   seed 42 fold 4 done in 44.4s; best_it=3799


    seed 42 test model 0 0.2s


    seed 42 test model 1 0.2s


    seed 42 test model 2 0.2s


    seed 42 test model 3 0.2s


    seed 42 test model 4 0.2s


 PP r=4.5 seed 1337 ...


   seed 1337 fold 0 done in 46.4s; best_it=3799


   seed 1337 fold 1 done in 44.8s; best_it=3799


   seed 1337 fold 2 done in 43.1s; best_it=3467


   seed 1337 fold 3 done in 37.8s; best_it=2982


   seed 1337 fold 4 done in 45.2s; best_it=3777


    seed 1337 test model 0 0.2s


    seed 1337 test model 1 0.2s


    seed 1337 test model 2 0.2s


    seed 1337 test model 3 0.2s


    seed 1337 test model 4 0.2s


 PP r=4.5 seed 2025 ...


   seed 2025 fold 0 done in 46.7s; best_it=3798


   seed 2025 fold 1 done in 45.1s; best_it=3796


   seed 2025 fold 2 done in 43.8s; best_it=3519


   seed 2025 fold 3 done in 42.3s; best_it=3358


   seed 2025 fold 4 done in 44.4s; best_it=3716


    seed 2025 test model 0 0.2s


    seed 2025 test model 1 0.2s


    seed 2025 test model 2 0.2s


    seed 2025 test model 3 0.2s


    seed 2025 test model 4 0.2s


Applied cap=2 to OOF. Kept rows: 333340 of 745624


 Fold 0 best after-hyst MCC=0.71485 thr_opp=0.820 thr_same=0.810


 Fold 1 best after-hyst MCC=0.74275 thr_opp=0.850 thr_same=0.810


 Fold 2 best after-hyst MCC=0.73672 thr_opp=0.840 thr_same=0.740


 Fold 3 best after-hyst MCC=0.73139 thr_opp=0.770 thr_same=0.710


 Fold 4 best after-hyst MCC=0.73835 thr_opp=0.790 thr_same=0.830
Fold-median thresholds after hysteresis (r=4.5, cap2): thr_opp=0.8200, thr_same=0.8100


Applied cap=2 on test. Kept rows: 131297 of 319769


PP (r45 bag + thr-after-hyst cap2) ones before G overwrite: 6260


Applied prior G overwrite. ones after=8312, delta=2052


Saved submission.csv. Took 1810.4s


In [54]:
# PP r=4.5 bagging with distance-aware caps (3/2/1) and thresholds optimized AFTER hysteresis per fold; fold-median thresholds; identical test chain; then G overwrite
import time, numpy as np, pandas as pd, sys, subprocess
from sklearn.model_selection import GroupKFold
from sklearn.metrics import matthews_corrcoef

try:
    import xgboost as xgb
except Exception as e:
    print('Installing xgboost...', e)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.4'], check=True)
    import xgboost as xgb
print('xgboost version (pp-bag-r45-cap321-thr-after-hyst):', getattr(xgb, '__version__', 'unknown'))

def apply_hyst_per_pair(df_bin: pd.DataFrame) -> np.ndarray:
    df_h = df_bin.sort_values(['game_play','p1','p2','step']).copy()
    grp = df_h.groupby(['game_play','p1','p2'], sort=False)['pred_bin']
    df_h['pred_hyst'] = grp.transform(lambda s: (s.rolling(3, center=True, min_periods=1).sum() >= 2).astype(int))
    return df_h['pred_hyst'].to_numpy()

t0 = time.time()
print('Loading r=4.5 supervised dyn train and test features...')
train_sup = pd.read_parquet('train_supervised_w5_helm_dyn_r45.parquet')
test_feats = pd.read_parquet('test_pairs_w5_helm_dyn_r45.parquet')
folds_df = pd.read_csv('folds_game_play.csv')
train_sup = train_sup.merge(folds_df, on='game_play', how='left')
assert train_sup['fold'].notna().all()
for df in (train_sup, test_feats):
    if 'px_dist_norm_min' in df.columns: df['px_dist_norm_min'] = df['px_dist_norm_min'].fillna(1.0)
    if 'views_both_present' in df.columns: df['views_both_present'] = df['views_both_present'].fillna(0).astype(float)

drop_cols = {'contact','game_play','step','p1','p2','team1','team2','pos1','pos2','fold'}
feat_cols = [c for c in train_sup.columns if c not in drop_cols and pd.api.types.is_numeric_dtype(train_sup[c])]
print('Using', len(feat_cols), 'features')

# Canonical order
ord_idx = train_sup[['game_play','p1','p2','step']].sort_values(['game_play','p1','p2','step']).index.to_numpy()
gkf = GroupKFold(n_splits=5)
groups = train_sup['game_play'].values
y_all = train_sup['contact'].astype(int).values
same_all = train_sup['same_team'].fillna(0).astype(int).values if 'same_team' in train_sup.columns else np.zeros(len(train_sup), np.int8)
fold_arr = train_sup['fold'].astype(int).to_numpy()

seeds = [42,1337,2025]
oof_s_list = []; test_s_list = []

for s in seeds:
    print(f' PP r=4.5 seed {s} ...', flush=True)
    X_all = train_sup[feat_cols].astype(float).values
    oof = np.full(len(train_sup), np.nan, float)
    models = []
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_all, y_all, groups=groups)):
        t1 = time.time()
        X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all[va_idx], y_all[va_idx]
        neg = (y_tr == 0).sum(); posc = (y_tr == 1).sum()
        spw = max(1.0, neg / max(1, posc))
        dtrain = xgb.DMatrix(X_tr, label=y_tr); dvalid = xgb.DMatrix(X_va, label=y_va)
        params = {'tree_method':'hist','device':'cuda','max_depth':7,'eta':0.05,'subsample':0.9,'colsample_bytree':0.8,
                  'min_child_weight':10,'lambda':1.5,'alpha':0.1,'gamma':0.1,'objective':'binary:logistic','eval_metric':'logloss',
                  'scale_pos_weight': float(spw), 'seed': int(s + fold)}
        booster = xgb.train(params, dtrain, num_boost_round=3800, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
        best_it = int(getattr(booster, 'best_iteration', None) or booster.num_boosted_rounds() - 1)
        oof[va_idx] = booster.predict(dvalid, iteration_range=(0, best_it + 1))
        models.append((booster, best_it))
        print(f'   seed {s} fold {fold} done in {time.time()-t1:.1f}s; best_it={best_it}', flush=True)
    # Smooth OOF on canonical order
    df = train_sup[['game_play','p1','p2','step']].iloc[ord_idx].copy()
    df['oof'] = oof[ord_idx]
    df = df.sort_values(['game_play','p1','p2','step'])
    grp = df.groupby(['game_play','p1','p2'], sort=False)
    df['oof_smooth'] = grp['oof'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    oof_s_list.append(df['oof_smooth'].to_numpy())

    # Test predictions and smoothing
    Xt = test_feats[feat_cols].astype(float).values
    dtest = xgb.DMatrix(Xt)
    pt = np.zeros(len(test_feats), float)
    for i, (booster, best_it) in enumerate(models):
        t1 = time.time(); pt += booster.predict(dtest, iteration_range=(0, best_it + 1));
        print(f'    seed {s} test model {i} {time.time()-t1:.1f}s', flush=True)
    pt /= max(1, len(models))
    dt = test_feats[['game_play','p1','p2','step']].copy().sort_values(['game_play','p1','p2','step'])
    idx_sorted = test_feats[['game_play','p1','p2','step']].sort_values(['game_play','p1','p2','step']).index.values
    pt_sorted = pt[idx_sorted]
    dt['prob'] = pt_sorted
    grp_t = dt.groupby(['game_play','p1','p2'], sort=False)
    dt['prob_smooth'] = grp_t['prob'].transform(lambda s_: s_.rolling(3, center=True, min_periods=1).max())
    test_s_list.append(dt['prob_smooth'].to_numpy())

# Average OOF across seeds in canonical order
oof_avg = np.mean(np.vstack(oof_s_list), axis=0)
keys_tr_sorted = train_sup[['game_play','p1','p2','step','distance']].iloc[ord_idx].copy().reset_index(drop=True)
y_sorted = train_sup['contact'].astype(int).to_numpy()[ord_idx]
same_sorted = train_sup['same_team'].fillna(0).astype(int).to_numpy()[ord_idx] if 'same_team' in train_sup.columns else np.zeros_like(y_sorted, np.int8)
fold_sorted = fold_arr[ord_idx]

# Distance-aware caps (<=1.6: top-3, 1.6-2.4: top-2, >2.4: top-1) applied on smoothed OOF BEFORE thresholding
df_o = keys_tr_sorted.copy()
df_o['prob'] = oof_avg
df_o['row_id'] = np.arange(len(df_o))
df_o['bin'] = np.where(df_o['distance'] <= 1.6, 0, np.where(df_o['distance'] <= 2.4, 1, 2))
cap_map = {0:3, 1:2, 2:1}
long1 = df_o[['game_play','step','p1','prob','row_id','bin']].rename(columns={'p1':'player'})
long2 = df_o[['game_play','step','p2','prob','row_id','bin']].rename(columns={'p2':'player'})
df_long = pd.concat([long1, long2], ignore_index=True)
df_long = df_long.sort_values(['game_play','step','player','bin','prob'], ascending=[True, True, True, True, False])
df_long['rank_in_bin'] = df_long.groupby(['game_play','step','player','bin'], sort=False)['prob'].rank(method='first', ascending=False)
keep_rows = []
for b, cap in cap_map.items():
    keep_rows.append(df_long.loc[(df_long['bin'] == b) & (df_long['rank_in_bin'] <= cap), 'row_id'])
kept_rows = set(pd.concat(keep_rows).tolist())
keep_mask_all = df_o['row_id'].isin(kept_rows).to_numpy()
oof_cap = oof_avg.copy(); oof_cap[~keep_mask_all] = 0.0
print('Applied distance-aware caps (3/2/1) to OOF. Kept rows:', int(keep_mask_all.sum()), 'of', len(keep_mask_all))

# Optimize thresholds AFTER hysteresis per fold
thr_grid = np.round(np.linspace(0.70, 0.85, 16), 3)
thr_best = []
for k in sorted(np.unique(fold_sorted)):
    m = (fold_sorted == k)
    df_k = keys_tr_sorted.loc[m, ['game_play','p1','p2','step']].copy()
    df_k['prob'] = oof_cap[m]
    df_k['same'] = same_sorted[m]
    y_k = y_sorted[m]
    best_m, best_to, best_ts = -1.0, 0.78, 0.78
    same_arr = df_k['same'].to_numpy()
    prob_arr = df_k['prob'].to_numpy()
    for to in thr_grid:
        for ts in thr_grid:
            thr_arr = np.where(same_arr == 1, ts, to)
            pred_bin = (prob_arr >= thr_arr).astype(int)
            df_tmp = df_k[['game_play','p1','p2','step']].copy()
            df_tmp['pred_bin'] = pred_bin
            pred_h = apply_hyst_per_pair(df_tmp)
            mcc = matthews_corrcoef(y_k, pred_h)
            if mcc > best_m:
                best_m, best_to, best_ts = float(mcc), float(to), float(ts)
    thr_best.append((best_to, best_ts))
    print(f' Fold {k} best after-hyst MCC={best_m:.5f} thr_opp={best_to:.3f} thr_same={best_ts:.3f}')

thr_best = np.array(thr_best, float)
thr_opp_med = float(np.median(thr_best[:, 0]))
thr_same_med = float(np.median(thr_best[:, 1]))
print(f'Fold-median thresholds after hysteresis (r=4.5 cap3/2/1): thr_opp={thr_opp_med:.4f}, thr_same={thr_same_med:.4f}')

# Test: average probs, smooth, apply distance-aware caps, then median thresholds, then hysteresis
pt_bag = np.mean(np.vstack(test_s_list), axis=0)
df_t = test_feats[['game_play','p1','p2','step','distance']].copy().sort_values(['game_play','p1','p2','step']).reset_index(drop=True)
df_t['prob_smooth'] = pt_bag
df_t['row_id'] = np.arange(len(df_t))
df_t['bin'] = np.where(df_t['distance'] <= 1.6, 0, np.where(df_t['distance'] <= 2.4, 1, 2))
long1t = df_t[['game_play','step','p1','prob_smooth','row_id','bin']].rename(columns={'p1':'player','prob_smooth':'prob'})
long2t = df_t[['game_play','step','p2','prob_smooth','row_id','bin']].rename(columns={'p2':'player','prob_smooth':'prob'})
df_long_t = pd.concat([long1t, long2t], ignore_index=True)
df_long_t = df_long_t.sort_values(['game_play','step','player','bin','prob'], ascending=[True, True, True, True, False])
df_long_t['rank_in_bin'] = df_long_t.groupby(['game_play','step','player','bin'], sort=False)['prob'].rank(method='first', ascending=False)
keep_rows_t = []
for b, cap in cap_map.items():
    keep_rows_t.append(df_long_t.loc[(df_long_t['bin'] == b) & (df_long_t['rank_in_bin'] <= cap), 'row_id'])
kept_rows_t = set(pd.concat(keep_rows_t).tolist())
keep_mask_t = df_t['row_id'].isin(kept_rows_t).to_numpy()
df_t.loc[~keep_mask_t, 'prob_smooth'] = 0.0
print('Applied distance-aware caps (3/2/1) on test. Kept rows:', int(keep_mask_t.sum()), 'of', len(keep_mask_t))

same_flag_test = test_feats[['game_play','p1','p2','step','same_team']].copy()
same_flag_test = same_flag_test.merge(df_t[['game_play','p1','p2','step','row_id']], on=['game_play','p1','p2','step'], how='right').sort_values('row_id')
same_arr_t = same_flag_test['same_team'].fillna(0).astype(int).to_numpy() if 'same_team' in same_flag_test.columns else np.zeros(len(df_t), int)
thr_arr_t = np.where(same_arr_t == 1, thr_same_med, thr_opp_med)
df_t['pred_bin'] = (df_t['prob_smooth'].to_numpy() >= thr_arr_t).astype(int)

df_tmp_t = df_t[['game_play','p1','p2','step','pred_bin']].copy()
pred_h_t = apply_hyst_per_pair(df_tmp_t)
df_t['pred_hyst'] = pred_h_t.astype(int)

# Build submission with PP, then overwrite G rows from prior submission
cid_sorted = (df_t['game_play'].astype(str) + '_' + df_t['step'].astype(str) + '_' + df_t['p1'].astype(str) + '_' + df_t['p2'].astype(str))
pred_df_pp = pd.DataFrame({'contact_id': cid_sorted.values, 'contact_pp': df_t['pred_hyst'].astype(int).values})
ss = pd.read_csv('sample_submission.csv')
sub = ss.merge(pred_df_pp, on='contact_id', how='left')
sub['contact'] = sub['contact_pp'].fillna(0).astype(int)
sub = sub.drop(columns=['contact_pp'])
pp_ones = int(sub['contact'].sum())
print('PP (r45 bag + cap3/2/1 thr-after-hyst) ones before G overwrite:', pp_ones)
try:
    prev_sub = pd.read_csv('submission.csv')
    g_pred_second = prev_sub[prev_sub['contact_id'].str.endswith('_G')][['contact_id','contact']].rename(columns={'contact':'contact_g'})
    sub = sub.merge(g_pred_second, on='contact_id', how='left')
    sub['contact'] = sub['contact_g'].fillna(sub['contact']).astype(int)
    sub = sub[['contact_id','contact']]
    after_ones = int(sub['contact'].sum())
    print(f'Applied prior G overwrite. ones after={after_ones}, delta={after_ones-pp_ones}')
except Exception as e:
    print('No prior submission with G rows found; skipping G overwrite.', e)
    sub = sub[['contact_id','contact']]

sub.to_csv('submission.csv', index=False)
print('Saved submission.csv. Took {:.1f}s'.format(time.time()-t0))

xgboost version (pp-bag-r45-cap321-thr-after-hyst): 2.1.4
Loading r=4.5 supervised dyn train and test features...


Using 50 features
 PP r=4.5 seed 42 ...


   seed 42 fold 0 done in 46.2s; best_it=3688


   seed 42 fold 1 done in 44.0s; best_it=3754


   seed 42 fold 2 done in 43.0s; best_it=3466


   seed 42 fold 3 done in 39.4s; best_it=3177


   seed 42 fold 4 done in 44.4s; best_it=3799


    seed 42 test model 0 0.2s


    seed 42 test model 1 0.2s


    seed 42 test model 2 0.2s


    seed 42 test model 3 0.2s


    seed 42 test model 4 0.2s


 PP r=4.5 seed 1337 ...


   seed 1337 fold 0 done in 46.3s; best_it=3799


   seed 1337 fold 1 done in 44.6s; best_it=3799


   seed 1337 fold 2 done in 42.8s; best_it=3467


   seed 1337 fold 3 done in 37.7s; best_it=2982


   seed 1337 fold 4 done in 45.1s; best_it=3777


    seed 1337 test model 0 0.2s


    seed 1337 test model 1 0.2s


    seed 1337 test model 2 0.2s


    seed 1337 test model 3 0.2s


    seed 1337 test model 4 0.2s


 PP r=4.5 seed 2025 ...


   seed 2025 fold 0 done in 46.7s; best_it=3798


   seed 2025 fold 1 done in 45.1s; best_it=3796


   seed 2025 fold 2 done in 43.8s; best_it=3519


   seed 2025 fold 3 done in 42.3s; best_it=3358


   seed 2025 fold 4 done in 44.2s; best_it=3716


    seed 2025 test model 0 0.2s


    seed 2025 test model 1 0.2s


    seed 2025 test model 2 0.2s


    seed 2025 test model 3 0.2s


    seed 2025 test model 4 0.2s


Applied distance-aware caps (3/2/1) to OOF. Kept rows: 461191 of 745624


 Fold 0 best after-hyst MCC=0.71594 thr_opp=0.820 thr_same=0.810


 Fold 1 best after-hyst MCC=0.73929 thr_opp=0.850 thr_same=0.810


 Fold 2 best after-hyst MCC=0.73686 thr_opp=0.840 thr_same=0.740


 Fold 3 best after-hyst MCC=0.73407 thr_opp=0.770 thr_same=0.710


 Fold 4 best after-hyst MCC=0.73886 thr_opp=0.790 thr_same=0.830
Fold-median thresholds after hysteresis (r=4.5 cap3/2/1): thr_opp=0.8200, thr_same=0.8100


Applied distance-aware caps (3/2/1) on test. Kept rows: 195880 of 319769


PP (r45 bag + cap3/2/1 thr-after-hyst) ones before G overwrite: 6459


Applied prior G overwrite. ones after=8511, delta=2052


Saved submission.csv. Took 1808.0s
