# Plan: Multi-modal Gesture Recognition (MMRGC)

Objectives:
- Establish GPU-enabled environment and robust, reproducible pipeline.
- Build fast baseline → improve via feature engineering and modeling.
- Achieve medal-level Levenshtein distance.

Milestones & Expert Checkpoints:
1) Planning (this doc) → Request expert feedback on medal-winning strategies and pitfalls.
2) Environment check: verify GPU; install correct cu121 PyTorch stack if needed.
3) Data audit:
   - Inspect training.csv/test.csv formats and required submission schema.
   - Inventory archives (training*.tar.gz, validation*.tar.gz, test.tar.gz) and contents (e.g., Sample*_data.mat).
   - Verify mapping between Video.Labels in .mat and training.csv sequences.
4) Baseline data loader:
   - Implement reader to parse per-sample modalities (skeleton, depth/RGB features if available) and labels.
   - Cache parsed features to disk (npz/parquet) to iterate quickly.
5) Validation protocol:
   - User-independent splits mirroring challenge (use provided validation sets if aligned).
   - Deterministic KFold/GroupKFold (group by subject/session). Save folds to disk.
6) Baseline model:
   - Sequence model on skeleton features first (GRU/LSTM/TemporalConv) with CTC/seq2seq.
   - Alt: classical per-frame classifier + Viterbi/DP decoding into sequences.
   - Quick smoke-run on subsample; enable mixed precision; early stopping.
7) Evaluation:
   - Compute Levenshtein distance on validation (OOF). Log confusion/error buckets.
8) Feature engineering:
   - Temporal deltas, velocities, joint angles, distances, normalized by body size.
   - Optional: fuse audio/RGB-depth derived features if present (late fusion).
9) Model improvements:
   - BiGRU/TemporalConvNet; SpecAug/time mask; label smoothing.
   - Calibration and decoding tweaks (beam search, penalties).
10) Ensembling:
   - Blend diverse seeds/architectures; average logits then decode.
11) Inference & Submission:
   - Generate test predictions; ensure submission.csv format matches sample.
   - Sanity-check file before submit.

Logging/Discipline:
- Print progress and elapsed time per fold.
- Cache features/logits; avoid recompute.
- Change one thing at a time; track deltas.

Next Actions:
1) Run environment and GPU check; list files; peek CSV heads.
2) Request expert review of plan and ask for medal-winning strategy specifics.

In [1]:
import os, sys, time, json, shutil, tarfile, zipfile, subprocess, pandas as pd
from pathlib import Path

def run(cmd):
    print("$", " ".join(cmd), flush=True)
    try:
        out = subprocess.run(cmd, capture_output=True, text=True, check=False)
        print(out.stdout, flush=True)
        if out.stderr:
            print(out.stderr, file=sys.stderr, flush=True)
        return out.returncode
    except Exception as e:
        print(f"ERROR running {cmd}: {e}")
        return -1

print("=== GPU CHECK (nvidia-smi) ===", flush=True)
run(['bash','-lc','nvidia-smi || true'])

print("=== List files in CWD ===", flush=True)
for p in sorted(Path('.').iterdir()):
    try:
        sz = p.stat().st_size
    except Exception:
        sz = -1
    print(f"{p.name}\t{sz}")

def head_csv(path, n=3):
    try:
        df = pd.read_csv(path)
        print(f"\n--- {path} shape={df.shape} ---")
        print(df.head(n))
    except Exception as e:
        print(f"Failed to read {path}: {e}")

head_csv('training.csv', 5)
head_csv('test.csv', 5)
head_csv('randomPredictions.csv', 5)

print("\n=== Inspect sample_code_mmrgc.zip entries (first 20) ===")
try:
    with zipfile.ZipFile('sample_code_mmrgc.zip') as z:
        names = z.namelist()
        for name in names[:20]:
            print(name)
        print(f"Total entries: {len(names)}")
except Exception as e:
    print(f"Zip inspect failed: {e}")

def list_tarfirst(tarpath, k=10):
    print(f"\n=== List first {k} members of {tarpath} ===")
    try:
        with tarfile.open(tarpath, 'r:*') as tf:
            for i, m in enumerate(tf):
                if i>=k: break
                print(m.name)
    except Exception as e:
        print(f"Tar inspect failed for {tarpath}: {e}")

list_tarfirst('training1.tar.gz', 10)
list_tarfirst('validation1.tar.gz', 10)
list_tarfirst('test.tar.gz', 10)

print("\n=== Done env/data audit ===")

=== GPU CHECK (nvidia-smi) ===


$ bash -lc nvidia-smi || true


Mon Sep 29 04:13:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

=== List files in CWD ===


.00_eda_and_planning_kernel_state.json	182
00_eda_and_planning.ipynb	6375
agent_metadata	4096
description.md	21508
devel01-40.7z	2177920674
docker_run.log	43548
randomPredictions.csv	5332
requirements.txt	2021
sample_code_mmrgc.zip	7708
task.txt	3949
test.csv	478
test.tar.gz	2041016729
training.csv	16513
training1.tar.gz	4370421093
training2.tar.gz	1755486450
training3.tar.gz	2300959544
valid_all_files_combined.7z	961765673
validation1.tar.gz	2909694856
validation2.tar.gz	3456269325
validation3.tar.gz	3253929930

--- training.csv shape=(297, 2) ---
   Id                                           Sequence
0   1  2 14 20 6 7 3 1 13 18 5 12 16 15 4 9 10 8 17 1...
1   3  12 3 18 14 16 20 5 2 4 1 10 6 9 19 15 17 11 13...
2   4  13 1 8 18 7 17 16 9 5 10 11 4 20 3 19 2 14 6 1...
3   5  10 4 7 13 19 15 9 11 17 1 8 5 18 3 12 16 14 2 ...
4   6  14 15 10 16 11 2 20 8 7 9 1 19 17 18 6 4 13 3 ...

--- test.csv shape=(95, 1) ---
    Id
0  300
1  301
2  302
3  303
4  304

--- randomPredictions.csv sh

./Sample00005.zip


./Sample00006.zip
./Sample00007.zip
./Sample00008.zip
./Sample00009.zip


./Sample00010.zip
./Sample00011.zip

=== List first 10 members of validation1.tar.gz ===
Sample00410.zip
Sample00411.zip
Sample00412.zip


Sample00413.zip
Sample00414.zip
Sample00415.zip
Sample00416.zip
Sample00417.zip
Sample00418.zip


Sample00420.zip

=== List first 10 members of test.tar.gz ===
./Sample00300.zip
./Sample00301.zip
./Sample00302.zip
./Sample00303.zip
./Sample00304.zip
./Sample00305.zip
./Sample00306.zip


./Sample00307.zip
./Sample00308.zip
./Sample00309.zip

=== Done env/data audit ===


In [2]:
import io, tarfile, zipfile, sys, time
from pathlib import Path

print("=== Inspect a couple of MAT files inside training/validation archives ===", flush=True)
targets = [
    ("training1.tar.gz", ["./Sample00001.zip", "./Sample00003.zip"]),
    ("validation1.tar.gz", ["Sample00410.zip"])
]

def ensure_scipy():
    try:
        import scipy.io as sio  # noqa
        return True
    except Exception:
        import subprocess, sys as _sys
        print("Installing scipy...", flush=True)
        rc = subprocess.run([_sys.executable, "-m", "pip", "install", "scipy", "--quiet"], check=False).returncode
        print("pip rc=", rc, flush=True)
        try:
            import scipy.io as sio  # noqa
            return True
        except Exception as e:
            print("Failed to import scipy after install:", e, flush=True)
            return False

ok_scipy = ensure_scipy()
if ok_scipy:
    import scipy.io as sio
else:
    sio = None

def inspect_zip_bytes(zb: bytes, label_hint: str = ""):
    with zipfile.ZipFile(io.BytesIO(zb)) as zf:
        names = zf.namelist()
        print(f"ZIP has {len(names)} entries. First 15:")
        for n in names[:15]:
            print("  ", n)
        # pick a *_data.mat if present
        mat_name = None
        for n in names:
            if n.lower().endswith("_data.mat") or n.lower().endswith(".mat"):
                mat_name = n
                break
        if mat_name and ok_scipy:
            with zf.open(mat_name) as f:
                b = f.read()
                try:
                    md = sio.loadmat(io.BytesIO(b), squeeze_me=True, struct_as_record=False)
                except TypeError:
                    # Older scipy may not accept BytesIO; write to tmp
                    tmp = Path("_tmp_inspect.mat")
                    tmp.write_bytes(b)
                    md = sio.loadmat(str(tmp), squeeze_me=True, struct_as_record=False)
                    try: tmp.unlink()
                    except Exception: pass
            print(f"MAT keys: {sorted([k for k in md.keys() if not k.startswith('__')])}")
            # Try common fields
            for key in ("Video", "video", "Labels", "labels", "Gesture", "gesture"):
                if key in md:
                    v = md[key]
                    print(f"Field {key}: type={type(v)}")
                    # Attempt to explore nested struct
                    try:
                        attrs = [a for a in dir(v) if not a.startswith('_')]
                        print(f"  attrs(sample): {attrs[:12]}")
                        # Look for Labels inside Video
                        for sub in ("Labels", "labels", "numFrames", "nframes", "fps", "SubjectID", "user", "Acquisition"):
                            if hasattr(v, sub):
                                sv = getattr(v, sub)
                                try:
                                    shp = getattr(sv, "shape", None)
                                except Exception:
                                    shp = None
                                print(f"  {key}.{sub}: type={type(sv)}, shape={shp}")
                    except Exception as e:
                        print("  could not introspect struct:", e)

for tarpath, members in targets:
    if not Path(tarpath).exists():
        print(f"Missing {tarpath}")
        continue
    print(f"\n-- TAR {tarpath} --", flush=True)
    with tarfile.open(tarpath, 'r:*') as tf:
        tf_members = {m.name: m for m in tf}
        for m in members:
            cand = m if m in tf_members else (m.lstrip('./') if m.lstrip('./') in tf_members else None)
            if not cand:
                print(f"Member {m} not found")
                continue
            print(f"Reading {cand} ...", flush=True)
            fobj = tf.extractfile(tf_members[cand])
            if not fobj:
                print("  cannot extract file object")
                continue
            data = fobj.read()
            print(f"  bytes: {len(data):,}")
            try:
                inspect_zip_bytes(data, label_hint=cand)
            except zipfile.BadZipFile:
                print("  Not a ZIP; skipping.")

print("\n=== Done MAT inspection probe ===", flush=True)

=== Inspect a couple of MAT files inside training/validation archives ===



-- TAR training1.tar.gz --


Reading ./Sample00001.zip ...


  bytes: 44,147,690
ZIP has 5 entries. First 15:
   Sample00001_color.mp4
   Sample00001_depth.mp4
   Sample00001_user.mp4
   Sample00001_data.mat
   Sample00001_audio.wav
MAT keys: ['Video']
Field Video: type=<class 'scipy.io.matlab._mio5_params.mat_struct'>
  attrs(sample): ['FrameRate', 'Frames', 'Labels', 'MaxDepth', 'NumFrames']
  Video.Labels: type=<class 'numpy.ndarray'>, shape=(20,)
Reading ./Sample00003.zip ...


  bytes: 39,357,003
ZIP has 5 entries. First 15:
   Sample00003_color.mp4
   Sample00003_depth.mp4
   Sample00003_user.mp4
   Sample00003_data.mat
   Sample00003_audio.wav


MAT keys: ['Video']
Field Video: type=<class 'scipy.io.matlab._mio5_params.mat_struct'>
  attrs(sample): ['FrameRate', 'Frames', 'Labels', 'MaxDepth', 'NumFrames']
  Video.Labels: type=<class 'numpy.ndarray'>, shape=(20,)

-- TAR validation1.tar.gz --


Reading Sample00410.zip ...


  bytes: 25,534,061
ZIP has 5 entries. First 15:
   Sample00410_color.mp4
   Sample00410_depth.mp4
   Sample00410_user.mp4
   Sample00410_audio.wav
   Sample00410_data.mat
MAT keys: ['Video']
Field Video: type=<class 'scipy.io.matlab._mio5_params.mat_struct'>
  attrs(sample): ['FrameRate', 'Frames', 'Labels', 'MaxDepth', 'NumFrames']
  Video.Labels: type=<class 'numpy.ndarray'>, shape=(0,)

=== Done MAT inspection probe ===


In [3]:
import io, tarfile, zipfile
from pathlib import Path

print("=== Deep inspect one MAT: fields, shapes, label structure ===", flush=True)
tarpath = "training1.tar.gz"
member = "./Sample00001.zip"

with tarfile.open(tarpath, 'r:*') as tf:
    tf_members = {m.name: m for m in tf}
    cand = member if member in tf_members else member.lstrip('./')
    fobj = tf.extractfile(tf_members[cand])
    data = fobj.read()
with zipfile.ZipFile(io.BytesIO(data)) as zf:
    # find mat
    mat_name = [n for n in zf.namelist() if n.lower().endswith('_data.mat') or n.lower().endswith('.mat')][0]
    b = zf.read(mat_name)
import scipy.io as sio
md = sio.loadmat(io.BytesIO(b), squeeze_me=True, struct_as_record=False)
V = md['Video']
def safe_shape(x):
    try: return getattr(x, 'shape', None)
    except Exception: return None
print("Video has attrs:", [a for a in dir(V) if not a.startswith('_')])
for fld in ("NumFrames","FrameRate","Frames","Labels","MaxDepth"):
    if hasattr(V, fld):
        val = getattr(V, fld)
        print(f"Video.{fld}: type={type(val)}, shape={safe_shape(val)}")
        if fld=="Frames":
            try:
                # Try to peek one frame entry
                fr0 = val[0] if hasattr(val, '__getitem__') else None
                print("  Frames[0] type=", type(fr0))
                if hasattr(fr0, 'shape'):
                    print("  Frames[0].shape=", fr0.shape)
            except Exception as e:
                print("  Could not index Frames:", e)
        if fld=="Labels":
            try:
                L = val
                print("  Labels len:", len(L))
                if len(L)>0:
                    l0 = L[0]
                    print("  Label[0] type:", type(l0))
                    # Try common fields of a label struct
                    if hasattr(l0, '__dict__') or hasattr(l0, 'dtype'):
                        try:
                            print("  Label[0] dir:", [a for a in dir(l0) if not a.startswith('_')][:15])
                        except Exception:
                            pass
                    # If it's an array like [start end class]
                    try:
                        import numpy as np
                        arr = np.array(l0)
                        print("  Label[0] as array:", arr, arr.shape)
                    except Exception as e:
                        print("  Could not array-ize label:", e)
            except Exception as e:
                print("  Could not inspect Labels:", e)
print("=== Done deep inspect ===")

=== Deep inspect one MAT: fields, shapes, label structure ===


Video has attrs: ['FrameRate', 'Frames', 'Labels', 'MaxDepth', 'NumFrames']
Video.NumFrames: type=<class 'int'>, shape=None
Video.FrameRate: type=<class 'int'>, shape=None
Video.Frames: type=<class 'numpy.ndarray'>, shape=(1254,)
  Frames[0] type= <class 'scipy.io.matlab._mio5_params.mat_struct'>
Video.Labels: type=<class 'numpy.ndarray'>, shape=(20,)
  Labels len: 20
  Label[0] type: <class 'scipy.io.matlab._mio5_params.mat_struct'>
  Label[0] dir: ['Begin', 'End', 'Name']
  Label[0] as array: <scipy.io.matlab._mio5_params.mat_struct object at 0x73bc5557b490> ()
Video.MaxDepth: type=<class 'int'>, shape=None
=== Done deep inspect ===


In [5]:
import io, tarfile, zipfile
from pathlib import Path
import numpy as np
import scipy.io as sio

print("=== Inspect first frame struct fields and shapes ===", flush=True)
tarpath = "training1.tar.gz"
member = "./Sample00001.zip"
with tarfile.open(tarpath, 'r:*') as tf:
    tf_members = {m.name: m for m in tf}
    cand = member if member in tf_members else member.lstrip('./')
    data = tf.extractfile(tf_members[cand]).read()
with zipfile.ZipFile(io.BytesIO(data)) as zf:
    mat_name = [n for n in zf.namelist() if n.lower().endswith('_data.mat') or n.lower().endswith('.mat')][0]
    b = zf.read(mat_name)
md = sio.loadmat(io.BytesIO(b), squeeze_me=True, struct_as_record=False)
V = md['Video']
frames = V.Frames
print("NumFrames:", getattr(V, 'NumFrames', None), "FrameRate:", getattr(V, 'FrameRate', None))
fr0 = frames[0]
attrs = [a for a in dir(fr0) if not a.startswith('_')]
print("Frame[0] attrs (first 40):", attrs[:40])

def show_attr(obj, name):
    try:
        val = getattr(obj, name)
    except Exception as e:
        print(f"  {name}: <error {e}>")
        return
    shp = getattr(val, 'shape', None)
    typ = type(val)
    info = None
    if isinstance(val, (np.ndarray, list, tuple)):
        try:
            if isinstance(val, np.ndarray) and val.size>0:
                info = f"dtype={val.dtype}, min={val.min()}, max={val.max()}" if np.issubdtype(val.dtype, np.number) else f"dtype={val.dtype}"
        except Exception:
            info = None
    print(f"  {name}: type={typ}, shape={shp}, {info}")

# Probe common fields that might exist in ChaLearn frames
for name in ("Depth", "User", "Map", "Skeleton", "RGB", "Audio", "LeftHand", "RightHand", "PointCloud", "XYZ", "Coordinates"):
    if hasattr(fr0, name):
        show_attr(fr0, name)

# If Skeleton exists as nested struct/array, peek deeper and print key fields
if hasattr(fr0, 'Skeleton'):
    sk = getattr(fr0, 'Skeleton')
    try:
        print("Skeleton dir:", [a for a in dir(sk) if not a.startswith('_')][:30])
        for sub in ("JointType", "PixelPosition", "WorldPosition", "WorldRotation"):
            if hasattr(sk, sub):
                val = getattr(sk, sub)
                print(f"  Skeleton.{sub} type={type(val)}, shape={getattr(val, 'shape', None)}")
                if isinstance(val, np.ndarray):
                    # Show small preview of shape details
                    try:
                        print("    ndim=", val.ndim, "dtype=", val.dtype)
                        if val.ndim>=1:
                            print("    first element type:", type(val.flat[0]))
                    except Exception as e:
                        print("    preview error:", e)
    except Exception as e:
        print("Skeleton inspect error:", e)

# Inspect first label fully
L = V.Labels
print("Labels count:", len(L))
if len(L)>0:
    l0 = L[0]
    print("Label[0] fields:", [a for a in dir(l0) if not a.startswith('_')])
    try:
        print("  Begin:", getattr(l0, 'Begin', None), "End:", getattr(l0, 'End', None), "Name:", getattr(l0, 'Name', None))
    except Exception as e:
        print("  Could not print label fields:", e)
print("=== Done frame inspection ===")

=== Inspect first frame struct fields and shapes ===


NumFrames: 1254 FrameRate: 20
Frame[0] attrs (first 40): ['Skeleton']
  Skeleton: type=<class 'scipy.io.matlab._mio5_params.mat_struct'>, shape=None, None
Skeleton dir: ['JointType', 'PixelPosition', 'WorldPosition', 'WorldRotation']
  Skeleton.JointType type=<class 'numpy.ndarray'>, shape=(20,)
    ndim= 1 dtype= object
    first element type: <class 'numpy.ndarray'>
  Skeleton.PixelPosition type=<class 'numpy.ndarray'>, shape=(20, 2)
    ndim= 2 dtype= uint8
    first element type: <class 'numpy.uint8'>
  Skeleton.WorldPosition type=<class 'numpy.ndarray'>, shape=(20, 3)
    ndim= 2 dtype= uint8
    first element type: <class 'numpy.uint8'>
  Skeleton.WorldRotation type=<class 'numpy.ndarray'>, shape=(20, 4)
    ndim= 2 dtype= uint8
    first element type: <class 'numpy.uint8'>
Labels count: 20
Label[0] fields: ['Begin', 'End', 'Name']
  Begin: 1 End: 79 Name: vieniqui
=== Done frame inspection ===


In [6]:
import io, tarfile, zipfile, json, math, gc
from pathlib import Path
import numpy as np
import pandas as pd

print("=== Build archive index and cache small subset of skeleton features ===", flush=True)
CWD = Path('.')
TRAIN_TARS = [CWD/'training1.tar.gz', CWD/'training2.tar.gz', CWD/'training3.tar.gz']
VAL_TARS = [CWD/'validation1.tar.gz', CWD/'validation2.tar.gz', CWD/'validation3.tar.gz']
TEST_TAR = CWD/'test.tar.gz'

def build_tar_index(tar_paths):
    idx = {}  # name -> (tarpath, TarInfo)
    for tp in tar_paths:
        if not tp.exists():
            continue
        with tarfile.open(tp, 'r:*') as tf:
            for m in tf:
                if not m.isreg():
                    continue
                nm = m.name.lstrip('./')
                if nm.endswith('.zip') and nm.startswith('Sample'):
                    idx[nm] = (tp, m)
    return idx

train_idx = build_tar_index(TRAIN_TARS)
val_idx = build_tar_index(VAL_TARS)
test_idx = build_tar_index([TEST_TAR])
print(f"Index sizes: train={len(train_idx)}, val={len(val_idx)}, test={len(test_idx)}")

def id_to_zipname(sample_id: int) -> str:
    return f"Sample{sample_id:05d}.zip"

def load_mat_from_zip(tarpath: Path, tarinfo: tarfile.TarInfo):
    with tarfile.open(tarpath, 'r:*') as tf:
        fobj = tf.extractfile(tarinfo)
        if fobj is None:
            raise RuntimeError("Failed to extract tar member")
        data = fobj.read()
    with zipfile.ZipFile(io.BytesIO(data)) as zf:
        mat_name = None
        for n in zf.namelist():
            ln = n.lower()
            if ln.endswith('_data.mat') or ln.endswith('.mat'):
                mat_name = n; break
        if mat_name is None:
            raise RuntimeError("No MAT file found in zip")
        b = zf.read(mat_name)
    import scipy.io as sio
    md = sio.loadmat(io.BytesIO(b), squeeze_me=True, struct_as_record=False)
    return md

def extract_skeleton_xy(md):
    V = md['Video']
    frames = V.Frames  # ndarray of mat_struct, len T
    T = frames.shape[0]
    # Each frame has Skeleton.PixelPosition (20,2) uint8; use that as base feature
    D = 20*2
    X = np.zeros((T, D), dtype=np.float32)
    for t in range(T):
        fr = frames[t]
        sk = getattr(fr, 'Skeleton')
        px = getattr(sk, 'PixelPosition')  # (20,2) uint8
        arr = np.asarray(px, dtype=np.float32)
        X[t] = arr.reshape(-1)
    # Normalize per-frame: center and scale
    mu = X.reshape(T, 20, 2).mean(axis=1, keepdims=False)  # (T,2)
    Xc = X.reshape(T, 20, 2) - mu[:, None, :]
    # scale by RMS distance to center to be size-invariant
    rms = np.sqrt((Xc**2).sum(axis=(1,2)) / (20*2))  # (T,)
    rms[rms == 0] = 1.0
    Xn = (Xc / rms[:, None, None]).reshape(T, D)
    return Xn, int(getattr(V, 'FrameRate', 20)), int(getattr(V, 'NumFrames', Xn.shape[0]))

def temporal_features(X, stride=2):
    # Downsample by stride, then compute velocities and accelerations on downsampled sequence
    Xds = X[::stride].astype(np.float32)
    V = np.diff(Xds, axis=0, prepend=Xds[:1])
    A = np.diff(V, axis=0, prepend=V[:1])
    return np.concatenate([Xds, V, A], axis=1)

def cache_one(sample_id: int, split: str, outdir: Path):
    if split=='train':
        idx = train_idx
    elif split=='val':
        idx = val_idx
    elif split=='test':
        idx = test_idx
    else:
        raise ValueError('split must be train/val/test')
    zipname = id_to_zipname(sample_id)
    if zipname not in idx:
        raise KeyError(f"{zipname} not found in index for split={split}")
    tarpath, tarinfo = idx[zipname]
    md = load_mat_from_zip(tarpath, tarinfo)
    X, fps, nframes = extract_skeleton_xy(md)
    # Build features
    Xf = temporal_features(X, stride=2)  # ~10 fps
    meta = dict(fps=fps, nframes=nframes, stride=2)
    outdir.mkdir(parents=True, exist_ok=True)
    np.savez_compressed(outdir/f"{sample_id}.npz", X=Xf, meta=json.dumps(meta))
    del md, X, Xf; gc.collect()

train_df = pd.read_csv('training.csv')
test_df = pd.read_csv('test.csv')
print(train_df.head(2))
print(test_df.head(2))

features_dir = Path('features')
small_ids = train_df['Id'].head(8).tolist()
t0 = time.time()
for i, sid in enumerate(small_ids):
    st = time.time()
    cache_one(int(sid), 'train', features_dir/'train')
    dt = time.time()-st
    print(f"cached train id={sid} ({i+1}/{len(small_ids)}) in {dt:.2f}s", flush=True)
print(f"Subset caching done in {time.time()-t0:.2f}s")
print("List cached files:")
for p in sorted((features_dir/'train').glob('*.npz'))[:5]:
    print("  ", p.name)
print("=== Done subset caching ===")

=== Build archive index and cache small subset of skeleton features ===


Index sizes: train=298, val=287, test=95
   Id                                           Sequence
0   1  2 14 20 6 7 3 1 13 18 5 12 16 15 4 9 10 8 17 1...
1   3  12 3 18 14 16 20 5 2 4 1 10 6 9 19 15 17 11 13...
    Id
0  300
1  301


cached train id=1 (1/8) in 0.23s


cached train id=3 (2/8) in 0.26s


cached train id=4 (3/8) in 0.34s


cached train id=5 (4/8) in 0.37s


cached train id=6 (5/8) in 0.42s


cached train id=7 (6/8) in 0.46s


cached train id=8 (7/8) in 0.53s


cached train id=9 (8/8) in 0.59s


Subset caching done in 3.21s
List cached files:
   1.npz
   3.npz
   4.npz
   5.npz
   6.npz
=== Done subset caching ===


In [7]:
import io, tarfile, zipfile, json, math, gc, sys, time
from pathlib import Path
import numpy as np
import pandas as pd
import scipy.io as sio

print("=== Switch caching to 3D WorldPosition with torso-centering and shoulder-width scaling ===", flush=True)

# Reuse indices if present from prior cell; else rebuild
CWD = Path('.')
TRAIN_TARS = [CWD/'training1.tar.gz', CWD/'training2.tar.gz', CWD/'training3.tar.gz']
VAL_TARS = [CWD/'validation1.tar.gz', CWD/'validation2.tar.gz', CWD/'validation3.tar.gz']
TEST_TAR = CWD/'test.tar.gz'

def build_tar_index(tar_paths):
    idx = {}
    for tp in tar_paths:
        if not tp.exists(): continue
        with tarfile.open(tp, 'r:*') as tf:
            for m in tf:
                if not m.isreg(): continue
                nm = m.name.lstrip('./')
                if nm.endswith('.zip') and nm.startswith('Sample'):
                    idx[nm] = (tp, m)
    return idx

try:
    train_idx
except NameError:
    train_idx = build_tar_index(TRAIN_TARS)
    val_idx = build_tar_index(VAL_TARS)
    test_idx = build_tar_index([TEST_TAR])

def id_to_zipname(sample_id: int) -> str:
    return f"Sample{sample_id:05d}.zip"

def load_mat_from_zip(tarpath: Path, tarinfo: tarfile.TarInfo):
    with tarfile.open(tarpath, 'r:*') as tf:
        fobj = tf.extractfile(tarinfo); data = fobj.read()
    with zipfile.ZipFile(io.BytesIO(data)) as zf:
        mat_name = None
        for n in zf.namelist():
            ln = n.lower()
            if ln.endswith('_data.mat') or ln.endswith('.mat'): mat_name = n; break
        b = zf.read(mat_name)
    md = sio.loadmat(io.BytesIO(b), squeeze_me=True, struct_as_record=False)
    return md

def get_joint_names(md):
    V = md['Video']
    fr0 = V.Frames[0]
    jt = getattr(fr0.Skeleton, 'JointType')
    names = []
    for j in jt:
        # Elements may be numpy arrays or strings/bytes
        if isinstance(j, np.ndarray):
            v = j
            try:
                s = ''.join(chr(int(x)) for x in v.flatten())
            except Exception:
                try:
                    s = v.tobytes().decode(errors='ignore')
                except Exception:
                    s = str(v)
        else:
            s = str(j)
        s = s.strip().replace('\x00','')
        names.append(s)
    return names

def infer_indices(names):
    # Build a case-insensitive map
    lower = {n.lower(): i for i,n in enumerate(names)}
    def find_any(keys):
        for k in keys:
            if k in lower: return lower[k]
        return None
    idx = {}
    idx['shoulder_left']  = find_any(['shoulderleft','leftshoulder','lshoulder'])
    idx['shoulder_right'] = find_any(['shoulderright','rightshoulder','rshoulder'])
    idx['hip_left']       = find_any(['hipleft','lefthip','lhip'])
    idx['hip_right']      = find_any(['hipright','righthip','rhip'])
    idx['hip_center']     = find_any(['hipcenter','centership','spinebase','base'])
    return idx

def extract_world3d(md):
    V = md['Video']
    frames = V.Frames
    T = frames.shape[0]
    X = np.zeros((T, 20, 3), dtype=np.float32)
    for t in range(T):
        wp = getattr(frames[t].Skeleton, 'WorldPosition')  # (20,3) uint8
        X[t] = np.asarray(wp, dtype=np.float32)
    names = get_joint_names(md)
    idxs = infer_indices(names)
    # Center on hip_center if available, else midpoint of hips
    if idxs.get('hip_center') is not None:
        center = X[:, idxs['hip_center'], :]
    else:
        hl = X[:, idxs['hip_left'], :] if idxs.get('hip_left') is not None else X[:, 0, :]
        hr = X[:, idxs['hip_right'], :] if idxs.get('hip_right') is not None else X[:, 1, :]
        center = (hl + hr) / 2.0
    Xc = X - center[:, None, :]
    # Scale by shoulder width; fallback to shoulder-hip
    if idxs.get('shoulder_left') is not None and idxs.get('shoulder_right') is not None:
        sl = X[:, idxs['shoulder_left'], :]; sr = X[:, idxs['shoulder_right'], :]
        scale = np.linalg.norm(sl - sr, axis=1)  # (T,)
    else:
        # Fallback: distance hip_left to shoulder_left if available else RMS
        if idxs.get('hip_left') is not None and idxs.get('shoulder_left') is not None:
            scale = np.linalg.norm(X[:, idxs['hip_left'], :] - X[:, idxs['shoulder_left'], :], axis=1)
        else:
            scale = np.sqrt((Xc**2).sum(axis=(1,2)) / (20*3))
    scale[scale <= 1e-6] = 1.0
    Xn = Xc / scale[:, None, None]
    return Xn.reshape(T, -1), int(getattr(V,'FrameRate',20)), int(getattr(V,'NumFrames', X.shape[0]))

def temporal_features(X, stride=2):
    Xds = X[::stride].astype(np.float32)
    V = np.diff(Xds, axis=0, prepend=Xds[:1])
    A = np.diff(V, axis=0, prepend=V[:1])
    return np.concatenate([Xds, V, A], axis=1)

def cache_one_world3d(sample_id: int, split: str, outdir: Path):
    idx = train_idx if split=='train' else (val_idx if split=='val' else test_idx)
    zipname = id_to_zipname(sample_id)
    tarpath, tarinfo = idx[zipname]
    md = load_mat_from_zip(tarpath, tarinfo)
    X, fps, nframes = extract_world3d(md)
    Xf = temporal_features(X, stride=2)  # ~10 fps
    meta = dict(fps=fps, nframes=nframes, stride=2, feat='world3d_pos+vel+acc')
    outdir.mkdir(parents=True, exist_ok=True)
    np.savez_compressed(outdir/f"{sample_id}.npz", X=Xf, meta=json.dumps(meta))
    del md, X, Xf; gc.collect()

# Smoke cache a few samples with 3D features
train_df = pd.read_csv('training.csv')
small_ids = train_df['Id'].head(6).tolist()
outdir = Path('features3d')/'train'
t0 = time.time()
for i, sid in enumerate(small_ids):
    st = time.time()
    cache_one_world3d(int(sid), 'train', outdir)
    print(f"[3D] cached train id={sid} ({i+1}/{len(small_ids)}) in {time.time()-st:.2f}s", flush=True)
print(f"[3D] subset caching done in {time.time()-t0:.2f}s; list:")
for p in sorted(outdir.glob('*.npz'))[:6]:
    print('  ', p.name)
print("=== Done 3D subset caching ===")

=== Switch caching to 3D WorldPosition with torso-centering and shoulder-width scaling ===


[3D] cached train id=1 (1/6) in 0.23s


[3D] cached train id=3 (2/6) in 0.26s


[3D] cached train id=4 (3/6) in 0.35s


[3D] cached train id=5 (4/6) in 0.37s


[3D] cached train id=6 (5/6) in 0.43s


[3D] cached train id=7 (6/6) in 0.47s


[3D] subset caching done in 2.12s; list:
   1.npz
   3.npz
   4.npz
   5.npz
   6.npz
   7.npz
=== Done 3D subset caching ===


In [8]:
import time, json, gc
from pathlib import Path
import numpy as np
import pandas as pd

print("=== Full TRAIN caching: 3D world pos + vel/acc at ~10 fps ===", flush=True)

# Reuse helpers and indices from previous cells: cache_one_world3d, train_idx, id_to_zipname, etc.
train_df = pd.read_csv('training.csv')
train_ids = train_df['Id'].astype(int).tolist()
outdir = Path('features3d')/'train'
outdir.mkdir(parents=True, exist_ok=True)

total = len(train_ids)
t0 = time.time()
done = 0
skipped = 0
for i, sid in enumerate(train_ids, 1):
    outp = outdir/f"{sid}.npz"
    if outp.exists():
        skipped += 1
        if i % 20 == 0:
            dt = time.time() - t0
            rate = (i)/(dt+1e-9)
            eta = (total - i)/max(rate,1e-6)
            print(f"[train] {i}/{total} (skip={skipped}) elapsed={dt/60:.1f}m eta={eta/60:.1f}m", flush=True)
        continue
    st = time.time()
    try:
        cache_one_world3d(int(sid), 'train', outdir)
        done += 1
    except Exception as e:
        print(f"[WARN] failed id={sid}: {e}", flush=True)
        continue
    if (i % 10) == 0 or i == total:
        dt = time.time() - t0
        rate = (i)/(dt+1e-9)
        eta = (total - i)/max(rate,1e-6)
        print(f"[train] {i}/{total} cached={done} skip={skipped} last={time.time()-st:.2f}s elapsed={dt/60:.1f}m eta={eta/60:.1f}m", flush=True)
    gc.collect()

print(f"=== TRAIN caching done: cached={done}, skipped={skipped}, total={total}, elapsed={(time.time()-t0)/60:.2f}m ===", flush=True)

=== Full TRAIN caching: 3D world pos + vel/acc at ~10 fps ===


[train] 10/297 cached=4 skip=6 last=0.70s elapsed=0.0m eta=1.2m


[train] 20/297 cached=14 skip=6 last=1.21s elapsed=0.2m eta=2.9m


[train] 30/297 cached=24 skip=6 last=1.71s elapsed=0.5m eta=4.1m


[train] 40/297 cached=34 skip=6 last=2.28s elapsed=0.8m eta=5.1m


[train] 50/297 cached=44 skip=6 last=3.11s elapsed=1.3m eta=6.2m


[train] 60/297 cached=54 skip=6 last=3.82s elapsed=1.8m eta=7.3m


[train] 70/297 cached=64 skip=6 last=4.51s elapsed=2.5m eta=8.3m


[train] 80/297 cached=74 skip=6 last=5.21s elapsed=3.4m eta=9.1m


[train] 90/297 cached=84 skip=6 last=5.48s elapsed=4.3m eta=9.8m


[train] 100/297 cached=94 skip=6 last=0.20s elapsed=5.0m eta=9.9m


[train] 110/297 cached=104 skip=6 last=0.41s elapsed=5.1m eta=8.6m


[train] 120/297 cached=114 skip=6 last=0.60s elapsed=5.2m eta=7.6m


[train] 130/297 cached=124 skip=6 last=0.84s elapsed=5.3m eta=6.8m


[train] 140/297 cached=134 skip=6 last=1.05s elapsed=5.5m eta=6.1m


[train] 150/297 cached=144 skip=6 last=1.27s elapsed=5.7m eta=5.5m


[train] 160/297 cached=154 skip=6 last=1.47s elapsed=5.9m eta=5.0m


[train] 170/297 cached=164 skip=6 last=1.67s elapsed=6.2m eta=4.6m


[train] 180/297 cached=174 skip=6 last=1.86s elapsed=6.5m eta=4.2m


[train] 190/297 cached=184 skip=6 last=2.06s elapsed=6.8m eta=3.8m


[train] 200/297 cached=194 skip=6 last=0.22s elapsed=7.1m eta=3.4m


[train] 210/297 cached=204 skip=6 last=0.45s elapsed=7.1m eta=2.9m


[train] 220/297 cached=214 skip=6 last=0.67s elapsed=7.2m eta=2.5m


[train] 230/297 cached=224 skip=6 last=1.19s elapsed=7.4m eta=2.1m


[train] 240/297 cached=234 skip=6 last=1.70s elapsed=7.6m eta=1.8m


[train] 250/297 cached=244 skip=6 last=2.00s elapsed=7.9m eta=1.5m


[train] 260/297 cached=254 skip=6 last=2.21s elapsed=8.3m eta=1.2m


[train] 270/297 cached=264 skip=6 last=2.40s elapsed=8.7m eta=0.9m


[train] 280/297 cached=274 skip=6 last=2.67s elapsed=9.1m eta=0.6m


[train] 290/297 cached=284 skip=6 last=2.91s elapsed=9.6m eta=0.2m


[train] 297/297 cached=291 skip=6 last=3.08s elapsed=9.9m eta=0.0m


=== TRAIN caching done: cached=291, skipped=6, total=297, elapsed=9.94m ===


In [9]:
import os, sys, subprocess, shutil
from pathlib import Path

print("=== Install PyTorch cu121 stack and sanity check GPU ===", flush=True)
def pip(*args):
    print(">", *args, flush=True)
    subprocess.run([sys.executable, "-m", "pip", *args], check=True)

# Uninstall any stray torch stacks (ignore errors)
for pkg in ("torch","torchvision","torchaudio"):
    subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", pkg], check=False)

# Clean possible shadow dirs (idempotent)
for d in (
    "/app/.pip-target/torch",
    "/app/.pip-target/torchvision",
    "/app/.pip-target/torchaudio",
    "/app/.pip-target/torch-2.4.1.dist-info",
    "/app/.pip-target/torchvision-0.19.1.dist-info",
    "/app/.pip-target/torchaudio-2.4.1.dist-info",
    "/app/.pip-target/torchgen",
    "/app/.pip-target/functorch",
):
    if os.path.exists(d):
        print("Removing", d, flush=True)
        shutil.rmtree(d, ignore_errors=True)

# Install exact cu121 stack
pip("install",
    "--index-url", "https://download.pytorch.org/whl/cu121",
    "--extra-index-url", "https://pypi.org/simple",
    "torch==2.4.1", "torchvision==0.19.1", "torchaudio==2.4.1")

# Freeze constraints for subsequent installs
Path("constraints.txt").write_text("torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n")

# Sanity check
import torch
print("torch:", torch.__version__, "built CUDA:", getattr(torch.version, "cuda", None))
print("CUDA available:", torch.cuda.is_available())
assert str(getattr(torch.version, "cuda", "")).startswith("12.1"), f"Wrong CUDA build: {torch.version.cuda}"
assert torch.cuda.is_available(), "CUDA not available"
print("GPU:", torch.cuda.get_device_name(0))
print("=== Torch install OK ===")

=== Install PyTorch cu121 stack and sanity check GPU ===






> install --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1




Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 MB 543.7 MB/s eta 0:00:00


Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 536.7 MB/s eta 0:00:00


Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 439.9 MB/s eta 0:00:00


Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 205.4 MB/s eta 0:00:00


Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 516.2 MB/s eta 0:00:00


Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 514.1 MB/s eta 0:00:00


Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 228.3 MB/s eta 0:00:00
Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 199.3 MB/s eta 0:00:00


Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 227.7 MB/s eta 0:00:00


Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 482.9 MB/s eta 0:00:00


Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)
Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 198.5 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 308.2 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 209.8 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 198.9 MB/s eta 0:00:00


Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 210.9 MB/s eta 0:00:00


Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 430.1 MB/s eta 0:00:00
Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 415.2 MB/s eta 0:00:00


Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 119.8 MB/s eta 0:00:00


Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 257.6 MB/s eta 0:00:00


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 508.2 MB/s eta 0:00:00


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 484.5 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 252.9 MB/s eta 0:00:00


Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 198.3 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 526.8 MB/s eta 0:00:00


Installing collected packages: mpmath, typing-extensions, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio


Successfully installed MarkupSafe-3.0.3 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0 typing-extensions-4.15.0


torch: 2.4.1+cu121 built CUDA: 12.1
CUDA available: True
GPU: NVIDIA A10-24Q
=== Torch install OK ===


In [10]:
import os, json, math, time, random, gc
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from torch.utils.data import Dataset, DataLoader

print("=== Train BiGRU+CTC on cached 3D features (train split with small val) ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
random.seed(42); np.random.seed(42); torch.manual_seed(42);

features_dir = Path('features3d')/'train'
train_df = pd.read_csv('training.csv')

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

class_n = 21  # 0 is blank, labels 1..20

def load_npz(sample_id: int):
    p = features_dir/f"{sample_id}.npz"
    d = np.load(p)
    X = d['X'].astype(np.float32)  # (T, D)
    return X

class SeqDataset(Dataset):
    def __init__(self, ids):
        self.ids = ids
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        sid = self.ids[idx]
        X = load_npz(sid)
        y = np.array(id2seq[sid], dtype=np.int64)  # (L,) tokens 1..20
        # Truncate very long sequences for speed (keep up to 1200 frames after DS)
        if X.shape[0] > 1200: X = X[:1200]
        return torch.from_numpy(X), torch.from_numpy(y), sid

def collate(batch):
    xs, ys, sids = zip(*batch)
    x_lens = torch.tensor([x.shape[0] for x in xs], dtype=torch.int32)
    y_lens = torch.tensor([y.shape[0] for y in ys], dtype=torch.int32)
    x_pad = pad_sequence(xs, batch_first=False)  # (T, B, D)
    y_cat = torch.cat(ys, dim=0)  # concat targets for CTC
    return x_pad, x_lens, y_cat, y_lens, sids

all_ids = [int(x) for x in train_df['Id'].tolist()]
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]
tr_ids = all_ids[val_n:]
print(f"Train videos: {len(tr_ids)}, Val videos: {len(val_ids)}")

train_ds = SeqDataset(tr_ids)
val_ds = SeqDataset(val_ids)

def make_loader(ds, bs=16, shuffle=True):
    return DataLoader(ds, batch_size=bs, shuffle=shuffle, num_workers=2, pin_memory=True, collate_fn=collate)

train_loader = make_loader(train_ds, bs=24, shuffle=True)
val_loader = make_loader(val_ds, bs=24, shuffle=False)

class BiGRUCTC(nn.Module):
    def __init__(self, in_dim, hidden=256, layers=2, num_classes=21, dropout=0.2):
        super().__init__()
        self.rnn = nn.GRU(input_size=in_dim, hidden_size=hidden, num_layers=layers,
                          dropout=dropout, bidirectional=True)
        self.proj = nn.Linear(hidden*2, num_classes)
    def forward(self, x, x_lens):  # x: (T,B,D)
        packed = pack_padded_sequence(x, x_lens.cpu(), enforce_sorted=False)
        out, _ = self.rnn(packed)
        out, _ = pad_packed_sequence(out)  # (T,B,2H)
        logits = self.proj(out)  # (T,B,C)
        return logits

def ctc_greedy_decode(logits):
    # logits: (T,B,C)
    with torch.no_grad():
        pred = logits.argmax(dim=-1)  # (T,B)
        pred = pred.cpu().numpy()
    seqs = []
    T, B = pred.shape
    for b in range(B):
        last = -1
        out = []
        for t in range(T):
            p = int(pred[t, b])
            if p != last:
                if p != 0:  # skip blank
                    out.append(p)
                last = p
        seqs.append(out)
    return seqs

def levenshtein(a, b):
    # a, b: lists of ints
    n, m = len(a), len(b)
    if n==0: return m
    if m==0: return n
    dp = list(range(m+1))
    for i in range(1, n+1):
        prev = dp[0]
        dp[0] = i
        ai = a[i-1]
        for j in range(1, m+1):
            tmp = dp[j]
            cost = 0 if ai==b[j-1] else 1
            dp[j] = min(dp[j]+1, dp[j-1]+1, prev+cost)
            prev = tmp
    return dp[m]

def evaluate(model, loader):
    model.eval()
    total_lev = 0.0; total = 0;
    with torch.no_grad():
        for xb, x_lens, y_cat, y_lens, sids in loader:
            xb = xb.to(device)
            x_lens = x_lens.to(device)
            logits = model(xb, x_lens)  # (T,B,C)
            seqs = ctc_greedy_decode(logits)
            # split y_cat into per-sample
            ys = []
            off = 0
            for L in y_lens.tolist():
                ys.append(y_cat[off:off+L].tolist()); off += L
            for p, t in zip(seqs, ys):
                # Optionally enforce min-length to reduce insertions (cheap post-process)
                total_lev += levenshtein(p, t)
                total += 1
    return total_lev/total if total>0 else math.inf

D_sample = np.load(next(iter(features_dir.glob('*.npz'))))['X'].shape[1]
print("Feature dim:", D_sample)
model = BiGRUCTC(in_dim=D_sample, hidden=256, layers=2, num_classes=class_n, dropout=0.2).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
ctc_loss = nn.CTCLoss(blank=0, zero_infinity=True)
scaler = torch.cuda.amp.GradScaler(enabled=torch.cuda.is_available())

def train_epoch(ep):
    model.train()
    t0 = time.time()
    total_loss = 0.0; nb = 0
    for it, (xb, x_lens, y_cat, y_lens, sids) in enumerate(train_loader):
        xb = xb.to(device)
        y_cat = y_cat.to(device)
        x_lens = x_lens.to(device, non_blocking=True)
        y_lens = y_lens.to(device, non_blocking=True)
        optimizer.zero_grad(set_to_none=True)
        with torch.cuda.amp.autocast(enabled=torch.cuda.is_available()):
            logits = model(xb, x_lens)  # (T,B,C)
            log_probs = logits.log_softmax(dim=-1)
            # CTC expects (T,B,C)
            loss = ctc_loss(log_probs, y_cat, x_lens, y_lens)
        scaler.scale(loss).backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer); scaler.update()
        total_loss += loss.item(); nb += 1
        if (it+1) % 20 == 0:
            print(f"ep{ep} it{it+1} loss={total_loss/nb:.4f} elapsed={time.time()-t0:.1f}s", flush=True)
    return total_loss/max(nb,1)

best_val = math.inf; best_state = None; patience = 3; bad = 0
max_epochs = 6
for ep in range(1, max_epochs+1):
    tr_loss = train_epoch(ep)
    val_lev = evaluate(model, val_loader)
    print(f"Epoch {ep}: train_loss={tr_loss:.4f} val_lev={val_lev:.4f}", flush=True)
    if val_lev < best_val - 1e-4:
        best_val = val_lev; best_state = {k:v.cpu() for k,v in model.state_dict().items()}; bad = 0
        print(f"  New best val_lev={best_val:.4f}", flush=True)
    else:
        bad += 1
        if bad >= patience:
            print("Early stopping.", flush=True); break

if best_state is not None:
    model.load_state_dict(best_state)
torch.save(model.state_dict(), 'model_ctc_bgru.pth')
print("=== Training complete. Saved model_ctc_bgru.pth ===")

=== Train BiGRU+CTC on cached 3D features (train split with small val) ===


Train videos: 253, Val videos: 44
Feature dim: 180


  scaler = torch.cuda.amp.GradScaler(enabled=torch.cuda.is_available())


  with torch.cuda.amp.autocast(enabled=torch.cuda.is_available()):


Epoch 1: train_loss=51.8156 val_lev=18.8409


  New best val_lev=18.8409


Epoch 2: train_loss=3.7155 val_lev=18.8409


Epoch 3: train_loss=3.3192 val_lev=18.8409


Epoch 4: train_loss=3.1670 val_lev=18.8409


In [11]:
import time, json, gc
from pathlib import Path
import numpy as np
import pandas as pd

print("=== Cache TEST features: 3D world pos + vel/acc at ~10 fps ===", flush=True)

# Reuse helpers from earlier cells: cache_one_world3d, test_idx, id_to_zipname
test_df = pd.read_csv('test.csv')
test_ids = test_df['Id'].astype(int).tolist()
outdir = Path('features3d')/'test'
outdir.mkdir(parents=True, exist_ok=True)

total = len(test_ids)
t0 = time.time()
done = 0
skipped = 0
for i, sid in enumerate(test_ids, 1):
    outp = outdir/f"{sid}.npz"
    if outp.exists():
        skipped += 1
        if i % 10 == 0:
            dt = time.time() - t0
            rate = (i)/(dt+1e-9)
            eta = (total - i)/max(rate,1e-6)
            print(f"[test] {i}/{total} (skip={skipped}) elapsed={dt/60:.1f}m eta={eta/60:.1f}m", flush=True)
        continue
    st = time.time()
    try:
        cache_one_world3d(int(sid), 'test', outdir)
        done += 1
    except Exception as e:
        print(f"[WARN] test id={sid} failed: {e}", flush=True)
        continue
    if (i % 10) == 0 or i == total:
        dt = time.time() - t0
        rate = (i)/(dt+1e-9)
        eta = (total - i)/max(rate,1e-6)
        print(f"[test] {i}/{total} cached={done} skip={skipped} last={time.time()-st:.2f}s elapsed={dt/60:.1f}m eta={eta/60:.1f}m", flush=True)
    gc.collect()

print(f"=== TEST caching done: cached={done}, skipped={skipped}, total={total}, elapsed={(time.time()-t0)/60:.2f}m ===", flush=True)

=== Cache TEST features: 3D world pos + vel/acc at ~10 fps ===


[test] 10/95 cached=10 skip=0 last=0.43s elapsed=0.1m eta=0.6m


[test] 20/95 cached=20 skip=0 last=0.69s elapsed=0.2m eta=0.7m


[test] 30/95 cached=30 skip=0 last=0.91s elapsed=0.3m eta=0.7m


[test] 40/95 cached=40 skip=0 last=1.25s elapsed=0.5m eta=0.7m


[test] 50/95 cached=50 skip=0 last=1.51s elapsed=0.8m eta=0.7m


[test] 60/95 cached=60 skip=0 last=1.79s elapsed=1.1m eta=0.6m


[test] 70/95 cached=70 skip=0 last=2.16s elapsed=1.4m eta=0.5m


[test] 80/95 cached=80 skip=0 last=2.30s elapsed=1.8m eta=0.3m


[test] 90/95 cached=90 skip=0 last=2.53s elapsed=2.2m eta=0.1m


[test] 95/95 cached=95 skip=0 last=2.64s elapsed=2.4m eta=0.0m


=== TEST caching done: cached=95, skipped=0, total=95, elapsed=2.43m ===


In [16]:
import math, json, time
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn

print("=== Inference on TEST: greedy CTC with fallback to class-ranking; write submission.csv ===", flush=True)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
test_df = pd.read_csv('test.csv')
test_ids = test_df['Id'].astype(int).tolist()
feat_dir = Path('features3d')/'test'

class BiGRUCTC(nn.Module):
    def __init__(self, in_dim, hidden=256, layers=2, num_classes=21, dropout=0.2):
        super().__init__()
        self.rnn = nn.GRU(input_size=in_dim, hidden_size=hidden, num_layers=layers,
                          dropout=dropout, bidirectional=True)
        self.proj = nn.Linear(hidden*2, num_classes)
    def forward(self, x, x_lens):
        from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
        packed = pack_padded_sequence(x, x_lens.cpu(), enforce_sorted=False)
        out, _ = self.rnn(packed)
        out, _ = pad_packed_sequence(out)
        logits = self.proj(out)
        return logits

def load_feat(sample_id: int):
    p = feat_dir/f"{sample_id}.npz"
    d = np.load(p)
    X = d['X'].astype(np.float32)
    if X.shape[0] > 1200: X = X[:1200]
    return X

def ctc_greedy(logits):
    # logits: (T,C) tensor
    pred = logits.argmax(dim=-1).cpu().numpy().tolist()
    out = []
    last = -1
    for p in pred:
        if p != last:
            if p != 0:
                out.append(int(p))
            last = p
    return out

def fallback_rank(logits):
    # logits: (T,C) tensor; compute per-class mean score and rank 1..20
    with torch.no_grad():
        lp = logits[:, 1:21].mean(dim=0)  # exclude blank
        order = torch.argsort(lp, descending=True).cpu().numpy().tolist()
    seq = [int(i+1) for i in order[:20]]
    return seq

def ensure_len20(seq, logits):
    # If seq invalid (len!=20 or duplicates or out of range), use fallback ranking
    ok = (len(seq) == 20) and all(1 <= s <= 20 for s in seq) and (len(set(seq)) == 20)
    if ok: return seq
    return fallback_rank(logits)

# Load model with inferred input dim from one train npz
train_any = next(iter((Path('features3d')/'train').glob('*.npz')))
in_dim = np.load(train_any)['X'].shape[1]
model = BiGRUCTC(in_dim=in_dim, hidden=256, layers=2, num_classes=21, dropout=0.2).to(device)
state = torch.load('model_ctc_bgru.pth', map_location=device)
model.load_state_dict(state)
model.eval()

pred_rows = []
t0 = time.time()
for i, sid in enumerate(test_ids, 1):
    X = load_feat(sid)  # (T,D)
    xb = torch.from_numpy(X).to(device)  # (T,D)
    xb = xb.unsqueeze(1)  # (T,1,D)
    x_lens = torch.tensor([xb.shape[0]], dtype=torch.int32, device=device)
    with torch.no_grad():
        logits = model(xb, x_lens)  # (T,1,C)
        logits = logits[:,0,:]  # (T,C)
    seq = ctc_greedy(logits)
    seq = ensure_len20(seq, logits)
    pred_rows.append({'Id': sid, 'Sequence': ' '.join(str(x) for x in seq)})
    if i % 10 == 0 or i == len(test_ids):
        print(f"[infer] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)

sub = pd.DataFrame(pred_rows, columns=['Id','Sequence'])
sub.to_csv('submission.csv', index=False)
print("Wrote submission.csv; head:\n", sub.head())
print("=== Inference done ===")

=== Inference on TEST: greedy CTC with fallback to class-ranking; write submission.csv ===


[infer] 10/95 elapsed=0.0m


  state = torch.load('model_ctc_bgru.pth', map_location=device)


[infer] 20/95 elapsed=0.0m


[infer] 30/95 elapsed=0.0m


[infer] 40/95 elapsed=0.0m


[infer] 50/95 elapsed=0.0m


[infer] 60/95 elapsed=0.0m


[infer] 70/95 elapsed=0.0m


[infer] 80/95 elapsed=0.0m


[infer] 90/95 elapsed=0.0m


[infer] 95/95 elapsed=0.0m


Wrote submission.csv; head:
     Id                                           Sequence
0  300  9 18 20 16 2 5 10 15 14 13 7 17 4 11 3 19 1 8 ...
1  301  9 18 16 20 2 5 10 15 14 13 7 17 4 3 11 19 1 8 ...
2  302  9 18 16 20 2 5 10 15 14 13 7 17 4 3 11 19 1 8 ...
3  303  9 18 16 20 2 5 10 15 7 14 13 17 4 3 11 19 1 8 ...
4  304  9 18 16 20 2 5 10 15 7 14 13 17 4 3 11 19 1 8 ...
=== Inference done ===


In [15]:
import math, time
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn

print("=== Inference (CTC beam) with penalties and constraints -> submission.csv ===", flush=True)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
test_df = pd.read_csv('test.csv')
test_ids = test_df['Id'].astype(int).tolist()
feat_dir = Path('features3d')/'test'

# Hyperparams (tune later):
beam_width = 50
temperature = 1.4  # divide logits by this before log_softmax
insertion_penalty = 0.6  # subtract only when appending a non-blank token
min_run_len = 6  # frames at ~10 fps
blank_prob_thresh = 0.985  # prune non-blank expansions on very-blank frames
top_k_nonblank = 8  # per-frame top-k for non-blank expansions

class BiGRUCTC(nn.Module):
    def __init__(self, in_dim, hidden=256, layers=2, num_classes=21, dropout=0.2):
        super().__init__()
        self.rnn = nn.GRU(input_size=in_dim, hidden_size=hidden, num_layers=layers,
                          dropout=dropout, bidirectional=True)
        self.proj = nn.Linear(hidden*2, num_classes)
    def forward(self, x, x_lens):
        from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
        packed = pack_padded_sequence(x, x_lens.cpu(), enforce_sorted=False)
        out, _ = self.rnn(packed)
        out, _ = pad_packed_sequence(out)
        logits = self.proj(out)
        return logits

def load_feat(sample_id: int):
    p = feat_dir/f"{sample_id}.npz"
    d = np.load(p)
    X = d['X'].astype(np.float32)
    if X.shape[0] > 1200: X = X[:1200]
    return X

def runs_from_path(tokens, timesteps):
    # returns list of (token, start_t, end_t) inclusive on time axis (end_t included)
    runs = []
    if len(tokens) == 0:
        return runs
    cur_tok = tokens[0]
    start_idx = 0
    for i in range(1, len(tokens)+1):
        nxt = tokens[i] if i < len(tokens) else None
        if nxt != cur_tok:
            # segment covers indices [start_idx, i-1] in tokens/timesteps
            t_start = timesteps[start_idx]
            t_end = timesteps[i-1]
            runs.append((cur_tok, int(t_start), int(t_end)))
            if nxt is None: break
            cur_tok = nxt
            start_idx = i
    return runs

def collapse_and_prune(tokens, timesteps, min_len_frames):
    # collapse identical consecutive tokens, convert to runs, then drop short runs by frame duration
    runs = runs_from_path(tokens, timesteps)
    kept = []
    for tok, t0, t1 in runs:
        if tok == 0 or tok is None:
            continue
        duration = (t1 - t0 + 1)
        if duration >= min_len_frames:
            kept.append((tok, t0, t1))
    return kept  # list of (tok, t0, t1)

def enforce_exact_20(runs, lp):
    # runs: list of (tok, t0, t1), lp: (T,C) log-probs
    # Step 1: order-preserving trim if > 20 using mean score over time span
    if len(runs) > 20:
        # score by mean log-prob over the token's segment
        scores = []
        for tok, t0, t1 in runs:
            seg = lp[t0:t1+1, tok]
            scores.append(float(seg.mean().item()))
        # get indices sorted by score desc, keep top 20 but preserve original temporal order
        keep_idx = set([i for i,_ in sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:20]])
        runs = [r for i, r in enumerate(runs) if i in keep_idx]
        runs.sort(key=lambda x: x[1])  # sort by start time to preserve order
    # Step 2: if < 20, insert missing classes by peak time; preserve order
    have = [tok for tok, _, _ in runs]
    need = [c for c in range(1, 21) if c not in have]
    if len(runs) < 20 and len(need) > 0:
        peaks = []
        for c in need:
            t_star = int(torch.argmax(lp[:, c]).item())
            peaks.append((c, t_star))
        # insert as tiny runs [t*, t*] then merge and sort
        for c, t_star in peaks:
            runs.append((c, t_star, t_star))
        runs.sort(key=lambda x: x[1])
        # if still > 20 due to excessive insertions (unlikely), trim by per-run mean score
        if len(runs) > 20:
            scores = [float(lp[t0:t1+1, tok].mean().item()) for tok, t0, t1 in runs]
            keep_idx = set([i for i,_ in sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:20]])
            runs = [r for i, r in enumerate(runs) if i in keep_idx]
            runs.sort(key=lambda x: x[1])
    # Deduplicate by first occurrence (should already be unique)
    seen = set(); seq = []
    for tok, _, _ in runs:
        if tok not in seen:
            seq.append(tok); seen.add(tok)
    # final guard
    if len(seq) > 20: seq = seq[:20]
    # if somehow <20 (very weak logits), fill by global ranking
    if len(seq) < 20:
        # rank by mean nonblank
        mean_nonblank = lp[:, 1:21].mean(dim=0)
        order = torch.argsort(mean_nonblank, descending=True).cpu().numpy().tolist()
        for idx in order:
            c = idx + 1
            if c not in seq:
                seq.append(c)
                if len(seq) == 20: break
    return seq

def simple_beam_decode(logits):
    # logits: (T,C) tensor on device; C=21, blank=0
    # Apply temperature and convert to log-probs
    lp = (logits / temperature).log_softmax(dim=-1)  # (T,C)
    T, C = lp.shape
    # Beams: (logp, last_token, tokens_list, timesteps_list)
    beams = [(0.0, 0, [], [])]
    for t in range(T):
        frame = lp[t]  # (C,)
        p_blank = torch.exp(frame[0]).item()
        new_beams = []
        # Always allow blank transition (stay)
        for logp, last, toks, ts in beams:
            stay_lp = logp + frame[0].item()
            new_beams.append((stay_lp, last, toks, ts))
        # If frame is highly blank, skip non-blank expansions
        if p_blank < blank_prob_thresh:
            # Expand non-blank with per-frame top-k
            vals, idxs = torch.topk(frame[1:], k=min(top_k_nonblank, C-1))
            vals = vals.tolist(); idxs = idxs.tolist()
            for logp, last, toks, ts in beams:
                for v, idx in zip(vals, idxs):
                    c = idx + 1
                    # Enforce CTC no-repeat without blank: if last == c, discourage by not adding
                    if last == c:
                        continue
                    nl = logp + v - insertion_penalty  # apply insertion penalty only on non-blank
                    new_beams.append((nl, c, toks + [c], ts + [t]))
        # prune to beam_width
        new_beams.sort(key=lambda x: x[0], reverse=True)
        beams = new_beams[:beam_width]
    # pick best beam and post-process
    best = max(beams, key=lambda x: x[0])
    _, _, toks, ts = best
    kept_runs = collapse_and_prune(toks, ts, min_run_len)
    seq = enforce_exact_20(kept_runs, lp)
    return seq

# Load model
train_any = next(iter((Path('features3d')/'train').glob('*.npz')))
in_dim = np.load(train_any)['X'].shape[1]
model = BiGRUCTC(in_dim=in_dim, hidden=256, layers=2, num_classes=21, dropout=0.2).to(device)
state = torch.load('model_ctc_bgru.pth', map_location=device)
model.load_state_dict(state); model.eval()

pred_rows = []
t0 = time.time()
for i, sid in enumerate(test_ids, 1):
    X = load_feat(sid)  # (T,D)
    xb = torch.from_numpy(X).to(device).unsqueeze(1)  # (T,1,D)
    x_lens = torch.tensor([xb.shape[0]], dtype=torch.int32, device=device)
    with torch.no_grad():
        logits = model(xb, x_lens)[:,0,:]  # (T,C)
    seq = simple_beam_decode(logits)
    pred_rows.append({'Id': sid, 'Sequence': ' '.join(str(x) for x in seq)})
    if i % 10 == 0 or i == len(test_ids):
        print(f"[beam infer] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)

sub = pd.DataFrame(pred_rows, columns=['Id','Sequence'])
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv; head:\n', sub.head())
print("=== Beam inference done ===")

=== Inference (CTC beam) with penalties and constraints -> submission.csv ===


  state = torch.load('model_ctc_bgru.pth', map_location=device)


[beam infer] 10/95 elapsed=0.1m


[beam infer] 20/95 elapsed=0.2m


[beam infer] 30/95 elapsed=0.2m


[beam infer] 40/95 elapsed=0.3m


[beam infer] 50/95 elapsed=0.4m


[beam infer] 60/95 elapsed=0.5m


[beam infer] 70/95 elapsed=0.6m


[beam infer] 80/95 elapsed=0.7m


[beam infer] 90/95 elapsed=0.8m


[beam infer] 95/95 elapsed=0.8m


Wrote submission.csv; head:
     Id                                           Sequence
0  300  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
1  301  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
2  302  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
3  303  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
4  304  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
=== Beam inference done ===


In [17]:
import math, time, random
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn

print("=== Validate beam decoder on internal holdout and tune hyperparams ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
feat_dir_tr = Path('features3d')/'train'
train_df = pd.read_csv('training.csv')
id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

# Deterministic split to mirror earlier run
all_ids = [int(x) for x in train_df['Id'].tolist()]
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]
print(f"Val videos: {len(val_ids)}")

def load_feat_tr(sample_id: int):
    p = feat_dir_tr/f"{sample_id}.npz"
    d = np.load(p)
    X = d['X'].astype(np.float32)
    if X.shape[0] > 1200: X = X[:1200]
    return X

class BiGRUCTC(nn.Module):
    def __init__(self, in_dim, hidden=256, layers=2, num_classes=21, dropout=0.2):
        super().__init__()
        self.rnn = nn.GRU(input_size=in_dim, hidden_size=hidden, num_layers=layers,
                          dropout=dropout, bidirectional=True)
        self.proj = nn.Linear(hidden*2, num_classes)
    def forward(self, x, x_lens):
        from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
        packed = pack_padded_sequence(x, x_lens.cpu(), enforce_sorted=False)
        out, _ = self.rnn(packed)
        out, _ = pad_packed_sequence(out)
        logits = self.proj(out)
        return logits

def levenshtein(a, b):
    n, m = len(a), len(b)
    if n==0: return m
    if m==0: return n
    dp = list(range(m+1))
    for i in range(1, n+1):
        prev = dp[0]; dp[0] = i; ai = a[i-1]
        for j in range(1, m+1):
            tmp = dp[j]
            cost = 0 if ai==b[j-1] else 1
            dp[j] = min(dp[j]+1, dp[j-1]+1, prev+cost)
            prev = tmp
    return dp[m]

# Reuse decoder/utilities from Cell 12 if present; otherwise import minimal versions
assert 'simple_beam_decode' in globals(), "Run Cell 12 once before this tuning cell."
assert 'enforce_exact_20' in globals() and 'collapse_and_prune' in globals(), "Run Cell 12 first."

# Build model
in_dim = np.load(next(iter(feat_dir_tr.glob('*.npz'))))['X'].shape[1]
model = BiGRUCTC(in_dim=in_dim, hidden=256, layers=2, num_classes=21, dropout=0.2).to(device)
state = torch.load('model_ctc_bgru.pth', map_location=device)
model.load_state_dict(state); model.eval()

def decode_val_once():
    tot_lev = 0.0; n = 0; lens = []; uniq_ok = 0
    t0 = time.time()
    for i, sid in enumerate(val_ids, 1):
        X = load_feat_tr(sid)
        xb = torch.from_numpy(X).to(device).unsqueeze(1)
        x_lens = torch.tensor([xb.shape[0]], dtype=torch.int32, device=device)
        with torch.no_grad():
            logits = model(xb, x_lens)[:,0,:]
        seq = simple_beam_decode(logits)
        tgt = id2seq[sid]
        tot_lev += levenshtein(seq, tgt)
        n += 1
        lens.append(len(seq))
        uniq_ok += int(len(seq)==20 and len(set(seq))==20 and all(1<=x<=20 for x in seq))
        if (i % 10) == 0 or i==len(val_ids):
            print(f"  [val] {i}/{len(val_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
    return (tot_lev/n if n else math.inf), (sum(lens)/max(n,1)), (uniq_ok/max(n,1))

# Small grid per expert advice (keep fast):
cfgs = []
for bw in [40, 50]:
    for temp in [1.3, 1.4]:
        for pen in [0.6, 0.8]:
            for mlen in [6, 8]:
                for bth in [0.985, 0.99]:
                    cfgs.append(dict(beam_width=bw, temperature=temp, insertion_penalty=pen, min_run_len=mlen, blank_prob_thresh=bth))

results = []
print(f"Testing {len(cfgs)} configs on {len(val_ids)} videos", flush=True)
for ci, cfg in enumerate(cfgs, 1):
    # assign globals used by simple_beam_decode
    beam_width = cfg['beam_width']
    temperature = cfg['temperature']
    insertion_penalty = cfg['insertion_penalty']
    min_run_len = cfg['min_run_len']
    blank_prob_thresh = cfg['blank_prob_thresh']
    print(f"[{ci}/{len(cfgs)}] cfg={cfg}", flush=True)
    lev, avg_len, uniq = decode_val_once()
    results.append((lev, avg_len, uniq, cfg))
    print(f"  -> val_lev={lev:.4f} avg_len={avg_len:.2f} uniq_ok={uniq:.2f}", flush=True)

results.sort(key=lambda x: x[0])
print("=== Top configs ===")
for r in results[:5]:
    print(f"lev={r[0]:.4f} avg_len={r[1]:.2f} uniq={r[2]:.2f} cfg={r[3]}")
best = results[0]
print("BEST:", best)

=== Validate beam decoder on internal holdout and tune hyperparams ===


Val videos: 44
Testing 32 configs on 44 videos


[1/32] cfg={'beam_width': 40, 'temperature': 1.3, 'insertion_penalty': 0.6, 'min_run_len': 6, 'blank_prob_thresh': 0.985}


  state = torch.load('model_ctc_bgru.pth', map_location=device)


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2727 avg_len=20.00 uniq_ok=1.00


[2/32] cfg={'beam_width': 40, 'temperature': 1.3, 'insertion_penalty': 0.6, 'min_run_len': 6, 'blank_prob_thresh': 0.99}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2727 avg_len=20.00 uniq_ok=1.00


[3/32] cfg={'beam_width': 40, 'temperature': 1.3, 'insertion_penalty': 0.6, 'min_run_len': 8, 'blank_prob_thresh': 0.985}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2727 avg_len=20.00 uniq_ok=1.00


[4/32] cfg={'beam_width': 40, 'temperature': 1.3, 'insertion_penalty': 0.6, 'min_run_len': 8, 'blank_prob_thresh': 0.99}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2727 avg_len=20.00 uniq_ok=1.00


[5/32] cfg={'beam_width': 40, 'temperature': 1.3, 'insertion_penalty': 0.8, 'min_run_len': 6, 'blank_prob_thresh': 0.985}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2727 avg_len=20.00 uniq_ok=1.00


[6/32] cfg={'beam_width': 40, 'temperature': 1.3, 'insertion_penalty': 0.8, 'min_run_len': 6, 'blank_prob_thresh': 0.99}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2727 avg_len=20.00 uniq_ok=1.00


[7/32] cfg={'beam_width': 40, 'temperature': 1.3, 'insertion_penalty': 0.8, 'min_run_len': 8, 'blank_prob_thresh': 0.985}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2727 avg_len=20.00 uniq_ok=1.00


[8/32] cfg={'beam_width': 40, 'temperature': 1.3, 'insertion_penalty': 0.8, 'min_run_len': 8, 'blank_prob_thresh': 0.99}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2727 avg_len=20.00 uniq_ok=1.00


[9/32] cfg={'beam_width': 40, 'temperature': 1.4, 'insertion_penalty': 0.6, 'min_run_len': 6, 'blank_prob_thresh': 0.985}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2955 avg_len=20.00 uniq_ok=1.00


[10/32] cfg={'beam_width': 40, 'temperature': 1.4, 'insertion_penalty': 0.6, 'min_run_len': 6, 'blank_prob_thresh': 0.99}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2955 avg_len=20.00 uniq_ok=1.00


[11/32] cfg={'beam_width': 40, 'temperature': 1.4, 'insertion_penalty': 0.6, 'min_run_len': 8, 'blank_prob_thresh': 0.985}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2955 avg_len=20.00 uniq_ok=1.00


[12/32] cfg={'beam_width': 40, 'temperature': 1.4, 'insertion_penalty': 0.6, 'min_run_len': 8, 'blank_prob_thresh': 0.99}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2955 avg_len=20.00 uniq_ok=1.00


[13/32] cfg={'beam_width': 40, 'temperature': 1.4, 'insertion_penalty': 0.8, 'min_run_len': 6, 'blank_prob_thresh': 0.985}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2955 avg_len=20.00 uniq_ok=1.00


[14/32] cfg={'beam_width': 40, 'temperature': 1.4, 'insertion_penalty': 0.8, 'min_run_len': 6, 'blank_prob_thresh': 0.99}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2955 avg_len=20.00 uniq_ok=1.00


[15/32] cfg={'beam_width': 40, 'temperature': 1.4, 'insertion_penalty': 0.8, 'min_run_len': 8, 'blank_prob_thresh': 0.985}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2955 avg_len=20.00 uniq_ok=1.00


[16/32] cfg={'beam_width': 40, 'temperature': 1.4, 'insertion_penalty': 0.8, 'min_run_len': 8, 'blank_prob_thresh': 0.99}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.1m


  [val] 30/44 elapsed=0.2m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.3m


  -> val_lev=18.2955 avg_len=20.00 uniq_ok=1.00


[17/32] cfg={'beam_width': 50, 'temperature': 1.3, 'insertion_penalty': 0.6, 'min_run_len': 6, 'blank_prob_thresh': 0.985}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.2m


  [val] 30/44 elapsed=0.3m


  [val] 40/44 elapsed=0.3m


  [val] 44/44 elapsed=0.4m


  -> val_lev=18.2727 avg_len=20.00 uniq_ok=1.00


[18/32] cfg={'beam_width': 50, 'temperature': 1.3, 'insertion_penalty': 0.6, 'min_run_len': 6, 'blank_prob_thresh': 0.99}


  [val] 10/44 elapsed=0.1m


  [val] 20/44 elapsed=0.2m


KeyboardInterrupt: 

In [19]:
import os, math, time, random, gc
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from torch.utils.data import Dataset, DataLoader

print("=== Train 3x384 BiGRU+CTC with input projection, LN, time masking, feature standardization ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
random.seed(42); np.random.seed(42); torch.manual_seed(42);
if torch.cuda.is_available():
    torch.backends.cudnn.benchmark = True

feat_dir = Path('features3d')/'train'
train_df = pd.read_csv('training.csv')
id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}
all_ids = [int(x) for x in train_df['Id'].tolist()]
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]
tr_ids = all_ids[val_n:]
print(f"Train videos: {len(tr_ids)}, Val videos: {len(val_ids)}")

# Compute and cache global mean/std over TRAIN features (across time and samples)
scaler_path = Path('feature_scaler.npz')
if scaler_path.exists():
    sc = np.load(scaler_path)
    mu, sigma = sc['mean'].astype(np.float32), sc['std'].astype(np.float32)
else:
    sum_vec = None; sumsq_vec = None; count = 0
    t0 = time.time()
    for i, sid in enumerate(tr_ids, 1):
        d = np.load(feat_dir/f"{sid}.npz")['X'].astype(np.float32)
        if d.shape[0] > 1200: d = d[:1200]
        if sum_vec is None:
            sum_vec = np.zeros(d.shape[1], np.float32)
            sumsq_vec = np.zeros(d.shape[1], np.float32)
        sum_vec += d.sum(axis=0)
        sumsq_vec += (d*d).sum(axis=0)
        count += d.shape[0]
        if (i % 50) == 0 or i == len(tr_ids):
            print(f"[scaler] {i}/{len(tr_ids)} frames_accum={count} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
    mu = sum_vec / max(count,1)
    var = np.maximum(sumsq_vec / max(count,1) - mu*mu, 1e-6)
    sigma = np.sqrt(var).astype(np.float32)
    np.savez_compressed(scaler_path, mean=mu, std=sigma)
print("Scaler stats:", mu.shape, sigma.shape, "std min/max:", float(sigma.min()), float(sigma.max()))

def load_npz_std(sample_id: int):
    d = np.load(feat_dir/f"{sample_id}.npz")
    X = d['X'].astype(np.float32)
    if X.shape[0] > 1200: X = X[:1200]
    X = (X - mu) / sigma
    return X

class SeqDataset(Dataset):
    def __init__(self, ids): self.ids = ids
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        sid = self.ids[idx]
        X = load_npz_std(sid)
        y = np.array(id2seq[sid], dtype=np.int64)
        return torch.from_numpy(X), torch.from_numpy(y), sid

def collate(batch):
    xs, ys, sids = zip(*batch)
    x_lens = torch.tensor([x.shape[0] for x in xs], dtype=torch.int32)
    y_lens = torch.tensor([y.shape[0] for y in ys], dtype=torch.int32)
    x_pad = pad_sequence(xs, batch_first=False)  # (T,B,D)
    y_cat = torch.cat(ys, dim=0)
    return x_pad, x_lens, y_cat, y_lens, sids

train_loader = DataLoader(SeqDataset(tr_ids), batch_size=24, shuffle=True, num_workers=2, pin_memory=True, collate_fn=collate)
val_loader   = DataLoader(SeqDataset(val_ids), batch_size=24, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate)

class InputProj(nn.Module):
    def __init__(self, in_dim, hid):
        super().__init__()
        self.lin = nn.Linear(in_dim, hid)
        self.ln = nn.LayerNorm(hid)
        self.act = nn.ReLU(inplace=True)
    def forward(self, x):  # x: (T,B,D)
        y = self.lin(x)
        y = self.ln(y)
        return self.act(y)

class BiGRUCTCStrong(nn.Module):
    def __init__(self, in_dim, proj=256, hidden=384, layers=3, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = InputProj(in_dim, proj)
        self.rnn = nn.GRU(input_size=proj, hidden_size=hidden, num_layers=layers, dropout=dropout, bidirectional=True)
        self.proj = nn.Linear(hidden*2, num_classes)
    def forward(self, x, x_lens):
        x = self.inp(x)
        packed = pack_padded_sequence(x, x_lens.cpu(), enforce_sorted=False)
        out, _ = self.rnn(packed)
        out, _ = pad_packed_sequence(out)
        return self.proj(out)

def time_mask(x, max_width=16, nmask=2, p=0.5):
    # x: (T,B,D), inplace mask
    if random.random() > p: return x
    T = x.size(0)
    for _ in range(nmask):
        w = random.randint(1, max_width)
        t0 = random.randint(0, max(0, T - w))
        x[t0:t0+w] = 0
    return x

def ctc_greedy(logits):
    pred = logits.argmax(dim=-1).cpu().numpy()  # (T,B)
    T,B = pred.shape
    seqs = []
    for b in range(B):
        out = []; last = -1
        for t in range(T):
            p = int(pred[t,b])
            if p != last:
                if p != 0: out.append(p)
                last = p
        seqs.append(out)
    return seqs

def fallback_rank_framewise(logits_b):
    # logits_b: (T,C) for one sample
    with torch.no_grad():
        lp = logits_b[:,1:21].mean(dim=0)
        order = torch.argsort(lp, descending=True).cpu().numpy().tolist()
    return [int(i+1) for i in order[:20]]

def ensure_len20_list(seq, logits_b):
    ok = (len(seq)==20) and (len(set(seq))==20) and all(1<=s<=20 for s in seq)
    if ok: return seq
    return fallback_rank_framewise(logits_b)

def levenshtein(a, b):
    n, m = len(a), len(b)
    if n==0: return m
    if m==0: return n
    dp = list(range(m+1))
    for i in range(1, n+1):
        prev = dp[0]; dp[0] = i; ai = a[i-1]
        for j in range(1, m+1):
            tmp = dp[j]
            dp[j] = min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1))
            prev = tmp
    return dp[m]

D_sample = np.load(next(iter(feat_dir.glob('*.npz'))))['X'].shape[1]
model = BiGRUCTCStrong(in_dim=D_sample, proj=256, hidden=384, layers=3, num_classes=21, dropout=0.3).to(device).float()
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
ctc_loss = nn.CTCLoss(blank=0, zero_infinity=True)
# Use scaler but keep training in full float32 to avoid dtype mismatch
scaler = torch.amp.GradScaler('cuda' if torch.cuda.is_available() else 'cpu', enabled=False)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=15)

def train_epoch(ep):
    model.train()
    t0 = time.time(); tot=0.0; nb=0
    for it, (xb, x_lens, y_cat, y_lens, sids) in enumerate(train_loader):
        xb = xb.to(device, non_blocking=True).float()
        # Time masking augmentation
        xb = time_mask(xb, max_width=16, nmask=2, p=0.5)
        y_cat = y_cat.to(device, non_blocking=True)
        x_lens = x_lens.to(device, non_blocking=True)
        y_lens = y_lens.to(device, non_blocking=True)
        optimizer.zero_grad(set_to_none=True)
        logits = model(xb, x_lens)
        log_probs = logits.log_softmax(dim=-1)
        loss = ctc_loss(log_probs, y_cat, x_lens, y_lens)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        tot += loss.item(); nb += 1
        if (it+1) % 20 == 0:
            print(f"ep{ep} it{it+1} loss={tot/max(nb,1):.4f} elapsed={time.time()-t0:.1f}s", flush=True)
    scheduler.step()
    return tot/max(nb,1)

def evaluate_fast(model):
    model.eval()
    total_lev = 0.0; total = 0
    t0 = time.time()
    with torch.no_grad():
        for xb, x_lens, y_cat, y_lens, sids in val_loader:
            xb = xb.to(device, non_blocking=True).float()
            x_lens = x_lens.to(device, non_blocking=True)
            logits = model(xb, x_lens)  # (T,B,C)
            T,B,C = logits.shape
            seqs = ctc_greedy(logits)
            # split targets
            ys = []; off=0
            for L in y_lens.tolist(): ys.append(y_cat[off:off+L].tolist()); off+=L
            # ensure sequences length 20 via fallback ranking using per-sample logits
            for b in range(B):
                seq = ensure_len20_list(seqs[b], logits[:,b,:])
                tgt = ys[b]
                total_lev += levenshtein(seq, tgt)
                total += 1
    print(f"  [val] evaluated {total} samples in {(time.time()-t0)/60:.2f}m", flush=True)
    return total_lev / max(total,1)

best_val = math.inf; best_state = None; patience = 3; bad = 0
max_epochs = 15
for ep in range(1, max_epochs+1):
    tr_loss = train_epoch(ep)
    val_lev = evaluate_fast(model)
    print(f"Epoch {ep}: train_loss={tr_loss:.4f} val_lev={val_lev:.4f} lr={scheduler.get_last_lr()[0]:.6f}", flush=True)
    if val_lev < best_val - 1e-4:
        best_val = val_lev; best_state = {k:v.detach().cpu() for k,v in model.state_dict().items()}; bad = 0
        print(f"  New best val_lev={best_val:.4f}", flush=True)
    else:
        bad += 1
        if bad >= patience:
            print("Early stopping.", flush=True); break

if best_state is not None:
    model.load_state_dict(best_state)
torch.save(model.state_dict(), 'model_ctc_bgru_v2.pth')
print("=== Training complete. Saved model_ctc_bgru_v2.pth; best val_lev=", best_val)

# Note: After this finishes, re-run Cell 12 with the new checkpoint and tuned beam or fast greedy+fallback.

=== Train 3x384 BiGRU+CTC with input projection, LN, time masking, feature standardization ===


Train videos: 253, Val videos: 44
Scaler stats: (180,) (180,) std min/max: 0.0011892978800460696 0.9245076775550842


  [val] evaluated 44 samples in 0.00m


Epoch 1: train_loss=23.5679 val_lev=18.2500 lr=0.000297


  New best val_lev=18.2500


  [val] evaluated 44 samples in 0.00m


Epoch 2: train_loss=8.4269 val_lev=18.4773 lr=0.000287


  [val] evaluated 44 samples in 0.00m


  [val] evaluated 44 samples in 0.00m


Epoch 4: train_loss=6.7487 val_lev=18.1136 lr=0.000250


  New best val_lev=18.1136


  [val] evaluated 44 samples in 0.00m


Epoch 5: train_loss=6.4927 val_lev=18.0455 lr=0.000225


  New best val_lev=18.0455


  [val] evaluated 44 samples in 0.00m


Epoch 6: train_loss=6.6235 val_lev=17.9091 lr=0.000196


  New best val_lev=17.9091


  [val] evaluated 44 samples in 0.00m


Epoch 7: train_loss=6.3430 val_lev=18.2727 lr=0.000166


  [val] evaluated 44 samples in 0.00m


Epoch 8: train_loss=5.3549 val_lev=18.0227 lr=0.000134


  [val] evaluated 44 samples in 0.00m


Epoch 9: train_loss=5.1989 val_lev=17.7727 lr=0.000104


  New best val_lev=17.7727


  [val] evaluated 44 samples in 0.00m


Epoch 10: train_loss=5.6111 val_lev=18.0227 lr=0.000075


  [val] evaluated 44 samples in 0.00m


Epoch 11: train_loss=5.0718 val_lev=17.9773 lr=0.000050


  [val] evaluated 44 samples in 0.00m


Epoch 12: train_loss=5.9004 val_lev=18.0455 lr=0.000029


Early stopping.


=== Training complete. Saved model_ctc_bgru_v2.pth; best val_lev= 17.772727272727273


In [20]:
import time, json
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

print("=== Inference on TEST with stronger 3x384 BiGRU (greedy+fallback) -> submission.csv ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
feat_dir = Path('features3d')/'test'
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()

# Load feature standardization stats used in training
sc = np.load('feature_scaler.npz')
mu = sc['mean'].astype(np.float32)
sigma = sc['std'].astype(np.float32)

def load_feat_std(sample_id: int):
    d = np.load(feat_dir/f"{sample_id}.npz")
    X = d['X'].astype(np.float32)
    if X.shape[0] > 1200: X = X[:1200]
    return (X - mu) / sigma

class InputProj(nn.Module):
    def __init__(self, in_dim, hid):
        super().__init__()
        self.lin = nn.Linear(in_dim, hid)
        self.ln = nn.LayerNorm(hid)
        self.act = nn.ReLU(inplace=True)
    def forward(self, x):
        y = self.lin(x); y = self.ln(y); return self.act(y)

class BiGRUCTCStrong(nn.Module):
    def __init__(self, in_dim, proj=256, hidden=384, layers=3, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = InputProj(in_dim, proj)
        self.rnn = nn.GRU(input_size=proj, hidden_size=hidden, num_layers=layers, dropout=dropout, bidirectional=True)
        self.proj = nn.Linear(hidden*2, num_classes)
    def forward(self, x, x_lens):
        x = self.inp(x)
        packed = pack_padded_sequence(x, x_lens.cpu(), enforce_sorted=False)
        out, _ = self.rnn(packed)
        out, _ = pad_packed_sequence(out)
        return self.proj(out)

def ctc_greedy_one(logits_T_C: torch.Tensor):
    pred = logits_T_C.argmax(dim=-1).tolist()
    out = []; last = -1
    for p in pred:
        if p != last:
            if p != 0: out.append(int(p))
            last = p
    return out

def fallback_rank_framewise(logits_T_C: torch.Tensor):
    with torch.no_grad():
        lp = logits_T_C[:,1:21].mean(dim=0)
        order = torch.argsort(lp, descending=True).tolist()
    return [int(i+1) for i in order[:20]]

def ensure_len20_list(seq, logits_T_C):
    ok = (len(seq)==20) and (len(set(seq))==20) and all(1<=s<=20 for s in seq)
    if ok: return seq
    return fallback_rank_framewise(logits_T_C)

# Build and load model
in_dim = np.load(next(iter((Path('features3d')/'train').glob('*.npz'))))['X'].shape[1]
model = BiGRUCTCStrong(in_dim=in_dim, proj=256, hidden=384, layers=3, num_classes=21, dropout=0.3).to(device).float()
state = torch.load('model_ctc_bgru_v2.pth', map_location=device)
model.load_state_dict(state); model.eval()

rows = []; t0 = time.time()
for i, sid in enumerate(test_ids, 1):
    X = load_feat_std(sid)
    xb = torch.from_numpy(X).to(device).unsqueeze(1)  # (T,1,D)
    x_lens = torch.tensor([xb.shape[0]], dtype=torch.int32, device=device)
    with torch.no_grad():
        logits = model(xb.float(), x_lens)[:,0,:]  # (T,C)
    seq = ctc_greedy_one(logits)
    seq = ensure_len20_list(seq, logits)
    rows.append({'Id': sid, 'Sequence': ' '.join(str(x) for x in seq)})
    if i % 10 == 0 or i == len(test_ids):
        print(f"[infer v2] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)

sub = pd.DataFrame(rows, columns=['Id','Sequence'])
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv; head:\n', sub.head())
print('=== Inference v2 done ===')

=== Inference on TEST with stronger 3x384 BiGRU (greedy+fallback) -> submission.csv ===


  state = torch.load('model_ctc_bgru_v2.pth', map_location=device)


[infer v2] 10/95 elapsed=0.0m


[infer v2] 20/95 elapsed=0.0m


[infer v2] 30/95 elapsed=0.0m


[infer v2] 40/95 elapsed=0.0m


[infer v2] 50/95 elapsed=0.0m


[infer v2] 60/95 elapsed=0.0m


[infer v2] 70/95 elapsed=0.0m


[infer v2] 80/95 elapsed=0.0m


[infer v2] 90/95 elapsed=0.0m


[infer v2] 95/95 elapsed=0.1m


Wrote submission.csv; head:
     Id                                           Sequence
0  300  17 16 11 18 13 12 8 14 9 20 15 10 1 5 6 4 3 2 ...
1  301  17 16 11 18 13 12 8 14 15 9 20 10 1 5 6 4 3 2 ...
2  302  17 16 11 18 13 12 8 14 15 9 20 10 1 5 6 4 3 2 ...
3  303  17 16 11 13 18 12 8 14 15 9 20 10 5 1 6 4 3 2 ...
4  304  17 16 11 13 18 12 8 14 15 9 20 10 5 1 6 4 3 2 ...
=== Inference v2 done ===


In [21]:
import io, tarfile, zipfile, json, math, gc, os, tempfile
from pathlib import Path
import numpy as np
import pandas as pd
import scipy.io as sio

print("=== Recache v2: dequantized 3D skeleton + extra scalars + vel/acc @20fps ===", flush=True)

CWD = Path('.')
TRAIN_TARS = [CWD/'training1.tar.gz', CWD/'training2.tar.gz', CWD/'training3.tar.gz']
VAL_TARS = [CWD/'validation1.tar.gz', CWD/'validation2.tar.gz', CWD/'validation3.tar.gz']
TEST_TAR = CWD/'test.tar.gz'

def build_tar_index(tar_paths):
    idx = {}
    for tp in tar_paths:
        if not tp.exists(): continue
        with tarfile.open(tp, 'r:*') as tf:
            for m in tf:
                if m.isreg():
                    nm = m.name.lstrip('./')
                    if nm.endswith('.zip') and nm.startswith('Sample'):
                        idx[nm] = (tp, m)
    return idx

train_idx = build_tar_index(TRAIN_TARS)
val_idx = build_tar_index(VAL_TARS)
test_idx = build_tar_index([TEST_TAR])

def id_to_zipname(sample_id: int) -> str:
    return f"Sample{sample_id:05d}.zip"

def read_mat_bytes_from_zip(tarpath: Path, tarinfo: tarfile.TarInfo) -> bytes:
    with tarfile.open(tarpath, 'r:*') as tf:
        fobj = tf.extractfile(tarinfo); data = fobj.read()
    with zipfile.ZipFile(io.BytesIO(data)) as zf:
        mat_name = None
        for n in zf.namelist():
            ln = n.lower()
            if ln.endswith('_data.mat') or ln.endswith('.mat'): mat_name = n; break
        b = zf.read(mat_name)
    return b

def load_video_struct(mat_bytes: bytes):
    md = sio.loadmat(io.BytesIO(mat_bytes), squeeze_me=True, struct_as_record=False)
    return md['Video']

# Joint index map (0-based) per expert advice
HIP_CENTER=0; SPINE=1; SHOULDER_CENTER=2; HEAD=3;
SHOULDER_L=4; ELBOW_L=5; WRIST_L=6; HAND_L=7;
SHOULDER_R=8; ELBOW_R=9; WRIST_R=10; HAND_R=11;
HIP_L=12; KNEE_L=13; ANKLE_L=14; FOOT_L=15;
HIP_R=16; KNEE_R=17; ANKLE_R=18; FOOT_R=19

def dequantize_u8(arr_u8: np.ndarray) -> np.ndarray:
    # Map uint8 to approx [-1,1]
    return (arr_u8.astype(np.float32) - 128.0) / 128.0

def ema_smooth_scale(scales: np.ndarray, alpha: float = 0.7) -> np.ndarray:
    s = np.empty_like(scales, dtype=np.float32)
    if len(scales) == 0: return s
    s[0] = scales[0]
    for t in range(1, len(scales)):
        s[t] = alpha * s[t-1] + (1.0 - alpha) * scales[t]
    s[s <= 1e-6] = 1.0
    return s

def elbow_angle(a, b, c):
    # angle at b between vectors a-b and c-b
    v1 = a - b; v2 = c - b
    num = np.sum(v1*v2, axis=-1)
    d1 = np.linalg.norm(v1, axis=-1) + 1e-8
    d2 = np.linalg.norm(v2, axis=-1) + 1e-8
    cosang = np.clip(num/(d1*d2), -1.0, 1.0)
    return np.arccos(cosang).astype(np.float32)  # radians

def build_features_from_video(V):
    T = int(getattr(V, 'NumFrames', 0))
    frames = V.Frames
    # Collect 3D u8 positions
    X = np.zeros((T, 20, 3), dtype=np.float32)
    for t in range(T):
        wp = getattr(frames[t].Skeleton, 'WorldPosition')  # (20,3) uint8
        X[t] = dequantize_u8(np.asarray(wp))
    # Center on hip center and scale by shoulder width with EMA smoothing
    center = X[:, HIP_CENTER, :]  # (T,3)
    Xc = X - center[:, None, :]
    shoulder_width = np.linalg.norm(X[:, SHOULDER_L, :] - X[:, SHOULDER_R, :], axis=1)
    scale = ema_smooth_scale(shoulder_width, alpha=0.7)
    Xn = Xc / scale[:, None, None]
    # Extra scalars per frame (computed on normalized coords)
    HL = Xn[:, HAND_L, :]; HR = Xn[:, HAND_R, :]
    SHL = Xn[:, SHOULDER_L, :]; SHR = Xn[:, SHOULDER_R, :]
    HD = Xn[:, HEAD, :]; HC = Xn[:, HIP_CENTER, :]
    WRL = Xn[:, WRIST_L, :]; WRI = Xn[:, WRIST_R, :]
    ELL = Xn[:, ELBOW_L, :]; ELR = Xn[:, ELBOW_R, :]
    # distances
    d_hands = np.linalg.norm(HL - HR, axis=1)[:, None]
    d_hl_head = np.linalg.norm(HL - HD, axis=1)[:, None]
    d_hr_head = np.linalg.norm(HR - HD, axis=1)[:, None]
    d_hl_shl = np.linalg.norm(HL - SHL, axis=1)[:, None]
    d_hr_shr = np.linalg.norm(HR - SHR, axis=1)[:, None]
    d_hl_hip = np.linalg.norm(HL - HC, axis=1)[:, None]
    d_hr_hip = np.linalg.norm(HR - HC, axis=1)[:, None]
    # speeds (magnitudes) and vertical velocities (z') for hands
    def temporal_diff(x):
        v = np.diff(x, axis=0, prepend=x[:1])
        return v
    v_hl = temporal_diff(HL); v_hr = temporal_diff(HR)
    sp_hl = np.linalg.norm(v_hl, axis=1)[:, None]; sp_hr = np.linalg.norm(v_hr, axis=1)[:, None]
    vz_hl = v_hl[:, 2:3]; vz_hr = v_hr[:, 2:3]
    # elbow angles
    ang_l = elbow_angle(SHL, ELL, WRL)[:, None]
    ang_r = elbow_angle(SHR, ELR, WRI)[:, None]
    scalars = np.concatenate([d_hands, d_hl_head, d_hr_head, d_hl_shl, d_hr_shr, d_hl_hip, d_hr_hip, sp_hl, sp_hr, vz_hl, vz_hr, ang_l, ang_r], axis=1)
    # Base per-frame vector: flattened joints + scalars
    base = np.concatenate([Xn.reshape(T, -1), scalars], axis=1).astype(np.float32)
    # Derivatives
    V1 = np.diff(base, axis=0, prepend=base[:1])
    A1 = np.diff(V1, axis=0, prepend=V1[:1])
    Xf = np.concatenate([base, V1, A1], axis=1).astype(np.float32)
    meta = dict(fps=int(getattr(V, 'FrameRate', 20)), nframes=T, stride=1, feat='world3d_dequant_norm+scalars+vel+acc')
    return Xf, meta

def cache_one_v2(sample_id: int, split: str, outdir: Path):
    idx = train_idx if split=='train' else (val_idx if split=='val' else test_idx)
    zipname = id_to_zipname(sample_id)
    if zipname not in idx: raise KeyError(f"{zipname} not in index for split={split}")
    tarpath, tarinfo = idx[zipname]
    mat_bytes = read_mat_bytes_from_zip(tarpath, tarinfo)
    V = load_video_struct(mat_bytes)
    Xf, meta = build_features_from_video(V)
    outdir.mkdir(parents=True, exist_ok=True)
    np.savez_compressed(outdir/f"{sample_id}.npz", X=Xf, meta=json.dumps(meta))

def recache_split(split: str, ids: list, outdir: Path):
    total = len(ids); t0 = time.time(); done=0; skip=0
    outdir.mkdir(parents=True, exist_ok=True)
    for i, sid in enumerate(ids, 1):
        p = outdir/f"{sid}.npz"
        if p.exists():
            skip += 1
        else:
            st = time.time()
            try:
                cache_one_v2(int(sid), split, outdir); done += 1
            except Exception as e:
                print(f"[WARN] {split} id={sid} failed: {e}", flush=True); continue
        if (i % 10)==0 or i==total:
            dt = time.time()-t0; rate = i/max(dt,1e-9); eta = (total-i)/max(rate,1e-6)
            print(f"[{split} v2] {i}/{total} cached={done} skip={skip} elapsed={dt/60:.1f}m eta={eta/60:.1f}m", flush=True)
        gc.collect()
    print(f"=== {split.upper()} v2 caching done: cached={done} skip={skip} total={total} elapsed={(time.time()-t0)/60:.2f}m ===", flush=True)

# Run recache for train and test into features3d_v2/
train_ids = pd.read_csv('training.csv')['Id'].astype(int).tolist()
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
recache_split('train', train_ids, Path('features3d_v2')/'train')
recache_split('test', test_ids, Path('features3d_v2')/'test')
print("=== Recache v2 complete ===")

=== Recache v2: dequantized 3D skeleton + extra scalars + vel/acc @20fps ===


[train v2] 10/297 cached=10 skip=0 elapsed=0.1m eta=2.6m


[train v2] 20/297 cached=20 skip=0 elapsed=0.3m eta=3.8m


[train v2] 30/297 cached=30 skip=0 elapsed=0.5m eta=4.8m


[train v2] 40/297 cached=40 skip=0 elapsed=0.9m eta=5.7m


[train v2] 50/297 cached=50 skip=0 elapsed=1.4m eta=6.7m


[train v2] 60/297 cached=60 skip=0 elapsed=2.0m eta=7.8m


[train v2] 70/297 cached=70 skip=0 elapsed=2.7m eta=8.7m


[train v2] 80/297 cached=80 skip=0 elapsed=3.5m eta=9.6m


[train v2] 90/297 cached=90 skip=0 elapsed=4.4m eta=10.2m


[train v2] 100/297 cached=100 skip=0 elapsed=5.2m eta=10.3m


[train v2] 110/297 cached=110 skip=0 elapsed=5.3m eta=9.0m


[train v2] 120/297 cached=120 skip=0 elapsed=5.4m eta=7.9m


[train v2] 130/297 cached=130 skip=0 elapsed=5.5m eta=7.1m


[train v2] 140/297 cached=140 skip=0 elapsed=5.7m eta=6.4m


[train v2] 150/297 cached=150 skip=0 elapsed=5.9m eta=5.8m


[train v2] 160/297 cached=160 skip=0 elapsed=6.1m eta=5.3m


[train v2] 170/297 cached=170 skip=0 elapsed=6.4m eta=4.8m


[train v2] 180/297 cached=180 skip=0 elapsed=6.7m eta=4.4m


[train v2] 190/297 cached=190 skip=0 elapsed=7.1m eta=4.0m


[train v2] 200/297 cached=200 skip=0 elapsed=7.4m eta=3.6m


[train v2] 210/297 cached=210 skip=0 elapsed=7.4m eta=3.1m


[train v2] 220/297 cached=220 skip=0 elapsed=7.5m eta=2.6m


[train v2] 230/297 cached=230 skip=0 elapsed=7.7m eta=2.2m


[train v2] 240/297 cached=240 skip=0 elapsed=8.0m eta=1.9m


[train v2] 250/297 cached=250 skip=0 elapsed=8.3m eta=1.6m


[train v2] 260/297 cached=260 skip=0 elapsed=8.7m eta=1.2m


[train v2] 270/297 cached=270 skip=0 elapsed=9.1m eta=0.9m


[train v2] 280/297 cached=280 skip=0 elapsed=9.5m eta=0.6m


[train v2] 290/297 cached=290 skip=0 elapsed=10.0m eta=0.2m


[train v2] 297/297 cached=297 skip=0 elapsed=10.4m eta=0.0m


=== TRAIN v2 caching done: cached=297 skip=0 total=297 elapsed=10.36m ===


[test v2] 10/95 cached=10 skip=0 elapsed=0.1m eta=0.5m


[test v2] 20/95 cached=20 skip=0 elapsed=0.2m eta=0.6m


[test v2] 30/95 cached=30 skip=0 elapsed=0.3m eta=0.7m


[test v2] 40/95 cached=40 skip=0 elapsed=0.5m eta=0.7m


[test v2] 50/95 cached=50 skip=0 elapsed=0.7m eta=0.7m


[test v2] 60/95 cached=60 skip=0 elapsed=1.0m eta=0.6m


[test v2] 70/95 cached=70 skip=0 elapsed=1.4m eta=0.5m


[test v2] 80/95 cached=80 skip=0 elapsed=1.7m eta=0.3m


[test v2] 90/95 cached=90 skip=0 elapsed=2.2m eta=0.1m


[test v2] 95/95 cached=95 skip=0 elapsed=2.4m eta=0.0m


=== TEST v2 caching done: cached=95 skip=0 total=95 elapsed=2.38m ===


=== Recache v2 complete ===


In [22]:
import io, tarfile, zipfile, json, time, gc
from pathlib import Path
import numpy as np
import pandas as pd
import scipy.io as sio

print("=== Build per-frame labels for train (v2) from Video.Labels and training.csv sequences ===", flush=True)

CWD = Path('.')
TRAIN_TARS = [CWD/'training1.tar.gz', CWD/'training2.tar.gz', CWD/'training3.tar.gz']

def build_tar_index(tar_paths):
    idx = {}
    for tp in tar_paths:
        if not tp.exists(): continue
        with tarfile.open(tp, 'r:*') as tf:
            for m in tf:
                if m.isreg():
                    nm = m.name.lstrip('./')
                    if nm.endswith('.zip') and nm.startswith('Sample'):
                        idx[nm] = (tp, m)
    return idx

train_idx = build_tar_index(TRAIN_TARS)

def id_to_zipname(sample_id: int) -> str:
    return f"Sample{sample_id:05d}.zip"

def read_mat_bytes_from_zip(tarpath: Path, tarinfo: tarfile.TarInfo) -> bytes:
    with tarfile.open(tarpath, 'r:*') as tf:
        fobj = tf.extractfile(tarinfo); data = fobj.read()
    with zipfile.ZipFile(io.BytesIO(data)) as zf:
        mat_name = None
        for n in zf.namelist():
            ln = n.lower()
            if ln.endswith('_data.mat') or ln.endswith('.mat'): mat_name = n; break
        b = zf.read(mat_name)
    return b

def load_video_struct(mat_bytes: bytes):
    md = sio.loadmat(io.BytesIO(mat_bytes), squeeze_me=True, struct_as_record=False)
    return md['Video']

train_df = pd.read_csv('training.csv')
id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}
train_ids = train_df['Id'].astype(int).tolist()

labels_dir = Path('labels3d_v2')/'train'
labels_dir.mkdir(parents=True, exist_ok=True)

def cache_labels_one(sample_id: int):
    zipname = id_to_zipname(sample_id)
    if zipname not in train_idx:
        raise KeyError(f"{zipname} not in train index")
    tarpath, tarinfo = train_idx[zipname]
    mat_bytes = read_mat_bytes_from_zip(tarpath, tarinfo)
    V = load_video_struct(mat_bytes)
    T = int(getattr(V, 'NumFrames', 0))
    y = np.zeros(T, dtype=np.int16)
    labels = V.Labels  # array of 20 structs with Begin, End, Name
    seq = id2seq[sample_id]  # list of 20 gesture IDs
    K = min(len(labels), len(seq))
    for k in range(K):
        lab = labels[k]
        b = int(getattr(lab, 'Begin', 1)); e = int(getattr(lab, 'End', b))
        b = max(1, b); e = min(T, e)
        if e >= b:
            y[b-1:e] = int(seq[k])  # 1..20
    np.save(labels_dir/f"{sample_id}.npy", y)

t0 = time.time(); done=0; skip=0
for i, sid in enumerate(train_ids, 1):
    outp = labels_dir/f"{sid}.npy"
    if outp.exists():
        skip += 1
    else:
        try:
            cache_labels_one(int(sid)); done += 1
        except Exception as e:
            print(f"[WARN] label cache failed id={sid}: {e}", flush=True)
            continue
    if (i % 10)==0 or i==len(train_ids):
        dt = time.time()-t0; rate = i/max(dt,1e-9); eta = (len(train_ids)-i)/max(rate,1e-6)
        print(f"[labels v2] {i}/{len(train_ids)} cached={done} skip={skip} elapsed={dt/60:.1f}m eta={eta/60:.1f}m", flush=True)
    gc.collect()
print(f"=== Label caching done: cached={done} skip={skip} total={len(train_ids)} elapsed={(time.time()-t0)/60:.2f}m ===", flush=True)

=== Build per-frame labels for train (v2) from Video.Labels and training.csv sequences ===


[labels v2] 10/297 cached=10 skip=0 elapsed=0.1m eta=2.4m


[labels v2] 20/297 cached=20 skip=0 elapsed=0.3m eta=3.5m


[labels v2] 30/297 cached=30 skip=0 elapsed=0.5m eta=4.5m


[labels v2] 40/297 cached=40 skip=0 elapsed=0.8m eta=5.4m


[labels v2] 50/297 cached=50 skip=0 elapsed=1.3m eta=6.4m


[labels v2] 60/297 cached=60 skip=0 elapsed=1.9m eta=7.5m


[labels v2] 70/297 cached=70 skip=0 elapsed=2.6m eta=8.4m


[labels v2] 80/297 cached=80 skip=0 elapsed=3.4m eta=9.3m


[labels v2] 90/297 cached=90 skip=0 elapsed=4.3m eta=10.0m


[labels v2] 100/297 cached=100 skip=0 elapsed=5.1m eta=10.0m


[labels v2] 110/297 cached=110 skip=0 elapsed=5.1m eta=8.8m


[labels v2] 120/297 cached=120 skip=0 elapsed=5.2m eta=7.7m


[labels v2] 130/297 cached=130 skip=0 elapsed=5.4m eta=6.9m


[labels v2] 140/297 cached=140 skip=0 elapsed=5.5m eta=6.2m


[labels v2] 150/297 cached=150 skip=0 elapsed=5.7m eta=5.6m


[labels v2] 160/297 cached=160 skip=0 elapsed=6.0m eta=5.1m


[labels v2] 170/297 cached=170 skip=0 elapsed=6.2m eta=4.7m


[labels v2] 180/297 cached=180 skip=0 elapsed=6.6m eta=4.3m


[labels v2] 190/297 cached=190 skip=0 elapsed=6.9m eta=3.9m


[labels v2] 200/297 cached=200 skip=0 elapsed=7.2m eta=3.5m


[labels v2] 210/297 cached=210 skip=0 elapsed=7.2m eta=3.0m


[labels v2] 220/297 cached=220 skip=0 elapsed=7.3m eta=2.6m


[labels v2] 230/297 cached=230 skip=0 elapsed=7.5m eta=2.2m


[labels v2] 240/297 cached=240 skip=0 elapsed=7.7m eta=1.8m


[labels v2] 250/297 cached=250 skip=0 elapsed=8.1m eta=1.5m


[labels v2] 260/297 cached=260 skip=0 elapsed=8.4m eta=1.2m


[labels v2] 270/297 cached=270 skip=0 elapsed=8.8m eta=0.9m


[labels v2] 280/297 cached=280 skip=0 elapsed=9.2m eta=0.6m


[labels v2] 290/297 cached=290 skip=0 elapsed=9.7m eta=0.2m


[labels v2] 297/297 cached=297 skip=0 elapsed=10.1m eta=0.0m


=== Label caching done: cached=297 skip=0 total=297 elapsed=10.06m ===


In [23]:
import math, time, random, gc
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

print("=== Per-frame CE model (dilated 1D CNN) on features3d_v2 + labels3d_v2 ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
random.seed(42); np.random.seed(42); torch.manual_seed(42)
if torch.cuda.is_available(): torch.backends.cudnn.benchmark = True

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]
tr_ids = all_ids[val_n:]
print(f"Train videos: {len(tr_ids)}, Val videos: {len(val_ids)}")

def load_feat(sample_id: int):
    d = np.load(feat_tr_dir/f"{sample_id}.npz")
    X = d['X'].astype(np.float32)  # (T,D)
    return X

def load_lab(sample_id: int):
    y = np.load(lab_tr_dir/f"{sample_id}.npy").astype(np.int64)  # (T,)
    return y

class FrameDataset(Dataset):
    def __init__(self, ids, max_T=1800):
        self.ids = list(ids); self.max_T = max_T
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        sid = self.ids[idx]
        X = load_feat(sid); y = load_lab(sid)
        T = min(len(X), len(y))
        X = X[:T]; y = y[:T]
        if T > self.max_T:
            X = X[:self.max_T]; y = y[:self.max_T]; T = self.max_T
        return torch.from_numpy(X), torch.from_numpy(y), int(sid)

def collate(batch):
    xs, ys, sids = zip(*batch)
    T_max = max(x.shape[0] for x in xs)
    D = xs[0].shape[1]
    B = len(xs)
    xb = torch.zeros(B, T_max, D, dtype=torch.float32)
    yb = torch.zeros(B, T_max, dtype=torch.long)
    mask = torch.zeros(B, T_max, dtype=torch.bool)
    for i,(x,y) in enumerate(zip(xs,ys)):
        T = x.shape[0]
        xb[i,:T] = x
        yb[i,:T] = y
        mask[i,:T] = True
    return xb, yb, mask, list(sids)

train_ds = FrameDataset(tr_ids, max_T=1800)
val_ds   = FrameDataset(val_ids, max_T=1800)
train_loader = DataLoader(train_ds, batch_size=12, shuffle=True, num_workers=2, pin_memory=True, collate_fn=collate)
val_loader   = DataLoader(val_ds, batch_size=12, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate)

D_in = np.load(next(iter(feat_tr_dir.glob('*.npz'))))['X'].shape[1]

class DilatedTCN(nn.Module):
    def __init__(self, d_in, channels=96, layers=10, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = nn.Conv1d(d_in, channels, kernel_size=1)
        blocks = []
        dil = 1
        for i in range(layers):
            blocks.append(nn.Sequential(
                nn.Conv1d(channels, channels, kernel_size=3, padding=dil, dilation=dil),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Conv1d(channels, channels, kernel_size=1),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
            ))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(channels, num_classes, kernel_size=1)
    def forward(self, x_b_t_d, mask_b_t=None):
        # x: (B,T,D) -> (B,C,T)
        x = x_b_t_d.transpose(1,2)
        h = self.inp(x)
        for blk in self.blocks:
            res = h
            h = blk(h)
            h = h + res
        logits = self.head(h)  # (B,C,T)
        return logits.transpose(1,2)  # (B,T,C)

model = DilatedTCN(d_in=D_in, channels=96, layers=10, num_classes=21, dropout=0.3).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)

def ce_loss_ignore_bg(logits_b_t_c, y_b_t, mask_b_t):
    # logits: (B,T,C), y: (B,T), mask True for valid frames. Ignore_index=0
    B,T,C = logits_b_t_c.shape
    logits = logits_b_t_c.reshape(B*T, C)
    targets = y_b_t.reshape(B*T)
    valid = mask_b_t.reshape(B*T)
    # Compute loss only on valid & label>0
    valid_fg = valid & (targets > 0)
    if valid_fg.sum() == 0:
        return logits.new_zeros([])
    loss = F.cross_entropy(logits[valid_fg], targets[valid_fg], reduction='mean')
    return loss

def decode_sequence_from_frame_probs(probs_b_t_c):
    # probs: (B,T,C) softmax over C; class 0 is background. Return list of sequences (len 20) per batch.
    B,T,C = probs_b_t_c.shape
    seqs = []
    # optional temporal smoothing with average pool over time
    smoothed = F.avg_pool1d(probs_b_t_c.transpose(1,2), kernel_size=7, stride=1, padding=3).transpose(1,2)
    for b in range(B):
        p = smoothed[b]  # (T,C)
        peaks = []
        for c in range(1,21):
            t_star = int(torch.argmax(p[:,c]).item())
            peaks.append((c, t_star))
        peaks.sort(key=lambda x: x[1])
        seq = [c for c,_ in peaks]
        seqs.append(seq)
    return seqs

def levenshtein(a, b):
    n, m = len(a), len(b)
    if n==0: return m
    if m==0: return n
    dp = list(range(m+1))
    for i in range(1, n+1):
        prev = dp[0]; dp[0] = i; ai = a[i-1]
        for j in range(1, m+1):
            tmp = dp[j]
            dp[j] = min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1))
            prev = tmp
    return dp[m]

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

def eval_val():
    model.eval()
    tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for xb, yb, mask, sids in val_loader:
            xb = xb.to(device); yb = yb.to(device); mask = mask.to(device)
            logits = model(xb, mask)  # (B,T,C)
            probs = logits.softmax(dim=-1)
            seqs = decode_sequence_from_frame_probs(probs)
            for sid, seq in zip(sids, seqs):
                tgt = id2seq[sid]
                tot += levenshtein(seq, tgt); cnt += 1
    print(f"  [val] {cnt} vids evaluated in {(time.time()-t0)/60:.2f}m", flush=True)
    return tot/max(cnt,1)

best = math.inf; best_state=None; patience=3; bad=0; max_epochs=20
for ep in range(1, max_epochs+1):
    model.train(); t0=time.time(); nb=0; tot_loss=0.0
    for it, (xb, yb, mask, sids) in enumerate(train_loader):
        xb = xb.to(device); yb = yb.to(device); mask = mask.to(device)
        opt.zero_grad(set_to_none=True)
        logits = model(xb, mask)  # (B,T,C)
        loss = ce_loss_ignore_bg(logits, yb, mask)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()
        tot_loss += float(loss.item()); nb += 1
        if (it+1)%20==0:
            print(f"ep{ep} it{it+1} loss={tot_loss/nb:.4f} elapsed={time.time()-t0:.1f}s", flush=True)
    val_lev = eval_val()
    print(f"Epoch {ep}: train_loss={tot_loss/max(nb,1):.4f} val_lev={val_lev:.4f}", flush=True)
    if val_lev < best - 1e-4:
        best = val_lev; best_state = {k:v.detach().cpu() for k,v in model.state_dict().items()}; bad=0
        print(f"  New best val_lev={best:.4f}", flush=True)
    else:
        bad += 1
        if bad>=patience:
            print("Early stopping.", flush=True); break

if best_state is not None: model.load_state_dict(best_state)
torch.save(model.state_dict(), 'model_ce_tcn_v2.pth')
print("Saved model_ce_tcn_v2.pth; best val_lev=", best)

print("=== Inference TEST with CE model (peak-time sort) -> submission.csv ===", flush=True)
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()

rows=[]; t0=time.time()
model.eval()
with torch.no_grad():
    for i, sid in enumerate(test_ids, 1):
        d = np.load(feat_te_dir/f"{sid}.npz"); X = d['X'].astype(np.float32)
        if X.shape[0] > 1800: X = X[:1800]
        xb = torch.from_numpy(X).unsqueeze(0).to(device)  # (1,T,D)
        mask = torch.ones(1, xb.shape[1], dtype=torch.bool, device=device)
        logits = model(xb, mask)[0]  # (T,C)
        probs = logits.softmax(dim=-1).unsqueeze(0)  # (1,T,C)
        seq = decode_sequence_from_frame_probs(probs)[0]
        rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
        if i%10==0 or i==len(test_ids):
            print(f"[infer CE] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)

sub = pd.DataFrame(rows, columns=['Id','Sequence'])
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv; head:\n', sub.head())
print('=== CE pipeline done ===')

=== Per-frame CE model (dilated 1D CNN) on features3d_v2 + labels3d_v2 ===


Train videos: 253, Val videos: 44


ep1 it20 loss=3.4702 elapsed=4.4s


  [val] 44 vids evaluated in 0.01m


ep2 it20 loss=2.7898 elapsed=2.3s


  [val] 44 vids evaluated in 0.00m


Epoch 2: train_loss=2.7797 val_lev=15.3409


  New best val_lev=15.3409


ep3 it20 loss=2.4210 elapsed=1.5s


  [val] 44 vids evaluated in 0.00m


Epoch 3: train_loss=2.4131 val_lev=12.8409


  New best val_lev=12.8409


ep4 it20 loss=2.1485 elapsed=1.5s


  [val] 44 vids evaluated in 0.00m


Epoch 4: train_loss=2.1501 val_lev=11.7727


  New best val_lev=11.7727


ep5 it20 loss=1.9746 elapsed=1.0s


  [val] 44 vids evaluated in 0.00m


Epoch 5: train_loss=1.9515 val_lev=9.9091


  New best val_lev=9.9091


ep6 it20 loss=1.8270 elapsed=0.9s


  [val] 44 vids evaluated in 0.00m


Epoch 6: train_loss=1.8068 val_lev=9.1136


  New best val_lev=9.1136


ep7 it20 loss=1.6773 elapsed=1.0s


  [val] 44 vids evaluated in 0.00m


Epoch 7: train_loss=1.6813 val_lev=8.0909


  New best val_lev=8.0909


ep8 it20 loss=1.5661 elapsed=1.2s


  [val] 44 vids evaluated in 0.00m


Epoch 8: train_loss=1.5550 val_lev=7.2727


  New best val_lev=7.2727


ep9 it20 loss=1.4629 elapsed=1.0s


  [val] 44 vids evaluated in 0.00m


Epoch 9: train_loss=1.4420 val_lev=6.8409


  New best val_lev=6.8409


ep10 it20 loss=1.3680 elapsed=0.9s


  [val] 44 vids evaluated in 0.00m


Epoch 10: train_loss=1.3534 val_lev=7.0227


ep11 it20 loss=1.3136 elapsed=1.1s


  [val] 44 vids evaluated in 0.00m


Epoch 11: train_loss=1.3133 val_lev=6.7955


  New best val_lev=6.7955


ep12 it20 loss=1.2645 elapsed=0.9s


  [val] 44 vids evaluated in 0.00m


Epoch 12: train_loss=1.2627 val_lev=6.2955


  New best val_lev=6.2955


ep13 it20 loss=1.2082 elapsed=1.0s


  [val] 44 vids evaluated in 0.00m


Epoch 13: train_loss=1.2088 val_lev=5.8636


  New best val_lev=5.8636


ep14 it20 loss=1.1473 elapsed=0.9s


  [val] 44 vids evaluated in 0.00m


Epoch 14: train_loss=1.1720 val_lev=5.3182


  New best val_lev=5.3182


ep15 it20 loss=1.0832 elapsed=0.8s


  [val] 44 vids evaluated in 0.00m


Epoch 15: train_loss=1.1719 val_lev=5.4545


ep16 it20 loss=1.0814 elapsed=0.8s


  [val] 44 vids evaluated in 0.00m


Epoch 16: train_loss=1.0737 val_lev=5.1818


  New best val_lev=5.1818


ep17 it20 loss=1.0425 elapsed=0.8s


  [val] 44 vids evaluated in 0.00m


Epoch 17: train_loss=1.0108 val_lev=5.8864


ep18 it20 loss=0.9959 elapsed=0.8s


  [val] 44 vids evaluated in 0.00m


Epoch 18: train_loss=1.0098 val_lev=5.4091


ep19 it20 loss=0.9714 elapsed=0.8s


  [val] 44 vids evaluated in 0.00m


Epoch 19: train_loss=0.9486 val_lev=4.8409


  New best val_lev=4.8409


ep20 it20 loss=0.9346 elapsed=0.9s


  [val] 44 vids evaluated in 0.00m


Epoch 20: train_loss=0.9092 val_lev=5.0227


Saved model_ce_tcn_v2.pth; best val_lev= 4.840909090909091
=== Inference TEST with CE model (peak-time sort) -> submission.csv ===


[infer CE] 10/95 elapsed=0.0m


[infer CE] 20/95 elapsed=0.0m


[infer CE] 30/95 elapsed=0.0m


[infer CE] 40/95 elapsed=0.0m


[infer CE] 50/95 elapsed=0.0m


[infer CE] 60/95 elapsed=0.0m


[infer CE] 70/95 elapsed=0.0m


[infer CE] 80/95 elapsed=0.1m


[infer CE] 90/95 elapsed=0.1m


[infer CE] 95/95 elapsed=0.1m


Wrote submission.csv; head:
     Id                                           Sequence
0  300  5 9 7 1 2 18 3 8 4 20 13 12 14 11 6 16 19 15 1...
1  301  10 1 5 4 6 2 11 14 13 19 15 7 9 20 12 8 18 3 1...
2  302  1 17 16 12 5 19 13 15 20 18 11 3 4 6 8 14 10 9...
3  303  13 4 3 10 14 5 19 15 20 17 1 11 16 8 18 7 12 6...
4  304  8 1 7 12 18 13 9 2 11 3 20 19 15 5 14 6 17 16 ...
=== CE pipeline done ===


In [29]:
import math, time, random, gc
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

print("=== Refined decoder: smoothing + duration prior + NMS + CoM refinement ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

# Rebuild the same val split as Cell 18
train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]
tr_ids = all_ids[val_n:]
print(f"Train videos: {len(tr_ids)}, Val videos: {len(val_ids)}")

D_in = np.load(next(iter((Path('features3d_v2')/'train').glob('*.npz'))))['X'].shape[1]

class DilatedTCN(nn.Module):
    def __init__(self, d_in, channels=96, layers=10, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = nn.Conv1d(d_in, channels, kernel_size=1)
        blocks = []
        dil = 1
        for _ in range(layers):
            blocks.append(nn.Sequential(
                nn.Conv1d(channels, channels, kernel_size=3, padding=dil, dilation=dil),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Conv1d(channels, channels, kernel_size=1),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
            ))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(channels, num_classes, kernel_size=1)
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        h = self.inp(x)
        for blk in self.blocks:
            res = h
            h = blk(h)
            h = h + res
        logits = self.head(h)
        return logits.transpose(1,2)

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p)
    X = d['X'].astype(np.float32)
    if X.shape[0] > max_T: X = X[:max_T]
    return X

def compute_class_median_durations():
    # Per-class durations from labels (frames at 20 fps)
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df['Id'].astype(int).tolist()
    t0=time.time()
    for i, sid in enumerate(ids, 1):
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        # durations: since exactly one occurrence per class, just count frames
        for c in range(1,21):
            cnt = int((y==c).sum())
            if cnt > 0: dur_by_c[c].append(cnt)
        if (i%50)==0 or i==len(ids):
            print(f"  [dur] {i}/{len(ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
    med = {}
    for c in range(1,21):
        if len(dur_by_c[c])==0:
            med[c] = 13  # sensible default
        else:
            med[c] = int(np.median(dur_by_c[c]))
        med[c] = int(np.clip(med[c], 9, 25))
    return med

def avg_pool_probs(p_t_c: torch.Tensor, k: int = 13) -> torch.Tensor:
    # p_t_c: (T,C)
    x = p_t_c.unsqueeze(0).transpose(1,2)  # (1,C,T)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral(p_t_c: torch.Tensor, k: int) -> torch.Tensor:
    # Convolve per-class probs with box kernel of size k
    T,C = p_t_c.shape
    x = p_t_c.unsqueeze(0).transpose(1,2)  # (1,C,T)
    weight = torch.ones(C, 1, k, device=p_t_c.device, dtype=p_t_c.dtype) / float(k)
    y = F.conv1d(x, weight, padding=k//2, groups=C)  # (1,C,T)
    return y.transpose(1,2).squeeze(0)  # (T,C)

def nms1d(scores: np.ndarray, radius: int = 12) -> int:
    # Return top-1 peak index with simple NMS (since we only need one peak per class)
    t0 = int(np.argmax(scores))
    return t0

def refine_com(p: torch.Tensor, t_star: int, w: int = 5) -> float:
    # center-of-mass refinement within ±w
    T = p.shape[0]
    a = max(0, t_star - w); b = min(T-1, t_star + w)
    idx = torch.arange(a, b+1, device=p.device, dtype=p.dtype)
    seg = p[a:b+1]
    s = seg.sum() + 1e-8
    com = (idx * seg).sum() / s
    return float(com.item())

def decode_video_probs(p_t_c: torch.Tensor, med_k: dict, pool_k: int = 13, nms_radius: int = 12) -> list:
    # p_t_c: (T,C), C=21 with bg at 0
    # 1) mild smoothing
    p_s = avg_pool_probs(p_t_c, k=pool_k)
    # 2) duration prior via per-class integral
    T,C = p_s.shape
    # build per-class duration integrals
    integ = torch.empty_like(p_s)
    for c in range(C):
        k = med_k.get(c, 13) if c!=0 else 13
        integ[:, c] = duration_integral(p_s[:, c:c+1], k=k).squeeze(1)
    peaks = []
    for c in range(1,21):
        s = integ[:, c].cpu().numpy()
        t_star = nms1d(s, radius=nms_radius)
        t_ref = refine_com(p_s[:, c], t_star, w=5)
        peaks.append((c, t_ref, float(integ[int(round(t_ref)) if 0<=int(round(t_ref))<T else t_star, c].item())))
    # 3) sort by refined time; break ties by higher integral score
    peaks.sort(key=lambda x: (x[1], -x[2]))
    seq = [c for c,_,_ in peaks]
    # fallback if needed
    if len(set(seq))<20:
        meanp = p_s[:,1:21].mean(dim=0)
        order = torch.argsort(meanp, descending=True).cpu().numpy().tolist()
        seq = [int(i+1) for i in order[:20]]
    return seq

print("Computing per-class median durations...", flush=True)
med_k = compute_class_median_durations()
print({k: med_k[k] for k in list(med_k.keys())[:5]}, "...", flush=True)

# Load CE model
model = DilatedTCN(d_in=D_in, channels=96, layers=10, num_classes=21, dropout=0.3).to(device)
state = torch.load('model_ce_tcn_v2.pth', map_location=device)
model.load_state_dict(state); model.eval()

def eval_val_refined():
    tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for i, sid in enumerate(val_ids, 1):
            X = load_feat(sid, split='train', max_T=1800)
            xb = torch.from_numpy(X).unsqueeze(0).to(device)  # (1,T,D)
            logits = model(xb)[0]  # (T,C)
            probs = logits.softmax(dim=-1)
            seq = decode_video_probs(probs, med_k, pool_k=13, nms_radius=12)
            tgt = [int(x) for x in str(train_df.loc[train_df['Id']==sid, 'Sequence'].iloc[0]).split()]
            # Levenshtein
            n,m=len(seq),len(tgt)
            if n==0:
                lev = m
            else:
                dp = list(range(m+1))
                for ii in range(1, n+1):
                    prev = dp[0]; dp[0] = ii; ai = seq[ii-1]
                    for jj in range(1, m+1):
                        tmp = dp[jj]
                        dp[jj] = min(dp[jj]+1, dp[jj-1]+1, prev + (0 if ai==tgt[jj-1] else 1))
                        prev = tmp
                lev = dp[m]
            tot += lev; cnt += 1
            if (i%10)==0 or i==len(val_ids):
                print(f"  [val refined] {i}/{len(val_ids)} elapsed={(time.time()-t0)/60:.2f}m", flush=True)
    return tot/max(cnt,1)

val_lev = eval_val_refined()
print(f"Refined decoder val_lev={val_lev:.4f} (normalized ~{val_lev/20:.5f})", flush=True)

print("=== Inference TEST with refined decoder -> submission.csv ===", flush=True)
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
rows=[]; t0=time.time()
with torch.no_grad():
    for i, sid in enumerate(test_ids, 1):
        X = load_feat(sid, split='test', max_T=1800)
        xb = torch.from_numpy(X).unsqueeze(0).to(device)
        logits = model(xb)[0]
        probs = logits.softmax(dim=-1)
        seq = decode_video_probs(probs, med_k, pool_k=13, nms_radius=12)
        rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
        if (i%10)==0 or i==len(test_ids):
            print(f"  [test refined] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
sub = pd.DataFrame(rows, columns=['Id','Sequence'])
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv; head:\n', sub.head())
print('=== Refined decoding complete ===')

=== Refined decoder: smoothing + duration prior + NMS + CoM refinement ===


Train videos: 253, Val videos: 44
Computing per-class median durations...


  [dur] 50/297 elapsed=0.0m


  [dur] 100/297 elapsed=0.0m


  [dur] 150/297 elapsed=0.0m


  [dur] 200/297 elapsed=0.0m


  [dur] 250/297 elapsed=0.0m


  [dur] 297/297 elapsed=0.0m


{1: 25, 2: 25, 3: 25, 4: 25, 5: 25} ...


  [val refined] 10/44 elapsed=0.00m


  state = torch.load('model_ce_tcn_v2.pth', map_location=device)


  [val refined] 20/44 elapsed=0.00m


  [val refined] 30/44 elapsed=0.01m


  [val refined] 40/44 elapsed=0.01m


  [val refined] 44/44 elapsed=0.01m


Refined decoder val_lev=4.5000 (normalized ~0.22500)


=== Inference TEST with refined decoder -> submission.csv ===


  [test refined] 10/95 elapsed=0.0m


  [test refined] 20/95 elapsed=0.0m


  [test refined] 30/95 elapsed=0.0m


  [test refined] 40/95 elapsed=0.0m


  [test refined] 50/95 elapsed=0.0m


  [test refined] 60/95 elapsed=0.0m


  [test refined] 70/95 elapsed=0.0m


  [test refined] 80/95 elapsed=0.0m


  [test refined] 90/95 elapsed=0.0m


  [test refined] 95/95 elapsed=0.0m


Wrote submission.csv; head:
     Id                                           Sequence
0  300  5 9 7 1 2 18 3 8 4 20 13 12 15 14 11 6 16 19 1...
1  301  10 1 5 4 6 2 11 14 13 19 15 7 9 20 12 8 18 3 1...
2  302  1 17 16 12 5 19 13 15 20 18 11 3 4 6 8 14 10 9...
3  303  13 4 3 10 14 5 19 15 20 17 1 11 16 8 18 7 12 6...
4  304  8 1 7 12 18 13 9 2 11 3 20 19 15 5 14 6 17 16 ...
=== Refined decoding complete ===


In [25]:
import math, time, random, gc, os
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

print("=== MS-TCN multi-stage CE training with temporal smoothing and refined decoding ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.cuda.is_available():
    torch.backends.cudnn.benchmark = True

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

# Deterministic split matching prior cells
train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]
tr_ids = all_ids[val_n:]
print(f"Train videos: {len(tr_ids)}, Val videos: {len(val_ids)}")

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p)
    X = d['X'].astype(np.float32)
    if X.shape[0] > max_T: X = X[:max_T]
    return X

def load_lab(sample_id: int, max_T=1800):
    y = np.load(lab_tr_dir/f"{sample_id}.npy").astype(np.int64)
    if y.shape[0] > max_T: y = y[:max_T]
    return y

class FrameDataset(Dataset):
    def __init__(self, ids, max_T=1800):
        self.ids = list(ids); self.max_T=max_T
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        sid = int(self.ids[idx])
        X = load_feat(sid, 'train', self.max_T)
        y = load_lab(sid, self.max_T)
        T = min(len(X), len(y))
        X = X[:T]; y = y[:T]
        return torch.from_numpy(X), torch.from_numpy(y), sid

def collate(batch):
    xs, ys, sids = zip(*batch)
    Tm = max(x.shape[0] for x in xs); D = xs[0].shape[1]; B=len(xs)
    xb = torch.zeros(B, Tm, D, dtype=torch.float32)
    yb = torch.zeros(B, Tm, dtype=torch.long)
    mask = torch.zeros(B, Tm, dtype=torch.bool)
    for i,(x,y) in enumerate(zip(xs,ys)):
        T=len(x); xb[i,:T]=x; yb[i,:T]=y; mask[i,:T]=True
    return xb, yb, mask, list(sids)

train_loader = DataLoader(FrameDataset(tr_ids, 1800), batch_size=12, shuffle=True, num_workers=2, pin_memory=True, collate_fn=collate)
val_loader   = DataLoader(FrameDataset(val_ids, 1800), batch_size=12, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate)

D_in = np.load(next(iter((feat_tr_dir).glob('*.npz'))))['X'].shape[1]

class DilatedResBlock(nn.Module):
    def __init__(self, ch, dilation, drop=0.3, groups=8, k=3):
        super().__init__()
        pad = dilation
        self.conv1 = nn.Conv1d(ch, ch, kernel_size=k, padding=pad, dilation=dilation)
        self.gn1 = nn.GroupNorm(groups, ch)
        self.conv2 = nn.Conv1d(ch, ch, kernel_size=1)
        self.gn2 = nn.GroupNorm(groups, ch)
        self.drop = nn.Dropout(drop)
    def forward(self, x):
        h = self.conv1(x); h = self.gn1(h); h = F.relu(h, inplace=True); h = self.drop(h)
        h = self.conv2(h); h = self.gn2(h); h = F.relu(h, inplace=True)
        return x + h

class Stage(nn.Module):
    def __init__(self, in_ch, ch=96, layers=10, drop=0.3):
        super().__init__()
        self.inp = nn.Conv1d(in_ch, ch, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(DilatedResBlock(ch, dil, drop=drop))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(ch, 21, kernel_size=1)
    def forward(self, x):  # x: (B,Fin,T)
        h = self.inp(x)
        for blk in self.blocks:
            h = blk(h)
        logits = self.head(h)  # (B,21,T)
        return logits

class MSTCN(nn.Module):
    def __init__(self, d_in, stages=4, ch=96, layers=10, drop=0.3, concat_feat=True):
        super().__init__()
        self.concat_feat = concat_feat
        self.stages = nn.ModuleList()
        fin = d_in
        self.input_proj = nn.Conv1d(d_in, d_in, kernel_size=1)  # identity-capable
        for s in range(stages):
            in_ch = (fin + 21) if (s>0 and concat_feat) else (21 if s>0 else fin)
            self.stages.append(Stage(in_ch, ch=ch, layers=layers, drop=drop))
    def forward(self, x_b_t_d):  # (B,T,D)
        x = x_b_t_d.transpose(1,2)  # (B,D,T)
        x = self.input_proj(x)
        logits_list = []
        prev_logits = None
        for i, st in enumerate(self.stages):
            if i == 0:
                inp = x
            else:
                if self.concat_feat:
                    inp = torch.cat([x, prev_logits], dim=1)
                else:
                    inp = prev_logits
            l = st(inp)  # (B,21,T)
            prev_logits = l
            logits_list.append(l.transpose(1,2))  # (B,T,21)
        return logits_list  # list of (B,T,21)

# Losses
def ce_ignore_bg_with_ls(logits, targets, mask, label_smoothing=0.05):
    # logits: (B,T,C), targets: (B,T), mask: (B,T) True valid; ignore targets==0 (bg)
    B,T,C = logits.shape
    valid = mask & (targets > 0)
    if valid.sum() == 0:
        return logits.new_zeros([])
    # gather valid logits and targets
    lg = logits[valid]  # (N,C)
    y = targets[valid]  # (N,)
    if label_smoothing > 0:
        with torch.no_grad():
            true_dist = torch.zeros_like(lg).scatter_(1, y.unsqueeze(1), 1.0)
            true_dist = true_dist * (1 - label_smoothing) + label_smoothing / (lg.size(1) - 1)
        logp = F.log_softmax(lg, dim=-1)
        return F.kl_div(logp, true_dist, reduction='batchmean')
    else:
        return F.cross_entropy(lg, y, reduction='mean')

def temporal_mse(probs, mask):
    # probs: (B,T,C), encourage p_t ~ p_{t-1} on valid frames, ignore bg explicitly not needed
    diff = (probs[:,1:,:] - probs[:,:-1,:])**2  # (B,T-1,C)
    m = (mask[:,1:] & mask[:,:-1]).float().unsqueeze(-1)
    num = (diff * m).sum()
    den = m.sum().clamp_min(1.0)
    return num / den

# Augmentations
def time_mask(x, mask, n=2, wmin=8, wmax=16, p=0.5):
    if random.random() > p: return x, mask
    B,T,D = x.shape
    for _ in range(n):
        w = random.randint(wmin, wmax)
        t0 = random.randint(0, max(0, T-w))
        x[:, t0:t0+w, :] = 0.0
        mask[:, t0:t0+w] = mask[:, t0:t0+w]  # mask unchanged (still valid frames)
    return x, mask

def frame_drop(x, y, mask, p=0.2):
    # randomly drop a small number of frames per sample, keep length by duplicating neighbors
    if random.random() > p: return x, y, mask
    B,T,D = x.shape
    for b in range(B):
        valid_T = int(mask[b].sum().item())
        if valid_T < 3: continue
        drop_t = random.randint(1, max(1, int(0.02*valid_T)))
        for _ in range(drop_t):
            t = random.randint(1, valid_T-2)
            x[b,t] = (x[b,t-1] + x[b,t+1]) * 0.5
            y[b,t] = y[b,t-1]
    return x, y, mask

def jitter(x, sigma=0.01):
    return x + torch.randn_like(x)*sigma

# Refined decoder utilities (reuse from cell 19, inlined for safety)
def avg_pool_probs(p_t_c: torch.Tensor, k: int = 13) -> torch.Tensor:
    x = p_t_c.unsqueeze(0).transpose(1,2)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral_single(p_t: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t.view(1,1,-1)
    w = torch.ones(1,1,k, device=p_t.device, dtype=p_t.dtype) / float(k)
    y = F.conv1d(x, w, padding=k//2)
    return y.view(-1)

def refine_com(p: torch.Tensor, t_star: int, w: int = 5) -> float:
    T = p.shape[0]
    a = max(0, t_star - w); b = min(T-1, t_star + w)
    idx = torch.arange(a, b+1, device=p.device, dtype=p.dtype)
    seg = p[a:b+1]
    s = seg.sum() + 1e-8
    com = (idx * seg).sum() / s
    return float(com.item())

def compute_class_median_durations():
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df['Id'].astype(int).tolist()
    for sid in ids:
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        for c in range(1,21):
            cnt = int((y==c).sum());
            if cnt>0: dur_by_c[c].append(cnt)
    med = {}
    for c in range(1,21):
        med[c] = int(np.clip(np.median(dur_by_c[c]) if len(dur_by_c[c])>0 else 13, 9, 25))
    return med

med_k = compute_class_median_durations()

def decode_video_probs_refined(p_t_c: torch.Tensor, pool_k=13):
    # p_t_c: (T,C), C=21; class 0 bg
    p_s = avg_pool_probs(p_t_c, k=pool_k)
    T,C = p_s.shape
    scores = torch.empty_like(p_s)
    for c in range(C):
        k = med_k.get(c, 13) if c!=0 else 13
        if c==0:
            scores[:,c] = p_s[:,c]
        else:
            scores[:,c] = duration_integral_single(p_s[:,c], k=k)
    peaks = []
    for c in range(1,21):
        t_star = int(torch.argmax(scores[:,c]).item())
        t_ref = refine_com(p_s[:,c], t_star, w=5)
        t_idx = int(round(t_ref));
        t_idx = min(max(t_idx, 0), T-1)
        peaks.append((c, t_ref, float(scores[t_idx, c].item())))
    peaks.sort(key=lambda x: (x[1], -x[2]))
    return [c for c,_,_ in peaks]

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

def levenshtein(a, b):
    n, m = len(a), len(b)
    if n==0: return m
    if m==0: return n
    dp = list(range(m+1))
    for i in range(1, n+1):
        prev = dp[0]; dp[0] = i; ai = a[i-1]
        for j in range(1, m+1):
            tmp = dp[j]
            dp[j] = min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1))
            prev = tmp
    return dp[m]

def eval_val(model):
    model.eval()
    tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for xb, yb, mask, sids in val_loader:
            xb = xb.to(device); mask = mask.to(device)
            logits_list = model(xb)  # list of (B,T,C)
            # use final stage probs
            probs = logits_list[-1].softmax(dim=-1)  # (B,T,C)
            for b, sid in enumerate(sids):
                T = int(mask[b].sum().item())
                p = probs[b,:T,:]
                seq = decode_video_probs_refined(p, pool_k=13)
                tgt = id2seq[int(sid)]
                tot += levenshtein(seq, tgt); cnt += 1
    print(f"  [val decode] {cnt} vids in {(time.time()-t0)/60:.2f}m", flush=True)
    return tot/max(cnt,1)

def train_seed(seed=0, epochs=18, patience=3, ch=96, stages=4, layers=10, ls=0.05, lambda_t=0.2, concat_feat=True):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
    model = MSTCN(d_in=D_in, stages=stages, ch=ch, layers=layers, drop=0.3, concat_feat=concat_feat).to(device)
    opt = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
    sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=15)
    best = math.inf; best_state=None; bad=0
    stage_w = [0.3,0.3,0.3,1.0][:stages]
    for ep in range(1, epochs+1):
        model.train(); t0=time.time(); nb=0; tot_loss=0.0
        for it, (xb, yb, mask, sids) in enumerate(train_loader):
            xb = xb.to(device); yb = yb.to(device); mask = mask.to(device)
            # augment
            xb, mask = time_mask(xb, mask, n=2, wmin=8, wmax=16, p=0.5)
            xb, yb, mask = frame_drop(xb, yb, mask, p=0.2)
            xb = jitter(xb, sigma=0.01)
            opt.zero_grad(set_to_none=True)
            logits_list = model(xb)  # list len=stages of (B,T,21)
            loss_ce = 0.0
            probs_last = None
            for s, lg in enumerate(logits_list):
                loss_ce = loss_ce + stage_w[min(s, len(stage_w)-1)] * ce_ignore_bg_with_ls(lg, yb, mask, label_smoothing=ls)
                probs_last = lg.softmax(dim=-1)
            loss_t = temporal_mse(probs_last, mask) * lambda_t
            loss = loss_ce + loss_t
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            opt.step()
            tot_loss += float(loss.item()); nb += 1
            if (it+1)%20==0:
                print(f"ep{ep} it{it+1} loss={tot_loss/nb:.4f} elapsed={time.time()-t0:.1f}s", flush=True)
        sched.step()
        val_lev = eval_val(model)
        print(f"Seed{seed} Epoch {ep}: train_loss={tot_loss/max(nb,1):.4f} val_lev={val_lev:.4f} lr={sched.get_last_lr()[0]:.6f}", flush=True)
        if val_lev < best - 1e-4:
            best = val_lev; best_state = {k:v.detach().cpu() for k,v in model.state_dict().items()}; bad=0
            print(f"  New best (seed {seed}) val_lev={best:.4f}", flush=True)
        else:
            bad += 1
            if bad >= patience:
                print("Early stopping.", flush=True); break
    if best_state is not None:
        model.load_state_dict(best_state)
    out_path = f"model_mstcn_s{seed}.pth"
    torch.save(model.state_dict(), out_path)
    print(f"Saved {out_path}; best val_lev={best:.4f}")
    return out_path, best

# Train 3 seeds sequentially
seeds = [0,1,2]
ckpts = []; scores=[]
for s in seeds:
    p, sc = train_seed(seed=s, epochs=18, patience=3, ch=96, stages=4, layers=10, ls=0.05, lambda_t=0.2, concat_feat=True)
    ckpts.append(p); scores.append(sc)
print("Seed scores:", list(zip(seeds, scores)))

# Optionally, ensemble on VAL to verify gain and then run TEST ensemble
def ensemble_val_and_test(ckpt_paths):
    models = []
    for p in ckpt_paths:
        m = MSTCN(d_in=D_in, stages=4, ch=96, layers=10, drop=0.3, concat_feat=True).to(device)
        m.load_state_dict(torch.load(p, map_location=device)); m.eval(); models.append(m)
    # VAL
    tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for xb, yb, mask, sids in val_loader:
            xb = xb.to(device); mask = mask.to(device)
            probs_ens = None
            for m in models:
                logits = m(xb)[-1]  # (B,T,C)
                p = logits.softmax(dim=-1)
                probs_ens = p if probs_ens is None else (probs_ens + p)
            probs_ens = probs_ens / len(models)
            for b, sid in enumerate(sids):
                T = int(mask[b].sum().item())
                seq = decode_video_probs_refined(probs_ens[b,:T,:], pool_k=13)
                tgt = id2seq[int(sid)]
                tot += levenshtein(seq, tgt); cnt += 1
    val_lev = tot/max(cnt,1)
    print(f"Ensemble VAL Levenshtein={val_lev:.4f} (norm ~{val_lev/20:.5f}), evaluated {cnt} vids in {(time.time()-t0)/60:.2f}m")
    # TEST inference and submission
    test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
    rows=[]; t0=time.time()
    with torch.no_grad():
        for i, sid in enumerate(test_ids, 1):
            X = load_feat(int(sid), split='test', max_T=1800)
            xb = torch.from_numpy(X).unsqueeze(0).to(device)
            probs_ens = None
            for m in models:
                logits = m(xb)[-1][0]  # (T,C)
                p = logits.softmax(dim=-1)
                probs_ens = p if probs_ens is None else (probs_ens + p)
            probs_ens = probs_ens / len(models)
            seq = decode_video_probs_refined(probs_ens, pool_k=13)
            rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
            if (i%10)==0 or i==len(test_ids):
                print(f"  [test ens] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
    sub = pd.DataFrame(rows, columns=['Id','Sequence'])
    sub.to_csv('submission.csv', index=False)
    print('Wrote submission.csv; head:\n', sub.head())

# After training completes, run ensemble_val_and_test(ckpts) in a subsequent execution if desired.
print("=== MS-TCN setup complete; training launched for seeds ===")

=== MS-TCN multi-stage CE training with temporal smoothing and refined decoding ===


Train videos: 253, Val videos: 44


ep1 it20 loss=5.8197 elapsed=3.4s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 1: train_loss=5.7760 val_lev=17.7273 lr=0.000297


  New best (seed 0) val_lev=17.7273


ep2 it20 loss=4.8505 elapsed=2.4s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 2: train_loss=4.8603 val_lev=17.1364 lr=0.000287


  New best (seed 0) val_lev=17.1364


ep3 it20 loss=4.4697 elapsed=1.8s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 3: train_loss=4.4398 val_lev=15.5909 lr=0.000271


  New best (seed 0) val_lev=15.5909


ep4 it20 loss=4.0490 elapsed=2.2s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 4: train_loss=4.0128 val_lev=13.7727 lr=0.000250


  New best (seed 0) val_lev=13.7727


ep5 it20 loss=3.6743 elapsed=2.1s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 5: train_loss=3.6420 val_lev=12.4318 lr=0.000225


  New best (seed 0) val_lev=12.4318


ep6 it20 loss=3.3811 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 6: train_loss=3.4042 val_lev=11.5000 lr=0.000196


  New best (seed 0) val_lev=11.5000


ep7 it20 loss=3.1541 elapsed=2.0s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 7: train_loss=3.1402 val_lev=11.4318 lr=0.000166


  New best (seed 0) val_lev=11.4318


ep8 it20 loss=3.0225 elapsed=1.8s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 8: train_loss=2.9932 val_lev=10.3182 lr=0.000134


  New best (seed 0) val_lev=10.3182


ep9 it20 loss=2.8482 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 9: train_loss=2.8160 val_lev=9.3182 lr=0.000104


  New best (seed 0) val_lev=9.3182


ep10 it20 loss=2.7327 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 10: train_loss=2.7251 val_lev=9.2500 lr=0.000075


  New best (seed 0) val_lev=9.2500


ep11 it20 loss=2.6681 elapsed=1.5s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 11: train_loss=2.6431 val_lev=9.0455 lr=0.000050


  New best (seed 0) val_lev=9.0455


ep12 it20 loss=2.6100 elapsed=1.5s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 12: train_loss=2.6019 val_lev=8.9545 lr=0.000029


  New best (seed 0) val_lev=8.9545


ep13 it20 loss=2.5490 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 13: train_loss=2.5416 val_lev=8.9318 lr=0.000013


  New best (seed 0) val_lev=8.9318


ep14 it20 loss=2.5104 elapsed=1.9s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 14: train_loss=2.5212 val_lev=8.7045 lr=0.000003


  New best (seed 0) val_lev=8.7045


ep15 it20 loss=2.5166 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 15: train_loss=2.5446 val_lev=8.7727 lr=0.000000


ep16 it20 loss=2.5043 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 16: train_loss=2.5064 val_lev=8.7727 lr=0.000003


ep17 it20 loss=2.5038 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 17: train_loss=2.5036 val_lev=8.6364 lr=0.000013


  New best (seed 0) val_lev=8.6364


ep18 it20 loss=2.4864 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 18: train_loss=2.4934 val_lev=8.4091 lr=0.000029


  New best (seed 0) val_lev=8.4091


Saved model_mstcn_s0.pth; best val_lev=8.4091


ep1 it20 loss=6.0162 elapsed=1.5s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 1: train_loss=5.9412 val_lev=17.9773 lr=0.000297


  New best (seed 1) val_lev=17.9773


ep2 it20 loss=4.8995 elapsed=1.3s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 2: train_loss=4.8880 val_lev=16.5000 lr=0.000287


  New best (seed 1) val_lev=16.5000


ep3 it20 loss=4.3934 elapsed=1.5s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 3: train_loss=4.3701 val_lev=15.2045 lr=0.000271


  New best (seed 1) val_lev=15.2045


ep4 it20 loss=3.9950 elapsed=1.5s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 4: train_loss=3.9589 val_lev=13.7273 lr=0.000250


  New best (seed 1) val_lev=13.7273


ep5 it20 loss=3.6042 elapsed=1.3s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 5: train_loss=3.5903 val_lev=12.5909 lr=0.000225


  New best (seed 1) val_lev=12.5909


ep6 it20 loss=3.3214 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 6: train_loss=3.3323 val_lev=11.6591 lr=0.000196


  New best (seed 1) val_lev=11.6591


ep7 it20 loss=3.1208 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 7: train_loss=3.1247 val_lev=11.0682 lr=0.000166


  New best (seed 1) val_lev=11.0682


ep8 it20 loss=2.9295 elapsed=1.8s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 8: train_loss=2.9262 val_lev=9.8636 lr=0.000134


  New best (seed 1) val_lev=9.8636


ep9 it20 loss=2.7755 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 9: train_loss=2.7838 val_lev=9.9091 lr=0.000104


ep10 it20 loss=2.6603 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 10: train_loss=2.6420 val_lev=9.4091 lr=0.000075


  New best (seed 1) val_lev=9.4091


ep11 it20 loss=2.5681 elapsed=1.5s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 11: train_loss=2.5553 val_lev=9.1818 lr=0.000050


  New best (seed 1) val_lev=9.1818


ep12 it20 loss=2.5154 elapsed=1.3s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 12: train_loss=2.4691 val_lev=8.7955 lr=0.000029


  New best (seed 1) val_lev=8.7955


ep13 it20 loss=2.4683 elapsed=1.3s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 13: train_loss=2.4512 val_lev=8.3182 lr=0.000013


  New best (seed 1) val_lev=8.3182


ep14 it20 loss=2.4317 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 14: train_loss=2.4244 val_lev=8.5227 lr=0.000003


ep15 it20 loss=2.4105 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 15: train_loss=2.4095 val_lev=8.3409 lr=0.000000


ep16 it20 loss=2.4181 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 16: train_loss=2.3833 val_lev=8.3409 lr=0.000003


Early stopping.


Saved model_mstcn_s1.pth; best val_lev=8.3182


ep1 it20 loss=6.0658 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 1: train_loss=5.9872 val_lev=17.8864 lr=0.000297


  New best (seed 2) val_lev=17.8864


ep2 it20 loss=4.8765 elapsed=1.3s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 2: train_loss=4.8706 val_lev=16.2273 lr=0.000287


  New best (seed 2) val_lev=16.2273


ep3 it20 loss=4.4174 elapsed=1.3s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 3: train_loss=4.4294 val_lev=14.5000 lr=0.000271


  New best (seed 2) val_lev=14.5000


ep4 it20 loss=4.0161 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 4: train_loss=4.0090 val_lev=13.0909 lr=0.000250


  New best (seed 2) val_lev=13.0909


ep5 it20 loss=3.6612 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 5: train_loss=3.6524 val_lev=11.8182 lr=0.000225


  New best (seed 2) val_lev=11.8182


ep6 it20 loss=3.3170 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 6: train_loss=3.3090 val_lev=10.4318 lr=0.000196


  New best (seed 2) val_lev=10.4318


ep7 it20 loss=3.0512 elapsed=1.5s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 7: train_loss=3.0383 val_lev=9.8636 lr=0.000166


  New best (seed 2) val_lev=9.8636


ep8 it20 loss=2.8271 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 8: train_loss=2.8518 val_lev=9.1591 lr=0.000134


  New best (seed 2) val_lev=9.1591


ep9 it20 loss=2.6718 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 9: train_loss=2.6690 val_lev=9.2045 lr=0.000104


ep10 it20 loss=2.5676 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 10: train_loss=2.5371 val_lev=8.7500 lr=0.000075


  New best (seed 2) val_lev=8.7500


ep11 it20 loss=2.4912 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 11: train_loss=2.6466 val_lev=8.6818 lr=0.000050


  New best (seed 2) val_lev=8.6818


ep12 it20 loss=2.4345 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 12: train_loss=2.4175 val_lev=8.5227 lr=0.000029


  New best (seed 2) val_lev=8.5227


ep13 it20 loss=2.4044 elapsed=1.3s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 13: train_loss=2.4229 val_lev=8.4545 lr=0.000013


  New best (seed 2) val_lev=8.4545


ep14 it20 loss=2.3559 elapsed=1.5s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 14: train_loss=2.3273 val_lev=8.4318 lr=0.000003


  New best (seed 2) val_lev=8.4318


ep15 it20 loss=2.3471 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 15: train_loss=2.3249 val_lev=8.2955 lr=0.000000


  New best (seed 2) val_lev=8.2955


ep16 it20 loss=2.3719 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 16: train_loss=2.3625 val_lev=8.2955 lr=0.000003


ep17 it20 loss=2.3549 elapsed=1.3s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 17: train_loss=2.3444 val_lev=8.4318 lr=0.000013


ep18 it20 loss=2.3443 elapsed=1.4s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 18: train_loss=2.3099 val_lev=8.1591 lr=0.000029


  New best (seed 2) val_lev=8.1591


Saved model_mstcn_s2.pth; best val_lev=8.1591
Seed scores: [(0, 8.409090909090908), (1, 8.318181818181818), (2, 8.159090909090908)]
=== MS-TCN setup complete; training launched for seeds ===


In [26]:
print("=== Ensembling 3 MS-TCN seeds and generating submission ===", flush=True)
ckpts = ["model_mstcn_s0.pth", "model_mstcn_s1.pth", "model_mstcn_s2.pth"]
for p in ckpts:
    assert Path(p).exists(), f"Missing checkpoint {p}"
ensemble_val_and_test(ckpts)
print("=== Ensemble complete ===")

=== Ensembling 3 MS-TCN seeds and generating submission ===


  m.load_state_dict(torch.load(p, map_location=device)); m.eval(); models.append(m)


Ensemble VAL Levenshtein=7.7045 (norm ~0.38523), evaluated 44 vids in 0.01m


  [test ens] 10/95 elapsed=0.0m


  [test ens] 20/95 elapsed=0.0m


  [test ens] 30/95 elapsed=0.0m


  [test ens] 40/95 elapsed=0.0m


  [test ens] 50/95 elapsed=0.0m


  [test ens] 60/95 elapsed=0.0m


  [test ens] 70/95 elapsed=0.0m


  [test ens] 80/95 elapsed=0.1m


  [test ens] 90/95 elapsed=0.1m


  [test ens] 95/95 elapsed=0.1m


Wrote submission.csv; head:
     Id                                           Sequence
0  300  5 9 7 1 2 18 3 8 4 20 13 12 15 14 11 10 6 16 1...
1  301  15 12 3 1 5 4 6 2 10 11 20 7 13 19 9 8 18 14 1...
2  302  1 17 16 3 5 7 19 13 20 18 12 11 4 10 6 8 14 15...
3  303  13 18 12 4 3 11 15 16 19 20 17 10 5 8 7 1 6 2 ...
4  304  8 7 1 2 14 12 18 13 9 11 10 3 20 19 5 15 6 17 ...
=== Ensemble complete ===


In [28]:
import time
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

print("=== Ensemble CE-TCN (best) + MS-TCN (best seed) with refined decoder ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

# Recreate split and utilities from previous cells
train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
import random, numpy as np
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p)
    X = d['X'].astype(np.float32)
    return X[:max_T] if X.shape[0] > max_T else X

def compute_class_median_durations():
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df['Id'].astype(int).tolist()
    for sid in ids:
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        for c in range(1,21):
            cnt = int((y==c).sum())
            if cnt>0: dur_by_c[c].append(cnt)
    med = {}; import numpy as _np
    for c in range(1,21):
        med[c] = int(_np.clip(_np.median(dur_by_c[c]) if len(dur_by_c[c])>0 else 13, 9, 25))
    return med

def avg_pool_probs(p_t_c: torch.Tensor, k: int = 13) -> torch.Tensor:
    x = p_t_c.unsqueeze(0).transpose(1,2)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral_single(p_t: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t.view(1,1,-1)
    w = torch.ones(1,1,k, device=p_t.device, dtype=p_t.dtype) / float(k)
    y = F.conv1d(x, w, padding=k//2)
    return y.view(-1)

def refine_com(p: torch.Tensor, t_star: int, w: int = 5) -> float:
    T = p.shape[0]
    a = max(0, t_star - w); b = min(T-1, t_star + w)
    idx = torch.arange(a, b+1, device=p.device, dtype=p.dtype)
    seg = p[a:b+1]
    s = seg.sum() + 1e-8
    com = (idx * seg).sum() / s
    return float(com.item())

med_k = compute_class_median_durations()

def decode_video_probs_refined(p_t_c: torch.Tensor, pool_k=13):
    p_s = avg_pool_probs(p_t_c, k=pool_k)
    T,C = p_s.shape
    scores = torch.empty_like(p_s)
    for c in range(C):
        k = med_k.get(c, 13) if c!=0 else 13
        if c==0: scores[:,c] = p_s[:,c]
        else:   scores[:,c] = duration_integral_single(p_s[:,c], k=k)
    peaks = []
    for c in range(1,21):
        t_star = int(torch.argmax(scores[:,c]).item())
        t_ref = refine_com(p_s[:,c], t_star, w=5)
        t_idx = int(round(t_ref)); t_idx = min(max(t_idx, 0), T-1)
        peaks.append((c, t_ref, float(scores[t_idx, c].item())))
    peaks.sort(key=lambda x: (x[1], -x[2]))
    return [c for c,_,_ in peaks]

# Model defs must match trained checkpoints
D_in = np.load(next(iter((feat_tr_dir).glob('*.npz'))))['X'].shape[1]

class DilatedTCN(nn.Module):
    def __init__(self, d_in, channels=96, layers=10, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = nn.Conv1d(d_in, channels, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(nn.Sequential(
                nn.Conv1d(channels, channels, kernel_size=3, padding=dil, dilation=dil),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Conv1d(channels, channels, kernel_size=1),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
            ))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(channels, num_classes, kernel_size=1)
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        h = self.inp(x)
        for blk in self.blocks:
            res = h; h = blk(h); h = h + res
        logits = self.head(h)
        return logits.transpose(1,2)

class DilatedResBlock(nn.Module):
    def __init__(self, ch, dilation, drop=0.3, groups=8, k=3):
        super().__init__()
        self.conv1 = nn.Conv1d(ch, ch, kernel_size=k, padding=dilation, dilation=dilation)
        self.gn1 = nn.GroupNorm(groups, ch)
        self.conv2 = nn.Conv1d(ch, ch, kernel_size=1)
        self.gn2 = nn.GroupNorm(groups, ch)
        self.drop = nn.Dropout(drop)
    def forward(self, x):
        h = self.conv1(x); h = self.gn1(h); h = F.relu(h, inplace=True); h = self.drop(h)
        h = self.conv2(h); h = self.gn2(h); h = F.relu(h, inplace=True)
        return x + h

class Stage(nn.Module):
    def __init__(self, in_ch, ch=96, layers=10, drop=0.3):
        super().__init__()
        self.inp = nn.Conv1d(in_ch, ch, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(DilatedResBlock(ch, dil, drop=drop))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(ch, 21, kernel_size=1)
    def forward(self, x):
        h = self.inp(x)
        for blk in self.blocks:
            h = blk(h)
        return self.head(h)

class MSTCN(nn.Module):
    def __init__(self, d_in, stages=4, ch=96, layers=10, drop=0.3, concat_feat=True):
        super().__init__()
        self.concat_feat = concat_feat
        self.stages = nn.ModuleList()
        xproj = nn.Conv1d(d_in, d_in, kernel_size=1)
        self.input_proj = xproj
        fin = d_in
        for s in range(stages):
            in_ch = (fin + 21) if (s>0 and concat_feat) else (21 if s>0 else fin)
            self.stages.append(Stage(in_ch, ch=ch, layers=layers, drop=drop))
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        x = self.input_proj(x)
        logits_list = []; prev=None
        for i, st in enumerate(self.stages):
            inp = x if i==0 else (torch.cat([x, prev], dim=1) if self.concat_feat else prev)
            prev = st(inp)
            logits_list.append(prev.transpose(1,2))
        return logits_list

# Load best CE-TCN
ce_model = DilatedTCN(d_in=D_in, channels=96, layers=10, num_classes=21, dropout=0.3).to(device)
ce_model.load_state_dict(torch.load('model_ce_tcn_v2.pth', map_location=device)); ce_model.eval()
# Load best MS-TCN seed (seed2 best)
ms_model = MSTCN(d_in=D_in, stages=4, ch=96, layers=10, drop=0.3, concat_feat=True).to(device)
ms_model.load_state_dict(torch.load('model_mstcn_s2.pth', map_location=device)); ms_model.eval()

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

def levenshtein(a, b):
    n, m = len(a), len(b)
    if n==0: return m
    if m==0: return n
    dp = list(range(m+1))
    for i in range(1, n+1):
        prev = dp[0]; dp[0] = i; ai = a[i-1]
        for j in range(1, m+1):
            tmp = dp[j]
            dp[j] = min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1))
            prev = tmp
    return dp[m]

def eval_val_ensemble(w_ce=0.8, w_ms=0.2):
    tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for sid in val_ids:
            X = load_feat(sid, 'train', 1800)
            xb = torch.from_numpy(X).unsqueeze(0).to(device)
            p_ce = ce_model(xb)[0].softmax(dim=-1)
            p_ms = ms_model(xb)[-1][0].softmax(dim=-1)
            probs = (w_ce*p_ce + w_ms*p_ms)
            seq = decode_video_probs_refined(probs, pool_k=13)
            tgt = id2seq[int(sid)]
            tot += levenshtein(seq, tgt); cnt += 1
            if (cnt % 10)==0 or cnt==len(val_ids):
                print(f"  [val ens] {cnt}/{len(val_ids)}", flush=True)
    return tot/max(cnt,1)

val_lev = eval_val_ensemble(0.8, 0.2)
print(f"Ensemble (CE 0.8 + MS 0.2) VAL Levenshtein={val_lev:.4f} (norm ~{val_lev/20:.5f})")

print("=== TEST inference for CE+MS ensemble ===", flush=True)
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
rows=[]; t0=time.time()
with torch.no_grad():
    for i, sid in enumerate(test_ids, 1):
        X = load_feat(int(sid), 'test', 1800)
        xb = torch.from_numpy(X).unsqueeze(0).to(device)
        p_ce = ce_model(xb)[0].softmax(dim=-1)
        p_ms = ms_model(xb)[-1][0].softmax(dim=-1)
        probs = (0.8*p_ce + 0.2*p_ms)
        seq = decode_video_probs_refined(probs, pool_k=13)
        rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
        if (i%10)==0 or i==len(test_ids):
            print(f"  [test ens ce+ms] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
sub = pd.DataFrame(rows, columns=['Id','Sequence'])
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv; head:\n', sub.head())
print('=== CE+MS ensemble complete ===')

=== Ensemble CE-TCN (best) + MS-TCN (best seed) with refined decoder ===


  ce_model.load_state_dict(torch.load('model_ce_tcn_v2.pth', map_location=device)); ce_model.eval()
  ms_model.load_state_dict(torch.load('model_mstcn_s2.pth', map_location=device)); ms_model.eval()


  [val ens] 10/44


  [val ens] 20/44


  [val ens] 30/44


  [val ens] 40/44


  [val ens] 44/44


Ensemble (CE 0.8 + MS 0.2) VAL Levenshtein=4.6591 (norm ~0.23295)
=== TEST inference for CE+MS ensemble ===


  [test ens ce+ms] 10/95 elapsed=0.0m


  [test ens ce+ms] 20/95 elapsed=0.0m


  [test ens ce+ms] 30/95 elapsed=0.0m


  [test ens ce+ms] 40/95 elapsed=0.0m


  [test ens ce+ms] 50/95 elapsed=0.0m


  [test ens ce+ms] 60/95 elapsed=0.0m


  [test ens ce+ms] 70/95 elapsed=0.0m


  [test ens ce+ms] 80/95 elapsed=0.0m


  [test ens ce+ms] 90/95 elapsed=0.0m


  [test ens ce+ms] 95/95 elapsed=0.0m


Wrote submission.csv; head:
     Id                                           Sequence
0  300  5 9 7 1 2 18 3 8 4 20 13 12 15 14 11 6 16 19 1...
1  301  10 1 5 4 6 2 11 14 13 19 15 7 9 20 12 8 18 3 1...
2  302  1 17 16 12 5 19 13 20 18 11 3 4 10 6 8 14 15 9...
3  303  13 4 3 10 14 5 19 15 20 17 11 16 8 18 7 12 1 6...
4  304  8 1 7 12 18 13 9 2 11 3 20 19 15 5 14 6 17 16 ...
=== CE+MS ensemble complete ===


In [30]:
import math, time, random, gc
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

print("=== MS-TCN++ (probs-only refinement) training + improved decoder (pool_k=15, classwise NMS) ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.cuda.is_available(): torch.backends.cudnn.benchmark = True

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

# Split (match earlier) 
train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]
tr_ids = all_ids[val_n:]
print(f"Train videos: {len(tr_ids)}, Val videos: {len(val_ids)}")

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p)
    X = d['X'].astype(np.float32)
    return X[:max_T] if X.shape[0] > max_T else X

def load_lab(sample_id: int, max_T=1800):
    y = np.load(lab_tr_dir/f"{sample_id}.npy").astype(np.int64)
    return y[:max_T] if y.shape[0] > max_T else y

class FrameDataset(Dataset):
    def __init__(self, ids, max_T=1800):
        self.ids = list(ids); self.max_T=max_T
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        sid = int(self.ids[idx])
        X = load_feat(sid, 'train', self.max_T)
        y = load_lab(sid, self.max_T)
        T = min(len(X), len(y))
        return torch.from_numpy(X[:T]), torch.from_numpy(y[:T]), sid

def collate(batch):
    xs, ys, sids = zip(*batch)
    Tm = max(x.shape[0] for x in xs); D = xs[0].shape[1]; B=len(xs)
    xb = torch.zeros(B, Tm, D, dtype=torch.float32)
    yb = torch.zeros(B, Tm, dtype=torch.long)
    mask = torch.zeros(B, Tm, dtype=torch.bool)
    for i,(x,y) in enumerate(zip(xs,ys)):
        T=len(x); xb[i,:T]=x; yb[i,:T]=y; mask[i,:T]=True
    return xb, yb, mask, list(sids)

train_loader = DataLoader(FrameDataset(tr_ids, 1800), batch_size=12, shuffle=True, num_workers=2, pin_memory=True, collate_fn=collate)
val_loader   = DataLoader(FrameDataset(val_ids, 1800), batch_size=12, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate)

D_in = np.load(next(iter((feat_tr_dir).glob('*.npz'))))['X'].shape[1]

class DilatedResBlock(nn.Module):
    def __init__(self, ch, dilation, drop=0.3, groups=8, k=3):
        super().__init__()
        self.conv1 = nn.Conv1d(ch, ch, kernel_size=k, padding=dilation, dilation=dilation)
        self.gn1 = nn.GroupNorm(groups, ch)
        self.conv2 = nn.Conv1d(ch, ch, kernel_size=1)
        self.gn2 = nn.GroupNorm(groups, ch)
        self.drop = nn.Dropout(drop)
    def forward(self, x):
        h = self.conv1(x); h = self.gn1(h); h = F.relu(h, inplace=True); h = self.drop(h)
        h = self.conv2(h); h = self.gn2(h); h = F.relu(h, inplace=True)
        return x + h

class Stage(nn.Module):
    def __init__(self, in_ch, ch=128, layers=10, drop=0.3):
        super().__init__()
        self.inp = nn.Conv1d(in_ch, ch, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(DilatedResBlock(ch, dil, drop=drop))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(ch, 21, kernel_size=1)
    def forward(self, x):
        h = self.inp(x)
        for blk in self.blocks:
            h = blk(h)
        return self.head(h)  # (B,21,T)

class MSTCNPP(nn.Module):
    # Stage 1: features -> logits
    # Stages 2+: probs-only refinement (no feature concat), i.e., input is prev softmax logits as channels
    def __init__(self, d_in, stages=4, ch=128, layers=10, drop=0.3):
        super().__init__()
        self.stages = nn.ModuleList()
        self.input_proj = nn.Conv1d(d_in, d_in, kernel_size=1)
        # Stage 1 takes features
        self.stages.append(Stage(d_in, ch=ch, layers=layers, drop=drop))
        # Subsequent stages take 21-channel probs/logits as input
        for _ in range(stages-1):
            self.stages.append(Stage(21, ch=ch, layers=layers, drop=drop))
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)  # (B,D,T)
        x = self.input_proj(x)
        logits_list = []
        prev = self.stages[0](x)  # (B,21,T)
        logits_list.append(prev.transpose(1,2))
        for s in range(1, len(self.stages)):
            # use probs-only refinement
            probs = prev.softmax(dim=1)  # (B,21,T)
            prev = self.stages[s](probs)  # (B,21,T)
            logits_list.append(prev.transpose(1,2))
        return logits_list  # list of (B,T,21)

def ce_ignore_bg_with_ls(logits, targets, mask, label_smoothing=0.10):
    B,T,C = logits.shape
    valid = mask & (targets > 0)
    if valid.sum() == 0: return logits.new_zeros([])
    lg = logits[valid]; y = targets[valid]
    if label_smoothing > 0:
        with torch.no_grad():
            true_dist = torch.zeros_like(lg).scatter_(1, y.unsqueeze(1), 1.0)
            true_dist = true_dist * (1 - label_smoothing) + label_smoothing / (lg.size(1) - 1)
        logp = F.log_softmax(lg, dim=-1)
        return F.kl_div(logp, true_dist, reduction='batchmean')
    else:
        return F.cross_entropy(lg, y, reduction='mean')

def temporal_mse(probs, mask):
    diff = (probs[:,1:,:] - probs[:,:-1,:])**2
    m = (mask[:,1:] & mask[:,:-1]).float().unsqueeze(-1)
    num = (diff * m).sum(); den = m.sum().clamp_min(1.0)
    return num / den

# Augmentations
def time_mask(x, mask, n=2, wmin=8, wmax=16, p=0.5):
    if random.random() > p: return x, mask
    B,T,D = x.shape
    for _ in range(n):
        w = random.randint(wmin, wmax)
        t0 = random.randint(0, max(0, T-w))
        x[:, t0:t0+w, :] = 0.0
    return x, mask

def jitter(x, sigma=0.01):
    return x + torch.randn_like(x)*sigma

# Decoder utils with pool_k=15, classwise NMS radius and tie-breaks
def avg_pool_probs(p_t_c: torch.Tensor, k: int = 15) -> torch.Tensor:
    x = p_t_c.unsqueeze(0).transpose(1,2)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral_single(p_t: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t.view(1,1,-1)
    w = torch.ones(1,1,k, device=p_t.device, dtype=p_t.dtype) / float(k)
    y = F.conv1d(x, w, padding=k//2)
    return y.view(-1)

def refine_com(p: torch.Tensor, t_star: int, w: int = 5) -> float:
    T = p.shape[0]
    a = max(0, t_star - w); b = min(T-1, t_star + w)
    idx = torch.arange(a, b+1, device=p.device, dtype=p.dtype)
    seg = p[a:b+1]; s = seg.sum() + 1e-8
    return float(((idx * seg).sum() / s).item())

def compute_class_medians():
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df['Id'].astype(int).tolist()
    for sid in ids:
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        for c in range(1,21):
            cnt = int((y==c).sum());
            if cnt>0: dur_by_c[c].append(cnt)
    med = {}
    for c in range(1,21):
        med[c] = int(np.clip(np.median(dur_by_c[c]) if len(dur_by_c[c])>0 else 13, 9, 25))
    return med

MED_K = compute_class_medians()

def decode_video_probs_refined(p_t_c: torch.Tensor, pool_k=15):
    # classwise radius
    p_s = avg_pool_probs(p_t_c, k=pool_k)
    T,C = p_s.shape
    scores = torch.empty_like(p_s)
    for c in range(C):
        k = MED_K.get(c, 13) if c!=0 else 13
        scores[:,c] = p_s[:,c] if c==0 else duration_integral_single(p_s[:,c], k=k)
    peaks = []
    for c in range(1,21):
        radius = max(10, MED_K.get(c,13)//2)
        s = scores[:,c]
        t_star = int(torch.argmax(s).item())
        t_ref = refine_com(p_s[:,c], t_star, w=5)
        t_idx = int(round(t_ref)); t_idx = min(max(t_idx, 0), T-1)
        local_mean = p_s[max(0,t_idx-radius):min(T,t_idx+radius+1), c].mean().item()
        peaks.append((c, t_ref, float(scores[t_idx, c].item()), float(local_mean)))
    # sort by time, then integral, then local mean prob
    peaks.sort(key=lambda x: (x[1], -x[2], -x[3]))
    return [c for c,_,_,_ in peaks]

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

def levenshtein(a,b):
    n,m=len(a),len(b)
    if n==0: return m
    if m==0: return n
    dp=list(range(m+1))
    for i in range(1,n+1):
        prev=dp[0]; dp[0]=i; ai=a[i-1]
        for j in range(1,m+1):
            tmp=dp[j]; dp[j]=min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1)); prev=tmp
    return dp[m]

def eval_val(model):
    model.eval(); tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for xb, yb, mask, sids in val_loader:
            xb = xb.to(device); mask = mask.to(device)
            probs = model(xb)[-1].softmax(dim=-1)  # (B,T,C)
            for b, sid in enumerate(sids):
                T = int(mask[b].sum().item())
                seq = decode_video_probs_refined(probs[b,:T,:], pool_k=15)
                tot += levenshtein(seq, id2seq[int(sid)]); cnt += 1
    print(f"  [val decode] {cnt} vids in {(time.time()-t0)/60:.2f}m", flush=True)
    return tot/max(cnt,1)

def train_seed(seed=0, epochs=18, patience=5, ch=128, stages=4, layers=10, ls=0.10, lambda_t=0.15):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
    model = MSTCNPP(d_in=D_in, stages=stages, ch=ch, layers=layers, drop=0.3).to(device)
    opt = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
    sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=15)
    best = math.inf; best_state=None; bad=0
    stage_w = [0.2,0.3,0.4,1.0][:stages]
    for ep in range(1, epochs+1):
        model.train(); t0=time.time(); nb=0; tot_loss=0.0
        for it, (xb, yb, mask, sids) in enumerate(train_loader):
            xb = xb.to(device); yb = yb.to(device); mask = mask.to(device)
            xb, mask = time_mask(xb, mask, n=2, wmin=8, wmax=16, p=0.5)
            xb = jitter(xb, sigma=0.01)
            opt.zero_grad(set_to_none=True)
            logits_list = model(xb)  # list (B,T,21)
            loss_ce = 0.0; probs_last=None
            for s, lg in enumerate(logits_list):
                loss_ce = loss_ce + stage_w[min(s, len(stage_w)-1)] * ce_ignore_bg_with_ls(lg, yb, mask, label_smoothing=ls)
                probs_last = lg.softmax(dim=-1)
            loss_t = temporal_mse(probs_last, mask) * lambda_t
            loss = loss_ce + loss_t
            loss.backward(); nn.utils.clip_grad_norm_(model.parameters(), 1.0); opt.step()
            tot_loss += float(loss.item()); nb += 1
            if (it+1)%20==0: print(f"ep{ep} it{it+1} loss={tot_loss/nb:.4f} elapsed={time.time()-t0:.1f}s", flush=True)
        sched.step()
        val_lev = eval_val(model)
        print(f"Seed{seed} Epoch {ep}: train_loss={tot_loss/max(nb,1):.4f} val_lev={val_lev:.4f} lr={sched.get_last_lr()[0]:.6f}", flush=True)
        if val_lev < best - 1e-4:
            best = val_lev; best_state = {k:v.detach().cpu() for k,v in model.state_dict().items()}; bad=0
            print(f"  New best (seed {seed}) val_lev={best:.4f}", flush=True)
        else:
            bad += 1
            if bad >= patience: print("Early stopping.", flush=True); break
    if best_state is not None: model.load_state_dict(best_state)
    out_path = f"model_mstcnpp_s{seed}.pth"
    torch.save(model.state_dict(), out_path)
    print(f"Saved {out_path}; best val_lev={best:.4f}")
    return out_path, best

# Train 3 seeds
seeds=[0,1,2]; ckpts=[]; scores=[]
for s in seeds:
    p, sc = train_seed(seed=s, epochs=18, patience=5, ch=128, stages=4, layers=10, ls=0.10, lambda_t=0.15)
    ckpts.append(p); scores.append(sc)
print("MS-TCN++ seed scores:", list(zip(seeds, scores)))

def ensemble_val_and_test_mstcnpp(ckpt_paths, do_test=True):
    models=[]
    for p in ckpt_paths:
        m = MSTCNPP(d_in=D_in, stages=4, ch=128, layers=10, drop=0.3).to(device)
        m.load_state_dict(torch.load(p, map_location=device)); m.eval(); models.append(m)
    tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for xb, yb, mask, sids in val_loader:
            xb = xb.to(device); mask = mask.to(device)
            probs_ens=None
            for m in models:
                p = m(xb)[-1].softmax(dim=-1)
                probs_ens = p if probs_ens is None else (probs_ens + p)
            probs_ens = probs_ens/len(models)
            for b, sid in enumerate(sids):
                T = int(mask[b].sum().item())
                seq = decode_video_probs_refined(probs_ens[b,:T,:], pool_k=15)
                tot += levenshtein(seq, id2seq[int(sid)]); cnt += 1
    val_lev = tot/max(cnt,1)
    print(f"MS-TCN++ ensemble VAL Levenshtein={val_lev:.4f} (norm ~{val_lev/20:.5f})", flush=True)
    if not do_test: return val_lev
    # TEST inference
    test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
    rows=[]; t0=time.time()
    with torch.no_grad():
        for i, sid in enumerate(test_ids, 1):
            X = load_feat(int(sid), 'test', 1800)
            xb = torch.from_numpy(X).unsqueeze(0).to(device)
            probs_ens=None
            for m in models:
                p = m(xb)[-1][0].softmax(dim=-1)
                probs_ens = p if probs_ens is None else (probs_ens + p)
            probs_ens = probs_ens/len(models)
            seq = decode_video_probs_refined(probs_ens, pool_k=15)
            rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
            if (i%10)==0 or i==len(test_ids):
                print(f"  [test mstcnpp ens] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
    sub = pd.DataFrame(rows, columns=['Id','Sequence'])
    sub.to_csv('submission.csv', index=False)
    print('Wrote submission.csv; head:\n', sub.head())

print("=== MS-TCN++ setup complete; training will run now ===")

=== MS-TCN++ (probs-only refinement) training + improved decoder (pool_k=15, classwise NMS) ===


Train videos: 253, Val videos: 44


ep1 it20 loss=5.4824 elapsed=7.5s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 1: train_loss=5.4213 val_lev=18.0000 lr=0.000297


  New best (seed 0) val_lev=18.0000


ep2 it20 loss=4.6555 elapsed=4.1s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 2: train_loss=4.6408 val_lev=18.0455 lr=0.000287


ep3 it20 loss=4.3243 elapsed=3.7s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 3: train_loss=4.2902 val_lev=17.3409 lr=0.000271


  New best (seed 0) val_lev=17.3409


ep4 it20 loss=3.7280 elapsed=3.5s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 4: train_loss=3.7560 val_lev=15.2955 lr=0.000250


  New best (seed 0) val_lev=15.2955


ep5 it20 loss=3.2218 elapsed=2.4s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 5: train_loss=3.2104 val_lev=12.9091 lr=0.000225


  New best (seed 0) val_lev=12.9091


ep6 it20 loss=2.7979 elapsed=2.6s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 6: train_loss=2.7874 val_lev=12.0682 lr=0.000196


  New best (seed 0) val_lev=12.0682


ep7 it20 loss=2.5941 elapsed=3.0s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 7: train_loss=2.5480 val_lev=10.9773 lr=0.000166


  New best (seed 0) val_lev=10.9773


ep8 it20 loss=2.4562 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 8: train_loss=2.4607 val_lev=11.1364 lr=0.000134


ep9 it20 loss=2.3614 elapsed=2.7s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 9: train_loss=2.3683 val_lev=10.5227 lr=0.000104


  New best (seed 0) val_lev=10.5227


ep10 it20 loss=2.2584 elapsed=2.6s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 10: train_loss=2.2584 val_lev=11.3636 lr=0.000075


ep11 it20 loss=2.2435 elapsed=2.6s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 11: train_loss=2.2412 val_lev=9.6591 lr=0.000050


  New best (seed 0) val_lev=9.6591


ep12 it20 loss=2.1435 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 12: train_loss=2.1187 val_lev=8.9318 lr=0.000029


  New best (seed 0) val_lev=8.9318


ep13 it20 loss=2.0830 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 13: train_loss=2.0909 val_lev=9.2727 lr=0.000013


ep14 it20 loss=2.0801 elapsed=2.0s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 14: train_loss=2.0720 val_lev=8.8864 lr=0.000003


  New best (seed 0) val_lev=8.8864


ep15 it20 loss=2.0485 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 15: train_loss=2.0528 val_lev=8.7727 lr=0.000000


  New best (seed 0) val_lev=8.7727


ep16 it20 loss=2.0430 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 16: train_loss=2.0360 val_lev=8.7727 lr=0.000003


ep17 it20 loss=2.0409 elapsed=2.0s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 17: train_loss=2.0380 val_lev=8.7273 lr=0.000013


  New best (seed 0) val_lev=8.7273


ep18 it20 loss=2.0534 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed0 Epoch 18: train_loss=2.0419 val_lev=8.8182 lr=0.000029


Saved model_mstcnpp_s0.pth; best val_lev=8.7273


ep1 it20 loss=5.5497 elapsed=2.3s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 1: train_loss=5.4868 val_lev=18.0227 lr=0.000297


  New best (seed 1) val_lev=18.0227


ep2 it20 loss=4.6859 elapsed=2.0s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 2: train_loss=4.6717 val_lev=17.8409 lr=0.000287


  New best (seed 1) val_lev=17.8409


ep3 it20 loss=4.3505 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 3: train_loss=4.3258 val_lev=17.3636 lr=0.000271


  New best (seed 1) val_lev=17.3636


ep4 it20 loss=3.7449 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 4: train_loss=3.6927 val_lev=14.6136 lr=0.000250


  New best (seed 1) val_lev=14.6136


ep5 it20 loss=3.1122 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 5: train_loss=3.0992 val_lev=12.6591 lr=0.000225


  New best (seed 1) val_lev=12.6591


ep6 it20 loss=2.7589 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 6: train_loss=2.7028 val_lev=10.9773 lr=0.000196


  New best (seed 1) val_lev=10.9773


ep7 it20 loss=2.5463 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 7: train_loss=2.5338 val_lev=10.7727 lr=0.000166


  New best (seed 1) val_lev=10.7727


ep8 it20 loss=2.4246 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 8: train_loss=2.3871 val_lev=9.5000 lr=0.000134


  New best (seed 1) val_lev=9.5000


ep9 it20 loss=2.2632 elapsed=2.0s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 9: train_loss=2.2296 val_lev=9.1818 lr=0.000104


  New best (seed 1) val_lev=9.1818


ep10 it20 loss=2.1442 elapsed=2.2s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 10: train_loss=2.1311 val_lev=8.7727 lr=0.000075


  New best (seed 1) val_lev=8.7727


ep11 it20 loss=2.0835 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 11: train_loss=2.0940 val_lev=8.2727 lr=0.000050


  New best (seed 1) val_lev=8.2727


ep12 it20 loss=1.9958 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 12: train_loss=1.9803 val_lev=8.1591 lr=0.000029


  New best (seed 1) val_lev=8.1591


ep13 it20 loss=1.9546 elapsed=2.0s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 13: train_loss=1.9623 val_lev=8.1591 lr=0.000013


ep14 it20 loss=1.9215 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 14: train_loss=1.9180 val_lev=7.9091 lr=0.000003


  New best (seed 1) val_lev=7.9091


ep15 it20 loss=1.8955 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 15: train_loss=1.8813 val_lev=7.9318 lr=0.000000


ep16 it20 loss=1.8928 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 16: train_loss=1.8652 val_lev=7.9318 lr=0.000003


ep17 it20 loss=1.8795 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 17: train_loss=1.8733 val_lev=7.9318 lr=0.000013


ep18 it20 loss=1.8908 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed1 Epoch 18: train_loss=1.8744 val_lev=8.0000 lr=0.000029


Saved model_mstcnpp_s1.pth; best val_lev=7.9091


ep1 it20 loss=5.5922 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 1: train_loss=5.5171 val_lev=17.8409 lr=0.000297


  New best (seed 2) val_lev=17.8409


ep2 it20 loss=4.6995 elapsed=1.9s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 2: train_loss=4.6883 val_lev=17.7955 lr=0.000287


  New best (seed 2) val_lev=17.7955


ep3 it20 loss=4.4384 elapsed=2.0s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 3: train_loss=4.4129 val_lev=17.5000 lr=0.000271


  New best (seed 2) val_lev=17.5000


ep4 it20 loss=3.9758 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 4: train_loss=3.9535 val_lev=15.6136 lr=0.000250


  New best (seed 2) val_lev=15.6136


ep5 it20 loss=3.2630 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 5: train_loss=3.2257 val_lev=12.9545 lr=0.000225


  New best (seed 2) val_lev=12.9545


ep6 it20 loss=2.8166 elapsed=2.3s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 6: train_loss=2.7765 val_lev=11.0909 lr=0.000196


  New best (seed 2) val_lev=11.0909


ep7 it20 loss=2.5760 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 7: train_loss=2.5348 val_lev=10.7500 lr=0.000166


  New best (seed 2) val_lev=10.7500


ep8 it20 loss=2.4262 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 8: train_loss=2.3824 val_lev=9.6818 lr=0.000134


  New best (seed 2) val_lev=9.6818


ep9 it20 loss=2.2519 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 9: train_loss=2.2687 val_lev=9.0909 lr=0.000104


  New best (seed 2) val_lev=9.0909


ep10 it20 loss=2.1238 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 10: train_loss=2.1091 val_lev=8.6818 lr=0.000075


  New best (seed 2) val_lev=8.6818


ep11 it20 loss=2.0603 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 11: train_loss=2.0215 val_lev=8.4091 lr=0.000050


  New best (seed 2) val_lev=8.4091


ep12 it20 loss=1.9854 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 12: train_loss=1.9678 val_lev=8.2727 lr=0.000029


  New best (seed 2) val_lev=8.2727


ep13 it20 loss=1.9321 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 13: train_loss=1.9249 val_lev=8.1364 lr=0.000013


  New best (seed 2) val_lev=8.1364


ep14 it20 loss=1.9089 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 14: train_loss=1.9317 val_lev=7.7045 lr=0.000003


  New best (seed 2) val_lev=7.7045


ep15 it20 loss=1.8738 elapsed=2.0s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 15: train_loss=1.8997 val_lev=7.6591 lr=0.000000


  New best (seed 2) val_lev=7.6591


ep16 it20 loss=1.8829 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 16: train_loss=1.8641 val_lev=7.6591 lr=0.000003


ep17 it20 loss=1.8732 elapsed=1.7s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 17: train_loss=1.8596 val_lev=7.6136 lr=0.000013


  New best (seed 2) val_lev=7.6136


ep18 it20 loss=1.8742 elapsed=1.6s


  [val decode] 44 vids in 0.01m


Seed2 Epoch 18: train_loss=1.8584 val_lev=7.3409 lr=0.000029


  New best (seed 2) val_lev=7.3409


Saved model_mstcnpp_s2.pth; best val_lev=7.3409
MS-TCN++ seed scores: [(0, 8.727272727272727), (1, 7.909090909090909), (2, 7.340909090909091)]
=== MS-TCN++ setup complete; training will run now ===


In [31]:
import time, math, random
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

print("=== CE model: improved refined decoder (pool_k=15, classwise radius, tie-breaks, temp=0.9) ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p); X = d['X'].astype(np.float32)
    return X[:max_T] if X.shape[0] > max_T else X

def compute_class_median_durations():
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df['Id'].astype(int).tolist()
    for sid in ids:
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        for c in range(1,21):
            cnt = int((y==c).sum())
            if cnt>0: dur_by_c[c].append(cnt)
    med = {}
    for c in range(1,21):
        med[c] = int(np.clip(np.median(dur_by_c[c]) if len(dur_by_c[c])>0 else 13, 9, 25))
    return med

MED_K = compute_class_median_durations()

def avg_pool_probs(p_t_c: torch.Tensor, k: int = 15) -> torch.Tensor:
    x = p_t_c.unsqueeze(0).transpose(1,2)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral_single(p_t: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t.view(1,1,-1)
    w = torch.ones(1,1,k, device=p_t.device, dtype=p_t.dtype) / float(k)
    y = F.conv1d(x, w, padding=k//2)
    return y.view(-1)

def refine_com(p: torch.Tensor, t_star: int, w: int = 5) -> float:
    T = p.shape[0]
    a = max(0, t_star - w); b = min(T-1, t_star + w)
    idx = torch.arange(a, b+1, device=p.device, dtype=p.dtype)
    seg = p[a:b+1]; s = seg.sum() + 1e-8
    return float(((idx * seg).sum() / s).item())

def decode_video_probs_refined(p_t_c: torch.Tensor, pool_k=15, temp=0.9):
    # temperature sharpening then normalize
    if temp != 1.0:
        p_t_c = (p_t_c ** (1.0/temp)); p_t_c = p_t_c / (p_t_c.sum(dim=-1, keepdim=True) + 1e-8)
    p_s = avg_pool_probs(p_t_c, k=pool_k)
    T,C = p_s.shape
    scores = torch.empty_like(p_s)
    for c in range(C):
        k = MED_K.get(c, 13) if c!=0 else 13
        scores[:,c] = p_s[:,c] if c==0 else duration_integral_single(p_s[:,c], k=k)
    peaks = []
    for c in range(1,21):
        radius = max(10, MED_K.get(c,13)//2)
        s = scores[:,c]
        t_star = int(torch.argmax(s).item())
        t_ref = refine_com(p_s[:,c], t_star, w=5)
        t_idx = int(round(t_ref)); t_idx = min(max(t_idx, 0), T-1)
        local_mean = p_s[max(0,t_idx-radius):min(T,t_idx+radius+1), c].mean().item()
        peaks.append((c, t_ref, float(scores[t_idx, c].item()), float(local_mean)))
    peaks.sort(key=lambda x: (x[1], -x[2], -x[3]))
    return [c for c,_,_,_ in peaks]

D_in = np.load(next(iter((feat_tr_dir).glob('*.npz'))))['X'].shape[1]

class DilatedTCN(nn.Module):
    def __init__(self, d_in, channels=96, layers=10, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = nn.Conv1d(d_in, channels, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(nn.Sequential(
                nn.Conv1d(channels, channels, kernel_size=3, padding=dil, dilation=dil),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Conv1d(channels, channels, kernel_size=1),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
            ))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(channels, num_classes, kernel_size=1)
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        h = self.inp(x)
        for blk in self.blocks:
            res = h; h = blk(h); h = h + res
        logits = self.head(h)
        return logits.transpose(1,2)

model = DilatedTCN(d_in=D_in, channels=96, layers=10, num_classes=21, dropout=0.3).to(device)
state = torch.load('model_ce_tcn_v2.pth', map_location=device)
model.load_state_dict(state); model.eval()

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

def levenshtein(a,b):
    n,m=len(a),len(b)
    if n==0: return m
    if m==0: return n
    dp=list(range(m+1))
    for i in range(1,n+1):
        prev=dp[0]; dp[0]=i; ai=a[i-1]
        for j in range(1,m+1):
            tmp=dp[j]; dp[j]=min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1)); prev=tmp
    return dp[m]

def eval_val():
    tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for sid in val_ids:
            X = load_feat(sid, 'train', 1800)
            xb = torch.from_numpy(X).unsqueeze(0).to(device)
            logits = model(xb)[0]
            probs = logits.softmax(dim=-1)
            seq = decode_video_probs_refined(probs, pool_k=15, temp=0.9)
            tot += levenshtein(seq, id2seq[int(sid)]); cnt += 1
    print(f"  [val improved] {cnt} vids in {(time.time()-t0)/60:.2f}m", flush=True)
    return tot/max(cnt,1)

val_lev = eval_val()
print(f"Improved refined decoder VAL Levenshtein={val_lev:.4f} (norm ~{val_lev/20:.5f})", flush=True)

print("=== TEST inference (CE + improved refined decoder) -> submission.csv ===", flush=True)
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
rows=[]; t0=time.time()
with torch.no_grad():
    for i, sid in enumerate(test_ids, 1):
        X = load_feat(int(sid), 'test', 1800)
        xb = torch.from_numpy(X).unsqueeze(0).to(device)
        logits = model(xb)[0]
        probs = logits.softmax(dim=-1)
        seq = decode_video_probs_refined(probs, pool_k=15, temp=0.9)
        rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
        if (i%10)==0 or i==len(test_ids):
            print(f"  [test ce+improved] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
sub = pd.DataFrame(rows, columns=['Id','Sequence'])
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv; head:\n', sub.head())
print('=== CE improved refined decoding complete ===')

=== CE model: improved refined decoder (pool_k=15, classwise radius, tie-breaks, temp=0.9) ===


  state = torch.load('model_ce_tcn_v2.pth', map_location=device)


  [val improved] 44 vids in 0.01m


Improved refined decoder VAL Levenshtein=4.5455 (norm ~0.22727)


=== TEST inference (CE + improved refined decoder) -> submission.csv ===


  [test ce+improved] 10/95 elapsed=0.0m


  [test ce+improved] 20/95 elapsed=0.0m


  [test ce+improved] 30/95 elapsed=0.0m


  [test ce+improved] 40/95 elapsed=0.0m


  [test ce+improved] 50/95 elapsed=0.0m


  [test ce+improved] 60/95 elapsed=0.0m


  [test ce+improved] 70/95 elapsed=0.0m


  [test ce+improved] 80/95 elapsed=0.0m


  [test ce+improved] 90/95 elapsed=0.0m


  [test ce+improved] 95/95 elapsed=0.0m


Wrote submission.csv; head:
     Id                                           Sequence
0  300  5 9 7 1 2 18 3 8 4 20 13 12 15 14 11 6 16 19 1...
1  301  10 1 5 4 6 2 11 14 13 19 15 7 9 20 12 8 18 3 1...
2  302  1 17 16 12 5 19 13 20 18 11 3 4 6 8 14 15 10 9...
3  303  13 4 3 10 14 5 19 15 20 17 1 11 16 8 18 7 12 6...
4  304  8 1 7 12 18 13 9 2 11 3 20 19 15 5 14 6 17 16 ...
=== CE improved refined decoding complete ===


In [32]:
import math, time, random, gc
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

print("=== Train 3 CE DilatedTCN seeds and ensemble with improved decoder ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.cuda.is_available(): torch.backends.cudnn.benchmark = True

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

# Split same as earlier
train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]
tr_ids = all_ids[val_n:]
print(f"Train videos: {len(tr_ids)}, Val videos: {len(val_ids)}")

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p); X = d['X'].astype(np.float32)
    return X[:max_T] if X.shape[0] > max_T else X

def load_lab(sample_id: int, max_T=1800):
    y = np.load(lab_tr_dir/f"{sample_id}.npy").astype(np.int64)
    return y[:max_T] if y.shape[0] > max_T else y

class FrameDataset(Dataset):
    def __init__(self, ids, max_T=1800):
        self.ids=list(ids); self.max_T=max_T
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        sid = int(self.ids[idx])
        X = load_feat(sid, 'train', self.max_T)
        y = load_lab(sid, self.max_T)
        T = min(len(X), len(y))
        return torch.from_numpy(X[:T]), torch.from_numpy(y[:T]), sid

def collate(batch):
    xs, ys, sids = zip(*batch)
    Tm = max(x.shape[0] for x in xs); D = xs[0].shape[1]; B=len(xs)
    xb = torch.zeros(B, Tm, D, dtype=torch.float32)
    yb = torch.zeros(B, Tm, dtype=torch.long)
    mask = torch.zeros(B, Tm, dtype=torch.bool)
    for i,(x,y) in enumerate(zip(xs,ys)):
        T=len(x); xb[i,:T]=x; yb[i,:T]=y; mask[i,:T]=True
    return xb, yb, mask, list(sids)

train_loader = DataLoader(FrameDataset(tr_ids, 1800), batch_size=12, shuffle=True, num_workers=2, pin_memory=True, collate_fn=collate)
val_loader   = DataLoader(FrameDataset(val_ids, 1800), batch_size=12, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate)

D_in = np.load(next(iter(feat_tr_dir.glob('*.npz'))))['X'].shape[1]

class DilatedTCN(nn.Module):
    def __init__(self, d_in, channels=96, layers=10, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = nn.Conv1d(d_in, channels, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(nn.Sequential(
                nn.Conv1d(channels, channels, kernel_size=3, padding=dil, dilation=dil),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Conv1d(channels, channels, kernel_size=1),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
            ))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(channels, num_classes, kernel_size=1)
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        h = self.inp(x)
        for blk in self.blocks:
            res = h; h = blk(h); h = h + res
        logits = self.head(h)
        return logits.transpose(1,2)

def ce_loss_ignore_bg(logits_b_t_c, y_b_t, mask_b_t):
    B,T,C = logits_b_t_c.shape
    logits = logits_b_t_c.reshape(B*T, C)
    targets = y_b_t.reshape(B*T)
    valid_fg = (mask_b_t.reshape(B*T)) & (targets > 0)
    if valid_fg.sum() == 0: return logits.new_zeros([])
    return F.cross_entropy(logits[valid_fg], targets[valid_fg], reduction='mean')

# Improved decoder (pool_k=15, classwise radius, temp=0.9, tie-breaks)
train_df_local = train_df
def compute_class_median_durations():
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df_local['Id'].astype(int).tolist()
    for sid in ids:
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        for c in range(1,21):
            cnt = int((y==c).sum());
            if cnt>0: dur_by_c[c].append(cnt)
    med = {}
    for c in range(1,21):
        med[c] = int(np.clip(np.median(dur_by_c[c]) if len(dur_by_c[c])>0 else 13, 9, 25))
    return med
MED_K = compute_class_median_durations()

def avg_pool_probs(p_t_c: torch.Tensor, k: int = 15) -> torch.Tensor:
    x = p_t_c.unsqueeze(0).transpose(1,2)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral_single(p_t: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t.view(1,1,-1)
    w = torch.ones(1,1,k, device=p_t.device, dtype=p_t.dtype) / float(k)
    y = F.conv1d(x, w, padding=k//2)
    return y.view(-1)

def refine_com(p: torch.Tensor, t_star: int, w: int = 5) -> float:
    T = p.shape[0]
    a = max(0, t_star - w); b = min(T-1, t_star + w)
    idx = torch.arange(a, b+1, device=p.device, dtype=p.dtype)
    seg = p[a:b+1]; s = seg.sum() + 1e-8
    return float(((idx * seg).sum() / s).item())

def decode_video_probs_refined(p_t_c: torch.Tensor, pool_k=15, temp=0.9):
    if temp != 1.0:
        p_t_c = (p_t_c ** (1.0/temp)); p_t_c = p_t_c / (p_t_c.sum(dim=-1, keepdim=True) + 1e-8)
    p_s = avg_pool_probs(p_t_c, k=pool_k)
    T,C = p_s.shape
    scores = torch.empty_like(p_s)
    for c in range(C):
        k = MED_K.get(c, 13) if c!=0 else 13
        scores[:,c] = p_s[:,c] if c==0 else duration_integral_single(p_s[:,c], k=k)
    peaks = []
    for c in range(1,21):
        radius = max(10, MED_K.get(c,13)//2)
        s = scores[:,c]
        t_star = int(torch.argmax(s).item())
        t_ref = refine_com(p_s[:,c], t_star, w=5)
        t_idx = int(round(t_ref)); t_idx = min(max(t_idx, 0), T-1)
        local_mean = p_s[max(0,t_idx-radius):min(T,t_idx+radius+1), c].mean().item()
        peaks.append((c, t_ref, float(scores[t_idx, c].item()), float(local_mean)))
    peaks.sort(key=lambda x: (x[1], -x[2], -x[3]))
    return [c for c,_,_,_ in peaks]

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

def levenshtein(a,b):
    n,m=len(a),len(b)
    if n==0: return m
    if m==0: return n
    dp=list(range(m+1))
    for i in range(1,n+1):
        prev=dp[0]; dp[0]=i; ai=a[i-1]
        for j in range(1,m+1):
            tmp=dp[j]; dp[j]=min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1)); prev=tmp
    return dp[m]

def eval_val(model):
    model.eval(); tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for xb, yb, mask, sids in val_loader:
            xb = xb.to(device); mask=mask.to(device)
            logits = model(xb)  # (B,T,C)
            probs = logits.softmax(dim=-1)
            for b, sid in enumerate(sids):
                T = int(mask[b].sum().item())
                seq = decode_video_probs_refined(probs[b,:T,:], pool_k=15, temp=0.9)
                tot += levenshtein(seq, id2seq[int(sid)]); cnt += 1
    print(f"  [val CE] {cnt} vids in {(time.time()-t0)/60:.2f}m", flush=True)
    return tot/max(cnt,1)

def train_ce_seed(seed=0, epochs=20, patience=3, ch=96, layers=10):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
    model = DilatedTCN(d_in=D_in, channels=ch, layers=layers, num_classes=21, dropout=0.3).to(device)
    opt = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
    best=math.inf; best_state=None; bad=0
    for ep in range(1, epochs+1):
        model.train(); t0=time.time(); nb=0; tot_loss=0.0
        for it,(xb,yb,mask,sids) in enumerate(train_loader):
            xb=xb.to(device); yb= yb.to(device); mask=mask.to(device)
            opt.zero_grad(set_to_none=True)
            logits = model(xb)
            loss = ce_loss_ignore_bg(logits, yb, mask)
            loss.backward(); nn.utils.clip_grad_norm_(model.parameters(), 1.0); opt.step()
            tot_loss += float(loss.item()); nb += 1
            if (it+1)%20==0: print(f"ep{ep} it{it+1} loss={tot_loss/nb:.4f} elapsed={time.time()-t0:.1f}s", flush=True)
        val_lev = eval_val(model)
        print(f"Seed{seed} Epoch {ep}: train_loss={tot_loss/max(nb,1):.4f} val_lev={val_lev:.4f}", flush=True)
        if val_lev < best - 1e-4:
            best = val_lev; best_state = {k:v.detach().cpu() for k,v in model.state_dict().items()}; bad=0
            print(f"  New best (seed {seed}) val_lev={best:.4f}", flush=True)
        else:
            bad += 1
            if bad >= patience: print("Early stopping.", flush=True); break
    if best_state is not None: model.load_state_dict(best_state)
    outp = f"model_ce_tcn_s{seed}.pth"
    torch.save(model.state_dict(), outp)
    print(f"Saved {outp}; best val_lev={best:.4f}")
    return outp, best

# Train seeds
seeds=[0,1,2]; ckpts=[]; scores=[]
for s in seeds:
    p, sc = train_ce_seed(seed=s, epochs=20, patience=3, ch=96, layers=10)
    ckpts.append(p); scores.append(sc)
print("CE seed scores:", list(zip(seeds, scores)))

def ensemble_val_and_test_ce(ckpt_paths, weights=None):
    models=[]
    for p in ckpt_paths:
        m = DilatedTCN(d_in=D_in, channels=96, layers=10, num_classes=21, dropout=0.3).to(device)
        m.load_state_dict(torch.load(p, map_location=device)); m.eval(); models.append(m)
    if weights is None: weights=[1.0]*len(models)
    sw = float(sum(weights)); weights=[w/sw for w in weights]
    # VAL
    tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for xb, yb, mask, sids in val_loader:
            xb=xb.to(device); mask=mask.to(device)
            probs_ens=None
            for w,m in zip(weights, models):
                p = m(xb).softmax(dim=-1)
                probs_ens = (w*p) if probs_ens is None else (probs_ens + w*p)
            for b, sid in enumerate(sids):
                T = int(mask[b].sum().item())
                seq = decode_video_probs_refined(probs_ens[b,:T,:], pool_k=15, temp=0.9)
                tot += levenshtein(seq, id2seq[int(sid)]); cnt += 1
    val_lev = tot/max(cnt,1)
    print(f"CE 3-seed ensemble VAL Levenshtein={val_lev:.4f} (norm ~{val_lev/20:.5f})", flush=True)
    # TEST
    test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
    rows=[]; t0=time.time()
    with torch.no_grad():
        for i, sid in enumerate(test_ids, 1):
            X = load_feat(int(sid), 'test', 1800)
            xb = torch.from_numpy(X).unsqueeze(0).to(device)
            probs_ens=None
            for w,m in zip(weights, models):
                p = m(xb)[0].softmax(dim=-1)
                probs_ens = (w*p) if probs_ens is None else (probs_ens + w*p)
            seq = decode_video_probs_refined(probs_ens, pool_k=15, temp=0.9)
            rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
            if (i%10)==0 or i==len(test_ids):
                print(f"  [test CE ens] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
    sub = pd.DataFrame(rows, columns=['Id','Sequence'])
    sub.to_csv('submission.csv', index=False)
    print('Wrote submission.csv; head:\n', sub.head())

print("=== CE seeds training complete; to run ensemble, execute this cell or call ensemble_val_and_test_ce(ckpts) in a new cell ===")

=== Train 3 CE DilatedTCN seeds and ensemble with improved decoder ===


Train videos: 253, Val videos: 44


ep1 it20 loss=3.4281 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 1: train_loss=3.3908 val_lev=17.2500


  New best (seed 0) val_lev=17.2500


ep2 it20 loss=2.8129 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 2: train_loss=2.7979 val_lev=14.0682


  New best (seed 0) val_lev=14.0682


ep3 it20 loss=2.4472 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 3: train_loss=2.4370 val_lev=10.3864


  New best (seed 0) val_lev=10.3864


ep4 it20 loss=2.0752 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 4: train_loss=2.0535 val_lev=9.3864


  New best (seed 0) val_lev=9.3864


ep5 it20 loss=1.8425 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 5: train_loss=1.8369 val_lev=8.2955


  New best (seed 0) val_lev=8.2955


ep6 it20 loss=1.6757 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 6: train_loss=1.6922 val_lev=7.8636


  New best (seed 0) val_lev=7.8636


ep7 it20 loss=1.5507 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 7: train_loss=1.5676 val_lev=7.2273


  New best (seed 0) val_lev=7.2273


ep8 it20 loss=1.4680 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 8: train_loss=1.4467 val_lev=6.6136


  New best (seed 0) val_lev=6.6136


ep9 it20 loss=1.3741 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 9: train_loss=1.3875 val_lev=5.9773


  New best (seed 0) val_lev=5.9773


ep10 it20 loss=1.3021 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 10: train_loss=1.3047 val_lev=6.0227


ep11 it20 loss=1.2392 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 11: train_loss=1.2707 val_lev=6.0682


ep12 it20 loss=1.1808 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 12: train_loss=1.1604 val_lev=5.5682


  New best (seed 0) val_lev=5.5682


ep13 it20 loss=1.1572 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 13: train_loss=1.1314 val_lev=5.6136


ep14 it20 loss=1.0956 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 14: train_loss=1.0962 val_lev=5.1818


  New best (seed 0) val_lev=5.1818


ep15 it20 loss=1.0652 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 15: train_loss=1.0386 val_lev=5.2955


ep16 it20 loss=1.0082 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 16: train_loss=1.0235 val_lev=5.2500


ep17 it20 loss=1.0231 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 17: train_loss=1.0077 val_lev=4.7727


  New best (seed 0) val_lev=4.7727


ep18 it20 loss=0.9888 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 18: train_loss=0.9921 val_lev=4.7955


ep19 it20 loss=0.9280 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 19: train_loss=0.9234 val_lev=4.4545


  New best (seed 0) val_lev=4.4545


ep20 it20 loss=0.9072 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed0 Epoch 20: train_loss=0.9340 val_lev=5.0227


Saved model_ce_tcn_s0.pth; best val_lev=4.4545


ep1 it20 loss=3.4189 elapsed=0.9s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 1: train_loss=3.3810 val_lev=17.2955


  New best (seed 1) val_lev=17.2955


ep2 it20 loss=2.7834 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 2: train_loss=2.7790 val_lev=13.8182


  New best (seed 1) val_lev=13.8182


ep3 it20 loss=2.4219 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 3: train_loss=2.3808 val_lev=10.9545


  New best (seed 1) val_lev=10.9545


ep4 it20 loss=2.0586 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 4: train_loss=2.0177 val_lev=9.3409


  New best (seed 1) val_lev=9.3409


ep5 it20 loss=1.8065 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 5: train_loss=1.8060 val_lev=8.1818


  New best (seed 1) val_lev=8.1818


ep6 it20 loss=1.6391 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 6: train_loss=1.6239 val_lev=7.9545


  New best (seed 1) val_lev=7.9545


ep7 it20 loss=1.5237 elapsed=0.9s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 7: train_loss=1.5237 val_lev=7.1591


  New best (seed 1) val_lev=7.1591


ep8 it20 loss=1.4200 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 8: train_loss=1.4148 val_lev=7.1818


ep9 it20 loss=1.3697 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 9: train_loss=1.3538 val_lev=7.1364


  New best (seed 1) val_lev=7.1364


ep10 it20 loss=1.3158 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 10: train_loss=1.3156 val_lev=7.0455


  New best (seed 1) val_lev=7.0455


ep11 it20 loss=1.2497 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 11: train_loss=1.2390 val_lev=6.2955


  New best (seed 1) val_lev=6.2955


ep12 it20 loss=1.2047 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 12: train_loss=1.2260 val_lev=6.1136


  New best (seed 1) val_lev=6.1136


ep13 it20 loss=1.1604 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 13: train_loss=1.1566 val_lev=5.7500


  New best (seed 1) val_lev=5.7500


ep14 it20 loss=1.1068 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 14: train_loss=1.1232 val_lev=5.9091


ep15 it20 loss=1.0754 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 15: train_loss=1.0800 val_lev=5.8409


ep16 it20 loss=1.0490 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 16: train_loss=1.0498 val_lev=5.4091


  New best (seed 1) val_lev=5.4091


ep17 it20 loss=0.9897 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 17: train_loss=1.0080 val_lev=5.1591


  New best (seed 1) val_lev=5.1591


ep18 it20 loss=0.9634 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 18: train_loss=0.9550 val_lev=5.1591


ep19 it20 loss=0.9419 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 19: train_loss=0.9488 val_lev=5.0909


  New best (seed 1) val_lev=5.0909


ep20 it20 loss=0.9321 elapsed=0.9s


  [val CE] 44 vids in 0.01m


Seed1 Epoch 20: train_loss=0.9333 val_lev=4.6136


  New best (seed 1) val_lev=4.6136


Saved model_ce_tcn_s1.pth; best val_lev=4.6136


ep1 it20 loss=3.4203 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 1: train_loss=3.3915 val_lev=17.7045


  New best (seed 2) val_lev=17.7045


ep2 it20 loss=2.8858 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 2: train_loss=2.8721 val_lev=14.4545


  New best (seed 2) val_lev=14.4545


ep3 it20 loss=2.4782 elapsed=0.9s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 3: train_loss=2.4531 val_lev=11.6591


  New best (seed 2) val_lev=11.6591


ep4 it20 loss=2.1195 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 4: train_loss=2.1406 val_lev=9.9318


  New best (seed 2) val_lev=9.9318


ep5 it20 loss=1.8914 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 5: train_loss=1.8653 val_lev=9.1591


  New best (seed 2) val_lev=9.1591


ep6 it20 loss=1.7092 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 6: train_loss=1.7235 val_lev=8.4545


  New best (seed 2) val_lev=8.4545


ep7 it20 loss=1.5989 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 7: train_loss=1.5665 val_lev=7.9773


  New best (seed 2) val_lev=7.9773


ep8 it20 loss=1.4937 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 8: train_loss=1.4881 val_lev=7.7273


  New best (seed 2) val_lev=7.7273


ep9 it20 loss=1.4162 elapsed=0.9s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 9: train_loss=1.4135 val_lev=6.9091


  New best (seed 2) val_lev=6.9091


ep10 it20 loss=1.3416 elapsed=0.9s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 10: train_loss=1.3444 val_lev=6.5000


  New best (seed 2) val_lev=6.5000


ep11 it20 loss=1.2781 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 11: train_loss=1.2827 val_lev=6.3409


  New best (seed 2) val_lev=6.3409


ep12 it20 loss=1.2430 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 12: train_loss=1.2172 val_lev=5.9773


  New best (seed 2) val_lev=5.9773


ep13 it20 loss=1.1911 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 13: train_loss=1.1720 val_lev=6.1591


ep14 it20 loss=1.1639 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 14: train_loss=1.1520 val_lev=5.5227


  New best (seed 2) val_lev=5.5227


ep15 it20 loss=1.0988 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 15: train_loss=1.0987 val_lev=5.4318


  New best (seed 2) val_lev=5.4318


ep16 it20 loss=1.0733 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 16: train_loss=1.0704 val_lev=5.1364


  New best (seed 2) val_lev=5.1364


ep17 it20 loss=1.0300 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 17: train_loss=1.0300 val_lev=5.1591


ep18 it20 loss=0.9820 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 18: train_loss=0.9954 val_lev=5.1136


  New best (seed 2) val_lev=5.1136


ep19 it20 loss=0.9772 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 19: train_loss=0.9703 val_lev=5.2045


ep20 it20 loss=0.9313 elapsed=0.8s


  [val CE] 44 vids in 0.01m


Seed2 Epoch 20: train_loss=0.9177 val_lev=4.5227


  New best (seed 2) val_lev=4.5227


Saved model_ce_tcn_s2.pth; best val_lev=4.5227
CE seed scores: [(0, 4.454545454545454), (1, 4.613636363636363), (2, 4.5227272727272725)]
=== CE seeds training complete; to run ensemble, execute this cell or call ensemble_val_and_test_ce(ckpts) in a new cell ===


In [33]:
print("=== Ensembling CE 3 seeds with improved decoder and generating submission ===", flush=True)
ckpts = ["model_ce_tcn_s0.pth", "model_ce_tcn_s1.pth", "model_ce_tcn_s2.pth"]
from pathlib import Path
for p in ckpts:
    assert Path(p).exists(), f"Missing checkpoint {p}"
ensemble_val_and_test_ce(ckpts)
print("=== CE 3-seed ensemble complete ===")

=== Ensembling CE 3 seeds with improved decoder and generating submission ===


  m.load_state_dict(torch.load(p, map_location=device)); m.eval(); models.append(m)


CE 3-seed ensemble VAL Levenshtein=4.3864 (norm ~0.21932)


  [test CE ens] 10/95 elapsed=0.0m


  [test CE ens] 20/95 elapsed=0.0m


  [test CE ens] 30/95 elapsed=0.0m


  [test CE ens] 40/95 elapsed=0.0m


  [test CE ens] 50/95 elapsed=0.0m


  [test CE ens] 60/95 elapsed=0.0m


  [test CE ens] 70/95 elapsed=0.0m


  [test CE ens] 80/95 elapsed=0.0m


  [test CE ens] 90/95 elapsed=0.0m


  [test CE ens] 95/95 elapsed=0.0m


Wrote submission.csv; head:
     Id                                           Sequence
0  300  5 9 7 1 2 18 3 8 4 20 13 12 15 14 11 6 16 19 1...
1  301  10 12 3 1 5 4 20 6 2 11 15 13 19 7 9 8 18 14 1...
2  302  1 17 16 12 3 5 19 13 20 18 11 4 6 15 8 14 10 9...
3  303  13 4 12 3 10 14 5 19 15 20 17 1 11 16 8 18 7 6...
4  304  8 1 7 12 18 13 9 2 11 3 20 19 5 14 6 15 17 16 ...
=== CE 3-seed ensemble complete ===


In [34]:
import time, math
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

print("=== Geometric ensemble: CE (3 seeds) + MS-TCN++ (3 seeds) with refined decoder ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
import random
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p)
    X = d['X'].astype(np.float32)
    return X[:max_T] if X.shape[0] > max_T else X

def compute_class_median_durations():
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df['Id'].astype(int).tolist()
    for sid in ids:
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        for c in range(1,21):
            cnt = int((y==c).sum())
            if cnt>0: dur_by_c[c].append(cnt)
    med = {}
    for c in range(1,21):
        med[c] = int(np.clip(np.median(dur_by_c[c]) if len(dur_by_c[c])>0 else 13, 9, 25))
    return med

MED_K = compute_class_median_durations()

def avg_pool_probs(p_t_c: torch.Tensor, k: int = 15) -> torch.Tensor:
    x = p_t_c.unsqueeze(0).transpose(1,2)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral_single(p_t: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t.view(1,1,-1)
    w = torch.ones(1,1,k, device=p_t.device, dtype=p_t.dtype) / float(k)
    y = F.conv1d(x, w, padding=k//2)
    return y.view(-1)

def refine_com(p: torch.Tensor, t_star: int, w: int = 5) -> float:
    T = p.shape[0]
    a = max(0, t_star - w); b = min(T-1, t_star + w)
    idx = torch.arange(a, b+1, device=p.device, dtype=p.dtype)
    seg = p[a:b+1]; s = seg.sum() + 1e-8
    return float(((idx * seg).sum() / s).item())

def decode_video_probs_refined(p_t_c: torch.Tensor, pool_k=15, temp=0.95):
    if temp != 1.0:
        p_t_c = (p_t_c ** (1.0/temp)); p_t_c = p_t_c / (p_t_c.sum(dim=-1, keepdim=True) + 1e-8)
    p_s = avg_pool_probs(p_t_c, k=pool_k)
    T,C = p_s.shape
    scores = torch.empty_like(p_s)
    for c in range(C):
        k = MED_K.get(c, 13) if c!=0 else 13
        scores[:,c] = p_s[:,c] if c==0 else duration_integral_single(p_s[:,c], k=k)
    peaks = []
    for c in range(1,21):
        radius = max(10, MED_K.get(c,13)//2)
        s = scores[:,c]
        t_star = int(torch.argmax(s).item())
        t_ref = refine_com(p_s[:,c], t_star, w=5)
        t_idx = int(round(t_ref)); t_idx = min(max(t_idx, 0), T-1)
        local_mean = p_s[max(0,t_idx-radius):min(T,t_idx+radius+1), c].mean().item()
        peaks.append((c, t_ref, float(scores[t_idx, c].item()), float(local_mean)))
    peaks.sort(key=lambda x: (x[1], -x[2], -x[3]))
    return [c for c,_,_,_ in peaks]

D_in = np.load(next(iter((feat_tr_dir).glob('*.npz'))))['X'].shape[1]

class DilatedTCN(nn.Module):
    def __init__(self, d_in, channels=96, layers=10, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = nn.Conv1d(d_in, channels, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(nn.Sequential(
                nn.Conv1d(channels, channels, kernel_size=3, padding=dil, dilation=dil),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Conv1d(channels, channels, kernel_size=1),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
            ))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(channels, num_classes, kernel_size=1)
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        h = self.inp(x)
        for blk in self.blocks:
            res = h; h = blk(h); h = h + res
        logits = self.head(h)
        return logits.transpose(1,2)

class DilatedResBlock(nn.Module):
    def __init__(self, ch, dilation, drop=0.3, groups=8, k=3):
        super().__init__()
        self.conv1 = nn.Conv1d(ch, ch, kernel_size=k, padding=dilation, dilation=dilation)
        self.gn1 = nn.GroupNorm(groups, ch)
        self.conv2 = nn.Conv1d(ch, ch, kernel_size=1)
        self.gn2 = nn.GroupNorm(groups, ch)
        self.drop = nn.Dropout(drop)
    def forward(self, x):
        h = self.conv1(x); h = self.gn1(h); h = F.relu(h, inplace=True); h = self.drop(h)
        h = self.conv2(h); h = self.gn2(h); h = F.relu(h, inplace=True)
        return x + h

class Stage(nn.Module):
    def __init__(self, in_ch, ch=128, layers=10, drop=0.3):
        super().__init__()
        self.inp = nn.Conv1d(in_ch, ch, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(DilatedResBlock(ch, dil, drop=drop))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(ch, 21, kernel_size=1)
    def forward(self, x):
        h = self.inp(x)
        for blk in self.blocks:
            h = blk(h)
        return self.head(h)

class MSTCNPP(nn.Module):
    def __init__(self, d_in, stages=4, ch=128, layers=10, drop=0.3):
        super().__init__()
        self.stages = nn.ModuleList()
        self.input_proj = nn.Conv1d(d_in, d_in, kernel_size=1)
        self.stages.append(Stage(d_in, ch=ch, layers=layers, drop=drop))
        for _ in range(stages-1):
            self.stages.append(Stage(21, ch=ch, layers=layers, drop=drop))
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        x = self.input_proj(x)
        logits_list = []
        prev = self.stages[0](x)
        logits_list.append(prev.transpose(1,2))
        for s in range(1, len(self.stages)):
            probs = prev.softmax(dim=1)
            prev = self.stages[s](probs)
            logits_list.append(prev.transpose(1,2))
        return logits_list

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

def levenshtein(a, b):
    n, m = len(a), len(b)
    if n==0: return m
    if m==0: return n
    dp = list(range(m+1))
    for i in range(1, n+1):
        prev = dp[0]; dp[0] = i; ai = a[i-1]
        for j in range(1, m+1):
            tmp = dp[j]
            dp[j] = min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1))
            prev = tmp
    return dp[m]

def load_models():
    ce_paths = ["model_ce_tcn_s0.pth", "model_ce_tcn_s1.pth", "model_ce_tcn_s2.pth"]
    ms_paths = ["model_mstcnpp_s0.pth", "model_mstcnpp_s1.pth", "model_mstcnpp_s2.pth"]
    for p in ce_paths + ms_paths:
        assert Path(p).exists(), f"Missing checkpoint {p}"
    ce_models = []
    for p in ce_paths:
        m = DilatedTCN(d_in=D_in, channels=96, layers=10, num_classes=21, dropout=0.3).to(device)
        m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ce_models.append(m)
    ms_models = []
    for p in ms_paths:
        m = MSTCNPP(d_in=D_in, stages=4, ch=128, layers=10, drop=0.3).to(device)
        m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ms_models.append(m)
    return ce_models, ms_models

def geo_mean_ensemble_probs(xb, ce_models, ms_models, w_ce=0.7, w_ms=0.3):
    # compute geometric mean across models and families
    # probs_final ∝ (prod_ce p_ce)^(w_ce/len_ce) * (prod_ms p_ms)^(w_ms/len_ms)
    with torch.no_grad():
        ce_log = None
        for m in ce_models:
            p = m(xb)[0].softmax(dim=-1)
            ce_log = torch.log(p + 1e-8) if ce_log is None else ce_log + torch.log(p + 1e-8)
        ce_log = ce_log / max(len(ce_models),1)
        ms_log = None
        for m in ms_models:
            p = m(xb)[-1][0].softmax(dim=-1)
            ms_log = torch.log(p + 1e-8) if ms_log is None else ms_log + torch.log(p + 1e-8)
        ms_log = ms_log / max(len(ms_models),1)
        log_comb = w_ce*ce_log + w_ms*ms_log
        probs = torch.exp(log_comb); probs = probs / (probs.sum(dim=-1, keepdim=True) + 1e-8)
        return probs

ce_models, ms_models = load_models()

def eval_val_geo():
    tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for sid in val_ids:
            X = load_feat(sid, 'train', 1800)
            xb = torch.from_numpy(X).unsqueeze(0).to(device)
            probs = geo_mean_ensemble_probs(xb, ce_models, ms_models, w_ce=0.7, w_ms=0.3)
            seq = decode_video_probs_refined(probs, pool_k=15, temp=0.95)
            tgt = id2seq[int(sid)]
            tot += levenshtein(seq, tgt); cnt += 1
            if (cnt%10)==0 or cnt==len(val_ids):
                print(f"  [val geo] {cnt}/{len(val_ids)}", flush=True)
    return tot/max(cnt,1)

val_lev = eval_val_geo()
print(f"Geo-ensemble (CE^0.7 * MS^0.3) VAL Levenshtein={val_lev:.4f} (norm ~{val_lev/20:.5f})")

print("=== TEST inference for CE+MS geometric ensemble ===", flush=True)
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
rows=[]; t0=time.time()
with torch.no_grad():
    for i, sid in enumerate(test_ids, 1):
        X = load_feat(int(sid), 'test', 1800)
        xb = torch.from_numpy(X).unsqueeze(0).to(device)
        probs = geo_mean_ensemble_probs(xb, ce_models, ms_models, w_ce=0.7, w_ms=0.3)
        seq = decode_video_probs_refined(probs, pool_k=15, temp=0.95)
        rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
        if (i%10)==0 or i==len(test_ids):
            print(f"  [test geo ens] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
sub = pd.DataFrame(rows, columns=['Id','Sequence'])
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv; head:\n', sub.head())
print('=== Geometric ensemble complete ===')

=== Geometric ensemble: CE (3 seeds) + MS-TCN++ (3 seeds) with refined decoder ===


  m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ce_models.append(m)
  m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ms_models.append(m)


  [val geo] 10/44


  [val geo] 20/44


  [val geo] 30/44


  [val geo] 40/44


  [val geo] 44/44


Geo-ensemble (CE^0.7 * MS^0.3) VAL Levenshtein=4.3182 (norm ~0.21591)
=== TEST inference for CE+MS geometric ensemble ===


  [test geo ens] 10/95 elapsed=0.0m


  [test geo ens] 20/95 elapsed=0.0m


  [test geo ens] 30/95 elapsed=0.0m


  [test geo ens] 40/95 elapsed=0.1m


  [test geo ens] 50/95 elapsed=0.1m


  [test geo ens] 60/95 elapsed=0.1m


  [test geo ens] 70/95 elapsed=0.1m


  [test geo ens] 80/95 elapsed=0.1m


  [test geo ens] 90/95 elapsed=0.1m


  [test geo ens] 95/95 elapsed=0.1m


Wrote submission.csv; head:
     Id                                           Sequence
0  300  5 9 7 1 2 18 3 8 4 20 13 12 15 14 11 6 16 19 1...
1  301  10 12 3 1 5 4 20 6 2 11 15 13 19 9 7 8 18 14 1...
2  302  1 17 16 12 3 5 19 13 20 18 11 4 6 15 8 14 10 9...
3  303  13 4 12 3 10 5 19 15 20 17 1 11 16 8 18 7 6 2 ...
4  304  8 1 7 12 18 13 9 2 11 3 20 19 5 14 6 15 17 16 ...
=== Geometric ensemble complete ===


In [35]:
import time, math
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

print("=== V2 sweep: Geometric ensemble variants (pruned MS seeds, CE-biased weights, small decoder grid) ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
import random
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p)
    X = d['X'].astype(np.float32)
    return X[:max_T] if X.shape[0] > max_T else X

def compute_class_median_durations():
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df['Id'].astype(int).tolist()
    for sid in ids:
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        for c in range(1,21):
            cnt = int((y==c).sum())
            if cnt>0: dur_by_c[c].append(cnt)
    med = {}
    for c in range(1,21):
        med[c] = int(np.clip(np.median(dur_by_c[c]) if len(dur_by_c[c])>0 else 13, 9, 25))
    return med

MED_K = compute_class_median_durations()

def avg_pool_probs(p_t_c: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t_c.unsqueeze(0).transpose(1,2)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral_single(p_t: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t.view(1,1,-1)
    w = torch.ones(1,1,k, device=p_t.device, dtype=p_t.dtype) / float(k)
    y = F.conv1d(x, w, padding=k//2)
    return y.view(-1)

def refine_com(p: torch.Tensor, t_star: int, w: int = 5) -> float:
    T = p.shape[0]
    a = max(0, t_star - w); b = min(T-1, t_star + w)
    idx = torch.arange(a, b+1, device=p.device, dtype=p.dtype)
    seg = p[a:b+1]; s = seg.sum() + 1e-8
    return float(((idx * seg).sum() / s).item())

def decode_video_probs_refined(p_t_c: torch.Tensor, pool_k=15, temp=0.95):
    if temp != 1.0:
        p_t_c = (p_t_c ** (1.0/temp)); p_t_c = p_t_c / (p_t_c.sum(dim=-1, keepdim=True) + 1e-8)
    p_s = avg_pool_probs(p_t_c, k=pool_k)
    T,C = p_s.shape
    scores = torch.empty_like(p_s)
    for c in range(C):
        k = MED_K.get(c, 13) if c!=0 else 13
        scores[:,c] = p_s[:,c] if c==0 else duration_integral_single(p_s[:,c], k=k)
    peaks = []
    for c in range(1,21):
        radius = max(10, MED_K.get(c,13)//2)
        s = scores[:,c]
        t_star = int(torch.argmax(s).item())
        t_ref = refine_com(p_s[:,c], t_star, w=5)
        t_idx = int(round(t_ref)); t_idx = min(max(t_idx, 0), T-1)
        local_mean = p_s[max(0,t_idx-radius):min(T,t_idx+radius+1), c].mean().item()
        peaks.append((c, t_ref, float(scores[t_idx, c].item()), float(local_mean)))
    peaks.sort(key=lambda x: (x[1], -x[2], -x[3]))
    return [c for c,_,_,_ in peaks]

D_in = np.load(next(iter((feat_tr_dir).glob('*.npz'))))['X'].shape[1]

class DilatedTCN(nn.Module):
    def __init__(self, d_in, channels=96, layers=10, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = nn.Conv1d(d_in, channels, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(nn.Sequential(
                nn.Conv1d(channels, channels, kernel_size=3, padding=dil, dilation=dil),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Conv1d(channels, channels, kernel_size=1),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
            ))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(channels, num_classes, kernel_size=1)
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        h = self.inp(x)
        for blk in self.blocks:
            res = h; h = blk(h); h = h + res
        logits = self.head(h)
        return logits.transpose(1,2)

class DilatedResBlock(nn.Module):
    def __init__(self, ch, dilation, drop=0.3, groups=8, k=3):
        super().__init__()
        self.conv1 = nn.Conv1d(ch, ch, kernel_size=k, padding=dilation, dilation=dilation)
        self.gn1 = nn.GroupNorm(groups, ch)
        self.conv2 = nn.Conv1d(ch, ch, kernel_size=1)
        self.gn2 = nn.GroupNorm(groups, ch)
        self.drop = nn.Dropout(drop)
    def forward(self, x):
        h = self.conv1(x); h = self.gn1(h); h = F.relu(h, inplace=True); h = self.drop(h)
        h = self.conv2(h); h = self.gn2(h); h = F.relu(h, inplace=True)
        return x + h

class Stage(nn.Module):
    def __init__(self, in_ch, ch=128, layers=10, drop=0.3):
        super().__init__()
        self.inp = nn.Conv1d(in_ch, ch, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(DilatedResBlock(ch, dil, drop=drop))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(ch, 21, kernel_size=1)
    def forward(self, x):
        h = self.inp(x)
        for blk in self.blocks:
            h = blk(h)
        return self.head(h)

class MSTCNPP(nn.Module):
    def __init__(self, d_in, stages=4, ch=128, layers=10, drop=0.3):
        super().__init__()
        self.stages = nn.ModuleList()
        self.input_proj = nn.Conv1d(d_in, d_in, kernel_size=1)
        self.stages.append(Stage(d_in, ch=ch, layers=layers, drop=drop))
        for _ in range(stages-1):
            self.stages.append(Stage(21, ch=ch, layers=layers, drop=drop))
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        x = self.input_proj(x)
        logits_list = []
        prev = self.stages[0](x)
        logits_list.append(prev.transpose(1,2))
        for s in range(1, len(self.stages)):
            probs = prev.softmax(dim=1)
            prev = self.stages[s](probs)
            logits_list.append(prev.transpose(1,2))
        return logits_list

def levenshtein(a, b):
    n, m = len(a), len(b)
    if n==0: return m
    if m==0: return n
    dp = list(range(m+1))
    for i in range(1, n+1):
        prev = dp[0]; dp[0] = i; ai = a[i-1]
        for j in range(1, m+1):
            tmp = dp[j]
            dp[j] = min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1))
            prev = tmp
    return dp[m]

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

def load_models(ce_ckpts, ms_ckpts):
    ce_models = []
    for p in ce_ckpts:
        m = DilatedTCN(d_in=D_in, channels=96, layers=10, num_classes=21, dropout=0.3).to(device)
        m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ce_models.append(m)
    ms_models = []
    for p in ms_ckpts:
        m = MSTCNPP(d_in=D_in, stages=4, ch=128, layers=10, drop=0.3).to(device)
        m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ms_models.append(m)
    return ce_models, ms_models

def geo_mean_probs(xb, ce_models, ms_models, w_ce=0.8, w_ms=0.2):
    with torch.no_grad():
        ce_log=None
        for m in ce_models:
            p = m(xb)[0].softmax(dim=-1)
            ce_log = torch.log(p + 1e-8) if ce_log is None else ce_log + torch.log(p + 1e-8)
        ce_log = ce_log / max(len(ce_models),1)
        ms_log=None
        for m in ms_models:
            p = m(xb)[-1][0].softmax(dim=-1)
            ms_log = torch.log(p + 1e-8) if ms_log is None else ms_log + torch.log(p + 1e-8)
        ms_log = ms_log / max(len(ms_models),1) if len(ms_models)>0 else 0.0
        if len(ms_models)==0:
            log_comb = ce_log
        else:
            log_comb = w_ce*ce_log + w_ms*ms_log
        probs = torch.exp(log_comb); probs = probs / (probs.sum(dim=-1, keepdim=True) + 1e-8)
        return probs

# Configs per expert advice
ce_ckpts = ["model_ce_tcn_s0.pth", "model_ce_tcn_s1.pth", "model_ce_tcn_s2.pth"]
ms_all = {
    's2': "model_mstcnpp_s2.pth",
    's1': "model_mstcnpp_s1.pth",
    's0': "model_mstcnpp_s0.pth",
}
ms_sets = [ ['s2'], ['s2','s1'] ]
weights = [ (0.8,0.2), (0.9,0.1) ]
pool_ks = [13,15]
temps = [0.90, 0.95]

best = (1e9, None)
results = []

for ms_set in ms_sets:
    ms_ckpts = [ms_all[k] for k in ms_set]
    for w_ce, w_ms in weights:
        # load once per weight-set to avoid reloading in pool/temp loop
        ce_models, ms_models = load_models(ce_ckpts, ms_ckpts)
        for pool_k in pool_ks:
            for temp in temps:
                t0=time.time()
                tot=0; cnt=0
                with torch.no_grad():
                    for sid in val_ids:
                        X = load_feat(sid, 'train', 1800)
                        xb = torch.from_numpy(X).unsqueeze(0).to(device)
                        probs = geo_mean_probs(xb, ce_models, ms_models, w_ce=w_ce, w_ms=w_ms)
                        seq = decode_video_probs_refined(probs, pool_k=pool_k, temp=temp)
                        tgt = id2seq[int(sid)]
                        tot += levenshtein(seq, tgt); cnt += 1
                val_lev = tot/max(cnt,1)
                cfg = dict(ms_set=ms_set, w_ce=w_ce, w_ms=w_ms, pool_k=pool_k, temp=temp)
                results.append((val_lev, cfg))
                print(f"  [val V2] lev={val_lev:.4f} cfg={cfg} elapsed={(time.time()-t0):.1f}s", flush=True)
                if val_lev < best[0]:
                    best = (val_lev, cfg)

results.sort(key=lambda x: x[0])
print("=== Top 5 V2 configs ===")
for r in results[:5]:
    print(r)
print("BEST:", best)

# Build TEST submission using best config
best_val, best_cfg = best
print(f"=== TEST inference with BEST V2 cfg: {best_cfg} (val_lev={best_val:.4f}) ===", flush=True)
ms_ckpts = [ms_all[k] for k in best_cfg['ms_set']]
ce_models, ms_models = load_models(ce_ckpts, ms_ckpts)
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
rows=[]; t0=time.time()
with torch.no_grad():
    for i, sid in enumerate(test_ids, 1):
        X = load_feat(int(sid), 'test', 1800)
        xb = torch.from_numpy(X).unsqueeze(0).to(device)
        probs = geo_mean_probs(xb, ce_models, ms_models, w_ce=best_cfg['w_ce'], w_ms=best_cfg['w_ms'])
        seq = decode_video_probs_refined(probs, pool_k=best_cfg['pool_k'], temp=best_cfg['temp'])
        rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
        if (i%10)==0 or i==len(test_ids):
            print(f"  [test V2] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
sub = pd.DataFrame(rows, columns=['Id','Sequence'])
# Format assertions
assert len(sub)==95
assert all(len(s.split())==20 and len(set(s.split()))==20 and all(1<=int(t)<=20 for t in s.split()) for s in sub.Sequence), "Submission row format invalid"
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv with BEST V2 cfg; head:\n', sub.head())
print('=== V2 sweep complete ===')

=== V2 sweep: Geometric ensemble variants (pruned MS seeds, CE-biased weights, small decoder grid) ===


  m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ce_models.append(m)
  m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ms_models.append(m)


  [val V2] lev=4.3636 cfg={'ms_set': ['s2'], 'w_ce': 0.8, 'w_ms': 0.2, 'pool_k': 13, 'temp': 0.9} elapsed=1.2s


  [val V2] lev=4.2500 cfg={'ms_set': ['s2'], 'w_ce': 0.8, 'w_ms': 0.2, 'pool_k': 13, 'temp': 0.95} elapsed=1.1s


  [val V2] lev=4.2955 cfg={'ms_set': ['s2'], 'w_ce': 0.8, 'w_ms': 0.2, 'pool_k': 15, 'temp': 0.9} elapsed=1.2s


  [val V2] lev=4.2273 cfg={'ms_set': ['s2'], 'w_ce': 0.8, 'w_ms': 0.2, 'pool_k': 15, 'temp': 0.95} elapsed=1.2s


  [val V2] lev=4.1364 cfg={'ms_set': ['s2'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 13, 'temp': 0.9} elapsed=1.1s


  [val V2] lev=4.1818 cfg={'ms_set': ['s2'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 13, 'temp': 0.95} elapsed=1.1s


  [val V2] lev=4.1364 cfg={'ms_set': ['s2'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 15, 'temp': 0.9} elapsed=1.1s


  [val V2] lev=4.1818 cfg={'ms_set': ['s2'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 15, 'temp': 0.95} elapsed=1.1s


  [val V2] lev=4.2273 cfg={'ms_set': ['s2', 's1'], 'w_ce': 0.8, 'w_ms': 0.2, 'pool_k': 13, 'temp': 0.9} elapsed=1.5s


  [val V2] lev=4.2045 cfg={'ms_set': ['s2', 's1'], 'w_ce': 0.8, 'w_ms': 0.2, 'pool_k': 13, 'temp': 0.95} elapsed=1.5s


  [val V2] lev=4.2273 cfg={'ms_set': ['s2', 's1'], 'w_ce': 0.8, 'w_ms': 0.2, 'pool_k': 15, 'temp': 0.9} elapsed=1.5s


  [val V2] lev=4.2045 cfg={'ms_set': ['s2', 's1'], 'w_ce': 0.8, 'w_ms': 0.2, 'pool_k': 15, 'temp': 0.95} elapsed=1.5s


  [val V2] lev=4.1364 cfg={'ms_set': ['s2', 's1'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 13, 'temp': 0.9} elapsed=1.5s


  [val V2] lev=4.1364 cfg={'ms_set': ['s2', 's1'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 13, 'temp': 0.95} elapsed=1.5s


  [val V2] lev=4.1364 cfg={'ms_set': ['s2', 's1'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 15, 'temp': 0.9} elapsed=1.6s


  [val V2] lev=4.1591 cfg={'ms_set': ['s2', 's1'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 15, 'temp': 0.95} elapsed=1.5s


=== Top 5 V2 configs ===
(4.136363636363637, {'ms_set': ['s2'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 13, 'temp': 0.9})
(4.136363636363637, {'ms_set': ['s2'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 15, 'temp': 0.9})
(4.136363636363637, {'ms_set': ['s2', 's1'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 13, 'temp': 0.9})
(4.136363636363637, {'ms_set': ['s2', 's1'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 13, 'temp': 0.95})
(4.136363636363637, {'ms_set': ['s2', 's1'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 15, 'temp': 0.9})
BEST: (4.136363636363637, {'ms_set': ['s2'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 13, 'temp': 0.9})
=== TEST inference with BEST V2 cfg: {'ms_set': ['s2'], 'w_ce': 0.9, 'w_ms': 0.1, 'pool_k': 13, 'temp': 0.9} (val_lev=4.1364) ===


  [test V2] 10/95 elapsed=0.0m


  [test V2] 20/95 elapsed=0.0m


  [test V2] 30/95 elapsed=0.0m


  [test V2] 40/95 elapsed=0.0m


  [test V2] 50/95 elapsed=0.0m


  [test V2] 60/95 elapsed=0.0m


  [test V2] 70/95 elapsed=0.0m


  [test V2] 80/95 elapsed=0.0m


  [test V2] 90/95 elapsed=0.0m


  [test V2] 95/95 elapsed=0.0m


Wrote submission.csv with BEST V2 cfg; head:
     Id                                           Sequence
0  300  5 9 7 1 2 18 3 8 4 20 13 12 15 14 11 6 16 19 1...
1  301  10 12 3 1 5 4 20 6 2 11 15 13 19 7 9 8 18 14 1...
2  302  1 17 16 12 3 5 7 19 13 20 18 11 4 6 15 8 14 10...
3  303  13 4 12 3 10 5 19 15 20 17 1 11 16 8 18 7 6 2 ...
4  304  8 1 7 12 18 13 9 2 11 3 20 19 5 14 6 15 17 16 ...
=== V2 sweep complete ===


In [36]:
import time, math
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

print("=== CE-only 3-seed ensemble with simple TTA (frame shifts -1/0/+1) and refined decoder ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
import random
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p)
    X = d['X'].astype(np.float32)
    return X[:max_T] if X.shape[0] > max_T else X

def compute_class_median_durations():
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df['Id'].astype(int).tolist()
    for sid in ids:
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        for c in range(1,21):
            cnt = int((y==c).sum())
            if cnt>0: dur_by_c[c].append(cnt)
    med = {}
    for c in range(1,21):
        med[c] = int(np.clip(np.median(dur_by_c[c]) if len(dur_by_c[c])>0 else 13, 9, 25))
    return med

MED_K = compute_class_median_durations()

D_in = np.load(next(iter((feat_tr_dir).glob('*.npz'))))['X'].shape[1]

class DilatedTCN(nn.Module):
    def __init__(self, d_in, channels=96, layers=10, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = nn.Conv1d(d_in, channels, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(nn.Sequential(
                nn.Conv1d(channels, channels, kernel_size=3, padding=dil, dilation=dil),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Conv1d(channels, channels, kernel_size=1),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
            ))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(channels, num_classes, kernel_size=1)
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        h = self.inp(x)
        for blk in self.blocks:
            res = h; h = blk(h); h = h + res
        logits = self.head(h)
        return logits.transpose(1,2)

def avg_pool_probs(p_t_c: torch.Tensor, k: int = 15) -> torch.Tensor:
    x = p_t_c.unsqueeze(0).transpose(1,2)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral_single(p_t: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t.view(1,1,-1)
    w = torch.ones(1,1,k, device=p_t.device, dtype=p_t.dtype) / float(k)
    y = F.conv1d(x, w, padding=k//2)
    return y.view(-1)

def refine_com(p: torch.Tensor, t_star: int, w: int = 5) -> float:
    T = p.shape[0]
    a = max(0, t_star - w); b = min(T-1, t_star + w)
    idx = torch.arange(a, b+1, device=p.device, dtype=p.dtype)
    seg = p[a:b+1]; s = seg.sum() + 1e-8
    return float(((idx * seg).sum() / s).item())

def decode_video_probs_refined(p_t_c: torch.Tensor, pool_k=15, temp=0.9):
    if temp != 1.0:
        p_t_c = (p_t_c ** (1.0/temp)); p_t_c = p_t_c / (p_t_c.sum(dim=-1, keepdim=True) + 1e-8)
    p_s = avg_pool_probs(p_t_c, k=pool_k)
    T,C = p_s.shape
    scores = torch.empty_like(p_s)
    for c in range(C):
        k = MED_K.get(c, 13) if c!=0 else 13
        scores[:,c] = p_s[:,c] if c==0 else duration_integral_single(p_s[:,c], k=k)
    peaks = []
    for c in range(1,21):
        radius = max(10, MED_K.get(c,13)//2)
        s = scores[:,c]
        t_star = int(torch.argmax(s).item())
        t_ref = refine_com(p_s[:,c], t_star, w=5)
        t_idx = int(round(t_ref)); t_idx = min(max(t_idx, 0), T-1)
        local_mean = p_s[max(0,t_idx-radius):min(T,t_idx+radius+1), c].mean().item()
        peaks.append((c, t_ref, float(scores[t_idx, c].item()), float(local_mean)))
    peaks.sort(key=lambda x: (x[1], -x[2], -x[3]))
    return [c for c,_,_,_ in peaks]

def levenshtein(a,b):
    n,m=len(a),len(b)
    if n==0: return m
    if m==0: return n
    dp=list(range(m+1))
    for i in range(1,n+1):
        prev=dp[0]; dp[0]=i; ai=a[i-1]
        for j in range(1,m+1):
            tmp=dp[j]; dp[j]=min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1)); prev=tmp
    return dp[m]

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

def load_ce_models(paths):
    models=[]
    for p in paths:
        m = DilatedTCN(d_in=D_in, channels=96, layers=10, num_classes=21, dropout=0.3).to(device)
        m.load_state_dict(torch.load(p, map_location=device)); m.eval(); models.append(m)
    return models

def tta_shift_probs(p_t_c: torch.Tensor, shift: int) -> torch.Tensor:
    # p_t_c: (T,C), shift in {-1,0,+1}; pad edge by repeating border
    if shift == 0:
        return p_t_c
    T, C = p_t_c.shape
    if shift > 0:
        pad = p_t_c[:shift, :]
        return torch.cat([pad, p_t_c[:-shift, :]], dim=0)
    else:
        s = -shift
        pad = p_t_c[-s:, :]
        return torch.cat([p_t_c[s:, :], pad], dim=0)

def ensemble_probs_with_tta(xb, models, shifts=(-1,0,1)) -> torch.Tensor:
    with torch.no_grad():
        probs_sum = None
        for m in models:
            p = m(xb)[0].softmax(dim=-1)  # (T,C)
            # TTA average over shifts
            p_tta = None
            for sh in shifts:
                ps = tta_shift_probs(p, sh)
                p_tta = ps if p_tta is None else (p_tta + ps)
            p_tta = p_tta / float(len(shifts))
            probs_sum = p_tta if probs_sum is None else (probs_sum + p_tta)
        probs = probs_sum / float(len(models))
        probs = probs / (probs.sum(dim=-1, keepdim=True) + 1e-8)
        return probs

ce_ckpts = ["model_ce_tcn_s0.pth", "model_ce_tcn_s1.pth", "model_ce_tcn_s2.pth"]
for p in ce_ckpts: assert Path(p).exists(), f"Missing {p}"
ce_models = load_ce_models(ce_ckpts)

def eval_val_ce_tta(pool_k=15, temp=0.9):
    tot=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for sid in val_ids:
            X = load_feat(int(sid), 'train', 1800)
            xb = torch.from_numpy(X).unsqueeze(0).to(device)
            probs = ensemble_probs_with_tta(xb, ce_models, shifts=(-1,0,1))
            seq = decode_video_probs_refined(probs, pool_k=pool_k, temp=temp)
            tgt = id2seq[int(sid)]
            tot += levenshtein(seq, tgt); cnt += 1
    val_lev = tot/max(cnt,1)
    print(f"CE-only 3-seed + TTA VAL Levenshtein={val_lev:.4f} (norm ~{val_lev/20:.5f})", flush=True)
    return val_lev

val_lev = eval_val_ce_tta(pool_k=15, temp=0.9)

print("=== TEST inference: CE-only 3-seed + TTA (pool_k=15, temp=0.9) -> submission.csv ===", flush=True)
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
rows=[]; t0=time.time()
with torch.no_grad():
    for i, sid in enumerate(test_ids, 1):
        X = load_feat(int(sid), 'test', 1800)
        xb = torch.from_numpy(X).unsqueeze(0).to(device)
        probs = ensemble_probs_with_tta(xb, ce_models, shifts=(-1,0,1))
        seq = decode_video_probs_refined(probs, pool_k=15, temp=0.9)
        rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
        if (i%10)==0 or i==len(test_ids):
            print(f"  [test CE TTA] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
sub = pd.DataFrame(rows, columns=['Id','Sequence'])
# Format assertions
assert len(sub)==95
assert all(len(s.split())==20 and len(set(s.split()))==20 and all(1<=int(t)<=20 for t in s.split()) for s in sub.Sequence), "Submission row format invalid"
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv (CE-only + TTA); head:\n', sub.head())
print('=== CE-only TTA complete ===')

=== CE-only 3-seed ensemble with simple TTA (frame shifts -1/0/+1) and refined decoder ===


  m.load_state_dict(torch.load(p, map_location=device)); m.eval(); models.append(m)


CE-only 3-seed + TTA VAL Levenshtein=4.3636 (norm ~0.21818)


=== TEST inference: CE-only 3-seed + TTA (pool_k=15, temp=0.9) -> submission.csv ===


  [test CE TTA] 10/95 elapsed=0.0m


  [test CE TTA] 20/95 elapsed=0.0m


  [test CE TTA] 30/95 elapsed=0.0m


  [test CE TTA] 40/95 elapsed=0.0m


  [test CE TTA] 50/95 elapsed=0.0m


  [test CE TTA] 60/95 elapsed=0.0m


  [test CE TTA] 70/95 elapsed=0.0m


  [test CE TTA] 80/95 elapsed=0.0m


  [test CE TTA] 90/95 elapsed=0.0m


  [test CE TTA] 95/95 elapsed=0.0m


Wrote submission.csv (CE-only + TTA); head:
     Id                                           Sequence
0  300  5 9 7 1 2 18 3 8 4 20 13 12 15 14 11 6 16 19 1...
1  301  10 12 3 1 5 4 20 6 2 11 15 13 19 7 9 8 18 14 1...
2  302  1 17 16 12 3 5 19 13 20 18 11 4 6 15 8 14 10 9...
3  303  13 4 12 3 10 14 5 19 15 20 17 1 11 16 8 18 7 6...
4  304  8 1 7 12 18 13 9 2 11 3 20 19 5 14 6 15 17 16 ...
=== CE-only TTA complete ===


In [37]:
import time, math
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

print("=== V3: CE(3) × MS++(s2) geometric ensemble + time-warp TTA + global temperature (grid on VAL) ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
import random
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p); X = d['X'].astype(np.float32)
    return X[:max_T] if X.shape[0] > max_T else X

def compute_class_median_durations():
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df['Id'].astype(int).tolist()
    for sid in ids:
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        for c in range(1,21):
            cnt = int((y==c).sum())
            if cnt>0: dur_by_c[c].append(cnt)
    med = {}
    for c in range(1,21):
        med[c] = int(np.clip(np.median(dur_by_c[c]) if len(dur_by_c[c])>0 else 13, 9, 25))
    return med

MED_K = compute_class_median_durations()

D_in = np.load(next(iter((feat_tr_dir).glob('*.npz'))))['X'].shape[1]

class DilatedTCN(nn.Module):
    def __init__(self, d_in, channels=96, layers=10, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = nn.Conv1d(d_in, channels, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(nn.Sequential(
                nn.Conv1d(channels, channels, kernel_size=3, padding=dil, dilation=dil),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Conv1d(channels, channels, kernel_size=1),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
            ))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(channels, num_classes, kernel_size=1)
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        h = self.inp(x)
        for blk in self.blocks:
            res = h; h = blk(h); h = h + res
        logits = self.head(h)
        return logits.transpose(1,2)

class DilatedResBlock(nn.Module):
    def __init__(self, ch, dilation, drop=0.3, groups=8, k=3):
        super().__init__()
        self.conv1 = nn.Conv1d(ch, ch, kernel_size=k, padding=dilation, dilation=dilation)
        self.gn1 = nn.GroupNorm(groups, ch)
        self.conv2 = nn.Conv1d(ch, ch, kernel_size=1)
        self.gn2 = nn.GroupNorm(groups, ch)
        self.drop = nn.Dropout(drop)
    def forward(self, x):
        h = self.conv1(x); h = self.gn1(h); h = F.relu(h, inplace=True); h = self.drop(h)
        h = self.conv2(h); h = self.gn2(h); h = F.relu(h, inplace=True)
        return x + h

class Stage(nn.Module):
    def __init__(self, in_ch, ch=128, layers=10, drop=0.3):
        super().__init__()
        self.inp = nn.Conv1d(in_ch, ch, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(DilatedResBlock(ch, dil, drop=drop))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(ch, 21, kernel_size=1)
    def forward(self, x):
        h = self.inp(x)
        for blk in self.blocks:
            h = blk(h)
        return self.head(h)

class MSTCNPP(nn.Module):
    def __init__(self, d_in, stages=4, ch=128, layers=10, drop=0.3):
        super().__init__()
        self.stages = nn.ModuleList()
        self.input_proj = nn.Conv1d(d_in, d_in, kernel_size=1)
        self.stages.append(Stage(d_in, ch=ch, layers=layers, drop=drop))
        for _ in range(stages-1):
            self.stages.append(Stage(21, ch=ch, layers=layers, drop=drop))
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        x = self.input_proj(x)
        logits_list = []
        prev = self.stages[0](x)
        logits_list.append(prev.transpose(1,2))
        for s in range(1, len(self.stages)):
            probs = prev.softmax(dim=1)
            prev = self.stages[s](probs)
            logits_list.append(prev.transpose(1,2))
        return logits_list

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

def levenshtein(a,b):
    n,m=len(a),len(b)
    if n==0: return m
    if m==0: return n
    dp=list(range(m+1))
    for i in range(1,n+1):
        prev=dp[0]; dp[0]=i; ai=a[i-1]
        for j in range(1,m+1):
            tmp=dp[j]; dp[j]=min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1)); prev=tmp
    return dp[m]

def avg_pool_probs(p_t_c: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t_c.unsqueeze(0).transpose(1,2)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral_single(p_t: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t.view(1,1,-1)
    w = torch.ones(1,1,k, device=p_t.device, dtype=p_t.dtype) / float(k)
    y = F.conv1d(x, w, padding=k//2)
    return y.view(-1)

def refine_com(p: torch.Tensor, t_star: int, w: int = 5) -> float:
    T = p.shape[0]
    a = max(0, t_star - w); b = min(T-1, t_star + w)
    idx = torch.arange(a, b+1, device=p.device, dtype=p.dtype)
    seg = p[a:b+1]; s = seg.sum() + 1e-8
    return float(((idx * seg).sum() / s).item())

def decode_video_probs_refined(p_t_c: torch.Tensor, pool_k=13, temp=1.0):
    # global temp scaling then normalize
    if temp != 1.0:
        p_t_c = (p_t_c ** (1.0/temp)); p_t_c = p_t_c / (p_t_c.sum(dim=-1, keepdim=True) + 1e-8)
    p_s = avg_pool_probs(p_t_c, k=pool_k)
    T,C = p_s.shape
    scores = torch.empty_like(p_s)
    for c in range(C):
        k = MED_K.get(c, 13) if c!=0 else 13
        scores[:,c] = p_s[:,c] if c==0 else duration_integral_single(p_s[:,c], k=k)
    peaks = []
    for c in range(1,21):
        s = scores[:,c]
        t_star = int(torch.argmax(s).item())
        t_ref = refine_com(p_s[:,c], t_star, w=5)
        t_idx = int(round(t_ref)); t_idx = min(max(t_idx, 0), T-1)
        peaks.append([c, t_ref, float(scores[t_idx, c].item())])
    # collision fix: if two t_ref (rounded) collide, shift the later one by +1 frame
    # then re-sort by time and tie-break by integral score
    peaks.sort(key=lambda x: x[1])
    seen_times = set()
    for i in range(len(peaks)):
        t_i = int(round(peaks[i][1]))
        while t_i in seen_times:
            t_i += 1
        seen_times.add(t_i)
        # clamp to [0, T-1]
        t_i = min(max(t_i, 0), T-1)
        peaks[i][1] = float(t_i)
    peaks.sort(key=lambda x: (x[1], -x[2]))
    return [c for c,_,_ in peaks]

def load_models():
    ce_paths = ["model_ce_tcn_s0.pth", "model_ce_tcn_s1.pth", "model_ce_tcn_s2.pth"]
    ms_path = "model_mstcnpp_s2.pth"
    for p in ce_paths + [ms_path]:
        assert Path(p).exists(), f"Missing checkpoint {p}"
    ce_models = []
    for p in ce_paths:
        m = DilatedTCN(d_in=D_in, channels=96, layers=10, num_classes=21, dropout=0.3).to(device)
        m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ce_models.append(m)
    ms_model = MSTCNPP(d_in=D_in, stages=4, ch=128, layers=10, drop=0.3).to(device)
    ms_model.load_state_dict(torch.load(ms_path, map_location=device)); ms_model.eval()
    return ce_models, ms_model

def geo_mean_probs_ce_ms(xb, ce_models, ms_model, w_ce=0.9):
    w_ms = 1.0 - w_ce
    with torch.no_grad():
        ce_log=None
        for m in ce_models:
            p = m(xb)[0].softmax(dim=-1)
            ce_log = torch.log(p + 1e-8) if ce_log is None else ce_log + torch.log(p + 1e-8)
        ce_log = ce_log / max(len(ce_models),1)
        p_ms = ms_model(xb)[-1][0].softmax(dim=-1)
        ms_log = torch.log(p_ms + 1e-8)
        log_comb = w_ce*ce_log + w_ms*ms_log
        probs = torch.exp(log_comb); probs = probs / (probs.sum(dim=-1, keepdim=True) + 1e-8)
        return probs

def time_warp_probs(p_t_c: torch.Tensor, factor: float) -> torch.Tensor:
    # p_t_c: (T,C); resample to int(T*factor) then back to T; linear interp, no corner align
    T, C = p_t_c.shape
    tgt_len = max(1, int(round(T*factor)))
    x = p_t_c.T.unsqueeze(0)  # (1,C,T)
    y = F.interpolate(x, size=tgt_len, mode='linear', align_corners=False)  # (1,C,tgt)
    y2 = F.interpolate(y, size=T, mode='linear', align_corners=False)[0].T  # (T,C)
    y2 = y2 / (y2.sum(dim=-1, keepdim=True) + 1e-8)
    return y2

def apply_tta_timewarp(p_t_c: torch.Tensor, factors=(0.9,1.0,1.1)) -> torch.Tensor:
    # average warped versions, always warp from original
    acc=None
    for s in factors:
        ps = time_warp_probs(p_t_c, s)
        acc = ps if acc is None else (acc + ps)
    out = acc / float(len(factors))
    out = out / (out.sum(dim=-1, keepdim=True) + 1e-8)
    return out

ce_models, ms_model = load_models()

temps = [0.85, 0.90, 0.95, 1.00]
pool_ks = [13, 15]
w_ces = [0.90, 0.95]

best = (1e9, None)
results = []
t0_all = time.time()
with torch.no_grad():
    for w_ce in w_ces:
        for pool_k in pool_ks:
            for temp in temps:
                tot=0; cnt=0; t0=time.time()
                for sid in val_ids:
                    X = load_feat(int(sid), 'train', 1800)
                    xb = torch.from_numpy(X).unsqueeze(0).to(device)
                    probs = geo_mean_probs_ce_ms(xb, ce_models, ms_model, w_ce=w_ce)  # (T,C)
                    probs = apply_tta_timewarp(probs, factors=(0.9,1.0,1.1))
                    seq = decode_video_probs_refined(probs, pool_k=pool_k, temp=temp)
                    tot += levenshtein(seq, id2seq[int(sid)]); cnt += 1
                val_lev = tot/max(cnt,1)
                cfg = dict(w_ce=w_ce, pool_k=pool_k, temp=temp)
                results.append((val_lev, cfg))
                print(f"  [VAL V3] lev={val_lev:.4f} cfg={cfg} elapsed={(time.time()-t0):.1f}s", flush=True)
                if val_lev < best[0]:
                    best = (val_lev, cfg)

results.sort(key=lambda x: x[0])
print("=== Top 5 V3 configs (VAL) ===")
for r in results[:5]:
    print(r)
print("BEST V3:", best, f"total elapsed={(time.time()-t0_all)/60:.2f}m")

best_val, best_cfg = best
print(f"=== TEST inference V3 (w_ce={best_cfg['w_ce']}, pool_k={best_cfg['pool_k']}, temp={best_cfg['temp']}) with time-warp TTA ===", flush=True)
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
rows=[]; t0=time.time()
with torch.no_grad():
    for i, sid in enumerate(test_ids, 1):
        X = load_feat(int(sid), 'test', 1800)
        xb = torch.from_numpy(X).unsqueeze(0).to(device)
        probs = geo_mean_probs_ce_ms(xb, ce_models, ms_model, w_ce=best_cfg['w_ce'])
        probs = apply_tta_timewarp(probs, factors=(0.9,1.0,1.1))
        seq = decode_video_probs_refined(probs, pool_k=best_cfg['pool_k'], temp=best_cfg['temp'])
        rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
        if (i%10)==0 or i==len(test_ids):
            print(f"  [test V3] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
sub = pd.DataFrame(rows, columns=['Id','Sequence'])
# Safety checks
assert len(sub)==95
assert all(len(s.split())==20 and len(set(s.split()))==20 and all(1<=int(t)<=20 for t in s.split()) for s in sub.Sequence), "Submission row format invalid"
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv (V3 TTA+temp); head:\n', sub.head())
print('=== V3 complete ===')

=== V3: CE(3) × MS++(s2) geometric ensemble + time-warp TTA + global temperature (grid on VAL) ===


  m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ce_models.append(m)
  ms_model.load_state_dict(torch.load(ms_path, map_location=device)); ms_model.eval()


  [VAL V3] lev=4.1591 cfg={'w_ce': 0.9, 'pool_k': 13, 'temp': 0.85} elapsed=1.3s


  [VAL V3] lev=4.1364 cfg={'w_ce': 0.9, 'pool_k': 13, 'temp': 0.9} elapsed=1.1s


  [VAL V3] lev=4.1818 cfg={'w_ce': 0.9, 'pool_k': 13, 'temp': 0.95} elapsed=1.1s


  [VAL V3] lev=4.2500 cfg={'w_ce': 0.9, 'pool_k': 13, 'temp': 1.0} elapsed=1.1s


  [VAL V3] lev=4.1364 cfg={'w_ce': 0.9, 'pool_k': 15, 'temp': 0.9} elapsed=1.1s


  [VAL V3] lev=4.1818 cfg={'w_ce': 0.9, 'pool_k': 15, 'temp': 0.95} elapsed=1.1s


  [VAL V3] lev=4.2500 cfg={'w_ce': 0.9, 'pool_k': 15, 'temp': 1.0} elapsed=1.1s


  [VAL V3] lev=4.2273 cfg={'w_ce': 0.95, 'pool_k': 13, 'temp': 0.85} elapsed=1.1s


  [VAL V3] lev=4.2273 cfg={'w_ce': 0.95, 'pool_k': 13, 'temp': 0.9} elapsed=1.1s


  [VAL V3] lev=4.2727 cfg={'w_ce': 0.95, 'pool_k': 13, 'temp': 0.95} elapsed=1.1s


  [VAL V3] lev=4.2500 cfg={'w_ce': 0.95, 'pool_k': 13, 'temp': 1.0} elapsed=1.1s


  [VAL V3] lev=4.2273 cfg={'w_ce': 0.95, 'pool_k': 15, 'temp': 0.85} elapsed=1.1s


  [VAL V3] lev=4.2727 cfg={'w_ce': 0.95, 'pool_k': 15, 'temp': 0.9} elapsed=1.1s


  [VAL V3] lev=4.2955 cfg={'w_ce': 0.95, 'pool_k': 15, 'temp': 0.95} elapsed=1.1s


  [VAL V3] lev=4.2727 cfg={'w_ce': 0.95, 'pool_k': 15, 'temp': 1.0} elapsed=1.1s


=== Top 5 V3 configs (VAL) ===
(4.136363636363637, {'w_ce': 0.9, 'pool_k': 13, 'temp': 0.9})
(4.136363636363637, {'w_ce': 0.9, 'pool_k': 15, 'temp': 0.9})
(4.159090909090909, {'w_ce': 0.9, 'pool_k': 13, 'temp': 0.85})
(4.159090909090909, {'w_ce': 0.9, 'pool_k': 15, 'temp': 0.85})
(4.181818181818182, {'w_ce': 0.9, 'pool_k': 13, 'temp': 0.95})
BEST V3: (4.136363636363637, {'w_ce': 0.9, 'pool_k': 13, 'temp': 0.9}) total elapsed=0.30m
=== TEST inference V3 (w_ce=0.9, pool_k=13, temp=0.9) with time-warp TTA ===


  [test V3] 10/95 elapsed=0.0m


  [test V3] 20/95 elapsed=0.0m


  [test V3] 30/95 elapsed=0.0m


  [test V3] 40/95 elapsed=0.0m


  [test V3] 50/95 elapsed=0.0m


  [test V3] 60/95 elapsed=0.0m


  [test V3] 70/95 elapsed=0.0m


  [test V3] 80/95 elapsed=0.0m


  [test V3] 90/95 elapsed=0.0m


  [test V3] 95/95 elapsed=0.0m


Wrote submission.csv (V3 TTA+temp); head:
     Id                                           Sequence
0  300  5 9 7 1 2 18 3 8 4 20 13 12 15 14 11 6 16 19 1...
1  301  10 12 3 1 5 4 20 6 2 11 15 13 19 7 9 8 18 14 1...
2  302  1 17 16 12 3 5 7 19 13 20 18 11 4 6 15 8 14 10...
3  303  13 4 12 3 10 5 19 15 20 17 1 11 16 8 18 7 6 2 ...
4  304  8 1 7 12 18 13 9 2 11 3 20 19 5 14 6 15 17 16 ...
=== V3 complete ===


In [38]:
import time, math
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

print("=== V4: Order-first decoder (expectation + pairwise) on CE(3)×MS++(s2) with time-warp TTA ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
import random
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p); X = d['X'].astype(np.float32)
    return X[:max_T] if X.shape[0] > max_T else X

def compute_class_median_durations():
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df['Id'].astype(int).tolist()
    for sid in ids:
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        for c in range(1,21):
            cnt = int((y==c).sum());
            if cnt>0: dur_by_c[c].append(cnt)
    med = {}
    for c in range(1,21):
        med[c] = int(np.clip(np.median(dur_by_c[c]) if len(dur_by_c[c])>0 else 13, 9, 25))
    return med

MED_K = compute_class_median_durations()

D_in = np.load(next(iter((feat_tr_dir).glob('*.npz'))))['X'].shape[1]

class DilatedTCN(nn.Module):
    def __init__(self, d_in, channels=96, layers=10, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = nn.Conv1d(d_in, channels, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(nn.Sequential(
                nn.Conv1d(channels, channels, kernel_size=3, padding=dil, dilation=dil),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Conv1d(channels, channels, kernel_size=1),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
            ))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(channels, num_classes, kernel_size=1)
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        h = self.inp(x)
        for blk in self.blocks:
            res = h; h = blk(h); h = h + res
        logits = self.head(h)
        return logits.transpose(1,2)

class DilatedResBlock(nn.Module):
    def __init__(self, ch, dilation, drop=0.3, groups=8, k=3):
        super().__init__()
        self.conv1 = nn.Conv1d(ch, ch, kernel_size=k, padding=dilation, dilation=dilation)
        self.gn1 = nn.GroupNorm(groups, ch)
        self.conv2 = nn.Conv1d(ch, ch, kernel_size=1)
        self.gn2 = nn.GroupNorm(groups, ch)
        self.drop = nn.Dropout(drop)
    def forward(self, x):
        h = self.conv1(x); h = self.gn1(h); h = F.relu(h, inplace=True); h = self.drop(h)
        h = self.conv2(h); h = self.gn2(h); h = F.relu(h, inplace=True)
        return x + h

class Stage(nn.Module):
    def __init__(self, in_ch, ch=128, layers=10, drop=0.3):
        super().__init__()
        self.inp = nn.Conv1d(in_ch, ch, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(DilatedResBlock(ch, dil, drop=drop))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(ch, 21, kernel_size=1)
    def forward(self, x):
        h = self.inp(x)
        for blk in self.blocks:
            h = blk(h)
        return self.head(h)

class MSTCNPP(nn.Module):
    def __init__(self, d_in, stages=4, ch=128, layers=10, drop=0.3):
        super().__init__()
        self.stages = nn.ModuleList()
        self.input_proj = nn.Conv1d(d_in, d_in, kernel_size=1)
        self.stages.append(Stage(d_in, ch=ch, layers=layers, drop=drop))
        for _ in range(stages-1):
            self.stages.append(Stage(21, ch=ch, layers=layers, drop=drop))
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        x = self.input_proj(x)
        logits_list = []
        prev = self.stages[0](x)
        logits_list.append(prev.transpose(1,2))
        for s in range(1, len(self.stages)):
            probs = prev.softmax(dim=1)
            prev = self.stages[s](probs)
            logits_list.append(prev.transpose(1,2))
        return logits_list

def avg_pool_probs(p_t_c: torch.Tensor, k: int = 13) -> torch.Tensor:
    x = p_t_c.unsqueeze(0).transpose(1,2)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral_single(p_t: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t.view(1,1,-1)
    w = torch.ones(1,1,k, device=p_t.device, dtype=p_t.dtype) / float(k)
    y = F.conv1d(x, w, padding=k//2)
    return y.view(-1)

def smooth_probs(p_t_c: torch.Tensor, pool_k=13, temp=0.9) -> torch.Tensor:
    if temp != 1.0:
        p_t_c = (p_t_c ** (1.0/temp)); p_t_c = p_t_c / (p_t_c.sum(dim=-1, keepdim=True) + 1e-8)
    return avg_pool_probs(p_t_c, k=pool_k)

def order_decoder_expectation(p_t_c: torch.Tensor, pool_k=13, temp=0.9) -> list:
    # Compute expected time per class using duration-integral as weights; sort by expectation
    p_s = smooth_probs(p_t_c, pool_k=pool_k, temp=temp)  # (T,C)
    T, C = p_s.shape
    scores = torch.empty_like(p_s)
    for c in range(C):
        k = MED_K.get(c, 13) if c!=0 else 13
        scores[:,c] = p_s[:,c] if c==0 else duration_integral_single(p_s[:,c], k=k)
    idx = torch.arange(T, device=scores.device, dtype=scores.dtype).unsqueeze(1)  # (T,1)
    exp_t = torch.sum(idx * scores, dim=0) / (torch.sum(scores, dim=0) + 1e-8)  # (C,)
    # classes 1..20 sorted by expected time
    order = torch.argsort(exp_t[1:21]).tolist()
    seq = [int(i+1) for i in order]
    return seq

def order_decoder_pairwise(p_t_c: torch.Tensor, pool_k=13, temp=0.9) -> list:
    # Pairwise dominance: score[i] = sum_t sum_j max(p_i - p_j, 0); rank by score ascending in time
    p_s = smooth_probs(p_t_c, pool_k=pool_k, temp=temp)  # (T,C)
    p = p_s[:,1:21]  # exclude bg -> (T,20)
    T, K = p.shape
    # compute pairwise advantages
    # naive O(T*K*K) is fine: 1800*400 ~ 720k ops per sample
    scores = torch.zeros(K, device=p.device, dtype=p.dtype)
    for i in range(K):
        pi = p[:, i].unsqueeze(1)  # (T,1)
        diff = pi - p  # (T,K)
        diff[:, i] = 0.0
        scores[i] = torch.clamp(diff, min=0).sum()
    order = torch.argsort(scores).tolist()  # smaller score ~ earlier in time (less dominates others later)
    seq = [int(i+1) for i in order]
    return seq

def hybrid_order_decoder(p_t_c: torch.Tensor, pool_k=13, temp=0.9) -> list:
    seq_e = order_decoder_expectation(p_t_c, pool_k=pool_k, temp=temp)
    seq_p = order_decoder_pairwise(p_t_c, pool_k=pool_k, temp=temp)
    # Borda-like fusion: assign ranks and sum; lower sum wins
    rank_e = {c: r for r, c in enumerate(seq_e)}
    rank_p = {c: r for r, c in enumerate(seq_p)}
    classes = list(range(1,21))
    scores = [(c, rank_e.get(c, 0) + rank_p.get(c, 0)) for c in classes]
    scores.sort(key=lambda x: x[1])
    return [c for c,_ in scores]

def levenshtein(a,b):
    n,m=len(a),len(b)
    if n==0: return m
    if m==0: return n
    dp=list(range(m+1))
    for i in range(1,n+1):
        prev=dp[0]; dp[0]=i; ai=a[i-1]
        for j in range(1,m+1):
            tmp=dp[j]; dp[j]=min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1)); prev=tmp
    return dp[m]

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

# Loader for best ensemble (CE 3 seeds + MS++ s2), with time-warp TTA on probs
D_in = np.load(next(iter((feat_tr_dir).glob('*.npz'))))['X'].shape[1]

def load_models():
    ce_paths = ["model_ce_tcn_s0.pth", "model_ce_tcn_s1.pth", "model_ce_tcn_s2.pth"]
    ms_path = "model_mstcnpp_s2.pth"
    for p in ce_paths + [ms_path]:
        assert Path(p).exists(), f"Missing checkpoint {p}"
    ce_models = []
    for p in ce_paths:
        m = DilatedTCN(d_in=D_in, channels=96, layers=10, num_classes=21, dropout=0.3).to(device)
        m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ce_models.append(m)
    ms_model = MSTCNPP(d_in=D_in, stages=4, ch=128, layers=10, drop=0.3).to(device)
    ms_model.load_state_dict(torch.load(ms_path, map_location=device)); ms_model.eval()
    return ce_models, ms_model

def geo_mean_probs_ce_ms(xb, ce_models, ms_model, w_ce=0.9):
    w_ms = 1.0 - w_ce
    with torch.no_grad():
        ce_log=None
        for m in ce_models:
            p = m(xb)[0].softmax(dim=-1)
            ce_log = torch.log(p + 1e-8) if ce_log is None else ce_log + torch.log(p + 1e-8)
        ce_log = ce_log / max(len(ce_models),1)
        p_ms = ms_model(xb)[-1][0].softmax(dim=-1)
        ms_log = torch.log(p_ms + 1e-8)
        log_comb = w_ce*ce_log + w_ms*ms_log
        probs = torch.exp(log_comb); probs = probs / (probs.sum(dim=-1, keepdim=True) + 1e-8)
        return probs

def time_warp_probs(p_t_c: torch.Tensor, factor: float) -> torch.Tensor:
    T, C = p_t_c.shape
    tgt_len = max(1, int(round(T*factor)))
    x = p_t_c.T.unsqueeze(0)
    y = F.interpolate(x, size=tgt_len, mode='linear', align_corners=False)
    y2 = F.interpolate(y, size=T, mode='linear', align_corners=False)[0].T
    y2 = y2 / (y2.sum(dim=-1, keepdim=True) + 1e-8)
    return y2

def apply_tta_timewarp(p_t_c: torch.Tensor, factors=(0.9,1.0,1.1)) -> torch.Tensor:
    acc=None
    for s in factors:
        ps = time_warp_probs(p_t_c, s)
        acc = ps if acc is None else (acc + ps)
    out = acc / float(len(factors))
    out = out / (out.sum(dim=-1, keepdim=True) + 1e-8)
    return out

ce_models, ms_model = load_models()

def eval_val_order_decoders(pool_k=13, temp=0.9):
    totE=totP=totH=0; cnt=0; t0=time.time()
    with torch.no_grad():
        for sid in val_ids:
            X = load_feat(int(sid), 'train', 1800)
            xb = torch.from_numpy(X).unsqueeze(0).to(device)
            probs = geo_mean_probs_ce_ms(xb, ce_models, ms_model, w_ce=0.9)
            probs = apply_tta_timewarp(probs, factors=(0.9,1.0,1.1))
            seqE = order_decoder_expectation(probs, pool_k=pool_k, temp=temp)
            seqP = order_decoder_pairwise(probs, pool_k=pool_k, temp=temp)
            seqH = hybrid_order_decoder(probs, pool_k=pool_k, temp=temp)
            tgt = id2seq[int(sid)]
            def lev(a,b):
                n,m=len(a),len(b);
                if n==0: return m
                if m==0: return n
                dp=list(range(m+1))
                for i in range(1,n+1):
                    prev=dp[0]; dp[0]=i; ai=a[i-1]
                    for j in range(1,m+1):
                        tmp=dp[j]; dp[j]=min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1)); prev=tmp
                return dp[m]
            totE += lev(seqE, tgt); totP += lev(seqP, tgt); totH += lev(seqH, tgt); cnt += 1
    print(f"VAL (pool_k={pool_k}, temp={temp}): Expect={totE/max(cnt,1):.4f} Pair={totP/max(cnt,1):.4f} Hybrid={totH/max(cnt,1):.4f}", flush=True)
    return (totE/max(cnt,1), totP/max(cnt,1), totH/max(cnt,1))

# Small grid over pool_k and temp
cands = [(13,0.9),(15,0.9)]
best = (1e9, None, None)  # (lev, (pool,temp), decoder_name)
for pool_k, temp in cands:
    e,p,h = eval_val_order_decoders(pool_k=pool_k, temp=temp)
    for name,lev in (('expect',e),('pair',p),('hybrid',h)):
        if lev < best[0]: best = (lev, (pool_k,temp), name)
print("BEST order-decoder:", best, flush=True)

# Build TEST submission using best order-decoder
best_lev, (pool_k, temp), name = best
print(f"=== TEST inference V4 using {name} (pool_k={pool_k}, temp={temp}) ===", flush=True)
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
rows=[]; t0=time.time()
with torch.no_grad():
    for i, sid in enumerate(test_ids, 1):
        X = load_feat(int(sid), 'test', 1800)
        xb = torch.from_numpy(X).unsqueeze(0).to(device)
        probs = geo_mean_probs_ce_ms(xb, ce_models, ms_model, w_ce=0.9)
        probs = apply_tta_timewarp(probs, factors=(0.9,1.0,1.1))
        if name=='expect':
            seq = order_decoder_expectation(probs, pool_k=pool_k, temp=temp)
        elif name=='pair':
            seq = order_decoder_pairwise(probs, pool_k=pool_k, temp=temp)
        else:
            seq = hybrid_order_decoder(probs, pool_k=pool_k, temp=temp)
        rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
        if (i%10)==0 or i==len(test_ids):
            print(f"  [test V4 order] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
sub = pd.DataFrame(rows, columns=['Id','Sequence'])
assert len(sub)==95
assert all(len(s.split())==20 and len(set(s.split()))==20 and all(1<=int(t)<=20 for t in s.split()) for s in sub.Sequence), "Submission row format invalid"
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv (V4 order-first); head:\n', sub.head())
print('=== V4 order-first complete ===')

=== V4: Order-first decoder (expectation + pairwise) on CE(3)×MS++(s2) with time-warp TTA ===


  m.load_state_dict(torch.load(p, map_location=device)); m.eval(); ce_models.append(m)
  ms_model.load_state_dict(torch.load(ms_path, map_location=device)); ms_model.eval()


VAL (pool_k=13, temp=0.9): Expect=12.1591 Pair=18.0000 Hybrid=16.1818


VAL (pool_k=15, temp=0.9): Expect=12.1818 Pair=17.9773 Hybrid=16.2045


BEST order-decoder: (12.159090909090908, (13, 0.9), 'expect')


=== TEST inference V4 using expect (pool_k=13, temp=0.9) ===


  [test V4 order] 10/95 elapsed=0.0m


  [test V4 order] 20/95 elapsed=0.0m


  [test V4 order] 30/95 elapsed=0.0m


  [test V4 order] 40/95 elapsed=0.0m


  [test V4 order] 50/95 elapsed=0.0m


  [test V4 order] 60/95 elapsed=0.0m


  [test V4 order] 70/95 elapsed=0.0m


  [test V4 order] 80/95 elapsed=0.0m


  [test V4 order] 90/95 elapsed=0.0m


  [test V4 order] 95/95 elapsed=0.0m


Wrote submission.csv (V4 order-first); head:
     Id                                           Sequence
0  300  9 5 1 7 18 2 8 3 20 12 4 16 13 15 14 11 10 6 1...
1  301  5 10 4 6 1 12 20 11 3 15 2 19 13 7 14 9 8 18 1...
2  302  17 16 5 12 19 1 13 20 3 18 11 7 4 6 15 10 2 8 ...
3  303  13 4 10 15 12 19 5 3 8 11 20 17 18 1 16 14 6 7...
4  304  8 7 1 13 18 2 12 9 3 11 14 20 19 10 15 5 17 6 ...
=== V4 order-first complete ===


In [39]:
import time, math
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

print("=== V4b: CE-only (3 seeds) + time-warp TTA + global temperature grid ===", flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

feat_tr_dir = Path('features3d_v2')/'train'
feat_te_dir = Path('features3d_v2')/'test'
lab_tr_dir  = Path('labels3d_v2')/'train'

train_df = pd.read_csv('training.csv')
all_ids = train_df['Id'].astype(int).tolist()
import random
random.seed(42); np.random.seed(42)
random.shuffle(all_ids)
val_ratio = 0.15
val_n = max(30, int(len(all_ids)*val_ratio))
val_ids = all_ids[:val_n]

def load_feat(sample_id: int, split='train', max_T=1800):
    p = (feat_tr_dir if split=='train' else feat_te_dir)/f"{sample_id}.npz"
    d = np.load(p); X = d['X'].astype(np.float32)
    return X[:max_T] if X.shape[0] > max_T else X

def compute_class_median_durations():
    dur_by_c = {c: [] for c in range(1,21)}
    ids = train_df['Id'].astype(int).tolist()
    for sid in ids:
        y = np.load(lab_tr_dir/f"{sid}.npy").astype(np.int16)
        for c in range(1,21):
            cnt = int((y==c).sum())
            if cnt>0: dur_by_c[c].append(cnt)
    med = {}
    for c in range(1,21):
        med[c] = int(np.clip(np.median(dur_by_c[c]) if len(dur_by_c[c])>0 else 13, 9, 25))
    return med

MED_K = compute_class_median_durations()

D_in = np.load(next(iter((feat_tr_dir).glob('*.npz'))))['X'].shape[1]

class DilatedTCN(nn.Module):
    def __init__(self, d_in, channels=96, layers=10, num_classes=21, dropout=0.3):
        super().__init__()
        self.inp = nn.Conv1d(d_in, channels, kernel_size=1)
        blocks = []; dil=1
        for _ in range(layers):
            blocks.append(nn.Sequential(
                nn.Conv1d(channels, channels, kernel_size=3, padding=dil, dilation=dil),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Conv1d(channels, channels, kernel_size=1),
                nn.GroupNorm(num_groups=8, num_channels=channels),
                nn.ReLU(inplace=True),
            ))
            dil = min(dil*2, 512)
        self.blocks = nn.ModuleList(blocks)
        self.head = nn.Conv1d(channels, num_classes, kernel_size=1)
    def forward(self, x_b_t_d):
        x = x_b_t_d.transpose(1,2)
        h = self.inp(x)
        for blk in self.blocks:
            res = h; h = blk(h); h = h + res
        logits = self.head(h)
        return logits.transpose(1,2)

def avg_pool_probs(p_t_c: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t_c.unsqueeze(0).transpose(1,2)
    y = F.avg_pool1d(x, kernel_size=k, stride=1, padding=k//2)
    return y.transpose(1,2).squeeze(0)

def duration_integral_single(p_t: torch.Tensor, k: int) -> torch.Tensor:
    x = p_t.view(1,1,-1)
    w = torch.ones(1,1,k, device=p_t.device, dtype=p_t.dtype) / float(k)
    y = F.conv1d(x, w, padding=k//2)
    return y.view(-1)

def refine_com(p: torch.Tensor, t_star: int, w: int = 5) -> float:
    T = p.shape[0]
    a = max(0, t_star - w); b = min(T-1, t_star + w)
    idx = torch.arange(a, b+1, device=p.device, dtype=p.dtype)
    seg = p[a:b+1]; s = seg.sum() + 1e-8
    return float(((idx * seg).sum() / s).item())

def decode_video_probs_refined(p_t_c: torch.Tensor, pool_k=13, temp=0.9):
    if temp != 1.0:
        p_t_c = (p_t_c ** (1.0/temp)); p_t_c = p_t_c / (p_t_c.sum(dim=-1, keepdim=True) + 1e-8)
    p_s = avg_pool_probs(p_t_c, k=pool_k)
    T,C = p_s.shape
    scores = torch.empty_like(p_s)
    for c in range(C):
        k = MED_K.get(c, 13) if c!=0 else 13
        scores[:,c] = p_s[:,c] if c==0 else duration_integral_single(p_s[:,c], k=k)
    peaks = []
    for c in range(1,21):
        radius = max(10, MED_K.get(c,13)//2)
        s = scores[:,c]
        t_star = int(torch.argmax(s).item())
        t_ref = refine_com(p_s[:,c], t_star, w=5)
        t_idx = int(round(t_ref)); t_idx = min(max(t_idx, 0), T-1)
        local_mean = p_s[max(0,t_idx-radius):min(T,t_idx+radius+1), c].mean().item()
        peaks.append((c, t_ref, float(scores[t_idx, c].item()), float(local_mean)))
    peaks.sort(key=lambda x: (x[1], -x[2], -x[3]))
    return [c for c,_,_,_ in peaks]

def levenshtein(a,b):
    n,m=len(a),len(b)
    if n==0: return m
    if m==0: return n
    dp=list(range(m+1))
    for i in range(1,n+1):
        prev=dp[0]; dp[0]=i; ai=a[i-1]
        for j in range(1,m+1):
            tmp=dp[j]; dp[j]=min(dp[j]+1, dp[j-1]+1, prev + (0 if ai==b[j-1] else 1)); prev=tmp
    return dp[m]

id2seq = {int(r.Id): [int(x) for x in str(r.Sequence).strip().split()] for _, r in train_df.iterrows()}

def load_ce_models(paths):
    models=[]
    for p in paths:
        m = DilatedTCN(d_in=D_in, channels=96, layers=10, num_classes=21, dropout=0.3).to(device)
        m.load_state_dict(torch.load(p, map_location=device)); m.eval(); models.append(m)
    return models

def time_warp_probs(p_t_c: torch.Tensor, factor: float) -> torch.Tensor:
    T, C = p_t_c.shape
    tgt_len = max(1, int(round(T*factor)))
    x = p_t_c.T.unsqueeze(0)
    y = F.interpolate(x, size=tgt_len, mode='linear', align_corners=False)
    y2 = F.interpolate(y, size=T, mode='linear', align_corners=False)[0].T
    y2 = y2 / (y2.sum(dim=-1, keepdim=True) + 1e-8)
    return y2

def apply_tta_timewarp(p_t_c: torch.Tensor, factors=(0.9,1.0,1.1)) -> torch.Tensor:
    acc=None
    for s in factors:
        ps = time_warp_probs(p_t_c, s)
        acc = ps if acc is None else (acc + ps)
    out = acc / float(len(factors))
    out = out / (out.sum(dim=-1, keepdim=True) + 1e-8)
    return out

def ensemble_ce_probs(xb, models) -> torch.Tensor:
    with torch.no_grad():
        probs_sum=None
        for m in models:
            p = m(xb)[0].softmax(dim=-1)
            probs_sum = p if probs_sum is None else (probs_sum + p)
        probs = probs_sum / float(len(models))
        probs = probs / (probs.sum(dim=-1, keepdim=True) + 1e-8)
        return probs

ce_ckpts = ["model_ce_tcn_s0.pth", "model_ce_tcn_s1.pth", "model_ce_tcn_s2.pth"]
for p in ce_ckpts: assert Path(p).exists(), f"Missing {p}"
ce_models = load_ce_models(ce_ckpts)

temps = [0.85, 0.90, 0.95, 1.00]
pool_ks = [13, 15]

best=(1e9, None)
results=[]
with torch.no_grad():
    for pool_k in pool_ks:
        for temp in temps:
            tot=0; cnt=0; t0=time.time()
            for sid in val_ids:
                X = load_feat(int(sid), 'train', 1800)
                xb = torch.from_numpy(X).unsqueeze(0).to(device)
                probs = ensemble_ce_probs(xb, ce_models)
                probs = apply_tta_timewarp(probs, factors=(0.9,1.0,1.1))
                seq = decode_video_probs_refined(probs, pool_k=pool_k, temp=temp)
                tot += levenshtein(seq, id2seq[int(sid)]); cnt += 1
            val_lev = tot/max(cnt,1)
            cfg = dict(pool_k=pool_k, temp=temp)
            results.append((val_lev, cfg))
            print(f"  [VAL V4b] lev={val_lev:.4f} cfg={cfg} elapsed={(time.time()-t0):.1f}s", flush=True)
            if val_lev < best[0]: best=(val_lev, cfg)

results.sort(key=lambda x: x[0])
print("=== Top V4b configs ===")
for r in results[:5]:
    print(r)
print("BEST V4b:", best)

best_val, best_cfg = best
print(f"=== TEST inference V4b CE-only with TTA using cfg={best_cfg} ===", flush=True)
test_ids = pd.read_csv('test.csv')['Id'].astype(int).tolist()
rows=[]; t0=time.time()
with torch.no_grad():
    for i, sid in enumerate(test_ids, 1):
        X = load_feat(int(sid), 'test', 1800)
        xb = torch.from_numpy(X).unsqueeze(0).to(device)
        probs = ensemble_ce_probs(xb, ce_models)
        probs = apply_tta_timewarp(probs, factors=(0.9,1.0,1.1))
        seq = decode_video_probs_refined(probs, pool_k=best_cfg['pool_k'], temp=best_cfg['temp'])
        rows.append({'Id': int(sid), 'Sequence': ' '.join(str(x) for x in seq)})
        if (i%10)==0 or i==len(test_ids):
            print(f"  [test V4b] {i}/{len(test_ids)} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
sub = pd.DataFrame(rows, columns=['Id','Sequence'])
assert len(sub)==95
assert all(len(s.split())==20 and len(set(s.split()))==20 and all(1<=int(t)<=20 for t in s.split()) for s in sub.Sequence), "Submission row format invalid"
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv (V4b CE-only TTA+temp); head:\n', sub.head())
print('=== V4b complete ===')

=== V4b: CE-only (3 seeds) + time-warp TTA + global temperature grid ===


  m.load_state_dict(torch.load(p, map_location=device)); m.eval(); models.append(m)


  [VAL V4b] lev=4.3182 cfg={'pool_k': 13, 'temp': 0.85} elapsed=0.7s


  [VAL V4b] lev=4.3636 cfg={'pool_k': 13, 'temp': 0.9} elapsed=0.7s


  [VAL V4b] lev=4.3182 cfg={'pool_k': 13, 'temp': 0.95} elapsed=0.7s


  [VAL V4b] lev=4.2955 cfg={'pool_k': 13, 'temp': 1.0} elapsed=0.7s


  [VAL V4b] lev=4.3182 cfg={'pool_k': 15, 'temp': 0.85} elapsed=0.7s


  [VAL V4b] lev=4.3636 cfg={'pool_k': 15, 'temp': 0.9} elapsed=0.7s


  [VAL V4b] lev=4.3409 cfg={'pool_k': 15, 'temp': 0.95} elapsed=0.8s


  [VAL V4b] lev=4.3409 cfg={'pool_k': 15, 'temp': 1.0} elapsed=0.8s


=== Top V4b configs ===
(4.295454545454546, {'pool_k': 13, 'temp': 1.0})
(4.318181818181818, {'pool_k': 13, 'temp': 0.85})
(4.318181818181818, {'pool_k': 13, 'temp': 0.95})
(4.318181818181818, {'pool_k': 15, 'temp': 0.85})
(4.340909090909091, {'pool_k': 15, 'temp': 0.95})
BEST V4b: (4.295454545454546, {'pool_k': 13, 'temp': 1.0})
=== TEST inference V4b CE-only with TTA using cfg={'pool_k': 13, 'temp': 1.0} ===


  [test V4b] 10/95 elapsed=0.0m


  [test V4b] 20/95 elapsed=0.0m


  [test V4b] 30/95 elapsed=0.0m


  [test V4b] 40/95 elapsed=0.0m


  [test V4b] 50/95 elapsed=0.0m


  [test V4b] 60/95 elapsed=0.0m


  [test V4b] 70/95 elapsed=0.0m


  [test V4b] 80/95 elapsed=0.0m


  [test V4b] 90/95 elapsed=0.0m


  [test V4b] 95/95 elapsed=0.0m


Wrote submission.csv (V4b CE-only TTA+temp); head:
     Id                                           Sequence
0  300  5 9 7 1 2 18 3 8 4 20 13 12 15 14 11 6 16 19 1...
1  301  10 12 3 1 5 4 20 6 2 11 15 13 19 7 9 8 18 14 1...
2  302  1 17 16 12 3 5 19 13 20 18 11 4 6 15 8 14 10 9...
3  303  13 4 12 3 10 14 5 19 15 20 17 1 11 16 8 18 7 6...
4  304  8 1 7 12 18 13 9 2 11 3 20 19 5 14 6 15 17 16 ...
=== V4b complete ===
