feat(2026): distill to cnn-medium-attn+RoPE, add inference CLI, restructure dist/ by AmitMY · Pull Request #12 · sign-language-processing/segmentation

AmitMY · 2026-03-22T08:17:51Z

Summary

Distills the 2026 experimental work down to a clean, minimal production path. Removes all architecture branches we didn't end up using and all loss/augmentation variants that didn't improve results.

Changes

model.py (805→294 lines): cnn-medium-attn + RoPE only — removed bilstm, bigru, tcn, cnn-fast-slow, cnn-local-attn, cnn-lstm, cnn-large, cnn; removed focal loss, label smoothing, b-dice, per-head weighted loss, legacy flags
train.py: removed curriculum callbacks (SegmentationDataModule, CurriculumCallback, FrameCurriculumCallback); simplified to direct get_dataloader() path
args.py: removed ~30 unused args; defaults reflect proven recipe (hidden_dim=384, depth=6, nhead=8, dice_loss_weight=1.0, epochs=200)
dataset.py: removed acceleration and speed_aug branches
evaluate.py: added as tracked file; removed windowed/LSTM eval path
bin.py (new): 2026 inference CLI — loads .ckpt, preprocesses pose, runs chunked RoPE inference, writes ELAN
old/bin.py: updated dist path to dist/2023/
dist/2023/: moved 2023 TorchScript .pth models from old/dist/; added README
dist/2026/: added EXPERIMENTS.md + findings README.md

Key findings (dist/2026/README.md)

Helped: RoPE transformer, Dice loss (weight=1.0), fps_aug (essential — disabling costs −9pp Sign IoU), body_part_dropout=0.1 (+10.5pp Phrase25), frame_dropout=0.15 (essential regularisation), velocity features, no_face, hidden_dim=384, HM(sign,phrase) validation metric, inference chunk_size=num_frames (bug fix: +12.8pp Phrase IoU)

Didn't help: attention padding mask (−7pp S25 consistently), B-frame Dice loss, focal loss, label smoothing, speed augmentation, frame curriculum, removing frame_dropout (catastrophic phrase overfitting after ~50 epochs)

Best result so far (E145, 1024 frames)

	Sign IoU	Phrase IoU
50fps	0.595	0.907
25fps	0.569	0.880
HM		0.705

🤖 Generated with Claude Code

…ucture dist/ Training: - model.py: remove all arch branches except cnn-medium-attn+RoPE (805→294 lines) - removes: bilstm/bigru/tcn/cnn-fast-slow/cnn-local-attn/cnn-lstm/cnn-large/cnn - removes: GatedResidual, SinusoidalPositionalEncoding, LocalAttentionBlock, TCNBlock - removes: focal loss, label smoothing, b-dice, per-head weighted loss, legacy flags - keeps: dice loss, RoPE chunked inference, HM(sign,phrase) validation metric - train.py: remove curriculum callbacks; simplify to direct get_dataloader() path - args.py: remove unused args (arch, pos_encoding, acceleration, speed_aug, weighted_loss, focal_gamma, label_smoothing, b_dice, curriculum) - dataset.py: remove acceleration and speed_aug branches Inference: - bin.py: new 2026 inference CLI (load .ckpt, process pose, write ELAN) - old/bin.py: update dist path to dist/2023/ Evaluation: - evaluate.py: add as tracked file; remove windowed/LSTM eval path Dist: - dist/2023/: move 2023 TorchScript .pth models from old/dist/; add README - dist/2026/: add EXPERIMENTS.md and findings README Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AmitMY

where is the dockerfile? update to the main README that says how to train? how to run?

AmitMY · 2026-03-22T08:49:02Z

+            eval_args.b_threshold = best_cfg.get("b_threshold", eval_args.b_threshold)
+            eval_args.io_threshold = best_cfg.get("io_threshold", eval_args.io_threshold)
+
+    if eval_args.likeliest:


did we check which is better? likeliest or probs_to_segments? from thinking that we might not need thresholds anymore? that would simplify the code a lot

Haven't done a systematic comparison, but from threshold-sweep experiments (--tune_threshold) likeliest (argmax) has generally matched or beaten threshold-based on IoU — and it's simpler. Made likeliest the default. Threshold-based still available via --threshold + --tune_threshold if needed. Happy to add a explicit comparison to the eval script if you'd like.

please make an evaluation of likeliest vs threshold and report here. if likeliest is the same/better, we can remove the threshold code, and all threshold tuning code. simplifies the repo

Running E165 (1536fr, new code) to get a clean comparison. Will report here once done. Previous models (E162, E163) were trained with probs_to_segments as the validation metric, so comparing on those is skewed. E165 is the first model trained with the new fps-normalised velocity and ClassifierHead, so results will be directly comparable.

Ablation results (E165-1536-batch8-drop01-fixchunk-hmval-3h, dev, 50fps):

Decoding Sign IoU Phrase IoU HM

likeliest (argmax) 0.5598 0.8892 0.6871

threshold (default 0.5) 0.5519 0.8885 0.6809

tuned threshold (swept) 0.5706 0.8937 0.6965

Tuned threshold beats likeliest by +0.011 Sign IoU, so we keep threshold-based decoding for final reporting. likeliest is the fast-path default (no tuning needed); --tune_threshold is the recommended path for final numbers.

Note: training validation used likeliest; the eval gap vs tuned threshold is ~1pp.

- Remove dist/2023/ (use the 2023 git tag/release instead) - Remove sign_language_segmentation/old/bin.py - pyproject.toml: remove old/* packages, dist/2023 data-files, use pip pose-anonymization - args.py: set best defaults (velocity, fps_aug, frame_dropout=0.15, body_part_dropout=0.1, optimizer=adamw-onecycle); drop no_face/normalize/pose_dims as deprecated hidden args - data/utils.py: preprocess_pose always applies no_face+normalize (remove conditionals); add compute_velocity(pose_data, frame_times_seconds) utility - data/dataset.py: remove normalize/no_face params; timestamps now in seconds - model/model.py: add ClassifierHead (linear→GELU→linear) for both BIO heads; RoPE now expects timestamps in seconds and scales by reference_fps=50 internally; use bio_labels_to_segments from metrics (no more duplicated BIO→segment loop) - metrics.py: add bio_labels_to_segments() shared utility - bin.py: @torch.inference_mode, seconds-based timestamps, use compute_velocity - evaluate.py: use bio_labels_to_segments; likeliest_probs_to_segments is now default - train.py: print best.ckpt path after training - dist/2026/README.md: fix architecture description (skip connections, residual, RoPE in seconds), clarify attention mask failure reason, remove HM row, note depth=4 worth retrying Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…luate - model.py: on_load_checkpoint migrates old single-Linear heads to nn.Linear when loading pre-ClassifierHead checkpoints (strict=False loads the remaining keys correctly; old sign_bio_head.weight maps directly) - dataset.py: fix missing frame_times_ms assignment in non-fps_aug path - evaluate.py: add --chunk_multiplier flag to scale inference chunk size for RoPE generalisation ablation (1x/2x/4x) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Uses the same argmax decoding in both validation_step and evaluate.py, removing the discrepancy where training validation used threshold-based probs_to_segments but evaluate.py reported likeliest results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tric switch E165 is currently training; switching validation metric mid-run risks premature early stopping. Revert to probs_to_segments for consistency with E165 training. Will align metrics after E165 completes once we have evidence that likeliest is better than threshold for new models. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AmitMY

Updated main README.md with Docker-based training/evaluation instructions and a pointer to Dockerfile.train. The Dockerfile was already in the repo root but not documented in the README.

- Add Docker build/train/evaluate commands pointing to Dockerfile.train - Add local development setup - Update architecture description to match 2026 CNN-medium-attn + RoPE - Point to dist/2026/README.md for full details - Remove outdated 2025 content Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

All best hyperparameters are now defaults in args.py: velocity=True, fps_aug=True, body_part_dropout=0.1, frame_dropout=0.15, dice_loss_weight=1.0. Training command only needs corpus/poses and resource params (batch_size, num_frames, patience). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Dockerfile - Remove probs_to_segments / _io_probs_to_segments from metrics.py — likeliest (argmax) decoding wins on E169 and generalises better to test set; threshold was overfitting dev sign IoU at the expense of phrase IoU. - evaluate.py: drop --threshold/--tune_threshold/--b/o/io_threshold args; decoding path is now simply likeliest + optional filter_segments. - bin.py: remove unused probs_to_segments import. - model.py: batched chunk inference in encode() — all chunks stacked into one batch and processed in a single transformer forward pass instead of N serial calls; remove on_load_checkpoint backward-compat shim. - Dockerfile.train: add training image definition (nvcr pytorch:26.02-py3 base, installs deps from pyproject.toml; code is mounted at runtime). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…docs - Delete sign_language_segmentation/old/ (2023-era code: SLURM job scripts, old threshold decoder, old tests — all superseded by 2026 rewrite) - args.py: remove deprecated suppressed args (--arch, --pos_encoding, --no_face, --no_normalize, --pose_dims, --acceleration, --speed_aug, --target_fps, --steps_per_epoch); update defaults to match best config (depth=4, dice=1.5) - dist/2026/README.md: fix architecture (depth=4 not 6), update best results table with E166-E169, add threshold decoding to "What Did Not Help", correct training command - README.md: fix training command to use correct hyperparams (depth=4, 1024fr) - .gitignore: add models/, logs/, lightning_logs/, *.egg-info/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e_pose_segments - bin.py: add segment_pose() importable function (loads model via lru_cache, runs inference, returns eaf + tiers dict); add save_pose_segments() to crop and save per-segment .pose files; add --save-segments and --subtitles CLI args; model loading now cached so repeated calls are fast - server.py: Flask server exposing POST / for pose segmentation (input/output as file paths or gs:// URIs) and GET /health; single-frame edge case handled - Dockerfile: CPU-only inference image (python:3.12-slim + torch CPU wheel); serves via gunicorn; copies source and dist/2026/best.ckpt at build time - pyproject.toml: add [server] optional deps (Flask, Werkzeug, gunicorn) - .github/workflows/publish-docker.yaml: publish image to ghcr.io on release - README.md: add Python API example, server usage, health check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

E169 (depth=4, 1024fr, 6h) beats Efinal on both dev and test: dev HM=0.763 (Sign=0.657, Phr=0.910) test HM=0.764 (Sign=0.652, Phr=0.925) Efinal trained longer but early stopping had already found the optimum. best.ckpt updated to E169 checkpoint. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- tests/test_inference.py: smoke tests for segment_pose (tiers, start/end, eaf tiers); example.pose bundled for CI - ruff fixes: remove unused imports (argparse, math, numpy), remove unused gold_range variable, replace lambda with def in evaluate.py - pyproject.toml: move pytorch-lightning and scikit-learn to core deps (both required at inference time, not just dev); add **/*.ckpt to package-data so best.ckpt ships with pip install - sign_language_segmentation/dist/2026/best.ckpt: E169 checkpoint bundled inside the package; _default_model_path() updated to find it via __file__ - Dockerfile: fix layer ordering (copy source then pip install --no-deps -e . so actual code is installed, not build stubs); warmup call now succeeds; fix ENV syntax and CMD JSON form to eliminate build warnings Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Strip AdamW optimizer states and convert float32→bfloat16 to reduce checkpoint size ~6x for deployment without affecting inference quality (dev HM-IoU 0.763 preserved). Add slim_checkpoint CLI entry point so future dist checkpoints can be prepared in one command. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… note - Restore complete bibtex entry (editor, address, doi, pages) from main - Restore '## 2023 Version (v2023)' section linking to the paper code - Document slim_checkpoint usage in dist/2026/README.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AmitMY commented Mar 22, 2026

View reviewed changes

AmitMY and others added 4 commits March 22, 2026 05:03

AmitMY commented Mar 22, 2026

View reviewed changes

AmitMY and others added 4 commits March 22, 2026 06:33

docs: update experiment log in dist/2026/

5c2b414

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AmitMY force-pushed the y2026 branch from a6295e5 to 5c2b414 Compare March 22, 2026 21:50

AmitMY and others added 5 commits March 22, 2026 17:53

AmitMY marked this pull request as ready for review March 23, 2026 07:59

AmitMY merged commit 1bca0e7 into y2025 Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(2026): distill to cnn-medium-attn+RoPE, add inference CLI, restructure dist/#12

feat(2026): distill to cnn-medium-attn+RoPE, add inference CLI, restructure dist/#12
AmitMY merged 15 commits intoy2025from
y2026

AmitMY commented Mar 22, 2026

Uh oh!

AmitMY left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmitMY Mar 22, 2026

Uh oh!

AmitMY Mar 22, 2026

Uh oh!

AmitMY Mar 22, 2026

Uh oh!

AmitMY Mar 22, 2026

Uh oh!

AmitMY Mar 22, 2026

Uh oh!

AmitMY left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Decoding	Sign IoU	Phrase IoU	HM
likeliest (argmax)	0.5598	0.8892	0.6871
threshold (default 0.5)	0.5519	0.8885	0.6809
tuned threshold (swept)	0.5706	0.8937	0.6965

Conversation

AmitMY commented Mar 22, 2026

Summary

Changes

Key findings (dist/2026/README.md)

Best result so far (E145, 1024 frames)

Uh oh!

AmitMY left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmitMY Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

AmitMY Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

AmitMY Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

AmitMY Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

AmitMY Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AmitMY left a comment •

edited

Loading