Conversation
…ucture dist/ Training: - model.py: remove all arch branches except cnn-medium-attn+RoPE (805→294 lines) - removes: bilstm/bigru/tcn/cnn-fast-slow/cnn-local-attn/cnn-lstm/cnn-large/cnn - removes: GatedResidual, SinusoidalPositionalEncoding, LocalAttentionBlock, TCNBlock - removes: focal loss, label smoothing, b-dice, per-head weighted loss, legacy flags - keeps: dice loss, RoPE chunked inference, HM(sign,phrase) validation metric - train.py: remove curriculum callbacks; simplify to direct get_dataloader() path - args.py: remove unused args (arch, pos_encoding, acceleration, speed_aug, weighted_loss, focal_gamma, label_smoothing, b_dice, curriculum) - dataset.py: remove acceleration and speed_aug branches Inference: - bin.py: new 2026 inference CLI (load .ckpt, process pose, write ELAN) - old/bin.py: update dist path to dist/2023/ Evaluation: - evaluate.py: add as tracked file; remove windowed/LSTM eval path Dist: - dist/2023/: move 2023 TorchScript .pth models from old/dist/; add README - dist/2026/: add EXPERIMENTS.md and findings README Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| eval_args.b_threshold = best_cfg.get("b_threshold", eval_args.b_threshold) | ||
| eval_args.io_threshold = best_cfg.get("io_threshold", eval_args.io_threshold) | ||
|
|
||
| if eval_args.likeliest: |
There was a problem hiding this comment.
did we check which is better? likeliest or probs_to_segments? from thinking that we might not need thresholds anymore? that would simplify the code a lot
There was a problem hiding this comment.
Haven't done a systematic comparison, but from threshold-sweep experiments (--tune_threshold) likeliest (argmax) has generally matched or beaten threshold-based on IoU — and it's simpler. Made likeliest the default. Threshold-based still available via --threshold + --tune_threshold if needed. Happy to add a explicit comparison to the eval script if you'd like.
There was a problem hiding this comment.
please make an evaluation of likeliest vs threshold and report here. if likeliest is the same/better, we can remove the threshold code, and all threshold tuning code. simplifies the repo
There was a problem hiding this comment.
Running E165 (1536fr, new code) to get a clean comparison. Will report here once done. Previous models (E162, E163) were trained with probs_to_segments as the validation metric, so comparing on those is skewed. E165 is the first model trained with the new fps-normalised velocity and ClassifierHead, so results will be directly comparable.
There was a problem hiding this comment.
Ablation results (E165-1536-batch8-drop01-fixchunk-hmval-3h, dev, 50fps):
| Decoding | Sign IoU | Phrase IoU | HM |
|---|---|---|---|
| likeliest (argmax) | 0.5598 | 0.8892 | 0.6871 |
| threshold (default 0.5) | 0.5519 | 0.8885 | 0.6809 |
| tuned threshold (swept) | 0.5706 | 0.8937 | 0.6965 |
Tuned threshold beats likeliest by +0.011 Sign IoU, so we keep threshold-based decoding for final reporting. likeliest is the fast-path default (no tuning needed); --tune_threshold is the recommended path for final numbers.
Note: training validation used likeliest; the eval gap vs tuned threshold is ~1pp.
- Remove dist/2023/ (use the 2023 git tag/release instead) - Remove sign_language_segmentation/old/bin.py - pyproject.toml: remove old/* packages, dist/2023 data-files, use pip pose-anonymization - args.py: set best defaults (velocity, fps_aug, frame_dropout=0.15, body_part_dropout=0.1, optimizer=adamw-onecycle); drop no_face/normalize/pose_dims as deprecated hidden args - data/utils.py: preprocess_pose always applies no_face+normalize (remove conditionals); add compute_velocity(pose_data, frame_times_seconds) utility - data/dataset.py: remove normalize/no_face params; timestamps now in seconds - model/model.py: add ClassifierHead (linear→GELU→linear) for both BIO heads; RoPE now expects timestamps in seconds and scales by reference_fps=50 internally; use bio_labels_to_segments from metrics (no more duplicated BIO→segment loop) - metrics.py: add bio_labels_to_segments() shared utility - bin.py: @torch.inference_mode, seconds-based timestamps, use compute_velocity - evaluate.py: use bio_labels_to_segments; likeliest_probs_to_segments is now default - train.py: print best.ckpt path after training - dist/2026/README.md: fix architecture description (skip connections, residual, RoPE in seconds), clarify attention mask failure reason, remove HM row, note depth=4 worth retrying Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…luate - model.py: on_load_checkpoint migrates old single-Linear heads to nn.Linear when loading pre-ClassifierHead checkpoints (strict=False loads the remaining keys correctly; old sign_bio_head.weight maps directly) - dataset.py: fix missing frame_times_ms assignment in non-fps_aug path - evaluate.py: add --chunk_multiplier flag to scale inference chunk size for RoPE generalisation ablation (1x/2x/4x) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Uses the same argmax decoding in both validation_step and evaluate.py, removing the discrepancy where training validation used threshold-based probs_to_segments but evaluate.py reported likeliest results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tric switch E165 is currently training; switching validation metric mid-run risks premature early stopping. Revert to probs_to_segments for consistency with E165 training. Will align metrics after E165 completes once we have evidence that likeliest is better than threshold for new models. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AmitMY
left a comment
There was a problem hiding this comment.
Updated main README.md with Docker-based training/evaluation instructions and a pointer to Dockerfile.train. The Dockerfile was already in the repo root but not documented in the README.
- Add Docker build/train/evaluate commands pointing to Dockerfile.train - Add local development setup - Update architecture description to match 2026 CNN-medium-attn + RoPE - Point to dist/2026/README.md for full details - Remove outdated 2025 content Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All best hyperparameters are now defaults in args.py: velocity=True, fps_aug=True, body_part_dropout=0.1, frame_dropout=0.15, dice_loss_weight=1.0. Training command only needs corpus/poses and resource params (batch_size, num_frames, patience). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Dockerfile - Remove probs_to_segments / _io_probs_to_segments from metrics.py — likeliest (argmax) decoding wins on E169 and generalises better to test set; threshold was overfitting dev sign IoU at the expense of phrase IoU. - evaluate.py: drop --threshold/--tune_threshold/--b/o/io_threshold args; decoding path is now simply likeliest + optional filter_segments. - bin.py: remove unused probs_to_segments import. - model.py: batched chunk inference in encode() — all chunks stacked into one batch and processed in a single transformer forward pass instead of N serial calls; remove on_load_checkpoint backward-compat shim. - Dockerfile.train: add training image definition (nvcr pytorch:26.02-py3 base, installs deps from pyproject.toml; code is mounted at runtime). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…docs - Delete sign_language_segmentation/old/ (2023-era code: SLURM job scripts, old threshold decoder, old tests — all superseded by 2026 rewrite) - args.py: remove deprecated suppressed args (--arch, --pos_encoding, --no_face, --no_normalize, --pose_dims, --acceleration, --speed_aug, --target_fps, --steps_per_epoch); update defaults to match best config (depth=4, dice=1.5) - dist/2026/README.md: fix architecture (depth=4 not 6), update best results table with E166-E169, add threshold decoding to "What Did Not Help", correct training command - README.md: fix training command to use correct hyperparams (depth=4, 1024fr) - .gitignore: add models/, logs/, lightning_logs/, *.egg-info/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_pose_segments - bin.py: add segment_pose() importable function (loads model via lru_cache, runs inference, returns eaf + tiers dict); add save_pose_segments() to crop and save per-segment .pose files; add --save-segments and --subtitles CLI args; model loading now cached so repeated calls are fast - server.py: Flask server exposing POST / for pose segmentation (input/output as file paths or gs:// URIs) and GET /health; single-frame edge case handled - Dockerfile: CPU-only inference image (python:3.12-slim + torch CPU wheel); serves via gunicorn; copies source and dist/2026/best.ckpt at build time - pyproject.toml: add [server] optional deps (Flask, Werkzeug, gunicorn) - .github/workflows/publish-docker.yaml: publish image to ghcr.io on release - README.md: add Python API example, server usage, health check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
E169 (depth=4, 1024fr, 6h) beats Efinal on both dev and test: dev HM=0.763 (Sign=0.657, Phr=0.910) test HM=0.764 (Sign=0.652, Phr=0.925) Efinal trained longer but early stopping had already found the optimum. best.ckpt updated to E169 checkpoint. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tests/test_inference.py: smoke tests for segment_pose (tiers, start/end, eaf tiers); example.pose bundled for CI - ruff fixes: remove unused imports (argparse, math, numpy), remove unused gold_range variable, replace lambda with def in evaluate.py - pyproject.toml: move pytorch-lightning and scikit-learn to core deps (both required at inference time, not just dev); add **/*.ckpt to package-data so best.ckpt ships with pip install - sign_language_segmentation/dist/2026/best.ckpt: E169 checkpoint bundled inside the package; _default_model_path() updated to find it via __file__ - Dockerfile: fix layer ordering (copy source then pip install --no-deps -e . so actual code is installed, not build stubs); warmup call now succeeds; fix ENV syntax and CMD JSON form to eliminate build warnings Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strip AdamW optimizer states and convert float32→bfloat16 to reduce checkpoint size ~6x for deployment without affecting inference quality (dev HM-IoU 0.763 preserved). Add slim_checkpoint CLI entry point so future dist checkpoints can be prepared in one command. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… note - Restore complete bibtex entry (editor, address, doi, pages) from main - Restore '## 2023 Version (v2023)' section linking to the paper code - Document slim_checkpoint usage in dist/2026/README.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Distills the 2026 experimental work down to a clean, minimal production path. Removes all architecture branches we didn't end up using and all loss/augmentation variants that didn't improve results.
Changes
cnn-medium-attn+ RoPE only — removed bilstm, bigru, tcn, cnn-fast-slow, cnn-local-attn, cnn-lstm, cnn-large, cnn; removed focal loss, label smoothing, b-dice, per-head weighted loss, legacy flagsget_dataloader()pathaccelerationandspeed_augbranches.ckpt, preprocesses pose, runs chunked RoPE inference, writes ELANdist/2023/.pthmodels fromold/dist/; added READMEEXPERIMENTS.md+ findingsREADME.mdKey findings (dist/2026/README.md)
Helped: RoPE transformer, Dice loss (weight=1.0), fps_aug (essential — disabling costs −9pp Sign IoU), body_part_dropout=0.1 (+10.5pp Phrase25), frame_dropout=0.15 (essential regularisation), velocity features, no_face, hidden_dim=384, HM(sign,phrase) validation metric, inference chunk_size=num_frames (bug fix: +12.8pp Phrase IoU)
Didn't help: attention padding mask (−7pp S25 consistently), B-frame Dice loss, focal loss, label smoothing, speed augmentation, frame curriculum, removing frame_dropout (catastrophic phrase overfitting after ~50 epochs)
Best result so far (E145, 1024 frames)
🤖 Generated with Claude Code