Skip to content

WIP feat(2025): major refactor to train directly from database, new model architecture#7

Merged
AmitMY merged 17 commits intomainfrom
y2025
Mar 23, 2026
Merged

WIP feat(2025): major refactor to train directly from database, new model architecture#7
AmitMY merged 17 commits intomainfrom
y2025

Conversation

@AmitMY
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY commented Apr 1, 2025

No description provided.

AmitMY and others added 17 commits January 6, 2025 11:00
…ucture dist/

Training:
- model.py: remove all arch branches except cnn-medium-attn+RoPE (805→294 lines)
  - removes: bilstm/bigru/tcn/cnn-fast-slow/cnn-local-attn/cnn-lstm/cnn-large/cnn
  - removes: GatedResidual, SinusoidalPositionalEncoding, LocalAttentionBlock, TCNBlock
  - removes: focal loss, label smoothing, b-dice, per-head weighted loss, legacy flags
  - keeps: dice loss, RoPE chunked inference, HM(sign,phrase) validation metric
- train.py: remove curriculum callbacks; simplify to direct get_dataloader() path
- args.py: remove unused args (arch, pos_encoding, acceleration, speed_aug,
  weighted_loss, focal_gamma, label_smoothing, b_dice, curriculum)
- dataset.py: remove acceleration and speed_aug branches

Inference:
- bin.py: new 2026 inference CLI (load .ckpt, process pose, write ELAN)
- old/bin.py: update dist path to dist/2023/

Evaluation:
- evaluate.py: add as tracked file; remove windowed/LSTM eval path

Dist:
- dist/2023/: move 2023 TorchScript .pth models from old/dist/; add README
- dist/2026/: add EXPERIMENTS.md and findings README

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove dist/2023/ (use the 2023 git tag/release instead)
- Remove sign_language_segmentation/old/bin.py
- pyproject.toml: remove old/* packages, dist/2023 data-files, use pip pose-anonymization
- args.py: set best defaults (velocity, fps_aug, frame_dropout=0.15, body_part_dropout=0.1,
  optimizer=adamw-onecycle); drop no_face/normalize/pose_dims as deprecated hidden args
- data/utils.py: preprocess_pose always applies no_face+normalize (remove conditionals);
  add compute_velocity(pose_data, frame_times_seconds) utility
- data/dataset.py: remove normalize/no_face params; timestamps now in seconds
- model/model.py: add ClassifierHead (linear→GELU→linear) for both BIO heads;
  RoPE now expects timestamps in seconds and scales by reference_fps=50 internally;
  use bio_labels_to_segments from metrics (no more duplicated BIO→segment loop)
- metrics.py: add bio_labels_to_segments() shared utility
- bin.py: @torch.inference_mode, seconds-based timestamps, use compute_velocity
- evaluate.py: use bio_labels_to_segments; likeliest_probs_to_segments is now default
- train.py: print best.ckpt path after training
- dist/2026/README.md: fix architecture description (skip connections, residual,
  RoPE in seconds), clarify attention mask failure reason, remove HM row,
  note depth=4 worth retrying

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…luate

- model.py: on_load_checkpoint migrates old single-Linear heads to nn.Linear
  when loading pre-ClassifierHead checkpoints (strict=False loads the
  remaining keys correctly; old sign_bio_head.weight maps directly)
- dataset.py: fix missing frame_times_ms assignment in non-fps_aug path
- evaluate.py: add --chunk_multiplier flag to scale inference chunk size
  for RoPE generalisation ablation (1x/2x/4x)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Uses the same argmax decoding in both validation_step and evaluate.py,
removing the discrepancy where training validation used threshold-based
probs_to_segments but evaluate.py reported likeliest results.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tric switch

E165 is currently training; switching validation metric mid-run risks
premature early stopping. Revert to probs_to_segments for consistency
with E165 training. Will align metrics after E165 completes once we
have evidence that likeliest is better than threshold for new models.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add Docker build/train/evaluate commands pointing to Dockerfile.train
- Add local development setup
- Update architecture description to match 2026 CNN-medium-attn + RoPE
- Point to dist/2026/README.md for full details
- Remove outdated 2025 content

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All best hyperparameters are now defaults in args.py:
velocity=True, fps_aug=True, body_part_dropout=0.1, frame_dropout=0.15,
dice_loss_weight=1.0. Training command only needs corpus/poses and
resource params (batch_size, num_frames, patience).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Dockerfile

- Remove probs_to_segments / _io_probs_to_segments from metrics.py — likeliest
  (argmax) decoding wins on E169 and generalises better to test set; threshold
  was overfitting dev sign IoU at the expense of phrase IoU.
- evaluate.py: drop --threshold/--tune_threshold/--b/o/io_threshold args;
  decoding path is now simply likeliest + optional filter_segments.
- bin.py: remove unused probs_to_segments import.
- model.py: batched chunk inference in encode() — all chunks stacked into one
  batch and processed in a single transformer forward pass instead of N serial
  calls; remove on_load_checkpoint backward-compat shim.
- Dockerfile.train: add training image definition (nvcr pytorch:26.02-py3 base,
  installs deps from pyproject.toml; code is mounted at runtime).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…docs

- Delete sign_language_segmentation/old/ (2023-era code: SLURM job scripts,
  old threshold decoder, old tests — all superseded by 2026 rewrite)
- args.py: remove deprecated suppressed args (--arch, --pos_encoding, --no_face,
  --no_normalize, --pose_dims, --acceleration, --speed_aug, --target_fps,
  --steps_per_epoch); update defaults to match best config (depth=4, dice=1.5)
- dist/2026/README.md: fix architecture (depth=4 not 6), update best results
  table with E166-E169, add threshold decoding to "What Did Not Help",
  correct training command
- README.md: fix training command to use correct hyperparams (depth=4, 1024fr)
- .gitignore: add models/, logs/, lightning_logs/, *.egg-info/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_pose_segments

- bin.py: add segment_pose() importable function (loads model via lru_cache,
  runs inference, returns eaf + tiers dict); add save_pose_segments() to crop
  and save per-segment .pose files; add --save-segments and --subtitles CLI args;
  model loading now cached so repeated calls are fast
- server.py: Flask server exposing POST / for pose segmentation (input/output
  as file paths or gs:// URIs) and GET /health; single-frame edge case handled
- Dockerfile: CPU-only inference image (python:3.12-slim + torch CPU wheel);
  serves via gunicorn; copies source and dist/2026/best.ckpt at build time
- pyproject.toml: add [server] optional deps (Flask, Werkzeug, gunicorn)
- .github/workflows/publish-docker.yaml: publish image to ghcr.io on release
- README.md: add Python API example, server usage, health check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
E169 (depth=4, 1024fr, 6h) beats Efinal on both dev and test:
  dev  HM=0.763 (Sign=0.657, Phr=0.910)
  test HM=0.764 (Sign=0.652, Phr=0.925)

Efinal trained longer but early stopping had already found the optimum.
best.ckpt updated to E169 checkpoint.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tests/test_inference.py: smoke tests for segment_pose (tiers, start/end,
  eaf tiers); example.pose bundled for CI
- ruff fixes: remove unused imports (argparse, math, numpy), remove unused
  gold_range variable, replace lambda with def in evaluate.py
- pyproject.toml: move pytorch-lightning and scikit-learn to core deps
  (both required at inference time, not just dev); add **/*.ckpt to
  package-data so best.ckpt ships with pip install
- sign_language_segmentation/dist/2026/best.ckpt: E169 checkpoint bundled
  inside the package; _default_model_path() updated to find it via __file__
- Dockerfile: fix layer ordering (copy source then pip install --no-deps -e .
  so actual code is installed, not build stubs); warmup call now succeeds;
  fix ENV syntax and CMD JSON form to eliminate build warnings

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strip AdamW optimizer states and convert float32→bfloat16 to reduce
checkpoint size ~6x for deployment without affecting inference quality
(dev HM-IoU 0.763 preserved). Add slim_checkpoint CLI entry point so
future dist checkpoints can be prepared in one command.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… note

- Restore complete bibtex entry (editor, address, doi, pages) from main
- Restore '## 2023 Version (v2023)' section linking to the paper code
- Document slim_checkpoint usage in dist/2026/README.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@AmitMY AmitMY merged commit b7c3299 into main Mar 23, 2026
ziv-lazarov-nagish added a commit that referenced this pull request Apr 20, 2026
Completes the publish pipeline on top of the pure-helpers layer:

- `publish/utils.py`: append `_eval_single`, `run_evaluation`, `check_regression`,
  `promote`.
- `publish/publish.py`: 8-step orchestrator — convert → find manifest → eval →
  regression check → save manifest → model card → upload to `weekly` branch →
  promote tag.
- `publish/__init__.py`: re-export `publish` and `main`.
- `datasets/common.py`: rename `_ensure_datasets_registered` →
  `ensure_datasets_registered` (now a public API so `run_evaluation` can call
  it without reaching into a private name).
- `pyproject.toml`: add `[hf]` (huggingface_hub>=0.20.0) and `[publish]`
  (`[hf]` + `[train]`) optional-dependency groups.
- `.env.example`: append HF section (`HF_TOKEN`, `HF_MODEL_REPO`,
  `HF_MODEL_REVISION`, `XDG_CACHE_HOME` for sparks cache).
- `.gitignore`: add `dist/`, `wandb/`, `*.log`.

Review-comment fixes shipped here:
- #1: dropped unused `os`, `datetime`, `UTC` imports from `publish.py`.
- #4: `_ensure_datasets_registered` → `ensure_datasets_registered` (public).
- #5: `class EvalArgs: pass` → `argparse.Namespace(...)`.
- #6: `--regression-threshold` default `0.02` → `0.005`.
- #7: `promote()` now raises `ValueError` on unresolved revision instead of
  silently passing the ref string through to `create_tag`.

11 new tests in `test_publish_cli.py` cover:
- `check_regression`: no_baseline (no tags / download failure), pass
  (within threshold), fail (beyond threshold).
- `promote`: tag found, branch found, unresolved raises `ValueError`.
- `publish()` integration with every HF + eval boundary mocked: skip-eval +
  no-promote, skip-eval + promote, eval + regression pass + promote, eval +
  regression fail (no promote).
ziv-lazarov-nagish added a commit that referenced this pull request Apr 20, 2026
Adds the library-level functions the publish CLI will orchestrate. No CLI
entry point yet — that lands in a follow-up PR.

- `publish/utils.py`: append `_eval_single`, `run_evaluation`,
  `check_regression`, `promote`. `run_evaluation` uses `argparse.Namespace`
  instead of an ad-hoc class, and imports `ensure_datasets_registered` as
  a public name. `promote` raises `ValueError` when the revision can't be
  resolved to a commit instead of silently passing the string through.
- `datasets/common.py`: rename `_ensure_datasets_registered` →
  `ensure_datasets_registered` so external callers can depend on it.
- `pyproject.toml`: add `[hf]` optional group (`huggingface_hub>=0.20.0`).
- `.env.example`: append HF section (`HF_TOKEN`, `HF_MODEL_REPO`,
  `HF_MODEL_REVISION`, `XDG_CACHE_HOME` for sparks cache).

Review comments addressed here: #4 (private → public), #5 (Namespace),
#7 (raise ValueError).

Tests: 7 new cases in `tests/test_publish_cli.py` covering
`check_regression` (no_baseline / download-fail / pass / fail) and
`promote` (tag-hit / branch-hit / unresolved → raises). All boundaries
mocked — no network calls.
ziv-lazarov-nagish added a commit that referenced this pull request Apr 20, 2026
Completes the publish pipeline on top of the pure-helpers layer:

- `publish/utils.py`: append `_eval_single`, `run_evaluation`, `check_regression`,
  `promote`.
- `publish/publish.py`: 8-step orchestrator — convert → find manifest → eval →
  regression check → save manifest → model card → upload to `weekly` branch →
  promote tag.
- `publish/__init__.py`: re-export `publish` and `main`.
- `datasets/common.py`: rename `_ensure_datasets_registered` →
  `ensure_datasets_registered` (now a public API so `run_evaluation` can call
  it without reaching into a private name).
- `pyproject.toml`: add `[hf]` (huggingface_hub>=0.20.0) and `[publish]`
  (`[hf]` + `[train]`) optional-dependency groups.
- `.env.example`: append HF section (`HF_TOKEN`, `HF_MODEL_REPO`,
  `HF_MODEL_REVISION`, `XDG_CACHE_HOME` for sparks cache).
- `.gitignore`: add `dist/`, `wandb/`, `*.log`.

Review-comment fixes shipped here:
- #1: dropped unused `os`, `datetime`, `UTC` imports from `publish.py`.
- #4: `_ensure_datasets_registered` → `ensure_datasets_registered` (public).
- #5: `class EvalArgs: pass` → `argparse.Namespace(...)`.
- #6: `--regression-threshold` default `0.02` → `0.005`.
- #7: `promote()` now raises `ValueError` on unresolved revision instead of
  silently passing the ref string through to `create_tag`.

11 new tests in `test_publish_cli.py` cover:
- `check_regression`: no_baseline (no tags / download failure), pass
  (within threshold), fail (beyond threshold).
- `promote`: tag found, branch found, unresolved raises `ValueError`.
- `publish()` integration with every HF + eval boundary mocked: skip-eval +
  no-promote, skip-eval + promote, eval + regression pass + promote, eval +
  regression fail (no promote).
ziv-lazarov-nagish added a commit that referenced this pull request Apr 23, 2026
Adds the library-level functions the publish CLI will orchestrate. No CLI
entry point yet — that lands in a follow-up PR.

- `publish/utils.py`: append `_eval_single`, `run_evaluation`,
  `check_regression`, `promote`. `run_evaluation` uses `argparse.Namespace`
  instead of an ad-hoc class, and imports `ensure_datasets_registered` as
  a public name. `promote` raises `ValueError` when the revision can't be
  resolved to a commit instead of silently passing the string through.
- `datasets/common.py`: rename `_ensure_datasets_registered` →
  `ensure_datasets_registered` so external callers can depend on it.
- `pyproject.toml`: add `[hf]` optional group (`huggingface_hub>=0.20.0`).
- `.env.example`: append HF section (`HF_TOKEN`, `HF_MODEL_REPO`,
  `HF_MODEL_REVISION`, `XDG_CACHE_HOME` for sparks cache).

Review comments addressed here: #4 (private → public), #5 (Namespace),
#7 (raise ValueError).

Tests: 7 new cases in `tests/test_publish_cli.py` covering
`check_regression` (no_baseline / download-fail / pass / fail) and
`promote` (tag-hit / branch-hit / unresolved → raises). All boundaries
mocked — no network calls.
ziv-lazarov-nagish added a commit that referenced this pull request Apr 23, 2026
Completes the publish pipeline on top of the pure-helpers layer:

- `publish/utils.py`: append `_eval_single`, `run_evaluation`, `check_regression`,
  `promote`.
- `publish/publish.py`: 8-step orchestrator — convert → find manifest → eval →
  regression check → save manifest → model card → upload to `weekly` branch →
  promote tag.
- `publish/__init__.py`: re-export `publish` and `main`.
- `datasets/common.py`: rename `_ensure_datasets_registered` →
  `ensure_datasets_registered` (now a public API so `run_evaluation` can call
  it without reaching into a private name).
- `pyproject.toml`: add `[hf]` (huggingface_hub>=0.20.0) and `[publish]`
  (`[hf]` + `[train]`) optional-dependency groups.
- `.env.example`: append HF section (`HF_TOKEN`, `HF_MODEL_REPO`,
  `HF_MODEL_REVISION`, `XDG_CACHE_HOME` for sparks cache).
- `.gitignore`: add `dist/`, `wandb/`, `*.log`.

Review-comment fixes shipped here:
- #1: dropped unused `os`, `datetime`, `UTC` imports from `publish.py`.
- #4: `_ensure_datasets_registered` → `ensure_datasets_registered` (public).
- #5: `class EvalArgs: pass` → `argparse.Namespace(...)`.
- #6: `--regression-threshold` default `0.02` → `0.005`.
- #7: `promote()` now raises `ValueError` on unresolved revision instead of
  silently passing the ref string through to `create_tag`.

11 new tests in `test_publish_cli.py` cover:
- `check_regression`: no_baseline (no tags / download failure), pass
  (within threshold), fail (beyond threshold).
- `promote`: tag found, branch found, unresolved raises `ValueError`.
- `publish()` integration with every HF + eval boundary mocked: skip-eval +
  no-promote, skip-eval + promote, eval + regression pass + promote, eval +
  regression fail (no promote).
ziv-lazarov-nagish added a commit that referenced this pull request Apr 23, 2026
Completes the publish pipeline on top of the pure-helpers layer:

- `publish/utils.py`: append `_eval_single`, `run_evaluation`, `check_regression`,
  `promote`.
- `publish/publish.py`: 8-step orchestrator — convert → find manifest → eval →
  regression check → save manifest → model card → upload to `weekly` branch →
  promote tag.
- `publish/__init__.py`: re-export `publish` and `main`.
- `datasets/common.py`: rename `_ensure_datasets_registered` →
  `ensure_datasets_registered` (now a public API so `run_evaluation` can call
  it without reaching into a private name).
- `pyproject.toml`: add `[hf]` (huggingface_hub>=0.20.0) and `[publish]`
  (`[hf]` + `[train]`) optional-dependency groups.
- `.env.example`: append HF section (`HF_TOKEN`, `HF_MODEL_REPO`,
  `HF_MODEL_REVISION`, `XDG_CACHE_HOME` for sparks cache).
- `.gitignore`: add `dist/`, `wandb/`, `*.log`.

Review-comment fixes shipped here:
- #1: dropped unused `os`, `datetime`, `UTC` imports from `publish.py`.
- #4: `_ensure_datasets_registered` → `ensure_datasets_registered` (public).
- #5: `class EvalArgs: pass` → `argparse.Namespace(...)`.
- #6: `--regression-threshold` default `0.02` → `0.005`.
- #7: `promote()` now raises `ValueError` on unresolved revision instead of
  silently passing the ref string through to `create_tag`.

11 new tests in `test_publish_cli.py` cover:
- `check_regression`: no_baseline (no tags / download failure), pass
  (within threshold), fail (beyond threshold).
- `promote`: tag found, branch found, unresolved raises `ValueError`.
- `publish()` integration with every HF + eval boundary mocked: skip-eval +
  no-promote, skip-eval + promote, eval + regression pass + promote, eval +
  regression fail (no promote).
ziv-lazarov-nagish added a commit that referenced this pull request Apr 26, 2026
Adds the library-level functions the publish CLI will orchestrate. No CLI
entry point yet — that lands in a follow-up PR.

- `publish/utils.py`: append `_eval_single`, `run_evaluation`,
  `check_regression`, `promote`. `run_evaluation` uses `argparse.Namespace`
  instead of an ad-hoc class, and imports `ensure_datasets_registered` as
  a public name. `promote` raises `ValueError` when the revision can't be
  resolved to a commit instead of silently passing the string through.
- `datasets/common.py`: rename `_ensure_datasets_registered` →
  `ensure_datasets_registered` so external callers can depend on it.
- `pyproject.toml`: add `[hf]` optional group (`huggingface_hub>=0.20.0`).
- `.env.example`: append HF section (`HF_TOKEN`, `HF_MODEL_REPO`,
  `HF_MODEL_REVISION`, `XDG_CACHE_HOME` for sparks cache).

Review comments addressed here: #4 (private → public), #5 (Namespace),
#7 (raise ValueError).

Tests: 7 new cases in `tests/test_publish_cli.py` covering
`check_regression` (no_baseline / download-fail / pass / fail) and
`promote` (tag-hit / branch-hit / unresolved → raises). All boundaries
mocked — no network calls.
ziv-lazarov-nagish added a commit that referenced this pull request Apr 26, 2026
* feat: publish HF ops + evaluation helpers

Adds the library-level functions the publish CLI will orchestrate. No CLI
entry point yet — that lands in a follow-up PR.

- `publish/utils.py`: append `_eval_single`, `run_evaluation`,
  `check_regression`, `promote`. `run_evaluation` uses `argparse.Namespace`
  instead of an ad-hoc class, and imports `ensure_datasets_registered` as
  a public name. `promote` raises `ValueError` when the revision can't be
  resolved to a commit instead of silently passing the string through.
- `datasets/common.py`: rename `_ensure_datasets_registered` →
  `ensure_datasets_registered` so external callers can depend on it.
- `pyproject.toml`: add `[hf]` optional group (`huggingface_hub>=0.20.0`).
- `.env.example`: append HF section (`HF_TOKEN`, `HF_MODEL_REPO`,
  `HF_MODEL_REVISION`, `XDG_CACHE_HOME` for sparks cache).

Review comments addressed here: #4 (private → public), #5 (Namespace),
#7 (raise ValueError).

Tests: 7 new cases in `tests/test_publish_cli.py` covering
`check_regression` (no_baseline / download-fail / pass / fail) and
`promote` (tag-hit / branch-hit / unresolved → raises). All boundaries
mocked — no network calls.

* refactor: publish HF ops review follow-ups — narrow except, strict loading, hm_IoU in regression

- check_regression: narrow `except Exception` to HfHubHTTPError (404-only); auth/network errors now surface
- drop `strict=False` on load_from_checkpoint — silent partial-load was a landmine for ckpt key drift
- align hparam fallbacks (fps_aug, velocity) with training defaults from args.py (both True); comment the why
- quality_percentile: assert equality across all manifests; raise ValueError on mismatch (previously silent "first wins")
- add hm_IoU to regression-check metrics (hm_IoU regressions previously didn't block promotion)
- drop dangling `# TODO: add slack notifications`
- rename tests/test_publish_cli.py → tests/test_publish_hf_ops.py (tests cover HF ops, not the CLI)
- tests: replace RuntimeError mock with real HfHubHTTPError(response.status_code=404); add test_non_404_download_error_propagates

* ci: install [hf] extra so publish tests can import huggingface_hub
ziv-lazarov-nagish added a commit that referenced this pull request Apr 26, 2026
Completes the publish pipeline on top of the pure-helpers layer:

- `publish/utils.py`: append `_eval_single`, `run_evaluation`, `check_regression`,
  `promote`.
- `publish/publish.py`: 8-step orchestrator — convert → find manifest → eval →
  regression check → save manifest → model card → upload to `weekly` branch →
  promote tag.
- `publish/__init__.py`: re-export `publish` and `main`.
- `datasets/common.py`: rename `_ensure_datasets_registered` →
  `ensure_datasets_registered` (now a public API so `run_evaluation` can call
  it without reaching into a private name).
- `pyproject.toml`: add `[hf]` (huggingface_hub>=0.20.0) and `[publish]`
  (`[hf]` + `[train]`) optional-dependency groups.
- `.env.example`: append HF section (`HF_TOKEN`, `HF_MODEL_REPO`,
  `HF_MODEL_REVISION`, `XDG_CACHE_HOME` for sparks cache).
- `.gitignore`: add `dist/`, `wandb/`, `*.log`.

Review-comment fixes shipped here:
- #1: dropped unused `os`, `datetime`, `UTC` imports from `publish.py`.
- #4: `_ensure_datasets_registered` → `ensure_datasets_registered` (public).
- #5: `class EvalArgs: pass` → `argparse.Namespace(...)`.
- #6: `--regression-threshold` default `0.02` → `0.005`.
- #7: `promote()` now raises `ValueError` on unresolved revision instead of
  silently passing the ref string through to `create_tag`.

11 new tests in `test_publish_cli.py` cover:
- `check_regression`: no_baseline (no tags / download failure), pass
  (within threshold), fail (beyond threshold).
- `promote`: tag found, branch found, unresolved raises `ValueError`.
- `publish()` integration with every HF + eval boundary mocked: skip-eval +
  no-promote, skip-eval + promote, eval + regression pass + promote, eval +
  regression fail (no promote).
ziv-lazarov-nagish added a commit that referenced this pull request Apr 26, 2026
* feat: publish CLI — evaluation, regression, promotion, orchestrator

Completes the publish pipeline on top of the pure-helpers layer:

- `publish/utils.py`: append `_eval_single`, `run_evaluation`, `check_regression`,
  `promote`.
- `publish/publish.py`: 8-step orchestrator — convert → find manifest → eval →
  regression check → save manifest → model card → upload to `weekly` branch →
  promote tag.
- `publish/__init__.py`: re-export `publish` and `main`.
- `datasets/common.py`: rename `_ensure_datasets_registered` →
  `ensure_datasets_registered` (now a public API so `run_evaluation` can call
  it without reaching into a private name).
- `pyproject.toml`: add `[hf]` (huggingface_hub>=0.20.0) and `[publish]`
  (`[hf]` + `[train]`) optional-dependency groups.
- `.env.example`: append HF section (`HF_TOKEN`, `HF_MODEL_REPO`,
  `HF_MODEL_REVISION`, `XDG_CACHE_HOME` for sparks cache).
- `.gitignore`: add `dist/`, `wandb/`, `*.log`.

Review-comment fixes shipped here:
- #1: dropped unused `os`, `datetime`, `UTC` imports from `publish.py`.
- #4: `_ensure_datasets_registered` → `ensure_datasets_registered` (public).
- #5: `class EvalArgs: pass` → `argparse.Namespace(...)`.
- #6: `--regression-threshold` default `0.02` → `0.005`.
- #7: `promote()` now raises `ValueError` on unresolved revision instead of
  silently passing the ref string through to `create_tag`.

11 new tests in `test_publish_cli.py` cover:
- `check_regression`: no_baseline (no tags / download failure), pass
  (within threshold), fail (beyond threshold).
- `promote`: tag found, branch found, unresolved raises `ValueError`.
- `publish()` integration with every HF + eval boundary mocked: skip-eval +
  no-promote, skip-eval + promote, eval + regression pass + promote, eval +
  regression fail (no promote).

* fix: pass repo_id to generate_model_card after #29 API change

* fix: split CLI-orchestrator tests back out of test_publish_hf_ops.py

Rebase side-effect: after #31 renamed test_publish_cli.py → test_publish_hf_ops.py,
git's rename detection merged #30's CLI-orchestrator additions into the HF-ops file.
Restore the intended separation — TestPublishIntegration lives in test_publish_cli.py,
TestCheckRegression and TestPromote stay in test_publish_hf_ops.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant