feat: HuggingFace publish pipeline with safetensors model format by ziv-lazarov-nagish · Pull Request #22 · sign-language-processing/segmentation

ziv-lazarov-nagish · 2026-04-15T21:12:45Z

Summary

HuggingFace publish pipeline: convert .ckpt to safetensors, evaluate, regression check, push to Hub
Model card generation with architecture/training config tables and per-dataset eval metrics
Date-based version tags (vYYYY.MM.DD) with same-day suffix support
Weekly branch workflow with promotion on regression pass
.env.example with documented sections for all required credentials

Changes since last review

Moved model card template into publish/ package
Inlined _ARCH_KEYS/_TRAIN_KEYS into generate_model_card()
Switched from arbitrary semver to date-based version tags
Fixed package-data globs, dropped .ckpt from bundled resources
Added comments to .env.example

Test plan

61 unit tests pass
End-to-end publish with --skip-eval to verify HF upload
Publish with eval + regression check against tagged model

AmitMY · 2026-04-16T11:53:25Z



 if __name__ == "__main__":
    main()


generally i think this file can be cleaner (not in this PR)

Add deployment pipeline for publishing models to HuggingFace Hub: - SafeTensors model loading with .ckpt fallback in bin.py - Flexible model resolution: MODEL_PATH env > HF_MODEL_REPO > baked-in - publish_model CLI: convert, evaluate, regression check, push, tag, promote - Model card template with placeholder substitution - HF/publish optional dependency groups and entry point

- Upload to 'weekly' branch, tag with vMAJOR.MINOR.PATCH on promotion - Auto-bump patch version from latest HF tag (--bump minor/major) - Read quality_percentile from split_manifest.json during eval - Add hm_IoU to eval results and model card - Regression check compares against latest semver tag - HF_MODEL_REVISION now required (no silent default)

Evaluation now runs on each dataset individually + combined, for both dev and test splits. Model card shows all results. Regression check uses combined test metrics (unchanged behavior).

Architecture and training config rendered as single-row transposed tables (keys as columns). Eval results in one table with dataset rows, dev/test sub-rows, combined bolded, IoU columns first.

Move model_card_template.md into the publish package where it belongs. Make _ARCH_KEYS and _TRAIN_KEYS local to generate_model_card().

Replace arbitrary semver bumping with date-based tags. Same-day publishes get an incremented suffix (vYYYY.MM.DD.1, .2, etc). Remove --bump arg.

Use **/*.md to match model_card_template.md in its new publish/ location. Remove .ckpt from package-data — checkpoints are CLI inputs, not bundled runtime resources.

…oader

…_model evaluate_model now returns hm_IoU computed per item (nagish PR #23). Wrapping it with _add_hm_iou would overwrite that with the less accurate average-of-averages metric.

AmitMY · 2026-04-19T18:42:44Z

+import json
+import os
+import tempfile
+from datetime import datetime, UTC


os, datetime, and UTC are unused in this file — datetime/UTC moved to utils.py with get_next_version, and os was never referenced. Dropping these keeps the import block clean.

(created by Claude)

AmitMY · 2026-04-19T18:42:47Z

@@ -0,0 +1,40 @@
+---
+language: dgs


language: dgs is hardcoded, but the publish pipeline accepts arbitrary --datasets (and all). When the model is trained on non-DGS corpora this frontmatter will be wrong on the HF model page. Either plumb the language list through generate_model_card as a derived field from the dataset registry, or drop the field entirely and let the {{dataset_section}} below describe coverage.

(created by Claude)

AmitMY · 2026-04-19T18:42:52Z

+    return model


 @lru_cache(maxsize=1)


@lru_cache keys by (model_dir, device), but when model_dir comes from resolve_model_path() via HF_MODEL_REPO, the actual content depends on HF_MODEL_REVISION too. If anything in the process ever changes HF_MODEL_REVISION between calls (long-running server, test fixtures), this cache returns a stale model silently. Either include the revision in the cache key (e.g. cache on resolve_model_path()s output, which bakes the revision into the snapshot dir), or document that the cache is process-lifetime and the env must not change.

(created by Claude)

AmitMY · 2026-04-19T18:42:56Z

+    Returns nested dict: {dataset_name: {split: {metric: value}}}.
+    Top-level keys include each individual dataset, plus "combined".
+    """
+    from sign_language_segmentation.datasets.common import DATASET_REGISTRY, _ensure_datasets_registered


Importing _ensure_datasets_registered reaches into a private symbol (leading underscore) of datasets.common. That couples publish/ to an implementation detail that may be renamed or removed without warning. Expose a public registration entry point in datasets.common (or make dataset registration happen at import time so consumers dont need to call it explicitly) and have publish/utils.py use that.

(created by Claude)

AmitMY · 2026-04-19T18:43:00Z

+                quality_percentile = qp
+                break
+
+    class EvalArgs:


Defining an empty class EvalArgs just to attach attributes is awkward and loses type checking. argparse.Namespace(corpus=corpus, poses=poses, target_fps=None, quality_percentile=quality_percentile) is a one-liner drop-in, and a @dataclass would be better still if get_dataloader is willing to accept one.

(created by Claude)

AmitMY · 2026-04-19T18:43:05Z

+    parser.add_argument(
+        "--regression-threshold",
+        type=float,
+        default=0.02,


A default tolerance of 2 IoU points is quite loose — real regressions of that size will pass the gate unnoticed, especially once you stack them over a few weekly releases. Worth measuring the run-to-run variance of sign_IoU/sentence_IoU across a few identical retrains and setting the threshold to ~2–3× that standard deviation. Until you have that data, consider dropping the default to 0.005 and overriding via flag when known-noisy experiments are being published.

(created by Claude)

AmitMY · 2026-04-19T18:43:09Z

+                target_sha = branch.target_commit
+                break
+    if target_sha is None:
+        target_sha = revision


If neither the tag-scan nor the branch-scan resolves revision to a sha, this silently falls back to passing revision as-is to create_tag. When revision="weekly" is the normal path and the branch lookup already covers it, reaching this fallback means something is wrong — but well still attempt an create_tag call with a possibly-invalid sha and whatever error HF returns will be confusing. Raise a clear ValueError(f"Could not resolve revision {revision} to a commit") here instead.

(created by Claude)

AmitMY · 2026-04-19T18:43:12Z

-        doc_ids = sorted(
-            d.name for d in videos_dir.iterdir() if d.is_dir()
-        )
+        doc_ids = sorted(d.name for d in videos_dir.iterdir() if d.is_dir())


This file has whitespace-only ruff reflows (this line, and several more below) that are unrelated to the publish pipeline. Consider splitting them into a separate formatting PR against nagish so this PRs diff stays focused on the publishing feature.

(created by Claude)

AmitMY · 2026-04-19T18:43:18Z

No tests were added for publish/utils.py (+385 lines). The pure-function pieces are easy to cover and would catch real bugs:

_parse_version — valid vYYYY.MM.DD, valid vYYYY.MM.DD.N, rejected garbage.
get_next_version — no existing tags, tag from a prior date, tag already exists for today (suffix increments correctly .1 → .2).
_get_test_metrics — nested multi-dataset format (pick combined.test), single-dataset nested (pick first), legacy flat format (return as-is).

Happy to scaffold these as a follow-up if helpful.

(created by Claude)

AmitMY · 2026-04-19T18:44:00Z

A more general note: this PR is hard to review end-to-end because theres no way to exercise the pipeline without actually pushing to HuggingFace, and the control flow in publish() (8 numbered steps with branches on skip_eval, no_promote, regression_status) has a lot of implicit state interactions that are easier to reason about in test cases than from reading the code.

Even a small amount of test coverage for publish/utils.py would meaningfully help:

The pure functions (_parse_version, get_next_version, _get_test_metrics, _build_config_table, _build_eval_section) are trivial to test and self-documenting once tested.
The HF-touching functions (get_latest_version, check_regression, promote) can be covered with unittest.mock stubs for HfApi — much cheaper than spinning up a real repo to verify behaviour.
A happy-path integration test for publish() using a fake HfApi would turn the 8-step workflow into something executable.

Not a blocker, but the combination of "no tests" + "touches a remote service" + "regression-gated promotion logic" is the class of code where bugs are both easy to introduce and expensive to notice in production.

(created by Claude)

AmitMY · 2026-04-19T18:44:53Z

Video explaining how I reviewed this PR: https://www.loom.com/share/39a93593fca74d46a63bd17ca41b6d26

ziv-lazarov-nagish · 2026-04-20T06:43:44Z

I did write tests for the flow, but wanted to make it as a separate PR to not make it too big and overwhelming (>1000 positive diff).. can add them here

AmitMY · 2026-04-20T07:07:05Z

I did write tests for the flow, but wanted to make it as a separate PR to not make it too big and overwhelming (>1000 positive diff).. can add them here

in my mind: tests do not count as diff. you can have 1000 lines of tests and 10 lines of code, and that is a small PR

ziv-lazarov-nagish marked this pull request as draft April 15, 2026 21:17

ziv-lazarov-nagish marked this pull request as ready for review April 15, 2026 21:18

ziv-lazarov-nagish marked this pull request as draft April 16, 2026 07:47

ziv-lazarov-nagish marked this pull request as ready for review April 16, 2026 09:41

ziv-lazarov-nagish requested a review from AmitMY April 16, 2026 09:41

AmitMY requested changes Apr 16, 2026

View reviewed changes

ziv-lazarov-nagish force-pushed the publish-pipeline branch 2 times, most recently from 0efa6eb to 9470171 Compare April 18, 2026 17:39

AmitMY requested changes Apr 19, 2026

View reviewed changes

Comment thread sign_language_segmentation/datasets/dgs/dataset.py Outdated

Comment thread pyproject.toml Outdated

ziv-lazarov-nagish added 14 commits April 19, 2026 07:09

chore: ignore dist/, wandb/, binary artifacts in .gitignore

b0c5756

feat: per-dataset eval metrics for dev+test in model card

b834d87

Evaluation now runs on each dataset individually + combined, for both dev and test splits. Model card shows all results. Regression check uses combined test metrics (unchanged behavior).

feat: transposed config tables and single eval table in model card

744daab

Architecture and training config rendered as single-row transposed tables (keys as columns). Eval results in one table with dataset rows, dev/test sub-rows, combined bolded, IoU columns first.

refactor: move model card template to publish dir and inline config keys

bb67dd4

Move model_card_template.md into the publish package where it belongs. Make _ARCH_KEYS and _TRAIN_KEYS local to generate_model_card().

refactor: switch to date-based version tags (vYYYY.MM.DD)

41df983

Replace arbitrary semver bumping with date-based tags. Same-day publishes get an incremented suffix (vYYYY.MM.DD.1, .2, etc). Remove --bump arg.

docs: add section comments to .env.example

b3997ba

chore: fix package-data globs for moved template, drop .ckpt

94007ee

Use **/*.md to match model_card_template.md in its new publish/ location. Remove .ckpt from package-data — checkpoints are CLI inputs, not bundled runtime resources.

refactor: use get_dataloader instead of manual build_datasets + DataL…

c7198c6

…oader

refactor: drop _add_hm_iou wrapper, use per-item hm_IoU from evaluate…

c3727ce

…_model evaluate_model now returns hm_IoU computed per item (nagish PR #23). Wrapping it with _add_hm_iou would overwrite that with the less accurate average-of-averages metric.

fix: expand all datasets in publish evaluation

aa3e609

refactor: localize publish utility constants

2d5604d

refactor: drop annotations_path from publish flow

b376fe8

ziv-lazarov-nagish force-pushed the publish-pipeline branch from f7b4c49 to b376fe8 Compare April 19, 2026 11:10

refactor: remove publish_model script entry

edeafd7

ziv-lazarov-nagish force-pushed the publish-pipeline branch from 7f58c72 to edeafd7 Compare April 19, 2026 11:24

ziv-lazarov-nagish requested a review from AmitMY April 19, 2026 11:26

AmitMY reviewed Apr 19, 2026

View reviewed changes

This was referenced Apr 20, 2026

feat: safetensors runtime loading + HF-aware resolver #28

Merged

feat: publish utils — conversion, versioning, model card #29

Merged

feat: publish CLI orchestrator #30

Merged

feat: publish HF ops + evaluation helpers #31

Merged

ziv-lazarov-nagish marked this pull request as draft April 20, 2026 12:21

ziv-lazarov-nagish mentioned this pull request Apr 20, 2026

feat: --dry-run flag on publish CLI #32

Merged

3 tasks

ziv-lazarov-nagish closed this Apr 26, 2026

ziv-lazarov-nagish deleted the publish-pipeline branch April 26, 2026 08:57

Conversation

ziv-lazarov-nagish commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes since last review

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmitMY commented Apr 19, 2026

Uh oh!

AmitMY commented Apr 19, 2026

Uh oh!

AmitMY commented Apr 19, 2026

Uh oh!

ziv-lazarov-nagish commented Apr 20, 2026

Uh oh!

AmitMY commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ziv-lazarov-nagish commented Apr 15, 2026 •

edited

Loading