feat: HuggingFace publish pipeline with safetensors model format#22
feat: HuggingFace publish pipeline with safetensors model format#22ziv-lazarov-nagish wants to merge 15 commits intonagishfrom
Conversation
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
There was a problem hiding this comment.
generally i think this file can be cleaner (not in this PR)
0efa6eb to
9470171
Compare
Add deployment pipeline for publishing models to HuggingFace Hub: - SafeTensors model loading with .ckpt fallback in bin.py - Flexible model resolution: MODEL_PATH env > HF_MODEL_REPO > baked-in - publish_model CLI: convert, evaluate, regression check, push, tag, promote - Model card template with placeholder substitution - HF/publish optional dependency groups and entry point
- Upload to 'weekly' branch, tag with vMAJOR.MINOR.PATCH on promotion - Auto-bump patch version from latest HF tag (--bump minor/major) - Read quality_percentile from split_manifest.json during eval - Add hm_IoU to eval results and model card - Regression check compares against latest semver tag - HF_MODEL_REVISION now required (no silent default)
Evaluation now runs on each dataset individually + combined, for both dev and test splits. Model card shows all results. Regression check uses combined test metrics (unchanged behavior).
Architecture and training config rendered as single-row transposed tables (keys as columns). Eval results in one table with dataset rows, dev/test sub-rows, combined bolded, IoU columns first.
Move model_card_template.md into the publish package where it belongs. Make _ARCH_KEYS and _TRAIN_KEYS local to generate_model_card().
Replace arbitrary semver bumping with date-based tags. Same-day publishes get an incremented suffix (vYYYY.MM.DD.1, .2, etc). Remove --bump arg.
Use **/*.md to match model_card_template.md in its new publish/ location. Remove .ckpt from package-data — checkpoints are CLI inputs, not bundled runtime resources.
…_model evaluate_model now returns hm_IoU computed per item (nagish PR #23). Wrapping it with _add_hm_iou would overwrite that with the less accurate average-of-averages metric.
f7b4c49 to
b376fe8
Compare
7f58c72 to
edeafd7
Compare
| import json | ||
| import os | ||
| import tempfile | ||
| from datetime import datetime, UTC |
There was a problem hiding this comment.
os, datetime, and UTC are unused in this file — datetime/UTC moved to utils.py with get_next_version, and os was never referenced. Dropping these keeps the import block clean.
(created by Claude)
| @@ -0,0 +1,40 @@ | |||
| --- | |||
| language: dgs | |||
There was a problem hiding this comment.
language: dgs is hardcoded, but the publish pipeline accepts arbitrary --datasets (and all). When the model is trained on non-DGS corpora this frontmatter will be wrong on the HF model page. Either plumb the language list through generate_model_card as a derived field from the dataset registry, or drop the field entirely and let the {{dataset_section}} below describe coverage.
(created by Claude)
| return model | ||
|
|
||
|
|
||
| @lru_cache(maxsize=1) |
There was a problem hiding this comment.
@lru_cache keys by (model_dir, device), but when model_dir comes from resolve_model_path() via HF_MODEL_REPO, the actual content depends on HF_MODEL_REVISION too. If anything in the process ever changes HF_MODEL_REVISION between calls (long-running server, test fixtures), this cache returns a stale model silently. Either include the revision in the cache key (e.g. cache on resolve_model_path()s output, which bakes the revision into the snapshot dir), or document that the cache is process-lifetime and the env must not change.
(created by Claude)
| Returns nested dict: {dataset_name: {split: {metric: value}}}. | ||
| Top-level keys include each individual dataset, plus "combined". | ||
| """ | ||
| from sign_language_segmentation.datasets.common import DATASET_REGISTRY, _ensure_datasets_registered |
There was a problem hiding this comment.
Importing _ensure_datasets_registered reaches into a private symbol (leading underscore) of datasets.common. That couples publish/ to an implementation detail that may be renamed or removed without warning. Expose a public registration entry point in datasets.common (or make dataset registration happen at import time so consumers dont need to call it explicitly) and have publish/utils.py use that.
(created by Claude)
| quality_percentile = qp | ||
| break | ||
|
|
||
| class EvalArgs: |
There was a problem hiding this comment.
Defining an empty class EvalArgs just to attach attributes is awkward and loses type checking. argparse.Namespace(corpus=corpus, poses=poses, target_fps=None, quality_percentile=quality_percentile) is a one-liner drop-in, and a @dataclass would be better still if get_dataloader is willing to accept one.
(created by Claude)
| parser.add_argument( | ||
| "--regression-threshold", | ||
| type=float, | ||
| default=0.02, |
There was a problem hiding this comment.
A default tolerance of 2 IoU points is quite loose — real regressions of that size will pass the gate unnoticed, especially once you stack them over a few weekly releases. Worth measuring the run-to-run variance of sign_IoU/sentence_IoU across a few identical retrains and setting the threshold to ~2–3× that standard deviation. Until you have that data, consider dropping the default to 0.005 and overriding via flag when known-noisy experiments are being published.
(created by Claude)
| target_sha = branch.target_commit | ||
| break | ||
| if target_sha is None: | ||
| target_sha = revision |
There was a problem hiding this comment.
If neither the tag-scan nor the branch-scan resolves revision to a sha, this silently falls back to passing revision as-is to create_tag. When revision="weekly" is the normal path and the branch lookup already covers it, reaching this fallback means something is wrong — but well still attempt an create_tag call with a possibly-invalid sha and whatever error HF returns will be confusing. Raise a clear ValueError(f"Could not resolve revision {revision} to a commit") here instead.
(created by Claude)
| doc_ids = sorted( | ||
| d.name for d in videos_dir.iterdir() if d.is_dir() | ||
| ) | ||
| doc_ids = sorted(d.name for d in videos_dir.iterdir() if d.is_dir()) |
There was a problem hiding this comment.
This file has whitespace-only ruff reflows (this line, and several more below) that are unrelated to the publish pipeline. Consider splitting them into a separate formatting PR against nagish so this PRs diff stays focused on the publishing feature.
(created by Claude)
|
No tests were added for
Happy to scaffold these as a follow-up if helpful. (created by Claude) |
|
A more general note: this PR is hard to review end-to-end because theres no way to exercise the pipeline without actually pushing to HuggingFace, and the control flow in Even a small amount of test coverage for
Not a blocker, but the combination of "no tests" + "touches a remote service" + "regression-gated promotion logic" is the class of code where bugs are both easy to introduce and expensive to notice in production. (created by Claude) |
|
Video explaining how I reviewed this PR: https://www.loom.com/share/39a93593fca74d46a63bd17ca41b6d26 |
|
I did write tests for the flow, but wanted to make it as a separate PR to not make it too big and overwhelming (>1000 positive diff).. can add them here |
in my mind: tests do not count as diff. you can have 1000 lines of tests and 10 lines of code, and that is a small PR |
Summary
.env.examplewith documented sections for all required credentialsChanges since last review
publish/package_ARCH_KEYS/_TRAIN_KEYSintogenerate_model_card().ckptfrom bundled resources.env.exampleTest plan
--skip-evalto verify HF upload