Skip to content

feat: HuggingFace publish pipeline with safetensors model format#22

Closed
ziv-lazarov-nagish wants to merge 15 commits intonagishfrom
publish-pipeline
Closed

feat: HuggingFace publish pipeline with safetensors model format#22
ziv-lazarov-nagish wants to merge 15 commits intonagishfrom
publish-pipeline

Conversation

@ziv-lazarov-nagish
Copy link
Copy Markdown
Contributor

@ziv-lazarov-nagish ziv-lazarov-nagish commented Apr 15, 2026

Summary

  • HuggingFace publish pipeline: convert .ckpt to safetensors, evaluate, regression check, push to Hub
  • Model card generation with architecture/training config tables and per-dataset eval metrics
  • Date-based version tags (vYYYY.MM.DD) with same-day suffix support
  • Weekly branch workflow with promotion on regression pass
  • .env.example with documented sections for all required credentials

Changes since last review

  • Moved model card template into publish/ package
  • Inlined _ARCH_KEYS/_TRAIN_KEYS into generate_model_card()
  • Switched from arbitrary semver to date-based version tags
  • Fixed package-data globs, dropped .ckpt from bundled resources
  • Added comments to .env.example

Test plan

  • 61 unit tests pass
  • End-to-end publish with --skip-eval to verify HF upload
  • Publish with eval + regression check against tagged model

@ziv-lazarov-nagish ziv-lazarov-nagish marked this pull request as draft April 15, 2026 21:17
@ziv-lazarov-nagish ziv-lazarov-nagish marked this pull request as ready for review April 15, 2026 21:18
@ziv-lazarov-nagish ziv-lazarov-nagish marked this pull request as draft April 16, 2026 07:47
@ziv-lazarov-nagish ziv-lazarov-nagish marked this pull request as ready for review April 16, 2026 09:41
Comment thread sign_language_segmentation/datasets/dgs/dataset.py Outdated
Comment thread sign_language_segmentation/publish/utils.py Outdated
Comment thread sign_language_segmentation/publish/utils.py Outdated


if __name__ == "__main__":
main()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally i think this file can be cleaner (not in this PR)

Comment thread .env.example
Comment thread .env.example
Comment thread pyproject.toml Outdated
@ziv-lazarov-nagish ziv-lazarov-nagish force-pushed the publish-pipeline branch 2 times, most recently from 0efa6eb to 9470171 Compare April 18, 2026 17:39
Comment thread sign_language_segmentation/datasets/dgs/dataset.py Outdated
Comment thread pyproject.toml Outdated
Add deployment pipeline for publishing models to HuggingFace Hub:
- SafeTensors model loading with .ckpt fallback in bin.py
- Flexible model resolution: MODEL_PATH env > HF_MODEL_REPO > baked-in
- publish_model CLI: convert, evaluate, regression check, push, tag, promote
- Model card template with placeholder substitution
- HF/publish optional dependency groups and entry point
- Upload to 'weekly' branch, tag with vMAJOR.MINOR.PATCH on promotion
- Auto-bump patch version from latest HF tag (--bump minor/major)
- Read quality_percentile from split_manifest.json during eval
- Add hm_IoU to eval results and model card
- Regression check compares against latest semver tag
- HF_MODEL_REVISION now required (no silent default)
Evaluation now runs on each dataset individually + combined, for both
dev and test splits. Model card shows all results. Regression check
uses combined test metrics (unchanged behavior).
Architecture and training config rendered as single-row transposed
tables (keys as columns). Eval results in one table with dataset rows,
dev/test sub-rows, combined bolded, IoU columns first.
Move model_card_template.md into the publish package where it belongs.
Make _ARCH_KEYS and _TRAIN_KEYS local to generate_model_card().
Replace arbitrary semver bumping with date-based tags. Same-day publishes
get an incremented suffix (vYYYY.MM.DD.1, .2, etc). Remove --bump arg.
Use **/*.md to match model_card_template.md in its new publish/ location.
Remove .ckpt from package-data — checkpoints are CLI inputs, not bundled
runtime resources.
…_model

evaluate_model now returns hm_IoU computed per item (nagish PR #23).
Wrapping it with _add_hm_iou would overwrite that with the less accurate
average-of-averages metric.
import json
import os
import tempfile
from datetime import datetime, UTC
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os, datetime, and UTC are unused in this file — datetime/UTC moved to utils.py with get_next_version, and os was never referenced. Dropping these keeps the import block clean.

(created by Claude)

@@ -0,0 +1,40 @@
---
language: dgs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

language: dgs is hardcoded, but the publish pipeline accepts arbitrary --datasets (and all). When the model is trained on non-DGS corpora this frontmatter will be wrong on the HF model page. Either plumb the language list through generate_model_card as a derived field from the dataset registry, or drop the field entirely and let the {{dataset_section}} below describe coverage.

(created by Claude)

return model


@lru_cache(maxsize=1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lru_cache keys by (model_dir, device), but when model_dir comes from resolve_model_path() via HF_MODEL_REPO, the actual content depends on HF_MODEL_REVISION too. If anything in the process ever changes HF_MODEL_REVISION between calls (long-running server, test fixtures), this cache returns a stale model silently. Either include the revision in the cache key (e.g. cache on resolve_model_path()s output, which bakes the revision into the snapshot dir), or document that the cache is process-lifetime and the env must not change.

(created by Claude)

Returns nested dict: {dataset_name: {split: {metric: value}}}.
Top-level keys include each individual dataset, plus "combined".
"""
from sign_language_segmentation.datasets.common import DATASET_REGISTRY, _ensure_datasets_registered
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing _ensure_datasets_registered reaches into a private symbol (leading underscore) of datasets.common. That couples publish/ to an implementation detail that may be renamed or removed without warning. Expose a public registration entry point in datasets.common (or make dataset registration happen at import time so consumers dont need to call it explicitly) and have publish/utils.py use that.

(created by Claude)

quality_percentile = qp
break

class EvalArgs:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining an empty class EvalArgs just to attach attributes is awkward and loses type checking. argparse.Namespace(corpus=corpus, poses=poses, target_fps=None, quality_percentile=quality_percentile) is a one-liner drop-in, and a @dataclass would be better still if get_dataloader is willing to accept one.

(created by Claude)

parser.add_argument(
"--regression-threshold",
type=float,
default=0.02,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A default tolerance of 2 IoU points is quite loose — real regressions of that size will pass the gate unnoticed, especially once you stack them over a few weekly releases. Worth measuring the run-to-run variance of sign_IoU/sentence_IoU across a few identical retrains and setting the threshold to ~2–3× that standard deviation. Until you have that data, consider dropping the default to 0.005 and overriding via flag when known-noisy experiments are being published.

(created by Claude)

target_sha = branch.target_commit
break
if target_sha is None:
target_sha = revision
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If neither the tag-scan nor the branch-scan resolves revision to a sha, this silently falls back to passing revision as-is to create_tag. When revision="weekly" is the normal path and the branch lookup already covers it, reaching this fallback means something is wrong — but well still attempt an create_tag call with a possibly-invalid sha and whatever error HF returns will be confusing. Raise a clear ValueError(f"Could not resolve revision {revision} to a commit") here instead.

(created by Claude)

doc_ids = sorted(
d.name for d in videos_dir.iterdir() if d.is_dir()
)
doc_ids = sorted(d.name for d in videos_dir.iterdir() if d.is_dir())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has whitespace-only ruff reflows (this line, and several more below) that are unrelated to the publish pipeline. Consider splitting them into a separate formatting PR against nagish so this PRs diff stays focused on the publishing feature.

(created by Claude)

@AmitMY
Copy link
Copy Markdown
Contributor

AmitMY commented Apr 19, 2026

No tests were added for publish/utils.py (+385 lines). The pure-function pieces are easy to cover and would catch real bugs:

  • _parse_version — valid vYYYY.MM.DD, valid vYYYY.MM.DD.N, rejected garbage.
  • get_next_version — no existing tags, tag from a prior date, tag already exists for today (suffix increments correctly .1.2).
  • _get_test_metrics — nested multi-dataset format (pick combined.test), single-dataset nested (pick first), legacy flat format (return as-is).

Happy to scaffold these as a follow-up if helpful.

(created by Claude)

@AmitMY
Copy link
Copy Markdown
Contributor

AmitMY commented Apr 19, 2026

A more general note: this PR is hard to review end-to-end because theres no way to exercise the pipeline without actually pushing to HuggingFace, and the control flow in publish() (8 numbered steps with branches on skip_eval, no_promote, regression_status) has a lot of implicit state interactions that are easier to reason about in test cases than from reading the code.

Even a small amount of test coverage for publish/utils.py would meaningfully help:

  • The pure functions (_parse_version, get_next_version, _get_test_metrics, _build_config_table, _build_eval_section) are trivial to test and self-documenting once tested.
  • The HF-touching functions (get_latest_version, check_regression, promote) can be covered with unittest.mock stubs for HfApi — much cheaper than spinning up a real repo to verify behaviour.
  • A happy-path integration test for publish() using a fake HfApi would turn the 8-step workflow into something executable.

Not a blocker, but the combination of "no tests" + "touches a remote service" + "regression-gated promotion logic" is the class of code where bugs are both easy to introduce and expensive to notice in production.

(created by Claude)

@AmitMY
Copy link
Copy Markdown
Contributor

AmitMY commented Apr 19, 2026

Video explaining how I reviewed this PR: https://www.loom.com/share/39a93593fca74d46a63bd17ca41b6d26

@ziv-lazarov-nagish
Copy link
Copy Markdown
Contributor Author

I did write tests for the flow, but wanted to make it as a separate PR to not make it too big and overwhelming (>1000 positive diff).. can add them here

@AmitMY
Copy link
Copy Markdown
Contributor

AmitMY commented Apr 20, 2026

I did write tests for the flow, but wanted to make it as a separate PR to not make it too big and overwhelming (>1000 positive diff).. can add them here

in my mind: tests do not count as diff. you can have 1000 lines of tests and 10 lines of code, and that is a small PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants