Skip to content

feat: annotation platform dataset and multi-dataset training#17

Merged
ziv-lazarov-nagish merged 8 commits intonagishfrom
feat/annotation-platform-dataset
Apr 14, 2026
Merged

feat: annotation platform dataset and multi-dataset training#17
ziv-lazarov-nagish merged 8 commits intonagishfrom
feat/annotation-platform-dataset

Conversation

@ziv-lazarov-nagish
Copy link
Copy Markdown
Contributor

@ziv-lazarov-nagish ziv-lazarov-nagish commented Apr 14, 2026

Summary

  • Add annotation platform dataset: sync from Convex DB, quality scoring, hash-based train/dev/test splits
  • Register as platform via dataset registry (--datasets platform or --datasets dgs,platform)
  • --quality_percentile filtering for annotation quality
  • --annotations_path arg for cache location
  • Sync CLI: fetches completed tasks (by completedAt), scores with latest model, caches results

Changed files

  • sign_language_segmentation/datasets/annotation_platform/ — dataset, sync, and tests
  • sign_language_segmentation/args.py--annotations_path, --quality_percentile
  • sign_language_segmentation/datasets/common.py — register platform in _ensure_datasets_registered
  • sign_language_segmentation/evaluate.py--annotations_path arg
  • pyproject.toml[platform] optional group (httpx, python-dotenv)

Test plan

  • ruff check . passes
  • pytest passes (61 tests)
  • --datasets platform fine-tune (hm_iou: 0.469)
  • --datasets dgs,platform from scratch (hm_iou: 0.391)
  • --datasets dgs,platform fine-tune (hm_iou: 0.457)

Copy link
Copy Markdown
Contributor

@AmitMY AmitMY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi dataset support should be available in main - maybe we pass datasets = 'dgs,platform' etc in args -
then ideally, we should have a register_dataset("dgs", some_function) and we can register datasets - then not necessary to have specific code for each dataset everywhere

Comment thread sign_language_segmentation/train.py Outdated
Comment thread sign_language_segmentation/train.py Outdated
@ziv-lazarov-nagish
Copy link
Copy Markdown
Contributor Author

multi dataset support should be available in main - maybe we pass datasets = 'dgs,platform' etc in args - then ideally, we should have a register_dataset("dgs", some_function) and we can register datasets - then not necessary to have specific code for each dataset everywhere

i'll implement the comma-separated datasets names but i'm not sure i understand the register_dataset part. how will that work?

Add annotation platform dataset backed by Convex DB sync:
- sync.py: fetch completed annotations (completedAt + approved/submitted)
- dataset.py: hash-based train/dev/test splits, quality filtering
- Multi-dataset dispatch (--dataset platform/combined)
- --val_dgs flag to mix DGS dev into validation
- --annotations_path, --quality_percentile args
- [platform] optional dep group (httpx, python-dotenv)
- Split manifest tracking for reproducibility
Replace broken --score boolean flag (defaulted to True, making sync
unreachable) with proper subcommands. Update run_hpo.sh to use the
CLI directly instead of python -c workaround.
Add --dataset flag (dgs/platform) and --annotations_path arg so the
evaluation script can run on both DGS and annotation platform data.
Remove approved status filter — all production workflows have no
review stage, so every completed status is treated as good.
Remove subcommands — single CLI that syncs then scores by default.
Add --no_score to skip scoring. Skip sync if cache file exists,
skip already-scored videos. Fix gpu->cuda device mapping for
checkpoint loading.
@ziv-lazarov-nagish ziv-lazarov-nagish force-pushed the feat/annotation-platform-dataset branch from 967b76a to 9e344cd Compare April 14, 2026 14:01
@ziv-lazarov-nagish ziv-lazarov-nagish changed the base branch from nagish to y2026 April 14, 2026 14:08
@ziv-lazarov-nagish ziv-lazarov-nagish changed the base branch from y2026 to nagish April 14, 2026 14:08
@ziv-lazarov-nagish ziv-lazarov-nagish merged commit 1df4e1c into nagish Apr 14, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants