feat: annotation platform dataset and multi-dataset training by ziv-lazarov-nagish · Pull Request #17 · sign-language-processing/segmentation

ziv-lazarov-nagish · 2026-04-14T07:26:03Z

Summary

Add annotation platform dataset: sync from Convex DB, quality scoring, hash-based train/dev/test splits
Register as platform via dataset registry (--datasets platform or --datasets dgs,platform)
--quality_percentile filtering for annotation quality
--annotations_path arg for cache location
Sync CLI: fetches completed tasks (by completedAt), scores with latest model, caches results

Changed files

sign_language_segmentation/datasets/annotation_platform/ — dataset, sync, and tests
sign_language_segmentation/args.py — --annotations_path, --quality_percentile
sign_language_segmentation/datasets/common.py — register platform in _ensure_datasets_registered
sign_language_segmentation/evaluate.py — --annotations_path arg
pyproject.toml — [platform] optional group (httpx, python-dotenv)

Test plan

ruff check . passes
pytest passes (61 tests)
--datasets platform fine-tune (hm_iou: 0.469)
--datasets dgs,platform from scratch (hm_iou: 0.391)
--datasets dgs,platform fine-tune (hm_iou: 0.457)

AmitMY

multi dataset support should be available in main - maybe we pass datasets = 'dgs,platform' etc in args -
then ideally, we should have a register_dataset("dgs", some_function) and we can register datasets - then not necessary to have specific code for each dataset everywhere

ziv-lazarov-nagish · 2026-04-14T13:02:39Z

multi dataset support should be available in main - maybe we pass datasets = 'dgs,platform' etc in args - then ideally, we should have a register_dataset("dgs", some_function) and we can register datasets - then not necessary to have specific code for each dataset everywhere

i'll implement the comma-separated datasets names but i'm not sure i understand the register_dataset part. how will that work?

Add annotation platform dataset backed by Convex DB sync: - sync.py: fetch completed annotations (completedAt + approved/submitted) - dataset.py: hash-based train/dev/test splits, quality filtering - Multi-dataset dispatch (--dataset platform/combined) - --val_dgs flag to mix DGS dev into validation - --annotations_path, --quality_percentile args - [platform] optional dep group (httpx, python-dotenv) - Split manifest tracking for reproducibility

…ore flag

Replace broken --score boolean flag (defaulted to True, making sync unreachable) with proper subcommands. Update run_hpo.sh to use the CLI directly instead of python -c workaround.

Add --dataset flag (dgs/platform) and --annotations_path arg so the evaluation script can run on both DGS and annotation platform data.

Remove approved status filter — all production workflows have no review stage, so every completed status is treated as good.

Remove subcommands — single CLI that syncs then scores by default. Add --no_score to skip scoring. Skip sync if cache file exists, skip already-scored videos. Fix gpu->cuda device mapping for checkpoint loading.

AmitMY reviewed Apr 14, 2026

View reviewed changes

Comment thread sign_language_segmentation/train.py Outdated

Comment thread sign_language_segmentation/train.py Outdated

ziv-lazarov-nagish mentioned this pull request Apr 14, 2026

refactor: dataset registry for multi-dataset training #18

Merged

3 tasks

ziv-lazarov-nagish added 8 commits April 14, 2026 09:50

fix: run_hpo.sh — missing backslash, wrong extra, remove invalid --sc…

d3b479b

…ore flag

refactor: use subcommands for sync CLI (sync/score)

b159e4e

Replace broken --score boolean flag (defaulted to True, making sync unreachable) with proper subcommands. Update run_hpo.sh to use the CLI directly instead of python -c workaround.

feat: add platform dataset support to evaluate.py

6891f0d

Add --dataset flag (dgs/platform) and --annotations_path arg so the evaluation script can run on both DGS and annotation platform data.

fix: use completedAt-only filter for task completion

a9f16db

Remove approved status filter — all production workflows have no review stage, so every completed status is treated as good.

feat: sync scores by default, skip if cache exists

ce925c9

Remove subcommands — single CLI that syncs then scores by default. Add --no_score to skip scoring. Skip sync if cache file exists, skip already-scored videos. Fix gpu->cuda device mapping for checkpoint loading.

refactor: remove --val_dgs flag and DGS dev mixing logic

d26e2e1

chore: remove run_hpo.sh from tracking, keep local

9e344cd

ziv-lazarov-nagish force-pushed the feat/annotation-platform-dataset branch from 967b76a to 9e344cd Compare April 14, 2026 14:01

ziv-lazarov-nagish changed the base branch from nagish to y2026 April 14, 2026 14:08

ziv-lazarov-nagish changed the base branch from y2026 to nagish April 14, 2026 14:08

AmitMY approved these changes Apr 14, 2026

View reviewed changes

ziv-lazarov-nagish merged commit 1df4e1c into nagish Apr 14, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: annotation platform dataset and multi-dataset training#17

feat: annotation platform dataset and multi-dataset training#17
ziv-lazarov-nagish merged 8 commits intonagishfrom
feat/annotation-platform-dataset

ziv-lazarov-nagish commented Apr 14, 2026 •

edited

Loading

Uh oh!

AmitMY left a comment

Uh oh!

Uh oh!

Uh oh!

ziv-lazarov-nagish commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ziv-lazarov-nagish commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changed files

Test plan

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ziv-lazarov-nagish commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ziv-lazarov-nagish commented Apr 14, 2026 •

edited

Loading