feat: add SignTube dataset by ziv-lazarov-nagish · Pull Request #21 · sign-language-processing/segmentation

ziv-lazarov-nagish · 2026-04-15T13:20:48Z

Summary

Add SignTube as a third training dataset alongside DGS and the annotation platform
SignTubeSegmentationDataset — reads annotations_cache.json, hash-based splits, registers as "signtube"
sync.py — fetches annotations from SignTube's Postgres DB, resolves pose files on NAS via md5 lookup, writes the JSON cache

Depends on #20.

What this PR actually does

Wires up a new source of poses + ground-truth boundaries for training:

Ground truth lives in SignTube's PostgreSQL captions table. Each row is one timestamped annotation: (videoId, start, end, language, text). Split by language:
- language ∈ {"Sgnw", "hns"} → sign-level annotations (primary segmentation target)
- anything else → sentence-level annotations (secondary target)
Poses (MediaPipe Holistic .pose files) live on NAS at /mnt/nas/GCS/sign-mediapipe-holistic-poses/{md5}.pose, keyed by the md5 of the source video file. SignTube's DB refers to videos by its own videoId strings, not md5 — so we bridge the two via /mnt/nas/transformations/videos/video_list.csv, which maps sign-tube/{videoId}.mp4 → md5Hash.

How it fits together

            sync.py                         dataset.py
  ┌─────────────────────┐           ┌──────────────────────┐
  │  DB captions table  │           │  annotations_cache   │
  │  + video_list.csv   │ ────────▶ │       .json          │ ──▶ SignTubeSegmentationDataset
  │  + NAS pose files   │           │  (pose_path → NAS)   │     (splits, BIO, pose tensor)
  └─────────────────────┘           └──────────────────────┘

Sync (one-time, when annotations change):

Query captions table → group rows by videoId
Build videoId → md5 map from video_list.csv
For each video: look up md5 → find NAS .pose file → read fps + total_frames from pose header
Write JSON cache, keyed by DB videoId, with pose_path pointing straight at NAS (no copy, no symlink)
Videos are skipped with a logged reason if: no md5 in CSV, no pose on NAS, pose file corrupted, or total_frames < 2

Training (every run):

SignTubeSegmentationDataset loads the cache, assigns each video to train/dev/test by hashing its videoId (deterministic, seedable, 80/10/10 default)
Each __getitem__ returns the usual {pose, timestamps, bio, ...} dict, identical shape to DGS/platform datasets — so the rest of the training pipeline is unchanged

Changed files

sign_language_segmentation/datasets/signtube/ — dataset, sync, SQL query
sign_language_segmentation/datasets/common.py — register signtube in _ensure_datasets_registered
sign_language_segmentation/tests/test_signtube.py — new, 15 tests (splits, schema, defensive checks, pose_path fallback, _is_sign_annotation, _build_cache)
.env.example — document DB_NAME / DB_USER / DB_PASS / DB_HOST
pyproject.toml — [signtube] optional dep group (psycopg, python-dotenv, tqdm)
.gitignore — exclude *.pose files

Verified

ruff check . clean
pytest — 76/76 pass (15 new signtube tests)
1-epoch --datasets signtube smoke: val_sign_iou=0.867, val_hm_iou=0.710, splits 419/45/51
--datasets dgs,platform,signtube combined training works (earlier run)

AmitMY

could you please first extract some changes? like for example the annotation platform changes

AmitMY · 2026-04-16T11:59:03Z

        args=eval_args,
+        batch_size=1,
        num_frames=999999,
+        persistent_workers=False,


i wonder why false?

that's the default value in the DataLoader class, and it wasn't specifically specified here before the change

so why specify it?

ziv-lazarov-nagish · 2026-04-16T13:31:55Z

opening a dedicated PR for the .cache directory. moving this PR to be draft since it depends on those changes

…gging Address PR #21 review feedback: - dataset.py: raise ValueError on cache missing 'videos' key; skip videos with total_frames < 2 - sync.py: log video_id on GCS-miss and too-few-frames skip paths for easier cache-drift diagnosis - .env.example: document DB_NAME/DB_USER/DB_PASS/DB_HOST required by sync - tests/test_signtube.py: 13 new tests covering splits, schema, defensive checks, pose_path fallback, and _is_sign_annotation Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

AmitMY · 2026-04-22T18:30:28Z

This was written by Claude. Please run /review on your PRs.

Review: PR #21 — `feat: add SignTube dataset`

Overview

Adds SignTube as a third training source alongside DGS and the annotation platform. Ships a SignTubeSegmentationDataset that reads annotations_cache.json (hash-split train/dev/test, same output shape as the other datasets), and a sync.py that queries SignTube's Postgres captions table, bridges videoId → md5 via /mnt/nas/transformations/videos/video_list.csv, and resolves poses on NAS. 13 tests in test_signtube.py, [signtube] extras group, *.sql added to package-data.

Code correctness

gcsfs is listed in [signtube] extras but never imported. pyproject.toml:31 pulls in gcsfs, but the current sync.py only uses psycopg + csv + pose_format. This looks like a refactor remnant from an earlier GCS-fallback version. Either drop the dep or bring back the fallback — today pip install .[signtube] downloads a dep that goes untouched. (Inline comment posted.)
Merge-conflict hazard with PR feat: publish HF ops + evaluation helpers #31. Both PRs edit datasets/common.py in the _ensure_datasets_registered function, and PR feat: publish HF ops + evaluation helpers #31 renames it to ensure_datasets_registered. Whichever lands second needs a trivial rebase (move the signtube import line and adopt the public name).
WHERE "start" != 0 in captions.sql has no why-comment. A legitimate caption starting at t=0 would be silently dropped. This is almost certainly "skip unfilled/placeholder rows" but SQL is the wrong place to lose that context — add an inline -- comment, or filter on a sentinel column if one exists (NULL, an is_draft flag, etc.).
_is_sign_annotation hardcodes ("Sgnw", "hns") (sync.py:75) with no comment on what these mean. If SignTube adds another sign-language code (say ASL SignWriting with a different tag), it'll silently land in the sentence bucket and skew training. At minimum a module-level _SIGN_LANGUAGE_CODES = (...) constant with a link/note to the source-of-truth.
Per-video schema in dataset.py is not validated beyond the top-level videos key. The ValueError: missing 'videos' key check (dataset.py:46) catches a corrupted cache, but video_data["fps"] / video_data["total_frames"] / video_data["signs"] will KeyError on a partial cache entry. Given sync.py always writes all four, this is a minor concern, but a one-line schema check (or .get(..., default) + skip) would make the dataset robust to manual edits.
_build_signtube_md5_lookup collapses duplicates silently. If video_list.csv has two rows with the same sign-tube/{id}.mp4 but different md5s, last-write wins with no warning. Probably can't happen today, but a if lookup.setdefault(key, h) != h: warn(...) would turn a silent landmine into a loud one.
_build_cache's per-skip prints will be noisy at scale. For tens of thousands of videos with missing md5s or missing poses, the log is a wall of lines. Consider aggregating skip counts by reason and printing a summary (123 skipped: no md5, 47 skipped: no pose, 3 skipped: <2 frames).
Bare except Exception on Pose.read (sync.py:112). Fine for a sync script — pose-format can raise a variety of errors and you want to continue — but narrowing to (struct.error, EOFError, ValueError) or similar would keep a CTRL-C from being swallowed as "file corrupted".
psycopg.Error re-raise includes the exception string. psycopg doesn't embed passwords in error messages by default, so this is probably safe, but worth a quick mental check: never log the connection dict (os.environ["DB_PASS"]) anywhere.

Conventions & style

Inheritance from BaseSegmentationDataset matches DGS/platform. glosses vs signs naming in the item dict is an existing convention (base-class contract names them glosses, caller reads signs from cache, dataset maps between them) — inconsistent but consistent with the rest of the codebase, so not a blocker here.
The pose_path fallback (cache's absolute path → CACHE_DIR/signtube/poses/{video_id}.pose) is a nice decoupling: sync can run on a NAS-mounted machine and training can run on one without. Keep it.
captions.sql living next to the code with inline URLs per prefix is great documentation — good pattern.
psycopg.rows.dict_row row factory — correct choice for the query.

Test coverage

13 tests cover the dataset's public behaviour well: splits cover all videos, are disjoint, are deterministic, manifest schema, __getitem__ shape, missing-pose skip, too-few-frames skip, corrupted-cache raise, pose-path fallback. Good.
Gap: no tests for sync.py beyond _is_sign_annotation. _build_signtube_md5_lookup and _build_cache carry real logic (prefix filter, md5 resolution, pose metadata read, skip reasons) and are entirely untested. Mocking csv.DictReader + Pose.read would cover most of the risky branches. Even one happy-path test + one "no md5" + one "pose missing" would be a meaningful addition.
test_getitem_returns_expected_format has pytest.skip("no items in train split for this seed") — a seed-dependent soft skip. With 10 items and an 80% train rate the probability of skipping is ~2×10⁻⁷, so realistically fine, but a soft-skip is a test-smell (the test can green without actually asserting anything). Either bump to 30–50 fixture videos, or pick a seed/ratio that guarantees ≥1 item in TRAIN.
test_split_is_deterministic checks same-config-same-output, which is necessary but weak. A companion assertion that a different split_seed produces a different partition would lock in the contract that splits are seedable-not-fixed.
No test asserts that build_datasets("signtube", ...) actually resolves the registered class — registration could silently fail and only trip at train time. A one-line assert "signtube" in DATASET_REGISTRY test would catch it.

Performance / security

_fetch_annotations does fetchall() into memory. Fine at SignTube's current scale; worth noting if the table grows beyond ~1M rows.
Pose.read(path.read_bytes(), pose_body=EmptyPoseBody) is the right optimization — we only need the header metadata, not the full body.
DB_PASS is read from env and never written back out; connection is scoped with with. No leak concerns.
No user-controlled paths beyond args.output (CLI) and annotations_path (caller). Fine.

Suggested actions before merge

Drop gcsfs from [signtube] extras (or reinstate the GCS fallback in sync.py).
Rebase on top of PR feat: publish HF ops + evaluation helpers #31 once it lands — _ensure_datasets_registered → ensure_datasets_registered rename will conflict otherwise.
Add an -- comment to captions.sql explaining WHERE "start" != 0.
Extract ("Sgnw", "hns") into a named constant in sync.py with a comment on where the list comes from.
Add at least a happy-path mocked test for _build_cache (md5 found, pose exists, valid metadata → cache entry) and a no-md5 skip test.
Harden test_getitem_returns_expected_format so it can't soft-skip (more fixtures or a pinned seed+ratio).
Aggregate per-skip prints into a summary rather than a per-video log line.
Optional: per-video schema validation in dataset.py (warn-and-skip on missing keys) instead of KeyError.

Solid addition — the dataset class is clean, the base-class contract is respected, and the split/fallback logic is well-tested. The gcsfs orphan and the SQL comment are the two I'd not merge without fixing; everything else is polish.

…tests Follow-up to PR #21 review feedback: - drop `gcsfs` from [signtube] extras (unused since the NAS/md5 refactor) - narrow `Pose.read` except to (struct.error, EOFError, ValueError) — corrupt/ truncated/malformed pose files get skipped with a per-file log line, real bugs (AttributeError/ImportError/etc.) still surface - replace per-skip prints + int counter with a Counter keyed by reason; print aggregated summary at end of sync (no_md5 / pose_missing / pose_unreadable / frames_lt_2) - remove seed-dependent `pytest.skip` in test_getitem_returns_expected_format in favor of a hard assert with an invariant comment - add TestBuildCache with happy-path and no-md5 skip tests

…valuate - Move get_dataloader from train.py to datasets/common.py with explicit args parameter (no global state dependency) - evaluate.py uses shared get_dataloader instead of manual build_datasets + DataLoader - Fix duplicate manifest in train.py (train + val collected same data)

- Add split_bucket() and assign_split() to common.py - Add _init_split_tracking(), _track_and_filter(), get_split_manifest() to BaseSegmentationDataset — removes duplication across all datasets - DGS: replace hardcoded splits.json with hash-based splitting - Platform: use base class helpers, remove local _split_bucket - Remove duplicate manifest collection in train.py (base class now tracks all split IDs regardless of which split was loaded)

- SignTubeSegmentationDataset: reads annotations_cache.json, hash-based splits via BaseSegmentationDataset, registers as "signtube" - sync.py: queries SignTube PostgreSQL DB, downloads poses from GCS, writes local cache - captions.sql: query for sign/sentence annotations across SignTube sources - --signtube_annotations_path CLI arg in args.py and evaluate.py - [signtube] optional dep group (psycopg, gcsfs, python-dotenv, tqdm)

…tasets

…gging Address PR #21 review feedback: - dataset.py: raise ValueError on cache missing 'videos' key; skip videos with total_frames < 2 - sync.py: log video_id on GCS-miss and too-few-frames skip paths for easier cache-drift diagnosis - .env.example: document DB_NAME/DB_USER/DB_PASS/DB_HOST required by sync - tests/test_signtube.py: 13 new tests covering splits, schema, defensive checks, pose_path fallback, and _is_sign_annotation Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Drop GCS download path and gcsfs import; sync is NAS-only now - New _build_signtube_md5_lookup reads /mnt/nas/transformations/videos/video_list.csv, mapping signtube video_id -> md5Hash - pose_path in cache points straight at /mnt/nas/GCS/sign-mediapipe-holistic-poses/{md5}.pose (no copy, no symlink) - Cache keys remain the DB videoId Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…tests Follow-up to PR #21 review feedback: - drop `gcsfs` from [signtube] extras (unused since the NAS/md5 refactor) - narrow `Pose.read` except to (struct.error, EOFError, ValueError) — corrupt/ truncated/malformed pose files get skipped with a per-file log line, real bugs (AttributeError/ImportError/etc.) still surface - replace per-skip prints + int counter with a Counter keyed by reason; print aggregated summary at end of sync (no_md5 / pose_missing / pose_unreadable / frames_lt_2) - remove seed-dependent `pytest.skip` in test_getitem_returns_expected_format in favor of a hard assert with an invariant comment - add TestBuildCache with happy-path and no-md5 skip tests

ziv-lazarov-nagish changed the title ~~feat: add SignTube dataset backed by PostgreSQL annotations~~ feat: add SignTube dataset Apr 15, 2026

ziv-lazarov-nagish requested a review from AmitMY April 16, 2026 09:40

ziv-lazarov-nagish marked this pull request as ready for review April 16, 2026 09:40

AmitMY requested changes Apr 16, 2026

View reviewed changes

AmitMY reviewed Apr 16, 2026

View reviewed changes

ziv-lazarov-nagish marked this pull request as draft April 16, 2026 13:32

ziv-lazarov-nagish force-pushed the feat/signtube-dataset branch 2 times, most recently from df97553 to 2f891c6 Compare April 19, 2026 12:31

ziv-lazarov-nagish marked this pull request as ready for review April 19, 2026 14:10

ziv-lazarov-nagish requested a review from AmitMY April 19, 2026 14:10

AmitMY requested changes Apr 19, 2026

View reviewed changes

Comment thread sign_language_segmentation/datasets/signtube/sync.py Outdated

AmitMY reviewed Apr 19, 2026

View reviewed changes

Comment thread sign_language_segmentation/datasets/signtube/sync.py Outdated

AmitMY reviewed Apr 19, 2026

View reviewed changes

Comment thread sign_language_segmentation/datasets/signtube/sync.py Outdated

ziv-lazarov-nagish requested a review from AmitMY April 20, 2026 10:17

AmitMY reviewed Apr 22, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

AmitMY reviewed Apr 23, 2026

View reviewed changes

Comment thread sign_language_segmentation/datasets/signtube/sync.py Outdated

AmitMY reviewed Apr 23, 2026

View reviewed changes

Comment thread sign_language_segmentation/datasets/signtube/dataset.py

AmitMY requested changes Apr 23, 2026

View reviewed changes

ziv-lazarov-nagish added 8 commits April 26, 2026 04:24

fix: restore val manifest collection (needed until hash-based splits)

b09c65d

refactor: load signtube data from local cache

318345b

refactor: remove signtube annotations arg

76368e4

style: revert formatting-only churn in dgs and annotation_platform da…

e6fdb90

…tasets

style: revert formatting-only churn in evaluate.py and train.py

cb2b835

ziv-lazarov-nagish and others added 5 commits April 26, 2026 04:24

style: revert formatting churn in common.py, keep only signtube import

23174e9

chore: update package data and env example

9c79e6a

ziv-lazarov-nagish force-pushed the feat/signtube-dataset branch from c03df6b to 9c79e6a Compare April 26, 2026 08:26

refactor: derive signtube metadata from video list

0d006be

ziv-lazarov-nagish merged commit cf9ece8 into nagish Apr 26, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SignTube dataset#21

feat: add SignTube dataset#21
ziv-lazarov-nagish merged 14 commits intonagishfrom
feat/signtube-dataset

ziv-lazarov-nagish commented Apr 15, 2026 •

edited

Loading

Uh oh!

AmitMY left a comment

Uh oh!

Uh oh!

Uh oh!

AmitMY Apr 16, 2026

Uh oh!

ziv-lazarov-nagish Apr 16, 2026

Uh oh!

AmitMY Apr 22, 2026

Uh oh!

ziv-lazarov-nagish commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmitMY commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ziv-lazarov-nagish commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR actually does

How it fits together

Changed files

Verified

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AmitMY Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ziv-lazarov-nagish Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

AmitMY Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

ziv-lazarov-nagish commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmitMY commented Apr 22, 2026

Review: PR #21 — feat: add SignTube dataset

Overview

Code correctness

Conventions & style

Test coverage

Performance / security

Suggested actions before merge

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ziv-lazarov-nagish commented Apr 15, 2026 •

edited

Loading

Review: PR #21 — `feat: add SignTube dataset`