Skip to content

feat: add SignTube dataset#21

Merged
ziv-lazarov-nagish merged 14 commits intonagishfrom
feat/signtube-dataset
Apr 26, 2026
Merged

feat: add SignTube dataset#21
ziv-lazarov-nagish merged 14 commits intonagishfrom
feat/signtube-dataset

Conversation

@ziv-lazarov-nagish
Copy link
Copy Markdown
Contributor

@ziv-lazarov-nagish ziv-lazarov-nagish commented Apr 15, 2026

Summary

  • Add SignTube as a third training dataset alongside DGS and the annotation platform
  • SignTubeSegmentationDataset — reads annotations_cache.json, hash-based splits, registers as "signtube"
  • sync.py — fetches annotations from SignTube's Postgres DB, resolves pose files on NAS via md5 lookup, writes the JSON cache

Depends on #20.

What this PR actually does

Wires up a new source of poses + ground-truth boundaries for training:

  • Ground truth lives in SignTube's PostgreSQL captions table. Each row is one timestamped annotation: (videoId, start, end, language, text). Split by language:

    • language ∈ {"Sgnw", "hns"}sign-level annotations (primary segmentation target)
    • anything else → sentence-level annotations (secondary target)
  • Poses (MediaPipe Holistic .pose files) live on NAS at /mnt/nas/GCS/sign-mediapipe-holistic-poses/{md5}.pose, keyed by the md5 of the source video file. SignTube's DB refers to videos by its own videoId strings, not md5 — so we bridge the two via /mnt/nas/transformations/videos/video_list.csv, which maps sign-tube/{videoId}.mp4 → md5Hash.

How it fits together

            sync.py                         dataset.py
  ┌─────────────────────┐           ┌──────────────────────┐
  │  DB captions table  │           │  annotations_cache   │
  │  + video_list.csv   │ ────────▶ │       .json          │ ──▶ SignTubeSegmentationDataset
  │  + NAS pose files   │           │  (pose_path → NAS)   │     (splits, BIO, pose tensor)
  └─────────────────────┘           └──────────────────────┘

Sync (one-time, when annotations change):

  1. Query captions table → group rows by videoId
  2. Build videoId → md5 map from video_list.csv
  3. For each video: look up md5 → find NAS .pose file → read fps + total_frames from pose header
  4. Write JSON cache, keyed by DB videoId, with pose_path pointing straight at NAS (no copy, no symlink)
  5. Videos are skipped with a logged reason if: no md5 in CSV, no pose on NAS, pose file corrupted, or total_frames < 2

Training (every run):

  • SignTubeSegmentationDataset loads the cache, assigns each video to train/dev/test by hashing its videoId (deterministic, seedable, 80/10/10 default)
  • Each __getitem__ returns the usual {pose, timestamps, bio, ...} dict, identical shape to DGS/platform datasets — so the rest of the training pipeline is unchanged

Changed files

  • sign_language_segmentation/datasets/signtube/ — dataset, sync, SQL query
  • sign_language_segmentation/datasets/common.py — register signtube in _ensure_datasets_registered
  • sign_language_segmentation/tests/test_signtube.pynew, 15 tests (splits, schema, defensive checks, pose_path fallback, _is_sign_annotation, _build_cache)
  • .env.example — document DB_NAME / DB_USER / DB_PASS / DB_HOST
  • pyproject.toml[signtube] optional dep group (psycopg, python-dotenv, tqdm)
  • .gitignore — exclude *.pose files

Verified

  • ruff check . clean
  • pytest — 76/76 pass (15 new signtube tests)
  • 1-epoch --datasets signtube smoke: val_sign_iou=0.867, val_hm_iou=0.710, splits 419/45/51
  • --datasets dgs,platform,signtube combined training works (earlier run)

@ziv-lazarov-nagish ziv-lazarov-nagish changed the title feat: add SignTube dataset backed by PostgreSQL annotations feat: add SignTube dataset Apr 15, 2026
@ziv-lazarov-nagish ziv-lazarov-nagish marked this pull request as ready for review April 16, 2026 09:40
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please first extract some changes? like for example the annotation platform changes

Comment thread sign_language_segmentation/args.py Outdated
Comment thread sign_language_segmentation/evaluate.py Outdated
args=eval_args,
batch_size=1,
num_frames=999999,
persistent_workers=False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder why false?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the default value in the DataLoader class, and it wasn't specifically specified here before the change

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why specify it?

@ziv-lazarov-nagish
Copy link
Copy Markdown
Contributor Author

opening a dedicated PR for the .cache directory. moving this PR to be draft since it depends on those changes

@ziv-lazarov-nagish ziv-lazarov-nagish marked this pull request as draft April 16, 2026 13:32
@ziv-lazarov-nagish ziv-lazarov-nagish force-pushed the feat/signtube-dataset branch 2 times, most recently from df97553 to 2f891c6 Compare April 19, 2026 12:31
@ziv-lazarov-nagish ziv-lazarov-nagish marked this pull request as ready for review April 19, 2026 14:10
Comment thread sign_language_segmentation/datasets/signtube/sync.py Outdated
Comment thread sign_language_segmentation/datasets/signtube/sync.py Outdated
Comment thread sign_language_segmentation/datasets/signtube/sync.py Outdated
ziv-lazarov-nagish added a commit that referenced this pull request Apr 20, 2026
…gging

Address PR #21 review feedback:
- dataset.py: raise ValueError on cache missing 'videos' key; skip videos
  with total_frames < 2
- sync.py: log video_id on GCS-miss and too-few-frames skip paths for
  easier cache-drift diagnosis
- .env.example: document DB_NAME/DB_USER/DB_PASS/DB_HOST required by sync
- tests/test_signtube.py: 13 new tests covering splits, schema, defensive
  checks, pose_path fallback, and _is_sign_annotation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread pyproject.toml Outdated
@AmitMY
Copy link
Copy Markdown
Contributor

AmitMY commented Apr 22, 2026

This was written by Claude. Please run /review on your PRs.


Review: PR #21feat: add SignTube dataset

Overview

Adds SignTube as a third training source alongside DGS and the annotation platform. Ships a SignTubeSegmentationDataset that reads annotations_cache.json (hash-split train/dev/test, same output shape as the other datasets), and a sync.py that queries SignTube's Postgres captions table, bridges videoId → md5 via /mnt/nas/transformations/videos/video_list.csv, and resolves poses on NAS. 13 tests in test_signtube.py, [signtube] extras group, *.sql added to package-data.

Code correctness

  • gcsfs is listed in [signtube] extras but never imported. pyproject.toml:31 pulls in gcsfs, but the current sync.py only uses psycopg + csv + pose_format. This looks like a refactor remnant from an earlier GCS-fallback version. Either drop the dep or bring back the fallback — today pip install .[signtube] downloads a dep that goes untouched. (Inline comment posted.)
  • Merge-conflict hazard with PR feat: publish HF ops + evaluation helpers #31. Both PRs edit datasets/common.py in the _ensure_datasets_registered function, and PR feat: publish HF ops + evaluation helpers #31 renames it to ensure_datasets_registered. Whichever lands second needs a trivial rebase (move the signtube import line and adopt the public name).
  • WHERE "start" != 0 in captions.sql has no why-comment. A legitimate caption starting at t=0 would be silently dropped. This is almost certainly "skip unfilled/placeholder rows" but SQL is the wrong place to lose that context — add an inline -- comment, or filter on a sentinel column if one exists (NULL, an is_draft flag, etc.).
  • _is_sign_annotation hardcodes ("Sgnw", "hns") (sync.py:75) with no comment on what these mean. If SignTube adds another sign-language code (say ASL SignWriting with a different tag), it'll silently land in the sentence bucket and skew training. At minimum a module-level _SIGN_LANGUAGE_CODES = (...) constant with a link/note to the source-of-truth.
  • Per-video schema in dataset.py is not validated beyond the top-level videos key. The ValueError: missing 'videos' key check (dataset.py:46) catches a corrupted cache, but video_data["fps"] / video_data["total_frames"] / video_data["signs"] will KeyError on a partial cache entry. Given sync.py always writes all four, this is a minor concern, but a one-line schema check (or .get(..., default) + skip) would make the dataset robust to manual edits.
  • _build_signtube_md5_lookup collapses duplicates silently. If video_list.csv has two rows with the same sign-tube/{id}.mp4 but different md5s, last-write wins with no warning. Probably can't happen today, but a if lookup.setdefault(key, h) != h: warn(...) would turn a silent landmine into a loud one.
  • _build_cache's per-skip prints will be noisy at scale. For tens of thousands of videos with missing md5s or missing poses, the log is a wall of lines. Consider aggregating skip counts by reason and printing a summary (123 skipped: no md5, 47 skipped: no pose, 3 skipped: <2 frames).
  • Bare except Exception on Pose.read (sync.py:112). Fine for a sync script — pose-format can raise a variety of errors and you want to continue — but narrowing to (struct.error, EOFError, ValueError) or similar would keep a CTRL-C from being swallowed as "file corrupted".
  • psycopg.Error re-raise includes the exception string. psycopg doesn't embed passwords in error messages by default, so this is probably safe, but worth a quick mental check: never log the connection dict (os.environ["DB_PASS"]) anywhere.

Conventions & style

  • Inheritance from BaseSegmentationDataset matches DGS/platform. glosses vs signs naming in the item dict is an existing convention (base-class contract names them glosses, caller reads signs from cache, dataset maps between them) — inconsistent but consistent with the rest of the codebase, so not a blocker here.
  • The pose_path fallback (cache's absolute path → CACHE_DIR/signtube/poses/{video_id}.pose) is a nice decoupling: sync can run on a NAS-mounted machine and training can run on one without. Keep it.
  • captions.sql living next to the code with inline URLs per prefix is great documentation — good pattern.
  • psycopg.rows.dict_row row factory — correct choice for the query.

Test coverage

  • 13 tests cover the dataset's public behaviour well: splits cover all videos, are disjoint, are deterministic, manifest schema, __getitem__ shape, missing-pose skip, too-few-frames skip, corrupted-cache raise, pose-path fallback. Good.
  • Gap: no tests for sync.py beyond _is_sign_annotation. _build_signtube_md5_lookup and _build_cache carry real logic (prefix filter, md5 resolution, pose metadata read, skip reasons) and are entirely untested. Mocking csv.DictReader + Pose.read would cover most of the risky branches. Even one happy-path test + one "no md5" + one "pose missing" would be a meaningful addition.
  • test_getitem_returns_expected_format has pytest.skip("no items in train split for this seed") — a seed-dependent soft skip. With 10 items and an 80% train rate the probability of skipping is ~2×10⁻⁷, so realistically fine, but a soft-skip is a test-smell (the test can green without actually asserting anything). Either bump to 30–50 fixture videos, or pick a seed/ratio that guarantees ≥1 item in TRAIN.
  • test_split_is_deterministic checks same-config-same-output, which is necessary but weak. A companion assertion that a different split_seed produces a different partition would lock in the contract that splits are seedable-not-fixed.
  • No test asserts that build_datasets("signtube", ...) actually resolves the registered class — registration could silently fail and only trip at train time. A one-line assert "signtube" in DATASET_REGISTRY test would catch it.

Performance / security

  • _fetch_annotations does fetchall() into memory. Fine at SignTube's current scale; worth noting if the table grows beyond ~1M rows.
  • Pose.read(path.read_bytes(), pose_body=EmptyPoseBody) is the right optimization — we only need the header metadata, not the full body.
  • DB_PASS is read from env and never written back out; connection is scoped with with. No leak concerns.
  • No user-controlled paths beyond args.output (CLI) and annotations_path (caller). Fine.

Suggested actions before merge

  1. Drop gcsfs from [signtube] extras (or reinstate the GCS fallback in sync.py).
  2. Rebase on top of PR feat: publish HF ops + evaluation helpers #31 once it lands — _ensure_datasets_registeredensure_datasets_registered rename will conflict otherwise.
  3. Add an -- comment to captions.sql explaining WHERE "start" != 0.
  4. Extract ("Sgnw", "hns") into a named constant in sync.py with a comment on where the list comes from.
  5. Add at least a happy-path mocked test for _build_cache (md5 found, pose exists, valid metadata → cache entry) and a no-md5 skip test.
  6. Harden test_getitem_returns_expected_format so it can't soft-skip (more fixtures or a pinned seed+ratio).
  7. Aggregate per-skip prints into a summary rather than a per-video log line.
  8. Optional: per-video schema validation in dataset.py (warn-and-skip on missing keys) instead of KeyError.

Solid addition — the dataset class is clean, the base-class contract is respected, and the split/fallback logic is well-tested. The gcsfs orphan and the SQL comment are the two I'd not merge without fixing; everything else is polish.

ziv-lazarov-nagish added a commit that referenced this pull request Apr 23, 2026
…tests

Follow-up to PR #21 review feedback:

- drop `gcsfs` from [signtube] extras (unused since the NAS/md5 refactor)
- narrow `Pose.read` except to (struct.error, EOFError, ValueError) — corrupt/
  truncated/malformed pose files get skipped with a per-file log line, real
  bugs (AttributeError/ImportError/etc.) still surface
- replace per-skip prints + int counter with a Counter keyed by reason; print
  aggregated summary at end of sync (no_md5 / pose_missing / pose_unreadable
  / frames_lt_2)
- remove seed-dependent `pytest.skip` in test_getitem_returns_expected_format
  in favor of a hard assert with an invariant comment
- add TestBuildCache with happy-path and no-md5 skip tests
Comment thread sign_language_segmentation/datasets/signtube/sync.py Outdated
Comment thread sign_language_segmentation/datasets/signtube/dataset.py
…valuate

- Move get_dataloader from train.py to datasets/common.py with explicit
  args parameter (no global state dependency)
- evaluate.py uses shared get_dataloader instead of manual build_datasets + DataLoader
- Fix duplicate manifest in train.py (train + val collected same data)
- Add split_bucket() and assign_split() to common.py
- Add _init_split_tracking(), _track_and_filter(), get_split_manifest()
  to BaseSegmentationDataset — removes duplication across all datasets
- DGS: replace hardcoded splits.json with hash-based splitting
- Platform: use base class helpers, remove local _split_bucket
- Remove duplicate manifest collection in train.py (base class now
  tracks all split IDs regardless of which split was loaded)
- SignTubeSegmentationDataset: reads annotations_cache.json, hash-based
  splits via BaseSegmentationDataset, registers as "signtube"
- sync.py: queries SignTube PostgreSQL DB, downloads poses from GCS,
  writes local cache
- captions.sql: query for sign/sentence annotations across SignTube sources
- --signtube_annotations_path CLI arg in args.py and evaluate.py
- [signtube] optional dep group (psycopg, gcsfs, python-dotenv, tqdm)
ziv-lazarov-nagish and others added 5 commits April 26, 2026 04:24
…gging

Address PR #21 review feedback:
- dataset.py: raise ValueError on cache missing 'videos' key; skip videos
  with total_frames < 2
- sync.py: log video_id on GCS-miss and too-few-frames skip paths for
  easier cache-drift diagnosis
- .env.example: document DB_NAME/DB_USER/DB_PASS/DB_HOST required by sync
- tests/test_signtube.py: 13 new tests covering splits, schema, defensive
  checks, pose_path fallback, and _is_sign_annotation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop GCS download path and gcsfs import; sync is NAS-only now
- New _build_signtube_md5_lookup reads /mnt/nas/transformations/videos/video_list.csv,
  mapping signtube video_id -> md5Hash
- pose_path in cache points straight at /mnt/nas/GCS/sign-mediapipe-holistic-poses/{md5}.pose
  (no copy, no symlink)
- Cache keys remain the DB videoId

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tests

Follow-up to PR #21 review feedback:

- drop `gcsfs` from [signtube] extras (unused since the NAS/md5 refactor)
- narrow `Pose.read` except to (struct.error, EOFError, ValueError) — corrupt/
  truncated/malformed pose files get skipped with a per-file log line, real
  bugs (AttributeError/ImportError/etc.) still surface
- replace per-skip prints + int counter with a Counter keyed by reason; print
  aggregated summary at end of sync (no_md5 / pose_missing / pose_unreadable
  / frames_lt_2)
- remove seed-dependent `pytest.skip` in test_getitem_returns_expected_format
  in favor of a hard assert with an invariant comment
- add TestBuildCache with happy-path and no-md5 skip tests
@ziv-lazarov-nagish ziv-lazarov-nagish merged commit cf9ece8 into nagish Apr 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants