refactor: restructure data layer into datasets/ and utils/ by ziv-lazarov-nagish · Pull Request #15 · sign-language-processing/segmentation

ziv-lazarov-nagish · 2026-04-13T11:39:52Z

Summary

Move monolithic data/dataset.py into a modular datasets/ package with a shared base class (BaseSegmentationDataset), enums (Split, DatasetType), and common utilities (collate_fn, load_and_augment)
Extract DGS corpus dataset into datasets/dgs/
Extract BIO label creation into utils/bio.py and pose preprocessing into utils/pose.py
Update all import paths across model, metrics, bin, evaluate, and train
Pin numpy<2, tensorflow<2.19, tensorflow-datasets==4.9.2 for compatibility with sign_language_datasets
Bump requires-python to >=3.11 (needed for StrEnum), CI workflows to Python 3.11
Use setuptools auto-discovery instead of hardcoded package list
Add test_common.py (23 tests) and test_utils.py (11 tests)

Pure structural refactor — no behavioral changes to training or inference.

Test plan

uv run --python 3.11 --extra dev ruff check . passes
uv run --python 3.11 --extra dev pytest — 37 tests pass
Training runs successfully with DGS dataset (verified 1 epoch)

🤖 Generated with Claude Code

Move monolithic data/dataset.py into a modular structure: - datasets/common.py: shared base class, enums, collate_fn, load_and_augment - datasets/dgs/: DGS corpus dataset (extracted from data/dataset.py) - utils/bio.py: BIO label creation (extracted from data/utils.py) - utils/pose.py: pose preprocessing and velocity (renamed from data/utils.py) Update all import paths across model, metrics, bin, evaluate, train. Pin numpy<2, tensorflow<2.19, tensorflow-datasets==4.9.2 for compat. Bump requires-python to >=3.11 (StrEnum), CI workflows to Python 3.11. Use setuptools auto-discovery instead of hardcoded package list. Add test_common.py and test_utils.py.

Only DGS exists in this branch — other dataset types will be added when their implementations land.

AmitMY · 2026-04-13T12:57:58Z

+    "tensorflow<2.19",
+    "tensorflow-datasets==4.9.2",
+    "lxml",
+    "webvtt-py",


unclear why we have these dependencies here? let's maybe split these into optional dependency groups. one called [dgs] for the dgs related ones, etc

they're relevant for the sign_language_datasets module. moved them to a dgs optional dependency

sign_language_datasets, tensorflow, tensorflow-datasets, lxml, and webvtt-py are only needed for DGS corpus training, not inference. Move them to a [dgs] optional dependency group so the base install stays lightweight.

Users can now install all training deps with a single --extra train instead of specifying --extra dgs separately.

AmitMY · 2026-04-13T15:16:40Z

+    "tensorflow<2.19",
+    "tensorflow-datasets==4.9.2",
+    "lxml",
+    "webvtt-py",


why this one? who is using it?

it's all for the sign_language_datasets. some of them aren't defined in the upstream repository.

ziv-lazarov-nagish added 2 commits April 13, 2026 07:39

fix: remove PLATFORM/COMBINED from DatasetType enum

9da706e

Only DGS exists in this branch — other dataset types will be added when their implementations land.

ziv-lazarov-nagish mentioned this pull request Apr 13, 2026

feat: training improvements — W&B logging, Optuna HPO, discriminative LR #16

Merged

3 tasks

AmitMY requested changes Apr 13, 2026

View reviewed changes

ziv-lazarov-nagish added 2 commits April 13, 2026 09:22

refactor: move DGS-specific deps to [dgs] optional group

7895333

sign_language_datasets, tensorflow, tensorflow-datasets, lxml, and webvtt-py are only needed for DGS corpus training, not inference. Move them to a [dgs] optional dependency group so the base install stays lightweight.

refactor: add [train] optional group that includes [dgs]

a6e0c02

Users can now install all training deps with a single --extra train instead of specifying --extra dgs separately.

AmitMY reviewed Apr 13, 2026

View reviewed changes

docs: add comment explaining webvtt-py transitive dependency

a624a55

AmitMY approved these changes Apr 13, 2026

View reviewed changes

ziv-lazarov-nagish merged commit da281da into main Apr 14, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: restructure data layer into datasets/ and utils/#15

refactor: restructure data layer into datasets/ and utils/#15
ziv-lazarov-nagish merged 5 commits intomainfrom
refactor/data-layer

ziv-lazarov-nagish commented Apr 13, 2026

Uh oh!

AmitMY Apr 13, 2026

Uh oh!

ziv-lazarov-nagish Apr 13, 2026

Uh oh!

AmitMY Apr 13, 2026

Uh oh!

ziv-lazarov-nagish Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ziv-lazarov-nagish commented Apr 13, 2026

Summary

Test plan

Uh oh!

AmitMY Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ziv-lazarov-nagish Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

AmitMY Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ziv-lazarov-nagish Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants