refactor: restructure data layer into datasets/ and utils/#15
Merged
ziv-lazarov-nagish merged 5 commits intomainfrom Apr 14, 2026
Merged
refactor: restructure data layer into datasets/ and utils/#15ziv-lazarov-nagish merged 5 commits intomainfrom
ziv-lazarov-nagish merged 5 commits intomainfrom
Conversation
Move monolithic data/dataset.py into a modular structure: - datasets/common.py: shared base class, enums, collate_fn, load_and_augment - datasets/dgs/: DGS corpus dataset (extracted from data/dataset.py) - utils/bio.py: BIO label creation (extracted from data/utils.py) - utils/pose.py: pose preprocessing and velocity (renamed from data/utils.py) Update all import paths across model, metrics, bin, evaluate, train. Pin numpy<2, tensorflow<2.19, tensorflow-datasets==4.9.2 for compat. Bump requires-python to >=3.11 (StrEnum), CI workflows to Python 3.11. Use setuptools auto-discovery instead of hardcoded package list. Add test_common.py and test_utils.py.
Only DGS exists in this branch — other dataset types will be added when their implementations land.
3 tasks
AmitMY
requested changes
Apr 13, 2026
| "tensorflow<2.19", | ||
| "tensorflow-datasets==4.9.2", | ||
| "lxml", | ||
| "webvtt-py", |
Contributor
There was a problem hiding this comment.
unclear why we have these dependencies here? let's maybe split these into optional dependency groups. one called [dgs] for the dgs related ones, etc
Contributor
Author
There was a problem hiding this comment.
they're relevant for the sign_language_datasets module. moved them to a dgs optional dependency
sign_language_datasets, tensorflow, tensorflow-datasets, lxml, and webvtt-py are only needed for DGS corpus training, not inference. Move them to a [dgs] optional dependency group so the base install stays lightweight.
Users can now install all training deps with a single --extra train instead of specifying --extra dgs separately.
AmitMY
reviewed
Apr 13, 2026
| "tensorflow<2.19", | ||
| "tensorflow-datasets==4.9.2", | ||
| "lxml", | ||
| "webvtt-py", |
Contributor
There was a problem hiding this comment.
why this one? who is using it?
Contributor
Author
There was a problem hiding this comment.
it's all for the sign_language_datasets. some of them aren't defined in the upstream repository.
AmitMY
approved these changes
Apr 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
data/dataset.pyinto a modulardatasets/package with a shared base class (BaseSegmentationDataset), enums (Split,DatasetType), and common utilities (collate_fn,load_and_augment)datasets/dgs/utils/bio.pyand pose preprocessing intoutils/pose.pynumpy<2,tensorflow<2.19,tensorflow-datasets==4.9.2for compatibility withsign_language_datasetsrequires-pythonto>=3.11(needed forStrEnum), CI workflows to Python 3.11test_common.py(23 tests) andtest_utils.py(11 tests)Pure structural refactor — no behavioral changes to training or inference.
Test plan
uv run --python 3.11 --extra dev ruff check .passesuv run --python 3.11 --extra dev pytest— 37 tests pass🤖 Generated with Claude Code