Skip to content

refactor: restructure data layer into datasets/ and utils/#15

Merged
ziv-lazarov-nagish merged 5 commits intomainfrom
refactor/data-layer
Apr 14, 2026
Merged

refactor: restructure data layer into datasets/ and utils/#15
ziv-lazarov-nagish merged 5 commits intomainfrom
refactor/data-layer

Conversation

@ziv-lazarov-nagish
Copy link
Copy Markdown
Contributor

Summary

  • Move monolithic data/dataset.py into a modular datasets/ package with a shared base class (BaseSegmentationDataset), enums (Split, DatasetType), and common utilities (collate_fn, load_and_augment)
  • Extract DGS corpus dataset into datasets/dgs/
  • Extract BIO label creation into utils/bio.py and pose preprocessing into utils/pose.py
  • Update all import paths across model, metrics, bin, evaluate, and train
  • Pin numpy<2, tensorflow<2.19, tensorflow-datasets==4.9.2 for compatibility with sign_language_datasets
  • Bump requires-python to >=3.11 (needed for StrEnum), CI workflows to Python 3.11
  • Use setuptools auto-discovery instead of hardcoded package list
  • Add test_common.py (23 tests) and test_utils.py (11 tests)

Pure structural refactor — no behavioral changes to training or inference.

Test plan

  • uv run --python 3.11 --extra dev ruff check . passes
  • uv run --python 3.11 --extra dev pytest — 37 tests pass
  • Training runs successfully with DGS dataset (verified 1 epoch)

🤖 Generated with Claude Code

Move monolithic data/dataset.py into a modular structure:
- datasets/common.py: shared base class, enums, collate_fn, load_and_augment
- datasets/dgs/: DGS corpus dataset (extracted from data/dataset.py)
- utils/bio.py: BIO label creation (extracted from data/utils.py)
- utils/pose.py: pose preprocessing and velocity (renamed from data/utils.py)

Update all import paths across model, metrics, bin, evaluate, train.
Pin numpy<2, tensorflow<2.19, tensorflow-datasets==4.9.2 for compat.
Bump requires-python to >=3.11 (StrEnum), CI workflows to Python 3.11.
Use setuptools auto-discovery instead of hardcoded package list.
Add test_common.py and test_utils.py.
Only DGS exists in this branch — other dataset types will be
added when their implementations land.
Comment thread pyproject.toml Outdated
"tensorflow<2.19",
"tensorflow-datasets==4.9.2",
"lxml",
"webvtt-py",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclear why we have these dependencies here? let's maybe split these into optional dependency groups. one called [dgs] for the dgs related ones, etc

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they're relevant for the sign_language_datasets module. moved them to a dgs optional dependency

sign_language_datasets, tensorflow, tensorflow-datasets, lxml, and
webvtt-py are only needed for DGS corpus training, not inference.
Move them to a [dgs] optional dependency group so the base install
stays lightweight.
Users can now install all training deps with a single --extra train
instead of specifying --extra dgs separately.
Comment thread pyproject.toml Outdated
"tensorflow<2.19",
"tensorflow-datasets==4.9.2",
"lxml",
"webvtt-py",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this one? who is using it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's all for the sign_language_datasets. some of them aren't defined in the upstream repository.

@ziv-lazarov-nagish ziv-lazarov-nagish merged commit da281da into main Apr 14, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants