Skip to content

refactor: consolidate split logic in base class#20

Merged
ziv-lazarov-nagish merged 1 commit intonagishfrom
refactor/hash-based-splits
Apr 16, 2026
Merged

refactor: consolidate split logic in base class#20
ziv-lazarov-nagish merged 1 commit intonagishfrom
refactor/hash-based-splits

Conversation

@ziv-lazarov-nagish
Copy link
Copy Markdown
Contributor

@ziv-lazarov-nagish ziv-lazarov-nagish commented Apr 15, 2026

Summary

  • Add split_bucket() and assign_split() to common.py — deterministic hash-based split assignment
  • Add _init_split_tracking(), _track_and_filter(), get_split_manifest() to BaseSegmentationDataset — removes duplication across all dataset classes
  • DGS: keep fixed splits.json (research-standard split), migrate to base class tracking infra and pathlib
  • Platform: use assign_split() + base class helpers, remove local _split_bucket
  • Remove unused collate_fn/DataLoader imports and duplicate manifest collection in train.py

Depends on #19.

Changed files

  • datasets/common.pysplit_bucket, assign_split, base class split tracking
  • datasets/dgs/dataset.py — use base class tracking while keeping fixed splits.json, migrate to pathlib
  • datasets/annotation_platform/dataset.py — use assign_split + base class helpers
  • train.py — remove unused imports and duplicate manifest
  • tests/ — update imports and dummy dataset

Test plan

  • ruff check . passes
  • pytest passes (61 tests)
  • 1-epoch DGS training smoke test — splits match original (545 train, 9 dev, 14 test)
  • 1-epoch dgs,platform training smoke test — both datasets load correctly

@ziv-lazarov-nagish ziv-lazarov-nagish mentioned this pull request Apr 15, 2026
4 tasks
@ziv-lazarov-nagish ziv-lazarov-nagish marked this pull request as ready for review April 16, 2026 09:40
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the split we use for dgs is a fixed one, used in research (not only our research)
using the split.json is preferred here - because it allows us to compare to our previous work, and to other's work as it comes. as i understand here we would create a new split

@ziv-lazarov-nagish
Copy link
Copy Markdown
Contributor Author

ziv-lazarov-nagish commented Apr 16, 2026

the split we use for dgs is a fixed one, used in research (not only our research) using the split.json is preferred here - because it allows us to compare to our previous work, and to other's work as it comes. as i understand here we would create a new split

correct. i was thinking on a single way to split it across all SegmentationDataset classes for reproducibility. i'll make an exception for the DGS one

@ziv-lazarov-nagish ziv-lazarov-nagish force-pushed the refactor/hash-based-splits branch from 7f15fe5 to c96a608 Compare April 16, 2026 12:37
- Add split_bucket() and assign_split() to common.py for hash-based splits
- Add _init_split_tracking(), _track_and_filter(), get_split_manifest()
  to BaseSegmentationDataset — removes duplication across all datasets
- DGS: keep fixed splits.json (research-standard split), migrate to
  base class tracking infra and pathlib
- Platform: use assign_split() + base class helpers
- Remove unused collate_fn/DataLoader imports and duplicate manifest
  collection in train.py
@ziv-lazarov-nagish ziv-lazarov-nagish force-pushed the refactor/hash-based-splits branch from c96a608 to e83c8f0 Compare April 16, 2026 13:09
@ziv-lazarov-nagish ziv-lazarov-nagish changed the title refactor: hash-based splits and consolidate split logic in base class refactor: consolidate split logic in base class Apr 16, 2026
@ziv-lazarov-nagish ziv-lazarov-nagish merged commit 85d75ca into nagish Apr 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants