Skip to content

Feat: Supply size string for n_obs_per_dataset#159

Merged
felix0097 merged 37 commits intomainfrom
ff/automatic-size-estimation
Mar 18, 2026
Merged

Feat: Supply size string for n_obs_per_dataset#159
felix0097 merged 37 commits intomainfrom
ff/automatic-size-estimation

Conversation

@felix0097
Copy link
Copy Markdown
Collaborator

No description provided.

@felix0097 felix0097 requested a review from ilan-gold March 10, 2026 13:47
@felix0097 felix0097 self-assigned this Mar 10, 2026
@felix0097 felix0097 added enhancement New feature or request skip-gpu-ci Whether gpu ci should be skipped labels Mar 10, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 81.81818% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.82%. Comparing base (0c6f025) to head (a365acf).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/annbatch/io.py 81.81% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #159      +/-   ##
==========================================
- Coverage   93.56%   90.82%   -2.75%     
==========================================
  Files          11       11              
  Lines         886      937      +51     
==========================================
+ Hits          829      851      +22     
- Misses         57       86      +29     
Files with missing lines Coverage Δ
src/annbatch/io.py 90.06% <81.81%> (-2.12%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilan-gold
Copy link
Copy Markdown
Collaborator

@felix0097 please fix the diff - i thnk you may need to merge main or rebase.

Copy link
Copy Markdown
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py
Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py
@felix0097 felix0097 requested a review from ilan-gold March 13, 2026 13:35
Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py
Comment thread src/annbatch/io.py
Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py Outdated
felix0097 and others added 3 commits March 16, 2026 16:40
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@felix0097 felix0097 requested a review from ilan-gold March 16, 2026 16:05
Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py Outdated
Comment thread tests/test_preshuffle.py
Comment thread tests/test_preshuffle.py Outdated
@felix0097 felix0097 requested a review from ilan-gold March 17, 2026 10:49
Comment thread CHANGELOG.md Outdated
Comment on lines +20 to +22
- The ``sparse_chunk_size``, ``sparse_shard_size``, ``dense_chunk_size``, and ``dense_shard_size`` parameters of {func}`annbatch.write_sharded` have been replaced by ``n_obs_per_chunk`` (number of observations per chunk, automatically converted to element counts for sparse arrays) and ``shard_size`` (number of observations per shard or a size string). The corresponding parameters in {meth}`annbatch.DatasetCollection.add_adatas` are ``n_obs_per_chunk`` and ``shard_size``.
- `shard_size` in {meth}`annbatch.DatasetCollection.add_adatas` and `shard_size` in {func}`annbatch.write_sharded` now accept a human-readable size string (e.g. ``'1GB'``, ``'512MB'``) in addition to an integer number of observations. When a string is provided, the observation count is derived independently for each array element from its uncompressed bytes-per-row so that every shard stays close to the target size.
- ``dataset_size`` in {meth}`annbatch.DatasetCollection.add_adatas` now accepts a human-readable size string (e.g. ``'20GB'``, ``'512MB'``) in addition to an integer number of observations. When a string is provided, the per-row byte size is estimated from the on-disk metadata of the input datasets during validation and used to derive the observation count. The default has changed from ``2_097_152`` to ``'20GB'``.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe split these between "Breaking" header and "Features" header

@felix0097 felix0097 merged commit fac9795 into main Mar 18, 2026
12 checks passed
@felix0097 felix0097 deleted the ff/automatic-size-estimation branch March 18, 2026 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request skip-gpu-ci Whether gpu ci should be skipped

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants