Skip to content

Conversation

@quentinblampey
Copy link
Contributor

Closes #134

I updated the dtype behavior for dask to fix #134.

I also added support for DiskArray in mean_var - I think we just needed to always np.power instead of the ** notation. Except if you had a specific reason to use ** @flying-sheep?
I think it's very inefficient though, since it will move the result of the power operation directly in memory (at least, this is what I understand, but it may be wrong). We would like to have it in memory only after the mean reduction, but maybe there is no other way to do that - I'm not familiar enough with h5.Datasets.

I wanted to add some tests but I don't understand all the details of the tests, is there any instructions or CONTRIBUTING.md file I could use to run and update the tests?

@codecov
Copy link

codecov bot commented Oct 29, 2025

Codecov Report

❌ Patch coverage is 61.53846% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.13%. Comparing base (6af5d9a) to head (a963725).

Files with missing lines Patch % Lines
src/fast_array_utils/stats/_generic_ops.py 25.00% 3 Missing ⚠️
src/fast_array_utils/stats/_utils.py 60.00% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (6af5d9a) and HEAD (a963725). Click for more details.

HEAD has 5 uploads less than BASE
Flag BASE (6af5d9a) HEAD (a963725)
7 2
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #135       +/-   ##
===========================================
- Coverage   99.13%   47.13%   -52.00%     
===========================================
  Files          19       15        -4     
  Lines         464      367       -97     
===========================================
- Hits          460      173      -287     
- Misses          4      194      +190     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codspeed-hq
Copy link

codspeed-hq bot commented Oct 29, 2025

CodSpeed Performance Report

Merging #135 will degrade performances by 22.41%

Comparing quentinblampey:mean_var_h5 (a963725) with main (6af5d9a)

Summary

❌ 8 regressions
✅ 224 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
test_stats_benchmark[numpy.ndarray-1d-all-int32-max] 1.6 ms 2 ms -22.31%
test_stats_benchmark[numpy.ndarray-1d-all-int32-min] 1.6 ms 2 ms -22.31%
test_stats_benchmark[scipy.sparse.csc_array-1d-all-int32-max] 1.6 ms 2 ms -22.39%
test_stats_benchmark[scipy.sparse.csc_array-1d-all-int32-min] 1.6 ms 2 ms -22.39%
test_stats_benchmark[scipy.sparse.csc_array-2d-all-int32-max] 1.6 ms 2 ms -22.41%
test_stats_benchmark[scipy.sparse.csc_array-2d-all-int32-min] 1.6 ms 2 ms -22.41%
test_stats_benchmark[scipy.sparse.csr_array-2d-all-int32-max] 1.6 ms 2 ms -22.39%
test_stats_benchmark[scipy.sparse.csr_array-2d-all-int32-min] 1.6 ms 2 ms -22.39%

@flying-sheep flying-sheep changed the title Support DiskArray on mean_var and support Dask in min/max Support Dask in min/max Oct 30, 2025
@flying-sheep
Copy link
Member

flying-sheep commented Oct 30, 2025

OK, as the failing benchmarks show, the ** had a reason.

Also you’re right: just creating a copy of the full-size Dataset isn’t a good idea, we’d have to do something smarter.

Therefore let’s just keep this PR single-purpose for min-max Dask support, and do the disk stuff in a separate PR. I reverted the disk stuff in here and fixed mypy

image

@flying-sheep flying-sheep added the run-gpu-ci Apply this label to run GPU CI once label Oct 30, 2025
@flying-sheep
Copy link
Member

flying-sheep commented Oct 30, 2025

OK, looks like there are more issues with min/max, I’m sorry that I didn’t catch that before making the release.

@quentinblampey
Copy link
Contributor Author

quentinblampey commented Oct 30, 2025

OK, as the failing benchmarks show, the ** had a reason.
Also you’re right: just creating a copy of the full-size Dataset isn’t a good idea, we’d have to do something smarter.

Yes, it seems so...
No problem, we can handle that in a future PR! I'll think more about it later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-gpu-ci Apply this label to run GPU CI once

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dask support for min/max

2 participants