Skip to content

Conversation

@nbren12
Copy link
Contributor

@nbren12 nbren12 commented Oct 30, 2025

Writing to sharded arrays was up to 10x slower for largish chunk sizes because the _ShardBuilder object has many calls to np.concatenate. This commit coalesces these into a single concatenate call, and improves write performance by a factor of 10 on the benchmarking script in #3560.

Added a new core.Buffer.combine API

Resolves #3560

[Description of PR]

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Writing to sharded arrays was up to 10x slower for largish chunk sizes
because the _ShardBuilder object has many calls to np.concatenate. This
commit coalesces these into a single concatenate call, and improves write
performance by a factor of 10 on the benchmarking script in zarr-developers#3560.

Added a new core.Buffer.combine API

Resolves zarr-developers#3560

Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Oct 30, 2025
@nbren12
Copy link
Contributor Author

nbren12 commented Oct 30, 2025

Benchmarking results
Before:

================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.4.dev34+gb3e9aed30
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.5280 ? 0.0161 seconds
  Throughput: 722.44 MB/s
  Times: ['0.5442', '0.5119']

Read (no sharding, full):
  Mean time: 0.3106 ? 0.0060 seconds
  Throughput: 1228.23 MB/s
  Times: ['0.3165', '0.3046']

Read (no sharding, partial):
  Mean time: 0.0267 ? 0.0001 seconds
  Throughput: 142.88 MB/s
  Times: ['0.0266', '0.0268']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 4.2136 ? 0.1268 seconds
  Throughput: 90.53 MB/s
  Times: ['4.3404', '4.0867']

Read (with sharding, full):
  Mean time: 0.3533 ? 0.0036 seconds
  Throughput: 1079.64 MB/s
  Times: ['0.3569', '0.3498']

Read (with sharding, partial):
  Mean time: 0.0317 ? 0.0003 seconds
  Throughput: 120.33 MB/s
  Times: ['0.0320', '0.0314']

================================================================================
COMPARISON
================================================================================
Write: 0.13x speedup with sharding (or 7.98x slower)
Read: 0.88x speedup with sharding (or 1.14x slower)

================================================================================
Benchmark complete!
================================================================================

After

================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.4.dev34+gb3e9aed30
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.5144 ? 0.0080 seconds
  Throughput: 741.65 MB/s
  Times: ['0.5223', '0.5064']

Read (no sharding, full):
  Mean time: 0.3128 ? 0.0018 seconds
  Throughput: 1219.45 MB/s
  Times: ['0.3147', '0.3110']

Read (no sharding, partial):
  Mean time: 0.0287 ? 0.0002 seconds
  Throughput: 133.12 MB/s
  Times: ['0.0288', '0.0285']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 0.6107 ? 0.0017 seconds
  Throughput: 624.61 MB/s
  Times: ['0.6125', '0.6090']

Read (with sharding, full):
  Mean time: 0.2908 ? 0.0030 seconds
  Throughput: 1311.91 MB/s
  Times: ['0.2938', '0.2877']

Read (with sharding, partial):
  Mean time: 0.0238 ? 0.0006 seconds
  Throughput: 160.59 MB/s
  Times: ['0.0244', '0.0231']

================================================================================
COMPARISON
================================================================================
Write: 0.84x speedup with sharding (or 1.19x slower)
Read: 1.08x speedup with sharding (or 0.93x slower)

================================================================================
Benchmark complete!
================================================================================

Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>
Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>
@d-v-b
Copy link
Contributor

d-v-b commented Oct 30, 2025

thanks so much to tackling this problem!

@nbren12
Copy link
Contributor Author

nbren12 commented Oct 30, 2025

hmmm. lot's of failing tests. I'm having some trouble grokking the implementation of the sharding codec.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 30, 2025

I'll have a look. I don't know this section of the codebase very well, but maybe I can figure something out. And BTW if you see any opportunities to refactor / simplify the logic here, feel free to move in that direction. I think making this code easier to understand would be a big win.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 30, 2025

diff --git a/src/zarr/codecs/sharding.py b/src/zarr/codecs/sharding.py
index 0e36f50f..31d68027 100644
--- a/src/zarr/codecs/sharding.py
+++ b/src/zarr/codecs/sharding.py
@@ -227,6 +227,7 @@ class _ShardBuilder(_ShardReader, ShardMutableMapping):
     buffers: list[Buffer]
     index: _ShardIndex
     buf: Buffer
+    _chunk_buffers: dict[tuple[int, ...], Buffer]  # Map chunk_coords to buffers
 
     @classmethod
     def merge_with_morton_order(
@@ -255,13 +256,24 @@ class _ShardBuilder(_ShardReader, ShardMutableMapping):
         obj = cls()
         obj.buf = buffer_prototype.buffer.create_zero_length()
         obj.buffers = []
+        obj._chunk_buffers = {}
         obj.index = _ShardIndex.create_empty(chunks_per_shard)
         return obj
 
+    def __getitem__(self, chunk_coords: tuple[int, ...]) -> Buffer:
+        # Override to use _chunk_buffers instead of self.buf
+        if chunk_coords in self._chunk_buffers:
+            return self._chunk_buffers[chunk_coords]
+        raise KeyError
+
     def __setitem__(self, chunk_coords: tuple[int, ...], value: Buffer) -> None:
         chunk_start = sum(len(buf) for buf in self.buffers)
         chunk_length = len(value)
-        self.buffers.append(value)
+        # Store the buffer for later retrieval
+        self._chunk_buffers[chunk_coords] = value
+        # Only append non-empty buffers to avoid messing up offset calculations
+        if chunk_length > 0:
+            self.buffers.append(value)
         self.index.set_chunk_slice(chunk_coords, slice(chunk_start, chunk_start + chunk_length))
 
     def __delitem__(self, chunk_coords: tuple[int, ...]) -> None:
@@ -280,7 +292,8 @@ class _ShardBuilder(_ShardReader, ShardMutableMapping):
             self.buffers.insert(0, index_bytes)
         else:
             self.buffers.append(index_bytes)
-        self.buf = self.buf.combine(self.buffers)

makes this work for me locally

@d-v-b
Copy link
Contributor

d-v-b commented Oct 30, 2025

@nbren12 please take a look at https://github.com/d-v-b/zarr-python/blob/chore/sharding-refactor/src/zarr/codecs/sharding.py, it's a refactor of the sharding logic that might be easier to reason about.

remove inheritance, hide the index attribute  and remove some indirection

Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>
just use dicts

Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>
@nbren12
Copy link
Contributor Author

nbren12 commented Oct 30, 2025

Thanks @d-v-b. I ended up pursuing an alternative refactor. I removed the various shard builder objects and things seem to be working now.

Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>
@nbren12
Copy link
Contributor Author

nbren12 commented Oct 30, 2025

@d-v-b Hopefully my latest commit fixes the tests.

For a future PR, the partial write case could definitely use further optimization. When the index is stored at the end it should be possible to write just the new chunks...rather than the whole shard.

@codecov
Copy link

codecov bot commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 81.81818% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.87%. Comparing base (b3e9aed) to head (ec5ea9b).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/codecs/sharding.py 86.11% 5 Missing ⚠️
src/zarr/core/buffer/core.py 0.00% 3 Missing ⚠️
src/zarr/core/buffer/cpu.py 85.71% 1 Missing ⚠️
src/zarr/core/buffer/gpu.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3561      +/-   ##
==========================================
+ Coverage   61.77%   61.87%   +0.10%     
==========================================
  Files          85       85              
  Lines       10165    10134      -31     
==========================================
- Hits         6279     6270       -9     
+ Misses       3886     3864      -22     
Files with missing lines Coverage Δ
src/zarr/core/buffer/cpu.py 41.37% <85.71%> (+3.19%) ⬆️
src/zarr/core/buffer/gpu.py 43.37% <88.88%> (+2.12%) ⬆️
src/zarr/core/buffer/core.py 30.34% <0.00%> (-0.65%) ⬇️
src/zarr/codecs/sharding.py 62.10% <86.11%> (+3.02%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 31, 2025

@d-v-b Hopefully my latest commit fixes the tests.

For a future PR, the partial write case could definitely use further optimization. When the index is stored at the end it should be possible to write just the new chunks...rather than the whole shard.

Awesome work! could you write up a release note? If not, I'm happy to write one.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Oct 31, 2025
@nbren12
Copy link
Contributor Author

nbren12 commented Oct 31, 2025

could you write up a release note?

Done. Thanks for the instructions.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 31, 2025

are these results expected?

main:

uv run sharding_benchmark.py --num-iterations 2 --shape 1000 100000 --chunks 1 100000 --shard-chunks 100 1000000
================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.3
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.5822 ? 0.1003 seconds
  Throughput: 655.21 MB/s
  Times: ['0.4819', '0.6825']

Read (no sharding, full):
  Mean time: 0.2232 ? 0.0080 seconds
  Throughput: 1709.10 MB/s
  Times: ['0.2152', '0.2312']

Read (no sharding, partial):
  Mean time: 0.0224 ? 0.0009 seconds
  Throughput: 170.46 MB/s
  Times: ['0.0233', '0.0215']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 4.6304 ? 0.1074 seconds
  Throughput: 82.38 MB/s
  Times: ['4.7377', '4.5230']

Read (with sharding, full):
  Mean time: 0.3334 ? 0.0207 seconds
  Throughput: 1144.08 MB/s
  Times: ['0.3128', '0.3541']

Read (with sharding, partial):
  Mean time: 0.0295 ? 0.0010 seconds
  Throughput: 129.33 MB/s
  Times: ['0.0305', '0.0285']

================================================================================
COMPARISON
================================================================================
Write: 0.13x speedup with sharding (or 7.95x slower)
Read: 0.67x speedup with sharding (or 1.49x slower)

================================================================================
Benchmark complete!

This PR:

================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.3
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.6010 ? 0.1220 seconds
  Throughput: 634.68 MB/s
  Times: ['0.4791', '0.7230']

Read (no sharding, full):
  Mean time: 0.2336 ? 0.0053 seconds
  Throughput: 1633.29 MB/s
  Times: ['0.2283', '0.2388']

Read (no sharding, partial):
  Mean time: 0.0232 ? 0.0014 seconds
  Throughput: 164.22 MB/s
  Times: ['0.0246', '0.0219']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 5.0533 ? 0.3548 seconds
  Throughput: 75.49 MB/s
  Times: ['4.6985', '5.4080']

Read (with sharding, full):
  Mean time: 0.3293 ? 0.0362 seconds
  Throughput: 1158.36 MB/s
  Times: ['0.3655', '0.2931']

Read (with sharding, partial):
  Mean time: 0.0281 ? 0.0000 seconds
  Throughput: 135.62 MB/s
  Times: ['0.0281', '0.0282']

================================================================================
COMPARISON
================================================================================
Write: 0.12x speedup with sharding (or 8.41x slower)
Read: 0.71x speedup with sharding (or 1.41x slower)

================================================================================
Benchmark complete!
================================================================================

@nbren12
Copy link
Contributor Author

nbren12 commented Oct 31, 2025

No. We expect the PR to improve the write performance. Here is what I get when running locally. Is it possible your uv is running the installed version of zarr rather than this code?

[nbrenowitz@fb7510c-lcedt zarr-python] :)
$ git log HEAD^..HEAD
commit ec5ea9b8db2685d6c57c7025512da9a976adc416 (HEAD -> main, fork/main)
Author: Noah D. Brenowitz <nbren12@gmail.com>
Date:   Fri Oct 31 10:52:38 2025 -0700

    add release note
[nbrenowitz@fb7510c-lcedt zarr-python] :)
$ pip install -e .
Obtaining file:///home/nbrenowitz/workspace/zarr-python
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Installing backend dependencies ... done
  Preparing editable metadata (pyproject.toml) ... done
Requirement already satisfied: donfig>=0.8 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (0.8.1.post1)
Requirement already satisfied: google-crc32c>=1.5 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (1.7.1)
Requirement already satisfied: numcodecs>=0.14 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (0.16.3)
Requirement already satisfied: numpy>=1.26 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (2.3.4)
Requirement already satisfied: packaging>=22.0 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (25.0)
Requirement already satisfied: typing-extensions>=4.9 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (4.15.0)
Requirement already satisfied: pyyaml in ./.venv/lib/python3.12/site-packages (from donfig>=0.8->zarr==3.1.4.dev42+gec5ea9b8d) (6.0.3)
Building wheels for collected packages: zarr
  Building editable for zarr (pyproject.toml) ... done
  Created wheel for zarr: filename=zarr-3.1.4.dev42+gec5ea9b8d-py3-none-any.whl size=5999 sha256=2421b26e486f5cf7885a8ce2627375ce3516a17abdba6328b988423c98e14c48
  Stored in directory: /tmp/pip-ephem-wheel-cache-te45oy0n/wheels/8a/7b/81/d4b2f55c665f369cecf02b44be6f230c026266e34be13d85d9
Successfully built zarr
Installing collected packages: zarr
  Attempting uninstall: zarr
    Found existing installation: zarr 3.1.4.dev42+gec5ea9b8d
    Uninstalling zarr-3.1.4.dev42+gec5ea9b8d:
      Successfully uninstalled zarr-3.1.4.dev42+gec5ea9b8d
Successfully installed zarr-3.1.4.dev42+gec5ea9b8d
[nbrenowitz@fb7510c-lcedt zarr-python] :)
$ python3 benchmark_zarr_sharding.py --num-iterations 2 --shape 1000 100000 --chunks 1 100000 --shard-chunks 100 1000000
================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.4.dev42+gec5ea9b8d
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.5058 ? 0.0169 seconds
  Throughput: 754.12 MB/s
  Times: ['0.5227', '0.4890']

Read (no sharding, full):
  Mean time: 0.3185 ? 0.0078 seconds
  Throughput: 1197.87 MB/s
  Times: ['0.3263', '0.3106']

Read (no sharding, partial):
  Mean time: 0.0267 ? 0.0001 seconds
  Throughput: 143.01 MB/s
  Times: ['0.0265', '0.0268']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 0.8496 ? 0.0296 seconds
  Throughput: 449.00 MB/s
  Times: ['0.8200', '0.8792']

Read (with sharding, full):
  Mean time: 0.4111 ? 0.0063 seconds
  Throughput: 927.87 MB/s
  Times: ['0.4174', '0.4049']

Read (with sharding, partial):
  Mean time: 0.0509 ? 0.0000 seconds
  Throughput: 74.88 MB/s
  Times: ['0.0509', '0.0510']

================================================================================
COMPARISON
================================================================================
Write: 0.60x speedup with sharding (or 1.68x slower)
Read: 0.77x speedup with sharding (or 1.29x slower)

================================================================================
Benchmark complete!
================================================================================
[nbrenowitz@fb7510c-lcedt zarr-python] :)
$ python
Python 3.12.9 (main, Mar 17 2025, 21:01:58) [Clang 20.1.0 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import zarr
>>> zarr.__path__
['/home/nbrenowitz/workspace/zarr-python/src/zarr']

@nbren12
Copy link
Contributor Author

nbren12 commented Oct 31, 2025

Yah. The version in your bechmarking script is marked as zarr==3.1.3, whereas in mine it is Zarr version: 3.1.4.dev42+gec5ea9b8d. This suggests you aren't running the benchmark agains the code in this PR.
`

@d-v-b
Copy link
Contributor

d-v-b commented Oct 31, 2025

my bad, I forgot the --with-editable flag for uv run 🤦

@d-v-b d-v-b merged commit e6ef2b1 into zarr-developers:main Oct 31, 2025
31 checks passed
@d-v-b
Copy link
Contributor

d-v-b commented Oct 31, 2025

thanks @nbren12!

@nbren12
Copy link
Contributor Author

nbren12 commented Oct 31, 2025

Thanks @d-v-b for the quick feedback and merge. It's my first zarr contribution in a while and seemed pretty seamless.

@nbren12 nbren12 deleted the main branch October 31, 2025 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

writing to a sharded array is 10x slower

2 participants