optimize shard writing #3561

nbren12 · 2025-10-30T06:54:11Z

Writing to sharded arrays was up to 10x slower for largish chunk sizes because the _ShardBuilder object has many calls to np.concatenate. This commit coalesces these into a single concatenate call, and improves write performance by a factor of 10 on the benchmarking script in #3560.

Added a new core.Buffer.combine API

Resolves #3560

[Description of PR]

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

Writing to sharded arrays was up to 10x slower for largish chunk sizes because the _ShardBuilder object has many calls to np.concatenate. This commit coalesces these into a single concatenate call, and improves write performance by a factor of 10 on the benchmarking script in zarr-developers#3560. Added a new core.Buffer.combine API Resolves zarr-developers#3560 Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>

nbren12 · 2025-10-30T06:58:18Z

Benchmarking results
Before:

================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.4.dev34+gb3e9aed30
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.5280 ? 0.0161 seconds
  Throughput: 722.44 MB/s
  Times: ['0.5442', '0.5119']

Read (no sharding, full):
  Mean time: 0.3106 ? 0.0060 seconds
  Throughput: 1228.23 MB/s
  Times: ['0.3165', '0.3046']

Read (no sharding, partial):
  Mean time: 0.0267 ? 0.0001 seconds
  Throughput: 142.88 MB/s
  Times: ['0.0266', '0.0268']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 4.2136 ? 0.1268 seconds
  Throughput: 90.53 MB/s
  Times: ['4.3404', '4.0867']

Read (with sharding, full):
  Mean time: 0.3533 ? 0.0036 seconds
  Throughput: 1079.64 MB/s
  Times: ['0.3569', '0.3498']

Read (with sharding, partial):
  Mean time: 0.0317 ? 0.0003 seconds
  Throughput: 120.33 MB/s
  Times: ['0.0320', '0.0314']

================================================================================
COMPARISON
================================================================================
Write: 0.13x speedup with sharding (or 7.98x slower)
Read: 0.88x speedup with sharding (or 1.14x slower)

================================================================================
Benchmark complete!
================================================================================

After

================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.4.dev34+gb3e9aed30
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.5144 ? 0.0080 seconds
  Throughput: 741.65 MB/s
  Times: ['0.5223', '0.5064']

Read (no sharding, full):
  Mean time: 0.3128 ? 0.0018 seconds
  Throughput: 1219.45 MB/s
  Times: ['0.3147', '0.3110']

Read (no sharding, partial):
  Mean time: 0.0287 ? 0.0002 seconds
  Throughput: 133.12 MB/s
  Times: ['0.0288', '0.0285']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 0.6107 ? 0.0017 seconds
  Throughput: 624.61 MB/s
  Times: ['0.6125', '0.6090']

Read (with sharding, full):
  Mean time: 0.2908 ? 0.0030 seconds
  Throughput: 1311.91 MB/s
  Times: ['0.2938', '0.2877']

Read (with sharding, partial):
  Mean time: 0.0238 ? 0.0006 seconds
  Throughput: 160.59 MB/s
  Times: ['0.0244', '0.0231']

================================================================================
COMPARISON
================================================================================
Write: 0.84x speedup with sharding (or 1.19x slower)
Read: 1.08x speedup with sharding (or 0.93x slower)

================================================================================
Benchmark complete!
================================================================================

Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>

d-v-b · 2025-10-30T07:16:39Z

thanks so much to tackling this problem!

nbren12 · 2025-10-30T08:46:34Z

hmmm. lot's of failing tests. I'm having some trouble grokking the implementation of the sharding codec.

d-v-b · 2025-10-30T09:00:57Z

I'll have a look. I don't know this section of the codebase very well, but maybe I can figure something out. And BTW if you see any opportunities to refactor / simplify the logic here, feel free to move in that direction. I think making this code easier to understand would be a big win.

d-v-b · 2025-10-30T09:11:57Z

diff --git a/src/zarr/codecs/sharding.py b/src/zarr/codecs/sharding.py
index 0e36f50f..31d68027 100644
--- a/src/zarr/codecs/sharding.py
+++ b/src/zarr/codecs/sharding.py
@@ -227,6 +227,7 @@ class _ShardBuilder(_ShardReader, ShardMutableMapping):
     buffers: list[Buffer]
     index: _ShardIndex
     buf: Buffer
+    _chunk_buffers: dict[tuple[int, ...], Buffer]  # Map chunk_coords to buffers
 
     @classmethod
     def merge_with_morton_order(
@@ -255,13 +256,24 @@ class _ShardBuilder(_ShardReader, ShardMutableMapping):
         obj = cls()
         obj.buf = buffer_prototype.buffer.create_zero_length()
         obj.buffers = []
+        obj._chunk_buffers = {}
         obj.index = _ShardIndex.create_empty(chunks_per_shard)
         return obj
 
+    def __getitem__(self, chunk_coords: tuple[int, ...]) -> Buffer:
+        # Override to use _chunk_buffers instead of self.buf
+        if chunk_coords in self._chunk_buffers:
+            return self._chunk_buffers[chunk_coords]
+        raise KeyError
+
     def __setitem__(self, chunk_coords: tuple[int, ...], value: Buffer) -> None:
         chunk_start = sum(len(buf) for buf in self.buffers)
         chunk_length = len(value)
-        self.buffers.append(value)
+        # Store the buffer for later retrieval
+        self._chunk_buffers[chunk_coords] = value
+        # Only append non-empty buffers to avoid messing up offset calculations
+        if chunk_length > 0:
+            self.buffers.append(value)
         self.index.set_chunk_slice(chunk_coords, slice(chunk_start, chunk_start + chunk_length))
 
     def __delitem__(self, chunk_coords: tuple[int, ...]) -> None:
@@ -280,7 +292,8 @@ class _ShardBuilder(_ShardReader, ShardMutableMapping):
             self.buffers.insert(0, index_bytes)
         else:
             self.buffers.append(index_bytes)
-        self.buf = self.buf.combine(self.buffers)

makes this work for me locally

d-v-b · 2025-10-30T10:01:03Z

@nbren12 please take a look at https://github.com/d-v-b/zarr-python/blob/chore/sharding-refactor/src/zarr/codecs/sharding.py, it's a refactor of the sharding logic that might be easier to reason about.

remove inheritance, hide the index attribute and remove some indirection Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>

just use dicts Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>

nbren12 · 2025-10-30T18:00:05Z

Thanks @d-v-b. I ended up pursuing an alternative refactor. I removed the various shard builder objects and things seem to be working now.

Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>

nbren12 · 2025-10-30T19:19:25Z

@d-v-b Hopefully my latest commit fixes the tests.

For a future PR, the partial write case could definitely use further optimization. When the index is stored at the end it should be possible to write just the new chunks...rather than the whole shard.

codecov · 2025-10-30T19:21:10Z

Codecov Report

❌ Patch coverage is 81.81818% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.87%. Comparing base (b3e9aed) to head (ec5ea9b).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/zarr/codecs/sharding.py	86.11%	5 Missing ⚠️
src/zarr/core/buffer/core.py	0.00%	3 Missing ⚠️
src/zarr/core/buffer/cpu.py	85.71%	1 Missing ⚠️
src/zarr/core/buffer/gpu.py	88.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3561      +/-   ##
==========================================
+ Coverage   61.77%   61.87%   +0.10%     
==========================================
  Files          85       85              
  Lines       10165    10134      -31     
==========================================
- Hits         6279     6270       -9     
+ Misses       3886     3864      -22

Files with missing lines	Coverage Δ
src/zarr/core/buffer/cpu.py	`41.37% <85.71%> (+3.19%)`	⬆️
src/zarr/core/buffer/gpu.py	`43.37% <88.88%> (+2.12%)`	⬆️
src/zarr/core/buffer/core.py	`30.34% <0.00%> (-0.65%)`	⬇️
src/zarr/codecs/sharding.py	`62.10% <86.11%> (+3.02%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d-v-b · 2025-10-31T11:11:19Z

@d-v-b Hopefully my latest commit fixes the tests.

For a future PR, the partial write case could definitely use further optimization. When the index is stored at the end it should be possible to write just the new chunks...rather than the whole shard.

Awesome work! could you write up a release note? If not, I'm happy to write one.

nbren12 · 2025-10-31T17:54:06Z

could you write up a release note?

Done. Thanks for the instructions.

d-v-b · 2025-10-31T20:29:58Z

are these results expected?

main:

uv run sharding_benchmark.py --num-iterations 2 --shape 1000 100000 --chunks 1 100000 --shard-chunks 100 1000000
================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.3
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.5822 ? 0.1003 seconds
  Throughput: 655.21 MB/s
  Times: ['0.4819', '0.6825']

Read (no sharding, full):
  Mean time: 0.2232 ? 0.0080 seconds
  Throughput: 1709.10 MB/s
  Times: ['0.2152', '0.2312']

Read (no sharding, partial):
  Mean time: 0.0224 ? 0.0009 seconds
  Throughput: 170.46 MB/s
  Times: ['0.0233', '0.0215']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 4.6304 ? 0.1074 seconds
  Throughput: 82.38 MB/s
  Times: ['4.7377', '4.5230']

Read (with sharding, full):
  Mean time: 0.3334 ? 0.0207 seconds
  Throughput: 1144.08 MB/s
  Times: ['0.3128', '0.3541']

Read (with sharding, partial):
  Mean time: 0.0295 ? 0.0010 seconds
  Throughput: 129.33 MB/s
  Times: ['0.0305', '0.0285']

================================================================================
COMPARISON
================================================================================
Write: 0.13x speedup with sharding (or 7.95x slower)
Read: 0.67x speedup with sharding (or 1.49x slower)

================================================================================
Benchmark complete!

This PR:

================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.3
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.6010 ? 0.1220 seconds
  Throughput: 634.68 MB/s
  Times: ['0.4791', '0.7230']

Read (no sharding, full):
  Mean time: 0.2336 ? 0.0053 seconds
  Throughput: 1633.29 MB/s
  Times: ['0.2283', '0.2388']

Read (no sharding, partial):
  Mean time: 0.0232 ? 0.0014 seconds
  Throughput: 164.22 MB/s
  Times: ['0.0246', '0.0219']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 5.0533 ? 0.3548 seconds
  Throughput: 75.49 MB/s
  Times: ['4.6985', '5.4080']

Read (with sharding, full):
  Mean time: 0.3293 ? 0.0362 seconds
  Throughput: 1158.36 MB/s
  Times: ['0.3655', '0.2931']

Read (with sharding, partial):
  Mean time: 0.0281 ? 0.0000 seconds
  Throughput: 135.62 MB/s
  Times: ['0.0281', '0.0282']

================================================================================
COMPARISON
================================================================================
Write: 0.12x speedup with sharding (or 8.41x slower)
Read: 0.71x speedup with sharding (or 1.41x slower)

================================================================================
Benchmark complete!
================================================================================

nbren12 · 2025-10-31T20:43:58Z

No. We expect the PR to improve the write performance. Here is what I get when running locally. Is it possible your uv is running the installed version of zarr rather than this code?

[nbrenowitz@fb7510c-lcedt zarr-python] :)
$ git log HEAD^..HEAD
commit ec5ea9b8db2685d6c57c7025512da9a976adc416 (HEAD -> main, fork/main)
Author: Noah D. Brenowitz <nbren12@gmail.com>
Date:   Fri Oct 31 10:52:38 2025 -0700

    add release note
[nbrenowitz@fb7510c-lcedt zarr-python] :)
$ pip install -e .
Obtaining file:///home/nbrenowitz/workspace/zarr-python
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Installing backend dependencies ... done
  Preparing editable metadata (pyproject.toml) ... done
Requirement already satisfied: donfig>=0.8 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (0.8.1.post1)
Requirement already satisfied: google-crc32c>=1.5 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (1.7.1)
Requirement already satisfied: numcodecs>=0.14 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (0.16.3)
Requirement already satisfied: numpy>=1.26 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (2.3.4)
Requirement already satisfied: packaging>=22.0 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (25.0)
Requirement already satisfied: typing-extensions>=4.9 in ./.venv/lib/python3.12/site-packages (from zarr==3.1.4.dev42+gec5ea9b8d) (4.15.0)
Requirement already satisfied: pyyaml in ./.venv/lib/python3.12/site-packages (from donfig>=0.8->zarr==3.1.4.dev42+gec5ea9b8d) (6.0.3)
Building wheels for collected packages: zarr
  Building editable for zarr (pyproject.toml) ... done
  Created wheel for zarr: filename=zarr-3.1.4.dev42+gec5ea9b8d-py3-none-any.whl size=5999 sha256=2421b26e486f5cf7885a8ce2627375ce3516a17abdba6328b988423c98e14c48
  Stored in directory: /tmp/pip-ephem-wheel-cache-te45oy0n/wheels/8a/7b/81/d4b2f55c665f369cecf02b44be6f230c026266e34be13d85d9
Successfully built zarr
Installing collected packages: zarr
  Attempting uninstall: zarr
    Found existing installation: zarr 3.1.4.dev42+gec5ea9b8d
    Uninstalling zarr-3.1.4.dev42+gec5ea9b8d:
      Successfully uninstalled zarr-3.1.4.dev42+gec5ea9b8d
Successfully installed zarr-3.1.4.dev42+gec5ea9b8d
[nbrenowitz@fb7510c-lcedt zarr-python] :)
$ python3 benchmark_zarr_sharding.py --num-iterations 2 --shape 1000 100000 --chunks 1 100000 --shard-chunks 100 1000000
================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.4.dev42+gec5ea9b8d
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.5058 ? 0.0169 seconds
  Throughput: 754.12 MB/s
  Times: ['0.5227', '0.4890']

Read (no sharding, full):
  Mean time: 0.3185 ? 0.0078 seconds
  Throughput: 1197.87 MB/s
  Times: ['0.3263', '0.3106']

Read (no sharding, partial):
  Mean time: 0.0267 ? 0.0001 seconds
  Throughput: 143.01 MB/s
  Times: ['0.0265', '0.0268']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 0.8496 ? 0.0296 seconds
  Throughput: 449.00 MB/s
  Times: ['0.8200', '0.8792']

Read (with sharding, full):
  Mean time: 0.4111 ? 0.0063 seconds
  Throughput: 927.87 MB/s
  Times: ['0.4174', '0.4049']

Read (with sharding, partial):
  Mean time: 0.0509 ? 0.0000 seconds
  Throughput: 74.88 MB/s
  Times: ['0.0509', '0.0510']

================================================================================
COMPARISON
================================================================================
Write: 0.60x speedup with sharding (or 1.68x slower)
Read: 0.77x speedup with sharding (or 1.29x slower)

================================================================================
Benchmark complete!
================================================================================
[nbrenowitz@fb7510c-lcedt zarr-python] :)
$ python
Python 3.12.9 (main, Mar 17 2025, 21:01:58) [Clang 20.1.0 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import zarr
>>> zarr.__path__
['/home/nbrenowitz/workspace/zarr-python/src/zarr']

nbren12 · 2025-10-31T20:45:43Z

Yah. The version in your bechmarking script is marked as zarr==3.1.3, whereas in mine it is Zarr version: 3.1.4.dev42+gec5ea9b8d. This suggests you aren't running the benchmark agains the code in this PR.
`

d-v-b · 2025-10-31T20:46:57Z

my bad, I forgot the --with-editable flag for uv run 🤦

src/zarr/core/buffer/cpu.py

d-v-b · 2025-10-31T21:18:18Z

thanks @nbren12!

nbren12 · 2025-10-31T21:20:05Z

Thanks @d-v-b for the quick feedback and merge. It's my first zarr contribution in a while and seemed pretty seamless.

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Oct 30, 2025

nbren12 added 2 commits October 30, 2025 00:01

remove redundant method

52b5575

Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>

remove redundant np.asayarray

55018c9

Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>

nbren12 added 3 commits October 30, 2025 09:35

clarify ShardBuilder API

3f92f12

remove inheritance, hide the index attribute and remove some indirection Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>

Remove shard builder objects

a85f118

just use dicts Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>

Merge branch 'refactor-shard-builder'

b8c3c5f

fix missing chunk case

a32e86a

Signed-off-by: Noah D. Brenowitz <nbren12@gmail.com>

add release note

ec5ea9b

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Oct 31, 2025

d-v-b mentioned this pull request Oct 31, 2025

writing to a sharded array is 10x slower #3560

Closed

d-v-b reviewed Oct 31, 2025

View reviewed changes

src/zarr/core/buffer/cpu.py Show resolved Hide resolved

d-v-b approved these changes Oct 31, 2025

View reviewed changes

d-v-b merged commit e6ef2b1 into zarr-developers:main Oct 31, 2025
31 checks passed

nbren12 deleted the main branch October 31, 2025 21:20

Uh oh!

optimize shard writing #3561

optimize shard writing #3561

Conversation

nbren12 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nbren12 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Oct 30, 2025

Uh oh!

nbren12 commented Oct 30, 2025

Uh oh!

d-v-b commented Oct 30, 2025

Uh oh!

d-v-b commented Oct 30, 2025

Uh oh!

d-v-b commented Oct 30, 2025

Uh oh!

nbren12 commented Oct 30, 2025

Uh oh!

nbren12 commented Oct 30, 2025

Uh oh!

codecov bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d-v-b commented Oct 31, 2025

Uh oh!

nbren12 commented Oct 31, 2025

Uh oh!

d-v-b commented Oct 31, 2025

Uh oh!

nbren12 commented Oct 31, 2025

Uh oh!

nbren12 commented Oct 31, 2025

Uh oh!

d-v-b commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

d-v-b commented Oct 31, 2025

Uh oh!

nbren12 commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nbren12 commented Oct 30, 2025 •

edited

Loading

nbren12 commented Oct 30, 2025 •

edited

Loading

codecov bot commented Oct 30, 2025 •

edited

Loading