Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
49d5ee8
First working version of Zstd codec on the GPU
akshaysubr Feb 25, 2025
d548adc
Adding nvcomp to the GPU dependency list
akshaysubr Feb 25, 2025
a8c0db3
Updating codec pipeline batch size for GPU codecs to enable parallelism
akshaysubr Feb 27, 2025
69aa274
Making encode and decode async
akshaysubr Jul 3, 2025
10e1bc9
Removing custom awaitable in favor of event synchronize in an async t…
akshaysubr Jul 4, 2025
771c0c1
Merge remote-tracking branch 'upstream/main' into gpu-codecs
TomAugspurger Jul 9, 2025
ec07100
Sync convert methods
TomAugspurger Jul 9, 2025
d1c37a3
test coverage
TomAugspurger Jul 9, 2025
69ea74e
loosen dtype restriction
TomAugspurger Jul 9, 2025
1b85fdc
fixed Buffer.__add__
TomAugspurger Jul 9, 2025
f5c7814
Added whatsnew
TomAugspurger Jul 11, 2025
7671274
Merge remote-tracking branch 'upstream/main' into gpu-codecs
TomAugspurger Jul 11, 2025
d558ef8
Merge remote-tracking branch 'upstream/main' into gpu-codecs
TomAugspurger Jul 14, 2025
f16d730
Look up the codec implementation
TomAugspurger Jul 14, 2025
f0db57d
Merge branch 'main' into gpu-codecs
TomAugspurger Jul 18, 2025
048ad48
test coverage
TomAugspurger Jul 18, 2025
2282cb9
Test coverage for uninitialized chunks
TomAugspurger Jul 18, 2025
c6460b5
coverage
TomAugspurger Jul 18, 2025
3b5e294
doc update
TomAugspurger Jul 18, 2025
dd825dc
lint
TomAugspurger Jul 18, 2025
f89b232
@gpu_test
TomAugspurger Jul 18, 2025
7a4b037
wip test stuff
TomAugspurger Jul 21, 2025
398b4d1
doc updates
TomAugspurger Jul 23, 2025
dd69543
added failing compatibility test
TomAugspurger Jul 23, 2025
76f7560
added a matching test
TomAugspurger Jul 24, 2025
8b5b3f1
Some buffer coverage
TomAugspurger Jul 24, 2025
7af3a16
coverage
TomAugspurger Jul 24, 2025
996fbc0
update error message
TomAugspurger Jul 28, 2025
d24d027
private
TomAugspurger Jul 28, 2025
090349c
Merge remote-tracking branch 'upstream/main' into gpu-codecs
TomAugspurger Jul 28, 2025
83c53b0
Merge remote-tracking branch 'upstream/main' into gpu-codecs
TomAugspurger Sep 29, 2025
eb50521
test fixup
TomAugspurger Sep 29, 2025
de3b577
doc fix
TomAugspurger Sep 29, 2025
ac14838
doc fix
TomAugspurger Sep 29, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions changes/2863.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Added GPU-accelerated Zstd Codec

This adds support for decoding with the Zstd Codec on NVIDIA GPUs using the
nvidia-nvcomp library.

With `zarr.config.enable_gpu()`, buffers will be decoded using the GPU
and the output will reside in device memory.
19 changes: 19 additions & 0 deletions docs/user-guide/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,25 @@ first register the implementations in the registry and then select them in the c
For example, an implementation of the bytes codec in a class `'custompackage.NewBytesCodec'`,
requires the value of `codecs.bytes.name` to be `'custompackage.NewBytesCodec'`.

## Codecs

Zarr and zarr-python split the logical codec definition from the implementation.
The Zarr metadata serialized in the store specifies just the codec name and
configuration. To resolve the specific implementation, a Python class, that's
used at runtime to encode or decode data, zarr-python looks up the codec name
in the codec registry.

For example, after calling `zarr.config.enable_gpu()`, an nvcomp-based
codec will be used:

```python
>>> with zarr.config.enable_gpu():
... print(zarr.config.get('codecs.zstd'))
zarr.codecs.gpu.NvcompZstdCodec
```

## Default Configuration

This is the current default configuration:

```python exec="true" session="config" source="above" result="ansi"
Expand Down
15 changes: 6 additions & 9 deletions docs/user-guide/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,6 @@

Zarr can use GPUs to accelerate your workload by running `zarr.Config.enable_gpu`.

!!! note
`zarr-python` currently supports reading the ndarray data into device (GPU)
memory as the final stage of the codec pipeline. Data will still be read into
or copied to host (CPU) memory for encoding and decoding.

In the future, codecs will be available compressing and decompressing data on
the GPU, avoiding the need to move data between the host and device for
compression and decompression.

## Reading data into device memory

[`zarr.config`][] configures Zarr to use GPU memory for the data
Expand All @@ -29,3 +20,9 @@ type(z[:10, :10])
```

Note that the output type is a `cupy.ndarray` rather than a NumPy array.

For supported codecs, data will be decoded using the GPU via the [nvcomp] library.
See [runtime-configuration][] for more. Isseus and feature requestsfor NVIDIA nvCOMP can be reported in the nvcomp [issue tracker].

[nvcomp]: https://docs.nvidia.com/cuda/nvcomp/samples/python_samples.html
[issue tradcker]: https://github.com/NVIDIA/CUDALibrarySamples/issues
34 changes: 34 additions & 0 deletions docs/user-guide/gpu.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
.. _user-guide-gpu:

Using GPUs with Zarr
====================

Zarr can use GPUs to accelerate your workload by running
:meth:`zarr.config.enable_gpu`.

Reading data into device memory
-------------------------------

:meth:`zarr.config.enable_gpu` configures Zarr to use GPU memory for the data
buffers used internally by Zarr.

.. code-block:: python

>>> import zarr
>>> import cupy as cp # doctest: +SKIP
>>> zarr.config.enable_gpu() # doctest: +SKIP
>>> store = zarr.storage.MemoryStore() # doctest: +SKIP
>>> z = zarr.create_array( # doctest: +SKIP
... store=store, shape=(100, 100), chunks=(10, 10), dtype="float32",
... )
>>> type(z[:10, :10]) # doctest: +SKIP
cupy.ndarray

Note that the output type is a ``cupy.ndarray`` rather than a NumPy array.

For supported codecs, data will be decoded using the GPU via the `nvcomp`_
library. See :ref:`user-guide-config` for more. Isseus and feature requests
for NVIDIA nvCOMP can be reported in the `nvcomp issue tracker`_.

.. _nvcomp: https://docs.nvidia.com/cuda/nvcomp/samples/python_samples.html
.. _nvcomp issue tracker: https://github.com/NVIDIA/CUDALibrarySamples/issues
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ remote = [
]
gpu = [
"cupy-cuda12x",
"nvidia-nvcomp-cu12",
]
cli = ["typer"]
# Development extras
Expand Down
2 changes: 2 additions & 0 deletions src/zarr/codecs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from zarr.codecs.blosc import BloscCname, BloscCodec, BloscShuffle
from zarr.codecs.bytes import BytesCodec, Endian
from zarr.codecs.crc32c_ import Crc32cCodec
from zarr.codecs.gpu import NvcompZstdCodec
from zarr.codecs.gzip import GzipCodec
from zarr.codecs.numcodecs import (
BZ2,
Expand Down Expand Up @@ -41,6 +42,7 @@
"Crc32cCodec",
"Endian",
"GzipCodec",
"NvcompZstdCodec",
"ShardingCodec",
"ShardingCodecIndexLocation",
"TransposeCodec",
Expand Down
176 changes: 176 additions & 0 deletions src/zarr/codecs/gpu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
from __future__ import annotations

import asyncio
from dataclasses import dataclass
from functools import cached_property
from typing import TYPE_CHECKING

import numpy as np

from zarr.abc.codec import BytesBytesCodec
from zarr.core.common import JSON, parse_named_configuration
from zarr.registry import register_codec

if TYPE_CHECKING:
from collections.abc import Iterable
from typing import Self

from zarr.core.array_spec import ArraySpec
from zarr.core.buffer import Buffer

try:
import cupy as cp
except ImportError: # pragma: no cover
cp = None

try:
from nvidia import nvcomp
except ImportError: # pragma: no cover
nvcomp = None


def _parse_zstd_level(data: JSON) -> int:
if isinstance(data, int):
if data >= 23:
raise ValueError(f"Value must be less than or equal to 22. Got {data} instead.")
return data
raise TypeError(f"Got value with type {type(data)}, but expected an int.")


def _parse_checksum(data: JSON) -> bool:
if isinstance(data, bool):
return data
raise TypeError(f"Expected bool. Got {type(data)}.")


@dataclass(frozen=True)
class NvcompZstdCodec(BytesBytesCodec):
is_fixed_size = True

level: int = 0
checksum: bool = False

def __init__(self, *, level: int = 0, checksum: bool = False) -> None:
# TODO: Set CUDA device appropriately here and also set CUDA stream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with leaving devices / streams as a TODO for now.

I want to enable users to overlap host-to-device memcpys with compute operations (like decode, but their own compute operations as well), but I'm not sure yet what that API will look like.

If you have any thoughts on how best to do this I'd love to hear them, and write them up as an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3271 for planning on devices and streams.


level_parsed = _parse_zstd_level(level)
checksum_parsed = _parse_checksum(checksum)

object.__setattr__(self, "level", level_parsed)
object.__setattr__(self, "checksum", checksum_parsed)

@classmethod
def from_dict(cls, data: dict[str, JSON]) -> Self:
_, configuration_parsed = parse_named_configuration(data, "zstd")
return cls(**configuration_parsed) # type: ignore[arg-type]

def to_dict(self) -> dict[str, JSON]:
return {
"name": "zstd",
"configuration": {"level": self.level, "checksum": self.checksum},
}

@cached_property
def _zstd_codec(self) -> nvcomp.Codec:
device = cp.cuda.Device() # Select the current default device
stream = cp.cuda.get_current_stream() # Use the current default stream
return nvcomp.Codec(
algorithm="Zstd",
bitstream_kind=nvcomp.BitstreamKind.RAW,
device_id=device.id,
cuda_stream=stream.ptr,
)

def _convert_to_nvcomp_arrays(
self,
chunks_and_specs: Iterable[tuple[Buffer | None, ArraySpec]],
) -> tuple[list[nvcomp.Array], list[int]]:
none_indices = [i for i, (b, _) in enumerate(chunks_and_specs) if b is None]
filtered_inputs = [b.as_array_like() for b, _ in chunks_and_specs if b is not None]
# TODO: add CUDA stream here
return nvcomp.as_arrays(filtered_inputs), none_indices

def _convert_from_nvcomp_arrays(
self,
arrays: Iterable[nvcomp.Array],
chunks_and_specs: Iterable[tuple[Buffer | None, ArraySpec]],
) -> Iterable[Buffer | None]:
return [
spec.prototype.buffer.from_array_like(cp.array(a, dtype=np.dtype("B"), copy=False))
if a
else None
for a, (_, spec) in zip(arrays, chunks_and_specs, strict=True)
]

async def decode(
self,
chunks_and_specs: Iterable[tuple[Buffer | None, ArraySpec]],
) -> Iterable[Buffer | None]:
"""Decodes a batch of chunks.
Chunks can be None in which case they are ignored by the codec.

Parameters
----------
chunks_and_specs : Iterable[tuple[Buffer | None, ArraySpec]]
Ordered set of encoded chunks with their accompanying chunk spec.

Returns
-------
Iterable[Buffer | None]
"""
chunks_and_specs = list(chunks_and_specs)

# Convert to nvcomp arrays
filtered_inputs, none_indices = self._convert_to_nvcomp_arrays(chunks_and_specs)

outputs = self._zstd_codec.decode(filtered_inputs) if len(filtered_inputs) > 0 else []

# Record event for synchronization
event = cp.cuda.Event()
# Wait for decode to complete in a separate async thread
await asyncio.to_thread(event.synchronize)

for index in none_indices:
outputs.insert(index, None)

return self._convert_from_nvcomp_arrays(outputs, chunks_and_specs)

async def encode(
self,
chunks_and_specs: Iterable[tuple[Buffer | None, ArraySpec]],
) -> Iterable[Buffer | None]:
"""Encodes a batch of chunks.
Chunks can be None in which case they are ignored by the codec.

Parameters
----------
chunks_and_specs : Iterable[tuple[Buffer | None, ArraySpec]]
Ordered set of to-be-encoded chunks with their accompanying chunk spec.

Returns
-------
Iterable[Buffer | None]
"""
# TODO: Make this actually async
chunks_and_specs = list(chunks_and_specs)

# Convert to nvcomp arrays
filtered_inputs, none_indices = self._convert_to_nvcomp_arrays(chunks_and_specs)

outputs = self._zstd_codec.encode(filtered_inputs) if len(filtered_inputs) > 0 else []

# Record event for synchronization
event = cp.cuda.Event()
# Wait for decode to complete in a separate async thread
await asyncio.to_thread(event.synchronize)

for index in none_indices:
outputs.insert(index, None)

return self._convert_from_nvcomp_arrays(outputs, chunks_and_specs)

def compute_encoded_size(self, _input_byte_length: int, _chunk_spec: ArraySpec) -> int:
raise NotImplementedError


register_codec("zstd", NvcompZstdCodec)
6 changes: 3 additions & 3 deletions src/zarr/core/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@
from zarr.codecs._v2 import V2Codec
from zarr.codecs.bytes import BytesCodec
from zarr.codecs.vlen_utf8 import VLenBytesCodec, VLenUTF8Codec
from zarr.codecs.zstd import ZstdCodec
from zarr.core._info import ArrayInfo
from zarr.core.array_spec import ArrayConfig, ArrayConfigLike, parse_array_config
from zarr.core.attributes import Attributes
Expand Down Expand Up @@ -128,6 +127,7 @@
_parse_array_array_codec,
_parse_array_bytes_codec,
_parse_bytes_bytes_codec,
get_codec_class,
get_pipeline_class,
)
from zarr.storage._common import StorePath, ensure_no_existing_node, make_store_path
Expand Down Expand Up @@ -5036,9 +5036,9 @@ def default_compressors_v3(dtype: ZDType[Any, Any]) -> tuple[BytesBytesCodec, ..
"""
Given a data type, return the default compressors for that data type.

This is just a tuple containing ``ZstdCodec``
This is just a tuple containing an instance of the default "zstd" codec class.
"""
return (ZstdCodec(),)
return (cast(BytesBytesCodec, get_codec_class("zstd")()),)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the extra cast needed now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_codec_class returns type[Codec] but this function specifically returns a tuple[BytesBytesCodec].



def default_serializer_v3(dtype: ZDType[Any, Any]) -> ArrayBytesCodec:
Expand Down
28 changes: 13 additions & 15 deletions src/zarr/core/buffer/gpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,6 @@
cast,
)

import numpy as np
import numpy.typing as npt

from zarr.core.buffer import core
from zarr.core.buffer.core import ArrayLike, BufferPrototype, NDArrayLike
from zarr.errors import ZarrUserWarning
Expand All @@ -23,8 +20,9 @@
from collections.abc import Iterable
from typing import Self

from zarr.core.common import BytesLike
import numpy.typing as npt

from zarr.core.common import BytesLike
try:
import cupy as cp
except ImportError:
Expand Down Expand Up @@ -54,14 +52,14 @@ class Buffer(core.Buffer):

def __init__(self, array_like: ArrayLike) -> None:
if cp is None:
raise ImportError(
raise ImportError( # pragma: no cover
"Cannot use zarr.buffer.gpu.Buffer without cupy. Please install cupy."
)

if array_like.ndim != 1:
raise ValueError("array_like: only 1-dim allowed")
if array_like.dtype != np.dtype("B"):
raise ValueError("array_like: only byte dtype allowed")
if array_like.dtype.itemsize != 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests in test_nvcomp.py were failing without this change.

I'd like to get us to a point where we don't care as much about the details of the buffer passed in here. This is an OK start I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly does this check for? It's not clear to me why any numpy array that can be viewed as bytes wouldn't be allowed in here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and same for the dimensionality check, since any N-dimensional numpy array can be viewed as a 1D array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not really sure...

I agree that the actual data we store internally here needs to be a byte dtype. Just doing cp.asarray(input).view("b") seems pretty reasonable to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not even convinced that we need Buffer / NDBuffer, when Buffer is just a special case of NDBuffer where there's 1 dimension and the data type is bytes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could even express this formally by:

  • making NDBuffer generic with two type parameters (number of dimensions and dtype)
  • having APIs that insist on consuming a Buffer instead insist on consuming NDBuffer[Literal[1], np.uint8]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(super out of scope for this PR ofc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why any numpy array that can be viewed as bytes wouldn't be allowed in here

I think this is mainly because NDBuffer objects don't need to be contiguous, but Buffer objects must be contiguous in memory which might be important when we send those out to codecs that expect a contiguous memory slice.

But I agree that we can probably merge those two and make Buffer a specialization of NDBuffer.

raise ValueError("array_like: only dtypes with itemsize=1 allowed")

if not hasattr(array_like, "__cuda_array_interface__"):
# Slow copy based path for arrays that don't support the __cuda_array_interface__
Expand Down Expand Up @@ -108,13 +106,13 @@ def as_numpy_array(self) -> npt.NDArray[Any]:
return cast("npt.NDArray[Any]", cp.asnumpy(self._data))

def __add__(self, other: core.Buffer) -> Self:
other_array = other.as_array_like()
assert other_array.dtype == np.dtype("B")
gpu_other = Buffer(other_array)
gpu_other_array = gpu_other.as_array_like()
return self.__class__(
cp.concatenate((cp.asanyarray(self._data), cp.asanyarray(gpu_other_array)))
)
other_array = cp.asanyarray(other.as_array_like())
left = self._data
if left.dtype != other_array.dtype:
other_array = other_array.view(left.dtype)

buffer = cp.concatenate([left, other_array])
return type(self)(buffer)


class NDBuffer(core.NDBuffer):
Expand Down Expand Up @@ -144,7 +142,7 @@ class NDBuffer(core.NDBuffer):

def __init__(self, array: NDArrayLike) -> None:
if cp is None:
raise ImportError(
raise ImportError( # pragma: no cover
"Cannot use zarr.buffer.gpu.NDBuffer without cupy. Please install cupy."
)

Expand Down
Loading
Loading