# Introduction to Zarr

This notebook provides a brief introduction to Zarr and how to
use it in cloud environments for scalable, chunked, and compressed data storage.
Zarr is a file format with implementations in different languages. In this tutorial, we will look at an example of how to use the Zarr format by looking at some features of the `zarr-python` library and how Zarr files can be opened with `xarray`.

## What is Zarr?

The Zarr data format is an open, community-maintained format designed for efficient, scalable storage of large N-dimensional arrays. It stores data as compressed and chunked arrays in a format well-suited to parallel processing and cloud-native workflows.

### Zarr Data Organization:
- **Arrays**: N-dimensional arrays that can be chunked and compressed.
- **Groups**: A container for organizing multiple arrays and other groups with a hierarchical structure.
- **Metadata**: JSON-like metadata describing the arrays and groups, including information about dimensions, data types, and compression.
- **Dimensions and Shape**: Arrays can have any number of dimensions, and their shape is defined by the number of elements in each dimension.
- **Coordinates & Indexing**: Zarr supports coordinate arrays for each dimension, allowing for efficient indexing and slicing.

The diagram below showing the structure of a Zarr file:
![EarthData](https://learning.nceas.ucsb.edu/2025-04-arctic/images/zarr-chunks.png)


### Zarr Fundamenals
A Zarr array has the following important properties:
- **Shape**: The dimensions of the array.
- **Dtype**: The data type of each element (e.g., float32).
- **Attributes**: Metadata stored as key-value pairs (e.g., units, description.
- **Compressors**: Algorithms used to compress each chunk (e.g., Blosc, Zlib).


#### Example: Creating and Inspecting a Zarr Array

In [1]:
import zarr
z = zarr.create(shape=(40, 50), chunks=(10, 10), dtype='f8', store='test.zarr')
z

ContainsArrayError: An array exists in store LocalStore('file://test.zarr') at path ''.

In [None]:
z.info

Type               : Array
Zarr format        : 3
Data type          : DataType.float64
Fill value         : 0.0
Shape              : (40, 50)
Chunk shape        : (10, 10)
Order              : C
Read-only          : False
Store type         : LocalStore
Filters            : ()
Serializer         : BytesCodec(endian=<Endian.little: 'little'>)
Compressors        : (ZstdCodec(level=0, checksum=False),)
No. bytes          : 16000 (15.6K)

In [None]:
z.fill_value

np.float64(0.0)

No data has been written to the array yet. If we try to access the data, we will get a fill value: 

In [None]:
z[0, 0]


array(0.)

This is how we assign data to the array. When we do this it gets written immediately.

In [None]:
import numpy as np
z[:] = 1
z[0, :] = np.arange(50)
z[:]

array([[ 0.,  1.,  2., ..., 47., 48., 49.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       ...,
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.]], shape=(40, 50))

##### Attributes

We can attach arbitrary metadata to our Array via attributes:

In [None]:
z.attrs['units'] = 'm/s'
z.attrs['standard_name'] = 'wind_speed'
print(dict(z.attrs))

{'units': 'm/s', 'standard_name': 'wind_speed'}


### Zarr Data Storage

Zarr can be stored in memory, on disk, or in cloud storage systems like Amazon S3.

Let's look under the hood. _The ability to look inside a Zarr store and understand what is there is a deliberate design decision._

In [None]:
z.store

LocalStore('file://test.zarr')

In [None]:
!tree -a test.zarr | head

[01;34mtest.zarr[0m
├── [01;34mc[0m
│   ├── [01;34m0[0m
│   │   ├── [00m0[0m
│   │   ├── [00m1[0m
│   │   ├── [00m2[0m
│   │   ├── [00m3[0m
│   │   └── [00m4[0m
│   ├── [01;34m1[0m
│   │   ├── [00m0[0m


To create groups in your store, use the `create_group` method after creating a root group. Here, we’ll create two groups, `temp` and `precip`.

In [None]:
root = zarr.group()
temp = root.create_group('temp')
precip = root.create_group('precip')
t2m = temp.create_array('t2m', shape=(100,100), chunks=(10,10), dtype='i4')
prcp = precip.create_array('prcp', shape=(1000,1000), chunks=(10,10), dtype='i4')

Groups can easily be accessed by name and index.



In [None]:
root['temp']
root['temp/t2m'][:, 3]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

To get a look at your overall dataset, the `tree` and `info` methods are helpful.



In [None]:
root.tree()


ImportError: 'rich' is required for Group.tree

In [None]:

root.info

Name        : 
Type        : Group
Zarr format : 3
Read-only   : False
Store type  : MemoryStore

#### Compressors
A number of different compressors can be used with Zarr. The built-in options include Blosc, Zstandard, and Gzip. Additional compressors are available through the [NumCodecs](https://numcodecs.readthedocs.io) package, which supports LZ4, Zlib, BZ2, and LZMA. 

Let's check the compressor we used when creating the array:

In [None]:
z.compressors

(ZstdCodec(level=0, checksum=False),)

If you don’t specify a compressor, by default Zarr uses the Zstandard compressor.

How much space was saved by compression?


In [None]:
z.info_complete()

Type               : Array
Zarr format        : 3
Data type          : DataType.float64
Fill value         : 0.0
Shape              : (40, 50)
Chunk shape        : (10, 10)
Order              : C
Read-only          : False
Store type         : LocalStore
Filters            : ()
Serializer         : BytesCodec(endian=<Endian.little: 'little'>)
Compressors        : (ZstdCodec(level=0, checksum=False),)
No. bytes          : 16000 (15.6K)
No. bytes stored   : 1216
Storage ratio      : 13.2
Chunks Initialized : 20

You can set `compression=None` when creating a Zarr array to turn off compression. This is useful for debugging or when you want to store data without compression.

```{info}
`.info_complete()` provides a more detailed view of the Zarr array, including metadata about the chunks, compressors, and attributes, but will be slower for larger arrays. 
```

#### Consolidated Metadata
Zarr supports consolidated metadata, which allows you to store all metadata in a single file. This can improve performance when reading metadata, especially for large datasets.

So far we have only been dealing in single array Zarr data stores. In this next example, we will create a zarr store with multiple arrays and then consolidate metadata. The speed up is significant when dealing in remote storage options, which we will see in the following example on accessing cloud storage.

In [None]:
store = zarr.storage.MemoryStore()
group = zarr.create_group(store=store)
group.create_array(shape=(1,), name='a', dtype='float64')
group.create_array(shape=(2, 2), name='b', dtype='float64')
group.create_array(shape=(3, 3, 3), name='c', dtype='float64')
zarr.consolidate_metadata(store)



<Group memory://5213924096>

Now, if we open that group, the Group’s metadata has a zarr.core.group.ConsolidatedMetadata that can be used:

In [None]:
consolidated = zarr.open_group(store=store)
consolidated_metadata = consolidated.metadata.consolidated_metadata.metadata
from pprint import pprint
pprint(dict(sorted(consolidated_metadata.items())))

{'a': ArrayV3Metadata(shape=(1,),
                      data_type=<DataType.float64: 'float64'>,
                      chunk_grid=RegularChunkGrid(chunk_shape=(1,)),
                      chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
                                                                 separator='/'),
                      fill_value=np.float64(0.0),
                      codecs=(BytesCodec(endian=<Endian.little: 'little'>),
                              ZstdCodec(level=0, checksum=False)),
                      attributes={},
                      dimension_names=None,
                      zarr_format=3,
                      node_type='array',
                      storage_transformers=()),
 'b': ArrayV3Metadata(shape=(2, 2),
                      data_type=<DataType.float64: 'float64'>,
                      chunk_grid=RegularChunkGrid(chunk_shape=(2, 2)),
                      chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
                     

## Object Storage as a Zarr Store

Zarr’s layout (many files/chunks per array) maps perfectly onto object storage, such as Amazon S3, Google Cloud Storage, or Azure Blob Storage. Each chunk is stored as a separate object, enabling distributed reads/writes.



Here are some examples of Zarr stores on the cloud:

* [Zarr data in Microsoft's Planetary Computer](https://planetarycomputer.microsoft.com/catalog?filter=zarr)
* [Zarr data from Google](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset&_ga=2.226354714.1000882083.1692116148-1788942020.1692116148&pli=1&q=zarr)
* [Amazon Sustainability Data Initiative available from Registry of Open Data on AWS](https://registry.opendata.aws/collab/asdi/) - Enter "Zarr" in the Search input box.
* [Pangeo-Forge Data Catalog](https://pangeo-forge.org/catalog)


### Xarray and Zarr

Xarray has built-in support for reading and writing Zarr data. You can use the `xarray.open_zarr()` function to open a Zarr store as an Xarray dataset.



In [15]:
store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'

ds = xr.open_dataset(store, engine='zarr', chunks={}, consolidated=True)
ds

Unnamed: 0,Array,Chunk
Bytes,1.41 kiB,1.41 kiB
Shape,"(180, 2)","(180, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.41 kiB 1.41 kiB Shape (180, 2) (180, 2) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2  180,

Unnamed: 0,Array,Chunk
Bytes,1.41 kiB,1.41 kiB
Shape,"(180, 2)","(180, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.81 kiB,2.81 kiB
Shape,"(360, 2)","(360, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.81 kiB 2.81 kiB Shape (360, 2) (360, 2) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2  360,

Unnamed: 0,Array,Chunk
Bytes,2.81 kiB,2.81 kiB
Shape,"(360, 2)","(360, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,144.16 kiB,3.12 kiB
Shape,"(9226, 2)","(200, 2)"
Dask graph,47 chunks in 2 graph layers,47 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 144.16 kiB 3.12 kiB Shape (9226, 2) (200, 2) Dask graph 47 chunks in 2 graph layers Data type datetime64[ns] numpy.ndarray",2  9226,

Unnamed: 0,Array,Chunk
Bytes,144.16 kiB,3.12 kiB
Shape,"(9226, 2)","(200, 2)"
Dask graph,47 chunks in 2 graph layers,47 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.23 GiB,49.44 MiB
Shape,"(9226, 180, 360)","(200, 180, 360)"
Dask graph,47 chunks in 2 graph layers,47 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.23 GiB 49.44 MiB Shape (9226, 180, 360) (200, 180, 360) Dask graph 47 chunks in 2 graph layers Data type float32 numpy.ndarray",360  180  9226,

Unnamed: 0,Array,Chunk
Bytes,2.23 GiB,49.44 MiB
Shape,"(9226, 180, 360)","(200, 180, 360)"
Dask graph,47 chunks in 2 graph layers,47 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [13]:
ds.precip

Unnamed: 0,Array,Chunk
Bytes,2.23 GiB,49.44 MiB
Shape,"(9226, 180, 360)","(200, 180, 360)"
Dask graph,47 chunks in 2 graph layers,47 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.23 GiB 49.44 MiB Shape (9226, 180, 360) (200, 180, 360) Dask graph 47 chunks in 2 graph layers Data type float32 numpy.ndarray",360  180  9226,

Unnamed: 0,Array,Chunk
Bytes,2.23 GiB,49.44 MiB
Shape,"(9226, 180, 360)","(200, 180, 360)"
Dask graph,47 chunks in 2 graph layers,47 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


::::{admonition} Exercise
:class: tip

Can you calculate the mean precipitation over the time dimension in the GPCP dataset and plot it?

:::{admonition} Solution
:class: dropdown

```python
ds.precip.mean(dim='time').plot()

```
:::
::::

In the next exercise, you will use the Xarray + Zarr to open CMIP6 dataset.

## Additional Resources

- [Zarr Documentation](https://zarr.readthedocs.io/en/stable/)
- [Cloud Optimized Geospatial Formats](https://guide.cloudnativegeo.org/zarr/zarr-in-practice.html)
- [Scalable and Computationally Reproducible Approaches to Arctic Research](https://learning.nceas.ucsb.edu/2025-04-arctic/sections/zarr.html)
