# Intro to Zarr

This notebook provides a brief introduction to Zarr and how to
use it in cloud environments for scalable, chunked, and compressed data storage.
Zarr is a file format with implementations in different languages. In this tutorial, we will look at an example of how to use the Zarr format by looking at some features of the `zarr-python` library and how Zarr files can be opened with `xarray`.

## What is Zarr?

The Zarr data format is an open, community-maintained format designed for efficient, scalable storage of large N-dimensional arrays. It stores data as compressed and chunked arrays in a format well-suited to parallel processing and cloud-native workflows.

### Zarr Data Organization:
- **Arrays**: N-dimensional arrays that can be chunked and compressed.
- **Groups**: A container for organizing multiple arrays and other groups with a hierarchical structure.
- **Metadata**: JSON-like metadata describing the arrays and groups, including information about dimensions, data types, and compression.
- **Dimensions and Shape**: Arrays can have any number of dimensions, and their shape is defined by the number of elements in each dimension.
- **Coordinates & Indexing**: Zarr supports coordinate arrays for each dimension, allowing for efficient indexing and slicing.

The diagram below showing the structure of a Zarr file:
![EarthData](https://learning.nceas.ucsb.edu/2025-04-arctic/images/zarr-chunks.png)


### Zarr Fundamenals
A Zarr array has the following important properties:
- **Shape**: The dimensions of the array.
- **Dtype**: The data type of each element (e.g., float32).
- **Attributes**: Metadata stored as key-value pairs (e.g., units, description.
- **Compressors**: Algorithms used to compress each chunk (e.g., Blosc, Zlib).


#### Example: Creating and Inspecting a Zarr Array

In [1]:
import zarr
z = zarr.create(shape=(40, 50), chunks=(10, 10), dtype='f8', store='test.zarr')
z

<zarr.core.Array (40, 50) float64>

In [2]:
z.info

0,1
Type,zarr.core.Array
Data type,float64
Shape,"(40, 50)"
Chunk shape,"(10, 10)"
Order,C
Read-only,False
Compressor,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"
Store type,zarr.storage.DirectoryStore
No. bytes,16000 (15.6K)
No. bytes stored,337


In [3]:
z.fill_value

0.0

No data has been written to the array yet. If we try to access the data, we will get a fill value: 

In [4]:
z[0, 0]


0.0

This is how we assign data to the array. When we do this it gets written immediately.

In [6]:
z[:] = 1
z.info

0,1
Type,zarr.core.Array
Data type,float64
Shape,"(40, 50)"
Chunk shape,"(10, 10)"
Order,C
Read-only,False
Compressor,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"
Store type,zarr.storage.DirectoryStore
No. bytes,16000 (15.6K)
No. bytes stored,1277 (1.2K)


##### Attributes

We can attach arbitrary metadata to our Array via attributes:

In [8]:
z.attrs['units'] = 'm/s'
z.attrs['standard_name'] = 'wind_speed'
print(dict(z.attrs))

{'standard_name': 'wind_speed', 'units': 'm/s'}


### Zarr Data Storage

Zarr can be stored in memory, on disk, or in cloud storage systems like Amazon S3.

Let's look under the hood. _The ability to look inside a Zarr store and understand what is there is a deliberate design decision._

In [9]:
z.store

<zarr.storage.DirectoryStore at 0x107530650>

In [10]:
!tree -a test.zarr | head

[01;34mtest.zarr[0m
├── [00m.zarray[0m
├── [00m.zattrs[0m
├── [00m0.0[0m
├── [00m0.1[0m
├── [00m0.2[0m
├── [00m0.3[0m
├── [00m0.4[0m
├── [00m1.0[0m
├── [00m1.1[0m


In [11]:
import json
with open('test.zarr/.zarray') as fp:
    print(json.load(fp))

{'chunks': [10, 10], 'compressor': {'blocksize': 0, 'clevel': 5, 'cname': 'lz4', 'id': 'blosc', 'shuffle': 1}, 'dtype': '<f8', 'fill_value': 0.0, 'filters': None, 'order': 'C', 'shape': [40, 50], 'zarr_format': 2}


In [12]:
with open('test.zarr/.zattrs') as fp:
    print(json.load(fp))

{'standard_name': 'wind_speed', 'units': 'm/s'}
