# Zarr Tutorial

👋 Welcome to the Zarr tutorial! This tutorial was created for the 2022 [Cloud-Native Geo Event](https://schedule.cloudnativegeo.org/). It borrows heavily from the [tutorial section of the Zarr docs](https://zarr.readthedocs.io/en/stable/tutorial.html).

_Level: intermediate python programs. Ideally you are already a bit familiar with Numpy and Xarray._

This will be a live-coding tutorial. You will get the most out of the tutorial if you follow along in a blank notebook and type the code in yourself. 

**Learning Goals**

By the end of this tutorial, you should be able to 

- Identify the fundamental data structures in Zarr (Groups and Arrays) and the key properties of Arrays (shape, dtype, chunks, attributes)
- Create Arrays and Groups in local files or in S3
- Create and edit attributes (metadata)
- Read and write data into Arrays
- Evaluate the tradeoffs of different Array chunking strageies
- Read and write NetCDF-style data to Zarr using Xarray
- Do parallel processing on Xarray / Zarr data using Dask

## Review: NumPy

Zarr's python API borrows many concepts from NumPy. So first we review the basics of NumPy.

Creating an array:

This array lives in memory.

How much memory does the array use?

Getting a piece of data with slicing:

Create a new array and assign to it:

## Zarr Fundamentals

A Zarr array has three important properties: 
- Shape
- Dtype
- Attributes

An additional property of arrays we will not delve into here is _filters / compressors_.

No data has been written to the array yet. If we try to access data, we will just get the fill value:

This is how we assign data to the array. When we do this it gets written immediately.

### Attributes

We can attach arbitrary metadata to our Array via attributes:

### Under the hood

Where / how is our data actually stored?
Let's look under the hood. _The ability to look inside a Zarr store and understand what is there is a deliberate design decision._

### Choosing Chunks

The main parameter we control when creating Zarr Arrays is the chunk shape.
When selecting chunks, we need to keep in mind two constraints:
- Writes can be concurrent (come from different processes simultaneously) if they do not touch the same chunks. _This enables massively parallel writing in the cloud._
- When reading data, if any piece of the chunk is needed, the entire chunk has to be loaded. (This will be relaxed in Zarr V3.)

Here we will compare two different chunking stragegies.

There is no universally perfect chunk size / shape.
Need to consider:
- Access patterns for data
- Latency & throughput of storage device
- Constraints on number of files / objects (don't want a billion files)

### Resizing Arrays

It is trivial to change the shape of Zarr arrays.
If you make your array smaller, you will lose data of course.

### Compressors

A big part of the performance of Zarr is due to its support for compression of individual chunks. Zarr by default supports 20 different compression codecs. These live in a separate package called [numcodecs](https://numcodecs.readthedocs.io/en/stable/). The default compressor is the [Blosc](https://www.blosc.org/) meta-compressor. It's easy to add a new compressor or filter.

For the sake of time, we've decided to skip going into detail on compressors. You can read the [Zarr docs](https://zarr.readthedocs.io/en/stable/tutorial.html#compressors) for more information. The default compressor usually works well for most applications.

### Groups

To keep many arrays together, we can organize them in groups.

## Zarr in the Cloud

### Writing to and Reading from Cloud Object Storage

Zarr can store data in any storage system that can be represented as a key-value store. Here are some examples

- A directory on your filesystem
- A ZipFile
- A Redis database
- A cloud object store (e.g. S3, GCS, Azure Blob Storage)

In the cell below, we will access an S3 bucket with read-write credentials. These credentials will be disabled after the tutorial.


In [None]:
import uuid
storage_kwargs = {
    "key": "AKIATRRGXBOIEPMO524C",
    "secret": "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"
}
my_folder = f"s3://pangeo-ogc-demo/{uuid.uuid4().hex}"

Now we create a store in that location and store a group in it.

### Consolidating Metadata

Listing directories can sometimes be slow (or impossible) on certain storage devices. Zarr offers the ability to consilidate the metadata for an entire group into a single object.

## Zarr + Xarray (+ Dask)

Confession: I almost never use the `zarr` library directly. I nearly always read and write Zarr via [Xarray](https://xarray.dev/) because I like Xarray's data model. (Xarray's data model is the NetCDF data model).

> Xarray is an open source project and Python package that introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, which allows for more intuitive, more concise, and less error-prone user experience.
> 
> Xarray includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.


### Quick Review of Xarray

### Writing Zarr from Xarray

First we chunk the dataset. This accomplishes two things.
- Allows parallel processing using [Dask](https://dask.org/) (Not necessary for this small-data example but very useful for big data)
- Automatically maps Dask chunks to Zarr chunks when writing.

### Reading from S3

Here is some syntactic sugar to automatically map the Zarr chunks to Dask chunks when reading. 

### Let's Look under the hood

## Example Datasets

### Work with Existing CMIP6 cloud data

In [None]:
cat_url = "https://cmip6-pds.s3-us-west-2.amazonaws.com/pangeo-cmip6.csv"

### MUR SST

https://registry.opendata.aws/mur/

> A global, gap-free, gridded, daily 1 km Sea Surface Temperature (SST) dataset created by merging multiple Level-2 satellite SST datasets. Those input datasets include the NASA Advanced Microwave Scanning Radiometer-EOS (AMSR-E), the JAXA Advanced Microwave Scanning Radiometer 2 (AMSR-2) on GCOM-W1, the Moderate Resolution Imaging Spectroradiometers (MODIS) on the NASA Aqua and Terra platforms, the US Navy microwave WindSat radiometer, the Advanced Very High Resolution Radiometer (AVHRR) on several NOAA satellites, and in situ SST observations from the NOAA iQuam project. Data are available from 2002 to present in Zarr format. The original source of the MUR data is the NASA JPL Physical Oceanography DAAC.

In [None]:
mur_url = 's3://mur-sst/zarr/'