# VirtualiZarr 2.0 + Icechunk demo on the Cross-Calibrated Multi-Platform project

This notebook was derived from [Dean Henze's VirtualiZarr 1.0 tutorial](https://podaac.github.io/tutorials/notebooks/Advanced_cloud/virtualizarr_recipes.html).

## Summary

The functionalities of VirtualiZarr (with earthaccess and Icechunk) covered in this notebook are:

1. **Getting Data File endpoints in Earthdata Cloud** which are needed for VirtualiZarr to create virtual datasets.
2. **Virtualizing 1 day and 1 year of a~750 GB data set and writing to Icechunk**. The data set used is the Level 4 global gridded 6-hourly wind product from the Cross-Calibrated Multi-Platform project (https://doi.org/10.5067/CCMP-6HW10M-L4V31), available on PO.DAAC. This section also covers speeding up the virtualization using parallel computing with Dask and Coiled.
3. **Appending new data to an existing Icechunk store**. When new data files become available (e.g. forward-streaming datasets), they can be appended to an existing Icechunk store without re-processing the entire record. Icechunk also provides version control, so you can always read from any previous snapshot.

## Requirements, prerequisite knowledge, learning outcomes

#### Requirements to run this notebook

* Earthdata login account: An Earthdata Login account is required to access data from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account.

* Compute environment: This notebook is meant to be run in the cloud (AWS instance running in us-west-2).

#### Prerequisite knowledge

* This notebook covers VirtualiZarr functionality but does not present the high-level ideas behind it. For an understanding of virtual datasets and how they are meant to enhance in-cloud access to file formats that are not cloud optimized (such as netCDF, HDF), please see [the VirtualiZarr documentation](https://virtualizarr.readthedocs.io/en/latest/) and [the Icechunk documentation](https://icechunk.io/).

* Familiarity with the `earthaccess` and `Xarray` packages. Familiarity with directly accessing NASA Earthdata in the cloud. 

* The Cookbook notebook on [Dask basics](https://podaac.github.io/tutorials/notebooks/Advanced_cloud/basic_dask.html) is handy for those new to parallel computing.

#### Compute environment

The [readme](README.md) contains instructions for using `uv` for a reproducible environment.

#### Learning Outcomes

This notebook serves both as a pedagogical resource for learning several key workflows as well as a quick reference guide. Readers will gain the understanding to combine the VirtualiZarr, earthaccess, and Icechunk packages to create virtual Zarr stores from NASA Earthdata.

## Import Packages

Dependencies are declared in `pyproject.toml`. See the [README](README.md) for setup instructions.

In [1]:
# Parallel computing
import multiprocessing
import warnings

import dask.array as da

# Data handling
import earthaccess

# Icechunk - versioned virtual Zarr store
import icechunk

# Other
import xarray as xr
from dask import delayed
from dask.distributed import Client, WorkerPlugin
from obspec_utils.registry import ObjectStoreRegistry

# Object store access (replaces fsspec)
from obstore.auth.earthdata import NasaEarthdataCredentialProvider
from obstore.store import S3Store

# VirtualiZarr
import virtualizarr as vz
from virtualizarr import open_virtual_dataset

# Suppress Numcodecs v3 compatibility warnings:
warnings.filterwarnings(
    "ignore",
    message="Numcodecs codecs are not in the Zarr version 3 specification.*",
    category=UserWarning,
)

## Other Setup

In [2]:
xr.set_options(  # display options for xarray objects
    display_expand_attrs=False,
    display_expand_coords=True,
    display_expand_data=True,
)

<xarray.core.options.set_options at 0x7fd94c69eba0>

## 1. Get Data File S3 endpoints in Earthdata Cloud 
The first step is to find the S3 endpoints to the files. We use `earthaccess` for authentication and data discovery, and `obstore` with `NasaEarthdataCredentialProvider` for S3 access (replacing the older fsspec-based approach). The credential provider handles automatic credential refresh, so you no longer need to manually re-fetch credentials after they expire.

In [3]:
# Get Earthdata creds
earthaccess.login()

<earthaccess.auth.Auth at 0x7fd97d2c9fd0>

In [4]:
# Locate CCMP file information / metadata:
granule_info = earthaccess.search_data(
    short_name="CCMP_WINDS_10M6HR_L4_V3.1",
)

In [5]:
# Extract S3 credentials endpoint and bucket URL from search results:
s3_credentials_endpoint = granule_info[0].get_s3_credentials_endpoint()
first_link = granule_info[0].data_links(access="direct")[0]
bucket = "/".join(
    first_link.split("/")[:3]
)  # e.g., "s3://podaac-ops-cumulus-protected"
print(f"S3 credentials endpoint: {s3_credentials_endpoint}")
print(f"Bucket: {bucket}")

# Set up obstore with NASA Earthdata credential auto-refresh:
credential_provider = NasaEarthdataCredentialProvider(s3_credentials_endpoint)
store = S3Store.from_url(bucket, credential_provider=credential_provider)
registry = ObjectStoreRegistry({bucket: store})

S3 credentials endpoint: https://archive.podaac.earthdata.nasa.gov/s3credentials
Bucket: s3://podaac-ops-cumulus-protected


In [6]:
# Get S3 endpoints for all files:
data_s3links = [g.data_links(access="direct")[0] for g in granule_info]
print(f"Total files: {len(data_s3links)}")
data_s3links[0:3]

Total files: 11917


['s3://podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930102_V03.1_L4.nc',
 's3://podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930103_V03.1_L4.nc',
 's3://podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930105_V03.1_L4.nc']

## 2. Virtualize data files and write to Icechunk

### 2.1 First day
VirtualiZarr's `open_virtual_dataset` creates a virtual dataset from a single file. In VirtualiZarr 2.x, you provide:
- A `registry` (ObjectStoreRegistry) that handles S3 access credentials.
- A `parser` (e.g. `HDFParser`) that knows how to read the file format.
- `loadable_variables` - coordinate variables to load into memory for indexing.

The virtual dataset is then written to an Icechunk store, which provides versioned, cloud-native access to the data.

In [7]:
# Coordinate variables to load into memory (needs to be modified per dataset):
coord_vars = ["latitude", "longitude", "time"]

# Configure the HDF parser for NetCDF4/HDF5 files:
parser = vz.parsers.HDFParser()

In [8]:
%%time
# Create virtual dataset for the first data file:
virtual_ds_example = open_virtual_dataset(
    url=data_s3links[0],
    registry=registry,
    parser=parser,
    loadable_variables=coord_vars,
)
print(virtual_ds_example)

<xarray.Dataset> Size: 66MB
Dimensions:    (latitude: 720, longitude: 1440, time: 4)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 32B 1993-01-02 ... 1993-01-02T18:00:00
Data variables:
    uwnd       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    vwnd       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    ws         (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    nobs       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
Attributes: (54)
CPU times: user 99.6 ms, sys: 0 ns, total: 99.6 ms
Wall time: 2.04 s


The virtual dataset can be written to an Icechunk store. Icechunk is a versioned, transactional data store for Zarr that stores virtual references alongside any loaded data. First we set up a local Icechunk repository with a virtual chunk container that tells Icechunk how to read the actual data from the S3 bucket:

In [9]:
# Set up Icechunk repository with virtual chunk container for the S3 bucket:
storage = icechunk.local_filesystem_storage("./ccmp_icechunk_example")

config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(
    icechunk.VirtualChunkContainer(
        bucket + "/",
        icechunk.s3_store(region="us-west-2"),
    )
)

# Get S3 credentials for Icechunk to read virtual chunks at read-time:
s3_creds = earthaccess.get_s3_credentials(daac="PODAAC")
credentials = icechunk.containers_credentials(
    {
        bucket + "/": icechunk.s3_credentials(
            access_key_id=s3_creds["accessKeyId"],
            secret_access_key=s3_creds["secretAccessKey"],
            session_token=s3_creds["sessionToken"],
        )
    }
)

repo = icechunk.Repository.open_or_create(
    storage=storage,
    config=config,
    authorize_virtual_chunk_access=credentials,
)

# Write the virtual dataset to Icechunk:
session = repo.writable_session("main")
virtual_ds_example.vz.to_icechunk(session.store)
snapshot_id = session.commit("First day of CCMP data")
print(f"Committed snapshot: {snapshot_id}")

  [2m2026-02-11T00:16:50.740153Z[0m [33m WARN[0m [1;33micechunk::storage::object_store[0m[33m: [33mThe LocalFileSystem storage is not safe for concurrent commits. If more than one thread/process will attempt to commit at the same time, prefer using object stores.[0m
    [2;3mat[0m icechunk/src/storage/object_store.rs:81

Committed snapshot: QAWYQ1D2KXF7RBKWF3P0


In [10]:
# Read the data back from the Icechunk store using xarray:
read_session = repo.readonly_session(branch="main")
data_example = xr.open_zarr(
    store=read_session.store,
    zarr_format=3,
    consolidated=False,
)
print(data_example)

<xarray.Dataset> Size: 66MB
Dimensions:    (time: 4, latitude: 720, longitude: 1440)
Coordinates:
  * time       (time) datetime64[ns] 32B 1993-01-02 ... 1993-01-02T18:00:00
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
Data variables:
    nobs       (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (54)


In [11]:
# Virtual references are compact -- check the in-memory size:
print(f"Size of virtual references: {virtual_ds_example.vz.nbytes:,} bytes")
print(f"Size of actual data (if loaded): {virtual_ds_example.nbytes:,} bytes")

Size of virtual references: 9,184 bytes
Size of actual data (if loaded): 66,363,872 bytes


> **Note:** You can also export to Kerchunk JSON or Parquet formats for backwards compatibility:
> ```python
> virtual_ds_example.vz.to_kerchunk('output.json', format='json')
> virtual_ds_example.vz.to_kerchunk('output.parq', format='parquet')
> ```
> However, Icechunk is the recommended format as it supports versioning, appending, and does not require a separate reference file.

### 2.2 First year
Virtual datasets for each data file in the year are created individually, and then the combined virtual dataset for the year can be created and written to Icechunk.

For us, virtualization of a single file takes about 0.7 seconds, so processing a year of files sequentially would take about 4.25 minutes. One can easily accomplish this with a for-loop:

```python
virtual_ds_list = [
    open_virtual_dataset(
        url=p, registry=registry, parser=parser,
        loadable_variables=coord_vars,
    )
    for p in data_s3links
]
```

However, we speed things up using basic parallel computing. 

> **Note:** VirtualiZarr also provides `open_virtual_mfdataset` which handles the open-and-combine workflow in a single call. It accepts `parallel="dask"` to use dask.delayed internally. The manual approach shown below gives you full control over the Dask cluster configuration.

### 2.2.1 Method 1: parallelize using Dask local cluster
If using an `m6i.4xlarge` AWS EC2 instance, there are 16 CPUs available and each should have enough memory to utilize all at once. If working on a different VM-type, change the `n_workers` in the call to `Client()` below as needed.

In [12]:
# Check how many cpu's are on this VM:
print("CPU count =", multiprocessing.cpu_count())

CPU count = 4


In [13]:
# Start up cluster and print some information about it:
client = Client(n_workers=15, threads_per_worker=1)


class SuppressWarningsPlugin(WorkerPlugin):
    def setup(self, worker):
        import warnings

        warnings.filterwarnings(
            "ignore",
            message="Numcodecs codecs are not in the Zarr version 3 specification.*",
            category=UserWarning,
        )


client.register_plugin(SuppressWarningsPlugin())
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /user/maxrjones/virtualizarr/proxy/8787/status,

0,1
Dashboard: /user/maxrjones/virtualizarr/proxy/8787/status,Workers: 15
Total threads: 15,Total memory: 29.08 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:45791,Workers: 0
Dashboard: /user/maxrjones/virtualizarr/proxy/8787/status,Total threads: 0
Started: Just now,Total memory: 0 B

0,1
Comm: tcp://127.0.0.1:43975,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/39603/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:35867,
Local directory: /tmp/dask-scratch-space/worker-pa0yjllz,Local directory: /tmp/dask-scratch-space/worker-pa0yjllz

0,1
Comm: tcp://127.0.0.1:41033,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/43767/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:40019,
Local directory: /tmp/dask-scratch-space/worker-uzjbruje,Local directory: /tmp/dask-scratch-space/worker-uzjbruje

0,1
Comm: tcp://127.0.0.1:44685,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/46325/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:39209,
Local directory: /tmp/dask-scratch-space/worker-x8heh8e7,Local directory: /tmp/dask-scratch-space/worker-x8heh8e7

0,1
Comm: tcp://127.0.0.1:37243,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/38939/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:37787,
Local directory: /tmp/dask-scratch-space/worker-dzz2_kho,Local directory: /tmp/dask-scratch-space/worker-dzz2_kho

0,1
Comm: tcp://127.0.0.1:44997,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/32921/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:45115,
Local directory: /tmp/dask-scratch-space/worker-1qtcwb55,Local directory: /tmp/dask-scratch-space/worker-1qtcwb55

0,1
Comm: tcp://127.0.0.1:44777,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/43787/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:46283,
Local directory: /tmp/dask-scratch-space/worker-k43l0xpt,Local directory: /tmp/dask-scratch-space/worker-k43l0xpt

0,1
Comm: tcp://127.0.0.1:37609,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/43229/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:45489,
Local directory: /tmp/dask-scratch-space/worker-nkstlfpu,Local directory: /tmp/dask-scratch-space/worker-nkstlfpu

0,1
Comm: tcp://127.0.0.1:38385,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/44375/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:34077,
Local directory: /tmp/dask-scratch-space/worker-gc_t1h5p,Local directory: /tmp/dask-scratch-space/worker-gc_t1h5p

0,1
Comm: tcp://127.0.0.1:45679,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/43339/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:36237,
Local directory: /tmp/dask-scratch-space/worker-8res1rj8,Local directory: /tmp/dask-scratch-space/worker-8res1rj8

0,1
Comm: tcp://127.0.0.1:33065,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/45981/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:45243,
Local directory: /tmp/dask-scratch-space/worker-jj57nxvi,Local directory: /tmp/dask-scratch-space/worker-jj57nxvi

0,1
Comm: tcp://127.0.0.1:38297,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/45297/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:45053,
Local directory: /tmp/dask-scratch-space/worker-wtg566ti,Local directory: /tmp/dask-scratch-space/worker-wtg566ti

0,1
Comm: tcp://127.0.0.1:32783,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/43771/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:41339,
Local directory: /tmp/dask-scratch-space/worker-vguy56qi,Local directory: /tmp/dask-scratch-space/worker-vguy56qi

0,1
Comm: tcp://127.0.0.1:42941,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/44591/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:40089,
Local directory: /tmp/dask-scratch-space/worker-0cy3g63v,Local directory: /tmp/dask-scratch-space/worker-0cy3g63v

0,1
Comm: tcp://127.0.0.1:37677,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/41199/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:41605,
Local directory: /tmp/dask-scratch-space/worker-d6j2u8g5,Local directory: /tmp/dask-scratch-space/worker-d6j2u8g5

0,1
Comm: tcp://127.0.0.1:35999,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/38419/status,Memory: 1.94 GiB
Nanny: tcp://127.0.0.1:44055,
Local directory: /tmp/dask-scratch-space/worker-xpyx7ml8,Local directory: /tmp/dask-scratch-space/worker-xpyx7ml8


In [14]:
%%time
# Create individual virtual datasets in parallel using dask.delayed:
open_vds_par = delayed(open_virtual_dataset)
tasks = [
    open_vds_par(
        url=p,
        registry=registry,
        parser=parser,
        loadable_variables=coord_vars,
    )
    for p in data_s3links[:365]  # First year only!
]
virtual_ds_list = list(
    da.compute(*tasks)
)  # xr.combine_nested() needs a list, not a tuple.

CPU times: user 6.66 s, sys: 1.35 s, total: 8.01 s
Wall time: 1min 11s


Using the individual references to create the combined reference is fast and does not require parallel computing.

In [15]:
%%time
# Create the combined reference
virtual_ds_combined = xr.combine_nested(
    virtual_ds_list,
    concat_dim="time",
    coords="minimal",
    compat="override",
    combine_attrs="drop_conflicts",
)

CPU times: user 448 ms, sys: 30.1 ms, total: 478 ms
Wall time: 576 ms


In [16]:
# Write the combined virtual dataset to a fresh Icechunk repository:
storage_1yr = icechunk.local_filesystem_storage("./ccmp_icechunk_1year")
repo_1yr = icechunk.Repository.open_or_create(
    storage=storage_1yr,
    config=config,
    authorize_virtual_chunk_access=credentials,
)

session_1yr = repo_1yr.writable_session("main")
virtual_ds_combined.vz.to_icechunk(session_1yr.store)
snapshot_1yr = session_1yr.commit("First year of CCMP data (365 files)")
print(f"Committed snapshot: {snapshot_1yr}")

  [2m2026-02-11T00:18:07.647336Z[0m [33m WARN[0m [1;33micechunk::storage::object_store[0m[33m: [33mThe LocalFileSystem storage is not safe for concurrent commits. If more than one thread/process will attempt to commit at the same time, prefer using object stores.[0m
    [2;3mat[0m icechunk/src/storage/object_store.rs:81

Committed snapshot: DZX5W4VPF9K22S2GQ6DG


In [17]:
%%time
# Test lazy loading from the Icechunk store:
read_session_1yr = repo_1yr.readonly_session(branch="main")
data_icechunk = xr.open_zarr(
    store=read_session_1yr.store,
    zarr_format=3,
    consolidated=False,
)
print(data_icechunk)

<xarray.Dataset> Size: 24GB
Dimensions:    (time: 1460, latitude: 720, longitude: 1440)
Coordinates:
  * time       (time) datetime64[ns] 12kB 1993-01-02 ... 1994-01-04T18:00:00
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
Data variables:
    vwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    nobs       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (47)
CPU times: user 104 ms, sys: 44.6 ms, total: 149 ms
Wall time: 104 ms


## 3. Appending new data to an existing Icechunk store

One of the key advantages of Icechunk over Kerchunk reference files is native support for appending new data. When new CCMP files become available (e.g., forward-streaming daily data), you can append them to the existing Icechunk store without re-processing the entire record.

Icechunk also provides version control -- each append creates a new snapshot, and you can always read from any previous snapshot.

### 3.1 Append an additional day

We virtualize a file that was not included in the previous year-long dataset and append it to the existing Icechunk store using `append_dim="time"`.

In [18]:
%%time
# Create virtual dataset for the 366th file (the day after our year-long dataset):
vds_extraday = open_virtual_dataset(
    url=data_s3links[365],
    registry=registry,
    parser=parser,
    loadable_variables=coord_vars,
)
print(vds_extraday)

<xarray.Dataset> Size: 66MB
Dimensions:    (latitude: 720, longitude: 1440, time: 4)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 32B 1994-01-05 ... 1994-01-05T18:00:00
Data variables:
    uwnd       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    vwnd       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    ws         (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    nobs       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
Attributes: (54)
CPU times: user 87.3 ms, sys: 38.7 ms, total: 126 ms
Wall time: 797 ms


In [19]:
%%time
# Append the extra day to the existing Icechunk store:
append_session = repo_1yr.writable_session("main")
vds_extraday.vz.to_icechunk(append_session.store, append_dim="time")
snapshot_appended = append_session.commit("Appended one additional day")
print(f"Committed append snapshot: {snapshot_appended}")

Committed append snapshot: MSE26GS6JMWNGNCRWYG0
CPU times: user 58.3 ms, sys: 28.6 ms, total: 86.8 ms
Wall time: 108 ms


In [20]:
# Verify the appended data:
read_session_appended = repo_1yr.readonly_session(branch="main")
data_appended = xr.open_zarr(
    store=read_session_appended.store,
    zarr_format=3,
    consolidated=False,
)
print(f"Time dimension after append: {data_appended.sizes['time']}")
print(data_appended)

Time dimension after append: 1464
<xarray.Dataset> Size: 24GB
Dimensions:    (time: 1464, latitude: 720, longitude: 1440)
Coordinates:
  * time       (time) datetime64[ns] 12kB 1993-01-02 ... 1994-01-05T18:00:00
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
Data variables:
    nobs       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (54)


### 3.2 Version history

Icechunk maintains a full version history. You can read from any previous snapshot, for example to compare the original year-long dataset with the appended version:

In [21]:
# Access the original year-long snapshot (before the append):
read_session_original = repo_1yr.readonly_session(snapshot_id=snapshot_1yr)
data_original = xr.open_zarr(
    store=read_session_original.store,
    zarr_format=3,
    consolidated=False,
)
print(f"Original dataset time steps: {data_original.sizes['time']}")
print(f"Appended dataset time steps: {data_appended.sizes['time']}")

Original dataset time steps: 1460
Appended dataset time steps: 1464


## 4. Cleanup

Close the credential provider and Dask client when done.

In [22]:
# Close the credential provider (releases background refresh thread):
credential_provider.close()

# Shut down the Dask client:
client.close()