# Virtualizarr and Coiled - Building a virtual dataset of Terraclimate

This notebook is an example of using Virtualizarr together with the Python distributed processing framework [Coiled](https://www.coiled.io/) to generate references using [serverless functions](https://docs.coiled.io/user_guide/functions.html). 
- **Note:** running this notebook requires a coiled account.


## The dataset
For this example, we are going to create a virtual zarr store from the [Terraclimate](https://www.climatologylab.org/terraclimate.html) dataset. Terraclimate is a monthly dataset spanning 66 years and containing 14 climate and water balance variables. It is made up of 924 individual NetCDF4 files. When represented as an Xarray dataset, it is over 1TB in size.

## Parallelizing `virtualizarr` reference generation with coiled serverless functions
Coiled serverless functions allow us to easily spin up hundreds of small compute instances, which are great for individual file reference generation. We were able to process 924 netCDF files into a 1TB virtual xarray dataset in 9 minutes for ~$0.24.

## Installation and environment

You should install the Python requirements in a clean virtual environment of your choice. Each coiled serverless function will re-use this environment, so it's best to start with a clean slate.

```bash
pip install 'virtualizarr['icechunk','hdf']' coiled ipykernel bokeh
```

## Imports


In [None]:
import coiled
import icechunk
import numpy as np
import xarray as xr

from virtualizarr import open_virtual_dataset

## Create the Terraclimate variable and year url combinations 
`14 variables * 66 years = 924 NetCDF files`





In [None]:
tvars = [
    "aet",
    "def",
    "pet",
    "ppt",
    "q",
    "soil",
    "srad",
    "swe",
    "tmax",
    "tmin",
    "vap",
    "ws",
    "vpd",
    "PDSI",
]
min_year = 1958
max_year = 2023
time_list = np.arange(min_year, max_year + 1, 1)

combinations = [
    f"https://climate.northwestknowledge.net/TERRACLIMATE-DATA/TerraClimate_{var}_{year}.nc"
    for year in time_list
    for var in tvars
]

## Define the coiled serverless function

### Serverless function setup notes:
- This coiled function is tailored to AWS
- `vm_type=["t4g.small"]` - This is a small instance, you shouldn't need large machines for reference generation
- `spot_policy="spot_with_fallback"` is cheaper, but might have unintended consequences
- `arm=True` uses VMs with ARM architecture, which is cheaper
- `idle_timeout="10 minutes"` workers will shut down after 10 minutes of inactivity 
- `n_workers=[100, 300]` adaptive scaling between 100 & 300 workers
- `name` [optional] if you want to keep track of your cluster in the coiled dashboard

More details can be found in the [serverless function API](https://docs.coiled.io/user_guide/functions.html#api).

In [None]:
@coiled.function(
    region="us-west-2",
    vm_type=["t4g.small"],
    spot_policy="spot_with_fallback",
    arm=True,
    idle_timeout="10 minutes",
    n_workers=[10, 100],
    name="parallel_reference_generation",
)
def process(filename):
    vds = open_virtual_dataset(
        filename,
        decode_times=True,
        loadable_variables=["time", "lat", "lon", "crs"],
    )
    return vds


# process.map distributes out the input file urls to coiled functions
# retires=10 allows for individual task retires, which can be useful for inconsistent server behavior
results = process.map(combinations[0:2], retries=10)


## Combine references into virtual dataset

In [None]:
# extract generator values into a list
vds_list = [result for result in results]

# combine individual refs into a virtual Xarray dataset
mds = xr.combine_by_coords(
    vds_list, coords="minimal", compat="override", combine_attrs="drop"
)

mds

In [None]:
print(str("{0:.2f}".format(mds.nbytes / 1e12)), " TB")

## Save the virtual dataset to Icechunk

Now that we have this virtual dataset, we can write it to Icechunk. 

In this example we're creating a local icechunk store, but you could configure it for cloud storage.

In [None]:
local_storage_conifg = icechunk.local_filesystem_storage("./terraclimate")
repo = icechunk.Repository.open_or_create(local_storage_conifg)
session = repo.writable_session("main")

In [None]:
mds.virtualize.to_icechunk(store=session.store)

## Open the Icechunk store with Xarray

**Warning:** Calling `to_zarr` on this dataset will try to write out 1TB of data.


In [None]:
combined_ds = xr.open_zarr(session.store, consolidated=False, zarr_format=3)

In [None]:
combined_ds