(resource:intake_esm)=
# Accessing CMIP6 data with intake-esm

Download this notebook 

This notebook demonstrates how to access Google Cloud CMIP6 data using intake-esm.

Intake-esm is a data cataloging utility built on top of intake, pandas, and
xarray. Intake-esm aims to facilitate:

- the discovery of earth’s climate and weather datasets.
- the ingestion of these datasets into xarray dataset containers.

It's basic usage is shown below. To begin, let's import `intake`:

In [1]:
import intake
import xarray as xr
import pandas as pd

## Load the catalog

At import time, intake-esm plugin is available in intake’s registry as
`esm_datastore` and can be accessed with `intake.open_esm_datastore()` function.
Use the `intake_esm.tutorial.get_url()` method to access smaller subsetted catalogs for tutorial purposes.

In [2]:
import intake_esm
#url = intake_esm.tutorial.get_url('google_cmip6')
#print(url)
url ="https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"

In [None]:
cat = intake.open_esm_datastore(url)


The summary above tells us that this catalog contains 514818 data assets.
We can get more information on the individual data assets contained in the
catalog by looking at the underlying dataframe created when we load the catalog:

The first data asset listed in the catalog contains:

- the surace pressure (variable_id='ps'), as a function of latitude, longitude, time,

- the high resolution version of the CMCC climate model (source_id='CMCC-CM2-HR4'),

- the high resolution model intercomparison expermenet (experiment_id='HighResMIP'),

- developed by the Euro-Mediterranean Centre on Climate Change (instution_id='CMCC'),

- run as part of the Coupled Model Intercomparison Project (activity_id='CMIP')

And is located in Google Cloud Storage at 'gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/Amon/ps/gn/v20170706/"

## Finding unique entries

To get unique values for given columns in the catalog, intake-esm provides a
{py:meth}`~intake_esm.core.esm_datastore.unique` method:

Let's query the data catalog to see what models(`source_id`), experiments
(`experiment_id`) and temporal frequencies (`table_id`) are available.

In [None]:
unique = cat.unique()

In [None]:
unique['variable_id'][:1]

In [None]:
experiments = unique['experiment_id']

In [None]:
experiments.sort()
#experiments

In [None]:
#unique['variable_id']

## Search for specific datasets

The {py:meth}`~intake_esm.core.esm_datastore.search` method allows the user to
perform a query on a catalog using keyword arguments. The keyword argument names
must match column names in the catalog. The search method returns a
subset of the catalog with all the entries that match the provided query.

In the example below, we are are going to search for the following:

- variable_d: `o2` which stands for
  `mole_concentration_of_dissolved_molecular_oxygen_in_sea_water`
- experiments: ['historical', 'ssp585']:
  - historical: all forcing of the recent past.
  - ssp585: emission-driven RCP8.5 based on SSP5.
- table_id: `0yr` which stands for annual mean variables on the ocean grid.
- grid_label: `gn` which stands for data reported on a model's native grid.

```{note}
For more details on the CMIP6 vocabulary, please check this
[website](http://clipc-services.ceda.ac.uk/dreq/index.html), and
[Core Controlled Vocabularies (CVs) for use in CMIP6](https://github.com/WCRP-CMIP/CMIP6_CVs)
GitHub repository.
```

In [None]:
cat_subset = cat.search(
  #  activity_id=["historical"],
  #  institution_id = ["CMCC"],
    experiment_id = ["historical"],
    source_id = ["MRI-ESM2-0"],
    table_id="Oyr",
  #  variable_id=["ph", "fgco2", "co3" , "dissic" , "chl", "co3satcalc" , "co3satarag", "calc", "tos"  ],  
    grid_label="gn",
)

cat_subset.keys()

#fgco2 ph co3 co3abio dissic

 #'ScenarioMIP.CMCC.CMCC-ESM2.ssp585.Oyr.gn',

# temperature feedbacks -> low semsitivy and hoit sensitivty models 

# CMIP.CCCma.CanESM5.1pctCO2.Oyr.gn',




## Load datasets using `to_dataset_dict()`

Intake-esm implements convenience utilities for loading the query results into
higher level xarray datasets. The logic for merging/concatenating the query
results into higher level xarray datasets is provided in the input JSON file and
is available under `.aggregation_info` property of the catalog:

In [None]:
cat.esmcat.aggregation_control

To load data assets into xarray datasets, we need to use the
{py:meth}`~intake_esm.core.esm_datastore.to_dataset_dict` method. This method
returns a dictionary of aggregate xarray datasets as the name hints.

In [None]:
dset_dict = cat_subset.to_dataset_dict(
    xarray_open_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True}
)

In [None]:
[key for key in dset_dict.keys()][:10]

We can access a particular dataset as follows:

In [None]:
ds = dset_dict
ds

## Use custom preprocessing functions

When comparing many models it is often necessary to preprocess (e.g. rename
certain variables) them before running some analysis step. The `preprocess`
argument lets the user pass a function, which is executed for each loaded asset
before combining datasets.

In [None]:
cat_pp = cat.search(
    experiment_id=["historical"],
    table_id="Oyr",
    variable_id=["ph", "fgco2", "co3" , "dissic" , "chl", "co3satcalc" , "co3satarag", "calc", "opottemptend", "osalttend"],
    grid_label="gn",
    source_id = ["CESM2"],
    member_id=["r1i1p1f1"]
)

pd.set_option('display.max_rows', None)

cat_pp.df

#'ScenarioMIP.CMCC.CMCC-ESM2.ssp585.Oyr.gn'

In [None]:
dset_dict_raw = cat_pp.to_dataset_dict(xarray_open_kwargs={"consolidated": True})

for k, ds in dset_dict_raw.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")

```{note}
Note that both models follow a different naming scheme. We can define a little
helper function and pass it to `.to_dataset_dict()` to fix this. For
demonstration purposes we will focus on the vertical level dimension which is
called `lev` in `CanESM5` and `olevel` in `IPSL-CM6A-LR`.
```

In [None]:
def helper_func(ds):
    """Rename `olevel` dim to `lev`"""
    ds = ds.copy()
    # a short example
    if "olevel" in ds.dims:
        ds = ds.rename({"olevel": "lev"})
    return ds

In [None]:
dset_dict_fixed = cat_pp.to_dataset_dict(xarray_open_kwargs={"consolidated": True}, preprocess=helper_func)

for k, ds in dset_dict_fixed.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")

This was just an example for one dimension.

```{note}
Check out [xmip package](https://github.com/jbusecke/xMIP)
for a full renaming function for all available CMIP6 models and some other
utilities.
```

## Load datasets into an xarray-datatree using `to_datatree()`

We can also load our data into an [xarray-datatree](https://xarray-datatree.readthedocs.io/en/latest/) object using the following:

In [None]:
helper_func

In [None]:
tree = cat_pp.to_datatree(xarray_open_kwargs={"consolidated": True}, preprocess=helper_func)


In [None]:
!conda list | grep tree

## Saving a dataset to disk

Once you have a case you will continue to use, you'll want to save it to a local drive for
further work.  To save on dictionary entry to disk, use `to_netcdf`.

In [None]:
print(f"{k=}")

In [None]:
print(dset_dict.keys())


In [None]:
write_it = True
if write_it:
    xr.backends.file_manager.FILE_CACHE.clear()
    dset_dict['CMIP.CMCC.CMCC-ESM2.historical.Oyr.gn'].to_netcdf("hist_CESM2",'w')
    

In [None]:
from copy import deepcopy
test = deepcopy(dset_dict["CMIP.CMCC.CMCC-ESM2.historical.Oyr.gn"])
test.to_zarr("CMCC_ESM2_hist.zarr")

In [None]:
write_it = True
if write_it:
    xr.backends.file_manager.FILE_CACHE.clear()
    dset_dict[k].to_netcdf("CMCC.CMCC-ESM2.nc",'w')

In [None]:
test_infile = xr.open_dataset("CMCC_ESM2_hist.zarr")

In [None]:
test

In [None]:
! pwd