(resource:intake_esm)=
# Accessing CMIP6 data with intake-esm

Download this notebook 

This notebook demonstrates how to access Google Cloud CMIP6 data using intake-esm.

Intake-esm is a data cataloging utility built on top of intake, pandas, and
xarray. Intake-esm aims to facilitate:

- the discovery of earth’s climate and weather datasets.
- the ingestion of these datasets into xarray dataset containers.

It's basic usage is shown below. To begin, let's import `intake`:

In [5]:
import intake
import xarray as xr
import pandas as pd

## Load the catalog

At import time, intake-esm plugin is available in intake’s registry as
`esm_datastore` and can be accessed with `intake.open_esm_datastore()` function.
Use the `intake_esm.tutorial.get_url()` method to access smaller subsetted catalogs for tutorial purposes.

In [6]:
import intake_esm
#url = intake_esm.tutorial.get_url('google_cmip6')
#print(url)
url ="https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"

In [7]:
cat = intake.open_esm_datastore(url)


  .applymap(type)


The summary above tells us that this catalog contains 514818 data assets.
We can get more information on the individual data assets contained in the
catalog by looking at the underlying dataframe created when we load the catalog:

## 
Search for specific datasets

The {py:meth}`~intake_esm.core.esm_datastore.search` method allows the user to
perform a query on a catalog using keyword arguments. The keyword argument names
must match column names in the catalog. The search method returns a
subset of the catalog with all the entries that match the provided query.

In the example below, we are are going to search for the following:

- variable_d: `o2` which stands for
  `mole_concentration_of_dissolved_molecular_oxygen_in_sea_water`
- experiments: ['historical', 'ssp585']:
  - historical: all forcing of the recent past.
  - ssp585: emission-driven RCP8.5 based on SSP5.
- table_id: `0yr` which stands for annual mean variables on the ocean grid.
- grid_label: `gn` which stands for data reported on a model's native grid.

```{note}
For more details on the CMIP6 vocabulary, please check this
[website](http://clipc-services.ceda.ac.uk/dreq/index.html), and
[Core Controlled Vocabularies (CVs) for use in CMIP6](https://github.com/WCRP-CMIP/CMIP6_CVs)
GitHub repository.
```

In [13]:

cat_subset = cat.search(
  #  activity_id=["historical"],
    institution_id = ["CMCC"],
    experiment_id = ["ssp126"],
   # source_id = ["MRI-ESM2-0"],
    table_id="Oyr",
    variable_id=["ph", "fgco2"],  
    grid_label="gn",
)

cat_subset.keys()

#fgco2 ph co3 co3abio dissic

 #'ScenarioMIP.CMCC.CMCC-ESM2.ssp585.Oyr.gn',

# temperature feedbacks -> low semsitivy and hoit sensitivty models 

# CMIP.CCCma.CanESM5.1pctCO2.Oyr.gn',




  .applymap(type)
  .applymap(type)


['ScenarioMIP.CMCC.CMCC-ESM2.ssp126.Oyr.gn']

## Load datasets using `to_dataset_dict()`

Intake-esm implements convenience utilities for loading the query results into
higher level xarray datasets. The logic for merging/concatenating the query
results into higher level xarray datasets is provided in the input JSON file and
is available under `.aggregation_info` property of the catalog:

In [14]:
cat.esmcat.aggregation_control

AggregationControl(variable_column_name='variable_id', groupby_attrs=['activity_id', 'institution_id', 'source_id', 'experiment_id', 'table_id', 'grid_label'], aggregations=[Aggregation(type=<AggregationType.union: 'union'>, attribute_name='variable_id', options={}), Aggregation(type=<AggregationType.join_new: 'join_new'>, attribute_name='member_id', options={'coords': 'minimal', 'compat': 'override'}), Aggregation(type=<AggregationType.join_new: 'join_new'>, attribute_name='dcpp_init_year', options={'coords': 'minimal', 'compat': 'override'})])

To load data assets into xarray datasets, we need to use the
{py:meth}`~intake_esm.core.esm_datastore.to_dataset_dict` method. This method
returns a dictionary of aggregate xarray datasets as the name hints.

In [15]:
dset_dict = cat_subset.to_dataset_dict(
    xarray_open_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True}
)


--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'


In [16]:
[key for key in dset_dict.keys()][:10]

['ScenarioMIP.CMCC.CMCC-ESM2.ssp126.Oyr.gn']

We can access a particular dataset as follows:

In [17]:
ds = dset_dict
ds

{'ScenarioMIP.CMCC.CMCC-ESM2.ssp126.Oyr.gn': <xarray.Dataset> Size: 2GB
 Dimensions:             (member_id: 1, dcpp_init_year: 1, time: 86, i: 292,
                          j: 362, bnds: 2, vertices: 4, lev: 50)
 Coordinates:
   * i                   (i) int32 1kB 0 1 2 3 4 5 6 ... 286 287 288 289 290 291
   * j                   (j) int32 1kB 0 1 2 3 4 5 6 ... 356 357 358 359 360 361
     latitude            (i, j) float64 846kB dask.array<chunksize=(292, 362), meta=np.ndarray>
     longitude           (i, j) float64 846kB dask.array<chunksize=(292, 362), meta=np.ndarray>
   * time                (time) object 688B 2015-07-02 12:00:00 ... 2100-07-02...
     time_bnds           (time, bnds) object 1kB dask.array<chunksize=(86, 2), meta=np.ndarray>
     vertices_latitude   (i, j, vertices) float64 3MB dask.array<chunksize=(292, 362, 4), meta=np.ndarray>
     vertices_longitude  (i, j, vertices) float64 3MB dask.array<chunksize=(292, 362, 4), meta=np.ndarray>
   * member_id           (

## Use custom preprocessing functions

When comparing many models it is often necessary to preprocess (e.g. rename
certain variables) them before running some analysis step. The `preprocess`
argument lets the user pass a function, which is executed for each loaded asset
before combining datasets.

In [20]:
cat_pp = cat.search(
    experiment_id=["ssp126"],
    table_id="Oyr",
    variable_id= ["ph", "fgco2"],
    grid_label="gn",
    source_id = ["CMCC-ESM2"],
    member_id=["r1i1p1f1"]
)

pd.set_option('display.max_rows', None)

cat_pp.df
 #  "dissicabio", "phabio", "co3abio", "fgco2abio", "dissicabio"
#'ScenarioMIP.CMCC.CMCC-ESM2.ssp585.Oyr.gn'

  .applymap(type)
  .applymap(type)


Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year,version
0,ScenarioMIP,CMCC,CMCC-ESM2,ssp126,r1i1p1f1,Oyr,fgco2,gn,gs://cmip6/CMIP6/ScenarioMIP/CMCC/CMCC-ESM2/ss...,,20210126
1,ScenarioMIP,CMCC,CMCC-ESM2,ssp126,r1i1p1f1,Oyr,ph,gn,gs://cmip6/CMIP6/ScenarioMIP/CMCC/CMCC-ESM2/ss...,,20210126


In [21]:
dset_dict_raw = cat_pp.to_dataset_dict(xarray_open_kwargs={"consolidated": True})

for k, ds in dset_dict_raw.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")


--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'


dataset key=ScenarioMIP.CMCC.CMCC-ESM2.ssp126.Oyr.gn
	dimensions=['bnds', 'dcpp_init_year', 'i', 'j', 'lev', 'member_id', 'time', 'vertices']



```{note}
Note that both models follow a different naming scheme. We can define a little
helper function and pass it to `.to_dataset_dict()` to fix this. For
demonstration purposes we will focus on the vertical level dimension which is
called `lev` in `CanESM5` and `olevel` in `IPSL-CM6A-LR`.
```

In [22]:
def helper_func(ds):
    """Rename `olevel` dim to `lev`"""
    ds = ds.copy()
    # a short example
    if "olevel" in ds.dims:
        ds = ds.rename({"olevel": "lev"})
    return ds

In [23]:
dset_dict_fixed = cat_pp.to_dataset_dict(xarray_open_kwargs={"consolidated": True}, preprocess=helper_func)

for k, ds in dset_dict_fixed.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")


--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'


dataset key=ScenarioMIP.CMCC.CMCC-ESM2.ssp126.Oyr.gn
	dimensions=['bnds', 'dcpp_init_year', 'i', 'j', 'lev', 'member_id', 'time', 'vertices']



This was just an example for one dimension.

```{note}
Check out [xmip package](https://github.com/jbusecke/xMIP)
for a full renaming function for all available CMIP6 models and some other
utilities.
```

## Load datasets into an xarray-datatree using `to_datatree()`

We can also load our data into an [xarray-datatree](https://xarray-datatree.readthedocs.io/en/latest/) object using the following:

In [24]:
helper_func

<function __main__.helper_func(ds)>

In [25]:
! pip install xarray-datatree



In [26]:
tree = cat_pp.to_datatree(xarray_open_kwargs={"consolidated": True}, preprocess=helper_func)


ImportError: .to_datatree() requires the xarray-datatree package to be installed. To proceed please install xarray-datatree using:  `python -m pip install xarray-datatree` or `conda install -c conda-forge xarray-datatree`.

In [None]:
!conda list | grep tree

## Saving a dataset to disk

Once you have a case you will continue to use, you'll want to save it to a local drive for
further work.  To save on dictionary entry to disk, use `to_netcdf`.

In [None]:
print(f"{k=}")

In [None]:
print(dset_dict.keys())


In [None]:
small_ds = dset_dict['ScenarioMIP.CMCC.CMCC-ESM2.ssp126.Oyr.gn'] # example
small_ds = small_ds.sel(time=slice("1850", "2010"),lev = small_ds.lev.values[0]).mean(dim = "member_id")

small_ds

In [None]:
write_it = True
if write_it:
    xr.backends.file_manager.FILE_CACHE.clear()
    small_ds.to_netcdf("hist_MRI-ESM2_0_1850_2010.nc")
    

In [None]:
test_infile = xr.open_dataset("CMCC_ESM2_hist.zarr")

In [None]:
test

In [None]:
! pwd