# Accessing and Visualizing CMIP6 Data via Amazon Web Services
### Authors

Samantha Stevenson sstevenson@ucsb.edu

### Table of Contents

[Goals](#purpose)

[Import Packages](#path)

[Read in Data](#data_io)

[Create a Time Series](#time_series)

[Plot a Map of the Mean State](#map)


<a id='purpose'></a> 
## **Goals**

In this tutorial, we will be using the database of Coupled Model Intercomparison Project phase 6 (CMIP6) output hosted by Amazon Web Services to create basic visualizations. 

The steps in this tutorial build on previous tutorials:
- [Time Series Plots](https://github.com/climate-datalab/Time-Series-Plots)
- [Map Plots](https://github.com/climate-datalab/Map-Plots)

Basically: we'll be doing exactly the same things we did in those tutorials, but accessing data via the cloud instead of downloading the files to a local machine/server! Please refer back to those materials if you would like additional detail on the code used to generate time series and map plots.

<a id='path'></a> 
## **Import Packages (TEXT NEEDS REWRITING)**

As always, we begin by importing the necessary packages for our analysis. The packages that are new for this tutorial are `intake` and `intake-esm`, which are designed to facilitate importing complicated datasets stored on a remote server and quickly packing them up into Python objects that can be used for analysis. The `intake` package is designed to be flexible for all sorts of data science applications, but doesn't always work well with climate model output. That's where `intake-esm` comes in - this allows the functionality of `intake` to interface with netCDF files well, by integrating with `xarray` and `pandas`. 

More detail on how intake and intake-esm work can be found at:
- The [intake Read the Docs page](https://intake.readthedocs.io/en/latest/scope2.html)
- The [intake-esm Read the Docs page](https://intake-esm.readthedocs.io/en/v2021.8.17/user-guide/index.html)
- This handy [Youtube explainer](https://www.youtube.com/watch?v=QVogieGP4Jw)

To install `intake-esm`, you can use either pip or conda:

`conda install -c conda-forge intake-esm`


In [1]:
import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import intake

<a id='data_io'></a> 
## **Read in Data** 

The next step is to read the data we'd like for our analysis into Python. Here we will NOT be downloading any files to a local machine! Instead, we'll rely on one of the various catalogs of climate model output hosted on cloud computing servers. This one is a set of CMIP6 output maintained by Amazon Web Services. 

You can find more information on the data catalog here:

[Blog post: CMIP6 provided through the Amazon Sustainability Data Initiative](https://aws.amazon.com/blogs/publicsector/now-available-cmip6-dataset-foster-climate-innovation-study-impact-future-climate-conditions/)

[Registry of Open Data (AWS)](https://registry.opendata.aws/cmip6/)

### **Use intake to open data catalog**

Let's first take a look at the whole data catalog to get a sense of what's in there! 

The `intake-esm` package contains a function called `open_esm_datastore` which can read the JSON file describing the contents of the CMIP6 data holdings. This will be parsed and can be stored as a "catalog" object that can be further queried within Python to grab the part of it a user is interested in.

More details on `open_esm_datastore`:

[Read the Docs "Loading a Catalog" page](https://intake-esm.readthedocs.io/en/v2021.8.17/user-guide/overview.html#loading-a-catalog)

(note that the link above uses a different data catalog than the one we're working with here, but the principle is the same!)

In [2]:
# Open the CMIP6 data catalog, store as a variable
catalog = intake.open_esm_datastore('https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.json')

In [10]:
# Print the catalog to get a summary of its contents
catalog

Unnamed: 0,unique
activity_id,18
institution_id,36
source_id,88
experiment_id,170
member_id,657
table_id,37
variable_id,709
grid_label,10
zstore,522217
dcpp_init_year,60


### **Search the catalog for a specific dataset**

As you can see from the code block above, there is an enormous amount of data in this catalog! We definitely don't want to look at the entire thing all at once.

Let's do an example of pulling the data we used in previous tutorials, this time from the cloud. The example data file from the [Time Series Plots](https://github.com/climate-datalab/Time-Series-Plots) and [Map Plots](https://github.com/climate-datalab/Map-Plots) repositories is:

`tas_Amon_CanESM5_historical_r10i1p1f1_gn_185001-201412.nc`

We can break this down to extract the fields we'll need to search the data catalog properly. If you need more detail on how to do this, also refer to the [filename decoder](http://climate-datalab.org/filename-decoder/) on the Climate DataLab website!

#### **Characteristics of this file**:
- _Variable_: This is a surface air temperature, or "tas", variable.
- _Realm_: Surface air temperature is generated by the atmosphere component of a climate model ("A"), and the information in this particular file is averaged monthly ("mon").
- _Model_: The name of the model is "CanESM5", which is short for the Canadian Earth System Model version 5.
- _Ensemble member_: The name of this ensemble member is "r10i1p1f1".
- _Grid_: This output is provided on the model's _native grid_ ("gn"), instead of doing any kind of interpolating to a different grid.

#### **How to search the catalog for these things**:

In order to find the equivalent information in the CMIP6 AWS catalog, we need to know what the appropriate search terms are. The terminology that AWS uses is slightly different from the way that things are specified on the CMIP6 website, because why make things too easy....?

You can see the fields that are listed in the AWS catalog from the `print(catalog)` statement above. Here is a translation chart to explain the most important ones (the fields that you'll generally be searching over):

- `activity_id`: This is the name of the "activity", or overall model intercomparison project (MIP), you're interested in. There are a lot of these, and you don't need to worry about most of them right now! (The idea behind the MIPs is explained in the [CMIP and other MIPs](https://climate-datalab.org/cmip-and-sub-mips/) page on the Climate DataLab website).

   For most applications, the ones you'll want are `CMIP` and `ScenarioMIP`. The `CMIP` activity is where the data for the historical period (1850-2015) is located, and the `ScenarioMIP` activity contains all the future projections (2015-2100).
  
- `source_id`: This is the name of the actual climate model you're interested in. In our case, we want CanESM5! 

- `institution_id`: This is the name of the "institution", or modeling center, which ran a given simulation. _Don't worry too much about this one_, because you can just search by the name of the model itself and get the same result. But for reference here: the modeling center which created the CanESM5 is the Canadian Centre for Climate Modeling and Analysis. See the [Models vs Modeling Centers](https://climate-datalab.org/models-vs-modeling-centers/) explainer on the Climate DataLab site if you're curious about how this works!

- `experiment_id`: This is the name of the specific type of "experiment" included in CMIP or ScenarioMIP. The ones you'll want here are `historical` (which is part of the CMIP "activity"), and one of the SSP future scenarios (which are part of ScenarioMIP). 

  You can pick which futures you're interested in! The main four scenarios used for CMIP6 are `ssp126`, `ssp245`, `ssp370`, and `ssp585`. Here higher numbers after `ssp` mean more overall warming (technically, the numbers are equal to the "radiative imbalance" at the top of the atmosphere, or difference between energy coming in and going out). 

- `member_id` 
- `table_id` 
- `variable_id` 


In [7]:
variable_id = 'tas' # Surface Air Temperature
table_id = 'Amon' # Monthly data from Atmosphere

grid = 'gn' #

# Records for Institution, experiment, and source_id are stored in https://github.com/WCRP-CMIP/CMIP6_CVs
experiment_ids = ['historical', 'ssp126', 'ssp245', 'ssp370', 'ssp585']
activity_ids = ['ScenarioMIP', 'CMIP'] # Search Scenarios & Historical data only
source_id = ['CanESM5']


In [9]:
res = catalog.search(activity_id=activity_ids, experiment_id=experiment_ids, source_id=source_id, table_id=table_id, variable_id=variable_id, grid_label=grid)
display(res.df)

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year,version
0,ScenarioMIP,CCCma,CanESM5,ssp370,r24i1p1f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
1,ScenarioMIP,CCCma,CanESM5,ssp370,r25i1p2f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
2,ScenarioMIP,CCCma,CanESM5,ssp370,r24i1p2f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
3,ScenarioMIP,CCCma,CanESM5,ssp585,r21i1p2f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
4,ScenarioMIP,CCCma,CanESM5,ssp585,r22i1p1f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
...,...,...,...,...,...,...,...,...,...,...,...
260,ScenarioMIP,CCCma,CanESM5,ssp245,r12i1p2f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
261,ScenarioMIP,CCCma,CanESM5,ssp245,r11i1p2f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
262,ScenarioMIP,CCCma,CanESM5,ssp126,r14i1p1f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
263,ScenarioMIP,CCCma,CanESM5,ssp245,r12i1p1f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
