# Accessing CMIP6 Data via Amazon Web Services
### Authors

Samantha Stevenson sstevenson@ucsb.edu

### Table of Contents

[Goals](#purpose)

[Import Packages](#path)

[Read in Data](#data_io)

[Plot a Time Series](#time_series)

<a id='purpose'></a> 
## **Goals**

In this tutorial, we will be using the database of Coupled Model Intercomparison Project phase 6 (CMIP6) output hosted by Amazon Web Services to create a basic time series visualization. 

The steps in this tutorial build on the various skills we learned in previous tutorials:
- [Read in Data and Plot a Time Series](https://github.com/climate-datalab/Time-Series-Plots/blob/main/1.%20Read%20in%20Climate%20Data%20%2B%20Plot%20a%20Regionally%20Averaged%20Time%20Series.ipynb)
  (regional averaging, time series plotting)
  
- [Mean-State and Seasonal Average Maps](https://github.com/climate-datalab/Map-Plots/blob/main/2.%20Mean-State%20and%20Seasonal%20Difference%20Plots.ipynb)
  (concatenating xarray objects)

Basically: we'll be doing a lot of the same things we did in those tutorials, but accessing data via the cloud instead of downloading the files to a local machine/server! Please refer back to those materials if you would like additional detail.

<a id='path'></a> 
## **Import Packages**

As always, we begin by importing the necessary packages for our analysis. The packages that are new for this tutorial are:
- `intake` 
- `intake-esm`
- `s3fs`

The idea behind `intake` is that it can be a unified interface regardless of the data source on a remote server, which provides a consistent API regardless of where the data is or what format it's stored in. It relies on "catalogs" of data on the remote server, which contain inventories of all the data available and the locations in which it's stored. `intake` also interfaces really well with packages like pandas and xarray - basically, it lets you synthesize a bunch of data on a server and read it in quickly as an easy-to-manipulate object within Python.

In addition to `intake`, `intake-esm` is also needed to parse the CMIP6 data catalogs we're working with today. `intake-esm` is a plugin that layers on top of `intake` - so it actually requires that `intake` be installed in order to function. `intake-esm` provides additional tools to search, filter, and load netCDF information (or, as we'll see later, "zarr" format data) and understands the metadata structure associated with CMIP6 and many other ensembles of climate information.

The final new package we'll need is `s3fs`, which provides a file system interface to the [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). This allows a user to read and write files directly from the S3 server, and integrates with xarray and intake. 

More detail on how intake, intake-esm, and s3fs work can be found at:
- The [intake Read the Docs page](https://intake.readthedocs.io/en/latest/scope2.html)
- The [intake-esm Read the Docs page](https://intake-esm.readthedocs.io/en/v2021.8.17/user-guide/index.html)
- This handy [Youtube explainer](https://www.youtube.com/watch?v=QVogieGP4Jw)
- The [s3fs Read the Docs page](https://s3fs.readthedocs.io/en/latest/)

To install these packages, you can use either pip or conda, as usual.

In [1]:
import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import intake
import s3fs

<a id='data_io'></a> 
## **Read in Data** 

The next step is to read the data we'd like for our analysis into Python. Here we will NOT be downloading any files to a local machine! Instead, we'll rely on one of the various catalogs of climate model output hosted on cloud computing servers. This one is a set of CMIP6 output maintained by Amazon Web Services. 

You can find more information on the data catalog here:

[Blog post: CMIP6 provided through the Amazon Sustainability Data Initiative](https://aws.amazon.com/blogs/publicsector/now-available-cmip6-dataset-foster-climate-innovation-study-impact-future-climate-conditions/)

[Registry of Open Data (AWS)](https://registry.opendata.aws/cmip6/)

### **Use intake to open data catalog**

Let's first take a look at the whole data catalog to get a sense of what's in there! 

The `intake-esm` package contains a function called `open_esm_datastore` which can read the JSON file describing the contents of the CMIP6 data holdings. This will be parsed and can be stored as a "catalog" object that can be further queried within Python to grab the part of it a user is interested in.

More details on `open_esm_datastore`:

[Read the Docs "Loading a Catalog" page](https://intake-esm.readthedocs.io/en/v2021.8.17/user-guide/overview.html#loading-a-catalog)

(note that the link above uses a different data catalog than the one we're working with here, but the principle is the same!)

_**The CMIP6 data catalog is quite large, so this code block may take a minute to run:**_

In [2]:
# Open the CMIP6 data catalog, store as a variable
catalog = intake.open_esm_datastore('https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.json')

In [3]:
# Print the catalog to get a summary of its contents
catalog

Unnamed: 0,unique
activity_id,18
institution_id,36
source_id,88
experiment_id,170
member_id,657
table_id,37
variable_id,709
grid_label,10
zstore,522217
dcpp_init_year,60


### **Search the catalog for a specific dataset**

As you can see from the code block above, there is an enormous amount of data in this catalog! We definitely don't want to look at the entire thing all at once.

Let's do an example of pulling the data we used in previous tutorials, this time from the cloud. The example data file from the [Time Series Plots](https://github.com/climate-datalab/Time-Series-Plots) and [Map Plots](https://github.com/climate-datalab/Map-Plots) repositories is:

`tas_Amon_CanESM5_historical_r10i1p1f1_gn_185001-201412.nc`

We can break this down to extract the fields we'll need to search the data catalog properly. If you need more detail on how to do this, also refer to the [filename decoder](http://climate-datalab.org/filename-decoder/) on the Climate DataLab website!

#### **Characteristics of this file**:
- _Variable_: This is a surface air temperature, or "tas", variable.
- _Realm_: Surface air temperature is generated by the atmosphere component of a climate model ("A"), and the information in this particular file is averaged monthly ("mon").
- _Model_: The name of the model is "CanESM5", which is short for the Canadian Earth System Model version 5.
- _Ensemble member_: The name of this ensemble member is "r10i1p1f1".
- _Grid_: This output is provided on the model's _native grid_ ("gn"), instead of doing any kind of interpolating to a different grid.

#### **How to search the catalog for these things**:

In order to find the equivalent information in the CMIP6 AWS catalog, we need to know what the appropriate search terms are. The terminology that AWS uses is slightly different from the way that things are specified on the CMIP6 website, because why make things too easy....?

You can see the fields that are listed in the AWS catalog from the `print(catalog)` statement above. Here is a translation chart to explain the most important ones (the fields that you'll generally be searching over):

- `activity_id`: This is the name of the "activity", or overall model intercomparison project (MIP), you're interested in. There are a lot of these, and you don't need to worry about most of them right now! (The idea behind the MIPs is explained in the [CMIP and other MIPs](https://climate-datalab.org/cmip-and-sub-mips/) page on the Climate DataLab website).

   For most applications, the ones you'll want are `CMIP` and `ScenarioMIP`. The `CMIP` activity is where the data for the historical period (1850-2015) is located, and the `ScenarioMIP` activity contains all the future projections (2015-2100).
  
- `source_id`: This is the name of the actual climate model you're interested in. In our case, we want CanESM5! 

- `institution_id`: This is the name of the "institution", or modeling center, which ran a given simulation. _Don't worry too much about this one_, because you can just search by the name of the model itself and get the same result. But for reference here: the modeling center which created the CanESM5 is the Canadian Centre for Climate Modeling and Analysis. See the [Models vs Modeling Centers](https://climate-datalab.org/models-vs-modeling-centers/) explainer on the Climate DataLab site if you're curious about how this works!

- `experiment_id`: This is the name of the specific type of "experiment" included in CMIP or ScenarioMIP. The ones you'll want here are `historical` (which is part of the CMIP "activity"), and one of the SSP future scenarios (which are part of ScenarioMIP). 

  You can pick which futures you're interested in! The main four scenarios used for CMIP6 are `ssp126`, `ssp245`, `ssp370`, and `ssp585`. Here higher numbers after `ssp` mean more overall warming (technically, the numbers are equal to the "radiative imbalance" at the top of the atmosphere, or difference between energy coming in and going out). 

- `member_id`: This is the name of the individual ensemble member run for a given "experiment" and "source_id". In our case, we're looking for a member_id of r10i1p1f1.

- `table_id`: This is equivalent to the "realm" terminology used on the Earth System Grid; basically, which portion of the Earth do you want to be looking at? They're in different tables in the cloud database. In this case, we want the monthly averages for atmospheric variables, which is a table id of "Amon".

- `variable_id`: This is the name of the individual variable you're interested in visualizing. In this case, we're interested in surface air temperature, or "tas".


In [4]:
# Specify search terms to query catalog for CanESM5 data
# activity_id: which project do you want? CMIP = historical data, ScenarioMIP = future projections
activity_ids = ['ScenarioMIP', 'CMIP'] 

# source_id: which model do you want? 
source_id = ['CanESM5']

# experiment_id: what experimental configuration do you want? Here we want historical and the four main SSPs
experiment_ids = ['historical', 'ssp126', 'ssp245', 'ssp370', 'ssp585']

# member_id: which ensemble member do you want? Here we want r10i1p1f1
member_id = 'r10i1p1f1'

# table_id: which part of the Earth system and time resolution do you want? Here we want monthly atmosphere data
table_id = 'Amon' 

# variable_id: which climate variable do you want? Here we want surface air temperature
variable_id = 'tas' 

#### **Display catalog search results**

The code block above specifies the search terms to use to get the `r10i1p1f1` member of the CanESM5 historical and SSP ensembles. To actually retrieve the information, we use the `.search` functionality that `catalog` type objects possess. 

The code block below parses through the full CMIP6 catalog and retrieves only entries that satisfy our search criteria:

In [5]:
# Search through catalog, store results in "res" variable
res = catalog.search(activity_id=activity_ids, source_id=source_id, experiment_id=experiment_ids, 
                     member_id=member_id, table_id=table_id, variable_id=variable_id)

# Display data fram associated with results
display(res.df)

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year,version
0,CMIP,CCCma,CanESM5,historical,r10i1p1f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/CMIP/CCCma/CanESM5/histor...,,20190429
1,ScenarioMIP,CCCma,CanESM5,ssp585,r10i1p1f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
2,ScenarioMIP,CCCma,CanESM5,ssp370,r10i1p1f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
3,ScenarioMIP,CCCma,CanESM5,ssp126,r10i1p1f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429
4,ScenarioMIP,CCCma,CanESM5,ssp245,r10i1p1f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/ScenarioMIP/CCCma/CanESM5...,,20190429


Now we're in business! The search above returned five results:

- one historical simulation ("experiment_id" = "historical")
- four future projection simulations ("experiment_id" = "ssp585", "ssp370", "ssp126", or "ssp245")

We can now follow a procedure similar to the one we used in previous tutorials, to read in the data as xarray objects and make our time series and map plots. 

### **Read information and store as an xarray object**

Let's first read in the data for just the historical simulation. To do this, we'll use the following command:

`xr.open_zarr(res.df['zstore'][0], storage_options={'anon': True}) `

This is part of the `xarray` package, and is designed to allow us to read in information stored as _zarr stores_. 

_**What's a zarr store??**_

`zarr`, like netCDF, is a self-describing data storage format, meaning that all the metadata and coordinate information you need to understand the data is packaged up with the data itself. However, it's been optimized for accessibility via cloud/parallelized servers, which is why many of the cloud-based data catalogs contain data stored in zarr format. _**Essentially, zarr is the cloud-optimized version of a netCDF file!**_

The historical data file is the first one returned in our catalog search above. So we want to retrieve the first value in the `zstore` column of the `res` dataframe, which will tell Python how to retrieve the relevant information. 

The final thing we need to pass to `open_zarr` is a flag that tells it to ignore any login information - since this is a publicly available database, we don't need it. That's what the `storage_options={'anon': True}` argument is doing!

In [22]:
# Read in just the historical data file as a demo
hist_data = xr.open_zarr(res.df['zstore'][0], storage_options={'anon': True})

We can print out the data to see what it looks like:

In [23]:
print(hist_data)

<xarray.Dataset>
Dimensions:    (lat: 64, bnds: 2, lon: 128, time: 1980)
Coordinates:
    height     float64 ...
  * lat        (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(64, 2), meta=np.ndarray>
  * lon        (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray>
  * time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
    tas        (time, lat, lon) float32 dask.array<chunksize=(600, 64, 128), meta=np.ndarray>
Attributes: (12/56)
    CCCma_model_hash:            55f484f90aff0e32c5a8e92a42c6b9ae7ffe6224
    CCCma_parent_runid:          rc3.1-pictrl
    CCCma_pycmor_hash:           33c30511acc319a98240633965a04ca99c26427e
    CCCma_runid:                 rc3.1-his10
   

Sure enough, this results in an xarray Dataset that looks essentially identical to the one we got when we downloaded the CanESM5 historical data manually! (compare with the results of the tutorials in the Time Series Plots repo for yourself if you like)

Some of the dimensions might be listed in a different order, but that doesn't matter to xarray since it knows how to find them based on their names... now that you've read in the data, you can do anything with it that you would with any other xarray dataset!

Now let's read in a second data file, one that goes with one of the SSP future projection simulations: say, SSP3-7.0. Looking at the data table above, we see that this is the third entry - so we grab the location of the third file and feed it to `xr.open_zarr` as we did for the historical simulation:

In [24]:
# Get data for SSP370
ssp370_data = xr.open_zarr(res.df['zstore'][2], storage_options={'anon': True})

<a id='time_series'></a> 
## **Plot a Time Series**

Now that the data have been read in, we can use it to plot a time series. Let's redo the example from our [first tutorial](http://localhost:8888/lab/tree/Time-Series-Plots/1.%20Read%20in%20Climate%20Data%20%2B%20Plot%20a%20Regionally%20Averaged%20Time%20Series.ipynb): a regionally averaged temperature plot for New York City.

First, we'll *also* concatenate the historical and SSP information into a single xarray object, to make the plotting simpler:

In [25]:
# Concatenate historical and future projection data
canesm5_data = xr.concat([hist_data, ssp370_data], dim="time")

Recall that the CanESM5 uses a non-standard 365-day year (no leap years), and to get the plotting to work correctly we have to convert the time format to `datetime64`:

In [27]:
# Convert time to datetime64 format
time = canesm5_data.time.astype('datetime64[ns]')

Now we follow the rest of the steps from the previous tutorial to define lat/lon bounds, mask the data, and compute a regional average:

In [28]:
# Define min/max bounds for region of interest (NYC)
lat_min, lat_max = 40, 41.5
lon_min, lon_max = 285.5, 287

# Define logical mask: True when lat/lon inside the valid ranges, False elsewhere
tas_NYC_lat = (canesm5_data.lat >= lat_min) & (canesm5_data.lat <= lat_max)
tas_NYC_lon = (canesm5_data.lon >= lon_min) & (canesm5_data.lon <= lon_max)

# Find points where the mask value is True, drop all other points
tas_NYC = canesm5_data.where(tas_NYC_lat & tas_NYC_lon, drop=True)

# Average over lat, lon dimensions to get a time series
tas_NYC = tas_NYC.mean(dim=["lat", "lon"])

and finally, generate our plot!

In [None]:
fig, ax = plt.subplots(figsize=(20, 8))
ax.plot(time, tas_NYC.tas, label='Near-Surface Air Temperature', color='b')
ax.set_title("Time Series of NYC Near-Surface Air Temperature (1850 to 2100) ", fontsize=20)
ax.set_xlabel("Time", fontsize=20)
ax.set_ylabel("Temperature (K)", fontsize=20)
ax.legend(fontsize=20)
ax.grid()
plt.show()

Great job! Now it should be more clear that you can use **EITHER** the manual download **OR** the cloud computing solution to access the same datasets, and do all the same analysis tasks. The ability to quickly pull data down from the cloud makes it much easier to carry out complicated analyses, so it's a great skill to have!