# Workbook for Accessing CMIP6 Data via Amazon Web Services
### Authors

Samantha Stevenson sstevenson@ucsb.edu

### Table of Contents

[Goals](#purpose)

[Import Packages](#import)

[Load and Query the CMIP6 AWS Catalog](#data_io)

[Plot a Time Series](#time_series)

<a id='purpose'></a> 
# Goals

This is the companion "workbook" for the tutorials in the "CMIP6 AWS" Climate DataLab repository. It does not contain any code! You can use this as a space to create your own workflow, based on the steps in the tutorials. We have provided an overall structure for the workflow, but you are encouraged to customize your code using snippets from the tutorials in this or other repositories as you like.

Happy coding!

<a id='import'></a> 
# Import Packages

Import all packages necessary for this tutorial: the main ones you'll need are matplotlib, xarray, intake, and s3fs. 

_Note: make sure you've installed the intake-esm plugin for the intake package, or that you're working on a server where this was done for you!_

In [2]:
# Import necessary packages
import xarray as xr
import matplotlib.pyplot as plt
import intake
import s3fs
import pandas as pd

<a id='data_io'></a> 
# Load and Query the CMIP6 AWS Catalog

Now use the steps laid out in the [Accessing CMIP6 Data via AWS](https://github.com/climate-datalab/CMIP6_AWS/blob/main/CMIP6_timeseries_map.ipynb) tutorial to load your data using intake!

In [3]:
# Open the CMIP6 data catalog, store as a variable
catalog = intake.open_esm_datastore('https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.json')

Once the data is loaded, you can query it using various search terms, listed below:

- _activity_id_: which project do you want? CMIP = historical data, ScenarioMIP = future projections

- _source_id_: which model do you want? 

- _experiment_id_: what experimental configuration do you want? Here we want historical and the four main SSPs

- _member_id_: which ensemble member do you want? Here we want r10i1p1f1

- _table_id_: which part of the Earth system and time resolution do you want? Here we want monthly atmosphere data

- _variable_id_: which climate variable do you want? Here we want surface air temperature


### How do I figure out what I'm looking at??

Try the following to get a sense for what's available in the catalog:

1) **List all unique models participating in the "CMIP" activity** ("activity_id=CMIP")
  
  (_see the section "Example: finding all the unique model names contributing to a given activity" in Tutorial 1_)

In [18]:
# Search through catalog, find all historical simulations
# ("activity_id=CMIP", "experiment_id=historical")
catalog.search(activity_id = 'CMIP', experiment_id = 'historical')

# Convert to a data frame
df = catalog.search(activity_id = 'CMIP', experiment_id = 'historical').df

display(df)

# Get unique model names in the set of search results
df.source_id.unique()

# Print list of model names
print(f"Model Name: {catalog.df.source_id.unique()}")


Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year,version
0,CMIP,NOAA-GFDL,GFDL-CM4,historical,r1i1p1f1,AERmon,ua,gr1,s3://cmip6-pds/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/h...,,20180301
1,CMIP,NOAA-GFDL,GFDL-CM4,historical,r1i1p1f1,AERmon,toz,gr1,s3://cmip6-pds/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/h...,,20180301
2,CMIP,NOAA-GFDL,GFDL-CM4,historical,r1i1p1f1,AERmon,so2,gr1,s3://cmip6-pds/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/h...,,20180301
3,CMIP,NOAA-GFDL,GFDL-CM4,historical,r1i1p1f1,AERmon,rsutcsaf,gr1,s3://cmip6-pds/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/h...,,20180301
4,CMIP,NOAA-GFDL,GFDL-CM4,historical,r1i1p1f1,AERmon,rsutaf,gr1,s3://cmip6-pds/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/h...,,20180301
...,...,...,...,...,...,...,...,...,...,...,...
86091,CMIP,EC-Earth-Consortium,EC-Earth3-Veg,historical,r1i1p1f1,Amon,uas,gr,s3://cmip6-pds/CMIP6/CMIP/EC-Earth-Consortium/...,,20211207
86092,CMIP,EC-Earth-Consortium,EC-Earth3-Veg,historical,r1i1p1f1,Amon,va,gr,s3://cmip6-pds/CMIP6/CMIP/EC-Earth-Consortium/...,,20211207
86093,CMIP,EC-Earth-Consortium,EC-Earth3-Veg,historical,r1i1p1f1,Amon,wap,gr,s3://cmip6-pds/CMIP6/CMIP/EC-Earth-Consortium/...,,20211207
86094,CMIP,EC-Earth-Consortium,EC-Earth3-Veg,historical,r1i1p1f1,Amon,tas,gr,s3://cmip6-pds/CMIP6/CMIP/EC-Earth-Consortium/...,,20211207


Model Name: ['CMCC-CM2-HR4' 'EC-Earth3P-HR' 'HadGEM3-GC31-MM' 'HadGEM3-GC31-HM'
 'HadGEM3-GC31-LM' 'EC-Earth3P' 'ECMWF-IFS-LR' 'ECMWF-IFS-HR'
 'HadGEM3-GC31-LL' 'CMCC-CM2-VHR4' 'GFDL-CM4' 'GFDL-AM4' 'IPSL-CM6A-LR'
 'E3SM-1-0' 'CNRM-CM6-1' 'GFDL-ESM4' 'GFDL-CM4C192' 'GFDL-ESM2M'
 'GFDL-OM4p5B' 'GISS-E2-1-G' 'GISS-E2-1-H' 'CNRM-ESM2-1' 'BCC-CSM2-MR'
 'BCC-ESM1' 'MIROC6' 'AWI-CM-1-1-MR' 'EC-Earth3-LR' 'IPSL-CM6A-ATM-HR'
 'CESM2' 'CESM2-WACCM' 'CNRM-CM6-1-HR' 'MRI-ESM2-0' 'CanESM5'
 'SAM0-UNICON' 'GISS-E2-1-G-CC' 'UKESM1-0-LL' 'EC-Earth3' 'EC-Earth3-Veg'
 'FGOALS-f3-L' 'CanESM5-CanOE' 'INM-CM4-8' 'INM-CM5-0' 'NESM3'
 'MPI-ESM-1-2-HAM' 'CAMS-CSM1-0' 'MPI-ESM1-2-LR' 'MPI-ESM1-2-HR'
 'MRI-AGCM3-2-H' 'MRI-AGCM3-2-S' 'MCM-UA-1-0' 'INM-CM5-H' 'KACE-1-0-G'
 'NorESM2-LM' 'FGOALS-f3-H' 'FGOALS-g3' 'MIROC-ES2L' 'FIO-ESM-2-0'
 'NorCPM1' 'NorESM1-F' 'MPI-ESM1-2-XR' 'CESM1-1-CAM5-CMIP5' 'E3SM-1-1'
 'KIOST-ESM' 'NorESM2-MM' 'ACCESS-CM2' 'ACCESS-ESM1-5' 'CESM2-FV2'
 'GISS-E2-2-G' 'CESM2-WACCM-FV2' 'GISS-

2) **List all unique ensemble members** associated with the "historical" simulations ("experiment_id=historical") run with CanESM5 ("source_id=CanESM5")

   _(see the section "Example: finding all the historical simulations with a given model¶") in Tutorial 1)_

In [19]:
# Search through catalog, find all historical simulations with CanESM5
# ("activity_id=CMIP", "experiment_id=historical", "source_id=CanESM5")
catalog.search(activity_id = 'CMIP', experiment_id = 'historical', source_id = 'CanESM5')

# Convert to a data frame
df = catalog.search(activity_id = 'CMIP', experiment_id = 'historical', source_id = 'CanESM5').df

# Print all unique ensemble members ("member_id")
df.member_id.unique()

array(['r24i1p1f1', 'r25i1p1f1', 'r14i1p1f1', 'r2i1p1f1', 'r17i1p1f1',
       'r10i1p1f1', 'r13i1p1f1', 'r7i1p1f1', 'r6i1p1f1', 'r5i1p1f1',
       'r3i1p1f1', 'r22i1p1f1', 'r23i1p1f1', 'r8i1p1f1', 'r11i1p1f1',
       'r12i1p1f1', 'r15i1p1f1', 'r19i1p1f1', 'r16i1p1f1', 'r1i1p1f1',
       'r9i1p1f1', 'r18i1p1f1', 'r4i1p1f1', 'r21i1p1f1', 'r20i1p1f1',
       'r11i1p2f1', 'r10i1p2f1', 'r7i1p2f1', 'r9i1p2f1', 'r8i1p2f1',
       'r4i1p2f1', 'r40i1p2f1', 'r3i1p2f1', 'r6i1p2f1', 'r24i1p2f1',
       'r13i1p2f1', 'r12i1p2f1', 'r5i1p2f1', 'r31i1p2f1', 'r30i1p2f1',
       'r32i1p2f1', 'r29i1p2f1', 'r28i1p2f1', 'r2i1p2f1', 'r22i1p2f1',
       'r23i1p2f1', 'r26i1p2f1', 'r27i1p2f1', 'r25i1p2f1', 'r37i1p2f1',
       'r38i1p2f1', 'r39i1p2f1', 'r35i1p2f1', 'r34i1p2f1', 'r36i1p2f1',
       'r33i1p2f1', 'r1i1p2f1', 'r18i1p2f1', 'r19i1p2f1', 'r14i1p2f1',
       'r15i1p2f1', 'r17i1p2f1', 'r16i1p2f1', 'r21i1p2f1', 'r20i1p2f1'],
      dtype=object)

### **Find a specific file**

Let's do an example of pulling the data we used in previous tutorials, this time from the cloud. The example data file from the [Time Series Plots](https://github.com/climate-datalab/Time-Series-Plots) and [Map Plots](https://github.com/climate-datalab/Map-Plots) repositories is:

`tas_Amon_CanESM5_historical_r10i1p1f1_gn_185001-201412.nc`

We can break this down to extract the fields we'll need to search the data catalog properly. If you need more detail on how to do this, also refer to the [filename decoder](http://climate-datalab.org/filename-decoder/) on the Climate DataLab website!

#### **Characteristics of this file (corresponding fields in the CMIP6 catalog are in parentheses)**:
- _Variable ("variable_id")_: This is a surface air temperature, or "tas", variable.
- _Realm ("table_id")_: Surface air temperature is generated by the atmosphere component of a climate model ("A"), and the information in this particular file is averaged monthly ("mon").
- _Model ("source_id")_: The name of the model is "CanESM5", which is short for the Canadian Earth System Model version 5.
- _Experiment ("experiment_id")_: The name of the model experiment being run. The file above is a _historical_ simulation: since we're also interested in the future projection information, we'll further specify that we'd also like the associated SSPs below.
- _Ensemble member ("member_id")_: The name of this ensemble member is "r10i1p1f1".

In [None]:
# Specify search terms to query catalog for CanESM5 data
# activity_id: which project do you want? CMIP = historical data, ScenarioMIP = future projections

# source_id: which model do you want? 

# experiment_id: what experimental configuration do you want? Here we want historical and the four main SSPs

# member_id: which ensemble member do you want? Here we want r10i1p1f1

# table_id: which part of the Earth system and time resolution do you want? Here we want monthly atmosphere data

# variable_id: which climate variable do you want? Here we want surface air temperature


In [None]:
# Search through catalog, store results in "res" variable

# Display data frame associated with results

# Extract data for the historical period, store as a separate xarray Dataset

# Extract data for an SSP 

<a id='time_series'></a> 
## **Plot a Time Series**

Once the data have been loaded in, you can use it to generate a time series, following the exact same steps used in previous tutorials!

In [None]:
# Concatenate historical and future projection data

# Convert time to datetime64 format


In [None]:
# Define min/max bounds for region of interest 


# Define logical mask: True when lat/lon inside the valid ranges, False elsewhere

# Find points where the mask value is True, drop all other points

# Average over lat, lon dimensions to get a time series


In [None]:
# Plot the resulting time series