# Plotting Time Series from Multiple Ensemble Members
## (using Regions Defined by Shape Files!)
### Authors

Samantha Stevenson sstevenson@ucsb.edu

### Table of Contents

[Goals](#purpose)

[Import Packages](#path)

[Load and Query the CMIP6 AWS Catalog](#load)

[Read in Data as an Xarray Object](#xarray)

[Define a Region Using Shapefiles](#shapefiles)

[Plot Time Series](#time_series)

<a id='purpose'></a> 
## **Goals**

In this tutorial, we will be reading in the database of Coupled Model Intercomparison Project phase 6 (CMIP6) output hosted by Amazon Web Services and exploring its contents. 

The steps in this tutorial build on the skills we learned in previous tutorials:
- [Read in Data and Plot a Time Series](https://github.com/climate-datalab/Time-Series-Plots/blob/main/1.%20Read%20in%20Climate%20Data%20%2B%20Plot%20a%20Regionally%20Averaged%20Time%20Series.ipynb)
  (regional averaging, time series plotting)
- [Opening and Querying the CMIP6 AWS Database](https://github.com/climate-datalab/CMIP6_AWS/blob/main/1.%20Opening%20and%20Querying%20the%20CMIP6%20Catalog.ipynb)  (data access via Amazon Web Services)

Basically: we'll be doing a lot of the same things we did in those tutorials, but this time extending the plots to include information from multiple _ensemble members_ and multiple climate models! Please refer back to those materials if you would like additional detail.

<a id='path'></a> 
## **Import Packages**

As always, we begin by importing the necessary packages for our analysis. This tutorial assumes you're starting with an environment in which `intake`, `intake-esm`, and `s3fs` are already installed - for details on those packages, see the [CMIP6 AWS repo](https://github.com/climate-datalab/CMIP6_AWS)!

We'll also need a new package for this tutorial: `geopandas`. [Geopandas](https://geopandas.org/en/stable/index.html) is designed to facilitate working with geospatial data in Python; it layers the functionality of Pandas with the shape-handing abilities of Shapely to allow users to perform operations on geometrics objects. 

Last but not least: we'll also import the coordinate reference system handling functionality from Cartopy (`cartopy.crs`; for more details see the [Cartopy CRS docs page](https://scitools.org.uk/cartopy/docs/latest/getting_started/crs.html)). This will allow us to reproject geospatial data onto a given CRS using Geopandas later on! 

In [None]:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import intake
import s3fs
import geopandas as gpd
import cartopy.crs as ccrs
from shapely.geometry import Point

<a id='shapefiles'></a> 
## **Define a Region Using Shapefiles**

Now that the data have been read in, we can use it to plot a time series. In previous tutorials, we had been specifying lat/lon ranges using a rectangular box: but we can do better now! A common desire in analyzing geospatial data is to select regions with irregular boundaries - this is often done using shapefiles which specify the lat/lon coordinates of the boundary around a given region. 

There are many sources of shapefiles around the Internet: here we'll work with the [California Geographic Boundaries](https://catalog.data.gov/dataset/ca-geographic-boundaries) datasets. These contain information for state, county, and local place boundaries - to make sure we have a large enough region, let's use the state boundary. 

The shape file for the California state boundary was downloaded from the link above and is stored in this repo (see folder "ca_state"). It can be read in using the Geopandas `.read_file()` method!

While we're at it, let's also reproject the file to use a specific coordinate reference system - in this case, the Plate Carree projection. This isn't strictly required since the shape file does contain a default CRS, but we _will_ need to make sure in a minute that our CRS is consistent between the shape file and the climate model data, so we might as well explicitly include a reprojection step just to make sure we don't forget to check!

In [None]:
# Read in shapefile for CA counties
gdf = gpd.read_file('ca_state/CA_State.shp')

# Look at default CRS for the shape file
print(gdf.crs)

# Reproject the shapefile to use the PlateCarree projection
gdf = gdf.to_crs(epsg=4326)

Now that we have our shape file, the next task is to take the lat and lon coordinates from the climate model grid, and figure out which of those points lie within the boundaries of the shape (in this case, the California state borders). This requires a couple of steps:

- Converting the 

In [None]:




# Make 2D lat, lon
lon_vals = ens_data.lon.values
lon_vals = np.where(lon_vals > 180, lon_vals - 360, lon_vals)

lon2d, lat2d = np.meshgrid(lon_vals, ens_data.lat.values)

# Create a GeoDataFrame from the xarray dataset's coordinates
points = [Point(lon, lat) for lon, lat in zip(lon2d.flatten(), lat2d.flatten())]
points_gdf = gpd.GeoDataFrame(geometry=points, crs="EPSG:4326")

# Print the points to see what they look like
#print(points_gdf)

# Spatial join to find points within the shapefile
joined = gpd.sjoin(points_gdf, gdf, how="inner", predicate="intersects")

# Create a mask based on the spatial join
mask = np.isin(np.arange(points_gdf.shape[0]), joined.index)
mask_2d = mask.reshape(lat2d.shape)

#masked_data = temp_data.where(mask_2d)


In [None]:
indices = np.argwhere(mask_2d)

print(indices)

In order to work with this shape file in combination with our climate model information, we need to define a _coordinate reference system (CRS)_ for both. This is essentially a framework for locating different spatial points on the surface of Earth; more information on coordinate reference systems can be found [here](https://www.earthdatascience.org/courses/earth-analytics/spatial-data-r/intro-to-coordinate-reference-systems/)

The code block below defines our CRS using the [Plate Carree](https://pro.arcgis.com/en/pro-app/latest/help/mapping/properties/plate-carree.htm) projection, centered at a longitude of 180, and reprojects the data in the CA shape file to use the same projection. 

_Note: the `epsg=4326` syntax below is how you refer to the Plate Carree projection in the language of Geopandas! The [EPSG database](https://epsg.org/home.html) has numbered various projections/CRS, and 4326 is the one that corresponds to Plate Carree._

In [None]:
# Reproject the shapefile to use the PlateCarree projection
gdf = gdf.to_crs(epsg=4326)

# Make 2D lat, lon
lon2d, lat2d = np.meshgrid(ens_data.lon.values, ens_data.lat.values)

# Create a GeoDataFrame from the xarray dataset's coordinates
points = [Point(lon, lat) for lon, lat in zip(lon2d.flatten(), lat2d.flatten())]
points_gdf = gpd.GeoDataFrame(geometry=points, crs="EPSG:4326")

# Print the points to see what they look like
print(points_gdf)

# Spatial join to find points within the shapefile
joined = gpd.sjoin(points_gdf, gdf, how="inner", predicate="intersects")

# Create a mask based on the spatial join
mask = np.isin(np.arange(points_gdf.shape[0]), joined.index)
mask_2d = mask.reshape(lat2d.shape)

#masked_data = temp_data.where(mask_2d)
