# Python and NetCDF climate data

Pandas has a rich toolbox for working with data, but the Pandas dataframe is limited to two dimensions (rows and columns). It's possible, through multi-indexes, looping, and other coding techniques to work around this limitation, but that quicky becomes tiresome and complicated. Thus, when working with multidimensional data such as the many lat/long/time series data generated by many climate models, we could use a better tool. 

Following this session, we'll look at that tool, namely the `xarray` package, but first, we should familiarize oursleves with the multi-dimensional data format. For this, we'll examine the Python `netCDF` package, using it to read in and examine the properties and structure of the netCDF file format and also do some basic multidimensional analysis. We'll also review how `NumPy` and `Pandas` can work with multidimensional data, but that working with such data often requires ***reducing*** the data to two dimensions. 

In this notebook, we work though an example using downscaled CMIP5 hydrology projections ([link](http://gdo-dcp.ucllnl.org/downscaled_cmip_projections/techmemo/BCSD5HydrologyMemo.pdf)). These data include monthly estimates of runoff, precipitation, evapotranspiration, and soil moisture content at a 1/8th degree spatial resolution across the US for the period of 1950 to 2099. Estimates are provided for 21 different climate projection ensembles applied to the Variable Infiltration Capacity (VIC) Macroscale Hydrologic Model ([link](http://vic.readthedocs.io/en/master/)); see the PDF document for a complete list. For demonstration purposes, this project uses the National Center for Atmospheric Research CCSM4 2.6 projection ensembles as the base data for water supply figures. 

### The multi-dimensional dataset
Climate model data are good examples of multi-dimensional data. In this notebook, we work though an example using downscaled CMIP5 hydrology projections ([link](http://gdo-dcp.ucllnl.org/downscaled_cmip_projections/techmemo/BCSD5HydrologyMemo.pdf)). These data include monthly estimates of runoff, precipitation, evapotranspiration, and soil moisture content at a 1/8th degree spatial resolution across the US for the period of 1950 to 2099. Estimates are provided for 21 different climate projection ensembles applied to the Variable Infiltration Capacity (VIC) Macroscale Hydrologic Model ([link](http://vic.readthedocs.io/en/master/)); see the PDF document for a complete list.

In the example here, we'll use monthly runoff predictions downscaled to a 1/16-degree spatial resolution provided for a given sample year (2000). Thus, our data has 4 **dimensions**: `latitude`, `longitude`, `time`, and `total_runogg`. *(Actually, the dataset has a 5th "dimension" as well, but this only includes data on the geographic coordinate reference system and has only one value, so we'll ignore it)*. 

Resources: 
* http://www.ceda.ac.uk/static/media/uploads/ncas-reading-2015/10_read_netcdf_python.pdf

## 0. Fetching the netCDF datafile
The code cell below fetches the monthy hydro runoff that we'll be anaylyzing. 

In [5]:
#Import the os and urllib module
import os
from urllib import request
#Set the location and filename of the data to fetch
baseURL = 'ftp://gdo-dcp.ucllnl.org/pub/dcp/archive/cmip5/hydro/BCSD_mon_VIC_nc/ccsm4_rcp26_r1i1p1/'
fileName = 'conus_c5.ccsm4_rcp26_r1i1p1.monthly.total_runoff.2000.nc'
#Store the file in the data folder
outFileName =  fileName
#Check if the file has already been downloaded; if not then get it
if not os.path.exists(outFileName):
    print("Fetching the dataset")
    request.urlretrieve (baseURL+fileName, outFileName);
    print("Done!")
else:
    print("Data already downloaded")

Fetching the dataset
Done!


## 1. Import the `.nc` data file into our Python script as netCDF4 `dataset` object
Before exploring the `xarray` package, we'll first examine out data using the **netCDF4** package. This package is installed with ArcGIS Pro, so no need to install it. This package allows us to read in our `.nc` file into a NetCDF4 `dataset` object that we can manipulate programmatically. 

Documentation on the NetCDF4 package is here: http://unidata.github.io/netcdf4-python/;  it displays the various *properties* and *methods* of the `dataset` object. 

In [None]:
#Import package to read netCDF file
import netCDF4

In [None]:
#Read the file into a netCDF dataset object
fileName = '../data/conus_c5.ccsm4_rcp26_r1i1p1.monthly.total_runoff.2000.nc'
dataset = netCDF4.Dataset(fileName)

In [None]:
#Confirm that the `dataset` variable points to a netCDF4 dataset object
type(dataset)

In [None]:
#Show some documentation on the `dataset` object
?dataset

In [None]:
#Show the file_format property the object
dataset.file_format

## 2. Explore our netCDF dataset
With our dataset object created, we can explore a bit about the data within it and its structure. Specifically, we'll examine what <u>dimensions</u>, <u>variables</u>, and <u>attributes</u> are contained within the dataset. 

### 2.1 Dimensions
The dimensions in our dataset are accessed as a Python **dictionary**, which is a collection of values referenced by specific keys. Here, each dimension is listed as a *key*, and to get information about that dimension, we "look up" its definition in the dictionary: 

In [None]:
#Show the full dimensions dictionary
dataset.dimensions

In [None]:
#Show just the keys, or dimensions, in the dataset
dataset.dimensions.keys()

In [None]:
#Show the values associated with the `time` dimension
dataset.dimensions['time']

► *<u>Now you try it</u>: Show the values associated with the "latitude" dimension. How many items are in this dimension? How many in the 'longitude'?*

In [None]:
#Display the values associated with the `latitude` dimension
dataset.dimensions['latitude']

In [None]:
#Display the values associated with the `latitude` dimension
dataset.dimensions['longitude']

Summing up, we see our dataset has 3 dimensions: 
* `time` with 12 values, and 
* `latitude` with 222 values
* `longitude` with 462 values

-> **Dimensions** in a netCDF file, then, describe the axes used to refernce indivdual values in our dataset. **Variables**, as we'll see next, contain the actual values.  

### 2.2 Variables
Like dimensions, the dataset's **variables** are accessed as dictionary objects. First let's list all the variables contained in our datset by exposing the "keys" included in the dictionary. 

In [None]:
#Show the 'key' or names of the variables
dataset.variables.keys()

We see many variables include each of the three dimensions, but also a few more, including *total_runoff* variable.
* `time` 
* `latitude`
* `bounds_latitude`
* `longitude`
* `bounds_longitude`
* `total_runoff`

In [None]:
#Show attributes of the time variable
dataset.variables['time']

From above we can see the data type and units of the `time` variable. We also see the size of the variable: it's shape attribute reveal it's a 1D array (i.e. a vector) with 12 values -- one for each month!

***Now you try it**: What are the units and shape of the "latitude" and "longitude" variables? The "total_runoff" variable?

In [None]:
#Show attributes of the total_runoff variable
dataset.variables['total_runoff']

### 2.3 Attributes
NetCDF datasets have both global attributes and attributes associated with each variable. Here's how we explore what attributes are included in each and how to access information on each attribute

#### 2.3.1 Global attributes
Listing the datset's **global** attributes is done bit differently than the process for dimensions and variables. Here, the `ncattrs()` function returns a list of the dataset's global attributes. Then we can display more information on any of these attributes using the `getattr()` function. 

In [None]:
#List the dataset's attributes
dataset.ncattrs()

In [None]:
#Display information stored in the `description` attribute
dataset.description

#### 2.3.2 Variable attributes
Now we'll focus on the attributes of a single variable in our datatset. We'll choose the `total_runoff` variable. 

In [None]:
#List the runoff variable's attributes
dataset.variables['total_runoff'].ncattrs()

In [None]:
#Reveal the information associated with the "units" attribute
dataset.variables['total_runoff'].units

*There's a lot going on in the above statements. It works, but it can be hard to read for the newbie. One of the advantanges of Python code is its readability, but this only works of coders write the code to be readable. So, let's rewrite the above statement so that it's more readable. It appears less "efficient", but sometimes readability is better than terseness.*

In [None]:
#Pull the total_runoff variable into a Python variable called "runoff"
runoff = dataset.variables['total_runoff']
#Now list its attributes
runoff.ncattrs()

With the `runoff` Python variable established, we can use it to display attribute contents directly

In [None]:
runoff._FillValue

---
## 3. Working with the data
With some familiarity of what's in the dataset, we can now start manipulating and visualizing the data...

### 3.1 Import the variables into netCDF4 ***variable*** objects
Let's now break our dataset into its component variables so that we can work with each more easily. Here, we assign Python variables to the four dataset variables: `time`, `latitute`, `longitude`, and `total_runoff`. (We'll ignore the `latitude_bounds` and `longitude_bounds` for now.) Then, we'll quickly examine some properties of these variable objects. 

In [None]:
#Read the variables in NETCDF file
ncTime = dataset.variables['time']
ncLon = dataset.variables['longitude']
ncLat = dataset.variables['latitude']
ncRunoff = dataset.variables['total_runoff']

In [None]:
#Confirm that these objects are netCDF4 variables
type(ncRunoff)

In [None]:
#Display what we can do with a variable object
?ncRunoff

In [None]:
#What if we just display everything about the variable?
ncRunoff

► *<u>Now you try it</u>: Recalling how we revealed the attributes of the `total_runoff` variable above, what are the units of the `time` variable? The `lat` and `lon` variables?*

In [None]:
#show the units of the ncTime variable


In [None]:
#show the units of the ncLat variable


In [None]:
#show the units of the ncLat variable


**Variable shape**: These four variables are all <u>arrays</u>, i.e. a series of values set across one or or more dimensions. We can examine the size and dimensions of each variable via its `shape` property:

In [None]:
ncTime.shape

In [None]:
ncLat.shape

In [None]:
ncLon.shape

In [None]:
ncRunoff.shape

**Note** that the size of the three dimensions of the `precip` variable corresponds to the size of the `time`, `lat`, and `lon` variables, respectively. <u>*This gives a more tangible sense of how these data are structured and how we can manipulate our data.*</u>

More specifically, we see that the data in the `precip` variable is 3 dimensional. The first dimension is time, and the other two dimensions are x-y coordinates in space. So we can <u>envision our data as a stack of precipitation maps, with each layer in the stack as precipitation values for a single time snapshot.</u> This helps us in subsetting our data...

What may not be clear here is: *what then are the `time`, `lat`, and `lon` variables, if everything is held in the one `precip` variable?* This has everything to do with the fact that the actual data (i.e., the amount of precipitation) are referenced by their *index position* in the array, as we'll see in a moment. It'd be confusing to explain this in more detail at the moment. Instead, let's move a bit further with the data and then we'll come back to this. 


---
### 3.2 Extracting data from our variables

#### 3.2.1 Extract a single value (i.e. single location-time value)
The `runoff` variable has 3 axes: `time`,`latitude`, and `longitude`, with sizes of 12, 222, and 462, respectively. And the value at a given time/geographic coordinate is the predicted runoff at that specific time/location. We can extract a specific precipitation value by specifying a `time`, `lat`, and `lon` value:

In [None]:
#Show the runoff for a specified time,lat,lon coordinate
ncRunoff[0,100,200]

The interpolated runoff at that time/location is `2.703` mm/month. 

<u>But what time and location did we actually specify??</u> *What is time=0?? Likewise, a latitude of "100" and a longitude of "200" are not really geographic coordinates.* 

These values are actually pointers to the **positions** in the runoff array, that is, the 1st time slice, the 101st "latitude" column, and the 201st "longitude" column - if you think of our precipitation array as a stack of lat/long tables, with a layer for each time. (Also note that Python indices are zero-based, meaning the values start at zero, not 1.)

**So how, then, do we extract data for a known time and/or location?** Well, those data are contained in the other arrays, with positions corresponding to the axes in the precipitation array:

In [None]:
print(ncTime[0])
print(ncLat[100])
print(ncLon[200])

So, we see that the precipitation value extracted above is associated with the location (`40.71824°N`,`254.15625°W`) and the time "`32864`"??!?

That time value is actually days since 01-01-1990 (found in the datasets metadata):

In [None]:
ncTime.units

And we have tools to convert this into a more readable format:

In [None]:
#Create a new array, converting days since 1900 to a date time value
ncTime2 = netCDF4.num2date(ncTime[:],ncTime.units)
ncTime2[0]

And we see the first time slice is Jan 1, 2000. 

So now what we have to do is somehow cross-reference the *values* in the `time2`, `lat`, and `lon` arrays with the *axes* in the precipitation array. In doing so, we can more easily extract values we want and run various analyses on the data. 

---
## 4. Python's NumPy (numeric Python) package

### 4.1 Converting the netCDF4 variable into NumPy arrays
The process to do this involves converting each of our NetCDF variabls into numeric arrays which then allows us to use Python's **NumPy** package to do exactly what we need to do...

In [None]:
#Import the numpy package, calling it "np" to save typing
import numpy as np

In [None]:
#Convert the NetCDF variables to numpy arrays 
arrTime = ncTime2[:].astype(np.datetime64)
arrLat = ncLat[:]
arrLon = ncLon[:]
arrRunoff = ncRunoff[:]

In [None]:
#Show the type of object created
type(arrRunoff)

### 4.2 Working with NumPy arrays
The netCDF variables were converted to Numpy *masked arrays*. This means that, in addition to the N-dimensional numeric array, we have an associated n-dimensional array, but one with boolean values that indicate whether the data should be used in calculations or not.  

In [None]:
#Show the shape of the precip array
arrRunoff.shape

Numpy arrays work much like the netCDF variables in terms of querying values using their position...

In [None]:
arrRunoff[0,100,200]

In [None]:
arrTime[0]

In [None]:
arrLat[100]

In [None]:
arrLon[200]

### 4.3 Subsetting our data
With our variable stored as NumPy arrays, we can use NumPy methods to slice our data across time and/or space. We'll begin simply by just [blindly] using raw position values (i.e. not actual times or lat/long coordinates) to do first a time series plot and then a simple map for a single time slice. 

#### 4.3.1 Time series for one location
To extract runoff across all time for one location (here we'll choose the `lat` at position `100` and the lon at postition `200`), we use a colon to say "grab all values" at the first axis (`time`), then then specific values for the `lat` and `lon` axes. 

In [None]:
#Grab all the precip data at location lat=200 and lon=300
arrRunoffLocX = arrRunoff[:,100,200]

In [None]:
#What is the shape of the result: should be 192 
arrRunoffLocX.shape

In [None]:
#Show the first 10 values in the resulting array
arrRunoffLocX[0:10].data

#### 4.3.2 Plotting - with NumPy & <u>MatPlotLib</u>
We use another library for plotting. Python has a few, but as you are likely familiar with Matlab, we'll use `matplotlib` which was desinged after Matlab's plotting procedures. 

In [None]:
#Import the pyplot subpackage and enable inline plotting
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
#Plot as a time series
fig = plt.figure(figsize=(15,3))
plt.plot(arrRunoffLocX);

In [None]:
#Plot time series for three points long a longitudinal transect
fig = plt.figure(figsize=(15,3))
plt.plot(arrRunoff[:,50,200])
plt.plot(arrRunoff[:,100,200])
plt.plot(arrRunoff[:,150,200])
plt.xlabel("month")
plt.ylabel("runoff (mm)")
plt.legend([50,100,150])
plt.show();

#### 4.3.3 Creating a location matrix for a single point in time, or average across all times

In [None]:
#Pull precip data for all lat/lon dimensions for the most recent record
arrRunoffRecent = arrRunoff[-1,:,:]
arrRunoffRecent.shape

In [None]:
#Or, compute the mean across all months (time = axis zero)
arrRunoffMean = arrRunoff.mean(axis=0)
arrRunoffMean.shape

In [None]:
#Or, compute the mean across all time values (time = axis zero) for a spatial subset
arrRunoffMeanSubset = arrRunoff[:,150:200,0:75].mean(axis=0)
arrRunoffMeanSubset.shape

#### 4.3.4 Plotting in two dimensions
Matplotlib's `imshow` function allows us to plot images, i.e., data stored with X/Y coordinates, as we have. 

In [None]:
fig = plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
plt.imshow(arrRunoffMean,origin=(0,0),cmap="YlGnBu")
plt.title("USA")
plt.subplot(1,2,2)
plt.imshow(arrRunoffMeanSubset,origin=(0,0),cmap="YlGnBu")
plt.title("Pacific NW")
plt.colorbar();

---
## 5. Pandas
**Numpy** works great for computations on n-dimensional arrays, but the axes are still a bit befuddling as they can only be accessed via position, not actual time stamps or actual lat-long coordinates. You can see that easily in the map we just created: the coordinates are *image* coordinates, not geographic ones (e.g. degrees north and degrees east). 

Numpy has a companion Python package called **Pandas** (a contraction of *Pan*el *Da*ta) that overcomes this issue by allowing us to store and manipulate our data in **data frames**. The key difference [for us] between *Numpy's ndarrays* and *Pandas' dataframes* is the ability to name and label columns and rows so we can move beyond just integer indices. 

The drawback, however, is that dataframes require data to be stored in two dimensions, i.e., in rows and columns. (There are ways around that, using something called a hierarchical multiindex, but more on that later...). 

### 5.1 Reducing and converting our NumPy arrays to Pandas dataframes
Anyway, let's convert our Numpy arrays to Pandas dataframes. Of course, before we can do that, we need to ***reduce*** our data to two or fewer dimensions. We've done that above, creating the time series dataset (by reducing the lat and lon dimensions to a single location), and by creating the mapped dataset (by reducing the time dimension to a single point in time or by averaging all time values into one value.)  

In [None]:
#Import pandas
import pandas as pd

#### 5.1.1 Full process from netCDF4 variable to 2d Pandas dataframes
Let's review the whole process...

In [None]:
#Pull the precip variable from the netCDF dataset
runoff = dataset.variables['total_runoff']

In [None]:
#Convert the netCDF variable to a 3-d numpy array
arrRunoff = runoff[:]
arrRunoff.shape

In [None]:
#Reduce the 3d array to a 1d array
arrTimeSeries = arrRunoff[:,100,200]
arrTimeSeries.shape

In [None]:
#Reduce the 3d array to a 2d array, collapsing the time dimension into average values
arrMeanRunoff = arrRunoff.mean(axis=0)
arrMeanRunoff.shape

In [None]:
#Convert the 1d time series array to a dataframe
dfTimeSeries=pd.DataFrame(arrTimeSeries)
dfTimeSeries.shape

In [None]:
#Display the first 5 records
dfTimeSeries.head()

In [None]:
#Convert the 2d mean values to a dataframe
dfMeanRunoff=pd.DataFrame(arrMeanRunoff)
dfMeanRunoff.shape

In [None]:
#Display mean runoff 
fig = plt.figure(figsize=(15,5))
plt.imshow(dfMeanRunoff,origin=(0,0),cmap="YlGnBu");

#### 5.1.2 Adding column names and indices to our dataframes
***♦ Note that the rows and columns remain as simple integers, not actual times or lat/long coordinates***.

By adding row and column labels to our dataset - using the `time`, `lat`, and `lon` variable values to assign those row & column names, we can then manipulate our data using those values (as compared to just integers in numpy arrays). 

**Note**: row labels are called indices in Pandas

* First, we import the variable values in to Pandas *series* objects. Series are just 1d arrays. 

In [None]:
#Convert the time, lat, and lon variables to Pandas series objects (series b/c 1 dimension)
ncTime = dataset.variables['time']
seriesTime = pd.Series(netCDF4.num2date(ncTime[:],ncTime.units))
seriesLat = pd.Series(dataset.variables['latitude'][:])
seriesLon = pd.Series(dataset.variables['longitude'][:])
print(seriesTime.shape,seriesLat.shape,seriesLon.shape)

Now we can assign those values to the dataFrame column name and row indices...

In [None]:
#Set the index of the time series dataframe to the actual times
dfTimeSeries.index = seriesTime
dfTimeSeries.head(3)

In [None]:
#Add column names and an index to the dfMeanPrecip dataframe
dfMeanRunoff.columns = seriesLon
dfMeanRunoff.index = seriesLat
dfMeanRunoff.head()

## 5.2 Subsetting data in Pandas dataframes
### 5.2.1 Selecting a *slice* of time
With our data in Pandas dataframe and the actual times set as the index, we can select, subset, or **slice** our data on *actual times*.

In [None]:
#Subset the data for the years 2000 to 2004
df_subset = dfTimeSeries[(dfTimeSeries.index >= '2000') & 
                         (dfTimeSeries.index < '2004')]

In [None]:
#Plot the subset
fig = plt.figure(figsize=(15,3))
plt.plot(df_subset);

♦ **AND** we get the actual dates on the axis! 

---

## 6. Using `xarray`
Ok. We are starting to get a handle on the basics of wrangling climate data with NetCDF4/NumPy/Pandas. However, before going any deeper into this rabbit hole, it's time to move onto something that makes this all MUCH easier. Turns out, someone coded a new Python package called **`xarray`** that includes a number of functions and methods tailored exactly to our needs. 

So, let's move to the next notebook!