# ATMS 305: Module 8 Lecture 1
Data visualization using xarray and matplotlib
---

In the last exercise, you learned how to use `matplotlib` to visualize data.  In this exercise, we will put it all together: 
- read files
- subset files using fancy indexing
- plot the results

We're going to use a new python package called `xarray`.  It is a high level interface for netCDF file reading and writing, that will make your life easier.  Let's go!  

In [None]:
%pylab inline
import xarray as xr

We need some additional packages for this assignment.  Run this every time you start up the notebook so you can access netcdf4 datasets remotely.

In [None]:
!pip install netcdf4
!pip install pydap

Ahhhhhhhh....all better.  Let's get to work!

## Using xarray to read a netCDF file

netCDF4 is a common dataset for storing gridded binary data in atmospheric sciences.  It is a self-describing data format, meaning that it contains data and all of the coordinates necessary to use the data.  We will start with a simple example - maps of surface temperature anomalies from NASA GISS (from a server at NOAA ESRL).  This particular dataset is online and available for streaming through a service called OPENDaP - which means we don't even have to download the data.  You can download netCDF files to your local computer also.

It can't be much easier to read data than this...xarray handles a lot of the dirty work for you.  We can load both local files, as well as files on the internet like this OPENDaP file.  Xarray allows you to either give the local file path, or the web site!

In [None]:
nc=xr.open_dataset('https://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/gistemp/combined/250km/air.2x2.250.mon.anom.comb.nc')
nc

plt.plot(nc.lon)

In [None]:
ncvar = nc['air']
ncvar

This data is a gridded time series at 2 degree resolution of monthly surface temperature anomalies starting in 1880.  Let's average over all space dimensions (lat - axis 1, and lon - axis 2).  We can use `np.mean` and its `axis` keyword (very handy) for this purpose.

This will yield a time series of globally averaged temperature!

In [None]:
np.mean(ncvar, axis=(1,2))

In [None]:
plt.figure()
plt.plot(ncvar.time,np.mean(ncvar,axis=(1,2)))
plt.xlabel('Year')
plt.ylabel('Temperature anomaly deg C')
plt.title("It's getting hot up in here!")
plt.show()

We can use the select tool to get a subset in a box (find closest index values of lon and lat) so that we can subset the data and grab the closest point to Champaign-Urbana.  We can give it a list of points.  Here we will give it one.

In [None]:
nc_cmi=nc.sel(lon=-88.9+360.,lat=37., method='nearest')
nc_cmi

In [None]:
import numpy as np
import matplotlib.pyplot as plt

plt.plot(nc_cmi.time,nc_cmi['air'])
plt.xlabel('Year')
plt.ylabel('Temperature anomaly deg C')
plt.title("It's not getting as hot in Champaign")

## Saving to file - easy as np.pi()
Want to save the file as a netCDF file?  No problem!

In [None]:
nc_cmi.to_netcdf('nc_cmi.nc')

## Calculating time averages

Let's say we want to average the monthly time series data into annually averaged data.

There are a number of ways to do this.  `xarray` offers time sampling capabilities, similar to `pandas`.  First though, let's do it the hard way.

In [None]:
nc_cmi

In [None]:
# How many years do we have?

ntimes=np.shape(nc_cmi['time'])
print(ntimes[0]/12.)

It looks like we don't have an evenly divisible number of months, so we don't have complete years.
Let's start at the beginning and loop by 12.

In [None]:
nyears=np.int(np.floor(ntimes[0]/12.))
averages=np.zeros(nyears)

for i in np.arange(nyears):
    averages[i]=np.mean(nc_cmi[var][i*12:(i+1)*12-1])

Save time series to a new file, just Champaign.

In [None]:
site='Champaign_annualavgs'

nc_cmi.to_netcdf(site+'_data.nc')

Read it back in to check!

In [None]:
nc_cmi2 = xr.open_dataset('Champaign_annualavgs_data.nc')
nc_cmi2

Now the easy way.  We `xarray` and `pandas` share the same interface to resample and group time series conveniently.  The documentation is available at: http://xarray.pydata.org/en/stable/time-series.html#resampling-and-grouped-operations.  The codes for resampling are the same as `pandas`.  See http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases


In [None]:
nc_cmi3=nc_cmi.resample('AS', dim='time', how='mean')
nc_cmi3

How easy is that?  np.pi()?  You can also resample with other time frequencies, or in space, or change how you do the calculation (i.e., calculate the median instead of the mean).

Now save to a file:

In [None]:
site='Champaign_annualavgs'

nc_cmi3.to_netcdf(site+'_data2.nc')

Make a (nice) plot!

In [None]:
plt.figure(figsize=(11,8.5)) #create a new figure

plt.plot(nc_cmi['time'],np.squeeze(nc_cmi['air']),'b',alpha=0.5)
plt.plot(nc_cmi3['time'],np.squeeze(nc_cmi3['air']),'r',linewidth=2.0)
plt.legend(['Monthly averages','Annual averages'])
plt.xlabel('Year')
plt.ylabel('Temperature Anomaly (degrees C)')
plt.title('GISTEMP Temperature Anomalies near Champaign, IL')
plt.show()