# Working with NetCDF data

NetCDF is shorthand for Network Common Data Form and is frequently used to distribute large amounts of array-like data. This notebook explains the basics of the data structure and shows how an online NetCDF dataset can be downloaded and visualised using the xarray package.

Let's import the required packages first.

In [1]:
import requests
import urllib

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr

The task at hand consists of determining the mean monthly rainfall over the Ovens catchment based on data in a NetCDF file. First, let's get the catchment boundary using the same API as in the previous session.

In [2]:
url = "https://services-ap1.arcgis.com/ypkPEy1AmwPKGNNv/arcgis/rest/services/River_Regions_Source_view/FeatureServer/0/query"

rest_params = {
    "outFields": "*",
    "where": "rivregname='OVENS RIVER'",
    "f": "geojson",
}

response = requests.get(url, params=rest_params)

Convert to a GeoDataFrame...

In [3]:
# Get the data part from the response
data = response.json()
# Convert the features to a GeoDataFrame
gdf_ov = gpd.GeoDataFrame.from_features(data["features"], crs=4326)

Then we'll download the NetCDF file from the following data folder: https://thredds.nci.org.au/thredds/catalog/zv2/agcd/v2-0-2/precip/total/r005/01month/catalog.html. There are lots of other interesting data sets here too: https://thredds.nci.org.au/thredds/catalog/catalog.html.

We use the `urlretrieve` function to actually download the file to our computer.

In [4]:
url = "https://thredds.nci.org.au/thredds/fileServer/zv2/agcd/v2-0-2/precip/total/r005/01month/agcd_v2_precip_total_r005_monthly_2010.nc"
local_filename = "agcd_v2_precip_total_r005_monthly_2010.nc"
urllib.request.urlretrieve(url, local_filename);

The downloaded NetCDF file can be opened using `open_dataset` from xarray. Typing the variable name on the last line will display the contents. For a detailed explanation of the data structures, see https://docs.xarray.dev/en/stable/user-guide/data-structures.html. For now it is worth noting that the Dataset stores the dimensions (time, lat and lon) as well as the coordinates for those dimesions. The Dataset also contains two variables: 'precip' and 'crs'. The latter only stores the coordinate system, the actual rainfall values per pixel are in 'precip'.

In [None]:
ds = xr.open_dataset(local_filename)

ds

To get an idea of the data, let's plot `precip` using the convenience method `plot`. Note how the `sel` method is used to select only the data for the last available month. Just like for a Pandas column, the variable of interest can be selected by placing its name between square brackets. The `plot` method of the GeoDataFrame is overlain to show the location of the Ovens catchment.

In [None]:
fig, ax = plt.subplots()
da = ds.sel(time=ds.time[-1])["precip"].plot(ax=ax)

gdf_ov.plot(ax=ax, fc="none", ec="w"); # No fill, white edgecolor

Let's say we want to use the data from the NetCDF file to get the mean rainfall per month in the Ovens catchment. In order to do this we need to create a mask, which will be used to hide all the data points in the Dataset that are outside the catchment. A few steps are required to do this. First we'll create a set of points and add them to a GeoDataFrame `gdf_pts` that contains no actual data, only the points themselves. Then we'll use geopandas' powerful `sjoin` function to determine which points are within the polygon of the catchment. Using some intermediate steps we can convert this to a mask, which is an array which has True where a point is inside the catchment, and False elsewhere.

In [7]:
x, y = np.meshgrid(ds["lon"], ds["lat"])
da_shape = x.shape
x = x.ravel()
y = y.ravel()

pts = gpd.points_from_xy(x, y)
gdf_pts = gpd.GeoDataFrame(geometry=pts, crs=4326)
within = gdf_pts.sjoin(gdf_ov, predicate='within')
idx = within.index.to_list()
mask = np.full(x.shape, False)
mask[idx] = True
mask = mask.reshape(da_shape)

The next code cell contains a loop in which for each step a time slice from the Dataset is selected. Selecting the variable 'precip' results in a DataArray structure, to which the mask can be applied. This means that all values of the DataArray become NaN outside the catchment, only those within the catchment boundaries will have some finite value. The `nanmean` function is then used to calcuate the mean rainfall for these pixels, and the calculated value is added to the list `rain_lst`.

In [8]:
rain_lst = []
for t in ds["time"]:
    dsi = ds.sel(time=t)
    da = dsi["precip"]
    da = da.where(mask)

    mm_rain = np.nanmean(da.to_numpy())

    rain_lst.append(mm_rain)

To view the last DataArray in QGIS we can save it as a NetCDF file

In [9]:
da.to_netcdf("da.nc")

We'll then plot the data in a bar chart.

In [None]:
fig, ax = plt.subplots()

ax.bar(ds["time"], rain_lst);

Hmm, the result isn't particularly pleasing to the eye. If we convert the data to a Pandas DataFrame makes it possible to use the `daysinmonth` method on the DatetimeIndex to get the duration for each month, with which we can set the width of the bars. The resulting plot looks a lot better!

In [None]:
df = pd.DataFrame(
    index=ds["time"],
    data = {
        "rainfall": rain_lst,
    }
)

fig, ax = plt.subplots()
bar_widths = df.index.daysinmonth
ax.bar(df.index, df["rainfall"], width=bar_widths, ec='w');