# Explore the National Water Model Reanalysis
Use [Xarray](http://xarray.pydata.org/en/stable/), [Dask](https://dask.org) and [hvPlot](https://hvplot.holoviz.org) from the [HoloViz](https://holoviz.org) tool suite to explore the National Water Modle Reanalysis Version 2.  We read from a cloud-optimized [Zarr](https://zarr.readthedocs.io/en/stable/) dataset that is part of the [AWS Open Data Program](https://aws.amazon.com/opendata/), and we use a Dask cluster to parallelize computation and reading of data chunks.  

In [None]:
import xarray as xr
import fsspec
import numpy as np

In [None]:
import hvplot.pandas
import hvplot.xarray
import geoviews as gv
from holoviews.operation.datashader import rasterize
import cartopy.crs as ccrs

### Start a Dask cluster
This is not required, but speeds up computations. Once can start a local cluster by just doing:
```
from dask.distributed import Client
client = Client()
```
but there are [many other ways to set up Dask clusters](https://docs.dask.org/en/latest/setup.html) that can scale larger than this. 

Since we used [Qhub](https://www.quansight.com/post/announcing-qhub) to install JupyterHub with a Dask Gateway running on Kubernetes, we can start a Dask cluster (with a specified environment and worker profile), scale it, and connect to it thusly:

In [None]:
from dask_gateway import Gateway
from dask.distributed import Client
gateway = Gateway()
# see Gateway options to use in new_cluster by doing: gateway.cluster_options()
cluster = gateway.new_cluster(environment='pangeo', profile='Pangeo Worker')  
cluster.scale(20)
client = Client(cluster)
cluster
#client.close();cluster.shutdown()   # shutdown client and cluster

Open Zarr datasets in Xarray using a mapper from fsspec.  We use `anon=True` for free-access public buckets like the AWS Open Data Program, and `requester_pays=True` for requester-pays public buckets. 

In [None]:
url = 's3://noaa-nwm-retro-v2-zarr-pds'

In [None]:
%%time
ds = xr.open_zarr(fsspec.get_mapper(url, anon=True), consolidated=True)

In [None]:
ds

In [None]:
var='streamflow'

In [None]:
print(f'Variable size: {ds[var].nbytes/1e12:.1f} TB')

In [None]:
 = ds[[var]]

In [None]:
idx = (ds.latitude > 41.0) & (ds.latitude < 51.0) & (ds.longitude > -75.0) & (ds.longitude < -62.0)

In [None]:
ds_subset = ds.isel(feature_id=idx) 

In [None]:
ds_out = ds_subset[[var]].isel(time=slice(0,672))

In [None]:
print(f'Variable size: {ds_out.nbytes/1e9:.1f} GB')

In [None]:
encoding={}
for var in ds_out.variables:
    encoding[var] = dict(zlib=True, complevel=5, 
                         fletcher32=False, shuffle=False,
                         chunksizes=ds[var].encoding['chunks'])

In [None]:
ds_out.to_netcdf('subset.nc', mode='w', encoding=encoding)

In [None]:
df = var_mean.to_pandas().to_frame()

The dataframe just has streamflow, so add longitude and latitude as columns

In [None]:
df = df.assign(latitude=ds_streamflow_subset['latitude'])
df = df.assign(longitude=ds_streamflow_subset['longitude'])
df.rename(columns={0: "transport"}, inplace=True)

In [None]:
p = df.hvplot.points('longitude', 'latitude', crs=ccrs.PlateCarree(),
                     c='transport', colorbar=True, size=14)

We don't want to plot all the 93,754 points individually, so aggregate to 0.02 degree resolution and rasterize with datashader.  Use a log scale for visualization since there is a large dynamic range in streamflow. 

In [None]:
g = rasterize(p, aggregator='mean', x_sampling=0.02, y_sampling=0.02, width=500).opts(tools=['hover'], 
                aspect='equal', logz=True, cmap='viridis', clim=(1e-2, np.nan))

Plot the rasterized streamflow data on an OpenStreetMap tile service basemap

In [None]:
g * gv.tile_sources.OSM

In [None]:
ds_sub=ds_streamflow_subset.sel(time=slice('2000-01-01',None))

In [None]:
#ds_sub.to_netcdf('subset.nc')

In [None]:
client.close();cluster.shutdown()