# Explore the National Water Model Reanalysis
Use [Xarray](http://xarray.pydata.org/en/stable/), [Dask](https://dask.org) and [hvPlot](https://hvplot.holoviz.org) from the [HoloViz](https://holoviz.org) tool suite to explore the National Water Modle Reanalysis Version 2.  We read from a cloud-optimized [Zarr](https://zarr.readthedocs.io/en/stable/) dataset that is part of the [AWS Open Data Program](https://aws.amazon.com/opendata/), and we use a Dask cluster to parallelize computation and reading of data chunks.  

In [None]:
import xarray as xr
import fsspec
import numpy as np

In [None]:
import hvplot.pandas
import hvplot.xarray
import geoviews as gv
from holoviews.operation.datashader import rasterize

### Start a Dask cluster
This is not required, but speeds up computations.  Here we start a Dask Gateway cluster (available on the Pangeo Binderhub), but we could also start a local Dask cluster that just uses the cores available on your local computer. There are  [many other ways to set up Dask clusters](https://docs.dask.org/en/latest/setup.html).

In [None]:
# Use a Dask local cluster using the CPUs on your computer:
from dask.distributed import Client
client = Client()
client

Open Zarr datasets in Xarray using a mapper from fsspec.  We use `anon=True` for free-access public buckets like the AWS Open Data Program, and `requester_pays=True` for requester-pays public buckets. 

In [None]:
fs = fsspec.filesystem('s3', anon=True)
url = 's3://noaa-nwm-retro-v2-zarr-pds'

In [None]:
%%time
ds = xr.open_dataset(fs.get_mapper(url), engine='zarr', chunks={},
                     backend_kwargs=dict(consolidated=True))

In [None]:
var='streamflow'

In [None]:
ds[var]

In [None]:
def plot_nwm_field(da, label=None):
    # Convert Xarray to Pandas dataframe so we can use hvplot.points for visualization
    df = da.to_pandas().to_frame()
    #The dataframe just has streamflow, so add longitude and latitude as columns
    df = df.assign(latitude=ds['latitude'])
    df = df.assign(longitude=ds['longitude'])
    df.rename(columns={0: "transport"}, inplace=True)
    p = df.hvplot.points('longitude', 'latitude', geo=True,
                     c='transport', colorbar=True, size=14, label=label)
    # We don't want to plot all the 2.7M points individually, so aggregate 
    # to 0.02 degree resolution and rasterize with datashader. 
    # Use a log scale for visualization since there is a large dynamic range in streamflow.
    g = rasterize(p, aggregator='mean', x_sampling=0.02, y_sampling=0.02, width=500).opts(tools=['hover'], 
                aspect='equal', logz=True, cmap='viridis', clim=(1e-2, np.nan))
    return (g * gv.tile_sources.OSM)

### Read and plot data for all the stations at a specific time

In [None]:
%%time
select_time = '2017-06-01 00:00:00'
da = ds[var].sel(time=select_time)
plot_nwm_field(da, label=f'{var}:{select_time}')

### Read and plot data for entire time series at a specific location 
Just as an example we pick the location with the largest stream flow from the specific time above

In [None]:
%%time
imax = da.argmax().values
ds[var][:,imax].hvplot(grid=True)

### Compute mean discharge during June 2017 on all rivers

In [None]:
da= ds[var].sel(time=slice('2017-06-01 00:00','2017-06-30 23:00'))

In [None]:
da

In [None]:
%%time
var_mean = da.mean(dim='time').compute()

In [None]:
plot_nwm_field(var_mean, 'Mean Streamflow: June 2017')

In [None]:
client.close()