## FSSPEC download for NWM RouteLink file for developing topologic relationships
This notebook demonstrates accessing the National Water Model (NWM) topological definition of the NWM channel routing simulation. The methods applied here utilize Zarr and FSSpec to retrieve the header for the file and then only the topology-definining fields: "link" and "to". Building the dataframe directly from these elements in the file from the web saves a 200Mb download and takes quite a bit less time than when obtaining the full file and operating from a local storage resource.

The key here is to note which operations take a long time:
* The initial `SingleHdf5ToZarr` step is about 1 second
* The `.translate()` operation (inline in our example) is about 8 seconds
* Opening the dataset from the translated .json object is only a few milliseconds
* reading the "to" and "from" attributes into a pandas dataframe is 11 seconds

That last step would be a lot longer if all variables were downloaded.

In [2]:

# Suppress the output of the pip install for display sanity...
# If there are problems, be sure to uncomment and check the output!
!pip install fsspec kerchunk zarr xarray[complete]



In [3]:
import fsspec
import xarray as xr
from kerchunk.hdf import SingleHdf5ToZarr

fs = fsspec.filesystem("http")

rl_nwm_url = "https://www.nco.ncep.noaa.gov/pmb/codes/nwprod/nwm.v3.0.13/parm/domain/RouteLink_CONUS.nc"
with fs.open(rl_nwm_url) as f:
    %time    rl_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate()

    # Key example here:
    # https://fsspec.github.io/kerchunk/test_example.html


CPU times: user 55.8 ms, sys: 21.1 ms, total: 76.9 ms
Wall time: 1.13 s


The `kerchunk`-ing example that we started with had a number of other parameters...
perhaps some might be reintroduced to make the data access even speedier!
e.g., ...
```py
fs = fsspec.filesystem('ftp', host="https://www.nco.ncep.noaa.gov/pmb")

with fs.open(rl_nwm_url, mode='rb', anon=True, default_fill_cache=False, default_cache_type='first') as f:
```
 ...

One thing that I specifically explored was the size of the `inline_threshold` setting. Smaller values definitely provided better results, though not a massivie improvement -- 9 seconds overall vs. 11 or so.
```py
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url).translate() # 11.1 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=30000).translate() # 11.3 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=300).translate() # 11.2 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=10).translate() # 11.3 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=2).translate() # 9.8 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=1).translate() # 9.85 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate() # 9.83 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=-1).translate() # 9.54 s
```
Inlining the `.translate()` call vs. splitting seemed to be about equal, with inlining having the additional advantage of omitting the unused intermediate output.
```py
    %time    rl_h5 = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0)
    %time    rl_t = rl_h5.translate() # This translate MUST happen inside the context block
```
    

In [4]:
backend_args = {
    "consolidated": False,
    "storage_options": {
        "fo": rl_t,
        # Adding these options returns a properly dimensioned but otherwise null dataframe
        # "remote_protocol": "https",
        # "remote_options": {'anon':True}
    },
}
%time ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs=backend_args,)

CPU times: user 1.38 s, sys: 24.5 ms, total: 1.41 s
Wall time: 241 ms


In [50]:
ds.lat.shape

(2776734,)

In [53]:
ds

In [6]:
subslice = [
    "link",
    "to",
    "gages",
]
%time df = ds[subslice].to_dataframe().astype({"link": int, "to": int,})

CPU times: user 904 ms, sys: 362 ms, total: 1.27 s
Wall time: 17.2 s


In [7]:
tt = b'       02465000'

In [8]:
df[df["gages"]==tt]



Unnamed: 0_level_0,link,to,gages,lat,lon
feature_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1839848,18229923,18206548,b' 02465000',33.208752,-87.592606


## Create a topology
With the downloaded Route_Link, we can generate the topology of the CONUS river network

In [9]:
df = df.set_index("link")
df

Unnamed: 0_level_0,to,gages,lat,lon
link,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6635572,6635570,b' ',46.228783,-96.540199
6635590,6635600,b' ',46.213486,-96.530647
6635598,6635636,b' ',46.201508,-96.505341
6635622,6635620,b' ',46.200523,-96.615021
6635626,6635624,b' ',46.195522,-96.637161
...,...,...,...,...
15456832,25371895,b' ',44.979626,-74.654648
25371895,0,b' ',44.996113,-74.648621
15448486,0,b' ',44.994370,-74.504646
25293410,0,b' ',44.998062,-74.673912


In [10]:
# NOT Used yet...
def build_mask(param_df, mask_file_path, mask_key):
    data_mask = nhd_io.read_mask(
        mask_file_path,
    )

    return param_df.filter(data_mask.iloc[:, [mask_key]], axis=0)

In [11]:
import nhd_network
from functools import partial


def replace_downstreams(data, downstream_col, terminal_code):
    ds0_mask = data[downstream_col] == terminal_code
    new_data = data.copy()
    new_data.loc[ds0_mask, downstream_col] = ds0_mask.index[ds0_mask]

    # Also set negative any nodes in downstream col not in data.index
    new_data.loc[~data[downstream_col].isin(data.index), downstream_col] *= -1
    return new_data


def organize_independent_networks(connections):
    rconn = nhd_network.reverse_network(connections)
    independent_networks = nhd_network.reachable_network(rconn)
    reaches_bytw = {}
    for tw, net in independent_networks.items():
        path_func = partial(nhd_network.split_at_junction, net)
        reaches_bytw[tw] = nhd_network.dfs_decomposition(net, path_func)

    return independent_networks, reaches_bytw, rconn

In [12]:
df = df.sort_index()
df = replace_downstreams(df, "to", 0)

In [13]:
connections = nhd_network.extract_connections(df, "to")

In [14]:
independent_networks, reaches_bytw, rconn = organize_independent_networks(connections)

In [31]:
list(rconn.keys())[3000]

85620

In [44]:
rconn[list(rconn.keys())[400000]]

[3935264]

### So, what?
At this point we have a couple of objects representing the U.S. stream network (or another country or region, if you snuck a mask or different base route-link file in there!).
* `connections` is a dictionary of each `link` and the `to` downstream link id. All connections that point
to a null downstream (i.e., they flow off the map into the ocean or into an interior terminal basin) have
been massaged so that they point to an id which is the negative of the last valid segment id.
* `rconn` is the reverse dictionary of connections. Unlike the connections DAG which is strictly
coalescing, the `rconn` values contain multiple values where junctions split and the value list may
contain multiple upstream `link` ids for each of the incoming channels to a junction.
* `independent_networks` is a grouping of the rconn dictionary into connections that are related
topologically to a single tailwater. THIS IS NOT ORDERED (except by whatever falls out of the original reversal of the connections dictionary.)
* `reaches_bytw` is perhaps the most mysterious. It is a topologically ordered list of lists for
each tailwater that, if traversed in order, guarantees that each leaf of the tailwater DAG is touched
before any downstream junction is traversed.

### Example:
We can run simple script to find the networks of any given size. We choose a size of 8 so we can diagram things more easily.

In [15]:
d = independent_networks
for k in sorted(d, key=lambda k: len(d[k]), reverse=True):
    if len(d[k]) == 8:
        print(k, len(d[k]))

-12058618 8
-23920468 8
-11206384 8
-19856634 8
-19856494 8
-10026014 8
-2161574 8
-19855962 8
-19855930 8
-3340794 8
-14678498 8
-19855772 8
-10680626 8
-10680610 8
-3340566 8
-13891828 8
-22673446 8
-20379679 8
-22673386 8
-946010074 8
-14415727 8
-11335517 8
-24311527 8
-10679724 8
-9500000 8
-11138376 8
-9499968 8
-9499946 8
-9499940 8
-17429694 8
-17495193 8
-11334709 8
-10482659 8
-20640658 8
-19854180 8
-2683730 8
-8188725 8
-21754632 8
-19854084 8
-25227996 8
-11334313 8
-9499220 8
-9499176 8
-20640206 8
-10678584 8
-2683140 8
-11202810 8
-10678522 8
-2683114 8
-10678500 8
-9629693 8
-16052090 8
-96988956 8
-9629309 8
-20311523 8
-24505642 8
-20245744 8
-16051220 8
-10349490 8
-20310811 8
-1174016 8
-20638204 8
-11332031 8
-9496702 8
-5302245 8
-7071696 8
-8316733 8
-12969788 8
-11396924 8
-20571637 8
-21095547 8
-17687535 8
-9888572 8
-21422878 8
-20636416 8
-10674854 8
-10674802 8
-22602315 8
-11854335 8
-26927610 8
-10674640 8
-11854049 8
-10477663 8
-20832324 8
-2547737 8
-

From the list, we choose a random network that has only 8 channel segments, -20427622.
If we examine the original dataframe, we can learn the lat-lon coordinates of our segment...

In [16]:
segment = -20427622
print(segment, len(independent_networks[segment]))

print(
    df.loc[-segment]
)  # remember, we need the -segment because we've labeled the tailwaters with a dummy downstream terminal value

-20427622 8
to                -20427622
gages    b'               '
lat               33.045074
lon             -112.275116
Name: 20427622, dtype: object


... and plot it's position on a map:



In [None]:
'''from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoDataFrame

_df = df.loc[[-segment]]
_geometry = [Point(xy) for xy in zip(_df["lon"], _df["lat"])]
_gdf = GeoDataFrame(_df, geometry=_geometry)

world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
_gdf.plot(ax=world.plot(figsize=(10, 6)), marker="o", color="red", markersize=15)'''

AttributeError: The geopandas.dataset has been deprecated and was removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.

Or, we can use a slightly more sophisticated map to discover that we have chosen an interior basin near Phoenix.

In [20]:
import folium

# create a map
this_map = folium.Map(prefer_canvas=True)


def plotDot(point):
    """input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map"""
    folium.CircleMarker(location=[point.lat, point.lon], radius=8, weight=5).add_to(
        this_map
    )


# use df.apply(,axis=1) to "iterate" through every row in your dataframe
_df.apply(plotDot, axis=1)


# Set the zoom to the maximum possible
this_map.fit_bounds(this_map.get_bounds())

# Save the map to an HTML file
this_map.save("simple_dot_plot.html")

this_map

So with that context for our tiny little drainage, we look at the connections and rconn dictionary results.

In [None]:
print(f"{connections[-segment]} the last item in the DAG points to our tailwater")
print(
    f"{rconn[segment]} ... and the upstream looking connection from the tailwater points to the last element of the DAG"
)

[-20427622] the last item in the DAG points to our tailwater
[20427622] ... and the upstream looking connection from the tailwater points to the last element of the DAG


The `independent_networks` dictionary will show us which are the 8 segments in the DAG, each with their corresponding upstream neighbor or neighbors, which looks like the following for our example ...

```
independent_networks[segment]
{20427706: [20429532],
 20429540: [],
 20427622: [20427704, 20427706],
 20429612: [],
 20429616: [],
 20427704: [20429540],
 -20427622: [20427622],
 20429532: [20429612, 20429616]}
 ```

In [None]:
print(segment)
independent_networks[segment]

-20427622


{20427706: [20429532],
 20429540: [],
 20427622: [20427704, 20427706],
 20429612: [],
 20429616: [],
 20427704: [20429540],
 -20427622: [20427622],
 20429532: [20429612, 20429616]}

... and the `reaches_bytw` object gives the order of these in reaches between junctions using DFS ordering only reveresed, to start at the leaves.

for our example, this looks like
```
reaches_bytw[-20427622]

[[20429540, 20427704],
 [20429612],
 [20429616],
 [20429532, 20427706],
 [20427622, -20427622]]

```

In [None]:
print(segment)
reaches_bytw[segment]

-20427622


[[20429540, 20427704],
 [20429612],
 [20429616],
 [20429532, 20427706],
 [20427622, -20427622]]

You'll have to look at this for a minute, but trust me, you can derive the following topological
map from those two pieces of information. (Technically, you would only need the `independent_networks`
information, but it's nice to get confirmation from the other object.)
```
upstream...

             20429612     20429616
                 ├────────────┘
 20429540    20429532
    │            │
 20427704    20427706
    ├────────────┘
 20427622
    │
-20427622

downstream...
```

We can chain the segment IDs together -- this can be useful for instance if we want to query all IDs in a given basin.

In [None]:
from itertools import chain

list(chain(*reaches_bytw[segment]))

[20429540,
 20427704,
 20429612,
 20429616,
 20429532,
 20427706,
 20427622,
 -20427622]