# Complementary extra codes: Group basins, find nested catchments and number of gauges upstream

Author: Thiago Nascimento (thiago.nascimento@eawag.ch)

This notebook complements the EStreams publication. The code is divided into first assigning groups for each catchment based on their conectivity (e.g., Rhine, Danube, etc); creating a list of all the nested catchments within each basin; and computing the number of gauges upstream the given basin. 

* Note that this code enables not only the replicability of the current database but also the extrapolation to new catchment areas. 
* Additionally, the user should download and insert the original raw-data in the folder of the same name prior to run this code. 
* The original third-party data used were not made available in this repository due to redistribution and storage-space reasons.  

## Requirements
**Python:**

* Python>=3.6
* Jupyter
* geopandas=0.10.2
* numpy
* os
* pandas
* shapely
* networkx
* tqdm

Check the Github repository for an environment.yml (for conda environments) or requirements.txt (pip) file.

**Files:**

* results/estreams_catchments.shp 
* results/estreams_gauging_stations.csv

**Directory:**

* Clone the GitHub directory locally
* Place any third-data variables in their respective directory.
* ONLY update the "PATH" variable in the section "Configurations", with their relative path to the EStreams directory. 

# Import modules

In [None]:
import pandas as pd
import numpy as np
import tqdm as tqdm
import os
import geopandas as gpd
import networkx as nx
from shapely.geometry import Polygon, Point
import time

# Configurations

In [None]:
# Only editable variable:
# Relative path to your local directory
PATH = "../../.."

* #### The users should NOT change anything in the code below here. 

In [None]:
# Non-editable variables:
PATH_OUTPUT = "results/"

# Set the directory:
os.chdir(PATH)

# Import data
## Catchment boundaries

In [None]:
catchment_boundaries = gpd.read_file('results/estreams_catchments.shp')
catchment_boundaries

## Network information

In [None]:
network_EU = pd.read_csv('results/estreams_gauging_stations_duplicates.csv', encoding='utf-8')
network_EU.set_index("basin_id", inplace = True)
network_EU

## Subset of the catchments to be used

In [None]:
catchments = catchment_boundaries.iloc[:, :]

network = network_EU.copy()
catchments

## Make a buffer around the catchments
* We can either make the buffer here, or upload an already buffered version (made using QGIS) which is faster. 
* The buffering using Python may take a considerable while. Interestingly if one make the buffer first for a subset and then to the complete list, it processes faster.

In [None]:
# First we assign a tolerance to overcome problems of catchments with delineations 
# slightly outside the other catchment. 
# This code may take a while.

start_time = time.time()
tolerance = 0.01
catchments_buffer = catchments.copy()
catchments_buffer['geometry'] = catchments['geometry'].buffer(tolerance)
end_time = time.time()

# Print the total time elapsed
print("Elapsed time: {:.1f} seconds".format(end_time - start_time))

# Processing
## Nested catchments groups

* First we classifiy the catchments according to their possibility of being nested.
* At the end we have groups (main watershed) to where each sub-catchment is assigned.
* For example, watershed_group == 1 corresponds to the Rhine.

In [None]:
# Nested catchments:
# Initialize an empty list to store nested catchments
nested_catchments = []

# Iterate over each catchment
for index, catchment in tqdm.tqdm(catchments.iterrows()):
    # Get the geometry of the current catchment
    geom = catchment['geometry']
    
    # Iterate over other catchments to check if they are nested
    for index2, other_catchment in catchments_buffer.iterrows():
        # Skip the same catchment
        if index == index2:
            continue
        
        other_geom = other_catchment['geometry']
        
        # Check if the current catchment is completely within the other catchment
        if geom.within(other_geom):
            nested_catchments.append((catchment.basin_id, other_catchment.basin_id))

In [None]:
# Create the big-groups (main watershed):
# Initialize an empty graph
G = nx.Graph()

# Add nodes for each catchment
for index, catchment in catchments.iterrows():
    G.add_node(catchment['basin_id'])

# Add edges for nested catchments
for nested_pair in nested_catchments:
    G.add_edge(nested_pair[0], nested_pair[1])

# Find connected components
groups = list(nx.connected_components(G))

# Assign groups to catchments
group_assignment = {}
for i, group in enumerate(groups):
    for catchment_id in group:
        group_assignment[catchment_id] = i + 1  # Assigning group numbers starting from 1

# Update the catchments GeoDataFrame with the group assignments
catchments['watershed_group'] = catchments['basin_id'].map(group_assignment)

In [None]:
catchments.head(5)

In [None]:
catchments[catchments.watershed_group == 1]

In [None]:
nested_catchments_df = pd.DataFrame(nested_catchments)
nested_catchments_df.columns = ["catchment_1", "catchment_2"]
nested_catchments_df

In [None]:
nested_catchments_df.to_excel("results/extras/nested_catchments_assignment_one2one.xlsx")

## Nested catchments within 
* Here we provide the list of nested catchments within each catchment. 

In [None]:
# Create a geometry column with Point objects for being used:
geometry = [Point(lon, lat) for lon, lat in zip(network['lon_snap'], network['lat_snap'])]

# Create a GeoDataFrame
network = gpd.GeoDataFrame(network, geometry=geometry)

# Optional: Set the coordinate reference system (CRS) if known
# For example, if your coordinates are in WGS84 (EPSG:4326)
network.crs = 'EPSG:4326'

In [None]:
# List to store the results
catchments_nested = []

# Iterate through each catchments geometry
for i, catchment in tqdm.tqdm(catchments.iterrows()):
    # Find the network points located within the current catchments geometry
    network_in_catchment = network[network.within(catchment.geometry)]

    # Get the indices of the network points within the current catchments geometry
    indices = network_in_catchment.index.tolist()

    # Append the list of indices to the results list
    catchments_nested.append(indices)

In [None]:
# Convert the list of lists to a pandas DataFrame
catchments_nested_df = pd.DataFrame({'nested_catchments': catchments_nested})

# Set the index of the DataFrame to be the index of the catchments GeoDataFrame
catchments_nested_df.index = catchments.basin_id

# Check each row and replace empty lists with the index value
# It may happen when the outlet is slightly outside (coordinates) the shapefile
for index, row in catchments_nested_df.iterrows():
    if not row['nested_catchments']:
        catchments_nested_df.at[index, 'nested_catchments'] = [index]  # Replace the empty list with the index as a list
          
catchments_nested_df

In [None]:
# Here we make sure that the outlet is within the list:
# Ensure that the basin_id is in the nested_catchments
for basin_id in catchments_nested_df.index:
    if basin_id not in catchments_nested_df.at[basin_id, 'nested_catchments']:
        catchments_nested_df.at[basin_id, 'nested_catchments'].append(basin_id)

In [None]:
network.loc[catchments_nested_df.loc["AT000001", "nested_catchments"]]

In [None]:
catchments_nested_df.to_csv("results/extras/estreams_gauging_stations_nested_catchments.csv")

## Number of unique gauges upstream
* Here we comoute the number of gauges upstream.
* A headwater catchment will have a number 1, while a downstream catchment that has two gauges within (not counting the outlet) has a number 3.

In [None]:
# Assign the index to the shapefile:
catchments.set_index("basin_id", inplace = True)

# Keep one field with the same name:
catchments["basin_id"] = catchments.index

In [None]:
# Create one field with the same name as the index:
network["basin_id"]= network.index
network

In [None]:
# Create a geometry column with Point objects for being used:
geometry = [Point(lon, lat) for lon, lat in zip(network['lon_snap'], network['lat_snap'])]

# Create a GeoDataFrame
network = gpd.GeoDataFrame(network, geometry=geometry)

# Optional: Set the coordinate reference system (CRS) if known
# For example, if your coordinates are in WGS84 (EPSG:4326)
network.crs = 'EPSG:4326'

### Apply the count taking into account some filters:
       - Points to pay attention:
* Outlet is seldom slightly outside the shapefile. 
* Catchment outlet has one duplicate within the shapefile.
* Catchments within the shapefile also have duplicates. 

       - Solution:
* We exclude the outlet from the count, and count + 1 at the end for all catchments. 
* We apply a filter to delete the catchment outlet to count duplicated_suspects that are within the catchment shapefile. 
* We count the number of duplicates, and when it is even, we simply divide per 2 and substract at the end count = count - (n/2). If it is odd, we do count = count - ((n - 1)/2 + 1). The reason is that when we have a two duplicates, they could delete each other.

In [None]:
# Spatial join to count geometries within the catchments shapefile
joined = gpd.sjoin(catchments, network, how='inner', op='intersects')

# Exclude geometries with the same "basin_id" as in the network GeoDataFrame (exclude the outlet):
joined_filtered = joined[joined['basin_id_left'] != joined['basin_id_right']]

# Here we create a function to deal with the duplicates of the outlet when they happen to be within:
# Parse the "duplicated_suspect" column to extract individual basin_ids
def parse_duplicated_suspect(suspect):
    if pd.isna(suspect):
        return []
    else:
        return suspect.split(', ')

joined_filtered['duplicated_suspect_ids'] = joined_filtered['duplicated_suspect'].apply(parse_duplicated_suspect)

# Exclude basin IDs from the count when there are duplicated suspects
def exclude_duplicated_suspects(row):
    if len(row['duplicated_suspect_ids']) > 0:
        return row['basin_id_left'] not in row['duplicated_suspect_ids']
    else:
        return True

joined_filtered = joined_filtered[joined_filtered.apply(exclude_duplicated_suspects, axis=1)]

# Count the number of geometries for each unique "basin_id" in the catchments shapefile
count_per_basin = joined_filtered['basin_id_left'].value_counts()

# Count the number of non-null values in the "duplicated_suspect" column for each basin ID
duplicates_count = joined_filtered.groupby('basin_id_left')['duplicated_suspect'].count()

# Adjust the count based on the number of duplicates within each catchment
for basin_id, count in duplicates_count.items():
    if count % 2 == 0:
        count_per_basin[basin_id] -= count // 2
    else:
        count_per_basin[basin_id] -= (count - 1) // 2
        count_per_basin[basin_id] += 1

# Here we add 1 station to include the outlet
count_per_basin += 1

network["gauges_upstream"] = np.nan      
network["gauges_upstream"] = count_per_basin

# Filter the potential NaNs:
network['gauges_upstream'] = network['gauges_upstream'].fillna(1)

network.head(10)

## Assign the new values to the network:

In [None]:
network_EU['watershed_group'] = catchments['watershed_group']
network_EU

In [None]:
network_EU['gauges_upstream'] = network['gauges_upstream'].astype(int)
network_EU

In [None]:
network_EU[network_EU.watershed_group== 1]

In [None]:
network_EU['nested_catchments'] = catchments_nested_df['nested_catchments']
network_EU

## Save the data

In [None]:
# Save the dataframe:
network_EU.to_csv('results/extras/estreams_gauging_stations_nested.csv',  encoding='utf-8')

## End