# Complementary extra codes: List unique catchments

Author: Thiago Nascimento (thiago.nascimento@eawag.ch)

This notebook complements the EStreams publication and can be used to filter potential duplicated catchments within the dataset. The approach takes as input the 'estreams_gauging_stations.csv' file and retrieve a final list with only _unique_ catchments mantaining the ones with the _longest_ time series of records, when duplicated exists. 

* Note that this code enables not only the replicability of the current database but also the extrapolation to new catchment areas. 
* Additionally, the user should download and insert the original raw-data in the folder of the same name prior to run this code. 
* The original third-party data used were not made available in this repository due to redistribution and storage-space reasons.  

## Requirements
**Python:**

* ast
* Python>=3.6
* Jupyter
* os
* pandas
* warnings

Check the Github repository for an environment.yml (for conda environments) or requirements.txt (pip) file.

**Files:**

* data/streamflow/estreams_gauging_stations.csv

**Directory:**

* Clone the GitHub directory locally
* Place any third-data variables in their respective directory.
* ONLY update the "PATH" variable in the section "Configurations", with their relative path to the EStreams directory. 

## Observations
- It may be possible that user do not want the gauge with the longest time series, but rather the most recent one, for example. In this case, they would need to adjust the current code.

# Import modules

In [None]:
import pandas as pd
import os
import warnings
import ast


# Configurations

In [None]:
# Relative path to your local directory
PATH = "../../.."
#PATH = r"/Users/thiagomedeirosdonascimento/Library/CloudStorage/OneDrive-Personal/PhD/Eawag/Papers/Paper1_Database/Database/EStreams/"

# Set the directory:
os.chdir(PATH)

warnings.simplefilter(action='ignore', category=Warning)

# Import data

### - Network information

In [None]:
network_estreams = pd.read_csv('data/streamflow/estreams_gauging_stations.csv', encoding='utf-8')
network_estreams.set_index("basin_id", inplace = True)

# Convert 'date_column' and 'time_column' to datetime
network_estreams['start_date'] = pd.to_datetime(network_estreams['start_date'])
network_estreams['end_date'] = pd.to_datetime(network_estreams['end_date'])

# Here we adjust the duplicated_suspect and nested_catchments columns to help our dealing with them:
network_estreams["duplicated_suspect"][network_estreams["duplicated_suspect"].notna()] = network_estreams["duplicated_suspect"][network_estreams["duplicated_suspect"].notna()].apply(ast.literal_eval)
network_estreams["nested_catchments"] = network_estreams["nested_catchments"].apply(ast.literal_eval)

network_estreams.head()

- Duplicated suspects deletion
    - At this part, when there is a duplicated suspect in our catchments list we keep only the catchemnt with the longest time-series.
    - For example, FR001479 has 23 years of measurements, from  1969 to 1991, and has two duplicated suspects: [FR001477, FR001478].
    - After our filter, we aim to keep only FR001479 in our final list, since it is the one with the longest number of measurements from the three. 
    - Eventually we mitigate the number of potential duplicates in our time-series.

In [None]:
# Here we make a copy of the original network metadata 
network_estreams_filtered = network_estreams.copy()

In [None]:
# Step 1: Filter rows where `duplicated_suspect` is not NaN
filtered_df = network_estreams_filtered[network_estreams_filtered['duplicated_suspect'].notna()]

# Step 2: Create a dictionary to store the maximum `num_years` for each group (current row and corresponding row(s))
max_num_years_dict = {}
processed_indices = set()  # Set to keep track of processed indices

# Iterate through each row in the filtered DataFrame
for index, row in filtered_df.iterrows():
    # Check if the current index has already been processed
    if index in processed_indices:
        continue  # Skip processing this row

    # Get the `duplicated_suspect` values, assuming it might be a list or a single string index
    duplicate_indices = row['duplicated_suspect']

    # If `duplicated_suspect` is a string, convert it to a list of strings and strip whitespace
    if isinstance(duplicate_indices, str):
        duplicate_indices = [dup.strip() for dup in duplicate_indices.split(',')]  # Split and strip whitespace

    # Initialize the maximum `num_years` as the current row's `num_years`
    max_num_years = row['num_years']
    max_index = index  # Start with the current index as the max index

    # Compare `num_years` of the current row with each duplicate index
    for dup_index in duplicate_indices:
        # Check if the duplicate index has already been processed
        if dup_index in processed_indices:
            continue  # Skip processing this duplicate index

        try:
            # Get `num_years` of the duplicate row
            num_years_duplicate = network_estreams_filtered.loc[dup_index, 'num_years']

            # Compare the `num_years` and update max values if necessary
            if num_years_duplicate > max_num_years:
                max_num_years = num_years_duplicate
                max_index = dup_index  # Update max index
        except KeyError:
            # Handle KeyError if the duplicate index is not found in the DataFrame
            continue

    # Store the maximum `num_years` and corresponding index in the dictionary
    max_num_years_dict[max_index] = max_num_years

    # Add the indices to the processed set
    processed_indices.add(index)
    for dup_index in duplicate_indices:
        processed_indices.add(dup_index)

# Step 3: Filter the DataFrame to keep only the rows with the indices in max_num_years_dict keys
result_df = network_estreams_filtered.loc[list(max_num_years_dict.keys())]

# Step 4: Get the indices of the rows in `result_df`
result_df_indices = set(result_df.index)

# Step 5: Get the indices of rows without duplicates (where `duplicated_suspect` is NaN)
no_duplicates_indices = set(network_estreams_filtered[network_estreams_filtered['duplicated_suspect'].isna()].index)

# Step 6: Combine the indices from `result_df` and rows without duplicates
indices_to_keep = list(result_df_indices.union(no_duplicates_indices))

# Step 7: Filter `network_estreams_filtered` using the combined indices
network_estreams_filtered = network_estreams_filtered.loc[indices_to_keep]

### - Check the results

In [None]:
network_estreams_filtered

In [None]:
# Check the number of "unique" gauges
len(network_estreams_filtered)

In [None]:
# Export the metadata list
network_estreams_filtered.to_csv("results/extras/estreams_attributes.csv", encoding='utf-8')

## End