# BAFU dataset extraction

Author: Thiago Nascimento (thiago.nascimento@eawag.ch)

This notebook is used to retrieve and concatenate the NADUF dataset. They present indeed different time-resolutions so not necessarly daily and hourly, but in different campaings. 

The output is one file per catchemnt (similar to the CAMELS_CH), with 44 columns:

* mean_discharge(m3/s)
* total_discharge(Mio m3)
* temperature_BAFU(°C)	
* pH_BAFU()	
* conductivity_25C_BAFU(µS/cm)	
* oxygen(mg/l)	
* oxygen_saturation(%)	
* pH_lab()	
* conductivity_20C_lab(µS/cm)
* total_hardness(mmol/l)
* alkalinity(mmol/l)	
* calcium(mg/l)	
* magnesium(mg/l)	
* nitrate(mg N/l)	
* total_nitrogen(mg N/l)	
* DRP(mg P/l)	
* total_phosphorus(mg P/l)	
* total_phosphorus_filtered(mg P/l)	
* chloride(mg/l)	
* fluoride(mg/l)	
* bromide(mg/l)	
* silicate(mg H4SiO4/l)	
* sulphate(mg SO4/l)	
* sodium(mg/l)	
* potassium(mg/l)	
* iron(mg/l)	
* TOC(mg C/l)	
* DOC(mg C/l)	
* suspended_material(mg/l)	
* chromium(µg/l)	
* zinc(µg/l)	
* copper(µg/l)	
* cadmium(µg/l)	
* lead(µg/l)	
* nickel(µg/l)	
* mercury(µg/l)	
* barium(µg/l)	
* strontium(µg/l)	
* arsenic(µg/l)	
* manganese(µg/l)	
* NP(µg/l)	
* NP1EO(µg/l)	
* NP2EO(µg/l)	
* NP3EO(µg/l)

## Requirements
**Python:**

* Python>=3.6
* Jupyter
* geopandas=0.10.2
* numpy
* os
* pandas=2.1.3
* scipy=1.9.0
* tqdm

Check the Github repository for an environment.yml (for conda environments) or requirements.txt (pip) file.

**Files:**

* 


**Directory:**

* Clone the GitHub directory locally
* Place any third-data variables in their respective directory.
* ONLY update the "PATH" variable in the section "Configurations", with their relative path to the EStreams directory. 


## References
* 
## Observations
* 

# Import modules

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
import tqdm as tqdm
import os
import glob
import warnings
import re

# Configurations

In [2]:
# Only editable variables:
# Relative path to your local directory
PATH = ".."
# Suppress all warnings
warnings.filterwarnings("ignore")

* #### The users should NOT change anything in the code below here. 

In [3]:
# Non-editable variables:
PATH_OUTPUT = "results/interval_samples/"

# Set the directory:
os.chdir(PATH)

In [4]:
os.getcwd()

'c:\\Users\\nascimth\\Documents\\Thiago\\Eawag\\Python\\Scripts\\CAMELS_CH_chem'

# Import data
* FULL dataset

In [5]:
# Full dataset of interval (time-series)
dataset_naduf = pd.read_excel(r"data/NADUF/naduf_data_1981-2020_v5.xlsx")
dataset_naduf

Unnamed: 0,naduf_id,status_number,remark,year,date_end,duration,mean_discharge,total_discharge,temperature_BAFU,pH_BAFU,...,zinc,copper,cadmium,lead,nickel,mercury,barium,strontium,arsenic,manganese
0,1181,1,,1982,1982-11-15 06:00:00,336.000000,5.313851,6.427634,7.271006,,...,10.406677,3.019255,0.073556,1.790373,,,,,,
1,1181,1,,1982,1982-11-29 06:00:00,336.000000,9.046227,10.942316,4.826679,,...,,,,,,,,,,
2,1181,1,,1982,1982-12-13 05:30:00,335.000000,10.864181,13.102202,4.872490,,...,,,,,,,,,,
3,1181,1,,1982,1982-12-27 05:30:00,336.000000,27.653205,33.449317,3.792989,,...,26.11415,2.798573,0.025071,1.349287,,,,,,
4,1181,1,,1983,1983-01-10 05:55:00,337.000000,12.789252,15.515920,3.064523,,...,28.858242,2.764272,0.026786,1.332136,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14464,6169,1,,2020,2020-10-26 10:35:00,336.833333,10.570123,12.817331,8.946462,,...,,,,,,,,,,
14465,6169,1,,2020,2020-11-09 10:15:00,335.666667,15.549123,18.789560,8.916996,,...,,,,,,,,,,
14466,6169,1,,2020,2020-11-23 10:05:00,335.833333,6.439759,7.785669,6.657077,,...,,,,,,,,,,
14467,6169,1,,2020,2020-12-07 10:05:00,336.000000,4.323554,5.229770,3.561146,,...,,,,,,,,,,


- Network

In [6]:
# Network NADUF
network_naduf = pd.read_excel(r"data/CAMELS_CH_chem_stations_short_v2.xlsx", sheet_name='naduf')
network_naduf

Unnamed: 0,naduf_id,naduf_station,naduf_water_body,longitude_LV03,lattitude_LV03,area_camels_CH [km2],hydro_naduf_distance [km],waterqual_id
0,1837,Porte du Scex,Rhône,557660,133280,5239.4,0.0,ok
1,1833,Brugg,Aare,657000,259360,11681.3,0.0,ok
2,1835,Mellingen,Reuss,662830,252580,3385.8,0.0,ok
3,1823,Andelfingen,Thur,693510,272500,1701.6,0.0,ok
4,1842,Martina,Inn,830640,197190,1937.5,0.0,ok
5,1840,Riazzino,Ticino,713670,113500,1613.3,10.2,ok
6,1832,Hagneck,Aare,580680,211650,5111.9,0.0,ok
7,1827,"Rheinfelden, Messstation",Rhein,627190,267840,34479.4,0.0,ok
8,1828,"Münchenstein, Hofmatt",Birs,613570,263080,887.3,0.0,ok
9,4409,Appenzell,Sitter,749040,244220,74.4,0.0,ok


In [7]:
len(dataset_naduf.naduf_id.unique())

29

In [8]:
dataset_naduf.naduf_id.unique()

array([1181, 1246, 1821, 1822, 1823, 1824, 1825, 1826, 1828, 1829, 1830,
       1831, 1832, 1833, 1835, 1836, 1837, 1838, 1840, 1842, 2044, 2045,
       2046, 2064, 2078, 3717, 4409, 4879, 6169], dtype=int64)

Observations
- 1827 is not present in the dataset. 

### Renaming the columns

In [9]:
column_rename_dict = {
    'naduf_id': 'naduf_id', 
    'status_number': 'status_number', 
    'remark':'remark' , 
    'year':'year', 
    'date_end':'date', 
    'duration':'duration',
    'mean_discharge': 'mean_discharge(m3/s)',
    'total_discharge': 'total_discharge(Miom3)',
    'temperature_BAFU': 'temperature(°C)',
    'pH_BAFU': 'pH(-)',
    'conductivity_25C_BAFU': 'conductivity_25C(µS/cm)',
    'oxygen': 'oxygen(mg/l)',
    'oxygensaturation': 'oxygen_saturation(%)',
    'pH_lab': 'pH_lab(-)',
    'conductivity_20C_lab': 'conductivity_20C_lab(µS/cm)',
    'total_hardness': 'total_hardness(mmol/l)',
    'alkalinity': 'alkalinity(mmol/l)',
    'calcium': 'calcium(mg/l)',
    'magnesium': 'magnesium(mg/l)',
    'nitrate': 'nitrate(mgN/l)',
    'total_nitrogen': 'total_nitrogen(mgN/l)',
    'DRP': 'DRP(mgP/l)',
    'total_phosphorus': 'total_phosphorus(mgP/l)',
    'total_phosphorus_filtered': 'total_phosphorus_filtered(mgP/l)',
    'chloride': 'chloride(mg/l)',
    'fluoride': 'fluoride(mg/l)',
    'bromide': 'bromide(mg/l)',
    'silicate': 'silicate(mgH4SiO4/l)',
    'sulphate': 'sulphate(mgSO4/l)',
    'sodium': 'sodium(mg/l)',
    'potassium': 'potassium(mg/l)',
    'iron': 'iron(mg/l)',
    'TOC': 'TOC(mgC/l)',
    'DOC': 'DOC(mgC/l)',
    'suspended_material': 'suspended_material(mg/l)',
    'chromium': 'chromium(µg/l)',
    'zinc': 'zinc(µg/l)',
    'copper': 'copper(µg/l)',
    'cadmium': 'cadmium(µg/l)',
    'lead': 'lead(µg/l)',
    'nickel': 'nickel(µg/l)',
    'mercury': 'mercury(µg/l)',
    'barium': 'barium(µg/l)',
    'strontium': 'strontium(µg/l)',
    'arsenic': 'arsenic(µg/l)',
    'manganese': 'manganese(µg/l)'
}

In [10]:
# Rename columns based on the dictionary
dataset_naduf.rename(columns=column_rename_dict, inplace=True)

In [11]:
dataset_naduf.columns

Index(['naduf_id', 'status_number', 'remark', 'year', 'date', 'duration',
       'mean_discharge(m3/s)', 'total_discharge(Miom3)', 'temperature(°C)',
       'pH(-)', 'conductivity_25C(µS/cm)', 'oxygen(mg/l)',
       'oxygen_saturation(%)', 'pH_lab(-)', 'conductivity_20C_lab(µS/cm)',
       'total_hardness(mmol/l)', 'alkalinity(mmol/l)', 'calcium(mg/l)',
       'magnesium(mg/l)', 'nitrate(mgN/l)', 'total_nitrogen(mgN/l)',
       'DRP(mgP/l)', 'total_phosphorus(mgP/l)',
       'total_phosphorus_filtered(mgP/l)', 'chloride(mg/l)', 'fluoride(mg/l)',
       'bromide(mg/l)', 'silicate(mgH4SiO4/l)', 'sulphate(mgSO4/l)',
       'sodium(mg/l)', 'potassium(mg/l)', 'iron(mg/l)', 'TOC(mgC/l)',
       'DOC(mgC/l)', 'suspended_material(mg/l)', 'chromium(µg/l)',
       'zinc(µg/l)', 'copper(µg/l)', 'cadmium(µg/l)', 'lead(µg/l)',
       'nickel(µg/l)', 'mercury(µg/l)', 'barium(µg/l)', 'strontium(µg/l)',
       'arsenic(µg/l)', 'manganese(µg/l)'],
      dtype='object')

In [12]:
dataset_naduf = dataset_naduf[['naduf_id', 'date',
       'mean_discharge(m3/s)',
       'temperature(°C)', 'pH(-)', 'conductivity_25C(µS/cm)',
       'oxygen(mg/l)', 'oxygen_saturation(%)', 'pH_lab(-)',
       'conductivity_20C_lab(µS/cm)', 'total_hardness(mmol/l)',
       'alkalinity(mmol/l)', 'calcium(mg/l)', 'magnesium(mg/l)',
       'nitrate(mgN/l)', 'total_nitrogen(mgN/l)', 'DRP(mgP/l)',
       'total_phosphorus(mgP/l)', 'total_phosphorus_filtered(mgP/l)',
       'chloride(mg/l)', 'fluoride(mg/l)', 'bromide(mg/l)',
       'silicate(mgH4SiO4/l)', 'sulphate(mgSO4/l)', 'sodium(mg/l)',
       'potassium(mg/l)', 'iron(mg/l)', 'TOC(mgC/l)', 'DOC(mgC/l)',
       'suspended_material(mg/l)', 'chromium(µg/l)', 'zinc(µg/l)',
       'copper(µg/l)', 'cadmium(µg/l)', 'lead(µg/l)', 'nickel(µg/l)',
       'mercury(µg/l)', 'barium(µg/l)', 'strontium(µg/l)', 'arsenic(µg/l)',
       'manganese(µg/l)']]
dataset_naduf

Unnamed: 0,naduf_id,date,mean_discharge(m3/s),temperature(°C),pH(-),conductivity_25C(µS/cm),oxygen(mg/l),oxygen_saturation(%),pH_lab(-),conductivity_20C_lab(µS/cm),...,zinc(µg/l),copper(µg/l),cadmium(µg/l),lead(µg/l),nickel(µg/l),mercury(µg/l),barium(µg/l),strontium(µg/l),arsenic(µg/l),manganese(µg/l)
0,1181,1982-11-15 06:00:00,5.313851,7.271006,,,,,8.480000,367.066770,...,10.406677,3.019255,0.073556,1.790373,,,,,,
1,1181,1982-11-29 06:00:00,9.046227,4.826679,,,,,8.430766,356.188458,...,,,,,,,,,,
2,1181,1982-12-13 05:30:00,10.864181,4.872490,,,,,8.356344,348.381529,...,,,,,,,,,,
3,1181,1982-12-27 05:30:00,27.653205,3.792989,,,,,8.474643,338.992271,...,26.11415,2.798573,0.025071,1.349287,,,,,,
4,1181,1983-01-10 05:55:00,12.789252,3.064523,,,,,8.281437,274.674792,...,28.858242,2.764272,0.026786,1.332136,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14464,6169,2020-10-26 10:35:00,10.570123,8.946462,,,,,8.410000,317.222586,...,,,,,,,,,,
14465,6169,2020-11-09 10:15:00,15.549123,8.916996,,,,,8.390000,313.269589,...,,,,,,,,,,
14466,6169,2020-11-23 10:05:00,6.439759,6.657077,,,,,8.380000,342.636797,...,,,,,,,,,,
14467,6169,2020-12-07 10:05:00,4.323554,3.561146,,,,,8.400031,383.225945,...,,,,,,,,,,


In [13]:
# Convert to datetime:
dataset_naduf["date"] = pd.to_datetime(dataset_naduf["date"], format='%Y-%m-%d')
dataset_naduf

Unnamed: 0,naduf_id,date,mean_discharge(m3/s),temperature(°C),pH(-),conductivity_25C(µS/cm),oxygen(mg/l),oxygen_saturation(%),pH_lab(-),conductivity_20C_lab(µS/cm),...,zinc(µg/l),copper(µg/l),cadmium(µg/l),lead(µg/l),nickel(µg/l),mercury(µg/l),barium(µg/l),strontium(µg/l),arsenic(µg/l),manganese(µg/l)
0,1181,1982-11-15 06:00:00,5.313851,7.271006,,,,,8.480000,367.066770,...,10.406677,3.019255,0.073556,1.790373,,,,,,
1,1181,1982-11-29 06:00:00,9.046227,4.826679,,,,,8.430766,356.188458,...,,,,,,,,,,
2,1181,1982-12-13 05:30:00,10.864181,4.872490,,,,,8.356344,348.381529,...,,,,,,,,,,
3,1181,1982-12-27 05:30:00,27.653205,3.792989,,,,,8.474643,338.992271,...,26.11415,2.798573,0.025071,1.349287,,,,,,
4,1181,1983-01-10 05:55:00,12.789252,3.064523,,,,,8.281437,274.674792,...,28.858242,2.764272,0.026786,1.332136,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14464,6169,2020-10-26 10:35:00,10.570123,8.946462,,,,,8.410000,317.222586,...,,,,,,,,,,
14465,6169,2020-11-09 10:15:00,15.549123,8.916996,,,,,8.390000,313.269589,...,,,,,,,,,,
14466,6169,2020-11-23 10:05:00,6.439759,6.657077,,,,,8.380000,342.636797,...,,,,,,,,,,
14467,6169,2020-12-07 10:05:00,4.323554,3.561146,,,,,8.400031,383.225945,...,,,,,,,,,,


In [17]:
# Function to round numbers and preserve symbols
def round_values(val):
    if isinstance(val, str):  # Handle string values with symbols
        if val.startswith('>') or val.startswith('<'):
            symbol = val[0]  # Extract the symbol ('>' or '<')
            try:
                number = float(val[1:])  # Convert the rest to a float
                return f"{symbol}{round(number, 4)}"
            except ValueError:  # Handle cases where conversion might fail
                return val
        else:
            try:
                return str(round(float(val), 4))  # Round plain string numbers
            except ValueError:
                return val  # Return original value if conversion fails
    elif isinstance(val, (int, float)):  # Handle numeric values
        return round(val, 4)
    return val  # Return unchanged if it's neither string nor numeric

In [22]:
for code in tqdm.tqdm(network_naduf.naduf_id):
    
    dataset = dataset_naduf[dataset_naduf["naduf_id"] == code]
    dataset.set_index("date", inplace = True)
    dataset.drop(["naduf_id"], axis=1, inplace = True)
    
    dataset.index.name = "date"
    
    # Apply the function to the column
    dataset = dataset.applymap(round_values)

    # There are some non-numeric things in the columns, instead of NaNs
    #dataset = dataset.apply(pd.to_numeric, errors='coerce')
    
    # Here we take out the > or < before converting to a numeric value:
    #dataset = dataset.applymap(lambda x: str(x).replace('<', '') if isinstance(x, str) else x)
    #dataset = dataset.applymap(lambda x: str(x).replace('>', '') if isinstance(x, str) else x)

    # There are some non-numeric things in the columns, instead of NaNs
    #dataset = dataset.apply(pd.to_numeric, errors='coerce')

    #dataset = dataset.round(4)

    dataset.to_csv(PATH_OUTPUT + "/NADUF/CAMELS_CH_chem_intervals_"+str(code)+".csv", encoding='latin')
    

100%|██████████| 24/24 [00:00<00:00, 25.31it/s]


Observations
- We have 24 stations in total (one is empty: 1827)
- So far, the itnervals are variable (not resampled)

# End