# Meteorological time-series extraction: Part B

Author: Thiago Nascimento (thiago.nascimento@eawag.ch)

This notebook is part of the EStreams publication and was used to extract and aggregate the meteorological time-series from the E-OBS dataset. At Part B we extract the data from the nc-files and export as individual intermediate files.  

* Note that this code enables not only the replicability of the current database but also the extrapolation to new catchment areas. 
* Additionally, the user should download and insert the original raw-data in the folder of the same name prior to run this code. 
* The original third-party data used were not made available in this repository due to redistribution and storage-space reasons.  

## Requirements
**Python:**

* Python>=3.6
* Jupyter
* geopandas=0.10.2
* glob
* netCDF4
* numpy
* os
* pandas
* tqdm
* concurrent

Check the Github repository for an environment.yml (for conda environments) or requirements.txt (pip) file.

**Files:**

* data/shapefiles/estreams_catchments.shp
* meteorology/eobs/{rr, tg, tn, tx, pp, hu, fg, qq}_ens_mean_0.25deg_reg_v27.0e.nc   https://www.ecad.eu/download/ensembles/download.php (Last access: 27 November 2023)
* meteorology/eobs/pet_hargreaves_025deg_v280e.nc. Derived hargreaves daily potential evapotranspiration. https://github.com/pyet-org/pyet Last access: 27 November 2023)
* In the output directory it is important to have one folder for each variable to be exported.

**Directory:**

* Clone the GitHub directory locally
* ONLY update the "PATH" variable in the section "Configurations", with their relative path to the EStreams directory. 

## References

* Cornes, R., G. van der Schrier, E.J.M. van den Besselaar, and P.D. Jones. 2018: An Ensemble Version of the E-OBS Temperature and Precipitation Datasets, J. Geophys. Res. Atmos., 123. doi:10.1029/2017JD028200

## Licenses
* EOBS: "The ECA&D data policy applies. These observational data are strictly for use in non-commercial research and non-commercial education projects only. Scientific results based on these data must be submitted for publication in the open literature without any delay linked to commercial objectives" https://www.ecad.eu/download/ensembles/download.php#guidance (Last access: 27 November 2023)

## Observations
#### E-OBS filenames

* rr = Total daily precipitation
* tg = Mean daily temperature
* tn = Minimum daily temperature
* tx = Maximum daily temperature
* pp = Mean daily air pressure at sea level
* hu = Mean daily relative humidity
* fg = Mean wind speed at 10-meters
* qq = Total daily global radiation

# Import modules

In [None]:
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import tqdm
import time
import glob
import netCDF4 as nc
from concurrent.futures import ThreadPoolExecutor
from utils.meteorology import *

# Configurations

In [None]:
# Only editable variables:
# Relative path to your local directory
PATH = "../../.."
# Set the number of workers for parallel processing
num_workers = 5

# Chunk size for reading NetCDF data
chunk_size = 100  # Adjust this value based on your available memory

# Choose the variable
chosen_variable = "pet"  # Variable to be processed ["rr", "tg", "tn", "tx", "pp", "hu", "fg", "qq", "pet", "pet_iceland"]

* #### The users should NOT change anything in the code below here.

In [None]:
# Non-editable variables:
PATH_preprocessing = "data/meteorology/eobs/preprocessing/"
PATH_netcdfs = "data/meteorology/eobs/"
PATH_OUTPUT = "results/timeseries/meteorology/catchments"
PATH_OUTPUT_2 = "results/timeseries/meteorology"
PATH_shapefile = "data/shapefiles/estreams_catchments.shp"
variables = ["rr", "tg", "tn", "tx", "pp", "hu", "fg", "qq", "pet", "pet_iceland"] # Eobs variables

# Set the directory:
os.chdir(PATH)

# Import data
## Catchment boundaries

In [None]:
catchment_boundaries = gpd.read_file(PATH_shapefile)
catchment_boundaries.head()

In [None]:
print("The total number of catchments to be processed are:", len(catchment_boundaries))

# Reproject to WGS-84

In [None]:
# Set the CRS of the shapefile's geometry to EPSG:4326 (WGS 84)
catchment_boundaries["geometry"] = catchment_boundaries["geometry"].to_crs(epsg=4326)

# Data Extraction

## Variable to be processed
* Due to processing reasons PET for Iceland is stored in a different netcdf file.

In [None]:
# Create a dictionary to map variables to their corresponding file names
variable_file_mapping = {
    "rr": "rr_ens_mean_0.25deg_reg_v28.0e.nc",
    "tg": "tg_ens_mean_0.25deg_reg_v28.0e.nc",
    "tn": "tn_ens_mean_0.25deg_reg_v28.0e.nc",
    "tx": "tx_ens_mean_0.25deg_reg_v28.0e.nc",
    "pp": "pp_ens_mean_0.25deg_reg_v28.0e.nc",
    "hu": "hu_ens_mean_0.25deg_reg_v28.0e.nc",
    "fg": "fg_ens_mean_0.25deg_reg_v28.0e.nc",
    "qq": "qq_ens_mean_0.25deg_reg_v28.0e.nc",
    "pet": "pet_hargreaves_025deg_v280e.nc",
    "pet_iceland": "pet_hargreaves_iceland_025deg_v280e.nc"
}

# Create a dictionary to map variables to their corresponding variable names
variable_name_mapping = {
    "rr": "rr",
    "tg": "tg",
    "tn": "tn",
    "tx": "tx",
    "pp": "pp",
    "hu": "hu",
    "fg": "fg",
    "qq": "qq",
    "pet": "Hargreaves",
    "pet_iceland": "Hargreaves"

}

# Get the file name for the chosen variable
nc_file = variable_file_mapping.get(chosen_variable)
nc_variable_name = variable_name_mapping.get(chosen_variable)

if nc_file is not None:
    # Generate variable_name and path_preprocessing based on the chosen variable
    variable_name = nc_variable_name
    path_preprocessing = PATH_preprocessing + chosen_variable + "/"

    # Read NetCDF data
    with nc.Dataset(PATH_netcdfs + nc_file, mode='r', format='NETCDF4') as nc_dataset:
        latitude = nc_dataset["latitude"][:]
        longitude = nc_dataset["longitude"][:]
        values = nc_dataset[variable_name][:]

    # Print variables for checking
    print(f"Variables in NetCDF file for {chosen_variable}: {nc_dataset.variables.keys()}")
else:
    print("Invalid variable choice.")

## Calculate pixels extent

In [None]:
# Calculate pixel extents using vectorized operations
lon_idx, lat_idx = np.meshgrid(range(len(longitude)), range(len(latitude)))
pixel_extents = np.stack(
    (longitude[lon_idx], latitude[lat_idx], longitude[lon_idx] + 0.25, latitude[lat_idx] + 0.25),
    axis=-1
)

## Extract catchment names

In [None]:
# Extract catchment names. 
if chosen_variable == "pet_iceland":
    # Extract catchment names for Iceland 
    desired_substring = 'ISGR'
    subset_catchment = catchment_boundaries[catchment_boundaries['basin_id'].str.contains(desired_substring)]
    catchmentnames = subset_catchment.basin_id.tolist()
else:
    catchmentnames = catchment_boundaries.basin_id.tolist()

catchmentnames

## Parallel processing of the catchment polygons
* This process may take a while deoending on the used machine and configurations adopted
* The output of this phase is the export of individual CSV-files for each catchment for each variable.
* The data will be later concatenate together. 

In [None]:
start = time.time()
with ThreadPoolExecutor(max_workers=num_workers) as executor:
    futures = [executor.submit(process_catchment, catchmentname, catchment_boundaries, values, latitude, longitude, path_preprocessing, variable_name = chosen_variable)
               for catchmentname in tqdm.tqdm(catchmentnames)]
    
    # Wait for all futures to complete
    for future in futures:
        future.result()

end = time.time()
print("Total time:", end - start)

# End