# Soil attributes extraction
Author: Thiago Nascimento (thiago.nascimento@eawag.ch)

This notebook is part of the EStreams publication and was used to extract and aggregate the soil types classes from the European Soil Database Derived data (ESDD).

* Note that this code enables not only the replicability of the current database but also the extrapolation to new catchment areas. 
* Additionally, the user should download and insert the original raw-data in the folder of the same name prior to run this code. 
* The original third-party data used were not made available in this repository due to redistribution and storage-space reasons.  

## Requirements
**Python:**
* Python>=3.6
* Jupyter
* geopandas=0.10.2
* numpy
* os
* pandas
* rasterio
* tqdm
* warnings

Check the Github repository for an environment.yml (for conda environments) or requirements.txt (pip) file.

**Files:**
* data/soils/{topsoil, subsoil}/{variable}.tif. ESDD rasters downloaded and converted to TIF-files (Topsoil - T and Subsoil - S). When converting already set the crs to 3035. Available at: https://esdac.jrc.ec.europa.eu/content/european-soil-database-derived-data (Last access 23 November 2023) 
* data/shapefiles/estreams_catchments.shp

**Directory:**
* Clone the GitHub directory locally
* Place any third-data variables in their respective directory.
* ONLY update the "PATH" variable in the section "Configurations", with their relative path to the EStreams directory. 

## References

* Hiederer, R. 2013. Mapping Soil Properties for Europe - Spatial Representation of Soil Database Attributes. Luxembourg: Publications Office of the European Union - 2013 - 47pp. EUR26082EN Scientific and Technical Research series, ISSN 1831-9424, doi:10.2788/94128

* Hiederer, R. 2013. Mapping Soil Typologies - Spatial Decision Support Applied to European Soil Database. Luxembourg: Publications Office of the European Union - 2013 - 147pp. EUR25932EN Scientific and Technical Research series, ISSN 1831-9424, doi:10.2788/8728

* Panagos, P., Van Liedekerke, M., Borrelli, P., Köninger, J., Ballabio, C., Orgiazzi, A., Lugato, E., Liakos, L., Hervas, J., Jones, A.  Montanarella, L. 2022. European Soil Data Centre 2.0: Soil data and knowledge in support of the EU policies. European Journal of Soil Science, 73(6), e13315. DOI: 10.1111/ejss.13315

* Panagos P., Van Liedekerke M., Jones A., Montanarella L., “European Soil Data Centre: Response to European policy support and public data requirements”; (2012) Land Use Policy, 29 (2), pp. 329-338. doi:10.1016/j.landusepol.2011.07.003

* European Soil Data Centre (ESDAC), esdac.jrc.ec.europa.eu, European Commission, Joint Research Centre

## License

* Open source, but no-redistribution of the original (non-modified) data: https://esdac.jrc.ec.europa.eu/content/european-soil-database-derived-data (Last access 23 November 2023)


## Observations
#### Soil classes 

1. Depth available to roots:	STU_EU_DEPTH_ROOTS	(cm)
2. Clay content:	STU_EU_T_CLAY,	STU_EU_S_CLAY	(%)
3. Sand content:	STU_EU_T_SAND,	STU_EU_S_SAND	(%)
4. Silt content:	STU_EU_T_SILT,	STU_EU_S_SILT	(%)
5. Organic carbon content:	STU_EU_T_OC,	STU_EU_S_OC	(%)
6. Bulk density:	STU_EU_T_BD,	STU_EU_S_BD	(g cm-3)
7. Coarse Fragments:	STU_EU_T_GRAVEL,	STU_EU_S_GRAVEL	(%)
8. Total available water content from PTR:	SMU_EU_T_TAWC,	SMU_EU_S_TAWC	(mm)
9. Total available water content from PTF:	STU_EU_T_TAWC,	STU_EU_S_TAWC	(mm)

# Import modules

In [None]:
import geopandas as gpd
import numpy as np
import pandas as pd
import tqdm as tqdm
import os
import rasterio
from rasterio.features import geometry_mask
import warnings

# Configurations

In [None]:
# Only editable variables:
# Relative path to your local directory
PATH = "../../.."
# Suppress all warnings
warnings.filterwarnings("ignore")

* #### The users should NOT change anything in the code below here. 

In [None]:
# Non-editable variables:
PATH_OUTPUT = "results/staticattributes/"

# Set the directory:
os.chdir(PATH)

# Import data
## Catchment boundaries

In [None]:
catchment_boundaries = gpd.read_file('data/shapefiles/estreams_catchments.shp')
catchment_boundaries

In [None]:
print("The total number of catchments to be processed are:", len(catchment_boundaries))

## Soil type rasters

In [None]:
# Topsoil:
filenames_topsoil =['data/soils/topsoil/stu_eu_depth_roots.tif',
            'data/soils/topsoil/smu_eu_t_tawc.tif',
            'data/soils/topsoil/stu_eu_t_sand.tif',
            'data/soils/topsoil/stu_eu_t_silt.tif',
            'data/soils/topsoil/stu_eu_t_clay.tif',
            'data/soils/topsoil/stu_eu_t_gravel.tif',
            'data/soils/topsoil/stu_eu_t_bd.tif',
            'data/soils/topsoil/stu_eu_t_oc.tif']
# Subsoil:
filenames_subsoil =['data/soils/topsoil/stu_eu_depth_roots.tif',
            'data/soils/subsoil/stu_eu_s_tawc.tif',
            'data/soils/subsoil/stu_eu_s_sand.tif',
            'data/soils/subsoil/stu_eu_s_silt.tif',
            'data/soils/subsoil/stu_eu_s_clay.tif',
            'data/soils/subsoil/stu_eu_s_gravel.tif',
            'data/soils/subsoil/stu_eu_s_bd.tif',
            'data/soils/subsoil/stu_eu_s_oc.tif']

## Reproject to projected coordinates system

In [None]:
# Define the target CRS to ETRS89 LAEA
target_crs = 'EPSG:3035'

# Reproject the GeoDataFrame to the target CRS
catchment_boundaries_reprojected = catchment_boundaries.to_crs(target_crs)

## Computation processes

In [None]:
# Initialize an empty DataFrame to store the results
soil_attributes_df = pd.DataFrame()

# Define prefixes for their names based on the order of lecture:
prefix_values = ["root_dep_", "soil_tawc_", "soil_fra_sand_", "soil_fra_silt_", "soil_fra_clay_",
                "soil_fra_grav_", "soil_bd_", "soil_oc_"]
i = 0
for filename in filenames_topsoil:
    
    # Create lists to store the results
    avg_values = []
    max_values = []
    min_values = []
    percentile_5th = []
    percentile_25th = []
    median = []
    percentile_75th = []
    percentile_90th = []

    # Load your raster file
    with rasterio.open(filename) as src:
        for idx, geom in tqdm.tqdm(catchment_boundaries_reprojected.iterrows()):
            
            # Check if the geometry is empty or invalid
            if geom['geometry'] is None or geom['geometry'].is_empty or not geom['geometry'].is_valid:
                avg_value = np.nan
                max_value = np.nan
                min_value = np.nan
                p5 = np.nan
                p25 = np.nan
                med = np.nan
                p75 = np.nan
                p90 = np.nan
            
            else:
                # Create a mask for the geometry
                mask = geometry_mask([geom['geometry']], out_shape=src.shape, transform=src.transform, invert=True)

                # Read the values within the geometry from the raster
                values = src.read(1, masked=True)
                values = values[mask]

            # Calculate statistics only if there are valid values in the 'values' array
            if len(values) > 0:
                avg_value = np.mean(values)
                max_value = np.max(values)
                min_value = np.min(values)
                p5 = np.percentile(values, 5)
                p25 = np.percentile(values, 25)
                med = np.percentile(values, 50)  # 50th percentile (median)
                p75 = np.percentile(values, 75)
                p90 = np.percentile(values, 90)
            
            else:
                # Handle the case when there are no valid values (e.g., by setting them to NaN or a specific value)
                avg_value = np.nan
                max_value = np.nan
                min_value = np.nan
                p5 = np.nan
                p25 = np.nan
                med = np.nan
                p75 = np.nan
                p90 = np.nan

            # Store the results in the lists
            avg_values.append(avg_value)
            max_values.append(max_value)
            min_values.append(min_value)
            percentile_5th.append(p5)
            percentile_25th.append(p25)
            median.append(med)
            percentile_75th.append(p75)
            percentile_90th.append(p90)

    # Create a DataFrame to store the results for this file
    data = {
        'basin_id': catchment_boundaries_reprojected['basin_id'],
        'mean': avg_values,
        'max': max_values,
        'min': min_values,
        'p05': percentile_5th,
        'p25': percentile_25th,
        'med': median,
        'p75': percentile_75th,
        'p90': percentile_90th
    }
    results_df = pd.DataFrame(data)
    results_df.set_index("basin_id", inplace=True)
    results_df = results_df.add_prefix(prefix_values[i])

    # Concatenate the results with the final DataFrame
    soil_attributes_df = pd.concat([soil_attributes_df, results_df], axis=1)
    i = i + 1

In [None]:
soil_attributes_df

In [None]:
# Here we sort the index:
soil_attributes_df = soil_attributes_df.sort_index(axis=0)
soil_attributes_df

In [None]:
# Round the data to 3 decimals
soil_attributes_df = soil_attributes_df.astype(float).round(3)
soil_attributes_df

# Data export

In [None]:
# Export the final dataset:
soil_attributes_df.to_csv(PATH_OUTPUT+"estreams_soil_attributes.csv")

# End