# Soil attributes extraction
Author: Thiago Nascimento (thiago.nascimento@eawag.ch)

This notebook is part of the EStreams publication and was used to extract and aggregate the soil types classes from the European Soil Database Derived data (ESDD).

* Note that this code enables not only the replicability of the current database but also the extrapolation to new catchment areas. 
* Additionally, the user should download and insert the original raw-data in the folder of the same name prior to run this code. 
* The original third-party data used were not made available in this repository due to redistribution and storage-space reasons.  

## Requirements
**Python:**
* Python>=3.6
* Jupyter
* geopandas=0.10.2
* numpy
* os
* pandas
* rasterio
* tqdm
* warnings

Check the Github repository for an environment.yml (for conda environments) or requirements.txt (pip) file.

**Files:**
* data/soils/{topsoil, subsoil}/{variable}.tif. ESDD rasters downloaded and converted to TIF-files (Topsoil - T and Subsoil - S). When converting already set the crs to 3035. Available at: https://esdac.jrc.ec.europa.eu/content/european-soil-database-derived-data (Last access 23 November 2023) 
* data/shapefiles/estreams_catchments.shp

**Directory:**
* Clone the GitHub directory locally
* Place any third-data variables in their respective directory.
* ONLY update the "PATH" variable in the section "Configurations", with their relative path to the EStreams directory. 

## References

* Hiederer, R. 2013. Mapping Soil Properties for Europe - Spatial Representation of Soil Database Attributes. Luxembourg: Publications Office of the European Union - 2013 - 47pp. EUR26082EN Scientific and Technical Research series, ISSN 1831-9424, doi:10.2788/94128

* Hiederer, R. 2013. Mapping Soil Typologies - Spatial Decision Support Applied to European Soil Database. Luxembourg: Publications Office of the European Union - 2013 - 147pp. EUR25932EN Scientific and Technical Research series, ISSN 1831-9424, doi:10.2788/8728

* Panagos, P., Van Liedekerke, M., Borrelli, P., Köninger, J., Ballabio, C., Orgiazzi, A., Lugato, E., Liakos, L., Hervas, J., Jones, A.  Montanarella, L. 2022. European Soil Data Centre 2.0: Soil data and knowledge in support of the EU policies. European Journal of Soil Science, 73(6), e13315. DOI: 10.1111/ejss.13315

* Panagos P., Van Liedekerke M., Jones A., Montanarella L., “European Soil Data Centre: Response to European policy support and public data requirements”; (2012) Land Use Policy, 29 (2), pp. 329-338. doi:10.1016/j.landusepol.2011.07.003

* European Soil Data Centre (ESDAC), esdac.jrc.ec.europa.eu, European Commission, Joint Research Centre

## License

* Open source, but no-redistribution of the original (non-modified) data: https://esdac.jrc.ec.europa.eu/content/european-soil-database-derived-data (Last access 23 November 2023)


## Observations
#### Soil classes 

1. Depth available to roots:	STU_EU_DEPTH_ROOTS	(cm)
2. Clay content:	STU_EU_T_CLAY,	STU_EU_S_CLAY	(%)
3. Sand content:	STU_EU_T_SAND,	STU_EU_S_SAND	(%)
4. Silt content:	STU_EU_T_SILT,	STU_EU_S_SILT	(%)
5. Organic carbon content:	STU_EU_T_OC,	STU_EU_S_OC	(%)
6. Bulk density:	STU_EU_T_BD,	STU_EU_S_BD	(g cm-3)
7. Coarse Fragments:	STU_EU_T_GRAVEL,	STU_EU_S_GRAVEL	(%)
8. Total available water content from PTR:	SMU_EU_T_TAWC,	SMU_EU_S_TAWC	(mm)
9. Total available water content from PTF:	STU_EU_T_TAWC,	STU_EU_S_TAWC	(mm)

# Import modules

In [1]:
import geopandas as gpd
import numpy as np
import pandas as pd
import tqdm as tqdm
import os
import rasterio
from rasterio.features import geometry_mask
import warnings

# Configurations

In [2]:
# Only editable variables:
# Relative path to your local directory
PATH = "../../.."
# Suppress all warnings
warnings.filterwarnings("ignore")

* #### The users should NOT change anything in the code below here. 

In [3]:
# Non-editable variables:
PATH_OUTPUT = "results/staticattributes/"

# Set the directory:
os.chdir(PATH)

# Import data
## Catchment boundaries

In [4]:
catchment_boundaries = gpd.read_file('data/shapefiles/estreams_catchments.shp')
catchment_boundaries

Unnamed: 0,id,area_km2,outlet_lat,outlet_lng,name,area_offic,layer,path,Code,basin_id,area_calc,geometry
0,FR003159,37,47.488,7.393,A100003001,38.6,FR003159,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,FR003159,FR003159,37.183,"POLYGON ((7.30374 47.49375, 7.30708 47.49375, ..."
1,FR003160,227,47.626,7.239,A105003001,233.0,FR003160,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,FR003160,FR003160,226.962,"POLYGON ((7.22291 47.63458, 7.22374 47.63458, ..."
2,FR003161,14,47.586,7.384,A106000101,15.0,FR003161,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,FR003161,FR003161,13.595,"POLYGON ((7.38791 47.59041, 7.39874 47.59041, ..."
3,FR003162,70,47.622,7.275,A107020001,70.0,FR003162,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,FR003162,FR003162,70.152,"POLYGON ((7.28375 47.60958, 7.28291 47.60958, ..."
4,FR003163,330,47.653,7.265,A108003001,325.0,FR003163,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,FR003163,FR003163,330.158,"POLYGON ((7.22958 47.65291, 7.23208 47.65291, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...
1967,HR000314,135,44.202,16.069,7267,,HR000314,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,HR000314,HR000314,135.462,"POLYGON ((16.01458 44.21375, 16.01375 44.21375..."
1968,HR000315,458,44.162,15.858,7236,,HR000315,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,HR000315,HR000315,457.864,"POLYGON ((15.89625 44.07791, 15.89374 44.07791..."
1969,HR000316,514,44.162,15.849,7237,,HR000316,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,HR000316,HR000316,514.369,"POLYGON ((15.84208 44.15458, 15.84208 44.15458..."
1970,HR000317,185,45.334,14.452,6077,,HR000317,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,HR000317,HR000317,184.733,"POLYGON ((14.51875 45.36708, 14.51875 45.36791..."


In [5]:
print("The total number of catchments to be processed are:", len(catchment_boundaries))

The total number of catchments to be processed are: 1972


## Soil type rasters

In [6]:
# Topsoil:
filenames_topsoil =['data/soils/topsoil/stu_eu_depth_roots.tif',
            'data/soils/topsoil/smu_eu_t_tawc.tif',
            'data/soils/topsoil/stu_eu_t_sand.tif',
            'data/soils/topsoil/stu_eu_t_silt.tif',
            'data/soils/topsoil/stu_eu_t_clay.tif',
            'data/soils/topsoil/stu_eu_t_gravel.tif',
            'data/soils/topsoil/stu_eu_t_bd.tif',
            'data/soils/topsoil/stu_eu_t_oc.tif']
# Subsoil:
filenames_subsoil =['data/soils/topsoil/stu_eu_depth_roots.tif',
            'data/soils/subsoil/stu_eu_s_tawc.tif',
            'data/soils/subsoil/stu_eu_s_sand.tif',
            'data/soils/subsoil/stu_eu_s_silt.tif',
            'data/soils/subsoil/stu_eu_s_clay.tif',
            'data/soils/subsoil/stu_eu_s_gravel.tif',
            'data/soils/subsoil/stu_eu_s_bd.tif',
            'data/soils/subsoil/stu_eu_s_oc.tif']

## Reproject to projected coordinates system

In [7]:
# Define the target CRS to ETRS89 LAEA
target_crs = 'EPSG:3035'

# Reproject the GeoDataFrame to the target CRS
catchment_boundaries_reprojected = catchment_boundaries.to_crs(target_crs)

## Computation processes

In [8]:
# Initialize an empty DataFrame to store the results
soil_attributes_df = pd.DataFrame()

# Define prefixes for their names based on the order of lecture:
prefix_values = ["root_dep_", "soil_tawc_", "soil_fra_sand_", "soil_fra_silt_", "soil_fra_clay_",
                "soil_fra_grav_", "soil_bd_", "soil_oc_"]
i = 0
for filename in filenames_topsoil:
    
    # Create lists to store the results
    avg_values = []
    max_values = []
    min_values = []
    percentile_5th = []
    percentile_25th = []
    median = []
    percentile_75th = []
    percentile_90th = []

    # Load your raster file
    with rasterio.open(filename) as src:
        for idx, geom in tqdm.tqdm(catchment_boundaries_reprojected.iterrows()):
            
            # Check if the geometry is empty or invalid
            if geom['geometry'] is None or geom['geometry'].is_empty or not geom['geometry'].is_valid:
                avg_value = np.nan
                max_value = np.nan
                min_value = np.nan
                p5 = np.nan
                p25 = np.nan
                med = np.nan
                p75 = np.nan
                p90 = np.nan
            
            else:
                # Create a mask for the geometry
                mask = geometry_mask([geom['geometry']], out_shape=src.shape, transform=src.transform, invert=True)

                # Read the values within the geometry from the raster
                values = src.read(1, masked=True)
                values = values[mask]

            # Calculate statistics only if there are valid values in the 'values' array
            if len(values) > 0:
                avg_value = np.mean(values)
                max_value = np.max(values)
                min_value = np.min(values)
                p5 = np.percentile(values, 5)
                p25 = np.percentile(values, 25)
                med = np.percentile(values, 50)  # 50th percentile (median)
                p75 = np.percentile(values, 75)
                p90 = np.percentile(values, 90)
            
            else:
                # Handle the case when there are no valid values (e.g., by setting them to NaN or a specific value)
                avg_value = np.nan
                max_value = np.nan
                min_value = np.nan
                p5 = np.nan
                p25 = np.nan
                med = np.nan
                p75 = np.nan
                p90 = np.nan

            # Store the results in the lists
            avg_values.append(avg_value)
            max_values.append(max_value)
            min_values.append(min_value)
            percentile_5th.append(p5)
            percentile_25th.append(p25)
            median.append(med)
            percentile_75th.append(p75)
            percentile_90th.append(p90)

    # Create a DataFrame to store the results for this file
    data = {
        'basin_id': catchment_boundaries_reprojected['basin_id'],
        'mean': avg_values,
        'max': max_values,
        'min': min_values,
        'p05': percentile_5th,
        'p25': percentile_25th,
        'med': median,
        'p75': percentile_75th,
        'p90': percentile_90th
    }
    results_df = pd.DataFrame(data)
    results_df.set_index("basin_id", inplace=True)
    results_df = results_df.add_prefix(prefix_values[i])

    # Concatenate the results with the final DataFrame
    soil_attributes_df = pd.concat([soil_attributes_df, results_df], axis=1)
    i = i + 1

1972it [00:38, 50.71it/s]
1972it [00:48, 40.73it/s]
1972it [00:47, 41.57it/s]
1972it [00:47, 41.71it/s]
1972it [00:47, 41.73it/s]
1972it [00:38, 51.83it/s]
1972it [00:47, 41.56it/s]
1972it [00:47, 41.43it/s]


In [9]:
soil_attributes_df

Unnamed: 0_level_0,root_dep_mean,root_dep_max,root_dep_min,root_dep_p05,root_dep_p25,root_dep_med,root_dep_p75,root_dep_p90,soil_tawc_mean,soil_tawc_max,...,soil_bd_p75,soil_bd_p90,soil_oc_mean,soil_oc_max,soil_oc_min,soil_oc_p05,soil_oc_p25,soil_oc_med,soil_oc_p75,soil_oc_p90
basin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
FR003159,83.684211,130.0,30.0,30.0,50.0,50.0,130.0,130.0,53.030270,53.619366,...,1.400,1.410,1.145790,2.13,0.65,0.8165,0.83,0.915,1.0000,2.13
FR003160,120.796460,130.0,30.0,50.0,130.0,130.0,130.0,130.0,53.995136,56.878792,...,1.400,1.405,0.914690,2.13,0.65,0.7400,0.83,0.830,0.9675,1.05
FR003161,130.000000,130.0,130.0,130.0,130.0,130.0,130.0,130.0,56.864426,56.878792,...,1.245,1.250,0.938667,1.17,0.87,0.8700,0.87,0.870,0.9350,1.17
FR003162,130.000000,130.0,130.0,130.0,130.0,130.0,130.0,130.0,56.840694,56.878792,...,1.250,1.250,0.956479,1.17,0.87,0.8700,0.87,0.870,1.1700,1.17
FR003163,123.696970,130.0,30.0,50.0,130.0,130.0,130.0,130.0,54.796230,56.878792,...,1.400,1.400,0.934061,2.13,0.65,0.7400,0.83,0.830,1.0000,1.17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
HR000314,23.157895,100.0,10.0,10.0,10.0,10.0,10.0,100.0,14.727452,48.174320,...,1.380,1.400,1.578647,2.13,0.98,0.9800,1.61,1.610,1.6100,1.61
HR000315,38.087912,100.0,10.0,10.0,10.0,10.0,100.0,100.0,15.687418,49.032417,...,1.400,1.410,1.519385,2.13,0.83,0.8300,1.39,1.610,1.6100,2.13
HR000316,42.250489,100.0,10.0,10.0,10.0,10.0,100.0,100.0,16.974653,49.032417,...,1.400,1.410,1.480450,2.13,0.83,0.8300,0.98,1.610,1.6100,2.13
HR000317,49.945355,100.0,10.0,10.0,10.0,30.0,100.0,100.0,32.106415,50.274513,...,1.410,1.410,1.538525,2.13,0.65,0.7500,0.98,1.610,2.1300,2.13


In [10]:
# Here we sort the index:
soil_attributes_df = soil_attributes_df.sort_index(axis=0)
soil_attributes_df

Unnamed: 0_level_0,root_dep_mean,root_dep_max,root_dep_min,root_dep_p05,root_dep_p25,root_dep_med,root_dep_p75,root_dep_p90,soil_tawc_mean,soil_tawc_max,...,soil_bd_p75,soil_bd_p90,soil_oc_mean,soil_oc_max,soil_oc_min,soil_oc_p05,soil_oc_p25,soil_oc_med,soil_oc_p75,soil_oc_p90
basin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
FR003159,83.684211,130.0,30.0,30.0,50.0,50.0,130.0,130.0,53.030270,53.619366,...,1.400,1.410,1.145790,2.13,0.65,0.8165,0.83,0.915,1.0000,2.13
FR003160,120.796460,130.0,30.0,50.0,130.0,130.0,130.0,130.0,53.995136,56.878792,...,1.400,1.405,0.914690,2.13,0.65,0.7400,0.83,0.830,0.9675,1.05
FR003161,130.000000,130.0,130.0,130.0,130.0,130.0,130.0,130.0,56.864426,56.878792,...,1.245,1.250,0.938667,1.17,0.87,0.8700,0.87,0.870,0.9350,1.17
FR003162,130.000000,130.0,130.0,130.0,130.0,130.0,130.0,130.0,56.840694,56.878792,...,1.250,1.250,0.956479,1.17,0.87,0.8700,0.87,0.870,1.1700,1.17
FR003163,123.696970,130.0,30.0,50.0,130.0,130.0,130.0,130.0,54.796230,56.878792,...,1.400,1.400,0.934061,2.13,0.65,0.7400,0.83,0.830,1.0000,1.17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
HR000313,31.693548,100.0,10.0,10.0,10.0,10.0,30.0,100.0,14.749300,49.032417,...,1.400,1.410,1.561936,2.13,0.83,0.9800,1.61,1.610,1.6100,2.13
HR000314,23.157895,100.0,10.0,10.0,10.0,10.0,10.0,100.0,14.727452,48.174320,...,1.380,1.400,1.578647,2.13,0.98,0.9800,1.61,1.610,1.6100,1.61
HR000315,38.087912,100.0,10.0,10.0,10.0,10.0,100.0,100.0,15.687418,49.032417,...,1.400,1.410,1.519385,2.13,0.83,0.8300,1.39,1.610,1.6100,2.13
HR000316,42.250489,100.0,10.0,10.0,10.0,10.0,100.0,100.0,16.974653,49.032417,...,1.400,1.410,1.480450,2.13,0.83,0.8300,0.98,1.610,1.6100,2.13


In [11]:
# Round the data to 3 decimals
soil_attributes_df = soil_attributes_df.astype(float).round(3)
soil_attributes_df

Unnamed: 0_level_0,root_dep_mean,root_dep_max,root_dep_min,root_dep_p05,root_dep_p25,root_dep_med,root_dep_p75,root_dep_p90,soil_tawc_mean,soil_tawc_max,...,soil_bd_p75,soil_bd_p90,soil_oc_mean,soil_oc_max,soil_oc_min,soil_oc_p05,soil_oc_p25,soil_oc_med,soil_oc_p75,soil_oc_p90
basin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
FR003159,83.684,130.0,30.0,30.0,50.0,50.0,130.0,130.0,53.030,53.619,...,1.400,1.410,1.146,2.13,0.65,0.816,0.83,0.915,1.000,2.13
FR003160,120.796,130.0,30.0,50.0,130.0,130.0,130.0,130.0,53.995,56.879,...,1.400,1.405,0.915,2.13,0.65,0.740,0.83,0.830,0.968,1.05
FR003161,130.000,130.0,130.0,130.0,130.0,130.0,130.0,130.0,56.864,56.879,...,1.245,1.250,0.939,1.17,0.87,0.870,0.87,0.870,0.935,1.17
FR003162,130.000,130.0,130.0,130.0,130.0,130.0,130.0,130.0,56.841,56.879,...,1.250,1.250,0.956,1.17,0.87,0.870,0.87,0.870,1.170,1.17
FR003163,123.697,130.0,30.0,50.0,130.0,130.0,130.0,130.0,54.796,56.879,...,1.400,1.400,0.934,2.13,0.65,0.740,0.83,0.830,1.000,1.17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
HR000313,31.694,100.0,10.0,10.0,10.0,10.0,30.0,100.0,14.749,49.032,...,1.400,1.410,1.562,2.13,0.83,0.980,1.61,1.610,1.610,2.13
HR000314,23.158,100.0,10.0,10.0,10.0,10.0,10.0,100.0,14.727,48.174,...,1.380,1.400,1.579,2.13,0.98,0.980,1.61,1.610,1.610,1.61
HR000315,38.088,100.0,10.0,10.0,10.0,10.0,100.0,100.0,15.687,49.032,...,1.400,1.410,1.519,2.13,0.83,0.830,1.39,1.610,1.610,2.13
HR000316,42.250,100.0,10.0,10.0,10.0,10.0,100.0,100.0,16.975,49.032,...,1.400,1.410,1.480,2.13,0.83,0.830,0.98,1.610,1.610,2.13


# Data export

In [12]:
# Export the final dataset:
soil_attributes_df.to_csv(PATH_OUTPUT+"estreams_soil_attributes.csv")

# End