# Soil attributes extraction
Author: Thiago Nascimento (thiago.nascimento@eawag.ch)

This notebook is part of the EStreams publication and was used to extract and aggregate the soil types classes from the European Soil Database Derived data (ESDD).

* Note that this code enables not only the replicability of the current database but also the extrapolation to new catchment areas. 
* Additionally, the user should download and insert the original raw-data in the folder of the same name prior to run this code. 
* The original third-party data used were not made available in this repository due to redistribution and storage-space reasons.  

## Requirements
**Python:**
* Python>=3.6
* Jupyter
* geopandas=0.10.2
* numpy
* os
* pandas
* rasterio
* tqdm
* warnings

Check the Github repository for an environment.yml (for conda environments) or requirements.txt (pip) file.

**Files:**
* data/soils/{topsoil, subsoil}/{variable}.tif. ESDD rasters downloaded and converted to TIF-files (Topsoil - T and Subsoil - S). When converting already set the crs to 3035. Available at: https://esdac.jrc.ec.europa.eu/content/european-soil-database-derived-data (Last access 23 November 2023) 
* data/shapefiles/estreams_catchments.shp

**Directory:**
* Clone the GitHub directory locally
* Place any third-data variables in their respective directory.
* ONLY update the "PATH" variable in the section "Configurations", with their relative path to the EStreams directory. 

## References

* Hiederer, R. 2013. Mapping Soil Properties for Europe - Spatial Representation of Soil Database Attributes. Luxembourg: Publications Office of the European Union - 2013 - 47pp. EUR26082EN Scientific and Technical Research series, ISSN 1831-9424, doi:10.2788/94128

* Hiederer, R. 2013. Mapping Soil Typologies - Spatial Decision Support Applied to European Soil Database. Luxembourg: Publications Office of the European Union - 2013 - 147pp. EUR25932EN Scientific and Technical Research series, ISSN 1831-9424, doi:10.2788/8728

* Panagos, P., Van Liedekerke, M., Borrelli, P., Köninger, J., Ballabio, C., Orgiazzi, A., Lugato, E., Liakos, L., Hervas, J., Jones, A.  Montanarella, L. 2022. European Soil Data Centre 2.0: Soil data and knowledge in support of the EU policies. European Journal of Soil Science, 73(6), e13315. DOI: 10.1111/ejss.13315

* Panagos P., Van Liedekerke M., Jones A., Montanarella L., “European Soil Data Centre: Response to European policy support and public data requirements”; (2012) Land Use Policy, 29 (2), pp. 329-338. doi:10.1016/j.landusepol.2011.07.003

* European Soil Data Centre (ESDAC), esdac.jrc.ec.europa.eu, European Commission, Joint Research Centre

## License

* Open source, but no-redistribution of the original (non-modified) data: https://esdac.jrc.ec.europa.eu/content/european-soil-database-derived-data (Last access 23 November 2023)


## Observations
#### Soil classes 

1. Depth available to roots:	STU_EU_DEPTH_ROOTS	(cm)
2. Clay content:	STU_EU_T_CLAY,	STU_EU_S_CLAY	(%)
3. Sand content:	STU_EU_T_SAND,	STU_EU_S_SAND	(%)
4. Silt content:	STU_EU_T_SILT,	STU_EU_S_SILT	(%)
5. Organic carbon content:	STU_EU_T_OC,	STU_EU_S_OC	(%)
6. Bulk density:	STU_EU_T_BD,	STU_EU_S_BD	(g cm-3)
7. Coarse Fragments:	STU_EU_T_GRAVEL,	STU_EU_S_GRAVEL	(%)
8. Total available water content from PTR:	SMU_EU_T_TAWC,	SMU_EU_S_TAWC	(mm)
9. Total available water content from PTF:	STU_EU_T_TAWC,	STU_EU_S_TAWC	(mm)

# Import modules

In [1]:
import geopandas as gpd
import numpy as np
import pandas as pd
import tqdm as tqdm
import os
import rasterio
from rasterio.features import geometry_mask
import warnings

# Configurations

In [2]:
# Only editable variables:
# Relative path to your local directory
PATH = "../../.."
# Suppress all warnings
warnings.filterwarnings("ignore")

* #### The users should NOT change anything in the code below here. 

In [3]:
# Non-editable variables:
PATH_OUTPUT = "results/staticattributes/"

# Set the directory:
os.chdir(PATH)

# Import data
## Catchment boundaries

In [4]:
catchment_boundaries = gpd.read_file('data/shapefiles/estreams_catchments.shp')
catchment_boundaries

Unnamed: 0,id,area_km2,outlet_lat,outlet_lng,name,area_offic,layer,path,area_diff,area_calc,basin_id,geometry
0,HUGR020,9600.0,46.785,21.142,6444410,9011,HUGR020,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,6.536,9595.794,HUGR020,"POLYGON ((21.13208 46.77291, 21.13208 46.77375..."
1,HUGR021,189000.0,46.423,18.896,6442080,189538,HUGR021,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,-0.284,188597.11,HUGR021,"POLYGON ((18.91708 46.41791, 18.91708 46.41625..."
2,HUGR022,28500.0,48.126,22.34,6444304,29057,HUGR022,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,-1.917,28507.473,HUGR022,"POLYGON ((22.32875 48.10875, 22.32791 48.10875..."
3,HUGR023,188000.0,46.627,18.869,6442060,189092,HUGR023,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,-0.577,188286.167,HUGR023,"POLYGON ((18.89041 46.62875, 18.88875 46.62708..."
4,HUGR025,1210.0,47.662,19.683,6444240,1222,HUGR025,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,-0.982,1206.441,HUGR025,"POLYGON ((19.68124 47.66875, 19.68291 47.66875..."
5,HUGR026,110.0,46.891,20.498,6444420,26647,HUGR026,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,-99.587,109.639,HUGR026,"POLYGON ((20.49958 46.93125, 20.49958 46.93125..."
6,HUGR027,4490.0,48.497,21.229,6444330,4515,HUGR027,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,-0.554,4494.402,HUGR027,"POLYGON ((21.23458 48.49708, 21.23208 48.49708..."
7,HUGR028,5770.0,46.883,18.141,6442110,5884,HUGR028,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,-1.937,5773.506,HUGR028,"POLYGON ((16.56458 46.93291, 16.56541 46.93291..."
8,HUGR029,185000.0,47.495,19.048,6442040,184893,HUGR029,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,0.058,184810.677,HUGR029,"POLYGON ((19.11291 47.48291, 19.11125 47.48458..."
9,HUGR030,13000.0,46.419,16.695,6446100,13033,HUGR030,C:/Users/nascimth/Documents/Thiago/Eawag/Pytho...,-0.253,13044.665,HUGR030,"POLYGON ((16.63625 46.45625, 16.63541 46.45625..."


In [5]:
print("The total number of catchments to be processed are:", len(catchment_boundaries))

The total number of catchments to be processed are: 33


## Soil type rasters

In [6]:
# Topsoil:
filenames_topsoil =['data/soils/topsoil/stu_eu_depth_roots.tif',
            'data/soils/topsoil/smu_eu_t_tawc.tif',
            'data/soils/topsoil/stu_eu_t_sand.tif',
            'data/soils/topsoil/stu_eu_t_silt.tif',
            'data/soils/topsoil/stu_eu_t_clay.tif',
            'data/soils/topsoil/stu_eu_t_gravel.tif',
            'data/soils/topsoil/stu_eu_t_bd.tif',
            'data/soils/topsoil/stu_eu_t_oc.tif']
# Subsoil:
filenames_subsoil =['data/soils/topsoil/stu_eu_depth_roots.tif',
            'data/soils/subsoil/stu_eu_s_tawc.tif',
            'data/soils/subsoil/stu_eu_s_sand.tif',
            'data/soils/subsoil/stu_eu_s_silt.tif',
            'data/soils/subsoil/stu_eu_s_clay.tif',
            'data/soils/subsoil/stu_eu_s_gravel.tif',
            'data/soils/subsoil/stu_eu_s_bd.tif',
            'data/soils/subsoil/stu_eu_s_oc.tif']

## Reproject to projected coordinates system

In [8]:
# Define the target CRS to ETRS89 LAEA
target_crs = 'EPSG:3035'

# Reproject the GeoDataFrame to the target CRS
catchment_boundaries_reprojected = catchment_boundaries.to_crs(target_crs)

## Computation processes

In [9]:
# Initialize an empty DataFrame to store the results
soil_attributes_df = pd.DataFrame()

# Define prefixes for their names based on the order of lecture:
prefix_values = ["root_dep_", "soil_tawc_", "soil_fra_sand_", "soil_fra_silt_", "soil_fra_clay_",
                "soil_fra_grav_", "soil_bd_", "soil_oc_"]
i = 0
for filename in filenames_topsoil:
    
    # Create lists to store the results
    avg_values = []
    max_values = []
    min_values = []
    percentile_5th = []
    percentile_25th = []
    median = []
    percentile_75th = []
    percentile_90th = []

    # Load your raster file
    with rasterio.open(filename) as src:
        for idx, geom in tqdm.tqdm(catchment_boundaries_reprojected.iterrows()):
            
            # Check if the geometry is empty or invalid
            if geom['geometry'] is None or geom['geometry'].is_empty or not geom['geometry'].is_valid:
                avg_value = np.nan
                max_value = np.nan
                min_value = np.nan
                p5 = np.nan
                p25 = np.nan
                med = np.nan
                p75 = np.nan
                p90 = np.nan
            
            else:
                # Create a mask for the geometry
                mask = geometry_mask([geom['geometry']], out_shape=src.shape, transform=src.transform, invert=True)

                # Read the values within the geometry from the raster
                values = src.read(1, masked=True)
                values = values[mask]

            # Calculate statistics only if there are valid values in the 'values' array
            if len(values) > 0:
                avg_value = np.mean(values)
                max_value = np.max(values)
                min_value = np.min(values)
                p5 = np.percentile(values, 5)
                p25 = np.percentile(values, 25)
                med = np.percentile(values, 50)  # 50th percentile (median)
                p75 = np.percentile(values, 75)
                p90 = np.percentile(values, 90)
            
            else:
                # Handle the case when there are no valid values (e.g., by setting them to NaN or a specific value)
                avg_value = np.nan
                max_value = np.nan
                min_value = np.nan
                p5 = np.nan
                p25 = np.nan
                med = np.nan
                p75 = np.nan
                p90 = np.nan

            # Store the results in the lists
            avg_values.append(avg_value)
            max_values.append(max_value)
            min_values.append(min_value)
            percentile_5th.append(p5)
            percentile_25th.append(p25)
            median.append(med)
            percentile_75th.append(p75)
            percentile_90th.append(p90)

    # Create a DataFrame to store the results for this file
    data = {
        'basin_id': catchment_boundaries_reprojected['basin_id'],
        'mean': avg_values,
        'max': max_values,
        'min': min_values,
        'p05': percentile_5th,
        'p25': percentile_25th,
        'med': median,
        'p75': percentile_75th,
        'p90': percentile_90th
    }
    results_df = pd.DataFrame(data)
    results_df.set_index("basin_id", inplace=True)
    results_df = results_df.add_prefix(prefix_values[i])

    # Concatenate the results with the final DataFrame
    soil_attributes_df = pd.concat([soil_attributes_df, results_df], axis=1)
    i = i + 1

33it [00:00, 44.30it/s]
33it [00:01, 31.16it/s]
33it [00:00, 34.00it/s]
33it [00:00, 33.77it/s]
33it [00:00, 35.36it/s]
33it [00:00, 43.29it/s]
33it [00:00, 35.74it/s]
33it [00:01, 32.26it/s]


In [9]:
soil_attributes_df

Unnamed: 0_level_0,root_dep_mean,root_dep_max,root_dep_min,root_dep_p05,root_dep_p25,root_dep_med,root_dep_p75,root_dep_p90,soil_tawc_mean,soil_tawc_max,...,soil_bd_p75,soil_bd_p90,soil_oc_mean,soil_oc_max,soil_oc_min,soil_oc_p05,soil_oc_p25,soil_oc_med,soil_oc_p75,soil_oc_p90
basin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HUGR020,116.760534,130,30,50.0,130.0,130.0,130.0,130.0,54.962193,69.633072,...,1.41,1.41,1.251793,5.27,0.58,0.74,0.99,1.05,1.4,2.1
HUGR021,100.318194,130,0,30.0,50.0,130.0,130.0,130.0,50.630196,117.386131,...,1.41,1.59,1.726546,39.400002,0.0,0.6,0.83,1.14,1.65,2.13
HUGR022,113.039271,130,0,40.0,130.0,130.0,130.0,130.0,55.189938,117.386131,...,1.41,1.43,1.486505,33.630001,0.0,0.74,0.99,1.4,1.45,2.4
HUGR023,100.269005,130,0,30.0,50.0,130.0,130.0,130.0,50.621159,117.386131,...,1.41,1.59,1.725779,39.400002,0.0,0.6,0.83,1.14,1.65,2.13
HUGR025,108.184846,130,30,50.0,70.0,130.0,130.0,130.0,52.90646,58.30217,...,1.4,1.41,1.326744,4.54,0.39,0.65,0.74,1.0,1.81,2.72
HUGR026,121.578947,130,50,50.0,130.0,130.0,130.0,130.0,54.200985,58.265205,...,1.22,1.248,1.828421,2.94,1.05,1.05,1.05,1.2,2.94,2.94
HUGR027,117.794838,130,30,30.0,130.0,130.0,130.0,130.0,52.820526,57.932434,...,1.41,1.43,1.365834,2.59,0.7,0.7,1.0,1.06,1.45,2.22
HUGR028,104.935065,130,0,0.0,130.0,130.0,130.0,130.0,51.703587,85.371735,...,1.4,1.7,6.244774,39.400002,0.0,0.0,0.74,0.98,2.13,33.630001
HUGR029,99.820263,130,0,30.0,50.0,130.0,130.0,130.0,50.545139,117.386131,...,1.41,1.59,1.71709,39.400002,0.0,0.6,0.83,1.14,1.65,2.13
HUGR030,68.690951,130,0,30.0,50.0,70.0,70.0,130.0,51.202457,117.386131,...,1.41,1.43,1.562495,33.630001,0.0,0.74,1.05,1.45,1.45,2.18


In [10]:
# Here we sort the index:
soil_attributes_df = soil_attributes_df.sort_index(axis=0)
soil_attributes_df

Unnamed: 0_level_0,root_dep_mean,root_dep_max,root_dep_min,root_dep_p05,root_dep_p25,root_dep_med,root_dep_p75,root_dep_p90,soil_tawc_mean,soil_tawc_max,...,soil_bd_p75,soil_bd_p90,soil_oc_mean,soil_oc_max,soil_oc_min,soil_oc_p05,soil_oc_p25,soil_oc_med,soil_oc_p75,soil_oc_p90
basin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HUGR019,115.246405,130,0,30.0,130.0,130.0,130.0,130.0,53.911449,117.386131,...,1.41,1.49,1.719948,39.400002,0.0,0.7,0.98,1.2,1.45,2.22
HUGR020,116.760534,130,30,50.0,130.0,130.0,130.0,130.0,54.962193,69.633072,...,1.41,1.41,1.251793,5.27,0.58,0.74,0.99,1.05,1.4,2.1
HUGR021,100.318194,130,0,30.0,50.0,130.0,130.0,130.0,50.630196,117.386131,...,1.41,1.59,1.726546,39.400002,0.0,0.6,0.83,1.14,1.65,2.13
HUGR022,113.039271,130,0,40.0,130.0,130.0,130.0,130.0,55.189938,117.386131,...,1.41,1.43,1.486505,33.630001,0.0,0.74,0.99,1.4,1.45,2.4
HUGR023,100.269005,130,0,30.0,50.0,130.0,130.0,130.0,50.621159,117.386131,...,1.41,1.59,1.725779,39.400002,0.0,0.6,0.83,1.14,1.65,2.13
HUGR024,117.797656,130,30,30.0,130.0,130.0,130.0,130.0,51.931255,57.932434,...,1.41,1.58,1.297789,2.59,0.7,0.7,0.83,1.45,1.45,2.13
HUGR025,108.184846,130,30,50.0,70.0,130.0,130.0,130.0,52.90646,58.30217,...,1.4,1.41,1.326744,4.54,0.39,0.65,0.74,1.0,1.81,2.72
HUGR026,121.578947,130,50,50.0,130.0,130.0,130.0,130.0,54.200985,58.265205,...,1.22,1.248,1.828421,2.94,1.05,1.05,1.05,1.2,2.94,2.94
HUGR027,117.794838,130,30,30.0,130.0,130.0,130.0,130.0,52.820526,57.932434,...,1.41,1.43,1.365834,2.59,0.7,0.7,1.0,1.06,1.45,2.22
HUGR028,104.935065,130,0,0.0,130.0,130.0,130.0,130.0,51.703587,85.371735,...,1.4,1.7,6.244774,39.400002,0.0,0.0,0.74,0.98,2.13,33.630001


In [10]:
# Round the data to 3 decimals
soil_attributes_df = soil_attributes_df.astype(float).round(3)
soil_attributes_df

Unnamed: 0_level_0,root_dep_mean,root_dep_max,root_dep_min,root_dep_p05,root_dep_p25,root_dep_med,root_dep_p75,root_dep_p90,soil_tawc_mean,soil_tawc_max,...,soil_bd_p75,soil_bd_p90,soil_oc_mean,soil_oc_max,soil_oc_min,soil_oc_p05,soil_oc_p25,soil_oc_med,soil_oc_p75,soil_oc_p90
basin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HUGR020,116.761,130.0,30.0,50.0,130.0,130.0,130.0,130.0,54.962,69.633,...,1.41,1.41,1.252,5.27,0.58,0.74,0.99,1.05,1.4,2.1
HUGR021,100.318,130.0,0.0,30.0,50.0,130.0,130.0,130.0,50.63,117.386,...,1.41,1.59,1.727,39.4,0.0,0.6,0.83,1.14,1.65,2.13
HUGR022,113.039,130.0,0.0,40.0,130.0,130.0,130.0,130.0,55.19,117.386,...,1.41,1.43,1.487,33.63,0.0,0.74,0.99,1.4,1.45,2.4
HUGR023,100.269,130.0,0.0,30.0,50.0,130.0,130.0,130.0,50.621,117.386,...,1.41,1.59,1.726,39.4,0.0,0.6,0.83,1.14,1.65,2.13
HUGR025,108.185,130.0,30.0,50.0,70.0,130.0,130.0,130.0,52.906,58.302,...,1.4,1.41,1.327,4.54,0.39,0.65,0.74,1.0,1.81,2.72
HUGR026,121.579,130.0,50.0,50.0,130.0,130.0,130.0,130.0,54.201,58.265,...,1.22,1.248,1.828,2.94,1.05,1.05,1.05,1.2,2.94,2.94
HUGR027,117.795,130.0,30.0,30.0,130.0,130.0,130.0,130.0,52.821,57.932,...,1.41,1.43,1.366,2.59,0.7,0.7,1.0,1.06,1.45,2.22
HUGR028,104.935,130.0,0.0,0.0,130.0,130.0,130.0,130.0,51.704,85.372,...,1.4,1.7,6.245,39.4,0.0,0.0,0.74,0.98,2.13,33.63
HUGR029,99.82,130.0,0.0,30.0,50.0,130.0,130.0,130.0,50.545,117.386,...,1.41,1.59,1.717,39.4,0.0,0.6,0.83,1.14,1.65,2.13
HUGR030,68.691,130.0,0.0,30.0,50.0,70.0,70.0,130.0,51.202,117.386,...,1.41,1.43,1.562,33.63,0.0,0.74,1.05,1.45,1.45,2.18


# Data export

In [None]:
# Export the final dataset:
soil_attributes_df.to_csv(PATH_OUTPUT+"estreams_soil_attributes.csv")

# End