**Script Description:** This script extracts all variables avaiable from the BIS-4D datasets for the NOBV locations into one CSV file.

**File Name:** 01_09_Extract_BIS_4D_Data.ipynb

**Date:** 2025

**Created by:** Rob Alamgir

#### Import the relevant packages

In [15]:
import os
import glob
import rasterio
import geopandas as gpd
import pandas as pd
from rasterio.transform import rowcol

#### Import the relevant data files

In [18]:
data_dir = "C:/Data_MSc_Thesis/BIS_4D_Selected/"
tif_files = glob.glob(os.path.join(data_dir, "*.tif"))

point_data_path = "C:/Data_MSc_Thesis/NOBV_Site_Data/NOBV_EC_Tower_Data_Final.csv"
point_data = pd.read_csv(point_data_path)
point_data.head()

Unnamed: 0,Site_no,Location_No,Site_ID,EPSG_4326_WGS_84_Longitude_X,EPSG_4326_WGS_84_Latitude_Y,EPSG_32631_WGS 84_X_m,EPSG_32631_WGS 84_Y_m,Elevation_m
0,1,1,ALB_MS,5.902334,53.05356,694512.5721,5882167.358,1.1
1,2,1,ALB_RF,5.904631,53.053385,694667.2798,5882154.181,1.1
2,3,2,AMM,5.903505,53.031374,694691.0225,5879703.421,1.1
3,4,2,AMR,5.902991,53.032245,694652.6416,5879798.861,1.1
4,5,3,ANK_PT,5.097471,52.253916,643168.4419,5791352.667,-1.4


In [17]:
tif_files  # List all .tif files in the directory

['C:/Data_MSc_Thesis/BIS_4D_Selected\\BD_gcm3_d_0_5_QRF_pred_mean.tif',
 'C:/Data_MSc_Thesis/BIS_4D_Selected\\BD_gcm3_d_5_15_QRF_pred_mean.tif',
 'C:/Data_MSc_Thesis/BIS_4D_Selected\\clay_per_d_0_5_QRF_pred_mean_processed.tif',
 'C:/Data_MSc_Thesis/BIS_4D_Selected\\clay_per_d_5_15_QRF_pred_mean_processed.tif',
 'C:/Data_MSc_Thesis/BIS_4D_Selected\\sand_per_d_0_5_QRF_pred_mean_processed.tif',
 'C:/Data_MSc_Thesis/BIS_4D_Selected\\sand_per_d_5_15_QRF_pred_mean_processed.tif',
 'C:/Data_MSc_Thesis/BIS_4D_Selected\\silt_per_d_0_5_QRF_pred_mean_processed.tif',
 'C:/Data_MSc_Thesis/BIS_4D_Selected\\silt_per_d_5_15_QRF_pred_mean_processed.tif',
 'C:/Data_MSc_Thesis/BIS_4D_Selected\\SOM_per_2020_d_0_5_QRF_pred_mean.tif',
 'C:/Data_MSc_Thesis/BIS_4D_Selected\\SOM_per_2020_d_5_15_QRF_pred_mean.tif',
 'C:/Data_MSc_Thesis/BIS_4D_Selected\\SOM_per_2023_d_0_5_QRF_pred_mean.tif',
 'C:/Data_MSc_Thesis/BIS_4D_Selected\\SOM_per_2023_d_5_15_QRF_pred_mean.tif']

#### Pre-process the datasets

In [21]:
point_data.rename(columns={"EPSG_4326_WGS_84_Longitude_X": "Longitude",
                           "EPSG_4326_WGS_84_Latitude_Y": "Latitude"}, inplace=True)

# Convert dataframe to a GeoDataFrame
gdf = gpd.GeoDataFrame(point_data, geometry=gpd.points_from_xy(point_data.Longitude, point_data.Latitude), crs="EPSG:4326")
gdf = gdf.to_crs("EPSG:28992")   # Reproject to match raster CRS (EPSG:28992)

# Extract reprojected coordinates
gdf["Reproj_X"] = gdf.geometry.x
gdf["Reproj_Y"] = gdf.geometry.y

gdf.info()
#gdf.head(12)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   Site_no                21 non-null     int64   
 1   Location_No            21 non-null     int64   
 2   Site_ID                21 non-null     object  
 3   Longitude              21 non-null     float64 
 4   Latitude               21 non-null     float64 
 5   EPSG_32631_WGS 84_X_m  21 non-null     float64 
 6   EPSG_32631_WGS 84_Y_m  21 non-null     float64 
 7   Elevation_m            21 non-null     float64 
 8   geometry               21 non-null     geometry
 9   Reproj_X               21 non-null     float64 
 10  Reproj_Y               21 non-null     float64 
dtypes: float64(7), geometry(1), int64(2), object(1)
memory usage: 1.9+ KB


#### Loop through each raster and extract values

In [22]:
# Function to extract raster values at given coordinates
def extract_raster_values(raster_path, points_gdf):
    with rasterio.open(raster_path) as src:
        coords = [(x, y) for x, y in zip(points_gdf["Reproj_X"], points_gdf["Reproj_Y"])]
        values = [val[0] if val else None for val in src.sample(coords)]
    return values


for tif_file in tif_files:
    raster_name = os.path.basename(tif_file).replace('.tif', '')
    gdf[f"{raster_name}_values"] = extract_raster_values(tif_file, gdf)

In [23]:
# Define column renaming dictionary
rename_dict = {
    "BD_gcm3_d_0_5_QRF_pred_mean_values": "BD_0_5",
    "BD_gcm3_d_5_15_QRF_pred_mean_values": "BD_5_15",
    "clay_per_d_0_5_QRF_pred_mean_processed_values": "Clay_0_5",
    "clay_per_d_5_15_QRF_pred_mean_processed_values": "Clay_5_15",
    "SOM_per_2020_d_0_5_QRF_pred_mean_values": "SOM_2020_0_5",
    "SOM_per_2020_d_5_15_QRF_pred_mean_values": "SOM_2020_5_15",
    "SOM_per_2023_d_0_5_QRF_pred_mean_values": "SOM_2023_0_5",
    "SOM_per_2023_d_5_15_QRF_pred_mean_values": "SOM_2023_5_15",
    "sand_per_d_0_5_QRF_pred_mean_processed_values": "Sand_0_5",
    "sand_per_d_5_15_QRF_pred_mean_processed_values": "Sand_5_15",
    "silt_per_d_0_5_QRF_pred_mean_processed_values": "Silt_0_5",
    "silt_per_d_5_15_QRF_pred_mean_processed_values": "Silt_5_15"
}

gdf.rename(columns=rename_dict, inplace=True)   # Rename columns in the GeoDataFrame
gdf.drop(columns=["geometry", ], inplace=True)

The **porosity** (`Porosity_BIS4D_SOM`) was estimated based on **bulk density** (BD) and **soil organic matter** (SOM) using a mixing model for particle density, following Rühlmann et al. (2006).

#### 1. Component Densities
- **Mineral particle density**:  
  $$ \text{PD}_{\text{mineral}} = 2.646 \, \text{g/cm}^3 $$
- **Organic particle density**:  
  $$ \text{PD}_{\text{organic}} = 1.350 \, \text{g/cm}^3 $$

#### 2. SOM Fraction
The SOM percentage is first converted to a **mass fraction**:  
$$ \text{SOM}_{\text{fraction}} = \frac{\text{SOM} (\%) }{100} $$  
The **mineral fraction** is:  
$$ \text{Mineral}_{\text{fraction}} = 1 - \text{SOM}_{\text{fraction}} $$

#### 3. Estimated Particle Density
The **estimated particle density** is calculated using a weighted harmonic mean:  
$$ \text{PD}_{\text{estimated}} = \frac{1}{\left( \frac{\text{SOM}_{\text{fraction}}}{\text{PD}_{\text{organic}}} + \frac{\text{Mineral}_{\text{fraction}}}{\text{PD}_{\text{mineral}}} \right)} $$

#### 4. Porosity Calculation
Finally, **porosity** is computed as:  
$$ \text{Porosity} = 1 - \left( \frac{\text{BD}}{\text{PD}_{\text{estimated}}} \right) $$

#### 5. Validity Conditions
Porosity is only calculated when the following conditions are met:
- `BD` and `PD_estimated` are not null
- `BD > 0`
- `PD_estimated > 0`
- `BD < PD_estimated`

In [26]:
# Define Component Particle Densities (based on literature, e.g., Rühlmann et al. 2006) ---
PD_Mineral = 2.646  # g/cm^3
PD_Organic = 1.350  # g/cm^3
print(f"Using PD_Mineral = {PD_Mineral} g/cm^3 and PD_Organic = {PD_Organic} g/cm^3")

bd_column = 'BD_5_15'
som_column = 'SOM_2023_5_15'
pd_est_column_name = 'PD_Estimated' 
porosity_column_name = 'Porosity_BIS4D_SOM' 

# Calculate Estimated Particle Density (PD_Estimated) based on SOM 
if som_column in gdf.columns:
    # Convert SOM % to mass fraction (handle NaN and clamp between 0 and 100 first)
    som_fraction = np.clip(gdf[som_column].fillna(0), 0, 100) / 100.0
    mineral_fraction = 1.0 - som_fraction
    # Calculate estimated PD using the mixing model PD_est = 1 / [ (SOM_frac / PD_org) + (Min_frac / PD_min) ]
    denominator = (som_fraction / PD_Organic) + (mineral_fraction / PD_Mineral)
    gdf[pd_est_column_name] = np.where(
        denominator > 1e-9, # Avoid division by zero
        1.0 / denominator,
        np.nan              # Assign NaN if calculation is problematic
    )
else:
    print(f"Error: SOM column '{som_column}' not found in gdf. Cannot estimate PD based on SOM.")
    gdf[pd_est_column_name] = np.nan   # Set estimated PD to NaN if SOM is missing

# Calculate Porosity using Estimated Particle Density 
if bd_column in gdf.columns and pd_est_column_name in gdf.columns:
    # Calculate Porosity = 1 - (BD / PD_estimated)
    gdf[porosity_column_name] = np.where(
        pd.notna(gdf[bd_column]) & pd.notna(gdf[pd_est_column_name]) & \
        (gdf[bd_column] > 0) & (gdf[pd_est_column_name] > 0) & \
        (gdf[bd_column] < gdf[pd_est_column_name]),          # BD must be less than PD
        1.0 - (gdf[bd_column] / gdf[pd_est_column_name]),
        np.nan                                               # Assign NaN otherwise
    )
    display_cols_other = [col for col in ['Site_ID', bd_column, som_column, pd_est_column_name, porosity_column_name] if col in gdf.columns]
    if not display_cols_other:
         display_cols_other = [col for col in [bd_column, som_column, pd_est_column_name, porosity_column_name] if col in gdf.columns]
    print(gdf[display_cols_other].head(21))
else:
    print(f"Error: Column '{bd_column}' or '{pd_est_column_name}' not found in gdf. Cannot calculate porosity.")

Using PD_Mineral = 2.646 g/cm^3 and PD_Organic = 1.35 g/cm^3
    Site_ID   BD_5_15  SOM_2023_5_15  PD_Estimated  Porosity_BIS4D_SOM
0    ALB_MS  0.911710       8.572524      2.444802            0.627082
1    ALB_RF  0.908030       7.614959      2.465745            0.631742
2       AMM  0.810291      23.857643      2.152911            0.623630
3       AMR  0.814070      24.262232      2.146129            0.620680
4    ANK_PT  0.845598      31.292482      2.034746            0.584421
5    ASD_MP  0.694556      28.035803      2.084870            0.666859
6       BUO  0.918417       7.150495      2.476033            0.629077
7       BUW  1.150595       7.139084      2.476287            0.535355
8       CAM  0.851328      36.718536      1.956380            0.564845
9       DEM  0.618595      34.088989      1.993590            0.689708
10      HOC  0.914042       7.929507      2.458826            0.628261
11      HOH  0.918894       7.900739      2.459457            0.626384
12   ILP_PT  0.8

In [27]:
#gdf.info()
gdf.head(12)

Unnamed: 0,Site_no,Location_No,Site_ID,Longitude,Latitude,EPSG_32631_WGS 84_X_m,EPSG_32631_WGS 84_Y_m,Elevation_m,Reproj_X,Reproj_Y,...,Sand_0_5,Sand_5_15,Silt_0_5,Silt_5_15,SOM_2020_0_5,SOM_2020_5_15,SOM_2023_0_5,SOM_2023_5_15,PD_Estimated,Porosity_BIS4D_SOM
0,1,1,ALB_MS,5.902334,53.05356,694512.5721,5882167.358,1.1,189540.226441,563087.9994,...,23.344248,21.167017,36.525433,37.886478,7.128063,8.446636,6.866517,8.572524,2.444802,0.627082
1,2,1,ALB_RF,5.904631,53.053385,694667.2798,5882154.181,1.1,189694.395452,563069.672616,...,24.490902,22.932985,36.252594,37.92939,6.683191,7.866906,6.311794,7.614959,2.465745,0.631742
2,3,2,AMM,5.903505,53.031374,694691.0225,5879703.421,1.1,189636.423177,560619.697916,...,40.598873,37.945511,32.501518,34.526012,22.591763,21.922325,23.386671,23.857643,2.152911,0.62363
3,4,2,AMR,5.902991,53.032245,694652.6416,5879798.861,1.1,189601.24816,560716.354952,...,36.05582,35.372974,34.436462,38.006378,21.611992,23.177069,21.361217,24.262232,2.146129,0.62068
4,5,3,ANK_PT,5.097471,52.253916,643168.4419,5791352.667,-1.4,135216.304078,474025.870157,...,59.78759,60.178284,32.695492,32.490849,32.450554,31.433214,32.792191,31.292482,2.034746,0.584421
5,6,4,ASD_MP,4.739599,52.475256,618150.9361,5815322.395,-2.0,111000.265319,498810.39108,...,20.951584,15.46284,37.814907,37.549713,30.139071,28.023767,29.654894,28.035803,2.08487,0.666859
6,7,5,BUO,5.87357,53.100143,692377.3494,5887269.906,-1.3,187576.53642,568258.526888,...,21.747456,20.608526,34.705997,35.249905,6.567414,7.478717,6.731217,7.150495,2.476033,0.629077
7,8,5,BUW,5.862276,53.096044,691639.6611,5886783.816,-1.0,186823.092788,567797.350434,...,18.037428,17.188889,41.579384,42.065186,5.794461,7.227769,5.721648,7.139084,2.476287,0.535355
8,9,6,CAM,6.579765,53.154907,739334.9552,5895489.504,0.5,234774.504797,574903.342314,...,40.493046,40.540417,37.765743,38.148533,35.09264,38.643532,33.85619,36.718536,1.95638,0.564845
9,10,7,DEM,4.946176,52.201298,632999.5877,5785212.722,-2.0,124849.9535,468223.621202,...,50.325432,50.210812,29.516975,28.107689,36.229919,33.966351,37.140511,34.088989,1.99359,0.689708


#### Export the final dataframe

In [28]:
output_path = "C:/Data_MSc_Thesis/BIS_4D_Selected/NOBV_Point_Data_Extracted_V1.csv"
gdf.to_csv(output_path, index=False)

print(f"Data has been successfully exported to '{output_path}'")

Data has been successfully exported to 'C:/Data_MSc_Thesis/BIS_4D_Selected/NOBV_Point_Data_Extracted_V1.csv'
