# BiLSTM Model – CO₂ Prediction with GroupKFold CV

This notebook implements a Bidirectional LSTM (BiLSTM) deep learning model to predict surface seawater CO₂ concentrations (μatm) using multivariate time series data. The workflow includes preprocessing, windowing, GroupKFold cross-validation, feature importance (permutation method), and prediction export.

## Highlights of this version:

• **Input data:** 'Deception_2025_CO2_ocean_meteo_seismic.csv' should include oceanographic (seawater temperature anda salinity), meteorological (solar radiation, air temperature and wind speed), geosphysical (tidal heigh, seismic events), and spatiio-temporal variables (Date, latitude and longitude).

• **Grouping:** Based on latitude bins to account for spatial structure during GroupKFold.

• **Model:** BiLSTM architecture trained per fold with early stopping.

• **Evaluation metrics:** RMSE and R² computed for each fold.

• **Feature importance:** Permutation-based increase in RMSE for each input variable.

• **Best fold selection:** Identified based on lowest RMSE.

## Outputs:

• `permutation_importance_fold.csv` → RMSE increase per variable per fold.

• `Deception_2025_CO2_prediction.csv` → Original spatio-temporal and CO2 real data + new column 'CO2_predicted' from the best-performing fold.

### Notes:
• Predicted values are descaled (inverse of z-score) before saving.

• Best fold predictions are aligned with the original input data.

• The model is ready for technical validation and integration into data descriptor papers or repositories.

---

**Author:** Susana Flecha, Instituto de Ciencias Marinas de Andalucía-Consejo Superior de Investigaciones Científicas (CSIC)

**Last updated:** September 2025

**More info:** This work has been carried out thanks to the data obtained from the DICHOSO project ([https://doi.org/10.20351/29HE20240312](https://doi.org/10.20351/29HE20240312)). Funding for this work was supported by the DICHOSO project (PID2021-125783OB-100). SF staff hired under the Generation D initiative, promoted by Red.es, an organization affiliated with the Ministry for Digital Transformation and the Civil Service, for attracting and retaining talent through grants and training contracts, financed by the Recovery, Transformation, and Resilience Plan through the European Union’s Next Generation funds. This work contributes to the CSIC Interdisciplinary Thematic Platform, OCEANS+, and the Conexión PolarCSIC hub.


## Library Imports and Dependencies

This section imports all the necessary libraries required for the CO₂ prediction pipeline. Each library serves a specific purpose in the workflow:

### Core Data Processing Libraries:
- **pandas**: Data manipulation and analysis, particularly for handling CSV files and time series data
- **numpy**: Numerical computing operations, array manipulations, and mathematical functions

### Deep Learning Framework:
- **tensorflow**: Primary deep learning framework for building and training the BiLSTM model
- **tensorflow.keras**: High-level API for neural network construction, including:
  - `Sequential`: For building linear stack of layers
  - `Bidirectional`, `LSTM`: Core components for the bidirectional LSTM architecture
  - `Dense`, `Dropout`: Fully connected layers and regularization
  - `l2`: L2 regularization to prevent overfitting

### Machine Learning Utilities:
- **sklearn.model_selection.GroupKFold**: Cross-validation strategy that ensures groups (spatial locations) don't appear in both training and validation sets
- **sklearn.preprocessing.StandardScaler**: Feature scaling to normalize input variables
- **sklearn.metrics**: Model evaluation metrics (RMSE, R²)

### Model Interpretability and Visualization:
- **matplotlib.pyplot**: Plotting and visualization capabilities
- **copy**: Python utility for creating deep copies of objects

### Jupyter Environment Setup:
- `%matplotlib inline`: Magic command to display plots directly in the notebook

**Technical Note**: The TensorFlow optimization messages indicate that the library is configured to use available CPU instructions for better performance.


In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.regularizers import l2
import matplotlib.pyplot as plt
import copy
%matplotlib inline


2025-09-11 17:41:44.062296: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Reproducibility Configuration

This section establishes reproducible results by setting random seeds for all stochastic processes in the pipeline.

### Purpose and Importance:
Deep learning models, particularly neural networks, involve multiple sources of randomness:
- **Weight initialization**: Random starting values for neural network parameters
- **Data shuffling**: Random ordering of training samples
- **Dropout layers**: Random neuron deactivation during training
- **Cross-validation splits**: Random assignment of data to folds

### Implementation Details:
1. **NumPy random seed**: Controls randomness in data preprocessing and scientific computing operations
2. **TensorFlow random seed**: Ensures consistent neural network training across runs
3. **Seed value (42)**: Chosen as a conventional value in machine learning for reproducible experiments

### Benefits:
- **Scientific reproducibility**: Results can be exactly replicated by other researchers
- **Model comparison**: Fair evaluation of different model configurations
- **Debugging**: Consistent behavior helps identify and fix issues
- **Validation**: Enables verification of model improvements over baseline results

**Note**: While this ensures reproducibility, it may slightly reduce model performance compared to truly random initialization in some cases.


In [2]:
# Set random seed for reproducibility
seed = 42
np.random.seed(seed)
tf.random.set_seed(seed)

## Data Loading and Temporal Feature Engineering

This section loads the multivariate environmental dataset and creates normalized temporal features essential for time series modeling.

### Input Data Structure:
The dataset `Deception_2025_CO2_ocean_meteo_seismic.csv` contains:
- **Oceanographic variables**: Seawater temperature, salinity
- **Meteorological variables**: Solar radiation, air temperature, wind speed
- **Geophysical variables**: Tidal elevation, seismic events
- **Spatio-temporal variables**: Date, latitude, longitude
- **Target variable**: CO₂ concentrations (μatm)

### Data Loading Process:
1. **CSV import**: Uses semicolon separator (`;`) as specified in the data format
2. **Date parsing**: Converts string dates to pandas datetime objects for temporal operations
   - *Warning note*: The parsing warning indicates mixed date formats, which is handled automatically by pandas

### Temporal Feature Engineering:
Creates normalized temporal features to capture cyclical patterns:

#### 1. **Hour normalization** (`hour / 23.0`):
   - **Range**: 0.0 to 1.0 representing 24-hour cycle
   - **Purpose**: Captures diurnal CO₂ variations due to photosynthesis and respiration patterns
   - **Biological relevance**: Marine CO₂ levels fluctuate with light availability

#### 2. **Month normalization** (`(month - 1) / 11.0`):
   - **Range**: 0.0 to 1.0 representing annual cycle
   - **Purpose**: Captures seasonal CO₂ variations in Antarctic waters
   - **Environmental relevance**: Seasonal changes in ice cover, biological productivity, and temperature

#### 3. **Day of year normalization** (`dayofyear / 365.0`):
   - **Range**: 0.0 to ~1.0 representing yearly progression
   - **Purpose**: Provides fine-grained seasonal information beyond monthly patterns
   - **Advantage**: Captures gradual transitions and specific timing of environmental events

### Normalization Benefits:
- **Neural network compatibility**: Scaled features (0-1 range) improve training stability
- **Equal weight**: Prevents temporal features from dominating due to scale differences
- **Cyclical preservation**: Maintains the periodic nature of temporal patterns

**Technical Note**: The temporal features will help the BiLSTM model understand both short-term (daily) and long-term (seasonal) patterns in CO₂ dynamics.


In [3]:
# Load data
df = pd.read_csv("Deception_2025_CO2_ocean_meteo_seismic.csv", sep=";")
df['Date'] = pd.to_datetime(df['Date'])

# Normalized temporal variables
df['hour'] = df['Date'].dt.hour / 23.0
df['month'] = (df['Date'].dt.month - 1) / 11.0
df['doy'] = df['Date'].dt.dayofyear / 365.0


  df['Date'] = pd.to_datetime(df['Date'])


## Feature Selection and Target Variable Definition

This section defines the predictor variables (features) and target variable for the CO₂ prediction model, establishing the core relationship to be learned.

### Selected Features (Input Variables):
The model uses 10 carefully chosen environmental and geophysical variables:

#### **Oceanographic Features**:
1. **Seawater_Temperature**: Direct influence on CO₂ solubility (higher temperature = lower solubility)
2. **Salinity**: Affects CO₂ solubility and chemical equilibrium in seawater

#### **Spatial Features**:
3. **Latitude**: Geographic position affecting solar radiation, ice dynamics, and biological activity
4. **Longitude**: Spatial variation in oceanographic conditions around Deception Island

#### **Meteorological Features**:
5. **Solar_radiation**: Drives photosynthesis, affecting biological CO₂ uptake/release
6. **Wind_speed**: Influences air-sea CO₂ exchange rates through surface turbulence
7. **Air_temperature**: Correlates with seawater temperature and atmospheric CO₂ partial pressure

#### **Geophysical Features**:
8. **Tidal_elevation**: Affects water mass mixing and CO₂ transport
9. **Seismic_events**: Volcanic activity can release CO₂, particularly relevant at Deception Island
10. **Seismic_events_10_min_avg**: Smoothed seismic activity to capture sustained geological influence

### Target Variable:
- **CO2**: Surface seawater CO₂ concentration in μatm (microatmospheres)
  - **Scientific significance**: Key parameter for ocean acidification and carbon cycle studies
  - **Range**: Typically 200-600 μatm in marine environments
  - **Measurement**: Represents partial pressure of CO₂ in equilibrium with seawater

### Data Quality Considerations:

#### **Missing Value Handling**:
- **Current approach**: Commented out `dropna()` function preserves all available data points
- **Rationale**: LSTM models can handle some missing values through their sequential nature
- **Alternative**: Enable `dropna()` if missing values significantly impact model performance

#### **Outlier Filtering**:
- **CO₂ threshold filter**: Optional filter for CO₂ < 400 μatm (currently disabled)
- **Purpose**: Removes potential outliers or extreme values
- **Consideration**: Deception Island's volcanic nature may produce legitimately high CO₂ values

### Feature Engineering Strategy:
- **No additional temporal features**: The previously created normalized temporal variables (hour, month, doy) are not included in the main feature set
- **Raw environmental data**: Focus on direct environmental measurements rather than derived features
- **Multivariate approach**: Leverages interactions between different environmental factors

**Modeling Rationale**: This feature set captures the primary physical, chemical, and biological drivers of CO₂ variability in Antarctic coastal waters, providing comprehensive environmental context for accurate predictions.


In [4]:
# Align df for later use
features = ["Seawater_Temperature", "Salinity", "Latitude", "Longitude", "Solar_radiation",
            "Tidal_elevation", "Wind_speed", "Air_temperature", "Seismic_events", "Seismic_events_10_min_avg"]

target_var = "CO2"

# Eliminate NaNs
# df = df.dropna(subset=features + [target_var])

# CO2 lower than 400 filter (uncomment if desired)
# df = df[df[target_var] < 400]


## Time Series Windowing and Sequence Preparation

This section transforms the time series data into sequences suitable for LSTM modeling through a sliding window approach, essential for capturing temporal dependencies in CO₂ dynamics.

### Windowing Methodology:

#### **Window Size Configuration**:
- **Window size**: 10 time steps
- **Rationale**: Captures short to medium-term temporal patterns while maintaining computational efficiency
- **Time span**: Represents 10 consecutive measurements in the time series
- **Balance**: Long enough to capture temporal trends, short enough to avoid overly complex patterns

#### **Sliding Window Process**:
The algorithm creates overlapping sequences where each training example consists of:
1. **Input sequence (X)**: 10 consecutive time steps of environmental features
2. **Target value (y)**: CO₂ concentration at the next time step (step 11)
3. **Group identifier**: Spatial grouping based on latitude for cross-validation

### Data Preparation Steps:

#### **1. Raw Data Extraction**:
- **X_raw**: Feature matrix containing all environmental variables
- **y_raw**: Target vector containing CO₂ concentrations
- **Format**: Converted to NumPy arrays for efficient processing

#### **2. Sequence Generation Loop**:
```python
for i in range(len(X_raw) - window_size):
    X.append(X_raw[i:i + window_size])     # 10 time steps of features
    y.append(y_raw[i + window_size])       # Next time step CO₂ value
    groups.append(...)                     # Spatial group for this sequence
```

#### **3. Spatial Grouping Strategy**:
- **Group calculation**: `int(Latitude * 500)`
- **Purpose**: Creates discrete spatial bins for GroupKFold cross-validation
- **Rationale**: Ensures that spatially close measurements don't appear in both training and validation sets
- **Scale factor (500)**: Provides sufficient granularity for spatial differentiation

### Output Data Structures:

#### **X (Input sequences)**:
- **Shape**: (n_sequences, window_size, n_features) = (n_sequences, 10, 10)
- **Content**: 3D array where each sequence contains 10 time steps of 10 environmental features
- **Purpose**: Provides temporal context for CO₂ prediction

#### **y (Target values)**:
- **Shape**: (n_sequences, 1)
- **Content**: CO₂ concentration to be predicted for each sequence
- **Alignment**: Corresponds to the time step immediately following each input sequence

#### **groups (Spatial identifiers)**:
- **Shape**: (n_sequences,)
- **Content**: Integer group identifiers for spatial cross-validation
- **Function**: Ensures spatial independence between training and validation sets

### DataFrame Alignment:
- **df_aligned**: Original DataFrame adjusted to match the windowed data
- **Adjustment**: Removes first `window_size` rows since they cannot form complete sequences
- **Purpose**: Maintains correspondence between windowed data and original timestamps/metadata
- **Index reset**: Ensures clean indexing for downstream operations

### LSTM Compatibility:
This windowing approach creates the 3D input structure (samples, time_steps, features) required by LSTM layers, enabling the model to:
- **Learn temporal patterns**: Understand how environmental conditions evolve over time
- **Capture dependencies**: Identify relationships between past conditions and future CO₂ levels
- **Leverage sequence information**: Utilize the ordering and timing of environmental changes

**Technical Note**: The windowing process reduces the total number of samples by `window_size`, but each remaining sample now contains rich temporal information essential for accurate CO₂ prediction.


In [5]:
# Create X, y, groups
X_raw = df[features].values
y_raw = df[[target_var]].values

# Create window withouth scaling
window_size = 10
X, y, groups = [], [], []

for i in range(len(X_raw) - window_size):
    X.append(X_raw[i:i + window_size])
    y.append(y_raw[i + window_size])
    groups.append(int(df.iloc[i + window_size]["Latitude"] * 500))

X = np.array(X)
y = np.array(y)
groups = np.array(groups)

# ✅ Align the original DataFrame with X, y, groups (due to windowing)
df_aligned = df.iloc[window_size:].reset_index(drop=True)


## BiLSTM Model Architecture Definition

This section defines the Bidirectional Long Short-Term Memory (BiLSTM) neural network architecture optimized for CO₂ time series prediction.

### Model Architecture Overview:

The model follows a deep learning architecture specifically designed for sequential environmental data:

#### **Layer 1: First Bidirectional LSTM**
- **Configuration**: `Bidirectional(LSTM(64, return_sequences=True))`
- **Units**: 64 LSTM cells in each direction (forward + backward) = 128 total outputs
- **Bidirectional advantage**: Processes sequences in both temporal directions
  - **Forward pass**: Learns from past → present patterns
  - **Backward pass**: Learns from future → present patterns
- **return_sequences=True**: Outputs full sequence for next layer input
- **L2 regularization (0.001)**: Prevents overfitting by penalizing large weights

#### **Layer 2: Dropout Regularization**
- **Rate**: 30% of neurons randomly deactivated during training
- **Purpose**: Prevents overfitting and improves generalization
- **Mechanism**: Forces model to learn robust patterns rather than memorizing training data

#### **Layer 3: Second Bidirectional LSTM**
- **Configuration**: `Bidirectional(LSTM(64, return_sequences=True))`
- **Function**: Learns higher-level temporal abstractions from first layer
- **Depth benefit**: Captures complex, multi-scale temporal dependencies
- **Same regularization**: L2(0.001) for consistent overfitting prevention

#### **Layer 4: Second Dropout**
- **Consistent regularization**: Same 30% dropout rate
- **Stacked approach**: Multiple dropout layers for robust regularization

#### **Layer 5: Final Bidirectional LSTM**
- **Configuration**: `Bidirectional(LSTM(32))`
- **Reduced units**: 32 cells (64 total outputs) for computational efficiency
- **return_sequences=False**: Outputs only final time step representation
- **Feature compression**: Distills temporal information into compact representation

#### **Layer 6: Dense Layer**
- **Units**: 64 neurons with ReLU activation
- **Purpose**: Non-linear transformation of LSTM features
- **ReLU advantage**: Efficient training and positive value emphasis

#### **Layer 7: Output Layer**
- **Units**: 1 neuron (single CO₂ prediction)
- **Activation**: Linear (no activation function)
- **Output**: Continuous CO₂ concentration value

### Model Compilation Settings:

#### **Optimizer: RMSprop**
- **Learning rate**: 0.0005 (conservative for stable training)
- **Advantage**: Adaptive learning rates, good for RNN training
- **Stability**: Helps prevent gradient explosion in deep recurrent networks

#### **Loss Function: Mean Squared Error (MSE)**
- **Purpose**: Penalizes prediction errors quadratically
- **Suitability**: Standard choice for regression problems
- **Behavior**: Heavily penalizes large errors, encouraging accurate predictions

#### **Metrics: Root Mean Squared Error (RMSE)**
- **Units**: Same as target variable (μatm)
- **Interpretability**: Directly comparable to CO₂ measurement precision
- **Monitoring**: Real-time training progress assessment

### Design Rationale:

#### **Why Bidirectional LSTMs?**
1. **Complete temporal context**: Access to both past and future information
2. **Pattern recognition**: Better identification of temporal patterns
3. **Environmental relevance**: CO₂ dynamics influenced by both preceding and subsequent conditions

#### **Why Multiple LSTM Layers?**
1. **Hierarchical learning**: Different layers capture different temporal scales
2. **Complex patterns**: Deep architecture handles non-linear environmental relationships
3. **Feature abstraction**: Progressive refinement of temporal representations

#### **Why This Specific Configuration?**
1. **64→64→32 progression**: Gradual feature compression while maintaining complexity
2. **Regularization balance**: L2 + Dropout prevents overfitting without under-training
3. **Conservative learning rate**: Ensures stable convergence for time series data

### Expected Model Behavior:
- **Input**: Sequences of 10 time steps × 10 environmental features
- **Processing**: Multi-scale temporal pattern extraction
- **Output**: Single CO₂ concentration prediction (μatm)
- **Strength**: Captures both short-term fluctuations and longer-term trends

**Technical Note**: This architecture balances model complexity with training stability, making it well-suited for the challenging task of predicting CO₂ dynamics in the variable Antarctic marine environment.


In [6]:
# Scale data
def create_model(input_shape):
    model = Sequential([
        Bidirectional(LSTM(64, return_sequences=True, kernel_regularizer=l2(0.001)), input_shape=input_shape),
        Dropout(0.3),
        Bidirectional(LSTM(64, return_sequences=True, kernel_regularizer=l2(0.001))),
        Dropout(0.3),
        Bidirectional(LSTM(32, kernel_regularizer=l2(0.001))),
        Dense(64, activation='relu'),
        Dense(1, activation='linear')
    ])
    model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.0005),
                  loss='mse',
                  metrics=[tf.keras.metrics.RootMeanSquaredError()])
    return model


## GroupKFold Cross-Validation and Model Training

This section implements a comprehensive cross-validation strategy with model training, evaluation, and feature importance analysis specifically designed for spatial-temporal data.

### Cross-Validation Strategy: GroupKFold

#### **Why GroupKFold?**
- **Spatial independence**: Prevents data leakage by ensuring spatially related samples don't appear in both training and validation sets
- **Realistic evaluation**: Mimics real-world scenario where predictions are needed for new spatial locations
- **Group definition**: Uses latitude-based spatial bins to define independent groups
- **Splits**: 5-fold cross-validation for robust performance estimation

#### **Validation Strategy Benefits**:
1. **Unbiased performance estimates**: Each fold tests on truly independent spatial regions
2. **Generalization assessment**: Evaluates model's ability to predict at new locations
3. **Robustness testing**: Multiple folds reveal model stability across different spatial configurations

### Training Pipeline for Each Fold:

#### **Step 1: Data Splitting**
- **Training set**: Samples from 4 spatial groups (80% of spatial coverage)
- **Validation set**: Samples from 1 spatial group (20% of spatial coverage)
- **Preservation**: Maintains temporal order within each spatial group

#### **Step 2: Feature Scaling (Critical for Neural Networks)**
```python
# Separate scalers for features and target
scaler_X = StandardScaler()  # Features: mean=0, std=1
scaler_y = StandardScaler()  # Target: normalized CO₂ values
```

**Scaling Process**:
1. **Fit on training data only**: Prevents data leakage from validation set
2. **3D array handling**: Reshapes sequences for StandardScaler compatibility
3. **Transform both sets**: Applies same scaling to training and validation data
4. **Target scaling**: Normalizes CO₂ values for stable neural network training
5. **Scaler preservation**: Stores scalers for inverse transformation

#### **Step 3: Model Training Configuration**
- **Architecture**: Fresh BiLSTM model instance for each fold
- **Input shape**: `(window_size, n_features)` = (10, 10)
- **Training epochs**: Maximum 200 epochs
- **Batch size**: 32 samples per batch for efficient training
- **Data shuffling**: Enabled to prevent batch-level biases

#### **Step 4: Early Stopping Strategy**
- **Monitor**: Validation loss (`val_loss`)
- **Patience**: 15 epochs without improvement
- **Restoration**: Automatically restores weights from best epoch
- **Purpose**: Prevents overfitting while maximizing training effectiveness

### Model Evaluation Metrics:

#### **Primary Metric: RMSE (Root Mean Squared Error)**
- **Calculation**: `sqrt(mean((y_true - y_pred)²))`
- **Units**: Same as target variable (μatm)
- **Interpretation**: Average prediction error magnitude
- **Advantage**: Penalizes large errors more heavily than small ones

### Feature Importance Analysis (Permutation Method):

#### **Methodology**:
1. **Baseline RMSE**: Calculate model performance on original validation data
2. **Feature permutation**: Randomly shuffle each feature across time steps
3. **Degraded performance**: Measure RMSE increase after permutation
4. **Importance score**: `Permuted_RMSE - Baseline_RMSE`
5. **Interpretation**: Higher values indicate more important features

#### **Permutation Process**:
```python
for each feature:
    for each time_step in window:
        randomly_shuffle(feature_values_at_time_step)
    calculate_new_RMSE()
    importance = new_RMSE - baseline_RMSE
```

#### **Output**: 
- **Per-fold CSV files**: `permutation_importance_fold{n}.csv`
- **Contents**: Variable names and RMSE increase values
- **Units**: μatm (same as prediction error)

### Best Model Selection:

#### **Selection Criteria**:
- **Metric**: Lowest validation RMSE across all folds
- **Storage**: Complete model state including:
  - True validation values (`y_true`)
  - Predicted validation values (`y_pred`) 
  - Fold identifier
  - Validation indices
  - Scaler object for inverse transformation

#### **Purpose**:
- **Final predictions**: Use best-performing model for output generation
- **Model deployment**: Identify optimal configuration for production use
- **Performance reporting**: Represent model capability with best-case results

### Data Tracking and Storage:

#### **Stored Information per Fold**:
- **Validation RMSE**: Performance metric for each fold
- **Training history**: Loss curves and metrics over epochs
- **Predictions**: True vs. predicted values for analysis
- **Validation indices**: Sample identification for result alignment
- **Scalers**: For proper inverse transformation

#### **DataFrame Alignment**:
- **df_val**: Aligned with validation set from final fold
- **Purpose**: Maintains correspondence between predictions and original data
- **Applications**: Result interpretation and spatial analysis

### Technical Considerations:

#### **Scaling Strategy**:
- **Fold-specific scaling**: Each fold uses independent scalers to prevent information leakage
- **Target scaling**: Essential for neural network convergence and stability
- **Inverse transformation**: Required for interpretable results in original units

#### **Memory Management**:
- **Sequential processing**: One fold at a time to manage memory usage
- **Result accumulation**: Stores only essential information per fold
- **Model disposal**: Implicit cleanup between folds

#### **Reproducibility**:
- **Permutation seeds**: `42 + fold` ensures consistent importance calculations
- **Training shuffle**: Controlled randomness for reproducible training

**Expected Outcome**: This comprehensive validation approach provides robust, unbiased estimates of model performance and feature importance, ensuring the BiLSTM model's reliability for CO₂ prediction in new spatial locations.


In [7]:
# Cross-validation with GroupKFold

gkf = GroupKFold(n_splits=5)
val_rmse_list = []
all_y_true = []
all_y_pred = []
history_list = []
scalers_y = []  
val_indices_per_fold = []
best_result = None

for fold, (train_idx, val_idx) in enumerate(gkf.split(X, y, groups)):
    print(f"\n📦 Fold {fold + 1}")

   # Dividing data
    X_train_raw, X_val_raw = X[train_idx], X[val_idx]
    y_train_raw, y_val_raw = y[train_idx], y[val_idx]

    # ⚠️ Fold scaling
    scaler_X = StandardScaler()
    scaler_y = StandardScaler()

    X_train = scaler_X.fit_transform(X_train_raw.reshape(-1, X.shape[2])).reshape(X_train_raw.shape)
    X_val   = scaler_X.transform(X_val_raw.reshape(-1, X.shape[2])).reshape(X_val_raw.shape)

    y_train = scaler_y.fit_transform(y_train_raw.reshape(-1, 1)).flatten()
    y_val   = scaler_y.transform(y_val_raw.reshape(-1, 1)).flatten()

    scalers_y.append(scaler_y)  # Guardar para invertir luego
    
    val_indices_per_fold.append(val_idx)

    # Model creation and training
    model = create_model((window_size, X.shape[2]))

    early_stop = tf.keras.callbacks.EarlyStopping(
        monitor='val_loss', patience=15, restore_best_weights=True
    )

    history = model.fit(
        X_train, y_train,
        epochs=200, batch_size=32,
        validation_data=(X_val, y_val),
        callbacks=[early_stop],
        verbose=1, shuffle=True
    )

    history_list.append(history.history)

    y_pred = model.predict(X_val)
    val_rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    val_rmse_list.append(val_rmse)
    all_y_true.append(y_val)
    all_y_pred.append(y_pred)


    print("📊 Calculating permutation importance...")

    # 🧪 Stablishing local seed for permutation
    np.random.seed(42 + fold)

    # Base prediction in real units
    y_val_inv = scaler_y.inverse_transform(y_val.reshape(-1, 1))
    y_pred_inv = scaler_y.inverse_transform(model.predict(X_val)).flatten()
    baseline_rmse = np.sqrt(mean_squared_error(y_val_inv, y_pred_inv))

    # Permutation importance per variable
    importance_fold = {}
    for i, var in enumerate(features):
        X_val_perm = np.array(X_val, copy=True)
        for t in range(window_size):
            X_val_perm[:, t, i] = np.random.permutation(X_val_perm[:, t, i])

        # Permutation importance in real units
        y_perm_pred = scaler_y.inverse_transform(model.predict(X_val_perm)).flatten()
        perm_rmse = np.sqrt(mean_squared_error(y_val_inv, y_perm_pred))
        importance_fold[var] = perm_rmse - baseline_rmse

    # 🔽 Save results CSV
    df_perm_fold = pd.DataFrame({
    "variable": list(importance_fold.keys()),
    "RMSE_increase": list(importance_fold.values())
    })
    csv_name = f"permutation_importance_fold{fold+1}.csv"
    df_perm_fold.to_csv(csv_name, index=False)
    print(f"✅ Saved: {csv_name}")

  
    last_val_idx = val_idx

    if best_result is None or val_rmse < min(val_rmse_list[:-1]):
        best_result = {
        "y_true": y_val,
        "y_pred": y_pred,
        "fold": fold,
        "val_idx": val_idx,
        "scaler_y": scaler_y  
        }

df_val = df_aligned.iloc[last_val_idx].copy()  # ✅ Correctly aligned DataFrame for the last fold


📦 Fold 1


2025-09-11 17:43:28.339658: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:43:28.340488: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:43:28.340990: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 1/200


2025-09-11 17:43:28.884975: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:43:28.886031: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:43:28.886709: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus



2025-09-11 17:43:33.265898: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:43:33.266956: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:43:33.267593: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200


2025-09-11 17:43:57.341841: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:43:57.342963: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:43:57.343655: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

📊 Calculating permutation importance...
✅ Saved: permutation_importance_fold1.csv

📦 Fold 2


2025-09-11 17:43:59.056542: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:43:59.057156: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:43:59.057911: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 1/200


2025-09-11 17:43:59.593504: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:43:59.594553: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:43:59.595169: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus



2025-09-11 17:44:04.098739: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:44:04.099786: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:44:04.100401: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78/200
Epoch 7

2025-09-11 17:44:56.645764: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:44:56.646672: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:44:56.647267: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

📊 Calculating permutation importance...
✅ Saved: permutation_importance_fold2.csv

📦 Fold 3


2025-09-11 17:44:58.372192: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:44:58.373030: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:44:58.373564: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 1/200


2025-09-11 17:44:58.960306: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:44:58.961651: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:44:58.962442: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus



2025-09-11 17:45:03.850562: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:45:03.851765: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:45:03.852476: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200


2025-09-11 17:45:33.666811: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:45:33.667751: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:45:33.668438: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

📊 Calculating permutation importance...
✅ Saved: permutation_importance_fold3.csv

📦 Fold 4


2025-09-11 17:45:35.372905: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:45:35.373545: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:45:35.374100: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 1/200


2025-09-11 17:45:35.907383: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:45:35.908563: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:45:35.909272: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus



2025-09-11 17:45:40.625975: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:45:40.627073: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:45:40.627772: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200


2025-09-11 17:45:56.361461: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:45:56.362456: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:45:56.363268: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

📊 Calculating permutation importance...
✅ Saved: permutation_importance_fold4.csv

📦 Fold 5


2025-09-11 17:45:58.090429: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:45:58.091078: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:45:58.091745: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 1/200


2025-09-11 17:45:58.624857: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:45:58.625949: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:45:58.626682: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus



2025-09-11 17:46:03.374565: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:46:03.375573: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:46:03.376244: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78/200
Epoch 7

2025-09-11 17:46:53.988214: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-09-11 17:46:53.989089: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-09-11 17:46:53.989891: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

📊 Calculating permutation importance...
✅ Saved: permutation_importance_fold5.csv


## Results Processing and Prediction Export

This final section processes the best-performing model results and exports predictions in a format suitable for scientific analysis and validation.

### Best Model Selection and Data Retrieval:

#### **Best Model Identification**:
- **Selection criteria**: Fold with lowest validation RMSE across all 5 folds
- **Retrieved data**: Complete model state from the optimal fold including:
  - **Validation predictions** (`best_y_pred`): Model outputs in scaled space
  - **True validation values** (`best_y_true`): Actual CO₂ measurements in scaled space
  - **Validation indices** (`best_val_idx`): Sample positions for data alignment
  - **Scaler object** (`best_scaler_y`): For inverse transformation to original units
  - **Fold identifier** (`best_fold_num`): For traceability and reporting

### Data Transformation and Alignment:

#### **Inverse Scaling Process**:
```python
# Transform from normalized space back to original CO₂ units (μatm)
y_true_real = scaler_y.inverse_transform(scaled_values)  
y_pred_real = scaler_y.inverse_transform(scaled_predictions)
```

**Purpose and Benefits**:
1. **Original units**: Converts normalized values back to μatm for scientific interpretation
2. **Correct scaling**: Uses the exact scaler from the best fold to ensure accuracy
3. **Consistency**: Maintains the same transformation used during training
4. **Interpretability**: Results are directly comparable to measured CO₂ concentrations

#### **Spatio-Temporal Alignment**:
- **Data source**: Uses `df_aligned` (windowing-adjusted original dataset)
- **Index matching**: `best_val_idx` ensures correct correspondence between predictions and metadata
- **Preserved information**: Maintains original timestamps and spatial coordinates

### Output Dataset Structure:

#### **Generated DataFrame (`df_best_fold`)**:
Contains five essential columns for comprehensive analysis:

1. **Date**: 
   - **Format**: Timestamp of each measurement
   - **Purpose**: Temporal analysis and time series validation
   - **Applications**: Seasonal pattern analysis, temporal trend identification

2. **Latitude**: 
   - **Units**: Decimal degrees
   - **Purpose**: Spatial location for geographic analysis
   - **Applications**: Spatial interpolation, geographic pattern analysis

3. **Longitude**: 
   - **Units**: Decimal degrees
   - **Purpose**: Complete spatial coordinates
   - **Applications**: Coastal mapping, spatial gradient analysis

4. **CO2_real**: 
   - **Units**: μatm (microatmospheres)
   - **Content**: Original measured CO₂ concentrations
   - **Purpose**: Ground truth for model validation
   - **Quality**: Direct measurements from field instruments

5. **CO2_predicted**: 
   - **Units**: μatm (microatmospheres)
   - **Content**: BiLSTM model predictions
   - **Purpose**: Model performance assessment and scientific analysis
   - **Validation**: Represents best-fold model capabilities

### File Export Specifications:

#### **Output File**: `Deception_2025_CO2_prediction.csv`
- **Format**: Comma-separated values (CSV) for universal compatibility
- **Index**: Not included (`index=False`) for clean data structure
- **Size**: Contains only validation samples from the best-performing fold
- **Quality**: Represents the most reliable model predictions available

#### **Scientific Applications**:
1. **Model validation**: Compare predicted vs. actual CO₂ concentrations
2. **Performance metrics**: Calculate RMSE, R², MAE, and other regression metrics
3. **Spatial analysis**: Examine prediction accuracy across different locations
4. **Temporal analysis**: Assess model performance over time periods
5. **Uncertainty quantification**: Analyze prediction residuals and error patterns
6. **Scientific publication**: Provide validation data for research papers

### Data Quality Assurance:

#### **Validation Checks**:
- **Null check**: `if best_result is not None` ensures valid model results exist
- **Alignment verification**: Indices correspond to correct temporal and spatial positions
- **Unit consistency**: All CO₂ values in original measurement units (μatm)
- **Data integrity**: Maintains chronological and spatial relationships

#### **Output Confirmation**:
- **Success message**: "✅ Saved File with Best Fold Predictions"
- **Content summary**: Confirms inclusion of date, coordinates, and CO₂ data
- **File verification**: Output file ready for downstream analysis

### Integration with Research Workflow:

#### **Downstream Applications**:
1. **Statistical analysis**: Performance metrics and error analysis
2. **Visualization**: Time series plots, spatial maps, scatter plots
3. **Model comparison**: Benchmark against other prediction methods
4. **Scientific validation**: Peer review and publication preparation
5. **Operational use**: Real-time CO₂ monitoring applications

#### **Research Value**:
- **Reproducibility**: Complete prediction dataset for validation
- **Transparency**: Clear provenance from best-performing model fold
- **Scientific rigor**: Proper scaling and alignment procedures
- **Accessibility**: Standard CSV format for broad compatibility

**Final Outcome**: This section completes the CO₂ prediction pipeline by delivering a scientifically validated, properly formatted dataset containing the best model predictions alongside corresponding measurements and metadata, ready for immediate use in research and operational applications.


In [8]:
# Save predictions from the best fold aligned with the spatio-temporal input file
if best_result is not None:
    # Get the data from the best fold
    best_y_true = best_result["y_true"]
    best_y_pred = best_result["y_pred"]
    best_val_idx = best_result["val_idx"]
    best_scaler_y = best_result["scaler_y"]
    best_fold_num = best_result["fold"]

    # Invert scalation to obtain real values
    y_true_real = best_scaler_y.inverse_transform(best_y_true.reshape(-1, 1)).flatten()
    y_pred_real = best_scaler_y.inverse_transform(best_y_pred.reshape(-1, 1)).flatten()

    # Create DataFrame with date, coordinates, real CO2, and predicted CO2
    df_best_fold = pd.DataFrame({
        'Date': df_aligned.iloc[best_val_idx]['Date'].values,
        'Latitude': df_aligned.iloc[best_val_idx]['Latitude'].values,
        'Longitude': df_aligned.iloc[best_val_idx]['Longitude'].values,
        'CO2_real': y_true_real,
        'CO2_predicted': y_pred_real
    })

    # Save the file
    df_best_fold.to_csv("Deception_2025_CO2_prediction.csv", index=False)
    print("✅ Saved File with Best Fold Predictions (Date, Coordinates, Real CO2, Predicted CO2).")

✅ Saved File with Best Fold Predictions (Date, Coordinates, Real CO2, Predicted CO2).
