# LSTM Model Training for Oil, Water, and Gas Production Prediction

This notebook imports necessary functions and libraries to preprocess data and train an LSTM model for oil, water, and gas production prediction. The model uses sequential data and aims to predict future production trends based on historical data.

## Imported Libraries and Functions:

- **`preprocess_data`**: Custom function from the `production_functions.py` script to prepare and preprocess data.
- **`os`, `glob`**: Used to navigate directories and work with file paths.
- **`pandas`, `numpy`**: Data handling libraries for loading, manipulating, and processing datasets.
- **`Sequential`, `LSTM`, `Dense`**: Keras components to build and define the LSTM neural network.
- **`train_test_split`**: Splits data into training and testing sets for model validation.


In [1]:
# Import the functions from production_functions.py
from function import preprocess_data

# The rest of your code
import os
import glob
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from sklearn.model_selection import train_test_split

2024-09-17 13:17:16.168657: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Data Preprocessing for Multiple Reservoirs

This code processes data from multiple reservoirs, prepares it for training an LSTM model, and scales the data for model input.

## Key Preprocessing Steps:

1. **Reservoir Folders**: The code retrieves a list of reservoir folders from the training dataset using the `glob` module.
2. **File Paths**: For each reservoir, paths to the `state.csv` and `production.csv` files are constructed, representing static reservoir rock characteristics and day-by-day production data, respectively.
3. **Data Loading and Processing**:
   - The `state.csv` and `production.csv` files are loaded into DataFrames.
   - The `Date` column in the production data is converted into a `datetime` format for consistency.
   - The production data is **sorted by date** and **grouped by date**, calculating the **mean of production values** for each day to handle multiple entries for the same date.
4. **Reservoir Rock Characteristics**:
   - The mean values of the static rock characteristics (from the `state.csv` file) are calculated.
   - These mean values are then **concatenated to the production data**, providing static characteristics alongside the dynamic production data for model input.
5. **Lag Feature Calculation**:
   - For the three production variables (oil, water, gas), **lag features** are created to incorporate the historical trends of production. Lag features are calculated for:
     - **1-day lag**
     - **3-day lag**
     - **7-day lag**
6. **Data Scaling**:
   - The combined dataset (including lag features and reservoir rock characteristics) is **scaled** using `scalers_X` for input features and `scalers_y` for target variables, making the data suitable for model training.
7. **Sequence Creation**:
   - After scaling, the data is divided into **sequences of 7 time steps** (default value) to capture temporal dependencies. Each sequence consists of 7 consecutive days of production data along with the corresponding rock characteristics.
   
The final result of this preprocessing is a sequence of production and static features with a time step of 7, ready for training in the LSTM model.


In [2]:
all_X, all_y = [], []
scalers_X, scalers_y = [], []

# Get all reservoir folders
reservoir_folders = glob.glob(os.path.join('../../dataset/training', 'Reservoir*'))

for reservoir_folder in reservoir_folders:
    # Extract reservoir_id from folder name
    reservoir_id = os.path.basename(reservoir_folder)

    # Define paths to status and production files
    state_path = os.path.join(reservoir_folder, 'state.csv')
    production_path = os.path.join(reservoir_folder, 'production.csv')
    
    # Check if files exist before processing
    if os.path.exists(state_path) and os.path.exists(production_path):
        # Load the datasets
        status_df = pd.read_csv(state_path)
        production_df = pd.read_csv(production_path)
        # Ensure the 'date' column is in datetime format
        production_df['Date'] = pd.to_datetime(production_df['Date'])

        # Sort by date if not already sorted
        production_df = production_df.sort_values(by='Date')
        production_df = production_df.groupby(['Date']).mean().reset_index()
        production_df = production_df.drop(columns='Date')

        X_sequences, y_sequences, X_scaled, y_scaled, scaler_X, scaler_y = preprocess_data(status_df, production_df)
        all_X.append(X_sequences)
        all_y.append(y_sequences)
        scalers_X.append(scaler_X)
        scalers_y.append(scaler_y)

X_combined = np.concatenate(all_X, axis=0)
y_combined = np.concatenate(all_y, axis=0)

# LSTM Model Training and Saving

This code splits the preprocessed dataset into training and testing sets, builds an LSTM model, trains it on the training data, and then saves the trained model for future use.

## Steps:

1. **Data Splitting**:
   - The dataset is split into training (80%) and testing (20%) sets using the `train_test_split` function from `sklearn`. The `random_state=42` ensures reproducibility of the results.

2. **LSTM Model Construction**:
   - A sequential model is built using Keras. It includes:
     - An LSTM layer with 50 units and `input_shape` set to `(7, 17)` (7 time steps, 17 features). 
     - A Dense output layer with 3 units corresponding to the 3 output variables (oil, water, and gas production).
   - The model uses the **Adam** optimizer and **mean squared error** as the loss function.

3. **Model Training**:
   - The model is trained for 50 epochs with a batch size of 32.
   - `validation_split=0.2` means 20% of the training data is used for validation during training, allowing the model to monitor its performance on unseen data.

4. **Model Saving**:
   - After training, the model is saved to the file `LSTM_model.h5`, which can be loaded later for predictions or further training.


In [3]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_combined, y_combined, test_size=0.2, random_state=42)

# Build and train the model
model = Sequential()
model.add(LSTM(units=50, return_sequences=False, input_shape=(7, 17)))
model.add(Dense(3))
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)

# Save the model
model.save('models/LSTM_model.h5')

  super().__init__(**kwargs)


Epoch 1/50
[1m310/310[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 5ms/step - loss: 0.0797 - val_loss: 0.0011
Epoch 2/50
[1m310/310[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 8.2630e-04 - val_loss: 3.9371e-04
Epoch 3/50
[1m310/310[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - loss: 3.7088e-04 - val_loss: 2.3585e-04
Epoch 4/50
[1m310/310[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 2.4982e-04 - val_loss: 1.8856e-04
Epoch 5/50
[1m310/310[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - loss: 1.9990e-04 - val_loss: 2.2398e-04
Epoch 6/50
[1m310/310[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - loss: 2.1876e-04 - val_loss: 1.3570e-04
Epoch 7/50
[1m310/310[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 1.9759e-04 - val_loss: 1.9263e-04
Epoch 8/50
[1m310/310[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - loss: 1.6549e-04 - val_l



# Model Evaluation

This section evaluates the performance of the trained LSTM model on the test dataset using several metrics: Mean Squared Error (MSE), Mean Absolute Error (MAE), and the R² score. 

## Steps:

1. **Model Prediction**:
   - The model predicts production values (`y_pred`) using the test data (`X_test`).
   - Both `y_test` and `y_pred` are reshaped to flatten them so they can be compared.

2. **Inverse Scaling**:
   - Since the target variables (`y_test` and `y_pred`) were scaled during preprocessing, they are transformed back to their original scale using the inverse transformation from the `scalers_y`.
   - This allows for error metrics to be computed in the original production units (e.g., barrels of oil, water, gas).

3. **Error Metrics Calculation**:
   - **Mean Squared Error (MSE)**: Measures the average of the squared differences between actual and predicted values. Lower values indicate better model performance.
   - **Mean Absolute Error (MAE)**: Measures the average of the absolute differences between actual and predicted values. It is a direct interpretation of error magnitude.
   - **R² Score (Coefficient of Determination)**: Evaluates how well the predicted values fit the actual values. An R² score of 1 indicates a perfect fit, while 0 or negative values indicate poor performance.

4. **Print Results**:
   - The computed metrics are printed to provide insights into how well the model performed on the test data.



In [4]:
# Evaluate the model
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

y_pred = model.predict(X_test)
y_test_flattened = y_test.reshape(-1, y_test.shape[-1])
y_pred_flattened = y_pred.reshape(-1, y_pred.shape[-1])

y_test_rescaled = scalers_y[0].inverse_transform(y_test_flattened)
y_pred_rescaled = scalers_y[0].inverse_transform(y_pred_flattened)

# Calculate error metrics
mse = mean_squared_error(y_test_rescaled, y_pred_rescaled)
mae = mean_absolute_error(y_test_rescaled, y_pred_rescaled)
r2 = r2_score(y_test_rescaled, y_pred_rescaled)

print(f'Mean Squared Error: {mse}')
print(f'Mean Absolute Error: {mae}')
print(f'R² Score: {r2}')

[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step
Mean Squared Error: 22837023611.819126
Mean Absolute Error: 82605.80888133099
R² Score: 0.9999580846302782
