# XGBoost Model Training for Production Prediction

This section demonstrates how to train an XGBoost model on reservoir production data to predict oil, water, and gas production.

## Steps:

1. **Imports**:
   - **`os` and `glob`**: Used to handle file paths and retrieve reservoir folders from the dataset.
   - **`pandas`**: For data manipulation and loading datasets.
   - **`xgboost`**: The XGBoost library is imported to build and train the model.
   - **`train_test_split`**: Splits the dataset into training and testing sets.
   - **`mean_squared_error`, `mean_absolute_error`, `r2_score`**: Metrics to evaluate model performance.

2. **Training the XGBoost Model**:
   - After data preprocessing and splitting the data into training and testing sets, the XGBoost model is trained to predict oil, water, and gas production using production and rock characteristic features.

3. **Evaluation**:
   - After training, the model is evaluated using common regression metrics, such as:
     - **Mean Squared Error (MSE)**: Measures the average squared difference between actual and predicted values.
     - **Mean Absolute Error (MAE)**: Measures the average absolute difference between actual and predicted values.
     - **R² Score**: Evaluates how well the model fits the data.


In [1]:
import os
import glob
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Combining Status and Production Data for Multiple Reservoirs

This code combines the status (rock characteristics) and production (day-by-day production) data from multiple reservoirs into two consolidated DataFrames for further model training or analysis.

## Steps:

1. **Initializing DataFrames**:
   - Empty lists (`all_status_dfs` and `all_production_dfs`) are initialized to store the status and production DataFrames for each reservoir.

2. **Retrieving Reservoir Folders**:
   - Using the `glob` module, the code retrieves paths to all reservoir folders from the dataset.

3. **Loading and Processing Data**:
   - For each reservoir:
     - The paths to the `state.csv` (status data) and `production.csv` (production data) files are generated.
     - Both datasets are loaded into DataFrames if the corresponding files exist.
     - The production data is grouped by the `Date` column, and mean values are computed for each day.
     - The `day` column is reset to start from 1 for each reservoir, representing sequential days.
     - A `reservoir_id` column is added to both the status and production DataFrames to track the reservoir source.
     - The `Date` column is dropped from the production data after processing.

4. **Combining Data**:
   - All individual status and production DataFrames are concatenated into two combined DataFrames (`combined_status_df` and `combined_production_df`).
   - These combined DataFrames contain all the data from multiple reservoirs.

5. **Saving Data**:
   - The combined DataFrames are saved to CSV files (`combined_state.csv` and `combined_production.csv`) for future use.


In [6]:
# Initialize lists to store DataFrames
all_status_dfs = []
all_production_dfs = []

# Get all reservoir folders
reservoir_folders = glob.glob(os.path.join('../../dataset/training', 'Reservoir*'))

# Loop through each reservoir folder
for reservoir_folder in reservoir_folders:
    # Extract reservoir_id from folder name
    reservoir_id = os.path.basename(reservoir_folder)

    # Define paths to state and production files
    state_path = os.path.join(reservoir_folder, 'state.csv')
    production_path = os.path.join(reservoir_folder, 'production.csv')
    
    # Check if the state and production files exist
    if os.path.exists(state_path) and os.path.exists(production_path):
        # Load the datasets
        status_df = pd.read_csv(state_path)
        production_df = pd.read_csv(production_path)

        # Group production data by 'Date' and compute mean values
        production_df = production_df.groupby(['Date']).mean().reset_index()

        # Reset the 'day' column starting from 1 for each reservoir
        production_df['day'] = range(1, len(production_df) + 1)
        
        # Add reservoir_id column to both datasets
        status_df['reservoir_id'] = reservoir_id
        production_df['reservoir_id'] = reservoir_id
        
        # Drop the 'Date' column from the production data
        production_df = production_df.drop(columns=['Date'])
        
        # Append DataFrames to lists
        all_status_dfs.append(status_df)
        all_production_dfs.append(production_df)

# Concatenate all status and production DataFrames
combined_status_df = pd.concat(all_status_dfs, ignore_index=True)
combined_production_df = pd.concat(all_production_dfs, ignore_index=True)

# Save the combined DataFrames to CSV files
combined_status_df.to_csv('preprocess_data/combined_train_state.csv', index=False)
combined_production_df.to_csv('preprocess_data/combined_train_production.csv', index=False)

# Reloading and Merging Combined Datasets

This section of the code reloads the previously saved combined datasets, aggregates the status data, and merges it with the production data.

## Steps:

1. **Reload Combined Datasets**:
   - The combined status data (`combined_state.csv`) and production data (`combined_production.csv`) are reloaded into DataFrames (`status_df` and `production_df`).

2. **Aggregate Status Data**:
   - The status data is aggregated by `reservoir_id` to compute the mean values of the following columns:
     - **X**
     - **Y**
     - **Depth**
     - **PERMX** (permeability in X direction)
     - **PERMY** (permeability in Y direction)
     - **PERMZ** (permeability in Z direction)
     - **PORO** (porosity)
     - **Transmissibility**
   - The aggregation is done using the `agg()` function, which computes the mean of each column for each reservoir. The aggregated data is stored in `status_df_agg`.

3. **Merge Aggregated Status Data with Production Data**:
   - The aggregated status data (`status_df_agg`) is merged with the production data (`production_df`) based on the `reservoir_id`.
   - This merge is performed using a left join, ensuring that all rows from the production data are preserved, and relevant status data is added where available.
   - The result is stored in `combined_df`, which contains both the production data and the aggregated status data for each reservoir.


In [4]:
# Reload the combined datasets
status_df = pd.read_csv('preprocess_data/combined_state.csv')
production_df = pd.read_csv('preprocess_data/combined_production.csv')

# Aggregate status data by reservoir_id and compute mean values
status_df_agg = status_df.groupby('reservoir_id').agg({
    'X': 'mean',
    'Y': 'mean',
    'Depth': 'mean',
    'PERMX': 'mean',
    'PERMY': 'mean',
    'PERMZ': 'mean',
    'PORO': 'mean',
    'Transmissibility': 'mean'
}).reset_index()

# Merge aggregated status data with production data on reservoir_id
combined_df = pd.merge(production_df, status_df_agg, on='reservoir_id', how='left')


# Training and Evaluating the XGBoost Model

This section prepares the data for model training, trains an XGBoost model, saves the model, and evaluates its performance.

## Steps:

1. **Prepare Features and Targets**:
   - **Features (`X`)**: Includes columns related to reservoir rock characteristics and the `day` column from `combined_df`:
     - **X**
     - **Y**
     - **Depth**
     - **PERMX** (permeability in X direction)
     - **PERMY** (permeability in Y direction)
     - **PERMZ** (permeability in Z direction)
     - **PORO** (porosity)
     - **Transmissibility**
     - **day**
   - **Targets (`y`)**: Includes columns for cumulative production values:
     - **Oil production cumulative**
     - **Water production cumulative**
     - **Gas production cumulative**

2. **Split Data**:
   - The data is split into training (80%) and testing (20%) sets using `train_test_split` with a random state for reproducibility.

3. **Train XGBoost Model**:
   - An XGBoost regressor (`xgb.XGBRegressor`) is instantiated with `objective='reg:squarederror'` to handle regression tasks.
   - The model is trained on the training data (`X_train` and `y_train`).

4. **Save the Model**:
   - The trained model is saved to a file named `xgboost_model.json` for future use.

5. **Make Predictions**:
   - Predictions are made on the test set (`X_test`) using the trained model.

6. **Calculate Metrics**:
   - **Mean Squared Error (MSE)**: Measures the average squared difference between actual and predicted values.
   - **Mean Absolute Error (MAE)**: Measures the average absolute difference between actual and predicted values.
   - **R² Score**: Evaluates how well the model fits the test data.
   - These metrics are computed using `mean_squared_error`, `mean_absolute_error`, and `r2_score`, respectively.

7. **Print Metrics**:
   - The computed MSE, MAE, and R² Score are printed to assess the model's performance.


In [5]:
# Prepare features (X) and target (y)
X = combined_df[['X', 'Y', 'Depth', 'PERMX', 'PERMY', 'PERMZ', 'PORO', 'Transmissibility', 'day']]
y = combined_df[['Oil production cumulative', 'Water production cumulative', 'Gas production cumulative']]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the XGBoost model
model = xgb.XGBRegressor(objective='reg:squarederror')
model.fit(X_train, y_train)

# Save the trained model
model.save_model('models/xgboost_model.json')

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print metrics
print(f'Mean Squared Error: {mse}')
print(f'Mean Absolute Error: {mae}')
print(f'R² Score: {r2}')

Mean Squared Error: 96835819895.33015
Mean Absolute Error: 183489.12765300297
R² Score: 0.9997407793998718
