<img src="../images/logo.png" alt="slb" style= "width: 1700px"/>

# ⚡️   - Tutorial 1: Predicting Cumulative Production

💡 The objective of this exercise is to learn how to apply supervised learning to predict an output variable from multi-dimensional data

📋 The goal is to predict production data using completion parameters from ~600 wells targeting an unconventional reservoir. In addition, we have available the location (lat, long) and total depth (tvd) of the wells, which can be used as a geology proxy. 

👌 Once we have completed our prediction, we will also investigate which feature (variable) has the most contribution on the well production. For this purpose we will use `SHAP` library

In [None]:
# Import required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import sys

## 🏁 Step 1: Data Load and Pre-processing

In [None]:
# Read the well dataset (well_data_raw.csv) and store it in a dataframe named 'well_data'

well_data= pd.read_csv(...)

In [None]:
# Explore the content of the well_data dataframe

well_data

In [None]:
# Create a scatter plot to display the well location (lat & long), total depth (tvd), and the cumulative gas production



In [None]:
# Display and analyze the distribution of missing values in the dataset

import missingno as msno



✍️ `missingno` is a nice library to visualize the distribution of null values. 

Note that visualizing null values can also reveal some relationships between the data

In [None]:
# Another plot to visualize the total number of missing values per column is the bar plot. Let's display it



🧹 Now let's do some data clean up 🧹

In [None]:
# First, create a copy of the dataset (well_data), call it 'well_data_clean' and perform the following operations:

# 1) Make a copy of the original dataset (to be safe! 🤓)
# 2) Set the column 'well_id' as the index 
# 3) Remove the rows where 'fluid'= 'oil'
# 4) Remove missing values
# 5) Remove the column 'fluid'
# 6) Add a new column 'target_bin' and define 5 bins for the 'cum_gas_prod' -> ["very_low", "low", "mid", "high", "very_high"]
# 7) Assign the data type in the 'target_bin' column as object



# Let's print the new dataset



❓ Do you know the meaning of the tilde (~) operator in python? We just used it in the step above 👆

<details>
<summary> 💡 Hint </summary>
    
👌 In Python, the tilde (~) operator is a bitwise NOT operator, which means it performs a logical NOT operation 

In the code above, the tilde is being used as a logical NOT operator to invert the boolean mask returned by the str.contains() method

</details>

In [None]:
# Let's print the count of values in each bin ("target_bin")



## 🏁Step 2: Data Standardization

In [None]:
# First create a subset of the 'well_data_clean' dataframe to include only the numerical data




# Standardize the dataset and add back the non-numerical column -> 'target_bin'



In [None]:
# Display the standardized dataset 'well_data_clean'



## 🏁 Step 3: Reshape the DataFrame for Data Exploration

👉 To be able to display all the columns in the same plot, we need to do some sort of reshaping on our dataframe.

In the cell below we will use the `.melt()` function to create the following shape:

**index** ------   **target_bin** ------   **variable** ------  **value**

✍️ The .melt() function is  used to create a specific format of the DataFrame where one or more columns work as identifier

In [None]:
# Change the DataFrame format from wide to long using the .melt () function 

well_data_melt = pd.melt(well_data_clean, value_vars= list(well_data_clean.columns).remove("target_bin"), id_vars= 'target_bin')


# Visualize the melted dataframe. Take your time to analyze the content of rows and columns 🧐

well_data_melt

👌 Having the dataframe in this format will facilitate the plotting in the next step!

## 🏁Step 4: Statistical Plotting

In [None]:
import seaborn as sns

# Create a violin plot to visualize the distribution of all variables in the dataset ('well_data_melt')




In [None]:
# Create a swarmplot and color the points according to the 'target_bin' variable



## 🏁Step 5:  Defining the Variables to be Used as 'Target' and 'Features'

In [None]:
# Let's print out the list column names for the well_data_clean dataframe



In [None]:
# Define the list of variables to be used as features




# Define the cumulative gas production as the target variable 'cum_12_gas_prod', call it 'target'



## 🏁 Step 6: Split the Dataset into 'train set' and 'test set'

In [None]:
from sklearn.model_selection import train_test_split

# Split the full dataset ('well_data') into two parts [30: 70] [test:train]



# ✍️ 'X' refers to the features, and 'y' refer to the target

# ✍️ 'test_size= 0.3' represents the proportion of data points that will be in the test set (30%)

## 🏁 Step 7: Train a LightGBM Model and Evaluate its Performance using Cross Validation

👇 We are going to try a powerful tree-based method -> `lightgbm`

📋 Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample

📌 In cross  validation, k refers to the number of groups that a given data sample is to be split into

In [None]:
from lightgbm import LGBMRegressor
from yellowbrick.model_selection import CVScores
from sklearn.model_selection import KFold

# Create a cross-validation strategy. Define the number of folds (k=5) to split the data




# Define the classification model using a LightGBM Classifier and visualizer





# Fit the data to the visualizer



☝ The plot above shows the performance of the LGBM model on 5 iterations. 

Since we defined *k* = 5, for each run, 20% of the data is set aside for testing. Once the model is trained, its score is shown as a bar in the chart.

Usually we apply this method on the entire data. But we can also check the performance on the test data 👇

## 🏁 Step 8: Use *yellowbrick* to Visualize the Performance of the LightGBM Model on the Test Set

In [None]:
from yellowbrick.regressor import PredictionError

# Generate a prediction error plot to evaluate the LGBM model

'LGMB : Light Gradient Boosting Machine'


'y = Actual Target Value'
'ŷ = Predicted Target value'

## 🏁 Step 9: Train and Evaluate a XGBoost Model

👇 Now let's try another *hackaton-wining* model -> `XGBRegressor`

In [None]:
from xgboost import XGBRegressor

# Generate the prediction error plot for the XGBoost model

'XGBoost: Extreme Gradient Boosting'


# 'y = Actual Target Value'
# 'ŷ = Predicted Target value'

## 🏁 Step 10: Generate a Residuals Plot

✍️ Residuals, in the context of regression models, are the difference between the observed value of the target variable (y) and the predicted value (ŷ). In other words, the error of the prediction

In [None]:
from yellowbrick.regressor import ResidualsPlot

# Now let's create a residual plot for the LGBM model



☝ We do not want to see any trend in the residual plot. The randomness in the residuals seems to indicate that our model is performing well. 

👌 We can also see from the histogram on the right that our error is normally distributed around zero, which also generally indicates a well fitted model

## 🏁 Step 11: Plot the Learning Curve for the LGBM Model


💡 A learning curve shows the relationship of the training score versus the cross validated test score for an estimator with a varying number of training samples

⚠️ If the training score is much greater than the validation score, then the model probably requires more training data in order to generalize more effectively

In [None]:
from yellowbrick.model_selection import LearningCurve

# Create an evenly spaced array that will be used as training instances




# Create the learning curve visualizer




# Fit the learning curve to our well data 



💥 The training score measures how well the model fits the training data, while the cross-validation score measures how well the model generalizes to new data

👉 The learning curve above suggests that providing more data for training improves the model performance (score)

👉 The separation between training and CV curves in the learning curve plot indicates that the model is overfitting the training data. It means that the model is fitting the training data too closely and not generalizing well to new data.
    
<br>
📋 To improve the model we can:  

1- Provide more data for training

2- Improve our feature engineering

3- Try using a different model


## 🏁 Step 12: Feature Importance

💡 The feature engineering process involves selecting manipulating, and transforming raw data into features that can be used to produce a valid model. 

📋 Generally, a model that has less features is preferred even if the score is slightly lower

In [None]:
from yellowbrick.model_selection.importances import FeatureImportances
from lightgbm import LGBMRegressor

# Title case the features for a better display



# Define the visualizer




# Fit and show the feature importance plot



## 🏁 Step 13: Use SHAP Values to Explain How a ML Model Works

**SHAP**: Shapley Additive Explanations

💡 SHAP values quantify the contribution of each feature to the final prediction of the model

🔖 Positive SHAP values indicate a positive impact on the prediction, which means that increasing its value will result in an increase in the predicted value of the target variable

In [None]:
import shap

# Train a XGBoost model



# Compute SHAP values



In [None]:
# Generate a beeswarm plot



👆 From the plot above: 

🔑 Note that features are also ordered by their effect on prediction

🔑 Each point represents a row from the dataset

🔑 The colors represent the feature values, not to be confused with the shap values. If the value of a feature is high -> pink

🔑 In SHAP, the most important feature is the one that has the highest mean absolute SHAP value across all samples in the dataset

In [None]:
# First, select a random well to investigate. For example the well in row 150



In [None]:
# Let's also print to the raw data (before normalization) for the well above



In [None]:
# Generate a Waterfall plot for the well in row #3



👁‍🗨 The waterfall plot shows the effect that each feature has on the prediction for a given observation, in this case, a well!

📌 If the arrow points to the left, it means that the feature value has a negative impact on the model's prediction. If it points to the right, it means that the feature value has a positive impact on the model's prediction

📌 The length of the bar for each feature represents the magnitude of the impact that feature has on the prediction

📌 The color of the bar represents the feature value for that particular data point, with red indicating high feature values and blue indicating low feature values

🎯 Well done!