<img src="../images/logo.png" alt="slb" style= "width: 1700px"/>

# ⚡️  - Tutorial 1

## `PART 1`: Clustering Analysis

💡 Clustering refers to grouping similar data points together, based on their attributes or features

This tutorial will be divided into two parts:

✔ In the first part, we will explore various clustering methods using log data from a single well

✔ In the second part, we will use the production dataset to generate a model that predicts the production trend for one well

### 🏁 Step 1: Import Required Libraries

In [None]:
#!pip install scikit-learn==0.23.2

In [None]:
# Import the required libraries

import os
import pandas as pd
from pandas import DataFrame
from numpy import nan as NA
import matplotlib.pyplot as plt
import sys
import numpy as np

### 🏁 Step 2: Import and Explore the Dataset

In [None]:
# Load the dataset from the 'w5.csv' file

w5_logs= 

In [None]:
# Generate descriptive statistics of the log data


### 🏁 Step 3: Data Preprocessing -  Standardization

💡 Some of the clustering models are distance based algorithms, in order to measure similarities between observations and form clusters they use a distance metric. So, features with high ranges will have a bigger influence on the clustering. 

✍🏻 Therefore, it's a good practice to scale the data before applying clustering analysis.

We will explore two methods to standardize the dataset: StandardScaler() and MinMaxScaler()

📋 The **StandardScaler** is a method of standardizing data such the the transformed feature has a mean 0 and and a standard deviation of 1

In [None]:
# Use the StandardScaler() function to standardize each column of the w5_logs dataset

from sklearn.preprocessing import StandardScaler

# First define the scaler


# Transform the dataset using the scaler

# w5_logs_scaled.describe().round(2)



In [None]:
# Use the MinMaxScaler() function to standardize each column of the w5_logs dataset

from sklearn.preprocessing import MinMaxScaler

# First define the scaler


# Transform the dataset using the scaler

# w5_logs_scaled.describe().round(2)



In [None]:
# Print the standardized dataset and assign 'DEPTH' as index



### 🏁 Step 4: Clustering Analysis Using K-Means

📋 The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance

In [None]:
# Extract 3 clusters of data using K-Means. Use the raw data (Not Scaled Data)

from sklearn.cluster import KMeans



# Display the VpVs versus AI using the previously run K-means algorithm


In [None]:
# Now let's use the scaled data. Extract 3 cluster of data using K-Means. 

from sklearn.cluster import KMeans




# Display the VpVs versus AI using the previously run K-means algorithm


### 🏁 Step 5: Define a Plotting Function

💥 As we will be exploring several clustering methods which result requires the same plot, it's a good idea to define a plotting function

In [None]:
# Define a function to plot clustering results

def format_plot(title="",pred=""):
    
   
    plt.show()

### 🏁 Step 6: Identify the Optimun Number of Clusters - Elbow Method 💪🏼

💡 In this step, we are going to use *yellowbrick*, which is a python library typically used to QC different steps of a ML workflow

In [None]:
# We can use the elbow plot to decide about the best number of clusters

from yellowbrick.cluster import KElbowVisualizer





# Fit the data to the visualizer



# Draw the elbow plot



📋 **Distortion**: It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.

💡 Good solutions aren’t those with the lowest score but rather the ones where you notice a more or less abrupt discontinuity in the descent of the score (even just a change in the slope). 

💪🏼 In this example, a good solution seems to be four clusters

### 🏁 Step 7: Clustering Analysis Using Gaussian Mixture Models (GMM)

💡 Instead of using a distance-based model (K-means), we will now use a distribution-based model

In [None]:
# Cluster the log data using a Gaussian Mixture Model (GMM)

# First, train the gaussian model 

from sklearn.mixture import GaussianMixture



# Generate predictions from the gaussian mixture model 



# Plot the result using the format_plot function



### 🏁 Step 8: Clustering Analysis Using the Hierarchical Method

📋 Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. 

The *AgglomerativeClustering* object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. 


📝 The linkage criteria determines the metric used for the merge strategy:

 - **Ward linkage** minimizes the sum of squared differences within all clusters

 - **Complete linkage** minimizes the maximum distance between observations of pairs of clusters
 
 - **Average linkage** minimizes the average of the distances between all observations of pairs of clusters


In [None]:
from sklearn.cluster import AgglomerativeClustering

# Perform clustering analysis using ward linkage criteria



# 'clust_ward' is a numpy array, explore it!


# Plot the clustering results using the format_plot function



### ----


# Perform clustering analysis using complete linkage criteria



# 'clust_complete' is a numpy array, explore it!


# Plot the clustering results using the format_plot function



### ----

# Perform clustering analysis using average linkage criteria



# 'clust_average' is a numpy array, explore it!


# Plot the clustering results using the format_plot function





### 🏁 Step 9: Clustering Analysis Using DBSCAN Method

📋 DBSCAN - Density-Based Spatial Clustering of Applications with Noise (i.e., outliers). 

💡 The main idea behind DBSCAN is that a data value belongs to a cluster if it is close to many data values from that cluster

In [None]:
from sklearn import cluster

# Train the dbscan model




# Generate predictions using the dbscan model 




# Plot clustering results using the format_plot function



In [None]:
# 📝 Note that DBSCAN does not require specifying the number of clusters, let's review it ('db_pred')



# Noisy data points are given the label -1. They do not belong to any cluster

#  `PART 2`: Prediction of Production Patterns Using Clustering Analysis

In the second part of this tutorial, we will learn how to train a clustering model using the production data from four wells, and then use this model to predict the production trends from another well

## 🏁 Step 10: Read and display the production data

In [None]:
# Read and display the production data (production_data.csv)

production = pd.read_csv(...)


# ⚠️ Make sure that the column 'Prod Date' is treated as date (yyyy-mm-dd)

In [None]:
# Set the column 'Prod Date' as index



In [None]:
# Print the production data to visualize changes



In [None]:
# Investigate the total number of wells in the dataset


## 🏁 Step 11: Create the test set using the production data from Well w1

In [None]:
# Create a new dataframe that contains only the production data from well 'w1', call it 'production_test'



## 🏁 Step 12: Create the training set using the production data from Wells w2, w3 and w4

In [None]:
# Create a new dataframe that contains the production data from the rest of the wells ('w2', 'w3' and 'w4'), call it 'production_train'



In [None]:
# Verify the wells contained in each dataframe ('production_test' & 'production_train')



## 🏁 Step 13: Plot the production data for the train wells

In [None]:
# Now let's plot the monthly gas production versus date for the train wells  



👆 Note that the production from the three wells begins at different times

## 🏁Step 14: Plot the "combined" production data from the train wells

In [None]:
# Plot the entire production data without discretizing it by well 




👆 In the plot above, the "production peaks" correspond to the order in which the wells appear in the dataset (w3 -> w2 -> w4) 

💡 Now that we have combined the production data from the train wells, the next step will be to train a model to see if we can separate different parts of the time series using another feature. 

For this, we can create a new feature using the .diff() function 👇

## 🏁 Step 15: Create a new variable using the .diff() function

✍🏼 The .diff() function is used to calculate the difference between the values for each row and, by default, the previous row

In [None]:
# Create a function to add a new feature to the production dataset, call it 'preprocess_data'



In [None]:
# Apply the 'preprocess_data' function to the production test and production train dataframes



In [None]:
# Display the resulting production train dataframe 'production_train_processed'



In [None]:
# Display the resulting production test dataframe 'production_test_processed'



## 🏁 Step 16: Plot the Production and diff features

In [None]:
# Plot the production (Monthly Gas) and diff variables in the same graph




## 🏁 Step 17: Identify the optimum number of clusters

In [None]:
# Use the elbow plot on the train dataset to identify the optimum number of clusters

from yellowbrick.cluster import KElbowVisualizer




## 🏁 Step 18: Train a K-Means model on the production data

In [None]:
# Train a K-means clustering model using the production data on the train set (w2, w3, w4)



# 'prod_train_pred' is a numpy array, explore it!



## 🏁 Step 19: Plot the K-means Clusters

In [None]:
# Plot the results of the clustering on the train set (w2, w3, w4)



## 🏁 Step 20: Use the K-means model to predict the Production data for well w1

In [None]:
# Implement the previously trained K-means model to predict the production trend on the test set (w1)




# Plot the results of the clustering on the test set (w1)



## 🏁 Step 21: Plot the Actual Production Data for Well w1

In [None]:
# Now let's plot the actual production data for well 'w1' and compare it with the predicted production clusters



🚀 Well done!

## 🚧 Optional steps beyond this point 🚧

## 🏁 Step 22: Decline Curve Analysis

In [None]:
# Read the production data from a CSV file


# Filter the data for well 1 -> 'w1'


In [None]:
# Rename the columns to 'Date', 'UWI', and 'Qg1'



In [None]:
# Convert the Date and Qg1 columns into numpy arrays



In [None]:
# Create the time range array 'xdataf'
 

# xdataf refers to the range of time (e.g., from 0 to 90 months)

👇 Now let's create a new dataframe combining the required variables for DCA

In [None]:
# Create a new DataFrame with 'zmonth', 'Qg1', and 'Date' columns



### 🚨 ARPS EQUATIONS

<img src="../images/pic7.png" alt="Decline Curve Equation" style= "width: 500px"/>

In [None]:
from scipy.optimize import curve_fit

# 1) Determine the total number of historical production and create a range from 1 to 90+



#-------------------------------------------------------------------------------------------


# 2) Determine the bounds (min-max)



#-------------------------------------------------------------------------------------------

# 3) Get the values of months and gas rate



#-------------------------------------------------------------------------------------------

# 4) Define the exponential equation with three variables: time, initial rate, and rate of decline



#-------------------------------------------------------------------------------------------


# 5) Fit the exponential decline equation with the rate observations using curve_fit
# Determine the values of the initial rate and rate of decline.

 # Print the values of qi1 and Di1.

# Generate the predicted gas rate using the fitted parameters


#-------------------------------------------------------------------------------------------

# 6) Plot monthly production and the exponential decline curve with the fitted curve.



 7) Repeat the same process as above, but now filter from the yellow cluster,  which indicates that the well has reached a boundary-dominated flow.

In [None]:
# Create the fit only with the filtered portion of the gas rate

# Get the length of the zmonth column


# Update the zmonth column with a range from 1 to n1


# Find the maximum gas rate


# Set the initial guess for curve fitting


# Filter the zmonth and Qg1 columns to exclude the first 15 months


# Define the exponential decline function


# Perform the curve fitting to estimate qi1 and Di1


# Generate the predicted gas production using the fitted parameters



# Plot the predicted production and the actual production


In [None]:
# 8) Extract printed initial rate and decline rate values from the previous step



# 9) Select the number of days to make the forecast.



# 10) Create a data frame with the column 'zmonth,' which includes the total months of history plus the forecasting months



# 11) Loop from the initial cluster yellow rate, and apply the exponential equation using the fitted values to predict 
# the gas rate for the following 100 months.  
# Finally, add those predictions to the data frame we created in step 11.





# Plot results


In [None]:
#Let's print the forecasted data


🎯 Well done!