# Prediction on Production of Oil Well with AttentionCNN-LSTM

Authors: S Pan, J Wang, W Zhou

Published in: Journal of Physics: Conference Series (Volume 2030, Paper 012038), 2021. Presented at ICEECT 2021 conference

The paper investigates whether Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks truly have long memory—the ability to retain information for a long time. Even though LSTMs were designed to overcome the short memory issue in RNNs, the authors show that both RNNs and LSTMs do not have long memory from a statistical perspective.

## 1. The Problem


Oil well production prediction is crucial for efficient resource management in the petroleum industry. Traditional methods like curve analysis and mathematical modeling are limited in accuracy due to the complexity of external factors affecting production. Machine learning techniques, such as ARIMA, BP neural networks, and SVR, have been used but suffer from limitations like data stability requirements, poor scalability, and susceptibility to local minima. Deep learning approaches, including CNNs and LSTMs, offer better predictive power, but individual models struggle with stability in long-term sequence forecasting. The Attention-CNN-LSTM model is proposed to address these challenges.

## 2. Related work

In the early stage of oilfield development, the curve analysis method and mathematical modeling methods are widely used. 

The authors mention, that "The traditional machine learning methods generally require, that all data should be put into the memory during training". I disagree with them on this topic. In some models - yes we require a lot of the data initially in memory to have correct weights, but still we can expertiment and do fine with partial fitting the data. 

Currently LSTM's are used in production predictions of an oil well and have achieved good results. However, due to the harsh udnerground production envrionment, the oil production data usually contains multiple noise components, which are non linear and non stationary time series. That is the reason, why the paper combines CNN, LSTM and Attention mechanism to construct a production prediction model. I also disagree partially with that, since LSTM alone is enough to handle nonlinear data, due to the gated mechanism, that allow it to capture complex dependendencies.

## 3. Methodology

$$\{\hat{y}_t\}_{t=T+1}^{T+\Delta} = F\left(\{x_t\}_{t=1}^{T}, \{y_t\}_{t=1}^{T} \right)$$

The production prediction of an oil well uses the timeseries of X and the actual oil well production y as inputs to construct a model to predict y in the future.

The model, that will be constructed is constisting of:

- CNN

The input data will be passed to the CNN layer. It can babstract and express the original oil production data at a higher level. The features of the original oil production data are processed by CNN, the correlation between the multi-dimensional data is mined and noises are removed.

- LSTM

The data is passed on to LSTM layers.

- Attention

The attention can be used to extract the salient features in the sub-sequences of long-time sequence and applied to calculate the weighted sumation for the vector expression of the hidden layer of the LSTM output.

Finally we end up with the following structuri - Attention-CNN-LSTM

<img src="./attention_cnn_lstm.png" alt="drawing" width="1000"/>

## 4. Training

The model is trained on data from an oilfield in souther China and includes the T1 and T2 wells. 

The metrics, that will determine, how good the model is will be RMSE, MAE and MAPE.

Those are the results the authors have provided us:

<img src="./results_comparison.png" alt="drawing" width="1000"/>

It seems like the proposed model is performing much better than all the other models on the T1 and T2 datasets.

## 5. Conclusion

Attention-CNN-LSTM is more suitable for predicting the time series data such as oil well production than the compared models.

The models seems to correctly extract high-dimensional features using the CNN and with attention and LSTM manages to avoid the gradiend explosion and get the important features.

# Experimentation

In [9]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import LSTM, Conv1D, MaxPooling1D, Dense, Attention, Input
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor

def generate_nonlinear_data(seq_length=2000):
    t = np.linspace(0, 100, seq_length)
    y = np.sin(t) + np.log(t+1) + np.random.normal(scale=0.2, size=seq_length) 
    return t, y

def create_dataset(data, look_back=10):
    X, y = [], []
    for i in range(len(data) - look_back):
        X.append(data[i:i+look_back])
        y.append(data[i+look_back])
    return np.array(X), np.array(y)

t, y = generate_nonlinear_data()
scaler = MinMaxScaler()
y_scaled = scaler.fit_transform(y.reshape(-1, 1)).flatten()

look_back = 10
X, y = create_dataset(y_scaled, look_back)
X = X.reshape(X.shape[0], look_back, 1)  # Reshape for LSTM input

split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

def build_bp_model():
    model = MLPRegressor(hidden_layer_sizes=(50, 50), max_iter=500)
    return model

def build_svr_model():
    model = SVR(kernel='rbf')
    return model

def build_lstm_model():
    model = Sequential([
        LSTM(50, return_sequences=True, input_shape=(look_back, 1)),
        LSTM(50),
        Dense(25, activation='relu'),
        Dense(1)
    ])
    model.compile(loss='mse', optimizer=Adam(learning_rate=0.001))
    return model

def build_attention_lstm_model():
    input_layer = Input(shape=(look_back, 1))
    lstm_out = LSTM(50, return_sequences=True)(input_layer)
    
    # Attention requires [query, value] as input
    attention_output = Attention()([lstm_out, lstm_out])  
    
    lstm_out2 = LSTM(50)(attention_output)
    dense_out = Dense(25, activation='relu')(lstm_out2)
    output_layer = Dense(1)(dense_out)
    
    model = tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
    model.compile(loss='mse', optimizer=Adam(learning_rate=0.001))
    
    return model


def build_cnn_lstm_model():
    model = Sequential([
        Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(look_back, 1)),
        MaxPooling1D(pool_size=2),
        LSTM(50, return_sequences=True),
        LSTM(50),
        Dense(25, activation='relu'),
        Dense(1)
    ])
    model.compile(loss='mse', optimizer=Adam(learning_rate=0.001))
    return model

def build_attention_cnn_lstm_model():
    inputs = Input(shape=(look_back, 1))
    cnn_out = Conv1D(filters=64, kernel_size=3, activation='relu')(inputs)
    cnn_out = MaxPooling1D(pool_size=2)(cnn_out)
    lstm_out = LSTM(50, return_sequences=True)(cnn_out)
    attention_out = Attention()([lstm_out, lstm_out])
    lstm_out2 = LSTM(50)(attention_out)
    dense_out = Dense(25, activation='relu')(lstm_out2)
    outputs = Dense(1)(dense_out)
    model = Model(inputs, outputs)
    model.compile(loss='mse', optimizer=Adam(learning_rate=0.001))
    return model

bp_model = build_bp_model()
bp_model.fit(X_train.reshape(X_train.shape[0], -1), y_train)
mse_bp = np.mean((bp_model.predict(X_test.reshape(X_test.shape[0], -1)) - y_test) ** 2)

svr_model = build_svr_model()
svr_model.fit(X_train.reshape(X_train.shape[0], -1), y_train)
mse_svr = np.mean((svr_model.predict(X_test.reshape(X_test.shape[0], -1)) - y_test) ** 2)

lstm_model = build_lstm_model()
history_lstm = lstm_model.fit(X_train, y_train, epochs=20, batch_size=16, validation_data=(X_test, y_test), verbose=1)
mse_lstm = lstm_model.evaluate(X_test, y_test)

attention_lstm_model = build_attention_lstm_model()
history_attention_lstm = attention_lstm_model.fit(X_train, y_train, epochs=20, batch_size=16, validation_data=(X_test, y_test), verbose=1)
mse_attention_lstm = attention_lstm_model.evaluate(X_test, y_test)

cnn_lstm_model = build_cnn_lstm_model()
history_cnn_lstm = cnn_lstm_model.fit(X_train, y_train, epochs=20, batch_size=16, validation_data=(X_test, y_test), verbose=1)
mse_cnn_lstm = cnn_lstm_model.evaluate(X_test, y_test)

attention_cnn_lstm_model = build_attention_cnn_lstm_model()
history_attention_cnn_lstm = attention_cnn_lstm_model.fit(X_train, y_train, epochs=20, batch_size=16, validation_data=(X_test, y_test), verbose=1)
mse_attention_cnn_lstm = attention_cnn_lstm_model.evaluate(X_test, y_test)

Epoch 1/20


  super().__init__(**kwargs)


[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 9ms/step - loss: 0.0783 - val_loss: 0.0032
Epoch 2/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0027 - val_loss: 0.0027
Epoch 3/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0026 - val_loss: 0.0032
Epoch 4/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0029 - val_loss: 0.0023
Epoch 5/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0026 - val_loss: 0.0023
Epoch 6/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0026 - val_loss: 0.0023
Epoch 7/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0025 - val_loss: 0.0028
Epoch 8/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0027 - val_loss: 0.0024
Epoch 9/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - loss: 0.1155 - val_loss: 0.0032
Epoch 2/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 0.0028 - val_loss: 0.0031
Epoch 3/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 0.0027 - val_loss: 0.0035
Epoch 4/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 0.0026 - val_loss: 0.0025
Epoch 5/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 0.0027 - val_loss: 0.0035
Epoch 6/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 0.0027 - val_loss: 0.0055
Epoch 7/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 0.0028 - val_loss: 0.0031
Epoch 8/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 0.0026 - val_loss: 0.0023
Epoch 9/20
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━

In [10]:
print(f"BP MSE: {mse_bp:.4f}")
print(f"SVR MSE: {mse_svr:.4f}")
print(f"LSTM MSE: {mse_lstm:.4f}")
print(f"Attention-LSTM MSE: {mse_attention_lstm:.4f}")
print(f"CNN-LSTM MSE: {mse_cnn_lstm:.4f}")
print(f"Attention-CNN-LSTM MSE: {mse_attention_cnn_lstm:.4f}")

BP MSE: 0.0026
SVR MSE: 0.0025
LSTM MSE: 0.0017
Attention-LSTM MSE: 0.0027
CNN-LSTM MSE: 0.0019
Attention-CNN-LSTM MSE: 0.0021


From my experimentation, the proposed model is performing just as good as every other one. Maybe it is due to the fact, that I don't have the actual data and I am generating my own dataset. Still it is a good model.