# Abstract:

In this Notebook, LSTM is applied to correlate/ predict whether the pump may be shut down (broken) for a period of time based on the input signal of the sensors.

The input parameters of the model used are the sensor parameters (in the database called sensors from 00 to 51). The output of the model is a single parameter: the pump operating state corresponding to 0 is the shutdown state, 1 is the normal operation state and 0.5 is recovering.

This Notebook will analyze the correlation between the input parameters (sensors) and determine which parameters are the most important, deciding on the output of the model, thereby building the simplest model, which requires less input parameters but the most accurate prediction results.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
data =  pd.read_csv("../input/pump-sensor-data/sensor.csv")

# Step 1. Data cleaning

<h4> The author of the data set has reported that the system had 7 system failures over one year, which caused serious problems. Thus, the problem consists on predicting when will the next failure occur is very important. <h4>
    (www.kaggle.com/nphantawee/pump-sensor-data) 

In [None]:
data.shape

<h4> The data frame shows that, there are 55 columns with 220320 recordings. Moreover, the measurements have different scales, as following.<h4>

In [None]:
data.columns

In [None]:
data.describe().transpose()

In [None]:
data.isnull().sum()

<h4> Let remove first all NaN columns and all columns have zero standard values.<h4>

In [None]:
data.drop(['Unnamed: 0', 'timestamp','sensor_00','sensor_15','sensor_50','sensor_51'],axis=1, inplace=True)

We need to note that, because sensors are not specifically noted what operating parameters, for example, pressure, temperature, flow, vibration, ... they measure. However, concretely and quickly making a judgment about which sensor will be the decisive parameter to the operating status and operation of the pump is very important. 

In actual operation, pump systems are often equipped with more than one sensor for a single operating parameter, such as pressure, flow rate or temperature, for a variety of reasons such as safety, operation or system reliability or automation equipment confidence. This can be the cause of the overlap in the measured signals of some sensors as analyzed following.

In [None]:
import matplotlib.pyplot as plt
data.plot(subplots =True, sharex = True, figsize = (20,50))

# As can be seen there are a pattern being captured by the sensors, example:

+ (1,2,3), 
+ (4,5,6,7,8,9),
+ (10,11,12), 
+ (14,16,17,18), 
+ (19,20,21,22,23,24), 
+ (25,26,28,29,30,31,32,33), 
+ (34,35), 
+ (38,39,40,41,42,43,45,46,47). 
    
In turn, there are signals that are very noisy and seem to follow no trend in particular.

On the basis of that analysis, determining which sensor signal influence the operating state of the pump is important for modeling. To optimize the model, we proceed to select the input parameters of the model according to the following hypotheses.

In [None]:
data['machine_status'].value_counts()

The database has 7 BROKEN states, which are then RECOVERED and returned to a NORMAL operating state. For the sack of simplicity, we can assume that 25% of the data could be used to train the model (covering 2 BROKEN states), the remaining 75% of the data is used to test the predictability of the model based on input parameters (covers 5 BROKEN points).

For graphical illustration purpose, we assume the BROKEN state transitions have a value of 0, the RECOVERING state and NORMAL operation value 0.5 and 1, respectively and converted it into a new column named: "Operation".

In [None]:
import numpy as np
conditions = [(data['machine_status'] =='NORMAL'), (data['machine_status'] =='BROKEN'), (data['machine_status'] =='RECOVERING')]
choices = [1, 0, 0.5]
data['Operation'] = np.select(conditions, choices, default=0)

In order to check if there is some obvious patterns that could be landmarked in a certain period, we have added the "Operation" code in the illustrations. That could helps us to define a good dataset to fitthe model.

In [None]:
import matplotlib.pyplot as plt
data.plot(subplots =True, sharex = True, figsize = (20,50))

In [None]:
data.columns

# Step 2. Assumptions and LSTM model

As analyzed above, many measurements follow the same trend. To this end, one starts by keeping only the features of interest and drop the rest. Then, one performes feature normalization to bring all values into the range [0,1]. Starting by dropping unused features, one can proceed as follows.

# Set 0: 
sensors numbers 4, 6, 7, 8, 9 will be included in the dataset.

In [None]:
df0 = pd.DataFrame(data, columns=['Operation','sensor_04', 'sensor_06', 'sensor_07', 'sensor_08', 'sensor_09'])

# Set 1: 
sensors 1, 4, 10, 14, 19, 25, 34, 38

In [None]:
df1 = pd.DataFrame(data, columns=['Operation','sensor_01', 'sensor_04', 'sensor_10', 'sensor_14', 'sensor_19', 'sensor_25'])

# Set 2: 
sensors 2, 5, 11, 16, 20, 26, 39

In [None]:
df2 = pd.DataFrame(data, columns = ['Operation','sensor_02', 'sensor_05', 'sensor_11', 'sensor_16', 'sensor_20', 'sensor_26'])

# Set 3: 
sensors 3, 6, 12, 17, 21, 28, 40

In [None]:
df3 = pd.DataFrame(data, columns = ['Operation','sensor_03', 'sensor_06', 'sensor_12', 'sensor_17', 'sensor_21', 'sensor_28'])

In [None]:
df0.plot(subplots =True, sharex = True, figsize = (20,20))

It seems that this time data series correlate a lot with the failure of the machine and can be a good indicator of the failure of the system, we will check it for another dataset. For now, the only concern is manipulation and prediction to test the robustness of classical methods.

In [None]:
df = df0
df.shape

# Step 3. Traing the model and implement the prediction

# Training set:

We choose 50,000 data points with 2 broken points to train the model, 

# Testing set:

the remaining 170,000 points with 5 broken states will be used to test the predictivity of the model.

In [None]:
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    dff = pd.DataFrame(data)
    cols, names = list(), list()
    for i in range(n_in, 0, -1):
        cols.append(dff.shift(-i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    for i in range(0, n_out):
        cols.append(dff.shift(-i))
        if i==0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1)) for j in range(n_vars)]        
        agg = pd.concat(cols, axis=1)
        agg.columns = names
        if dropnan:
            agg.dropna(inplace=True)
        return agg

In [None]:
from sklearn.preprocessing import MinMaxScaler

values = df.values
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
reframed = series_to_supervised(scaled, 1, 1)
r = list(range(df.shape[1]+1, 2*df.shape[1]))
reframed.drop(reframed.columns[r], axis=1, inplace=True)
reframed.head()

# Data spliting into train and test data series.
values = reframed.values
n_train_time = 50000
train = values[:n_train_time, :]
test = values[n_train_time:, :]
train_x, train_y = train[:, :-1], train[:, -1]
test_x, test_y = test[:, :-1], test[:, -1]
train_x = train_x.reshape((train_x.shape[0], 1, train_x.shape[1]))
test_x = test_x.reshape((test_x.shape[0], 1, test_x.shape[1]))

In [None]:
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import Dense
from sklearn.metrics import mean_squared_error,r2_score
import matplotlib.pyplot as plt
import numpy as np

model = Sequential()
model.add(LSTM(100, input_shape=(train_x.shape[1], train_x.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

# Network fitting
history = model.fit(train_x, train_y, epochs=50, batch_size=70, validation_data=(test_x, test_y), verbose=2, shuffle=False)

# Loss history plot
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
plt.show()

size = df.shape[1]

# Prediction test
yhat = model.predict(test_x)
test_x = test_x.reshape((test_x.shape[0], size))

# invert scaling for prediction
inv_yhat = np.concatenate((yhat, test_x[:, 1-size:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]

# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
inv_y = np.concatenate((test_y, test_x[:, 1-size:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]

# calculate RMSE
rmse = np.sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)

In [None]:
import numpy as np
e = np.round(sum(np.abs(inv_y[:]-inv_yhat[:]))/(sum(inv_y[:])*len(inv_y[:]))*100,2)
aa=[x for x in range(160000)]
plt.figure(figsize=(25,10)) 
plt.plot(aa, inv_y[:160000], marker='.', label="actual")
plt.plot(aa, inv_yhat[:160000], 'r', label="prediction with precision of {} %".format(e))
plt.ylabel(df.columns[0], size=15)
plt.xlabel('Time', size=15)
plt.legend(fontsize=15)
plt.show()

In [None]:
df = df2
df.shape

In [None]:
df2.plot(subplots =True, sharex = True, figsize = (20,20))

In [None]:
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    dff = pd.DataFrame(data)
    cols, names = list(), list()
    for i in range(n_in, 0, -1):
        cols.append(dff.shift(-i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    for i in range(0, n_out):
        cols.append(dff.shift(-i))
        if i==0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1)) for j in range(n_vars)]        
        agg = pd.concat(cols, axis=1)
        agg.columns = names
        if dropnan:
            agg.dropna(inplace=True)
        return agg

In [None]:
from sklearn.preprocessing import MinMaxScaler

values = df.values
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
reframed = series_to_supervised(scaled, 1, 1)
r = list(range(df.shape[1]+1, 2*df.shape[1]))
reframed.drop(reframed.columns[r], axis=1, inplace=True)
reframed.head()

# Data spliting into train and test data series.
values = reframed.values
n_train_time = 50000
train = values[:n_train_time, :]
test = values[n_train_time:, :]
train_x, train_y = train[:, :-1], train[:, -1]
test_x, test_y = test[:, :-1], test[:, -1]
train_x = train_x.reshape((train_x.shape[0], 1, train_x.shape[1]))
test_x = test_x.reshape((test_x.shape[0], 1, test_x.shape[1]))

In [None]:
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import Dense
from sklearn.metrics import mean_squared_error,r2_score
import matplotlib.pyplot as plt
import numpy as np

model = Sequential()
model.add(LSTM(100, input_shape=(train_x.shape[1], train_x.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

# Network fitting
history = model.fit(train_x, train_y, epochs=150, batch_size=70, validation_data=(test_x, test_y), verbose=2, shuffle=False)

# Loss history plot
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
plt.show()

size = df.shape[1]

# Prediction test
yhat = model.predict(test_x)
test_x = test_x.reshape((test_x.shape[0], size))

# invert scaling for prediction
inv_yhat = np.concatenate((yhat, test_x[:, 1-size:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]

# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
inv_y = np.concatenate((test_y, test_x[:, 1-size:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]

# calculate RMSE
rmse = np.sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)

In [None]:
import numpy as np
aa=[x for x in range(170000)]
plt.figure(figsize=(25,10)) 
plt.plot(aa, inv_y[:170000], marker='.', label="actual")
plt.plot(aa, inv_yhat[:170000], 'r', label="prediction with the model")
plt.ylabel(df.columns[0], size=15)
plt.xlabel('Time', size=15)
plt.legend(fontsize=15)
plt.show()

# Step 4. Conclusions

In this Notebook, we focus on analyzing the raw data and looking for a logical approach. The construction of maintenance prediction plan for pump system operation is not only simple on the basis of net data but also needs to detail specific operating data, such as pressure, temperature, flow, vibration, ... of the pump system. These specific data will be very useful to analyze and apply industrial equipment operating knowledge to data analysis to build maintenance predictive models.

This notebook also analyzes and shows sensor signals that have similar characteristics or reflect similar operational parameters of the pump system. On that basis, it is necessary to select a suitable, simple, but relevant series of data that accurately reflects the nature of the operating characteristics of the pump system. Choosing the right data not only reduces the computational costs but also allows to build a predictive-model with high accuracy and reliability.

Since I am a beginner, many mistakes can occur in terms of the methodology, figures or model ideas as well as during the construction of this Notebook. I would be happy to receive all of your comments and feedback.

Thank you very much.