# Data Quality Monitoring with Autoencoders at CMS

<div class="alert alert-block alert-info">
  <b>Credits:</b> A. Papanastassiou (INFN-Firenze), B. Camaiani (INFN-Firenze)<br>
    Fifth ML-INFN Hachathon, <a href=https://agenda.infn.it/event/37650/overview> link </a>
</div>

## LSTM Autoencoder

LSTM Autoencoders are built using another type of layer, called LSTM layer. LSTM layers are capable of learning the complex dynamics within the temporal ordering of input sequences as well as use an internal memory to remember or use information across long input sequences. This is possible since LSTM is a type of Recurrent Neural Network (RNN) in which each neuron is built as multiple copies of the same unit:

<div align="center">
<img src="https://alpapana.web.cern.ch/LSTM%20cell2.PNG" width="500"/>
</div>

We can build two kinds of LSTM autoencoders, the first one is the Undercomplete one as before, where the structure of the layers is again showing a decrese followed by an increase of the number of nodes but with the complication that the output of each layer is duplicated to enter each of the copies of every node of the following layer (using return_sequences=True in the layer definition). For the latent layer, a RepeatVector layer is used to bring copies of the layer to the folowing decoding layer:

<div align="center">
<img src="https://alpapana.web.cern.ch/LSTMAE.PNG" width="650"/>
</div>

The second kind is so-called "Sparse", where the encoding is performed using Dropout layer that randomly sets input units to 0 with a frequency of $rate$ at each step during training time. Inputs not set to 0 are scaled up by $1/(1 - rate)$ such that the sum over all inputs is unchanged. \
Both these models needs a reshaping of our input to allow each layer to see not one sample at a time but a certain window of them. We will first reashape the input, then rescale it.

In [None]:
# This function creates the new reshaped input for the LSTM layer

def reshape(X, time_steps=50):
    X1 = []
    for i in range(len(X) - time_steps-1):
        t = X.loc[i:(i + time_steps-1)].values
        X1.append(t)
    return np.array(X1)

In [None]:
# apply the function to the training dataset and apply the usual rescaling

df_binsm_window=reshape(df_bins_train)
x_train_w=np.array(df_binsm_window, dtype=np.float64)

min_val = tf.reduce_min(x_train_w,axis=0)
max_val = tf.reduce_max(x_train_w,axis=0)
data_w = (x_train_w - min_val) / (max_val - min_val)
data_w = np.where(np.isnan(data_w), 0, data_w)

data_w=np.array(data_w, dtype=np.float64)

In [None]:
data_w.shape

### Sparse LSTM Autoencoder

In [None]:
# Define the sparse LSTM autoencoder 

autoencoder_LSTMs = keras.Sequential()
autoencoder_LSTMs.add(keras.layers.LSTM(units=64, input_shape=(data_w.shape[1],data_w.shape[2]))) 
autoencoder_LSTMs.add(keras.layers.Dropout(rate=0.2)) #The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged.
# one-dimensional encoded feature vector as output of the intermediate layer. A sample is encoded into a feature vector.
autoencoder_LSTMs.add(keras.layers.RepeatVector(n=data_w.shape[1])) #Repeats the input n times (previous LSTM has no return sequence) .
autoencoder_LSTMs.add(keras.layers.LSTM(units=64, return_sequences=True))
autoencoder_LSTMs.add(keras.layers.Dropout(rate=0.2))
autoencoder_LSTMs.add(keras.layers.TimeDistributed(keras.layers.Dense(units=data_w.shape[2]))) #The TimeDistibuted layer takes the info from the previous layer and creates a vector with a length of the output layers.
autoencoder_LSTMs.compile(loss='mae', optimizer='adam')
autoencoder_LSTMs.summary()

In [None]:
# train the model
history=autoencoder_LSTMs.fit(data_w, data_w, epochs=100, batch_size=batch_size, shuffle=False)

In [None]:
plt.plot(history.history['loss'])
plt.xlabel('epochs')
plt.ylabel('MAE')


In [None]:
df_binsm_test_window=reshape(df_bins_test_3, time_steps=50)
x_test_w=np.array(df_binsm_test_window, dtype=np.float64)

min_val = tf.reduce_min(x_test_w,axis=0)
max_val = tf.reduce_max(x_test_w,axis=0)
data_test_w = (x_test_w - min_val) / (max_val - min_val)
data_test_w = np.where(np.isnan(data_test_w), 0, data_test_w)

data_test_w=np.array(data_test_w, dtype=np.float64)

In [None]:
test_predictions_w = autoencoder_LSTMs.predict(data_test_w)
test_predictions_uw=test_predictions_w[:,0,:]
test_data_uw=data_test_w[:,0,:]

In [None]:
mae_LSTMs = tf.math.reduce_mean(tf.math.abs(test_data_uw - test_predictions_uw), axis = 1)

In [None]:
plt.plot(mae_LSTMs)
plt.xlabel('LS')
plt.ylabel('MAE')
plt.title('MAE between input and output of the AE for testing data run3')

In [None]:
LS_sparse=anomalies_finder(mae_LSTMs, 99.7, n=2)

In [None]:
l_sparse=np.sum(x_test_3, axis=0)

cleaned_x_test_sparse =np.delete(x_test_3, [x-zeros for x in df_s.index[df_s[0].apply(lambda x: x in LS_sparse)].tolist()], axis=0)
s_sparse=np.sum(cleaned_x_test_sparse, axis=0)

In [None]:
#Plot l and s and check if the anomalous LSs have disapeared
plt.plot(l_sparse,ds = 'steps-mid',linewidth=1, label='uncleaned test run')
plt.plot(s_sparse,ds = 'steps-mid',linewidth=1, label='cleaned test run')

plt.yscale("log")
plt.legend()
plt.xlabel('METSig')

In [None]:
plt.plot(test_data_uw[161,:],ds = 'steps-mid',linewidth=1, label='input')
plt.plot(test_predictions_uw[161,:],ds = 'steps-mid',linewidth=1, label='output')
plt.legend()

### Undercomplete LSTM Autoencoder

In [None]:
autoencoder_LSTMu = keras.Sequential()
autoencoder_LSTMu.add(keras.layers.LSTM(units=64,input_shape=(data_w.shape[1],data_w.shape[2]),return_sequences=True)) #no return_sequence --> encoding
autoencoder_LSTMu.add(keras.layers.LSTM(32, activation='relu', return_sequences=False))
autoencoder_LSTMu.add(keras.layers.RepeatVector(n=data_w.shape[1]))
autoencoder_LSTMu.add(keras.layers.LSTM(32, activation='relu', return_sequences=True))
autoencoder_LSTMu.add(keras.layers.LSTM(64, activation='relu', return_sequences=True))
autoencoder_LSTMu.add(keras.layers.TimeDistributed(keras.layers.Dense(units=data_w.shape[2])))
autoencoder_LSTMu.compile(loss='mae', optimizer='adam')
autoencoder_LSTMu.summary()

In [None]:
#train the model
history=autoencoder_LSTMu.fit(data_w, data_w, epochs=100, batch_size=batch_size, shuffle=False)

In [None]:
plt.plot(history.history['loss'])
plt.xlabel('epochs')
plt.ylabel('MAE')


In [None]:
test_predictions_w = autoencoder_LSTMu.predict(data_test_w)
test_predictions_uw=test_predictions_w[:,0,:]
test_data_uw=data_test_w[:,0,:]

In [None]:
mae_LSTMu = tf.math.reduce_mean(tf.math.abs(test_data_uw - test_predictions_uw), axis = 1)

In [None]:
plt.plot(mae_LSTMu)
plt.xlabel('LS')
plt.ylabel('MAE')
plt.title('MAE between input and output of the AE for testing data run3')

In [None]:
LS_5=anomalies_finder(mae_LSTMu, 99.7, n=2)

In [None]:
l_under=np.sum(x_test_3, axis=0)

cleaned_x_test_under =np.delete(x_test_3, [x-zeros for x in df_s.index[df_s[0].apply(lambda x: x in LS_5)].tolist()], axis=0)
s_under=np.sum(cleaned_x_test_under, axis=0)

In [None]:
#Plot l and s and check if the anomalous LSs have disapeared
plt.plot(l_under,ds = 'steps-mid',linewidth=1, label='uncleaned test run')
plt.plot(s_under,ds = 'steps-mid',linewidth=1, label='cleaned test run')

plt.yscale("log")
plt.legend()
plt.xlabel('METSig')

LSTM Autoencoders appear to capture smaller anomalies more effectively than the Dense Undercomplete model in run3. Additionally, we have successfully detected empty or nearly empty LSs within the run that the other model failed to identify. While further optimization may enhance the performance of the dense model, overall, superior performance from LSTM Autoencoders is expected.