# Anomaly Detection

Anomaly or outlier detection is essentially finding patterns that do not conform to expected behavior. There are several approaches to anomaly detection that are based on either statistical properties, clustering, classification, Principal Component Analysis (PCA), or subsampling. In this notebook we will look at an **autoencoder network** for anomaly or outlier detection. An autoencoder is a neural-net based, unsupervised learning model that is used to learn low-dimensional features that captures some structure underlying the high-dimensional input data.

## Lab Scenario

Groundwater level is an important metric, especially for agriculture states such as Iowa. One of the metrics [U.S. Geological Survey (USGS)](https://www.usgs.gov/) monitors is **depth to water level in feet below the land**. In this lab we will use a synthetic dataset that models certain scenarios for Des Moines, Iowa. The three key weather-related metrics we will be using are:

- water-level (depth to water level in feet below the land)
- temperature
- humidity

The data is generated daily using realistic monthly averages for Des Moines, Iowa, for the years 2016 – 2019.

We are going to be using 3 copies of the dataset for years 2016 -2019: 

1. Normal conditions for the region.
2. A sudden precipitous rainfall abnormally raising ground water levels that occurs in mid-May 2019.
3. A gradual build up dry conditions dropping the ground water levels that occur over the months of June and July 2019.

## Outline

1. **Introduction to the datasets**: Understand the patterns in the three datasets – normal, sudden and gradual.

2. **Define and train the Autoencoder Network**: Use Keras to define and train the autoencoder model.

3. **Establish criteria for anomalies**: Define approaches and thresholds for detecting anomalies based on the trained autoencoder model.

4. **Predict anomalies**: Used in the trained autoencoder model, make predictions on the sudden and gradual dataset to identify anomalies.

5. **Principal Component Analysis**: Apply PCA on the encoded dataset and visualize the data representation at lower dimensions.

### Import required libraries 

In [1]:
import pandas as pd
import numpy as np
np.random.seed(293)
import math
from IPython.display import display, HTML, Image, SVG
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', -1)
print("pandas version: {} numpy version: {}".format(pd.__version__, np.__version__))

import sklearn
from sklearn import preprocessing
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn_pandas import DataFrameMapper

import keras
import tensorflow
from keras.layers import Input, Dropout
from keras.layers.core import Dense 
from keras.models import Model, Sequential, load_model
from keras import regularizers
from keras.models import model_from_json

from numpy.random import seed
from tensorflow import set_random_seed

print("keras version: {} tensorflow version: {} sklearn version: {}".format(keras.__version__, 
                                                                        tensorflow.__version__, sklearn.__version__))

%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns

print('importing libraries done!')

pandas version: 0.25.1 numpy version: 1.16.4


Using TensorFlow backend.


keras version: 2.2.5 tensorflow version: 1.14.0 sklearn version: 0.20.3
importing libraries done!


**Helper method to display a pandas dataframe**

In [2]:
def display_dataframe(df_in):
    s = df_in.style.set_properties(**{'text-align': 'left'})
    s.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
    display(HTML(s.render()))

## Introduction to the Datasets

**Load the three datasets**

In [3]:
normal_url = ('https://quickstartsws9073123377.blob.core.windows.net/'
              'azureml-blobstore-0d1c4218-a5f9-418b-bf55-902b65277b85/anomaly_detection/normal.xlsx')

sudden_url = ('https://quickstartsws9073123377.blob.core.windows.net/'
              'azureml-blobstore-0d1c4218-a5f9-418b-bf55-902b65277b85/anomaly_detection/sudden.xlsx')

gradual_url = ('https://quickstartsws9073123377.blob.core.windows.net/'
               'azureml-blobstore-0d1c4218-a5f9-418b-bf55-902b65277b85/anomaly_detection/gradual.xlsx')

normal_df = pd.read_excel(normal_url)
sudden_df = pd.read_excel(sudden_url)
gradual_df = pd.read_excel(gradual_url)

### Pair Plots

Visualize the pair-wise relationships in the three data sets for the three-key metrics: (1) temperature, (2) humidity, and (3) water-level.

#### Normal Dataset

The water-level appears to be negatively skewed distribution with the peak value of 20 feet. It is important to note that that **higher water-level number represents dry conditions** and **lower water-level number represents wet conditions**. 

No natural clusters between water-level and temperature, whereas two distinct clusters between water-level and humidity.

In [4]:
cols = ['temperature', 'humidity', 'water_level']
sns.pairplot(normal_df[cols])

<IPython.core.display.Javascript object>

<seaborn.axisgrid.PairGrid at 0x7f2d1bd45e10>

#### Sudden Dataset

Note that in this dataset we have introduced sudden precipitous rainfall that suddenly and temporarily raised ground water levels around mid-May 2019 till end of May 2019.

No visible change in the pair wise distributions.

In [5]:
sns.pairplot(sudden_df[cols])

<IPython.core.display.Javascript object>

<seaborn.axisgrid.PairGrid at 0x7f2d1bc64ef0>

#### Gradual Dataset

Note that this dataset emulates gradual build up dry conditions over the months of June and July 2019 that unseasonably dropped the water levels.

No visible change in the pair wise distributions.

In [6]:
sns.pairplot(gradual_df[cols])

<IPython.core.display.Javascript object>

<seaborn.axisgrid.PairGrid at 0x7f2d1417ad30>

### Water Level Plots

#### Box and Whisker Plot

In a box and whisker plot: the ends of the box are the upper and lower quartiles, so the box spans the interquartile range. the median is marked by a horizontal line inside the box. The whiskers are the two lines outside the box indicating variability outside the upper and lower quartiles. Outliers are plotted as individual points.

The plot shows what is normal changes from month to month. The summer months of June and July are relatively wet month from ground water level perspective presumably due to precipitations from earlier months. Maintaining reliable monthly water levels is critical to the agricultural needs.

In this visualization, there is not obvious evidence of the induced anomalies in the sudden dataset. The gradual dataset does show a noticeably bigger spread of the months of June and July.

In [7]:
f, axes = plt.subplots(3, 1, sharey=True, sharex=True, figsize=(6, 7))

ax1 = sns.boxplot(x="month_name", y="water_level", data=normal_df, ax=axes[0])
ax2 = sns.boxplot(x="month_name", y="water_level", data=sudden_df, ax=axes[1])
ax3 = sns.boxplot(x="month_name", y="water_level", data=gradual_df, ax=axes[2])

ax1.set_title('Normal Dataset')
ax2.set_title('Sudden Dataset')
ax3.set_title('Gradual Dataset')
ax1.set_xlabel('')
ax2.set_xlabel('')
ax3.set_xlabel('')
ax1.set_ylabel('')
ax2.set_ylabel('Water Level')
ax3.set_ylabel('')

f.tight_layout(rect=[0, 0.03, 1, 0.95])

<IPython.core.display.Javascript object>

#### Daily Trend Plot

This plot shows daily `Water Level` reading for years 2016-2019 (1461 days).

In this historic view of the data, you can observer the induced anomalies (around day 1230) in the sudden and the gradual datasets. In the sudden dataset, the `Water Level` suddenly rises, and in the gradual dataset, the `Water Level` gradually drops over a period of time.

In this notebook we are going to develop an anomaly detection model that allows us to detect such anomalies in real-time.

In [8]:
f, ax = plt.subplots(3, 1, sharey=True, sharex=True, figsize=(6, 7))
ax[0].plot(normal_df.water_level)
ax[1].plot(sudden_df.water_level)
ax[2].plot(gradual_df.water_level)
ax[0].set_title('Normal Dataset')
ax[1].set_title('Sudden Dataset')
ax[2].set_title('Gradual Dataset')
ax[1].set_ylabel('Water Level')
f.tight_layout(rect=[0, 0.03, 1, 0.95])

<IPython.core.display.Javascript object>

## Define and Train the Autoencoder Network

### Preprocess Input Data

Select **month**, **temperature**, **humidity**, and **water level** as our features for the network.

In [9]:
feature_cols = ['month', 'temperature', 'humidity', 'water_level']
categorical = ['month']
numerical = ['temperature', 'humidity', 'water_level']

numeric_transformations = [([f], Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler())])) for f in numerical]
    
categorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical]

transformations = categorical_transformations + numeric_transformations

clf = Pipeline(steps=[('preprocessor', DataFrameMapper(transformations))])

X_train = clf.fit_transform(normal_df[feature_cols])
np.random.shuffle(X_train)

### Define the Autoencoder Network Architecture

In [10]:
seed(10)
set_random_seed(10)
act_func = 'elu'

input_ = Input(shape=(X_train.shape[1],))
x = Dense(100, activation=act_func)(input_)
x = Dense(50, activation=act_func)(x)
encoder = Dense(20, activation=act_func, name='feature_vector')(x)
x = Dense(50, activation=act_func)(encoder)
x = Dense(100, activation=act_func)(x)
output_ = Dense(X_train.shape[1], activation=act_func)(x)

model = Model(input_, output_)
opt = keras.optimizers.Adam(lr=0.0001)
model.compile(loss='mse', optimizer=opt)

encoder_model = Model(inputs=model.input, outputs=model.get_layer('feature_vector').output)
encoder_model.compile(loss='mse', optimizer='adam')

W0910 20:29:12.796108 139834460002048 deprecation_wrapper.py:119] From /anaconda/envs/azureml_py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:66: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0910 20:29:12.797593 139834460002048 deprecation_wrapper.py:119] From /anaconda/envs/azureml_py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:541: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0910 20:29:12.800894 139834460002048 deprecation_wrapper.py:119] From /anaconda/envs/azureml_py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4432: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0910 20:29:12.918521 139834460002048 deprecation_wrapper.py:119] From /anaconda/envs/azureml_py36/lib/python3.6/site-packages/keras/optimizers.py:793: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimize

### Train the Autoencoder Model

In [11]:
epochs = 100
batch_size = 16

history = model.fit(X_train, X_train, batch_size=batch_size, epochs=epochs, validation_split=0.05, verbose=1)

W0910 20:29:15.785931 139834460002048 deprecation_wrapper.py:119] From /anaconda/envs/azureml_py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1033: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

W0910 20:29:15.993799 139834460002048 deprecation_wrapper.py:119] From /anaconda/envs/azureml_py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1020: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.



Train on 1387 samples, validate on 74 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100


Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


### Review the Model Training Loss

In [12]:
plt.figure()
plt.plot(history.history['loss'][15:])
plt.show()

<IPython.core.display.Javascript object>

**Save the models**

In [13]:
model.save('anomaly_detection_full_model.h5')
encoder_model.save('anomaly_detection_encoder_model.h5')

**Load the models**

Run this cell if you want to load previously trained models

In [14]:
#model = load_model('anomaly_detection_full_model.h5')
#encoder_model = load_model('anomaly_detection_encoder_model.h5')

## Establish Criteria for Anomalies

The autoencoder network is trained using normal data where it first compresses the input data and then reconstructs the input data. During training the network learns the interactions between various input variables under normal conditions and learns to reconstruct the input variables back to their original values. The reconstruction error is the error is reproducing back the original input values. We will be using `Mean Absolute Error` as our measure for the reconstruction error. The basic idea behind anomaly detection is that the reconstruction error using the trained network for anomalous inputs will be higher than what is typically observed with normal data. 

Thus, one of the parameters we need to understand is the **threshold for the reconstruction error** that identifies anomalous input data.

**Compute reconstruction errors for the normal dataset**

Next, we will make predictions on the normal dataset, compute the reconstruction error for individual set of inputs, and look that the upper and lower bounds for the reconstruction errors.

In [15]:
X_train = clf.transform(normal_df[feature_cols]) # Keep the order, X_train used for training was shuffled
X_pred = model.predict(X_train)
loss_mae = np.mean(np.abs(X_pred-X_train), axis = 1)
normal_df['loss_mae'] = loss_mae
stats = normal_df.loss_mae.describe()
whis = 2.0
upper_bound = (whis* (stats['75%'] - stats['25%']) + stats['75%'])
lower_bound = (-whis* (stats['75%'] - stats['25%']) + stats['25%'])
print(('Mean Absolute Error: lower bound: {}, upper bound: {}').format(lower_bound, upper_bound))

Mean Absolute Error: lower bound: -0.0003853121893124217, upper bound: 0.004824729858529223


**Visualize the reconstruction errors for the normal dataset**

It appears that the threshold value of `0.005` is a reasonable cutoff to identify anomalous input data.

In [16]:
upper_bound = 0.005

In [17]:
f, ax = plt.subplots(2, 1, sharey=False, sharex=False, figsize=(6, 7))

upper_boundary = upper_bound * np.ones(len(loss_mae))

ax[0].plot(loss_mae)
ax[0].plot(upper_boundary, color='r')
sns.boxplot(y=loss_mae, whis=whis, ax = ax[1])

ax[0].set_title('Normal Dataset - Line Plot')
ax[1].set_title('Normal Dataset - Box Plot')
ax[0].set_ylabel('Mean Absolute Error')
ax[1].set_ylabel('Mean Absolute Error')

<IPython.core.display.Javascript object>

Text(0, 0.5, 'Mean Absolute Error')

**Visualize the reconstruction errors for the sudden and gradual datasets**

The line plots show the anomalous regions in the two respective datasets.  For the sudden dataset there is an almost instantaneous spike and for the gradual dataset there is a ramp up to the peak error value.

In [18]:
X_sudden = clf.transform(sudden_df[feature_cols])
X_sudden_pred = model.predict(X_sudden)
loss_mae_sudden = np.mean(np.abs(X_sudden-X_sudden_pred), axis = 1)

X_gradual = clf.transform(gradual_df[feature_cols])
X_gradual_pred = model.predict(X_gradual)
loss_mae_gradual = np.mean(np.abs(X_gradual-X_gradual_pred), axis = 1)

sudden_df['loss_mae'] = loss_mae_sudden
gradual_df['loss_mae'] = loss_mae_gradual

In [19]:
f, ax = plt.subplots(2, 1, sharey=True, sharex=True, figsize=(6, 7))

upper_boundary = upper_bound * np.ones(len(loss_mae_sudden))

ax[0].plot(loss_mae_sudden)
ax[0].plot(upper_boundary, color='r')
ax[1].plot(loss_mae_gradual)
ax[1].plot(upper_boundary, color='r')

ax[0].set_title('Sudden Dataset')
ax[1].set_title('Gradual Dataset')
ax[0].set_ylabel('Mean Absolute Error')
ax[1].set_ylabel('Mean Absolute Error')

<IPython.core.display.Javascript object>

Text(0, 0.5, 'Mean Absolute Error')

**Zoom in to review the trends in the anomalous period**

Let’s review the reconstruction errors during the period of May 2019 to August 2019.

In [20]:
sudden_test_df = sudden_df.loc[lambda d: (d.date.dt.year == 2019) & 
                                ((d.date.dt.month == 5) | (d.date.dt.month == 6) | 
                                 (d.date.dt.month == 7) | (d.date.dt.month == 8)), :]

gradual_test_df = gradual_df.loc[lambda d: (d.date.dt.year == 2019) & 
                                ((d.date.dt.month == 5) | (d.date.dt.month == 6) | 
                                 (d.date.dt.month == 7) | (d.date.dt.month == 8)), :]

In [21]:
f, ax = plt.subplots(2, 1, sharey=True, sharex=True, figsize=(6, 7))

upper_boundary = upper_bound * np.ones(len(sudden_test_df))

ax[0].plot(sudden_test_df.date, sudden_test_df.loss_mae.values)
ax[0].plot(sudden_test_df.date, upper_boundary, color='r')
ax[1].plot(sudden_test_df.date, gradual_test_df.loss_mae.values)
ax[1].plot(sudden_test_df.date, upper_boundary, color='r')
ax[0].set_title('Sudden Dataset')
ax[1].set_title('Gradual Dataset')
ax[0].set_ylabel('Mean Absolute Error')
ax[1].set_ylabel('Mean Absolute Error')
plt.xticks(fontsize=8, rotation=45);

<IPython.core.display.Javascript object>

**The question is for the gradual case, is there a lower error threshold we can monitor to detect the potential anomaly earlier in the time scale?**

#### Consecutive Counts Metric

`Consecutive Counts Metric` – is basically computing the number of consecutive errors that are above a given threshold in a real-time feed of time series data. The threshold here will be lower than the threshold monitored for anomalies. The idea is that for normal conditions while you may have occasional points above the lower threshold, that trend may not persist in subsequent readings unless the errors are gradually trending upwards towards anomalous conditions.

*Note that this is just one example to predict gradual anomalies earlier, and often you have either change or fine tune your approach to minimize false positives*

In [22]:
def consecutive_counts(df, col_name, threshold, start_index = 0):
    answer = []
    for i in range(start_index, len(df)):
        count = 0
        current_index = i
        current_value = df.iloc[current_index][col_name]
        while (current_index >= 0) & (current_value >= threshold):
            count = count + 1
            current_index = current_index - 1
            if (current_index >= 0):
                current_value = df.iloc[current_index][col_name]
        answer.append(count)
    return answer

In this example, we will use a lower threshold value of **0.003** to compute the consecutive counts metric.

In [23]:
consecutive_threshold = 0.003
normal_df['consecutive_counts'] = consecutive_counts(normal_df, 'loss_mae', consecutive_threshold)
sudden_df['consecutive_counts'] = consecutive_counts(sudden_df, 'loss_mae', consecutive_threshold)
gradual_df['consecutive_counts'] = consecutive_counts(gradual_df, 'loss_mae', consecutive_threshold)

## Predict Anomalies

With the two established thresholds: **0.005** for point anomalies and **0.003** for gradual (consecutive counts based) anomalies, we will add the two types of predictions to our data sets, standard point anomalies (`anomaly_std`), and anomalies based on consecutive counts metric (`anomaly_cc`). For the case of consecutive counts, we will use **5** consecutive readings above the lower threshold as start of anomalous conditions.

In [24]:
normal_df['anomaly_std'] = normal_df.loss_mae.apply(lambda x: True if x > upper_bound else False)
sudden_df['anomaly_std'] = sudden_df.loss_mae.apply(lambda x: True if x > upper_bound else False)
gradual_df['anomaly_std'] = gradual_df.loss_mae.apply(lambda x: True if x > upper_bound else False)

consecutive_counts_bound = 5
normal_df['anomaly_cc'] = normal_df.consecutive_counts.apply(lambda x: True if 
                                                             x >= consecutive_counts_bound else False)
sudden_df['anomaly_cc'] = sudden_df.consecutive_counts.apply(lambda x: True if 
                                                             x >= consecutive_counts_bound else False)
gradual_df['anomaly_cc'] = gradual_df.consecutive_counts.apply(lambda x: True if 
                                                               x >= consecutive_counts_bound else False)

**Review Anomalies in the Normal dataset**

The data set shows several point anomalies (`anomaly_std`), and one instance of consecutive counts (`anomaly_cc`) based anomaly around May 16th 2017 that last for two days.

In [25]:
display_dataframe(normal_df[(normal_df.anomaly_std == True) | (normal_df.anomaly_cc == True)])

Unnamed: 0,date,year,month,month_name,day,temperature,humidity,water_level,loss_mae,consecutive_counts,anomaly_std,anomaly_cc
16,2016-01-17 00:00:00,2016,1,Jan,17,13.9,70.9,27.5,0.00561974,1,True,False
68,2016-03-09 00:00:00,2016,3,Mar,9,28.8,67.8,22.6,0.00508701,1,True,False
87,2016-03-28 00:00:00,2016,3,Mar,28,34.9,68.0,9.7,0.00536313,1,True,False
91,2016-04-01 00:00:00,2016,4,Apr,1,47.1,61.2,17.1,0.00534971,1,True,False
92,2016-04-02 00:00:00,2016,4,Apr,2,38.3,61.4,22.0,0.0069166,2,True,False
115,2016-04-25 00:00:00,2016,4,Apr,25,45.1,62.8,26.2,0.00505636,2,True,False
144,2016-05-24 00:00:00,2016,5,May,24,54.2,62.8,19.0,0.00511678,1,True,False
386,2017-01-21 00:00:00,2017,1,Jan,21,21.4,70.3,8.4,0.006694,2,True,False
445,2017-03-21 00:00:00,2017,3,Mar,21,34.3,68.1,24.7,0.00532163,1,True,False
501,2017-05-16 00:00:00,2017,5,May,16,52.0,63.1,13.3,0.00300168,5,False,True


**Review Anomalies in the Sudden dataset**

The data set shows a sudden jump in the reconstruction error (`loss_mae`) on May 16 2019 and it persists till end of May. The `anomaly_cc` starts, as expected, on the 4th day after `anomaly_std`.

In [26]:
display_dataframe(sudden_df[(sudden_df.anomaly_std == True) | (sudden_df.anomaly_cc == True)])

Unnamed: 0,date,year,month,month_name,day,temperature,humidity,water_level,loss_mae,consecutive_counts,anomaly_std,anomaly_cc
16,2016-01-17 00:00:00,2016,1,Jan,17,13.9,70.9,27.5,0.00561974,1,True,False
68,2016-03-09 00:00:00,2016,3,Mar,9,28.8,67.8,22.6,0.00508701,1,True,False
87,2016-03-28 00:00:00,2016,3,Mar,28,34.9,68.0,9.7,0.00536313,1,True,False
91,2016-04-01 00:00:00,2016,4,Apr,1,47.1,61.2,17.1,0.00534971,1,True,False
92,2016-04-02 00:00:00,2016,4,Apr,2,38.3,61.4,22.0,0.0069166,2,True,False
115,2016-04-25 00:00:00,2016,4,Apr,25,45.1,62.8,26.2,0.00505636,2,True,False
144,2016-05-24 00:00:00,2016,5,May,24,54.2,62.8,19.0,0.00511678,1,True,False
386,2017-01-21 00:00:00,2017,1,Jan,21,21.4,70.3,8.4,0.006694,2,True,False
445,2017-03-21 00:00:00,2017,3,Mar,21,34.3,68.1,24.7,0.00532163,1,True,False
501,2017-05-16 00:00:00,2017,5,May,16,52.0,63.1,13.3,0.00300168,5,False,True


**Review Anomalies in the Gradual dataset**

The `anomaly_cc` starts on June 9th 2019 almost 22 days before the reconstruction error (`loss_mae`) exceeds the normal threshold.

In [27]:
display_dataframe(gradual_df[(gradual_df.anomaly_std == True) | (gradual_df.anomaly_cc == True)])

Unnamed: 0,date,year,month,month_name,day,temperature,humidity,water_level,loss_mae,consecutive_counts,anomaly_std,anomaly_cc
16,2016-01-17 00:00:00,2016,1,Jan,17,13.9,70.9,27.5,0.00561974,1,True,False
68,2016-03-09 00:00:00,2016,3,Mar,9,28.8,67.8,22.6,0.00508701,1,True,False
87,2016-03-28 00:00:00,2016,3,Mar,28,34.9,68.0,9.7,0.00536313,1,True,False
91,2016-04-01 00:00:00,2016,4,Apr,1,47.1,61.2,17.1,0.00534971,1,True,False
92,2016-04-02 00:00:00,2016,4,Apr,2,38.3,61.4,22.0,0.0069166,2,True,False
115,2016-04-25 00:00:00,2016,4,Apr,25,45.1,62.8,26.2,0.00505636,2,True,False
144,2016-05-24 00:00:00,2016,5,May,24,54.2,62.8,19.0,0.00511678,1,True,False
386,2017-01-21 00:00:00,2017,1,Jan,21,21.4,70.3,8.4,0.006694,2,True,False
445,2017-03-21 00:00:00,2017,3,Mar,21,34.3,68.1,24.7,0.00532163,1,True,False
501,2017-05-16 00:00:00,2017,5,May,16,52.0,63.1,13.3,0.00300168,5,False,True


### Visualize Anomalies in the Observed Water Levels

Next, we will visualize the anomalies in the measured water levels during the anomalous period (May-August 2019).

As you can observe the **anomaly_std** works best in the case of **sudden anomalies**, whereas, the **anomaly_cc** works best in case of **gradual anomalies**.

In [28]:
sudden_test_df = sudden_df.loc[lambda d: (d.date.dt.year == 2019) & 
                                ((d.date.dt.month == 5) | (d.date.dt.month == 6) | 
                                 (d.date.dt.month == 7) | (d.date.dt.month == 8)), :]

gradual_test_df = gradual_df.loc[lambda d: (d.date.dt.year == 2019) & 
                                ((d.date.dt.month == 5) | (d.date.dt.month == 6) | 
                                 (d.date.dt.month == 7) | (d.date.dt.month == 8)), :]

In [29]:
f, ax = plt.subplots(2, 2, sharey=True, sharex=True, figsize=(10, 8))

colors_s_1 = ['red' if value == True else 'blue' for value in sudden_test_df.anomaly_std.values]
size_s_1 = [10 if value == True else 5 for value in sudden_test_df.anomaly_std.values]
colors_s_2 = ['red' if value == True else 'blue' for value in sudden_test_df.anomaly_cc.values]
size_s_2 = [10 if value == True else 5 for value in sudden_test_df.anomaly_cc.values]

colors_g_1 = ['red' if value == True else 'blue' for value in gradual_test_df.anomaly_std.values]
size_g_1 = [10 if value == True else 5 for value in gradual_test_df.anomaly_std.values]
colors_g_2 = ['red' if value == True else 'blue' for value in gradual_test_df.anomaly_cc.values]
size_g_2 = [10 if value == True else 5 for value in gradual_test_df.anomaly_cc.values]

ax[0][0].scatter(sudden_test_df.date, sudden_test_df.water_level, s = size_s_1, c = colors_s_1)
ax[0][1].scatter(gradual_test_df.date, gradual_test_df.water_level, s = size_g_1, c = colors_g_1)
ax[1][0].scatter(sudden_test_df.date, sudden_test_df.water_level, s = size_s_2, c = colors_s_2)
ax[1][1].scatter(gradual_test_df.date, gradual_test_df.water_level, s = size_g_2, c = colors_g_2)

ax[0][0].set_title('Sudden Dataset - anomaly_std')
ax[1][0].set_title('Sudden Dataset - anomaly_cc')
ax[0][1].set_title('Gradual Dataset - anomaly_std')
ax[1][1].set_title('Gradual Dataset - anomaly_cc')
ax[0][0].set_ylabel('Water Level')
ax[1][0].set_ylabel('Water Level')

from matplotlib.patches import Patch
from matplotlib.lines import Line2D

legend_elements = [Line2D([0], [0], marker='o', color='w', label='Normal', markerfacecolor='b', markersize=5), 
                  Line2D([0], [0], marker='o', color='w', label='Anomaly', markerfacecolor='r', markersize=5)]

ax[0][0].legend(handles=legend_elements, frameon=False)
#ax[0][1].legend(handles=legend_elements, frameon=False)
#ax[1][0].legend(handles=legend_elements, frameon=False)
#ax[1][1].legend(handles=legend_elements, frameon=False)

f.tight_layout(rect=[0, 0.03, 1, 0.95])

<IPython.core.display.Javascript object>

## Principal Component Analysis

Generate the top N principal components of the encoded representation of the input data for both the sudden and gradual datasets during the anomalous periods for each of the respective datasets.

In [30]:
sudden_anomalies = sudden_df.loc[lambda d: (d.date.dt.year == 2019) & (d.date.dt.month == 5), :]

gradual_anomalies = gradual_df.loc[lambda d: (d.date.dt.year == 2019) & 
                                   ((d.date.dt.month == 5) | (d.date.dt.month == 6) | 
                                    (d.date.dt.month == 7)), :]
sudden_anomalies_encoded = encoder_model.predict(clf.transform(sudden_anomalies[feature_cols]))
gradual_anomalies_encoded = encoder_model.predict(clf.transform(gradual_anomalies[feature_cols]))

Generate principal components for **N = [2, 3, 4, 5]**

In [31]:
pca_components = [2, 3, 4, 5]
sudden_anomalies_pca = []
gradual_anomalies_pca = []

def pca_analysis(input, results, anomaly_type):
    for comp in pca_components: 
        pca = PCA(n_components = comp)
        pca_result = pca.fit_transform(input)
        print('{} - Cumulative explained variation for {} principal components: {}'.format(
            anomaly_type, comp, np.sum(pca.explained_variance_ratio_)))
        results.append(pca_result)

pca_analysis(sudden_anomalies_encoded, sudden_anomalies_pca, 'Sudden anomalies')
pca_analysis(gradual_anomalies_encoded, gradual_anomalies_pca, 'Gradual anomalies')

Sudden anomalies - Cumulative explained variation for 2 principal components: 0.9961084723472595
Sudden anomalies - Cumulative explained variation for 3 principal components: 0.9999396204948425
Sudden anomalies - Cumulative explained variation for 4 principal components: 0.9999906420707703
Sudden anomalies - Cumulative explained variation for 5 principal components: 0.9999988675117493
Gradual anomalies - Cumulative explained variation for 2 principal components: 0.9979403018951416
Gradual anomalies - Cumulative explained variation for 3 principal components: 0.9992831349372864
Gradual anomalies - Cumulative explained variation for 4 principal components: 0.9998731017112732
Gradual anomalies - Cumulative explained variation for 5 principal components: 0.9999879598617554


### Visualize the Principal Components for N = 3

Visualize the top 3 principal components of the encoded representation of the input data.

The 3-D plots show a clean separation between normal and anomalous points in the sudden dataset, whereas, for the gradual dataset you see a gradual separation at first then followed by a clean separation between the normal and anomalous points.

In [49]:
X_embedded1 = pd.DataFrame(sudden_anomalies_pca[1], columns=['X','Y', 'Z'])
X_embedded1['State'] = np.where(sudden_anomalies.anomaly_std, 'Failure', 'Normal')

X_embedded2 = pd.DataFrame(gradual_anomalies_pca[1], columns=['X','Y', 'Z'])
X_embedded2['State'] = np.where(gradual_anomalies.anomaly_cc, 'Failure', 'Normal')

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(121, projection='3d')
ax2 = fig.add_subplot(122, projection='3d')

ax1.set_title('Sudden anomalies readings', y=-0.12)
ax2.set_title('Gradual anomalies readings', y=-0.12)

colors_1 = ['red' if value == 'Failure' else 'blue' for value in X_embedded1.State.values]
ax1.scatter(X_embedded1.X.values, X_embedded1.Y.values, X_embedded1.Z.values, c=colors_1)

colors_2 = ['red' if value == 'Failure' else 'blue' for value in X_embedded2.State.values]
ax2.scatter(X_embedded2.X.values, X_embedded2.Y.values, X_embedded2.Z.values, c=colors_2)
#start, end = ax2.get_xlim()
#start, end = ax2.get_ylim()
ax2.xaxis.set_ticks(np.arange(-.7, 1.2, 0.4))
ax2.yaxis.set_ticks(np.arange(-.8, 0.8, 0.3))

from matplotlib.patches import Patch
from matplotlib.lines import Line2D

legend_elements = [Line2D([0], [0], marker='o', color='w', label='Normal', markerfacecolor='b', markersize=5), 
                  Line2D([0], [0], marker='o', color='w', label='Anomaly', markerfacecolor='r', markersize=5)]

ax1.legend(handles=legend_elements, loc='upper left', frameon=False)
ax2.legend(handles=legend_elements, loc='upper left', frameon=False)

plt.show()

<IPython.core.display.Javascript object>