# RealTrafic - anomaly detection

In this notebook, data analysis and anomaly detection was performed for the RealTraffic data from the [Numenta Anomaly Benchmark](https://github.com/numenta/NAB)

The dataset consists of data from 5 sensors that measure traffic on different road sections. The sensors measure traffic in various metrics:
- occupancy - the average number of vehicles
- speed - average speed
- travel time - average travel time

Note: not all metrics are available at each metrics.

## Data load and analysis
The work was started by reading and inspecting the data.

In [None]:
import pandas as pd
import numpy as np
import os
import re
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm

In [None]:
cd ../input/nab/realTraffic/realTraffic

In [None]:
data = pd.DataFrame(columns=['sensor', 'metric', 'timestamp', 'value'])

In [None]:
for file in os.listdir():
    if file[-4:] == '.csv':
        temp = pd.read_csv(file, engine='python')
        sensor_name = re.split(r"[_.]",file)
        temp['metric'] = sensor_name[0]
        temp['sensor'] = sensor_name[1]
        data = data.append(temp, ignore_index=True)

Preparing a summary for the readings

In [None]:
sensors_summary = data.groupby(['sensor', 'metric']).agg(['min', 'max', 'count'], on='value')
sensors_summary

Conclusions:
- Only two sensors have two metrics. They can be combined to provide more accurate insights.
- Rest of the sensors measure only one metric.
- The timestamp is a key between multiple metrics at the same sensor.
- But the range for the timestamp is varying for different metrics for the same sensor. It has to be taken into account while merging data.
- Ranges for the same metrics between different sensors look similar.
- The data doesn't have labels indicating anomalies. This dataset represents unsupervised case.

The distribution of features is important while using most of typical anomaly detection algorithms.

In [None]:
for sensor, metric in sensors_summary.index:
    plt.figure(figsize=(7, 5))
    sns.distplot(data[(data['sensor']==sensor) & (data['metric']==metric)]['value']).set_title('Distribution for the {} metric for the {} sensor'.format(metric, sensor))
    plt.xlabel('Value')
    plt.show()

The data distribution is far from normal. This would hurt the performance of most of the anomaly detection algorithms.
The log1p function was used to fix the issue.  It's commonly used to "normalize" the data.

In [None]:
data['transformed'] = np.log1p(data['value'])

In [None]:
data_stats = pd.DataFrame(columns=['sensor', 'metric', 'old_mu', 'new_mu', 'old_sigma', 'new_sigma'])

In [None]:
for sensor, metric in sensors_summary.index:
    #Fetching data
    sub_data = data[(data['sensor']==sensor) & (data['metric']==metric)]
    
    #Plotting distirbutions before and after "side by side"
    fig, ax = plt.subplots(1, 2, figsize=(14, 6))
    sns.distplot(sub_data['value'], ax=ax[0]).set_title('{}: {} - Original distribution'.format(sensor, metric))
    plt.xlabel('Original value')
    sns.distplot(sub_data['transformed'], ax=ax[1], color='g').set_title('{}: {} - Transformed distribution'.format(sensor, metric))
    plt.xlabel('Transformed value')
    plt.show()
    
    # Saving distribution info into the stats df
    data_stats.loc[len(data_stats)] = {'sensor': sensor, 'metric':metric, 'old_mu': norm.fit(sub_data['value'])[0], 'new_mu':norm.fit(sub_data['transformed'])[0], 'old_sigma': norm.fit(sub_data['value'])[1], 'new_sigma': norm.fit(sub_data['transformed'])[1]}

In [None]:
data_stats

The comparison before and after log1p transormation show high decrease in mean and variance values which proves the transformation function is properly selected.

The highest values are for the TravelTime metric. If some improvement is needed in the future, it can be done by finding exclusive function for the metric.

## Anomaly detection

The plan for this chapter:
- 2D sensors (sensors with two metrics)
    - Plot the data on a 2D plot
    - For one of the sensors find the best algorithm for AD
    - Plot decision boundary
    - Perform anomaly detection on the second sensor
- 1D sensors
    - Perform AD for 1D
    - Plot boundary 


The work started by plotting the data for the t4013 sensor in the original and transformed scale.

In [None]:
#Fetching the data
data_t4013 = pd.merge(data[(data['sensor']=='t4013') & (data['metric']=='speed')], data[(data['sensor']=='t4013') & (data['metric']=='occupancy')], on='timestamp').drop_duplicates().filter(['value_x', 'value_y'])
data_t4013.columns = ['speed','occupancy']
data_t4013.reset_index(drop=True, inplace=True)

data_6005 = pd.merge(data[(data['sensor']=='6005') & (data['metric']=='speed')], data[(data['sensor']=='6005') & (data['metric']=='occupancy')], on='timestamp').drop_duplicates().filter(['value_x', 'value_y'])
data_6005.columns = ['speed','occupancy']
data_6005.reset_index(drop=True, inplace=True)

In [None]:
def plot_2d(data, title):
    plt.figure(figsize=(10,8))
    ax = sns.scatterplot(data['occupancy'], data['speed'])
    ax.set_title(title)
    plt.xlabel('Average occupancy')
    plt.ylabel('Average speed')
    plt.plot()

In [None]:
plot_2d(data_t4013, 'Sensor readings for the t4013 sensor in the original scale')
plot_2d(data_6005, 'Sensor readings for the 6005 sensor in the original scale')

Readings from the sensors have the same pattern for the main cluster. The t4013 has more outliners. Thus the sensor will be used for finding the best model for anomaly detection.

The data in the transformed scale is visualized below:

In [None]:
data_t4013 = pd.merge(data[(data['sensor']=='t4013') & (data['metric']=='speed')], data[(data['sensor']=='t4013') & (data['metric']=='occupancy')], on='timestamp').drop_duplicates().filter(['transformed_x', 'transformed_y'])
data_t4013.columns = ['speed','occupancy']
data_t4013.reset_index(drop=True, inplace=True)

data_6005 = pd.merge(data[(data['sensor']=='6005') & (data['metric']=='speed')], data[(data['sensor']=='6005') & (data['metric']=='occupancy')], on='timestamp').drop_duplicates().filter(['transformed_x', 'transformed_y'])
data_6005.columns = ['speed','occupancy']
data_6005.reset_index(drop=True, inplace=True)

In [None]:
plot_2d(data_t4013, 'Sensor readings for the t4013 sensor in the transformed scale')
plot_2d(data_6005, 'Sensor readings for the 6005 sensor in the transformed scale')

On the transformed scale plots decision boundary is easier to imagine and has simpler boundary which might be fitted by an ellipse. So it was decided to try the EllipticEvelope first.

The model fits an ellipse to the dataset. The position and size of the ellipse is determined to maintain selected contamination - the percentage of abnormal observations in the dataset.

In [None]:
from sklearn.covariance import EllipticEnvelope

In [None]:
def update_pred_labels(pred):
    for i in range(len(pred)):
        if pred[i] == -1:
            pred[i] = 'abnormal'
        else:
            pred[i] = 'normal'

In [None]:
t4013_elliptic = EllipticEnvelope(contamination=0.05)
t4013_elliptic.fit(data_t4013)

In [None]:
pred = pd.Series(t4013_elliptic.predict(data_t4013))
update_pred_labels(pred)
print(pred.value_counts())
sns.scatterplot(x=data_t4013['occupancy'], y=data_t4013['speed'], hue=pred)

Fitting the model for various contamination values:

In [None]:
for cutoff in np.arange(0.03, 0.12, 0.01):
    t4013_elliptic = EllipticEnvelope(contamination=cutoff)
    t4013_elliptic.fit(data_t4013)
    pred = pd.Series(t4013_elliptic.predict(data_t4013))
    update_pred_labels(pred)
    print(pred.value_counts())
    ax = sns.scatterplot(x=data_t4013['occupancy'], y=data_t4013['speed'], hue=pred)
    ax.set_title('Elliptic for contamination = {:.2f}'.format(cutoff))
    plt.show()

The best value for cutoff is 0.06

In [None]:
import matplotlib.patches as patches

#Retrain the model
cutoff = 0.06
t4013_elliptic = EllipticEnvelope(contamination=cutoff)
t4013_elliptic.fit(data_t4013)
pred = pd.Series(t4013_elliptic.predict(data_t4013))
update_pred_labels(pred)

#Plot the result
plt.figure(figsize=(8, 6))
ax = sns.scatterplot(x=data_t4013['occupancy'], y=data_t4013['speed'], hue=pred)
ax.set_title('Elliptic for contamination = {:.2f}'.format(cutoff))

#Create a rectangle
ax.add_patch(patches.Rectangle(
        xy=(0.38, 3.85),
        width=0.4,
        height=0.55,
        linewidth=1,
        color = 'r',
        fill = False))

The model fails to fit the dataset in multiple places. Especially in the region marked in the red rectangle. This is caused by the data being too far from an ideal normal distribution - even after the transformation.

The IsolationForest is known for its ability to adapt to complex shapes. So it was selected for the next model.

In [None]:
from sklearn.ensemble import IsolationForest

In [None]:
for contamination in [0.015, 0.05]:
    
    #Training the model
    t4013_isolation = IsolationForest(n_estimators=1010, max_features=2, contamination=contamination, bootstrap=False)
    t4013_isolation.fit(data_t4013)

    #Predicting the result
    pred = pd.Series(t4013_isolation.predict(data_t4013))
    update_pred_labels(pred)

    #Plotting
    print(pred.value_counts())
    plt.figure(figsize=(10, 7))
    ax = sns.scatterplot(x=data_t4013['occupancy'], y=data_t4013['speed'], hue=pred)
    ax.set_title('IsolationForest for the t4013 sensor with contamination {}'.format(contamination))
    
    #Create left rectangle
    ax.add_patch(patches.Rectangle(
        xy=(0.38, 3.85),
        width=0.4,
        height=0.55,
        linewidth=1,
        color = 'r',
        fill = False))
    
    #Create right rectangle
    ax.add_patch(patches.Rectangle(
        xy=(0.76, 4.275),
        width=0.25,
        height=0.1,
        linewidth=1,
        color = 'k',
        fill = False))

Despite trying various hyperparameters, it was impossible to properly match most of the items in the red rectangle as normal and items in the black rectangle as abnormal. 

Taking into consideration some other anomalies on the results, it was decided to try another model - OneClassSVM.

In [None]:
from sklearn.svm import OneClassSVM

As the SVM model is based on distance calculations, it was decided to rescale the data to improve the SVM model performance.

In [None]:
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
data_t4013_sc = sc.fit_transform(data_t4013)

In [None]:
for cutoff in np.arange(0.02, 0.06, 0.01):
    t4013_SVM = OneClassSVM(nu=cutoff)
    t4013_SVM.fit(data_t4013_sc)
    pred = pd.Series(t4013_SVM.predict(data_t4013_sc))
    update_pred_labels(pred)
    print(pred.value_counts())
    plt.figure(figsize=(9,6))
    ax = sns.scatterplot(x=data_t4013['occupancy'], y=data_t4013['speed'], hue=pred)
    ax.set_title('One-Class SVM for contamination = {:.3f}'.format(cutoff))
    plt.show()

The SVM model seems to fit the dataset the best of all models tested. The contamination equal to 0.03 seems to fit the issue the best.

The next step was to plot the decision boundary.

In [None]:
data_t4013_sc.shape

In [None]:
t4013_SVM = OneClassSVM(nu=0.03, kernel='rbf', tol=1e-10, gamma='auto')
t4013_SVM.fit(data_t4013_sc)

In [None]:
pred = pd.Series(t4013_SVM.predict(data_t4013_sc))
update_pred_labels(pred)
print(pred.value_counts())

plt.figure(figsize=(10,8))
ax = sns.scatterplot(x=data_t4013_sc[:, 1], y=data_t4013_sc[:, 0], hue=pred)
ax.set_title('The decision boundary for the SingleClassSVM model on scaled data for the t4013 sensor')
ax.set(xlabel='occupancy', ylabel='speed')
plt.xlim(-3, 3)
plt.ylim(-3, 3)
xx, yy = np.meshgrid(np.linspace(-3, 3, 300),
                     np.linspace(-3, 3, 300))
Z = t4013_SVM.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(yy, xx, Z, levels=[0], linewidths=1, colors='black')

plt.show()

Single abnormal spot can be found in the boundary of the main cluster. It's a downside of an SVM model. Modifying the hyperparameters to avoid the issue leads to deterioration of the decision boundary. The spot can be removed by adding data into the spot location.

Plotting the decision boundary on the original values:

In [None]:
xx_yy = np.array([])
for i in range(len(xx.ravel())):
    pair = np.array([xx.ravel()[i], yy.ravel()[i]])
    pair = sc.inverse_transform(pair)
    xx_yy = np.concatenate([xx_yy, pair])

xx_yy = xx_yy.reshape(-1, 2)
xx_yy = np.expm1(xx_yy)

In [None]:
plt.figure(figsize=(10,8))
ax = sns.scatterplot(x=np.expm1(data_t4013['occupancy']), y=np.expm1(data_t4013['speed']), hue=pred)
ax.set_title('The decision boundary for the SingleClassSVM model on the original data for t4013')
ax.set(xlabel='occupancy', ylabel='speed')
plt.xlim(-5, 50)
plt.ylim(10, 85)

plt.contour(xx_yy[:,1].reshape(300, 300), xx_yy[:,0].reshape(300, 300), Z, levels=[0], linewidths=1, colors='black')

plt.show()

The steps were recreated for the second sensor with two metrics: 6005.

In [None]:
sc = StandardScaler()
data_6005_sc = sc.fit_transform(data_6005)

s6005_SimpleSVM = OneClassSVM(nu=0.03, tol=1e-8)
s6005_SimpleSVM.fit(data_6005_sc)

In [None]:
pred = pd.Series(s6005_SimpleSVM.predict(data_6005_sc))
update_pred_labels(pred)
print(pred.value_counts())

plt.figure(figsize=(8,6))
ax = sns.scatterplot(x=data_6005_sc[:, 1], y=data_6005_sc[:, 0], hue=pred)
ax.set_title('The decision boundary for the SingleClassSVM model on the transformed data for the 6005 sensor')
ax.set(xlabel='occupancy', ylabel='speed')
plt.xlim(-3, 3)
plt.ylim(-4, 3)
xx, yy = np.meshgrid(np.linspace(-4, 3, 300),
                     np.linspace(-4, 3, 300))
Z = s6005_SimpleSVM.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(yy, xx, Z, levels=[0], linewidths=1, colors='black')

plt.show()

In [None]:
xx_yy = np.array([])
for i in range(len(xx.ravel())):
    pair = np.array([xx.ravel()[i], yy.ravel()[i]])
    pair = sc.inverse_transform(pair)
    xx_yy = np.concatenate([xx_yy, pair])

xx_yy = xx_yy.reshape(-1, 2)
xx_yy = np.expm1(xx_yy)

In [None]:
plt.figure(figsize=(8,6))
ax = sns.scatterplot(x=np.expm1(data_6005['occupancy']), y=np.expm1(data_6005['speed']), hue=pred)
ax.set_title('The decision boundary for  the SingleClassSVM model on the original data for the 6005 sensor')
ax.set(xlabel='occupancy', ylabel='speed')
plt.xlim(-5, 25)
plt.ylim(15, 115)

plt.contour(xx_yy[:,1].reshape(300, 300), xx_yy[:,0].reshape(300, 300), Z, levels=[0], linewidths=1, colors='black')

plt.show()

The abnormal spots in the main normal cluster are noticeable, but in general, the decision boundary is marked correctly.  

The rest of the sensors have only one metric. In this case, model selection is less crucial. ElipticEvelope was selected as it allows to easily read decision boundary, which was important for the illustation purposes. 

In [None]:
all_single_sensors = ['387', '451', '7578']
single_sensors_boundaries = pd.DataFrame(columns=['sensor', 'metric', 'low', 'high'])

for single_sensor in all_single_sensors:
    #Fetching data and fitting the model
    single_data = np.array(data[data['sensor']==single_sensor]['transformed']).reshape(-1, 1)
    single_metric = data[data['sensor']==single_sensor]['metric'].iloc[0]
    single_model = EllipticEnvelope(contamination=0.05)
    single_model.fit(single_data)
    pred = single_model.predict(single_data)
    
    #Finding the decision boundary, in this 1D case low and high values are enough
    xx = np.linspace(min(single_data), max(single_data))
    pred_xx = single_model.predict(xx)
    for i in range(len(xx)-1):
        if pred_xx[i] < pred_xx[i+1]:
            low = xx[i][0]
        if pred_xx[i] > pred_xx[i+1]:
            high = xx[i][0]
            break
    
    #Saving boundary values into table
    single_sensors_boundaries = single_sensors_boundaries.append( {'sensor':single_sensor, 'metric':single_metric, 'low':np.expm1(low), 'high':np.expm1(high)}, ignore_index=True)
    
    #Plotting boundary using transformed data
    plt.figure(figsize=(8,6))
    ax = sns.distplot(single_data)
    ax.axvline(low, 0, 0.3, color='#FF6600', linewidth=3)
    ax.axvline(high, 0, 0.3, color='#FF6600', linewidth=3)
    ax.set_title('Anomaly boundaries on the tranformed data for the {} sensor'.format(single_sensor))
    ax.set(xlabel=single_metric)

    #Plotting boundary using original scale
    plt.figure(figsize=(8,6))
    ax = sns.distplot(np.expm1(single_data))
    ax.axvline(np.expm1(low), 0, 0.4, color='#FF6600', linewidth=3)
    ax.axvline(np.expm1(high), 0, 0.4, color='#FF6600', linewidth=3)
    ax.set_title('Anomaly boundaries on original data for the {} sensor'.format(single_sensor))
    ax.set(xlabel=data[data['sensor']==single_sensor]['metric'].iloc[0])

The way the boundaries are determined is easier to understand while looking on the transformed data. However, the boundaries on the original data may be more useful in real case scenarios.

Please note that detecting anomalies on a single measurement should not be considered as a highly reliable method. More metrics on these sensors would increase confidence in the detection mechanism.

The table with the low and high boundaries for normal samples:

In [None]:
single_sensors_boundaries

The speed measurement is the only metric with is intersecting with 2D sensors. The boundary values for this metric looks comparable between 1D and 2D analysis. 