
# Anomaly / Outlier Detection


Given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking points as outliers that lie alone in low-density regions (whose nearest neighbors are too far away). 

DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from pylab import rcParams
rcParams['figure.figsize'] = 15, 10

In [None]:
from sklearn.cluster import DBSCAN

Read data

In [None]:
df_raw = pd.read_csv('./data/PM_train.csv')
df_raw.head()

Select certain columns of raw data

In [None]:
df = df_raw.loc[:, ['s2', 's3']]

Initiate model and fit it to data

In [None]:
model = DBSCAN(eps=0.8, min_samples=10)

Fit model to the dataframe

In [None]:
model.fit(df)

Get the model's labels

In [None]:
labels = model.labels_
labels

Get number of found outliers

In [None]:
n_outliers = (labels == -1).sum()
n_outliers

Plot result

In [None]:
plt.scatter(df['s2'], df['s3'], s=10, c=labels)

plt.title('Number of found outliers: %d' % n_outliers)
plt.show()

# Clustering

Let's try again with som toy data

In [None]:
from sklearn import datasets

In [None]:
X_moons, y_moons = datasets.make_moons(n_samples=1000, shuffle=True, noise=0.1, random_state=0)
df_moons = pd.DataFrame(X_moons)

In [None]:
plt.scatter(df_moons[0], df_moons[1], c=y_moons)
plt.show()

Try to find outliers using `DBSCAN`

In [None]:
model_moons = DBSCAN(eps=0.15, min_samples=10)  # try 0.15
model_moons.fit(df_moons)

Count equal labels

In [None]:
from collections import Counter
Counter(model_moons.labels_)

In [None]:
plt.scatter(df_moons[0], df_moons[1], c=model_moons.labels_)
plt.show()

Automatically adjust model parameter to find optimal parameter range

In [None]:
tmp = pd.DataFrame(columns=['outliers', 'cluster0', 'cluster1'])

for i in np.arange(0.01, 0.3, 0.01):
    model_moons = DBSCAN(eps=i, min_samples=10).fit(df_moons)
    row = [(model_moons.labels_ == -1).sum(), (model_moons.labels_ == 0).sum(), (model_moons.labels_ == 1).sum()]
    tmp.loc[i] = row

tmp.plot()

The quality of found clusters depends heavily on choosen clustering algorithm.

In [None]:
from sklearn.cluster import KMeans

model_moons = KMeans(n_clusters=2).fit(df_moons)

plt.scatter(df_moons[0], df_moons[1], c=model_moons.labels_)
plt.show()