**Detecting Outliers**

https://chrisalbon.com/machine_learning/preprocessing_structured_data/detecting_outliers/

**Preliminaries**

In [1]:
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs

**Create Data**

In [2]:
#Create simulated data
X, _ = make_blobs(n_samples=10,n_features=2,centers=1,random_state=1)

#Replace the first observation's values with extreme values
X[0,0] = 10000
X[0,1] = 10000

In [3]:
X

array([[  1.00000000e+04,   1.00000000e+04],
       [ -2.76017908e+00,   5.55121358e+00],
       [ -1.61734616e+00,   4.98930508e+00],
       [ -5.25790464e-01,   3.30659860e+00],
       [  8.52518583e-02,   3.64528297e+00],
       [ -7.94152277e-01,   2.10495117e+00],
       [ -1.34052081e+00,   4.15711949e+00],
       [ -1.98197711e+00,   4.02243551e+00],
       [ -2.18773166e+00,   3.33352125e+00],
       [ -1.97451969e-01,   2.34634916e+00]])

**Detect Outliers**

Elliptic envelope assumes the data is normally distributed and based on that assumption "draws" an ellipse around the data, classifying an observation inside the ellipse as an inlier(labeled as 1) and any observation outside the ellipse as an oulier(labeled as -1). A major limitation to this approach is to specify as contamination parameter, which is the proportion of observations that are outliers, a value that we don't know.

In [4]:
#Create detector

outlier_detector = EllipticEnvelope(contamination=0.1)

#Fit detector

outlier_detector.fit(X)

#Predict outliers

outlier_detector.predict(X)

array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])