# Outliers:
### Outliers are the extreme values that deviate from other observations. Outliers are of 2 types:
1. Univariate    [Outliers in single feature space]
2. Multivariate  [Outliers found in n-dimensional space i.e. n features]


# Load the libraries

In [3]:
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs


# Create data

In [15]:
X,Y = make_blobs(n_samples = 10,                   # Generates isotropic Gaussian blobs for clustering.  
                 n_features = 2,                   # Number of features for each sample
                 centers = 1,                      # Number of centres to generate 
                 random_state = 1)                 # random_state is the seed used by the Random Number Generator. 
                                                   
                                                   # X: The generated samples
                                                   # Y: Integer label for cluster membership of each sample
        

# Introduce the outliers

In [17]:
X[0,0] = 50000                                    # Set extreme observations for the first observation.
X[0,1] = 50000

print(X)

[[ 5.00000000e+04  5.00000000e+04]
 [-2.76017908e+00  5.55121358e+00]
 [-1.61734616e+00  4.98930508e+00]
 [-5.25790464e-01  3.30659860e+00]
 [ 8.52518583e-02  3.64528297e+00]
 [-7.94152277e-01  2.10495117e+00]
 [-1.34052081e+00  4.15711949e+00]
 [-1.98197711e+00  4.02243551e+00]
 [-2.18773166e+00  3.33352125e+00]
 [-1.97451969e-01  2.34634916e+00]]


# Detect the Outliers

In [21]:
'''
    Here we use EllipticEnvelope, which assumes that the data is NORMALLY DISTRIBUTED. On the basis of this assumption, 
    it draws an ellipse, around the data. Any observation inside the ellipse is termed as an INLIER (labelled as 1) and any 
    observation outside an ellipse, is an OUTLIER (labelled as -1). 
    
    Limitation: We need to supply an arbitrary 'contamination' parameter, which indicates the proportion of observations that are 
    outliers. We generally do not know this proportion's value. 
'''
# Create an instance of EllipticEnvelope class - This is the outlier detector
eeObj = EllipticEnvelope(contamination=0.01)

# Fit the outlier detector on the EllipticEnvelope class
eeObj.fit(X)

# Predict the outliers
eeObj.predict(X)                                # [-1,  1,  1,  1,  1,  1,  1,  1,  1,  1]
                                                # This clearly shows that the first entry is an outlier and hence should be removed        

array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])