## [Isolation Forests](#top)

Isolation forest is an anomaly detection algorithm. It detects anomalies using isolation (how far a data point is to the rest of the data), rather than modelling the normal points.

### Steps

1. Randomly select a feature and randomly select a value for that feature within its range.

2. If the observation’s feature value falls above (below) the selected value, then this value becomes the new min (max) of that feature’s range.

3. Check if at least one other observation has values in the range of each feature in the dataset, where some ranges were altered via step 2. If no, then the observation is isolated.

4. Repeat steps 1–3 until the observation is isolated. The number of times you had to go through these steps is the isolation number. The lower the number, the more anomalous the observation is.


### Data Preprocessing:

In [1]:
# import necessary libraries:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# load the dataset
df  = pd.read_csv('../linear-regression/dataset/household_power_consumption/household_power_consumption.txt',
                    sep=';', parse_dates={'dt' : ['Date', 'Time']}, 
                    infer_datetime_format=True, low_memory=False, 
                    na_values=['nan','?'], index_col='dt')

df.shape

df.isnull().sum()

# Drop the null values:
df = df.copy()
df = df.dropna()

df.info()
df.isnull().sum()

# retrieve the array
data = df.values

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2049280 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Data columns (total 7 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Global_active_power    float64
 1   Global_reactive_power  float64
 2   Voltage                float64
 3   Global_intensity       float64
 4   Sub_metering_1         float64
 5   Sub_metering_2         float64
 6   Sub_metering_3         float64
dtypes: float64(7)
memory usage: 125.1 MB


### Split into the input and output columns, split into train and test datasets, then summarize the shapes of the data arrays

In [3]:
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# summarize the shape of the dataset
print(X.shape, y.shape)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(2049280, 6) (2049280,)
(1373017, 6) (676263, 6) (1373017,) (676263,)


#### Baseline Model Performance

We will fit a linear regression algorithm and evaluate model performance by training the model on the test dataset and making a prediction on the test data and evaluate the predictions using the **mean absolute error (*MAE*)**.

In [4]:
# evaluate model on the raw dataset

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

MAE: 4.009



The scikit-learn library provides a number of built-in automatic methods for identifying outliers in data.


After a method will be defined, then it will be fit on the *training dataset*. The fit model will then predict which examples in the training dataset are *outliers* and which are not (so-called *inliers*). The outliers will then be **removed** from the training dataset, then the model will be fit on the remaining examples and evaluated on the entire test dataset.

It would be invalid to fit the outlier detection method on the entire training dataset as this would result in data leakage. That is, the model would have access to data (or information about the data) in the test set not used to train the model. This may result in an optimistic estimate of model performance.

We could attempt to detect outliers on “*new data*” such as the test set prior to making a prediction, but then what do we do if outliers are detected?

One approach might be to return a “None” indicating that the model is unable to make a prediction on those outlier cases. This might be an interesting extension to explore that may be appropriate for the project.

## Isolation Forest

Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm.

It is based on modeling the normal data in such a way as to isolate anomalies that are both few in number and different in the feature space.

… our proposed method takes advantage of two anomalies’ quantitative properties: i) they are the minority consisting of fewer instances and ii) they have attribute-values that are very different from those of normal instances.

— Isolation Forest, 2008.

The scikit-learn library provides an implementation of Isolation Forest in the IsolationForest class.

Perhaps the most important hyperparameter in the model is the “contamination” argument, which is used to help estimate the number of outliers in the dataset. This is a value between 0.0 and 0.5 and by default is set to 0.1.

```
identify outliers in the training dataset
iso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(X_train)

Once identified, we can remove the outliers from the training dataset.

# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

```


- Tying all these together, the complete example of evaluating the linear model on the household power consumption dataset with outliers identified and removed with isolation forest is listed below:

In [6]:
# evaluate model performance with outliers removed using isolation forest
# import necessary libraries:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import IsolationForest
from sklearn.metrics import mean_absolute_error

# load the dataset
df  = pd.read_csv('../linear-regression/dataset/household_power_consumption/household_power_consumption.txt',
                    sep=';', parse_dates={'dt' : ['Date', 'Time']}, 
                    infer_datetime_format=True, low_memory=False, 
                    na_values=['nan','?'], index_col='dt')

df.shape

df.isnull().sum()

# Drop the null values:
df = df.copy()
df = df.dropna()

df.info()
df.isnull().sum()

# retrieve the array
data = df.values

# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)

# identify outliers in the training dataset
iso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(X_train)

# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)

# fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)

# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2049280 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Data columns (total 7 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Global_active_power    float64
 1   Global_reactive_power  float64
 2   Voltage                float64
 3   Global_intensity       float64
 4   Sub_metering_1         float64
 5   Sub_metering_2         float64
 6   Sub_metering_3         float64
dtypes: float64(7)
memory usage: 125.1 MB
(1373017, 6) (1373017,)
(1235716, 6) (1235716,)
MAE: 4.616


* As a result, we can see that that model identified and removed *137,301 outliers* and achieved a MAE of about 4.616, an decrease over the baseline that achieved a score of about 4.009

## [Minimum Covariance Determinant](#second)

If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.

For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution.

This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.

The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. It also serves as a convenient and efficient tool for outlier detection.

— Minimum Covariance Determinant and Extensions, 2017.

The scikit-learn library provides access to this method via the EllipticEnvelope class.

It provides the “contamination” argument that defines the expected ratio of outliers to be observed in practice. In this case, we will set it to a value of 0.01, found with a little trial and error.

```
# identify outliers in the training dataset
ee = EllipticEnvelope(contamination=0.01)
yhat = ee.fit_predict(X_train)
```
Once identified, the outliers can be removed from the training dataset as we did in the prior method.


- Tying all these together, the complete example of identifying and removing outliers from the household power consumption dataset using the elliptical envelope (minimum covariant determinant) method is listed below:

In [7]:
# evaluate model performance with outliers removed using elliptical envelope

# import necessary libraries:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.covariance import EllipticEnvelope
from sklearn.metrics import mean_absolute_error

# load the dataset
df  = pd.read_csv('../linear-regression/dataset/household_power_consumption/household_power_consumption.txt',
                    sep=';', parse_dates={'dt' : ['Date', 'Time']}, 
                    infer_datetime_format=True, low_memory=False, 
                    na_values=['nan','?'], index_col='dt')

df.shape

df.isnull().sum()

# Drop the null values:
df = df.copy()
df = df.dropna()

df.info()
df.isnull().sum()

# retrieve the array
data = df.values

# split into input and output elements
X, y = data[:, :-1], data[:, -1]

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)

# identify outliers in the training dataset
ee = EllipticEnvelope(contamination=0.01)
yhat = ee.fit_predict(X_train)

# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)

# fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2049280 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Data columns (total 7 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Global_active_power    float64
 1   Global_reactive_power  float64
 2   Voltage                float64
 3   Global_intensity       float64
 4   Sub_metering_1         float64
 5   Sub_metering_2         float64
 6   Sub_metering_3         float64
dtypes: float64(7)
memory usage: 125.1 MB
(1373017, 6) (1373017,)
(1359286, 6) (1359286,)
MAE: 4.004


* In this case, as the result we can see that the elliptical envelope method identified and removed only 13,731 outliers, resulting in a drop in MAE from 4.009 with the baseline to 4.004.



## [Local Outlier Factor](#third)

A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature space.

This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.

The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

We introduce a local outlier (LOF) for each object in the dataset, indicating its degree of outlier-ness.

— LOF: Identifying Density-based Local Outliers, 2000.

The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class.

The model provides the “contamination” argument, that is the expected percentage of outliers in the dataset, be indicated and defaults to 0.1.

```
# identify outliers in the training dataset
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)
```

In [8]:
# evaluate model performance with outliers removed using local outlier factor

# import necessary libraries:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import mean_absolute_error

# load the dataset
df  = pd.read_csv('../linear-regression/dataset/household_power_consumption/household_power_consumption.txt',
                    sep=';', parse_dates={'dt' : ['Date', 'Time']}, 
                    infer_datetime_format=True, low_memory=False, 
                    na_values=['nan','?'], index_col='dt')

df.shape

df.isnull().sum()

# Drop the null values:
df = df.copy()
df = df.dropna()

df.info()
df.isnull().sum()

# retrieve the array
data = df.values

# split into input and output elements
X, y = data[:, :-1], data[:, -1]

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)

# identify outliers in the training dataset
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)

# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)

# fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)

# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

(1373017, 6) (1373017,)
(1340943, 6) (1340943,)
MAE: 4.007


In this case, we can see that the local outlier factor method identified and removed 32,074 outliers, resulting in a drop in MAE from 4.009 with the baseline to 4.007. Better, but not as good as isolation forest, suggesting a different set of outliers were identified and removed.