## Outlier Engineering


An outlier is a data point which is significantly different from the remaining data. “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” [D. Hawkins. Identification of Outliers, Chapman and Hall , 1980].

Statistics such as the mean and variance are very susceptible to outliers. In addition, **some Machine Learning models are sensitive to outliers** which may decrease their performance. Thus, depending on which algorithm we wish to train, we often remove outliers from our variables.

We discussed in section 3 of this course how to identify outliers. In this section, we we discuss how we can process them to train our machine learning models.


## How can we pre-process outliers?

- Trimming: remove the outliers from our dataset
- Treat outliers as missing data, and proceed with any missing data imputation technique
- Discrestisation: outliers are placed in border bins together with higher or lower values of the distribution
- Censoring: capping the variable distribution at a max and / or minimum value

**Censoring** is also known as:

- top and bottom coding
- windsorisation
- capping


## Censoring or Capping.

**Censoring**, or **capping**, means capping the maximum and /or minimum of a distribution at an arbitrary value. On other words, values bigger or smaller than the arbitrarily determined ones are **censored**.

Capping can be done at both tails, or just one of the tails, depending on the variable and the user.

Check my talk in [pydata](https://www.youtube.com/watch?v=KHGGlozsRtA) for an example of capping used in a finance company.

The numbers at which to cap the distribution can be determined:

- arbitrarily
- using the inter-quantal range proximity rule
- using the gaussian approximation
- using quantiles


### Advantages

- does not remove data

### Limitations

- distorts the distributions of the variables
- distorts the relationships among variables


## In this Demo

We will see how to perform capping with arbitrary values using the Titanic dataset

## Important

When doing capping, we tend to cap values both in train and test set. It is important to remember that the capping values MUST be derived from the train set. And then use those same values to cap the variables in the test set

I will not do that in this demo, but please keep that in mind when setting up your pipelines

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from feature_engine import missing_data_imputers  as msi
from feature_engine import outlier_removers as outr

In [2]:
# function to load the titanic dataset

def load_titanic():
    data = pd.read_csv('../titanic.csv')
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['embarked'].fillna('C', inplace=True)
    return data

In [3]:
data = load_titanic()
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C,S,,,"Montreal, PQ / Chesterville, ON"


## ArbitraryOutlierCapper

The ArbitraryOutlierCapper caps the minimum and maximum values by a value determined by the user. 

In [4]:
# let's find out the maximum Age and maximum Fare in the titanic

data.age.max(), data.fare.max()

(80.0, 512.3292)

In [5]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict = {'age':50, 'fare':200},
                                     min_capping_dict = None)
capper.fit(data)

ArbitraryOutlierCapper(max_capping_dict=None, min_capping_dict=None)

In [6]:
capper.right_tail_caps_

{'age': 50, 'fare': 200}

In [7]:
capper.left_tail_caps_

{}

In [8]:
temp = capper.transform(data)

temp.age.max(), temp.fare.max()

(50.0, 200.0)

### Minimum capping

In [9]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict=None,
                                     min_capping_dict={
                                         'age': 10,
                                         'fare': 100
                                     })
capper.fit(data)

ArbitraryOutlierCapper(max_capping_dict=None, min_capping_dict=None)

In [10]:
capper.variables

['age', 'fare']

In [11]:
capper.right_tail_caps_

{}

In [18]:
capper.left_tail_caps_

{'age': 10, 'fare': 100}

In [12]:
temp = capper.transform(data)

temp.age.min(), temp.fare.min()

(10.0, 100.0)

### Both ends capping

In [13]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict={
                                     'age': 50, 'fare': 200},
                                     min_capping_dict={
                                     'age': 10, 'fare': 100})
capper.fit(data)

ArbitraryOutlierCapper(max_capping_dict=None, min_capping_dict=None)

In [14]:
capper.right_tail_caps_

{'age': 50, 'fare': 200}

In [15]:
capper.left_tail_caps_

{'age': 10, 'fare': 100}

In [16]:
temp = capper.transform(data)

temp.age.min(), temp.fare.min()

(10.0, 100.0)

In [17]:
temp.age.max(), temp.fare.max()

(50.0, 200.0)