## Outlier Engineering


An outlier is a data point which is significantly different from the remaining data. “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” [D. Hawkins. Identification of Outliers, Chapman and Hall , 1980].

Statistics such as the mean and variance are very susceptible to outliers. In addition, **some Machine Learning models are sensitive to outliers** which may decrease their performance. Thus, depending on which algorithm we wish to train, we often remove outliers from our variables.

Earlier we discussed how to identify outliers. In this notebook, we will discuss how we can process them to train our machine learning models.


## How can we pre-process outliers?

- Trimming: remove the outliers from our dataset
- Treat outliers as missing data, and proceed with any missing data imputation technique
- Discrestisation: outliers are placed in border bins together with higher or lower values of the distribution
- Censoring: capping the variable distribution at a max and / or minimum value

**Censoring** is also known as:

- top and bottom coding
- windsorisation
- capping


## Censoring or Capping.

**Censoring**, or **capping**, means capping the maximum and /or minimum of a distribution at an arbitrary value. On other words, values bigger or smaller than the arbitrarily determined ones are **censored**.

Capping can be done at both tails, or just one of the tails, depending on the variable and the user.

The numbers at which to cap the distribution can be determined:

- arbitrarily
- using the inter-quantal range proximity rule
- using the gaussian approximation
- using quantiles


### Advantages

- does not remove data

### Limitations

- distorts the distributions of the variables
- distorts the relationships among variables


## In this Demo

We will see how to perform capping with arbitrary values using the Titanic dataset

## Important

When doing capping, we tend to cap values both in train and test set. It is important to remember that the capping values MUST be derived from the train set. And then use those same values to cap the variables in the test set

Please keep that in mind when setting up your pipelines

# Applying CAPPING, but this time going to use Arbitrary values that we select based on our domain knowledge of the subject matter.

### from feature_engine.outliers import ArbitraryOutlierCapper

In [34]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from feature_engine.outliers import ArbitraryOutlierCapper

In [35]:
# from feature_engine.outliers import missing_data_imputers as msi
# from feature_engine.outliers import outlier_removers as outr


In [36]:
# function to load the titanic dataset

def load_titanic():
    data = pd.read_csv('titanic.csv')
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['embarked'].fillna('C', inplace=True)
    return data

In [37]:
data = load_titanic()
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C,S,,,"Montreal, PQ / Chesterville, ON"


In [38]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   object 
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      1309 non-null   object 
 10  embarked   1309 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(3), object(8)
memory usage: 143.3+ KB


In [39]:
data.describe()

Unnamed: 0,survived,age,sibsp,parch,fare,body
count,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,0.0,0.1667,0.0,0.0,0.0,1.0
25%,0.0,21.0,0.0,0.0,7.8958,72.0
50%,0.0,28.0,0.0,0.0,14.4542,155.0
75%,1.0,39.0,1.0,0.0,31.275,256.0
max,1.0,80.0,8.0,9.0,512.3292,328.0


## ArbitraryOutlierCapper

- The ArbitraryOutlierCapper caps the minimum and maximum values by a value determined by the user. 

In [40]:
# let's find out the maximum Age and maximum Fare in the titanic

data.age.max(), data.fare.max()

(80.0, 512.3292)

## And now I'm going to arbitrarily cap age at 50 years and  cap fare at 200.

### Maximum Capping

In [41]:
capper = ArbitraryOutlierCapper(max_capping_dict = {'age': 50, 'fare': 200},
                                min_capping_dict = None)
capper.fit(data.fillna(0))

In [42]:
capper.right_tail_caps_

{'age': 50, 'fare': 200}

In [43]:
capper.left_tail_caps_

{}

In [44]:
temp = capper.transform(data.fillna(0))

temp.age.max(), temp.fare.max()

(50.0, 200.0)

### Minimum capping

In [45]:
capper = ArbitraryOutlierCapper(max_capping_dict = None,
                                min_capping_dict = {
                                    'age': 10,
                                    'fare': 100
                                })
capper.fit(data.fillna(0))

In [46]:
capper.right_tail_caps_

{}

In [47]:
capper.left_tail_caps_

{'age': 10, 'fare': 100}

In [48]:
temp = capper.transform(data.fillna(0))

temp.age.min(), temp.fare.min()

(10.0, 100.0)

### Both ends capping

In [49]:
capper = ArbitraryOutlierCapper(max_capping_dict={
    'age': 50, 'fare': 200},
    min_capping_dict={
    'age': 10, 'fare': 100})

capper.fit(data.fillna(0))

In [50]:
capper.right_tail_caps_

{'age': 50, 'fare': 200}

In [51]:
capper.left_tail_caps_

{'age': 10, 'fare': 100}

In [52]:
temp = capper.transform(data.fillna(0))

temp.age.min(), temp.fare.min()

(10.0, 100.0)

In [53]:
temp.age.max(), temp.fare.max()

(50.0, 200.0)