<a href="https://colab.research.google.com/github/vinay10949/AnalyticsAndML/blob/master/FeatureEngineering/Outlier-Handling/5_5_Capping_Arbitrary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Outlier Engineering


An outlier is a data point which is significantly different from the remaining data. “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” [D. Hawkins. Identification of Outliers, Chapman and Hall , 1980].

Statistics such as the mean and variance are very susceptible to outliers. In addition, **some Machine Learning models are sensitive to outliers** which may decrease their performance. Thus, depending on which algorithm we wish to train, we often remove outliers from our variables.

We discussed in section 3 of this course how to identify outliers. In this section, we we discuss how we can process them to train our machine learning models.


## How can we pre-process outliers?

- Trimming: remove the outliers from our dataset
- Treat outliers as missing data, and proceed with any missing data imputation technique
- Discrestisation: outliers are placed in border bins together with higher or lower values of the distribution
- Censoring: capping the variable distribution at a max and / or minimum value

**Censoring** is also known as:

- top and bottom coding
- windsorisation
- capping


## Censoring or Capping.

**Censoring**, or **capping**, means capping the maximum and /or minimum of a distribution at an arbitrary value. On other words, values bigger or smaller than the arbitrarily determined ones are **censored**.

Capping can be done at both tails, or just one of the tails, depending on the variable and the user.

Check my talk in [pydata](https://www.youtube.com/watch?v=KHGGlozsRtA) for an example of capping used in a finance company.

The numbers at which to cap the distribution can be determined:

- arbitrarily
- using the inter-quantal range proximity rule
- using the gaussian approximation
- using quantiles


### Advantages

- does not remove data

### Limitations

- distorts the distributions of the variables
- distorts the relationships among variables



## Important

When doing capping, we tend to cap values both in train and test set. It is important to remember that the capping values MUST be derived from the train set. And then use those same values to cap the variables in the test set


In [1]:
!pip install feature_engine
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from feature_engine import missing_data_imputers  as msi
from feature_engine import outlier_removers as outr

Collecting feature_engine
  Downloading https://files.pythonhosted.org/packages/b3/0f/7f7f60195879fc487aeaecba343f02c6f4426bc239b378b73655d40c1d06/feature_engine-0.3.1-py3-none-any.whl
Collecting numpydoc>=0.6.0
  Downloading https://files.pythonhosted.org/packages/b0/70/4d8c3f9f6783a57ac9cc7a076e5610c0cc4a96af543cafc9247ac307fbfe/numpydoc-0.9.2.tar.gz
Building wheels for collected packages: numpydoc
  Building wheel for numpydoc (setup.py) ... [?25l[?25hdone
  Created wheel for numpydoc: filename=numpydoc-0.9.2-cp36-none-any.whl size=31893 sha256=823ec69a2a08ca9f9121093f3bd27d5701cdcadf79c53ddf2367b8f7da2b215a
  Stored in directory: /root/.cache/pip/wheels/96/f3/52/25c8e1f40637661d27feebc61dae16b84c7cdd93b8bc3d7486
Successfully built numpydoc
Installing collected packages: numpydoc, feature-engine
Successfully installed feature-engine-0.3.1 numpydoc-0.9.2


In [0]:
# function to load the titanic dataset

def load_titanic():
    data = pd.read_csv('titanic_train.csv')
    data['Cabin'] = data['Cabin'].astype(str).str[0]
    data['Pclass'] = data['Pclass'].astype('O')
    data['Embarked'].fillna('C', inplace=True)
    return data

In [3]:
data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


## ArbitraryOutlierCapper

The ArbitraryOutlierCapper caps the minimum and maximum values by a value determined by the user. 

In [5]:
# let's find out the maximum Age and maximum Fare in the titanic

data.Age.max(), data.Fare.max()

(80.0, 512.3292)

In [9]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict = {'Age':50, 'Fare':200},
                                     min_capping_dict = None)
capper.fit(data)



ArbitraryOutlierCapper(max_capping_dict=None, min_capping_dict=None)

In [10]:
capper.right_tail_caps_

{'Age': 50, 'Fare': 200}

In [11]:
capper.left_tail_caps_

{}

In [13]:
temp = capper.transform(data)

temp.Age.max(), temp.Fare.max()

(50.0, 200.0)

### Minimum capping

In [15]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict=None,
                                     min_capping_dict={
                                         'Age': 10,
                                         'Fare': 100
                                     })
capper.fit(data)



ArbitraryOutlierCapper(max_capping_dict=None, min_capping_dict=None)

In [16]:
capper.variables

['Age', 'Fare']

In [17]:
capper.right_tail_caps_

{}

In [18]:
capper.left_tail_caps_

{'Age': 10, 'Fare': 100}

In [19]:
temp = capper.transform(data)

temp.Age.min(), temp.Fare.min()

(10.0, 100.0)

### Both ends capping

In [20]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict={
                                     'Age': 50, 'Fare': 200},
                                     min_capping_dict={
                                     'Age': 10, 'Fare': 100})
capper.fit(data)



ArbitraryOutlierCapper(max_capping_dict=None, min_capping_dict=None)

In [21]:
capper.right_tail_caps_

{'Age': 50, 'Fare': 200}

In [22]:
capper.left_tail_caps_

{'Age': 10, 'Fare': 100}

In [23]:
temp = capper.transform(data)

temp.Age.min(), temp.Fare.min()

(10.0, 100.0)

In [24]:
temp.Age.max(), temp.Fare.max()

(50.0, 200.0)