# Outlier removers

In this notebook, I will show you how to use the different outlier removers in feature_engine.

For the demo, I will use the titanic dataset available in [Kaggle](https://www.kaggle.com/c/titanic/data)

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from importlib import reload

from feature_engine import missing_data_imputers  as msi
from feature_engine import outlier_removers as outr

In [2]:
# function to load the titanic dataset, and get the first letter of the variable cabin
# we will load the dataset multiple times during the demo

def load_titanic():
    data = pd.read_csv('titanic.csv')
    data['Cabin'] = data['Cabin'].astype(str).str[0]
    data['Pclass'] = data['Pclass'].astype('O')
    data['Embarked'].fillna('C', inplace=True)
    return data

In [3]:
data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


## Windsorizer

Windzorizer finds maximum and minimum values following a Gaussian or skewed distribution as indicated. It can also cap the right, left or both ends of the distribution.

In [4]:
# let's find out the maximum Age and maximum Fare in the titanic
data.Age.max(), data.Fare.max()

(80.0, 512.32920000000001)

### Gaussian distribution and right tail

In [5]:
windsoriser = outr.Windsorizer(distribution='gaussian', tail='right', fold=3, variables = ['Age','Fare'])
windsoriser.fit(data)

Windsorizer(distribution='gaussian', fold=3, tail='right',
      variables=['Age', 'Fare'])

In [6]:
# here we can find the maximum caps allowed
windsoriser.right_tail_caps_

{'Age': 73.27860964406095, 'Fare': 181.2844937601173}

In [7]:
# this dictionary is empty, because we selected only right tail
windsoriser.left_tail_caps_

{}

In [8]:
data = windsoriser.transform(data)

# let's check the new maximum Age and maximum Fare in the titanic
data.Age.max(), data.Fare.max()

(73.27860964406095, 181.2844937601173)

### Gaussian distribution, both tails

In [9]:
data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [10]:
windsoriser = outr.Windsorizer(distribution='gaussian', tail='both', fold=1, variables='Fare')
windsoriser.fit(data)

Windsorizer(distribution='gaussian', fold=1, tail='both', variables=['Fare'])

In [11]:
windsoriser.left_tail_caps_

{'Fare': -17.489220628606304}

In [12]:
windsoriser.right_tail_caps_

{'Fare': 81.8976365657555}

In [13]:
temp = windsoriser.transform(data) # need to add a catcher for the transform
temp.Fare.max()

81.897636565755505

### Skewed distribution, left tail

In [14]:
data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [15]:
windsoriser = outr.Windsorizer(distribution='skewed', tail='left', fold=1, variables = ['Age', 'Fare'])
windsoriser.fit(data)

Windsorizer(distribution='skewed', fold=1, tail='left',
      variables=['Age', 'Fare'])

In [16]:
# right tail dictionary is empty, because we selected only left tail
windsoriser.right_tail_caps_

{}

In [17]:
windsoriser.left_tail_caps_

{'Age': 2.25, 'Fare': -15.179200000000002}

In [18]:
temp = windsoriser.transform(data)
temp.Age.min(), temp.Fare.min()

(2.25, 0.0)

## ArbitraryOutlierCapper

The ArbitraryOutlierCapper caps the minimum and maximum values by a value determined by the user. 

In [19]:
data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [20]:
# let's find out the maximum Age and maximum Fare in the titanic
data.Age.max(), data.Fare.max()

(80.0, 512.32920000000001)

In [21]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict = {'Age':50, 'Fare':200}, min_capping_dict = None)
capper.fit(data)

ArbitraryOutlierCapper(max_capping_dict={'Fare': 200, 'Age': 50},
            min_capping_dict=None)

In [22]:
capper.right_tail_caps_

{'Age': 50, 'Fare': 200}

In [23]:
capper.left_tail_caps_

{}

In [24]:
temp = capper.transform(data)
temp.Age.max(), temp.Fare.max()

(50.0, 200.0)

### Minimum capping

In [25]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict = None, min_capping_dict = {'Age':10, 'Fare':100})
capper.fit(data)

ArbitraryOutlierCapper(max_capping_dict=None,
            min_capping_dict={'Fare': 100, 'Age': 10})

In [26]:
capper.variables

['Fare', 'Age']

In [27]:
capper.right_tail_caps_

{}

In [28]:
temp = capper.transform(data)
temp.Age.min(), temp.Fare.min()

(10.0, 100.0)

### Both ends capping

In [29]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict = {'Age':50, 'Fare':200}, min_capping_dict = {'Age':10, 'Fare':100})
capper.fit(data)

ArbitraryOutlierCapper(max_capping_dict={'Fare': 200, 'Age': 50},
            min_capping_dict={'Fare': 100, 'Age': 10})

In [30]:
capper.right_tail_caps_

{'Age': 50, 'Fare': 200}

In [31]:
capper.left_tail_caps_

{'Age': 10, 'Fare': 100}

In [32]:
temp = capper.transform(data)
temp.Age.min(), temp.Fare.min()

(10.0, 100.0)

In [33]:
temp.Age.max(), temp.Fare.max()

(50.0, 200.0)