# Feature Engine - Unit 06 - Handle Outlier

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Handle Outlier using Winsorizer, Arbitrary capper or Outlier Trimmer



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Handle Outlier


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> These techniques aim to cap outliers based on a calculation or an arbitrary value. In addition, you may drop the outliers from the dataset. It is important to use the business context to manage outliers. For example:
* If your variable is Age and you see a value of 400, that may mean an error when collecting the data. You may cap the outlier with a `Q3 + 1.5 * IQR` value, replace it with an arbitrary number, or drop the row. The practical decision depends on your business context. Luckily we can code and check the effect of multiple possibilities before deciding the most suitable option to handle the outlier.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We should consider as the last option to drop rows with outliers since the data collection process requires energy, time and money from some team, either your team or another team. Also, outliers may indicate that your data is changing its behaviour, and you have collected the first samples of this new behaviour.



We will study the following transformers
* Winsorizer
* ArbitraryOutlierCapper
* OutlierTrimmer

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Winsorizer

It caps the outliers as maximum and/or minimum values of a continuous variable. It calculates the capping values, using specific methods. The documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/outliers/Winsorizer.html#)
* The arguments are the variables with outliers you are interested in (if you don't parse anything, it will consider all numerical variables), `tail`, where you decide to cap outliers on the right, left or both tails. Another argument is `fold`, which is the number that will multiply IQR to calculate the capping values. The documentation says recommended values are 1.5 or 3 for the IQR proximity rule. Another argument is `capping_method`, we will consider  `'iqr'`: 75th quantile + 1.5* IQR for the right tail and 25th quantile - 1.5* IQR for the left tail.



from feature_engine.outliers import Winsorizer

We will consider the titanic data for this exercise. It holds passengers records from the Titanic's final journey. We will consider the variables `age` and `fare`

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url).filter(['age','fare'])
# df = sns.load_dataset('titanic').filter(['age', 'fare'])
print(df.shape)
df.head()

We will assess variables distribution with a custom function by plotting a combined histogram and a boxplot.  The function code was used in the Descriptive Statistics unit, so it should be familiar to you. On top of that, we added more code to inform of the limits where the boxplot considers a data point an outlier (we calculate the [IQR](https://en.wikipedia.org/wiki/Interquartile_range) and the lower (Q1 - 1.5 x IRQ) and upper limits (Q3 + 1.5 x IQR) of the boxplot

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots in terms of outliers:
* `Age` has few outliers, on the right side of the tail (or on the right side of the plot)
* `Fare` has multiple outliers, on the right side of the tail


def plot_histogram_and_boxplot(df):
  for col in df.columns:
    fig, axes = plt.subplots(nrows=2 ,ncols=1 ,figsize=(6,6), gridspec_kw={"height_ratios": (.15, .85)})
    sns.boxplot(data=df, x=col, ax=axes[0])
    sns.histplot(data=df, x=col, kde=True, ax=axes[1])
    fig.suptitle(f"{col} Distribution - Boxplot and Histogram")
    plt.show()

    IQR = df[col].quantile(q=0.75) - df[col].quantile(q=0.25)
    print(
        f"This is the range where a datapoint is not an outlier: from "
        f"{(df[col].quantile(q=0.25) - 1.5*IQR).round(2)} to "
        f"{(df[col].quantile(q=0.75) + 1.5*IQR).round(2)}")
    print("\n")

plot_histogram_and_boxplot(df)

We create a pipeline with two steps: ``DropMissingData()`` since there should be no missing data. Then ``Winsorizer(),`` on both variables using iqr as capping method and fold as 1.5 on both tails. 


from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ( 'winsorizer_iqr', Winsorizer(capping_method='iqr', fold=1.5, tail='both', variables=['age', 'fare']) )
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note the **capping points** change when you apply iqr. We assess the cap values with `.right_tail_caps` and `.left_tail_caps`. We first check the right tails cap

pipeline['winsorizer_iqr'].right_tail_caps_

Then left tail caps

pipeline['winsorizer_iqr'].left_tail_caps_

For each variable, we will check the histogram and boxplot before and after the transformation
  * Note the ranges have changed
  * The outliers on the right tail were trimmed on `Q3 + 1.5 * IQR`

print("========= Before Transformation ========= \n")
plot_histogram_and_boxplot(df)
print("\n\n ========= After Transformation =========")
plot_histogram_and_boxplot(df=df_transformed)

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Arbitrary Outlier Capper

It caps the maximum or minimum values of a variable at an arbitrary value indicated by the user. The function documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/outliers/ArbitraryOutlierCapper.html)
* The arguments are `max_capping_dict` and `min_capping_dict`, where you parse in a dictionary of the variables and limits (min and max) you want to cap

from feature_engine.outliers import ArbitraryOutlierCapper

We will consider the titanic data for this exercise. It holds passengers records from the Titanic's final journey. We will consider the variables `age` and `fare`

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url).filter(['age','fare'])
# df = sns.load_dataset('titanic').filter(['age', 'fare'])
print(df.shape)
df.head()

We will assess the distribution of the variables with a custom function by plotting a combined histogram and a boxplot. 

plot_histogram_and_boxplot(df)

We create a pipeline with two steps: DropMissingData() since there should be no missing data. Then  `ArbitraryOutlierCapper()`, and set 40 as the max cap for fare and 50 as the max cap for age. We use these numbers so you can clearly see the effect in the histograms. In the workplace, you should reflect on the selected number for the cap.

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ( 'arb', ArbitraryOutlierCapper(max_capping_dict={'fare':40 , 'age':50}) )
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

In the end, after applying the gaussian method, we still will have outliers, wherein the iqr methods we will not.
* For each variable, we check the histogram and boxplot, before and after the transformation, so you can see the behaviour we described
  * Note the ranges have changed
  * Note after applying the transformation; all outliers values became 40 for fare and 50 for age. We note a "peak" in the fare histogram around 40 and a "peak" in age histogram around 50 due to the transformation 

print("========= Before Transformation ========= \n")
plot_histogram_and_boxplot(df)
print("\n\n ========= After Transformation =========")
plot_histogram_and_boxplot(df=df_transformed)

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Outlier Trimmer

It removes observations with outliers from the data. The documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/outliers/OutlierTrimmer.html). The arguments are the variables you want to apply the transformer to, if you don't parse variables, it will get all numerical data. There are also `capping_method`, `tail` and `fold`, which have the same meaning as the  Winsorizer() technique. We will consider capping_method='irq', tail='both' and fold=1.5

from feature_engine.outliers import OutlierTrimmer

We will consider the titanic data for this exercise. It holds passengers records from the Titanic's final journey. We will consider the variables `age` and `fare`

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url).filter(['age','fare'])
# df = sns.load_dataset('titanic').filter(['age', 'fare'])
print(df.shape)
df.head()

We will assess the distribution of the variables with a custom function by plotting a combined histogram and a boxplot. In addition, we will calculate how many rows the dataset has when it includes outliers

print(f"* The dataset has {len(df)} rows, considering outliers.\n\n")
plot_histogram_and_boxplot(df)

We create a pipeline with 2 steps: `DropMissingData()` since there should be no missing data. Then `OutlierTrimmer()`, where capping_method='iqr', fold=1.5, tail='both', and variables=['age', 'fare']. We `.fit_transform()` the data

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ( 'out_trimmer', OutlierTrimmer(capping_method='iqr', fold=1.5, tail='both', variables=['age', 'fare']) )
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We notice the dataset length has decreased since we removed the observations from both variables which were considered outliers

print(f"* The dataset has {len(df)} rows, considering outliers.")
print(f"* Once it is transformed with OutlierTrimmer, dataset has {len(df_transformed)} rows")

But that doesn't mean the new dataset will not have outliers. Since under the new configuration, or new distribution, the data might be distributed in a way that may contain a few outliers. The difference is that now you will have a lot fewer outliers for your model. 
* Note the range has changed, as we may expect. The distribution shape is the same in the area that there are no outliers (as we may expect as well)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Please consider this route as a last resort, after carefully reflecting on why your original data had outliers in the first place.


print("========= Before Transformation ========= \n")
plot_histogram_and_boxplot(df)
print("\n\n ========= After Transformation =========")
plot_histogram_and_boxplot(df=df_transformed)
%matplotlib inline

