# Feature Engine - Unit 04 - Handle Numerical Variable Transformation

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Handle Numerical Variable Transformation, using Log Transformer, Reciprocal Transformer, Power Transformer, Box Cox and Yeo Johnson Transformer



## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Handle Numerical Variable Transformation

The techniques presented here transform numerical variables considering multiple mathematical transformations. The idea is to transform the variable distribution, ideally to become close to a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution). We will study the following transformers:
* LogTransformer
* ReciprocalTransformer
* PowerTransformer
* BoxCoxTransformer
* YeoJohnsonTransformer

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We will do exercises with all of the transformers. You don't have to memorize the specific mathematical function for each transformer. Instead, you should be aware that we apply mathematical functions to numerical data. Later on, we will show a custom function that displays a report on numerical transformations, giving you criteria to select the most suitable transformer for your data.

* We will use the pingouin package to run a Q-Q plot to visually check how close to the normal distribution a given variable is.


import pingouin as pg

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Log Transformer

It applies the [natural logarithm](https://en.wikipedia.org/wiki/Natural_logarithm) (base e) or the base 10 logarithm to numerical variables. The function documentation is [here](https://feature-engine.readthedocs.io/en/1.1.x/transformation/LogTransformer.html)
* The transformer, as we may expect, can't handle zero or negative values
* The arguments are the `variables` you want to apply the method to. In case you don't parse the variables list, the transformer considers all numerical variables. The next argument is `base` (either 'e' or '10').

from feature_engine import transformation as vt

We will consider the Boston dataset from [sckit learn datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html). It shows house prices in Boston.
* We used 4 lines of code to unpack the dataset into a format to reach a DataFrame with all the features and the target variable. In the next lesson, we will investigate how to use sklearn functionalities. For now, we just need its dataset.
* For this exercise, we are not interested in making sense of the variables meaning and business impact. We're looking for numerical variables for handling transformation. we will consider only a subset of the variables.

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)

df = df.filter(['DIS','LSTAT', 'target'])
df.head()

We assess the histogram and QQ plot by looping over the variables. We create custom functions for this task since we will repeat it across different transformers. The first calculates skewness and kurtosis. The second plots a histogram and QQ plot for a given numerical variable

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A quick recap
* Skewness is the asymmetry of the data. A distribution is symmetric when it looks the same to the left and right of the center point. It is horizontally mirrored. Positive Skewness happens when the tail on the right side is longer. Negative skewness is the opposite.
* Kurtosis relates to the tails of the distribution. It is a measure of outliers in the distribution. A negative kurtosis indicates the distribution has thin tails. Ppsitive kurtosis indicates that the distribution is peaked and has thick tails




def calculate_skew_kurtosis(df,col, moment):
  print(f"{moment}  | skewness: {df[col].skew().round(2)} | kurtosis: {df[col].kurtosis().round(2)}")


def distribution_before_applying_transformer(df):
  for col in df.columns:
    print(f"*** {col} ***")
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))
    sns.histplot(data=df, x=col, kde=True, ax=axes[0])
    axes[0].set_title("Histogram")
    pg.qqplot(df[col], dist='norm',ax=axes[1])
    plt.tight_layout()
    plt.show()
    calculate_skew_kurtosis(df,col,'before apply transformation')
    print("\n")

distribution_before_applying_transformer(df)

We set the pipeline with this transformer: `vt.LogTransformer()`. Then we `.fit_transform()` the pipeline, assigning the result to `df_transformed`

pipeline = Pipeline([
      ( 'log', vt.LogTransformer() )
  ])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We now compare the distribution of the variables before and after applying the transformer. We create a custom function for that. It plots the histogram and QQplot for the same variable before and after applying the transformer.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note: When transforming variables, the summary statistics may change. We will consider here only skewness and kurtosis. What is important is to reach a gain where the transformed variable is closer to a normal distribution.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots
* `DIS` decreased skewness, but its kurtosis increased and changed from positive to negative. The QQ plot is closer to the diagonal line after transformation, but it is still "bent"
* `LSTAT` decreased skewness and changed from positive to negative. Its kurtosis decreased and changed from positive to negative. The QQ plot is closer to the diagonal line after transformation
* `target` decreased skewness and kurtosis. Skewness changed from positive to negative, but it is still "bent"

We can say that in general this transformation, helped to transform these variables to become closer to a normal distribution when we compare the distribution shape and QQ plot before and after applying the transformer.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> However, we have more mathematical functions at our disposal to test. 
* That leads to another question: Which mathematical function should I apply to my variable? We prepared a custom function that will use all possible transformations in a given variable, so you can have a report to decide which transformer to apply


def compare_distributions_before_and_after_applying_transformer(df, df_transformed, method):

  for col in df.columns:
    print(f"*** {col} ***")
    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9,6))

    sns.histplot(data=df, x=col, kde=True, ax=axes[0,0])
    axes[0,0].set_title(f'Before {method}')
    pg.qqplot(df[col], dist='norm',ax=axes[0,1])
    
    sns.histplot(data=df_transformed, x=col, kde=True, ax=axes[1,0])
    axes[1,0].set_title(f'After {method}')
    pg.qqplot(df_transformed[col], dist='norm',ax=axes[1,1])
    
    plt.tight_layout()
    plt.show()

    calculate_skew_kurtosis(df,col, moment='before transformation')
    calculate_skew_kurtosis(df_transformed,col, moment='after transformation')
    print("\n")


compare_distributions_before_and_after_applying_transformer(df, df_transformed, method='Log Transformer')

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Reciprocal Transformer

This technique applies the reciprocal transformation 1 / x to numerical variables. As we may expect, it can't handle a variable that contains zero. The function documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/transformation/ReciprocalTransformer.html)
* The argument is `variables`. In case you don't parse the variables list, the transformer considers all numerical variables

from feature_engine import transformation as vt

We consider the Boston dataset with the same variables from the previous exercise

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)

df = df.filter(['DIS','LSTAT', 'target'])
df.head()

We assess the distribution using the previous custom function

distribution_before_applying_transformer(df)

We set the pipeline with this transformer: `vt.ReciprocalTransformer()`. Then we `.fit_transform()` the pipeline, assigning the result to `df_transformed`

pipeline = Pipeline([
      ( 'reciprocal', vt.ReciprocalTransformer() )
  ])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We compare the histograms and QQ plots before and after applying the transformers

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots
* ``DIS`` decreased skewness, and kurtosis changed from positive to negative. The QQ plots look similar. It doesn't look to have made any progress after transforming
* `LSTAT` increase both skewness and kurtosis. It doesn't look to have made any progress after transforming
* `target` increase both skewness and kurtosis. It doesn't look to have made any progress after transforming


compare_distributions_before_and_after_applying_transformer(df, df_transformed, method='ReciprocalTransformer')

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Power Transformer

It applies power or exponential transformations to the numerical variable. The documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/transformation/PowerTransformer.html)
* The arguments are the `variables` you want to apply the method to. In case you don't parse the variables list, the transformer considers all numerical variables. `exp` is the power of exponent, the default is 0.5

from feature_engine import transformation as vt

We consider the Boston dataset with the same variables from the previous exercise

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)

df = df.filter(['DIS','LSTAT', 'target'])
df.head()

We assess the distribution using the previous custom function

distribution_before_applying_transformer(df)

We set the pipeline with this transformer: `vt.PowerTransformer()`. Then we `.fit_transform()` the pipeline, assigning the result to `df_transformed`

pipeline = Pipeline([
      ('pt', vt.PowerTransformer() )
  ])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We compare the histograms and QQ plots before and after applying the transformers

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots
* `DIS` decreased skewness, kurtosis changed from positive to negative. The QQ plots look similar. It doesn't look to have made any progress after transforming
* `LSTAT` decreased skewness and kurtosis changed from positive to negative. It looks to have improved when you look at the QQ plot
* `target` decreased skewness and kurtosis. It looks to have made minor progress when comparing the QQ plot before and after the transformation, since the blue dots are close to the diagonal line


compare_distributions_before_and_after_applying_transformer(df, df_transformed, method='PowerTransformer')

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Box Cox Transformer

It applies the BoxCox transformation to numerical variables. A mathematical formulation can be found [here](https://www.statisticshowto.com/box-cox-transformation/). The data must be positive for the transformer. The documentation for the function is found [here](https://feature-engine.readthedocs.io/en/1.1.x/transformation/BoxCoxTransformer.html)
* The argument is `variables`. In case you don't parse the variables list, the transformer considers all numerical variables

from feature_engine import transformation as vt

We consider the Boston dataset with the same variables from the previous exercise

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)

df = df.filter(['DIS','LSTAT', 'target'])
df.head()

We assess the distribution using the previous custom function

distribution_before_applying_transformer(df)

We set the pipeline with this transformer: `vt.BoxCoxTransformer()`. Then we `.fit_transform()` the pipeline, assigning the result to `df_transformed`

pipeline = Pipeline([
      ('bct', vt.BoxCoxTransformer() )
  ])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We compare the histograms and QQ plots before and after applying the transformers

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots
* `DIS` decreased skewness, kurtosis changed from positive to negative. The QQ plots look similar. It doesn't look to have made any progress after transforming
* `LSTAT` decreased skewness and kurtosis changed from positive to negative. It looks to have made an improvement when you look at the QQ plot
* `target` decreased skewness and kurtosis. It looks to have made minor progress when comparing the QQ plot before and after the transformation since the blue dots are close to the diagonal line


compare_distributions_before_and_after_applying_transformer(df, df_transformed, method='BoxCoxTransformer')

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Yeo Johnson Transformer

It applies the Yeo-Johnson transformation, more information on the mathematical formulation can be found [here](https://statisticaloddsandends.wordpress.com/2021/02/19/the-box-cox-and-yeo-johnson-transformations-for-continuous-variables/). The documentation for the function is found [here](https://feature-engine.readthedocs.io/en/1.1.x/transformation/YeoJohnsonTransformer.html).
* The argument is a list of `variables`. In case you don't parse the variables list, the transformer considers all numerical variables

from feature_engine import transformation as vt

We consider the Boston dataset with the same variables from the previous exercise

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)

df = df.filter(['DIS','LSTAT', 'target'])
df.head()

We assess the distribution using the previous custom function

distribution_before_applying_transformer(df)

We set the pipeline with this transformer: `vt.YeoJohnsonTransformer()`. Then we `.fit_transform()` the pipeline, assigning the result to `df_transformed`

pipeline = Pipeline([
      ('yj', vt.YeoJohnsonTransformer() )
  ])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We compare the histograms and QQ plots before and after applying the transformers

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots
* `DIS` decreased skewness, kurtosis increased and changed from positive to negative. The QQ plots look to have made progress since the blue dots are closer to the diagonal line
* `LSTAT`: same as above
* `target` decreased skewness and kurtosis.  The QQ plot after transformation looks better than before transformation


compare_distributions_before_and_after_applying_transformer(df, df_transformed, method='YeoJohnsonTransformer')

