# Feature Engine - Unit 01 - Feature Engine Transformers and Pipeline

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

* **Feature Engine Lesson consists of nine units.**
* By the end of this lesson, you should be able to:
  * Learn and use multiple estimators for handling missing data,  encode categorical variables, transform numerical variables, split continuous variables into independent, separate variables, handle outliers and detect correlated features.
  * Create your own estimator and assemble it in your ML pipeline.
  * Use a custom function for data cleaning and feature engineering workflow.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and use feature-engine Transformers and Pipelines.



---

* Feature-engine is a Python library with multiple built-in transformers which are used to engineer a dataset’s variables. The transformers learn parameters from the data and then transform the data.   
<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
">
 **Why do we study feature-engine?**
  * Because you can find in a centralised library a wide range of built-in transformers for engineering your variables. 
  * In addition, the transformers are compatible with the Scikit-learn pipeline. They take in a Pandas DataFrame and return a Pandas DataFrame, which is handy when your project is either in the research or production phase. 


---
* Are there other Python libraries with built-in transformers?
  * Yes, other Python libraries, like Scikit-learn, contain built-in transformers. We are using feature-engine due to the reasons stated above. 
  * We encourage you, over your data practitioner career, to explore and research at a later time additional libraries for these tasks.


  ---  




## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Learning Context

* We encourage you to:
  * Add **code cells to try out** other possibilities, ie.: play around with parameter values in a function/method, or consider additional function parameters etc.
  * Also, **add your comments** in the cells. It can help you to consolidate your learning. 

* Parameters in a given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed in Data Science for a particular function/method. 
  * However, you may seek additional information in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For feature-engine, the link is [here](https://feature-engine.readthedocs.io/en/1.1.x/)**.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

We need to install `feature-engine`

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Feature Engine Transformers and Pipeline


Feature-engine has multiple built-in transformers which are used to engineer a dataset’s variables. 
* The transformers learn parameters from the data and then transform the data.
* The transformers become particularly useful when assigned to an ML pipeline. But what is an ML pipeline?

An ML pipeline is a sequence of tasks that are performed when training a machine learning model. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Distinct transformers are arranged in series into an ML pipeline, which helps to streamline the ML pipeline implementation

We will use a Scikit-learn capability to create a pipeline. 
* Even though Scikit-learn is in an upcoming lesson, we will use its Pipeline class in this lesson to better understand feature-engine capabilities. 


We will import Pipeline from sklearn, the function documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

from sklearn.pipeline import Pipeline

The idea is to define a set of steps that will be executed in a pipeline.
* You will pass in a list of steps. Each step is defined in a tuple format, containing the step name and the functionality itself. Let's create a fictitious and non-runnable pipeline with 3 steps.
* It is important to define a step name since you will be interested later in assessing a specific step to do a particular task (like checking what the transformer has learned, assessing feature importance from an ML model, etc.)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Warning: The cell below won't work! the idea is to show how steps are set and arranged in a pipeline.

first_pipeline = Pipeline([
     ('first_step', FunctionThatExecutesSomething(arguments)),
     ('imagine_a_feature_engineering_step', function_feat_eng()),
     ('this_is_a_ML_model', ml_model(argument_a, argument_b))
     ])

A set of estimators are arranged in series when training a machine learning model. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In an ML pipeline, typically, the last step will be the ML model, and the preceding steps will prepare the data for the model.

You will use a set of methods in the pipeline, so you can:
* Learn the data parameters and afterwards transform the data.
* Train the ML model and run predictions.


For the moment, we just need to understand `.fit()` and .`transform()`
* The first learns from the data. The second transforms the data. We will see in later sections how to do this. 
* In addition, in a later Scikit lesson, we will cover this topic in more detail.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> 
In this lesson, you will see a combination of using feature-engine transformers "alone" and assembled in a pipeline, so you can get used to different coding situations.



# Feature-engine - Unit 02 - Handle Missing Data

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn how to handle Missing Data on numerical and categorical data.



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Handle Missing Data

Feature-engine imputes missing data with values learned from the data or arbitrary values set by the user, either for numerical or categorical variables. We will study:
* Mean Median Imputer
* Arbitrary Number Imputer
* Categorical Imputer
* Drop Missing Data

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Mean Median Imputer

It replaces missing data with the mean or median value of the variable. It works only with numerical variables. The documentation link is [here](https://feature-engine.readthedocs.io/en/1.1.x/imputation/MeanMedianImputer.html)


* Parse a list of variables to be imputed. Alternatively, this imputer can automatically select all variables of type numeric.
* The imputer first calculates the mean/median values of the variables (with the fit method). Then replaces the missing data with the estimated value (with the transform method).

from feature_engine.imputation import MeanMedianImputer

How do you know if you should impute the mean or the median?
* You should assess the numerical distribution plot. If it is normally distributed (bell curve shape), you can replace missing values using ``mean``. Otherwise, replace using ``median``.

Let's load the 'penguins' dataset.

df = sns.load_dataset('penguins')
df = df.sample(frac=0.5, random_state=5)
df.head()

Let's check missing data levels with `.isnull().sum()`

df.isnull().sum()

We assess the distribution. We will replace the values with the median.

sns.set_style('whitegrid')
for col in ['bill_length_mm' , 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']:
  sns.histplot(data=df, x=col, kde=True)
  plt.show()
  print('\n')

We load and set the transformer. The arguments are:
* imputation_method: either mean or median
* variables: list of numerical variables to apply the method to. If you don't pass in anything here, the transformer will consider all numerical variables.

from feature_engine.imputation import MeanMedianImputer
imputer = MeanMedianImputer(imputation_method='median',
                            variables=['bill_length_mm' , 'bill_depth_mm',
                                       'flipper_length_mm', 'body_mass_g'])

We use the `.fit()` method, so the transformer can learn the median values from the selected variables. The argument is the dataset you are interested to learn from

imputer.fit(df)

As a confirmation step, let's check the learned values with the attribute `.imputer_dict_`

imputer.imputer_dict_

We now transform the data, which means we replace the missing data of each variable according to its respective learned median value. We use `.transform()` method. The argument is the dataset you want to transform.

df = imputer.transform(df)

Check the output, it is a DataFrame.
* At first, we may think this is a minor detail; however other libraries for feature engineering, like scikit-learn, return, as an array, a ``.transform()`` command when doing a data transformation.

print(type(df))
df.head()

Let's check missing levels on `['bill_length_mm' , 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']`. They were replaced with median values

df.isnull().sum()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's use an example where we arrange a transformer in a pipeline. We will use this approach from now on. 
* First, we reload the dataset with missing data


df = sns.load_dataset('penguins')
df = df.sample(frac=0.5, random_state=5)
df.isnull().sum()

We set the pipeline in one step. We name it 'median'. Then we use the `MeanMedianImputer()` and the arguments we saw earlier.

pipeline = Pipeline([
      ( 'median',  MeanMedianImputer(imputation_method='median',
                                     variables=['bill_length_mm' , 'bill_depth_mm',
                                                'flipper_length_mm', 'body_mass_g']) )
])
pipeline

We fit the pipeline. That means we will execute all the tasks in the pipeline.
* In this example, the pipeline has one step that learns the median value from the selected variables.

pipeline.fit(df)

We then transform the dataset.

df = pipeline.transform(df)

And check for missing data.

df.isnull().sum()

If we want to check the learned values from the median imputed, we have to assess the step. Using bracket notation, we write the step name

pipeline['median']

We then use the respective attribute from the transformer, in this case, `.imputer_dict_`

pipeline['median'].imputer_dict_

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Arbitrary Number

It replaces missing data in numerical variables with an arbitrary number determined by the user. The function documentation is [here](https://feature-engine.readthedocs.io/en/1.1.x/imputation/ArbitraryNumberImputer.html)
* The arguments are the variables and the number to be imputed


from feature_engine.imputation import ArbitraryNumberImputer

Let's use the 'penguins' dataset and check missing data levels.

df = sns.load_dataset('penguins')
df = df.sample(frac=0.5, random_state=5)
df.isnull().sum()

We set the pipeline. Imagine you conducted the same data analysis as before and decided (with no basis) you want to impute`-100` where `bill_length_mm`, and `-500` for the remaining numerical variables with missing data.
* The values we chose here are arbitrary. In a project, this imputation can relate to a particular business context. For example, imagine if the variable is Age and you have the long-term experience that if a row is missing for this variable, you should replace it with, say, 25.

pipeline = Pipeline([
      ( 'bill_length_mm',  ArbitraryNumberImputer(arbitrary_number=-100,
                                                  variables=['bill_length_mm']) ),

      ( 'other_variables',  ArbitraryNumberImputer(arbitrary_number=-500,
                                                   variables=['bill_depth_mm',
                                                              'flipper_length_mm',
                                                              'body_mass_g']) )

])
pipeline

We fit the pipeline with the df.

pipeline.fit(df)

We then transform the dataset.

df = pipeline.transform(df)

And check for missing data.

df.isnull().sum()

If we want to check the learned values from the arbitrary imputation, we have to assess the step. Using bracket notation, we write the step name. We first check `bill_length_mm`

pipeline['bill_length_mm'].imputer_dict_

Then for the remaining variables.

pipeline['other_variables'].imputer_dict_

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Categorical Imputer

It replaces missing data in categorical variables by an arbitrary value (typically with the label 'missing') or by the most frequent category. The documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/imputation/CategoricalImputer.html)
* How do we select between the most frequent category or arbitrary value imputation?
  * It will depend on your business context and the missing levels. If you believe there is a hidden pattern that your data is missing, in this categorical variable, you can replace it with 'missing' and may expect an algorithm will find and use that for predictions.
  * Or maybe if the missing levels are so low, you can, in theory, replace them with the most frequent level without jeopardising the analysis.


from feature_engine.imputation import CategoricalImputer

##### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Replace with 'Missing'

Let's use the 'penguins' dataset and check missing data levels.

df = sns.load_dataset('penguins')
df = df.sample(n=50, random_state=1)
df.isnull().sum()

Let's assess `sex` frequency with .value_counts()

df['sex'].value_counts()

We will first replace it with a 'missing' label. We set the transformer in a pipeline by defining its name and CategoricalImputer as the function. The parameters are the imputation method, the value to be filled and the variables.

pipeline = Pipeline([
      ( 'categorical_imputer', CategoricalImputer(imputation_method='missing',
                                                  fill_value='Missing',
                                                  variables=['sex']) )
])
pipeline

For learning purposes, we can use `.fit_transform()`, so we can speed up the process of fitting and transforming the data. We assign the result to df

df = pipeline.fit_transform(df)

We check again `sex` distribution with `.value_counts()`. Now ``missing`` is a label in this variable.


df['sex'].value_counts()

##### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Replace with the most frequent

We will reload the 'penguins' dataset and use the other method: impute with the most frequent.

df = sns.load_dataset('penguins')

We set the pipeline and `.fit_transform()`
* CategoricalImputer now has the imputation method as frequent.

pipeline = Pipeline([
      ( 'categorical_imputer', CategoricalImputer(imputation_method='frequent',
                                                  variables=['sex']) )
])


df = pipeline.fit_transform(df)

We check again `sex` distribution with `.value_counts()`.
* You may remember, at first, Male had 168 rows. Now it has increased after this transformation


df['sex'].value_counts()

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Drop Missing Data

It deletes rows with missing values, similar to `pd.drop_na()`. It can handle numerical and categorical variables.
* the arguments are the list of variables for which missing values should be removed. When you don't set the variables list explicitly, the transformer will drop all missing data rows. The documentation link is [here](https://feature-engine.readthedocs.io/en/1.1.x/imputation/DropMissingData.html).
* In theory, you should consider as a last resort the option to drop missing data since there was an effort and cost to collect the data. However, if you see the imputing methods will not serve you, and your missing data levels are low, you, in theory, can remove the missing data without jeopardising the analysis.

from feature_engine.imputation import DropMissingData

As usual, let's consider the 'penguins' dataset and check missing data levels. We notice the dataset has 344 rows. The missing data level looks to be insignificant. The majority of missing levels is 2, and there is one with 11.

df = sns.load_dataset('penguins')
print(f"{df.shape} \n")
df.isnull().sum()

We set the pipeline with this transformer - we don't pass in any variables since we are interested in dropping all missing data. Then we `.fit_transform()` the data.

pipeline = Pipeline([
      ( 'drop_na', DropMissingData() )
])


df = pipeline.fit_transform(df)

We check again how many rows the data has and the missing levels.
* We notice now the data has 333 rows, before was 344 rows

print(f"{df.shape} \n")
df.isnull().sum()

# Feature-Engine - Unit 03 - Handle Categorical Variable Encoding

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn how to Handle Categorical Variable Encoding, using One Hot Encoder, Ordinal Encoder and Rare Label Encoder



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Handle Categorical Variable Encoding

A categorical encoder replaces variable labels with a calculated or arbitrary number. We will study:
* One Hot Encoder
* Ordinal Encoder
* Rare Label Encoder

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  One Hot Encoder

This technique replaces the categorical variable with a combination of binary variables (which takes value 0 or 1) where each new binary variable is related to a label from the categorical variable. The function is called `OneHotEncoder()` and its documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/encoding/OneHotEncoder.html)
* For example, imagine if our variable is `Colour`, and has three labels: Yellow, Blue and Green
* When you One Hot Encode (OHE) `Colour`, it is replaced by three binary variables `Colour_Yellow`, `Colour_Blue` and `Colour_Green`
* Imagine if a given row of Colour is Yellow. Once One Hot Encoded, this row will be transformed to  Colour_Yellow = 1, Colour_Blue = 0 and Colour_Green = 0.
* There is a concept called a redundant feature. Stop for a moment: do I need three binary variables to represent the variable `Colour`? 
  * The answer is no. If you have two binary variables for Colour, say Colour_Yellow and Colour_Blue, you can represent all possibilities as: 
    * Colour_Yellow = 1 and Colour_Blue = 0, meaning yellow
    * Colour_Yellow = 0  and Colour_Blue = 1, meaning blue
    * Colour_Yellow = 0  and Colour_Blue = 0, meaning green

from feature_engine.encoding import OneHotEncoder

Let's consider only categorical variables from the 'penguins' dataset.

df = sns.load_dataset('penguins').filter(['species', 'island', 'sex'])
df.head()

Let's create the pipeline with two steps (Handle Missing data and categorical encoding), and then use `.fit_transform()`
* Note: we can't encode a categorical variable that has missing data. For the exercise, we dropped the missing data using the transformer from the previous unit (DropMissingData).
* Using OneHotEncoder we pass a list of variables that we are interested to OHE.

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ('drop_na', DropMissingData() ),
      ('ohe', OneHotEncoder(variables=['species', 'island', 'sex']) )
])


df = pipeline.fit_transform(df)
df

But what about the redundant feature?
* You just have to pass the argument `drop_last=True` to `OneHotEncoder()`
* But first, we reload the dataset.

df = sns.load_dataset('penguins').filter(['species', 'island', 'sex'])
df.head()

Then set the same pipeline, but now add `drop_last=True`. Compare to the previous transformation and check which binary variables were removed.
* Note there are only two binary variables related to species and island. There is only one binary variable pertaining to sex. This same set of variables carries the same amount of information as the previous OHE transformation.
* You probably noticed that this transformation has the potential to generate a lot of new columns. That increases the feature space and may increase the chance of overfitting your model. To manage that, you may use, when possible, a FeatureSelection() step in your pipeline to select the most relevant features in your dataset. Don't worry. This topic will be covered in the next lesson.

pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ('ohe', OneHotEncoder(variables=['species', 'island', 'sex'], drop_last=True) )
])


df = pipeline.fit_transform(df)
df

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Ordinal Encoder

It replaces categories with ordinal numbers, like 0, 1, 2, 3 etc.  
* The numbers can be on a first seen-first basis.
* You can pass in a list of variables to encode, otherwise it will encode all categorical variables.

The function is `OrdinalEncoder()` and its documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/encoding/OrdinalEncoder.html)

The encoding method can be set to `ordered` or `arbitrary`.
When set to `ordered`, the categories are numbered in ascending order, based on the target mean value per category. When set to `arbitrary`, the categories are numbered arbitrarily. 
Throughout the course, when we use this transformer, we will set it as `arbitrary`. In fact, "arbitrary" is the method argument used in a similar transformer from scikit-learn, which will be covered in an upcoming lesson. There are multiple packages that can engineer your variables, including both scikit-learn and feature-engine.
When using `ordered`, remember your ML task must contain a target (like regression or classification). If it is a cluster, for example, the transformer will not work.
 For the teaching examples in this course, when we need to set the encoding method for this transformer, we will set it to `arbitrary`. However, you may try different options in your personal project or the workplace. After all, this transformation is part of your feature engineering strategy and as we studied, there is no fixed recipe when engineering your variables. It is a trial-and-error approach. 

from feature_engine.encoding import OrdinalEncoder

Let's consider categorical variables from the 'penguins' dataset.

df = sns.load_dataset('penguins').filter(['species', 'island', 'sex'])
df.head()

Let's create the pipeline with two steps (Handle Missing data and ordinal encoding) and then use `.fit_transform()`
* We will not pass any variable list argument to `OrdinalEncoder()`, which means we will encode all variables. We set `encoding_method='arbitrary'`

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary') )
])

df = pipeline.fit_transform(df)
df

Let's check the frequencies and labels names.
* We use a for loop on DataFrame columns and print the variable name + the value counts for that variable.
* Note the labels were replaced by numbers. For example, Male and Female were replaced by 0 and 1.

for col in df.columns.to_list():
  print(f"{col} \n{df[col].value_counts()} \n\n")

Let's check the encoder dictionary, to see how the transformer mapped the labels to numbers.

pipeline['ordinal_encoder'].encoder_dict_

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Rare Label Encoder

This encoder groups infrequent categories in a new category called 'Rare' (or other defined name)
* For example, if your variable is Fruit, and the  percentage of rows for the labels banana, grape and apple is less than < 6 %, all these labels will be replaced by 'Rare'. That helps to decrease the chance of a model overfitting.
* The function is `RareLabelEncoder()` and its documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/encoding/RareLabelEncoder.html). The arguments are:
  * `tol`, which is the tolerance, or the minimum frequency a label should have to be considered frequent. Categories with frequencies lower than tol will be replaced as 'Rare'.
  * `n_categories`: The minimum number of categories a variable should have for the encoder to find frequent labels. If the variable contains fewer categories, all of them will be considered frequent.
  * `variables`: list of variables that you would like to apply this transformation on. If you don't parse anything, it will select all categorical variables.

from feature_engine.encoding import RareLabelEncoder

Let's consider a few variables from the Titanic dataset. It holds passengers' records from the first and indeed last Titanic voyage.  
* Note we are converting the variables to 'object' with `.astype()` since some of them were listed as numerical yet being represented as a ``category``.

df = sns.load_dataset('titanic').filter(['parch', 'sibsp']).astype('object')
print(df.shape)
df.head()

Let's assess missing levels.

df.isnull().sum()

Now let's check the label's frequencies for each variable
* We loop on each variable and count its labels frequencies using .value_counts(normalize=True)
* We note that there are some labels which are infrequent, like 6 for parch.

for col in df.columns.to_list():
  print(f"{col} \n{df[col].value_counts(normalize=True)} \n\n")

Let's create the pipeline with two steps (rare label encoding), and then use `.fit_transform()`. We show here the use case where we can perform multiple rare label encoding.
* The first RareLabelEncoder deals with parch and sets the tolerance to 10% (this is a random number and is used to explain the concept). In the end, any parch label that is less frequent than 10%, will be replaced by 'Rare'.
* The second RareLabelEncoder deals with sibsp and sets the tolerance to 8% (again, a random number to illustrate the concept). In the end, any sibsp label that is less frequent than 8%, will be replaced by 'Rare'.
* Note: you can perform this technique with a set of variables. We created the example with single variables with different tolerance to illustrate the concept. In the workplace, the tol level will be selected based on the business context.
* We set ``n_categories=2`` since we want to encode all possible labels.

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ('rle_parch', RareLabelEncoder(tol=0.1,
                                     n_categories=2,
                                     variables=['parch']) ), 
      ('rle_sibsp', RareLabelEncoder(tol=0.08,
                                     n_categories=2,
                                     variables=['sibsp']) )
])

df = pipeline.fit_transform(df)
df.head()

Now let's check the label's frequencies for each variable again
* Note the labels were grouped into a label called 'Rare' according to the rules defined in the pipeline.

for col in df.columns.to_list():
  print(f"{col} \n{df[col].value_counts(normalize=True)} \n\n")

But, you may think, my variable is still a category, what should I do?
* The answer is, to arrange an Ordinal Encoder or OHE after the rare label encoder, so your categorical variables can be properly encoded.
* Just as an example, let's reload the data and inspect labels frequencies.

df = sns.load_dataset('titanic').filter(['parch', 'sibsp']).astype('object')
for col in df.columns.to_list():
  print(f"{col} \n{df[col].value_counts(normalize=True)} \n\n")

In one cell, we will do the following tasks:
* Create a pipeline with four steps: drop missing data, two rare label encoders and an ordinal encoder.
* Then we fit and transform the data.
* Finally, we loop over the variables to check labels frequencies.

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ('rle_parch', RareLabelEncoder(tol=0.1,
                                     n_categories=2,
                                     variables=['parch']) ), 
      ('rle_sibsp', RareLabelEncoder(tol=0.08,
                                     n_categories=2,
                                     variables=['sibsp']) ),
      ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary',
                                         variables= ['parch', 'sibsp']) )
])

df = pipeline.fit_transform(df)

for col in df.columns.to_list():
  print(f"{col} \n{df[col].value_counts(normalize=True)} \n\n")

# Feature-Engine - Unit 04 - Handle Numerical Variable Transformation

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Handle Numerical Variable Transformation, using Log Transformer, Reciprocal Transformer, Power Transformer, Box Cox and Yeo Johnson Transformer



## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Handle Numerical Variable Transformation

The techniques presented here transform numerical variables considering multiple mathematical transformations. The idea is to transform the variable distribution, ideally to become close to a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution). We will study the following transformers:
* LogTransformer
* ReciprocalTransformer
* PowerTransformer
* BoxCoxTransformer
* YeoJohnsonTransformer

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We will do exercises with all of the transformers. You don't have to memorise the specific mathematical function for each transformer. Instead, you should be aware that we apply mathematical functions to numerical data. Later on, we will show a custom function that displays a report on numerical transformations, giving you criteria to select the most suitable transformer for your data.

* We will use the pingouin package to run a Q-Q plot to visually check how close to the normal distribution a given variable is.


import pingouin as pg

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Log Transformer

It applies the [natural logarithm](https://en.wikipedia.org/wiki/Natural_logarithm) (base e) or the base 10 logarithm to numerical variables. The function documentation is [here](https://feature-engine.readthedocs.io/en/1.1.x/transformation/LogTransformer.html)
* The transformer, as we may expect, can't handle zero or negative values
* The arguments are the `variables` you want to apply the method to. In cases where you don't pass in a list of variables, the transformer considers all numerical variables. The next argument is `base` (either 'e' or '10').

from feature_engine import transformation as vt

We will consider the Boston dataset from [scikit learn datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html). It shows house prices in Boston.
* We used four lines of code to unpack the dataset into a format to reach a DataFrame with all the features and the target variable. In the next lesson, we will investigate how to use sklearn functionalities. For now, we just need its dataset.
* For this exercise, we are not interested in making sense of the variable's meaning and business impact. We're looking for numerical variables for handling transformation. we will consider only a subset of the variables.

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)

df = df.filter(['DIS','LSTAT', 'target'])
df.head()

We assess the histogram and QQ plot by looping over the variables. We create custom functions for this task since we will repeat it across different transformers. The first will calculate skewness and kurtosis. The second plots a histogram and QQ plot for a given numerical variable.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A quick recap
* Skewness is the asymmetry of the data. A distribution is symmetric when it looks the same to the left and right of the centre point. It is horizontally mirrored. Positive Skewness happens when the tail on the right side is longer. Negative skewness is the opposite.
* Kurtosis relates to the tails of the distribution. It is a measure of outliers in the distribution. A negative kurtosis indicates the distribution has thin tails. Positive kurtosis indicates that the distribution is peaked and has thick tails.




def calculate_skew_kurtosis(df,col, moment):
  print(f"{moment}  | skewness: {df[col].skew().round(2)} | kurtosis: {df[col].kurtosis().round(2)}")


def distribution_before_applying_transformer(df):
  for col in df.columns:
    print(f"*** {col} ***")
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))
    sns.histplot(data=df, x=col, kde=True, ax=axes[0])
    axes[0].set_title("Histogram")
    pg.qqplot(df[col], dist='norm',ax=axes[1])
    plt.tight_layout()
    plt.show()
    calculate_skew_kurtosis(df,col,'before apply transformation')
    print("\n")

distribution_before_applying_transformer(df)

We set the pipeline with this transformer: `vt.LogTransformer()`. Then we `.fit_transform()` the pipeline, assigning the result to `df_transformed`

pipeline = Pipeline([
      ( 'log', vt.LogTransformer() )
  ])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We now compare the distribution of the variables before and after applying the transformer. We create a custom function for that. It plots the histogram and QQplot for the same variable before and after applying the transformer.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note: When transforming variables, the summary statistics may change. We will consider here only skewness and kurtosis. What is important is to reach a gain where the transformed variable is closer to a normal distribution.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots
* `DIS` decreased skewness, but its kurtosis increased and changed from positive to negative. The QQ plot is closer to the diagonal line after transformation, but it is still "bent".
* `LSTAT` decreased skewness and changed from positive to negative. Its kurtosis decreased and changed from positive to negative. The QQ plot is closer to the diagonal line after transformation.
* `target` decreased skewness and kurtosis. Skewness changed from positive to negative, but it is still "bent".

We can say that in general this transformation, helped to transform these variables to become closer to a normal distribution when we compare the distribution shape and QQ plot before and after applying the transformer.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> However, we have more mathematical functions at our disposal to test. 
* That leads to another question: Which mathematical function should I apply to my variable? We prepared a custom function that will use all possible transformations in a given variable, so you can have a report to decide which transformer to apply.


def compare_distributions_before_and_after_applying_transformer(df, df_transformed, method):

  for col in df.columns:
    print(f"*** {col} ***")
    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10,8))

    sns.histplot(data=df, x=col, kde=True, ax=axes[0,0])
    axes[0,0].set_title(f'Before {method}')
    pg.qqplot(df[col], dist='norm',ax=axes[0,1])
    
    sns.histplot(data=df_transformed, x=col, kde=True, ax=axes[1,0])
    axes[1,0].set_title(f'After {method}')
    pg.qqplot(df_transformed[col], dist='norm',ax=axes[1,1])
    
    plt.tight_layout()
    plt.show()

    calculate_skew_kurtosis(df,col, moment='before transformation')
    calculate_skew_kurtosis(df_transformed,col, moment='after transformation')
    print("\n")


compare_distributions_before_and_after_applying_transformer(df, df_transformed, method='Log Transformer')

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Reciprocal Transformer

This technique applies the reciprocal transformation 1 / x to numerical variables. As we may expect, it can't handle a variable that contains zero. The function documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/transformation/ReciprocalTransformer.html)
* The argument is `variables`. In cases where you don't pass in a list of variables, the transformer considers all numerical variables.

from feature_engine import transformation as vt

We consider the Boston dataset with the same variables from the previous exercise.

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)

df = df.filter(['DIS','LSTAT', 'target'])
df.head()

We assess the distribution using the previous custom function.

distribution_before_applying_transformer(df)

We set the pipeline with this transformer: `vt.ReciprocalTransformer()`. Then we `.fit_transform()` the pipeline, assigning the result to `df_transformed`

pipeline = Pipeline([
      ( 'reciprocal', vt.ReciprocalTransformer() )
  ])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We compare the histograms and QQ plots before and after applying the transformers.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots.
* ``DIS`` decreased skewness, and kurtosis changed from positive to negative. The QQ plots look similar. It doesn't look to have made any progress after transforming.
* `LSTAT` increases both skewness and kurtosis. It doesn't look to have made any progress after transforming.
* `target` increases both skewness and kurtosis. It doesn't look to have made any progress after transforming.


compare_distributions_before_and_after_applying_transformer(df, df_transformed, method='ReciprocalTransformer')

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Power Transformer

It applies power or exponential transformations to the numerical variable. The documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/transformation/PowerTransformer.html)
* The arguments are the `variables` you want to apply the method to. In cases where you don't pass in a list of variables, the transformer considers all numerical variables. `exp` is the power of the exponent, and the default is 0.5

from feature_engine import transformation as vt

We consider the Boston dataset with the same variables from the previous exercise.

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)

df = df.filter(['DIS','LSTAT', 'target'])
df.head()

We assess the distribution using the previous custom function.

distribution_before_applying_transformer(df)

We set the pipeline with this transformer: `vt.PowerTransformer()`. Then we `.fit_transform()` the pipeline, assigning the result to `df_transformed`

pipeline = Pipeline([
      ('pt', vt.PowerTransformer() )
  ])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We compare the histograms and QQ plots before and after applying the transformers.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots.
* `DIS` decreased skewness, kurtosis changed from positive to negative. The QQ plots look similar. It doesn't look to have made any progress after transforming.
* `LSTAT` decreased skewness and kurtosis changed from positive to negative. It looks to have improved when you look at the QQ plot.
* `target` decreased skewness and kurtosis. It looks to have made minor progress when comparing the QQ plot before and after the transformation since the blue dots are close to the diagonal line.


compare_distributions_before_and_after_applying_transformer(df, df_transformed, method='PowerTransformer')

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Box Cox Transformer

It applies the BoxCox transformation to numerical variables. A mathematical formulation can be found [here](https://www.statisticshowto.com/box-cox-transformation/). The data must be positive for the transformer. The documentation for the function is found [here](https://feature-engine.readthedocs.io/en/1.1.x/transformation/BoxCoxTransformer.html)
* The argument is `variables`. In cases where you don't pass in a list of variables, the transformer considers all numerical variables. 

from feature_engine import transformation as vt

We consider the Boston dataset with the same variables from the previous exercise.

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)

df = df.filter(['DIS','LSTAT', 'target'])
df.head()

We assess the distribution using the previous custom function.

distribution_before_applying_transformer(df)

We set the pipeline with this transformer: `vt.BoxCoxTransformer()`. Then we `.fit_transform()` the pipeline, assigning the result to `df_transformed`

pipeline = Pipeline([
      ('bct', vt.BoxCoxTransformer() )
  ])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We compare the histograms and QQ plots before and after applying the transformers.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots.
* `DIS` decreased skewness, kurtosis changed from positive to negative. The QQ plots look similar. It doesn't look to have made any progress after transforming.
* `LSTAT` decreased skewness and kurtosis changed from positive to negative. It looks to have made an improvement when you look at the QQ plot.
* `target` decreased skewness and kurtosis. It looks to have made minor progress when comparing the QQ plot before and after the transformation since the blue dots are close to the diagonal line.


compare_distributions_before_and_after_applying_transformer(df, df_transformed, method='BoxCoxTransformer')

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Yeo Johnson Transformer

It applies the Yeo-Johnson transformation, more information on the mathematical formulation can be found [here](https://statisticaloddsandends.wordpress.com/2021/02/19/the-box-cox-and-yeo-johnson-transformations-for-continuous-variables/). The documentation for the function is found [here](https://feature-engine.readthedocs.io/en/1.1.x/transformation/YeoJohnsonTransformer.html).
* The argument is a list of `variables`. In cases where you don't pass in a list of variables, the transformer considers all numerical variables. 

from feature_engine import transformation as vt

We consider the Boston dataset with the same variables from the previous exercise.

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)

df = df.filter(['DIS','LSTAT', 'target'])
df.head()

We assess the distribution using the previous custom function.

distribution_before_applying_transformer(df)

We set the pipeline with this transformer: `vt.YeoJohnsonTransformer()`. Then we `.fit_transform()` the pipeline, assigning the result to `df_transformed`

pipeline = Pipeline([
      ('yj', vt.YeoJohnsonTransformer() )
  ])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We compare the histograms and QQ plots before and after applying the transformers.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots.
* `DIS` decreased skewness, and kurtosis increased and changed from positive to negative. The QQ plots look to have made progress since the blue dots are closer to the diagonal line.
* `LSTAT`: same as above.
* `target` decreased skewness and kurtosis. The QQ plot after the transformation looks better than before the transformation.


compare_distributions_before_and_after_applying_transformer(df, df_transformed, method='YeoJohnsonTransformer')



# Feature Engine - Unit 05 - Handle Variable Discretization

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Handle Variable Discretization using Equal Frequency discretizer, Equal Width discretizer or Arbitrary discretizer



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Handle Variable Discretization

This technique consists of transforming continuous numerical variables into discrete variables. The discrete variables will contain intervals related to the numerical distribution. The interval will be decided based on the frequency or width. We will study:
* EqualFrequencyDiscretiser
* EqualWidthDiscretiser
* ArbitraryDiscretiser



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> When should I consider using them? We can consider these use cases:
* Eventually, your feature has an abnormal or weird numerical distribution, and by discretizing this variable, the categorical distribution is better understood by the model
* You have a continuous target variable, and you are not successful in fitting a model to the dataset. Then, you can discretize the target variable and convert the ML task to classification since your target variable is now categorical. The expectation is that we will create more conditions to find a model that fits the data.


#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Equal Frequency

It divides continuous numerical variables into contiguous equal frequency intervals, intervals containing approximately the same proportion of observations. The function documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/discretisation/EqualFrequencyDiscretiser.html)
* The arguments are `variables` to apply the method; if you don't parse anything, it will select all numerical variables. And `q` (for quantiles), which is the desired number of equal frequency intervals (or quantiles).


from feature_engine.discretisation import EqualFrequencyDiscretiser

We will use the target variable from the Boston dataset

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.target, columns=['target'])
df.head()

We assess the distribution with sns.histplot()

sns.histplot(data=df, x='target', kde=True)
plt.show()

We create a pipeline with `EqualFrequencyDiscretiser()` on the target variable and look for five bins. We then `.fit_transform()` the data
* In the workplace, you will consider criteria to select a number for ``q``. Eventually, it will make sense to have 3 or 6. At the same time, you can run multiple simulations and assess the results for numerous ``q``


pipeline = Pipeline([
      ('efd', EqualFrequencyDiscretiser(q=5, variables=['target'] ))
])

df_transformed = pipeline.fit_transform(df)

We assess the efd step and check what were the bins the transformer calculated with `.binner_dict_`

pipeline['efd'].binner_dict_

Finally, we plot the new target distribution. As we may expect, all intervals have the same frequency
* Note in the plot the bar where the target is zero; it corresponds to the numerical interval of -inf to 15.3. You can extend this for the remaining bars


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The upside of using this technique, considering you are using the target variable, is that your target variable for the classification task will be already balanced, which means the labels have similar frequencies.

sns.countplot(data=df_transformed, x='target')
plt.show()

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Equal Width

This technique divides continuous numerical variables into intervals of the same width. Note that the count of observations per interval may vary. The function documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/discretisation/EqualWidthDiscretiser.html).
* The arguments are `variables` to apply the method to; if you don't parse anything, it will select all numerical variables. And `bins` which is the number of equal-width intervals/bins you want.

from feature_engine.discretisation import EqualWidthDiscretiser

We will use the target variable from the Boston dataset

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.target, columns=['target'])
df.head()

We assess its distribution

sns.histplot(data=df, x='target', kde=True)
plt.show()

We create a pipeline with `EqualWidthDiscretiser()` on the target variable and look for six bins. We then `.fit_transform()` the data
* In the workplace, you will consider criteria to select a number of `bins`. Eventually, it will make sense to have 3 or 6. At the same time, you can run multiple simulations and assess the results for numerous `bins`


pipeline = Pipeline([
      ('ewd', EqualWidthDiscretiser(bins=6, variables=['target']) )
])

df_transformed = pipeline.fit_transform(df)

We assess the ewd step and check what were the bins the transformer calculated with `.binner_dict_`

pipeline['ewd'].binner_dict_

Finally, we plot the new target distribution. As we may expect, all intervals have the same frequency
* Note in the plot, the bar where the target is zero corresponds to the numerical interval of -inf to 12.5.  You can extend this for the remaining bars


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The downside of using this technique, considering you are using the target variable, is that your target variable for the classification task will likely not be balanced.

sns.countplot(data=df_transformed, x='target')
plt.show()

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Arbitrary Discretiser

It divides continuous range intervals, which limits are determined by the user. The documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/discretisation/ArbitraryDiscretiser.html). The used argument is:
* ``binning_dict`` is a dictionary that tells which variable you want to apply the method and the intervals.

* You may use this technique when the company is comfortable with how to map the numerical values to ranges. For example, imagine if the variable is Revenue from a given purchase. The business is comfortable assuming that Revenue smaller than 100 is small, between 100 and 1000 is medium and greater than 1000 is big. You can also conduct separate analyses with other custom ranges to question current assumptions and/or look for other criteria to discretize the data

from feature_engine.discretisation import ArbitraryDiscretiser

We will use the target variable from the Boston dataset

from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.target, columns=['target'])
df.head()

We assess its distribution

sns.histplot(data=df, x='target', kde=True)
plt.show()

We create a pipeline with `ArbitraryDiscretiser()` on the target variable and look for six bins. We then `.fit_transform()` the data
* In the workplace, you will consider criteria to select a number of `bins`. Eventually, it will make sense to have 3 or 6. At the same time, you can run multiple simulations and assess the results for numerous `bins`


import numpy as np # we import NumPy to set -inf and +inf
pipeline = Pipeline([
      ( 'arbd', ArbitraryDiscretiser(binning_dict={'target':[-np.inf,10,20,40,np.inf]}) )
])

df_transformed = pipeline.fit_transform(df)

We assess the arbd step and check the bins we created with `.binner_dict_`

pipeline['arbd'].binner_dict_

Finally, we plot the new target distribution. As we may expect, all intervals have the same frequency
* Note in the plot the bar where the target is zero corresponds to the numerical interval of -inf to 10.  You can extend this for the remaining bars


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The upside of this technique is that we set the intervals we are comfortable with. The downside is that the categorical distribution may be imbalanced.

sns.countplot(data=df_transformed, x='target')
plt.show()

# Feature Engine - Unit 06 - Handle Outlier

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Handle Outlier using Winsorizer, Arbitrary capper or Outlier Trimmer



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Handle Outlier


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> These techniques aim to cap outliers based on a calculation or an arbitrary value. In addition, you may drop the outliers from the dataset. It is important to use the business context to manage outliers. For example:
* If your variable is Age and you see a value of 400, that may mean an error when collecting the data. You may cap the outlier with a `Q3 + 1.5 * IQR` value, replace it with an arbitrary number, or drop the row. The practical decision depends on your business context. Luckily we can code and check the effect of multiple possibilities before deciding the most suitable option to handle the outlier.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">  We should consider it as the last option to drop rows containing outliers since the data collection process requires energy, time and money from some team, either your team or another team. Also, outliers may indicate that your data is changing its behaviour and you have collected the first samples of this new behaviour.



We will study the following transformers
* Winsorizer
* ArbitraryOutlierCapper
* OutlierTrimmer

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Winsorizer

It caps the outliers as a continuous variable's maximum and/or minimum values. It calculates the capping values using specific methods. The documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/outliers/Winsorizer.html#)
* The arguments are the variables with outliers you are interested in (if you don't parse anything, it will consider all numerical variables), `tail`, where you decide to cap outliers on the right, left or both tails. Another argument is `fold`, the number that will multiply IQR to calculate the capping values. The documentation says recommended values are 1.5 or 3 for the IQR proximity rule. Another argument is `capping_method`; we will consider  `'iqr'`: 75th quantile + 1.5* IQR for the right tail and 25th quantile - 1.5* IQR for the left tail.



from feature_engine.outliers import Winsorizer

We will consider the titanic data for this exercise. It holds passenger records from the Titanic's final journey. We will consider the variables `age` and `fare`

df = sns.load_dataset('titanic').filter(['age', 'fare'])
print(df.shape)
df.head()

We will assess variable distribution with a custom function by plotting a combined histogram and a boxplot.  The function code was used in the Descriptive Statistics unit, so it should be familiar to you. On top of that, we added more code to inform of the limits where the boxplot considers a data point an outlier (we calculate the [IQR](https://en.wikipedia.org/wiki/Interquartile_range) and the lower (Q1 - 1.5 x IRQ) and upper limits (Q3 + 1.5 x IQR) of the boxplot

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's comment on the plots in terms of outliers:
* `Age` has few outliers on the right side of the tail (or on the right side of the plot)
* `Fare` has multiple outliers on the right side of the tail


def plot_histogram_and_boxplot(df):
  for col in df.columns:
    fig, axes = plt.subplots(nrows=2 ,ncols=1 ,figsize=(7,7), gridspec_kw={"height_ratios": (.15, .85)})
    sns.boxplot(data=df, x=col, ax=axes[0])
    sns.histplot(data=df, x=col, kde=True, ax=axes[1])
    fig.suptitle(f"{col} Distribution - Boxplot and Histogram")
    plt.show()

    IQR = df[col].quantile(q=0.75) - df[col].quantile(q=0.25)
    print(
        f"This is the range where a datapoint is not an outlier: from "
        f"{(df[col].quantile(q=0.25) - 1.5*IQR).round(2)} to "
        f"{(df[col].quantile(q=0.75) + 1.5*IQR).round(2)}")
    print("\n")

plot_histogram_and_boxplot(df)

We create a pipeline with two steps: ``DropMissingData()`` (since there should be no missing data),  ``then Winsorizer()``, on both variables using iqr as the capping method and fold as 1.5 on both tails.


from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ( 'winsorizer_iqr', Winsorizer(capping_method='iqr', fold=1.5, tail='both', variables=['age', 'fare']) )
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note the **capping points** change when you apply iqr. We assess the cap values with `.right_tail_caps` and `.left_tail_caps`. We first check the right tails cap

pipeline['winsorizer_iqr'].right_tail_caps_

Then left tail caps

pipeline['winsorizer_iqr'].left_tail_caps_

For each variable, we will check the histogram and boxplot before and after the transformation
  * Note the ranges have changed
  * The outliers on the right tail were trimmed on `Q3 + 1.5 * IQR`

print("========= Before Transformation ========= \n")
plot_histogram_and_boxplot(df)
print("\n\n ========= After Transformation =========")
plot_histogram_and_boxplot(df=df_transformed)

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Arbitrary Outlier Capper

It caps a variable's maximum or minimum values at an arbitrary value indicated by the user. The function documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/outliers/ArbitraryOutlierCapper.html)
* The arguments are `max_capping_dict` and `min_capping_dict`, where you parse in a dictionary of the variables and limits (min and max) you want to cap

from feature_engine.outliers import ArbitraryOutlierCapper

We will consider the titanic data for this exercise. It holds passengers' records from the Titanic's final journey. We will consider the variables `age` and `fare`

df = sns.load_dataset('titanic').filter(['age', 'fare'])
print(df.shape)
df.head()

We will assess the distribution of the variables with a custom function by plotting a combined histogram and a boxplot. 

plot_histogram_and_boxplot(df)

We create a pipeline with two steps: ``DropMissingData()`` (since there should be no missing data), then ``ArbitraryOutlierCapper()``, and set 40 as the max cap for fare and 50 as the max cap for age. We use these numbers so you can clearly see the effect in the histograms. In the workplace, you should reflect on the selected number for the cap.

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ( 'arb', ArbitraryOutlierCapper(max_capping_dict={'fare':40 , 'age':50}) )
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

After applying the gaussian method, we will still have outliers; with the iqr methods, we will not.
* For each variable, we check the histogram and boxplot, before and after the transformation, so you can see the behaviour we described
  * Note the ranges have changed
  * Note after applying the transformation, all outliers values became 40 for fare and 50 for age. We note a "peak" in the fare histogram around 40 and a "peak" in the age histogram around 50 

print("========= Before Transformation ========= \n")
plot_histogram_and_boxplot(df)
print("\n\n ========= After Transformation =========")
plot_histogram_and_boxplot(df=df_transformed)

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Outlier Trimmer

It removes observations with outliers from the data. The documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/outliers/OutlierTrimmer.html). The arguments are the variables you want to apply the transformer too. If you don't parse variables, it will get all numerical data. There are also `capping_method`, `tail` and `fold`, which have the same meaning as the  Winsorizer() technique. We will consider capping_method='irq', tail='both' and fold=1.5

from feature_engine.outliers import OutlierTrimmer

We will consider the titanic data for this exercise. It holds passengers' records from the Titanic's final journey. We will consider the variables `age` and `fare`

df = sns.load_dataset('titanic').filter(['age', 'fare'])
print(df.shape)
df.head()

We will assess the distribution of the variables with a custom function by plotting a combined histogram and a boxplot. In addition, we will calculate how many rows the dataset has when it includes outliers

print(f"* The dataset has {len(df)} rows, considering outliers.\n\n")
plot_histogram_and_boxplot(df)

We create a pipeline with two steps: `DropMissingData()` (since there should be no missing data), then `OutlierTrimmer()`, where capping_method='iqr', fold=1.5, tail='both', and variables=['age', 'fare']. We `.fit_transform()` the data

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ( 'out_trimmer', OutlierTrimmer(capping_method='iqr', fold=1.5, tail='both', variables=['age', 'fare']) )
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

We notice the dataset length has decreased since we removed the observations from both variables, which were considered outliers

print(f"* The dataset has {len(df)} rows, considering outliers.")
print(f"* Once it is transformed with OutlierTrimmer, dataset has {len(df_transformed)} rows")

But that doesn't mean the new dataset will not have outliers. Since under the new configuration, or new distribution, the data might be distributed in a way that may contain a few outliers. The difference is that now you will have fewer outliers for your model. 
* Note the range has changed, as we may expect. The distribution shape is the same in the area where there are no outliers (as we may expect as well)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Please consider this route as a last resort after carefully reflecting on why your original data had outliers in the first place.

print("========= Before Transformation ========= \n")
plot_histogram_and_boxplot(df)
print("\n\n ========= After Transformation =========")
plot_histogram_and_boxplot(df=df_transformed)

# Feature Engine - Unit 07 -  Drop Features & Smart Correlated Features

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn how to apply Drop Features transformer & Smart Correlated Features transformer



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">Drop Features

It drops a list of variables indicated by the developer. The function documentation is [here](https://feature-engine.readthedocs.io/en/1.1.x/selection/DropFeatures.html). The argument is the features you want to drop.

from feature_engine.selection import DropFeatures

We will use the penguin dataset. It has records for three different species of penguins collected from 3 islands in the Palmer Archipelago, Antarctica

df = sns.load_dataset('penguins')
df.head()

We will set the pipeline with `DropFeatures(),` and we want to drop the variables 'sex' and 'island'. We chose these arbitrarily, just for the exercise.
* In the workplace, you may consider the context. For example, your variable might be CustomerID, which is typically a combination of letters and numbers with high cardinality. You can often only get a little information out of it. Therefore, you may drop this variable.
* Other use cases could be when you create variables combining others, for example, 'distance' and 'time'; you may create a variable 'speed' when dividing one by another. After that, you may discard 'distance' and 'time'
* After setting the pipeline, we `.fit_transform()` the data

pipeline = Pipeline([
      ( 'drop_features', DropFeatures(features_to_drop = ['sex', 'island']) )
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Smart Correlated Features


According to the documentation, this transformer finds groups of correlated features. It then selects, from each group, a feature following certain criteria: Features with the least missing values, features with the most unique values, and features with the highest variance. The documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/selection/SmartCorrelatedSelection.html)
* The arguments we will use are variables, which are the list of variables to evaluate, if you don't parse anything it will consider all numerical variables in the dataset. The next is a method (like 'Pearson' or 'Spearman'), and threshold, which according to the documentation, is the correlation threshold above which a feature will be deemed correlated with another one and removed from the dataset.

from feature_engine.selection import SmartCorrelatedSelection

We will use the tips dataset. It holds records for waiter tips based on the day of the week, time of day, total bill, gender, if it is a table of smokers or not, and how many people were at the table.

df = sns.load_dataset('tips')
df.head()

When you load the dataset from Seaborn, the categorical variables data type is 'category', and for the ML tasks, and more specifically, for the exercise, it should be 'object'.

df.info()

We change the data type to `'object'` by looping over all the variables where its current data type is `'category'`

for col in df.select_dtypes(include='category').columns:
  df[col] = df[col].astype('object')

df.info()

We check for missing data. 
* There is no missing data

df.isnull().sum()

``SmartCorrelatedSelection()`` transformer works on numerical data; therefore, we must encode the existing categorical variables. We do that in this exercise with ``OrdinalEncoder()``. Then we add ``SmartCorrelatedSelection()``, where we don't pass the variables, meaning we want all numerical variables to be evaluated. We set the method as Pearson, the threshold as 0.6 and selection_method as the variance. A threshold of 0.6 means that any variable correlations that are at least moderate will be considered and subject to removal

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **A Big warning**: the tips dataset is intended to be used in a regression task where you are interested in predicting tips. When working on a project, the tips variable wouldn't be a feature but a target. Here we left it in on purpose as a feature just for the sake of the exercise.

from feature_engine.encoding import OrdinalEncoder
pipeline = Pipeline([
      ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary') ),
      ( 'SmartCorrelatedSelection', SmartCorrelatedSelection(method="pearson",
                                                             threshold=0.6,
                                                             selection_method="variance",))
])

df_transformed = pipeline.fit_transform(df)

We can check which sets of features were marked as correlated (using the rules we set in the previous pipeline). We do that by accessing the pipeline step and using the attribute `.correlated_feature_sets_`

pipeline['SmartCorrelatedSelection'].correlated_feature_sets_

We check which variables were removed with the attribute `.features_to_drop_`

pipeline['SmartCorrelatedSelection'].features_to_drop_

Alternatively, we inspected the df_transformed, and as we expected, the variables were removed

df_transformed.head()

 <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **Additional warning**: This transformer is used in the features when setting your pipeline for your ML task. It is typically one of the last steps of feature engineering since it requires pre-processing the data.
 



# Feature Engine - Unit 08 - Create your own transformer

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Create your own transformer that can be arranged into a pipeline



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Create your own transformer

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> What if, for your existing project, you couldn't find a built-in transformer that satisfies your project needs?
* You can create a transformer. Your custom transformer will be a Python Class, which is a topic you are already familiar with!

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Before defining your custom transformer, all transformers in scikit-learn (and scikit-learn compatible libraries, like feature-engine) are implemented as Python classes, each with its own attributes and methods. 
* Our custom transformer (or Class) must be implemented as a class with the same methods, like fit(), transform(), fit_transform() etc. We will inherit these methods using two scikit-learn base classes: TransformerMixin and BaseEstimator. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For that, we will need two base transformers from Scikit-learn. 
* `BaseEstimator`: According to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html), it is a "base class for all estimators in scikit-learn". We will not focus on the technical aspects, only the frame, as it contains the core of what a transformer should have.
* `TransformerMixin`: According to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html), it is a Mixin class for all transformers in scikit-learn.

from sklearn.base import BaseEstimator, TransformerMixin

In feature-engine (and scikit-learn), we have a transformer that replaces the missing value with the mean. But let's imagine it didn't, and we want to create `MyCustomTransformerForMeanImputation()`
* Let's follow along with the code's comment to understand the steps

import pandas as pd # to use .mean()

# We will define three methods for the class: _init_, fit and transform
# The fit_transform() will be inherited since we are using BaseEstimator and TransformerMixin

# Define your transformer name, and as an argument inherit the base classes
class MyCustomTransformerForMeanImputation(BaseEstimator, TransformerMixin):

  #### Here, you define the variables you need to parse when you initialize the class
  def __init__(self, variables):
    # We make sure the variables will be a list, even if only one element
    if not isinstance(variables, list): 
      self.variables = [variables]
    else: self.variables = variables

  #### Here is where the learning happens. We perform the operation we are interested in
  #### In this case, calculate the mean
  def fit(self, X, y=None):
   
    # We want to keep the mean value in a dictionary
    self.imputer_dict_ = {}
      
    # loop over each variable, calculate the mean and save it in the dictionary.  
    for feature in self.variables:
        self.imputer_dict_[feature] = X[feature].mean()
    
    return self

  #### Here, you transform the variables based on what you learned in the .fit()
  #### You can transform into the train set, test set or real-time data
  def transform(self, X):
    # loop over the variables and .fillna() in a given feature based on the 
    # mean of a given feature
    for feature in self.variables:
      X[feature].fillna(self.imputer_dict_[feature], inplace=True)
      
    return X

You may create a custom transformer where you don't need to code the ``.fit().`` For example, imagine you want to apply the upper case method to all the variables. You don't need to learn that; you just need to execute it.
* Let's create this transformer and call `ConvertUpperCase()`

# The comments relate to the new concepts for this exercise

class ConvertUpperCase(BaseEstimator, TransformerMixin):
  def __init__(self, variables):
    if not isinstance(variables, list): 
      self.variables = [variables]
    else: self.variables = variables

  # We don't need to learn anything here; we just return self
  # We need to do that anyway to be compatible with scikit-learn format
  def fit(self, X, y=None):
      return self

  # Here, we convert the variables using a method called .upper()
  # We loop over all the variables, check if it is an object, and then use a lambda function...
  # ...to apply .upper() to all rows
  def transform(self, X):
    for feature in self.variables:
      if X[feature].dtype == 'object':
        X[feature] = X[feature].apply(lambda x: x.upper())
      else:
        print(f"Warning: {feature} data type should be object to use ConvertUpperCase()")

    return X

We will use the penguin dataset. It has records for three different species of penguins collected from 3 islands in the Palmer Archipelago, Antarctica. 
* We check for missing data

df = sns.load_dataset('penguins')
df.isnull().sum()

And inspect the DataFrame

df.head()

We are interested in:
* Cleaning the missing data with `MyCustomTransformerForMeanImputation()` on the numerical variables and `CategoricalImputer()` for categorical variables
* Next, we want to make all words from the 'sex' column upper case. We will use our own transformer: ConvertUpperCase()


We set the pipeline using these rules in three steps. Then we run `.fit_transform()`
* Once we inspect the data with .head(), we notice the `'sex'` variable has all letters in upper case!

from feature_engine.imputation import CategoricalImputer

pipeline = Pipeline([
      ( 'custom_transf', MyCustomTransformerForMeanImputation(variables=['bill_length_mm',
                                                                         'bill_depth_mm',
                                                                         'flipper_length_mm',
                                                                         'body_mass_g'] )),
                     
      ( 'categorical_imputer', CategoricalImputer(imputation_method='missing',
                                                  fill_value='Missing',
                                                  variables=['sex']) ),
      
      ('upper_case' , ConvertUpperCase(variables=['sex'])),
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

Let's check if the numerical data is cleaned
* It is cleaned!

df_transformed.isnull().sum()

We now check the mean values from the original data

df[['bill_length_mm' , 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].mean()

And the learned mean values from `MyCustomTransformerForMeanImputation()` dictionary. We assess the 'custom_transf' steps and check the attribute `.imputer_dict_`, which happens to be the dictionary we stored the mean values in the `.fit()` method 

pipeline['custom_transf'].imputer_dict_

# Feature Engine - Unit 09 - Custom functions for Data Cleaning and Feature Engineering Workflow

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand and use custom functions for data cleaning and feature engineering workflow, using feature-engine transformers



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Custom functions for Data Cleaning and Feature Engineering Workflow

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You probably noticed the exercises from previous the units took time and energy. There is no fixed recipe but instead guidelines.
* This is the reason that data practitioners spend a lot of energy and time in data cleaning and feature engineering the variables


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
"> We created a custom function, made with specific feature-engine transformers, to help you be more effective during the Data Cleaning and Feature Engineering stage. We will instruct you on how we expect you to use and interpret it.

* We will present two functions to you now, and we will use them in Walkthrough Project 02.
  * `DataCleaningEffect()`
  * `FeatureEngineeringAnalysis()`


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> These custom functions were delivered specially for this specialisation. The functions' logic and usability were tested and reviewed extensively; however, bugs may appear.



---

* We will not focus on explaining the code itself but focus on the functionality and instruct how we could use it

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 `DataCleaningEffect()`
* Function objective: assess the effect of cleaning the data, when:
  * imput mean, median or arbitrary number is a numerical variable
  * replace with 'Missing' or most frequent a categorical variable
* Parameters: `df_original`: data not cleaned, `df_cleaned`: cleaned data,`variables_applied_with_method`: variables where you applied a given method

  * It is understandable if, at first, you don't understand all the code used in the function below. The point is to make sense of the pseudo-code and understand the function parameters.

import seaborn as sns
sns.set(style="whitegrid")
import matplotlib.pyplot as plt

def DataCleaningEffect(df_original,df_cleaned,variables_applied_with_method):

  flag_count=1 # Indicate plot number
  
  # distinguish between numerical and categorical variables
  categorical_variables = df_original.select_dtypes(exclude=['number']).columns 

  # scan over variables, 
    # first on variables that you applied the method
    # if the variable is a numerical plot, a histogram if categorical plot a barplot
  for set_of_variables in [variables_applied_with_method]:
    print("\n=====================================================================================")
    print(f"* Distribution Effect Analysis After Data Cleaning Method in the following variables:")
    print(f"{set_of_variables} \n\n")
  

    for var in set_of_variables:
      if var in categorical_variables:  # it is categorical variable: barplot
        
        df1 = pd.DataFrame({"Type":"Original","Value":df_original[var]})
        df2 = pd.DataFrame({"Type":"Cleaned","Value":df_cleaned[var]})
        dfAux = pd.concat([df1, df2], axis=0)
        fig , axes = plt.subplots(figsize=(15, 5))
        sns.countplot(hue='Type', data=dfAux, x="Value",palette=['#432371',"#FAAE7B"])
        axes.set(title=f"Distribution Plot {flag_count}: {var}")
        plt.xticks(rotation=90)
        plt.legend() 

      else: # it is numerical variable: histogram

        fig , axes = plt.subplots(figsize=(10, 5))
        sns.histplot(data=df_original, x=var, color="#432371", label='Original', kde=True,element="step", ax=axes)
        sns.histplot(data=df_cleaned, x=var, color="#FAAE7B", label='Cleaned', kde=True,element="step", ax=axes)
        axes.set(title=f"Distribution Plot {flag_count}: {var}")
        plt.legend() 

      plt.show()
      flag_count+= 1

---


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 `FeatureEngineeringAnalysis()`
* Function objective: apply a set of transformers, defined by the user, for a given set of variables
* Parameters: `df`: data, `analysis_type`:` ['numerical', 'ordinal_encoder',  'outlier_winsorizer']`


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> You should parse the proper variable data types according to your analysis, for example, you shall parse only numerical variables when selecting 'numerical' for analysis_type

  * It is understandable if, at first, you don't understand all the code used in the function below. The point is to make sense of the pseudo-code and understand the function parameters.

from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')



def FeatureEngineeringAnalysis(df,analysis_type=None):


  """
  - used for quick feature engineering on numerical and categorical variables
  to decide which transformation can better transform the distribution shape 
  - Once transformed, use a reporting tool, like pandas-profiling, to evaluate distributions

  """
  check_missing_values(df)
  allowed_types= ['numerical', 'ordinal_encoder',  'outlier_winsorizer']
  check_user_entry_on_analysis_type(analysis_type, allowed_types)
  list_column_transformers = define_list_column_transformers(analysis_type)
  
  
  # Loop over each variable and engineer the data according to the analysis type
  df_feat_eng = pd.DataFrame([])
  for column in df.columns:
    # create additional columns (column_method) to apply the methods
    df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
    for method in list_column_transformers:
      df_feat_eng[f"{column}_{method}"] = df[column]
      
    # Apply transformers in respectives column_transformers
    df_feat_eng, list_applied_transformers = apply_transformers(analysis_type, df_feat_eng, column)

    # For each variable, assess how the transformations perform
    transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng)

  return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
  ### Check analyis type
  if analysis_type == None:
    raise SystemExit(f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
  if analysis_type not in allowed_types:
      raise SystemExit(f"analysis_type argument should be one of these options: {allowed_types}")

def check_missing_values(df):
  if df.isna().sum().sum() != 0:
    raise SystemExit(
        f"There is missing values in your dataset. Please handle that before getting into feature engineering.")



def define_list_column_transformers(analysis_type):
  ### Set suffix colummns acording to analysis_type
  if analysis_type=='numerical':
    list_column_transformers = ["log_e","log_10","reciprocal", "power","box_cox","yeo_johnson"]
  
  elif analysis_type=='ordinal_encoder':
    list_column_transformers = ["ordinal_encoder"]

  elif analysis_type=='outlier_winsorizer':
    list_column_transformers = ['iqr']

  return list_column_transformers



def apply_transformers(analysis_type, df_feat_eng, column):


  for col in df_feat_eng.select_dtypes(include='category').columns:
    df_feat_eng[col] = df_feat_eng[col].astype('object')


  if analysis_type=='numerical':
    df_feat_eng,list_applied_transformers = FeatEngineering_Numerical(df_feat_eng,column)
  
  elif analysis_type=='outlier_winsorizer':
    df_feat_eng,list_applied_transformers = FeatEngineering_OutlierWinsorizer(df_feat_eng,column)

  elif analysis_type=='ordinal_encoder':
    df_feat_eng,list_applied_transformers = FeatEngineering_CategoricalEncoder(df_feat_eng,column)

  return df_feat_eng,list_applied_transformers



def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
  # For each variable, assess how the transformations perform
  print(f"* Variable Analyzed: {column}")
  print(f"* Applied transformation: {list_applied_transformers} \n")
  for col in [column] + list_applied_transformers:
    
    if analysis_type!='ordinal_encoder':
      DiagnosticPlots_Numerical(df_feat_eng, col)
    
    else:
      if col == column: 
        DiagnosticPlots_Categories(df_feat_eng, col)
      else:
        DiagnosticPlots_Numerical(df_feat_eng, col)

    print("\n")



def DiagnosticPlots_Categories(df_feat_eng, col):
  plt.figure(figsize=(20, 5))
  sns.countplot(data=df_feat_eng, x=col,palette=['#432371'],order = df_feat_eng[col].value_counts().index)
  plt.xticks(rotation=90) 
  plt.suptitle(f"{col}", fontsize=30,y=1.05)        
  plt.show();
  print("\n")



def DiagnosticPlots_Numerical(df, variable):
  fig, axes = plt.subplots(1, 3, figsize=(20, 6))
  sns.histplot(data=df, x=variable, kde=True,element="step",ax=axes[0]) 
  stats.probplot(df[variable], dist="norm", plot=axes[1])
  sns.boxplot(x=df[variable],ax=axes[2])
  
  axes[0].set_title('Histogram')
  axes[1].set_title('QQ Plot')
  axes[2].set_title('Boxplot')
  fig.suptitle(f"{variable}", fontsize=30,y=1.05)
  plt.show();


def FeatEngineering_CategoricalEncoder(df_feat_eng,column):
  list_methods_worked = []
  try:  
    encoder= OrdinalEncoder(encoding_method='arbitrary', variables = [f"{column}_ordinal_encoder"])
    df_feat_eng = encoder.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_ordinal_encoder")
  
  except: 
    df_feat_eng.drop([f"{column}_ordinal_encoder"],axis=1,inplace=True)
    
  return df_feat_eng,list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng,column):
  list_methods_worked = []

  ### Winsorizer iqr
  try: 
    disc=Winsorizer(
        capping_method='iqr', tail='both', fold=1.5, variables = [f"{column}_iqr"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_iqr")
  except: 
    df_feat_eng.drop([f"{column}_iqr"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked




def FeatEngineering_Numerical(df_feat_eng,column):

  list_methods_worked = []

  ### LogTransformer base e
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_e"])
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_e")
  except: 
    df_feat_eng.drop([f"{column}_log_e"],axis=1,inplace=True)

    ### LogTransformer base 10
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_10"],base='10')
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_10")
  except: 
    df_feat_eng.drop([f"{column}_log_10"],axis=1,inplace=True)

  ### ReciprocalTransformer
  try:
    rt = vt.ReciprocalTransformer(variables = [f"{column}_reciprocal"])
    df_feat_eng =  rt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_reciprocal")
  except:
    df_feat_eng.drop([f"{column}_reciprocal"],axis=1,inplace=True)

  ### PowerTransformer
  try:
    pt = vt.PowerTransformer(variables = [f"{column}_power"])
    df_feat_eng = pt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_power")
  except:
    df_feat_eng.drop([f"{column}_power"],axis=1,inplace=True)

  ### BoxCoxTransformer
  try:
    bct = vt.BoxCoxTransformer(variables = [f"{column}_box_cox"])
    df_feat_eng = bct.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_box_cox")
  except:
    df_feat_eng.drop([f"{column}_box_cox"],axis=1,inplace=True)


  ### YeoJohnsonTransformer
  try:
    yjt = vt.YeoJohnsonTransformer(variables = [f"{column}_yeo_johnson"])
    df_feat_eng = yjt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_yeo_johnson")
  except:
        df_feat_eng.drop([f"{column}_yeo_johnson"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked

We will present the use cases and interpretations so that you can conduct your data cleaning and feature engineering steps more effectively.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Handle missing data

We are assuming that at this moment of your project in the workplace, you have already conducted an initial EDA of your data, and you know which variables require you to handle missing data

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
"> In this exercise for Data Cleaning, we  will follow these steps:

* 1 - Select an imputation method
* 2 - Select variables to apply the method to
* 3 - Create a separate DataFrame to apply the method
* 4 - Assess the effect on the variable distribution

Let's consider the titanic dataset. It holds passengers' records from its unique ride. 

df = sns.load_dataset('titanic').drop(['alive'],axis=1)
df.head()

We inspect the dataset and notice there are variable data types which are `'category'`. 
* Typically, categorical variables are handled as `'object'`, but sometimes, for some reason, the data is stored as `'category'` instead. 
* Feature engine library handles the data properly when a categorical variable is an `'object'` data type. 

df.info()

We will convert them to `'object'` data type by looping over the variables with data type as `'category'` and converting to `'object'`

for col in df.select_dtypes(include='category'):
  df[col] = df[col].astype('object')

We check for missing data. 
* There are numerical and categorical data with missing data

df.isna().sum()

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Numerical

Using the methods we covered, you may impute with mean, median or arbitrary.
* For our exercise, we will assume we made an EDA and selected median

1 - Select an imputation method

from feature_engine.imputation import MeanMedianImputer

2 - Select the variables to apply the method to
* You have to make sure you are using numerical variables

variables_method = ['age']
variables_method

3 - Create a separate DataFrame to apply the method

imputer = MeanMedianImputer(imputation_method='median', variables=variables_method)
df_method = imputer.fit_transform(df)



4 - Assess the effect on the variable distribution
* The function plots in the same Axes, the distribution before and after applying the method; This helps to give you insights into how different your variable would look after cleaning.
* We notice the "peak" in the variable distribution after median imputation.

DataCleaningEffect(df_original=df,
                   df_cleaned=df_method,
                   variables_applied_with_method=variables_method)

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Categorical

In this exercise, we will impute 'Missing' on categorical variables 

1 - Select an imputation method

from feature_engine.imputation import CategoricalImputer

2 - Select variables to apply the method to
* You have to make sure you are using categorical variables

variables_method = ['embarked', 'deck', 'embark_town']
variables_method

3 - Create a separate DataFrame to apply the method

imputer = CategoricalImputer(imputation_method='missing',fill_value='Missing',
                             variables=variables_method)

df_method = imputer.fit_transform(df)

4 - Assess the effect on the variable distribution
* It was probably not a good idea to consider this method on these variables
  * For the deck, we might consider dropping the variable, since its missing levels are high
  * For embarked and embark_town, we may consider replacing with most frequent since the missing data levels are low.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> This exercise gives an idea of how this function works in practice.

DataCleaningEffect(df_original=df,
                   df_cleaned=df_method,
                   variables_applied_with_method=variables_method)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Feature Engineering

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
"> In this exercise for Feature Engineering workflow, we will follow these steps:

* 1 - Select variable(s)
* 2 - Create a separate DataFrame, for that variable(s)
* 3 - Assess engineered variables distribution 


In your career, you will develop your preferences and unique methods for dealing with data cleaning and feature engineering. As a starting point, we suggest starting the feature engineering workflow by:
* Looking for categorical encoding
* Looking for handling outliers
* Looking for numerical transformation

---

Let's recap our dataset

df.head()

We can check missing data levels

df.isna().sum()

In the last section, we didn't impute any missing data to the original DataFrame (df); we just checked how it would look if we applied a given imputer.
* For the next exercise, we create a quick pipeline to manage missing data, but dropping the feature with a lot of missing data, add median as imputer for age, and drop the remaining missing data

from sklearn.pipeline import Pipeline
from feature_engine.selection import DropFeatures
from feature_engine.imputation import MeanMedianImputer
from feature_engine.imputation import DropMissingData

data_cleaning_pipeline = Pipeline([
      ( 'DropFeatures', DropFeatures(features_to_drop=['deck']) ),
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median', variables=['age']) ),
      ( 'DropMissingData', DropMissingData()),
])

df = data_cleaning_pipeline.fit_transform(df)

We check missing data levels again
* We are good to go for feature engineering

df.isna().sum()

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Ordinal Encoder

Again We assume that at this moment, you are on a project in the workplace; you will have already done an EDA on the variables, so you will know which variables to encode.

1 - Select variable(s)

variables_engineering= ['sex', 'embarked', 'who', 'embark_town']
variables_engineering

2 - Create a separate DataFrame for these variables

df_engineering = df[variables_engineering].copy()
df_engineering.head(3)

3 - Assess engineered variables distribution 
* We notice that the distribution will not be normally distributed when we encode a category to a number. The new data type is numerical discrete (not continuous), and that is fine

df_engineering = FeatureEngineeringAnalysis(df=df_engineering,analysis_type='ordinal_encoder')

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Outlier 

Again We will assume that at this moment, you are on a project in the workplace, you will have already done an EDA on the variables, so you will know which variables to consider in this outlier analysis

1 - Select variable(s)

variables_engineering = ['age', 'fare']
variables_engineering

2 - Create a separate DataFrame for the variable(s)

df_engineering = df[variables_engineering].copy()
df_engineering.head(3)

3 - Assess engineered variables distribution 
* We note that for both variables, replacing outliers with the IQR method didn't help to become normal distributed but helped to become less abnormal, and this tends to be positive for an ML model. Therefore, you will consider this step in your pipeline when age and fare are features. 

df_engineering = FeatureEngineeringAnalysis(df=df_engineering.dropna(),
                                            analysis_type='outlier_winsorizer')

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Numerical

Again We will assume that at this moment, you are on a project in the workplace, you will have already done an EDA on the variables, so you will know which variables to try numerical transformation

1 - Select variable(s)

variables_engineering= ['fare']
variables_engineering

2 - Create a separate DataFrame for the variable(s)

df_engineering = df[variables_engineering].copy()
df_engineering.head(3)

3 - Assess engineered variables distribution 

* The function will try to transform a variable using the following transformer: Log base e and base 10 Transformer, Power Transformer, Reciprocal Transformer, Box Cox Transformer and Yeo Johnson Transformer. In case it is not possible to compute a given transformation (ex.: log transformation doesn't work for negative values), the function will dismiss that given transformation to that given variable.
* For fare, it was possible only to apply Power Transformer and Yeo Johnson.
* Yeo Johnson has a distribution with fewer outliers, and even not being normal distributed, it is better than before. We shall consider this transformer for rare features.

df_engineering = FeatureEngineeringAnalysis(df=df_engineering,analysis_type='numerical')

---