# Feature Engine - Unit 08 - Create your own transformer

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Create your own transformer that can be arranged into a pipeline



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Create your own transformer

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> What if, for your existing project, you couldn't find a built-in transformer that satisfies your project needs?
* You can create a transformer. Your custom transformer will be a Python Class, which is a topic you are already familiar with!

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Before defining your custom transformer, all transformers in scikit-learn (and scikit-learn compatible libraries, like feature-engine) are implemented as Python classes, each with its own attributes and methods. 
* Our custom transformer (or Class) must be implemented as a class with the same methods, like fit(), transform(), fit_transform() etc. We will inherit these methods using two scikit-learn base classes: TransformerMixin and BaseEstimator. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For that, we will need 2 base transformers from Scikit-learn. 
* `BaseEstimator`: According to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html), it is a "base class for all estimators in scikit-learn". We will not focus on the technical aspects, only the frame, as it contains the core of what a transformer should have.
* `TransformerMixin`: According to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) it is a Mixin class for all transformers in scikit-learn.

from sklearn.base import BaseEstimator, TransformerMixin

In feature-engine (and scikit-learn) we have a transformer that replaces the missing value with the mean. But let's imagine it didn't and we want to create `MyCustomTransformerForMeanImputation()`
* Let's follow along to the code's comment to understand the steps

import pandas as pd # to use .mean()

# we will define 3 methods for the class: _init_, fit and transform
# The fit_transform() will be inherited since we are using BaseEstimator and TransformerMixin

# define your transformer name and as an argument inherit the base classes
class MyCustomTransformerForMeanImputation(BaseEstimator, TransformerMixin):

  #### here you define the variables you need to parse when you initialize the class
  def __init__(self, variables):
    # we make sure the variable will be a list, even if only one element
    if not isinstance(variables, list): 
      self.variables = [variables]
    else: self.variables = variables

  #### here is where the learning happens. We perform the operation we are interested in,
  #### in this case, calculate the mean
  def fit(self, X, y=None):
   
    # we want to keep the mean value in a dictionary
    self.imputer_dict_ = {}
      
    # loop over each variable, calculate the mean and save in the dictionary.  
    for feature in self.variables:
        self.imputer_dict_[feature] = X[feature].mean()
    
    return self

  #### here you transform the variables based on what you learned in the .fit()
  #### You can transform into the train set, test set or real-time data
  def transform(self, X):
    # loop over the variables and .fillna() in a given feature based on the 
    # mean of a given feature
    for feature in self.variables:
      X[feature].fillna(self.imputer_dict_[feature], inplace=True)
      
    return X

You may create a custom transformer where you don't need to code the ``.fit().`` For example, imagine if you want to apply the upper case method to all the variables. You don't need to learn that; you just need to execute it.
* Let's create this transformer and call `ConvertUpperCase()`

# The comments relate to the new concepts for this exercise

class ConvertUpperCase(BaseEstimator, TransformerMixin):
  def __init__(self, variables):
    if not isinstance(variables, list): 
      self.variables = [variables]
    else: self.variables = variables

  # we don't need to learn anything here, we just return self
  # we need to do that anyway to be compatible with scikit-learn format
  def fit(self, X, y=None):
      return self

  # here we convert the variables using a method called .upper()
  # We lopp over all the variables, check if it is an object, then use a lambda function...
  # ...to apply .upper() to all rows
  def transform(self, X):
    for feature in self.variables:
      if X[feature].dtype == 'object':
        X[feature] = X[feature].apply(lambda x: x.upper())
      else:
        print(f"Warning: {feature} data type should be object to use ConvertUpperCase()")

    return X

We will use the penguin dataset. It has records for 3 different species of penguins, collected from 3 islands in the Palmer Archipelago, Antarctica. 
* We check for missing data

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(url)
# df = sns.load_dataset('penguins')
df.isnull().sum()

And inspect the DataFrame

df.head()

We are interested to:
* clean the missing data with `MyCustomTransformerForMeanImputation()` on the numerical variables and `CategoricalImputer()` for categorical variables
* Next, we want to make all words from the 'sex' column upper case. We will use our own transformer: ConvertUpperCase()


We set the pipeline using these rules and create 3 steps. Then we run `.fit_transform()`
* Once we inspect the data with .head(), we notice the `'sex'` variable has all letters in upper case!

from feature_engine.imputation import CategoricalImputer

pipeline = Pipeline([
      ( 'custom_transf', MyCustomTransformerForMeanImputation(variables=['bill_length_mm',
                                                                         'bill_depth_mm',
                                                                         'flipper_length_mm',
                                                                         'body_mass_g'] )),
                     
      ( 'categorical_imputer', CategoricalImputer(imputation_method='missing',
                                                  fill_value='Missing',
                                                  variables=['sex']) ),
      
      ('upper_case' , ConvertUpperCase(variables=['sex'])),
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

Let's check if the numerical data is cleaned
* It is cleaned!

df_transformed.isnull().sum()

We now check the mean values from the original data

df[['bill_length_mm' , 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].mean()

And the learned mean values from `MyCustomTransformerForMeanImputation()` dictionary. We assess the 'custom_transf' steps, and check the attribute `.imputer_dict_`, which happens to be the dictionary we stored the mean values in the `.fit()` method 

pipeline['custom_transf'].imputer_dict_

