<img src="images/logo.png" align='right' width=250px>

# Custom Transformer in Scikit-Learn

Scikit-Learn can be extended with functionality that is not included as standard in the library. 

By the end of this notebook you will be able to:

- Explain the benefits of using custom sklearn transformers to determine when you should use them
- Develop a custom transformer for a specific use case, adhering to the OOP design in sci-kit learn
- Assess the functionality of a customer transformer on data and the impact that it has on model performance
- Examine the design principles of OOP in the sci-kit learn framework to discover key patterns including attribute & method naming conventions

This notebook covers
- [The benefits of custom scikit-learn](#benefits)
- [Building a custom Transformer on date columns](#date)
- [Building a custom Transformer to bin data](#binning)
- [The design of OOP in Sklearn](#design)

In [None]:
import pandas as pd

In [None]:
# Data cleaning
from sklearn.model_selection import train_test_split

# Pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Transformers
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

# Model
from sklearn.ensemble import RandomForestClassifier

# Metrics
from sklearn.metrics import roc_auc_score

## Data preparation

For this notebook, the stroke dataset will be used. 

In [None]:
stroke = pd.read_csv('data/stroke.csv').rename(columns=str.lower)

Let's prepare the data for machine learning by creating a feature matrix and target vector and performing a train-test split.

In [None]:
# Columns to treat
drop_cols = ['id']
target = 'stroke'

def create_Xy(df, drop_cols, target_col):
    df = df.drop(columns=drop_cols)
    return (
        df.drop(columns=target_col),
        df[target_col]
    )
    
# New feature matrix
X, y = (
    stroke
    .pipe(create_Xy, 
          drop_cols=drop_cols, 
          target_col=target,
          )
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.25,
                                                    random_state = 123,
                                                    stratify = y,
                                                    )

## Creating a state feature

There is a column in the data - `address` - that isn't useful as it is as it is too granular for each entry being unique for each person:

In [None]:
print(f"N unique addresses: {stroke['address'].nunique()}") 
print(f"N patients in the data: {stroke['id'].nunique()}")

So this column either needs to be dropped or treated. 

It is possible to extract a useful bit of information that could be predictive - the location (state) of the patient:


In [None]:
(
    stroke
    .assign(state = lambda df: df['address'].str.split().str.get(-2))
).head()

This can be made into a pandas pipeline to make it easier to apply to multiple (`X_train` and `X_test`) dataframes:

In [None]:
def get_word_from_string(df, column, output_col, splitter=' ', word = -1, drop_column=True):
    """by default will split on whitespaces and return the last word
    """
    df = df.assign(**{f"{output_col}": df[column].str.split(splitter).str.get(word)})
    if drop_column:
        df = df.drop(columns=column)

    return df

In [None]:
X_train_state = X_train.pipe(get_word_from_string, 'address', 'state', word=-2)
X_test_state = X_test.pipe(get_word_from_string, 'address', 'state', word=-2)

## Implementing scikit-learn separately

Although this treatment of the `address` column is as necessary as using an imputer or one-hot encoding columns, at the moment, it is happening outside of scikit-learn.

After the transformation has been performed, we can use a scikit-learn pipeline to perform the remaining preprocessing (treating `state` as a categorical feature) and the modeling:

In [None]:
categorical_cols = ['work_type', 'smoking_status', 'who', 'gender', 'residence_type', 'state']
numeric_cols = ['age', 'hypertension', 'heart_disease', 'ever_married', 'avg_glucose_level', 'bmi']
missing_cols = ['age', 'bmi']

onehot = Pipeline(steps = [
    ('onehot', OneHotEncoder(drop = "if_binary", sparse_output=False)),
])

impute = Pipeline(steps = [
    ('impute', SimpleImputer(strategy ='mean')),
])

preprocessor = ColumnTransformer(transformers = [
    ('onehot', onehot, categorical_cols),
    ('impute', impute, numeric_cols)
], remainder = 'passthrough')

base_model = RandomForestClassifier(class_weight='balanced',
                                    max_depth=5,
                                    random_state=123,
                                    )

base_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', base_model)
])

base_pipeline.fit(X_train_state, y_train)


# Find the probabilities of stroke for AUC evaluation
y_train_probs = base_pipeline.predict_proba(X_train_state)[:,1]
y_test_probs = base_pipeline.predict_proba(X_test_state)[:,1]

print(f'AUC train: {roc_auc_score(y_train, y_train_probs)}')
print(f'AUC test: {roc_auc_score(y_test, y_test_probs)}')

### A better way

Currently, we are using a mix of pandas and scikit-learn to perform feature engineering.

However, a more structured and scalable approach is possible: performing the preprocessing entirely within scikit-learn is preferable because you are already working within its broader ecosystem.

# 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 

## Building a custom sklearn Transformer

To build your own custom Transformer, you will be extending the sklearn library. To do this, the concept of parent/child classes in OOP is very important as there are two classes you need to inherit from:
- [BaseEstimator](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator): All sklearn Transformers (and Models) are built upon this fundamental base class. 
- [TransformerMixin](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin): All Transformers are built upon this Mixin class. It's what links the `.fit()` and `.transform()` method to make the `.fit_transform()` method!

*The design principles of sklearn, including the difference between Transformers, Models and Estimators will be covered later in this notebook!*

Let's first see how to build your own Transformer:

In [None]:
# import the required parent clasess

from sklearn.base import BaseEstimator, TransformerMixin

#### A custom Transformer requires the following methods:

- `__init__`
   - To initialise the class when instantiated and can have optional parameters
   - Optional return value
- `fit`
   - To learn what is needed to be able to make a transformation (eg. calculating the imputation value for `SimpleImpute`)
   - Requires the object itself to be the return value (`return self`) to align to the sklearn OOP design.
- `transform`
   - The actual process of transforming the data (eg. filling the missing values with the calculated value from the `.fit()` method)
   - Requires the return value to be the transformed data - to adhere to the design pattern and can be used later in the pipeline


In [None]:
from sklearn.utils.validation import check_is_fitted
class WordExtractor(BaseEstimator, TransformerMixin):
    
    def __init__(self, splitter, word, features_out):

        if not splitter or not word:
            raise ValueError(f"Two columns need to be passed")
            
        self.splitter = splitter
        self.word = word
        self.features_out = features_out
    
    def fit(self, X, y=None):
        self._extract_word = lambda col: col.str.split(self.splitter).str.get(self.word)
        return self
    
    def transform(self, X, y=None):
        
        # check if the transformer has been fitted
        # check_is_fitted(self)
        
        X = X.apply(self._extract_word)
        
        return X

    def get_feature_names_out(self, input_features=None):
        return self.features_out

In [None]:
word_extractor = WordExtractor(splitter = ' ', word = -2, features_out = ['state'])

Now the extract can be used as you would use any Transformer from sklearn.

Note that the transformer can only be used on string columns, so it is required to select only string columns when using the Transformer in isolation.

In [None]:
word_extractor.fit(X_train[['address']])
word_extractor.transform(X_train[['address']]).head()

This is now a Transformer in sklearn, which means you can use the standard practices of the sklearn library, for example using `.fit_transform()` instead of `.fit()` followed by `.transform()`:

In [None]:
word_extractor.fit_transform(X_train[['address']]).head()

This is despite the Transformer having no `fit_transform` method included in the class. 

<mark>**Questions**</mark>

1. Why is `fit_transform` available as a method when it wasn't writted as part of the `WordExtractor` class?

<details>
  <summary><span style="color:blue">Show answer</span></summary>
    
`WordExtractor` has inherited from the TransformerMixin, which includes the `fit_transform` method.
This can be seen in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) and in the [source code](https://github.com/scikit-learn/scikit-learn/blob/9e38cd00d032f777312e639477f1f52f3ea4b3b7/sklearn/base.py#L1006).

</details>

2. Can you think of any other methods or functionalities that we could use with this Transformer?

<details>
  <summary><span style="color:blue">Show answer</span></summary>
    
One option is to incorporate this Transformer in the `ColumnTransformer` to only apply it to the desired column.

</details>

## Writing a full pipeline with a Custom Transformer

Transformers in scikit-learn are designed to be applied to the entire dataset, unless passed through the ColumnTransformer.

So... can ColumnTransformer be used with this new custom word extractor Transformer?

<details>
  <summary><span style="color:blue">🚨 Spoiler alert!</span></summary>
    
Yes it can. Since the Transformer has been built using the scikit-learn standard, it absolutely can!

</details>


In [None]:
string_columns = ['address']

word_extractor_ct = ColumnTransformer(transformers = [
    ('word_extractor', word_extractor, string_columns),
], remainder = 'passthrough').set_output(transform="pandas")

word_extractor_ct.fit_transform(X_train).head()

The column names are a little over the top, when all that's been changed is `address` to `state`.

Since there is an existing method in the transformer that returns the features:

In [None]:
word_extractor_ct.get_feature_names_out()

The ColumnTransformer can be updated so that this feature name is used by adding the parameter

`verbose_feature_names_out=False`

In [None]:
string_columns = ['address']

word_extractor_ct = ColumnTransformer(transformers = [
    ('word_extractor', word_extractor, string_columns),
], remainder = 'passthrough', verbose_feature_names_out=False).set_output(transform="pandas")

word_extractor_ct.fit_transform(X_train).head()

### Adding this to the full Pipeline

In [None]:
categorical_cols = ['work_type', 'smoking_status', 'who', 'gender', 'residence_type', 'state']
numeric_cols = ['age', 'hypertension', 'heart_disease', 'ever_married', 'avg_glucose_level', 'bmi']
missing_cols = ['age', 'bmi']

word_extractor_ct = ColumnTransformer(transformers = [
    ('word_extractor', word_extractor, string_columns),
], remainder = 'passthrough', verbose_feature_names_out=False).set_output(transform="pandas")

onehot = Pipeline(steps = [
    ('onehot', OneHotEncoder(drop = "if_binary", sparse_output=False)),
])

impute = Pipeline(steps = [
    ('impute', SimpleImputer(strategy ='mean')),
])

preprocessor = ColumnTransformer(transformers = [
    ('onehot', onehot, categorical_cols),
    ('impute', impute, numeric_cols)
], remainder = 'passthrough')

forest_model = RandomForestClassifier(class_weight='balanced',
                                    max_depth=5,
                                    random_state=123,
                                    )

base_pipeline = Pipeline(steps=[
    ('word_extractor', word_extractor_ct),
    ('preprocessor', preprocessor),
    ('model', forest_model)
])

base_pipeline.fit(X_train, y_train)


# Find the probabilities of stroke for AUC evaluation
y_train_probs = base_pipeline.predict_proba(X_train)[:,1]
y_test_probs = base_pipeline.predict_proba(X_test)[:,1]

print(f'AUC train: {roc_auc_score(y_train, y_train_probs)}')
print(f'AUC test: {roc_auc_score(y_test, y_test_probs)}')

## Conclusion

Incorporating a **custom transformer** to extract substrings from a column and seamlessly integrating it into the powerful `ColumnTransformer` provides the flexibility and extensibility of the Scikit-Learn `Pipeline`.

You can now carry out feature engineering and data preprocessing without relying on external tools like pandas. The use of transformers and pipelines not only enhances code modularity but also fosters a more efficient and scalable approach to machine learning workflows.