<a href="https://colab.research.google.com/github/stepthom/869_course/blob/main/pipeline/custom_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom Transformers

- Stephen W. Thomas
- Used for MMA 869, MMAI 869, and GMMA 869

For an overview of pipelines in general, read this tutorial first: [slides_pipeline.ipynb](https://github.com/stepthom/869_course/blob/main/pipeline/slides_pipeline.ipynb)

Sci-kit learn has a lot of [great transformers](https://scikit-learn.org/stable/data_transforms.html) for your pipeline, but sometimes you need something custom.  

This tutorial shows how to create custom transformers for your pipeline.



In [1]:
import pandas as pd
import numpy as np

# Make numpy's printing more beautiful
np.set_printoptions(suppress=True)

# Make sure Jupyter prints out more
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 0.22.2.post1.


In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/stepthom/869_course/main/data/generated_german.csv')

# To keep this tutorial simple, and make things easy to print, let's only work 
# with a few features.
keep_features = ['City', 'Sex', 'Duration', 'Amount', 'BadCredit']

df = df[keep_features]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   City       1000 non-null   object
 1   Sex        1000 non-null   object
 2   Duration   1000 non-null   int64 
 3   Amount     1000 non-null   int64 
 4   BadCredit  1000 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 39.2+ KB


In [4]:
df.head()

Unnamed: 0,City,Sex,Duration,Amount,BadCredit
0,North Judithbury,M,6,2104,0
1,East Jill,F,48,10712,1
2,West Michael,M,12,3773,0
3,North Judithbury,M,42,14188,0
4,Ramirezstad,M,24,8766,1


In [5]:
from sklearn.model_selection import train_test_split

X = df.drop(['BadCredit'], axis=1)
y = df['BadCredit']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin

# All custom transformers have the following basic structure.
# - It is a class that inherits from BaseEstimator and TransformerMixin
# - It has at least three methods: __init__(), fit(), and transform()
# - It can have more helper methods
# - The __init__() is called when you first create the transformer. You can
#   pass in custom parameter values here.
# - The __init__() method returns nothing.
# - The fit() method must return self.
# - The transform method must return a dataframe and/or numpy array


class MyNumericCapper(BaseEstimator, TransformerMixin):

    # Caps a given numeric column(s) to a certain number
    # E.g., any value greater than max_val becomes max_val
    #
    # Note that there was a design choice: either have the user
    # pass in the names of the columns to operate one (which I've done here), 
    # or just operate on all the columns (and have the user be responsible for
    # passing in a subset of the dataframe). Pros and cons to each and there's
    # note a singe best answer.

    def __init__(self, num_cols=None, max_val=0):
        self.num_cols = num_cols
        self.max_val = max_val
        
    def fit(self, X, y=None):
        # Nothing to do during fit in this simple transformer!
        return self

    def transform(self, X, y=None):

        # It's common practice to copy X, so we don't accidentally modify the 
        # original and cause downstream confusion
        _X = X.copy()
        
        for col in self.num_cols:
            _X[col] =  _X[col].apply(lambda x: x if x <= self.max_val else self.max_val)
        
        return _X

In [7]:
# Another example, slightly more complicated.

class MyCategoryCoalescer(BaseEstimator, TransformerMixin):
    # Coalesces (smushes/condenses) rare levels of a categorical 
    # feature into "__OTHER__".
    #
    # Will leave the `keep_top` most frequent levels unchanged; the rest
    # will be changed to `"__OTHER__"`.
    #
    # Note that there was a design choice: either have the user
    # pass in the names of the columns to operate one (which I've done here), 
    # or just operate on all the columns (and have the user be responsible for
    # passing in a subset of the dataframe). Pros and cons to each and there's
    # note a singe best answer.
    
    def __init__(self, cat_cols=[], keep_top=25, ):
        self.cat_cols = cat_cols
        self.keep_top = keep_top
        
        # For each cat_col, this dict will hold an list of the most-frequent 
        # levels
        self.top_n_values = {}
            
    def get_top_n_values(self, X, col, n=5):
        # A helper function to do the actual work.

        # Get the sorted value counts for the column
        vc = X[col].value_counts(sort=True, ascending=False)

        # Get the actual values
        vals = list(vc.index)
        if len(vals) > n:
            top_values = vals[0:n]
        else:
            top_values =  vals

        # Debug printing.
        print("Top n={} values for column={}:".format(n, col))
        print(top_values)
        return top_values
    
    def fit(self, X, y=None):

        # Find the top n values for each cateogircal column
        for col in self.cat_cols:
            self.top_n_values[col] = self.get_top_n_values(X, col, n=self.keep_top)
        return self
    
    def transform(self, X, y=None):
        _X = X.copy()
        _X[self.cat_cols] = _X[self.cat_cols].astype('category')
        for c in self.cat_cols:
            _X[c] = _X[c].cat.add_categories('__OTHER__')
            _X.loc[~_X[c].isin(self.top_n_values[c]), c] = "__OTHER__"
        return _X

# The Pipeline

Now we create a pipeline that uses the above custom transformers.

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.compose import make_column_selector

numeric_transformer = Pipeline(steps=[
                                      
    # Here is where we add our custom capper!
    ('capper', MyNumericCapper(num_cols=['Duration'], max_val=42)),
    ('imputer', SimpleImputer()),

    # Not going to scale in this example, so we can visually inspect the results
    # of the capper more easily.
    #('scaler', StandardScaler()),
    ])

categorical_transformer = Pipeline(steps=[
                                          
      # Here is where we add our custom coalescer!
      ('coalescer', MyCategoryCoalescer(cat_cols=['City'], keep_top=5)),
      ('encoder', OneHotEncoder(handle_unknown='ignore')),
      ])

preprocessor1 = Pipeline(steps=[
      ('ct', ColumnTransformer(
        transformers=[
            ('cat', categorical_transformer, make_column_selector(dtype_include=['category', object])),
            ('num', numeric_transformer, make_column_selector(dtype_include=['number'])),
            ],
            remainder = 'passthrough', 
            sparse_threshold=0)),
    ])

# Note: we would normally include a classifier/regressor in our pipeline,
# but since this tutorial is just for learning about data preprocessing etc., 
# we will exclude it.

pipe1 = Pipeline(steps=[('preprocessor', preprocessor1)])

In [9]:
# The pipe is created! Let's fit it to the training data.

pipe1 = pipe1.fit(X_train, y_train)

# Now, let's use the pipe to transform the training and testing/.
X_train_transformed = pipe1.transform(X_train)
X_test_transformed  = pipe1.transform(X_test)

Top n=5 values for column=City:
['North Judithbury', 'East Jill', 'New Roberttown', 'East Jessetown', 'Lisatown']


In [10]:
# Brilliant! Now, let's inspect the results to see what happened. Let's look at the first row of data.

# First, let's look at the "raw" data.
print("Raw Data:")
print(X_train.iloc[0, :])

# Now, let's look at the same row, now transformed.
print("\nTransformed Data:")
print(X_train_transformed[0,:])

Raw Data:
City        North Judithbury
Sex                        M
Duration                  60
Amount                 12305
Name: 29, dtype: object

Transformed Data:
[    0.     0.     0.     0.     1.     0.     0.     1.    42. 12305.]


In [11]:
# Now, let's inspect the results of another row.

print("Raw Data:")
print(X_train.iloc[3, :])

print("\nTransformed Data:")
print(X_train_transformed[3,:])

Raw Data:
City        North Noahstad
Sex                      F
Duration                21
Amount                9005
Name: 557, dtype: object

Transformed Data:
[   0.    0.    0.    0.    0.    1.    1.    0.   21. 9005.]
