<a href="https://colab.research.google.com/github/stepthom/869_course/blob/main/piepline/custom_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom Transformers

- Stephen W. Thomas
- Used for MMA 869, MMAI 869, and GMMA 869

Read first: [slides_pipeline.ipynb](https://github.com/stepthom/869_course/blob/main/pipeline/slides_pipeline.ipynb)

Sci-kit learn has a lot of [great transformers](https://scikit-learn.org/stable/data_transforms.html) for your pipeline, but sometimes you need something custom.  

This tutorial shows how to create custom transformers for your pipeline.



In [1]:
import pandas as pd
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 0.22.2.post1.


In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/stepthom/869_course/main/data/generated_german.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 58 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   UserID                                  1000 non-null   object 
 1   FirstName                               1000 non-null   object 
 2   LastName                                1000 non-null   object 
 3   DateOfBirth                             1000 non-null   object 
 4   Sex                                     1000 non-null   object 
 5   Street                                  1000 non-null   object 
 6   City                                    1000 non-null   object 
 7   LicensePlate                            1000 non-null   object 
 8   Married                                 1000 non-null   float64
 9   NumberPets                              1000 non-null   float64
 10  Duration                                1000 non-null   int64

In [6]:
df.head()

Unnamed: 0,UserID,FirstName,LastName,DateOfBirth,Sex,Street,City,LicensePlate,Married,NumberPets,Duration,Amount,InstallmentRatePercentage,ResidenceDuration,NumberExistingCredits,NumberPeopleMaintenance,OwnCar,ForeignWorker,CheckingAccountStatus.lt.0,CheckingAccountStatus.0.to.200,CheckingAccountStatus.gt.200,CheckingAccountStatus.none,CreditHistory.NoCredit.AllPaid,CreditHistory.ThisBank.AllPaid,CreditHistory.PaidDuly,CreditHistory.Delay,CreditHistory.Critical,Purpose.NewCar,Purpose.UsedCar,Purpose.Furniture.Equipment,Purpose.Radio.Television,Purpose.DomesticAppliance,Purpose.Repairs,Purpose.Education,Purpose.Vacation,Purpose.Retraining,Purpose.Business,Purpose.Other,OtherDebtorsGuarantors.None,OtherDebtorsGuarantors.CoApplicant,OtherDebtorsGuarantors.Guarantor,Property.RealEstate,Property.Insurance,Property.CarOther,Property.Unknown,OtherInstallmentPlans.Bank,OtherInstallmentPlans.Stores,OtherInstallmentPlans.None,Housing.Rent,Housing.Own,Housing.ForFree,Job.UnemployedUnskilled,Job.UnskilledResident,Job.SkilledEmployee,Job.Management.SelfEmp.HighlyQualified,EmploymentDuration,SavingsAccountBonds,BadCredit
0,218-84-8180,Christopher,Gray,1953-09-02,M,503 Linda Locks,North Judithbury,395C,0.0,0.0,6,2104,4,4,2,1,0,1,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,7.0,281.0,0
1,643-21-6917,Jennifer,Rocha,1999-09-30,F,42388 Burgess Meadow Suite 532,East Jill,012 PCY,1.0,0.0,48,10712,2,2,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,2.0,0.0,1
2,520-14-4890,Kyle,Cruz,1973-03-01,M,480 Erin Plain Suite 514,West Michael,7-F0482,0.0,2.0,12,3773,2,3,1,2,1,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,4.0,0.0,0
3,081-11-7963,Ryan,Romero,1975-10-17,M,52880 Burns Creek,North Judithbury,30Z J39,0.0,1.0,42,14188,2,4,1,2,1,1,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0,4.0,0.0,0
4,463-16-4062,Robert,Spence,1969-05-15,M,78248 Brandt Plains,Ramirezstad,3-46578,0.0,0.0,24,8766,3,4,2,2,1,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,2.0,0.0,1


In [7]:
from sklearn.model_selection import train_test_split

X = df.drop(['BadCredit'], axis=1)
y = df['BadCredit']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
class MyAgeGetter(BaseEstimator, TransformerMixin):
    def __init__(self, num_cols=None, imputer=None):
        self.num_cols = num_cols
        self.imputer = imputer
        
    def fit(self, X, y=None):
        #print('SteveNumericImputer')
        self.imputer.fit(X[self.num_cols], y)
        return self

    def transform(self, X, y=None):
        _X = X.copy()
        
        _X[self.num_cols] = self.imputer.transform(_X[self.num_cols])
        
        # Preserve data types of original dataframe
        for col in self.num_cols:
            _X[col] =  _X[col].astype(X.dtypes[col])
        
        return _X


In [10]:
from sklearn.base import BaseEstimator, TransformerMixin

# All custom transformers have the following basic structure.
# - It is a class that inherits from BaseEstimator and TransformerMixin
# - It has at least three methods: __init__(), fit(), and transform()
# - It can have more helper methods
# - The __init__() is called when you first create the transformer. You can
#   pass in custom parameter values here.
# - The __init__() method returns nothing.
# - The fit() method must return self.
# - The transform method must return a dataframe and/or numpy array


class MyNumericCapper(BaseEstimator, TransformerMixin):

    # Caps a given numeric column(s) to a certain number
    # E.g., any value greater than max_val becomes max_val

    def __init__(self, num_cols=None, max_val=0):
        self.num_cols = num_cols
        self.max_val = max_val
        
    def fit(self, X, y=None):
        # Nothing to do during fit in this simple transformer!
        return self

    def transform(self, X, y=None):

        # It's common practice to copy X, so we don't accidentally modify the 
        # original and cause downstream confusion
        _X = X.copy()
        
        for col in self.num_cols:
            _X[col] =  _X[col].apply(lambda x: x if x <= self.max_val else self.max_val)
        
        return _X

In [26]:
# Another example, slightly more complicated.

class MyCategoryCoalescer(BaseEstimator, TransformerMixin):
    # Coalesces (smushes/condenses) rare levels of a categorical 
    # feature into "__OTHER__".
    #
    # Will leave the `keep_top` most frequent levels unchanged; the rest
    # will be changed to `"__OTHER__"`.
    def __init__(self, cat_cols=[], keep_top=25, ):
        self.cat_cols = cat_cols
        self.keep_top = keep_top
        
        # For each cat_col, this dict will hold an list of the most-frequent 
        # levels
        self.top_n_values = {}
            
    def get_top_n_values(self, X, col, n=5):
        # A helper function to do the actual work.

        # Get the sorted value counts for the column
        vc = X[col].value_counts(sort=True, ascending=False)

        # Get the actual values
        vals = list(vc.index)
        if len(vals) > n:
            top_values = vals[0:n]
        else:
            top_values =  vals
        return top_values
    
    def fit(self, X, y=None):
        # Find the top n values for each cateogircal column
        for col in self.cat_cols:
            self.top_n_values[col] = self.get_top_n_values(X, col, n=self.keep_top)
        return self
    
    def transform(self, X, y=None):
        _X = X.copy()
        _X[self.cat_cols] = _X[self.cat_cols].astype('category')
        for c in self.cat_cols:
            _X[c] = _X[c].cat.add_categories('__OTHER__')
            _X.loc[~_X[c].isin(self.top_n_values[c]), c] = "__OTHER__"
        return _X

# The Pipeline

Now we create a pipeline that uses the above custom transformers.

In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

# For our first pipeline, we're going to keep it very simple.

numeric_features = ['Amount', 'Duration', 'NumberPets', 
                    'ResidenceDuration', 'Married', 'EmploymentDuration']

categorical_features = ['City', 'Sex']

drop_features = ['UserID', 'DateOfBirth', 'FirstName', "LastName", 
                 'Street', 'LicensePlate']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ])

categorical_transformer = Pipeline(steps=[
      ('encoder', OneHotEncoder(handle_unknown='ignore')),
      ])

preprocessor1 = Pipeline(steps=[
      ('capper', MyNumericCapper(num_cols=['Duration'], max_val=42)),
      ('coalescer', MyCategoryCoalescer(cat_cols=['City'], keep_top=3)),
      ('ct', ColumnTransformer(
        transformers=[
            #('num', numeric_transformer, numeric_features),
            #('cat', categorical_transformer, categorical_features),
            ('drop', 'drop', drop_features)],
            remainder = 'passthrough', 
            sparse_threshold=0)),
    ])

# Note: we would normally include a classifier/regressor in our pipeline,
# but since this tutorial is just for learning about data preprocessing etc., 
# we will exclude it.

pipe1 = Pipeline(steps=[('preprocessor', preprocessor1)])

In [33]:
# The pipe is created! Let's fit it to the training data.

pipe1 = pipe1.fit(X_train, y_train)

# Now, let's use the pipe to transform the training and testing/.
X_train_transformed = pipe1.transform(X_train)
X_test_transformed  = pipe1.transform(X_test)

In [34]:
# Brilliant! Now, let's inspect the results to see what happened. Let's look at the first row of data.

# First, let's look at the "raw" data.
print("Raw Data:")
print(X_train.iloc[0, :])

# Now, let's look at the same row, now transformed.
print("\nTransformed Data:")
print(X_train_transformed[0,:])

Raw Data:
UserID                                                  757-62-4531
FirstName                                                   Matthew
LastName                                                       Wood
DateOfBirth                                              1959-05-16
Sex                                                               M
Street                                    8557 Parker Fort Apt. 351
City                                               North Judithbury
LicensePlate                                                233-RYW
Married                                                           0
NumberPets                                                        1
Duration                                                         60
Amount                                                        12305
InstallmentRatePercentage                                         3
ResidenceDuration                                                 4
NumberExistingCredits                 

In [36]:
# Brilliant! Now, let's inspect the results of another row.
# First, let's look at the "raw" data.
print("Raw Data:")
print(X_train.iloc[3, :])

# Now, let's look at the same row, now transformed.
print("\nTransformed Data:")
print(X_train_transformed[3,:])

Raw Data:
UserID                                                  082-46-9183
FirstName                                                      Lisa
LastName                                                    Hopkins
DateOfBirth                                              1991-11-14
Sex                                                               F
Street                                    4394 Mark Union Suite 375
City                                                 North Noahstad
LicensePlate                                                765-DWH
Married                                                           1
NumberPets                                                        1
Duration                                                         21
Amount                                                         9005
InstallmentRatePercentage                                         1
ResidenceDuration                                                 4
NumberExistingCredits                 

In [19]:
X_train.head(1).T

Unnamed: 0,29
UserID,757-62-4531
FirstName,Matthew
LastName,Wood
DateOfBirth,1959-05-16
Sex,M
Street,8557 Parker Fort Apt. 351
City,North Judithbury
LicensePlate,233-RYW
Married,0
NumberPets,1
