# Playground for Pipeline Slides

- Stephen W. Thomas
- Used for MMA 869, MMAI 869, and GMMA 869

Pipelines are awesome and have many benefits. 

Think of pipelines as Legos, or building blocks. You have lots of small blocks that you can combine in many different ways to create almost anything you wish. 

Pipelines can be tricky to learn, since they have so much flexibility, but once you get a hang of the basics, you'll never go back.

Great documentation can be found on scikit-learn's website:

https://scikit-learn.org/stable/modules/compose.html

In [1]:
import pandas as pd
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 0.22.2.post1.


In [3]:
import os
os.getcwd()

'/content'

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/stepthom/869_course/main/data/generated_german.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 58 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   UserID                                  1000 non-null   object 
 1   FirstName                               1000 non-null   object 
 2   LastName                                1000 non-null   object 
 3   DateOfBirth                             1000 non-null   object 
 4   Sex                                     1000 non-null   object 
 5   Street                                  1000 non-null   object 
 6   City                                    1000 non-null   object 
 7   LicensePlate                            1000 non-null   object 
 8   Married                                 1000 non-null   float64
 9   NumberPets                              1000 non-null   float64
 10  Duration                                1000 non-null   int64

In [5]:
df.head()

Unnamed: 0,UserID,FirstName,LastName,DateOfBirth,Sex,Street,City,LicensePlate,Married,NumberPets,Duration,Amount,InstallmentRatePercentage,ResidenceDuration,NumberExistingCredits,NumberPeopleMaintenance,OwnCar,ForeignWorker,CheckingAccountStatus.lt.0,CheckingAccountStatus.0.to.200,CheckingAccountStatus.gt.200,CheckingAccountStatus.none,CreditHistory.NoCredit.AllPaid,CreditHistory.ThisBank.AllPaid,CreditHistory.PaidDuly,CreditHistory.Delay,CreditHistory.Critical,Purpose.NewCar,Purpose.UsedCar,Purpose.Furniture.Equipment,Purpose.Radio.Television,Purpose.DomesticAppliance,Purpose.Repairs,Purpose.Education,Purpose.Vacation,Purpose.Retraining,Purpose.Business,Purpose.Other,OtherDebtorsGuarantors.None,OtherDebtorsGuarantors.CoApplicant,OtherDebtorsGuarantors.Guarantor,Property.RealEstate,Property.Insurance,Property.CarOther,Property.Unknown,OtherInstallmentPlans.Bank,OtherInstallmentPlans.Stores,OtherInstallmentPlans.None,Housing.Rent,Housing.Own,Housing.ForFree,Job.UnemployedUnskilled,Job.UnskilledResident,Job.SkilledEmployee,Job.Management.SelfEmp.HighlyQualified,EmploymentDuration,SavingsAccountBonds,BadCredit
0,218-84-8180,Christopher,Gray,1953-09-02,M,503 Linda Locks,North Judithbury,395C,0.0,0.0,6,2104,4,4,2,1,0,1,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,7.0,281.0,0
1,643-21-6917,Jennifer,Rocha,1999-09-30,F,42388 Burgess Meadow Suite 532,East Jill,012 PCY,1.0,0.0,48,10712,2,2,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,2.0,0.0,1
2,520-14-4890,Kyle,Cruz,1973-03-01,M,480 Erin Plain Suite 514,West Michael,7-F0482,0.0,2.0,12,3773,2,3,1,2,1,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,4.0,0.0,0
3,081-11-7963,Ryan,Romero,1975-10-17,M,52880 Burns Creek,North Judithbury,30Z J39,0.0,1.0,42,14188,2,4,1,2,1,1,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0,4.0,0.0,0
4,463-16-4062,Robert,Spence,1969-05-15,M,78248 Brandt Plains,Ramirezstad,3-46578,0.0,0.0,24,8766,3,4,2,2,1,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,2.0,0.0,1


In [6]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df.describe().T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Married,1000.0,0.402,0.490547,0.0,0.0,0.0,1.0,1.0
NumberPets,1000.0,1.074,0.805713,0.0,0.0,1.0,2.0,2.0
Duration,1000.0,20.903,12.058814,4.0,12.0,18.0,24.0,72.0
Amount,1000.0,5888.252,5080.928343,450.0,2458.0,4175.0,7150.25,33163.0
InstallmentRatePercentage,1000.0,2.973,1.118715,1.0,2.0,3.0,4.0,4.0
ResidenceDuration,1000.0,2.845,1.103718,1.0,2.0,3.0,4.0,4.0
NumberExistingCredits,1000.0,1.407,0.577654,1.0,1.0,1.0,2.0,4.0
NumberPeopleMaintenance,1000.0,1.155,0.362086,1.0,1.0,1.0,1.0,2.0
OwnCar,1000.0,0.596,0.490943,0.0,0.0,1.0,1.0,1.0
ForeignWorker,1000.0,0.963,0.188856,0.0,1.0,1.0,1.0,1.0


In [8]:
from sklearn.model_selection import train_test_split

X = df.drop(['BadCredit'], axis=1)
y = df['BadCredit']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Simple Pipeline 1

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score

# For our very first pipeline, we're going to keep it very simple.
# We're going to define some features to keep, and we'll drop the rest.
# We won't do any preprocessing, and we'll just pass the features straight on
# to an estimator/classifier.  
# That's it!

keep_features = ['Amount', 'Duration']

clf = RandomForestClassifier(random_state=42)

preprocessor1 = ColumnTransformer(
    transformers=[
        ('keep_and_do_nothing', 'passthrough', keep_features),
        ],
        remainder = 'drop')

pipe1 = Pipeline(steps=[("preprocessor", preprocessor1), ("clf", clf)])

scores1 = cross_val_score(pipe1, X_train, y_train, 
                          scoring='f1_macro', cv=10, n_jobs=-1)
print(scores1)
print(np.mean(scores1))

[0.5        0.60286817 0.54403383 0.59272727 0.59329693 0.51566952
 0.47525343 0.43452381 0.66437834 0.5826087 ]
0.5505359994404786


In [12]:
# What did the features look like after preprocessing?
pipe1 = pipe1.fit(X_train, y_train)
_tmp = pipe1.named_steps['preprocessor'].transform(X_train)

array([[12305,    60],
       [ 4174,    21],
       [ 2225,     6],
       ...,
       [10447,    24],
       [ 2671,    12],
       [ 1678,     6]])

# Simple Pipeline 2

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

numeric_features = ['Amount', 'Duration', 'NumberPets', 
                    'ResidenceDuration', 'Married', 'EmploymentDuration']

categorical_features = ['City', 'Sex']

drop_features = ['UserID', 'DateOfBirth', 'FirstName', "LastName", 
                 'Street', 'LicensePlate']

clf = RandomForestClassifier(random_state=42)

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ])

categorical_transformer = Pipeline(steps=[
      ('encoder', OneHotEncoder(handle_unknown='ignore')),
      ])

preprocessor2 = Pipeline(steps=[
      ('ct', ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features),
            ('drop', 'drop', drop_features)],
            remainder = 'passthrough', 
            sparse_threshold=0)),
    ])

pipe2 = Pipeline(steps=[('preprocessor', preprocessor2),  ('clf', clf)])

scores2 = cross_val_score(pipe2, X_train, y_train, 
                          scoring='f1_macro', cv=10, n_jobs=-1)
print(scores2)
print(np.mean(scores2))

# Pipeline 3

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import KernelPCA, PCA, TruncatedSVD
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

numeric_features = ['Amount', 'Duration', 'NumberPets', 
                    'ResidenceDuration', 'Married', 'EmploymentDuration']

categorical_features = ['City', 'Sex']

drop_features = ['UserID', 'FirstName', "LastName", 
                 'Street', 'LicensePlate']

# A Custom transformer that takes in a feature that is a date/time (e.g., Date 
# of birth) and calculates the age in years from 2021.
def get_age_years(feature):
  res = np.array([])
  for instance in feature:
    age = 2021 - int(instance[0:4])
    res = np.append(res, age)
  return res.reshape(-1, 1)

clf = RandomForestClassifier(random_state=42)

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ])

categorical_transformer = Pipeline(steps=[
      ('encoder', OneHotEncoder(handle_unknown='ignore')),
      ])

preprocessor3 = Pipeline(steps=[
      ('ct', ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('amount_log', FunctionTransformer(np.log10, validate=False), ['Amount']),
            ('cat', categorical_transformer, categorical_features),
            ('age', FunctionTransformer(get_age_years, validate=False), 'DateOfBirth'),
            ('drop', 'drop', drop_features)],
            remainder = 'passthrough', 
            sparse_threshold=0)),
    ('pca', PCA(n_components=10)),
    ])

pipe3 = Pipeline(steps=[('preprocessor', preprocessor3),  ('clf', clf)])

scores3 = cross_val_score(pipe3, X_train, y_train, 
                          scoring='f1_macro', cv=10, n_jobs=-1)
print(scores3)
print(np.mean(scores3))

# A Pipeline That's A Little More Complicated...

Add a little feature selection, calass imbalance handling, and hyperparameter tuning...

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV

scaler = StandardScaler()
dt = DecisionTreeClassifier(random_state=223)
rfe = RFE(estimator=dt, n_features_to_select=10)


def is_old(feature):
    feature = np.array(feature)
    return np.array([int(sample) > 60 for sample in feature]).reshape(-1, 1)

Column_trans = ColumnTransformer(
     [('scale', scaler, numeric_features),
      ('is_old', FunctionTransformer(is_old, validate=False), ['Age']),
      ('amount_log', FunctionTransformer(np.log10, validate=False), ['Amount']),
      ],
     remainder='passthrough')

# Since we are adding two new features, we better add them to our list of feature names
feature_names = feature_names + ['is_old', 'amount_log']

pipe2 = Pipeline([('features', Column_trans), ('rfe', rfe), ('dt', dt)])

param_grid = {
    'features__scale__with_mean': [True, False],
    'features__scale__with_std': [True, False],
    'rfe__n_features_to_select': [None, 5, 10, 20, len(feature_names)],
    'dt__max_depth': [None, 3, 10],
    'dt__criterion': ('gini', 'entropy'), 
    'dt__max_features':[None, 'auto'], 
    'dt__max_leaf_nodes':[None, 10],
    'dt__class_weight':[None, 'balanced'],
}

search = GridSearchCV(pipe2, param_grid, cv=3, n_jobs=3, scoring='f1_micro', return_train_score=True, verbose=2)

In [None]:
search.fit(X_train, y_train)

In [None]:
search.score(X_test, y_test)

In [None]:
search.best_params_

In [None]:
# What did the features look like after preprocessings
feature_processing_obj = search.best_estimator_.named_steps['features']

features_train = feature_processing_obj.transform(X_train)
features_train.shape
features_train[0:10, 0:10]

In [None]:
# Which features were selected by RFE?
rfe_obj = search.best_estimator_.named_steps['rfe']

for i in range(len(feature_names)):
    if rfe_obj.support_[i]:
        print('Feature {} ({}), Selected {}, Rank: {}'.format(i, feature_names[i], rfe_obj.support_[i], rfe_obj.ranking_[i]))

In [None]:
# Print out the results of hyperparmater tuning

def cv_results_to_df(cv_results):
    results = pd.DataFrame(list(cv_results['params']))
    results['mean_fit_time'] = cv_results['mean_fit_time']
    results['mean_score_time'] = cv_results['mean_score_time']
    results['mean_train_score'] = cv_results['mean_train_score']
    results['std_train_score'] = cv_results['std_train_score']
    results['mean_test_score'] = cv_results['mean_test_score']
    results['std_test_score'] = cv_results['std_test_score']
    results['rank_test_score'] = cv_results['rank_test_score']

    results = results.sort_values(['mean_test_score'], ascending=False)
    return results

results = cv_results_to_df(search.cv_results_)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(results)
