# Get Feature Names out of a Pipeline

This was an excercice to get the feature names out of a pipeline when one-hot-encoding (OHE) is performed in it. Such a procedure can be important if you want to analyse feature importance after having modelled. - To add additional complexity I applied oversampling with SMOTENC in a late step of the pipeline. For this the indices of the one-hot-encoded categorical columns have to be passed.

It seems that there are three possible solutions to this:
1. OHE outside of the pipeline and you have direct access to the feature names (the easiest way and quite ok, I think)
2. OHE inside the pipeline and you infere the feature names (pragmatic but you better make sure you got it right)
3. OHE inside the pipeline and you get the feature names out of it (or of a cloned pipeline ... as was necessary here)

I worked on solution 3 in this notebook. That was kind of a hassle because the sklearn ColumnTransformer (or FeatureUnion) object only returns the feature names if all transformers within it provide the method get_feature_names(). Unfortunately some like StandardScaler do not (yet). The work-around was to build a second pipeline just to get the feature names. There I substituted the StandardScaler with a custom 'PasstroughTransformer' that passes the data unchanged but has the necessary get_feature_names() method (see [here](https://stackoverflow.com/questions/53382322/adding-get-feature-names-to-columntransformer-pipeline) for background info).

_NOTE: the original data is not provided for this notebook_

In [2]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split,GridSearchCV, \
    cross_val_score, StratifiedKFold, validation_curve, learning_curve

from imblearn.over_sampling import SMOTENC
from imblearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.metrics import make_scorer, classification_report, confusion_matrix, fbeta_score

import custom_transformers as transform
import cleaning_functions as clean

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set; sns.set_style('whitegrid')
%matplotlib inline  

# display of all columns in df - check if pd option below isn't better
from IPython.display import display
pd.options.display.max_columns = None

### Check data

In [3]:
XX = pd.read_csv('Financial Distress.csv')

In [4]:
XX.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3670 entries, 0 to 3669
Columns: 127 entries, Company to x124
dtypes: float64(114), int64(13)
memory usage: 3.6 MB


In [5]:
XX.head(2)

Unnamed: 0,Company,Time,Financial Distress,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,x50,x51,x52,x53,x54,x55,x56,x57,x58,x59,x60,x61,x62,x63,x64,x65,x66,x67,x68,x69,x70,x71,x72,x73,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83,x84,x85,x86,x87,x88,x89,x90,x91,x92,x93,x94,x95,x96,x97,x98,x99,x100,x101,x102,x103,x104,x105,x106,x107,x108,x109,x110,x111,x112,x113,x114,x115,x116,x117,x118,x119,x120,x121,x122,x123,x124
0,1,1,0.010636,1.281,0.022934,0.87454,1.2164,0.06094,0.18827,0.5251,0.018854,0.18279,0.006449,0.85822,2.0058,0.12546,6.9706,4.6512,0.0501,2.1984,0.018265,0.024978,0.027264,1.4173,9.5554,0.14872,0.66995,214.76,12.641,6.4607,0.043835,0.20459,0.35179,8.3161,0.28922,0.76606,2.5825,77.4,0.026722,1.6307,0.015016,0.005478,0.1273,9.6951,-0.73622,0.98559,0.18016,1.5006,0.026224,7.0513,1174.9,5.3399,0.85128,12.837,0.061737,0.1809,209.87,-0.58255,0.47101,0.1099,0.0,0.0,0.22009,0.13076,0.14952,0.19518,0.1075,1224.5,1.0422,4.892,6.7291,0.5386,104.41,0.49844,2.3224,300.0,0.14653,1.0214,24.402,-47.071,129,1200.0,-0.4623,391.0,2870000.0,8990000000.0,31,31400000000.0,9.98e-09,25.75,0.19693,74.25,38.44,15.93,0.0,0.0,74.25,1,2,0,5,0,0,0.8,7.1241,15.381,3.2702,17.872,34.692,30.087,12.8,7991.4,364.95,15.8,61.476,4.0,36.0,85.437,27.07,26.102,16.0,16.0,0.2,22,0.06039,30,49
1,1,2,-0.45597,1.27,0.006454,0.82067,1.0049,-0.01408,0.18104,0.62288,0.006423,0.035991,0.001795,0.85152,-0.48644,0.17933,4.5764,3.7521,-0.014011,2.4575,0.027558,0.028804,0.041102,1.1801,7.2952,0.056026,0.67048,38.242,12.877,5.5506,0.26548,0.15019,0.41763,9.5276,0.41561,0.81699,2.6033,95.947,0.00758,0.83754,0.027425,0.045434,0.13774,5.6035,-0.64385,1.3019,0.046857,1.0095,0.007864,4.6022,1062.5,3.7389,0.94397,12.881,-0.000565,0.056298,250.14,-0.47477,0.38599,0.36933,0.0,0.0,0.0,-0.042671,-0.051995,-0.063643,-0.042465,-252.83,-0.23795,-2.0869,-0.98939,-0.23212,-10.857,-0.18801,0.90531,100.0,0.4039,1.8484,25.588,88.667,229,1964.0,3.5409,126.0,371000.0,541000000.0,27,724000000.0,5.32e-08,26.78,0.2299,73.22,42.86,15.94,0.0,0.0,73.22,1,2,0,5,0,0,0.6,7.4166,7.105,14.321,18.77,124.76,26.124,11.8,8322.8,0.1896,15.6,24.579,0.0,36.0,107.09,31.31,30.194,17.0,16.0,0.4,22,0.010636,31,50


In [6]:
# make sure there's no missing data in the set
assert XX.isnull().sum().sum() == 0, "NaN present"

### Prepare Data

In [7]:
def create_Xy(df):
    """Seprate target variable from features."""

    X = df.copy()
    y = XX['Financial Distress']
    X = X.drop(['Financial Distress'], axis=1)
    
    return X, y

In [8]:
# call function an check results
X, y = create_Xy(XX)
X.shape

(3670, 126)

In [9]:
# prepare target feature
y = np.array([0 if i > -0.50 else 1 for i in y])

# check result
unique, counts = np.unique(y, return_counts=True)
display(np.asarray((unique, counts)).T)

array([[   0, 3535],
       [   1,  135]], dtype=int64)

In [10]:
"""define numerical and categorical features"""

cat_features = ['x121', 'x99']  # orig: x95, x97, x99, x100, x121

X = X.astype(float)  # Column types are defaulted to floats
X[cat_features] = X[cat_features].astype('category') # categorical are set to cat

num_features = list(X.columns)
for feature in cat_features:
    num_features.remove(feature)

assert (len(num_features) + len(cat_features)) == X.shape[1] # safety check


# define number of one-hot-encoded cat features for SMOTENC sampler (and print results)
number_cat = 0
for feature in cat_features:
    values = X[feature].nunique()
    print(feature, ", number of unique values (categories): ", values)
    number_cat += values
print("\nTotal number of one-hot-encoded cat features: ", number_cat)   

x121 , number of unique values (categories):  37
x99 , number of unique values (categories):  2

Total number of one-hot-encoded cat features:  39


**Note:** We will need `cat_features` and `num_features` to select the respective columns in the ColumnTransformer in the Pipeline. the total number of OHE features is necessary that we can pass the correct cat_indices to the SMOTENC sampler.

In [11]:
"""split data into train and test"""

indices = np.arange(y.shape[0])
X_train, X_test, y_train, y_test, idx_train, idx_test = \
    train_test_split(X, y, indices, stratify=y, test_size=0.3,random_state=42)

# check results
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(2569, 126)
(1101, 126)
(2569,)
(1101,)


### Build Pipeline

In [65]:
"""assemble pipeline (define function)"""

def build_pipe(X_train, y_train, clf, sampler):
    """Build a pipeline for preprocessing (including oversampling)
    and classification.
    
    ARGUMENTS:
        X_train: training features (df or array)
        y_train: training labels (df or array)
        clf: classifier (sk-learn model object)
        sampler: sampler (imblearn sampling class)
        
    RETURNS:
        full_pipe: pipeline object
    """
 
    preprocessor = ColumnTransformer([
            ('ohe', OneHotEncoder(), cat_features),
            ('scaling', StandardScaler(), num_features),
            ])
    
    full_pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('sample', sampler),
        ('clf', clf)])
    
    return preprocessor, full_pipe

In [66]:
"""initialize  classifier and SMOTENC sampler, build pipeline""" 

rg = LogisticRegression(class_weight = { 0:1, 1:1 }, random_state = 42, solver = 'saga',
                        max_iter=100, n_jobs=-1, intercept_scaling=1, C=0.02, penalty='l1')

gbc = GradientBoostingClassifier()

sampler = SMOTENC(categorical_features=list(range(0,number_cat)), n_jobs=-1)

preprocessor, full_pipe = build_pipe(X_train, y_train, gbc, sampler)

### Fit and Tune

In [67]:
def fit_pipe(X_train, y_train, pipe, scorer, cv=StratifiedKFold(3)):
    """Fit training data to a pipeline with GridSearchCV
    for best parameter tuning.
    
    ARGUMENTS:
        X_train: training features (df or array)
        y_train: training labels (df or array)
        pipe: pipeline (sk-learn pipeline object)
        scorer: evaluation metric for validation
        cv: type of CV, default is StratifiedKFold(3)
        
    RETURNS:
        grid: grid search object
        grid_results: dict with grid search results
    """
    parameters = {'clf__learning_rate': [0.05], 
#                   'clf__class_weight':   [{ 0:1, 1:11 }, { 0:1, 1:8 }]
                 }

    cv = GridSearchCV(pipe, param_grid=parameters, scoring=scorer, n_jobs= -1, 
                      cv=cv, error_score='raise', return_train_score=False, verbose=1)

    grid = cv.fit(X_train, y_train) 
    grid_results = grid.cv_results_

    return grid, grid_results

In [68]:
# call the function and evaluate on fbeta score

scorer = make_scorer(fbeta_score, beta=4)
cv = 3

grid, grid_results = fit_pipe(X_train, y_train, full_pipe, scorer, cv=cv)

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   16.3s finished
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [16]:
len(grid.best_estimator_.named_steps['clf'].feature_importances_)

163

In [24]:
# just in case you wanna see the preprocessed data (bevor it is passed to sampler)
preprocessed_data = pd.DataFrame(preprocessor.transform(X_train))

(2569, 163)

### Alternative Preprocessor / Pipeline to get the feature names

StandardScaler and built-in passthrough argument of ColumnTransformer do not yet provide get_feature_names(). That's why I have substituted the StandardScaler with a custom PassthroughTransformer.


In [56]:
"""assemble pipeline (define function)"""

def build_feature_pipe(X_train, y_train, clf, sampler):
    """Build a pipeline for preprocessing (including oversampling)
    and classification.
    
    ARGUMENTS:
        X_train: training features (df or array)
        y_train: training labels (df or array)
        clf: classifier (sk-learn model object)
        sampler: sampler (imblearn sampling class)
        
    RETURNS:
        full_pipe: pipeline object
    """
 
    preprocessor = ColumnTransformer([
            ('ohe', OneHotEncoder(), cat_features),
            ('pass', transform.PassthroughTransformer(), num_features), # new step, does not change data
            ])
    
    full_pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('sample', sampler),
        ('clf', clf)])
    
    return preprocessor, full_pipe

In [74]:
"""fit and get feature names"""
feature_preprocessor, feature_pipe = build_feature_pipe(X_train, y_train, gbc, sampler)
feature_preprocessor.fit(X_train)
feature_preprocessor.get_feature_names()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


['ohe__x0_1.0',
 'ohe__x0_2.0',
 'ohe__x0_3.0',
 'ohe__x0_4.0',
 'ohe__x0_5.0',
 'ohe__x0_6.0',
 'ohe__x0_7.0',
 'ohe__x0_8.0',
 'ohe__x0_9.0',
 'ohe__x0_10.0',
 'ohe__x0_11.0',
 'ohe__x0_12.0',
 'ohe__x0_13.0',
 'ohe__x0_14.0',
 'ohe__x0_15.0',
 'ohe__x0_16.0',
 'ohe__x0_17.0',
 'ohe__x0_18.0',
 'ohe__x0_19.0',
 'ohe__x0_20.0',
 'ohe__x0_21.0',
 'ohe__x0_22.0',
 'ohe__x0_23.0',
 'ohe__x0_24.0',
 'ohe__x0_25.0',
 'ohe__x0_26.0',
 'ohe__x0_27.0',
 'ohe__x0_28.0',
 'ohe__x0_29.0',
 'ohe__x0_30.0',
 'ohe__x0_31.0',
 'ohe__x0_32.0',
 'ohe__x0_33.0',
 'ohe__x0_34.0',
 'ohe__x0_35.0',
 'ohe__x0_36.0',
 'ohe__x0_37.0',
 'ohe__x1_0.0',
 'ohe__x1_1.0',
 'pass__Company',
 'pass__Time',
 'pass__x1',
 'pass__x2',
 'pass__x3',
 'pass__x4',
 'pass__x5',
 'pass__x6',
 'pass__x7',
 'pass__x8',
 'pass__x9',
 'pass__x10',
 'pass__x11',
 'pass__x12',
 'pass__x13',
 'pass__x14',
 'pass__x15',
 'pass__x16',
 'pass__x17',
 'pass__x18',
 'pass__x19',
 'pass__x20',
 'pass__x21',
 'pass__x22',
 'pass__x23',
 '

---