# Feature Engineering

In this notebook we test out several feature engineering techniques. In particular, we will try out the following features:

1. Feature Selection
2. Row statistics (static features)
3. TargetEncoding
4. KMeans Clustering

In each case we will compare it with the baseline LightGBM model and score it using cross-validation. For each technique we use the following parameters:

* `n_estimators = 10000` with `early_stopping_rounds = 150`
* `learning_rate = 0.03`
* `random_state = 0` to ensure reproducible results

In [1]:
# Global variables for testing changes to this notebook quickly
NUM_TREES = 10000
EARLY_STOP = 150
NUM_FOLDS = 3
RANDOM_SEED = 0
SUBMIT = True

In [2]:
# Essential imports
import numpy as np
import pandas as pd
import matplotlib
import pyarrow
import time
import os
import gc

# feature engineering
import scipy.stats as stats
from category_encoders import MEstimateEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from functools import partial

# Model evaluation
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import mutual_info_classif

# LightGBM
from lightgbm import LGBMClassifier, plot_importance

# Mute warnings
import warnings
warnings.filterwarnings('ignore')

# display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

## Loading Function

We create a function that recreates the training and holdout sets since some of our methods may overwrite the original data and we need a reproducible way to get the same data.

In [3]:
# Generate training and holdout set
def get_training_data():
    train = pd.read_feather("../data/train.feather")

    train, holdout = train_test_split(
        train,
        train_size = 500000,
        stratify = train['target'],
        shuffle = True,
        random_state = RANDOM_SEED,
    )

    train.reset_index(drop = True, inplace = True)
    holdout.reset_index(drop = True, inplace = True)
    
    return train, holdout

In [4]:
%%time
train, holdout = get_training_data()

# save important features
features = [x for x in train.columns if x not in ['id','target']]

Wall time: 4.51 s


## Scoring Function

For each feature engineering technique we create a function that accepts the training, test and validation data as arguments and returns the appropriately transformed data (taking care to avoid leakage). This function is passed to a scoring function as the argument `preprocessing`,

In [5]:
def score_lightgbm(preprocessing = None):
    start = time.time()
    holdout_preds = np.zeros((holdout.shape[0],))
    print('')
    
    skf = StratifiedKFold(n_splits = NUM_FOLDS, shuffle = True, random_state = 0)
    for fold, (train_idx, valid_idx) in enumerate(skf.split(train, train['target'])):
        
        # train, valid split for cross-validation
        X_train, y_train = train[features].iloc[train_idx].copy(), train['target'].iloc[train_idx].copy()
        X_valid, y_valid = train[features].iloc[valid_idx].copy(), train['target'].iloc[valid_idx].copy()
        X_test, y_test = holdout[features].copy(), holdout['target'].copy()
        
        # preprocessing function should return a copy
        if preprocessing:
            try:
                X_train, X_valid, X_test = preprocessing(X_train, X_valid, X_test, y_train)
            except:
                X_train, X_valid, X_test = preprocessing(X_train, X_valid, X_test)
        
        # model with params
        model = LGBMClassifier(
            n_estimators = NUM_TREES,
            random_state = RANDOM_SEED,
            learning_rate = 0.03,
        )

        model.fit(
            X_train, y_train,
            eval_set = [(X_valid, y_valid)],
            eval_metric = 'auc',
            early_stopping_rounds = EARLY_STOP,
            verbose = False,
        )

        holdout_preds += model.predict_proba(X_test)[:,1] / NUM_FOLDS
        valid_preds = model.predict_proba(X_valid)[:,1]
        
        fold_auc = roc_auc_score(y_valid, valid_preds)
        print(f"Fold {fold} (AUC):", fold_auc)
        
    end = time.time()
    return roc_auc_score(holdout['target'], holdout_preds), round(end-start, 2), model

# 0. Baseline (LightGBM)

We start with computing a baseline score for LightGBM using the raw data with no feature engineering.

In [6]:
baseline_score, baseline_time, model = score_lightgbm()

print("\nTraining Time:", baseline_time)
print("Holdout (AUC):", baseline_score)


Fold 0 (AUC): 0.8546680181750979
Fold 1 (AUC): 0.8550275353925132
Fold 2 (AUC): 0.8535062714075843

Training Time: 565.08
Holdout (AUC): 0.8562177395328208


# 1. Feature Selection

In this section we experiment with dropping certain features deemed unimportant by various feature selection techniques. We consider two methods for determining unimportant features:

* LightGBM feature importance
* Mutual Information

In [7]:
# Data structure for comparing
data = dict(
    scores = [baseline_score],
    times = [baseline_time]
)
index = ["Baseline"]

## 1.1 Feature Importance

We define a bad feature as one with a feature importance below 3 using the building `feature_importance_` attribute:

In [8]:
# Determine good columns
good_columns = list()
for score, col in zip(model.feature_importances_, train[features].columns):
    if score >= 3:
        good_columns.append(col)

In [9]:
def feature_selection_importance(X_train, X_valid, X_test):
    return X_train[good_columns], X_valid[good_columns], X_test[good_columns]

In [10]:
# Feature selection with 'feature importance'
print(f'Removed {len(features) - len(good_columns)} features.')
fi_score, fi_time, model = score_lightgbm(feature_selection_importance)

del model
gc.collect()

print("\nTraining Time:", fi_time)
print("Holdout (AUC):", fi_score)

data['times'].append(fi_time)
data['scores'].append(fi_score)
index.append('Feature Importance')

Removed 14 features.

Fold 0 (AUC): 0.8545783781620102
Fold 1 (AUC): 0.8549845451883575
Fold 2 (AUC): 0.853505431880075

Training Time: 533.42
Holdout (AUC): 0.856134647566317


## 1.2 Mutual Information

In this section we remove features which have zero [mutual information](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif) scores.

In [11]:
def remove_uninformative(X_train, X_valid, X_test, y_train, verbose = False):
    
    # 0. categoricals
    binary_features = [X_train[x].dtype.name.startswith("int") for x in X_train.columns]
    
    # 1. Determine uninformative columns
    scores =  mutual_info_classif(
        X_train, y_train,
        discrete_features = binary_features,
    )
    cols = [x for i, x in enumerate(X_train.columns) if scores[i] == 0]
    
    # 2. Drop the uninformative columns
    X_train.drop(cols, axis = 1, inplace = True)
    X_valid.drop(cols, axis = 1, inplace = True)
    X_test.drop(cols, axis = 1, inplace = True)
    
    if verbose:
        print("Dropped columns:", *cols)
    
    return X_train, X_valid, X_test

In [12]:
mi_score, mi_time, model = score_lightgbm(remove_uninformative)

del model
gc.collect()

print("\nTraining Time:", mi_time)
print("Holdout (AUC):", mi_score)

data['times'].append(mi_time)
data['scores'].append(mi_score)
index.append('Mutual Information')


Fold 0 (AUC): 0.8479612625036981
Fold 1 (AUC): 0.849321014621545
Fold 2 (AUC): 0.8467167107565394

Training Time: 1664.57
Holdout (AUC): 0.8515919097064109


# 1. Row Statistics

In this section, we calculate several row statistics as features and see which (if any) result in improvements over the original features.

In [13]:
def create_row_stats(data):
    cont_cols, cat_cols = list(), list()
    for col in data.columns:
        if data[col].dtype.name.startswith("int"):
            cat_cols.append(col)
        else:
            cont_cols.append(col)
    new_data = data.copy()
    new_data['binary_count'] = data[cat_cols].sum(axis=1)
    new_data['binary_std'] = data[cat_cols].std(axis=1)
    new_data['min'] = data[cont_cols].min(axis=1)
    new_data['std'] = data[cont_cols].std(axis=1)
    new_data['max'] = data[cont_cols].max(axis=1)
    new_data['median'] = data[cont_cols].median(axis=1)
    new_data['mean'] = data[cont_cols].mean(axis=1)
    #new_data['var'] = data[cont_cols].var(axis=1)
    #new_data['sum'] = data[cont_cols].sum(axis=1)
    #new_data['sem'] = data[cont_cols].sem(axis=1)
    new_data['skew'] = data[cont_cols].skew(axis=1)
    new_data['median_abs_dev'] = stats.median_abs_deviation(data[cont_cols], axis=1)
    new_data['zscore'] = (np.abs(stats.zscore(data[cont_cols]))).sum(axis=1)
    return new_data

def row_stats(X_train, X_valid, X_test, y_train):
    X_train = create_row_stats(X_train)
    X_valid = create_row_stats(X_valid)
    X_test = create_row_stats(X_test)
    return X_train, X_valid, X_test

In [14]:
features = [x for x in train.columns if x not in ['id','target']]

In [15]:
stats_score, stats_time, model = score_lightgbm(row_stats)

print("\nTraining Time:", stats_time)
print("Holdout (AUC):", stats_score)

data['times'].append(stats_time)
data['scores'].append(stats_score)
index.append('Row Stats')


Fold 0 (AUC): 0.8546835329755162
Fold 1 (AUC): 0.8549890739744684
Fold 2 (AUC): 0.8535805096876286

Training Time: 654.35
Holdout (AUC): 0.8562796163752812


We see that our model found some of these variables decently important for training however there is no noticable benefit to the overall model accuracy and a much slower training time.

# 2. Target Encoding

In this section, we target encode all the binary variables. Target encoding is generally used for higher cardinality categorical data but we'll try it here anyways.

In [16]:
%%time
train, holdout = get_training_data()

features = [x for x in train.columns if x not in ['id','target']]
binary_features = [x for x in features if train[x].dtype.name.startswith("int")]

Wall time: 3.68 s


In [17]:
def target_encode(X_train, X_valid, X_test, y_train):
    encoder = MEstimateEncoder(
        cols = binary_features,
        m = 1.0,
    )
    X_train = encoder.fit_transform(X_train, y_train)
    X_valid = encoder.transform(X_valid)
    X_test = encoder.transform(X_test)
    return X_train, X_valid, X_test

In [18]:
target_score, target_time, model = score_lightgbm(target_encode)

# don't need the model
del model
gc.collect()

print("\nTraining Time:", target_time)
print("Holdout (AUC):", target_score)

data['times'].append(target_time)
data['scores'].append(target_score)
index.append('Target Encoding')


Fold 0 (AUC): 0.8546217074851257
Fold 1 (AUC): 0.8549852078743252
Fold 2 (AUC): 0.8535062714075843

Training Time: 698.17
Holdout (AUC): 0.856247962553375


As said before target encoding is best done with high cardinality variables so it's not particularly surprising that this didn't improve our models. It also significantly slowed down training time.

# 3. KMeans Clustering

We test cluster labels as categorical features and cluster distances as numerical features separately and see if either results in better models.

## 3.1 Cluster Labels

In [19]:
def generate_cluster_labels(X_train, X_valid, X_test, name, features, scale = True):
    
    # 1. normalize based on training data
    if scale:
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X_train[features])
        X_valid_scaled = scaler.transform(X_valid[features])
        X_test_scaled = scaler.transform(X_test[features])
    else:
        # no scaling
        X_scaled = X_train[features]
        X_valid_scaled = X_valid[features]
        X_test_scaled = X_test[features]
    
    # 2. create cluster labels (use predict)
    kmeans = KMeans(
        n_clusters = 10, 
        n_init = 10, 
        random_state = RANDOM_SEED
    )
    X_train[name + "_Cluster"] = kmeans.fit_predict(X_scaled)
    X_valid[name + "_Cluster"] = kmeans.predict(X_valid_scaled)
    X_test[name + "_Cluster"] = kmeans.predict(X_test_scaled)
         
    return X_train, X_valid, X_test

In [20]:
def cluster_label_features(X_train, X_valid, X_test, y_train):
    # get variables correlated with target
    corr = train.corr()
    corr = corr.loc['target':'target']
    corr = corr.drop(['id','target'],axis=1)
    corr = abs(corr)
    corr = corr.sort_values(by='target',axis=1, ascending=False)
    cols = [x for x in corr.columns][:15]
    return generate_cluster_labels(X_train, X_valid, X_test, "Top15", cols)

In [21]:
clusterlabel_score, clusterlabel_time, model = score_lightgbm(cluster_label_features)

# don't need the model
del model
gc.collect()

print("\nTraining Time:", clusterlabel_time)
print("Holdout (AUC):", clusterlabel_score)

data['times'].append(clusterlabel_time)
data['scores'].append(clusterlabel_score)
index.append("Cluster Labels")


Fold 0 (AUC): 0.854570248666942
Fold 1 (AUC): 0.8548990131946711
Fold 2 (AUC): 0.8534960185158745

Training Time: 780.64
Holdout (AUC): 0.856102446735894


## 3.2 Cluster Distances

In [22]:
def generate_cluster_distances(X_train, X_valid, X_test, name, features, scale = True):
    
    # 1. normalize based on training data
    if scale:
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X_train[features])
        X_valid_scaled = scaler.transform(X_valid[features])
        X_test_scaled = scaler.transform(X_test[features])
    else:
        # no scaling
        X_scaled = X_train[features]
        X_valid_scaled = X_valid[features]
        X_test_scaled = X_test[features]
    
    # 2. generate cluster distances (use transform)
    kmeans = KMeans(n_clusters = 10, n_init = 10, random_state=0)
    X_cd = kmeans.fit_transform(X_scaled)
    X_valid_cd = kmeans.transform(X_valid_scaled)
    X_test_cd = kmeans.transform(X_test_scaled)
    
    # 3. column labels
    X_cd = pd.DataFrame(X_cd, columns=[name + "_Centroid_" + str(i) for i in range(X_cd.shape[1])])
    X_valid_cd = pd.DataFrame(X_valid_cd, columns=[name + "_Centroid_" + str(i) for i in range(X_valid_cd.shape[1])])
    X_test_cd = pd.DataFrame(X_test_cd, columns=[name + "_Centroid_" + str(i) for i in range(X_test_cd.shape[1])])    
    
    return X_train.join(X_cd), X_valid.join(X_valid_cd), X_test.join(X_test_cd)

In [23]:
def cluster_distance_features(X_train, X_valid, X_test, y_train):
    # get variables correlated with target
    corr = train.corr()
    corr = corr.loc['target':'target']
    corr = corr.drop(['id','target'],axis=1)
    corr = abs(corr)
    corr = corr.sort_values(by='target',axis=1, ascending=False)
    cols = [x for x in corr.columns][:15]
    return generate_cluster_distances(X_train, X_valid, X_test, "Top15", cols)

In [24]:
clusterdist_score, clusterdist_time, model = score_lightgbm(cluster_distance_features)

# don't need the model
del model
gc.collect()

print("\nTraining Time:", clusterdist_time)
print("Holdout (AUC):", clusterdist_score)

data['times'].append(clusterdist_time)
data['scores'].append(clusterdist_score)
index.append('Cluster Distances')


Fold 0 (AUC): 0.8545575676338328
Fold 1 (AUC): 0.854876412463984
Fold 2 (AUC): 0.8534671379055443

Training Time: 860.15
Holdout (AUC): 0.8561483535152661


# Evaluation

In [25]:
pd.DataFrame(data = data, index = index).T

Unnamed: 0,Baseline,Feature Importance,Mutual Information,Row Stats,Target Encoding,Cluster Labels,Cluster Distances
scores,0.856218,0.856135,0.851592,0.85628,0.856248,0.856102,0.856148
times,565.08,533.42,1664.57,654.35,698.17,780.64,860.15


None of these methods appear particularly promising as they either provide no/little gain and/or increase the training time significantly but we may experiment with using some of these methods for ensembling to increase the variance.