# Tabular Playground Series - May 2022: Feature Boundary Investigation

Based on the amazing work seen across these notebooks:
- https://www.kaggle.com/code/ambrosm/tpsmay22-eda-which-makes-sense by @ambrosm
- https://www.kaggle.com/code/wti200/analysing-interactions-with-shap by @wti200
- https://www.kaggle.com/code/cabaxiom/tps-may-22-eda-lgbm-model by @cabaxiom

it seemed that the standard approach to the feature interaction issue was to create ternary features
that explicitly defined boundaries:

`X["f_21_f_02"] = (X.f_02 + X.f_21 > 5.2).astype('int') - (X.f_02 + X.f_21 < -5.3).astype('int')`

I was curious if defining a simpler feature would allow the classifier to fit an optimal boundary:

`X["f_21_f_02"] = X.f_02 + X.f_21`

thus performing as well (or better) than the more elaborate interaction features.

So, here we fit classifiers to both sets of features, compare performance, and analyze the outcome.

\[NOTE: I used a GPU and 200 trees for training\]

In [None]:
from pathlib import Path
from warnings import simplefilter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

simplefilter("ignore")

RANDOM_STATE=42

### Load training data

In [None]:
train = pd.read_csv('../input/tabular-playground-series-may-2022/train.csv')
train = train.set_index('id').sort_index()
test = pd.read_csv('../input/tabular-playground-series-may-2022/test.csv')
test = test.set_index('id').sort_index()
display(train.head(2))

# Make features

create the interaction features as simple sums (allowing classifier to fit a boundary), or as ternary features with explicit boundaries.

In [None]:
def make_features(X_in, boundaries=True):
    """
    generate features for incoming dataframe
    
    boundaries: specifies whether interaction features incorporate explicit boundaries
    
    returns: dataframe with features
    """
    
    # start with float and int features
    X = X_in.select_dtypes(['float64','int64'])
 
    # manufacture features from f_27:
    # - feature for each character position, with ordinal-encoding (10 features)
    # - feature with total number of distinct characters
    for i in range(10):
        X[f"f_27_{i}"] = X_in["f_27"].str[i].apply(ord) - ord("A")
        X["f_27_count"] =  X_in["f_27"].apply(lambda s: len(set(s)))
        
    # interaction features:
    # if boundaries==True, create 3 ternary features based on explicit boundaries
    if boundaries: 
        X["f_21_f_02"] = (X.f_02 + X.f_21 > 5.2).astype('int') - (X.f_02 + X.f_21 < -5.3).astype('int')
        X["f_26_f_00_f_01"] = (X.f_01 + X.f_00 + X.f_26 > 5.0).astype('int') - (X.f_01 + X.f_00 + X.f_26 < -5.0).astype('int')
        X["f_22_f_05"] =( X.f_22 + X.f_05 > 5.1).astype('int') - (X.f_22 + X.f_05 < -5.4).astype('int')
    else:
        X["f_21_f_02"] = X.f_02 + X.f_21 
        X["f_26_f_00_f_01"] = X.f_01 + X.f_00 + X.f_26
        X["f_22_f_05"] = X.f_22 + X.f_05

    return X

# Benchmark Alternative Feature Sets

We aren't trying to optimize for score:
- build a simple, quick-to-train classifier that's reasonably performant
- run once with explicit boundary features, again with boundaryless features
- use cross-validation just to ensure we don't see variability across folds
- compare performance of the 2 feature sets

In [None]:
# change tree method if you don't want to use GPU
# change n_estimators if you want to experiment with different AUCs

def make_xgb(random_state=RANDOM_STATE):
    return XGBClassifier(n_estimators=200,
                         objective='binary:logistic',
                         eval_metric='auc',
                         random_state=random_state,
                         tree_method='gpu_hist'
                        )

In [None]:
%%time

y = train.target

skf = StratifiedKFold(n_splits=3)

xgb1 = make_xgb()
X = make_features(train.drop(columns=['target']), boundaries=True)   
scores1 = cross_val_score(xgb1, X, y, cv=skf, scoring="roc_auc", verbose=2)

xgb2 = make_xgb()
X = make_features(train.drop(columns=['target']), boundaries=False)   
scores2 = cross_val_score(xgb2, X, y, cv=skf, scoring="roc_auc", verbose=2)


In [None]:
mean1, std1 = np.mean(scores1), np.std(scores1)
mean2, std2 = np.mean(scores2), np.std(scores2)
print(f"boundaries: AUC-{mean1} (std-{std1})")
print(f"no boundaries: AUC-{mean2} (std-{std2})")
print(f"performance increase: {mean1-mean2}")

#### Results
While this doesn't seem like a huge difference, it would mean the difference between 1st place and 300th place in the TPS May 22 competition!

# Analysis

First lets examine scatterplots of the interacting features

We see well-defined boundaries for all 3 interactions.

In [None]:
X["f_00 + f_01 + f_26"] = X["f_00"] + X["f_01"] + X["f_26"]
X["f_02 + f_21"] = X["f_02"] + X["f_21"]
X["f_05 + f_22"] = X["f_05"] + X["f_22"]
X["random"] = np.random.randn(len(X))

f,axs = plt.subplots(1,3, figsize=(20,10), sharey=True)
sns.scatterplot(data = X, y="f_00 + f_01 + f_26", x="random", hue=y, s=2, ax=axs[0])
sns.scatterplot(data = X, y="f_02 + f_21", x="random", hue=y, s=2, ax=axs[1])
sns.scatterplot(data = X, y="f_05 + f_22", x="random", hue=y, s=2, ax=axs[2])
f.show()

#### Explicit boundaries

Fit the model with the explicit boundary feature set so we can examine the features in the trees. 

We inspect the tree dump and as an example we see that `f26+f00+f01` is the criterion for 441 nodes.

In [None]:
xgb1 = make_xgb()
X = make_features(train.drop(columns=['target']), boundaries=True)   
xgb1.fit(X,y)

trees1 = xgb1.get_booster().trees_to_dataframe()
nodes = trees1[trees1.Feature=='f_26_f_00_f_01'].shape[0]
print(f"total nodes with f26+f00+f01: {nodes}")

#### No Boundaries

Fit the model with the no boundary feature set.

We inspect the tree dump and now see that `f26+f00+f01` is the criterion for 704 nodes.

In [None]:
xgb2 = make_xgb()
X = make_features(train.drop(columns=['target']), boundaries=False)   
xgb2.fit(X,y)

trees2 = xgb2.get_booster().trees_to_dataframe()
trees2[trees2.Feature=='f_26_f_00_f_01'].shape[0]

Furthermore, we see that the early trees have split values close to the explicit boundary values.

However, later trees (that are targeting subsets of the training instances) have split values diverging
from the explicit boundaries.

(Here we are looking for -5.0 and 5.0)

In [None]:
display(trees2[(trees2.Tree==0) & (trees2.Feature=='f_26_f_00_f_01')])
display(trees2[(trees2.Tree==50) & (trees2.Feature=='f_26_f_00_f_01')])

Checking f22+f05, we see a similar outcome with earlier vs later trees.

(Here we are looking for 5.1 and -5.4)

In [None]:
display(trees2[(trees2.Tree==0) & (trees2.Feature=='f_22_f_05')])
display(trees2[(trees2.Tree==50) & (trees2.Feature=='f_22_f_05')])

# Conclusion

At first blush, it would seem simpler to create the interaction features as simple sums without explicit boundary definitions, allowing
the classifier to fit boundaries.

But we see a slight decrease in performance, as it gives the classifier too much flexibility, essentially giving the classifier
permission to overfit with the feature anywhere in the domain/range.
