# Synopsis

This shows the results of reducing all the features in this dataset to simply 0 or 1, where 1 replaces all positive values.  I have called this "positive encoding."  We can see here that this feature engineering is useful in this dataset for linear regression.  This dataset is a special case, and there may be very few other datasets that would benefit from this type of encoding.

Background

I had originally checked this just to see how much information was preserved with this simplification, and also as a preparation to using this in combination with the original features.  The reason for this combination would be to allow linear models to make a distinction between zeros and low positive numbers; that appeared to have some relevance in my original exploratory analysis.

I was surprised to find that this one simple encoding that loses so much information allowed a basic linear regression to score as well as a basic LightGBM model.

Since tree-based models like LightGBM are already free to handle any one value as special, it is no surprise that it does not gain from this encoding.  But it is unexpected that it loses very little in spite of having all positive values of each feature compressed together.

# Setup

In [1]:
import copy

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas.io.formats import style

import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

import sklearn
import sklearn.preprocessing as sk_prep
import sklearn.model_selection as sk_ms
import sklearn.feature_selection as sk_fs
import sklearn.pipeline as sk_pipe
import sklearn.compose as sk_comp
import sklearn.base as sk_base
import sklearn.ensemble as sk_ens
import sklearn.metrics as sk_met
import sklearn.linear_model as sk_lm
import sklearn.tree as sk_tree
import sklearn.svm as sk_svm
import sklearn.decomposition as sk_de
import category_encoders as ce

from scipy import stats

import lightgbm as lgbm

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [1]:
DATA_DIR = '/kaggle/input/tabular-playground-series-jun-2021'
RANDOM_STATE = 9003

# Load Data

In [1]:
train_set = pd.read_csv(os.path.join(DATA_DIR, 'train.csv'))
train_data = train_set.iloc[:, 1:-1] # Feature columns

train_y = train_set['target']

print(train_set.shape)
train_set

In [1]:
# Fold preparation

cv_folds = pd.Series(index=train_data.index, dtype='int64')

kf = sk_ms.KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

In [1]:
loss_list_1 = []

# Models with no FE for Comparison

In [1]:
%%time

# CV

model = sk_lm.LogisticRegression(solver='lbfgs', max_iter=1000)

lr_cv_scores = sk_ms.cross_val_score(
    model, train_data, train_y, 
    cv=kf,
    scoring=sk_met.make_scorer(sk_met.log_loss, greater_is_better=False, needs_proba=True)
) * -1

lr_loss = np.mean(lr_cv_scores)
print(f'Mean: {lr_loss} Folds: {lr_cv_scores}')

In [1]:
%%time

model = lgbm.LGBMClassifier(random_state=RANDOM_STATE)

# CV

lgbm_cv_scores = sk_ms.cross_val_score(
    model, train_data, train_y, 
    cv=kf,
    scoring=sk_met.make_scorer(sk_met.log_loss, greater_is_better=False, needs_proba=True)
) * -1

lgbm_loss = np.mean(lgbm_cv_scores)
print(f'Mean: {lgbm_loss} Folds: {lgbm_cv_scores}')

In [1]:
loss_list_1.append(['None', lr_loss, lgbm_loss])

# Positive Encoding

In [1]:
# Create an array with True/False for positive/zero
tr1_ar = train_data.values > 0
# Convert data to 1/0 and put in a DataFrame with the original column names
tr1_X = pd.DataFrame(tr1_ar.astype('int32'), columns=train_data.columns, index=train_data.index)

tr2_X, val2_X, tr2_y, val2_y = sk_ms.train_test_split(tr1_X, train_y, test_size=0.3, random_state=RANDOM_STATE)

tr2_X

In [1]:
%%time

# CV

model = sk_lm.LogisticRegression(solver='lbfgs', max_iter=1000)

lr_cv_scores = sk_ms.cross_val_score(
    model, tr1_X, train_y, 
    cv=kf,
    scoring=sk_met.make_scorer(sk_met.log_loss, greater_is_better=False, needs_proba=True)
) * -1

lr_loss = np.mean(lr_cv_scores)
print(f'Mean: {lr_loss} Folds: {lr_cv_scores}')

In [1]:
%%time

# CV

model = lgbm.LGBMClassifier(random_state=RANDOM_STATE)

lgbm_cv_scores = sk_ms.cross_val_score(
    model, tr1_X, train_y, 
    cv=kf,
    scoring=sk_met.make_scorer(sk_met.log_loss, greater_is_better=False, needs_proba=True)
) * -1

lgbm_loss = np.mean(lgbm_cv_scores)
print(f'Mean: {lgbm_loss} Folds: {lgbm_cv_scores}')

In [1]:
loss_list_1.append(['PosEnc', lr_loss, lgbm_loss])

# Positive Encoding Plus Original

In [1]:
tr1_ar = train_data.values > 0
# This time prepend "pos_" to the column names to distinguish them from the original columns
tr1_X = pd.DataFrame(tr1_ar.astype('int32'), columns='pos_' + train_data.columns, index=train_data.index)
tr3_X = pd.concat([train_data, tr1_X], axis=1)

tr2_X, val2_X, tr2_y, val2_y = sk_ms.train_test_split(tr3_X, train_y, test_size=0.3, random_state=RANDOM_STATE)

In [1]:
%%time

# CV

model = sk_lm.LogisticRegression(solver='lbfgs', max_iter=1000)

lr_cv_scores = sk_ms.cross_val_score(
    model, tr3_X, train_y, 
    cv=kf,
    scoring=sk_met.make_scorer(sk_met.log_loss, greater_is_better=False, needs_proba=True)
) * -1

lr_loss = np.mean(lr_cv_scores)
print(f'Mean: {lr_loss} Folds: {lr_cv_scores}')

In [1]:
%%time

# CV

model = lgbm.LGBMClassifier(random_state=RANDOM_STATE)

lgbm_cv_scores = sk_ms.cross_val_score(
    model, tr3_X, train_y, 
    cv=kf,
    scoring=sk_met.make_scorer(sk_met.log_loss, greater_is_better=False, needs_proba=True)
) * -1

lgbm_loss = np.mean(lgbm_cv_scores)
print(f'Mean: {lgbm_loss} Folds: {lgbm_cv_scores}')

In [1]:
loss_list_1.append(['PosEnc+Orig', lr_loss, lgbm_loss])

# Summary

In [1]:
loss_df = pd.DataFrame(loss_list_1, columns=['Method', 'LR_Loss', 'LGBM_Loss'])
# cmap = matplotlib.colors.Colormap('viridis')
style.Styler(loss_df, precision=4).background_gradient(cmap='viridis', vmin=1.70, vmax=1.85)

We see from this table that positive encoding significantly improves logistic regression, but does little for LightGBM.  Including the original values so that no data is lost does not help much with either model.