# 1 Introduction

This EDA explores the data available for the Tabular Playground Series - October 2021 competition. Simple data exploration is performed, as well as preliminary modeling.

## 1.1 Evaluation Criteria

The goal for this competition is to maximize ROC AUC score. This means generating classifiers or regressions that predict the probability of the class target variable based on the features included.

In [None]:
import pandas as pd
import numpy as np
import gc

train = pd.read_csv("../input/tabular-playground-series-oct-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-oct-2021/test.csv")

In [None]:
def cat_column_info(column):
    num_categories = train[column].nunique()
    print("------> {} <------".format(column))
    print("--: train - type {}".format(train[column].dtype))
    print("--: test  - type {}".format(test[column].dtype))
    print("--: train - # categories {}".format(train[column].nunique()))
    print("--: test  - # categories {}".format(test[column].nunique()))
    if num_categories < 10:
        if train[column].dtype == "int64":
            print("--: train - values {}".format(np.sort(train[column].unique())))
            print("--: test  - values {}".format(np.sort(test[column].unique())))
        else:
            print("--: train - values {}".format(train[column].unique()))
            print("--: test  - values {}".format(test[column].unique()))
    print("--: train - NaN count {}".format(train[column].isnull().values.sum()))
    print("--: test  - NaN count {}".format(test[column].isnull().values.sum()))
    print("")

def cont_column_info(column):
    print("------> {} <------".format(column))
    print("--: train - type {}".format(train[column].dtype))
    print("--: test  - type {}".format(test[column].dtype))
    print("--: train - min {}".format(train[column].min()))
    print("--: test  - min {}".format(test[column].min()))
    print("--: train - max {}".format(train[column].max()))
    print("--: test  - max {}".format(test[column].max()))    
    print("--: train - NaN count {}".format(train[column].isnull().values.sum()))
    print("--: test  - NaN count {}".format(test[column].isnull().values.sum()))
    print("")
    
print(": Train shape {}".format(train.shape))
print(": Test shape {}".format(test.shape))
print("")

## 1.3 Training and Testing Files

Our input data consists of:

* `train.csv` - 2.2 GB in size, containing 287 columns and 1,000,000 rows
* `test.csv` - 1.1 GB in size, containing 286 columns and 500,000 rows

One main observation here is the sheer size of the data we are looking at. While 2.2 GB fits in memory, model training may exert pressure on the Kaggle 16 GB CPU memory and GPU memory limitations. We should definitely explore what column formats are at play, and whether running functions to [reduce memory usage](https://www.kaggle.com/gemartin/load-data-reduce-memory-usage) on Pandas dataframes can ease pressure on memory.

# 2 Features

## 2.1 `id` Column

The `id` column is a `int64` integer column that contains unique record indicators ranging from 0 to 999,999. Like most Tabular Series, this is simply an identifier for the record and is likely not going to be of use for modelling purposes.

## 2.2 `target` Column

The `target` column contains the class targets we are attempting to predict. We should look first to see what class breakdown we have.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
sns_params = {"palette": "bwr_r"}

counts = pd.DataFrame(train["target"].value_counts())
ax = sns.barplot(x=counts.index, y=counts.target, **sns_params)
for p in ax.patches:
    ax.text(x=p.get_x()+(p.get_width()/2), y=p.get_height(), s="{:,d}".format(round(p.get_height())), ha="center")
_ = ax.set_title("Class Balance", fontsize=15)
_ = ax.set_ylabel("Number of Records", fontsize=15)
_ = ax.set_xlabel("Class", fontsize=15)

del(counts)
_ = gc.collect()

The predicted class is well balanced, with little to no skew. This is interesting as it gives us a lot of training data per class to look at.

## 2.3 `fX` Columns

There are 285 feature columns named `f0` through `f284`. 


## 2.4 Null Values

First things first, we should check to see if we are missing any values in the columns, as this was an interesting feature from the competition last month.

In [None]:
# Count the number of null values that occur in each row
train["null_count"] = train.isnull().sum(axis=1)

# Group the null counts
counts = train.groupby("null_count")["target"].count().to_dict()
null_data = {"{} Null Value(s)".format(k) : v for k, v in counts.items() if k < 6}

# Plot the null count results
pie, ax = plt.subplots(figsize=[20, 10])
colors = sns.color_palette("bwr_r")[0:5]
plt.pie(x=null_data.values(), autopct="%.2f%%", explode=[0.05]*len(null_data.keys()), labels=null_data.keys(), pctdistance=0.5, colors=colors)
_ = plt.title("Percentage of Null Values Per Row (Train Data)", fontsize=14)

del(counts)
del(null_data)
_ = gc.collect()

In [None]:
# Count the number of null values that occur in each row
test["null_count"] = test.isnull().sum(axis=1)

# Group the null counts
counts = test.groupby("null_count")["null_count"].count().to_dict()
null_data = {"{} Null Value(s)".format(k) : v for k, v in counts.items() if k < 6}

# Plot the null count results
pie, ax = plt.subplots(figsize=[20, 10])
plt.pie(x=null_data.values(), autopct="%.2f%%", explode=[0.05]*len(null_data.keys()), labels=null_data.keys(), pctdistance=0.5, colors=colors)
_ = plt.title("Percentage of Null Values Per Row (Test Data)", fontsize=14)

del(counts)
del(null_data)
del(test)
_ = gc.collect()

With this competition, we're not seeing any missing values. This means we don't have to worry about imputing or creating new features based on null values.

## 2.5 Continuous Columns

Columns `f0` through `f241` (with the exception of `f22` and `f43`) are all of type `float64`. All columns are scaled between 0 and 1. This is interesting, since the data is already pre-scaled for use with a neural network. 

## 2.6 Categorical Columns

Columns `f22`, `f43`, and `f242` through `f284` are all of type `int64`. Value counts suggest that these columns are likely one-hot encoded variables of some form. Let's take a look at how they line up with the target variable.

In [None]:
cat_features = ["f22", "f43"]
cat_features.extend(["f{}".format(x) for x in range(242, 285)])

In [None]:
fig, axs = plt.subplots(11, 4, figsize=(4*4, 11*3), squeeze=False, sharey=True)

ptr = 0
for row in range(11):
    for col in range(4):  
        x = train[[cat_features[ptr], "target"]].value_counts().sort_index().to_frame().rename({0: "# of Samples"}, axis="columns").reset_index()
        sns.barplot(x=cat_features[ptr], y="# of Samples", hue="target", data=x, ax=axs[row][col], **sns_params)
        plt.xlabel(cat_features[ptr])
        ptr += 1
        del(x)
plt.tight_layout()    
plt.show()

_ = gc.collect()

Here we start to see some interesting information. Most of the categorical variables aren't very informative when it comes to discriminating the target variable. The one exception is feature `f22`, where we see a value of `0` is strongly correlated with a `target` of `1`, while a value of `1` is strongly correlated with a target of `0`. The remaining categorical features do not have any strong indicators of the target variable. This would suggest as a baseline we are probably best to drop most of the other feature variables if dimensionality is a problem.

## 2.7 Categorical Feature Relationships

Given that most of the categorical features are `0` or `1`, it is reasonable to hypothesize that these may be one-hot encoded categoricals. The question is whether we can determine what feature columns were broken out into their one-hot counterparts, and if we could reasonably recombine them into a single categorical field instead. The reason this may be a good idea is due to the plethora of categorical features we see. If they all belong to a single category, then we may want to recreate that category and check to see what other types of categorical encodings we could use in place of one-hot. To check for one-hot dependencies, we need to look at the dataframe categorical columns and check for instances where two columns contain mutually exclusive information. For example, if column `f242` and `f243` were broken out from a categorical column where there were values `A` and `B` into two new columns such as `A_present` and `B_present`, then we would never expect to see both `A_present` and `B_present` having the value `1` at the same time. 

In [None]:
temp_features = cat_features.copy()
temp_features.remove("f22")
temp_features.remove("f43")

df = pd.DataFrame(train[temp_features])
related_features = []
for x in range(242, 285):
    for y in range(x+1, 285):
        if len(df[(df["f{}".format(x)] == 1) & (df["f{}".format(y)] == 1)]) == 0:
            related_features.append(("f{}".format(x), "f{}".format(y)))

if len(related_features) == 0:
    print("-> Found no one-hot dependencies between categorical features")
else:
    print("-> The following features may be dependent on one another:")
    for (x, y) in related_features:
        print("---> f{} and f{}".format(x, y))

As we can see, there appear to be no links between categorical columns, which means that the binary columns are not categorically related to each other.

## 2.8 Sum of Binary Features Per Row

One other aspect to look at is if sum of the binary features has some information when related to `target`.

In [None]:
train["binary_count"] = train[cat_features].sum(axis=1)

x = train[["binary_count", "target"]].value_counts().sort_index().to_frame().rename({0: "# of Samples"}, axis="columns").reset_index()
fig, ax = plt.subplots(figsize=(20, 20))
_ = sns.barplot(x="binary_count", y="# of Samples", hue="target", data=x, **sns_params)
_ = plt.xlabel("Number of Binary Features = 1", fontsize=15)

del(x)
_ = gc.collect()

We are starting to see a little separation based on target value. For example, we see that there is a higher likelihood of a `target` value of `1` if there are 11 binary features set to `1`. We may want to include this information in our models. It looks as though the probability of having a `target` value of `1` is higher if there are fewer than 15 binary features set. Let's zoom on this a little more.

In [None]:
train["binary_15_16"] = train["binary_count"].apply(lambda x: 15 if x <= 15 else 16)

x = train[["binary_15_16", "target"]].value_counts().sort_index().to_frame().rename({0: "# of Samples"}, axis="columns").reset_index()
fig, ax = plt.subplots(figsize=(20, 20))
_ = sns.barplot(x="binary_15_16", y="# of Samples", hue="target", data=x, **sns_params)
_ = plt.xlabel("Number of Binary Features", fontsize=15)

del(x)
_ = gc.collect()

There isn't a huge amount of distinction between them. If we bin them a little more, perhaps there may be more differentiation.

In [None]:
def bin_count(x):
    if x <= 5:
        return 5
    if x > 5 and x <= 10:
        return 10
    if x > 10 and x <= 15:
        return 15
    if x > 15 and x <= 20:
        return 20
    if x > 20 and x <= 25:
        return 25
    return 30

train["binary_5_10_15_20_25_30"] = train["binary_count"].apply(lambda x: bin_count(x))

x = train[["binary_5_10_15_20_25_30", "target"]].value_counts().sort_index().to_frame().rename({0: "# of Samples"}, axis="columns").reset_index()
fig, ax = plt.subplots(figsize=(20, 20))
_ = sns.barplot(x="binary_5_10_15_20_25_30", y="# of Samples", hue="target", data=x, **sns_params)
_ = plt.xlabel("Number of Binary Features", fontsize=15)

del(x)
_ = gc.collect()

Again, it looks like generating bins isn't going to help us any more than simply counting the number of binary values that occur in each row.

## 2.9 P-Value Testing

While looking at features visually will tell us some interesting information, we can also use p-value testing to see if a feature has a net impact on a simple regression model. 

In [None]:
from statsmodels.regression.linear_model import OLS
from statsmodels.tools.tools import add_constant

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import gc

cont_features = ["f{}".format(x) for x in range(242)]
cont_features.remove("f22")
cont_features.remove("f43")

features = []
features.extend(cat_features)
features.extend(cont_features)
x = add_constant(train[features])
model = OLS(train["target"], x).fit()

In [None]:
pvalues = pd.DataFrame(model.pvalues)
pvalues.reset_index(inplace=True)
pvalues.rename(columns={0: "pvalue", "index": "feature"}, inplace=True)
pvalues.style.background_gradient(cmap='YlOrRd')

In [None]:
del(model)
del(x)
_ = gc.collect()

The null hypothesis is that the feature impacts the target variable of `target`. In this case, anything with a p-value greater than 0.05 means we reject that hypothesis. This means that there are features above we can remove that will not impact the overall model. Let's iterate them below.

In [None]:
features_to_drop = []
for index, row in pvalues.iterrows():
    if row["pvalue"] > 0.05:
        features_to_drop.append(row["feature"])
features_to_drop

# 3 Simple Models

Given we know a little about the distribution of data, we should establish a set of baseline models to understand what kind of performance we can get from models.

## 3.1 LightGBM

We'll start with a simple LightGBM model and see how our features work out from there.

In [None]:
cont_features = ["f{}".format(x) for x in range(242)]
cont_features.remove("f22")
cont_features.remove("f43")

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

target = train["target"]
cv_rounds = 3

k_fold = StratifiedKFold(
    n_splits=cv_rounds,
    random_state=2021,
    shuffle=True,
)

features = []
features.extend(cat_features)
features.extend(cont_features)

train_preds = np.zeros(len(train.index), )
train_probas = np.zeros(len(train.index), )

for fold, (train_index, test_index) in enumerate(k_fold.split(train[features], target)):
    x_train = train[features].iloc[train_index]
    y_train = target.iloc[train_index]

    x_valid = train[features].iloc[test_index]
    y_valid = target.iloc[test_index]

    model = LGBMClassifier(
        random_state=2021,
        n_estimators=2000,
        verbose=-1,
        metric="auc",
    )
    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        early_stopping_rounds=50,
        verbose=0,
    )

    train_oof_preds = model.predict(x_valid)
    train_oof_probas = model.predict_proba(x_valid)[:, -1]
    train_preds[test_index] = train_oof_preds
    train_probas[test_index] = train_oof_probas
    
    print("-- Fold {}:".format(fold+1))
    print("{}".format(classification_report(y_valid, train_oof_preds)))

print("-- Overall:")
print("{}".format(classification_report(target, train_preds)))
print("-- ROC AUC: {}".format(roc_auc_score(target, train_probas)))

train["unmodified_preds"] = train_preds
train["unmodified_probas"] = train_probas

# Show the confusion matrix
confusion = confusion_matrix(train["target"], train["unmodified_preds"])
ax = sns.heatmap(confusion, annot=True, fmt=",d")
_ = ax.set_title("Confusion Matrix for LGB Classifier (Unmodified Dataset)", fontsize=15)
_ = ax.set_ylabel("Actual Class")
_ = ax.set_xlabel("Predicted Class")

del(train_preds)
del(train_probas)
del(confusion)
_ = gc.collect()

Looking across folds, we are seeing stability, which is good. Our overall precision and recall metrics are fairly high between the positive and negative class, although the model is having problems capturing more instances of the positive class. 

## 3.2 LightGBM Dropping Uninformative Categorical Features

Given what we know about categorical features, we know that only `f22` is the only informative feature available. We can rebuild our model and drop the other categoricals.

In [None]:
target = train["target"]
cv_rounds = 3

k_fold = StratifiedKFold(
    n_splits=cv_rounds,
    random_state=2021,
    shuffle=True,
)

features = ["f22"]
features.extend(cont_features)

train_preds = np.zeros(len(train.index), )
train_probas = np.zeros(len(train.index), )

for fold, (train_index, test_index) in enumerate(k_fold.split(train[features], target)):
    x_train = train[features].iloc[train_index]
    y_train = target.iloc[train_index]

    x_valid = train[features].iloc[test_index]
    y_valid = target.iloc[test_index]

    model = LGBMClassifier(
        random_state=2021,
        n_estimators=2000,
        verbose=-1,
        metric="auc",
    )
    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        early_stopping_rounds=50,
        verbose=0,
    )

    train_oof_preds = model.predict(x_valid)
    train_oof_probas = model.predict_proba(x_valid)[:, -1]
    train_preds[test_index] = train_oof_preds
    train_probas[test_index] = train_oof_probas
    
    print("-- Fold {}:".format(fold+1))
    print("{}".format(classification_report(y_valid, train_oof_preds)))

print("-- Overall:")
print("{}".format(classification_report(target, train_preds)))
print("-- ROC AUC: {}".format(roc_auc_score(target, train_probas)))

train["drop_cat_features_preds"] = train_preds
train["drop_cat_features_probas"] = train_probas

# Show the confusion matrix
confusion = confusion_matrix(train["target"], train["drop_cat_features_preds"])
ax = sns.heatmap(confusion, annot=True, fmt=",d")
_ = ax.set_title("Confusion Matrix for LGB Classifier (Dropping Most Categoricals)", fontsize=15)
_ = ax.set_ylabel("Actual Class")
_ = ax.set_xlabel("Predicted Class")

del(train_preds)
del(train_probas)
_ = gc.collect()

We're seeing a slight decrease in our ROC AUC score without some of the categoricals. However, as expected, the impact is quite minimal, suggesting that `f22` has a lot of pull.

## 3.3 Using Binary Counts

The next interesting feature we should look at is using binary count data across each row. The data discovery portion above found that there may be some correlation between the number of binary features being `1` in the row and it's relation to `target`.

In [None]:
target = train["target"]
cv_rounds = 3

k_fold = StratifiedKFold(
    n_splits=cv_rounds,
    random_state=2021,
    shuffle=True,
)

features = []
features.extend(cat_features)
features.extend(cont_features)
features.append("binary_count")

train_preds = np.zeros(len(train.index), )
train_probas = np.zeros(len(train.index), )

for fold, (train_index, test_index) in enumerate(k_fold.split(train[features], target)):
    x_train = train[features].iloc[train_index]
    y_train = target.iloc[train_index]

    x_valid = train[features].iloc[test_index]
    y_valid = target.iloc[test_index]

    model = LGBMClassifier(
        random_state=2021,
        n_estimators=2000,
        verbose=-1,
        metric="auc",
    )
    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        early_stopping_rounds=50,
        verbose=0,
    )

    train_oof_preds = model.predict(x_valid)
    train_oof_probas = model.predict_proba(x_valid)[:, -1]
    train_preds[test_index] = train_oof_preds
    train_probas[test_index] = train_oof_probas
    
    print("-- Fold {}:".format(fold+1))
    print("{}".format(classification_report(y_valid, train_oof_preds)))

print("-- Overall:")
print("{}".format(classification_report(target, train_preds)))
print("-- ROC AUC: {}".format(roc_auc_score(target, train_probas)))

train["bincount_preds"] = train_preds
train["bincount_probas"] = train_probas

# Show the confusion matrix
confusion = confusion_matrix(train["target"], train["bincount_preds"])
ax = sns.heatmap(confusion, annot=True, fmt=",d")
_ = ax.set_title("Confusion Matrix for LGB Classifier (Binary Counts)", fontsize=15)
_ = ax.set_ylabel("Actual Class")
_ = ax.set_xlabel("Predicted Class")

del(train_preds)
del(train_probas)
_ = gc.collect()

While we see a small amount of lift using binary counts, it doesn't impact ROC AUC scores by a large amount when compared to the default model.

## 3.4 Removing Features

Let's take a look at what happens when we remove features as indicated by their p-value.

In [None]:
target = train["target"]
cv_rounds = 3

k_fold = StratifiedKFold(
    n_splits=cv_rounds,
    random_state=2021,
    shuffle=True,
)

features = []
features.extend(cat_features)
features.extend(cont_features)
features.append("binary_count")

for feature in features_to_drop:
    features.remove(feature)

train_preds = np.zeros(len(train.index), )
train_probas = np.zeros(len(train.index), )

for fold, (train_index, test_index) in enumerate(k_fold.split(train[features], target)):
    x_train = train[features].iloc[train_index]
    y_train = target.iloc[train_index]

    x_valid = train[features].iloc[test_index]
    y_valid = target.iloc[test_index]

    model = LGBMClassifier(
        random_state=2021,
        n_estimators=2000,
        verbose=-1,
        metric="auc",
    )
    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        early_stopping_rounds=50,
        verbose=0,
    )

    train_oof_preds = model.predict(x_valid)
    train_oof_probas = model.predict_proba(x_valid)[:, -1]
    train_preds[test_index] = train_oof_preds
    train_probas[test_index] = train_oof_probas
    
    print("-- Fold {}:".format(fold+1))
    print("{}".format(classification_report(y_valid, train_oof_preds)))

print("-- Overall:")
print("{}".format(classification_report(target, train_preds)))
print("-- ROC AUC: {}".format(roc_auc_score(target, train_probas)))

train["bincount_remove_preds"] = train_preds
train["bincount_remove_probas"] = train_probas

# Show the confusion matrix
confusion = confusion_matrix(train["target"], train["bincount_remove_preds"])
ax = sns.heatmap(confusion, annot=True, fmt=",d")
_ = ax.set_title("Confusion Matrix for LGB Classifier (Feature Removal + Binary Counts)", fontsize=15)
_ = ax.set_ylabel("Actual Class")
_ = ax.set_xlabel("Predicted Class")

del(train_preds)
del(train_probas)
_ = gc.collect()

## 3.5 Comparison of Approaches

In [None]:
bar, ax = plt.subplots(figsize=(20, 10))
ax = sns.barplot(
    x=["Unmodified", "Dropping Features", "Binary Counts", "Binary Counts + Removed"],
    y=[
        float(roc_auc_score(target, train["unmodified_probas"])),
        roc_auc_score(target, train["drop_cat_features_probas"]),
        roc_auc_score(target, train["bincount_probas"]),
        roc_auc_score(target, train["bincount_remove_probas"]),
    ],
    **sns_params
)
_ = ax.set_title("ROC AUC Score Based on Approach", fontsize=15)
_ = ax.set_xlabel("Approach")
_ = ax.set_ylabel("ROC AUC Score")
_ = ax.set(ylim=(0.84, 0.86))
for p in ax.patches:
    height = p.get_height()
    ax.text(
        x=p.get_x()+(p.get_width()/2),
        y=height,
        s="{:.4f}".format(height),
        ha="center"
    )

We observe a few interesting features. First, dropping categorical features blindly in favor of `f22` results in a drop in our ROC AUC score. Binary counts on the other hand, fail to provide a significant amount of lift to our scores. Finally, removing features based on p-values leaves the overall score unchanged. 

# More to come...