# 1 Introduction

This EDA explores the data available for the Tabular Playground Series - November 2021 competition. Simple data exploration is performed, as well as preliminary modeling.

## 1.1 Evaluation Criteria

The goal for this competition is to maximize ROC AUC score. This means generating classifiers or regressions that predict the probability of the class target variable based on the features included.

In [None]:
import pandas as pd
import numpy as np
import gc

train = pd.read_csv("../input/tabular-playground-series-nov-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-nov-2021/test.csv")

In [None]:
print(": Train shape {}".format(train.shape))
print(": Test shape {}".format(test.shape))
print("")

## 1.3 Training and Testing Files

Our input data consists of:

* `train.csv` - 521 MB in size, containing 102 columns and 600,000 rows
* `test.csv` - 468 MB in size, containing 101 columns and 540,000 rows

The main observation is that while 1.0 GB fits in memory, model training may exert pressure on the Kaggle 16 GB CPU memory and GPU memory limitations. We should definitely explore what column formats are at play, and whether running functions to [reduce memory usage](https://www.kaggle.com/gemartin/load-data-reduce-memory-usage) on Pandas dataframes can ease pressure on memory.

# 2 Features

## 2.1 `id` Column

The `id` column is a `int64` integer column that contains unique record indicators ranging from 0 to 599,999. Like most Tabular Series, this is simply an identifier for the record and is likely not going to be of use for modelling purposes.

## 2.2 `target` Column

The `target` column contains the class targets we are attempting to predict. We should look first to see what class breakdown we have.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
sns_params = {"palette": "bwr_r"}

counts = pd.DataFrame(train["target"].value_counts())
ax = sns.barplot(x=counts.index, y=counts.target, **sns_params)
for p in ax.patches:
    ax.text(x=p.get_x()+(p.get_width()/2), y=p.get_height(), s="{:,d}".format(round(p.get_height())), ha="center")
_ = ax.set_title("Class Balance", fontsize=15)
_ = ax.set_ylabel("Number of Records", fontsize=15)
_ = ax.set_xlabel("Class", fontsize=15)

del(counts)
_ = gc.collect()

The predicted class is well balanced, with little to no skew. This is interesting as it gives us a lot of training data per class to look at.

## 2.3 `Fx` Columns

Feature columns are `f0` through `f99`. All are continuous. The display below is credited to [@subinium](https://www.kaggle.com/subinium) (see their [Simple EDA](https://www.kaggle.com/subinium/tps-oct-simple-eda) for the October TPS competition).

In [None]:
features = ["f{}".format(x) for x in range(100)]

train[features].describe().T.style.bar(subset=['mean'], color='#7BCC70')\
    .background_gradient(subset=['std'], cmap='Reds')\
    .background_gradient(subset=['50%'], cmap='coolwarm')

Right away there are some interesting observations. Features `f2` and `f35` have values that buck the trend of having a mean around 0 or 2. We should dive deeper into those two features. Another thing to note is that we aren't seeing any categorical features masquerading as continuous (i.e. features that have a min of 0 and max of 1 with no values between. 

## 2.4 Null Values

We should also check to see if we are missing any values in the columns.

In [None]:
# Count the number of null values that occur in each row
train["null_count"] = train.isnull().sum(axis=1)

# Group the null counts
counts = train.groupby("null_count")["target"].count().to_dict()
null_data = {"{} Null Value(s)".format(k) : v for k, v in counts.items() if k < 6}

# Plot the null count results
pie, ax = plt.subplots(figsize=[20, 10])
colors = sns.color_palette("bwr_r")[0:5]
plt.pie(x=null_data.values(), autopct="%.2f%%", explode=[0.05]*len(null_data.keys()), labels=null_data.keys(), pctdistance=0.5, colors=colors)
_ = plt.title("Percentage of Null Values Per Row (Train Data)", fontsize=14)

del(counts)
del(null_data)
_ = gc.collect()

In [None]:
# Count the number of null values that occur in each row
test["null_count"] = test.isnull().sum(axis=1)

# Group the null counts
counts = test.groupby("null_count")["null_count"].count().to_dict()
null_data = {"{} Null Value(s)".format(k) : v for k, v in counts.items() if k < 6}

# Plot the null count results
pie, ax = plt.subplots(figsize=[20, 10])
plt.pie(x=null_data.values(), autopct="%.2f%%", explode=[0.05]*len(null_data.keys()), labels=null_data.keys(), pctdistance=0.5, colors=colors)
_ = plt.title("Percentage of Null Values Per Row (Test Data)", fontsize=14)

del(counts)
del(null_data)
_ = gc.collect()

With this competition, we're not seeing any missing values. This means we don't have to worry about imputing or creating new features based on null values.

## 2.5 P-Value Testing

While looking at features visually will tell us some interesting information, we can also use p-value testing to see if a feature has a net impact on a simple regression model. This method is controversial in that it likely doesn't provide a correct look at what features are informative. Our null hypothesis is that the feature impacts the target variable of `target`. In this case, anything with a p-value greater than 0.05 means we reject that hypothesis, and can potentially flag it for removal.

In [None]:
from statsmodels.regression.linear_model import OLS
from statsmodels.tools.tools import add_constant

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

x = add_constant(train[features])
model = OLS(train["target"], x).fit()

In [None]:
pvalues = pd.DataFrame(model.pvalues)
pvalues.reset_index(inplace=True)
pvalues.rename(columns={0: "pvalue", "index": "feature"}, inplace=True)
pvalues.style.background_gradient(cmap='YlOrRd')

In [None]:
del(model)
del(x)
_ = gc.collect()

features_to_drop = []
for index, row in pvalues.iterrows():
    if row["pvalue"] > 0.05:
        features_to_drop.append(row["feature"])
features_to_drop

We may potentially be able to drop features `f0`, `f38`, `f52`, `f72`, and `f92`. We'll probably want to use more advanced feature selection and information techniques such as [SHAP](https://github.com/slundberg/shap) to guide feature selection more reliably.

## 2.6 Spearman Correlation

We should also check to see what variables are correlated to one another. We'll check the Spearman correlation first, since it does not make assumptions about distribution types or linearity. With Spearman correlation, we have values that range from -1 to +1. Values around either extreme end mean a neagative or positive correlation, while those around 0 mean no correlation exists.

In [None]:
columns_to_check = features.copy()
columns_to_check.append("target")
correlation_matrix = train[columns_to_check].corr(method="spearman")

from matplotlib.colors import SymLogNorm

f, ax = plt.subplots(figsize=(20, 20))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
_ = sns.heatmap(
    correlation_matrix, 
    mask=np.triu(np.ones_like(correlation_matrix, dtype=bool)), 
    cmap=sns.diverging_palette(230, 20, as_cmap=True), 
    center=0,
    square=True, 
    linewidths=.1, 
    cbar_kws={"shrink": .2},
    norm=SymLogNorm(linthresh=0.03, linscale=0.03, vmin=-1.0, vmax=1.0, base=10),
)

#### Target Correlations

The following features are correlated positively to the target:

* `f8`, `f27`, `f34`, `f41`, `f43`, `f50`, and `f57`

The following features are correlated negatively to the target:

* `f55`, `f71`, `f80`, and `f91`

#### Feature Correlations

The following pairs appear to be correlated:

* `f4` and `f75` - these share a positive correlation.
* `f9` and `f21` - these share a positive correlation.
* `f18` and `f89` - these share a positive correlation.
* `f20` and `f21` - these share a positive correlation.
* `f23` and `f33` - these share a positive correlation.
* `f27` and `f31` - these share a negative correlation. Since `f27` is strongly correlated to the target however, we may not want to drop either one.
* `f46` and `f52` - these share a positive correlation.

# 3 Simple Models

Given we know a little about the distribution of data, we should establish a set of baseline models to understand what kind of performance we can get from models.

## 3.1 LightGBM

We'll start with a simple LightGBM model and see how our features work out from there.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

target = train["target"]
cv_rounds = 3

k_fold = StratifiedKFold(
    n_splits=cv_rounds,
    random_state=2021,
    shuffle=True,
)

train_preds = np.zeros(len(train.index), )
train_probas = np.zeros(len(train.index), )

for fold, (train_index, test_index) in enumerate(k_fold.split(train[features], target)):
    x_train = train[features].iloc[train_index]
    y_train = target.iloc[train_index]

    x_valid = train[features].iloc[test_index]
    y_valid = target.iloc[test_index]

    model = LGBMClassifier(
        random_state=2021,
        n_estimators=2000,
        verbose=-1,
        metric="auc",
    )
    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        early_stopping_rounds=50,
        verbose=0,
    )

    train_oof_preds = model.predict(x_valid)
    train_oof_probas = model.predict_proba(x_valid)[:, -1]
    train_preds[test_index] = train_oof_preds
    train_probas[test_index] = train_oof_probas
    
    print("-- Fold {}:".format(fold+1))
    print("{}".format(classification_report(y_valid, train_oof_preds)))

print("-- Overall:")
print("{}".format(classification_report(target, train_preds)))
print("-- ROC AUC: {}".format(roc_auc_score(target, train_probas)))

train["unmodified_preds"] = train_preds
train["unmodified_probas"] = train_probas

# Show the confusion matrix
confusion = confusion_matrix(train["target"], train["unmodified_preds"])
ax = sns.heatmap(confusion, annot=True, fmt=",d")
_ = ax.set_title("Confusion Matrix for LGB Classifier (Unmodified Dataset)", fontsize=15)
_ = ax.set_ylabel("Actual Class")
_ = ax.set_xlabel("Predicted Class")

del(train_preds)
del(train_probas)
del(confusion)
_ = gc.collect()

Looking across folds, we are seeing stability, which is good. Our overall precision and recall metrics are fairly high between the positive and negative class

## 3.2 LightGBM Dropping Uninformative Features

Let's use the results of our P-value test and drop uninformative features.

In [None]:
target = train["target"]
cv_rounds = 3

k_fold = StratifiedKFold(
    n_splits=cv_rounds,
    random_state=2021,
    shuffle=True,
)

new_features = features.copy()
new_features.remove('f89')
    
train_preds = np.zeros(len(train.index), )
train_probas = np.zeros(len(train.index), )

for fold, (train_index, test_index) in enumerate(k_fold.split(train[new_features], target)):
    x_train = train[new_features].iloc[train_index]
    y_train = target.iloc[train_index]

    x_valid = train[new_features].iloc[test_index]
    y_valid = target.iloc[test_index]

    model = LGBMClassifier(
        random_state=2021,
        n_estimators=2000,
        verbose=-1,
        metric="auc",
    )
    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        early_stopping_rounds=50,
        verbose=0,
    )

    train_oof_preds = model.predict(x_valid)
    train_oof_probas = model.predict_proba(x_valid)[:, -1]
    train_preds[test_index] = train_oof_preds
    train_probas[test_index] = train_oof_probas
    
    print("-- Fold {}:".format(fold+1))
    print("{}".format(classification_report(y_valid, train_oof_preds)))

print("-- Overall:")
print("{}".format(classification_report(target, train_preds)))
print("-- ROC AUC: {}".format(roc_auc_score(target, train_probas)))

train["dropped_preds"] = train_preds
train["dropped_probas"] = train_probas

# Show the confusion matrix
confusion = confusion_matrix(train["target"], train["dropped_preds"])
ax = sns.heatmap(confusion, annot=True, fmt=",d")
_ = ax.set_title("Confusion Matrix for LGB Classifier (Dropped Features)", fontsize=15)
_ = ax.set_ylabel("Actual Class")
_ = ax.set_xlabel("Predicted Class")

del(train_preds)
del(train_probas)
del(confusion)
_ = gc.collect()

# More to come...