# Tabular Data Classification and Baseline with EDA

**Table of Contents:**

1. [Load Data and Inspect Top Level Features](#load)
2. [Exploratory Data Analysis (EDA)](#eda)
3. [Data Preparation and Preprocessing](#data-preprocessing)
4. [Model Training and Evaluation](#model-training)
    - 4.1. [Basic Analysis using Random Forest](#random-forest)
    - 4.2 [CatBoost Classification model](#catboost)
5. [Test set predictions](#test-predictions)

In [None]:
!pip install pydotplus

In [None]:
import gc
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

from catboost import CatBoostClassifier, cv, Pool
from collections import defaultdict

from imblearn.over_sampling import SMOTE
from IPython.display import Image
from pydotplus import graph_from_dot_data
        
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_val_predict
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, roc_curve, \
                            recall_score, confusion_matrix, classification_report, \
                            auc, precision_recall_curve
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, Perceptron, RidgeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

---

<a id="load"></a>
## 1. Load Data and inspect top level features

In [None]:
data_dir = "/kaggle/input/tabular-playground-series-mar-2021/"
train_df = pd.read_csv(os.path.join(data_dir, "train.csv"))
test_df = pd.read_csv(os.path.join(data_dir, "test.csv"))
train_df.head()

In [None]:
train_df.info()

In [None]:
train_df.isna().sum().sum()

Good, we have absolutely no null or missing values at all!

---

<a id="eda"></a>
## 2. Exploratory Data Analysis (EDA)

Since we've got a range of both categorical and numerical features we should briefly explore these for insights.

### 2.1 Analysis of Categorical Features

In [None]:
def custom_countplot(data_df, col_name, ax=None, annotate=True):
    """ Plot seaborn countplot for selected dataframe col """
    c_plot = sns.countplot(x=col_name, data=data_df, ax=ax)
    if annotate:
        for g in c_plot.patches:
            c_plot.annotate(f"{g.get_height()}",
                            (g.get_x()+g.get_width()/3,
                             g.get_height()+60))

We can plot all of our categorical columns using this basic helper function:

In [None]:
cat_cols = [x for x in train_df.columns.values if x.startswith('cat')]
n = len(cat_cols)

print(f"Number of categorical columns: {n}")

In [None]:
n = len(cat_cols)

fig, axs = plt.subplots(5, 4, figsize=(18,10))
axs = axs.flatten()

# iterate through each col and plot
for i, col_name in enumerate(cat_cols):
    custom_countplot(train_df, col_name, ax=axs[i], annotate=False)
    axs[i].set_xlabel(f"{col_name}", weight = 'bold')
    axs[i].set_ylabel('Count', weight='bold')
    
    # only apply y label to left-most plots
    if (i not in [0, 4, 8, 12, 16]):
        axs[i].set_ylabel('')
        
plt.tight_layout()
plt.show()

Lets also look at our target output:

In [None]:
train_df['target'].value_counts().plot.bar()
plt.show()

In [None]:
train_df['target'].value_counts(normalize=True)

So we've got a slight imbalance of data for our outputs. We could consider correcting this through the use of various imbalanced data techniques.

In [None]:
num_cols = [x for x in train_df.columns.values if x.startswith('cont')]
num_n = len(num_cols)

print(f"Number of numerical columns: {num_n}")

In [None]:
def target_boxplot(y_val_col, x_val_col, data_df, figsize=(9,6), ax=None, name="Boxplot"):
    """ Custom boxplot function - plot a chosen value against target x col """
    
    if not ax:
        fig, ax = plt.subplots(figsize=figsize)
    b_plot = sns.boxplot(x=x_val_col, y=y_val_col, data=data_df, ax=ax)
    
    medians = data_df.groupby(x_val_col)[y_val_col].median()
    vert_offset = data_df[y_val_col].median() * 0.05 
    
    for xtick in b_plot.get_xticks():
        b_plot.text(xtick, medians[xtick] + vert_offset, medians[xtick], 
                horizontalalignment='center',size='small',color='w',weight='semibold')
    
    if not ax:
        plt.title(f"{name}", weight='bold')
        plt.show()

In [None]:
num_n = len(num_cols)

fig, axs = plt.subplots(3, 4, figsize=(16,9))
axs = axs.flatten()

# iterate through each col and plot
for i, col_name in enumerate(num_cols):
    
    target_boxplot(col_name, 'target', train_df, 
               name=f'{col_name}', ax=axs[i])
    
    axs[i].set_xlabel(f"{col_name}", weight = 'bold')
    axs[i].set_ylabel('Value', weight='bold')
    
    # only apply y label to left-most plots
    if (i not in [0, 5, ]):
        axs[i].set_ylabel('')
        
plt.tight_layout()
plt.show()

In [None]:
sns.pairplot(train_df.loc[:20000, num_cols], height=3, plot_kws={'alpha':0.2})
plt.show()

In [None]:
# find the correlation between our variables
corr = train_df.loc[:, num_cols].corr()

plt.figure(figsize=(12,8))
sns.heatmap(corr, annot=True)
plt.show()

We have a few fairly strong correlations between our variables, which might be worth exploring further and seeing if we can reduce any unnecessary redundancy. For this it will be worth experimenting with some dimensionality reduction and/or feature augmentation techniques.

Despite this, we dont have a huge number of numerical features, and so dont want to throw any useful information about our dependent output variable away if we can avoid it. Therefore, we will keep all features as they are during preprocessing for this notebook.

---

<a id="data-preprocessing"></a>
## 3. Preprocessing

We need to suitably encode our categorical variables, and standardise (if required) our numerical features. In addition, we can also add some additional features as part of feature engineering if we're feeling curious. 

From a simplistic categorical encoding perspective, we can either encode our categorical features with appropriate integer labels, or perform one-hot encoding. The latter method will produce a much larger dataset, and hence introduced more complexity, but with tabular models this is often at the benefit of improved performance. This is not a simple rule however, and results will vary from problem to problem - thus, it is worth exploring both approaches and seeing which works best for your application.

In [None]:
class DataProcessor(object):
    def __init__(self):
        self.encoder = None
        self.standard_scaler = None
        self.num_cols = None
        self.cat_cols = None
        
    def preprocess(self, data_df, train=True, one_hot_encode=False,
                   add_pca_feats=False):
        """ Preprocess train / test as required """
        
        # if training, fit our transformers
        if train:
            self.train_ids = data_df.loc[:, 'id']
            train_cats = data_df.loc[:, data_df.dtypes == object]
            self.cat_cols = train_cats.columns
            
            # if selected, one hot encode our cat features
            if one_hot_encode:
                self.encoder = OneHotEncoder(handle_unknown='ignore')
                oh_enc = self.encoder.fit_transform(train_cats).toarray()
                train_cats_enc = pd.DataFrame(oh_enc, columns=self.encoder.get_feature_names())
                self.final_cat_cols = list(train_cats_enc.columns)
            
            # otherwise just encode our cat feats with ints
            else:
                # encode all of our categorical variables
                self.encoder = defaultdict(LabelEncoder)
                train_cats_enc = train_cats.apply(lambda x: 
                                                  self.encoder[x.name].fit_transform(x))
                self.final_cat_cols = list(self.cat_cols)
            
            
            # standardise all numerical columns
            train_num = data_df.loc[:, data_df.dtypes != object].drop(columns=['target', 'id'])
            self.num_cols = train_num.columns
            self.standard_scaler = StandardScaler()
            train_num_std = self.standard_scaler.fit_transform(train_num)
            
            # add pca reduced num feats if selected, else just combine num + cat feats
            if add_pca_feats:
                pca_feats = self._return_num_pca(train_num_std)
                self.final_num_feats = list(self.num_cols)+list(self.pca_cols)
                
                
                X = pd.DataFrame(np.hstack((train_cats_enc, train_num_std, pca_feats)), 
                        columns=list(self.final_cat_cols)+list(self.num_cols)+list(self.pca_cols))
            else:   
                self.final_num_feats = list(self.num_cols)
                X = pd.DataFrame(np.hstack((train_cats_enc, train_num_std)), 
                        columns=list(self.final_cat_cols)+list(self.num_cols))
        
        # otherwise, treat as test data
        else:
            # transform categorical and numerical data
            self.test_ids = data_df.loc[:, 'id']
            cat_data = data_df.loc[:, self.cat_cols]
        
            if one_hot_encode:
                oh_enc = self.encoder.transform(cat_data).toarray()
                cats_enc = pd.DataFrame(oh_enc, columns=self.encoder.get_feature_names())
            else:
                cats_enc = cat_data.apply(lambda x: self.encoder[x.name].transform(x))
                
            # transform test numerical data
            num_data = data_df.loc[:, self.num_cols]
            num_std = self.standard_scaler.transform(num_data)
            
            if add_pca_feats:
                pca_feats = self._return_num_pca(num_std, train=False)
                
                X = pd.DataFrame(np.hstack((cats_enc, num_std, pca_feats)), 
                        columns=list(self.final_cat_cols)+list(self.num_cols)+list(self.pca_cols))
            
            else:
                X = pd.DataFrame(np.hstack((cats_enc, num_std)), 
                        columns=list(self.final_cat_cols)+list(self.num_cols)) 
        return X
    
    def _return_num_pca(self, num_df, n_components=0.85, train=True):
        """ return dim reduced numerical features using PCA """
        if train:
            self.pca = PCA(n_components=n_components)
            num_rd = self.pca.fit_transform(num_df)
            
            # create new col names for our reduced features
            self.pca_cols = [f"pca_{x}" for x in range(num_rd.shape[1])]
            
        else:
            num_rd = self.pca.transform(num_df)
        
        return pd.DataFrame(num_rd, columns=self.pca_cols)

Lets transform our data into a form suitable for training various models. This includes encoding our categorical variables, and standardising our numerical variables.

We can either encode our categorical feature values, or one-hot encode them. Our preprocessing function supports whichever we want, through simply setting the one_hot_encode argument as true (one-hot encoding) or false (simple numerical encoding). We obtain a larger number of feature columns if we one-hot encode, and therefore introduce more complexity. However, many models perform better with one-hot encoding, so it is worth trying both techniques for our range of models.

We'll be using mainly tree-based methods in this notebook, and as such one-hot encoding and simple encoding of features does not actually make any noticeable difference (as demonstrated through years of empirical research and comparisons). Therefore, we'll keep our dataset simpler and just use categorical encoding.

In [None]:
data_proc = DataProcessor()

# advanced preprocessing- include pca feats + one hot encoding
X_train_full = data_proc.preprocess(train_df, one_hot_encode=True, add_pca_feats=False)
y_train_full = train_df.loc[:, 'target']

X_test = data_proc.preprocess(test_df, train=False, one_hot_encode=True, add_pca_feats=False)

print(f"X_train_full: {X_train_full.shape} \\ny_train_full: {y_train_full.shape}, \nX_test: {X_test.shape}")

Lets obtain a further split containing a validation and training split for model training, optimising and evaluation purposes:

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, 
                                                  test_size=0.2, random_state=12, stratify=y_train_full)
print(f"X_train_full: {X_train.shape} \ny_train_full: {y_train.shape} \nX_val: {X_val.shape}, \ny_val: {y_val.shape}")

---

## 4. Exploring our dataset using different models

<a id="random-forest"></a>
### 4.1 Random Forest Analysis

In [None]:
def show_tree_graph(tree_model, feature_names):
    """ Output a decision tree to notebook """
    draw_data = export_graphviz(tree_model, filled=True, 
                                rounded=True, feature_names=feature_names, 
                                out_file=None, rotate=True, class_names=True)
    graph = graph_from_dot_data(draw_data)

    return Image(graph.create_png())

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=3)
rf_clf.fit(X_train, y_train)

In [None]:
show_tree_graph(rf_clf.estimators_[0], list(X_train.columns))

Lets train our random forest again, but this time not limiting the depth of our trees:

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
%time rf_clf.fit(X_train, y_train)

The great thing with random forests is the ease of being able to see the relative importance of our features used for making predictions:

In [None]:
def feature_importances(rf_model, dataframe):
    """ Return dataframe of feat importances from random forest model """
    return pd.DataFrame({'columns' : dataframe.columns, 
                         'importance' : rf_model.feature_importances_}
                       ).sort_values('importance', ascending=False)

In [None]:
importances = feature_importances(rf_clf, X_train)
TOP_N = 45

plt.figure(figsize=(14,6))
sns.barplot(x="columns", y="importance", data=importances[:TOP_N])
plt.ylabel("Feature Importances", weight='bold')
plt.xlabel("Features", weight='bold')
plt.title("Random Forest Feature Importances", weight='bold')
plt.xticks(rotation=90)
plt.show()
print(importances[:TOP_N])

Lets now make some predictions on our validation set to get a rough idea of the performance of our model:

In [None]:
val_preds = rf_clf.predict(X_val)
val_acc = accuracy_score(val_preds, y_val)

In [None]:
print(f"Random Forest accuracy on validation set: {val_acc}\n")
print(classification_report(val_preds, y_val))

These metrics are hard to appreciate from the values alone, however they do highlight a severe limitation of our model. Lets plot a confusion matrix, which will help illustrate what this is.

In [None]:
def plot_confusion_matrix(true_y, pred_y, title='Confusion Matrix', figsize=(8,6)):
    """ Custom function for plotting a confusion matrix for predicted results """
    conf_matrix = confusion_matrix(true_y, pred_y)
    conf_df = pd.DataFrame(conf_matrix, columns=np.unique(true_y), index = np.unique(true_y))
    conf_df.index.name = 'Actual'
    conf_df.columns.name = 'Predicted'
    plt.figure(figsize = figsize)
    plt.title(title)
    sns.set(font_scale=1.4)
    sns.heatmap(conf_df, cmap="Blues", annot=True, 
                annot_kws={"size": 16}, fmt='g')
    plt.show()
    return

In [None]:
plot_confusion_matrix(y_val, val_preds)

In [None]:
def plot_roc_curve(y_train, y_train_probs, y_val, y_val_probs, figsize=(8,8)):
    """ Helper function to plot the ROC AUC from given labels """
    # obtain true positive and false positive rates for roc_auc
    fpr, tpr, thresholds = roc_curve(y_train, y_train_probs[:, 1], pos_label=1)
    roc_auc = auc(fpr, tpr)

    # obtain true positive and false positive rates for roc_auc
    val_fpr, val_tpr, val_thresholds = roc_curve(y_val, y_val_probs[:, 1], pos_label=1)
    val_roc_auc = auc(val_fpr, val_tpr)

    plt.figure(figsize=figsize)
    plt.plot(fpr, tpr, label=f"Train ROC AUC = {roc_auc}", color='blue')
    plt.plot(val_fpr, val_tpr, label=f"Val ROC AUC = {val_roc_auc}", color='red')
    plt.plot([0,1], [0, 1], label="Random Guessing", 
             linestyle=":", color='grey', alpha=0.6)
    plt.plot([0, 0, 1], [0, 1, 1], label="Perfect Performance", 
             linestyle="--", color='black', alpha=0.6)
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver Operating Characteristic", weight='bold')
    plt.legend(loc='best')
    plt.show()

In [None]:
# obtain prediction probabilities for trg and val
y_val_probs = rf_clf.predict_proba(X_val)
y_trg_probs = rf_clf.predict_proba(X_train)

# plot our ROC curve
plot_roc_curve(y_train, y_trg_probs, y_val, y_val_probs)

The AUC is the metric we're trying to maximise for this competition, and therefore we should seek to obtain a model that scores particularly well in this regard.

Just for interest, we can also inspect the precision-recall curve for our models:

In [None]:
def plot_prec_rec_curve(y_train, y_train_probs, y_val, y_val_probs, figsize=(14,6)):
    """ Helper function to plot the ROC AUC from given labels """
    # obtain true positive and false positive rates for roc_auc
    prec, rec, thresholds = precision_recall_curve(y_train, 
                                                   y_train_probs[:, 1], 
                                                   pos_label=1)
    prec_rec_auc = auc(rec, prec)

    # obtain true positive and false positive rates for roc_auc
    val_prec, val_rec, val_thresholds = precision_recall_curve(y_val, 
                                                               y_val_probs[:, 1], 
                                                               pos_label=1)
    val_prec_rec_auc = auc(val_rec, val_prec)

    plt.figure(figsize=figsize)
    plt.plot(prec, rec, 
             label=f"Train Precision-Recall AUC = {prec_rec_auc}", color='blue')
    plt.plot(val_prec, val_rec, 
             label=f"Val Precision-Recall AUC = {val_prec_rec_auc}", color='red')
    plt.plot([0, 0, 1], [0, 1, 1], label="Perfect Performance", 
             linestyle="--", color='black', alpha=0.6)
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.title("Precision-Recall Curve", weight='bold')
    plt.legend(loc='best')
    plt.show()

In [None]:
# plot our precision recall curve
plot_prec_rec_curve(y_train, y_trg_probs, y_val, y_val_probs)

<a id="catboost"></a>
### 4.2 CatBoost Classifier

Since training can take a long time on CPU, we want to ensure we select GPU as the task type for our catboost model:

In [None]:
cb_learn_rate = 0.006
n_iterations = 15000
early_stop_rounds = 400

cb_params = {'iterations' : n_iterations, 'learning_rate' : cb_learn_rate, 
             'task_type' : 'GPU', 'random_seed' : 13, 'verbose' : 500}

#cb_params = {'iterations' : n_iterations, 'learning_rate' : cb_learn_rate, 
#             'random_seed' : 13, 'verbose' : 500}

In [None]:
cb_clf = CatBoostClassifier(**cb_params)

A nice feature of CatBoost is the option of adding an interactive plot of training, which allows us to analyse the performance in real time:

In [None]:
cb_clf.fit(X_train, y_train, eval_set=(X_val, y_val), 
           use_best_model=True, plot=True, 
           early_stopping_rounds=early_stop_rounds)

Lets make some predictions on the validation set and compare to our previous random forest model:

In [None]:
val_preds = cb_clf.predict(X_val)
val_acc = accuracy_score(val_preds, y_val)

In [None]:
print(f"CatBoost accuracy on validation set: {val_acc}\n")
print(classification_report(val_preds, y_val))

In [None]:
plot_confusion_matrix(y_val, val_preds)

Finally, lets look at the ROC AUC for the CatBoost model:

In [None]:
# obtain prediction probabilities for trg and val
y_val_probs = cb_clf.predict_proba(X_val)
y_trg_probs = cb_clf.predict_proba(X_train)

# plot our ROC curve
plot_roc_curve(y_train, y_trg_probs, y_val, y_val_probs)

---

<a id="test-predictions"></a>
## 5. Test set predictions

With some basic models under our belt, lets make some predictions on the test set and submit these:

In [None]:
test_preds = cb_clf.predict(X_test)

In [None]:
submission_df = pd.read_csv(os.path.join(data_dir, "sample_submission.csv"))
submission_df['target'] = test_preds
submission_df.to_csv('submission.csv', index=False)