# Problem description

In this challenge, we have to predict a binary target using a number of continuous and categorical features. There are 19 categorical features and 11 continuous features. Further, there are 300000 samples in training data and 200000 samples in test data.

In [None]:
# Install Data Analysis Baseline Library for automated data analysis
!pip install dabl 

# Import necessary libraries

In [None]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import dabl
import itertools

from pandas_profiling import ProfileReport

# Preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Classifiers
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score, plot_roc_curve, confusion_matrix 

import warnings 
warnings.filterwarnings('ignore') # silence warnings

# Load data

In [None]:
train = pd.read_csv("../input/tabular-playground-series-mar-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-mar-2021/test.csv")
sample = pd.read_csv("../input/tabular-playground-series-mar-2021/sample_submission.csv")

In [None]:
train.head()

In [None]:
train.info()

In [None]:
test.head()

# EDA with DABL

In [None]:
dabl.plot(train, "target")

# Encode categorical features

In [None]:
cat_features = [feat for feat in train.columns if feat[:3]=='cat'] # select categorical features

In [None]:
def check_new_labels_in_test(train_set, test_set, cat_features=cat_features):
    """
    This function checks for any new labels in categorical features in test set which are not present in train set    """
    
    df = pd.DataFrame()
    for feat in cat_features:
        set_tr = set(train_set[feat].values)
        set_te = set(test_set[feat].values)
        diff = set_te - set_tr    # get new labels present in test set and not in train set
        df.loc[0, feat] = len(diff)
    print(df)

check_new_labels_in_test(train, test)

There are 8 new labels in test set of cat10 feature which are absent in train set. This needs to be fixed before using LabelEncoder to encode the categorical features.

In [None]:
def encode_cat_features(train_set, test_set, cat_features=cat_features): 
    
    """
    This function fits LabelEncoder on train data and transforms train and test data. If a feature contains new labels in test data, 
    first the train and test data will be merged and LabelEncoder will be fitted on merged data 
    """
    le = LabelEncoder()

    for feat in cat_features:
        le.fit(train_set[feat])
        train_set[feat] = le.transform(train_set[feat])
        try:
            test_set[feat] = le.transform(test_set[feat])
        except ValueError:
            train_set[feat] = le.inverse_transform(train_set[feat]) # get the labels back before merging train and test data
            le.fit(pd.concat([train_set[feat], test_set[feat]], axis=0))
            train_set[feat] = le.transform(train_set[feat])
            test_set[feat] = le.transform(test_set[feat])
    return train_set, test_set        

In [None]:
train_enc, test_enc = encode_cat_features(train, test)
#train_enc[cat_features] = train_enc[cat_features].astype('category') 
y = train_enc['target']

# Selection of K top categorical features

The data contains 19 categorical features. Although we would prefer to have as many features as we can in the hope of getting a reasonably accurate model, it is often the case that the variance in target is better explained by only a subset of the features. I will perform feature selection of catgorical features based on mutual information between the features and target to see if it leads to any improvement in model performance over using all the features. 

In [None]:
def cat_feature_selection(X_cat=train_enc[cat_features], y=y, top_feats=8, print_fs_score=False, train_enc=train_enc):
    """
    This function selects k top features based on mutual information between features and target
    """
    fs_mutual_info = mutual_info_classif(X_cat, y, random_state=1)
    if print_fs_score:
        print(fs_mutual_info)
    top_features = fs_mutual_info.argsort()[-top_feats:][::-1]
    X_post_fs = train_enc.iloc[:, [i+1 for i in top_features]]
    return X_post_fs

In [None]:
X_cat_fs = cat_feature_selection(print_fs_score=True)
X_cat_fs.head()

Feature selection based on mutual information corroborates the importance of categorical features detected in DABL analysis. Categorical features in decreasing order of importance as per both DABL and Mutual Info are the same: cat16, cat15, cat18, cat1 and so on. 

# Continuous features

There are 11 continous features. Let's plot a heatmap showing correlation of these features with target

In [None]:
cont_features = [feat for feat in train.columns if feat[:4]=='cont']
X_cont = pd.concat([train[cont_features], train["target"]], axis=1)

In [None]:
def plot_heatmap(df, width, height):
    """
    Plot heatmap of correlation matrix in specified height and width
    """
    sns.set_style('whitegrid')
    plt.subplots(figsize=(width, height))

    mask = np.zeros_like(df.corr(), dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    sns.heatmap(df.corr(), 
                cmap=sns.diverging_palette(250, 15, s=75, l=40,n=9, center="dark"), 
                mask = mask, 
                annot=True, 
                center = 0)
    plt.title("Correlation Heatmap")
    plt.show()

In [None]:
plot_heatmap(X_cont, 18, 12)

There is a moderate correlation between target and some features such as cont5, cont6 etc and very weak correlation between target and some other features like cont0, cont7 etc. There is also a problem of multicollinearity among the independent variables as some of them are highly correlated: cont1 & cont2, cont0 & cont10, cont7 & cont10 and so on. It is important to minimize or eliminate multicollinearity as it undermines the statistical power of the model. 

To identify the degree of multi-collinearity in the data, I will use Variance Inflation Factor. VIF is equal to the ratio of the overall model variance to the variance of a model that includes only a single independent variable. This ratio is calculated for each independent variable. A high VIF indicates that the associated independent variable is highly collinear with the other variables in the model. A VIF of 1 indicates that the two variables are not correlated, 5 indicates moderate collinearity and 10 indicates high collinearity. I would try to achieve a VIF score below 10 for the remaining features after removing some highly correlated features.

In [None]:
vif = pd.DataFrame()
vif["features"] = train[cont_features].columns
vif["vif_Factor"] = [variance_inflation_factor(train[cont_features].values, i) for i in range(train[cont_features].shape[1])]
vif

There is high correlation between independent variables as can be seen from the high VIF score of some features

In [None]:
vif_feats = ['cont3','cont4','cont5','cont6','cont8', 'cont9']
vif = pd.DataFrame()
vif["features"] = train[vif_feats].columns
vif["vif_Factor"] = [variance_inflation_factor(train[vif_feats].values, i) for i in range(train[vif_feats].shape[1])]
vif

This subset of features have better VIF scores than the complete set of features. I will see if this leads to an improvement in model performance.

In [None]:
X_cont_fs = pd.concat([train[vif_feats], train["target"]],axis=1)
plot_heatmap(X_cont_fs, 10, 7)

# Modelling with all features

In [None]:
# Select independent and dependent variables
X = train_enc.iloc[:,1:-1]
y = train_enc.iloc[:,-1]

In [None]:
skf = StratifiedKFold(n_splits=2, shuffle=True, random_state=111) # split data into multiple folds for reliable results

In [None]:
def run_model(model,skf=skf, X=X, y=y, get_pred=False, plot_cm=False, cmap=plt.cm.Reds, **params): 
    """
    This function trains a given model with chosen parameters on training data and generates predictions on validation data. 
    If plot_cm is True, it plots confusion matrix on full data.  
    """
    cv_score = [] # container to compute mean cv scores
    i = 1

    for train_idx, val_idx in skf.split(X, y):  # stratified split of train data
        print(f"{i} of KFold {skf.n_splits}")
        X_train, X_val = X.loc[train_idx], X.loc[val_idx]
        y_train, y_val = y.loc[train_idx], y.loc[val_idx]

        modl = model(**params)
        modl.fit(X_train, y_train)
        score = roc_auc_score(y_val, modl.predict(X_val))
        print(f"ROC AUC Score: {score:.4f}")
        cv_score.append(score)
        i+=1

    print(f"CV Scores: {[round(val, 4) for val in cv_score]} \n Mean CV Score: {np.mean(cv_score):.4f}")
    
    if plot_cm:
        # plot confusion matrix
        pred_all_y = modl.predict_proba(X)[:,1] # predict all targets for plotting confusion matrix 
        plt.figure(figsize=(8,5))
        cm = confusion_matrix(y,np.where(pred_all_y > 0.5, 1, 0))
        plt.imshow(cm, interpolation='nearest', cmap=cmap)
        plt.title(f"Confusion Matrix - {modl.__class__.__name__}")
        plt.colorbar()
        classes = [0,1]
        tick_marks = np.arange(len(classes))
        plt.xticks(tick_marks, classes, rotation=0)
        plt.yticks(tick_marks, classes)
        plt.tight_layout()
        plt.ylabel('True class')
        plt.xlabel('Predicted class')
        plt.grid(False)
        thresh = cm.max() / 2.
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, cm[i, j],
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        plt.show()        


# Logistic Regression with default parameters

In [None]:
params = {'C':1}
run_model(LogisticRegression, plot_cm=True, **params)

# XGBoost with default parameters

In [None]:
run_model(XGBClassifier, params=None, plot_cm=True)

XGBoost classifier has done a marginally better job classifying the target as compared to Logistic regression (both with default parameters).

# CatBoostClassifier

In [None]:
params = {'verbose':False}
run_model(CatBoostClassifier, plot_cm=True, **params)

We see slight improvement in ROC AUC score with CatBoostClassifier over XGBClassifier

# Applying feature selection for continuous and categorical features

I will select a subset of features which are more associated with the target than the rest and train the model on these features. Applying mutual information for categorical features and variance inflation factor for continuous features in the previous stage, I have selected the top 14 features which explain the variance of target the most. Let's see how the results may vary when we use only 14 features instead of the initial 30 features.

In [None]:
X_fs = pd.concat([train_enc[X_cat_fs.columns], train_enc[vif_feats]],axis=1)

In [None]:
params = {'verbose':False}
run_model(CatBoostClassifier, X=X_fs, plot_cm=True, **params) # using CatBoostClassifier as it gave best results

At the first glance, it is clear that ROC AUC score dropped after feature selection. However, the drop in score is a miniscule 0.0076 whereas we were able to reduce the number of features by 53.33% which is quite sigificant. This confirms our assumption with this dataset that only a subset of the given features mainly contribute to model's learning. The absence of the remaining features hardly had an effect. One can also compare the confusion matrices and see how similar they are.

# Model selection, training and submission of predictions

In [None]:
# model will be trained on entire training set
X_train = train_enc.iloc[:, 1:-1]
y_train = train_enc['target']

X_test = test_enc.iloc[:, 1:]
test_ids = test_enc['id']

# Choosing CatBoostClassifier as it gave best results with default parameters
cbc = CatBoostClassifier(verbose=False)
cbc.fit(X_train, y_train)
test_pred = cbc.predict(X_test) # test predictions

# generate solution for submission
sub = pd.DataFrame()
sub['id'] = test_ids
sub['target'] = test_pred
sub.to_csv('My submission.csv', index=False)

# Next steps

In the next segment, I will try hyperparamater tuning to improve the results...

If you like my work, kindly upvote. Thanks for reading through.