# Kaggle Tabular Playground Nov Data: Logistic regression

This work looks at Kaggle Tabular Playground Nov. Data and produces predictions for 'target' variable. 
The data is synthetically generated by a GAN that was trained on a real-world dataset used to identify spam emails via various extracted features from the email.It contains 100 features with continuous values and one target variable (0,1).
The following steps were followed:

- detailed exploratory data analysis
- model construction and evaluation
- hyperparameter tuning
- producing predictions



## EDA

In this section train and test data sets are read, analysed and data checks are performed.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import glob
import os
import matplotlib.pyplot as plt
from xgboost import XGBClassifier, plot_importance
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, KFold


import warnings
warnings.filterwarnings('ignore')

filename1 = r'/kaggle/input/tabular-playground-series-nov-2021/train.csv'
filename2 = r'/kaggle/input/tabular-playground-series-nov-2021/test.csv'


df_train = pd.read_csv(filename1, index_col=None, header=0)
df_test = pd.read_csv(filename2, index_col=None, header=0)
#df=df.iloc[1:,:]
print(len(df_train), 'rows in training dataset')
print(df_train.head())
print(len(df_test), 'rows in test dataset')
print(df_test.head())

In [None]:
df_train.describe()

In [None]:
df_test.describe()

It is easy to see that data is continuous and is not normalised. Note that feature 'f2' gets significantly higher values compared to other features and depending on modelling technique adopted scaling could be required.

The next step is checking the columns for null values

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

The preliminary checks detect no missing values hence the next step of analysis: distribution and feature correlation checks are performed.

### Distribution and correlation checks


In [None]:
Y=df_train['target']

In [None]:
mask = df_train.dtypes == np.float
float_cols = df_train.columns[mask]
fig, axes = plt.subplots(len(float_cols) // 4, 4, figsize=(22, 40))
#df[float_cols].hist(figsize=(20, 20), bins=50, xlabelsize=12, ylabelsize=12); 
for col,axis in zip(float_cols,axes.reshape(-1)):
    sns.histplot(df_train[col], ax=axis, kde=True,bins=100, label=f'train_{col}')
    sns.histplot(df_test[col], color ='red' ,ax=axis, kde=True,bins=100, label=f'test_{col}')
    axis.legend()
    


In [None]:
dfcorr = df_train.corr()
ndf = dfcorr.loc[dfcorr.max(axis=1) > 0.50, dfcorr.max(axis=0) > 0.50]

sns.heatmap(ndf)
plt.show()

We can see that there is no strong correlation between features to be detected.

In [None]:
dfcorr['target']

In [None]:
Y_train=df_train['target']


### Target Disribution check


In [None]:
## Target distibution
pie, ax = plt.subplots(figsize=[18,8])
df_train.groupby('target').size().plot(kind='pie',autopct='%.1f',colors=sns.color_palette('pastel')[0:2], x=ax,title='Target distibution')

### Skewness Check and Normalisation

In [None]:
# Create a list of float colums to check for skewing
mask = df_train.dtypes == np.float
float_cols = df_train.columns[mask]
skew_limit = 0.75
skew_vals = df_train[float_cols].skew()

skew_cols = (skew_vals
             .sort_values(ascending=False)
             .to_frame()
             .rename(columns={0:'Skew'})
             .query('abs(Skew) > {0}'.format(skew_limit)))

skew_cols = skew_cols.index.to_list()

In [None]:
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler

# apply standard scaler to the data
scaler = RobustScaler()
df_train[skew_cols] = scaler.fit_transform(df_train[skew_cols])
df_test[skew_cols] = scaler.transform(df_test[skew_cols])

## Modelling and splitting the data into 5 KFolds

In [None]:
df_train=df_train.drop('id', axis=1)
df_test=df_test.drop('id', axis=1)

For modelling XGBoost was chosen as this is a powerful classifier to be used for logistic regression problems. The training set is split into 5 Folds and XGBClassifier is trained on that data. **Note** that tree based models do not require data scaling and normalising hence no data scaling/normalisation is performed.

In [None]:
xgb_classifier = XGBClassifier(booster='gbtree', 
                             max_depth = 4,
                    objective = 'binary:logistic',         
                    n_estimators=1000, 
                    learning_rate = .1)

KFoldseed = 1
cv = KFold(n_splits=5, shuffle=True, random_state=KFoldseed)

In [None]:
#separate predictors and targets in data frame
#remove id column as it is simply numbering the rows
x_df = df_train[df_train.columns[:-1]]
y_df = df_train[df_train.columns[-1:]]
print(x_df.shape)
print(y_df.shape)

In [None]:
y_df.columns


In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
column='target'
sigma_target_list = []
sigma_target = np.std(y_df[column])
sigma_target_list.append(sigma_target)

cv_train_accuracy= []
cv_test_accuracy = []
cv_train_roc_auc= []
cv_test_roc_auc = []

for train_idx, test_idx in cv.split(x_df):

    x_train, x_test = x_df.iloc[train_idx], x_df.iloc[test_idx]
    y_train, y_test = y_df[column].iloc[train_idx], y_df[column].iloc[test_idx]


    fitted_model = xgb_classifier.fit(x_train, y_train, 
                                eval_set=[(x_train, y_train), (x_test, y_test)],
                                eval_metric='auc',
                                early_stopping_rounds=250, 
                                verbose=False)

    pred_train = fitted_model.predict(x_train)
    pred_train_prob = fitted_model.predict_proba(x_train)[:,1]

    cv_train_accuracy.append(accuracy_score(y_train.values, pred_train))
    cv_train_roc_auc.append(roc_auc_score(y_train, pred_train_prob))

    pred_test = fitted_model.predict(x_test)
    pred_test_prob = fitted_model.predict_proba(x_test)[:,1]
    cv_test_accuracy.append(accuracy_score(y_test.values, pred_test))
    cv_test_roc_auc.append(roc_auc_score(y_test.values, pred_test_prob))


    

Other than results we are also interested in feature importance. This is important since if in the future hyperarameter optimization is performed and the goal is to improve the results elimination of some features can play an important role.

In [None]:
evals_result = fitted_model.evals_result()
plt.plot(np.arange(len(evals_result['validation_0']['auc'])), evals_result['validation_0']['auc'], label='Training Set')
plt.plot(np.arange(len(evals_result['validation_1']['auc'])), evals_result['validation_1']['auc'], label='Testing Set')
plt.xlabel('Iteration')
plt.ylabel('AUC')
plt.title('Learning Curve for Target', fontweight='bold')
plt.legend()

#plot feature importance per target
feature_importances = pd.DataFrame(fitted_model.feature_importances_, index = x_df.columns, columns=['importance']).sort_values('importance')
print(feature_importances[feature_importances['importance']>0.02])
pos_importance = feature_importances[feature_importances['importance']>0.02]
pos_importance.plot(kind = 'barh',title=f'Target')
plt.show()
plt.clf()

In [None]:
#Print model report:
print("\nModel Report")
print("XGBoost Mean Accuracy (Train) : %.4g" % np.mean(cv_train_accuracy))
print("XGBoost Mean AUC Score (Train): %f" % np.mean(cv_train_roc_auc))

print("XGBoost Mean Accuracy (Test) : %.4g" % np.mean(cv_test_accuracy))
print("XGBoost Mean AUC Score (Test): %f" % np.mean(cv_test_roc_auc))



## Confusion Matrix

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
# Creates a confusion matrix
cm = confusion_matrix(y_test, pred_test) 

# Transform to df for easier plotting
cm_df = pd.DataFrame(cm,
                     index = ['0','1',], 
                     columns = ['0','1'])

plt.figure(figsize=(5.5,4))
sns.heatmap(cm_df, annot=True,fmt='g')
plt.title('XGBoost \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, pred_test)))
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

## Improvements : LightGBM Model

XGBoost in this example takes a long time to run hence to LightGBM Classifier was tested to improve the execution times. Not only LightGBM rans faster but also it provides a slightly improved result both on training and testing sets.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
column='target'
sigma_target_list = []
sigma_target = np.std(y_df[column])
sigma_target_list.append(sigma_target)
    
cv_train_accuracy= []
cv_test_accuracy = []
cv_train_roc_auc= []
cv_test_roc_auc = []
preds = np.zeros(len(df_test))
lgb = lgb.LGBMClassifier(learning_rate=0.04,max_depth=3, n_estimators=5000)

for train_idx, test_idx in cv.split(x_df):

    x_train, x_test = x_df.iloc[train_idx], x_df.iloc[test_idx]
    y_train, y_test = y_df[column].iloc[train_idx], y_df[column].iloc[test_idx]
    
    
    fitted_model = lgb.fit(x_train, y_train, 
                                eval_set=[(x_train, y_train), (x_test, y_test)],
                                eval_metric='auc',
                                early_stopping_rounds=250, 
                                verbose=False)

    pred_train = fitted_model.predict(x_train)
    pred_train_prob = fitted_model.predict_proba(x_train)[:,1]
   
    cv_train_accuracy.append(accuracy_score(y_train.values, pred_train))
    cv_train_roc_auc.append(roc_auc_score(y_train, pred_train_prob))
   
    pred_test = fitted_model.predict(x_test)
    pred_test_prob = fitted_model.predict_proba(x_test)[:,1]
    preds += fitted_model.predict_proba(df_test)[:,1]/5
    cv_test_accuracy.append(accuracy_score(y_test.values, pred_test))
    cv_test_roc_auc.append(roc_auc_score(y_test.values, pred_test_prob))

In [None]:
#Print model report:
print("\nModel Report")
print("LightGBM Mean Accuracy (Train) : %.4g" % np.mean(cv_train_accuracy))
print("LightGBM Mean AUC Score (Train): %f" % np.mean(cv_train_roc_auc))

print("LightGBM Mean Accuracy (Test) : %.4g" % np.mean(cv_test_accuracy))
print("LightGBM Mean AUC Score (Test): %f" % np.mean(cv_test_roc_auc))

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
# Creates a confusion matrix
cm = confusion_matrix(y_test, pred_test) 

# Transform to df for easier plotting
cm_df = pd.DataFrame(cm,
                     index = ['0','1',], 
                     columns = ['0','1'])

plt.figure(figsize=(5.5,4))
sns.heatmap(cm_df, annot=True,fmt='g')
plt.title('LightGBM \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, pred_test)))
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

As we see using LightGBM model has imporved the score and decreased the difference of AUC scores between the traning and testing sets.

## Improvements TabNet Model

Introduced in 2019 by Google TabNet (https://arxiv.org/pdf/1908.07442.pdf) is a Neural Network that was able to outperform the leading tree based models across a variety of benchmarks. It is also considered more explainable than boosted tree models and can be used without any feature preprocessing.  Hence it was interesting to try TabNet for this problem and see whether score can be improved. To test the labelled data set was split into 2 sets -training and test for TabNet to train.

In [None]:
!pip install pytorch-tabnet wget


In [None]:
import pytorch_tabnet
from pytorch_tabnet.tab_model import TabNetClassifier
import numpy as np
import torch
np.random.seed(8)
tabnetclass = TabNetClassifier(optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=2e-2),
                       scheduler_params={"step_size":50, # how to use learning rate scheduler
                                         "gamma":0.8},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                       mask_type='entmax' # "sparsemax"
                      )



In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x_df, y_df, test_size=0.30, random_state=8)
x_train_np= x_train.to_numpy()
y_train_np= y_train.to_numpy().ravel()
x_val_np = x_val.to_numpy()
y_val_np = y_val.to_numpy().ravel()

tabnetclass.fit(
    x_train_np,y_train_np,
    eval_set=[(x_train_np, y_train_np), (x_val_np, y_val_np)],
    eval_name=['train', 'valid'],
    eval_metric=['auc','accuracy'],
    max_epochs=200 , patience=20,
    batch_size=1024, virtual_batch_size=128,
    num_workers=0,
    weights=1,
    drop_last=False
)

## Submission

In [None]:
filename3 = r'/kaggle/input/tabular-playground-series-nov-2021/sample_submission.csv'

df3 = pd.read_csv(filename3, index_col=None, header=0)

df3['target']=preds
df3.to_csv('submission.csv', index=False)
df3
