XGBoost has reigned the predictions on tabular data for quite sometime and the ML/ AI community is itching to get past the baseline it sets. Recently, have been introduced to TabNet architecture. In this Notebook, i do a quick & plain vanilla comparison between the 2 algorithms on 5M sample. I have barely made much changes to default parameters and here, i just intend to see if both have comparable results and if eventually an ensemble could result into a better result.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score,roc_curve,classification_report

import xgboost as xgb
from xgboost import plot_importance
from xgboost.sklearn import XGBClassifier
import riiideducation

import torch
from pytorch_tabnet.tab_model import TabNetClassifier
from torch.optim.lr_scheduler import ReduceLROnPlateau

In [None]:
# pip install pytorch_tabnet

In [None]:
train = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv',
                   usecols=[1, 2, 3, 4, 5, 7, 8, 9],
                   dtype={'timestamp': 'int64',
                          'user_id': 'int32',
                          'content_id': 'int16',
                          'content_type_id': 'int8',
                          'task_container_id': 'int16',
                          'answered_correctly':'int8',
                          'prior_question_elapsed_time': 'float32',
                          'prior_question_had_explanation': 'boolean'}
                   )

In [None]:
# Remove lectures and additional processing
train = train[train.content_type_id == False]

train = train.sort_values(['timestamp'], ascending=True)
train.drop(['timestamp','content_type_id'], axis=1, inplace=True)

In [None]:
# Read Questions and Lectures
questions = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
lectures = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')

In [None]:
# Merge train with Questions
train = pd.merge(train, questions, left_on = 'content_id', right_on = 'question_id', how = 'left')

In [None]:
#Indicator for first question in a batch
train['firstQindicator'] = np.where(train['prior_question_elapsed_time'].isnull(),1,0)
train['prior_question_elapsed_time'] = np.where(train['prior_question_elapsed_time'].isnull(),
                                                0,train['prior_question_elapsed_time'])

In [None]:
train.head()

In [None]:
# Remove unused columns
del train['question_id']
del train['bundle_id']
del train['correct_answer']
del train['tags']

In [None]:
import gc
gc.collect()

In [None]:
train.prior_question_had_explanation = np.where(train.prior_question_had_explanation=='True',1,0)

In [None]:
# Sample 5M records
train = train.sample(n=5000000)

In [None]:
# train test split
xtrain, xvalid, ytrain, yvalid = train_test_split(train.drop(['answered_correctly'],axis=1), 
                                                  train['answered_correctly'],
                                                  random_state=42, 
                                                  test_size=0.2, 
                                                  shuffle=True)

In [None]:
# Train XGB Classifier
clf_xgb = xgb.XGBClassifier()
clf_xgb.fit(xtrain, ytrain)

In [None]:
# Predict using XGB Classifier
pred_xgb = clf_xgb.predict(xvalid)
print('\t\t\tCLASSIFICATIION METRICS: XGBOOST\n')
print(metrics.classification_report(yvalid, pred_xgb))
score = roc_auc_score(yvalid, pred_xgb)
print('ROC value is: {}'.format(score))

In [None]:
# Tabnet object
clf_tabnet = TabNetClassifier()

In [None]:
# Fit TabNet model
clf_tabnet.fit(
    X_train=xtrain.values, y_train=ytrain.values,
    eval_set=[(xvalid.values, yvalid.values)]
    
)

In [None]:
# Predict using TabNet
pred_tabnet = clf_tabnet.predict(xvalid.values)
print('\t\t\tCLASSIFICATIION METRICS: TabNet\n')
print(metrics.classification_report(yvalid, pred_tabnet))
score = roc_auc_score(yvalid, pred_tabnet)
print('ROC value is: {}'.format(score))

I think XGBoost still has an edge. As the model is done on partial data, only minimal feature engineering is done, and no hyper-parameter tuning is done, we see the results much less than what the baselines from other Kernels are providing. As the next step, my focus is going to be along the lines and see how much each algo can stretch to. **Stay tuned...**