## Kaggle Playground Series S6E2
In this notebook, I will use `fastai`'s Tabular learner to train a model for Kaggle playground series competition, [S6E2](https://www.kaggle.com/competitions/playground-series-s6e2). It is a practice competition organised every month and the task in this competition (February 2026) is to predict the likelihood of a heart disease. I fit a logistic regression to the problem as a first pass which can be viewed [here](https://github.com/tejas-kale/kaggle-notebooks/blob/main/competitions/playground_s6e2/01_logistic_regression.ipynb). I am experimenting with the Tabular learner now as I'm currently doing Jeremy Howard's excellent course [Practical Deep Learning for Coders](https://course.fast.ai).

**Note**: This notebook is not a tutorial on the Tabular learner. To know more about it, read the [official tutorial](https://docs.fast.ai/tutorial.tabular.html).

## Installation and imports

In [1]:
# https://github.com/AnswerDotAI/fastprogress/issues/118#issuecomment-3761073870
%pip -q install -U fastai==2.8.5 fastprogress==1.0.5 ipywidgets

In [30]:
import pandas as pd
from fastai.tabular.all import (
    Path,
    TabularDataLoaders,
    Categorify,
    Normalize,
    FillMissing,
    CategoryBlock,
    tabular_learner,
    RocAucBinary,
)

In [None]:
from google.colab import drive

drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Data
The data is placed in my Google Drive.

In [7]:
DATA_DIR = Path("/content/drive/MyDrive/Colab Notebooks/data/kaggle_playground_s6e2")

In [27]:
df_train = pd.read_csv(DATA_DIR / "train.csv")
df_test = pd.read_csv(DATA_DIR / "test.csv")
df_submission = pd.read_csv(DATA_DIR / "sample_submission.csv")

In [15]:
df_train.loc[:, "Heart Disease Indicator"] = [
    1 if x == "Presence" else 0 for x in df_train["Heart Disease"]
]

In [20]:
dls = TabularDataLoaders.from_df(
    df_train,
    path=DATA_DIR,
    y_names="Heart Disease Indicator",
    y_block=CategoryBlock,
    cat_names=[
        "Sex",
        "Chest pain type",
        "FBS over 120",
        "Exercise angina",
        "Thallium",
        "EKG results",
        "Number of vessels fluro",
        "Slope of ST",
    ],
    cont_names=["Age", "BP", "Cholesterol", "Max HR", "ST depression"],
    procs=[Categorify, FillMissing, Normalize],
)

* `Categorify` converts categorical variables into numeric indices.
* `FillMissing` plugs missing values with median by default. In this dataset, there are no missing values.
* `Normalize` performs standard normalisation (subtract mean and divide by standard deviation) on the numeric variables.
* `TabularDataLoaders` splits the data into training and validation sets randomly using the 80:20 ratio.

In [21]:
dls.show_batch()

Unnamed: 0,Sex,Chest pain type,FBS over 120,Exercise angina,Thallium,EKG results,Number of vessels fluro,Slope of ST,Age,BP,Cholesterol,Max HR,ST depression,Heart Disease Indicator
0,1,4,0,0,3,2,0,2,39.0,112.0,229.999999,129.999999,1.6,1
1,0,4,0,0,3,2,0,2,43.0,109.999999,265.0,148.0,-8.923432e-09,0
2,1,2,0,0,7,0,0,1,43.0,150.0,244.0,154.0,-8.923432e-09,0
3,1,4,0,1,7,0,2,2,65.0,130.0,283.0,124.999999,0.5,1
4,1,4,0,0,3,2,0,1,60.0,134.0,213.000001,175.0,1.6,0
5,1,3,0,0,3,0,0,1,47.0,130.0,187.999999,148.0,-8.923432e-09,0
6,1,4,0,1,7,2,1,2,54.0,140.0,233.0,157.0,1.2,1
7,1,4,1,1,6,2,1,2,62.0,99.999999,267.0,152.0,1.2,1
8,0,3,1,0,3,2,1,1,54.0,150.0,286.000001,181.0,0.8,0
9,1,4,0,1,7,0,2,2,58.0,120.0,229.0,152.0,2.0,1


## Train model

In [31]:
learner = tabular_learner(dls, metrics=RocAucBinary())
learner.fit_one_cycle(1)

epoch,train_loss,valid_loss,roc_auc_score,time
0,0.285715,0.275682,0.953035,01:20


In [32]:
learner.show_results()

Unnamed: 0,Sex,Chest pain type,FBS over 120,Exercise angina,Thallium,EKG results,Number of vessels fluro,Slope of ST,Age,BP,Cholesterol,Max HR,ST depression,Heart Disease Indicator,Heart Disease Indicator_pred
0,2.0,4.0,1.0,2.0,3.0,3.0,4.0,2.0,0.830205,-0.035244,-1.426133,-1.400091,1.774196,1.0,1.0
1,2.0,4.0,1.0,1.0,3.0,3.0,4.0,1.0,-0.986623,0.365182,1.3059,-0.511452,-0.124455,1.0,1.0
2,2.0,4.0,1.0,1.0,1.0,3.0,1.0,1.0,-0.259892,0.498658,-0.446165,0.481734,-0.757338,0.0,0.0
3,1.0,3.0,1.0,1.0,3.0,1.0,1.0,1.0,-1.349989,-0.836096,2.523436,1.004463,-0.757338,0.0,0.0
4,1.0,4.0,1.0,2.0,1.0,3.0,3.0,1.0,0.466839,-0.836096,1.216812,0.011278,1.563235,1.0,1.0
5,1.0,4.0,1.0,1.0,1.0,3.0,2.0,1.0,-0.623257,-0.035244,1.60286,1.004463,-0.757338,0.0,0.0
6,2.0,4.0,1.0,2.0,3.0,1.0,4.0,1.0,0.466839,-1.369998,1.098028,0.220369,-0.757338,1.0,1.0
7,2.0,3.0,1.0,1.0,1.0,3.0,1.0,1.0,-0.259892,-0.035244,-0.327381,0.481734,-0.757338,0.0,0.0
8,2.0,4.0,1.0,2.0,1.0,3.0,1.0,1.0,-0.259892,-0.702621,-0.030421,-1.609183,0.086507,1.0,1.0


The tabular learner has a area under ROC curve value of `0.953035` which slightly higher than the `0.9528` observed with logistic regression although the latter was a mean of 5 cross-validation splits.

## Submit predictions

In [25]:
dl = learner.dls.test_dl(df_test)
preds, _ = learner.get_preds(dl=dl)

In [29]:
presence_probs = preds[:, 1]

In [None]:
df_submission = df_submission.drop(columns=["Heart Disease"])
df_submission["Heart Disease"] = presence_probs
df_submission.to_csv("submission.csv", index=False)

The model had a score of `0.9511` on the private leaderboard on Kaggle which was better than `0.95089` scored by the logistic regression model.