# TabZilla Metadataset Tutorial

This notebook demonstrates how analyze our experimental results, including some of the results from our paper.

### First Things First

1. Please download the TabZilla results dataset `metadataset_clean.csv`, and the dataset meta-features `metafeatures_clean.csv` from our Google Drive folder [here](https://drive.google.com/drive/folders/1cHisTmruPHDCYVOYnaqvTdybLngMkB8R?usp=sharing), and place them in the same directory as this notebook.
2. You need to run this notebook with a python (3.11+) environment with `pandas` installed.

### Read the datasets

In [2]:
import pandas as pd

metadataset_df = pd.read_csv("./metadataset_clean.csv")
metafeatures_df = pd.read_csv("./metafeatures_clean.csv")

# 1. Explore our experiment results (`metadataset.csv`)

The most important columns in this dataset are:
- `dataset_fold_id`: the name of the "dataset fold". Each dataset is split into 10 train/test/validation splits for these experiments.
- `dataset_name`: the name of the dataset, not including the fold.
- `alg_name`: the name of the algorithm.
- `hparam_source`: the set of hyperparameters used with the algorithm.

Each row contains results for a single algorithm trained on the training set (80%) of the entire dataset, and then evaluated on both the validation and test sets (each 10%). 

This file includes the following metrics:
- Log Loss
- AUC
- Accuracy
- F1 Score
- runtime ("time").

For each of the three splits: train, test, and validation. These columns have the naming convention "{metric}__{split}". For example, the column "Log Loss__val" is the Log Loss calculated on the validation set, and "time__test" is the runtime to evaluate the test test.

For example, here are the log loss and training time results for CatBoost using default hyperparameters, for all splits of the dataset "openml__adult-census__3953":

In [11]:
metadataset_df.loc[
    (metadataset_df["alg_name"] == "CatBoost") & 
    (metadataset_df["hparam_source"] == "default") &
    (metadataset_df["dataset_name"] == "openml__jungle_chess_2pcs_raw_endgame_complete__167119"),
    [
        "dataset_fold_id", 
        "alg_name", 
        "hparam_source", 
        "Accuracy__test", 
        "training_time"]
]

Unnamed: 0,dataset_fold_id,alg_name,hparam_source,Accuracy__test,training_time
562260,openml__jungle_chess_2pcs_raw_endgame_complete...,CatBoost,default,0.82954,2.370031
562549,openml__jungle_chess_2pcs_raw_endgame_complete...,CatBoost,default,0.827755,1.214042
562838,openml__jungle_chess_2pcs_raw_endgame_complete...,CatBoost,default,0.836234,1.482092
563127,openml__jungle_chess_2pcs_raw_endgame_complete...,CatBoost,default,0.842481,1.507916
563416,openml__jungle_chess_2pcs_raw_endgame_complete...,CatBoost,default,0.844712,1.197246
563705,openml__jungle_chess_2pcs_raw_endgame_complete...,CatBoost,default,0.846274,1.122574
563994,openml__jungle_chess_2pcs_raw_endgame_complete...,CatBoost,default,0.829763,1.198574
564283,openml__jungle_chess_2pcs_raw_endgame_complete...,CatBoost,default,0.830879,1.222639
564572,openml__jungle_chess_2pcs_raw_endgame_complete...,CatBoost,default,0.841589,1.208539
564861,openml__jungle_chess_2pcs_raw_endgame_complete...,CatBoost,default,0.840661,1.198245


In [8]:
metadataset_df["alg_name"].unique()

array(['CatBoost', 'DecisionTree', 'DeepFM', 'KNN', 'LightGBM',
       'LinearModel', 'MLP', 'RandomForest', 'STG', 'SVM', 'TabNet',
       'TabTransformer', 'VIME', 'XGBoost', 'rtdl_MLP', 'rtdl_ResNet',
       'DANet', 'NAM', 'NODE', 'SAINT', 'rtdl_FTTransformer',
       'TabPFNModel'], dtype=object)