Skip to content

snap-stanford/relbench-user-study

Repository files navigation

RelBench User Study

Relbench Website Twitter

Website | Relbench Repo | Paper | Mailing List

Description

This repo hosts the code used in the RelBench User Study. The goal of the study is to benchmark the performance of a classical ML model (LightGBM) with feature engineering carried out in SQL by a data scientist. The purpose is to provide a comparison point for the performance Relational Deep Learning (RDL) models on the RelBench benchmark.

For details on the user study see section 5 and appendix C of the RelBench Paper.

Structure

At the top level there are two noteworthy python files:

  • traing_gbdt.py: A script that runs hyperparameter tuning for LightGBM trained on the hand-engineered featuers for each task.
  • utils.py: A set of utility functions useful throughout the study, most notably a function to set up DuckDB instances of each dataset (see below).

In addition, the directories at the top level correspond to datasets in RelBench. Within each dataset directory you will find a dataset-level exploratory data analysis (EDA) notebook, and subdirectories for each task. In turn, each task directory contains a feats.sql file containing the features engineered by a data scientist, and a notebook with dataset/model validation code.

Setup

Install dependencies (pip install -r requirements.txt).

For a given dataset (eg: rel-amazon), set up a local DuckDB instance by running:

python -c "import utils; utils.db_setup('rel-amazon', 'amazon/amazon.db');"

Once you've set up a local DuckDB instance you should be able to run all the notebooks and any additional SQL you desire.

Training a LightGBM

Assuming you have set up the DuckDB instance as indicated above, you can train a LightGBM with the following command:

python train_gbdt.py --dataset rel-amazon --task user-churn --generate_feats