First, `pip install` if you haven't already

In [2]:
!pip install relbench



# Load database

All it takes is one line!


In [70]:
from relbench.datasets import get_dataset

dataset = get_dataset(
    name="rel-f1", process=True
)  # other options to try include 'rel-amazon', 'rel-stack'

making Database object from raw files...
done in 0.12 seconds.
reindexing pkeys and fkeys...
done in 0.01 seconds.
caching Database object to /Users/joshuarobinson/Library/Caches/relbench/rel-f1/db...
done in 0.02 seconds.
use process=False to load from cache.


Use `process=True` the first time you load a patricular dataset to automatically download the data it's origin source onto your machine. From then on you can set `process=False` for faster loading from cache.


Now we have loaded the database, let's start poking around to see what's inside. To start, let's check the full list of attributes the dataset has...


In [32]:
dir(dataset)

['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_full_db',
 'cache_dir',
 'db',
 'db_dir',
 'get_task',
 'make_db',
 'max_eval_time_frames',
 'name',
 'pack_db',
 'task_cls_dict',
 'task_cls_list',
 'task_names',
 'test_timestamp',
 'train_start_timestamp',
 'val_timestamp',
 'validate_and_correct_db']

A lot of this list can be ignored, especially the `__blah__` attributes. There are, however, a number of attributes that we _do_ care about. 



# Val / Test cutoffs

We can check the val/test time cutoffs as follows:

In [60]:
print(dataset.val_timestamp)
print(dataset.test_timestamp)

2005-01-01 00:00:00
2010-01-01 00:00:00


This shows that data before 2005 is used for training, between 2005 and 2010 for validation, and after 2010 for testing. 

Note that it is a RelBench design choice to make the validation and test cutoffs a dataset property, _not_ a task-specific property. In other words, all tasks for a given database use the same time splits.




# Acessing the raw data


Next we check out `dataset.db`, which holds the data itself...

In [51]:
dataset.db

Database()

This returns a RelBench `Database` object. So let's go one layer deeper and check what's inside this...

In [47]:
dir(dataset.db)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'load',
 'max_timestamp',
 'min_timestamp',
 'reindex_pkeys_and_fkeys',
 'save',
 'table_dict',
 'upto']

With this we can double check the full timespan of the database:

In [22]:
print(dataset.db.max_timestamp)
print(dataset.db.min_timestamp)

2009-11-01 11:00:00
1950-05-13 00:00:00


1950 is the first season for F1! So we have data for the full history of F1. Note that the `max_timestamp` is the cutoff date for validation data. Data from afer 2009 is used for testing, but is hidden from `dataset.db`. To see the full database including test data you can instead use `dataset._full_db`, but we advise caution when using this to avoid inadvertent time leakage. For instance we can check the final cutoff for test data by calling:

In [24]:
print(dataset._full_db.max_timestamp)

2023-11-26 13:00:00


Next let's check out the `dataset.db.table_dict`, which contains the raw tables.

More info on the schemas for F1 and all other datasets can be found at https://relbench.stanford.edu/.

In [56]:
dataset.db.table_dict

{'races': Table(df=
      raceId  year  round  circuitId                  name                date  \
 0         0  1950      1          8    British Grand Prix 1950-05-13 00:00:00   
 1         1  1950      2          5     Monaco Grand Prix 1950-05-21 00:00:00   
 2         2  1950      3         18      Indianapolis 500 1950-05-30 00:00:00   
 3         3  1950      4         65      Swiss Grand Prix 1950-06-04 00:00:00   
 4         4  1950      5         12    Belgian Grand Prix 1950-06-18 00:00:00   
 ..      ...   ...    ...        ...                   ...                 ...   
 815     815  2009     13         13    Italian Grand Prix 2009-09-13 12:00:00   
 816     816  2009     14         14  Singapore Grand Prix 2009-09-27 12:00:00   
 817     817  2009     15         21   Japanese Grand Prix 2009-10-04 05:00:00   
 818     818  2009     16         17  Brazilian Grand Prix 2009-10-18 16:00:00   
 819     819  2009     17         23  Abu Dhabi Grand Prix 2009-11-01 11:00:00

So `dataset.db.table_dict` is a dict, and we can check the full list of tables in the F1 database by checking out the dict keys.

In [58]:
dataset.db.table_dict.keys()

dict_keys(['races', 'circuits', 'drivers', 'results', 'standings', 'constructors', 'constructor_results', 'constructor_standings', 'qualifying'])

That's 9 tables total! Let's look more closely at one of them.

In [68]:
table = dataset.db.table_dict["drivers"]
table

Table(df=
     driverId        driverRef code  forename     surname        dob  \
0           0         hamilton  HAM     Lewis    Hamilton 1985-01-07   
1           1         heidfeld  HEI      Nick    Heidfeld 1977-05-10   
2           2          rosberg  ROS      Nico     Rosberg 1985-06-27   
3           3           alonso  ALO  Fernando      Alonso 1981-07-29   
4           4       kovalainen  KOV    Heikki  Kovalainen 1981-10-19   
..        ...              ...  ...       ...         ...        ...   
852       852  mick_schumacher  MSC      Mick  Schumacher 1999-03-22   
853       853             zhou  ZHO    Guanyu        Zhou 1999-05-30   
854       854         de_vries  DEV      Nyck    de Vries 1995-02-06   
855       855          piastri  PIA     Oscar     Piastri 2001-04-06   
856       856         sargeant  SAR     Logan    Sargeant 2000-12-31   

    nationality  
0       British  
1        German  
2        German  
3       Spanish  
4       Finnish  
..          ...  

The `drivers` table stores information on all F1 drivers that ever competed in a race. Note that the table comes with multiple bits of information:
- The table itself, `table.df` which is simply a Pandas DataFrame.
- The primary key column, `table.pkey_col`, which indicates that the `driverId` column holds the primary key for this particular table in the database.
- The primary time column, `table.time_col` which, if the entity is an event, records the time an event happened. In the case of drivers, they are non-temporal entities, so `table.time_col=None`.
- The other tables that foreign keys points to `table.fkey_col_to_pkey_table`. If the table has any foreign key columns, then this dict indicates which table we foreign key corresponds to. Again in the case of drivers this is not applicable. 

We can start to explore the data a little, e.g., check out the oldest and youngest ever F1 drivers, spanning 3 centuries!

In [80]:
print(table.df.iloc[table.df["dob"].idxmax()])
print(table.df.iloc[table.df["dob"].idxmin()])

driverId                       855
driverRef                  piastri
code                           PIA
forename                     Oscar
surname                    Piastri
dob            2001-04-06 00:00:00
nationality             Australian
Name: 855, dtype: object
driverId                       741
driverRef                etancelin
code                            \N
forename                  Philippe
surname                  Étancelin
dob            1896-12-28 00:00:00
nationality                 French
Name: 741, dtype: object


Going back to the `table.time_col` and `table.fkey_col_to_pkey_table`, the `results` table contains a non-trivial example.

In [98]:
dataset.db.table_dict["results"]

Table(df=
       resultId  raceId  driverId  constructorId  number  grid  position  \
0             0       0       660            152    18.0    21      11.0   
1             1       0       790            149     8.0    12       NaN   
2             2       0       579             49     1.0     3       NaN   
3             3       0       661            149     9.0    10       NaN   
4             4       0       789            152    17.0     7       NaN   
...         ...     ...       ...            ...     ...   ...       ...   
20318     20318     819         1              1     6.0     8       5.0   
20319     20319     819        21             22    23.0     4       4.0   
20320     20320     819        17             22    22.0     5       3.0   
20321     20321     819        16              8    14.0     3       2.0   
20322     20322     819         2              2    16.0     9       9.0   

       positionOrder  points  laps  milliseconds  fastestLap  rank  statusId 

Here we start to notice certain data artifacts that might be good to keep in mind for later when doing ML modeling. For instance, the `milliseconds` and `fastestLap` columns seem to only have been collected for more recent races, with `NaN` features for earlier races.

# Loading a task

Each RelBench dataset comes with multiple pre-defined predictive tasks. For any given RelBench dataset, you can check all the associated tasks with:

In [81]:
dataset.task_names

['driver-position', 'driver-dnf', 'driver-top3']

Check out https://relbench.stanford.edu/ for detailed descriptions of what each task is. As an example, let's use `driver-top3` where the task is, for a given driver and a given timestamp, to predict whether that driver will finish in the top 3 in some race in the next 30 days.

The task itself is instantiated by calling:

In [86]:
task = dataset.get_task("driver-top3", process=True)

Ground truth train / val / test label are computed by calling `task.train_table` etc. Each task table contains triples (timestamp, Id, label) indicating the entity the label is associated to, the timepoint at which the prediction is made, an the label itself. The task table also indicates which database table it is "attached" to - in this case the the `drivers` table.

In [90]:
task.train_table

Table(df=
           date  driverId  qualifying
0    2004-08-04        40           0
1    2004-08-04        45           0
2    2004-08-04        43           0
3    2004-06-05        17           1
4    2004-06-05         9           0
...         ...       ...         ...
1348 1994-03-30        80           0
1349 1994-03-30        48           0
1350 1994-03-30        77           0
1351 1994-02-28        43           0
1352 1994-02-28        56           0

[1353 rows x 3 columns],
  fkey_col_to_pkey_table={'driverId': 'drivers'},
  pkey_col=None,
  time_col=date)

The test table is handled differently, with the labels being hidden by default.

In [92]:
task.test_table

Table(df=
          date  driverId
0   2013-03-16       153
1   2013-03-16        19
2   2012-10-17       808
3   2012-10-17       818
4   2012-10-17       817
..         ...       ...
721 2010-07-30        14
722 2010-06-30       154
723 2010-06-30        14
724 2010-05-01        14
725 2010-05-01       154

[726 rows x 2 columns],
  fkey_col_to_pkey_table={'driverId': 'drivers'},
  pkey_col=None,
  time_col=date)

We have carefully designed the standardized evaluation protocol (see: XXX) so that the test labels themselves are only ever used under the hood of RelBench, so users should not need to ever see them to reduce the risk of data leakage. If strictly needed, test labels can be retrieved by calling.

In [97]:
task._full_test_table

Table(df=
          date  driverId  qualifying
0   2013-03-16       153           0
1   2013-03-16        19           1
2   2012-10-17       808           0
3   2012-10-17       818           0
4   2012-10-17       817           0
..         ...       ...         ...
721 2010-07-30        14           0
722 2010-06-30       154           0
723 2010-06-30        14           0
724 2010-05-01        14           0
725 2010-05-01       154           0

[726 rows x 3 columns],
  fkey_col_to_pkey_table={'driverId': 'drivers'},
  pkey_col=None,
  time_col=date)

Now we have explored the data and task, the next step is to train an ML model on the data. See XXX for our GNN-based approach!