# Benchmarks

We benchmark GPflux' Deep GP on several UCI datasets.
The code to run the experiments can be found in `benchmarking/main.py`. The results are stored in `benchmarking/runs/*.json`. In this script we aggregate and plot the outcomes.

In [1]:
import glob
import json

import numpy as np
import pandas as pd

In [2]:
LOGS = "../../benchmarking/runs/*.json"

data = []
for path in glob.glob(LOGS):
    with open(path) as json_file:
        data.append(json.load(json_file))

df = pd.DataFrame.from_records(data)
df = df.rename(columns={"model_name": "model"})

In [3]:
table = df.groupby(["dataset", "model"]).agg(
    {
        "split": "count",
        **{metric: ["mean", "std"] for metric in ["mse", "nlpd"]},
    }
)

We report the mean and std. dev. of the MSE and Negative Log Predictive Density (NLPD) measured by running the experiment on 5 different splits. We use 90% of the data for training and the remaining 10% for testing. The output is normalised to have zero mean and unit variance.

In [4]:
table

Unnamed: 0_level_0,Unnamed: 1_level_0,split,mse,mse,nlpd,nlpd
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,mean,std
dataset,model,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Concrete,dgp-1,5,0.103785,0.014586,0.526873,0.231547
Concrete,dgp-2,5,0.093612,0.003917,0.388471,0.200387
Concrete,dgp-3,5,0.103213,0.019258,0.624335,0.409077
Energy,dgp-1,5,0.003866,0.00166,-0.991852,0.065885
Energy,dgp-2,5,0.004071,0.001542,-1.089672,0.039099
Energy,dgp-3,5,0.004063,0.001521,-1.091651,0.039407
Kin8mn,dgp-1,5,0.098581,0.006733,0.263775,0.019575
Kin8mn,dgp-2,5,0.061714,0.002321,0.040491,0.026879
Kin8mn,dgp-3,5,0.064156,0.002981,0.144311,0.045383
Power,dgp-1,5,0.056407,0.004272,-0.009102,0.045228
