The initial release of ArtEvalBench (v0.9, PR #15) contains a single artifact, Wasabi, whose results can be ultimately summarized as a single integer -- the number of bugs were triggered by the current attempt.
The goal of this feature request is adding an artifact that produces a more diverse set of results/outputs, including time series used for plots and figures, which require a more elaborate "results reproduced"/"experiment runs" evaluator oracle.