-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hi, it's nice to come across a cross-library/model benchmark like this!
When looking at evaluations for structured output libraries, I feel like "valid response" is such a low bar when used on its own as a metric, and I think adding accuracy-related metrics would help these benchmarks be more informative.
I fully acknowledge that this is a bit on the ornery side, but since it only took a few lines of code (it was very easy to do in this repo!), I wanted to submit a demo PR for a new framework that uses polyfactory
to generate valid responses based on the response model, with 100% reliability and a latency of 0.000, maybe 0.001 on a bad day.
I'd potentially be interested in contributing to work on additional metrics/tasks in the future, in particular named entity recognition!