Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Support custom scoring summary functions #1953

Merged
merged 45 commits into from
Jul 18, 2024

Conversation

tssweeney
Copy link
Collaborator

@tssweeney tssweeney commented Jul 16, 2024

Note to Reviews: Yes, this PR has many lines of change, but functionality-wise it is quite minimal. The vast majority of the changes are the result of changing the underlying data model to support different key structures for summaries and scores. I also took the opportunity to cleanup a lot of code from the first MVP, add A LOT of comments, and re-organize code decouple unrelated files. I also took this chance to rename various symbols more aptly.

Problem: When building the evaluation comparison page, I made an invalid assumption about the shape of scoring data and summary data. Namely: i thought that the shape of summary data was DEFINED as the shape of the score data, where each score "leaf" is split into a dictionary in the "summary" - and that dictionary had a regular form for boolean or floating values respectively. However this is ONLY true for metrics that are "autosummarized". It is perfectly possible for the user to define their own summary function. Jason did this here: https://wandb.ai/jzhao/resume-bot-eval/weave/compare-evaluations?evaluationCallIds=%5B%22fecc2462-10c1-4a7b-a0b9-4e1726e5618d%22%2C%224bc16671-95b6-4044-94d4-b34417b44868%22%5D where he defined a score as {"correct": boolean}, but summarized as {"f1: float, "precision": float, "recall" float}. This is perfectly possible with our Evaluation framework, but was not supported in the UI.

Solution: The major change here is to decouple our metric definitions. Specifically, our data model currently has scorerMetricDimensions: {[metricDimensionId: string]: ScorerMetricDimension};, but this is not sufficient to describe the world. Now we have two fields:

  // ScoreMetrics define the metrics that are associated on each individual prediction
  scoreMetrics: MetricDefinitionMap;

  // SummaryMetrics define the metrics that are associated with the evaluation as a whole
  // often aggregated from the scoreMetrics.
  summaryMetrics: MetricDefinitionMap;

Pretty much all the remaining changes are supporting the model change and making the score handling more robust and centralized. I am also being a lot more strict with "Metric" referring to a general value, "SummaryMetric" being a metric summarizing the eval, and "ScoreMetric" being a metric that relates to a specific input/output pair.

A few Before and Afters:
Screenshot 2024-07-17 at 17 23 28
Screenshot 2024-07-17 at 17 24 26
Screenshot 2024-07-17 at 17 24 11
Screenshot 2024-07-17 at 17 24 04

**Tests: ** I added a big unit test to assert the correct shapes of data assumed by the UI to protect against breakage in the future.

Here are a bunch of good examples for all sorts of eval situations for manual testing:

@tssweeney tssweeney marked this pull request as ready for review July 18, 2024 00:26
@tssweeney tssweeney requested a review from a team as a code owner July 18, 2024 00:26
weave/tests/test_evaluations.py Outdated Show resolved Hide resolved
weave/tests/test_evaluations.py Outdated Show resolved Hide resolved
weave/tests/test_evaluations.py Outdated Show resolved Hide resolved

// Helpers

const moveItemToFront = (arr: any[], item: any) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a moveToFront in EmojiDetails, maybe they can join forces.

@tssweeney tssweeney merged commit 2a1c72c into master Jul 18, 2024
25 checks passed
@tssweeney tssweeney deleted the tim/support_custom_scorers branch July 18, 2024 06:21
@github-actions github-actions bot locked and limited conversation to collaborators Jul 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants