-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Support custom scoring summary functions #1953
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Jul 17, 2024
jamie-rasmussen
approved these changes
Jul 18, 2024
...3/pages/CompareEvaluationsPage/sections/ComparisonDefinitionSection/EvaluationDefinition.tsx
Outdated
Show resolved
Hide resolved
...-js/src/components/PagePanelComponents/Home/Browse3/pages/CompareEvaluationsPage/ecpState.ts
Outdated
Show resolved
Hide resolved
...-js/src/components/PagePanelComponents/Home/Browse3/pages/CompareEvaluationsPage/ecpState.ts
Outdated
Show resolved
Hide resolved
|
||
// Helpers | ||
|
||
const moveItemToFront = (arr: any[], item: any) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put a moveToFront
in EmojiDetails, maybe they can join forces.
...-js/src/components/PagePanelComponents/Home/Browse3/pages/CompareEvaluationsPage/ecpTypes.ts
Outdated
Show resolved
Hide resolved
...se3/pages/CompareEvaluationsPage/sections/ExampleCompareSection/exampleCompareSectionUtil.ts
Outdated
Show resolved
Hide resolved
...ePanelComponents/Home/Browse3/pages/wfReactInterface/tsDataModelHooksEvaluationComparison.ts
Outdated
Show resolved
Hide resolved
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note to Reviews: Yes, this PR has many lines of change, but functionality-wise it is quite minimal. The vast majority of the changes are the result of changing the underlying data model to support different key structures for summaries and scores. I also took the opportunity to cleanup a lot of code from the first MVP, add A LOT of comments, and re-organize code decouple unrelated files. I also took this chance to rename various symbols more aptly.
Problem: When building the evaluation comparison page, I made an invalid assumption about the shape of scoring data and summary data. Namely: i thought that the shape of summary data was DEFINED as the shape of the score data, where each score "leaf" is split into a dictionary in the "summary" - and that dictionary had a regular form for boolean or floating values respectively. However this is ONLY true for metrics that are "autosummarized". It is perfectly possible for the user to define their own summary function. Jason did this here: https://wandb.ai/jzhao/resume-bot-eval/weave/compare-evaluations?evaluationCallIds=%5B%22fecc2462-10c1-4a7b-a0b9-4e1726e5618d%22%2C%224bc16671-95b6-4044-94d4-b34417b44868%22%5D where he defined a score as
{"correct": boolean}
, but summarized as{"f1: float, "precision": float, "recall" float}
. This is perfectly possible with our Evaluation framework, but was not supported in the UI.Solution: The major change here is to decouple our metric definitions. Specifically, our data model currently has
scorerMetricDimensions: {[metricDimensionId: string]: ScorerMetricDimension};
, but this is not sufficient to describe the world. Now we have two fields:Pretty much all the remaining changes are supporting the model change and making the score handling more robust and centralized. I am also being a lot more strict with "Metric" referring to a general value, "SummaryMetric" being a metric summarizing the eval, and "ScoreMetric" being a metric that relates to a specific input/output pair.
A few Before and Afters:
**Tests: ** I added a big unit test to assert the correct shapes of data assumed by the UI to protect against breakage in the future.
Here are a bunch of good examples for all sorts of eval situations for manual testing:
Unit Test
Anish
Lavanya
Jason
Shawn
Adam
Good Manual Tests: