chore: Support custom scoring summary functions #1953

tssweeney · 2024-07-16T15:28:57Z

Note to Reviews: Yes, this PR has many lines of change, but functionality-wise it is quite minimal. The vast majority of the changes are the result of changing the underlying data model to support different key structures for summaries and scores. I also took the opportunity to cleanup a lot of code from the first MVP, add A LOT of comments, and re-organize code decouple unrelated files. I also took this chance to rename various symbols more aptly.

Problem: When building the evaluation comparison page, I made an invalid assumption about the shape of scoring data and summary data. Namely: i thought that the shape of summary data was DEFINED as the shape of the score data, where each score "leaf" is split into a dictionary in the "summary" - and that dictionary had a regular form for boolean or floating values respectively. However this is ONLY true for metrics that are "autosummarized". It is perfectly possible for the user to define their own summary function. Jason did this here: https://wandb.ai/jzhao/resume-bot-eval/weave/compare-evaluations?evaluationCallIds=%5B%22fecc2462-10c1-4a7b-a0b9-4e1726e5618d%22%2C%224bc16671-95b6-4044-94d4-b34417b44868%22%5D where he defined a score as {"correct": boolean}, but summarized as {"f1: float, "precision": float, "recall" float}. This is perfectly possible with our Evaluation framework, but was not supported in the UI.

Solution: The major change here is to decouple our metric definitions. Specifically, our data model currently has scorerMetricDimensions: {[metricDimensionId: string]: ScorerMetricDimension};, but this is not sufficient to describe the world. Now we have two fields:

  // ScoreMetrics define the metrics that are associated on each individual prediction
  scoreMetrics: MetricDefinitionMap;

  // SummaryMetrics define the metrics that are associated with the evaluation as a whole
  // often aggregated from the scoreMetrics.
  summaryMetrics: MetricDefinitionMap;

Pretty much all the remaining changes are supporting the model change and making the score handling more robust and centralized. I am also being a lot more strict with "Metric" referring to a general value, "SummaryMetric" being a metric summarizing the eval, and "ScoreMetric" being a metric that relates to a specific input/output pair.

A few Before and Afters:

**Tests: ** I added a big unit test to assert the correct shapes of data assumed by the UI to protect against breakage in the future.

Here are a bunch of good examples for all sorts of eval situations for manual testing:

weave/tests/test_evaluations.py

...3/pages/CompareEvaluationsPage/sections/ComparisonDefinitionSection/EvaluationDefinition.tsx

...-js/src/components/PagePanelComponents/Home/Browse3/pages/CompareEvaluationsPage/ecpState.ts

jamie-rasmussen · 2024-07-18T03:23:31Z

...-js/src/components/PagePanelComponents/Home/Browse3/pages/CompareEvaluationsPage/ecpState.ts

+
+// Helpers
+
+const moveItemToFront = (arr: any[], item: any) => {


I put a moveToFront in EmojiDetails, maybe they can join forces.

...-js/src/components/PagePanelComponents/Home/Browse3/pages/CompareEvaluationsPage/ecpTypes.ts

...se3/pages/CompareEvaluationsPage/sections/ExampleCompareSection/exampleCompareSectionUtil.ts

...ePanelComponents/Home/Browse3/pages/wfReactInterface/tsDataModelHooksEvaluationComparison.ts

tssweeney added 14 commits July 16, 2024 08:28

init

140a373

Merge branch 'master' into tim/support_custom_scorers

6ba91d2

tests

9828380

Huge Refactor

bb6c470

Cleanup unused symbols

b8b1dbb

Fix small

2d25d00

beginning the fixes

d88a248

beginning the fixes

677ddb5

solving more - still not finished

6b3a266

solving more - still not finished

0558d09

basically working

d79f116

ok - my test case is now working

ce6b577

lint

9394a15

lint

03d7b49

This was referenced Jul 17, 2024

chore(weave): Support custom scorers more generally #1935

Closed

DNM chore(weave): Fix custom scorers #1908

Closed

tssweeney added 14 commits July 17, 2024 10:55

more fixing

63dc761

remove model metrics

3903e3b

remove model metrics

b30d086

lint

327abd5

Merge branch 'master' into tim/support_custom_scorers

19097b0

inital check

2957cb0

part 1 of removing sumaryMetricUtil

a3f4df1

more meaningful progress

fbc6ab9

small refactor

16fc55e

wired up the clicking

a2f0bf5

clicks are one point

ad1c611

finished source calls

39a8741

lint

9d77566

small code style

55abd08

tssweeney added 12 commits July 17, 2024 15:16

small code style

f214f88

docs

5996a24

moving a bunch of things around

d1e527d

more code comments

e814b7f

more cleaning

161ac5c

more moving around

1976129

a few more files down

d8f6687

lint

fe5cca0

more cleaning

7d69f2b

more cleaning

4a00815

more cleaning

1c85402

style

dbd8d0e

tssweeney marked this pull request as ready for review July 18, 2024 00:26

tssweeney requested a review from a team as a code owner July 18, 2024 00:26

jamie-rasmussen approved these changes Jul 18, 2024

View reviewed changes

tssweeney added 5 commits July 17, 2024 22:41

Merge branch 'master' into tim/support_custom_scorers

3ddd6e7

comments (jamie): tests

814b5c3

comments (jamie): variety of small nits

8efeeac

comments (jamie): variety of small nits

43f6c3a

lint

fe6ddb5

tssweeney merged commit 2a1c72c into master Jul 18, 2024
25 checks passed

tssweeney deleted the tim/support_custom_scorers branch July 18, 2024 06:21

github-actions bot locked and limited conversation to collaborators Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Support custom scoring summary functions #1953

chore: Support custom scoring summary functions #1953

tssweeney commented Jul 16, 2024 •

edited

Loading

jamie-rasmussen Jul 18, 2024


		// Helpers

		const moveItemToFront = (arr: any[], item: any) => {

chore: Support custom scoring summary functions #1953

chore: Support custom scoring summary functions #1953

Conversation

tssweeney commented Jul 16, 2024 • edited Loading

jamie-rasmussen Jul 18, 2024

Choose a reason for hiding this comment

tssweeney commented Jul 16, 2024 •

edited

Loading