Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 41 additions & 24 deletions models/tables/evaluate-models.mdx
Original file line number Diff line number Diff line change
@@ -1,25 +1,30 @@
---
title: Evaluate models with W&B Weave and W&B Tables
description: Learn how to evaluate machine learning models using W&B Weave and Tables.
keywords: [scorers, judges, predictions table, error analysis, model registry]
---

This page shows you two complementary ways to evaluate models tracked in W&B: use W&B Weave for LLM and GenAI evaluations, and use W&B Tables for prediction analysis across runs and epochs.

## Evaluate models with Weave

[W&B Weave](/weave) is a purpose-built toolkit for evaluating LLMs and GenAI applications. It provides comprehensive evaluation capabilities including scorers, judges, and detailed tracing to help you understand and improve model performance. Weave integrates with W&B Models, allowing you to evaluate models stored in your Model Registry.
[W&B Weave](/weave) is a purpose-built toolkit for evaluating LLMs and GenAI applications. It provides evaluation capabilities including scorers, judges, and detailed tracing to help you understand and improve model performance. Weave integrates with W&B Models so you can evaluate models stored in your Model Registry.

<Frame>
<img src="/images/weave/evals.png" alt="Weave evaluation dashboard showing model performance metrics and traces" />
</Frame>

### Key features for model evaluation

* **Scorers and judges**: Pre-built and custom evaluation metrics for accuracy, relevance, coherence, and more
* **Evaluation datasets**: Structured test sets with ground truth for systematic evaluation
* **Model versioning**: Track and compare different versions of your models
* **Detailed tracing**: Debug model behavior with complete input/output traces
* **Cost tracking**: Monitor API costs and token usage across evaluations
Weave provides the following capabilities for model evaluation:

* **Scorers and judges**: Pre-built and custom evaluation metrics for accuracy, relevance, coherence, and more.
* **Evaluation datasets**: Structured test sets with ground truth for systematic evaluation.
* **Model versioning**: Track and compare different versions of your models.
* **Detailed tracing**: Debug model behavior with complete input/output traces.
* **Cost tracking**: Monitor API costs and token usage across evaluations.

### Getting started: Evaluate a model from W&B Registry
### Evaluate a model from W&B Registry

Download a model from W&B Models Registry and evaluate it using Weave:

Expand Down Expand Up @@ -69,12 +74,14 @@ results = await evaluation.evaluate(model)

### Integrate Weave evaluations with W&B Models

To connect Weave evaluation results with the models and runs you track in W&B, use the integration workflow described next.

The [Models and Weave Integration Demo](/weave/cookbooks/Models_and_Weave_Integration_Demo) shows the complete workflow for:

1. **Load models from Registry**: Download fine-tuned models stored in W&B Models Registry
2. **Create evaluation pipelines**: Build comprehensive evaluations with custom scorers
3. **Log results back to W&B**: Connect evaluation metrics to your model runs
4. **Version evaluated models**: Save improved models back to the Registry
1. **Load models from Registry**: Download fine-tuned models stored in W&B Models Registry.
2. **Create evaluation pipelines**: Build evaluations with custom scorers.
3. **Log results back to W&B**: Connect evaluation metrics to your model runs.
4. **Version evaluated models**: Save improved models back to the Registry.

Log evaluation results to both Weave and W&B Models:

Expand All @@ -93,17 +100,19 @@ wandb.run.config.update({
### Advanced Weave features

#### Custom scorers and judges
Create sophisticated evaluation metrics tailored to your use case:

Create evaluation metrics tailored to your use case:

```python
@weave.op()
def llm_judge_scorer(expected: str, output: str, judge_model) -> dict:
async def llm_judge_scorer(expected: str, output: str, judge_model) -> dict:
prompt = f"Is this answer correct? Expected: {expected}, Got: {output}"
judgment = await judge_model.predict(prompt)
return {"judge_score": judgment}
```

#### Batch evaluations

Evaluate multiple model versions or configurations:

```python
Expand All @@ -119,18 +128,21 @@ for model in models:

### Next steps

For more information, see the following:

* [Complete Weave evaluation tutorial](/weave/tutorial-eval/)
* [Models and Weave integration example](/weave/cookbooks/Models_and_Weave_Integration_Demo)



## Evaluate models with tables
## Evaluate models with Tables

Use W&B Tables to:
* **Compare model predictions**: View side-by-side comparisons of how different models perform on the same test set
* **Track prediction changes**: Monitor how predictions evolve across training epochs or model versions
* **Analyze errors**: Filter and query to find commonly misclassified examples and error patterns
* **Visualize rich media**: Display images, audio, text, and other media types alongside predictions and metrics
W&B Tables let you log structured predictions and inspect them interactively in the UI. Use W&B Tables to:

* **Compare model predictions**: View side-by-side comparisons of how different models perform on the same test set.
* **Track prediction changes**: Monitor how predictions evolve across training epochs or model versions.
* **Analyze errors**: Filter and query to find commonly misclassified examples and error patterns.
* **Visualize rich media**: Display images, audio, text, and other media types alongside predictions and metrics.

<Frame>
![Example of predictions table showing model outputs alongside ground truth labels](/images/data_vis/tables_sample_predictions.png)
Expand Down Expand Up @@ -170,6 +182,7 @@ run.log({"evaluation_results": eval_table})
### Advanced table workflows

#### Compare multiple models

Log evaluation tables from different models to the same key for direct comparison:

```python
Expand All @@ -189,6 +202,7 @@ with wandb.init(project="model-comparison", name="model_b") as run:
</Frame>

#### Track predictions over time

Log tables at different training epochs to visualize improvement:

```python
Expand All @@ -206,18 +220,21 @@ for epoch in range(num_epochs):

### Interactive analysis in the W&B UI

Once logged, you can:
1. **Filter results**: Click on column headers to filter by prediction accuracy, confidence thresholds, or specific classes
2. **Compare tables**: Select multiple table versions to see side-by-side comparisons
3. **Query data**: Use the query bar to find specific patterns (for example, `"correct" = false AND "confidence" > 0.8`)
4. **Group and aggregate**: Group by predicted class to see per-class accuracy metrics
After you log your tables, the W&B UI provides several ways to explore the results. You can:

* **Filter results**: Click column headers to filter by prediction accuracy, confidence thresholds, or specific classes.
* **Compare tables**: Select multiple table versions to see side-by-side comparisons.
* **Query data**: Use the query bar to find specific patterns (for example, `"correct" = false AND "confidence" > 0.8`).
* **Group and aggregate**: Group by predicted class to see per-class accuracy metrics.

<Frame>
![Interactive filtering and querying of evaluation results in W&B Tables](/images/data_vis/wandb_demo_filter_on_a_table.png)
</Frame>

### Example: Error analysis with enriched tables

The following example creates a mutable table, logs initial predictions, then adds confidence and error type columns for deeper analysis:

```python
# Create a mutable table to add analysis columns
eval_table = wandb.Table(
Expand Down
Loading
Loading