Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Claude Code settings
.claude/
6 changes: 3 additions & 3 deletions datasets/quick-start.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
title: "Quick Start"
---

Datasets are simple data tables that you can use to manage your data for experiments and evaluation of your AI applications.
Datasets are available in the SDK, and they enable you to create versioned snapshots for reproducible testing.

<Frame>
<img
className="block dark:hidden"
Expand All @@ -10,9 +13,6 @@ title: "Quick Start"
<img className="hidden dark:block" src="/img/dataset/dataset-list-dark.png" />
</Frame>

Datasets are simple data tables that you can use to manage your data for experiments and evaluation of your AI applications.
Datasets are available in the SDK, and they enable you to create versioned snapshots for reproducible testing.

<Steps>
<Step title="Create a new dataset">

Expand Down
47 changes: 47 additions & 0 deletions evaluators/custom-evaluator.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
title: "Custom Evaluators"
description: "Define an evaluator for your specific needs "
---

Create your own evaluator to match your specific needs. You can start right away with custom criteria for full flexibility, or use one of our recommended formats as a starting point.


<Frame>
<img
className="block dark:hidden"
src="/img/evaluator/eval-custom-light.png"
/>
<img className="hidden dark:block" src="/img/evaluator/eval-custom-dark.png" />
</Frame>

## Do It Yourself

This option lets you write the evaluator prompt from scratch by adding the desired messages (System, Assistant, User, or Developer) and configuring the model along with its settings.

<Frame>
<img
className="block dark:hidden"
src="/img/evaluator/eval-do-it-yourself-light.png"
/>
<img className="hidden dark:block" src="/img/evaluator/eval-do-it-yourself-dark.png" />
</Frame>

## Generate Evaluator

The evaluator prompt can be automatically configured by Traceloop by clicking on the **Generate Evaluator** button.
To enable the button, map the column you want to evaluate (such as an LLM response) and add any additional data columns required for prompt creation.
Describe the evaluator’s purpose and reference the relevant data columns in the description.

The system generates a prompt template that you can edit and customize as needed.


## Test Evaluator

Before creating an evaluator, you can test it on existing Playground data.
This allows you to refine and correct the evaluator prompt before saving the final version.

## Execute Evaluator

Evaluators can be executed in [playground columns](../playgrounds/columns/column-management) and in [experiments through the SDK](../experiments/running-from-code).


82 changes: 82 additions & 0 deletions evaluators/evaluator-library.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
title: "Evaluator Library"
description: "Select from pre-built quality checks or create custom evaluators to systematically assess AI outputs"
---

The Evaluator Library provides a comprehensive collection of pre-built quality checks designed to systematically assess AI outputs. You can choose from existing evaluators or create custom ones tailored to your specific needs.

<Frame>
<img
className="block dark:hidden"
src="/img/evaluator/eval-library-light.png"
/>
<img className="hidden dark:block" src="/img/evaluator/eval-library-dark.png" />
</Frame>

## Made by Traceloop

Traceloop provides several pre-configured evaluators for common assessment tasks:

### Content Analysis Evaluators

**Character Count**
- Analyze response length and verbosity
- Helps ensure responses meet length requirements

**Character Count Ratio**
- Measure the ratio of characters to the input
- Useful for assessing response proportionality

**Word Count**
- Ensure appropriate response detail level
- Track output length consistency

**Word Count Ratio**
- Measure the ratio of words to the input
- Compare input/output verbosity

### Quality Assessment Evaluators

**Answer Relevancy**
- Verify responses address the query
- Ensure AI outputs stay on topic

**Faithfulness**
- Detect hallucinations and verify facts
- Maintain accuracy and truthfulness

### Safety & Security Evaluators

**PII Detection**
- Identify personal information in responses
- Protect user privacy and data security

**Profanity Detection**
- Monitor for inappropriate language
- Maintain content quality standards

**Secrets Detection**
- Monitor for sensitive information leakage
- Prevent accidental exposure of credentials

## Custom Evaluators

In addition to the pre-built evaluators, you can create custom evaluators with:

### Inputs
- **string**: Text-based input parameters
- Support for multiple input types

### Outputs
- **results**: String-based evaluation results
- **pass**: Boolean indicator for pass/fail status

## Usage

1. Browse the available evaluators in the library
2. Select evaluators that match your assessment needs
3. Configure input parameters as required
4. Use the "Use evaluator" button to integrate into your workflow
5. Monitor outputs and pass/fail status for systematic quality assessment

The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications.
41 changes: 41 additions & 0 deletions evaluators/intro.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
title: "Introduction"
description: "Evaluating workflows and LLM outputs"
---

The evaluation library is a core feature of Traceloop, providing comprehensive tools to assess LLM outputs, data quality, and performance across various dimensions. Whether you need automated scoring or human judgment, the evaluation system has you covered.

## Why Do We Need Evaluators?

LLM agents are more complex than single-turn completions.
They operate across multiple steps, use tools, and depend on context and external systems like memory or APIs. This complexity introduces new failure modes: agents may hallucinate tools, get stuck in loops, or produce final answers that hide earlier mistakes.

Evaluators make these issues visible by checking correctness, relevance, task completion, tool usage, memory retention, safety, and style. They ensure outputs remain consistent even when dependencies shift and provide a structured way to measure reliability. Evaluation is continuous, extending into production through automated tests, drift detection, quality gates, and online monitoring.
In short, evaluators turn outputs into trustworthy systems by providing measurable and repeatable checks that give teams confidence to deploy at scale.

## Evaluators types

The system supports:
- **Custom evaluators** - Create your own evaluation logic tailored to specific needs
- **Built-in evaluators** - pre-configured evaluators by Traceloop for common assessment tasks

In the Evaluator Library, select the evaluator you want to define.
You can either create a custom evaluator by clicking **New Evaluator** or choose one of the prebuilt **Made by Traceloop** evaluators.

<Frame>
<img
className="block dark:hidden"
src="/img/evaluator/eval-library-light.png"
/>
<img className="hidden dark:block" src="/img/evaluator/eval-library-dark.png" />
</Frame>

Clicking on existing evaluators will present their input and output schema. This is valuable information in order to execute the evaluator [through the SDK](../experiments/running-from-code).

## Where to Use Evaluators

Evaluators can be used in two main contexts within Traceloop:

- **[Playgrounds](../playgrounds/quick-start)** - Test and iterate on your evaluators interactively, compare different configurations, and validate evaluation logic before deployment
- **[Experiments](../experiments/introduction)** - Run systematic evaluations across datasets programmatically using the SDK, track performance metrics over time, and easily compare experiment results
- **[Monitors](../monitoring/introduction)** - Continuously evaluate your LLM applications in production with real-time monitoring and alerting on quality degradation
87 changes: 87 additions & 0 deletions evaluators/made-by-traceloop.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: "Made by Traceloop"
description: "Pre-configured evaluators by Traceloop for common assessment tasks"
---

The Evaluator Library provides a comprehensive collection of pre-built quality checks designed to systematically assess AI outputs.

Each evaluator comes with a predefined input and output schema. When using an evaluator, you’ll need to map your data to its input schema.
<Frame>
<img
className="block dark:hidden"
src="/img/evaluator/eval-made-by-traceloop-light.png"
/>
<img className="hidden dark:block" src="/img/evaluator/eval-made-by-traceloop-dark.png" />
</Frame>

## Evaluator Types

<CardGroup cols={3}>
<Card title="Character Count" icon="text">
Analyze response length and verbosity to ensure outputs meet specific length requirements.
</Card>

<Card title="Character Count Ratio" icon="hashtag">
Measure the ratio of characters to the input to assess response proportionality and expansion.
</Card>

<Card title="Word Count" icon="align-left">
Ensure appropriate response detail level by tracking the total number of words in outputs.
</Card>

<Card title="Word Count Ratio" icon="hashtag">
Measure the ratio of words to the input to compare input/output verbosity and expansion patterns.
</Card>

<Card title="Answer Relevancy" icon="bullseye">
Verify responses address the query to ensure AI outputs stay on topic and remain relevant.
</Card>

<Card title="Faithfulness" icon="circle-check">
Detect hallucinations and verify facts to maintain accuracy and truthfulness in AI responses.
</Card>

<Card title="PII Detection" icon="shield">
Identify personal information exposure to protect user privacy and ensure data security compliance.
</Card>

<Card title="Profanity Detection" icon="triangle-exclamation">
Flag inappropriate language use to maintain content quality standards and professional communication.
</Card>

<Card title="Secrets Detection" icon="lock">
Monitor for credential and key leaks to prevent accidental exposure of sensitive information.
</Card>

<Card title="SQL Validation" icon="database">
Validate SQL queries to ensure proper syntax and structure in database-related AI outputs.
</Card>

<Card title="JSON Validation" icon="code">
Validate JSON responses to ensure proper formatting and structure in API-related outputs.
</Card>

<Card title="Regex Validation" icon="asterisk">
Validate regex patterns to ensure correct regular expression syntax and functionality.
</Card>

<Card title="Placeholder Regex" icon="asterisk">
Validate placeholder regex patterns to ensure proper template and variable replacement structures.
</Card>

<Card title="Semantic Similarity" icon="hashtag">
Validate semantic similarity between expected and actual responses to measure content alignment.
</Card>

<Card title="Agent Goal Accuracy" icon="bullseye">
Validate agent goal accuracy to ensure AI systems achieve their intended objectives effectively.
</Card>

<Card title="Topic Adherence" icon="hashtag">
Validate topic adherence to ensure responses stay focused on the specified subject matter.
</Card>

<Card title="Measure Perplexity" icon="hashtag">
Measure text perplexity from logprobs to assess the predictability and coherence of generated text.
</Card>
</CardGroup>
Binary file added img/evaluator/eval-custom-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/evaluator/eval-custom-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/evaluator/eval-do-it-yourself-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/evaluator/eval-do-it-yourself-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/evaluator/eval-library-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/evaluator/eval-library-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/evaluator/eval-made-by-traceloop-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/evaluator/eval-made-by-traceloop-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-action-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-action-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-column-list-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-column-list-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-column-options-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-column-options-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-column-settings-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-column-settings-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-full-table-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-full-table-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-json-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-json-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-list-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-list-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-multi-select-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-multi-select-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-number-col-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/playground/play-number-col-light.png
Binary file added img/playground/play-number-col-summary-dark.png
Binary file added img/playground/play-number-col-summary-light.png
Binary file added img/playground/play-prompt-column-dark.png
Binary file added img/playground/play-prompt-column-light.png
Binary file added img/playground/play-prompt-write-dark.png
Binary file added img/playground/play-prompt-write-light.png
Binary file added img/playground/play-single-select-dark.png
Binary file added img/playground/play-single-select-light.png
18 changes: 18 additions & 0 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -151,10 +151,28 @@
"group": "Prompt Management",
"pages": ["prompts/quick-start", "prompts/registry", "prompts/sdk-usage"]
},
{
"group": "Playgrounds",
"pages": [
"playgrounds/quick-start",
{
"group": "Columns",
"pages": [
"playgrounds/columns/data-columns",
"playgrounds/columns/prompt",
"playgrounds/columns/column-management"
]
}
]
},
{
"group": "Datasets",
"pages": ["datasets/quick-start", "datasets/sdk-usage"]
},
{
"group": "Evaluators",
"pages": ["evaluators/intro", "evaluators/custom-evaluator", "evaluators/made-by-traceloop"]
},
{
"group": "Experiments",
"pages": ["experiments/introduction", "experiments/result-overview", "experiments/running-from-code"]
Expand Down
54 changes: 54 additions & 0 deletions playgrounds/columns/column-management.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: "Column Management"
description: "Learn all columns general functionalities"
---

Columns in the Playground can be reordered, edited, or deleted at any time to adapt your workspace as your analysis evolves. Understanding how to manage columns effectively helps you maintain organized and efficient playgrounds.

## Columns Settings
Column Settings lets you hide specific columns from the Playground and reorder them as needed. To open the settings, click the Playground Action button and select Column Settings
<img
className="block dark:hidden"
src="/img/playground/play-action-light.png"
/>
<img className="hidden dark:block" src="/img/playground/play-action-dark.png" />

To change the column order, use the six-dot handle on the right side of each column to simply drag the column into the desired position.

To hide a column, toggle its switch in the menu.

<Info>
Columns can also be reordered by dragging them to your desired position in the playground
</Info>

<img
className="block dark:hidden"
src="/img/playground/play-column-settings-light.png"
style={{maxWidth: '600px'}}
/>
<img
className="hidden dark:block"
src="/img/playground/play-column-settings-dark.png"
style={{maxWidth: '600px'}}
/>


## Columns Actions

Each column has a menu that lets you manage and customize it. From this menu, you can:
- Rename the column directly by editing its title
- Edit the column configuration
- Duplicate the column to create a copy with the same settings
- Delete the column if it’s no longer needed


<img
className="block dark:hidden"
src="/img/playground/play-column-options-light.png"
style={{maxWidth: '350px'}}
/>
<img
className="hidden dark:block"
src="/img/playground/play-column-options-dark.png"
style={{maxWidth: '350px'}}
/>
Loading