-
Notifications
You must be signed in to change notification settings - Fork 48
docs(evaluators + playgrounds): add #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
4ba030f
init
nina-kollman df97a46
quick start
nina-kollman 4030d60
order
nina-kollman 8c0e4ee
number
nina-kollman c947827
wip
nina-kollman def0e95
add structure
nina-kollman 063ce92
prompt done
nina-kollman ee21732
delete
nina-kollman 9119ad3
comm
nina-kollman c841f7e
eval
nina-kollman ad1a99a
intro
nina-kollman 0a0ef42
add made by
nina-kollman 5b40173
made by traceloop
nina-kollman 9504601
comm
nina-kollman 2940087
play comm
nina-kollman 1f11474
why
nina-kollman 9e5d526
add where
nina-kollman 2762753
fix
nina-kollman 63a51bd
structire
nina-kollman 0fe9af1
fix
nina-kollman File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| # Claude Code settings | ||
| .claude/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| --- | ||
| title: "Custom Evaluators" | ||
| description: "Define an evaluator for your specific needs " | ||
| --- | ||
|
|
||
| Create your own evaluator to match your specific needs. You can start right away with custom criteria for full flexibility, or use one of our recommended formats as a starting point. | ||
|
|
||
|
|
||
| <Frame> | ||
| <img | ||
| className="block dark:hidden" | ||
| src="/img/evaluator/eval-custom-light.png" | ||
| /> | ||
| <img className="hidden dark:block" src="/img/evaluator/eval-custom-dark.png" /> | ||
| </Frame> | ||
|
|
||
| ## Do It Yourself | ||
|
|
||
| This option lets you write the evaluator prompt from scratch by adding the desired messages (System, Assistant, User, or Developer) and configuring the model along with its settings. | ||
|
|
||
| <Frame> | ||
| <img | ||
| className="block dark:hidden" | ||
| src="/img/evaluator/eval-do-it-yourself-light.png" | ||
| /> | ||
| <img className="hidden dark:block" src="/img/evaluator/eval-do-it-yourself-dark.png" /> | ||
| </Frame> | ||
|
|
||
| ## Generate Evaluator | ||
|
|
||
| The evaluator prompt can be automatically configured by Traceloop by clicking on the **Generate Evaluator** button. | ||
| To enable the button, map the column you want to evaluate (such as an LLM response) and add any additional data columns required for prompt creation. | ||
| Describe the evaluator’s purpose and reference the relevant data columns in the description. | ||
|
|
||
| The system generates a prompt template that you can edit and customize as needed. | ||
|
|
||
|
|
||
| ## Test Evaluator | ||
|
|
||
| Before creating an evaluator, you can test it on existing Playground data. | ||
| This allows you to refine and correct the evaluator prompt before saving the final version. | ||
|
|
||
| ## Execute Evaluator | ||
|
|
||
| Evaluators can be executed in [playground columns](../playgrounds/columns/column-management) and in [experiments through the SDK](../experiments/running-from-code). | ||
|
|
||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,82 @@ | ||
| --- | ||
| title: "Evaluator Library" | ||
| description: "Select from pre-built quality checks or create custom evaluators to systematically assess AI outputs" | ||
| --- | ||
|
|
||
| The Evaluator Library provides a comprehensive collection of pre-built quality checks designed to systematically assess AI outputs. You can choose from existing evaluators or create custom ones tailored to your specific needs. | ||
|
|
||
| <Frame> | ||
| <img | ||
| className="block dark:hidden" | ||
| src="/img/evaluator/eval-library-light.png" | ||
| /> | ||
| <img className="hidden dark:block" src="/img/evaluator/eval-library-dark.png" /> | ||
| </Frame> | ||
|
|
||
| ## Made by Traceloop | ||
|
|
||
| Traceloop provides several pre-configured evaluators for common assessment tasks: | ||
|
|
||
| ### Content Analysis Evaluators | ||
|
|
||
| **Character Count** | ||
| - Analyze response length and verbosity | ||
| - Helps ensure responses meet length requirements | ||
|
|
||
| **Character Count Ratio** | ||
| - Measure the ratio of characters to the input | ||
| - Useful for assessing response proportionality | ||
|
|
||
| **Word Count** | ||
| - Ensure appropriate response detail level | ||
| - Track output length consistency | ||
|
|
||
| **Word Count Ratio** | ||
| - Measure the ratio of words to the input | ||
| - Compare input/output verbosity | ||
|
|
||
| ### Quality Assessment Evaluators | ||
|
|
||
| **Answer Relevancy** | ||
| - Verify responses address the query | ||
| - Ensure AI outputs stay on topic | ||
|
|
||
| **Faithfulness** | ||
| - Detect hallucinations and verify facts | ||
| - Maintain accuracy and truthfulness | ||
|
|
||
| ### Safety & Security Evaluators | ||
|
|
||
| **PII Detection** | ||
| - Identify personal information in responses | ||
| - Protect user privacy and data security | ||
|
|
||
| **Profanity Detection** | ||
| - Monitor for inappropriate language | ||
| - Maintain content quality standards | ||
|
|
||
| **Secrets Detection** | ||
| - Monitor for sensitive information leakage | ||
| - Prevent accidental exposure of credentials | ||
|
|
||
| ## Custom Evaluators | ||
|
|
||
| In addition to the pre-built evaluators, you can create custom evaluators with: | ||
|
|
||
| ### Inputs | ||
| - **string**: Text-based input parameters | ||
| - Support for multiple input types | ||
|
|
||
| ### Outputs | ||
| - **results**: String-based evaluation results | ||
| - **pass**: Boolean indicator for pass/fail status | ||
|
|
||
| ## Usage | ||
|
|
||
| 1. Browse the available evaluators in the library | ||
| 2. Select evaluators that match your assessment needs | ||
| 3. Configure input parameters as required | ||
| 4. Use the "Use evaluator" button to integrate into your workflow | ||
| 5. Monitor outputs and pass/fail status for systematic quality assessment | ||
|
|
||
| The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| --- | ||
| title: "Introduction" | ||
| description: "Evaluating workflows and LLM outputs" | ||
| --- | ||
|
|
||
| The evaluation library is a core feature of Traceloop, providing comprehensive tools to assess LLM outputs, data quality, and performance across various dimensions. Whether you need automated scoring or human judgment, the evaluation system has you covered. | ||
|
|
||
| ## Why Do We Need Evaluators? | ||
|
|
||
| LLM agents are more complex than single-turn completions. | ||
| They operate across multiple steps, use tools, and depend on context and external systems like memory or APIs. This complexity introduces new failure modes: agents may hallucinate tools, get stuck in loops, or produce final answers that hide earlier mistakes. | ||
|
|
||
| Evaluators make these issues visible by checking correctness, relevance, task completion, tool usage, memory retention, safety, and style. They ensure outputs remain consistent even when dependencies shift and provide a structured way to measure reliability. Evaluation is continuous, extending into production through automated tests, drift detection, quality gates, and online monitoring. | ||
| In short, evaluators turn outputs into trustworthy systems by providing measurable and repeatable checks that give teams confidence to deploy at scale. | ||
|
|
||
| ## Evaluators types | ||
|
|
||
| The system supports: | ||
| - **Custom evaluators** - Create your own evaluation logic tailored to specific needs | ||
| - **Built-in evaluators** - pre-configured evaluators by Traceloop for common assessment tasks | ||
|
|
||
| In the Evaluator Library, select the evaluator you want to define. | ||
| You can either create a custom evaluator by clicking **New Evaluator** or choose one of the prebuilt **Made by Traceloop** evaluators. | ||
|
|
||
| <Frame> | ||
| <img | ||
| className="block dark:hidden" | ||
| src="/img/evaluator/eval-library-light.png" | ||
| /> | ||
| <img className="hidden dark:block" src="/img/evaluator/eval-library-dark.png" /> | ||
| </Frame> | ||
|
|
||
| Clicking on existing evaluators will present their input and output schema. This is valuable information in order to execute the evaluator [through the SDK](../experiments/running-from-code). | ||
|
|
||
| ## Where to Use Evaluators | ||
|
|
||
| Evaluators can be used in two main contexts within Traceloop: | ||
|
|
||
| - **[Playgrounds](../playgrounds/quick-start)** - Test and iterate on your evaluators interactively, compare different configurations, and validate evaluation logic before deployment | ||
| - **[Experiments](../experiments/introduction)** - Run systematic evaluations across datasets programmatically using the SDK, track performance metrics over time, and easily compare experiment results | ||
| - **[Monitors](../monitoring/introduction)** - Continuously evaluate your LLM applications in production with real-time monitoring and alerting on quality degradation | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| --- | ||
| title: "Made by Traceloop" | ||
| description: "Pre-configured evaluators by Traceloop for common assessment tasks" | ||
| --- | ||
|
|
||
| The Evaluator Library provides a comprehensive collection of pre-built quality checks designed to systematically assess AI outputs. | ||
|
|
||
| Each evaluator comes with a predefined input and output schema. When using an evaluator, you’ll need to map your data to its input schema. | ||
| <Frame> | ||
| <img | ||
| className="block dark:hidden" | ||
| src="/img/evaluator/eval-made-by-traceloop-light.png" | ||
| /> | ||
| <img className="hidden dark:block" src="/img/evaluator/eval-made-by-traceloop-dark.png" /> | ||
| </Frame> | ||
|
|
||
| ## Evaluator Types | ||
|
|
||
| <CardGroup cols={3}> | ||
| <Card title="Character Count" icon="text"> | ||
| Analyze response length and verbosity to ensure outputs meet specific length requirements. | ||
| </Card> | ||
|
|
||
| <Card title="Character Count Ratio" icon="hashtag"> | ||
| Measure the ratio of characters to the input to assess response proportionality and expansion. | ||
| </Card> | ||
|
|
||
| <Card title="Word Count" icon="align-left"> | ||
| Ensure appropriate response detail level by tracking the total number of words in outputs. | ||
| </Card> | ||
|
|
||
| <Card title="Word Count Ratio" icon="hashtag"> | ||
| Measure the ratio of words to the input to compare input/output verbosity and expansion patterns. | ||
| </Card> | ||
|
|
||
| <Card title="Answer Relevancy" icon="bullseye"> | ||
| Verify responses address the query to ensure AI outputs stay on topic and remain relevant. | ||
| </Card> | ||
|
|
||
| <Card title="Faithfulness" icon="circle-check"> | ||
| Detect hallucinations and verify facts to maintain accuracy and truthfulness in AI responses. | ||
| </Card> | ||
|
|
||
| <Card title="PII Detection" icon="shield"> | ||
| Identify personal information exposure to protect user privacy and ensure data security compliance. | ||
| </Card> | ||
|
|
||
| <Card title="Profanity Detection" icon="triangle-exclamation"> | ||
| Flag inappropriate language use to maintain content quality standards and professional communication. | ||
| </Card> | ||
|
|
||
| <Card title="Secrets Detection" icon="lock"> | ||
| Monitor for credential and key leaks to prevent accidental exposure of sensitive information. | ||
| </Card> | ||
|
|
||
| <Card title="SQL Validation" icon="database"> | ||
| Validate SQL queries to ensure proper syntax and structure in database-related AI outputs. | ||
| </Card> | ||
|
|
||
| <Card title="JSON Validation" icon="code"> | ||
| Validate JSON responses to ensure proper formatting and structure in API-related outputs. | ||
| </Card> | ||
|
|
||
| <Card title="Regex Validation" icon="asterisk"> | ||
| Validate regex patterns to ensure correct regular expression syntax and functionality. | ||
| </Card> | ||
|
|
||
| <Card title="Placeholder Regex" icon="asterisk"> | ||
| Validate placeholder regex patterns to ensure proper template and variable replacement structures. | ||
| </Card> | ||
|
|
||
| <Card title="Semantic Similarity" icon="hashtag"> | ||
| Validate semantic similarity between expected and actual responses to measure content alignment. | ||
| </Card> | ||
|
|
||
| <Card title="Agent Goal Accuracy" icon="bullseye"> | ||
| Validate agent goal accuracy to ensure AI systems achieve their intended objectives effectively. | ||
| </Card> | ||
|
|
||
| <Card title="Topic Adherence" icon="hashtag"> | ||
| Validate topic adherence to ensure responses stay focused on the specified subject matter. | ||
| </Card> | ||
|
|
||
| <Card title="Measure Perplexity" icon="hashtag"> | ||
| Measure text perplexity from logprobs to assess the predictability and coherence of generated text. | ||
| </Card> | ||
| </CardGroup> |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| --- | ||
| title: "Column Management" | ||
| description: "Learn all columns general functionalities" | ||
| --- | ||
|
|
||
| Columns in the Playground can be reordered, edited, or deleted at any time to adapt your workspace as your analysis evolves. Understanding how to manage columns effectively helps you maintain organized and efficient playgrounds. | ||
|
|
||
| ## Columns Settings | ||
| Column Settings lets you hide specific columns from the Playground and reorder them as needed. To open the settings, click the Playground Action button and select Column Settings | ||
| <img | ||
| className="block dark:hidden" | ||
| src="/img/playground/play-action-light.png" | ||
| /> | ||
| <img className="hidden dark:block" src="/img/playground/play-action-dark.png" /> | ||
|
|
||
| To change the column order, use the six-dot handle on the right side of each column to simply drag the column into the desired position. | ||
|
|
||
| To hide a column, toggle its switch in the menu. | ||
|
|
||
| <Info> | ||
| Columns can also be reordered by dragging them to your desired position in the playground | ||
| </Info> | ||
|
|
||
| <img | ||
| className="block dark:hidden" | ||
| src="/img/playground/play-column-settings-light.png" | ||
| style={{maxWidth: '600px'}} | ||
| /> | ||
| <img | ||
| className="hidden dark:block" | ||
| src="/img/playground/play-column-settings-dark.png" | ||
| style={{maxWidth: '600px'}} | ||
| /> | ||
|
|
||
|
|
||
| ## Columns Actions | ||
|
|
||
| Each column has a menu that lets you manage and customize it. From this menu, you can: | ||
| - Rename the column directly by editing its title | ||
| - Edit the column configuration | ||
| - Duplicate the column to create a copy with the same settings | ||
| - Delete the column if it’s no longer needed | ||
|
|
||
|
|
||
| <img | ||
| className="block dark:hidden" | ||
| src="/img/playground/play-column-options-light.png" | ||
| style={{maxWidth: '350px'}} | ||
| /> | ||
| <img | ||
| className="hidden dark:block" | ||
| src="/img/playground/play-column-options-dark.png" | ||
| style={{maxWidth: '350px'}} | ||
| /> |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.