Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
229 changes: 172 additions & 57 deletions evaluators/evaluator-library.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -17,83 +17,200 @@ The Evaluator Library provides a comprehensive collection of pre-built quality c

Traceloop provides several pre-configured evaluators for common assessment tasks:

### Content Analysis Evaluators
---

**Character Count**
- Analyze response length and verbosity
- Helps ensure responses meet length requirements
### Agent Evaluators

**Character Count Ratio**
- Measure the ratio of characters to the input
- Useful for assessing response proportionality
**Agent Efficiency** <sup>(beta)</sup>
- Evaluates how efficiently an agent completes a task by detecting redundant steps, unnecessary tool calls, loops, or poor reasoning
- Returns a 0-1 score
- *Implementation: Custom GPT-4o prompt*

**Word Count**
- Ensure appropriate response detail level
- Track output length consistency
**Agent Flow Quality** <sup>(beta)</sup>
- Checks whether the agent satisfies all user-defined behavioral or logical conditions; strict full-condition matching
- Returns score as ratio of passed conditions
- *Implementation: Custom GPT-4o prompt*

**Word Count Ratio**
- Measure the ratio of words to the input
- Compare input/output verbosity
**Agent Goal Accuracy**
- Determines if the agent actually achieved the user's goal, with or without a reference expected answer
- Supports both reference-based and reference-free evaluation
- *Implementation: Ragas AgentGoalAccuracy metrics*

**Agent Goal Completeness**
- Extracts user intents across a conversation and evaluates how many were fulfilled end-to-end
- Automatically determines fulfillment rate
- *Implementation: DeepEval ConversationCompletenessMetric*

**Agent Tool Error Detector** <sup>(beta)</sup>
- Detects incorrect tool usage (bad params, failed API calls, unexpected behavior) in agent trajectories
- Returns pass/fail
- *Implementation: Custom GPT-4o prompt*

### Quality Assessment Evaluators
---

### Answer Quality Evaluators

**Answer Completeness**
- Measures how thoroughly the answer uses relevant context, using a rubric from "barely uses context" to "fully covers it"
- Normalized to 0-1 score
- *Implementation: Ragas RubricsScore metric*

**Answer Correctness**
- Evaluates factual correctness by combining semantic similarity with a correctness model vs ground truth
- Returns combined 0-1 score
- *Implementation: Ragas AnswerCorrectness + AnswerSimilarity*

**Answer Relevancy**
- Verify responses address the query
- Ensure AI outputs stay on topic
- Determines whether the answer meaningfully responds to the question
- Outputs pass/fail
- *Implementation: Ragas answer_relevancy metric*

**Faithfulness**
- Detect hallucinations and verify facts
- Maintain accuracy and truthfulness
- Ensures all claims in the answer are grounded in the provided context and not hallucinated
- Binary pass/fail
- *Implementation: Ragas Faithfulness metric*

**Semantic Similarity**
- Computes embedding-based similarity between generated text and a reference answer
- Returns 0-1 score
- *Implementation: Ragas SemanticSimilarity metric*

---

### Conversation Evaluators

**Conversation Quality**
- Overall conversation score combining relevancy (40%), completeness (40%), and memory retention (20%) over multiple turns
- Returns weighted combined score
- *Implementation: DeepEval TurnRelevancy + ConversationCompleteness + KnowledgeRetention*

**Intent Change**
- Detects if the conversation stayed on the original intent or drifted into unrelated topics
- Higher score = better adherence to original topic
- *Implementation: Ragas TopicAdherenceScore (precision mode)*

**Topic Adherence**
- Measures how well conversation messages stay aligned with specified allowed topics
- Returns 0-1 score
- *Implementation: Ragas TopicAdherenceScore*

**Context Relevance**
- Rates whether retrieved context actually contains the information needed to answer the question
- Score = relevant statements / total statements
- *Implementation: DeepEval ContextualRelevancyMetric*

**Instruction Adherence**
- Evaluates how closely the model followed system-level or user instructions
- Returns 0-1 adherence score
- *Implementation: DeepEval PromptAlignmentMetric*

---

### Safety & Security Evaluators

**PII Detection**
- Identify personal information in responses
- Protect user privacy and data security
**PII Detector**
- Detects names, addresses, emails, and other personal identifiers in text; may redact them
- Pass/fail based on confidence threshold
- *Implementation: Microsoft Presidio Analyzer*

**Profanity Detection**
- Monitor for inappropriate language
- Maintain content quality standards
**Secrets Detector**
- Identifies hardcoded secrets such as API keys, tokens, passwords, etc.
- Binary pass/fail with optional redaction
- *Implementation: Yelp detect-secrets*

**Secrets Detection**
- Monitor for sensitive information leakage
- Prevent accidental exposure of credentials
**Profanity Detector**
- Checks whether text contains offensive or profane language
- Binary pass/fail
- *Implementation: profanity-check library*

### Formatting Evaluators
**Prompt Injection Detector**
- Flags attempts to override system behavior or inject malicious instructions
- Binary pass/fail based on threshold
- *Implementation: AWS SageMaker endpoint running DeBERTa-v3 model*

**SQL Validation**
- Validate SQL queries
- Ensure syntactically correct SQL output
**Toxicity Detector**
- Classifies toxic categories like threat, insult, obscenity, hate speech, etc.
- Binary pass/fail based on threshold
- *Implementation: AWS SageMaker unitary/toxic-bert model*

**JSON Validation**
- Validate JSON responses
- Ensure properly formatted JSON structures
**Sexism Detector**
- Detects sexist language or bias specifically toward gender-based discrimination
- Binary pass/fail based on threshold
- *Implementation: AWS SageMaker unitary/toxic-bert model*

**Regex Validation**
- Validate regex patterns
- Verify pattern matching requirements
---

### Format Validators

**JSON Validator**
- Validates that output is valid JSON and optionally matches a schema
- Binary pass/fail
- *Implementation: Python json and jsonschema*

**SQL Validator**
- Checks whether generated text is syntactically valid PostgreSQL SQL
- Binary pass/fail
- *Implementation: pglast Postgres parser*

**Regex Validator**
- Validates whether text matches (or must not match) a regex with flexible flags
- Supports case sensitivity, multiline, and dotall flags
- *Implementation: Python re*

**Placeholder Regex**
- Validate placeholder regex patterns
- Check for expected placeholders in responses
- Similar to regex validator, but dynamically injects a placeholder before matching
- Useful for dynamic pattern validation
- *Implementation: Python re*

### Advanced Quality Evaluators
---

**Semantic Similarity**
- Validate semantic similarity between texts
- Compare meaning and context alignment
### Text Metrics

**Agent Goal Accuracy**
- Validate agent goal accuracy
- Measure how well agent achieves defined goals
**Word Count**
- Counts number of words in generated output
- Returns integer count
- *Implementation: Python string split*

**Topic Adherence**
- Validate topic adherence
- Ensure responses stay within specified topics
**Word Count Ratio**
- Compares output word count to input word count
- Useful for measuring expansion/compression
- *Implementation: Python string operations*

**Char Count**
- Counts number of characters in the generated text
- Returns integer count
- *Implementation: Python len()*

**Char Count Ratio**
- Output character count divided by input character count
- Returns float ratio
- *Implementation: Python len()*

**Measure Perplexity**
- Measure text perplexity from logprobs
- Assess response predictability and coherence
**Perplexity**
- Computes perplexity using provided logprobs to quantify model confidence on its output
- Lower = more confident predictions
- *Implementation: Mathematical calculation exp(-avg_log_prob)*

---

### Specialized Evaluators

**LLM as a Judge**
- Fully flexible LLM-based evaluator using arbitrary prompts and variables; returns JSON directly from the model
- Configurable model, temperature, etc.
- *Implementation: Custom OpenAI API call*

**Tone Detection**
- Classifies emotional tone (joy, anger, sadness, fear, neutral, etc.) based on text
- Returns detected tone and confidence score
- *Implementation: AWS SageMaker emotion-distilroberta model*

**Uncertainty**
- Generates a response with token-level logprobs and calculates uncertainty using max surprisal
- Returns answer + uncertainty score
- *Implementation: GPT-4o-mini with logprobs enabled*

---

## Custom Evaluators

Expand All @@ -103,10 +220,6 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
- Create custom metric evaluations
- Define your own evaluation logic and scoring

**Custom LLM Judge**
- Create custom evaluations using LLM-as-a-judge
- Leverage AI models to assess outputs against custom criteria

### Inputs
- **string**: Text-based input parameters
- Support for multiple input types
Expand All @@ -115,6 +228,8 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
- **results**: String-based evaluation results
- **pass**: Boolean indicator for pass/fail status

---

## Usage

1. Browse the available evaluators in the library
Expand All @@ -123,4 +238,4 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
4. Use the "Use evaluator" button to integrate into your workflow
5. Monitor outputs and pass/fail status for systematic quality assessment

The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications.
The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications.