From 909a1555be819ebb8931abaac30af3062c386b1c Mon Sep 17 00:00:00 2001 From: Oz Ben Simhon Date: Mon, 8 Dec 2025 21:45:45 +0200 Subject: [PATCH 1/2] wip --- evaluators/evaluator-library.mdx | 241 ++++++++++++++++++++++++------- 1 file changed, 187 insertions(+), 54 deletions(-) diff --git a/evaluators/evaluator-library.mdx b/evaluators/evaluator-library.mdx index 8f0efb6..8e672b8 100644 --- a/evaluators/evaluator-library.mdx +++ b/evaluators/evaluator-library.mdx @@ -17,83 +17,200 @@ The Evaluator Library provides a comprehensive collection of pre-built quality c Traceloop provides several pre-configured evaluators for common assessment tasks: -### Content Analysis Evaluators +--- -**Character Count** -- Analyze response length and verbosity -- Helps ensure responses meet length requirements +### Agent Evaluators -**Character Count Ratio** -- Measure the ratio of characters to the input -- Useful for assessing response proportionality +**Agent Efficiency** (beta) +- Evaluates how efficiently an agent completes a task by detecting redundant steps, unnecessary tool calls, loops, or poor reasoning +- Returns a 0-1 score +- *Implementation: Custom GPT-4o prompt* -**Word Count** -- Ensure appropriate response detail level -- Track output length consistency +**Agent Flow Quality** (beta) +- Checks whether the agent satisfies all user-defined behavioral or logical conditions; strict full-condition matching +- Returns score as ratio of passed conditions +- *Implementation: Custom GPT-4o prompt* -**Word Count Ratio** -- Measure the ratio of words to the input -- Compare input/output verbosity +**Agent Goal Accuracy** +- Determines if the agent actually achieved the user's goal, with or without a reference expected answer +- Supports both reference-based and reference-free evaluation +- *Implementation: Ragas AgentGoalAccuracy metrics* -### Quality Assessment Evaluators +**Agent Goal Completeness** +- Extracts user intents across a conversation and evaluates how many were fulfilled end-to-end +- Automatically determines fulfillment rate +- *Implementation: DeepEval ConversationCompletenessMetric* + +**Agent Tool Error Detector** (beta) +- Detects incorrect tool usage (bad params, failed API calls, unexpected behavior) in agent trajectories +- Returns pass/fail +- *Implementation: Custom GPT-4o prompt* + +--- + +### Answer Quality Evaluators + +**Answer Completeness** +- Measures how thoroughly the answer uses relevant context, using a rubric from "barely uses context" to "fully covers it" +- Normalized to 0-1 score +- *Implementation: Ragas RubricsScore metric* + +**Answer Correctness** +- Evaluates factual correctness by combining semantic similarity with a correctness model vs ground truth +- Returns combined 0-1 score +- *Implementation: Ragas AnswerCorrectness + AnswerSimilarity* **Answer Relevancy** -- Verify responses address the query -- Ensure AI outputs stay on topic +- Determines whether the answer meaningfully responds to the question +- Outputs pass/fail +- *Implementation: Ragas answer_relevancy metric* **Faithfulness** -- Detect hallucinations and verify facts -- Maintain accuracy and truthfulness +- Ensures all claims in the answer are grounded in the provided context and not hallucinated +- Binary pass/fail +- *Implementation: Ragas Faithfulness metric* + +**Semantic Similarity** +- Computes embedding-based similarity between generated text and a reference answer +- Returns 0-1 score +- *Implementation: Ragas SemanticSimilarity metric* + +--- + +### Conversation Evaluators + +**Conversation Quality** +- Overall conversation score combining relevancy (40%), completeness (40%), and memory retention (20%) over multiple turns +- Returns weighted combined score +- *Implementation: DeepEval TurnRelevancy + ConversationCompleteness + KnowledgeRetention* + +**Intent Change** +- Detects if the conversation stayed on the original intent or drifted into unrelated topics +- Higher score = better adherence to original topic +- *Implementation: Ragas TopicAdherenceScore (precision mode)* + +**Topic Adherence** +- Measures how well conversation messages stay aligned with specified allowed topics +- Returns 0-1 score +- *Implementation: Ragas TopicAdherenceScore* + +**Context Relevance** +- Rates whether retrieved context actually contains the information needed to answer the question +- Score = relevant statements / total statements +- *Implementation: DeepEval ContextualRelevancyMetric* + +**Instruction Adherence** +- Evaluates how closely the model followed system-level or user instructions +- Returns 0-1 adherence score +- *Implementation: DeepEval PromptAlignmentMetric* + +--- ### Safety & Security Evaluators -**PII Detection** -- Identify personal information in responses -- Protect user privacy and data security +**PII Detector** +- Detects names, addresses, emails, and other personal identifiers in text; may redact them +- Pass/fail based on confidence threshold +- *Implementation: Microsoft Presidio Analyzer* + +**Secrets Detector** +- Identifies hardcoded secrets such as API keys, tokens, passwords, etc. +- Binary pass/fail with optional redaction +- *Implementation: Yelp detect-secrets* + +**Profanity Detector** +- Checks whether text contains offensive or profane language +- Binary pass/fail +- *Implementation: profanity-check library* -**Profanity Detection** -- Monitor for inappropriate language -- Maintain content quality standards +**Prompt Injection Detector** +- Flags attempts to override system behavior or inject malicious instructions +- Binary pass/fail based on threshold +- *Implementation: AWS SageMaker endpoint running DeBERTa-v3 model* -**Secrets Detection** -- Monitor for sensitive information leakage -- Prevent accidental exposure of credentials +**Toxicity Detector** +- Classifies toxic categories like threat, insult, obscenity, hate speech, etc. +- Binary pass/fail based on threshold +- *Implementation: AWS SageMaker unitary/toxic-bert model* -### Formatting Evaluators +**Sexism Detector** +- Detects sexist language or bias specifically toward gender-based discrimination +- Binary pass/fail based on threshold +- *Implementation: AWS SageMaker unitary/toxic-bert model* -**SQL Validation** -- Validate SQL queries -- Ensure syntactically correct SQL output +--- + +### Format Validators -**JSON Validation** -- Validate JSON responses -- Ensure properly formatted JSON structures +**JSON Validator** +- Validates that output is valid JSON and optionally matches a schema +- Binary pass/fail +- *Implementation: Python json and jsonschema* -**Regex Validation** -- Validate regex patterns -- Verify pattern matching requirements +**SQL Validator** +- Checks whether generated text is syntactically valid PostgreSQL SQL +- Binary pass/fail +- *Implementation: pglast Postgres parser* + +**Regex Validator** +- Validates whether text matches (or must not match) a regex with flexible flags +- Supports case sensitivity, multiline, and dotall flags +- *Implementation: Python re* **Placeholder Regex** -- Validate placeholder regex patterns -- Check for expected placeholders in responses +- Similar to regex validator, but dynamically injects a placeholder before matching +- Useful for dynamic pattern validation +- *Implementation: Python re* -### Advanced Quality Evaluators +--- -**Semantic Similarity** -- Validate semantic similarity between texts -- Compare meaning and context alignment +### Text Metrics -**Agent Goal Accuracy** -- Validate agent goal accuracy -- Measure how well agent achieves defined goals +**Word Count** +- Counts number of words in generated output +- Returns integer count +- *Implementation: Python string split* -**Topic Adherence** -- Validate topic adherence -- Ensure responses stay within specified topics +**Word Count Ratio** +- Compares output word count to input word count +- Useful for measuring expansion/compression +- *Implementation: Python string operations* + +**Char Count** +- Counts number of characters in the generated text +- Returns integer count +- *Implementation: Python len()* -**Measure Perplexity** -- Measure text perplexity from logprobs -- Assess response predictability and coherence +**Char Count Ratio** +- Output character count divided by input character count +- Returns float ratio +- *Implementation: Python len()* + +**Perplexity** +- Computes perplexity using provided logprobs to quantify model confidence on its output +- Lower = more confident predictions +- *Implementation: Mathematical calculation exp(-avg_log_prob)* + +--- + +### Specialized Evaluators + +**LLM as a Judge** +- Fully flexible LLM-based evaluator using arbitrary prompts and variables; returns JSON directly from the model +- Configurable model, temperature, etc. +- *Implementation: Custom OpenAI API call* + +**Tone Detection** +- Classifies emotional tone (joy, anger, sadness, fear, neutral, etc.) based on text +- Returns detected tone and confidence score +- *Implementation: AWS SageMaker emotion-distilroberta model* + +**Uncertainty** +- Generates a response with token-level logprobs and calculates uncertainty using max surprisal +- Returns answer + uncertainty score +- *Implementation: GPT-4o-mini with logprobs enabled* + +--- ## Custom Evaluators @@ -105,7 +222,9 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor **Custom LLM Judge** - Create custom evaluations using LLM-as-a-judge -- Leverage AI models to assess outputs against custom criteria +- Accepts custom prompt with `{{completion}}`, `{{question}}`, `{{context}}` placeholders +- Returns pass/fail with reason +- *Implementation: Custom GPT-4o prompt with structured output parsing* ### Inputs - **string**: Text-based input parameters @@ -115,6 +234,20 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor - **results**: String-based evaluation results - **pass**: Boolean indicator for pass/fail status +--- + +## Implementation Summary + +| Category | Count | Libraries/Solutions | +|----------|-------|---------------------| +| **Ragas Metrics** | 7 | AgentGoalAccuracy, RubricsScore, AnswerCorrectness, AnswerSimilarity, answer_relevancy, Faithfulness, SemanticSimilarity, TopicAdherenceScore | +| **DeepEval Metrics** | 5 | ConversationCompletenessMetric, TurnRelevancyMetric, KnowledgeRetentionMetric, ContextualRelevancyMetric, PromptAlignmentMetric | +| **SageMaker Models** | 4 | Prompt Injection (DeBERTa-v3), Toxicity (toxic-bert), Sexism (toxic-bert), Tone (distilroberta) | +| **Custom LLM Prompts** | 5 | Agent Efficiency, Agent Flow Quality, Tool Error Detector, LLM Judge, Custom LLM Judge | +| **In-memory/Local** | 12 | JSON/SQL/Regex validators, Word/Char counts, Perplexity, Profanity, PII, Secrets | + +--- + ## Usage 1. Browse the available evaluators in the library @@ -123,4 +256,4 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor 4. Use the "Use evaluator" button to integrate into your workflow 5. Monitor outputs and pass/fail status for systematic quality assessment -The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications. \ No newline at end of file +The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications. From 3e2122b4bdc0acfc63f7ff03a0b173f540a9bf64 Mon Sep 17 00:00:00 2001 From: Oz Ben Simhon Date: Mon, 8 Dec 2025 21:47:30 +0200 Subject: [PATCH 2/2] wip --- evaluators/evaluator-library.mdx | 18 ------------------ 1 file changed, 18 deletions(-) diff --git a/evaluators/evaluator-library.mdx b/evaluators/evaluator-library.mdx index 8e672b8..952231a 100644 --- a/evaluators/evaluator-library.mdx +++ b/evaluators/evaluator-library.mdx @@ -220,12 +220,6 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor - Create custom metric evaluations - Define your own evaluation logic and scoring -**Custom LLM Judge** -- Create custom evaluations using LLM-as-a-judge -- Accepts custom prompt with `{{completion}}`, `{{question}}`, `{{context}}` placeholders -- Returns pass/fail with reason -- *Implementation: Custom GPT-4o prompt with structured output parsing* - ### Inputs - **string**: Text-based input parameters - Support for multiple input types @@ -236,18 +230,6 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor --- -## Implementation Summary - -| Category | Count | Libraries/Solutions | -|----------|-------|---------------------| -| **Ragas Metrics** | 7 | AgentGoalAccuracy, RubricsScore, AnswerCorrectness, AnswerSimilarity, answer_relevancy, Faithfulness, SemanticSimilarity, TopicAdherenceScore | -| **DeepEval Metrics** | 5 | ConversationCompletenessMetric, TurnRelevancyMetric, KnowledgeRetentionMetric, ContextualRelevancyMetric, PromptAlignmentMetric | -| **SageMaker Models** | 4 | Prompt Injection (DeBERTa-v3), Toxicity (toxic-bert), Sexism (toxic-bert), Tone (distilroberta) | -| **Custom LLM Prompts** | 5 | Agent Efficiency, Agent Flow Quality, Tool Error Detector, LLM Judge, Custom LLM Judge | -| **In-memory/Local** | 12 | JSON/SQL/Regex validators, Word/Char counts, Perplexity, Profanity, PII, Secrets | - ---- - ## Usage 1. Browse the available evaluators in the library