From 909a1555be819ebb8931abaac30af3062c386b1c Mon Sep 17 00:00:00 2001
From: Oz Ben Simhon <oz@traceloop.com>
Date: Mon, 8 Dec 2025 21:45:45 +0200
Subject: [PATCH 1/2] wip

---
 evaluators/evaluator-library.mdx | 241 ++++++++++++++++++++++++-------
 1 file changed, 187 insertions(+), 54 deletions(-)
diff --git a/evaluators/evaluator-library.mdx b/evaluators/evaluator-library.mdx
index 8f0efb6..8e672b8 100644
--- a/evaluators/evaluator-library.mdx
+++ b/evaluators/evaluator-library.mdx
@@ -17,83 +17,200 @@ The Evaluator Library provides a comprehensive collection of pre-built quality c
 
 Traceloop provides several pre-configured evaluators for common assessment tasks:
 
-### Content Analysis Evaluators
+---
 
-**Character Count**
-- Analyze response length and verbosity
-- Helps ensure responses meet length requirements
+### Agent Evaluators
 
-**Character Count Ratio** 
-- Measure the ratio of characters to the input
-- Useful for assessing response proportionality
+**Agent Efficiency** <sup>(beta)</sup>
+- Evaluates how efficiently an agent completes a task by detecting redundant steps, unnecessary tool calls, loops, or poor reasoning
+- Returns a 0-1 score
+- *Implementation: Custom GPT-4o prompt*
 
-**Word Count**
-- Ensure appropriate response detail level
-- Track output length consistency
+**Agent Flow Quality** <sup>(beta)</sup>
+- Checks whether the agent satisfies all user-defined behavioral or logical conditions; strict full-condition matching
+- Returns score as ratio of passed conditions
+- *Implementation: Custom GPT-4o prompt*
 
-**Word Count Ratio**
-- Measure the ratio of words to the input
-- Compare input/output verbosity
+**Agent Goal Accuracy**
+- Determines if the agent actually achieved the user's goal, with or without a reference expected answer
+- Supports both reference-based and reference-free evaluation
+- *Implementation: Ragas AgentGoalAccuracy metrics*
 
-### Quality Assessment Evaluators
+**Agent Goal Completeness**
+- Extracts user intents across a conversation and evaluates how many were fulfilled end-to-end
+- Automatically determines fulfillment rate
+- *Implementation: DeepEval ConversationCompletenessMetric*
+
+**Agent Tool Error Detector** <sup>(beta)</sup>
+- Detects incorrect tool usage (bad params, failed API calls, unexpected behavior) in agent trajectories
+- Returns pass/fail
+- *Implementation: Custom GPT-4o prompt*
+
+---
+
+### Answer Quality Evaluators
+
+**Answer Completeness**
+- Measures how thoroughly the answer uses relevant context, using a rubric from "barely uses context" to "fully covers it"
+- Normalized to 0-1 score
+- *Implementation: Ragas RubricsScore metric*
+
+**Answer Correctness**
+- Evaluates factual correctness by combining semantic similarity with a correctness model vs ground truth
+- Returns combined 0-1 score
+- *Implementation: Ragas AnswerCorrectness + AnswerSimilarity*
 
 **Answer Relevancy**
-- Verify responses address the query
-- Ensure AI outputs stay on topic
+- Determines whether the answer meaningfully responds to the question
+- Outputs pass/fail
+- *Implementation: Ragas answer_relevancy metric*
 
 **Faithfulness**
-- Detect hallucinations and verify facts
-- Maintain accuracy and truthfulness
+- Ensures all claims in the answer are grounded in the provided context and not hallucinated
+- Binary pass/fail
+- *Implementation: Ragas Faithfulness metric*
+
+**Semantic Similarity**
+- Computes embedding-based similarity between generated text and a reference answer
+- Returns 0-1 score
+- *Implementation: Ragas SemanticSimilarity metric*
+
+---
+
+### Conversation Evaluators
+
+**Conversation Quality**
+- Overall conversation score combining relevancy (40%), completeness (40%), and memory retention (20%) over multiple turns
+- Returns weighted combined score
+- *Implementation: DeepEval TurnRelevancy + ConversationCompleteness + KnowledgeRetention*
+
+**Intent Change**
+- Detects if the conversation stayed on the original intent or drifted into unrelated topics
+- Higher score = better adherence to original topic
+- *Implementation: Ragas TopicAdherenceScore (precision mode)*
+
+**Topic Adherence**
+- Measures how well conversation messages stay aligned with specified allowed topics
+- Returns 0-1 score
+- *Implementation: Ragas TopicAdherenceScore*
+
+**Context Relevance**
+- Rates whether retrieved context actually contains the information needed to answer the question
+- Score = relevant statements / total statements
+- *Implementation: DeepEval ContextualRelevancyMetric*
+
+**Instruction Adherence**
+- Evaluates how closely the model followed system-level or user instructions
+- Returns 0-1 adherence score
+- *Implementation: DeepEval PromptAlignmentMetric*
+
+---
 
 ### Safety & Security Evaluators
 
-**PII Detection**
-- Identify personal information in responses
-- Protect user privacy and data security
+**PII Detector**
+- Detects names, addresses, emails, and other personal identifiers in text; may redact them
+- Pass/fail based on confidence threshold
+- *Implementation: Microsoft Presidio Analyzer*
+
+**Secrets Detector**
+- Identifies hardcoded secrets such as API keys, tokens, passwords, etc.
+- Binary pass/fail with optional redaction
+- *Implementation: Yelp detect-secrets*
+
+**Profanity Detector**
+- Checks whether text contains offensive or profane language
+- Binary pass/fail
+- *Implementation: profanity-check library*
 
-**Profanity Detection** 
-- Monitor for inappropriate language
-- Maintain content quality standards
+**Prompt Injection Detector**
+- Flags attempts to override system behavior or inject malicious instructions
+- Binary pass/fail based on threshold
+- *Implementation: AWS SageMaker endpoint running DeBERTa-v3 model*
 
-**Secrets Detection**
-- Monitor for sensitive information leakage
-- Prevent accidental exposure of credentials
+**Toxicity Detector**
+- Classifies toxic categories like threat, insult, obscenity, hate speech, etc.
+- Binary pass/fail based on threshold
+- *Implementation: AWS SageMaker unitary/toxic-bert model*
 
-### Formatting Evaluators
+**Sexism Detector**
+- Detects sexist language or bias specifically toward gender-based discrimination
+- Binary pass/fail based on threshold
+- *Implementation: AWS SageMaker unitary/toxic-bert model*
 
-**SQL Validation**
-- Validate SQL queries
-- Ensure syntactically correct SQL output
+---
+
+### Format Validators
 
-**JSON Validation**
-- Validate JSON responses
-- Ensure properly formatted JSON structures
+**JSON Validator**
+- Validates that output is valid JSON and optionally matches a schema
+- Binary pass/fail
+- *Implementation: Python json and jsonschema*
 
-**Regex Validation**
-- Validate regex patterns
-- Verify pattern matching requirements
+**SQL Validator**
+- Checks whether generated text is syntactically valid PostgreSQL SQL
+- Binary pass/fail
+- *Implementation: pglast Postgres parser*
+
+**Regex Validator**
+- Validates whether text matches (or must not match) a regex with flexible flags
+- Supports case sensitivity, multiline, and dotall flags
+- *Implementation: Python re*
 
 **Placeholder Regex**
-- Validate placeholder regex patterns
-- Check for expected placeholders in responses
+- Similar to regex validator, but dynamically injects a placeholder before matching
+- Useful for dynamic pattern validation
+- *Implementation: Python re*
 
-### Advanced Quality Evaluators
+---
 
-**Semantic Similarity**
-- Validate semantic similarity between texts
-- Compare meaning and context alignment
+### Text Metrics
 
-**Agent Goal Accuracy**
-- Validate agent goal accuracy
-- Measure how well agent achieves defined goals
+**Word Count**
+- Counts number of words in generated output
+- Returns integer count
+- *Implementation: Python string split*
 
-**Topic Adherence**
-- Validate topic adherence
-- Ensure responses stay within specified topics
+**Word Count Ratio**
+- Compares output word count to input word count
+- Useful for measuring expansion/compression
+- *Implementation: Python string operations*
+
+**Char Count**
+- Counts number of characters in the generated text
+- Returns integer count
+- *Implementation: Python len()*
 
-**Measure Perplexity**
-- Measure text perplexity from logprobs
-- Assess response predictability and coherence
+**Char Count Ratio**
+- Output character count divided by input character count
+- Returns float ratio
+- *Implementation: Python len()*
+
+**Perplexity**
+- Computes perplexity using provided logprobs to quantify model confidence on its output
+- Lower = more confident predictions
+- *Implementation: Mathematical calculation exp(-avg_log_prob)*
+
+---
+
+### Specialized Evaluators
+
+**LLM as a Judge**
+- Fully flexible LLM-based evaluator using arbitrary prompts and variables; returns JSON directly from the model
+- Configurable model, temperature, etc.
+- *Implementation: Custom OpenAI API call*
+
+**Tone Detection**
+- Classifies emotional tone (joy, anger, sadness, fear, neutral, etc.) based on text
+- Returns detected tone and confidence score
+- *Implementation: AWS SageMaker emotion-distilroberta model*
+
+**Uncertainty**
+- Generates a response with token-level logprobs and calculates uncertainty using max surprisal
+- Returns answer + uncertainty score
+- *Implementation: GPT-4o-mini with logprobs enabled*
+
+---
 
 ## Custom Evaluators
 
@@ -105,7 +222,9 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
 
 **Custom LLM Judge**
 - Create custom evaluations using LLM-as-a-judge
-- Leverage AI models to assess outputs against custom criteria
+- Accepts custom prompt with `{{completion}}`, `{{question}}`, `{{context}}` placeholders
+- Returns pass/fail with reason
+- *Implementation: Custom GPT-4o prompt with structured output parsing*
 
 ### Inputs
 - **string**: Text-based input parameters
@@ -115,6 +234,20 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
 - **results**: String-based evaluation results
 - **pass**: Boolean indicator for pass/fail status
 
+---
+
+## Implementation Summary
+
+| Category | Count | Libraries/Solutions |
+|----------|-------|---------------------|
+| **Ragas Metrics** | 7 | AgentGoalAccuracy, RubricsScore, AnswerCorrectness, AnswerSimilarity, answer_relevancy, Faithfulness, SemanticSimilarity, TopicAdherenceScore |
+| **DeepEval Metrics** | 5 | ConversationCompletenessMetric, TurnRelevancyMetric, KnowledgeRetentionMetric, ContextualRelevancyMetric, PromptAlignmentMetric |
+| **SageMaker Models** | 4 | Prompt Injection (DeBERTa-v3), Toxicity (toxic-bert), Sexism (toxic-bert), Tone (distilroberta) |
+| **Custom LLM Prompts** | 5 | Agent Efficiency, Agent Flow Quality, Tool Error Detector, LLM Judge, Custom LLM Judge |
+| **In-memory/Local** | 12 | JSON/SQL/Regex validators, Word/Char counts, Perplexity, Profanity, PII, Secrets |
+
+---
+
 ## Usage
 
 1. Browse the available evaluators in the library
@@ -123,4 +256,4 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
 4. Use the "Use evaluator" button to integrate into your workflow
 5. Monitor outputs and pass/fail status for systematic quality assessment
 
-The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications.
\ No newline at end of file
+The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications.

From 3e2122b4bdc0acfc63f7ff03a0b173f540a9bf64 Mon Sep 17 00:00:00 2001
From: Oz Ben Simhon <oz@traceloop.com>
Date: Mon, 8 Dec 2025 21:47:30 +0200
Subject: [PATCH 2/2] wip

---
 evaluators/evaluator-library.mdx | 18 ------------------
 1 file changed, 18 deletions(-)

diff --git a/evaluators/evaluator-library.mdx b/evaluators/evaluator-library.mdx
index 8e672b8..952231a 100644
--- a/evaluators/evaluator-library.mdx
+++ b/evaluators/evaluator-library.mdx
@@ -220,12 +220,6 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
 - Create custom metric evaluations
 - Define your own evaluation logic and scoring
 
-**Custom LLM Judge**
-- Create custom evaluations using LLM-as-a-judge
-- Accepts custom prompt with `{{completion}}`, `{{question}}`, `{{context}}` placeholders
-- Returns pass/fail with reason
-- *Implementation: Custom GPT-4o prompt with structured output parsing*
-
 ### Inputs
 - **string**: Text-based input parameters
 - Support for multiple input types
@@ -236,18 +230,6 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
 
 ---
 
-## Implementation Summary
-
-| Category | Count | Libraries/Solutions |
-|----------|-------|---------------------|
-| **Ragas Metrics** | 7 | AgentGoalAccuracy, RubricsScore, AnswerCorrectness, AnswerSimilarity, answer_relevancy, Faithfulness, SemanticSimilarity, TopicAdherenceScore |
-| **DeepEval Metrics** | 5 | ConversationCompletenessMetric, TurnRelevancyMetric, KnowledgeRetentionMetric, ContextualRelevancyMetric, PromptAlignmentMetric |
-| **SageMaker Models** | 4 | Prompt Injection (DeBERTa-v3), Toxicity (toxic-bert), Sexism (toxic-bert), Tone (distilroberta) |
-| **Custom LLM Prompts** | 5 | Agent Efficiency, Agent Flow Quality, Tool Error Detector, LLM Judge, Custom LLM Judge |
-| **In-memory/Local** | 12 | JSON/SQL/Regex validators, Word/Char counts, Perplexity, Profanity, PII, Secrets |
-
----
-
 ## Usage
 
 1. Browse the available evaluators in the library