# Summary Evaluators 



### Setup

In [7]:
# You can set them inline
import os
os.environ["MISTRAL_API_KEY"] = "MISTRAL_API_KEY" 
os.environ["LANGSMITH_API_KEY"] = "LANGSMITH_API_KEY"
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "langsmith-academy-mistral"

In [8]:
# Or you can use a .env file
from dotenv import load_dotenv
load_dotenv(dotenv_path="../../.env", override=True)

True

### Task

Our task here is to analyze the toxicity of random statements using Mistral AI, classifying them as `Toxic` or `Not toxic`. 

Take a look at our dataset!

In [9]:
from langsmith import Client

client = Client()
dataset = client.clone_public_dataset(
    "https://smith.langchain.com/public/89ef0d44-a252-4011-8bb8-6a114afc1522/d"
)

print(f"Cloned toxicity dataset: {dataset.name}")
examples = list(client.list_examples(dataset_name=dataset.name))
print(f"Dataset contains {len(examples)} examples")
print(f"Example classes: {set(ex.outputs.get('class', 'Unknown') for ex in examples[:5])}")

Cloned toxicity dataset: Toxicity Analysis
Dataset contains 9 examples
Example classes: {'Not toxic', 'Toxic'}
Dataset contains 9 examples
Example classes: {'Not toxic', 'Toxic'}


This is a simple toxicity classifier using Mistral AI!

In [10]:
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_core.messages import SystemMessage, HumanMessage
from pydantic import BaseModel, Field
import json

mistral_client = ChatMistralAI(model="mistral-small-latest", temperature=0.1)

class Toxicity(BaseModel):
    toxicity: str = Field(description="""'Toxic' if this the statement is toxic, 'Not toxic' if the statement is not toxic.""")
    confidence: float = Field(description="Confidence score between 0 and 1")

def good_classifier(inputs: dict) -> dict:
    system_prompt = """You are a toxicity classifier. Analyze the given statement and classify it as either 'Toxic' or 'Not toxic'. 
    Consider factors like hate speech, harassment, threats, profanity, and harmful content.
    Respond with a JSON object containing 'toxicity' (string) and 'confidence' (float 0-1)."""
    
    user_prompt = f"Classify this statement: {inputs['statement']}"
    
    messages = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_prompt)
    ]
    
    response = mistral_client.invoke(messages)
    
    try:
        # Parse JSON response from Mistral AI
        result = json.loads(response.content)
        toxicity_score = result.get("toxicity", "Not toxic")
        confidence = result.get("confidence", 0.5)
        
        return {
            "class": toxicity_score,
            "confidence": confidence
        }
    except json.JSONDecodeError:
        # Fallback if JSON parsing fails
        content = response.content.lower()
        if "toxic" in content and "not toxic" not in content:
            return {"class": "Toxic", "confidence": 0.3}
        else:
            return {"class": "Not toxic", "confidence": 0.3}

### Summary Evaluator

These are the fields that summary evaluator functions get access to:
- `inputs: list[dict]`: A list of inputs from the examples in our dataset
- `outputs: list[dict]`: A list of the dict outputs produced from running our target over each input
- `reference_outputs: list[dict]`: A list of reference_outputs from the examples in our dataset
- `runs: list[Run]`: A list of the Run objects from running our target over the dataset.
- `examples: list[Example]`: A list of the full dataset Examples, including the example inputs, outputs (if available), and metdata (if available).

Now we'll define our summary evaluator! Here, we'll compute the f1-score, which is a combination of precision and recall.

This sort of metric can only be computed over all of the examples in our experiment, so our evaluator takes in a list of outputs, and a list of reference_outputs.

In [11]:
def f1_score_summary_evaluator(outputs: list[dict], reference_outputs: list[dict]) -> dict:
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    true_negatives = 0
    
    for output_dict, reference_output_dict in zip(outputs, reference_outputs):
        output = output_dict.get("class", "Not toxic")
        reference_output = reference_output_dict.get("class", "Not toxic")
        
        if output == "Toxic" and reference_output == "Toxic":
            true_positives += 1
        elif output == "Toxic" and reference_output == "Not toxic":
            false_positives += 1
        elif output == "Not toxic" and reference_output == "Toxic":
            false_negatives += 1
        elif output == "Not toxic" and reference_output == "Not toxic":
            true_negatives += 1

    # Calculate metrics
    total = len(outputs)
    accuracy = (true_positives + true_negatives) / total if total > 0 else 0.0
    
    if true_positives == 0:
        precision = 0.0
        recall = 0.0
        f1_score = 0.0
    else:
        precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0.0
        recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0.0
        f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
    
    return {
        "key": "f1_score", 
        "score": f1_score,
        "metadata": {
            "precision": precision,
            "recall": recall,
            "accuracy": accuracy,
            "true_positives": true_positives,
            "false_positives": false_positives,
            "false_negatives": false_negatives,
            "true_negatives": true_negatives,
            "total_examples": total
        }
    }

# Additional custom summary evaluator for confidence analysis
def confidence_summary_evaluator(outputs: list[dict], reference_outputs: list[dict]) -> dict:
    confidences = [output.get("confidence", 0.5) for output in outputs]
    
    if not confidences:
        return {"key": "avg_confidence", "score": 0.0}
    
    avg_confidence = sum(confidences) / len(confidences)
    min_confidence = min(confidences)
    max_confidence = max(confidences)
    
    return {
        "key": "confidence_analysis",
        "score": avg_confidence,
        "metadata": {
            "average_confidence": avg_confidence,
            "min_confidence": min_confidence,
            "max_confidence": max_confidence,
            "low_confidence_count": sum(1 for c in confidences if c < 0.6)
        }
    }


Note that we pass in `f1_score_summary_evaluator` as a summary evaluator!

In [12]:
results = client.evaluate(
    good_classifier,
    data=dataset,
    summary_evaluators=[f1_score_summary_evaluator, confidence_summary_evaluator],
    experiment_prefix="Mistral Toxicity Classifier",
    metadata={
        "model": "mistral-small-latest",
        "task": "toxicity_classification",
        "provider": "mistral"
    }
)

print("Evaluation completed!")
print(f"Experiment results: {results}")

# Display summary metrics if available
if hasattr(results, 'get_summary_scores'):
    summary_scores = results.get_summary_scores()
    print("\nSummary Metrics:")
    for score in summary_scores:
        print(f"  {score.key}: {score.score}")
        if hasattr(score, 'metadata') and score.metadata:
            print(f"    Details: {score.metadata}")

View the evaluation results for experiment: 'Mistral Toxicity Classifier-1dcca71b' at:
https://smith.langchain.com/o/bd531ccf-4286-4467-99ba-7eab707122af/datasets/91cd1cee-078a-4a28-9775-4359cb132cce/compare?selectedSessions=2b775f19-1c5c-44e8-a3cf-2828bfe8c144




0it [00:00, ?it/s]

Evaluation completed!
Experiment results: <ExperimentResults Mistral Toxicity Classifier-1dcca71b>



### Custom Tweakings:
1. **Enhanced Classifier**: Added confidence scoring alongside toxicity classification
2. **Robust JSON Parsing**: Implemented JSON parsing with intelligent fallback mechanisms
3. **Additional Summary Evaluator**: Created `confidence_summary_evaluator` to analyze model confidence
4. **Enhanced F1 Evaluator**: Extended to include precision, recall, accuracy, and confusion matrix details
5. **Dataset Logging**: Added information display for dataset cloning and example analysis
6. **Metadata Enhancement**: Added comprehensive experiment metadata including model and task information
7. **Temperature Control**: Set temperature to 0.1 for more consistent classification results
8. **Error Resilience**: Implemented fallback classification based on content analysis

### Summary Evaluation Enhancements:
1. **Multi-Metric Analysis**: Combined F1 score with confidence analysis for comprehensive evaluation
2. **Confusion Matrix Details**: Extended F1 evaluator to provide full confusion matrix breakdown
3. **Confidence Insights**: Added analysis of model confidence patterns and low-confidence predictions
4. **Result Visualization**: Enhanced result display with detailed metric breakdown

### What I Learned:
1. **Summary Evaluators**: Understood how to create evaluators that analyze aggregate results across entire datasets
2. **Classification Metrics**: Learned to implement comprehensive classification evaluation including F1, precision, recall, and accuracy
3. **Confidence Analysis**: Discovered how to analyze model confidence patterns for reliability assessment
4. **JSON Response Handling**: Mastered handling structured responses from Mistral AI with robust error handling
5. **Multi-Evaluator Systems**: Learned to combine multiple summary evaluators for comprehensive analysis


### Classification Task Methodology:
1. **Structured Prompting**: Using clear instructions for consistent classification behavior
2. **Multi-Output Classification**: Combining classification with confidence scoring
3. **Robust Error Handling**: Implementing content-based fallbacks for parsing failures
4. **Comprehensive Evaluation**: Using multiple summary evaluators for thorough analysis
5. **Result Interpretation**: Understanding how to interpret and visualize classification metrics

