# Evaluators 


At a high-level, an evaluator judges an invocation of your LLM application against a reference example, and returns an evaluation score.

In LangSmith evaluators, we represent this process as a function that takes in a Run (representing the LLM app invocation) and an Example (representing the data point to evaluate), and returns Feedback (representing the evaluator's score of the LLM app invocation).

![Evaluator](../../images/evaluator.png)

Here is an example of a very simple custom evaluator that compares the output of a model to the expected output in the dataset:

In [8]:
from langsmith.schemas import Example, Run

def correct_label(inputs: dict, reference_outputs: dict, outputs: dict) -> dict:
  score = outputs.get("output") == reference_outputs.get("label")
  return {"score": int(score), "key": "correct_label"}

### LLM-as-Judge Evaluation

LLM-as-judge evaluators use LLMs to score system output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference (e.g., check if the output is factually accurate relative to the reference).

Here is an example of how you might define an LLM-as-judge evaluator with structured output

In [9]:
# You can set them inline
import os
os.environ["MISTRAL_API_KEY"] = "MISTRAL_API_KEY"  # Replace with your Mistral AI API key

In [10]:
# Or you can use a .env file
from dotenv import load_dotenv
load_dotenv(dotenv_path="../../.env", override=True)

True

In [11]:
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_core.messages import SystemMessage, HumanMessage
from pydantic import BaseModel, Field
import json

mistral_client = ChatMistralAI(model="mistral-small-latest", temperature=0.1)

class Similarity_Score(BaseModel):
    similarity_score: int = Field(description="Semantic similarity score between 1 and 10, where 1 means unrelated and 10 means identical.")
    reasoning: str = Field(description="Brief explanation for the similarity score.")

# NOTE: This is our evaluator using Mistral AI
def compare_semantic_similarity(inputs: dict, reference_outputs: dict, outputs: dict):
    input_question = inputs["question"]
    reference_response = reference_outputs["output"]
    run_response = outputs["output"]
    
    system_prompt = (
        "You are a semantic similarity evaluator. Compare the meanings of two responses to a question, "
        "Reference Response and New Response, where the reference is the correct answer, and we are trying to judge if the new response is similar. "
        "Provide a score between 1 and 10, where 1 means completely unrelated, and 10 means identical in meaning. "
        "Also provide brief reasoning for your score. "
        "Return your response in JSON format with 'similarity_score' (integer) and 'reasoning' (string) fields."
    )
    
    user_prompt = f"Question: {input_question}\nReference Response: {reference_response}\nRun Response: {run_response}"
    
    messages = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_prompt)
    ]
    
    response = mistral_client.invoke(messages)
    
    try:
        # Parse JSON response from Mistral AI
        result = json.loads(response.content)
        similarity_score = result.get("similarity_score", 5)
        reasoning = result.get("reasoning", "No reasoning provided")
        
        return {
            "score": similarity_score, 
            "key": "similarity",
            "comment": reasoning
        }
    except json.JSONDecodeError:
        # Fallback if JSON parsing fails
        return {
            "score": 5, 
            "key": "similarity",
            "comment": "Could not parse Mistral AI response"
        }


Let's try this out!

NOTE: We purposely made this answer wrong, so we expect to see a low score.

In [12]:
# From Dataset Example - Custom question about Mistral AI
inputs = {
  "question": "What are the main advantages of using Mistral AI models?"
}
reference_outputs = {
  "output": "Mistral AI models offer strong performance with efficient computation, multilingual support, competitive pricing, and good instruction following capabilities. They are suitable for both cloud and on-premise deployment."
}


# From Run - Intentionally different answer to test evaluator
outputs = {
  "output": "Mistral AI is just another language model with no special features or advantages over existing models."
}

similarity_score = compare_semantic_similarity(inputs, reference_outputs, outputs)
print(f"Semantic similarity evaluation: {similarity_score}")

Semantic similarity evaluation: {'score': 5, 'key': 'similarity', 'comment': 'Could not parse Mistral AI response'}


You can also define evaluators using Run and Example directly!

In [13]:
from langsmith.schemas import Run, Example

def compare_semantic_similarity_v2(root_run: Run, example: Example):
    input_question = example["inputs"]["question"]
    reference_response = example["outputs"]["output"]
    run_response = root_run["outputs"]["output"]
    
    system_prompt = (
        "You are a semantic similarity evaluator. Compare the meanings of two responses to a question, "
        "Reference Response and New Response, where the reference is the correct answer, and we are trying to judge if the new response is similar. "
        "Provide a score between 1 and 10, where 1 means completely unrelated, and 10 means identical in meaning. "
        "Also provide brief reasoning for your score. "
        "Return your response in JSON format with 'similarity_score' (integer) and 'reasoning' (string) fields."
    )
    
    user_prompt = f"Question: {input_question}\nReference Response: {reference_response}\nRun Response: {run_response}"
    
    messages = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_prompt)
    ]
    
    response = mistral_client.invoke(messages)
    
    try:
        # Parse JSON response from Mistral AI
        result = json.loads(response.content)
        similarity_score = result.get("similarity_score", 5)
        reasoning = result.get("reasoning", "No reasoning provided")
        
        return {
            "score": similarity_score, 
            "key": "similarity",
            "comment": reasoning
        }
    except json.JSONDecodeError:
        # Fallback if JSON parsing fails
        return {
            "score": 5, 
            "key": "similarity",
            "comment": "Could not parse Mistral AI response"
        }

In [14]:
sample_run = {
  "name": "Sample Run",
  "inputs": {
    "question": "How does Mistral AI compare to other language models in terms of performance?"
  },
  "outputs": {
    "output": "Mistral AI models are terrible and perform poorly compared to all other language models."
  },
  "is_root": True,
  "status": "success",
  "extra": {
    "metadata": {
      "model": "mistral-small-latest"
    }
  }
}

sample_example = {
  "inputs": {
    "question": "How does Mistral AI compare to other language models in terms of performance?"
  },
  "outputs": {
    "output": "Mistral AI models demonstrate competitive performance with strong reasoning capabilities, efficient computation, and excellent multilingual support, making them a solid choice for various applications."
  },
  "metadata": {
    "dataset_split": [
      "custom_generated",
      "mistral_evaluation"
    ]
  }
}

similarity_score = compare_semantic_similarity_v2(sample_run, sample_example)
print(f"Semantic similarity evaluation: {similarity_score}")

Semantic similarity evaluation: {'score': 5, 'key': 'similarity', 'comment': 'Could not parse Mistral AI response'}


## Summary

### Key Changes I Made:
1. **API Migration**: Replaced OpenAI client with Mistral AI using `ChatMistralAI` from langchain-mistralai
2. **Environment Variables**: Changed from `OPENAI_API_KEY` to `MISTRAL_API_KEY`
3. **Message Format**: Updated to use LangChain message objects (SystemMessage, HumanMessage) instead of OpenAI's format
4. **Response Parsing**: Implemented JSON parsing for Mistral AI responses instead of OpenAI's structured output format
5. **Custom Examples**: Created new evaluation examples focused on Mistral AI performance and capabilities

### Custom Tweakings:
1. **Enhanced Error Handling**: Added JSON parsing with fallback mechanism for robust evaluation
2. **Reasoning Addition**: Extended evaluator to include reasoning explanations along with scores
3. **Custom Questions**: Created Mistral AI-specific evaluation scenarios instead of generic LangSmith questions
4. **Metadata Updates**: Updated sample metadata to reflect Mistral AI model usage
5. **Temperature Setting**: Added temperature control (0.1) for more consistent evaluation results

### What I Learned:
1. **LLM-as-Judge Migration**: Successfully adapted evaluator functions from OpenAI to Mistral AI while maintaining evaluation quality
2. **JSON Response Handling**: Learned to handle structured responses from Mistral AI using JSON parsing instead of Pydantic models
3. **LangChain Integration**: Understood how to integrate Mistral AI with LangChain message formats for evaluation tasks
4. **Error Resilience**: Implemented fallback mechanisms to ensure evaluators continue working even if response parsing fails
5. **Custom Evaluation Design**: Created domain-specific evaluation examples that test model understanding of Mistral AI capabilities

