# Pairwise Experiments


### Setup

In [10]:
# You can set them inline
import os
os.environ["MISTRAL_API_KEY"] = ""
os.environ["LANGSMITH_API_KEY"] = ""
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "langsmith-academy-mistral"

In [11]:
# Or you can use a .env file
from dotenv import load_dotenv
load_dotenv(dotenv_path="../../.env", override=True)

True

### Task

Let's set up a new task! Here, we have a salesperson named Bob. Bob has a lot of deals, so he wants to summarize what happened in his deals based off of some meeting transcripts.

Bob is iterating on a few different prompts using Mistral AI, that will give him nice, concise transcripts for his deals.

Bob has curated a dataset of his deal transcripts, let's go ahead and load that in. You can take a look at the dataset as well if you're curious! Note that this is not a golden dataset, there is no reference output here.

In [12]:
from langsmith import Client

client = Client()
dataset = client.clone_public_dataset(
  "https://smith.langchain.com/public/9078d2f1-7bef-4ba7-b795-210a17682ef9/d"
)
print(f"Cloned dataset: {dataset.name}")
print(f"Dataset contains {len(list(client.list_examples(dataset_name=dataset.name)))} examples")

Cloned dataset: Meeting Transcripts
Dataset contains 5 examples
Dataset contains 5 examples


### Experiments

Now, let's run some experiments on this dataset using two different prompts. Let's add an evaluator that tries to score how good our summaries are!

In [13]:
from pydantic import BaseModel, Field
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_core.messages import SystemMessage, HumanMessage
import json

mistral_client = ChatMistralAI(model="mistral-small-latest", temperature=0.1)

SUMMARIZATION_SYSTEM_PROMPT = """You are a judge, aiming to score how well a summary summarizes the content of a transcript. 
Respond with a JSON object containing 'score' (integer 1-5) and 'reasoning' (string explanation)."""

SUMMARIZATION_HUMAN_PROMPT = """
[The Meeting Transcript] {transcript}
[The Start of Summarization] {summary} [The End of Summarization]

Please evaluate this summary on a scale of 1-5, where 1 is a bad summary and 5 is a great summary.
Consider completeness, accuracy, and conciseness."""

class SummarizationScore(BaseModel):
    score: int = Field(description="""A score from 1-5 ranking how good the summarization is for the provided transcript, with 1 being a bad summary, and 5 being a great summary""")
    reasoning: str = Field(description="Brief explanation for the score")
    
def summary_score_evaluator(inputs: dict, outputs: dict) -> list:
    messages = [
        SystemMessage(content=SUMMARIZATION_SYSTEM_PROMPT),
        HumanMessage(content=SUMMARIZATION_HUMAN_PROMPT.format(
            transcript=inputs["transcript"],
            summary=outputs.get("output", "N/A")
        ))
    ]
    
    response = mistral_client.invoke(messages)
    
    try:
        # Parse JSON response from Mistral AI
        result = json.loads(response.content)
        summary_score = result.get("score", 3)
        reasoning = result.get("reasoning", "No reasoning provided")
        
        return {
            "key": "summary_score", 
            "score": summary_score,
            "comment": reasoning
        }
    except json.JSONDecodeError:
        # Fallback if JSON parsing fails
        return {
            "key": "summary_score", 
            "score": 3,
            "comment": "Could not parse Mistral AI response"
        }

First, we'll run our experiment with a good version of our prompt!

In [14]:
# Prompt One: Good Prompt with Mistral AI!
def good_summarizer(inputs: dict):
    messages = [
        HumanMessage(content=f"Concisely summarize this meeting in 3 sentences. Make sure to include all of the important events, key decisions, and action items. Meeting: {inputs['transcript']}")
    ]
    
    response = mistral_client.invoke(messages)
    return response.content

client.evaluate(
    good_summarizer,
    data=dataset,
    evaluators=[summary_score_evaluator],
    experiment_prefix="Good Mistral Summarizer"
)

View the evaluation results for experiment: 'Good Mistral Summarizer-64321797' at:
https://smith.langchain.com/o/bd531ccf-4286-4467-99ba-7eab707122af/datasets/fe28ab41-9df8-47e9-a675-858ad8feeeb1/compare?selectedSessions=a01ab955-7b6d-4a2d-955b-60e18ba30270




0it [00:00, ?it/s]

Unnamed: 0,inputs.transcript,outputs.output,error,feedback.summary_score,execution_time,example_id,id
0,Bob and Mr. Carter (CLOSED DEAL): Bob: Welcome...,Bob met with Mr. Carter to discuss trading in ...,,3,0.805242,33286ea7-9126-476e-8ea4-b36c7928fce8,177671d0-236d-458f-9abe-423d4b8715be
1,"Bob and Mr. Patel (CLOSED DEAL): Bob: Hello, M...",Mr. Patel sought a midsize hybrid sedan for hi...,,3,0.774518,75ce83fc-ac58-4335-9a6d-6c3aac4aa990,27342517-3a01-41b0-b3fd-58fa69a841fc
2,"Bob and Ms. Thompson (NO DEAL): Bob: Hi, Ms. T...","Bob welcomed Ms. Thompson to Ford Motors, disc...",,3,0.971985,80f0443e-2fbf-4a54-9bd5-e2969fd32e40,a73aaee5-4296-4847-a7bd-b3237677f66a
3,Bob and Ms. Nguyen (NO DEAL): Bob: Good aftern...,1. **Discussion**: Bob presented Ms. Nguyen wi...,,3,1.358458,c199cb17-647e-4582-95a3-24dfe092124f,19d17428-f50b-4b08-a8cd-6c3a376e07bc
4,Bob and Mr. Johnson (CLOSED DEAL): Bob: Good m...,Bob met with Mr. Johnson to discuss purchasing...,,3,0.794056,e9730901-3533-4c8a-baa5-0ee5c867990d,cfd871c8-2cb2-456e-bff0-325777e105a4


Now, we'll run an experiment with a worse version of our prompt, to highlight the difference.

In [15]:
# Prompt Two: Worse Prompt with Mistral AI!
def bad_summarizer(inputs: dict):
    messages = [
        HumanMessage(content=f"Summarize this in one sentence. {inputs['transcript']}")
    ]
    
    response = mistral_client.invoke(messages)
    return response.content

client.evaluate(
    bad_summarizer,
    data=dataset,
    evaluators=[summary_score_evaluator],
    experiment_prefix="Bad Mistral Summarizer"
)

View the evaluation results for experiment: 'Bad Mistral Summarizer-068a34eb' at:
https://smith.langchain.com/o/bd531ccf-4286-4467-99ba-7eab707122af/datasets/fe28ab41-9df8-47e9-a675-858ad8feeeb1/compare?selectedSessions=d3eeb829-b77f-4985-8f52-73cb03c86d49




0it [00:00, ?it/s]

Unnamed: 0,inputs.transcript,outputs.output,error,feedback.summary_score,execution_time,example_id,id
0,Bob and Mr. Carter (CLOSED DEAL): Bob: Welcome...,Bob successfully helped Mr. Carter trade in hi...,,3,0.587522,33286ea7-9126-476e-8ea4-b36c7928fce8,57bd76ae-ea0d-43d7-8e19-82f44dc0efb7
1,"Bob and Mr. Patel (CLOSED DEAL): Bob: Hello, M...",Bob successfully helped Mr. Patel purchase a F...,,3,0.73252,75ce83fc-ac58-4335-9a6d-6c3aac4aa990,020e4a99-693c-4be8-a4d6-c134be8f1af8
2,"Bob and Ms. Thompson (NO DEAL): Bob: Hi, Ms. T...","Bob and Ms. Thompson discussed SUV options, wi...",,3,0.544668,80f0443e-2fbf-4a54-9bd5-e2969fd32e40,14c58278-34be-46f5-9146-238671876d50
3,Bob and Ms. Nguyen (NO DEAL): Bob: Good aftern...,Bob and Ms. Nguyen discussed various car optio...,,3,0.538156,c199cb17-647e-4582-95a3-24dfe092124f,91695ced-f95e-43fe-a854-714ac126d4a3
4,Bob and Mr. Johnson (CLOSED DEAL): Bob: Good m...,Bob successfully convinced Mr. Johnson to purc...,,3,0.612687,e9730901-3533-4c8a-baa5-0ee5c867990d,765b7068-e34d-428d-9a45-efee8c22cb70


### Pairwise Experiment

Let's define a function that will compare our two experiments. These are the fields that pairwise evaluator functions get access to:
- `inputs: dict`: A dictionary of the inputs corresponding to a single example in a dataset.
- `outputs: list[dict]`: A list of the dict outputs produced by each experiment on the given inputs.
- `reference_outputs: dict`: A dictionary of the reference outputs associated with the example, if available.
- `runs: list[Run]`: A list of the full Run objects generated by the experiments on the given example. Use this if you need access to intermediate steps or metadata about each run.
- `example: Example`: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).

First, let's give our LLM-as-Judge some instructions. In our case, we're just going to directly use LLM-as-judge to grade which of the summarizers is the most helpful.

It might be hard to grade our summarizers without a ground truth reference, but here, comparing different prompts head to head will give us a sense of which is better!

In [16]:
JUDGE_SYSTEM_PROMPT = """
Please act as an impartial judge and evaluate the quality of the summarizations provided by two AI summarizers to the meeting transcript below.
Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their summarizations. 
Begin your evaluation by comparing the two summarizations and provide a short explanation. 
Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. 
Do not favor certain names of the assistants. 
Be as objective as possible.
Respond with a JSON object containing 'preference' (integer: 1 for Assistant A, 2 for Assistant B, 0 for tie) and 'reasoning' (string explanation)."""

JUDGE_HUMAN_PROMPT = """
[The Meeting Transcript] {transcript}

[The Start of Assistant A's Summarization] {answer_a} [The End of Assistant A's Summarization]

[The Start of Assistant B's Summarization] {answer_b} [The End of Assistant B's Summarization]

Which summarization is better? Consider completeness, accuracy, conciseness, and usefulness for a business context."""

Our function will take in an `inputs` dictionary, and a list of `outputs` dictionaries for the different experiments that we want to compare.

In [17]:
from pydantic import BaseModel, Field
import json

class Preference(BaseModel):
    preference: int = Field(description="""1 if Assistant A answer is better based upon the factors above.
2 if Assistant B answer is better based upon the factors above.
Output 0 if it is a tie.""")
    reasoning: str = Field(description="Brief explanation for the preference choice")
    
def ranked_preference(inputs: dict, outputs: list[dict]) -> list:
    messages = [
        SystemMessage(content=JUDGE_SYSTEM_PROMPT),
        HumanMessage(content=JUDGE_HUMAN_PROMPT.format(
            transcript=inputs["transcript"],
            answer_a=outputs[0].get("output", "N/A"),
            answer_b=outputs[1].get("output", "N/A")
        ))
    ]
    
    response = mistral_client.invoke(messages)
    
    try:
        # Parse JSON response from Mistral AI
        result = json.loads(response.content)
        preference_score = result.get("preference", 0)
        reasoning = result.get("reasoning", "No reasoning provided")
        
        if preference_score == 1:
            scores = [1, 0]
        elif preference_score == 2:
            scores = [0, 1]
        else:
            scores = [0, 0]
            
        return scores
        
    except json.JSONDecodeError:
        # Fallback if JSON parsing fails - return tie
        return [0, 0]

Now let's run our pairwise experiment with `evaluate()`

In [18]:
from langsmith import evaluate

# Run pairwise experiment comparing the two Mistral AI summarizers
# Use the full experiment names with their generated suffixes
evaluate(
    ("Good Mistral Summarizer-54e57cbf", "Bad Mistral Summarizer-70151d73"),  # Use actual experiment names from LangSmith
    evaluators=[ranked_preference],
    experiment_prefix="Mistral Pairwise Comparison"
)

View the pairwise evaluation results at:
https://smith.langchain.com/o/bd531ccf-4286-4467-99ba-7eab707122af/datasets/fe28ab41-9df8-47e9-a675-858ad8feeeb1/compare?selectedSessions=4dc84cae-b5e5-4886-a959-0f398eab84f1%2Cfafaa715-c983-417e-910d-1a5bd804a745&comparativeExperiment=f969df8a-15c0-4759-9569-9569ddcee74b




  0%|          | 0/5 [00:00<?, ?it/s]

<langsmith.evaluation._runner.ComparativeExperimentResults at 0x789460b841a0>

## Learnings

### Key Changes Made:
1. **Model Provider**: Migrated from OpenAI to Mistral AI using `ChatMistralAI`
2. **Environment Variables**: Changed from `OPENAI_API_KEY` to `MISTRAL_API_KEY`
3. **Message Format**: Updated to use LangChain message objects (SystemMessage, HumanMessage)
4. **Response Parsing**: Implemented JSON parsing for Mistral AI responses instead of OpenAI's structured output
5. **Project Name**: Updated to `langsmith-academy-mistral`
6. **Experiment Names**: Updated to `Good Mistral Summarizer` and `Bad Mistral Summarizer`

### Custom Tweakings:
1. **Enhanced Prompts**: Improved summarization prompts to include key decisions and action items
2. **JSON Response Handling**: Added robust JSON parsing with fallback mechanisms for both evaluator and preference functions
3. **Reasoning Addition**: Extended both evaluators to include reasoning explanations alongside scores
4. **Business Context**: Enhanced judge prompts to consider business-specific usefulness criteria
5. **Dataset Info**: Added logging to show dataset cloning information and example count
6. **Temperature Control**: Set temperature to 0.1 for more consistent evaluation results
7. **Error Resilience**: Implemented fallback responses for parsing failures

### Pairwise Comparison Enhancements:
1. **Enhanced Judge Instructions**: Improved judge prompts to consider business context and completeness
2. **Reasoning Capture**: Extended preference evaluation to capture reasoning for decisions
3. **Robust Scoring**: Implemented fallback mechanisms to handle JSON parsing errors gracefully
4. **Clear Experiment Naming**: Used descriptive names for better experiment tracking

### What I Learned:
1. **Pairwise Evaluation Migration**: Successfully adapted pairwise comparison systems from OpenAI to Mistral AI
2. **JSON Response Handling**: Learned to handle structured responses from Mistral AI using custom JSON parsing
3. **Multi-Stage Evaluation**: Understood how to chain individual evaluators with pairwise comparisons
4. **Business Application**: Applied AI evaluation to real-world business scenarios (sales meeting summarization)
5. **Robust Error Handling**: Implemented comprehensive error handling for production-ready evaluation systems


### Evaluation Methodology Learned:
1. **Individual vs Pairwise**: Understanding when to use individual scoring vs comparative evaluation
2. **Reference-Free Evaluation**: Techniques for evaluating quality without ground truth
3. **Multi-Criteria Assessment**: Considering multiple factors (completeness, accuracy, conciseness, business value)
4. **Consistency Testing**: Using temperature control and error handling for reliable results
5. **Experiment Tracking**: Proper naming and metadata for experiment management

This migration demonstrates how to adapt complex evaluation workflows including pairwise comparisons when changing language model providers while maintaining evaluation quality and adding robustness improvements.