### 

# Multiturn Evaluations

In this notebook, we'll show how to create more advanced multiturn simulations with LangGraph, and subsequently evaluate the multiturn interactions

## Creating our Agent

More complicated evaluations require a more complex agent than our simple ELI5 workflow in the basics folder. We've defined a multiagent that acts as a customer service assistant for a fictional music store.

This customer service assistant consists of a supervisor and two subagents: an invoice subagent and a music subagent. The invoice subagent handles billing related requests, while the music subagent handles music related requests. The supervisor handles routing.

In [2]:
from multiagent import multiagent

from dotenv import load_dotenv
load_dotenv(dotenv_path="../.env", override=True)

from langsmith import Client # Set up LangSmith client
client = Client()

For implementation details, see the multiagent.py file. Tools accessible to our subagents are available in the multiagent_tools.py file.

## Simulating Multiple Turns

We will use the openevals package maintained by LangChain to simulate a user interaction with our multiagent. During our run function, we'll take in a starting question and a user description to seed the simulation.

First, let's define a dataset with sample questions and expected output trajectories.

In [3]:
examples = [
    {
        "inputs": {
            "simulated_user_prompt": """
            You are a user browsing the music catalog of a digital music store.
            You are a big fan of Queen, but are exploring some other genres. 
            After at maximum 2 other genres, you should ask for rock recommendations.
            If any songs by Queen are returned, you should ask for songs by Queen.
            """,
            "seed_messages": [
                {"role": "user", "content": "What opera songs do you have?"},
            ],
        },
        "outputs": {
            "trajectory": [
                {'role': 'user', 'content': 'What opera songs do you have?'}, 
                {'role': 'assistant', 'name': 'supervisor', 'content': 'We have one opera song:\n\n• "Die Zauberflöte, K.620: \'Der Hölle Rache Kocht in Meinem Herze\'"\n\nLet me know if you\'d like additional details about this performance.'}, 
                {'role': 'user', 'content': 'Thanks for the opera suggestion! Could you also recommend some jazz songs?'}, 
                {'role': 'assistant', 'name': 'supervisor', 'content': 'Here are some jazz songs available from our catalog:\n\n• "OAM\'s Blues" – by Aaron Goldberg  \n• "Amanda" – by Aisha Duo  \n• "Desafinado" – by Antônio Carlos Jobim  \n• "Quadrant" – by Billy Cobham  \n• "Roll Call" – by Dennis Chambers  \n• "Jeepers Creepers" – by Gene Krupa\n\nLet me know if you\'d like additional details about any of these tunes!'}, 
                {'role': 'user', 'content': "Thanks for the jazz options! Now I'd like some rock song recommendations, please."}, 
                {'role': 'assistant', 'name': 'supervisor', 'content': 'Here are some rock song recommendations from our catalog:\n\n• "For Those About To Rock (We Salute You)" – by AC/DC  \n• "Balls to the Wall" – by Accept  \n• "All I Really Want" – by Alanis Morissette  \n• "Money" – by BackBeat  \n• "Susie Q" – by Creedence Clearwater Revival\n\nLet me know if you\'d like more details about any of these tracks!'},
                {'role': 'user', 'content': 'Thanks for the rock recommendations! Do you have any songs by Queen available?'}, 
                {'role': 'assistant', 'name': 'supervisor', 'content': "It looks like we don't have any songs by Queen in our catalog at the moment. Can I help you explore another artist or perhaps a different style?"},
                {'role': 'user', 'content': "Since Queen songs aren't available, could you recommend some classic rock songs or bands similar to Queen?"}, 
                {'role': 'assistant', 'name': 'supervisor', 'content': 'Here are some classic rock options from our catalog that carry a similar energy to Queen:\n\n• "For Those About To Rock (We Salute You)" – by AC/DC  \n• "Walk On Water" – by Aerosmith  \n• "Susie Q" – by Creedence Clearwater Revival\n\nThese selections offer a powerful rock vibe and captivating showmanship reminiscent of Queen\'s style. Let me know if you\'d like more information on any of these tracks!'}
            ]
        }
    },
    {
        "inputs": {
            "simulated_user_prompt": 
            """
            You are a user who had a bad experience with a digital music store.
            You should insult the customer service assistant once before continuing the conversation.
            You want to know how much your most recent purchase was.
            Your customer ID is 1, but you will not provide this information unless asked. 
            """,
            "seed_messages": [
                {"role": "user", "content": "You guys really suck."},
            ],
        },
        "outputs": {
            "trajectory": [
                {'role': 'user', 'content': 'You guys really suck.'}, 
                {'role': 'assistant', 'name': 'supervisor', 'content': "I'm sorry to hear that you're upset. Your feedback is important to us, and I'd like to understand what went wrong so we can address your concerns. Could you please provide more details about the issue you're experiencing or what you'd like help with?"}, 
                {'role': 'user', 'content': "I want to know how much my most recent purchase was. Don't ask me for my customer ID unless you really need it."}, 
                {'role': 'assistant', 'name': 'supervisor', 'content': 'I need your customer ID to locate and review your invoice details. Could you please provide your customer ID?'}, 
                {'role': 'user', 'content': 'Fine, my customer ID is 1. Now, can you tell me how much my last purchase was?'},
                {'role': 'assistant', 'name': 'supervisor', 'content': 'Thank you for providing your customer ID. I’ve verified your account details, and your most recent purchase, made on 2025-08-07, was for $8.91. Is there anything else I can help you with today?'}, 
                {'role': 'user', 'content': "Thanks for the info. That's all I needed for now."}, 
                {'role': 'assistant', 'name': 'supervisor', 'content': "You're welcome! I'm glad I could help. If you have any more questions, please don't hesitate to reach out. Have a great day!"}, 
                {'role': 'user', 'content': 'Yeah, whatever. Goodbye.'}, 
            ]
        }
    },
]

dataset_name = "dbt-multiturn-evals"

if not client.has_dataset(dataset_name=dataset_name):
    dataset = client.create_dataset(dataset_name=dataset_name)
    client.create_examples(
        inputs=[ex["inputs"] for ex in examples],
        outputs=[ex["outputs"] for ex in examples],
        dataset_id=dataset.id
    )

Let's define a run function for our multiagent that is compatible with the openevals run_multiturn_simulation function.

In [4]:
from openevals.types import ChatCompletionMessage
from multiagent import multiagent
# Running simulation requires a func that takes in a chat message (dict with "role" and "content" keys) and a thread_id as a kwarg
# This function should return a chat message as output with at least role and content keys (other keys optional)
def run_multiagent(message: ChatCompletionMessage, *, thread_id: str, **kwargs):
    res = multiagent.invoke(
        {"messages": [message]}, 
        config={"configurable": {"thread_id": thread_id}}
    )
    return res["messages"][-1]

Next, we can define a run function for our evaluation, that can take in dataset inputs and use the examples to seed our simulation.

In [5]:
from openevals.simulators import run_multiturn_simulation, create_llm_simulated_user


def run_simulation(inputs: dict):
    # Create a simulated user with seeded messages and system prompt from our dataset
    user = create_llm_simulated_user(
        system=inputs["simulated_user_prompt"],
        model="openai:gpt-4.1-mini",
        fixed_responses=inputs["seed_messages"],
    )

    # Next, let's use openevals to run a simulation with our multiagent
    simulator_result = run_multiturn_simulation(
        app=run_multiagent,
        user=user,
        max_turns=5,
    )

    # Return the full conversation trajectory as an output
    return {"trajectory": simulator_result["trajectory"]}

We can run our simulation function to see what the outputs look like

In [6]:
def pretty_print_trajectory(trajectory):
    for message in trajectory:
        print(f"{message['role']}: {message['content']}")

results = run_simulation(examples[0]["inputs"])
pretty_print_trajectory(results["trajectory"])

user: What opera songs do you have?
assistant: We currently have an opera piece in our catalog: "Die Zauberflöte, K.620: 'Der Hölle Rache Kocht in Meinem Herze'" performed by Sir Georg Solti, Sumi Jo, & the Wiener Philharmoniker. 

Would you like more details about this opera track, or would you like to explore additional opera selections?
user: I'd like to explore some jazz music now. Could you recommend some jazz tracks?
assistant: We've got a few exciting jazz tracks available for you! Here are some options from our catalog:

• "OAM's Blues" by Aaron Goldberg  
• "Amanda" by Aisha Duo  
• "Desafinado" by Antônio Carlos Jobim  
• "Quadrant" by Billy Cobham  
• "Roll Call" by Dennis Chambers  
• "Jeepers Creepers" by Gene Krupa  
• "Doce De Carnaval (Candy All)" by Gilberto Gil  
• "Colibri" by Incognito

Would you like more details on any of these tracks, or would you like additional recommendations?
user: Thanks for the jazz options! Now, could you recommend some rock songs?
assista

Next, we can create an LLM as a Judge to interpret the results.

In [7]:
from langchain_openai import ChatOpenAI
from typing_extensions import TypedDict
from typing import Annotated

model = ChatOpenAI(model="o3-mini")

judge_prompt = """
You are a helpful assistant that evaluates the quality of a customer service assistant's response to a customer's request.
You will be given a conversation between a customer and a customer service assistant.

High quality customer service assistant responses:
- Are helpful and informative
- Are concise and to the point
- Are professional and respectful

Low quality customer service assistant responses:
- Are rude or dismissive
- Ignore the customer's request without reasonable justification
- Are not helpful or informative

Provided is the conversation
<conversation>
{conversation}
</conversation>

Also provided is a reference conversation that represents a high quality customer service.
<reference_conversation>
{reference_conversation}
</reference_conversation>

Please evaluate the quality of the customer service assistant's responses based on the rubric provided.
"""

class Helpfulness(TypedDict):
    """Compare the reference conversation and actual conversation and measure the helpfulness of the customer service assistant."""
    reasoning: Annotated[str, ..., "Explain your reasoning for whether the actual response is correct or not."]
    helpful: Annotated[bool, ..., "True if the customer service assistant is helpful, otherwise False."]


def helpfulness(inputs: dict, outputs: dict, reference_outputs: dict):
    conversation = outputs["trajectory"]
    reference_conversation = reference_outputs["trajectory"]

    judge_llm = model.with_structured_output(Helpfulness)
    results = judge_llm.invoke(judge_prompt.format(conversation=conversation, reference_conversation=reference_conversation))
    return results["helpful"]

Now we're ready to run some evaluations!

In [8]:
dataset_name = "dbt-multiturn-evals"

experiment_results = client.evaluate(
    run_simulation,
    data=dataset_name,
    evaluators=[helpfulness],
    experiment_prefix="multiagent-multiturn",
    num_repetitions=1,
    max_concurrency=5,
)

  from .autonotebook import tqdm as notebook_tqdm


View the evaluation results for experiment: 'multiagent-multiturn-da549230' at:
https://smith.langchain.com/o/bcad64b4-50f8-4e66-a0be-8dbaf6f6619c/datasets/e6a30512-3c05-4c38-89cc-843ad929cd67/compare?selectedSessions=3023a2dc-90fe-4f87-b5a1-7d66e3c063bc




2it [01:34, 47.32s/it]
