# LangSmith Repeat Evaluation

- Author: [Hwayoung Cha](https://github.com/forwardyoung)
- Peer Review: []()
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

> Repetitive evaluation is a method of more accurately measuring a model's performance by conducting multiple evaluations on the same dataset.

You can add repetition to the experiment. This notebook demonstrates how to use `LangSmith` for repeatable evaluations of language models. It covers setting up evaluation workflows, running evaluations on different datasets, and analyzing results to ensure consistency. The focus is on leveraging `LangSmith`'s tools for reproducible and scalable model assessments.

This allows the evaluation to be repeated multiple times, which is useful in the following cases:

- For larger evaluation sets
- For chains that can generate variable responses
- For evaluations that can produce variable scores (e.g., `llm-as-judge`)

You can learn how to run an evaluation from [this site](https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_llm_application#evaluate-on-a-dataset-with-repetitions).

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Performing Repetitive Evaluations with num_repetitions](#performing-repetitive-evaluations-with-num_repetitions)
- [Define a function for RAG performance testing](#define-a-function-for-rag-performance-testing)
- [Repetitive evaluation of RAG using GPT models](#repetitive-evaluation-of-rag-using-gpt-models)
- [Repetitive evaluation of RAG using Ollama models](#repetitive-evaluation-of-rag-using-ollama-models)

## References
- [How to run an evaluation](https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_llm_application#evaluate-on-a-dataset-with-repetitions)
- [How to evaluate with repetitions](https://docs.smith.langchain.com/evaluation/how_to_guides/repetition)

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial


[notice] A new release of pip is available: 23.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_openai",
        "langchain_core",
        "langchain_community",
        "langchain_ollama",
        "faiss-cpu"
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGSMITH_TRACING_V2": "true",
        "LANGSMITH_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_PROJECT": "Repeat-Evaluations"
    }
)

Environment variables have been set successfully.


You can alternatively set OPENAI_API_KEY in .env file and load it.

[Note] This is not necessary if you've already set OPENAI_API_KEY in previous steps.

In [4]:
# Configuration file to manage API keys as environment variables
from dotenv import load_dotenv

# Load API key information
load_dotenv(override=True)

True

## Performing Repetitive Evaluations with `num_repetitions`

`LangSmith` provides a simple way to perform repetitive evaluations using the `num_repetitions` parameter in the evaluate function. This parameter specifies how many times each example in your dataset should be evaluated.

When you set `num_repetitions=N`, `LangSmith` will:

Run each example in your dataset N times.

Aggregate the results to provide a more accurate measure of your model's performance.

For example:

If your dataset has 10 examples and you set `num_repetitions=5`, each example will be evaluated 5 times, resulting in a total of 50 runs.

## Define a function for RAG performance testing

In [5]:
from myrag import PDFRAG


# Create a function to generate responses to questions.
def ask_question_with_llm(llm):
    # Create a PDFRAG object
    rag = PDFRAG(
        "data/Newwhitepaper_Agents2.pdf",
        llm,
    )

    # Create a retriever
    retriever = rag.create_retriever()

    # Create a chain
    rag_chain = rag.create_chain(retriever)

    def _ask_question(inputs: dict):
        # Context retrieval for the question
        context = retriever.invoke(inputs["question"])
        # Combine the retrieved documents into a single string.
        context = "\n".join([doc.page_content for doc in context])
        # Return a dictionary containing the question, context, and answer.
        return {
            "question": inputs["question"],
            "context": context,
            "answer": rag_chain.invoke(inputs["question"]),
        }

    return _ask_question

In this tutorial, we use the `llama3.2` model for repetitive evaluations. Make sure to install [`Ollama`](https://ollama.com/) on your local machine and run `ollama pull llama3.2` to download the model before proceeding with this tutorial.

Below is an example of loading and invoking the model:

In [6]:
from langchain_ollama import ChatOllama

# Load the Ollama model
ollama = ChatOllama(model="llama3.2")

# Call the Ollama model
ollama.invoke("hello") 

AIMessage(content='Hello! How can I assist you today?', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-01-16T16:13:33.4880891Z', 'done': True, 'done_reason': 'stop', 'total_duration': 4804983300, 'load_duration': 2912003200, 'prompt_eval_count': 26, 'prompt_eval_duration': 1218000000, 'eval_count': 10, 'eval_duration': 673000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-18e729df-2816-4de2-90f8-f4ac34a4e94b-0', usage_metadata={'input_tokens': 26, 'output_tokens': 10, 'total_tokens': 36})

In [7]:
from langchain_openai import ChatOpenAI

gpt_chain = ask_question_with_llm(ChatOpenAI(model="gpt-4o-mini", temperature=1.0))

# Load the Ollama model.
ollama_chain = ask_question_with_llm(ChatOllama(model="llama3.2"))

## Repetitive evaluation of RAG using GPT models

In [8]:
from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Create a QA evaluator
cot_qa_evalulator = LangChainStringEvaluator(
    "cot_qa",
    config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],
        "input": example.inputs["question"],
    },
)

dataset_name = "RAG_EVAL_DATASET"

# Run the evaluation
evaluate(
    gpt_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="REPEAT_EVAL",
    # Specify the experiment metadata.
    metadata={
        "variant": "Perform repeat evaluation. GPT-4o-mini model (cot_qa)",
    },
    num_repetitions=3,
)

View the evaluation results for experiment: 'REPEAT_EVAL-b3299690' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=c23fb470-e26c-410c-8549-395ce25b6a74




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.question,outputs.context,outputs.answer,error,reference.answer,feedback.COT Contextual Accuracy,execution_time,example_id,id
0,What are the three targeted learnings to enhan...,What are the three targeted learnings to enhan...,Agents\n33\nSeptember 2024\nEnhancing model pe...,The three targeted learnings to enhance model ...,,The three targeted learning approaches to enha...,0,4.690003,0e661de4-636b-425d-8f6e-0a52b8070576,a5b4a8d3-4563-4121-880b-319de3a6ca61
1,What are the key functions of an agent's orche...,What are the key functions of an agent's orche...,implementation of the agent orchestration laye...,The key functions of an agent's orchestration ...,,The key functions of an agent's orchestration ...,1,3.266062,3561c6fe-6ed4-4182-989a-270dcd635f32,b812d7ef-f43b-4281-8883-49babe0331d8
2,List up the name of the authors,List up the name of the authors,"Agents\nAuthors: Julia Wiesinger, Patrick Marl...","Julia Wiesinger, Patrick Marlow, Vladimir Vusk...",,"The authors are Julia Wiesinger, Patrick Marlo...",1,2.151092,b03e98d1-44ad-4142-8dfa-7b0a31a57096,fd43ff73-fbe9-4462-9548-ccd4699e36a6
3,What is Tree-of-thoughts?,What is Tree-of-thoughts?,weaknesses depending on the specific applicati...,Tree-of-thoughts (ToT) is a prompt engineering...,,Tree-of-thoughts (ToT) is a prompt engineering...,1,2.683703,be18ec98-ab18-4f30-9205-e75f1cb70844,a3af1502-c3a8-4ccf-a9b4-f215f7627fab
4,What is the framework used for reasoning and p...,What is the framework used for reasoning and p...,"reasoning frameworks (CoT, ReAct, etc.) to \nf...",The frameworks used for reasoning and planning...,,The frameworks used for reasoning and planning...,1,4.457662,eb4b29a7-511c-4f78-a08f-2d5afeb84320,d0403e56-289b-4d9c-909f-2596e541ba44
5,How do agents differ from standalone language ...,How do agents differ from standalone language ...,1.\t Agents extend the capabilities of languag...,Agents extend the capabilities of language mod...,,Agents can use tools to access real-time data ...,1,3.897497,f4a5a0cf-2d2e-4e15-838a-bc8296eb708b,2cbfba4a-f93e-46a6-90c1-e9c3038615b6
6,What are the three targeted learnings to enhan...,What are the three targeted learnings to enhan...,Agents\n33\nSeptember 2024\nEnhancing model pe...,The three targeted learnings to enhance model ...,,The three targeted learning approaches to enha...,1,4.40145,0e661de4-636b-425d-8f6e-0a52b8070576,8fa1b780-72e8-4a80-bb6d-b08a7399aae3
7,What are the key functions of an agent's orche...,What are the key functions of an agent's orche...,implementation of the agent orchestration laye...,The key functions of an agent's orchestration ...,,The key functions of an agent's orchestration ...,1,3.007212,3561c6fe-6ed4-4182-989a-270dcd635f32,6cbbda3f-8145-495a-9840-bc01dbde7f63
8,List up the name of the authors,List up the name of the authors,"Agents\nAuthors: Julia Wiesinger, Patrick Marl...","The authors are Julia Wiesinger, Patrick Marlo...",,"The authors are Julia Wiesinger, Patrick Marlo...",1,1.867887,b03e98d1-44ad-4142-8dfa-7b0a31a57096,3dfa685d-f963-4224-92de-a587d36d5a05
9,What is Tree-of-thoughts?,What is Tree-of-thoughts?,weaknesses depending on the specific applicati...,Tree-of-thoughts (ToT) is a prompt engineering...,,Tree-of-thoughts (ToT) is a prompt engineering...,1,2.559536,be18ec98-ab18-4f30-9205-e75f1cb70844,a2309b4c-d65b-48ec-b779-ffe25ca9de0b


![13-langsmith-repeat-evaluation-01](./assets/13-langSmith-repeat-evaluation-01.png)

## Repetitive evaluation of RAG using Ollama models

In [9]:
# Create a QA evaluator
cot_qa_evalulator = LangChainStringEvaluator(
    "cot_qa",
    config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],
        "input": example.inputs["question"],
    },
)

dataset_name = "RAG_EVAL_DATASET"

# Run the evaluation
evaluate(
    ollama_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="REPEAT_EVAL",
    # Specify the experiment metadata.
    metadata={
        "variant": "Perform repeat evaluation. Ollama(llama3.2) (cot_qa)",
    },
    num_repetitions=3,
)

View the evaluation results for experiment: 'REPEAT_EVAL-902762c1' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=fc097110-f9b7-4a1e-88d4-1d5791c5608e




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.question,outputs.context,outputs.answer,error,reference.answer,feedback.COT Contextual Accuracy,execution_time,example_id,id
0,What are the three targeted learnings to enhan...,What are the three targeted learnings to enhan...,Agents\n33\nSeptember 2024\nEnhancing model pe...,In-context learning and fine-tuning based lear...,,The three targeted learning approaches to enha...,0,62.753057,0e661de4-636b-425d-8f6e-0a52b8070576,adf296ae-4a55-4a59-a0fc-db38c23bb553
1,What are the key functions of an agent's orche...,What are the key functions of an agent's orche...,implementation of the agent orchestration laye...,"Based on the provided context, the orchestrati...",,The key functions of an agent's orchestration ...,1,53.047706,3561c6fe-6ed4-4182-989a-270dcd635f32,f787c085-db6a-477f-84ef-12f97189e02c
2,List up the name of the authors,List up the name of the authors,"Agents\nAuthors: Julia Wiesinger, Patrick Marl...",The names of the authors are:\n\n1. Julia Wies...,,"The authors are Julia Wiesinger, Patrick Marlo...",0,45.92155,b03e98d1-44ad-4142-8dfa-7b0a31a57096,8882d384-8016-43b6-9ff1-d506a8b02041
3,What is Tree-of-thoughts?,What is Tree-of-thoughts?,weaknesses depending on the specific applicati...,Tree-of-thoughts (ToT) is a prompt engineering...,,Tree-of-thoughts (ToT) is a prompt engineering...,1,47.277102,be18ec98-ab18-4f30-9205-e75f1cb70844,1bfb2833-187c-4252-90e8-c0ecef15923e
4,What is the framework used for reasoning and p...,What is the framework used for reasoning and p...,"reasoning frameworks (CoT, ReAct, etc.) to \nf...",The framework mentioned in the context for rea...,,The frameworks used for reasoning and planning...,1,66.69021,eb4b29a7-511c-4f78-a08f-2d5afeb84320,fbcd970e-b0c4-4672-a26e-e5b1888d7e8d
5,How do agents differ from standalone language ...,How do agents differ from standalone language ...,1.\t Agents extend the capabilities of languag...,"According to the context, agents differ from s...",,Agents can use tools to access real-time data ...,1,67.281025,f4a5a0cf-2d2e-4e15-838a-bc8296eb708b,aadb7c12-f60d-483a-9d38-a45601875fd7
6,What are the three targeted learnings to enhan...,What are the three targeted learnings to enhan...,Agents\n33\nSeptember 2024\nEnhancing model pe...,In-context learning and Fine-tuning based lear...,,The three targeted learning approaches to enha...,0,100.027999,0e661de4-636b-425d-8f6e-0a52b8070576,e2555674-21c3-4b13-b8ed-e2e20fab6b5e
7,What are the key functions of an agent's orche...,What are the key functions of an agent's orche...,implementation of the agent orchestration laye...,"Based on the retrieved context, it appears tha...",,The key functions of an agent's orchestration ...,1,149.409035,3561c6fe-6ed4-4182-989a-270dcd635f32,58500837-8434-4854-a5e2-391715d10dd7
8,List up the name of the authors,List up the name of the authors,"Agents\nAuthors: Julia Wiesinger, Patrick Marl...",The names of the authors are:\n\n1. Julia Wies...,,"The authors are Julia Wiesinger, Patrick Marlo...",1,95.070125,b03e98d1-44ad-4142-8dfa-7b0a31a57096,00243009-135d-4838-88e5-414b23a7d075
9,What is Tree-of-thoughts?,What is Tree-of-thoughts?,weaknesses depending on the specific applicati...,Tree-of-thoughts (ToT) is a prompt engineering...,,Tree-of-thoughts (ToT) is a prompt engineering...,1,54.378582,be18ec98-ab18-4f30-9205-e75f1cb70844,c7e603ce-1ada-4185-b85b-2ac17d266317


![13-langsmith-repeat-evaluation-02](./assets/13-langSmith-repeat-evaluation-02.png)