## Load Environment Variables

In [None]:
from dotenv import load_dotenv

# Load environment variables
load_dotenv(dotenv_path=".env", override=True)

## Create AI Application

### Setup 

As usual, let's define our prompt and give our application access to the web.

In [None]:
# Initialize web search tool
from langchain_community.tools.tavily_search import TavilySearchResults

web_search_tool = TavilySearchResults(max_results=1)

# Define prompt template
prompt = """You are a professor and expert in explaining complex topics in a way that is easy to understand. 
Your job is to answer the provided question so that even a 5 year old can understand it. 
You have provided with relevant background context to answer the question.

Question: {question} 

Context: {context}

Answer:"""

### Define Application Logic

The logic here is the same as in the tracing module. We define a search step to scan the web, and an explain step for an LLM to summarize the web results.

In [3]:
from openai import OpenAI
from langsmith import traceable
from langsmith.wrappers import wrap_openai


# Create Application
openai_client = wrap_openai(OpenAI())

@traceable
def search(question):
    web_docs = web_search_tool.invoke({"query": question})
    web_results = "\n".join([d["content"] for d in web_docs])
    return web_results
    
@traceable
def explain(question, context):
    formatted = prompt.format(question=question, context=context)
    
    completion = openai_client.chat.completions.create(
        messages=[
            {"role": "system", "content": formatted},
            {"role": "user", "content": question},
        ],
        model="o3-mini",
    )
    return completion.choices[0].message.content

@traceable
def eli5(question):
    context = search(question)
    answer = explain(question, context)
    return answer


## Setup for Experiment

Now we're ready to run experiments, and test our application's performance on our dataset.

### Import LangSmith Client

First, we'll create a LangSmith client to use the SDK, and specify the dataset we'd like to run our experiment on.

### Upload Dataset to LangSmith
Before running our experiment, we need to upload our dataset to LangSmith. We'll read the CSV file and create a dataset with the name "eli5-golden".

In [4]:
from langsmith import Client

client = Client()
dataset_name = "eli5-golden"

In [None]:
import pandas as pd
from langsmith import Client

# Read the dataset CSV file
df = pd.read_csv("dataset.csv")

# Convert to LangSmith format
examples = []
for _, row in df.iterrows():
    examples.append({
        "inputs": {"question": row["input_question"]},
        "outputs": {"output": row["output_output"]}
    })

# Create dataset in LangSmith
try:
    # Try to read the dataset first to see if it already exists
    existing_dataset = client.read_dataset(dataset_name=dataset_name)
    print(f"Dataset '{dataset_name}' already exists with {len(list(client.list_examples(dataset_name=dataset_name)))} examples")
except:
    # Dataset doesn't exist, create it
    dataset = client.create_dataset(
        dataset_name=dataset_name,
        description="ELI5 (Explain Like I'm 5) dataset for evaluating AI explanations"
    )
    
    # Add examples to the dataset
    client.create_examples(
        inputs=[ex["inputs"] for ex in examples],
        outputs=[ex["outputs"] for ex in examples],
        dataset_id=dataset.id
    )
    
    print(f"Successfully created dataset '{dataset_name}' with {len(examples)} examples")


### Define Evaluators

#### Custom Code Evaluator

We'll first define a custom code evaluator, which are useful to measure deterministic or close-ended metrics. 

In [6]:
def conciseness(outputs: dict) -> bool:
    words = outputs["output"].split(" ")
    return len(words) <= 200

This particular custom code evaluator is a simple Python function that checks if our application produces outputs that are less than or equal to 200 words long.

#### LLM-as-a-Judge Evaluator

For open-ended metrics, it's can be powerful to use an LLM to score the outputs.

Let's use an LLM to check whether our application produces correct outputs. First, let's define a scoring schema for our LLM to adhere to in its response.

In [7]:
from pydantic import BaseModel, Field

# Define a scoring schema that our LLM must adhere to
class CorrectnessScore(BaseModel):
    """Correctness score of the answer when compared to the reference answer."""
    score: int = Field(description="The score of the correctness of the answer, from 0 to 1")

We'll define a function to give an LLM our application's outputs, alongside the reference outputs stored in our dataset. 

The LLM will then be able to reference the "right" output to judge if our application's answer meets our accuracy standards.

In [8]:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage


def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    prompt = """
    You are an expert data labeler evaluating model outputs for correctness. Your task is to assign a score based on the following rubric:

    <Rubric>
        A correct answer:
        - Provides accurate information
        - Uses suitable analogies and examples
        - Contains no factual errors
        - Is logically consistent

        When scoring, you should penalize:
        - Factual errors
        - Incoherent analogies and examples
        - Logical inconsistencies
    </Rubric>

    <Instructions>
        - Carefully read the input and output
        - Use the reference output to determine if the model output contains errors
        - Focus whether the model output uses accurate analogies and is logically consistent
    </Instructions>

    <Reminder>
        The analogies in the output do not need to match the reference output exactly. Focus on logical consistency.
    </Reminder>

    <input>
        {}
    </input>

    <output>
        {}
    </output>

    Use the reference outputs below to help you evaluate the correctness of the response:
    <reference_outputs>
        {}
    </reference_outputs>
    """.format(inputs["question"], outputs["output"], reference_outputs["output"])
    structured_llm = ChatOpenAI(model_name="gpt-4o", temperature=0).with_structured_output(CorrectnessScore)
    generation = structured_llm.invoke([HumanMessage(content=prompt)])
    return generation.score == 1


### Define Run Function

We'll define a function to run our application on the example inputs of our dataset. This is function that will be called when we run our experiment.

In [9]:
# 4. Define a function to run your application
def run(inputs: dict):
    return eli5(inputs["question"])

## Run Experiment

We have all the necessary components, so let's run our experiment! 

In [None]:
from langsmith import evaluate

evaluate(
    run,
    data=dataset_name,
    evaluators=[correctness, conciseness],
    experiment_prefix="eli5-o3-mini"
)