# Prompt Iteration Walkthrough

This notebook performs a quick walkthrough of LangSmith's evaluation flow, introducing:

1. Datasets & Evaluation
2. Summary evaluators (for aggregate statistics)
3. Prompt Versioning in the hub.
4. Using the LLM proxy.

Importantly, the pipelines in this walkthrough **do not depend on the LangChain open source libraries**. LangChain is only used for in part 3 to show how to use one of its many off-the-shelf evaluators or an din part 4 to connect to the prompt Hub.

## Setup

First, we'll do some setup. Create a LangSmith API Key by navigating to the settings page in LangSmith, then set the following environment variables.

```bash
OPENAI_API_KEY=<YOUR OPENAI API KEY>
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=<YOUR PROJECT NAME>
LANGCHAIN_API_KEY=<YOUR LANGSMITH API KEY>
```

#### Install below dependencies
langsmith

openai

langchain

langchain-core

langchain_classic

python-dotenv

In [1]:
import os
os.getcwd()

%pip install langsmith openai langchain langchain-core langchain_classic python-dotenv

%load_ext dotenv
%dotenv -o

import os


Note: you may need to restart the kernel to use updated packages.


In [2]:
from langsmith import Client

client = Client()

## Pt. 1 -- Toxic Queries

In [3]:
toxic_examples = [
    ("Shut up, idiot", "Toxic"),
    ("You're a wonderful person", "Not toxic"),
    ("This is the worst thing ever", "Toxic"),
    ("I had a great day today", "Not toxic"),
    ("Nobody likes you", "Toxic"),
    ("This movie is a masterpiece", "Not toxic"),
    ("Go away and never come back", "Toxic"),
    ("Thank you for your help", "Not toxic"),
    ("This is so dumb", "Toxic"),
    ("I appreciate your efforts", "Not toxic"),
    ("This is a waste of time", "Toxic"),
    ("This movie blows", "Toxic"),
    ("This is unacceptable. I want to speak to the manager.", "Toxic"),
]

toxic_dataset_name = "Toxic Queries"
if not client.has_dataset(dataset_name=toxic_dataset_name):
    toxic_dataset = client.create_dataset(dataset_name=toxic_dataset_name)
    inputs, outputs = zip(
        *[({"text": text}, {"label": label}) for text, label in toxic_examples]
    )
    client.create_examples(inputs=inputs, outputs=outputs, dataset_id=toxic_dataset.id)

#### Our Pipeline

In [4]:
from langsmith import traceable

from openai import OpenAI

openai = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

@traceable
def label_query(text):
    messages = [
        {
            "role": "system",
            "content": "Given the following user query, assess whether it contains toxic content. Please provide a simple 'Toxic' or 'Not toxic' response based on your assessment.",
        },
        {"role": "user", "content": text},
    ]
    result = openai.chat.completions.create(
        messages=messages, model="granite4", temperature=0
    )
    return result.choices[0].message.content


@traceable
def label_query_alternate_prompt(text):
    messages = [
        {
            "role": "system",
            "content": "Please review the user query below and determine if it contains any form of toxic behavior, such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does, and 'Not toxic' if it doesn't.",
        },
        {"role": "user", "content": text},
    ]
    result = openai.chat.completions.create(
        messages=messages, model="granite4", temperature=0
    )
    return result.choices[0].message.content

#### Evaluate

In [5]:
from langsmith.evaluation import evaluate


# Row-level evaluator
def correct_label(run, example) -> dict:
    score = run.outputs.get("output") == example.outputs.get("label")
    return {"score": int(score)}


# Summary (experiment-level) evaluator
def summary_eval(runs, examples):
    correct = 0
    for i, run in enumerate(runs):
        if run.outputs["output"] == examples[i].outputs["label"]:
            correct += 1
    if correct / len(runs) > 0.5:
        return {"key": "pass", "score": True}
    else:
        return {"key": "pass", "score": False}


results_1 = evaluate(
    lambda inputs: label_query(inputs["text"]),
    data=toxic_dataset_name,
    evaluators=[correct_label],
    summary_evaluators=[summary_eval],
    experiment_prefix="Toxic Queries",
    metadata={
        "prompt_version": "1",
    },
)

View the evaluation results for experiment: 'Toxic Queries-a53863da' at:
https://smith.langchain.com/o/4f5768b4-9e19-4abd-8067-61a12e1df8a4/datasets/f7098d2f-f02a-476b-9278-4d5cf978f014/compare?selectedSessions=8a9fcaf3-d837-4e49-a6c6-ccee208d1e7a




0it [00:00, ?it/s]

In [6]:
results_2 = evaluate(
    lambda inputs: label_query_alternate_prompt(inputs["text"]),
    data=toxic_dataset_name,
    evaluators=[correct_label],
    summary_evaluators=[summary_eval],
    experiment_prefix="Toxic Queries",
    metadata={
        "prompt_version": "2",
    },
)

View the evaluation results for experiment: 'Toxic Queries-ee238d47' at:
https://smith.langchain.com/o/4f5768b4-9e19-4abd-8067-61a12e1df8a4/datasets/f7098d2f-f02a-476b-9278-4d5cf978f014/compare?selectedSessions=fc7d273f-65e8-414f-993c-7de76ed2e113




0it [00:00, ?it/s]

### Aside: Using the LangSmith Hub for Prompt Management