# LangSmith: Evaluation & Monitoring of Agentic AI

This notebook covers the usage of **LangSmith** for evaluating and monitoring agentic AI systems, including **tracking prompts, responses, and performance metrics**.

---

## Table of Contents
1. Introduction to LangSmith
2. Why Evaluation & Monitoring Matter
3. Setting Up LangSmith
4. Tracking Prompts and Responses
5. Evaluating Agent Performance
6. Best Practices and Notes

---

## 1. Introduction to LangSmith

**LangSmith** is a tool for **evaluating and monitoring agentic AI systems**.  
It helps you:

- Track prompts and responses from AI agents
- Evaluate outputs for correctness, relevance, and performance
- Monitor agent behavior over time for consistency and improvements

LangSmith integrates seamlessly with **LLMs and LangChain agents**.
 
---

## 2. Why Evaluation & Monitoring Matter

When deploying AI agents in production, it's critical to:

1. **Understand agent behavior** – How the agent interprets prompts and executes tasks.
2. **Measure performance** – Accuracy, relevance, or success rate of responses.
3. **Detect failures or drift** – Monitor changes in agent outputs over time.
4. **Optimize prompts and chains** – Improve the agent iteratively using feedback.

Without monitoring, agents may produce unexpected or harmful outputs in production.

---

## 3. Setting Up LangSmith

### Installation

In [39]:
#!pip install langsmith

In [2]:
import os

# Set your LangSmith API key
os.environ["LANGSMITH_API_KEY"] = "your_key"


In [38]:
from langsmith import Client
from langsmith.run_helpers import traceable

# Initialize the client
client = Client()

## 4. Logging a Run with the `traceable` Decorator

You can log a run by decorating a function with `@traceable` and providing **input, output, and optional metadata**.


In [17]:
# Example prompt and agent response
prompt_text = "Who is the CEO of Tesla?"
response_text = "Elon Musk"

@traceable(name="QA_Agent_Run", metadata={"model": "GPT-4", "task": "QA"})
def qa_agent(prompt):
    # Simulate agent response
    return response_text

# Call the function to log the run
qa_agent(prompt_text)


'Elon Musk'

### Explanation

- `@traceable`: Decorator to log the function call as a run in LangSmith
- `name`: A human-readable name for this run
- `metadata`: Optional dictionary with model info, task type, or version
- `qa_agent(prompt)`: Function call that triggers the logging


### Fetching the Latest Run from LangSmith

In this step, we connect to our LangSmith project and retrieve recent runs.  
Each **run** represents a single tracked execution (for example, a model query or an evaluation).  

We specify the `project_name` — here it’s `"default"` — and use the `list_runs()` method to fetch the five most recent runs associated with that project.  

The code checks whether any runs exist:
- If runs are found, it stores the most recent one in `latest_run` and extracts its unique `run_id`.
- If no runs are available, it simply prints a message indicating that none were found.

This `run_id` can then be used for logging feedback, inspecting results, or linking to the LangSmith dashboard.


In [32]:
# Replace with your project name in LangSmith
project_name = "default"  # or your specific project name

# list_runs now works
recent_runs = list(client.list_runs(project_name=project_name, limit=5))

if recent_runs:
    latest_run = recent_runs[0]
    run_id = latest_run.id
    print("Latest run ID:", run_id)
else:
    print("No runs found for this project.")


Latest run ID: 08d3f107-f66e-4ab5-97da-96aed14963ab


## 5. Adding a Rating to a Run

You can rate a run to track performance and correctness.

### Explanation

- `run_id`: The ID of the run to rate
- `key`: The feedback key (e.g., "correctness")
- `score`: Numerical score (1–5)
- `value`: Optional text feedback
- Ratings help track agent **accuracy, relevance, and quality** over time
    


In [33]:

# Feedback score must be 0.0–1.0
client.create_feedback(
    run_id=run_id,
    key="correctness",
    score=1.0,  # 1.0 = excellent, 0.0 = poor
    value="Correct and relevant answer."
)

print("Feedback submitted successfully for run ID:", run_id)


Feedback submitted successfully for run ID: 08d3f107-f66e-4ab5-97da-96aed14963ab


## 6. Retrieving and Monitoring Runs

You can fetch previously logged runs to monitor trends or analyze agent performance.


In [37]:
recent_runs = list(client.list_runs(project_name=project_name, limit=5))

for run in recent_runs:
    print(f"Run ID: {run.id}")
    print(f"Name: {run.name}")
    print(f"Metadata: {run.metadata}")
    print(f"Status: {run.status}")
    print("-" * 50)

Run ID: 08d3f107-f66e-4ab5-97da-96aed14963ab
Name: OpenAIFunctionsAgentOutputParser
Metadata: {'custom-metadata-key': 'custom-metadata-value', 'revision_id': 'my-revision-id', 'ls_run_depth': 2}
Status: success
--------------------------------------------------
Run ID: 22b0a08f-fc7d-46df-8051-9f25ebcaa543
Name: ChatOpenAI
Metadata: {'custom-metadata-key': 'custom-metadata-value', 'revision_id': 'my-revision-id', 'ls_run_depth': 2}
Status: success
--------------------------------------------------
Run ID: f56df947-6260-416e-b670-b651ae5a94a0
Name: ChatPromptTemplate
Metadata: {'custom-metadata-key': 'custom-metadata-value', 'revision_id': 'my-revision-id', 'ls_run_depth': 2}
Status: success
--------------------------------------------------
Run ID: 96505265-d5e6-4629-a5a1-d00be3bde99f
Name: RunnableLambda
Metadata: {'custom-metadata-key': 'custom-metadata-value', 'revision_id': 'my-revision-id', 'ls_run_depth': 3}
Status: success
--------------------------------------------------
Run ID

## 7. Best Practices

1. **Log every run** in production for auditing.
2. Include **metadata** (model version, task, environment) for filtering.
3. Regularly **evaluate agent responses** to ensure quality.
4. Use ratings to **iterate on prompts, chains, and tools**.
5. Monitor performance trends via **LangSmith dashboards**.
