# ðŸ¤– Deep Agents Quickstart with TruLens Evaluation

This notebook shows how to:
1. Create a Deep Agent with search capabilities
2. Instrument it with TruLens using TruGraph (since Deep Agents is built on LangGraph)
3. Evaluate the agent with Agent GPA metrics

Based on: https://docs.langchain.com/oss/python/deepagents/quickstart

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/main/examples/quickstart/deep_agents_quickstart.ipynb)

## Install Dependencies

In [None]:
!pip install deepagents tavily-python trulens trulens-apps-langgraph trulens-providers-openai langchain-openai -q

## Set Up API Keys

In [None]:
import os

# Set your API keys (or use environment variables)
# os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
# os.environ["TAVILY_API_KEY"] = "your-tavily-api-key"

## Step 1: Create the Search Tool

We'll use Tavily for web search capabilities.

In [None]:
from typing import Literal

from tavily import TavilyClient

tavily_client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])


def internet_search(
    query: str,
    max_results: int = 5,
    topic: Literal["general", "news", "finance"] = "general",
    include_raw_content: bool = False,
):
    """Run a web search using Tavily."""
    return tavily_client.search(
        query,
        max_results=max_results,
        include_raw_content=include_raw_content,
        topic=topic,
    )

## Step 2: Create the Deep Agent

Deep Agents automatically handle planning, file system tools, and subagent delegation.
Under the hood, it's built on LangGraph.

In [None]:
from deepagents import create_deep_agent
from langchain.chat_models import init_chat_model

# Use OpenAI as the model provider
model = init_chat_model(model="openai:gpt-4o")

research_instructions = """You are an expert researcher. Your job is to conduct
thorough research and then write a polished report.

You have access to an internet search tool as your primary means of gathering
information.

## `internet_search`
Use this to run an internet search for a given query. You can specify the max
number of results to return, the topic, and whether raw content should be
included.
"""

agent = create_deep_agent(
    model=model, tools=[internet_search], system_prompt=research_instructions
)

## Step 3: Set Up TruLens Session and Agent GPA Metrics

We'll configure Agent GPA (Goal, Plan, Action) feedback functions to evaluate:
- **Answer Relevance**: Is the response relevant to the user's question?
- **Tool Selection**: Did the agent choose appropriate tools for the task?
- **Tool Calling**: Were the tool calls executed correctly with proper arguments?
- **Execution Efficiency**: Did the agent complete the task without unnecessary steps?

Since Deep Agents does explicit planning (via `write_todos`), we also include:
- **Plan Quality**: Is the plan well-structured and appropriate for the task?
- **Plan Adherence**: Did the agent follow its plan?

In [None]:
from trulens.core import Feedback
from trulens.core import TruSession
from trulens.core.feedback.selector import Selector
from trulens.providers.openai import OpenAI as TruOpenAI

# Initialize TruLens session
session = TruSession()
session.reset_database()

# Initialize OpenAI provider for evaluations
provider = TruOpenAI(model_engine="gpt-4o-mini")

# Agent GPA metrics use trace-level selection
trace_selector = {"trace": Selector(trace_level=True)}

# Answer Relevance: Is the answer relevant to the question?
f_answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
    .on_input()
    .on_output()
)

# Tool Selection: Did the agent choose appropriate tools?
f_tool_selection = Feedback(
    provider.tool_selection_with_cot_reasons, name="Tool Selection"
).on(trace_selector)

# Tool Calling: Were tool calls executed correctly?
f_tool_calling = Feedback(
    provider.tool_calling_with_cot_reasons, name="Tool Calling"
).on(trace_selector)

# Execution Efficiency: Did the agent complete efficiently?
f_execution_efficiency = Feedback(
    provider.execution_efficiency_with_cot_reasons, name="Execution Efficiency"
).on(trace_selector)

# Plan Quality: Is the plan well-structured? (Deep Agents does planning)
f_plan_quality = Feedback(
    provider.plan_quality_with_cot_reasons, name="Plan Quality"
).on(trace_selector)

# Plan Adherence: Did the agent follow its plan?
f_plan_adherence = Feedback(
    provider.plan_adherence_with_cot_reasons, name="Plan Adherence"
).on(trace_selector)

# Combine all Agent GPA feedbacks
agent_gpa_feedbacks = [
    f_answer_relevance,
    f_tool_selection,
    f_tool_calling,
    f_execution_efficiency,
    f_plan_quality,
    f_plan_adherence,
]

## Step 4: Wrap with TruGraph

Since Deep Agents is built on LangGraph, we use `TruGraph` to automatically instrument all internal nodes, tool calls, and planning steps.

In [None]:
from trulens.apps.langgraph import TruGraph

# Wrap the Deep Agent with TruGraph
tru_agent = TruGraph(
    agent,
    app_name="DeepAgent",
    app_version="v1",
    feedbacks=agent_gpa_feedbacks,
)

## Step 5: Run the Agent with Evaluation

In [None]:
# Run test queries
test_queries = [
    "What is LangGraph and how does it relate to LangChain?",
    "What are the latest developments in AI agents?",
]

for query in test_queries:
    print(f"Query: {query}")
    print("-" * 50)

    with tru_agent as recording:
        result = agent.invoke({
            "messages": [{"role": "user", "content": query}]
        })

    response = result["messages"][-1].content
    print(f"Response: {response[:500]}...\n")

## Step 6: View Evaluation Results

Use `retrieve_feedback_results()` to wait for evaluations to complete and get results.

In [None]:
# Wait for and retrieve feedback results (waits for evaluations to complete)
# The timeout parameter controls how long to wait for results (default: 180 seconds)
feedback_results = recording.retrieve_feedback_results(timeout=300)
feedback_results

In [None]:
# Get the leaderboard showing evaluation scores across all records
session.get_leaderboard()

## Step 7: Launch the Dashboard

In [None]:
from trulens.dashboard import run_dashboard

run_dashboard(session)

## What Happened?

Your deep agent automatically:
1. **Planned its approach**: Used the built-in `write_todos` tool to break down the research task
2. **Conducted research**: Called the `internet_search` tool to gather information
3. **Managed context**: Used file system tools (`write_file`, `read_file`) to offload large search results
4. **Synthesized a report**: Compiled findings into a coherent response

**TruGraph captured the full trace** including:
- All LangGraph nodes and transitions
- Tool calls (internet_search, write_file, read_file, etc.)
- Planning and reasoning steps
- Subagent invocations

**Agent GPA metrics evaluated:**
- **Answer Relevance**: How well the response addresses the user's question
- **Tool Selection**: Whether the agent chose the right tools for each step
- **Tool Calling**: Whether tool calls were executed with correct arguments
- **Execution Efficiency**: Whether the agent completed without unnecessary steps
- **Plan Quality**: Whether the agent's plan was well-structured
- **Plan Adherence**: Whether the agent followed its stated plan

## Next Steps

- Add custom feedback functions for domain-specific evaluations
- Customize the agent with more tools
- Compare different agent versions using the leaderboard
- Explore the trace in the dashboard to see all internal steps