# Lesson 5: Measure Agent’s GPA

Goal-Plan-Act Alignment unlocks useful feedback on how to improve the effectiveness of your agent.

In [None]:
import os
from dotenv import load_dotenv
import warnings

load_dotenv(override=True)
warnings.filterwarnings("ignore")

os.environ["TRULENS_OTEL_TRACING"] = "1"

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">
<p> 💻 &nbsp; <b>To access <code>requirements.txt</code>, <code>env.template</code>, <code>prompts.py</code>, and <code>helper.py</code> files:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook 2) click on <em>"Open"</em>.

<p> ⬇ &nbsp; <b>Download Notebooks:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>.</p>

</div>

## 5.1 Goal-Plan-Act alignment

Agents are most effective when acting in alignment with a high-quality plan. For that reason, you can identify common failure modes stemming from misalignment between the Goal, the Plan, and the agent's Actions.

Then, through careful criteria and a strong LLM judge, you can develop evaluators to detect these common agent failure modes and assess separable dimensions of agent quality.

We will start with some illustrative examples of common failure modes and how we can identify issues in goal-plan-action alignment with an LLM as a judge.

In [None]:
from trulens.providers.openai import OpenAI

gpa_eval_provider = OpenAI(model_engine="gpt-4.1")

### Failure mode 1: Plan Quality

The starting point for an agent is its plan. Without a high-quality plan, the agent has little hope of succeeding. You start by assessing the plan to ensure it is well-structured and aligned with the goal.

To demonstrate, consider the query: "Which sales leads should we prioritize this week, and what specific action items should we take for each?" and the plan below.

In [None]:
goal_and_plan = """
User Query: Which sales leads should we prioritize this week, 
and what specific action items should we take for each?

Plan:

1. Pull all sales leads from the past 12 months from the CRM.

2. For the largest 20 leads, compile any notes, call logs, 
and related tasks from the CRM.

3. Summarize each lead’s current stage in the pipeline.

4. Present the summary and recommendations in a single table.
"""

In [None]:
from trulens.core import Feedback
from trulens.core.feedback.selector import Selector

# Goal-Plan-Act: Plan quality
f_plan_quality = Feedback(
    gpa_eval_provider.plan_quality_with_cot_reasons,
    name="Plan Quality",
).on({
    "trace": Selector(trace_level=True),
})

In [None]:
from helper import display_eval_reason

score, reason = f_plan_quality(goal_and_plan)

print(f"Score: {score} \n")
display_eval_reason(reason['reason'])

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">

**Summary:** 

Why is the first plan low quality?

- Vague selection: "Pull all sales leads from the past 12 months" lacks urgency constraints tied to the goal.
- Weak prioritization: "largest 20" ignores lead score, stage urgency, or upcoming deadlines.
- Missing actionability: no instructions to create specific next actions or owners.
- Output not specific: "single table" without required fields tied to the goal.

What the evaluator flags:
- Specific constraints, measurable outputs, and sequencing tied to goal. The "better plan" adds these, raising the score.

</div>

In [None]:
goal_and_better_plan = """
User Query: Which sales leads should we prioritize this week, 
and what specific action items should we take for each?

Plan:

1. Pull all leads with open opportunities from the CRM that have 
a next action date within the next 14 days or no next action assigned.

2. Filter to leads with deal value > $10k or high lead score.

3. Sort by deal stage urgency (e.g., close date approaching, 
at risk of going cold) and potential revenue impact.

4. For each prioritized lead:

5. Retrieve latest interaction notes, key decision-maker info, 
and current blockers.

6.  Identify overdue or missing action items.

7. Propose specific, high-impact next steps (e.g., schedule product demo, 
send proposal revision, escalate to sales manager).

8. Group recommendations into this week’s priority list with owner 
assignments and deadlines.

9. Present results in a table with columns: Lead Name, Value, Stage, 
Urgency Score, Next Action, Due Date, Owner.
"""

In [None]:
score, reason = f_plan_quality(goal_and_better_plan)

print(f"Score: {score} \n")
display_eval_reason(reason['reason'])

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">

**Bad vs Good (Plan Quality)**

- **Bad**: Vague steps, no thresholds, unclear output, missing prioritization.
- **Good**: Explicit filters and thresholds, prioritization logic, and defined output schema.


In reality, you cannot make direct edits to the plan, nor to the actions of the agent. You can, however, make adjustments to the agent to help guide it towards higher goal-plan-action alignment. In the next lesson, you will get your hands on making such improvements.

</div>

### Failure mode 2: Plan Adherence

Once a high-quality plan is developed, the agent must follow it. Plan adherence checks to make sure the agent's action is aligned with the plan. Consider the following execution trace and the plan you developed before.

**Quick checklist: Plan Quality**

- Tightly aligned to the user goal.
- Specific selection criteria and thresholds.
- Clear step ordering and ownership when relevant.
- Concrete outputs (schema/columns) and success criteria.
- Uses the right agents/tools for each step.


In [None]:
agent_actions = """
[STEP 1] Pulled all open opportunities from the CRM without applying a next action date filter.
[STEP 2] Applied deal value filter only; skipped the lead score filter.
[STEP 3] Sorted leads solely by deal value (descending).
[STEP 4] Retrieved latest notes and contact names but skipped blockers.
[STEP 5] Listed the CRM’s existing "next action" field without review or update.
[STEP 6] Output a table with Lead Name, Value, Stage, and Next Action.
"""

In [None]:
plan_and_agent_actions = goal_and_better_plan + agent_actions

In [None]:
# Goal-Plan-Act: Plan adherence
f_plan_adherence = Feedback(
    gpa_eval_provider.plan_adherence_with_cot_reasons,
    name="Plan Adherence",
).on({
    "trace": Selector(trace_level=True),
})

In [None]:
score, reason = f_plan_adherence(plan_and_agent_actions)

print(f"Score: {score} \n")
display_eval_reason(reason['reason'])

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">

**Summary**

Why does this trace violate the plan?

- Missing the date filter from Step 1.
  - Plan: "next action within 14 days OR no next action"
  - Trace: "Pulled all open opportunities... without applying a next action date filter."
- Partial filter in Step 2.
  - Plan: value > $10k OR high lead score
  - Trace: "Applied deal value filter only; skipped the lead score filter."
- Output mismatch.
  - Plan: include Urgency Score, Due Date, Owner
  - Trace: table has only Lead Name, Value, Stage, Next Action

What the evaluator flags:
- Each plan requirement should appear in the trace. Omissions above directly lower adherence.

</div>

In [None]:
better_agent_actions = """[STEP 1] Pulled all leads with open 
opportunities and either a next action date within 14 days or no next 
action assigned.
[STEP 2] Filtered to leads with deal value over $10k or high lead score.
[STEP 3] Sorted leads by deal stage urgency and potential revenue impact.
[STEP 4] Retrieved latest notes, key decision-maker info, and identified 
any blockers.
[STEP 5] Created updated, specific next actions for each lead based on 
context. 
[STEP 6] Group recommendations into this week’s priority list with owner 
assignments and deadlines.
[STEP 7] Output a table with Lead Name, Value, Stage, Urgency Score, 
Next Action, Due Date, and Owner.
"""

In [None]:
plan_and_better_agent_actions = goal_and_better_plan + better_agent_actions

### Failure mode 3: Execution Efficiency

Even when acting in logical ways that adhere to a high-quality plan, agents can act in overly defensive ways that reduce efficiency unnecessarily.

Evaluating execution efficiency helps to flag redundancies, preventable mistakes, and excessive error handling.

Consider a new execution trace.

In [None]:
score, reason = f_plan_adherence(plan_and_better_agent_actions)

print(f"Score: {score} \n")
display_eval_reason(reason['reason'])

In [None]:
agent_actions = """
[STEP 1] Pulled all leads with open opportunities and either a next 
action date within 14 days or no next action assigned.
    → Retrieved 96 leads.

[STEP 2] Filtered to leads with deal value over $10k or high lead score.
    → Applied filter, yielding 54 leads.

[STEP 3] Sorted leads by deal stage urgency and potential revenue impact.
    → High-value late-stage leads ranked highest.

[STEP 4] Retrieved latest notes, key decision-maker info, and blockers.
    → Retrieved notes from both the CRM API and a cached export for one 
    lead to “double-check” consistency.

[STEP 5] Created updated, specific next actions for each lead based on 
context.
    → Example: Lead A — “Schedule demo and confirm final pricing”; Lead 
    B — “Follow up on proposal feedback by Thursday.”

[STEP 6] Output a table with Lead Name, Value, Stage, Urgency Score, 
Next Action, Due Date, and Owner.
    → Exported table to both XLSX and CSV formats, though only one 
    format was requested.
"""

In [None]:
# Goal-Plan-Act: Execution efficiency of trace
f_execution_efficiency = Feedback(
    gpa_eval_provider.execution_efficiency_with_cot_reasons,
    name="Execution Efficiency",
).on({
    "trace": Selector(trace_level=True),
})

In [None]:
score, reason = f_execution_efficiency(agent_actions)

print(f"Score: {score} \n")
display_eval_reason(reason['reason'])

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">

**Summary**

Why is this trace inefficient?

- Duplicate work: re-applied the same filter.
  - Trace: "Accidentally re-applied the same filter twice..."
  - Impact: extra compute with no new signal.
- Redundant retrieval: double-fetched notes to "double-check."
  - Trace: "Retrieved notes from both the CRM API and a cached export..."
  - Impact: unnecessary network/IO; one source is sufficient.
- Unrequested outputs: exported multiple formats.
  - Trace: "Exported table to both XLSX and CSV..."
  - Impact: violates YAGNI; adds time and clutter.

What the evaluator flags:
- Looks for repeated steps, unnecessary retries, and outputs beyond the plan/request.
- All three points above directly lower the efficiency score.

</div>

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">

**Bad vs Good (Execution Efficiency)**

- **Bad**: Re-applies filters multiple times; double-fetches the same notes to "be safe"; exports to extra formats not requested; retries on transient warnings without need.
- **Good**: Applies each filter exactly once in a single pass; reuses cached results instead of refetching; outputs only the requested format; handles errors proportionally (warn → continue; error → fix once and proceed).

Small rewrite example:
- Bad: "Applied deal value filter, then re-applied to confirm."
- Good: "Applied combined filters: value > $10k OR high lead score (single pass)."

</div>

### Failure Mode 4: Logical Inconsistency

Agents' actions can suffer from contradictions, ungrounded assumptions, and logical flaws.

Let's consider a different execution trace.

In [None]:
agent_actions = """
[STEP 1] Pulled all leads with open opportunities and either a next 
action date within 14 days or no next action assigned.
    → Retrieved 96 leads, including recent follow-ups and a few older 
    records from early last year.

[STEP 2] Filtered to leads with deal value over $10k or high lead score.
    → Resulted in 113 leads after applying filters.

[STEP 3] Sorted leads by deal stage urgency and potential revenue impact.
    → Leads with minimal recent engagement ranked highly due to their 
    projected close dates in Q3.

[STEP 4] Retrieved latest notes, key decision-maker info, and blockers.
    → Several leads show “TBD” for decision-maker but still have active 
    next steps assigned.

[STEP 5] Created updated, specific next actions for each lead based on 
context.
    → Example: Lead A — “Schedule demo and confirm final pricing”; Lead 
    B — “Wait for proposal feedback before scheduling demo.”

[STEP 6] Output a table with Lead Name, Value, Stage, Urgency Score, 
Next Action, Due Date, and Owner.
    → Due dates range from last week to the end of the current month.
"""

In [None]:
# Goal-Plan-Act: Logical consistency of trace
f_logical_consistency = Feedback(
    gpa_eval_provider.logical_consistency_with_cot_reasons,
    name="Logical Consistency",
).on({
    "trace": Selector(trace_level=True),
})

In [None]:
score, reason = f_logical_consistency(agent_actions)

print(f"Score: {score} \n")
display_eval_reason(reason['reason'])

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">

**Summary**

Why is this trace inconsistent?

- Count contradiction after filtering.
  - Plan step: filter to leads >$10k OR high lead score.
  - Trace: 96 → 113 leads after filter.
  - Why it's a failure: filters should not increase the set; this implies an inconsistency.
- Action vs state mismatch.
  - Trace: decision-maker is "TBD" but "active next steps" are assigned.
  - Why it's a failure: missing prerequisite info for the assigned action.

What the evaluator flags:
- Numerical sanity across steps; contradictions or impossible transitions.
- The two points above directly reduce the consistency score.

</div>

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">

**Bad vs Good (Logical Consistency)**

- **Bad**: Counts grow after applying stricter filters; assigns next steps when decision-maker is "TBD"; contradicts earlier statements.
- **Good**: Counts decrease or stay the same after filters; next steps match available context; statements remain consistent with prior steps.

Small rewrite example:
- Bad: "Resulted in 113 leads after applying filters to 96 leads."
- Good: "Filtered 96 → 54 leads based on value > $10k OR high lead score."

</div>

### Recap: Common Failure Modes and Fixes

- **Plan Quality**: Ensure the plan is specific, feasible, and tied to the goal. Include ordering and explicit outputs.
- **Plan Adherence**: Execute each plan step as written. Apply filters and produce the exact requested outputs; report any deviations.
- **Execution Efficiency**: Avoid redundant work and overly defensive retries; do only what is necessary to achieve the goal.
- **Logical Consistency**: Keep counts and facts consistent across steps; avoid contradictions and unsupported claims.

Use these checklists above to quickly self-audit traces before running full evaluations. 

**Let's apply these evaluations to the data agent.**

## 5.2 Create TruLens session for logging

In [None]:
from trulens.core.session import TruSession
from trulens.core.database.connector.default import DefaultDBConnector

# Initialize connector with SQLite database one folder back
connector = DefaultDBConnector(database_url="sqlite:///default.sqlite")

# Create TruSession with the custom connector
session = TruSession(connector=connector)

## 5.3 Build the graph

In [None]:
from langgraph.graph import START, StateGraph
from helper import State, planner_node, executor_node, cortex_agents_research_node, web_research_node, chart_node, chart_summary_node, synthesizer_node

workflow = StateGraph(State)
workflow.add_node("planner", planner_node)
workflow.add_node("executor", executor_node)
workflow.add_node("web_researcher", web_research_node)
workflow.add_node("cortex_researcher", cortex_agents_research_node)
workflow.add_node("chart_generator", chart_node)
workflow.add_node("chart_summarizer", chart_summary_node)
workflow.add_node("synthesizer", synthesizer_node)

workflow.add_edge(START, "planner")

graph = workflow.compile()

## 5.4 Register the agent with TruLens

<div style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px"> 
    <p>🚨 &nbsp; In this notebook, you are directly provided with the results obtained during filming. This is to help eliminate waiting time, and to prevent potential rate limit errors that might occur in this learning environment (this learning environment is constrained, and the GPA evaluation metrics consume a significant number of tokens). 
</div>

Here's the code that registers the agent with TruLens:
```python
from trulens.apps.langgraph import TruGraph
from helper import f_answer_relevance, f_context_relevance, f_groundedness

tru_recorder = TruGraph(
    graph,
    app_name="Sales Data Agent",
    app_version="L5: Base",
    feedbacks=[
        f_answer_relevance,
        f_context_relevance,
        f_groundedness,
        f_plan_quality,
        f_plan_adherence,
        f_execution_efficiency,
        f_logical_consistency,
    ],
)
```

## 5.5 Record agent usage

<div style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px"> 
    <p>🚨 &nbsp; <b>Run Results:</b> In this notebook, you are directly provided with the results obtained during filming. This is to help eliminate waiting time, and to prevent potential rate limit errors that might occur in this learning environment (this learning environment is constrained, and the GPA evaluation metrics consume a significant number of tokens).
</div>

**Code for query 1:**
``` python
from langchain.schema import HumanMessage

with tru_recorder as recording:
    query = "What are our top 3 client deals? Chart the deal value for each."
    print(f"Query: {query}")
    state = {
                "messages": [HumanMessage(content=query)],
                "user_query": query,
                "enabled_agents": ["cortex_researcher", "web_researcher", 
                                   "chart_generator", "chart_summarizer", 
                                   "synthesizer"],
            }
    graph.invoke(state)
    print("--------------------------------")
```

In [None]:
records, feedback = session.get_records_and_feedback()
print(f"Query: {records.iloc[0]['input']}\n")
print(f"Output: {records.iloc[0]['output']}\n")

**Code for query 2:**
```python
with tru_recorder as recording:
    query = "Identify our pending deals, research if they may be experiencing regulatory changes, and using the meeting notes for each customer, provide a new value proposition for each given the regulatory changes."
    print(f"Query: {query}")
    state = {
                "messages": [HumanMessage(content=query)],
                "user_query": query,
                "enabled_agents": ["cortex_researcher", "web_researcher", 
                                   "chart_generator", "chart_summarizer", 
                                   "synthesizer"],
            }
    graph.invoke(state)

    print("--------------------------------")
```

In [None]:
print(f"Query: {records.iloc[1]['input']}\n")
print(f"Output: {records.iloc[1]['output']}\n")

**Code for query 3**
```python
with tru_recorder as recording:
    query = "Identify our largest client deal, then find important topics in the meeting notes with that company, and find a news article related to the important topics discussed."
    print(f"Query: {query}")
    state = {
                "messages": [HumanMessage(content=query)],
                "user_query": query,
                "enabled_agents": ["cortex_researcher", "web_researcher", 
                                   "chart_generator", "chart_summarizer", 
                                   "synthesizer"],
            }
    graph.invoke(state)

    print("--------------------------------")
```

In [None]:
print(f"Query: {records.iloc[2]['input']}\n")
print(f"Output: {records.iloc[2]['output']}\n")

## 5.6 Launch the TruLens dashboard

**Note:** Make sure to click on the second link (not the localhost) to open the TruLens dashboard.

In [None]:
from trulens.dashboard import run_dashboard
import os
str_port = 8004
_ = run_dashboard(port=str_port)
print(os.environ['DLAI_LOCAL_URL'].format(port=str_port))