<b>CI/CD Testing Pipelines</b>

In [5]:
from IPython.display import Image

In [2]:
from pydantic import BaseModel, Field, ValidationError
import json

To build a CI/CD pipeline for a non-deterministic state machine, you have to stop testing the exact output strings and start testing the logical bounds of the agent's trajectory. In the industry, we solve this using Programmatic Evaluation (often implemented via frameworks like Ragas or TruLens) where we use a faster, cheaper, deterministic LLM (the "Judge") to score the trace of the main agent's execution.

The Triad of Agentic Evaluation MetricsTo test a system that retrieves memory and reasons over it, you must evaluate three distinct mathematical properties of the execution trace.Let $Q$ be the User Query, $C$ be the Context retrieved from FAISS, and $A$ be the Agent's Final Answer.

<b>1. Context Relevance (Testing the Disk Drive) </b> <br>Before we care about what the LLM says, we must test if the vector database returned garbage. Context Relevance measures the signal-to-noise ratio of the retrieved memory blocks.<br>
<i>The Math: </i> The Judge LLM analyzes the context $C$ and extracts the number of sentences that are strictly necessary to answer $Q$.$$\text{Context Relevance} = \frac{|\text{Relevant Sentences in } C|}{|\text{Total Sentences in } C|}$$ <br>
<i>Why it matters for CI/CD: </i> If this score drops below 0.5 in your pipeline, your FAISS search tool is pulling in too much noise, which will inflate your context window and confuse the agent. You need to adjust your top_k or improve your embedding strategy.

<b>2. Faithfulness (Testing for Hallucination) </b> <br>
This is the most critical metric. Faithfulness ensures the agent didn't invent facts outside of its Tier 1 Core Memory or its retrieved Tier 2 Archival Memory. <br>
<i>The Math: </i> The Judge LLM breaks the agent's final answer $A$ into a set of atomic claims $\{c_1, c_2, ..., c_n\}$. It then checks if each claim $c_i$ can be logically inferred from the Context $C$.$$\text{Faithfulness} = \frac{|\text{Claims logically supported by } C|}{|\text{Total Claims in } A|}$$ <br>
<i>Why it matters for CI/CD: </i> If Faithfulness is anything less than 1.0 (100%), the build fails. A score of 0.8 means the agent hallucinates 20% of its statements.

<b>3. Answer Relevance (Testing the Final Output) </b> Did the agent actually answer the user's prompt, or did it get distracted by the context it retrieved? <br>
<i> The Math: </i> This is often calculated using embedding distances. We use an embedding model to vectorize the final Answer $\vec{A}$ and the original Query $\vec{Q}$, and calculate their cosine similarity.$$\text{Answer Relevance} = \frac{\vec{A} \cdot \vec{Q}}{||\vec{A}|| \times ||\vec{Q}||}$$ <br>
<i>Why it matters for CI/CD: </i> A highly faithful answer can still be completely irrelevant to the prompt. This ensures the agent's core instruction-following policy remains intact. <br>

<b> Implementing the CI/CD Pipeline </b>

In [2]:
import pytest

# 1. The Golden Dataset (Your test cases)
TEST_CASES = [
    {
        "query": "What is the max database timeout?",
        "expected_facts": ["timeout is 30 seconds"],
    },
    # ... 100 more edge cases ...
]

def evaluate_trace_with_judge(query: str, retrieved_context: str, final_answer: str) -> dict:
    """
    Calls a fast, cheap Judge LLM (e.g., Claude 3.5 Haiku or GPT-4o-mini) 
    with a strict prompt to calculate the Triad metrics.
    """
    prompt = f"""
    Given the Query: {query}
    Given the Context: {retrieved_context}
    Given the Answer: {final_answer}
    
    Calculate Faithfulness, Context Relevance, and Answer Relevance as floats between 0.0 and 1.0.
    Return strictly as JSON.
    """
    # ... call Judge LLM ...
    return {"faithfulness": 1.0, "context_relevance": 0.8, "answer_relevance": 0.95}

@pytest.mark.parametrize("test_case", TEST_CASES)
def test_agent_execution_trajectory(test_case):
    # 1. Run the Agent (The Heartbeat Loop we built)
    # We modify the run_agent function to return the full state_memory trace, not just the final string.
    trace = run_os_agent(test_case["query"])
    
    # 2. Extract variables from the trace
    final_answer = trace[-1]["content"]
    retrieved_context = extract_tool_results_from_trace(trace, tool_name="archival_memory_search")
    
    # 3. Score the execution using the Judge LLM
    scores = evaluate_trace_with_judge(test_case["query"], retrieved_context, final_answer)
    
    # 4. The CI/CD Assertions
    assert scores["faithfulness"] >= 0.99, f"Agent hallucinated! Score: {scores['faithfulness']}"
    assert scores["context_relevance"] >= 0.50, f"FAISS retrieved too much noise."
    assert scores["answer_relevance"] >= 0.85, f"Agent evaded the question."

<b>Step 3: The Architecture in Action (Control Flow) </b><br>
Let's look at how this plays out inside your heartbeat event loop from the previous step.

User asks a hard question: "What were the key takeaways from that 50-page technical spec I gave you last month?"

LLM Execution 1: The LLM realizes it doesn't know. It calls ArchivalMemorySearch(search_query="technical spec takeaways last month", top_k=5).

Python Execution: Your script runs the FAISS search, grabs the 5 closest text chunks, and formats them into a single string.

Context Update: The python script appends {"role": "system", "content": "Tool Result: [Match 1] The spec emphasized... [Match 2]..."} to the state_memory list.

The Heartbeat: Because the tool requested a heartbeat, the python script does not wait for the user. It immediately loops and passes the newly updated context window back to the LLM.

LLM Execution 2: The LLM reads the system prompt containing the retrieved text, synthesizes the answer, and finally calls SendMessage(message="Based on the spec, the key takeaways were...").

Pause: The agent sets request_heartbeat = False and the thread goes to sleep, waiting for the user.