# [STARTER] Udaplay Project

## Part 02 - Agent

In this part of the project, you'll use your VectorDB to be part of your Agent as a tool.

You're building UdaPlay, an AI Research Agent for the video game industry. The agent will:
1. Answer questions using internal knowledge (RAG)
2. Search the web when needed
3. Maintain conversation state
4. Return structured outputs
5. Store useful information for future use

### Setup

In [1]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [2]:
# TODO: Import the necessary libs
# For example: 
# import os

from lib.agents import Agent
from lib.llm import LLM
from lib.messages import UserMessage, SystemMessage, ToolMessage, AIMessage
from lib.tooling import tool, ToolCall, Tool
import chromadb
from dotenv import load_dotenv
from lib.rag import RAG
from lib.evaluation import TestCase, AgentEvaluator, EvaluationResult
from lib.parsers import PydanticOutputParser
from lib.state_machine import StateMachine, Step, EntryPoint, Termination, Run
from lib.vector_db import VectorStoreManager
from lib.memory import ShortTermMemory
import json
from pydantic import BaseModel, Field
from typing import List, TypedDict, Optional, Union, Dict
from tavily import TavilyClient
import os



In [3]:
# TODO: Load environment variables
load_dotenv()

# OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# TAVILY_API_KEY = os.getenv("TAVILY_API_KEY")

True

### Tools

Build at least 3 tools:
- retrieve_game: To search the vector DB
- evaluate_retrieval: To assess the retrieval performance
- game_web_search: If no good, search the web


#### Retrieve Game Tool

In [4]:
# TODO: Create retrieve_game tool
# It should use chroma client and collection you created
chroma_client = chromadb.PersistentClient(path="chromadb")
collection = chroma_client.get_collection("udaplay")
rag_llm = LLM(
    model="gpt-4o-mini",
    temperature=0.3,
)
rag = RAG(
    llm=rag_llm,
    vector_store = collection
)

@tool
def retrieve_game(query):
    """
    Semantic search: Finds most results in the vector DB
    args:
    - query: a question about game industry. 

    You'll receive results as list. Each element contains:
    - Platform: like Game Boy, Playstation 5, Xbox 360...)
    - Name: Name of the Game
    - YearOfRelease: Year when that game was released for that platform
    - Description: Additional details about the game
    """

    return rag.invoke(query).get_final_state()['documents']

test_query = "What games were released in 1996?"
test_result_retrieve_game = retrieve_game(test_query)
print(test_result_retrieve_game)


[StateMachine] Starting: __entry__
[StateMachine] Executing step: retrieve
[StateMachine] Executing step: augment
[StateMachine] Executing step: generate
[StateMachine] Terminating: __termination__
["[Nintendo 64] Super Mario 64 (1996) - A groundbreaking 3D platformer that set new standards for the genre, featuring Mario's quest to rescue Princess Peach.", '[PlayStation 1] Gran Turismo (1997) - A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.', "[Wii] Wii Sports (2006) - A collection of sports games that utilize the Wii's motion controls, bundled with the console to showcase its capabilities.", '[Game Boy Color] Pokémon Gold and Silver (1999) - Second-generation Pokémon games introducing new regions, Pokémon, and gameplay mechanics.', '[Super Nintendo Entertainment System (SNES)] Super Mario World (1990) - A classic platformer where Mario embarks on a quest to save Princess Toadstool and Dinosaur Land from Bowser.', '[Game Bo

#### Evaluate Retrieval Tool

In [5]:
# TODO: Create evaluate_retrieval tool
# You might use an LLM as judge in this tool to evaluate the performance
# You need to prompt that LLM with something like:
# "Your task is to evaluate if the documents are enough to respond the query. "
# "Give a detailed explanation, so it's possible to take an action to accept it or not."
# Use EvaluationReport to parse the result
#evaluator = AgentEvaluator()


class EvaluationReport(BaseModel):
    useful: bool = Field(description="Whether the retrieved documents are useful to answer the question")
    description: str = Field(description="Description about the evaluation result")

@tool
def evaluate_retrieval(question, retrieved_docs):
    """
    Based on the user's question and on the list of retrieved documents, 
    it will analyze the usability of the documents to respond to that question. 
    args: 
    - question: original question from user
    - retrieved_docs: retrieved documents most similar to the user query in the Vector Database
    The result includes:
    - useful: whether the documents are useful to answer the question
    - description: description about the evaluation result
    """
    llm_judge = LLM(model="gpt-4o-mini")

    # Use LLM as judge to evaluate the response
    judge_prompt = f"""
    Your task is to evaluate if the documents are enough to respond the query. 
    
    User Query: {question}
    Retrieved Documents: {retrieved_docs}
    
    Provide two outputs:
    1) Whether the documents are enough to respond to the user's query (True or False)
    2) Give a detailed explanation, so it's possible to take an action to accept it or not.

    """
    
    # Use structured output with Pydantic model
    judge_response = llm_judge.invoke(
        input=judge_prompt, 
        response_format=EvaluationReport
    )
    
    # Parse the structured response
    parser = PydanticOutputParser(model_class=EvaluationReport)
    try:
        evaluation = parser.parse(judge_response)
    except Exception as e:
        print(f"Debug: Structured parsing error: {e}")
        print(f"Debug: Judge response content: {judge_response.content}")

    return evaluation.useful, evaluation.description

test_result_evaluate_retrieval_useful, test_result_evaluate_retrieval_desc = evaluate_retrieval(test_query, test_result_retrieve_game)
print(test_result_evaluate_retrieval_useful)
print(test_result_evaluate_retrieval_desc)

False
The retrieved documents do not provide sufficient information to answer the user's query about games released in 1996. The only relevant game mentioned is 'Super Mario 64,' which was released in 1996, but the other documents list games from different years (e.g., Gran Turismo in 1997, Pokémon Gold and Silver in 1999, etc.). Therefore, the documents fail to comprehensively cover the user's request for games specifically released in 1996.


#### Game Web Search Tool

In [6]:
# TODO: Create game_web_search tool
# Please use Tavily client to search the web


@tool
def game_web_search(question):
    """
    Semantic search: Finds most results in the vector DB
    args:
    - question: a question about game industry. 
    """
    api_key = os.getenv("TAVILY_API_KEY")
    client = TavilyClient(api_key=api_key)

    # Perform the search
    search_result = client.search(
        query=question,
        include_answer=True,
        include_raw_content=False,
        include_images=False
    )
    return search_result
print(game_web_search(test_query))

{'query': 'What games were released in 1996?', 'follow_up_questions': None, 'answer': 'In 1996, Super Mario 64, Resident Evil, and NiGHTS Into Dreams were released. These were among the most notable games of that year.', 'images': [], 'results': [{'url': 'https://videogamehistory.fandom.com/wiki/List_of_video_games_released_in_1996', 'title': 'List of video games released in 1996', 'content': "1. '96 Flag Rally · 2. 16 Tales: Vol. 2 · 3. 16 Tales: Vol. 1 · 4. 16 Tales: Vol. 3 · 5. 16 Tales: Vol. 4 · 6. 2Xtreme · 7. 3 Skulls of the Toltecs · 8. 3-D Tetris", 'score': 0.99933845, 'raw_content': None}, {'url': 'https://www.youtube.com/watch?v=2t0Q91Kd1D8', 'title': 'The 26 Best Video Games of 1996 (According to NEXT ... - YouTube', 'content': 'The 26 Best Video Games of 1996 (According to NEXT Generation Magazine)\n\nDefunct Games\n100 likes\n2173 views\n25 Nov 2022\nWhat do the games Duke Nukem 3D, Street Fighter Alpha 2 and Twisted Metal have in common? These three amazing games were all

### Agent

In [7]:
# TODO: Create your Agent abstraction using StateMachine
# Equip with an appropriate model
# Craft a good set of instructions 
# Plug all Tools you developed


class AgentState(TypedDict):
    user_query: str  # The current user query being processed
    instructions: str  # System instructions for the agent
    messages: List[dict]  # List of conversation messages
    current_tool_calls: Optional[List[ToolCall]]  # Current pending tool calls


class Agent:
    def __init__(self, 
                 model_name: str,
                 instructions: str, 
                 tools: List[Tool] = None,
                 temperature: float = 0.7):
        """
        Initialize an Agent instance
        
        Args:
            model_name: Name/identifier of the LLM model to use
            instructions: System instructions for the agent
            tools: Optional list of tools available to the agent
            temperature: Temperature parameter for LLM (default: 0.7)
        """
        self.instructions = instructions
        self.tools = tools if tools else []
        self.model_name = model_name
        self.temperature = temperature
        self.sessions = {}

        # Initialize memory and state machine
        self.memory = ShortTermMemory()
        self.workflow = self._create_state_machine()

    def _prepare_messages_step(self, state: AgentState) -> AgentState:
        """Step logic: Prepare messages for LLM consumption"""
        messages = state.get("messages", [])
        
        # If no messages exist, start with system message
        if not messages:
            messages = [SystemMessage(content=state["instructions"])]
            
        # Add the new user message
        messages.append(UserMessage(content=state["user_query"]))
        
        return {
            "messages": messages,
            "session_id": state["session_id"]
        }

    def _llm_step(self, state: AgentState) -> AgentState:
        """Step logic: Process the current state through the LLM"""

        # Initialize LLM
        llm = LLM(
            model=self.model_name,
            temperature=self.temperature,
            tools=self.tools
        )

        response = llm.invoke(state["messages"])
        tool_calls = response.tool_calls if response.tool_calls else None

        # Create AI message with content and tool calls
        ai_message = AIMessage(content=response.content, tool_calls=tool_calls)
        
        return {
            "messages": state["messages"] + [ai_message],
            "current_tool_calls": tool_calls,
            "session_id": state["session_id"],
        }

    def _tool_step(self, state: AgentState) -> AgentState:
        """Step logic: Execute any pending tool calls"""
        tool_calls = state["current_tool_calls"] or []
        tool_messages = []
        
        for call in tool_calls:
            # Access tool call data correctly
            function_name = call.function.name
            function_args = json.loads(call.function.arguments)
            tool_call_id = call.id
            # Find the matching tool
            tool = next((t for t in self.tools if t.name == function_name), None)
            if tool:
                result = tool(**function_args)
                tool_messages.append(
                    ToolMessage(
                        content=json.dumps(result), 
                        tool_call_id=tool_call_id, 
                        name=function_name, 
                    )
                )
        
        # Clear tool calls and add results to messages
        return {
            "messages": state["messages"] + tool_messages,
            "current_tool_calls": None,
            "session_id": state["session_id"]
        }

    def _create_state_machine(self) -> StateMachine[AgentState]:
        """Create the internal state machine for the agent"""
        machine = StateMachine[AgentState](AgentState)
        
        # Create steps
        entry = EntryPoint[AgentState]()
        message_prep = Step[AgentState]("message_prep", self._prepare_messages_step)
        llm_processor = Step[AgentState]("llm_processor", self._llm_step)
        tool_executor = Step[AgentState]("tool_executor", self._tool_step)
        termination = Termination[AgentState]()
        
        machine.add_steps([entry, message_prep, llm_processor, tool_executor, termination])
        
        # Add transitions
        machine.connect(entry, message_prep)
        machine.connect(message_prep, llm_processor)
        
        # Transition based on whether there are tool calls
        def check_tool_calls(state: AgentState) -> Union[Step[AgentState], str]:
            """Transition logic: Check if there are tool calls"""
            if state.get("current_tool_calls"):
                return tool_executor
            return termination
        
        machine.connect(llm_processor, [tool_executor, termination], check_tool_calls)
        machine.connect(tool_executor, llm_processor)  # Go back to llm after tool execution
        
        return machine

    def invoke(self, query: str, session_id: str) -> Run:
        """
        Run the agent on a query
        
        Args:
            query: The user's query to process
            session_id: Optional session identifier (uses "default" if None)
            
        Returns:
            The final run object after processing
        """
        session_id = session_id or "default"

        # Create session if it doesn't exist
        self.memory.create_session(session_id)

        # Get previous messages from last run if available
        previous_messages = []
        last_run: Run = self.memory.get_last_object(session_id)
        if last_run:
            last_state = last_run.get_final_state()
            if last_state:
                previous_messages = last_state["messages"]

        initial_state: AgentState = {
            "user_query": query,
            "instructions": self.instructions,
            "messages": previous_messages,
            "current_tool_calls": None,
            "session_id": session_id,
        }

        run_object = self.workflow.run(initial_state)

        # Store the complete run object in memory
        self.memory.add(run_object, session_id)

        return run_object

tools = [ retrieve_game, evaluate_retrieval, game_web_search ]

udaplay_agent = Agent(
    model_name="gpt-4o-mini",
    tools=tools,
    instructions=(
        "You're UdaPlay, an AI Research Agent for the video game industry. "
        "You can answer multistep questions by sequentially calling functions. "
        "You follow a pattern of of Thought and Action. "
        "Create a plan of execution: "
        "- Use Thought to describe your thoughts about the question you have been asked. "
        "- Use Action to specify one of the tools available to you. if you don't have a tool available, you can respond directly."
        "When you think it's over, return the answer "
        "Never try to respond directly if the question needs a tool. "
        "But if you don't have a tool available, you can respond directly. "
        
        f"The actions you have are the Tools: {tools}. \n"
    )
)

### Test Questions and Responses

In [8]:
def run_agent_query(question: str, session_id: str):
    """
    Run a query through the UdaPlay agent and print:
    - The user question
    - The final AI answer
    - A detailed tool trace (which tools were called and with what arguments)
    """
    print("\n" + "=" * 80)
    print(f"USER QUERY: {question}")
    print("=" * 80)

    # Invoke the state machine–based agent
    run = udaplay_agent.invoke(query=question, session_id = session_id)

    # Extract the final state (messages, user_query, etc.)
    final_state = run.get_final_state()
    messages = final_state["messages"]

    # --- Final answer (last message content) ---
    final_answer = messages[-1].content
    print("\nFINAL ANSWER:\n")
    print(final_answer)

    # --- Tool trace (what tools were used and what they returned) ---
    print("\nTOOL TRACE:\n")
    for i, msg in enumerate(messages):
        msg_type = type(msg).__name__

        # 1) Tool calls requested by the model (inside AIMessage.tool_calls)
        tool_calls = getattr(msg, "tool_calls", None)
        if tool_calls:
            print(f"Step {i}: MODEL TOOL CALLS ({msg_type})")
            for tc in tool_calls:
                fn = getattr(tc, "function", None)
                tool_name = getattr(fn, "name", "unknown_tool")
                tool_args = getattr(fn, "arguments", "")
                print(f"  → Calling tool: {tool_name} with args: {tool_args}")

        # 2) Tool execution results (ToolMessage objects)
        if msg_type == "ToolMessage":
            tool_name = getattr(msg, "name", "unknown_tool")
            content_preview = msg.content
            if len(content_preview) > 300:
                content_preview = content_preview[:300] + "..."
            print(f"Step {i}: TOOL RESULT from {tool_name}")
            print(f"  {content_preview}")


In [9]:
# Run a small test suite of questions through the agent

test_questions = [
    "When was Pokémon Gold and Silver released?",
    "Which Mario game was the first 3D platformer?",
    "Was Mortal Kombat X ever released on PlayStation 5?",
    "Tell me about Super Mario 64 and Super Mario World.",
    "Which one was released first?",
    "What platforms were they on?"
]

for q in test_questions:
    run_agent_query(q, session_id = 123)



USER QUERY: When was Pokémon Gold and Silver released?
[StateMachine] Starting: __entry__
[StateMachine] Executing step: message_prep
[StateMachine] Executing step: llm_processor
[StateMachine] Starting: __entry__
[StateMachine] Executing step: retrieve
[StateMachine] Executing step: augment
[StateMachine] Executing step: generate
[StateMachine] Terminating: __termination__
[StateMachine] Executing step: tool_executor
[StateMachine] Executing step: llm_processor
[StateMachine] Executing step: tool_executor
[StateMachine] Executing step: llm_processor
[StateMachine] Terminating: __termination__

FINAL ANSWER:

Pokémon Gold and Silver were released in 1999.

TOOL TRACE:

Step 2: MODEL TOOL CALLS (AIMessage)
  → Calling tool: retrieve_game with args: {"query":"Pokémon Gold and Silver release date"}
Step 3: TOOL RESULT from retrieve_game
  ["[Game Boy Color] Pok\u00e9mon Gold and Silver (1999) - Second-generation Pok\u00e9mon games introducing new regions, Pok\u00e9mon, and gameplay mecha

## Understanding Tool Call Traces

The UdaPlay agent operates in a multi-step reasoning loop that often involves calling one or more tools.  
The tool trace printed in the previous cells helps illustrate *how* the agent reasons and *why* it selects each tool.

Here is what to look for in the trace:

---

### 1. Model-Initiated Tool Calls
During the reasoning loop, the model may produce an `AIMessage` containing a list of `tool_calls`.  
This indicates that the LLM decided it needs external information.

A typical entry looks like:

→ Calling tool: retrieve_game with args: {"query": "..."}


This shows:
- **which** tool was selected  
- **why** it was chosen (via the arguments)  
- the agent’s internal decision-making process  

This aligns with modern agent frameworks (OpenAI tool calling, LangChain tools, LangGraph nodes).

---

### 2. Tool Execution Results
After a tool call, the state machine returns a `ToolMessage` showing the actual output of the tool.

Example:

Step 3: TOOL RESULT from retrieve_game

[...]


This includes:
- retrieved documents
- metadata returned by the tool
- structured game data from the vector database or web search

The LLM uses these results to inform its next step.

---

### 3. Multi-Tool Paths
Some questions produce only a single tool call (e.g., retrieving a game description).  
Others may trigger a fallback pattern, such as:

1. `retrieve_game`  
2. `evaluate_retrieval` (model checks whether the results are relevant)  
3. `game_web_search` (if local data was insufficient)

The trace makes this process transparent.

---

### 4. Why These Traces Matter
Tool call traces are one of the most important diagnostic tools when developing agent systems.  
They reveal:

- whether the agent is selecting the right tool  
- whether the tool APIs are behaving as expected  
- whether the agent can recover from weak retrieval  
- how the model chains information across steps  

These traces mirror debugging workflows used in production agentic systems.

---

### Summary

The printed tool call trace provides a window into the agent’s internal decision-making loop.  
Understanding this trace is essential for debugging, optimization, and future extensions of the agentic architecture.


## Agent Design & Tooling Rationale

This notebook implements the agentic layer of the UdaPlay system.  
The design combines structured state management, tool-based reasoning, short-term session memory, and a controlled reasoning loop.  
This section explains the rationale behind these architectural choices and how they align with real-world agentic AI patterns.

---

### 1. Why Use an Agent Instead of a Single LLM Call?

A single LLM response cannot reliably:
- retrieve domain-specific facts,
- decide when to consult external tools,
- maintain conversational context,
- or integrate multiple data sources.

The UdaPlay agent uses a **multi-step reasoning loop** that allows the model to:
1. interpret the user query  
2. decide which tool(s) to call  
3. incorporate retrieved evidence  
4. synthesize a final answer  

This approach mirrors the architectures used in modern enterprise-grade LLM applications.

---

### 2. Tool-Based Reasoning (RAG + Web Search)

The agent uses three core tools:

#### **1. `retrieve_game`**
Queries the local ChromaDB vector store for semantically relevant games.  
Benefits:
- fast retrieval  
- grounded responses  
- transparent metadata  

#### **2. `evaluate_retrieval`**
Allows the agent to self-assess whether retrieved documents are relevant enough.  
This is a common pattern in real RAG pipelines where retrieval quality must be validated before the LLM relies on it.

#### **3. `game_web_search`**
Provides access to up-to-date or external information when local data is insufficient.  
This hybrid setup (local RAG + external search) is the prevailing pattern in production agent systems.

Together, these tools demonstrate controlled tool selection, fallback behaviors, and information fusion.

---

### 3. State Machine Architecture for Deterministic Control

The agent is implemented as a **state machine**, giving:
- deterministic transitions between reasoning steps,  
- full transparency of intermediate tool calls,  
- debuggable state snapshots (`run.get_final_state()`),  
- support for multi-step tool usage in a controlled flow.

This design matches modern frameworks such as LangGraph, where explicit state definitions and step functions create predictable agent behavior.

---

### 4. Short-Term, Per-Session Memory

The agent now supports **session-based memory**:

- Each session ID corresponds to a dedicated in-memory message buffer.
- Subsequent queries in the same session build on the agent’s prior responses.
- The agent retains context such as previously discussed games or facts.
- Memory resets when a new session ID is used.

This is **short-term memory**, meaning it persists across multiple queries within the same session, but does **not** persist across restarts or different sessions.

This design enables multi-turn conversations while keeping the architecture simple and modular.

---

### 5. Output Format

The final answer is returned as standard natural language text.  
This keeps the notebook:
- easy to read,
- easy to debug,
- and ideal for exploratory development.

Although the system does not yet enforce a structured schema (e.g., via Pydantic), the modular tool design makes it straightforward to add structured output in the future if needed.

---

### 6. Alignment With Real-World Agentic Patterns

The UdaPlay agent implements several patterns that directly match real-world agent design:

- **Think → Decide → Retrieve → Evaluate → Synthesize** loop  
- Hybrid RAG architecture (local + web search)  
- Explicit tool routing  
- Deterministic state transitions  
- Session-scoped conversational memory  
- Transparent tool traces for debugging  
- Clean separation between LLM reasoning and data/tool operations  

These principles reflect the architectures used in production-grade assistants, copilots, and automated research agents across industry.

---

### Summary

The UdaPlay agent is intentionally designed to be:

- **Modular** — tools, memory, and LLM layers separated  
- **Transparent** — full tool trace visibility  
- **Stateful** — session-based memory to support multi-turn conversation  
- **Hybrid** — combining vector search and real-time web search  
- **Extensible** — easy to add new tools, memory modules, or output schemas  

This notebook serves as a simplified but accurate blueprint for building real agentic AI systems.

## Current Limitations of the UdaPlay Agent

Although the UdaPlay agent demonstrates multi-step reasoning, tool use, and short-term session memory, it still has several limitations that are important for understanding the boundaries of the current design.

### 1. No Long-Term or Cross-Session Memory
Session-based memory is supported (via session IDs), but:
- memory does not persist beyond the current session,
- memory is not stored to disk or a database,
- restarting the environment or using a new session ID resets context.

This limits persistent personalization, historical context, and long-running workflows.

---

### 2. No Structured Output Schema
The agent currently returns natural-language responses only.  
While readable, this prevents:
- downstream automation,
- guaranteed field consistency (e.g., game name, year, platform),
- machine-parseable outputs for APIs or monitoring tools.

---

### 3. Retrieval Quality Not Fully Verified
Although an `evaluate_retrieval` tool exists, the agent does not yet enforce:
- minimum similarity thresholds,
- retrieval confidence scoring,
- fallback logic when retrieval is low-quality.

Weak retrievals may still influence the final answer.

---

### 4. Limited Normalization of Web Search Results
Web search outputs vary significantly in formatting.  
The agent does not yet:
- extract consistent fields (release dates, platform lists),
- reconcile conflicting web sources,
- filter noise or irrelevant hits.

This reduces reliability for questions requiring external data.

---

### 5. Lack of Robust Error Handling
The toolchain assumes that:
- API requests succeed,
- tools always return data,
- the vector store is reachable.

Real-world agents need:
- retry mechanisms,
- graceful degradation paths,
- exception-aware decision logic.

---

### 6. No Automated Evaluation Suite
There is currently no quantitative way to measure:
- factual accuracy,
- retrieval quality over time,
- regression after code changes,
- performance across query types.

Testing is manual and qualitative.

---

### 7. No Multi-Agent or Supervisor Routing
The system uses a single agent loop.  
It does not:
- delegate tasks to sub-agents,
- use a supervisor model,
- split tasks into planning/execution phases.

While unnecessary for this scale, multi-agent design is common in advanced agentic workflows.

---

These limitations are typical of early-stage prototypes and clearly define the roadmap toward a production-level agentic system.

## Future Extensions to Evolve the UdaPlay Agent

The UdaPlay agent now supports multi-step reasoning, hybrid tool use, and short-term session-based memory.  
The next stage of evolution is to extend this system into a robust, production-ready agent with long-term memory, stronger data validation, and richer retrieval logic.  
Below are the natural next steps.

---

### 1. Add Long-Term and Cross-Session Memory

Extend the current session-scoped memory into a persistent memory subsystem that survives across sessions and notebook restarts.  
Potential approaches include:

- vector-based episodic memory for past conversations  
- semantic memory storage (facts, user preferences, extracted entities)  
- time-decayed memory scoring  
- memory summarization to avoid runaway growth  

This transforms the agent from a short-term conversational assistant into a continuously learning system.

---

### 2. Enforce Structured Output with Schemas

Introduce Pydantic or JSON schema enforcement for the agent’s final answers.  
Benefits:

- guaranteed consistency (e.g., title, platform, year, confidence)  
- easier downstream automation  
- clean integration with APIs or front-end displays  
- explicit validation before responses are returned  

Structured output is essential for production-grade reliability.

---

### 3. Improve Retrieval Quality Assurance

Enhance the `evaluate_retrieval` tool into a full “retrieval governance” layer that can:

- apply similarity-score thresholds  
- rerank results using cross-encoders  
- retry retrieval with query rewriting  
- fall back to web search if local results are weak  

This significantly improves accuracy and reduces hallucination risk.

---

### 4. Normalize and Reconcile Web Search Data

Add a post-processing layer for `game_web_search` results:

- extract canonical fields (release year, platform list, publisher, genre)  
- remove duplicates  
- merge conflicting information using voting or confidence scoring  
- standardize the final merged result  

This makes hybrid RAG more trustworthy and consistent.

---

### 5. Add Robust Error Handling and Failover Logic

Enhance the reliability of tool calls:

- automatic retries for intermittent API failures  
- fallback branches when tools return empty results  
- timeout handling  
- structured errors that feed back into the reasoning loop  

This prepares the agent for real-world, unreliable environments.

---

### 6. Build an Automated Evaluation Harness

Develop a small benchmarking suite to measure:

- accuracy of release year answers  
- correctness of platform classifications  
- robustness of multi-tool reasoning  
- performance across diverse query types  

Automated evaluation allows regression tracking and systematic improvement.

---

### 7. Explore Multi-Agent or Supervisor Architectures

As the system grows, consider splitting responsibilities among specialized agents:

- a “retriever agent”  
- a “search agent”  
- a “reasoning/synthesis agent”  
- a planning layer (supervisor) to orchestrate them  

This mirrors cutting-edge agentic architectures used in industry.

---

### Summary

Implementing these extensions would evolve the UdaPlay agent from a prototype into a scalable, reliable, and production-level agentic system—capable of maintaining long-term knowledge, handling complex queries, and delivering structured, validated outputs.