<a href="https://colab.research.google.com/github/wjleece/AI-Agents/blob/main/AI_Agents_w_Evals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [39]:
%pip install anthropic
#%pip install openai
%pip install -q -U google-generativeai
%pip install fuzzywuzzy



In [40]:
#Setup and Imports
import anthropic
import google.generativeai as gemini
import re
import json
import time
import os
import copy
import glob # For finding files matching a pattern
import uuid # For generating unique learning IDs in RAG
from google.colab import userdata
#from openai import OpenAI
from google.colab import drive # For Google Drive mounting
from datetime import datetime
from typing import Dict, List, Any, Optional, Union, Tuple
from fuzzywuzzy import process, fuzz

# LLM API Keys
ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY')
#OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
#openai_client = OpenAI(api_key=OPENAI_API_KEY)
gemini.configure(api_key=GOOGLE_API_KEY)

ANTHROPIC_MODEL_NAME = "claude-3-5-sonnet-latest"
#OPENAI_MODEL_NAME = "gpt-4.1" # Or your preferred GPT-4 class model
EVAL_MODEL_NAME = "gemini-2.5-pro-preview-05-06" # Or your preferred Gemini model
CLASSIFIER_MODEL_NAME = "gemini-1.5-flash-latest" # Fast model for question classification / routing

In [41]:
#Drive Authentication & path mapping

DRIVE_MOUNT_PATH = '/content/drive'

try:
    drive.mount(DRIVE_MOUNT_PATH)
    print(f"Google Drive mounted successfully at {DRIVE_MOUNT_PATH}.")
except Exception as e:
    print(f"Error mounting Google Drive: {e}. RAG features will not work.")

# Set up the default learnings path
DEFAULT_LEARNINGS_DRIVE_SUBPATH = "My Drive/AI/Knowledgebases"  # Your default path
LEARNINGS_DRIVE_BASE_PATH = os.path.join(DRIVE_MOUNT_PATH, DEFAULT_LEARNINGS_DRIVE_SUBPATH)

# Create the directory if it doesn't exist
if not os.path.exists(LEARNINGS_DRIVE_BASE_PATH):
    try:
        os.makedirs(LEARNINGS_DRIVE_BASE_PATH)
        print(f"Created learnings directory: {LEARNINGS_DRIVE_BASE_PATH}")
    except Exception as e:
        print(f"Error creating learnings directory {LEARNINGS_DRIVE_BASE_PATH}: {e}")
else:
    print(f"Using existing learnings directory: {LEARNINGS_DRIVE_BASE_PATH}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Google Drive mounted successfully at /content/drive.
Using existing learnings directory: /content/drive/My Drive/AI/Knowledgebases


In [42]:
#Specialized System Prompts

# --- Worker AI Prompts ---
worker_base_instructions = """
You are a helpful customer service assistant for an e-commerce system.
Your overriding goal is to be helpful by answering questions and performing actions as requested by a human user.
When responding to the user, use the conversation context to maintain continuity.
- If a user refers to "my order" or similar, use the context to determine which order they're talking about.
- If they mention "that product" or use other references, check the context to determine what they're referring to.
- Always prioritize recent context over older context when resolving references.

The conversation context will be provided to you with each message. This includes:
- Previous questions and answers
- Recently viewed customers, products, and orders
- Recent actions taken (like creating orders, updating products, etc.)
- Relevant Learnings from a knowledge base (if applicable to the current query type).
- **Crucially, results from any tools you use will also be part of the context provided back to you.**

**YOUR RESPONSE TO THE HUMAN USER:**
Your primary role is to communicate effectively and naturally with the human user.
- After you use tools and get their results (which will be shown to you), your final textual response to the user **must be a friendly, conversational, and easy-to-understand summary.**
- **DO NOT output raw data (like JSON strings or complex lists/dictionaries from tool results) directly in your response to the user.**
- Instead, you must **interpret the tool results** and explain the outcome or provide the requested information in natural language.
- For example, if a tool you used returns information like `{"product_name": "Perplexinator", "inventory_count": 1485}`, your response to the user should be something like: "Currently, we have 1485 Perplexinators in stock." or "The Perplexinator has 1485 units available, would you like to know more?"
- If you perform an action (e.g., creating an order), confirm this action clearly and provide key details in a sentence, for instance: "I've successfully created your order (Order ID: O4) for 10 Widgets."
- Always aim to be helpful, polite, and clear in your language.

REQUESTING CLARIFICATION FROM THE USER:
If you determine that you absolutely need more information from the user to accurately and efficiently fulfill their request or use a tool correctly, you MUST:
1. Formulate a clear, concise question for the user.
2. Prefix your entire response with the exact tag: `CLARIFICATION_REQUESTED:`
   Example: `CLARIFICATION_REQUESTED: To update the order, could you please provide the Order ID?`
3. Do NOT use any tools in the same turn you are requesting clarification. Wait for the user's response.

Keep all other responses friendly, concise, and helpful.
"""

worker_operational_system_prompt = f"""
{worker_base_instructions}

Your current task is OPERATIONAL. Focus on understanding user requests related to e-commerce functions (managing orders, products, customers), using the provided tools accurately, and interacting with the data store.
The "Relevant Learnings from Knowledge Base" provided in your context may contain operational guidelines.
Remember to synthesize tool results into a user-friendly textual response. The detailed tool outputs are logged separately for evaluation.
"""

worker_metacognitive_learnings_system_prompt = f"""
{worker_base_instructions}

Your current task is METACOGNITIVE: SUMMARIZING LEARNINGS AND EXPLAINING YOUR THINKING.
If the user asks you to "summarize your learnings", "what have you learned", "why did you", "is there a better way to" or similar phrases, your response should be based PRIMARILY on the content provided to you under the heading "Relevant Learnings from Knowledge Base (In-Session Cache)" in your current context.
- List the key principles or pieces of information from these provided learnings.
- Do not confuse these explicit learnings with a general summary of your recent actions or the current state of the data store, unless a learning specifically refers to such an action or state.
- If no specific learnings are provided in your context for this type of query, you can state that no specific new learnings have been highlighted for this interaction.
- Avoid using tools for this type of summarization unless a tool is specifically designed to retrieve or process learnings.
"""

# --- Evaluator AI Prompt (unified but guided by query type information) ---
# This prompt is largely the same as the one from worker_prompt_update_learning_summary,
# but we will emphasize the query_type in the main prompt to the evaluator.

evaluator_system_prompt = """
You are Google Gemini, an impartial evaluator assessing the quality of responses from an AI assistant to customer service queries.

You will be provided with:
- The user's query and the TYPE of query it was classified as (e.g., OPERATIONAL, METACOGNITIVE_LEARNINGS_SUMMARY).
- The conversation context (including RAG learnings) that was available to the AI assistant.
- **The AI assistant's final user-facing textual response.**
- **A log of tools called by the AI assistant, including their inputs and raw outputs.**
- A snapshot of the 'Data Store State *Before* AI Action'.
- A snapshot of the 'Data Store State *After* AI Action'.
- Details of any clarification questions the AI assistant asked.

Your primary goal is to assess the AI assistant based on the SPECIFIC TASK it was attempting, as indicated by the query type.

For each interaction, evaluate the assistant's response based on:
1.  **Accuracy**:
    * If OPERATIONAL:
        * How correct and factual is the AI's **user-facing textual response**?
        * Does the textual response accurately reflect the outcomes of the **tool calls** and changes in the datastore?
        * Did its actions (tool calls) correctly process information or modify the datastore as intended? Verify against 'Tool Call Log', 'Before' and 'After' states.
    * If METACOGNITIVE_LEARNINGS_SUMMARY: Did the AI accurately summarize the "Relevant Learnings from Knowledge Base" provided in its context? Was the summary faithful to these learnings?
    * Check for new entity IDs and correct updates if applicable to the query type.

2.  **Efficiency**:
    * Did the assistant achieve its goal with minimal clarifying questions?
    * If OPERATIONAL: Were tool calls used appropriately and efficiently? (Refer to 'Tool Call Log')
    * If METACOGNITIVE_LEARNINGS_SUMMARY: Was the summary direct and to the point based on provided learnings?

3.  **Context Awareness**:
    * Did the assistant correctly use the conversation history and entities?
    * Crucially, did the assistant adhere to the task defined by the query type?
    * Did it correctly use any "Relevant Learnings from Knowledge Base" that were pertinent to the query type?
    * For OPERATIONAL tasks, did the user-facing response make sense given the tool outputs and datastore changes?

4.  **Helpfulness & Clarity (of the user-facing response)**:
    * How well did the assistant address the user's needs *for the identified query type* in its textual response?
    * Was the **user-facing response clear, polite, and easy to understand?** Did it avoid jargon or raw data dumps?
    * Did it provide relevant information in a helpful manner?

Score the response on a scale of 1-10 for each criterion, and provide an overall score. Provide detailed reasoning, EXPLICITLY MENTIONING THE QUERY TYPE and referencing the **user-facing text**, the **tool call log**, and **datastore states** as appropriate.
- For OPERATIONAL queries, heavily reference the 'Tool Call Log' and 'Before'/'After' data store states when assessing the underlying actions, and then assess if the user-facing text accurately and clearly conveys this.
- For METACOGNITIVE_LEARNINGS_SUMMARY, heavily reference the "Relevant Learnings from Knowledge Base" that were provided to the worker.

EVALUATING CLARIFICATION QUESTIONS:
If the worker AI asked for clarification:
- Assess necessity using 'Data Store State *Before* AI Action' and context.
- If necessary and well-phrased, it should NOT negatively impact Efficiency.
- If unnecessary, it SHOULD negatively impact Efficiency.

If you, the evaluator, still have questions, use "CLARIFICATION NEEDED_EVALUATOR:".

DATA STORE CONSISTENCY (Primarily for OPERATIONAL tasks):
When assessing Accuracy for OPERATIONAL tasks, explicitly compare the AI's actions (via tool log and datastore changes) with its claims in the user-facing text.
"""

In [43]:
# Global Data Stores (Initial data - will be managed by the Storage class instance)
# These are initial values. The Storage class will manage them.
initial_customers = {
    "C1": {"name": "John Doe", "email": "john@example.com", "phone": "123-456-7890"},
    "C2": {"name": "Jane Smith", "email": "jane@example.com", "phone": "987-654-3210"}
}

initial_products = {
    "P1": {"name": "Widget A", "description": "A simple widget. Very compact.", "price": 19.99, "inventory_count": 999},
    "P2": {"name": "Gadget B", "description": "A powerful gadget. It spins.", "price": 49.99, "inventory_count": 200},
    "P3": {"name": "Perplexinator", "description": "A perplexing perfunctator", "price": 79.99, "inventory_count": 1483}
}

initial_orders = {
    "O1": {"id": "O1", "product_id": "P1", "product_name": "Widget A", "quantity": 2, "price": 19.99, "status": "Shipped"},
    "O2": {"id": "O2", "product_id": "P2", "product_name": "Gadget B", "quantity": 1, "price": 49.99, "status": "Processing"}
}


In [44]:
# Standalone Anthropic Completion Function (for basic tests)
#def get_completion_anthropic_standalone(prompt: str):
#    message = anthropic_client.messages.create(
#        model=ANTHROPIC_MODEL_NAME,
#        max_tokens=2000,
#        temperature=0.0,
#        system=worker_base_instructions,
#        tools=tools_schemas_list,
#        messages=[
#          {"role": "user", "content": prompt}
#        ]
#    )
#    return message.content[0].text

In [45]:
#prompt_test_anthropic = "Hey there, which AI model do you use for answering questions?"
#print(f"Anthropic Standalone Test: {get_completion_anthropic_standalone(prompt_test_anthropic)}")

In [46]:
#def get_completion_openai_standalone(prompt: str):
#    response = openai_client.chat.completions.create(
#        model=OPENAI_MODEL_NAME,
#        max_tokens=2000,
#        temperature=0.0,
#        tools=tools_schemas_list,
#        messages=[
#            {"role": "system", "content": worker_system_prompt},
#            {"role": "user", "content": prompt}
#        ]
#    )
#    return response.choices[0].message.content

In [47]:
#prompt_test_openai = "Hey there, which AI model do you use for answering questions?"
#print(f"OpenAI Standalone Test: {get_completion_openai_standalone(prompt_test_openai)}")

In [48]:
class Storage:
    """Manages the in-memory e-commerce datastore."""
    def __init__(self):
        self.customers = copy.deepcopy(initial_customers)
        self.products = copy.deepcopy(initial_products)
        self.orders = copy.deepcopy(initial_orders)
        print("Storage initialized with deep copies of initial data.")

    def get_full_datastore_copy(self) -> Dict[str, Any]:
        """Returns a deep copy of the current datastore."""
        return {
            "customers": copy.deepcopy(self.customers),
            "products": copy.deepcopy(self.products),
            "orders": copy.deepcopy(self.orders)
        }

In [49]:
#Definitive list of tool schemas.
tools_schemas_list = [
    {
        "name": "create_customer",
        "description": "Adds a new customer to the database. Includes customer name, email, and (optional) phone number.",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "The name of the customer."},
                "email": {"type": "string", "description": "The email address of the customer."},
                "phone": {"type": "string", "description": "The phone number of the customer (optional)."}
            },
            "required": ["name", "email"]
        }
    },
    {
        "name": "get_customer_info",
        "description": "Retrieves customer information based on their customer ID. Returns the customer's name, email, and (optional) phone number.",
        "input_schema": {
            "type": "object",
            "properties": {
                "customer_id": {"type": "string", "description": "The unique identifier for the customer."}
            },
            "required": ["customer_id"]
        }
    },
    {
        "name": "create_product",
        "description": "Adds a new product to the product database. Includes name, description, price, and initial inventory count.",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "The name of the product."},
                "description": {"type": "string", "description": "A description of the product."},
                "price": {"type": "number", "description": "The price of the product."},
                "inventory_count": {"type": "integer", "description": "The amount of the product that is currently in inventory."}
            },
            "required": ["name", "description", "price", "inventory_count"]
        }
    },
    {
        "name": "update_product",
        "description": "Updates an existing product with new information. Only fields that are provided will be updated; other fields remain unchanged.",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_id": {"type": "string", "description": "The unique identifier for the product to update."},
                "name": {"type": "string", "description": "The new name for the product (optional)."},
                "description": {"type": "string", "description": "The new description for the product (optional)."},
                "price": {"type": "number", "description": "The new price for the product (optional)."},
                "inventory_count": {"type": "integer", "description": "The new inventory count for the product (optional)."}
            },
            "required": ["product_id"]
        }
    },
    {
        "name": "get_product_info",
        "description": "Retrieves product information based on product ID or product name (with fuzzy matching for misspellings). Returns product details including name, description, price, and inventory count.",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_id_or_name": {"type": "string", "description": "The product ID or name (can be approximate)."}
            },
            "required": ["product_id_or_name"]
        }
    },
    {
        "name": "list_all_products",
        "description": "Lists all available products in the inventory.",
        "input_schema": { "type": "object", "properties": {}, "required": [] }
    },
    {
        "name": "create_order",
        "description": "Creates an order using the product's current price. If requested quantity exceeds available inventory, no order is created and available quantity is returned. Orders can only be created for products that are in stock. Supports specifying products by either ID or name with fuzzy matching for misspellings.",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_id_or_name": {"type": "string", "description": "The ID or name of the product to order (supports fuzzy matching)."},
                "quantity": {"type": "integer", "description": "The quantity of the product in the order."},
                "status": {"type": "string", "description": "The initial status of the order (e.g., 'Processing', 'Shipped')."}
            },
            "required": ["product_id_or_name", "quantity", "status"]
        }
    },
    {
        "name": "get_order_details",
        "description": "Retrieves the details of a specific order based on the order ID. Returns the order ID, product name, quantity, price, and order status.",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string", "description": "The unique identifier for the order."}
            },
            "required": ["order_id"]
        }
    },
    {
        "name": "update_order_status",
        "description": "Updates the status of an order and adjusts inventory accordingly. Changing to \"Shipped\" decreases inventory. Changing to \"Returned\" or \"Canceled\" from \"Shipped\" increases inventory. Status can be \"Processing\", \"Shipped\", \"Delivered\", \"Returned\", or \"Canceled\".",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string", "description": "The unique identifier for the order."},
                "new_status": {
                    "type": "string",
                    "description": "The new status to set for the order.",
                    "enum": ["Processing", "Shipped", "Delivered", "Returned", "Canceled"]
                }
            },
            "required": ["order_id", "new_status"]
        }
    }
]
print(f"Defined {len(tools_schemas_list)} tool schemas.")

Defined 9 tool schemas.


In [50]:
# --- Tool Functions (Global for now, passed to ToolExecutor) ---

def create_customer(current_storage: Storage, name: str, email: str, phone: Optional[str] = None): # Simplified for brevity
    new_id = f"C{len(current_storage.customers) + 1}"
    while new_id in current_storage.customers: new_id = f"C{int(new_id[1:]) + 1}"
    current_storage.customers[new_id] = {"name": name, "email": email, "phone": phone}
    print(f"[Tool Exec] create_customer: ID {new_id}")
    return {"status": "success", "customer_id": new_id, "customer": current_storage.customers[new_id]}
def get_customer_info(current_storage: Storage, customer_id: str):
    customer = current_storage.customers.get(customer_id)
    if customer: print(f"[Tool Exec] get_customer_info: ID {customer_id} found."); return {"status": "success", "customer": customer}
    print(f"[Tool Exec] get_customer_info: ID {customer_id} not found."); return {"status": "error", "message": "Customer not found"}
def create_product(current_storage: Storage, name: str, description: str, price: float, inventory_count: int):
    new_id = f"P{len(current_storage.products) + 1}"
    while new_id in current_storage.products: new_id = f"P{int(new_id[1:]) + 1}"
    current_storage.products[new_id] = {"name": name, "description": description, "price": price, "inventory_count": inventory_count}
    print(f"[Tool Exec] create_product: ID {new_id}")
    return {"status": "success", "product_id": new_id, "product": current_storage.products[new_id]}
def update_product(current_storage: Storage, product_id: str, name: Optional[str]=None, description: Optional[str]=None, price: Optional[float]=None, inventory_count: Optional[int]=None):
    if product_id not in current_storage.products: print(f"[Tool Exec] update_product: ID {product_id} not found."); return {"status": "error", "message": "Product not found"}
    product = current_storage.products[product_id]; updated_fields = []
    if name: product["name"] = name; updated_fields.append("name")
    if description: product["description"] = description; updated_fields.append("description")
    if price: product["price"] = price; updated_fields.append("price")
    if inventory_count is not None : product["inventory_count"] = inventory_count; updated_fields.append("inventory_count")
    if not updated_fields: print(f"[Tool Exec] update_product: ID {product_id}, no fields updated."); return {"status":"warning", "message":"No fields updated."}
    print(f"[Tool Exec] update_product: ID {product_id}, Updated: {', '.join(updated_fields)}")
    return {"status": "success", "product_id": product_id, "product": product, "updated_fields": updated_fields}
def find_product_by_name(current_storage: Storage, product_name: str, min_similarity: int = 70) -> Tuple[Optional[str], Optional[Dict[str, Any]]]:
    if not product_name: return None, None
    name_id_list = [(p_data["name"], p_id) for p_id, p_data in current_storage.products.items()]
    if not name_id_list: return None, None
    best_match_name_score = process.extractOne(product_name, [item[0] for item in name_id_list], scorer=fuzz.token_sort_ratio)
    if best_match_name_score and best_match_name_score[1] >= min_similarity:
        matched_name = best_match_name_score[0]
        for name_val, pid_val in name_id_list:
            if name_val == matched_name: return pid_val, current_storage.products[pid_val]
    return None, None
def get_product_id(current_storage: Storage, product_identifier: str) -> Optional[str]:
    if product_identifier in current_storage.products: return product_identifier
    product_id, _ = find_product_by_name(current_storage, product_identifier)
    return product_id
def get_product_info(current_storage: Storage, product_id_or_name: str):
    pid = get_product_id(current_storage, product_id_or_name)
    if pid and pid in current_storage.products:
        print(f"[Tool Exec] get_product_info: Found '{product_id_or_name}' as ID '{pid}'.")
        return {"status": "success", "product_id": pid, "product": current_storage.products[pid]}
    print(f"[Tool Exec] get_product_info: Product '{product_id_or_name}' not found.")
    return {"status": "error", "message": f"Product '{product_id_or_name}' not found"}
def list_all_products(current_storage: Storage):
    print(f"[Tool Exec] list_all_products: Found {len(current_storage.products)} products.")
    return {"status": "success", "count": len(current_storage.products), "products": current_storage.products}
def create_order(current_storage: Storage, product_id_or_name: str, quantity: int, status: str):
    actual_product_id = get_product_id(current_storage, product_id_or_name)
    if not actual_product_id: print(f"[Tool Exec] create_order: Product '{product_id_or_name}' not found."); return {"status": "error", "message": f"Product '{product_id_or_name}' not found."}
    product = current_storage.products[actual_product_id]
    if product["inventory_count"] < quantity and status == "Shipped": print(f"[Tool Exec] create_order: Insufficient inventory for {product['name']} to ship."); return {"status": "error", "message": f"Insufficient inventory for {product['name']} to ship. Available: {product['inventory_count']}"}
    if status == "Shipped" and product["inventory_count"] >= quantity : product["inventory_count"] -= quantity
    new_id = f"O{len(current_storage.orders) + 1}"
    while new_id in current_storage.orders: new_id = f"O{int(new_id[1:]) + 1}"
    current_storage.orders[new_id] = {"id": new_id, "product_id": actual_product_id, "product_name": product["name"], "quantity": quantity, "price": product["price"], "status": status}
    print(f"[Tool Exec] create_order: Order {new_id} for {product['name']}. Status: {status}. Inv: {product['inventory_count']}")
    return {"status": "success", "order_id": new_id, "order_details": current_storage.orders[new_id], "remaining_inventory": product["inventory_count"]}
def get_order_details(current_storage: Storage, order_id: str):
    order = current_storage.orders.get(order_id)
    if order: print(f"[Tool Exec] get_order_details: Order {order_id} found."); return {"status": "success", "order_details": order}
    print(f"[Tool Exec] get_order_details: Order {order_id} not found."); return {"status": "error", "message": "Order not found"}
def update_order_status(current_storage: Storage, order_id: str, new_status: str):
    if order_id not in current_storage.orders: print(f"[Tool Exec] update_order_status: Order {order_id} not found."); return {"status": "error", "message": "Order not found"}
    order = current_storage.orders[order_id]; product_id = order["product_id"]; quantity = order["quantity"]
    product = current_storage.products[product_id]; old_status = order["status"]
    if old_status == new_status: print(f"[Tool Exec] update_order_status: Status for {order_id} already {new_status}."); return {"status": "unchanged", "message": "Status is already " + new_status}
    inventory_adjusted = False
    if new_status == "Shipped" and old_status != "Shipped":
        if product["inventory_count"] < quantity: print(f"[Tool Exec] update_order_status: Insufficient inv for {product_id} to ship order {order_id}."); return {"status": "error", "message": "Insufficient inventory to ship."}
        product["inventory_count"] -= quantity; inventory_adjusted = True
    elif old_status == "Shipped" and new_status != "Shipped": # e.g. returned/canceled from shipped
        product["inventory_count"] += quantity; inventory_adjusted = True
    order["status"] = new_status
    print(f"[Tool Exec] update_order_status: Order {order_id} to {new_status}. Inv for {product_id} is {product['inventory_count']}. Adjusted: {inventory_adjusted}")
    return {"status": "success", "order_id": order_id, "order_details": order, "current_inventory": product["inventory_count"], "inventory_adjusted": inventory_adjusted}
print("Tool functions defined.")


Tool functions defined.


In [51]:
class QueryClassifier:
    """Classifies user queries using an LLM."""
    def __init__(self, llm_client):
        self.llm_client = llm_client
        self.classification_prompt_template = """
Classify the following user query into one of these categories: OPERATIONAL, METACOGNITIVE_LEARNINGS_SUMMARY.
Return ONLY the category name.

OPERATIONAL queries are about performing e-commerce tasks, like asking about products, creating orders, or updating customer information.
Examples of OPERATIONAL:
- "Show me all shoes."
- "What's the price of P1?"
- "Create an order for 2 widgets."
- "Update my address."

METACOGNITIVE_LEARNINGS_SUMMARY queries are about the AI's own learning process or knowledge derived from feedback.
Examples of METACOGNITIVE_LEARNINGS_SUMMARY:
- "Summarize your learnings."
- "What have you learned recently?"
- "Tell me about your new knowledge."
- "Why did you do that in the last turn?"
- "Is there a better way to handle X?"

User Query: "{user_message}"
Classification:"""
        print("QueryClassifier initialized with LLM client.")

    def classify(self, user_message: str) -> str:
        """Classifies the user query using the LLM."""
        prompt = self.classification_prompt_template.format(user_message=user_message)
        try:
            response = self.llm_client.generate_content(prompt)
            classification = response.text.strip()
            if classification in ["OPERATIONAL", "METACOGNITIVE_LEARNINGS_SUMMARY"]:
                return classification
            else:
                print(f"[QueryClassifier Warning] LLM returned unexpected classification: '{classification}'. Defaulting to OPERATIONAL.")
                return "OPERATIONAL"
        except Exception as e:
            print(f"[QueryClassifier Error] Failed to classify query using LLM: {e}. Defaulting to OPERATIONAL.")
            return "OPERATIONAL"

In [52]:
class ConversationManager:
    """Manages conversation history and context data."""
    def __init__(self):
        self.messages: List[Dict[str, Any]] = []
        self.context_data: Dict[str, Any] = {
            "customers": {}, "products": {}, "orders": {}, "last_action": None
        }
        print("ConversationManager initialized.")

    def add_user_message(self, message: str) -> None:
        self.messages.append({"role": "user", "content": message})

    def add_assistant_message(self, message_content: Union[str, List[Dict[str, Any]]], query_type: str) -> None:
        if isinstance(message_content, str):
            content_to_log = f"[{query_type}]: {message_content}"
        else:
            content_to_log = message_content
        self.messages.append({"role": "assistant", "content": content_to_log})

    def update_entity_in_context(self, entity_type: str, entity_id: str, data: Any) -> None:
        if entity_type in self.context_data:
            self.context_data[entity_type][entity_id] = data
            print(f"[CM_Context Updated] Entity: {entity_type}, ID: {entity_id}")

    def set_last_action(self, action_type: str, action_details: Any) -> None:
        self.context_data["last_action"] = {
            "type": action_type,
            "details": action_details,
            "timestamp": datetime.now().isoformat()
        }
        print(f"[CM_Context Updated] Last Action: {action_type}")

    def get_full_conversation_for_api(self) -> List[Dict[str, Any]]:
        return self.messages.copy()

    def get_context_summary(self) -> str:
        summary_parts = []
        if self.context_data["customers"]: summary_parts.append(f"Recent customers: {list(self.context_data['customers'].keys())}")
        if self.context_data["products"]: summary_parts.append(f"Recent products: {list(self.context_data['products'].keys())}")
        if self.context_data["orders"]: summary_parts.append(f"Recent orders: {list(self.context_data['orders'].keys())}")
        if self.context_data["last_action"]: summary_parts.append(f"Last action type: {self.context_data['last_action']['type']}")
        return "\\n".join(summary_parts) if summary_parts else "No specific context items set yet."


In [53]:
class ToolExecutor:
    def __init__(self, available_tools_dict: Dict[str, callable]):
        self.available_tools = available_tools_dict
        print("ToolExecutor initialized.")

    def execute_tool(self, tool_name: str, tool_input: Dict[str, Any], storage_instance: Storage) -> Dict[str, Any]:
        if tool_name in self.available_tools:
            try:
                tool_function = self.available_tools[tool_name]
                result = tool_function(storage_instance, **tool_input)
                print(f"--- [ToolExecutor] Result for {tool_name}: {json.dumps(result, indent=2, default=str)} ---")
                return result
            except Exception as e:
                print(f"--- [ToolExecutor Error] executing {tool_name}: {e} ---"); import traceback; traceback.print_exc()
                return {"status": "error", "message": f"Error executing tool {tool_name}: {str(e)}"}
        print(f"--- [ToolExecutor Error] Tool {tool_name} not found. ---")
        return {"status": "error", "message": f"Tool {tool_name} not found."}

In [54]:
# Original Cell 18
class KnowledgeManager:
    def __init__(self, base_path: str, drive_mount_path: str, default_subpath: str, evaluator_llm_instance):
        self.base_drive_path = base_path
        self.drive_mount_path = drive_mount_path
        self.default_drive_subpath = default_subpath
        self.evaluator_llm = evaluator_llm_instance
        self.active_learnings_cache: List[Dict] = self._load_initial_learnings_from_drive()
        self.learnings_updated_this_session_flag: bool = False
        print(f"KnowledgeManager initialized. Loaded {len(self.active_learnings_cache)} initial learnings from {self.base_drive_path}.")

    def _mount_drive_if_needed(self):
        if not os.path.exists(self.drive_mount_path) or not os.listdir(self.drive_mount_path):
            try:
                drive.mount(self.drive_mount_path, force_remount=True)
                print("Drive mounted by KnowledgeManager.")
            except Exception as e:
                print(f"KM: Error mounting Drive: {e}.")

    def _initialize_learnings_path(self):
        if not os.path.exists(self.base_drive_path):
            try:
                os.makedirs(self.base_drive_path)
                print(f"KM: Created learnings directory: {self.base_drive_path}")
            except Exception as e:
                print(f"KM: Error creating learnings directory {self.base_drive_path}: {e}")

    def _get_latest_learnings_filepath(self) -> Optional[str]:
        self._mount_drive_if_needed()
        self._initialize_learnings_path()
        if not os.path.isdir(self.base_drive_path):
            return None
        list_of_files = glob.glob(os.path.join(self.base_drive_path, 'learnings_*.json'))
        return max(list_of_files, key=os.path.getctime) if list_of_files else None

    def _read_learnings_from_file(self, filepath: str) -> List[Dict]:
        if not filepath or not os.path.exists(filepath):
            return []
        try:
            with open(filepath, 'r') as f:
                learnings_list = json.load(f)
            return learnings_list if isinstance(learnings_list, list) else []
        except Exception as e:
            print(f"KM: Error reading/decoding {filepath}: {e}")
            return []

    def _load_initial_learnings_from_drive(self) -> List[Dict]:
        latest_filepath = self._get_latest_learnings_filepath()
        if latest_filepath:
            print(f"KM: Loading initial learnings from: {latest_filepath}")
            return self._read_learnings_from_file(latest_filepath)
        print("KM: No existing learnings file found for initial load.")
        return []

    def persist_active_learnings(self):
        self._mount_drive_if_needed()
        self._initialize_learnings_path()
        if not os.path.isdir(self.base_drive_path):
            print("KM CRITICAL: Learnings directory not available.")
            return
        if not self.active_learnings_cache:
            print("KM: Active learnings cache is empty or unchanged. Nothing to persist.")
            return

        timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
        new_filepath = os.path.join(self.base_drive_path, f'learnings_{timestamp_str}.json')
        try:
            with open(new_filepath, 'w') as f:
                json.dump(self.active_learnings_cache, f, indent=4)
            print(f"KM: Persisted {len(self.active_learnings_cache)} active learnings to: {new_filepath}")
            self.learnings_updated_this_session_flag = False
        except Exception as e:
            print(f"KM: Error persisting active learnings to {new_filepath}: {e}")

    def get_relevant_learnings_for_prompt(self, query: str, query_type: str,
                                           # Parameter name changed for clarity:
                                           recipient_role: Optional[str] = None,
                                           count: int = 5) -> str:
        if not self.active_learnings_cache:
            return "No specific relevant learnings from knowledge base provided for this query."

        # 1. Filter by recipient_role first
        eligible_learnings: List[Dict] = []
        if recipient_role == "Agent": # Agent gets "AgentAndEvaluator"
            eligible_learnings = [
                entry for entry in self.active_learnings_cache
                if entry.get("learning_target") == "AgentAndEvaluator"
            ]
            # print(f"KM: For Agent, found {len(eligible_learnings)} 'AgentAndEvaluator' learnings before keyword filtering.")
        elif recipient_role == "Evaluator": # Evaluator gets "AgentAndEvaluator" OR "EvaluatorOnly"
            eligible_learnings = [
                entry for entry in self.active_learnings_cache
                if entry.get("learning_target") in ["AgentAndEvaluator", "EvaluatorOnly"]
            ]
            # print(f"KM: For Evaluator, found {len(eligible_learnings)} 'AgentAndEvaluator' or 'EvaluatorOnly' learnings before keyword filtering.")
        else: # Default: if no role specified, or unknown role, consider all learnings that are not EvaluatorOnly
              # This ensures general learnings are picked up if role isn't strictly Agent.
              # Or, to be more restrictive, you could default to only "AgentAndEvaluator" or empty.
            # print(f"KM: No specific recipient_role or unknown role '{recipient_role}'. Considering 'AgentAndEvaluator' learnings.")
            eligible_learnings = [
                entry for entry in self.active_learnings_cache
                if entry.get("learning_target") == "AgentAndEvaluator" # Default to what an agent would see
            ]


        # 2. Apply query_type and keyword-based filtering on the eligible_learnings
        learnings_to_consider = []
        if query_type == "METACOGNITIVE_LEARNINGS_SUMMARY":
            # For summary, show all eligible learnings based on role, up to a limit, newest first.
            learnings_to_consider = sorted(eligible_learnings, key=lambda x: x.get('timestamp_created', ''), reverse=True)
        elif query_type == "OPERATIONAL":
            if not eligible_learnings: # No role-specific learnings to filter with keywords
                 return "No specific relevant learnings from knowledge base found for this query based on role."
            keywords = self._extract_keywords(query)
            # print(f"KM: Extracted keywords for operational query: {keywords}")
            learnings_to_consider = [
                entry for entry in eligible_learnings # Filter from already role-filtered list
                if any(kw.lower() in (entry.get("final_learning_statement", "") + " " + " ".join(entry.get("keywords", []))).lower() for kw in keywords)
            ]
            learnings_to_consider.sort(key=lambda x: x.get('timestamp_created', ''), reverse=True)
        else: # Default for other query types or if no specific logic
            learnings_to_consider = sorted(eligible_learnings, key=lambda x: x.get('timestamp_created', ''), reverse=True)

        # 3. Apply count limit
        final_learnings_for_prompt = learnings_to_consider[:count]
        # print(f"KM: Final {len(final_learnings_for_prompt)} learnings for prompt (Recipient: {recipient_role}, QueryType: {query_type})")

        formatted_learnings = [
            f"- Learning (ID: {entry.get('learning_id', 'N/A')[:8]}, Target: {entry.get('learning_target', 'N/A')}): {entry.get('final_learning_statement', str(entry))}"
            for entry in final_learnings_for_prompt
        ]

        if not formatted_learnings:
            return "No specific relevant learnings from knowledge base found for this query after all filters."
        return "\\nRelevant Learnings from Knowledge Base (In-Session Cache):\\n" + "\\n".join(formatted_learnings)

    def _extract_keywords(self, text: str) -> List[str]:
        if not text:
            return ["general"]
        words = re.findall(r'\\b\\w{4,}\\b', text.lower())
        stop_words = {
            "the", "and", "is", "in", "to", "a", "of", "for", "with", "on", "at", "an", "by", "not", "or", "as", "if",
            "it", "its", "this", "that", "these", "those", "was", "were", "be", "been", "being", "have", "has", "had",
            "do", "does", "did", "will", "would", "should", "can", "could", "may", "might", "must",
            "what", "who", "whom", "which", "when", "where", "why", "how",
            "show", "tell", "please", "user", "query", "context", "claude", "anthropic", "gemini", "google",
            "before", "after", "state", "action", "truth", "ground", "learnings", "summarize", "your", "you", "me", "my",
            "provide", "give", "make", "update", "create", "get", "list", "item", "items", "detail", "details"
        }
        extracted = list(set(word for word in words if word not in stop_words and not word.isdigit()))
        return extracted if extracted else ["generic"]

    def synthesize_and_store_learning(self, human_feedback_text: str, user_query_context: str, turn_context_summary: str, learning_target: str):
        print(f"--- KM: Processing New Learning Candidate (Target: {learning_target}): \"{human_feedback_text}\" ---")

        current_feedback_to_process = human_feedback_text
        attempt_count = 0
        max_attempts = 3

        while attempt_count < max_attempts:
            attempt_count += 1
            print(f"KM: Learning Synthesis Attempt {attempt_count}/{max_attempts}")

            evaluator_task_prompt_parts = [
                "You are an AI assistant helping to maintain a knowledge base of 'learnings' from human feedback.",
                f"The human feedback is targeted towards: {learning_target}.",
                f"New Human Feedback to process: \"{current_feedback_to_process}\"",
                f"Original User Query that led to this feedback: \"{user_query_context}\"",
                f"General Conversation Context when feedback was given: \"{turn_context_summary}\"",
                "Existing ACTIVE learnings (sample of last 3, if any):" + "".join([f"  - (ID: {entry.get('learning_id','N/A')[:6]}, Target: {entry.get('learning_target', 'N/A')}) {entry.get('final_learning_statement', '')[:100]}..." for entry in self.active_learnings_cache[-3:]]) if self.active_learnings_cache else "  - No existing learnings in cache.",
                "Your Tasks:",
                "1. Analyze the 'New Human Feedback'.",
                "2. Check for CONFLICT or significant REDUNDANCY with existing learnings. Consider general knowledge principles and the stated target of the learnings.",
                "3. If the feedback is new, valuable, non-conflicting, and non-redundant, synthesize it into a concise, actionable 'Finalized Learning Statement'. This statement should be generalizable if possible.",
                "Output Format Instructions:",
                "- If suitable for storing: `FINALIZED_LEARNING: [synthesized statement]`",
                "- If it conflicts: `CONFLICT_DETECTED: [Explanation of the conflict, and if possible, reference key phrases or IDs of conflicting existing learnings]. Proposed statement if you tried to rephrase: [rephrased statement, or original if no rephrase attempt]`",
                "- If it's redundant: `REDUNDANT_LEARNING: [Explanation of redundancy, and if possible, reference key phrases or IDs of the existing learning it's redundant with]. Proposed statement if you tried to rephrase: [rephrased statement, or original if no rephrase attempt]`",
                "- If not actionable/too vague: `NOT_ACTIONABLE: [Explanation]`",
                "Ensure your entire response strictly follows one of these prefixed formats."
            ]
            synthesis_prompt = "\\n".join(evaluator_task_prompt_parts)

            try:
                synthesis_response_obj = self.evaluator_llm.generate_content(synthesis_prompt)
                evaluator_synthesis_text = synthesis_response_obj.text.strip()
                print(f"KM: Gemini Learning Synthesis Raw Response:\\n{evaluator_synthesis_text}")

                final_statement = None
                conflict_explanation = None
                redundant_explanation = None
                not_actionable_explanation = None

                if evaluator_synthesis_text.startswith("FINALIZED_LEARNING:"):
                    final_statement = evaluator_synthesis_text.replace("FINALIZED_LEARNING:", "", 1).strip()
                elif evaluator_synthesis_text.startswith("CONFLICT_DETECTED:"):
                    conflict_explanation = evaluator_synthesis_text.replace("CONFLICT_DETECTED:", "", 1).strip()
                elif evaluator_synthesis_text.startswith("REDUNDANT_LEARNING:"):
                    redundant_explanation = evaluator_synthesis_text.replace("REDUNDANT_LEARNING:", "", 1).strip()
                elif evaluator_synthesis_text.startswith("NOT_ACTIONABLE:"):
                    not_actionable_explanation = evaluator_synthesis_text.replace("NOT_ACTIONABLE:", "", 1).strip()
                else:
                    print("KM: Gemini learning synthesis response format unexpected. Defaulting to not actionable.")
                    not_actionable_explanation = f"Response format error: {evaluator_synthesis_text}"

                if final_statement:
                    self.active_learnings_cache.append({
                        "learning_id": str(uuid.uuid4()),
                        "timestamp_created": datetime.now().isoformat(),
                        "original_human_input": human_feedback_text,
                        "processed_human_input": current_feedback_to_process,
                        "final_learning_statement": final_statement,
                        "keywords": self._extract_keywords(final_statement + " " + current_feedback_to_process),
                        "status": "active",
                        "learning_target": learning_target
                    })
                    self.learnings_updated_this_session_flag = True
                    print(f"KM: Stored new learning. Cache size: {len(self.active_learnings_cache)}")
                    return

                elif conflict_explanation:
                    print(f"KM: Learning conflict detected by Gemini: {conflict_explanation}")
                    if attempt_count < max_attempts:
                        print("KM: --- CONFLICT RESOLUTION ---")
                        print(f"Original feedback: '{human_feedback_text}'")
                        print(f"Feedback being processed: '{current_feedback_to_process}'")
                        user_choice = input("Conflict detected. (M)odify your feedback, (S)kip storing, or (P)roceed with current version for resynthesis? [M/S/P]: ").strip().upper()
                        if user_choice == 'M':
                            new_feedback = input("Enter your modified feedback: ").strip()
                            if new_feedback:
                                current_feedback_to_process = new_feedback
                                print("KM: Retrying synthesis with modified feedback.")
                                continue
                            else:
                                print("KM: No modification provided. Skipping.")
                                return
                        elif user_choice == 'P':
                            print("KM: User chose to proceed. Retrying synthesis with current feedback version.")
                            continue
                        else:
                            print("KM: Skipping this learning due to unresolved conflict.")
                            return
                    else:
                        print("KM: Max attempts reached for conflict resolution. Skipping this learning.")
                        return

                elif redundant_explanation:
                    print(f"KM: Learning deemed redundant by Gemini: {redundant_explanation}")
                    user_choice_redundant = input("This learning seems redundant. (S)kip storing, or (F)orce store anyway? [S/F]: ").strip().upper()
                    if user_choice_redundant == 'F':
                        proposed_statement_match = re.search(r"Proposed statement.*?:\s*(.*)", redundant_explanation, re.IGNORECASE)
                        if proposed_statement_match and proposed_statement_match.group(1).strip():
                            forced_statement = proposed_statement_match.group(1).strip()
                            print(f"KM: Using LLM's proposed statement due to Force: '{forced_statement}'")
                        else:
                            forced_statement = current_feedback_to_process
                            print(f"KM: No specific proposed statement from LLM. Using current feedback for Force: '{forced_statement}'")

                        self.active_learnings_cache.append({
                            "learning_id": str(uuid.uuid4()),
                            "timestamp_created": datetime.now().isoformat(),
                            "original_human_input": human_feedback_text,
                            "processed_human_input": current_feedback_to_process,
                            "final_learning_statement": forced_statement,
                            "keywords": self._extract_keywords(forced_statement + " " + current_feedback_to_process),
                            "status": "active_forced_redundancy",
                            "learning_target": learning_target,
                            "notes": f"Forced storage despite redundancy. Original LLM note: {redundant_explanation}"
                        })
                        self.learnings_updated_this_session_flag = True
                        print(f"KM: Stored learning (forced despite redundancy). Cache size: {len(self.active_learnings_cache)}")
                        return
                    else:
                        print("KM: Skipping redundant learning.")
                        return

                elif not_actionable_explanation:
                    print(f"KM: Learning deemed not actionable by Gemini: {not_actionable_explanation}")
                    print("KM: Skipping this learning.")
                    return

                else:
                    print("KM: Synthesis resulted in an unhandled state. Skipping.")
                    return

            except Exception as e:
                print(f"KM: Error during learning synthesis attempt {attempt_count}: {e}")
                import traceback; traceback.print_exc()
                if attempt_count >= max_attempts:
                    print("KM: Max attempts reached due to errors. Skipping this learning.")
                    return
                time.sleep(1)

        print("KM: Could not synthesize learning after maximum attempts. Skipping.")

In [55]:
class WorkerAgentHandler:
    def __init__(self, llm_client, tool_schemas: List[Dict], tool_executor: ToolExecutor, storage_instance: Storage):
        self.llm_client = llm_client
        self.tool_schemas = tool_schemas
        self.tool_executor = tool_executor
        self.storage = storage_instance
        print("WorkerAgentHandler initialized.")

    def _execute_llm_interaction_loop(self, system_prompt: str, messages_for_api: List[Dict[str, Any]], query_type: str, conversation_manager: ConversationManager) -> Tuple[str, List[Dict]]:
        tools_for_this_call = self.tool_schemas if query_type == "OPERATIONAL" else []
        max_iterations = 5 if query_type == "OPERATIONAL" else 1 # Max tool use iterations for operational, 1 for others

        executed_tool_calls_log: List[Dict] = [] # Log for tool calls in this interaction loop

        for i in range(max_iterations):
            print(f"--- WorkerLLM Calling Anthropic (Iter {i+1}/{max_iterations}, QType: {query_type}) ---")
            current_text_response = "" # Initialize for this iteration
            try:
                response = self.llm_client.messages.create(
                    model=ANTHROPIC_MODEL_NAME,
                    max_tokens=4000,
                    temperature=0.0,
                    system=system_prompt,
                    tools=tools_for_this_call,
                    messages=messages_for_api
                )
            except Exception as e:
                error_message = f"Error communicating with Worker LLM: {e}"
                print(f"LLM API Error: {e}")
                return error_message, executed_tool_calls_log # Return error and any logs so far

            assistant_response_blocks = response.content
            # It's important to add the raw assistant response blocks to the API history
            # This includes text parts and tool_use parts if any.
            messages_for_api.append({"role": "assistant", "content": assistant_response_blocks})

            text_blocks = [block.text for block in assistant_response_blocks if block.type == "text"]
            current_text_response = " ".join(text_blocks).strip()

            if current_text_response.startswith("CLARIFICATION_REQUESTED:"):
                return current_text_response, executed_tool_calls_log # Return immediately for clarification

            tool_calls_to_process = [block for block in assistant_response_blocks if block.type == "tool_use"]

            if not tool_calls_to_process or query_type != "OPERATIONAL":
                # If no tools to call, or not an operational query, this is the final response from the LLM for this loop.
                final_response_text = current_text_response if current_text_response else "Worker AI provided no text content in its final turn."
                return final_response_text, executed_tool_calls_log

            # If there are tool calls to process (and it's an OPERATIONAL query)
            tool_results_for_next_llm_call_content = [] # This will be the content for the next "user" role message (tool results)

            for tool_use_block in tool_calls_to_process:
                tool_name, tool_input, tool_use_id = tool_use_block.name, tool_use_block.input, tool_use_block.id
                print(f"WorkerLLM: Requesting Tool Call: {tool_name}, Input: {tool_input}")

                # Execute the tool
                tool_result_data = self.tool_executor.execute_tool(tool_name, tool_input, self.storage)

                # Log the tool call and its result for the orchestrator/evaluator
                executed_tool_calls_log.append({
                    "tool_name": tool_name,
                    "tool_input": copy.deepcopy(tool_input), # Deepcopy to avoid modification issues
                    "tool_output": copy.deepcopy(tool_result_data)
                })

                # Update conversation manager's context (this was already here)
                # Example: update context based on product/order/customer IDs in tool_result_data
                entity_type_map = {
                    "order_details": "orders", "order_id": "orders",
                    "product": "products", "product_id": "products",
                    "customer": "customers", "customer_id": "customers"
                }
                found_entity_type = "unknown"
                found_entity_id = "unknown_id"
                found_entity_data = tool_result_data

                for key, etype in entity_type_map.items():
                    if key in tool_result_data and tool_result_data[key]:
                        found_entity_type = etype
                        if isinstance(tool_result_data[key], dict) and ("id" in tool_result_data[key] or etype[:-1]+"_id" in tool_result_data[key]): # e.g. order_details might have 'id'
                             found_entity_id = tool_result_data[key].get("id") or tool_result_data[key].get(etype[:-1]+"_id")
                             found_entity_data = tool_result_data[key]
                        elif isinstance(tool_result_data.get(etype[:-1]+"_id"), str): # e.g. direct product_id
                            found_entity_id = tool_result_data.get(etype[:-1]+"_id")
                        break # Take first match for simplicity

                # Try to get ID more robustly if it's directly in tool_result_data
                if found_entity_id == "unknown_id":
                     if "order_id" in tool_result_data: found_entity_id = tool_result_data["order_id"]
                     elif "product_id" in tool_result_data: found_entity_id = tool_result_data["product_id"]
                     elif "customer_id" in tool_result_data: found_entity_id = tool_result_data["customer_id"]

                if found_entity_id != "unknown_id":
                    conversation_manager.update_entity_in_context(
                        entity_type=found_entity_type,
                        entity_id=found_entity_id,
                        data=found_entity_data
                    )
                conversation_manager.set_last_action(f"tool_{tool_name}_Anthropic", {"input": tool_input, "result_summary": tool_result_data.get("status", "unknown_status")})

                tool_results_for_next_llm_call_content.append({
                    "type": "tool_result",
                    "tool_use_id": tool_use_id,
                    "content": json.dumps(tool_result_data) if isinstance(tool_result_data, dict) else str(tool_result_data)
                    # Consider adding an error field from tool_result_data if status is error
                    # "is_error": tool_result_data.get("status") == "error" if isinstance(tool_result_data, dict) else False
                })

            # Add the aggregated tool results as a new "user" message for the next LLM call
            if tool_results_for_next_llm_call_content:
                messages_for_api.append({"role": "user", "content": tool_results_for_next_llm_call_content})
            else: # Should not happen if tool_calls_to_process was non-empty
                print("WorkerLLM: No tool results to append, though tool calls were present. This is unexpected.")


        # If loop finishes (max_iterations reached)
        final_response_text = current_text_response if current_text_response else "Worker AI reached max tool iterations without a final text response."
        return final_response_text, executed_tool_calls_log

In [56]:
class ResponseEvaluator:
    def __init__(self, evaluator_llm_instance):
        self.evaluator_llm = evaluator_llm_instance # Gemini model
        print("ResponseEvaluator initialized.")

    def evaluate_turn(self, user_message: str, query_type: str, worker_response_text: str,
                      context_summary: str, rag_learnings_provided: str,
                      clarification_interactions: Optional[List[Dict]],
                      initial_datastore_state: Dict[str, Any],
                      final_datastore_state: Dict[str, Any],
                      executed_tool_calls_log: List[Dict]) -> Dict[str, Any]: # Added tool calls log

        initial_ds_prompt = f"Data Store State *Before* AI Action:\\n{json.dumps(initial_datastore_state, indent=2, default=str)}"
        final_ds_prompt = f"Data Store State *After* AI Action:\\n{json.dumps(final_datastore_state, indent=2, default=str)}"

        clarification_info_prompt = "No worker AI clarifications this turn."
        if clarification_interactions:
            clar_summary = [f"  Q: '{c.get('agent_question', 'N/A')}' -> User A: '{c.get('user_answer', 'N/A')}'" for c in clarification_interactions]
            clarification_info_prompt = f"Worker AI Clarification Interactions:\\n" + "\\n".join(clar_summary)

        tool_log_prompt = "No tools were executed by the Worker AI this turn."
        if executed_tool_calls_log:
            formatted_tool_calls = []
            for i, call in enumerate(executed_tool_calls_log):
                # Limit the size of tool output in the prompt to avoid excessive length
                output_summary = call.get('tool_output', {})
                if isinstance(output_summary, dict):
                    output_summary_str = json.dumps({k: v for k, v in output_summary.items() if k != "products"}, indent=1, default=str) # Exclude full product lists from summary
                    if len(output_summary_str) > 300: # Truncate if still too long
                        output_summary_str = output_summary_str[:300] + "..."
                else:
                    output_summary_str = str(output_summary)
                    if len(output_summary_str) > 300:
                         output_summary_str = output_summary_str[:300] + "..."

                formatted_tool_calls.append(
                    f"  Tool Call {i+1}:\n"
                    f"    Name: {call.get('tool_name')}\n"
                    f"    Input: {json.dumps(call.get('tool_input'), default=str)}\n"
                    f"    Output Summary: {output_summary_str}"
                )
            tool_log_prompt = f"Worker AI Tool Calls Executed This Turn:\\n" + "\\n".join(formatted_tool_calls)

        eval_content_prompt = f"""
User query: {user_message}
Classified Query Type: {query_type}

Context provided to assistant (summary):
{context_summary}

Relevant RAG Learnings provided to assistant:
{rag_learnings_provided}

{initial_ds_prompt}

{tool_log_prompt}

{final_ds_prompt}

{clarification_info_prompt}

Worker AI (Claude) final textual response:
{worker_response_text}
---
INSTRUCTIONS FOR EVALUATOR (You are Gemini {EVAL_MODEL_NAME}):
Based on your system prompt (which emphasizes impartiality and detailed assessment criteria) and the classified query type ({query_type}), please evaluate the AI assistant's response.
- If OPERATIONAL, focus on tool use accuracy (refer to 'Worker AI Tool Calls Executed'), data store changes (Before vs. After), and whether the final textual response aligns with these actions.
- If METACOGNITIVE_LEARNINGS_SUMMARY, focus on whether the AI accurately summarized the 'Relevant RAG Learnings' it was provided.
Provide detailed reasoning for scores (Accuracy, Efficiency, Context Awareness, Helpfulness) and an overall score (1-10).
Explicitly reference the tool call log and datastore states when assessing operational tasks.
"""
        try:
            # Assuming self.evaluator_llm is a GenerativeModel instance configured with the evaluator_system_prompt
            gemini_response_obj = self.evaluator_llm.generate_content(eval_content_prompt)
            evaluation_text = gemini_response_obj.text

            score = self._extract_score(evaluation_text) # Overall score
            # You might want to extract individual criteria scores too if needed later.

            return {
                "anthropic_score": score, # This is the overall score for the worker AI
                "full_evaluation": evaluation_text, # This is Gemini's full textual evaluation
                "clarification_details_evaluator": {"used": False}, # Placeholder, as evaluator doesn't ask for clarification in this setup
                "query_type_evaluated": query_type,
                "raw_evaluation_text": evaluation_text # For direct printing
            }
        except Exception as e:
            print(f"Evaluator: Error during Gemini evaluation: {e}")
            import traceback; traceback.print_exc()
            return {
                "error": str(e),
                "anthropic_score": 0,
                "full_evaluation": f"Evaluation failed: {e}",
                "clarification_details_evaluator": {},
                "query_type_evaluated": query_type,
                "raw_evaluation_text": f"Evaluation Error: {e}"
            }

    def _extract_score(self, evaluation_text: str) -> int:
        # More robust score extraction, looking for "Overall Score: X/10" or "Overall Score: X"
        # Prefers scores out of 10 if specified.
        patterns = [
            r"Overall Score\s*:\s*(\d{1,2})\s*/\s*10",  # "Overall Score : X/10" or "Overall Score:X/10"
            r"Overall Score\s*:\s*(\d{1,2})",         # "Overall Score : X" or "Overall Score:X"
            r"Overall\s*:\s*(\d{1,2})\s*/\s*10",       # "Overall : X/10"
            r"Overall\s*:\s*(\d{1,2})"                # "Overall : X"
        ]
        for p_str in patterns:
            matches = list(re.finditer(p_str, evaluation_text, re.IGNORECASE | re.DOTALL))
            if matches:
                try:
                    score_value = int(matches[-1].group(1))
                    if 0 <= score_value <= 10: # Ensure score is within expected range
                        return score_value
                except (ValueError, IndexError):
                    continue
        print(f"Evaluator: Could not reliably extract a 0-10 score from text: ...{evaluation_text[-300:]}")
        return 0 # Default score if not found or invalid

In [57]:
class AgentOrchestrator:
    def __init__(self):
        # ... (initialization remains the same) ...
        self.classifier_llm_client = gemini.GenerativeModel(model_name=CLASSIFIER_MODEL_NAME)
        self.query_classifier = QueryClassifier(llm_client=self.classifier_llm_client)
        self.storage = Storage()
        self.conversation_manager = ConversationManager()
        self.tool_functions_map = {
            "create_customer": create_customer, "get_customer_info": get_customer_info,
            "create_product": create_product, "update_product": update_product,
            "get_product_info": get_product_info, "list_all_products": list_all_products,
            "create_order": create_order, "get_order_details": get_order_details,
            "update_order_status": update_order_status,
        }
        self.tool_executor = ToolExecutor(self.tool_functions_map)

        knowledge_synthesis_llm = gemini.GenerativeModel(model_name=EVAL_MODEL_NAME)
        self.knowledge_manager = KnowledgeManager(LEARNINGS_DRIVE_BASE_PATH, DRIVE_MOUNT_PATH, DEFAULT_LEARNINGS_DRIVE_SUBPATH, knowledge_synthesis_llm)

        self.worker_agent_handler = WorkerAgentHandler(anthropic_client, tools_schemas_list, self.tool_executor, self.storage)

        main_evaluator_llm = gemini.GenerativeModel(model_name=EVAL_MODEL_NAME, system_instruction=evaluator_system_prompt)
        self.response_evaluator = ResponseEvaluator(evaluator_llm_instance=main_evaluator_llm)

        self.evaluation_results_log: List[Dict] = []
        print("AgentOrchestrator initialized.")

    def _handle_worker_clarification_interaction(self, agent_question_text: str, system_prompt: str,
                                                current_turn_history: List[Dict], query_type: str,
                                                conversation_manager: ConversationManager,
                                                max_attempts: int = 2) -> Tuple[str, List[Dict], List[Dict]]:
        # ... (this method remains the same as the last correct version) ...
        clarification_interactions = []
        response_text = agent_question_text
        executed_tool_calls_log_clarification_phase: List[Dict] = []

        for attempt in range(max_attempts):
            actual_question = response_text.split("CLARIFICATION_REQUESTED:", 1)[-1].strip() if "CLARIFICATION_REQUESTED:" in response_text else response_text
            print(f"--- Worker AI requests clarification: {actual_question} ---")

            user_clarification = ""
            try:
                user_clarification = input(f"Your response to Worker AI: ").strip()
            except EOFError:
                print("EOFError encountered during input. Assuming no user clarification.")
                user_clarification = "(User provided no further input due to EOF)"

            if not user_clarification and not user_clarification.startswith("(User provided no further input"):
                 user_clarification = "(User provided no further input)"

            clarification_interactions.append({"agent_question": actual_question, "user_answer": user_clarification})
            current_turn_history.append({"role": "user", "content": user_clarification})

            response_text, tools_log_this_iteration = self.worker_agent_handler._execute_llm_interaction_loop(
                system_prompt, current_turn_history, query_type, conversation_manager
            )
            executed_tool_calls_log_clarification_phase.extend(tools_log_this_iteration)

            if not response_text.startswith("CLARIFICATION_REQUESTED:"):
                return response_text, clarification_interactions, executed_tool_calls_log_clarification_phase

        print("Max clarification attempts reached for worker AI.")
        return response_text, clarification_interactions, executed_tool_calls_log_clarification_phase


    def run_agent_turn(self, user_message: str) -> Dict[str, Any]:
        """
        Handles the agent's processing of a query, up to generating its response.
        Does NOT perform evaluation or solicit feedback on agent's response.
        """
        self.conversation_manager.add_user_message(user_message)
        query_type = self.query_classifier.classify(user_message)
        print(f"--- Orchestrator: Classified Query Type: {query_type} ---")

        context_summary_for_worker = self.conversation_manager.get_context_summary()
        rag_learnings_for_worker = self.knowledge_manager.get_relevant_learnings_for_prompt(
            user_message, query_type, recipient_role="Agent"
        )

        full_worker_prompt = (
            f"{worker_operational_system_prompt if query_type == 'OPERATIONAL' else worker_metacognitive_learnings_system_prompt}\\n\\n"
            f"Conversation Context Summary (recent entities and last action):\\n{context_summary_for_worker}\\n\\n"
            f"{rag_learnings_for_worker}"
        )

        initial_datastore_state = self.storage.get_full_datastore_copy()
        current_turn_processing_history = self.conversation_manager.get_full_conversation_for_api()

        worker_response_text, executed_tool_calls_log = self.worker_agent_handler._execute_llm_interaction_loop(
            full_worker_prompt, current_turn_processing_history, query_type, self.conversation_manager
        )

        clarification_interactions = []
        if worker_response_text.startswith("CLARIFICATION_REQUESTED:"):
            worker_response_text, clarification_interactions, tools_during_clarif = self._handle_worker_clarification_interaction(
                worker_response_text, full_worker_prompt, current_turn_processing_history, query_type, self.conversation_manager
            )
            executed_tool_calls_log.extend(tools_during_clarif)

        self.conversation_manager.add_assistant_message(worker_response_text, query_type)
        final_datastore_state = self.storage.get_full_datastore_copy() # Capture after agent's actions

        return {
            "user_message": user_message, # Pass through for context
            "query_type": query_type,
            "anthropic_response": worker_response_text,
            "executed_tool_calls": executed_tool_calls_log,
            "context_summary_for_worker": context_summary_for_worker,
            "initial_datastore_state": initial_datastore_state,
            "final_datastore_state": final_datastore_state,
            "clarification_interactions": clarification_interactions
        }

    def handle_feedback_on_worker_response(self, original_user_query: str,
                                           context_summary_for_worker: str,
                                           human_feedback_on_worker: str):
        # ... (this method remains the same as the last correct version, uses target "AgentAndEvaluator") ...
        if human_feedback_on_worker.lower() not in ['skip', ''] and human_feedback_on_worker:
            self.knowledge_manager.synthesize_and_store_learning(
                human_feedback_on_worker,
                original_user_query,
                context_summary_for_worker,
                learning_target="AgentAndEvaluator"
            )
            if self.knowledge_manager.learnings_updated_this_session_flag:
                self.knowledge_manager.persist_active_learnings()
        else:
            print("Orchestrator: No feedback provided for worker response or 'skip' entered.")


    def run_evaluation_turn(self, agent_turn_data: Dict[str, Any]) -> Dict[str, Any]:
        """
        Performs evaluation of the agent's turn using the latest RAG learnings.
        """
        user_message = agent_turn_data["user_message"]
        query_type = agent_turn_data["query_type"]
        worker_response_text = agent_turn_data["anthropic_response"]
        context_summary_for_worker = agent_turn_data["context_summary_for_worker"] # This is worker's context
        clarification_interactions = agent_turn_data["clarification_interactions"]
        initial_datastore_state = agent_turn_data["initial_datastore_state"]
        final_datastore_state = agent_turn_data["final_datastore_state"]
        executed_tool_calls_log = agent_turn_data["executed_tool_calls"]

        # CRITICAL: Fetch RAG for Evaluator *NOW*, after agent feedback might have been processed
        rag_learnings_for_evaluator = self.knowledge_manager.get_relevant_learnings_for_prompt(
            user_message, query_type, recipient_role="Evaluator"
        )

        evaluation_result = self.response_evaluator.evaluate_turn(
            user_message, query_type, worker_response_text,
            context_summary_for_worker, # Context summary that was available to worker
            rag_learnings_for_evaluator, # Freshly fetched RAG for evaluator
            clarification_interactions,
            initial_datastore_state,
            final_datastore_state,
            executed_tool_calls_log
        )

        # Log the evaluation result
        self.evaluation_results_log.append({
            "user_message": user_message,
            "query_type": query_type,
            "worker_response": worker_response_text, # From agent_turn_data
            "tool_calls": copy.deepcopy(executed_tool_calls_log), # From agent_turn_data
            "evaluation_details": evaluation_result
        })

        return evaluation_result # Return just the evaluation details

    def handle_feedback_on_evaluation(self, original_user_query: str, worker_response_summary: str,
                                      evaluation_text_summary: str, human_feedback_on_evaluator: str):
        # ... (this method remains the same as the last correct version, uses target "EvaluatorOnly") ...
        if human_feedback_on_evaluator.lower() not in ['skip', ''] and human_feedback_on_evaluator:
            eval_feedback_context = (
                f"Feedback is on an evaluation. Original query: '{original_user_query}'. "
                f"Worker response (summary): '{worker_response_summary[:100]}...'. "
                f"Evaluation (summary): '{evaluation_text_summary[:150]}...'"
            )
            self.knowledge_manager.synthesize_and_store_learning(
                human_feedback_on_evaluator,
                original_user_query,
                eval_feedback_context,
                learning_target="EvaluatorOnly"
            )
            if self.knowledge_manager.learnings_updated_this_session_flag:
                self.knowledge_manager.persist_active_learnings()
        else:
            print("Orchestrator: No feedback provided for evaluator or 'skip' entered.")


    def get_evaluation_log(self) -> List[Dict]:
        return self.evaluation_results_log

    def persist_learnings_on_exit(self):
        if self.knowledge_manager.learnings_updated_this_session_flag:
            print("Orchestrator: Persisting any remaining updated learnings on exit...")
            self.knowledge_manager.persist_active_learnings()

In [58]:
# Original Cell 22 (main function) - MODIFICATIONS

def main():
    print("\\nStarting Main Execution with REFACTORED Agent System...\\n")
    orchestrator = AgentOrchestrator()

    while True:
        try:
            print("\\n" + "=" * 70)
            user_query = input("Enter query (or 'quit' to exit): ").strip()

            if user_query.lower() == 'quit':
                orchestrator.persist_learnings_on_exit()
                print("Exiting system. Learnings persisted if updated.")
                break
            if not user_query:
                continue

            print(f">>> User Query: '{user_query}'")

            # === AGENT ACTION PHASE ===
            agent_turn_data = orchestrator.run_agent_turn(user_query)

            worker_response_text = agent_turn_data.get('anthropic_response', "No worker response found.")
            query_type_from_turn = agent_turn_data.get('query_type', "N/A")
            context_for_worker_feedback = agent_turn_data.get('context_summary_for_worker', "N/A")

            print(f"\\n--- Worker AI Final Response (Type: {query_type_from_turn}) ---")
            print(worker_response_text)
            print("--- End of Worker AI Response ---")

            # === FEEDBACK ON AGENT PHASE ===
            try:
                human_feedback_on_worker = input("Orchestrator: Feedback on Worker AI's response? (type or 'skip'): ").strip()
                orchestrator.handle_feedback_on_worker_response(
                    original_user_query=user_query, # user_query is the original query for this turn
                    context_summary_for_worker=context_for_worker_feedback,
                    human_feedback_on_worker=human_feedback_on_worker
                )
            except EOFError:
                print("Orchestrator: Skipping feedback on worker response (EOF).")

            # === EVALUATION PHASE ===
            # Pass the full agent_turn_data to the evaluation method
            evaluation_details = orchestrator.run_evaluation_turn(agent_turn_data)

            raw_eval_text_for_feedback = "No evaluation text."
            if evaluation_details and not evaluation_details.get("error"):
                raw_eval_text_for_feedback = evaluation_details.get("raw_evaluation_text", "No raw evaluation text found.")
                print(f"\\n--- Evaluator (Gemini) Assessment (for query: '{user_query}') ---")
                print(raw_eval_text_for_feedback)
                print("--- End of Evaluation ---")
            elif evaluation_details and evaluation_details.get("error"):
                error_message = evaluation_details.get('error', 'Unknown evaluation error.')
                raw_eval_text_for_feedback = f"Evaluation Error: {error_message}"
                print(f"\\n--- Evaluator (Gemini) Assessment Error ---")
                print(raw_eval_text_for_feedback)
                print("--- End of Evaluation Error ---")
            else:
                print("\\n--- Evaluator: No evaluation details found for this turn. ---")

            # === FEEDBACK ON EVALUATION PHASE ===
            try:
                human_feedback_on_evaluator = input("Orchestrator: Feedback on Evaluator's assessment? (type or 'skip'): ").strip()
                orchestrator.handle_feedback_on_evaluation(
                    original_user_query=user_query, # user_query for context
                    worker_response_summary=worker_response_text, # worker's actual response
                    evaluation_text_summary=raw_eval_text_for_feedback, # evaluator's actual assessment text
                    human_feedback_on_evaluator=human_feedback_on_evaluator
                )
            except EOFError:
                print("Orchestrator: Skipping feedback on evaluator assessment (EOF).")

        except SystemExit: # ... (rest of the exception handling and final summary remains the same) ...
            print("System exit requested.")
            orchestrator.persist_learnings_on_exit()
            break
        except EOFError:
            print("\\nEOF encountered. Exiting gracefully.")
            orchestrator.persist_learnings_on_exit()
            break
        except KeyboardInterrupt:
            print("\\nKeyboard interrupt detected. Exiting.")
            orchestrator.persist_learnings_on_exit()
            break
        except Exception as e:
            print(f"CRITICAL ERROR in main loop: {e}")
            import traceback
            traceback.print_exc()

    # --- Final Evaluation Summary ---
    # ... (this part remains the same) ...
    print("\\n" + "=" * 30 + " FINAL EVALUATION SUMMARY " + "=" * 30)
    results_log = orchestrator.get_evaluation_log()
    total_score, num_q_evaluated = 0, 0

    if not results_log:
        print("No queries were processed and evaluated in this session.")
    else:
        for i, turn_data in enumerate(results_log): # results_log now directly contains dicts from append
            user_msg = turn_data.get('user_message', 'N/A')
            q_type = turn_data.get('query_type', 'N/A')
            # The structure of results_log was: list of dicts where each dict has "evaluation_details"
            eval_details_from_log = turn_data.get('evaluation_details', {})

            if isinstance(eval_details_from_log, dict) and not eval_details_from_log.get("error"):
                score = eval_details_from_log.get('anthropic_score', 0)
                total_score += score
                num_q_evaluated += 1
                print(f"Q{i+1}: '{user_msg}' (Type: {q_type}) -> Score: {score}")
            elif isinstance(eval_details_from_log, dict) and eval_details_from_log.get("error"):
                print(f"Q{i+1}: '{user_msg}' (Type: {q_type}) -> Evaluation Error: {eval_details_from_log.get('error')}")
            else:
                # This case might occur if run_evaluation_turn somehow didn't produce a dict.
                print(f"Q{i+1}: '{user_msg}' (Type: {q_type}) -> No valid evaluation details logged or evaluation error.")


    if num_q_evaluated > 0:
        print(f"\\nAverage Score over {num_q_evaluated} evaluated queries: {total_score / num_q_evaluated:.2f}")
    else:
        print("\\nNo queries were successfully evaluated to calculate an average score.")
    print(f"Total aggregate score for the session: {total_score}")
    print("=" * 70)
    print("Execution Finished.")

# To run:
# main()

In [59]:
""" Sample queries:
* Show me all the products available
* I'd like to order 25 Perplexinators, please
* Show me the status of my order
* (If the order is not in Shipped state, then) Please ship my order now
* How many Perplexinators are now left in stock?
* Add a new customer: Bill Leece, bill.leece@mail.com, +1.222.333.4444
* Add new new product: Gizmo X, description: A fancy gizmo, price: 29.99, inventory: 50
* Update Gizzmo's price to 99.99 #Note the misspelling of 'Gizmo'
* When was the last time the Toronto Maple Leafs won the Stanley Cup?
* I need to update our insurance policy, so I need to know the total value of all the products in our inventory. Please tell me this amount.
* Summarize your learnings from our recent interactions.
"""

main()

\nStarting Main Execution with REFACTORED Agent System...\n
QueryClassifier initialized with LLM client.
Storage initialized with deep copies of initial data.
ConversationManager initialized.
ToolExecutor initialized.
KM: Loading initial learnings from: /content/drive/My Drive/AI/Knowledgebases/learnings_20250520_180459_428842.json
KnowledgeManager initialized. Loaded 1 initial learnings from /content/drive/My Drive/AI/Knowledgebases.
WorkerAgentHandler initialized.
ResponseEvaluator initialized.
AgentOrchestrator initialized.
Enter query (or 'quit' to exit): Show me all the products available
>>> User Query: 'Show me all the products available'
--- Orchestrator: Classified Query Type: OPERATIONAL ---
--- WorkerLLM Calling Anthropic (Iter 1/5, QType: OPERATIONAL) ---
WorkerLLM: Requesting Tool Call: list_all_products, Input: {}
[Tool Exec] list_all_products: Found 3 products.
--- [ToolExecutor] Result for list_all_products: {
  "status": "success",
  "count": 3,
  "products": {
    "P1