<a href="https://colab.research.google.com/github/wjleece/AI-Agents/blob/main/AI_Agents_w_Evals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%pip install anthropic
#%pip install openai
%pip install -q -U google-generativeai
%pip install fuzzywuzzy

Collecting anthropic
  Downloading anthropic-0.51.0-py3-none-any.whl.metadata (25 kB)
Downloading anthropic-0.51.0-py3-none-any.whl (263 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.0/264.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: anthropic
Successfully installed anthropic-0.51.0
Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [26]:
#Setup and Imports
import anthropic
import google.generativeai as gemini
import re
import json
import time
import os
import copy
import glob # For finding files matching a pattern
import uuid # For generating unique learning IDs in RAG
from google.colab import userdata
#from openai import OpenAI
from google.colab import drive # For Google Drive mounting
from datetime import datetime
from typing import Dict, List, Any, Optional, Union, Tuple
from fuzzywuzzy import process, fuzz

# LLM API Keys
ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY')
#OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
#openai_client = OpenAI(api_key=OPENAI_API_KEY)
gemini.configure(api_key=GOOGLE_API_KEY)

ANTHROPIC_MODEL_NAME = "claude-3-5-sonnet-latest"
#OPENAI_MODEL_NAME = "gpt-4.1" # Or your preferred GPT-4 class model
EVAL_MODEL_NAME = "gemini-2.5-pro-preview-05-06" # Or your preferred Gemini model


DRIVE_MOUNT_PATH = '/content/drive'

try:
    drive.mount(DRIVE_MOUNT_PATH)
    print(f"Google Drive mounted successfully at {DRIVE_MOUNT_PATH}.")
except Exception as e:
    print(f"Error mounting Google Drive: {e}. RAG features will not work.")

# Set up the default learnings path
DEFAULT_LEARNINGS_DRIVE_SUBPATH = "My Drive/AI/Knowledgebases"  # Your default path
LEARNINGS_DRIVE_BASE_PATH = os.path.join(DRIVE_MOUNT_PATH, DEFAULT_LEARNINGS_DRIVE_SUBPATH)

# Create the directory if it doesn't exist
if not os.path.exists(LEARNINGS_DRIVE_BASE_PATH):
    try:
        os.makedirs(LEARNINGS_DRIVE_BASE_PATH)
        print(f"Created learnings directory: {LEARNINGS_DRIVE_BASE_PATH}")
    except Exception as e:
        print(f"Error creating learnings directory {LEARNINGS_DRIVE_BASE_PATH}: {e}")
else:
    print(f"Using existing learnings directory: {LEARNINGS_DRIVE_BASE_PATH}")

print("Imports and LLM clients initialized. Drive RAG configuration variables set.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Google Drive mounted successfully at /content/drive.
Using existing learnings directory: /content/drive/My Drive/AI/Knowledgebases
Imports and LLM clients initialized. Drive RAG configuration variables set.


In [27]:
# MODIFIED System Prompts
worker_system_prompt = """
You are a helpful customer service assistant for an e-commerce system.

Your overriding goal is to be helpful by answering questions and performing actions as requested by a human user.

When responding to the user, use the conversation context to maintain continuity.
- If a user refers to "my order" or similar, use the context to determine which order they're talking about.
- If they mention "that product" or use other references, check the context to determine what they're referring to.
- Always prioritize recent context over older context when resolving references.

The conversation context will be provided to you with each message. This includes:
- Previous questions and answers
- Recently viewed customers, products, and orders
- Recent actions taken (like creating orders, updating products, etc.)
- Relevant Learnings from a knowledge base.

REQUESTING CLARIFICATION FROM THE USER:
If you determine that you absolutely need more information from the user to accurately and efficiently fulfill their request or use a tool correctly, you MUST:
1. Formulate a clear, concise question for the user.
2. Prefix your entire response with the exact tag: `CLARIFICATION_REQUESTED:`
   Example: `CLARIFICATION_REQUESTED: To update the order, could you please provide the Order ID?`
3. Do NOT use any tools in the same turn you are requesting clarification. Wait for the user's response.

Keep all other responses friendly, concise, and helpful.
"""

evaluator_system_prompt = """
You are Google Gemini, an impartial evaluator assessing the quality of responses from an AI assistant to customer service queries.

You will be provided with:

- The user's query.
- The conversation context (including RAG learnings) that was available to the AI assistant.
- The AI assistant's final response.
- **A snapshot of the 'Data Store State *Before* AI Action' (customers, products, orders). This represents the state of the world when the AI assistant decided on its actions.**
- **A snapshot of the 'Data Store State *After* AI Action' (customers, products, orders). This represents the state after the AI assistant's actions were executed (e.g., after tool calls).**
- Details of any clarification questions the AI assistant asked the user.

For each interaction, evaluate the assistant's response based on:
1.  **Accuracy**:
    * How correct and factual is the AI's textual response?
    * Did the AI's actions (tool calls) correctly modify the datastore as intended and as claimed in its response?
    * Verify factual claims (prices, inventory, statuses, IDs) by comparing the AI's response and its actions against the 'Data Store State *Before* AI Action' (for what it should have known) and the 'Data Store State *After* AI Action' (for what actually happened).
    * Specifically check if any new entities (orders, customers, products) were created with correct IDs and if existing entities were updated correctly. Note any discrepancies in IDs or values.

2.  **Efficiency**:
    * Did the assistant get to the correct answer and perform the correct actions with minimal clarifying questions?
    * Consider if any questions asked were necessary by checking if the information was available in the 'Data Store State *Before* AI Action', the conversation context, or RAG learnings.
    * Were tool calls used appropriately, or could the AI have inferred information without a tool call?

3.  **Context Awareness**:
    * Did the assistant correctly use the conversation context (history, entities, RAG learnings) to understand references and intent?

4.  **Helpfulness**:
    * How well did the assistant address the user's needs, both in its textual response and through its actions?
    * Was the response clear, and did it provide all relevant information related to the outcome of the user's request?

Score the response on a scale of 1-10 for each criterion, and provide an overall score. Provide detailed reasoning for your scores, explicitly referencing the 'Before' and 'After' data store states when discussing accuracy of actions and information.

EVALUATING CLARIFICATION QUESTIONS ASKED BY THE WORKER AI:
If the worker AI asked for clarification from the user:
- Assess the *necessity* of the question using the 'Data Store State *Before* AI Action' and context.
- Assess the *quality* of the question.
- If the question was necessary and well-phrased, it should NOT negatively impact the Efficiency score.
- If the question was unnecessary, this SHOULD negatively impact the Efficiency score.

If you, the evaluator, still have questions *after* reviewing all provided information, you can ask the human admin for clarification using "CLARIFICATION NEEDED_EVALUATOR:".

DATA STORE CONSISTENCY CHECK (as part of overall evaluation):
Your main evaluation should inherently perform a consistency check. When assessing Accuracy, explicitly compare the AI's stated actions and its textual response with the changes (or lack thereof) between the 'Data Store State *Before* AI Action' and the 'Data Store State *After* AI Action'.
- Does the 'Data Store State *After* AI Action' accurately reflect the outcomes of any tool calls the AI Assistant made (or should have made based on its response)?
- Are there any inconsistencies between the agent's textual response and the 'Data Store State *After* AI Action'?
- Clearly state if your review of the 'Before' and 'After' states causes you to adjust scores, and explain why.
"""


In [28]:
#Initialize stuff to prevent possible caching issues

human_feedback_learnings = {}
tools_schemas_list = []

In [29]:
#Gemini models have different structure than Anthropic and need to be called this way before use to enable generate_content(prompt),
#whereas Anthropic allows model definition + system instructions within messages.create(prompt)

eval_model_instance = gemini.GenerativeModel(
    model_name=EVAL_MODEL_NAME,
    system_instruction=evaluator_system_prompt
)
print("Gemini Evaluator model instance initialized.")

#this instance of Gemini is to get "ground truth" answers by running queries in parallel with Anthropic. The evaluator instance then evaluates the "ground truth" against Anthropic's response.
gemini_actor_model_instance = gemini.GenerativeModel(
    model_name=EVAL_MODEL_NAME, # Using the same underlying model
    system_instruction=worker_system_prompt # But with the worker's system prompt
)
print("Gemini Actor model instance initialized.")

Gemini Evaluator model instance initialized.
Gemini Actor model instance initialized.


In [30]:
# Global Data Stores (Initial data - will be managed by the Storage class instance)
# These are initial values. The Storage class will manage them.
initial_customers = {
    "C1": {"name": "John Doe", "email": "john@example.com", "phone": "123-456-7890"},
    "C2": {"name": "Jane Smith", "email": "jane@example.com", "phone": "987-654-3210"}
}

initial_products = {
    "P1": {"name": "Widget A", "description": "A simple widget. Very compact.", "price": 19.99, "inventory_count": 999},
    "P2": {"name": "Gadget B", "description": "A powerful gadget. It spins.", "price": 49.99, "inventory_count": 200},
    "P3": {"name": "Perplexinator", "description": "A perplexing perfunctator", "price": 79.99, "inventory_count": 1483}
}

initial_orders = {
    "O1": {"id": "O1", "product_id": "P1", "product_name": "Widget A", "quantity": 2, "price": 19.99, "status": "Shipped"},
    "O2": {"id": "O2", "product_id": "P2", "product_name": "Gadget B", "quantity": 1, "price": 49.99, "status": "Processing"}
}


In [31]:
# Standalone Anthropic Completion Function (for basic tests)
def get_completion_anthropic_standalone(prompt: str):
    message = anthropic_client.messages.create(
        model=ANTHROPIC_MODEL_NAME,
        max_tokens=2000,
        temperature=0.0,
        system=worker_system_prompt,
        tools=tools_schemas_list,
        messages=[
          {"role": "user", "content": prompt}
        ]
    )
    return message.content[0].text

In [8]:
prompt_test_anthropic = "Hey there, which AI model do you use for answering questions?"
print(f"Anthropic Standalone Test: {get_completion_anthropic_standalone(prompt_test_anthropic)}")

Anthropic Standalone Test: I am a customer service assistant designed to help with e-commerce related questions and tasks. I aim to be helpful by answering questions and performing actions related to orders, products, customers, and other e-commerce functions. I don't actually have information about which specific AI model I use - I focus on helping users with their e-commerce needs.

Is there something specific about your orders, products, or other e-commerce matters that I can help you with today?


In [9]:
#def get_completion_openai_standalone(prompt: str):
#    response = openai_client.chat.completions.create(
#        model=OPENAI_MODEL_NAME,
#        max_tokens=2000,
#        temperature=0.0,
#        tools=tools_schemas_list,
#        messages=[
#            {"role": "system", "content": worker_system_prompt},
#            {"role": "user", "content": prompt}
#        ]
#    )
#    return response.choices[0].message.content

In [10]:
#prompt_test_openai = "Hey there, which AI model do you use for answering questions?"
#print(f"OpenAI Standalone Test: {get_completion_openai_standalone(prompt_test_openai)}")

In [11]:
def get_completion_eval_standalone(prompt: str):
    # Uses the eval_model_instance which has the system prompt
        response = eval_model_instance.generate_content(prompt)
        return response.text

In [12]:
prompt_test_eval = "Hey there, can you tell me which AI you are and what your key tasks are?"
print(f"Gemini Eval Standalone Test:\n{get_completion_eval_standalone(prompt_test_eval)}")

Gemini Eval Standalone Test:
Okay, I can help with that!

I am a friendly AI assistant here to help with your customer service needs. My key tasks include:
*   Answering questions about our products and services.
*   Helping you place new orders.
*   Checking the status of your existing orders.
*   Processing returns and exchanges.
*   Updating your account information.

Is there anything specific I can help you with today?


In [32]:
# Storage Class
class Storage:
    def __init__(self):
        self.customers = copy.deepcopy(initial_customers)
        self.products = copy.deepcopy(initial_products)
        self.orders = copy.deepcopy(initial_orders)
        self.human_feedback_learnings = human_feedback_learnings

    def get_full_datastore_copy(self) -> Dict[str, Any]:
        """Returns a deep copy of the current datastore."""
        return {
            "customers": copy.deepcopy(self.customers),
            "products": copy.deepcopy(self.products),
            "orders": copy.deepcopy(self.orders)
        }

print("Storage class defined with deepcopy for initial data and get_full_datastore_copy method.")

Storage class defined with deepcopy for initial data and get_full_datastore_copy method.


In [33]:
#Definitive list of tool schemas.
tools_schemas_list = [
    {
        "name": "create_customer",
        "description": "Adds a new customer to the database. Includes customer name, email, and (optional) phone number.",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "The name of the customer."},
                "email": {"type": "string", "description": "The email address of the customer."},
                "phone": {"type": "string", "description": "The phone number of the customer (optional)."}
            },
            "required": ["name", "email"]
        }
    },
    {
        "name": "get_customer_info",
        "description": "Retrieves customer information based on their customer ID. Returns the customer's name, email, and (optional) phone number.",
        "input_schema": {
            "type": "object",
            "properties": {
                "customer_id": {"type": "string", "description": "The unique identifier for the customer."}
            },
            "required": ["customer_id"]
        }
    },
    {
        "name": "create_product",
        "description": "Adds a new product to the product database. Includes name, description, price, and initial inventory count.",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "The name of the product."},
                "description": {"type": "string", "description": "A description of the product."},
                "price": {"type": "number", "description": "The price of the product."},
                "inventory_count": {"type": "integer", "description": "The amount of the product that is currently in inventory."}
            },
            "required": ["name", "description", "price", "inventory_count"]
        }
    },
    {
        "name": "update_product",
        "description": "Updates an existing product with new information. Only fields that are provided will be updated; other fields remain unchanged.",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_id": {"type": "string", "description": "The unique identifier for the product to update."},
                "name": {"type": "string", "description": "The new name for the product (optional)."},
                "description": {"type": "string", "description": "The new description for the product (optional)."},
                "price": {"type": "number", "description": "The new price for the product (optional)."},
                "inventory_count": {"type": "integer", "description": "The new inventory count for the product (optional)."}
            },
            "required": ["product_id"]
        }
    },
    {
        "name": "get_product_info",
        "description": "Retrieves product information based on product ID or product name (with fuzzy matching for misspellings). Returns product details including name, description, price, and inventory count.",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_id_or_name": {"type": "string", "description": "The product ID or name (can be approximate)."}
            },
            "required": ["product_id_or_name"]
        }
    },
    {
        "name": "list_all_products",
        "description": "Lists all available products in the inventory.",
        "input_schema": { "type": "object", "properties": {}, "required": [] }
    },
    {
        "name": "create_order",
        "description": "Creates an order using the product's current price. If requested quantity exceeds available inventory, no order is created and available quantity is returned. Orders can only be created for products that are in stock. Supports specifying products by either ID or name with fuzzy matching for misspellings.",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_id_or_name": {"type": "string", "description": "The ID or name of the product to order (supports fuzzy matching)."},
                "quantity": {"type": "integer", "description": "The quantity of the product in the order."},
                "status": {"type": "string", "description": "The initial status of the order (e.g., 'Processing', 'Shipped')."}
            },
            "required": ["product_id_or_name", "quantity", "status"]
        }
    },
    {
        "name": "get_order_details",
        "description": "Retrieves the details of a specific order based on the order ID. Returns the order ID, product name, quantity, price, and order status.",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string", "description": "The unique identifier for the order."}
            },
            "required": ["order_id"]
        }
    },
    {
        "name": "update_order_status",
        "description": "Updates the status of an order and adjusts inventory accordingly. Changing to \"Shipped\" decreases inventory. Changing to \"Returned\" or \"Canceled\" from \"Shipped\" increases inventory. Status can be \"Processing\", \"Shipped\", \"Delivered\", \"Returned\", or \"Canceled\".",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string", "description": "The unique identifier for the order."},
                "new_status": {
                    "type": "string",
                    "description": "The new status to set for the order.",
                    "enum": ["Processing", "Shipped", "Delivered", "Returned", "Canceled"]
                }
            },
            "required": ["order_id", "new_status"]
        }
    }
]
print(f"Defined {len(tools_schemas_list)} tool schemas.")

Defined 9 tool schemas.


In [34]:
# Tool Function Definitions
# These tool functions now accept a 'current_storage' argument to operate on a specific Storage instance.

# Customer functions
def create_customer(current_storage: Storage, name: str, email: str, phone: Optional[str] = None) -> Dict[str, Any]:
    """Creates a new customer and adds them to the customer database."""
    new_id = f"C{len(current_storage.customers) + 1}"
    current_storage.customers[new_id] = {"name": name, "email": email, "phone": phone}
    print(f"[Tool Executed] create_customer: ID {new_id}, Name: {name} (in {type(current_storage).__name__})")
    return {"status": "success", "customer_id": new_id, "customer": current_storage.customers[new_id]}

def get_customer_info(current_storage: Storage, customer_id: str) -> Dict[str, Any]:
    """Retrieves information about a customer based on their ID."""
    customer = current_storage.customers.get(customer_id)
    if customer:
        print(f"[Tool Executed] get_customer_info: ID {customer_id} found (in {type(current_storage).__name__}).")
        return {"status": "success", "customer_id": customer_id, "customer": customer}
    print(f"[Tool Executed] get_customer_info: ID {customer_id} not found (in {type(current_storage).__name__}).")
    return {"status": "error", "message": "Customer not found"}

# Product functions
def create_product(current_storage: Storage, name: str, description: str, price: float, inventory_count: int) -> Dict[str, Any]:
    """Creates a new product and adds it to the product database."""
    new_id = f"P{len(current_storage.products) + 1}"
    current_storage.products[new_id] = {
        "name": name,
        "description": description,
        "price": float(price),
        "inventory_count": int(inventory_count)
    }
    print(f"[Tool Executed] create_product: ID {new_id}, Name: {name} (in {type(current_storage).__name__})")
    return {"status": "success", "product_id": new_id, "product": current_storage.products[new_id]}

def update_product(current_storage: Storage, product_id: str, name: Optional[str] = None, description: Optional[str] = None,
                   price: Optional[float] = None, inventory_count: Optional[int] = None) -> Dict[str, Any]:
    """Updates a product with the provided parameters."""
    if product_id not in current_storage.products:
        print(f"[Tool Executed] update_product: ID {product_id} not found (in {type(current_storage).__name__}).")
        return {"status": "error", "message": f"Product {product_id} not found"}

    product = current_storage.products[product_id]
    updated_fields = []

    if name is not None:
        product["name"] = name
        updated_fields.append("name")
    if description is not None:
        product["description"] = description
        updated_fields.append("description")
    if price is not None:
        product["price"] = float(price)
        updated_fields.append("price")
    if inventory_count is not None:
        product["inventory_count"] = int(inventory_count)
        updated_fields.append("inventory_count")

    if not updated_fields:
        print(f"[Tool Executed] update_product: ID {product_id}, no fields updated (in {type(current_storage).__name__}).")
        return {"status": "warning", "message": "No fields were updated.", "product": product}

    print(f"[Tool Executed] update_product: ID {product_id}, Updated fields: {', '.join(updated_fields)} (in {type(current_storage).__name__})")
    return {
        "status": "success",
        "message": f"Product {product_id} updated. Fields: {', '.join(updated_fields)}",
        "product_id": product_id,
        "updated_fields": updated_fields,
        "product": product
    }

def find_product_by_name(current_storage: Storage, product_name: str, min_similarity: int = 70) -> Tuple[Optional[str], Optional[Dict[str, Any]]]:
    """Find a product by name using fuzzy string matching."""
    if not product_name: return None, None

    name_id_list = [(p_data["name"], p_id) for p_id, p_data in current_storage.products.items()]
    if not name_id_list: return None, None

    best_match_name_score = process.extractOne(
        product_name,
        [item[0] for item in name_id_list],
        scorer=fuzz.token_sort_ratio
    )

    if best_match_name_score and best_match_name_score[1] >= min_similarity:
        matched_name = best_match_name_score[0]
        for name, pid_val in name_id_list:
            if name == matched_name:
                print(f"[Tool Helper] find_product_by_name: Matched '{product_name}' to '{matched_name}' (ID: {pid_val}) with score {best_match_name_score[1]} (in {type(current_storage).__name__})")
                return pid_val, current_storage.products[pid_val]

    print(f"[Tool Helper] find_product_by_name: No good match for '{product_name}' (min_similarity: {min_similarity}, Best match: {best_match_name_score}) (in {type(current_storage).__name__})")
    return None, None


def get_product_id(current_storage: Storage, product_identifier: str) -> Optional[str]:
    """Get product ID either directly or by fuzzy matching the name."""
    if product_identifier in current_storage.products:
        return product_identifier
    product_id, _ = find_product_by_name(current_storage, product_identifier)
    return product_id

def get_product_info(current_storage: Storage, product_id_or_name: str) -> Dict[str, Any]:
    """Get information about a product by its ID or name."""
    if product_id_or_name in current_storage.products:
        product = current_storage.products[product_id_or_name]
        print(f"[Tool Executed] get_product_info: Found by ID '{product_id_or_name}' (in {type(current_storage).__name__}).")
        return {"status": "success", "product_id": product_id_or_name, "product": product}

    # Use the modified find_product_by_name that takes current_storage
    product_id_found, product_data = find_product_by_name(current_storage, product_id_or_name)
    if product_id_found and product_data:
        print(f"[Tool Executed] get_product_info: Found by name (fuzzy) '{product_id_or_name}' as ID '{product_id_found}' (in {type(current_storage).__name__}).")
        return {"status": "success", "message": f"Found product matching '{product_id_or_name}'", "product_id": product_id_found, "product": product_data}

    print(f"[Tool Executed] get_product_info: No product found for '{product_id_or_name}' (in {type(current_storage).__name__}).")
    return {"status": "error", "message": f"No product found matching '{product_id_or_name}'"}


def list_all_products(current_storage: Storage) -> Dict[str, Any]:
    """List all available products in the inventory."""
    print(f"[Tool Executed] list_all_products: Found {len(current_storage.products)} products (in {type(current_storage).__name__}).")
    return {"status": "success", "count": len(current_storage.products), "products": dict(current_storage.products)}

# Order functions
def create_order(current_storage: Storage, product_id_or_name: str, quantity: int, status: str) -> Dict[str, Any]:
    """Creates an order using the product's stored price."""
    actual_product_id = get_product_id(current_storage, product_id_or_name) # Pass current_storage

    if not actual_product_id:
        print(f"[Tool Executed] create_order: Product '{product_id_or_name}' not found (in {type(current_storage).__name__}).")
        return {"status": "error", "message": f"Product '{product_id_or_name}' not found."}

    product = current_storage.products[actual_product_id]
    price = product["price"]

    if product["inventory_count"] == 0:
        print(f"[Tool Executed] create_order: Product ID {actual_product_id} is out of stock (in {type(current_storage).__name__}).")
        return {"status": "error", "message": f"{product['name']} is out of stock."}
    if quantity <= 0:
        print(f"[Tool Executed] create_order: Quantity must be positive. Requested: {quantity} (in {type(current_storage).__name__})")
        return {"status": "error", "message": "Quantity must be a positive number."}
    if quantity > product["inventory_count"]:
        print(f"[Tool Executed] create_order: Insufficient inventory for {product['name']} (ID: {actual_product_id}). Available: {product['inventory_count']}, Requested: {quantity} (in {type(current_storage).__name__})")
        return {
            "status": "partial_availability",
            "message": f"Insufficient inventory. Only {product['inventory_count']} units of {product['name']} are available.",
            "available_quantity": product["inventory_count"],
            "requested_quantity": quantity,
            "product_name": product['name']
        }

    if status == "Shipped":
        product["inventory_count"] -= quantity
        print(f"[Tool Executed] create_order: Inventory for {product['name']} (ID: {actual_product_id}) reduced by {quantity} due to 'Shipped' status on creation (in {type(current_storage).__name__}).")

    new_id = f"O{len(current_storage.orders) + 1}"
    current_storage.orders[new_id] = {
        "id": new_id,
        "product_id": actual_product_id,
        "product_name": product["name"],
        "quantity": quantity,
        "price": price,
        "status": status
    }
    print(f"[Tool Executed] create_order: Order {new_id} created for {quantity} of {product['name']} (ID: {actual_product_id}). Status: {status}. Remaining inv: {product['inventory_count']} (in {type(current_storage).__name__})")
    return {
        "status": "success",
        "order_id": new_id,
        "order_details": current_storage.orders[new_id],
        "remaining_inventory": product["inventory_count"]
    }

def get_order_details(current_storage: Storage, order_id: str) -> Dict[str, Any]:
    """Get details of a specific order."""
    order = current_storage.orders.get(order_id)
    if order:
        print(f"[Tool Executed] get_order_details: Order {order_id} found (in {type(current_storage).__name__}).")
        return {"status": "success", "order_id": order_id, "order_details": dict(order)}
    print(f"[Tool Executed] get_order_details: Order {order_id} not found (in {type(current_storage).__name__}).")
    return {"status": "error", "message": "Order not found"}

def update_order_status(current_storage: Storage, order_id: str, new_status: str) -> Dict[str, Any]:
    """Updates the status of an order and adjusts inventory accordingly."""
    if order_id not in current_storage.orders:
        print(f"[Tool Executed] update_order_status: Order {order_id} not found (in {type(current_storage).__name__}).")
        return {"status": "error", "message": "Order not found"}

    order = current_storage.orders[order_id]
    old_status = order["status"]
    product_id = order["product_id"]
    quantity = order["quantity"]

    if old_status == new_status:
        print(f"[Tool Executed] update_order_status: Order {order_id} status unchanged ({old_status}) (in {type(current_storage).__name__}).")
        return {"status": "unchanged", "message": f"Order {order_id} status is already {old_status}", "order_details": dict(order)}

    inventory_adjusted = False
    current_inventory_val = "unknown" # Default if product not found (should not happen if order is valid)

    if product_id in current_storage.products:
        product = current_storage.products[product_id]
        current_inventory_val = product["inventory_count"]

        if new_status == "Shipped" and old_status not in ["Shipped", "Delivered"]:
            if current_inventory_val < quantity:
                print(f"[Tool Executed] update_order_status: Insufficient inventory to ship order {order_id}. Have {current_inventory_val}, need {quantity} (in {type(current_storage).__name__}).")
                return {"status": "error", "message": f"Insufficient inventory to ship. Available: {current_inventory_val}, Required: {quantity}"}
            product["inventory_count"] -= quantity
            inventory_adjusted = True
            current_inventory_val = product["inventory_count"]
            print(f"[Tool Executed] update_order_status: Order {order_id} Shipped. Inv for {product_id} reduced by {quantity} to {current_inventory_val} (in {type(current_storage).__name__}).")
        elif new_status in ["Returned", "Canceled"] and old_status in ["Shipped", "Delivered"]:
            product["inventory_count"] += quantity
            inventory_adjusted = True
            current_inventory_val = product["inventory_count"]
            print(f"[Tool Executed] update_order_status: Order {order_id} {new_status}. Inv for {product_id} increased by {quantity} to {current_inventory_val} (in {type(current_storage).__name__}).")
    else:
        print(f"[Tool Executed] update_order_status: Product {product_id} for order {order_id} not found for inventory adjustment (in {type(current_storage).__name__}).")

    order["status"] = new_status
    print(f"[Tool Executed] update_order_status: Order {order_id} status updated from {old_status} to {new_status} (in {type(current_storage).__name__}).")
    return {
        "status": "success",
        "message": f"Order {order_id} status updated from {old_status} to {new_status}.",
        "order_id": order_id,
        "product_id": product_id,
        "old_status": old_status,
        "new_status": new_status,
        "inventory_adjusted": inventory_adjusted,
        "current_inventory": current_inventory_val,
        "order_details": dict(order)
    }

print("Tool functions defined.")

Tool functions defined.


In [35]:
class ConversationContext:
    def __init__(self):
        self.messages: List[Dict[str, Any]] = []
        self.context_data: Dict[str, Any] = {
            "customers": {}, "products": {}, "orders": {}, "last_action": None
        }
        self.session_start_time = datetime.now()

    def add_user_message(self, message: str) -> None:
        self.messages.append({"role": "user", "content": message})

    def add_assistant_message(self, message_content: Union[str, List[Dict[str, Any]]]) -> None:
        self.messages.append({"role": "assistant", "content": message_content})

    def update_entity_in_context(self, entity_type: str, entity_id: str, data: Any) -> None:
        if entity_type in self.context_data:
            self.context_data[entity_type][entity_id] = data # Store the actual data
            print(f"[Context Updated] Entity: {entity_type}, ID: {entity_id}, Data (type): {type(data)}")

    def set_last_action(self, action_type: str, action_details: Any) -> None:
        self.context_data["last_action"] = {
            "type": action_type,
            "details": action_details,
            "timestamp": datetime.now().isoformat()
        }
        print(f"[Context Updated] Last Action: {action_type}, Details: {json.dumps(action_details, default=str)}")


    def get_full_conversation_for_api(self) -> List[Dict[str, Any]]:
        return self.messages.copy()

    def get_context_summary(self) -> str:
        summary_parts = []
        if self.context_data["customers"]:
            customers_str = ", ".join([f"ID: {cid} (Name: {c.get('name', 'N/A') if isinstance(c, dict) else 'N/A'})" for cid, c in self.context_data["customers"].items()])
            summary_parts.append(f"Recent customers: {customers_str}")
        if self.context_data["products"]:
            products_str = ", ".join([f"ID: {pid} (Name: {p.get('name', 'N/A') if isinstance(p, dict) else 'N/A'})" for pid, p in self.context_data["products"].items()])
            summary_parts.append(f"Recent products: {products_str}")
        if self.context_data["orders"]:
            orders_str = ", ".join([f"ID: {oid} (Product: {o.get('product_name', 'N/A') if isinstance(o, dict) else 'N/A'}, Status: {o.get('status', 'N/A') if isinstance(o, dict) else 'N/A'})" for oid, o in self.context_data["orders"].items()])
            summary_parts.append(f"Recent orders: {orders_str}")

        last_action = self.context_data["last_action"]
        if last_action:
            action_type = last_action['type']
            action_details_summary = "..." # Default summary
            if isinstance(last_action.get('details'), dict):
                action_input = last_action['details'].get('input', {})
                action_result_status = last_action['details'].get('result', {}).get('status')
                action_details_summary = f"Input: {action_input}, Result Status: {action_result_status}"
                if action_result_status == "success":
                    if "order_id" in last_action['details'].get('result', {}):
                         action_details_summary += f", OrderID: {last_action['details']['result']['order_id']}"
                    elif "product_id" in last_action['details'].get('result', {}):
                         action_details_summary += f", ProductID: {last_action['details']['result']['product_id']}"


            summary_parts.append(f"Last action: {action_type} at {last_action['timestamp']} ({action_details_summary})")

        if not summary_parts: return "No specific context items set yet."
        return "\n".join(summary_parts)

    def clear(self) -> None:
        self.messages = []
        self.context_data = {"customers": {}, "products": {}, "orders": {}, "last_action": None}
        self.session_start_time = datetime.now()
        print("[Context Cleared]")

print("ConversationContext class defined.")


ConversationContext class defined.


In [36]:
# MODIFIED AgentEvaluator Class
class AgentEvaluator:
    def __init__(self):
        if 'tools_schemas_list' not in globals() or not tools_schemas_list:
            raise NameError("ERROR: Global 'tools_schemas_list' not found or empty. Please ensure the cell defining it is executed before creating an AgentEvaluator instance.")

        self.conversation_context = ConversationContext()
        self.evaluation_results = []
        self.anthropic_storage = Storage() # Worker agent's e-commerce data
        print("AgentEvaluator initialized with new Storage for Anthropic agent.")

        self.anthropic_tools_schemas = tools_schemas_list

        self.available_tool_functions = {
            "create_customer": create_customer, "get_customer_info": get_customer_info,
            "create_product": create_product, "update_product": update_product,
            "get_product_info": get_product_info, "list_all_products": list_all_products,
            "create_order": create_order, "get_order_details": get_order_details,
            "update_order_status": update_order_status,
        }
        self.active_learnings_cache: List[Dict] = self._load_initial_learnings_from_drive()
        self.learnings_updated_this_session_flag: bool = False
        print(f"AgentEvaluator initialized. Learnings path: {LEARNINGS_DRIVE_BASE_PATH}. Loaded {len(self.active_learnings_cache)} initial learnings.")

    def _mount_drive_if_needed(self):
        if not os.path.exists(DRIVE_MOUNT_PATH) or not os.listdir(DRIVE_MOUNT_PATH):
            try:
                drive.mount(DRIVE_MOUNT_PATH, force_remount=True)
                print(f"Google Drive mounted successfully at {DRIVE_MOUNT_PATH}.")
            except Exception as e:
                print(f"Error mounting Google Drive: {e}. RAG features will not work.")
        # else: print(f"Google Drive already mounted at {DRIVE_MOUNT_PATH}.") # Less verbose

    def _initialize_learnings_path(self): # Simplified as path is now set globally earlier
        global LEARNINGS_DRIVE_BASE_PATH
        if not LEARNINGS_DRIVE_BASE_PATH: # Fallback if somehow not set
             LEARNINGS_DRIVE_BASE_PATH = os.path.join(DRIVE_MOUNT_PATH, DEFAULT_LEARNINGS_DRIVE_SUBPATH)

        if not os.path.exists(LEARNINGS_DRIVE_BASE_PATH):
            try:
                os.makedirs(LEARNINGS_DRIVE_BASE_PATH)
                print(f"Created learnings directory: {LEARNINGS_DRIVE_BASE_PATH}")
            except Exception as e:
                print(f"Error creating learnings directory {LEARNINGS_DRIVE_BASE_PATH}: {e}")
        # else: print(f"Learnings directory found: {LEARNINGS_DRIVE_BASE_PATH}") # Less verbose

    def _get_latest_learnings_filepath_from_drive(self) -> Optional[str]:
        self._mount_drive_if_needed() # Ensure drive is mounted before access
        self._initialize_learnings_path() # Ensure path exists
        if not os.path.isdir(LEARNINGS_DRIVE_BASE_PATH): return None
        list_of_files = glob.glob(os.path.join(LEARNINGS_DRIVE_BASE_PATH, 'learnings_*.json'))
        if not list_of_files: return None
        return max(list_of_files, key=os.path.getctime)

    def _read_learnings_from_drive_file(self, filepath: str) -> List[Dict]:
        if not filepath or not os.path.exists(filepath): return []
        try:
            with open(filepath, 'r') as f: learnings_list = json.load(f)
            return learnings_list if isinstance(learnings_list, list) else []
        except Exception as e:
            print(f"Error reading/decoding learnings file {filepath}: {e}")
            return []

    def _load_initial_learnings_from_drive(self) -> List[Dict]:
        print("[RAG Cache] Initializing: Loading latest learnings from Drive...")
        latest_filepath = self._get_latest_learnings_filepath_from_drive()
        if latest_filepath:
            print(f"[RAG Cache] Loading initial learnings from: {latest_filepath}")
            return self._read_learnings_from_drive_file(latest_filepath)
        print("[RAG Cache] No existing learnings file found on Drive for initial load.")
        return []

    def _persist_active_learnings_to_drive(self):
        self._mount_drive_if_needed()
        self._initialize_learnings_path()
        if not os.path.isdir(LEARNINGS_DRIVE_BASE_PATH):
            print("CRITICAL: Learnings directory not available. Aborting RAG persistence.")
            return
        if not self.active_learnings_cache:
            print("[RAG Cache] Active learnings cache is empty. Nothing to persist.")
            self.learnings_updated_this_session_flag = False
            return
        timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
        new_filepath = os.path.join(LEARNINGS_DRIVE_BASE_PATH, f'learnings_{timestamp_str}.json')
        try:
            with open(new_filepath, 'w') as f: json.dump(self.active_learnings_cache, f, indent=4)
            print(f"[RAG Cache] Successfully persisted {len(self.active_learnings_cache)} active learnings to: {new_filepath}")
            self.learnings_updated_this_session_flag = False
        except Exception as e:
            print(f"Error persisting active learnings to {new_filepath}: {e}")

    def check_relevant_learnings(self, query: str, count: int = 5) -> Optional[str]:
        if not self.active_learnings_cache: return None
        keywords_from_query = self.extract_keywords(query)
        relevant_learning_objects = []
        for entry in self.active_learnings_cache:
            text_to_search = entry.get("final_learning_statement", "") + " " + entry.get("original_human_input", "") + " " + " ".join(entry.get("keywords", []))
            if any(kw.lower() in text_to_search.lower() for kw in keywords_from_query):
                relevant_learning_objects.append(entry)
        relevant_learning_objects.sort(key=lambda x: x.get('timestamp_created', ''), reverse=True)
        formatted_learnings = [f"- Learning (ID: {entry.get('learning_id', 'N/A')[:8]}, Time: {entry.get('timestamp_created', 'N/A')}): {entry.get('final_learning_statement', str(entry))}" for entry in relevant_learning_objects[:count]]
        return "\\nRelevant Learnings from Knowledge Base (In-Session Cache):\\n" + "\\n".join(formatted_learnings) if formatted_learnings else None

    def _update_context_from_tool_results(self, tool_name: str, tool_input: Dict, tool_result: Dict):
        agent_name = "Anthropic" # Assuming only Anthropic for now
        if not isinstance(tool_result, dict):
            print(f"[Context Update Error] Tool result for {tool_name} ({agent_name}) is not a dict: {tool_result}")
            self.conversation_context.set_last_action(f"tool_{tool_name}_{agent_name}", {"input": tool_input, "result": {"status": "error", "message": "Tool result was not a dictionary."}})
            return
        if tool_result.get("status") == "success":
            if "customer_id" in tool_result and "customer" in tool_result and isinstance(tool_result["customer"], dict):
                self.conversation_context.update_entity_in_context("customers", tool_result["customer_id"], tool_result["customer"])
            elif "product_id" in tool_result and "product" in tool_result and isinstance(tool_result["product"], dict):
                self.conversation_context.update_entity_in_context("products", tool_result["product_id"], tool_result["product"])
            elif "order_id" in tool_result and "order_details" in tool_result and isinstance(tool_result["order_details"], dict):
                self.conversation_context.update_entity_in_context("orders", tool_result["order_id"], tool_result["order_details"])
            elif tool_name == "list_all_products" and "products" in tool_result and isinstance(tool_result["products"], dict):
                 for pid, pdata in tool_result["products"].items(): self.conversation_context.update_entity_in_context("products", pid, pdata)
        self.conversation_context.set_last_action(f"tool_{tool_name}_{agent_name}", {"input": tool_input, "result": tool_result})

    def process_tool_call(self, tool_name: str, tool_input: Dict[str, Any]) -> Dict[str, Any]:
        target_storage_instance = self.anthropic_storage
        print(f"--- [Tool Dispatcher] Attempting tool: {tool_name} with input: {json.dumps(tool_input, default=str)} for storage: AnthropicStorage ---")
        if tool_name in self.available_tool_functions:
            function_to_call = self.available_tool_functions[tool_name]
            try:
                result = function_to_call(target_storage_instance, **tool_input)
                print(f"--- [Tool Dispatcher] Result for {tool_name} on AnthropicStorage: {json.dumps(result, indent=2, default=str)} ---")
                return result
            except Exception as e:
                print(f"--- [Tool Dispatcher] Exception for {tool_name} on AnthropicStorage: {e} (Input: {tool_input}) ---")
                import traceback; traceback.print_exc()
                return {"status": "error", "message": f"Error executing {tool_name}: {str(e)}"}
        else:
            print(f"--- [Tool Dispatcher] Tool {tool_name} not found. ---")
            return {"status": "error", "message": f"Tool {tool_name} not found."}

    def get_anthropic_response(self, current_worker_system_prompt: str, conversation_history: List[Dict[str, Any]]) -> str:
        messages_for_api = list(conversation_history) # Use a copy
        try:
            for i in range(5): # Max 5 iterations for tool use
                response = anthropic_client.messages.create(
                    model=ANTHROPIC_MODEL_NAME, max_tokens=4000, temperature=0.0,
                    system=current_worker_system_prompt,
                    tools=self.anthropic_tools_schemas,
                    messages=messages_for_api
                )
                assistant_response_blocks = response.content
                # Append assistant's attempt (text or tool calls) to the history for the *next* turn
                messages_for_api.append({"role": "assistant", "content": assistant_response_blocks})

                text_blocks = [block.text for block in assistant_response_blocks if block.type == "text"]
                final_text_response = " ".join(text_blocks).strip()

                if final_text_response.startswith("CLARIFICATION_REQUESTED:"):
                    return final_text_response # Return immediately for clarification

                tool_calls_to_process = [block for block in assistant_response_blocks if block.type == "tool_use"]
                if not tool_calls_to_process: # No tools to call, this is the final text
                    return final_text_response if final_text_response else "No text content in final Anthropic response."

                tool_results_for_next_call_content = []
                for tool_use_block in tool_calls_to_process:
                    tool_name, tool_input, tool_use_id = tool_use_block.name, tool_use_block.input, tool_use_block.id
                    print(f"Anthropic Tool Call: {tool_name}, Input: {tool_input}, ID: {tool_use_id}")
                    tool_result_data = self.process_tool_call(tool_name, tool_input)
                    self._update_context_from_tool_results(tool_name, tool_input, tool_result_data) # Update context based on this tool call
                    tool_results_for_next_call_content.append({
                        "type": "tool_result", "tool_use_id": tool_use_id,
                        "content": json.dumps(tool_result_data) # Ensure result is JSON string
                    })
                # Append user role message with tool results for the next iteration
                messages_for_api.append({"role": "user", "content": tool_results_for_next_call_content})
            return "Max tool iterations reached for Anthropic."
        except Exception as e:
            print(f"Error in get_anthropic_response: {str(e)}")
            import traceback; traceback.print_exc()
            return f"Error getting Anthropic response: {str(e)}"

    def _handle_worker_clarification(self, agent_question: str, current_prompt: str,
                                     agent_specific_history: List[Dict], max_attempts: int = 2) -> Tuple[str, List[Dict]]:
        agent_name = "Anthropic"
        print(f"\\n--- {agent_name} requests clarification ---")
        print(f"{agent_name}: {agent_question}")
        clarification_turn_details = []
        current_history_for_clarification = list(agent_specific_history)

        for attempt in range(max_attempts):
            user_clarification = input(f"Your response to {agent_name}: ").strip()
            if not user_clarification: user_clarification = "(User provided no input to clarification)"

            clarification_turn_details.append({"agent_question": agent_question, "user_answer": user_clarification})
            # Add user's clarification to the history for the *next* call to the LLM
            current_history_for_clarification.append({"role": "user", "content": user_clarification})

            print(f"Sending clarification back to {agent_name} (Attempt {attempt + 1}/{max_attempts})...")
            # Get new response based on history that *includes* user's clarification
            agent_response_text = self.get_anthropic_response(current_prompt, current_history_for_clarification)

            # The get_anthropic_response already appends its own assistant message to current_history_for_clarification
            # if it makes tool calls or has text. If it asks for clarification again, that's also handled.

            if agent_response_text.startswith("CLARIFICATION_REQUESTED:"):
                agent_question = agent_response_text.split("CLARIFICATION_REQUESTED:", 1)[-1].strip()
                # The assistant's clarification request would have been added to current_history_for_clarification
                # by the get_anthropic_response call.
                if attempt < max_attempts - 1:
                    print(f"\\n--- {agent_name} requests further clarification ---")
                    print(f"{agent_name}: {agent_question}")
                else:
                    print(f"{agent_name} reached max clarification attempts. Proceeding with last question as response.")
                    return agent_response_text, clarification_turn_details # Return the clarification request itself
            else: # AI provided a final response (text or after tool use)
                return agent_response_text, clarification_turn_details

        return agent_response_text, clarification_turn_details # Should be caught by loop logic

    def _get_llm_response_with_clarification_loop(self, system_prompt_with_context_learnings: str,
                                                 base_conversation_history: List[Dict]) -> Tuple[str, List[Dict]]:
        agent_specific_history = list(base_conversation_history) # Start with a copy

        # Initial call to the LLM
        final_agent_response = self.get_anthropic_response(system_prompt_with_context_learnings, agent_specific_history)
        # Note: get_anthropic_response internally modifies agent_specific_history by appending assistant turns and tool results.

        clarification_interactions = []
        if final_agent_response.startswith("CLARIFICATION_REQUESTED:"):
            agent_question = final_agent_response.split("CLARIFICATION_REQUESTED:", 1)[-1].strip()
            # The initial CLARIFICATION_REQUESTED message is already in agent_specific_history (added by get_anthropic_response)

            # Handle clarification loop. This will further modify agent_specific_history.
            final_agent_response, clarification_interactions = self._handle_worker_clarification(
                agent_question, system_prompt_with_context_learnings,
                agent_specific_history # Pass the evolving history
            )
        return final_agent_response, clarification_interactions

    def process_user_request(self, user_message: str) -> Dict[str, Any]:
        print(f"\\n\\n{'='*60}\\nUser Message: {user_message}\\n{'='*60}")
        self.conversation_context.add_user_message(user_message)
        context_summary = self.conversation_context.get_context_summary()
        print(f"Current Context Summary for Agent:\\n{context_summary}\\n{'-'*60}")

        learnings_for_prompt = self.check_relevant_learnings(user_message)
        if learnings_for_prompt: print(f"Relevant Learnings for this turn:\\n{learnings_for_prompt}")
        else: print("No specific relevant learnings found for this turn.")

        current_worker_prompt_with_context_and_learnings = f"{worker_system_prompt}\\n\\nConversation Context:\\n{context_summary}\\n\\n{learnings_for_prompt if learnings_for_prompt else 'No specific past learnings provided for this query.'}"
        self.learnings_updated_this_session_flag = False

        # --- CAPTURE INITIAL DATASTORE STATE ---
        initial_datastore_state_for_eval = self.anthropic_storage.get_full_datastore_copy()
        print(f"--- Captured Initial Datastore State for Evaluation (Products count: {len(initial_datastore_state_for_eval['products'])}) ---")


        print("\\n--- Getting Anthropic's Response ---")
        # Base history for this turn (user message + prior turns)
        anthropic_base_history = self.conversation_context.get_full_conversation_for_api()

        anthropic_final_response, anthropic_clarifications = self._get_llm_response_with_clarification_loop(
            current_worker_prompt_with_context_and_learnings, anthropic_base_history
        )
        # The conversation_context.messages list is updated by add_user_message and add_assistant_message.
        # The anthropic_base_history sent to _get_llm_response_with_clarification_loop is a snapshot.
        # The actual LLM calls in get_anthropic_response and _handle_worker_clarification build upon their own
        # evolving history copy. We need to ensure the main conversation_context.messages reflects the *final* outcome.

        # Add the final assistant response (after all tools and clarifications) to the main context
        self.conversation_context.add_assistant_message(f"[Anthropic Final Text]: {anthropic_final_response}") # Simplified content for context log

        if anthropic_clarifications:
            print(f"[INFO] Anthropic clarification interactions: {json.dumps(anthropic_clarifications, indent=2)}")
            # Log clarifications for learning, if any
            for interaction in anthropic_clarifications:
                 self.process_and_store_new_learning(
                    human_feedback_text=f"User clarification for Anthropic: '{interaction['user_answer']}' (re: agent question: '{interaction['agent_question']}')",
                    user_query_context=user_message,
                    turn_context_summary=context_summary + f"\\nAnthropic asked: {interaction['agent_question']}"
                )

        print(f"\\n--- Anthropic Final Response Text (Post-Tool/Clarification) ---\\n{anthropic_final_response}")

        # --- CAPTURE FINAL DATASTORE STATE ---
        final_datastore_state_for_eval = self.anthropic_storage.get_full_datastore_copy()
        print(f"--- Captured Final Datastore State for Evaluation (Products count: {len(final_datastore_state_for_eval['products'])}) ---")

        evaluation = self.evaluate_responses(
            user_message, anthropic_final_response,
            context_summary, learnings_for_prompt or "",
            anthropic_clarifications,
            initial_datastore_state_for_eval, # NEW
            final_datastore_state_for_eval    # NEW
        )
        self.evaluation_results.append(evaluation)

        # Handle human feedback for learning (evaluator clarification, general learning)
        human_feedback_candidate = ""
        clarif_details_eval = evaluation.get("clarification_details", {})
        if clarif_details_eval.get("used") and clarif_details_eval.get("provided_input") not in ["Skipped by user", "Skipped (non-interactive)"]:
            human_feedback_candidate = clarif_details_eval.get("provided_input")
            print(f"[INFO] Candidate learning from *evaluator* clarification: '{human_feedback_candidate}'")
            self.process_and_store_new_learning(human_feedback_text=human_feedback_candidate, user_query_context=user_message, turn_context_summary=context_summary)
            human_feedback_candidate = "" # Reset after processing

        try:
            human_general_learning = input("Do you want to add any general learnings from this turn? (Type your learning or 'skip'): ").strip()
            if human_general_learning.lower() not in ['skip', ''] :
                human_feedback_candidate = human_general_learning
            elif human_general_learning.lower() == 'skip':
                print("[INFO] Skipping the addition of general learnings for this turn.")
        except EOFError: print("EOFError: Skipping general learning input (non-interactive).")

        if human_feedback_candidate:
            self.process_and_store_new_learning(human_feedback_text=human_feedback_candidate, user_query_context=user_message, turn_context_summary=context_summary)

        if self.learnings_updated_this_session_flag:
            self._persist_active_learnings_to_drive()

        return {"user_message": user_message, "anthropic_response": anthropic_final_response, "evaluation": evaluation}

    # MODIFIED evaluate_responses method
    def evaluate_responses(self, user_message: str, anthropic_response: str,
                       context_summary_for_eval: str, learnings_for_eval: str,
                       anthropic_clarifications: Optional[List[Dict]],
                       initial_datastore_state: Dict[str, Any], # NEW
                       final_datastore_state: Dict[str, Any]    # NEW
                       ) -> Dict[str, Any]:
        print("\\n--- Starting Evaluation by Gemini (with Before/After States) ---")

        # Prepare datastore states for the prompt
        initial_ds_prompt = f"\\nData Store State *Before* AI Action (Initial Ground Truth):\\n{json.dumps(initial_datastore_state, indent=2, default=str)}\\n"
        final_ds_prompt = f"\\nData Store State *After* AI Action (Final Ground Truth):\\n{json.dumps(final_datastore_state, indent=2, default=str)}\\n"

        clarification_summary_for_eval = []
        if anthropic_clarifications:
            clarification_summary_for_eval.append(f"Anthropic Worker AI Clarification Loop: {len(anthropic_clarifications)} question(s) asked.")
            for i, turn in enumerate(anthropic_clarifications):
                clarification_summary_for_eval.append(f"  A{i+1}: Q: '{turn['agent_question']}' -> User A: '{turn['user_answer']}'")
        clarification_info_for_prompt = "\\n".join(clarification_summary_for_eval) if clarification_summary_for_eval else "No clarification questions were asked by the worker AI this turn."

        try:
            eval_prompt_parts = [
                f"User query: {user_message}",
                f"Current context provided to assistant:\\n{context_summary_for_eval}",
                f"Relevant RAG Learnings provided to assistant:\\n{learnings_for_eval if learnings_for_eval else 'None'}",
                initial_ds_prompt, # Use the "Before" state for initial context
                final_ds_prompt,   # Provide the "After" state for verifying outcomes
                f"\\nDetails of Worker AI Clarification Interactions this turn:\\n{clarification_info_for_prompt}\\n",
                f"Anthropic Claude final textual response:\\n{anthropic_response}",
                ("Please evaluate the response based on accuracy (comparing AI's claims and actions against the changes between 'Before' and 'After' data store states), "
                 "efficiency (considering necessity of clarifications based on 'Before' state), context awareness, and helpfulness. "
                 "Provide an overall score (1-10) and detailed reasoning, following your system prompt guidelines. "
                 "Explicitly reference the 'Before' and 'After' states when discussing data modifications or lack thereof.")
            ]
            eval_prompt = "\\n\\n".join(filter(None, eval_prompt_parts))
            # print(f"\nDEBUG: Full Eval Prompt to Gemini:\n{eval_prompt[:2000]}...\n") # For debugging long prompts

            gemini_response_obj = eval_model_instance.generate_content(eval_prompt)
            evaluation_text = gemini_response_obj.text
            print(f"Gemini Raw Evaluation (using Before/After states):\\n{evaluation_text}")

            clarification_details = {"used": False, "needed": "", "provided_input": "", "action_summary": ""}
            if "CLARIFICATION NEEDED_EVALUATOR:" in evaluation_text.upper():
                clarification_details["used"] = True
                clarification_details["needed"] = self.extract_clarification_needed(evaluation_text, tag="CLARIFICATION NEEDED_EVALUATOR:")
                print(f"--- Human Clarification Indicated by Evaluator ---\\nClarification needed: {clarification_details['needed']}")
                try:
                    human_input_for_eval = input(f"Enter human clarification for EVALUATOR (or 'skip'/'quit'): ")
                    if human_input_for_eval.lower() in ['quit', 'exit', 'stop', 'q']: raise SystemExit("User requested exit during evaluation")
                    if human_input_for_eval.lower() != 'skip' and human_input_for_eval.strip():
                        clarification_details["provided_input"] = human_input_for_eval
                        # Note: Actions based on evaluator feedback might need careful consideration of which storage to target.
                        # For now, assuming it might affect anthropic_storage if it's a direct data correction.
                        action_summary = self.process_human_feedback_actions(human_input_for_eval, self.anthropic_storage)
                        clarification_details["action_summary"] = action_summary

                        # If evaluator feedback led to data change, the 'final_datastore_state' might be stale.
                        # For a re-evaluation, we might need to pass the *new* final state.
                        # However, the current prompt structure for re-evaluation is simple.
                        updated_final_ds_for_reeval = self.anthropic_storage.get_full_datastore_copy() # Get potentially updated state

                        updated_eval_prompt = (f"{eval_prompt}\\n\\nHuman clarification provided TO EVALUATOR: {human_input_for_eval}\\n"
                                               f"Action taken based on feedback: {action_summary}\\n"
                                               f"Refreshed Data Store State *After* AI Action (and evaluator feedback action):\\n{json.dumps(updated_final_ds_for_reeval, indent=2, default=str)}\\n"
                                               f"Please re-evaluate, considering this new information and the refreshed 'After' state.")
                        updated_gemini_response = eval_model_instance.generate_content(updated_eval_prompt)
                        evaluation_text = updated_gemini_response.text # This is now the final evaluation text
                        print(f"Gemini Raw Re-Evaluation:\\n{evaluation_text}")
                    else:
                        clarification_details["provided_input"] = "Skipped by user"
                except EOFError: clarification_details["provided_input"] = "Skipped (non-interactive)"

            # The "Data Store Consistency Check" is now integrated into the main evaluation logic
            # by providing both before and after states and prompting the evaluator to use them.
            # The `evaluation_text` should already contain this comprehensive analysis.
            final_evaluation_text = evaluation_text # Use the potentially re-evaluated text

            anthropic_score = self.extract_score(final_evaluation_text) # Extract from the final eval text
            return {"anthropic_score": anthropic_score,
                    "full_evaluation": final_evaluation_text,
                    "clarification_details": clarification_details,
                    "initial_datastore_state_provided_to_eval": initial_datastore_state, # For logging
                    "final_datastore_state_provided_to_eval": final_datastore_state      # For logging
                   }
        except Exception as e:
            print(f"Error in evaluation: {str(e)}")
            import traceback; traceback.print_exc()
            return {"error": f"Error in evaluation: {str(e)}", "anthropic_score": 0,
                    "full_evaluation": f"Evaluation failed: {str(e)}", "clarification_details": {"used": False, "action_summary": ""},
                    "initial_datastore_state_provided_to_eval": initial_datastore_state,
                    "final_datastore_state_provided_to_eval": final_datastore_state
                   }

    def extract_clarification_needed(self, evaluation_text: str, tag: str = "CLARIFICATION NEEDED:") -> str:
        # (Implementation from notebook - seems okay)
        clarification_match = re.search(rf"{tag.upper()}\\s*(.*?)(?:\\n|$)", evaluation_text, re.IGNORECASE | re.DOTALL)
        if clarification_match and clarification_match.group(1).strip(): return clarification_match.group(1).strip()
        lines = evaluation_text.splitlines();
        for i, line in enumerate(lines):
            if tag.upper() in line.upper(): return line.upper().split(tag.upper(), 1)[-1].strip() + "\\n" + "\\n".join(lines[i+1:i+3])
        return "Evaluator indicated clarification needed, but specific question not formatted as expected."

    def process_and_store_new_learning(self, human_feedback_text: str, user_query_context: str, turn_context_summary: str):
        # (Implementation from notebook - largely unchanged for this fix, ensure it uses eval_model_instance)
        print(f"\\n--- Processing New Candidate Learning from Human Feedback: \\\"{human_feedback_text}\\\" ---")
        current_learnings_for_conflict_check = list(self.active_learnings_cache)
        evaluator_task_prompt_parts = [
            "You are an AI assistant helping to maintain a knowledge base of 'learnings' for a customer service AI agent.",
            f"The New Human Feedback is: \\\"{human_feedback_text}\\\"",
            f"This feedback was given in the context of the user query: \\\"{user_query_context}\\\"",
            "\\nHere are some existing ACTIVE learnings from our knowledge base (last 10, if any):"
        ]
        if current_learnings_for_conflict_check:
            for i, el_entry in enumerate(current_learnings_for_conflict_check[-10:]):
                stmt = el_entry.get('final_learning_statement', el_entry.get('original_human_input', 'N/A'))
                evaluator_task_prompt_parts.append(f"  Existing Learning {i+1} (ID: {el_entry.get('learning_id','N/A')[:8]}): \\\"{stmt}\\\"")
        else: evaluator_task_prompt_parts.append("  (No existing active learnings found.)")
        evaluator_task_prompt_parts.extend([
            "\\nYour Tasks:",
            "1. Analyze New Human Feedback.", "2. Compare with Existing ACTIVE Learnings for CONFLICTS or REDUNDANCY.",
            "3. If CONFLICT: Output `CONFLICT_DETECTED:`, explain, ask human to resolve. DO NOT provide 'FINALIZED_LEARNING'.",
            "4. If REDUNDANT: Output `REDUNDANT_LEARNING:`, state which existing learning. DO NOT provide 'FINALIZED_LEARNING'.",
            "5. If new/refining (no major conflict): Synthesize concise, actionable 'Finalized Learning Statement'. Prefix with `FINALIZED_LEARNING:`.",
            "6. If vague/unsuitable: Output `NOT_ACTIONABLE:`, explain. DO NOT provide 'FINALIZED_LEARNING'."
        ])
        evaluator_conflict_check_prompt = "\\n".join(evaluator_task_prompt_parts)
        synthesis_response_obj = eval_model_instance.generate_content(evaluator_conflict_check_prompt)
        evaluator_synthesis_text = synthesis_response_obj.text
        print(f"\\n[RAG Cache] Evaluator response on learning processing:\\n{evaluator_synthesis_text}")

        made_change_to_cache = False; final_learning_statement_to_store = None
        if "CONFLICT_DETECTED:" in evaluator_synthesis_text:
            print(f"\\n🛑 CONFLICT DETECTED BY EVALUATOR 🛑\\n{evaluator_synthesis_text.split('CONFLICT_DETECTED:', 1)[-1].strip()}")
            # Simplified conflict resolution: default to adding new if user doesn't explicitly discard/merge
            human_resolution = input("Resolve conflict: 'use new', 'discard new', 'merge: [text]', or 'keep existing': ").strip()
            if human_resolution.lower() == "use new": final_learning_statement_to_store = human_feedback_text
            elif human_resolution.lower().startswith("merge:"): final_learning_statement_to_store = human_resolution.split("merge:", 1)[-1].strip()
            elif human_resolution.lower() in ["discard new", "keep existing"]: final_learning_statement_to_store = None
            else: print("Resolution unclear. New candidate learning discarded."); final_learning_statement_to_store = None
        elif "FINALIZED_LEARNING:" in evaluator_synthesis_text:
            final_learning_statement_to_store = evaluator_synthesis_text.split("FINALIZED_LEARNING:", 1)[-1].strip()
            print(f"\\n✅ Finalized Learning by Evaluator: \\\"{final_learning_statement_to_store}\\\"")
        elif "REDUNDANT_LEARNING:" in evaluator_synthesis_text or "NOT_ACTIONABLE:" in evaluator_synthesis_text:
            print(f"\\nℹ️ Evaluator: {evaluator_synthesis_text.strip()}")
        else: # Unexpected response
            print("\\n⚠️ Evaluator response format for learning unexpected. Storing raw human feedback."); final_learning_statement_to_store = human_feedback_text

        if final_learning_statement_to_store:
            new_learning_entry = {"learning_id": str(uuid.uuid4()), "timestamp_created": datetime.now().isoformat(), "user_query_context_at_feedback": user_query_context, "original_human_input": human_feedback_text, "final_learning_statement": final_learning_statement_to_store, "keywords": self.extract_keywords(final_learning_statement_to_store + " " + human_feedback_text), "status": "active"}
            self.active_learnings_cache.append(new_learning_entry)
            made_change_to_cache = True
            print(f"[RAG Cache] Learning processed. Cache now has {len(self.active_learnings_cache)} entries.")
        if made_change_to_cache: self.learnings_updated_this_session_flag = True
        else: print("[RAG Cache] No changes made to learnings cache from this feedback.")


    def extract_keywords(self, text: str) -> List[str]:
        # (Implementation from notebook - seems okay)
        if not text: return ["general"]
        words = re.findall(r'\\b\\w{4,}\\b', text.lower())
        stop_words = {"the", "and", "is", "in", "to", "a", "of", "for", "with", "on", "at", "what", "how", "show", "tell", "please", "what's", "i'd", "like", "user", "query", "this", "that", "context", "claude", "anthropic", "before", "after", "state", "action", "truth", "ground"}
        extracted = list(set(word for word in words if word not in stop_words))
        return extracted if extracted else ["generic"]

    def extract_score(self, evaluation_text: str) -> int:
        # (Implementation from notebook - seems okay, might need tuning for new eval format)
        model_name_pattern = "Anthropic" # Or make it more generic if evaluating multiple
        # Try to find score in the last part of the text first, assuming summary is at the end
        search_text_segments = [evaluation_text]
        consistency_check_section_start = evaluation_text.upper().rfind("--- DATA STORE CONSISTENCY CHECK ---") # This marker might be removed from prompt
        if consistency_check_section_start != -1: # If old marker exists, search after it first
             search_text_segments.insert(0, evaluation_text[consistency_check_section_start:])

        patterns = [
            rf"(?:Overall Score|{model_name_pattern}.*?Overall Score).*?[:\\s]*(\\d+)/10",
            rf"{model_name_pattern}.*?Overall Score.*?[:\\s]*(\\d+)",
            rf"Overall Score.*?{model_name_pattern}.*?[:\\s]*(\\d+)",
            rf"{model_name_pattern}.*?score.*[:\\s]+(\\d+)(?:/10)?",
            rf"{model_name_pattern}.*?\\bscore\\b.*?(\\d+)",
            rf"Updated score for {model_name_pattern}.*?[:\\s]*(\\d+)",
            rf"{model_name_pattern}.*?updated overall score.*?[:\\s]*(\\d+)",
            r"Overall Score[:\s]*(\d+)(?:/10)?",
            r"The AI assistant's overall score is (\d+)/10",
            r"Overall[:\s]*(\d+)/10"
        ]
        for text_segment in search_text_segments: # Prioritize later segments
            for p_str in reversed(patterns): # Try more specific patterns first
                matches = list(re.finditer(p_str, text_segment, re.IGNORECASE | re.DOTALL))
                if matches:
                    last_match = matches[-1]
                    if last_match.group(1):
                        try:
                            score = int(last_match.group(1))
                            print(f"Extracted score {score} for '{model_name_pattern}' (or generic) using pattern: {p_str}")
                            return score
                        except ValueError: continue
        print(f"Could not extract score for '{model_name_pattern}' from eval text. Review eval output. Defaulting to 0.")
        return 0

    def process_human_feedback_actions(self, feedback: str, target_storage_for_action: Optional[Storage]) -> str:
        # (Implementation from notebook - ensure it uses self.process_tool_call which now uses self.anthropic_storage)
        action_result_summary = "No specific data action taken based on feedback."
        if not target_storage_for_action: # Should always be self.anthropic_storage if called from eval
            return "Skipping data action: No target storage specified for feedback."

        current_storage_name = "Anthropic's Store" # Since target_storage_for_action is self.anthropic_storage

        # Simplified: Check if feedback is a direct tool call-like instruction
        # Example: "tool:create_order product_id_or_name=P1 quantity=5 status=Shipped"
        if feedback.lower().startswith("tool:"):
            try:
                parts = feedback.split(" ", 1)
                tool_name = parts[0].split(":")[1]
                args_str = parts[1] if len(parts) > 1 else ""
                tool_input = {}
                if args_str:
                    for arg_pair in args_str.split(" "):
                        key, value = arg_pair.split("=")
                        # Attempt to convert to int/float if possible
                        if value.isdigit(): tool_input[key] = int(value)
                        elif re.match(r'^\\d+\\.\\d+$', value): tool_input[key] = float(value)
                        else: tool_input[key] = value

                if tool_name in self.available_tool_functions:
                    print(f"[Human Feedback Action] Attempting tool '{tool_name}' with input: {tool_input} on {current_storage_name}")
                    # Use self.process_tool_call to ensure it uses the correct storage (self.anthropic_storage)
                    # and updates conversation context appropriately.
                    result = self.process_tool_call(tool_name, tool_input)
                    action_result_summary = f"Action executed on {current_storage_name} via human feedback: Tool '{tool_name}'. Result: {result.get('status', 'N/A')}"
                    # Context update is handled by process_tool_call via _update_context_from_tool_results
                else:
                    action_result_summary = f"Tool '{tool_name}' specified in human feedback not found."
            except Exception as e:
                action_result_summary = f"Failed to parse/execute tool from human feedback '{feedback}': {str(e)}"
            print(f"[Human Feedback Action] {action_result_summary}")
            return action_result_summary

        # Original regex-based parsing (can be kept as fallback or for specific commands)
        order_update_match = re.search(r"update\\s+order\\s+(\\w+)\\s+status\\s+to\\s+(\\w+)", feedback, re.IGNORECASE)
        if order_update_match:
            order_id, new_status = order_update_match.groups()
            tool_input = {"order_id": order_id, "new_status": new_status}
            result = self.process_tool_call("update_order_status", tool_input) # Uses self.anthropic_storage
            action_result_summary = f"Action executed on {current_storage_name}: Updated order {order_id} status to {new_status}. Result: {result.get('status', 'N/A')}"
            print(f"[Human Feedback Action] {action_result_summary}")
            return action_result_summary

        # Add more regexes or a more robust command parser if needed

        return action_result_summary

print("MODIFIED AgentEvaluator class defined with Before/After state handling for evaluation.")

MODIFIED AgentEvaluator class defined with Before/After state handling for evaluation.


In [37]:
def main():
    print("\\nStarting Main Execution (Single Agent - Anthropic) with MODIFIED Evaluator...\\n")
    try:
        agent = AgentEvaluator()
    except Exception as e:
        print(f"Failed to initialize AgentEvaluator: {e}")
        import traceback; traceback.print_exc()
        return

    results_log = []
    while True:
        try:
            user_query = input("\\nEnter your query (or 'quit', 'exit', 'stop', 'q' to end): ")
            if user_query.lower() in ['quit', 'exit', 'stop', 'q']:
                print("Exiting the system. Goodbye!")
                if agent.learnings_updated_this_session_flag:
                    print("Persisting final session learnings to Drive...")
                    agent._persist_active_learnings_to_drive()
                break
            if not user_query.strip(): print("Empty query, please enter something."); continue

            result = agent.process_user_request(user_query)
            results_log.append(result)
        except SystemExit as se:
            print(f"System exit requested: {se}")
            if agent.learnings_updated_this_session_flag:
                print("Persisting session learnings to Drive before exiting...")
                agent._persist_active_learnings_to_drive()
            break
        except Exception as e:
            print(f"CRITICAL ERROR processing query '{user_query}': {e}")
            import traceback; traceback.print_exc()
            results_log.append({"user_message": user_query, "anthropic_response": "ERROR",
                                "evaluation": {"anthropic_score": 0, "full_evaluation": f"Critical error: {e}",
                                               "clarification_details": {"used": False, "action_summary": ""}}})
    print("\\n\\n===== EVALUATION SUMMARY =====")
    total_anthropic_score, num_q = 0, 0
    for i, res in enumerate(results_log):
        if not res: print(f"\\nQuery {i+1}: Skipped (empty result)."); continue
        num_q +=1
        print(f"\\nQuery {i+1}: {res.get('user_message', 'N/A')}")
        print(f"  Anthropic Resp: {str(res.get('anthropic_response', 'N/A'))[:250]}...")
        eval_data = res.get('evaluation', {})
        anth_s = eval_data.get('anthropic_score', 0)
        total_anthropic_score += anth_s
        print(f"  Score - Anthropic: {anth_s}")
        # Log initial and final states if available in eval_data for review
        if "initial_datastore_state_provided_to_eval" in eval_data:
            print(f"    Initial DS Orders: {len(eval_data['initial_datastore_state_provided_to_eval'].get('orders', {}))}, Products Inv P3: {eval_data['initial_datastore_state_provided_to_eval'].get('products', {}).get('P3',{}).get('inventory_count', 'N/A')}")
        if "final_datastore_state_provided_to_eval" in eval_data:
            print(f"    Final DS Orders: {len(eval_data['final_datastore_state_provided_to_eval'].get('orders', {}))}, Products Inv P3: {eval_data['final_datastore_state_provided_to_eval'].get('products', {}).get('P3',{}).get('inventory_count', 'N/A')}")

        clarif_details = eval_data.get('clarification_details',{})
        if clarif_details.get('used'):
            print(f"    Evaluator Clarification: Needed='{clarif_details.get('needed', 'N/A')}', Provided='{clarif_details.get('provided_input', 'N/A')}'")
            if clarif_details.get('action_summary'): print(f"    Action from Evaluator Clarification: {clarif_details['action_summary']}")
    print(f"\\n----- Overall Performance -----")
    if num_q > 0: print(f"Avg Anthropic Score: {total_anthropic_score/num_q:.2f}")
    else: print("No queries processed.")
    print(f"Total Anthropic Score: {total_anthropic_score}")
    print(f"\\nLearnings are stored in: {LEARNINGS_DRIVE_BASE_PATH}")
    latest_learnings_file = agent._get_latest_learnings_filepath_from_drive()
    if latest_learnings_file: print(f"Most recent learnings file: {os.path.basename(latest_learnings_file)}")
    else: print("No learnings files found.")
    print("\\nExecution Finished.")

# To run in a new cell:
# main()

In [38]:
""" Sample queries:
* Show me all the products available
* I'd like to order 25 Perplexinators, please
* Show me the status of my order
* (If the order is not in Shipped state, then) Please ship my order now
* How many Perplexinators are now left in stock?
* Add a new customer: Bill Leece, bill.leece@mail.com, +1.222.333.4444
* Add new new product: Gizmo X, description: A fancy gizmo, price: 29.99, inventory: 50
* Update Gizzmo's price to 99.99 #Note the misspelling of 'Gizmo'
* Who won the 2020 US presidential election?
* I need to update our insurance policy, so I need to know the total value of all the products in our inventory. Please tell me this amount.
* Summarize your learnings from our recent interactions.
"""

main()

\nStarting Main Execution (Single Agent - Anthropic) with MODIFIED Evaluator...\n
AgentEvaluator initialized with new Storage for Anthropic agent.
[RAG Cache] Initializing: Loading latest learnings from Drive...
[RAG Cache] Loading initial learnings from: /content/drive/My Drive/AI/Knowledgebases/learnings_20250519_141006_754753.json
AgentEvaluator initialized. Learnings path: /content/drive/My Drive/AI/Knowledgebases. Loaded 2 initial learnings.
\nEnter your query (or 'quit', 'exit', 'stop', 'q' to end): I'd like to order 25 Perplexinators, please
Current Context Summary for Agent:\nNo specific context items set yet.\n------------------------------------------------------------
No specific relevant learnings found for this turn.
--- Captured Initial Datastore State for Evaluation (Products count: 3) ---
\n--- Getting Anthropic's Response ---
Anthropic Tool Call: get_product_info, Input: {'product_id_or_name': 'Perplexinator'}, ID: toolu_01NYsHuAuUBATWBPjTQccucC
--- [Tool Dispatcher] A