# Solve Business Problems with AI - Codename: Hermes

## Objective
Develop a proof-of-concept application to intelligently process email order requests and customer inquiries for a fashion store. The system should accurately categorize emails as either product inquiries or order requests and generate appropriate responses using the product catalog information and current stock status.

You are encouraged to use AI assistants (like ChatGPT or Claude) and any IDE of your choice to develop your solution. Many modern IDEs (such as PyCharm, or Cursor) can work with Jupiter files directly.

## Task Description

### Inputs

Google Spreadsheet **[Document](https://docs.google.com/spreadsheets/d/14fKHsblfqZfWj3iAaM2oA51TlYfQlFT4WKo52fVaQ9U)** containing:

- **Products**: List of products with fields including product ID, name, category, stock amount, detailed description, and season.

- **Emails**: Sequential list of emails with fields such as email ID, subject, and body.

### Instructions

- Implement all requirements using advanced Large Language Models (LLMs) to handle complex tasks, process extensive data, and generate accurate outputs effectively.
- Use Retrieval-Augmented Generation (RAG) and vector store techniques where applicable to retrieve relevant information and generate responses.
- You are provided with a temporary OpenAI API key granting access to GPT-4o, which has a token quota. Use it wisely or use your own key if preferred.
- Address the requirements in the order listed. Review them in advance to develop a general implementation plan before starting.
- Your deliverables should include:
   - Code developed within this notebook.
   - A single spreadsheet containing results, organized across separate sheets.
   - Comments detailing your thought process.
- You may use additional libraries (e.g., langchain) to streamline the solution. Use libraries appropriately to align with best practices for AI and LLM tools.
- Use the most suitable AI techniques for each task. Note that solving tasks with traditional programming methods will not earn points, as this assessment evaluates your knowledge of LLM tools and best practices.

### Requirements

#### 1. Classify emails
    
Classify each email as either a _**"product inquiry"**_ or an _**"order request"**_. Ensure that the classification accurately reflects the intent of the email.

**Output**: Populate the **email-classification** sheet with columns: email ID, category.

#### 2. Process order requests
1.   Process orders
  - For each order request, verify product availability in stock.
  - If the order can be fulfilled, create a new order line with the status “created”.
  - If the order cannot be fulfilled due to insufficient stock, create a line with the status “out of stock” and include the requested quantity.
  - Update stock levels after processing each order.
  - Record each product request from the email.
  - **Output**: Populate the **order-status** sheet with columns: email ID, product ID, quantity, status (**_"created"_**, **_"out of stock"_**).

2.   Generate responses
  - Create response emails based on the order processing results:
      - If the order is fully processed, inform the customer and provide product details.
      - If the order cannot be fulfilled or is only partially fulfilled, explain the situation, specify the out-of-stock items, and suggest alternatives or options (e.g., waiting for restock).
  - Ensure the email tone is professional and production-ready.
  - **Output**: Populate the **order-response** sheet with columns: email ID, response.

#### 3. Handle product inquiry

Customers may ask general open questions.
  - Respond to product inquiries using relevant information from the product catalog.
  - Ensure your solution scales to handle a full catalog of over 100,000 products without exceeding token limits. Avoid including the entire catalog in the prompt.
  - **Output**: Populate the **inquiry-response** sheet with columns: email ID, response.

## Evaluation Criteria
- **Advanced AI Techniques**: The system should use Retrieval-Augmented Generation (RAG) and vector store techniques to retrieve relevant information from data sources and use it to respond to customer inquiries.
- **Tone Adaptation**: The AI should adapt its tone appropriately based on the context of the customer's inquiry. Responses should be informative and enhance the customer experience.
- **Code Completeness**: All functionalities outlined in the requirements must be fully implemented and operational as described.
- **Code Quality and Clarity**: The code should be well-organized, with clear logic and a structured approach. It should be easy to understand and maintain.
- **Presence of Expected Outputs**: All specified outputs must be correctly generated and saved in the appropriate sheets of the output spreadsheet. Ensure the format of each output matches the requirements—do not add extra columns or sheets.
- **Accuracy of Outputs**: The accuracy of the generated outputs is crucial and will significantly impact the evaluation of your submission.

We look forward to seeing your solution and your approach to solving real-world problems with AI technologies.

# Solution

## 1. Environment

### 1.1 Install Dependencies

In [None]:
!pip install openai httpx==0.27.2 pandas pinecone-client thefuzz python-Levenshtein openpyxl

### 1.2 Import Libraries

In [None]:
import os
import json
import time
import pandas as pd
from openai import OpenAI
import pinecone
from thefuzz import process as fuzz_process
from IPython.display import display, Markdown

### 1.3 Parameters

In [None]:
# --- OpenAI Configuration ---
OPENAI_API_KEY = "<OPENAI API KEY: Use one provided by Crossover or your own>"
OPENAI_BASE_URL = 'https://47v4us7kyypinfb5lcligtc3x40ygqbs.lambda-url.us-east-1.on.aws/v1/' # For Crossover provided key
OPENAI_MODEL = "gpt-4o"
OPENAI_EMBEDDING_MODEL = "text-embedding-3-small"

# --- Pinecone Configuration ---
PINECONE_API_KEY = "<YOUR PINECONE API KEY>"
PINECONE_ENVIRONMENT = "<YOUR PINECONE ENVIRONMENT>"
PINECONE_INDEX_NAME = "hermes-products"
PINECONE_EMBEDDING_DIMENSION = 1536 # Matches text-embedding-3-small
PINECONE_METRIC = "cosine"

# --- Data Configuration ---
INPUT_SPREADSHEET_ID = '14fKHsblfqZfWj3iAaM2oA51TlYfQlFT4WKo52fVaQ9U'
OUTPUT_SPREADSHEET_NAME = 'hermes_assignment_output.xlsx'

# --- Prompt Files Configuration ---
REPOSITORY="svallory/crossover-hermes"
PROMPTS_DIR = "prompts/"
DOCS_DIR = "docs/"
SALES_GUIDE_FILENAME = "sales-email-intelligence-guide.md"

# --- Other Constants ---
FUZZY_MATCH_THRESHOLD = 80 # Minimum score for a fuzzy match to be considered
MAX_RETRIES_LLM = 3
RETRY_DELAY_LLM = 5 # seconds

### 1.4 Initialize OpenAI Client

In [None]:
client = None

if OPENAI_API_KEY != "<OPENAI API KEY: Use one provided by Crossover or your own>":
    try:
        client = OpenAI(
            base_url=OPENAI_BASE_URL, # Comment out or remove if using your own key directly with OpenAI's API
            api_key=OPENAI_API_KEY
        )
        # Test connection (optional, can be uncommented)
        # completion = client.chat.completions.create(model=OPENAI_MODEL, messages=[{\"role\": \"user\", \"content\": \"Hello!\"}])
        # print(\"OpenAI client initialized successfully.\")
        display(Markdown("OpenAI client initialized successfully."))
    except Exception as e:
        display(Markdown(f"**Error initializing OpenAI client:** {e}. Please check your API key and base URL."))
else:
    display(Markdown("**OpenAI API Key not configured.** Please set your API key in section 1.3."))

## 2. Data Loading and Preparation

This section includes functions to load the product catalog and email data from the specified Google Spreadsheet, and prepare them for processing.

In [None]:
def read_data_frame(document_id, sheet_name):
    """Reads a sheet from a Google Spreadsheet into a pandas DataFrame."""
    
    export_link = f"https://docs.google.com/spreadsheets/d/{document_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
    
    try:
        df = pd.read_csv(export_link)
        display(Markdown(f"Successfully loaded `{sheet_name}`. Shape: {df.shape}"))
        return df
    except Exception as e:
        display(Markdown(f"**Error loading sheet `{sheet_name}`:** {e}"))
        return pd.DataFrame() # Return empty DataFrame on error

products_df = read_data_frame(INPUT_SPREADSHEET_ID, 'products')
emails_df = read_data_frame(INPUT_SPREADSHEET_ID, 'emails')

# Display first few rows to verify loading
if not products_df.empty:
    display(Markdown"**Products Data (First 3 rows):**"))
    display(products_df.head(3))
if not emails_df.empty:
    display(Markdown("**Emails Data (First 3 rows):**"))
    display(emails_df.head(3))

### 2.1 Data Cleaning and Preparation

Perform any necessary data cleaning or type conversions. For instance, ensuring product IDs are strings, stock is integer, price is float. Also, create a mutable copy of the inventory for processing.

In [None]:
if not products_df.empty:
    products_df['product_id'] = products_df['product_id'].astype(str)
    products_df['stock'] = pd.to_numeric(products_df['stock'], errors='coerce').fillna(0).astype(int)
    products_df['price'] = pd.to_numeric(products_df['price'], errors='coerce').fillna(0.0).astype(float)
    # Ensure other text fields are strings and handle NaNs
    for col in ['name', 'category', 'description', 'season']:
        if col in products_df.columns:
            products_df[col] = products_df[col].astype(str).fillna('')
    display(Markdown(\"Product data types checked/converted and NaNs handled.\"))
    # products_df.info() # For debugging types

if not emails_df.empty:
    emails_df['email_id'] = emails_df['email_id'].astype(str)
    # Ensure other text fields are strings and handle NaNs
    for col in ['subject', 'message_body']:
        if col in emails_df.columns:
             emails_df[col] = emails_df[col].astype(str).fillna('')
    display(Markdown(\"Email data types checked/converted and NaNs handled.\"))
    # emails_df.info() # For debugging types
    
# Create a mutable copy of inventory for processing
inventory_df = None
if not products_df.empty:
    inventory_df = products_df[['product_id', 'stock']].copy()
    inventory_df.set_index('product_id', inplace=True)
    display(Markdown(\"Initial inventory DataFrame created for operational use.\"))
    # display(inventory_df.head()) # For debugging

## 3. Utility Functions and Core Components

This section defines helper functions for loading prompt templates, interacting with LLMs, managing the Pinecone vector store, and other core logic required by the agent pipeline.

### 3.1 Prompt Engineering Utilities

Functions to load prompt templates from files, inject dynamic values (like email content or prior agent results), and include content from other files (e.g., `sales-email-intelligence-guide.md`).

In [None]:

sales_guide_content = "" # Global variable to store sales guide content
try:
    # Adjust path if your notebook is not at the root of the hermes directory structure
    with open(os.path.join(DOCS_DIR, SALES_GUIDE_FILENAME), 'r') as f:
        sales_guide_content = f.read()
    display(Markdown(f"Successfully loaded `{SALES_GUIDE_FILENAME}` from `{DOCS_DIR}`."))
except FileNotFoundError:
    display(Markdown(f"**Warning:** `{SALES_GUIDE_FILENAME}` not found in `{DOCS_DIR}`. Prompts requiring it may fail or be incomplete."))
except Exception as e:
    display(Markdown(f"**Error loading `{SALES_GUIDE_FILENAME}`:** {e}"))

def load_prompt_template(prompt_filename, dynamic_values=None):
    """Loads a prompt template from a file and injects dynamic values and included files."""
    global sales_guide_content # Access the globally loaded sales guide
    try:
        # Adjust path if your notebook is not at the root of the hermes directory structure
        with open(os.path.join(PROMPTS_DIR, prompt_filename), 'r') as f:
            template = f.read()
    except FileNotFoundError:
        display(Markdown(f"**Error:** Prompt file `{prompt_filename}` not found in `{PROMPTS_DIR}`."))
        return None
    except Exception as e:
        display(Markdown(f"**Error reading prompt file `{prompt_filename}`:** {e}"))
        return None

    # Handle << include: sales-email-intelligence-guide.md >> directives
    if "<< include: sales-email-intelligence-guide.md >>" in template:
        if sales_guide_content:
            template = template.replace("<< include: sales-email-intelligence-guide.md >>", sales_guide_content)
        else:
            display(Markdown(f"**Warning:** Prompt `{prompt_filename}` requests `{SALES_GUIDE_FILENAME}`, but it was not loaded. Substitution skipped."))
    
    # Add more general file includes here if needed based on `<< include: filename >>` pattern
    # Example: For including other .md files from DOCS_DIR or PROMPTS_DIR
    # Note: This is a simple replace. More robust parsing for multiple includes might be needed.
    # For now, only the sales guide is explicitly handled.

    # Substitute other dynamic values like << include: email.subject >> or << classification_results >>
    if dynamic_values:
        for key, value in dynamic_values.items():
            # Ensure values that are dicts/lists are converted to JSON strings for inclusion in prompts
            if isinstance(value, (dict, list)):
                value_str = json.dumps(value, indent=2) # Pretty print JSON for readability
            else:
                value_str = str(value)
            template = template.replace(f"<< include: {key} >>", value_str) # For keys like 'email.subject'
            template = template.replace(f"<< {key} >>", value_str)       # For simple keys like 'classification_results'
            
    return template

# Example of loading a prompt (can be uncommented for testing after data load)
# if not emails_df.empty:
#     test_email_data = emails_df.iloc[0]
#     test_prompt_content = load_prompt_template("01-classification-signal-extraction-agent.md", 
#                                        {"email.email_id": test_email_data.get('email_id','test_id'), 
#                                         "email.subject": test_email_data.get('subject','Test Subject'), 
#                                         "email.message": test_email_data.get('message_body','Test message')})
#     if test_prompt_content:
#         display(Markdown("**Sample prompt loaded (first 500 chars):**"))
#         display(Markdown(f"```markdown\n{test_prompt_content[:500]}...\n```"))
#     else:
#         display(Markdown("**Failed to load sample prompt.**"))
# else:
#     display(Markdown("Email data not loaded, skipping prompt load example."))

### 3.2 LLM Interaction Utilities

A robust function to call the OpenAI Chat Completion API, including error handling, retries, and optional JSON output parsing.

In [None]:
def call_openai_chat_completion(prompt_content, model=OPENAI_MODEL, temperature=0.2, is_json_output=True):
    """Calls the OpenAI Chat Completion API and returns the message content, optionally parsing JSON."""
    if not client:
        display(Markdown("**Error:** OpenAI client is not initialized. Cannot make API call."))
        return {"error": "OpenAI client not initialized"} if is_json_output else None # Return error dict if JSON expected

    messages = [{"role": "user", "content": prompt_content}]
    
    for attempt in range(MAX_RETRIES_LLM):
        try:
            response_format_arg = {"type": "json_object"} if is_json_output else None
            
            # Ensure response_format is only passed if not None, to avoid issues with models not supporting it or when not expecting JSON
            completion_args = {
                "model": model,
                "messages": messages,
                "temperature": temperature
            }
            if response_format_arg: # Only add if we want JSON and it's configured
                completion_args["response_format"] = response_format_arg
                
            completion = client.chat.completions.create(**completion_args)
            content = completion.choices[0].message.content
            
            if is_json_output:
                try:
                    # Attempt to strip markdown code block fences if present before parsing JSON
                    if content.strip().startswith("```json\n") and content.strip().endswith("\n```"):
                        content_for_json = content.strip()[7:-3].strip()
                    elif content.strip().startswith("```\n") and content.strip().endswith("\n```"):
                         content_for_json = content.strip()[3:-3].strip()
                    else:
                        content_for_json = content
                    return json.loads(content_for_json)
                except json.JSONDecodeError as e:
                    display(Markdown(f"**Warning:** LLM output was not valid JSON (attempt {attempt + 1}): {e}. Raw content (first 500 chars): ```{content[:500]}...```"))
                    if attempt < MAX_RETRIES_LLM - 1:
                        time.sleep(RETRY_DELAY_LLM)
                        continue # Retry if JSON parsing failed
                    return {"error": "JSONDecodeError after retries", "raw_content": content} # Final attempt failed
            return content # Return raw content if not expecting JSON
        except Exception as e:
            display(Markdown(f"**Error calling OpenAI API (attempt {attempt + 1}/{MAX_RETRIES_LLM}):** {e}"))
            if attempt < MAX_RETRIES_LLM - 1:
                time.sleep(RETRY_DELAY_LLM)
            else:
                display(Markdown("Max retries reached. API call failed."))
                return {"error": str(e)} if is_json_output else None # Return error dict if JSON expected
    return {"error": "Max retries reached and call failed consistently"} if is_json_output else None

### 3.3 Pinecone Vector Store Utilities

Functions for initializing Pinecone, creating an index if it doesn't exist, embedding product data (text preparation and batch upsertion), and querying the index for Retrieval-Augmented Generation (RAG).

In [None]:
pinecone_index = None # Global variable for the Pinecone index object

def initialize_pinecone():
    """Initializes Pinecone connection and returns the index object."""
    global pinecone_index
    if PINECONE_API_KEY == "<YOUR PINECONE API KEY>" or PINECONE_ENVIRONMENT == "<YOUR PINECONE ENVIRONMENT>":
        display(Markdown("**Pinecone API Key or Environment not configured.** Please set them in section 1.3. Pinecone features will be disabled."))
        return None
    try:
        pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
        if PINECONE_INDEX_NAME not in pinecone.list_indexes():
            display(Markdown(f"Creating Pinecone index '{PINECONE_INDEX_NAME}' with dimension {PINECONE_EMBEDDING_DIMENSION} and metric '{PINECONE_METRIC}'... This might take a moment."))
            pinecone.create_index(
                name=PINECONE_INDEX_NAME,
                dimension=PINECONE_EMBEDDING_DIMENSION,
                metric=PINECONE_METRIC,
                # pod_type='p1.x1' # As per ITD-007, though this might depend on account type/free tier limitations. Defaulting for wider compatibility.
            )
            # Wait for index to be ready
            wait_time = 0
            max_wait_time = 300 # 5 minutes
            while not pinecone.describe_index(PINECONE_INDEX_NAME).status['ready']:
                display(Markdown("Waiting for Pinecone index to be ready..."))
                time.sleep(10)
                wait_time += 10
                if wait_time >= max_wait_time:
                    display(Markdown("**Error:** Pinecone index did not become ready in time."))
                    return None
            display(Markdown(f"Index '{PINECONE_INDEX_NAME}' created and ready."))
        else:
            display(Markdown(f"Pinecone index '{PINECONE_INDEX_NAME}' already exists."))
        pinecone_index = pinecone.Index(PINECONE_INDEX_NAME)
        display(Markdown("Pinecone initialized and index object retrieved successfully."))
        # display(pinecone_index.describe_index_stats()) # For debugging index status
        return pinecone_index
    except Exception as e:
        display(Markdown(f"**Error initializing Pinecone:** {e}. Check API key, environment, and index name/configuration."))
        pinecone_index = None # Ensure it's None on failure
        return None

def get_openai_embeddings(texts_list, model=OPENAI_EMBEDDING_MODEL):
    """Generates embeddings for a list of texts using OpenAI."""
    if not client:
        display(Markdown("**Error:** OpenAI client not initialized for embeddings."))
        return []
    if not texts_list: # Handle empty list input to avoid API error
        return []
    try:
        # Replace newlines, as recommended by OpenAI for their embedding models to improve performance
        texts_list_cleaned = [str(text).replace("\n", " ") for text in texts_list]
        response = client.embeddings.create(input=texts_list_cleaned, model=model)
        return [item.embedding for item in response.data]
    except Exception as e:
        display(Markdown(f"**Error getting OpenAI embeddings:** {e}"))
        return []

def prepare_embedding_text_for_product(product_row_dict):
    """Prepares a concise text representation of a product for embedding, based on ITD-007."""
    # Ensure all parts are strings to avoid errors with NoneTypes during join/format
    name = str(product_row_dict.get('name', ''))
    description = str(product_row_dict.get('description', ''))
    category = str(product_row_dict.get('category', ''))
    season = str(product_row_dict.get('season', ''))
    return (
        f"Product: {name}\n"
        f"Description: {description}\n"
        f"Category: {category}\n"
        f"Season: {season}"
    )

def embed_and_upsert_products(products_df_to_embed, target_index, batch_size=50):
    """Embeds product data and upserts it into the Pinecone index."""
    if target_index is None or products_df_to_embed.empty:
        display(Markdown("**Error (Embed/Upsert):** Pinecone index not initialized or products DataFrame is empty. Cannot upsert."))
        return
    
    # Optional: Check if index is already populated to avoid re-populating (can be time/cost consuming)
    # try:
    #     stats = target_index.describe_index_stats()
    #     if stats.total_vector_count >= len(products_df_to_embed) * 0.9: # If mostly populated
    #         display(Markdown(f"Pinecone index '{target_index.name}' appears to be already populated with {stats.total_vector_count} vectors. Skipping population. Set RUN_PINECONE_POPULATION to True to force."))
    #         return
    # except Exception as e:
    #     display(Markdown(f"**Warning:** Could not get stats for Pinecone index '{target_index.name}'. Will attempt to populate. Error: {e}"))

    display(Markdown(f"Starting embedding and upsertion of {len(products_df_to_embed)} products to '{target_index.name}'... This may take a while."))
    for i in range(0, len(products_df_to_embed), batch_size):
        batch_df = products_df_to_embed.iloc[i:i+batch_size]
        # Use itertuples for efficiency and convert to dict for prepare_embedding_text_for_product
        texts_to_embed = [prepare_embedding_text_for_product(row._asdict()) for row in batch_df.itertuples(index=False)]
        
        if not texts_to_embed:
            display(Markdown(f"Skipping empty batch at index {i}."))
            continue
            
        embeddings = get_openai_embeddings(texts_to_embed)
        if not embeddings or len(embeddings) != len(batch_df):
            display(Markdown(f"**Warning (Embed/Upsert):** Embedding failed or mismatch for batch starting at index {i}. Expected {len(batch_df)}, got {len(embeddings)}. Skipping batch."))
            continue

        vectors_to_upsert = []
        for j, product_tuple in enumerate(batch_df.itertuples(index=False)):
            product_row_dict = product_tuple._asdict() # Convert NamedTuple to dict
            product_id_str = str(product_row_dict['product_id']) # Ensure ID is string for Pinecone
            
            # Get current stock from the operational inventory_df for metadata
            current_stock = 0 # Default if not found
            if inventory_df is not None and product_id_str in inventory_df.index:
                current_stock = int(inventory_df.loc[product_id_str, 'stock'])
            else:
                current_stock = int(product_row_dict.get('stock', 0)) # Fallback to original stock from products_df
                
            metadata_dict = {
                "product_id": product_id_str,
                "name": str(product_row_dict.get('name', '')),
                "description": str(product_row_dict.get('description', '')),
                "category": str(product_row_dict.get('category', '')),
                "season": str(product_row_dict.get('season', '')),
                "price": float(product_row_dict.get('price', 0.0)),
                "stock": current_stock 
            }
            vectors_to_upsert.append({
                "id": product_id_str, 
                "values": embeddings[j],
                "metadata": metadata_dict
            })
        
        if vectors_to_upsert:
            try:
                target_index.upsert(vectors=vectors_to_upsert)
                num_total_batches = (len(products_df_to_embed) + batch_size - 1) // batch_size
                display(Markdown(f"Upserted batch {i//batch_size + 1}/{num_total_batches}. Products {i+1}-{min(i+batch_size, len(products_df_to_embed))} processed."))
            except Exception as e:
                display(Markdown(f"**Error (Embed/Upsert):** Failed upserting batch to Pinecone: {e}"))
        time.sleep(1) # Basic rate limiting to be polite to APIs
    display(Markdown("Product embedding and upsertion complete."))

def query_pinecone_for_products(query_text, top_k=3, index_to_query=None):
    """Embeds a query and searches the Pinecone index."""
    if index_to_query is None:
        display(Markdown("**Error (Query Pinecone):** Pinecone index not available for querying."))
        return []
    if not query_text or not str(query_text).strip(): # Check for empty or whitespace-only query
        display(Markdown("**Warning (Query Pinecone):** Empty query text provided. Returning no results."))
        return []
            
    query_embedding_list = get_openai_embeddings([str(query_text)]) # Ensure query_text is string
    if not query_embedding_list or not query_embedding_list[0]:
        display(Markdown("**Error (Query Pinecone):** Failed to generate embedding for query text: '{str(query_text)[:100]}...'"))
        return []
    
    try:
        results = index_to_query.query(
            vector=query_embedding_list[0],
            top_k=top_k,
            include_metadata=True
        )
        return results.get('matches', [])
    except Exception as e:
        display(Markdown(f"**Error (Query Pinecone):** Failed during Pinecone query: {e}"))
        return []

# Initialize Pinecone (attempt once per session)
pinecone_index = initialize_pinecone()

# Populate Pinecone index with product data 
# This is a potentially long and costly operation. 
# Set RUN_PINECONE_POPULATION to True ONLY if you need to populate or re-populate the index. 
# For subsequent runs, if the index is already populated, this step will be skipped if the below flag is False.
RUN_PINECONE_POPULATION = False # <-- IMPORTANT: Default to False. Set to True only if you intend to (re)populate.

if RUN_PINECONE_POPULATION:
    if pinecone_index and not products_df.empty:
        display(Markdown("**Starting Pinecone population process as RUN_PINECONE_POPULATION is True...**"))
        # Optional: Logic to delete and re-create index if a completely fresh population is always desired.
        # For now, it will upsert, which overwrites existing vectors with the same ID or adds new ones.
        embed_and_upsert_products(products_df, pinecone_index)
    else:
        display(Markdown("Skipping Pinecone population: RUN_PINECONE_POPULATION is True, but Pinecone index or product data is missing/unavailable."))
elif pinecone_index: # If not running population, but index exists, optionally show stats.
    display(Markdown("RUN_PINECONE_POPULATION is False. Assuming Pinecone index is already populated or population is not intended for this run."))
    # try:
    #    stats = pinecone_index.describe_index_stats()
    #    display(Markdown(f"Existing Pinecone index '{pinecone_index.name}' stats: {stats}"))
    # except Exception as e:
    #    display(Markdown(f"Could not fetch stats for existing index '{pinecone_index.name}': {e}"))
else:
    display(Markdown("RUN_PINECONE_POPULATION is False, and Pinecone index is not initialized (e.g. API key missing). RAG features will not work."))

### 3.4 Product Matching Utilities (Fuzzy Matching)

Based on ITD-003 (Product Name/ID Mapping Strategy) and ITD-006 (Product Matching Implementation Strategy), this uses fuzzy string matching (e.g., using `thefuzz` library) to map textual product mentions to product IDs from the catalog. This is an algorithmic approach that can supplement or be used by the LLM-based Product Matcher Agent.

In [None]:
def resolve_product_mention_fuzzy(mention_text, product_names_list_for_fuzz, product_id_map_for_fuzz, score_threshold=FUZZY_MATCH_THRESHOLD):
    """
    Resolves a product mention using fuzzy matching against product names.
    Returns a dictionary with 'product_id', 'product_name', 'confidence', 'match_method' or None.
    product_id_map_for_fuzz: A dictionary mapping product names to product_ids for quick lookup after match.
    """
    if not mention_text or not product_names_list_for_fuzz:
        return None
    
    # Using extractOne which finds the best match above a certain score_cutoff
    match_details = fuzz_process.extractOne(str(mention_text), product_names_list_for_fuzz, score_cutoff=score_threshold)
    
    if match_details:
        matched_name_str, score_val = match_details
        # Retrieve the product_id using the matched name
        # This assumes product_name_to_id_map is {name: id}
        p_id = product_id_map_for_fuzz.get(matched_name_str)
        if p_id:
            return {
                "product_id": str(p_id), # Ensure product_id is string
                "product_name": matched_name_str,
                # Quantity is often part of the order extraction by LLM, not simple name matching. 
                # Defaulting to 1 or omitting it here, to be filled by a later stage if it's an order item.
                # "quantity": 1, 
                "confidence": score_val / 100.0, # Normalize score to 0.0 - 1.0
                "original_mention": str(mention_text),
                "match_method": "fuzzy_name_match"
            }
        else:
            # This case should be rare if product_id_map_for_fuzz is built correctly from the same source as product_names_list_for_fuzz
            display(Markdown(f"**Fuzzy match warning:** Matched name '{matched_name_str}' (from mention '{mention_text}') not found in product_id_map."))
    return None

# Prepare product names and a map from name to ID for efficient fuzzy matching
product_names_for_fuzzy_matching = []
product_name_to_id_dict = {}
if not products_df.empty:
    product_names_for_fuzzy_matching = products_df['name'].unique().tolist() # Use unique names for matching choices
    # Create a mapping from product name to product ID. If names are not unique, this will take the ID of the last occurrence.
    # For a more robust system with non-unique names, a more complex mapping might be needed (e.g., name to list of IDs).
    product_name_to_id_dict = pd.Series(products_df.product_id.values, index=products_df.name).to_dict()
    display(Markdown(f"Product name list ({len(product_names_for_fuzzy_matching)} unique names) and name-to-ID map ({len(product_name_to_id_dict)} entries) prepared for fuzzy matching."))
else:
    display(Markdown("Products DataFrame is empty. Fuzzy matching utilities will not be effective."))

## 4. Email Processing Pipeline Agents

This section defines the functions that represent each agent in the pipeline, as outlined in ITD-005 (Agent Architecture). Each agent function will typically:
1. Load its specific prompt template from the `/prompts` directory.
2. Populate the template with dynamic data (e.g., email content, outputs from previous agents, RAG context).
3. Call the LLM (via `call_openai_chat_completion`).
4. Parse the structured JSON output from the LLM and return it.

Error handling and data validation will be incorporated within each agent and in the main loop.

### 4.1 Agent 1: Classification & Signal Extraction Agent

**Purpose**: Analyzes incoming emails to determine primary intent (order or inquiry) and extract all relevant customer signals (product mentions, emotional cues, etc.) based on `prompts/01-classification-signal-extraction-agent.md` and the `sales-email-intelligence-guide.md`.

In [None]:
def classification_signal_extraction_agent(email_id_val, email_subject_val, email_message_val):
    """Processes an email for classification and signal extraction using an LLM."""
    display(Markdown(f"**Agent 1 (Classification & Signal Extraction) processing Email ID: {email_id_val}**"))
    prompt_template_filename = "01-classification-signal-extraction-agent.md"
    dynamic_prompt_values = {
        "email.email_id": email_id_val,
        "email.subject": str(email_subject_val), # Ensure string
        "email.message": str(email_message_val)  # Ensure string
    }
    complete_prompt_content = load_prompt_template(prompt_template_filename, dynamic_prompt_values)
    if not complete_prompt_content:
        # load_prompt_template already displays an error
        return {"error": f"Failed to load prompt for Agent 1: {prompt_template_filename}"}
    
    # For debugging the prompt being sent to LLM (optional)
    # display(Markdown(f"**Agent 1 Prompt for {email_id_val} (first 1000 chars):**\n```markdown\n{complete_prompt_content[:1000]}...\n```"))
    
    llm_response = call_openai_chat_completion(complete_prompt_content, is_json_output=True)
    
    # Validate basic structure of LLM response for this agent
    if isinstance(llm_response, dict) and 'category' in llm_response and 'confidence' in llm_response and 'signals' in llm_response:
        display(Markdown(f"Agent 1: Classification for {email_id_val}: **{llm_response.get('category', 'N/A')}**, Confidence: {llm_response.get('confidence', 'N/A'):.2f}"))
    elif isinstance(llm_response, dict) and 'error' in llm_response:
        display(Markdown(f"**Error in Agent 1 (Classification) output for {email_id_val}:** {llm_response['error']}. Raw: {llm_response.get('raw_content','empty')[:200]}..."))
        # Return the error dict so the main loop can handle it
    else:
        display(Markdown(f"**Agent 1 (Classification & Signal Extraction) failed or returned unexpected/malformed JSON for Email ID: {email_id_val}. Output: {str(llm_response)[:300]}**"))
        return {"error": "Agent 1 failed or returned malformed JSON", "raw_output": str(llm_response)[:500]}
    return llm_response

### 4.2 Agent 2: Product Matcher Agent

**Purpose**: Identifies specific products mentioned in the email, distinguishing between order items and inquiry items. It leverages signals from Agent 1 and uses a combination of matching strategies (exact ID, fuzzy name, vector similarity for descriptions if prompted). Relies on `prompts/02-product-matcher-agent.md`. 
The current implementation primarily tasks the LLM with this based on catalog snippets and signals. Algorithmic fuzzy matching (from 3.4) can be used as a pre-processing step or fallback if the LLM struggles, but the primary path here is LLM-driven matching guided by context.

In [None]:
def product_matcher_agent(email_id_val, classification_agent_results, products_catalog_df_ref, pinecone_index_ref):
    """Identifies specific products mentioned in the email using LLM, guided by catalog data and signals."""
    display(Markdown(f"**Agent 2 (Product Matcher) processing Email ID: {email_id_val}**"))
    prompt_template_filename = "02-product-matcher-agent.md"

    if not isinstance(classification_agent_results, dict) or 'signals' not in classification_agent_results:
        display(Markdown("**Error (Agent 2):** Classification results are missing or invalid. Cannot proceed with product matching."))
        return {"error": "Missing or invalid classification_results for Agent 2"}

    # Prepare a snippet of the product catalog to provide context to the LLM
    product_catalog_context_str = "Product_ID,Name,Category,Description (first 50 chars),Stock\n" # Added Stock
    if not products_catalog_df_ref.empty:
        # Try to provide a more relevant snippet if product_identification signals exist
        identified_mentions = classification_agent_results.get('signals', {}).get('product_identification', [])
        candidate_products_for_snippet_df = pd.DataFrame()
        
        # 1. Exact ID matches from mentions
        if identified_mentions:
            exact_id_matches_df = products_catalog_df_ref[products_catalog_df_ref['product_id'].isin(identified_mentions)]
            if not exact_id_matches_df.empty:
                candidate_products_for_snippet_df = pd.concat([candidate_products_for_snippet_df, exact_id_matches_df])

        # 2. Fuzzy name matches for remaining mentions (if any)
        #    This is illustrative; a full fuzzy pre-match could be complex here. LLM will use raw mentions.

        # 3. If few specific candidates, add some general catalog items for broader context
        num_specific_candidates = len(candidate_products_for_snippet_df)
        num_general_needed = max(0, 10 - num_specific_candidates) # Aim for around 10-15 total for snippet
        
        if num_general_needed > 0 and len(products_catalog_df_ref) > num_specific_candidates:
            # Exclude already selected candidates from sampling general ones
            remaining_products_df = products_catalog_df_ref[~products_catalog_df_ref['product_id'].isin(candidate_products_for_snippet_df['product_id'])]
            if not remaining_products_df.empty:
                 sample_size = min(num_general_needed, len(remaining_products_df))
                 general_sample_df = remaining_products_df.sample(sample_size, random_state=1) # random_state for reproducibility
                 candidate_products_for_snippet_df = pd.concat([candidate_products_for_snippet_df, general_sample_df])
        
        df_for_snippet = candidate_products_for_snippet_df.drop_duplicates(subset=['product_id']).head(15) # Limit total snippet size
        
        if df_for_snippet.empty and not products_catalog_df_ref.empty: # Fallback if no candidates found
            df_for_snippet = products_catalog_df_ref.head(10)

        for _, row_data in df_for_snippet.iterrows():
            desc_snip = str(row_data.get('description', ''))[:50].replace('\n', ' ') + "..."
            # Get current stock from live inventory_df for the snippet
            stock_val = 0
            if inventory_df is not None and row_data['product_id'] in inventory_df.index:
                stock_val = inventory_df.loc[row_data['product_id'], 'stock']
            else:
                stock_val = row_data.get('stock',0) # Fallback to original if not in live inventory
            product_catalog_context_str += f"{row_data.get('product_id')},{row_data.get('name')},{row_data.get('category')},{desc_snip},{stock_val}\n"
    else:
        product_catalog_context_str += "No product catalog data available.\n"

    # Vector similarity search is mentioned in the prompt's instructions to the LLM.
    # The LLM is expected to use descriptive phrases if it identifies them from the email signals.
    # For this PoC, we don't have a separate Python step for vector search *within* this agent call based on LLM identifying a phrase.
    # The RAG in Agent 4 (Inquiry Processing) is more direct for vector search against user queries.
    vector_search_info_for_prompt = "(Vector similarity search can be used for descriptive phrases if such are identified from email signals. Currently, prioritize matches from the provided catalog snippet and direct mentions. If a descriptive phrase seems key, note it for potential later vector search if direct matching fails.)"

    dynamic_prompt_values = {
        "classification_results": classification_agent_results, # Full JSON output from Agent 1
        "product_catalog": product_catalog_context_str,
        "vector_embeddings_of_product_descriptions": vector_search_info_for_prompt 
    }
    complete_prompt_content = load_prompt_template(prompt_template_filename, dynamic_prompt_values)
    if not complete_prompt_content:
        return {"error": f"Failed to load prompt for Agent 2: {prompt_template_filename}"}
    
    # display(Markdown(f"**Agent 2 Prompt for {email_id_val} (first 1000 chars):**\n```markdown\n{complete_prompt_content[:1000]}...\n```"))
    llm_response = call_openai_chat_completion(complete_prompt_content, is_json_output=True)

    if isinstance(llm_response, dict) and 'order_items' in llm_response and 'inquiry_items' in llm_response : # Check for key fields
        display(Markdown(f"Agent 2: Product matching for {email_id_val} completed. Order items found: {len(llm_response.get('order_items',[]))}, Inquiry items found: {len(llm_response.get('inquiry_items',[]))}, Unmatched: {len(llm_response.get('unmatched_mentions',[]))}"))
    elif isinstance(llm_response, dict) and 'error' in llm_response:
        display(Markdown(f"**Error in Agent 2 (Product Matcher) output for {email_id_val}:** {llm_response['error']}. Raw: {llm_response.get('raw_content','empty')[:200]}..."))
    else:
        display(Markdown(f"**Agent 2 (Product Matcher) failed or returned unexpected/malformed JSON for Email ID: {email_id_val}. Output: {str(llm_response)[:300]}**"))
        return {"error": "Agent 2 failed or returned malformed JSON", "raw_output": str(llm_response)[:500]}
    return llm_response

### 4.3 Agent 3: Order Processing Agent

**Purpose**: Processes confirmed order items (from Agent 2). Checks inventory (live `inventory_df` snapshot provided in the prompt), determines fulfillment status (created, out_of_stock, partial), suggests alternatives for out-of-stock items, and outputs LLM-generated `inventory_updates` instructions (which Python code then validates and applies to the live `inventory_df`). Based on `prompts/03-order-processing-agent.md`.

In [None]:
def order_processing_agent(email_id_val, product_matcher_agent_results, current_inventory_snapshot_df, main_products_catalog_df):
    """Processes order items, checks stock, and suggests alternatives using an LLM."""
    display(Markdown(f"**Agent 3 (Order Processing) processing Email ID: {email_id_val}**"))
    prompt_template_filename = "03-order-processing-agent.md"
    
    # Ensure product_matcher_results is a dict, even if it failed or was empty, to avoid errors in .get()
    order_items_list_to_process = product_matcher_agent_results.get('order_items', []) if isinstance(product_matcher_agent_results, dict) else []
    
    if not order_items_list_to_process:
        display(Markdown(f"Agent 3: No order items to process for {email_id_val} based on Product Matcher output."))
        return { "processed_items": [], "unfulfilled_items_for_inquiry": [], "inventory_updates": [] } # Return valid empty structure

    # Prepare current inventory status string for ONLY the items in the order for the LLM's primary focus
    inventory_status_for_prompt_str = "Product_ID,Name,Current_Stock\n"
    # Get unique product IDs from the order items to avoid redundant lookups
    ordered_product_ids = list(set(item['product_id'] for item in order_items_list_to_process if 'product_id' in item and item['product_id']))

    for pid_str in ordered_product_ids:
        stock_level = 0 # Default if not found
        product_name_val = "Unknown Product"
        if pid_str in current_inventory_snapshot_df.index:
            stock_level = current_inventory_snapshot_df.loc[pid_str, 'stock']
            # Get product name for better context in prompt
            if pid_str in main_products_catalog_df['product_id'].values:
                 product_name_val = main_products_catalog_df.loc[main_products_catalog_df['product_id'] == pid_str, 'name'].iloc[0]
        inventory_status_for_prompt_str += f"{pid_str},{product_name_val},{stock_level}\n"
                
    # Provide a broader catalog snippet for the LLM to find alternatives for out-of-stock items
    # This helps LLM suggest relevant alternatives beyond just the ordered items.
    product_catalog_for_alternatives_str = "Product_ID,Name,Category,Price,Current_Stock\n"
    if not main_products_catalog_df.empty:
         # Sample some products, prioritizing those in stock, potentially from similar categories if an item is OOS
         # For PoC, using a larger sample of the catalog for LLM to pick from. Max 30-40 items.
         sample_size = min(30, len(main_products_catalog_df))
         df_for_alt_snippet = main_products_catalog_df.sample(sample_size, random_state=42) if len(main_products_catalog_df) > sample_size else main_products_catalog_df

         for _, row_data in df_for_alt_snippet.iterrows():
            # Get current stock for these potential alternatives from live inventory
            stock_level_alt = 0
            if row_data['product_id'] in current_inventory_snapshot_df.index:
                stock_level_alt = current_inventory_snapshot_df.loc[row_data['product_id'], 'stock']
            product_catalog_for_alternatives_str += f"{row_data.get('product_id')},{row_data.get('name')},{row_data.get('category')},{row_data.get('price')},{stock_level_alt}\n"

    dynamic_prompt_values = {
        # Agent 03 prompt expects `product_matcher_results` which should contain `order_items` array
        "product_matcher_results": {"order_items": order_items_list_to_process}, 
        "current_inventory_status": inventory_status_for_prompt_str, # Stock for items in order
        "product_catalog_for_alternatives": product_catalog_for_alternatives_str # Broader catalog for suggesting alts
    }
    complete_prompt_content = load_prompt_template(prompt_template_filename, dynamic_prompt_values)
    if not complete_prompt_content:
        return {"error": f"Failed to load prompt for Agent 3: {prompt_template_filename}"}
    
    # display(Markdown(f"**Agent 3 Prompt for {email_id_val} (first 1000 chars):**\n```markdown\n{complete_prompt_content[:1000]}...\n```"))
    llm_response = call_openai_chat_completion(complete_prompt_content, is_json_output=True)

    # Validate basic structure of LLM response for this agent
    if isinstance(llm_response, dict) and 'processed_items' in llm_response: # Key field for this agent
        display(Markdown(f"Agent 3: Order processing for {email_id_val} completed by LLM. Items LLM decided on: {len(llm_response.get('processed_items',[]))}"))
    elif isinstance(llm_response, dict) and 'error' in llm_response:
        display(Markdown(f"**Error in Agent 3 (Order Processing) output for {email_id_val}:** {llm_response['error']}. Raw: {llm_response.get('raw_content','empty')[:200]}..."))
    else:
        display(Markdown(f"**Agent 3 (Order Processing) failed or returned unexpected/malformed JSON for Email ID: {email_id_val}. Output: {str(llm_response)[:300]}**"))
        return {"error": "Agent 3 failed or returned malformed JSON", "raw_output": str(llm_response)[:500]}
    return llm_response

def apply_inventory_updates_from_llm_decision(order_processing_llm_results, live_inventory_df_ref):
    """
    Applies inventory updates to the `live_inventory_df_ref` based on the LLM's order processing output.
    This function interprets the `processed_items` from the LLM to determine actual stock deductions.
    It modifies `live_inventory_df_ref` in place.
    Returns a list of dictionaries, where each dict describes an actual update applied.
    """
    if not isinstance(order_processing_llm_results, dict) or 'processed_items' not in order_processing_llm_results:
        display(Markdown("Apply Inventory: Invalid or missing 'processed_items' in LLM results. No updates applied."))
        return [] 
    
    if live_inventory_df_ref is None:
        display(Markdown("Apply Inventory: Live inventory DataFrame is None. Cannot apply updates."))
        return []
        
    actual_applied_inventory_changes_list = []
    llm_processed_items_list = order_processing_llm_results.get('processed_items', [])
    
    for item_data_dict in llm_processed_items_list:
        pid_str = str(item_data_dict.get('product_id', '')) # Ensure string ID
        status_str = item_data_dict.get('status')
        # 'quantity' in processed_items is the original requested quantity by customer for that item
        original_requested_qty = int(item_data_dict.get('quantity', 0)) 
        
        if not pid_str:
            display(Markdown(f"Apply Inventory: Missing product_id in processed item data: {item_data_dict}"))
            continue
            
        if pid_str not in live_inventory_df_ref.index:
            display(Markdown(f"**Warning (Apply Inventory):** Product ID '{pid_str}' from order results not found in live inventory. Cannot apply update."))
            continue
            
        current_stock_on_record = live_inventory_df_ref.loc[pid_str, 'stock']
        quantity_to_deduct_from_stock = 0

        if status_str == 'created':
            # LLM confirmed full order for original_requested_qty based on stock it was shown. Deduct that original requested quantity.
            quantity_to_deduct_from_stock = original_requested_qty
        elif status_str == 'partial_fulfillment':
            # LLM decided on partial fulfillment. It should output 'available_quantity' for what was actually fulfilled.
            # This 'available_quantity' from LLM is the amount it believes can be fulfilled and thus deducted.
            quantity_to_deduct_from_stock = int(item_data_dict.get('available_quantity', 0))
            if quantity_to_deduct_from_stock == 0 and original_requested_qty > 0:
                display(Markdown(f"**Warning (Apply Inventory):** Partial fulfillment for '{pid_str}' but LLM's 'available_quantity' is 0 or missing. No stock deducted for this item based on this status logic."))
        elif status_str == 'out_of_stock':
            quantity_to_deduct_from_stock = 0 # No stock deducted for OOS items
        else:
            display(Markdown(f"**Warning (Apply Inventory):** Unknown status '{status_str}' for product '{pid_str}'. No stock deducted."))
            continue # Skip to next item if status is unclear
            
        if quantity_to_deduct_from_stock > 0:
            if quantity_to_deduct_from_stock > current_stock_on_record:
                display(Markdown(f"**Critical Warning (Apply Inventory):** For product '{pid_str}', LLM/logic determined to fulfill {quantity_to_deduct_from_stock}, but only {current_stock_on_record} is actually available in live inventory. Deducting only the available stock ({current_stock_on_record}). This indicates a potential discrepancy or race condition if multiple processes update stock."))
                quantity_to_deduct_from_stock = current_stock_on_record # Safety: never deduct more than currently available
            
            new_stock_level = current_stock_on_record - quantity_to_deduct_from_stock
            live_inventory_df_ref.loc[pid_str, 'stock'] = new_stock_level # Modify live inventory DataFrame
            
            applied_change_info = {
                "product_id": pid_str,
                "quantity_fulfilled_and_deducted": quantity_to_deduct_from_stock,
                "stock_before_update": current_stock_on_record,
                "stock_after_update": new_stock_level,
                "llm_item_status": status_str
            }
            actual_applied_inventory_changes_list.append(applied_change_info)
            display(Markdown(f"Inventory update for '{pid_str}': {current_stock_on_record} -> {new_stock_level} (deducted: {quantity_to_deduct_from_stock}) based on LLM status '{status_str}'."))
            
    return actual_applied_inventory_changes_list