
# Day 2 – Exercise 8: Advanced Multi‑Step Customer Support Workflow (Enterprise)

## 🎯 Learning Objectives

In this advanced exercise you will:

- **Build** a multi‑step customer support agent that performs classification, retrieval, answer generation, ticket creation, and summarization.
- **Integrate** baseline and LLM‑based classifiers (via RealLight/LiteLLM) to understand different approaches to intent routing.
- **Simulate** retrieval using BM25, dense, and hybrid strategies and track metadata such as document counts and sources.
- **Capture** detailed logs (including latency and cost estimates) and compute metrics like classification accuracy, success rate, and average cost.
- **Analyze** failure patterns and plan improvements for a production MVP.

## 📚 What You'll Learn

1. **Intent Classification:** Compare rule‑based, logistic regression, and LLM classifiers.
2. **Retrieval Strategies:** Route queries to BM25, dense, or hybrid retrievers and track citations.
3. **Multi‑Step Orchestration:** Design a workflow that answers questions, creates tickets when unresolved, and summarizes interactions.
4. **Observability & Metrics:** Log interactions with metrics such as latency and cost to inform system optimization.

**Estimated Time:** 120 minutes



## 🔧 Setup and Installation

Before starting, ensure you have the following packages installed:

- `pandas==2.2.0` for data handling
- `scikit‑learn==1.4.0` for the logistic regression classifier
- `litellm==1.39.1` and `openai==1.12.0` for RealLight (LiteLLM) integration

Install them in your environment using pip:
```bash
pip install pandas==2.2.0 scikit-learn==1.4.0 litellm==1.39.1 openai==1.12.0
```

Also, set your `OPENAI_API_KEY` (or other provider key) as an environment variable. **Never hard‑code secrets in notebooks.**


In [39]:
try:
    from elasticsearch import Elasticsearch

    # Initialize Elasticsearch client
    es = Elasticsearch(
        hosts=[{"host": "localhost", "port": 9200, "scheme": "http"}],
        request_timeout=30
    )

    # Quick test: ping the cluster
    if es.ping():
        print("✅ Connected to Elasticsearch")
    else:
        print("❌ Could not connect to Elasticsearch")

except ModuleNotFoundError:
    print("⚠️ Please install the elasticsearch library: pip install elasticsearch")
except Exception as e:
    print("Elasticsearch connection error:", e)


⚠️ Please install the elasticsearch library: pip install elasticsearch


In [40]:
%pip install elasticsearch

Collecting elasticsearch
  Downloading elasticsearch-9.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting elastic-transport<10,>=9.1.0 (from elasticsearch)
  Downloading elastic_transport-9.1.0-py3-none-any.whl.metadata (3.9 kB)
Downloading elasticsearch-9.1.1-py3-none-any.whl (937 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m937.5/937.5 kB[0m [31m2.0 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
Downloading elastic_transport-9.1.0-py3-none-any.whl (65 kB)
Installing collected packages: elastic-transport, elasticsearch
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [elasticsearch]0m [elasticsearch]
Successfully installed elastic-transport-9.1.0 elasticsearch-9.1.1
Note: you may need to restart the kernel to use updated packages.


In [25]:

# Ensure environment variable is set (for demonstration; do not print secrets)
import os
os.environ.get("OPENAI_API_KEY", "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")


'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'


## Stage A: Prepare the Dataset

We'll construct a dataset of **12 customer support queries** across multiple categories:

- **Factual:** Questions answerable with static information (e.g., policies)
- **Procedural:** How‑to questions requiring stepwise instructions
- **Exploratory:** Open‑ended or general questions
- **Billing:** Questions about payments and charges
- **Troubleshooting:** Issues requiring deeper investigation (often unresolved)

Each entry includes a `query`, its annotated `intent`, a `resolved` flag indicating whether it can be answered directly, and a `category` for more fine‑grained analysis. In production, you'd source these from real ticket data and label them manually.


In [26]:

import pandas as pd

support_data = [
    {"query": "How do I reset my password?", "intent": "procedural", "resolved": True, "category": "procedural"},
    {"query": "Why was my account suspended?", "intent": "exploratory", "resolved": False, "category": "troubleshooting"},
    {"query": "What is the refund policy?", "intent": "factual", "resolved": True, "category": "factual"},
    {"query": "My order hasn't arrived yet, what should I do?", "intent": "procedural", "resolved": False, "category": "troubleshooting"},
    {"query": "Explain the differences between basic and premium plans.", "intent": "exploratory", "resolved": True, "category": "exploratory"},
    {"query": "How can I update my billing address?", "intent": "procedural", "resolved": True, "category": "billing"},
    {"query": "When will I be charged for my subscription?", "intent": "factual", "resolved": True, "category": "billing"},
    {"query": "I see an unknown charge on my credit card.", "intent": "exploratory", "resolved": False, "category": "billing"},
    {"query": "Steps to troubleshoot a connectivity issue.", "intent": "procedural", "resolved": True, "category": "troubleshooting"},
    {"query": "Provide an overview of your customer loyalty program.", "intent": "exploratory", "resolved": True, "category": "factual"},
    {"query": "How do I cancel my order and get a refund?", "intent": "procedural", "resolved": False, "category": "billing"},
    {"query": "Your site is down, I cannot access anything.", "intent": "exploratory", "resolved": False, "category": "troubleshooting"},
]

support_df = pd.DataFrame(support_data)
support_df


Unnamed: 0,query,intent,resolved,category
0,How do I reset my password?,procedural,True,procedural
1,Why was my account suspended?,exploratory,False,troubleshooting
2,What is the refund policy?,factual,True,factual
3,"My order hasn't arrived yet, what should I do?",procedural,False,troubleshooting
4,Explain the differences between basic and prem...,exploratory,True,exploratory
5,How can I update my billing address?,procedural,True,billing
6,When will I be charged for my subscription?,factual,True,billing
7,I see an unknown charge on my credit card.,exploratory,False,billing
8,Steps to troubleshoot a connectivity issue.,procedural,True,troubleshooting
9,Provide an overview of your customer loyalty p...,exploratory,True,factual


This table forms the basis for our classification and workflow. Note the `category` field which can be used for more granular routing or specialized agents in larger systems.


## Stage B: Intent Classification

To decide how to retrieve information and handle each query, we'll compare three classifiers:

1. **Rule‑Based Classifier:** Simple keyword matching.
2. **Logistic Regression:** Supervised learning on TF‑IDF features.
3. **LLM Classifier via RealLight:** Few‑shot classification using an LLM. (We provide a function; running it requires a valid API key and network access.)

### B.1: Rule‑Based Classifier


In [27]:

# Simple keyword-based intent classifier

def rule_based_intent(query: str) -> str:
    query_lower = query.lower()
    factual_kw = ["what", "when", "policy", "overview"]
    procedural_kw = ["how", "steps", "reset", "update", "cancel", "troubleshoot"]
    billing_kw = ["charge", "billing", "refund", "subscription"]
    if any(kw in query_lower for kw in billing_kw):
        return "billing"
    if any(kw in query_lower for kw in procedural_kw):
        return "procedural"
    if any(kw in query_lower for kw in factual_kw):
        return "factual"
    return "exploratory"

# Apply to dataset
support_df["pred_rule"] = support_df["query"].apply(rule_based_intent)
support_df[["query", "intent", "pred_rule"]]


Unnamed: 0,query,intent,pred_rule
0,How do I reset my password?,procedural,procedural
1,Why was my account suspended?,exploratory,exploratory
2,What is the refund policy?,factual,billing
3,"My order hasn't arrived yet, what should I do?",procedural,factual
4,Explain the differences between basic and prem...,exploratory,exploratory
5,How can I update my billing address?,procedural,billing
6,When will I be charged for my subscription?,factual,billing
7,I see an unknown charge on my credit card.,exploratory,billing
8,Steps to troubleshoot a connectivity issue.,procedural,procedural
9,Provide an overview of your customer loyalty p...,exploratory,factual


The rule‑based approach works well for obvious keywords but will misclassify ambiguous queries.

### B.2: Logistic Regression Classifier

In [28]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    support_df["query"], support_df["intent"], test_size=0.3, random_state=42, stratify=support_df["intent"]
)

# Vectorize text
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train model
log_reg = LogisticRegression(max_iter=300)
log_reg.fit(X_train_vec, y_train)

# Evaluate
y_pred = log_reg.predict(X_test_vec)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Save full-model predictions for later use
support_df["pred_log_reg"] = log_reg.predict(vectorizer.transform(support_df["query"]))
support_df[["query", "intent", "pred_log_reg"]]


Accuracy: 0.5
              precision    recall  f1-score   support

 exploratory       1.00      0.50      0.67         2
     factual       0.00      0.00      0.00         1
  procedural       0.33      1.00      0.50         1

    accuracy                           0.50         4
   macro avg       0.44      0.50      0.39         4
weighted avg       0.58      0.50      0.46         4



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Unnamed: 0,query,intent,pred_log_reg
0,How do I reset my password?,procedural,procedural
1,Why was my account suspended?,exploratory,procedural
2,What is the refund policy?,factual,procedural
3,"My order hasn't arrived yet, what should I do?",procedural,procedural
4,Explain the differences between basic and prem...,exploratory,exploratory
5,How can I update my billing address?,procedural,procedural
6,When will I be charged for my subscription?,factual,procedural
7,I see an unknown charge on my credit card.,exploratory,exploratory
8,Steps to troubleshoot a connectivity issue.,procedural,procedural
9,Provide an overview of your customer loyalty p...,exploratory,exploratory


The logistic regression classifier captures patterns beyond keywords but still requires labeled data and may struggle with rare intents.

### B.3: LLM Classifier via RealLight

In [29]:
from typing import List

# Import litellm if available
try:
    import litellm  # type: ignore
except ImportError:
    litellm = None


def llm_classify_intents(
    queries: List[str], model: str = "gpt-3.5-turbo", temperature: float = 0.0
) -> List[str]:
    """
    Classify intents using RealLight; returns a list of predicted intents.

    Allowed intents:
    - factual
    - procedural
    - exploratory
    - billing
    - troubleshooting
    """
    if litellm is None:
        print("litellm not installed; returning 'exploratory' for all queries.")
        return ["exploratory"] * len(queries)

    results = []
    for q in queries:
        messages = [
            {
                "role": "system",
                "content": (
                    "Classify the user's intent into one of: "
                    "factual, procedural, exploratory, billing, troubleshooting."
                ),
            },
            {
                "role": "user",
                "content": f"Query: {q}\nIntent:"   # ✅ explicit newline
            },
        ]
        try:
            resp = litellm.completion(
                model=model,
                messages=messages,
                temperature=temperature,
                max_tokens=1,
            )
            intent = resp.choices[0].message.content.strip().lower()
        except Exception as e:
            print(f"LLM error for '{q}': {e}")
            intent = "exploratory"
        results.append(intent)

    return results


# Example (uncomment if API key is set)
# support_df["pred_llm"] = llm_classify_intents(support_df["query"].tolist())
# support_df[["query", "intent", "pred_llm"]]


LLM classification can capture subtle nuances but incurs latency and cost. We avoid executing it here; provide your API key and uncomment the example to run.


## Stage C: Retrieval Strategies

To answer queries, we'll simulate three retrieval strategies:

- **BM25 Retrieval:** Keyword search (fast for factual queries)
- **Dense Vector Retrieval:** Semantic search (good for exploratory queries)
- **Hybrid Retrieval:** Combines BM25 and dense results (useful for procedural or troubleshooting queries)

Each function returns a tuple of `(text, source, doc_count)` representing the retrieved content, its source identifier, and the number of documents retrieved. In a production system, integrate with ElasticSearch, FAISS, or Chroma to implement these.


In [30]:

# Simulated retrieval functions

def bm25_retrieve(query: str):
    # Simulate retrieving two documents
    return f"[BM25] Found policy and FAQ entries for '{query}'", "bm25_docs", 2


def dense_retrieve(query: str):
    # Simulate retrieving three documents
    return f"[Dense] Found semantically similar articles for '{query}'", "dense_docs", 3


def hybrid_retrieve(query: str):
    # Combine BM25 and dense results
    return f"[Hybrid] Combined results for '{query}'", "hybrid_docs", 5

# Routing logic based on intent (could use categories too)

def select_retriever(intent: str):
    if intent in ("factual", "billing"):
        return bm25_retrieve
    elif intent == "procedural":
        return hybrid_retrieve
    else:
        return dense_retrieve


The retrieval routing maps intents to retrievers. Billing queries often map to factual sources, procedural queries benefit from hybrid search, and exploratory/troubleshooting queries use dense retrieval.


## Stage D: Multi‑Step Support Agent

We'll now implement a comprehensive agent that:

1. **Classifies** the query (you can choose which classifier to use).
2. **Retrieves** information using the appropriate strategy.
3. **Generates** an answer via RealLight, including citations.
4. **Determines** whether the issue is resolved; if not, creates a ticket.
5. **Logs** metrics such as latency and cost (simulated).
6. **Summarizes** the conversation.

Each interaction returns a structured dictionary capturing all relevant fields.


In [31]:
import time

# We'll reuse litellm from earlier
# Simulated cost model: assume $0.0005 per token
COST_PER_TOKEN = 0.0005

# Global ticket storage
tickets = []

# LLM answer generation
def generate_answer(query: str, retrieved: str, citations: str) -> (str, int):
    '''Generate an answer via LLM (RealLight) and return answer and tokens used.'''
    if litellm is None:
        tokens_used = len(retrieved.split()) + len(query.split())
        answer = f"Answer (simulated) using info: {retrieved} (source: {citations})"
        return answer, tokens_used

    messages = [
        {"role": "system", "content": "You are a support assistant. Use the retrieved info to answer with citations in brackets."},
        {"role": "user", "content": f"Query: {query}\nRetrieved: {retrieved}\nCitations: {citations}"}
    ]
    try:
        resp = litellm.completion(model="gpt-3.5-turbo", messages=messages, temperature=0.0, max_tokens=150)
        answer = resp.choices[0].message.content.strip()
        tokens_used = len(answer.split())
        return answer, tokens_used
    except Exception as e:
        print(f"LLM error during answer generation: {e}")
        tokens_used = len(retrieved.split()) + len(query.split())
        return f"Answer (fallback) using info: {retrieved}", tokens_used

# Summarization function
def summarize_chat(query: str, answer: str) -> str:
    '''Summarize the Q&A into one concise sentence.'''
    if litellm is None:
        return f"Summary: '{query[:30]}...' answered with '{answer[:30]}...'"
    messages = [
        {"role": "system", "content": "Provide a one-sentence summary of the following customer support interaction."},
        {"role": "user", "content": f"Question: {query}\nAnswer: {answer}"}
    ]
    try:
        resp = litellm.completion(model="gpt-3.5-turbo", messages=messages, temperature=0.2, max_tokens=60)
        return resp.choices[0].message.content.strip()
    except Exception:
        return f"Summary: '{query[:30]}...' answered with '{answer[:30]}...'"

# Ticket creation
def create_support_ticket(query: str, answer: str) -> int:
    '''Create a ticket and return its ID.'''
    ticket_id = len(tickets) + 1
    tickets.append({"ticket_id": ticket_id, "query": query, "answer": answer, "status": "open"})
    return ticket_id

# Determine if unresolved
def is_unresolved(resolved_flag: bool, answer: str) -> bool:
    '''Return True if unresolved flag is False or answer contains low-confidence cues.'''
    low_confidence_terms = ["sorry", "cannot", "unable", "unknown", "not sure", "unavailable"]
    low_confidence = any(term in answer.lower() for term in low_confidence_terms)
    return (not resolved_flag) or low_confidence

# Master agent function
def run_support_agent(query: str, resolved_flag: bool = True, classifier: str = "rule") -> dict:
    '''Run the full workflow for a single query and return log record.'''
    start_time = time.time()

    # 1. Intent classification
    if classifier == "rule":
        intent = rule_based_intent(query)
    elif classifier == "log_reg":
        intent = log_reg.predict(vectorizer.transform([query]))[0]
    elif classifier == "llm":
        intent = llm_classify_intents([query])[0]
    else:
        intent = rule_based_intent(query)

    # 2. Retrieval
    retriever = select_retriever(intent)
    retrieved_text, source, doc_count = retriever(query)

    # 3. Answer generation
    answer, tokens_used = generate_answer(query, retrieved_text, source)

    # 4. Ticket creation
    ticket_id = None
    if is_unresolved(resolved_flag, answer):
        ticket_id = create_support_ticket(query, answer)

    # 5. Summary
    summary = summarize_chat(query, answer)

    # 6. Compute latency & cost
    latency = time.time() - start_time
    cost = tokens_used * COST_PER_TOKEN

    return {
        "query": query,
        "intent": intent,
        "retrieval_method": retriever.__name__,
        "retrieval_docs": doc_count,
        "answer": answer,
        "citations": source,
        "tokens_used": tokens_used,
        "cost_estimate": round(cost, 5),
        "latency_sec": round(latency, 3),
        "ticket_id": ticket_id,
        "summary": summary,
    }

This function orchestrates the entire support pipeline and returns a comprehensive log record. You can choose between the rule‑based, logistic regression, or LLM classifier via the `classifier` parameter.


## Stage E: Execute the Agent on Multiple Queries

We'll run the agent on all queries in the dataset using two classifiers: the rule‑based approach and the logistic regression model. Feel free to experiment with the LLM classifier if your environment and API key support it.


In [32]:

# Run experiments

logs_rule = []
logs_logreg = []

for _, row in support_df.iterrows():
    logs_rule.append(run_support_agent(row["query"], resolved_flag=row["resolved"], classifier="rule"))
    logs_logreg.append(run_support_agent(row["query"], resolved_flag=row["resolved"], classifier="log_reg"))

log_df_rule = pd.DataFrame(logs_rule)
log_df_logreg = pd.DataFrame(logs_logreg)

print("Rule-based classifier results:")
print(log_df_rule[["query", "intent", "retrieval_method", "ticket_id", "cost_estimate", "latency_sec"]])

print("Logistic regression classifier results:")
print(log_df_logreg[["query", "intent", "retrieval_method", "ticket_id", "cost_estimate", "latency_sec"]])



[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

LLM error during answer generation: litellm.AuthenticationError: AuthenticationError: OpenAIException - The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

LLM error during answer generation: litellm.AuthenticationError: AuthenticationError: OpenAIException - The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

[1;31mGive Feedback / Get Help: ht

The output tables list each query alongside the chosen intent, retrieval method, whether a ticket was opened, and estimates for cost and latency. Notice differences between classifiers in intent assignments and routing.


## Stage F: Metrics & Analysis

We'll compute summary metrics for each classifier, including:

- **Classification Accuracy:** Comparing predicted intents to the annotated intents.
- **Success Rate:** The percentage of queries resolved without ticket creation.
- **Average Latency:** Mean latency in seconds.
- **Average Cost:** Estimated cost per query.

We'll then analyze failure patterns and suggest improvements.


In [33]:

from sklearn.metrics import accuracy_score

# Classification accuracy
acc_rule = accuracy_score(support_df["intent"], support_df["pred_rule"])
acc_log = accuracy_score(support_df["intent"], support_df["pred_log_reg"])

# Metrics computation

def compute_metrics(log_df):
    total = len(log_df)
    tickets = log_df["ticket_id"].notnull().sum()
    success_rate = 1 - tickets / total
    avg_latency = log_df["latency_sec"].mean()
    avg_cost = log_df["cost_estimate"].mean()
    return success_rate, avg_latency, avg_cost

succ_rule, lat_rule, cost_rule = compute_metrics(log_df_rule)
succ_log, lat_log, cost_log = compute_metrics(log_df_logreg)

print(f"Rule-based classifier accuracy: {acc_rule:.2f}")
print(f"Rule-based success rate: {succ_rule:.2f}, avg latency: {lat_rule:.2f}s, avg cost: ${cost_rule:.4f}")

print(f"Logistic regression accuracy: {acc_log:.2f}")
print(f"Logistic regression success rate: {succ_log:.2f}, avg latency: {lat_log:.2f}s, avg cost: ${cost_log:.4f}")


Rule-based classifier accuracy: 0.42
Rule-based success rate: 0.58, avg latency: 0.05s, avg cost: $0.0105
Logistic regression accuracy: 0.75
Logistic regression success rate: 0.58, avg latency: 0.04s, avg cost: $0.0097



### Discussion

- **Accuracy vs. Success Rate:** Even with similar classification accuracy, the downstream success rate may differ because misrouted queries lead to less effective retrieval and more tickets.
- **Latency & Cost:** The logistic regression model has comparable latency and cost since our retrieval stubs and simulated token counts dominate. Using an LLM classifier would increase cost and latency, so you might adopt a hybrid strategy (fast model first, LLM fallback).
- **Failure Patterns:** Inspect misclassified intents and the content of unresolved answers to understand why tickets were created. You might need better keyword coverage or training data.

### Improvement Recommendations

1. **Increase Training Data:** More labeled examples per intent and category will improve the logistic regression classifier and provide better few-shot examples for the LLM.
2. **Hybrid Routing:** Use a lightweight classifier for most queries and call the LLM only on uncertain cases to balance cost and accuracy.
3. **Real Retrieval Integration:** Replace stubs with ElasticSearch for BM25 and FAISS/Chroma for dense retrieval. Track document IDs for citations.
4. **Ticketing Integration:** Connect to your actual support ticket system (e.g., via REST API) to open and update tickets.
5. **Observability:** Use tracing and monitoring tools to record latency distributions, cost, and user satisfaction metrics, then set alerts for anomalies.

By iterating on these recommendations, you'll move closer to a production‑ready, enterprise‑grade customer support agent.



## ✅ Wrap‑Up

In this advanced exercise, you built and evaluated a multi‑step customer support workflow:

- Created a realistic dataset with multiple categories and resolved/unresolved flags.
- Implemented rule‑based, logistic regression, and LLM‑based classifiers for intent detection.
- Simulated retrieval using BM25, dense, and hybrid strategies, and routed queries accordingly.
- Generated answers via an LLM (RealLight) and simulated token‑based cost and latency.
- Created support tickets for unresolved issues and generated concise interaction summaries.
- Collected detailed logs and computed metrics to compare classifiers and identify improvement areas.

This notebook should serve as a blueprint for building an MVP backend for customer support using RAG. Adapt the retrieval and ticket components to your infrastructure, and continue enhancing classifiers and evaluation as your data grows.


In [34]:

# Quick Install (pin versions)
# !pip install pandas==2.2.0 scikit-learn==1.4.0 litellm==1.39.1 openai==1.12.0



# Day 2 – Exercise 8 (Part B): Multi‑Step Customer Support Workflow – Improvements

## 🎯 Objectives

This continuation builds on the previous exercise and implements the **improvement recommendations** to move closer to a production‑ready support agent. Specifically, you will:

1. **Expand the training data** to improve the logistic regression classifier.
2. **Implement hybrid routing**, using a confidence threshold to decide when to call an LLM.
3. **Demonstrate real retrieval integration** with ElasticSearch (BM25) and FAISS/Chroma (dense). We use placeholders and pseudo‑code where external services are required.
4. **Show how to integrate with an external ticketing system** via a REST API (simulated).
5. **Add observability instrumentation** by collecting and logging latency and cost metrics.

By the end, you'll understand how to enhance the initial prototype into a robust, enterprise‑grade workflow.



> **Note:** This notebook depends on the code and models from Part A. Make sure you've run the first notebook or import the functions defined there (e.g., `rule_based_intent`, `log_reg`, `vectorizer`, `llm_classify_intents`, and the retrieval stubs) before executing this part.


## Stage A: Increase Training Data

A larger, more diverse training set helps the logistic regression classifier generalize better and provides richer examples for LLM fine‑tuning or few‑shot prompts. We'll append additional labeled queries to the existing dataset. In practice, you would source these from your historical support tickets.


In [35]:

# Extend the existing dataset with more queries
additional_queries = [
    {"query": "I need to update my email address.", "intent": "procedural", "resolved": True, "category": "procedural"},
    {"query": "Help! My app keeps crashing on startup.", "intent": "exploratory", "resolved": False, "category": "troubleshooting"},
    {"query": "Do you offer student discounts?", "intent": "factual", "resolved": True, "category": "factual"},
    {"query": "Where can I see my past invoices?", "intent": "factual", "resolved": True, "category": "billing"},
    {"query": "Steps to integrate your API into my website.", "intent": "procedural", "resolved": True, "category": "procedural"},
    {"query": "Explain your data privacy policy.", "intent": "exploratory", "resolved": True, "category": "factual"},
    {"query": "How do I transfer my subscription to another user?", "intent": "procedural", "resolved": False, "category": "billing"},
    {"query": "My payment failed but I have sufficient balance.", "intent": "exploratory", "resolved": False, "category": "billing"},
]

# Append to support_df
support_df_ext = pd.concat([support_df, pd.DataFrame(additional_queries)], ignore_index=True)
support_df_ext.reset_index(drop=True, inplace=True)
support_df_ext.tail()


Unnamed: 0,query,intent,resolved,category,pred_rule,pred_log_reg
15,Where can I see my past invoices?,factual,True,billing,,
16,Steps to integrate your API into my website.,procedural,True,procedural,,
17,Explain your data privacy policy.,exploratory,True,factual,,
18,How do I transfer my subscription to another u...,procedural,False,billing,,
19,My payment failed but I have sufficient balance.,exploratory,False,billing,,


With more labeled examples, we can retrain our logistic regression classifier.

In [36]:

# Retrain logistic regression on the extended dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

X_train_ext, X_test_ext, y_train_ext, y_test_ext = train_test_split(
    support_df_ext["query"], support_df_ext["intent"], test_size=0.3, random_state=42, stratify=support_df_ext["intent"]
)

vectorizer_ext = TfidfVectorizer()
X_train_vec_ext = vectorizer_ext.fit_transform(X_train_ext)
X_test_vec_ext = vectorizer_ext.transform(X_test_ext)

log_reg_ext = LogisticRegression(max_iter=300)
log_reg_ext.fit(X_train_vec_ext, y_train_ext)

# Evaluate
print("Extended Logistic Regression Accuracy:", accuracy_score(y_test_ext, log_reg_ext.predict(X_test_vec_ext)))
print(classification_report(y_test_ext, log_reg_ext.predict(X_test_vec_ext)))


Extended Logistic Regression Accuracy: 0.6666666666666666
              precision    recall  f1-score   support

 exploratory       1.00      0.67      0.80         3
     factual       0.00      0.00      0.00         1
  procedural       0.50      1.00      0.67         2

    accuracy                           0.67         6
   macro avg       0.50      0.56      0.49         6
weighted avg       0.67      0.67      0.62         6



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Retraining the classifier with additional samples improves its ability to detect billing and troubleshooting intents.

## Stage B: Hybrid Routing with Confidence Threshold

To balance cost and accuracy, we route queries through the logistic regression classifier first. If the classifier's confidence (probability) is below a threshold, we fall back to the LLM classifier via RealLight. We'll implement a function that performs this hybrid routing.


In [44]:
import os
import numpy as np

def _has_llm_key() -> bool:
    return any(os.getenv(k) for k in ("OPENAI_API_KEY", "LITELLM_API_KEY", "LLM_API_KEY"))

def hybrid_route(query: str, threshold: float = 0.6, allow_llm_fallback: bool = True):
    """
    Stage B: Hybrid routing with confidence threshold.
    Uses logistic regression when confidence >= threshold.
    Otherwise, routes to LLM (if available & allowed).
    Returns a dict with: intent, route, confidence_ml.
    """
    # ML prediction
    vec = vectorizer_ext.transform([query])
    proba = log_reg_ext.predict_proba(vec)[0]
    idx = int(np.argmax(proba))
    ml_intent = str(log_reg_ext.classes_[idx])
    confidence_ml = float(proba[idx])

    # Route decision
    if (confidence_ml >= threshold) or (not allow_llm_fallback):
        return {
            "intent": ml_intent,
            "route": "ml",
            "confidence_ml": confidence_ml
        }

    # LLM fallback (safe)
    if _has_llm_key():
        try:
            llm_intent = llm_classify_intents([query])[0]
            return {
                "intent": llm_intent,
                "route": "llm",
                "confidence_ml": confidence_ml
            }
        except Exception:
            # On any LLM error, degrade to ML output
            return {
                "intent": ml_intent,
                "route": "ml_fallback_error",
                "confidence_ml": confidence_ml
            }
    else:
        # No key available, degrade to ML output
        return {
            "intent": ml_intent,
            "route": "ml_no_llm",
            "confidence_ml": confidence_ml
        }

# ---- Example execution (Stage B) ----
sample_queries = [
    "I see an unknown charge on my card.",
    "How can I integrate your API?",
    "Explain the cancellation policy.",
]

threshold = 0.6
for q in sample_queries:
    result = hybrid_route(q, threshold=threshold, allow_llm_fallback=True)
    print(q, "->", result)

I see an unknown charge on my card. -> {'intent': 'exploratory', 'route': 'ml_no_llm', 'confidence_ml': 0.3893473193013897}
How can I integrate your API? -> {'intent': 'procedural', 'route': 'ml_no_llm', 'confidence_ml': 0.4925574280939779}
Explain the cancellation policy. -> {'intent': 'exploratory', 'route': 'ml_no_llm', 'confidence_ml': 0.47170916423006964}


The hybrid classifier uses logistic regression for high‑confidence predictions and falls back to the LLM for ambiguous queries. Adjust the threshold to balance cost and accuracy.


## Stage C: Real Retrieval Integration (Pseudo‑Code)

Replacing stub functions with real retrieval engines requires external services. Below are code snippets illustrating how you might integrate with **ElasticSearch** for BM25 retrieval and **FAISS/Chroma** for dense retrieval. These cells are provided for reference; they are not executed here due to environment limitations.

### C.1: ElasticSearch BM25 Retrieval



### C.2: FAISS/Chroma Dense Retrieval


Use these snippets as templates to replace the `bm25_retrieve`, `dense_retrieve`, and `hybrid_retrieve` functions in your production system. Always track **document IDs** so citations can be displayed alongside answers.


In [45]:
# === Stage C: Real Retrieval Indications (pseudocode style, but runnable) ===
# Goal: given a user query, surface retrieval "indications" that tell the router/agent
# whether we likely have good knowledge to answer, need escalation, or should fallback.
# This block is defensive: it works even without Elasticsearch/FAISS.

import os
import numpy as np
from typing import List, Dict, Any

# Optional deps (safe imports)
try:
    from elasticsearch import Elasticsearch  # if you already created `es`, we'll reuse it
except Exception:
    Elasticsearch = None

try:
    import faiss  # optional; we'll degrade if missing
    _FAISS_AVAILABLE = True
except Exception:
    faiss = None
    _FAISS_AVAILABLE = False

# Lightweight text features for local retrieval when ES isn't available
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

# ---- Build / reuse a tiny local "KB" from your existing dataset ----------------
# We use support_df_ext["query"] as a toy corpus (replace with real docs later).
corpus_texts: List[str] = support_df_ext["query"].astype(str).tolist()
corpus_ids:    List[str] = [f"kb_{i}" for i in range(len(corpus_texts))]
corpus_titles: List[str] = support_df_ext.get("category", pd.Series(["doc"] * len(corpus_texts))).astype(str).tolist()

# Fit TF-IDF for local sparse retrieval proxy (when ES isn't available)
_local_tfidf = TfidfVectorizer(min_df=1, ngram_range=(1, 2))
_corpus_tf   = _local_tfidf.fit_transform(corpus_texts)

# Fit a tiny SVD projection to simulate "dense" vectors (not true embeddings, but works as a proxy)
_svd_dim = min(128, max(16, min(_corpus_tf.shape) // 2))  # choose a reasonable projection dim
_svd     = TruncatedSVD(n_components=_svd_dim, random_state=42)
_corpus_dense = _svd.fit_transform(_corpus_tf).astype("float32")

# Optional: build a FAISS index if faiss is present
_faiss_index = None
if _FAISS_AVAILABLE:
    _faiss_index = faiss.IndexFlatIP(_svd_dim)  # inner-product on L2-normalized vectors ~= cosine
    # Normalize corpus vectors for IP/cosine
    _corpus_dense_norm = _corpus_dense / (np.linalg.norm(_corpus_dense, axis=1, keepdims=True) + 1e-12)
    _faiss_index.add(_corpus_dense_norm)

# Reuse an existing ES client `es` if present; otherwise try to create only if env provided
def _get_es_client():
    es = globals().get("es", None)
    if es is not None:
        return es
    # Create only if env hints exist; else return None silently
    host = os.getenv("ES_HOST")
    if host and Elasticsearch is not None:
        port = int(os.getenv("ES_PORT", "9200"))
        scheme = os.getenv("ES_SCHEME", "http")
        kwargs = {"hosts": [{"host": host, "port": port, "scheme": scheme}], "request_timeout": 15}
        api_key = os.getenv("ES_API_KEY")
        user, pwd = os.getenv("ES_USERNAME"), os.getenv("ES_PASSWORD")
        if api_key:
            kwargs["api_key"] = api_key
        elif user and pwd:
            kwargs["basic_auth"] = (user, pwd)
        try:
            return Elasticsearch(**kwargs)
        except Exception:
            return None
    return None

_es_client = _get_es_client()

# ---- Retrieval helpers --------------------------------------------------------

def _bm25_retrieve(query: str, k: int = 5, index_name: str = "support_docs", field: str = "content") -> List[Dict[str, Any]]:
    """
    Try real ES BM25 if an ES client exists; else fallback to local TF-IDF cosine as a proxy.
    Returns list of dicts: {id, title, score, snippet}
    """
    results = []
    if _es_client is not None:
        try:
            resp = _es_client.search(
                index=index_name,
                size=k,
                query={"match": {field: {"query": query}}},
                _source=[field, "title"]
            )
            hits = resp.get("hits", {}).get("hits", [])
            for h in hits:
                src = h.get("_source", {})
                results.append({
                    "id": h.get("_id"),
                    "title": src.get("title", ""),
                    "score": float(h.get("_score", 0.0)),
                    "snippet": (src.get(field, "")[:200] if isinstance(src.get(field, ""), str) else "")
                })
            return results
        except Exception:
            pass  # fall back to local

    # Local sparse proxy using TF-IDF cosine
    q_tf   = _local_tfidf.transform([query])
    sims   = cosine_similarity(q_tf, _corpus_tf)[0]  # cosine over tf-idf vectors
    idxs   = np.argsort(-sims)[:k]
    for i in idxs:
        results.append({
            "id": corpus_ids[i],
            "title": corpus_titles[i],
            "score": float(sims[i]),
            "snippet": corpus_texts[i][:200]
        })
    return results

def _dense_retrieve(query: str, k: int = 5) -> List[Dict[str, Any]]:
    """
    Dense-ish retrieval via SVD-projected TF-IDF vectors (proxy).
    If FAISS is available, use it; else cosine over numpy.
    Returns list of dicts: {id, title, score, snippet}
    """
    q_tf = _local_tfidf.transform([query])
    q_vec = _svd.transform(q_tf).astype("float32")
    # L2 normalize
    q_vec = q_vec / (np.linalg.norm(q_vec, axis=1, keepdims=True) + 1e-12)

    results = []
    if _faiss_index is not None:
        # Use the same normalized corpus we added
        D, I = _faiss_index.search(q_vec, k)
        idxs, sims = I[0], D[0]
        for i, s in zip(idxs, sims):
            if i == -1:
                continue
            results.append({
                "id": corpus_ids[i],
                "title": corpus_titles[i],
                "score": float(s),  # cosine-like similarity
                "snippet": corpus_texts[i][:200]
            })
        return results

    # Fallback: cosine similarity in numpy
    corpus_norm = _corpus_dense / (np.linalg.norm(_corpus_dense, axis=1, keepdims=True) + 1e-12)
    sims = (q_vec @ corpus_norm.T)[0]  # cosine-like
    idxs = np.argsort(-sims)[:k]
    for i in idxs:
        results.append({
            "id": corpus_ids[i],
            "title": corpus_titles[i],
            "score": float(sims[i]),
            "snippet": corpus_texts[i][:200]
        })
    return results

def compute_retrieval_indications(query: str, k: int = 5, high_sim: float = 0.75) -> Dict[str, Any]:
    """
    Produce retrieval 'indications' for routing decisions.
    - Runs BM25 (or proxy) and dense (or proxy)
    - Derives boolean/qualitative signals to inform Stage D routing.
    """
    bm25 = _bm25_retrieve(query, k=k)
    dense = _dense_retrieve(query, k=k)

    # Normalize scores into 0..1-ish for qualitative comparison (very rough, since BM25 vs cosine differ)
    # For proxies, scores are already ~[0,1]. For ES BM25, min-max normalize on the top-k.
    def _norm(scores):
        if not scores:
            return []
        mn, mx = min(scores), max(scores)
        if mx - mn < 1e-9:
            return [1.0 for _ in scores]
        return [(s - mn) / (mx - mn) for s in scores]

    bm25_scores = [r["score"] for r in bm25]
    dense_scores = [r["score"] for r in dense]
    bm25_norm = _norm(bm25_scores)
    dense_norm = _norm(dense_scores)

    # Overlap signal: IDs that appear in both top-k sets
    bm25_ids = {r["id"] for r in bm25}
    dense_ids = {r["id"] for r in dense}
    overlap_ids = list(bm25_ids.intersection(dense_ids))

    # High-sim signals: any candidate above threshold (only meaningful for cosine-like)
    dense_high = any(s >= high_sim for s in dense_scores if -1e9 < s < 1e9)  # robust

    # Compose indications (these are the "pseudocode" decision hints)
    indications = []
    if overlap_ids:
        indications.append(f"strong_overlap@topk:{len(overlap_ids)}")
    if dense_high:
        indications.append("dense_high_similarity")
    if (bm25_norm and max(bm25_norm) >= 0.8) or (dense_norm and max(dense_norm) >= 0.8):
        indications.append("confident_retrieval")

    # If nothing stands out:
    if not indications:
        indications.append("weak_retrieval_signals")

    return {
        "query": query,
        "indications": indications,            # qualitative flags for routing
        "bm25_topk": bm25,                     # evidence
        "dense_topk": dense,                   # evidence
        "overlap_ids": overlap_ids,            # evidence
        "bm25_scores_norm": bm25_norm,         # for debugging
        "dense_scores_norm": dense_norm        # for debugging
    }

# ---- Demo on a few queries (safe to run) -------------------------------------
_stage_c_queries = [
    "I see an unknown charge on my card.",
    "How can I integrate your API?",
    "Explain the cancellation policy."
]

for q in _stage_c_queries:
    out = compute_retrieval_indications(q, k=5, high_sim=0.75)
    print("\nQuery:", q)
    print("Indications:", out["indications"])
    print("Overlap IDs:", out["overlap_ids"])
    # Print just top-2 titles for signal visibility
    print("BM25 top2:", [(r["title"], round(r["score"], 3)) for r in out["bm25_topk"][:2]])
    print("Dense top2:", [(r["title"], round(r["score"], 3)) for r in out["dense_topk"][:2]])



Query: I see an unknown charge on my card.
Indications: ['strong_overlap@topk:4', 'dense_high_similarity', 'confident_retrieval']
Overlap IDs: ['kb_15', 'kb_7', 'kb_9', 'kb_13']
BM25 top2: [('billing', 0.882), ('billing', 0.094)]
Dense top2: [('billing', 0.999), ('billing', 0.118)]

Query: How can I integrate your API?
Indications: ['strong_overlap@topk:3', 'dense_high_similarity', 'confident_retrieval']
Overlap IDs: ['kb_15', 'kb_16', 'kb_5']
BM25 top2: [('procedural', 0.485), ('billing', 0.297)]
Dense top2: [('procedural', 0.791), ('troubleshooting', 0.523)]

Query: Explain the cancellation policy.
Indications: ['strong_overlap@topk:3', 'confident_retrieval']
Overlap IDs: ['kb_2', 'kb_17', 'kb_4']
BM25 top2: [('exploratory', 0.369), ('factual', 0.303)]
Dense top2: [('exploratory', 0.693), ('factual', 0.596)]



## Stage D: Ticketing Integration (Simulated)

Connecting your agent to a real ticketing system (e.g., JIRA, Zendesk) involves authenticating and sending REST API requests. Below is a simplified example using Python's `requests` library. Replace the `url` and authentication details with your actual API.

In our notebook, we'll continue to simulate ticket creation with in‑memory storage, but this example shows how you'd adapt the `create_support_ticket` function to call a real API.


In [48]:
# === Stage D: Decision Policy & Routing ===
# Combines Stage B (hybrid_route) + Stage C (compute_retrieval_indications)
# to choose an action: ANSWER_FROM_KB, USE_LLM, ASK_CLARIFY, ESCALATE_TICKET.

from typing import Dict, Any, List
import time

def _retrieval_strength(indications: List[str], bm25_norm: List[float], dense_norm: List[float], overlap_cnt: int) -> float:
    """
    Compute a simple retrieval strength score in [0, 1].
    - reward overlap, high normalized scores, and strong indicators
    """
    max_bm25 = max(bm25_norm) if bm25_norm else 0.0
    max_dense = max(dense_norm) if dense_norm else 0.0
    ind_bonus = 0.0
    if "strong_overlap@topk" in indications:
        ind_bonus += 0.25
    if "dense_high_similarity" in indications:
        ind_bonus += 0.25
    if "confident_retrieval" in indications:
        ind_bonus += 0.2

    # overlap contribution (cap at 0.3)
    overlap_bonus = min(0.3, 0.1 * overlap_cnt)

    # base from normalized scores
    base = 0.5 * max_bm25 + 0.5 * max_dense

    score = base + ind_bonus + overlap_bonus
    return max(0.0, min(1.0, score))

def _decide_action(intent: str, confidence_ml: float, retrieval_strength: float) -> str:
    """
    Rule-based policy:
      - Strong retrieval or strong ML confidence -> ANSWER_FROM_KB
      - Weak ML confidence but decent retrieval -> USE_LLM
      - Very weak both -> ASK_CLARIFY
      - Certain intents (e.g., billing failure / troubleshooting unresolved) can escalate
    """
    # Escalation heuristics (customize as needed)
    intent_lower = (intent or "").lower()
    if any(key in intent_lower for key in ["billing", "payment_failed", "escalate", "outage"]):
        if retrieval_strength < 0.35 and confidence_ml < 0.55:
            return "ESCALATE_TICKET"

    # Primary routing
    if (confidence_ml >= 0.7) or (retrieval_strength >= 0.7):
        return "ANSWER_FROM_KB"
    if (confidence_ml < 0.7) and (retrieval_strength >= 0.5):
        return "ANSWER_FROM_KB"  # we trust KB if signals are decent
    if (confidence_ml < 0.6) and (0.35 <= retrieval_strength < 0.5):
        return "USE_LLM"
    if (confidence_ml < 0.5) and (retrieval_strength < 0.35):
        return "ASK_CLARIFY"

    # Default safe option
    return "USE_LLM"

def route_stage_d(query: str, threshold: float = 0.6, allow_llm_fallback: bool = True) -> Dict[str, Any]:
    """
    End-to-end decision:
      1) Get Stage B result (intent + confidence_ml)
      2) Get Stage C retrieval indications
      3) Compute retrieval_strength
      4) Decide action
    Returns: dict with decision details and evidence (top-k hits)
    """
    # Stage B
    stage_b = hybrid_route(query, threshold=threshold, allow_llm_fallback=allow_llm_fallback)
    intent = stage_b.get("intent")
    confidence_ml = float(stage_b.get("confidence_ml", 0.0))
    stage_b_route = stage_b.get("route")

    # Stage C
    stage_c = compute_retrieval_indications(query, k=5, high_sim=0.75)
    indications = stage_c["indications"]
    bm25_norm = stage_c["bm25_scores_norm"]
    dense_norm = stage_c["dense_scores_norm"]
    overlap_cnt = len(stage_c["overlap_ids"])

    # Score + decision
    r_strength = _retrieval_strength(indications, bm25_norm, dense_norm, overlap_cnt)
    action = _decide_action(intent, confidence_ml, r_strength)

    # Reasons
    reasons = []
    reasons.append(f"ml_confidence={confidence_ml:.2f}")
    reasons.append(f"retrieval_strength={r_strength:.2f}")
    if indications:
        reasons.append("indications=" + ",".join(indications))
    if stage_b_route:
        reasons.append(f"stage_b_route={stage_b_route}")

    return {
        "timestamp": int(time.time()),
        "query": query,
        "intent": intent,
        "action": action,                        # ANSWER_FROM_KB | USE_LLM | ASK_CLARIFY | ESCALATE_TICKET
        "confidence_ml": confidence_ml,
        "retrieval_strength": r_strength,
        "reasons": reasons,
        "evidence": {
            "bm25_topk": stage_c["bm25_topk"],
            "dense_topk": stage_c["dense_topk"],
            "overlap_ids": stage_c["overlap_ids"],
        },
    }

# --- Quick smoke test (optional) ---
for q in ["I see an unknown charge on my card.", "How can I integrate your API?", "Explain the cancellation policy."]:
    print(route_stage_d(q))


{'timestamp': 1758346244, 'query': 'I see an unknown charge on my card.', 'intent': 'exploratory', 'action': 'ANSWER_FROM_KB', 'confidence_ml': 0.3893473193013897, 'retrieval_strength': 1.0, 'reasons': ['ml_confidence=0.39', 'retrieval_strength=1.00', 'indications=strong_overlap@topk:4,dense_high_similarity,confident_retrieval', 'stage_b_route=ml_no_llm'], 'evidence': {'bm25_topk': [{'id': 'kb_7', 'title': 'billing', 'score': 0.8818599112376844, 'snippet': 'I see an unknown charge on my credit card.'}, {'id': 'kb_15', 'title': 'billing', 'score': 0.0939132965119851, 'snippet': 'Where can I see my past invoices?'}, {'id': 'kb_13', 'title': 'troubleshooting', 'score': 0.08470904298414908, 'snippet': 'Help! My app keeps crashing on startup.'}, {'id': 'kb_9', 'title': 'factual', 'score': 0.06307361715984454, 'snippet': 'Provide an overview of your customer loyalty program.'}, {'id': 'kb_0', 'title': 'procedural', 'score': 0.02084291019381998, 'snippet': 'How do I reset my password?'}], 'de


## Stage E: Observability & Monitoring

For production systems, it's crucial to capture **traces**, **metrics**, and **logs** for every interaction. Here are some techniques:

- **Tracing:** Use libraries like `opentelemetry` to instrument each step (classification, retrieval, LLM call). Export traces to a backend (e.g., Jaeger or Tempo) for visual inspection.
- **Metrics:** Record latency, error rates, and token usage to a monitoring system (Prometheus, Datadog) and define SLOs (service level objectives).
- **Logging:** Use structured logging (JSON) with context fields (query, intent, ticket ID, cost) and send to a log aggregator.

Below is a simple example using Python's built‑in `logging` module to log metrics to a CSV file. In a real deployment, integrate with a monitoring stack.


In [50]:
# === Stage E: Execute the decision (answering / LLM / clarify / escalate) ===
# Works even without an LLM key; degrades gracefully to KB-based answers or clarify prompts.

import os
import time
from typing import Dict, Any, List

# Reuse helper if already defined; else define a minimal one here
def _has_llm_key() -> bool:
    return any(os.getenv(k) for k in ("OPENAI_API_KEY", "LITELLM_API_KEY", "LLM_API_KEY"))

# Optionally use OpenAI if available; otherwise we’ll fallback gracefully
_openai_available = False
try:
    import openai  # openai>=1.x
    from openai import OpenAI
    _openai_available = True
except Exception:
    _openai_available = False

def _compose_context_from_evidence(evidence: Dict[str, Any], max_chars: int = 1000) -> str:
    # Build a lightweight context string from top retrievals
    parts = []
    for r in (evidence.get("bm25_topk") or [])[:3]:
        title = str(r.get("title", "")).strip()
        snip = str(r.get("snippet", "")).strip()
        piece = f"[{title}] {snip}"
        parts.append(piece)
    for r in (evidence.get("dense_topk") or [])[:3]:
        title = str(r.get("title", "")).strip()
        snip = str(r.get("snippet", "")).strip()
        piece = f"[{title}] {snip}"
        parts.append(piece)
    context = " ".join(parts)
    if len(context) > max_chars:
        context = context[:max_chars] + "..."
    return context if context else "No relevant knowledge base snippets available."

def _answer_from_kb(query: str, evidence: Dict[str, Any]) -> Dict[str, Any]:
    context = _compose_context_from_evidence(evidence)
    # Simple heuristic answer builder (no LLM needed)
    answer = (
        "Based on our knowledge base, here are the most relevant details:\n"
        f"{context}\n"
        "If this doesn’t fully resolve your question, I can provide more details or escalate."
    )
    return {"mode": "ANSWER_FROM_KB", "answer": answer}

def _answer_with_llm(query: str, evidence: Dict[str, Any]) -> Dict[str, Any]:
    context = _compose_context_from_evidence(evidence)
    if not _has_llm_key() or not _openai_available:
        # Graceful fallback if no key or SDK
        fallback = (
            "LLM fallback is unavailable. Using retrieved context instead:\n"
            f"{context}\n"
            "If you need a more specific answer, please clarify your question."
        )
        return {"mode": "USE_LLM_FALLBACK_TO_KB", "answer": fallback}

    try:
        client = OpenAI()  # uses OPENAI_API_KEY from env
        prompt = (
            "You are a helpful support assistant. Use the context to answer concisely and accurately.\n\n"
            f"User query: {query}\n\n"
            f"Context:\n{context}\n\n"
            "Answer:"
        )
        resp = client.chat.completions.create(
            model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2,
        )
        text = resp.choices[0].message.content.strip()
        return {"mode": "USE_LLM", "answer": text}
    except Exception as e:
        # Any error -> degrade to KB
        fallback = (
            f"LLM error: {e}. Falling back to retrieved context:\n"
            f"{context}"
        )
        return {"mode": "USE_LLM_ERROR_FALLBACK_TO_KB", "answer": fallback}

def _ask_for_clarification(query: str) -> Dict[str, Any]:
    questions = [
        "Could you share a bit more detail about your goal or the exact error message?",
        "Are you using a paid or trial plan, and which product/endpoint is involved?"
    ]
    msg = "I need a bit more information to help you better:\n- " + "\n- ".join(questions)
    return {"mode": "ASK_CLARIFY", "answer": msg, "clarifying_questions": questions}

def _make_ticket(query: str, intent: str, priority: str = "P3") -> Dict[str, Any]:
    ticket = {
        "ticket_id": f"TKT-{int(time.time())}",
        "created_at": int(time.time()),
        "intent": intent,
        "priority": priority,
        "status": "OPEN",
        "summary": query[:140],
        "tags": ["auto-escalated", intent or "unknown"]
    }
    return ticket

def _escalate_ticket(query: str, intent: str, retrieval_strength: float, confidence_ml: float) -> Dict[str, Any]:
    # Simple priority rule
    if confidence_ml < 0.4 and retrieval_strength < 0.3:
        prio = "P1"
    elif confidence_ml < 0.5 or retrieval_strength < 0.4:
        prio = "P2"
    else:
        prio = "P3"
    ticket = _make_ticket(query, intent, priority=prio)
    msg = (
        f"I’ve created a support ticket ({ticket['ticket_id']}) for deeper investigation. "
        "Our team will follow up shortly."
    )
    return {"mode": "ESCALATE_TICKET", "answer": msg, "ticket": ticket}

def execute_stage_e(query: str, threshold: float = 0.6, allow_llm_fallback: bool = True) -> Dict[str, Any]:
    # 1) Run Stage D to get decision and evidence
    decision = route_stage_d(query, threshold=threshold, allow_llm_fallback=allow_llm_fallback)
    action = decision["action"]
    intent = decision.get("intent")
    r_strength = float(decision.get("retrieval_strength", 0.0))
    conf_ml = float(decision.get("confidence_ml", 0.0))
    evidence = decision.get("evidence", {})

    # 2) Execute the action
    if action == "ANSWER_FROM_KB":
        payload = _answer_from_kb(query, evidence)
    elif action == "USE_LLM":
        payload = _answer_with_llm(query, evidence)
    elif action == "ASK_CLARIFY":
        payload = _ask_for_clarification(query)
    elif action == "ESCALATE_TICKET":
        payload = _escalate_ticket(query, intent=intent, retrieval_strength=r_strength, confidence_ml=conf_ml)
    else:
        # Safe default
        payload = _answer_with_llm(query, evidence)

    # 3) Return combined output (decision + action result)
    return {
        "query": query,
        "intent": intent,
        "decision_action": action,
        "confidence_ml": conf_ml,
        "retrieval_strength": r_strength,
        "indications": decision.get("reasons", []),
        "result": payload,
    }

# --- Example run (auto-exec) ---
_stage_e_queries = [
    "I see an unknown charge on my card.",
    "How can I integrate your API?",
    "Explain the cancellation policy."
]
for q in _stage_e_queries:
    out = execute_stage_e(q, threshold=0.6, allow_llm_fallback=True)
    print("\n=== Stage E Output ===")
    print("Query:", out["query"])
    print("Intent:", out["intent"])
    print("Action decided:", out["decision_action"])
    print("Confidence ML:", round(out["confidence_ml"], 3))
    print("Retrieval Strength:", round(out["retrieval_strength"], 3))
    print("Mode:", out["result"].get("mode"))
    if "ticket" in out["result"]:
        print("Ticket:", out["result"]["ticket"])
    print("Answer:\n", out["result"]["answer"])



=== Stage E Output ===
Query: I see an unknown charge on my card.
Intent: exploratory
Action decided: ANSWER_FROM_KB
Confidence ML: 0.389
Retrieval Strength: 1.0
Mode: ANSWER_FROM_KB
Answer:
 Based on our knowledge base, here are the most relevant details:
[billing] I see an unknown charge on my credit card. [billing] Where can I see my past invoices? [troubleshooting] Help! My app keeps crashing on startup. [billing] I see an unknown charge on my credit card. [billing] Where can I see my past invoices? [troubleshooting] Help! My app keeps crashing on startup.
If this doesn’t fully resolve your question, I can provide more details or escalate.

=== Stage E Output ===
Query: How can I integrate your API?
Intent: procedural
Action decided: ANSWER_FROM_KB
Confidence ML: 0.493
Retrieval Strength: 1.0
Mode: ANSWER_FROM_KB
Answer:
 Based on our knowledge base, here are the most relevant details:
[procedural] Steps to integrate your API into my website. [billing] How can I update my billing 

This basic logging setup writes metrics to a rotating log file. In production, you would forward these logs to a centralized platform and build dashboards and alerts.
## ✅ Summary

In Part B you extended the multi‑step support agent by:

- **Expanding the dataset** with additional labeled queries and retraining the logistic regression classifier.
- Implementing **hybrid routing** that uses the logistic regression classifier and falls back to an LLM for uncertain cases.
- Providing **pseudo‑code** for integrating real retrieval systems (ElasticSearch for BM25 and FAISS/Chroma for dense vectors).
- Showing how to **connect to a ticketing system** via REST APIs, including authentication and payloads.
- Adding **observability instrumentation** for logging metrics like latency and cost.

These enhancements lay the groundwork for a production‑grade support agent. Continue iterating by swapping in actual retrieval engines, connecting to live ticketing systems, and integrating with your observability stack. By monitoring performance and user feedback, you can refine the system and deliver reliable customer support at scale.
