# Context-Aware Customer Support Agent ‚Äî Demo
**Kaggle Capstone ‚Äî Stylish Demo Notebook**

This notebook demonstrates:
- Live multi-turn demos using the local agent
- Memory inspection (short & long term)
- Trace-based observability (traces.jl)
- Hybrid evaluation (heuristic + LLM-as-judge)
- Visualizations and summary metrics for judges

> Run cells top ‚Üí bottom. If imports fail, ensure you opened the project root in VS Code / Jupyter.


In [1]:
# Setup & imports
import os, json, time
from IPython.display import Markdown, display, HTML
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Project imports (must run notebook from project root)
from agent.orchestrator import handle_user_message
from agent.memory import retrieve_memories, save_user, init_db
from agent.logger import log_event
from agent.evaluator import hybrid_score

# Notebook styling helpers
plt.style.use("seaborn-whitegrid")
%matplotlib inline

display(Markdown("**Environment check:**"))
print("Python:", os.sys.version.splitlines()[0])
print("Working dir:", os.getcwd())
display(Markdown("---"))


ModuleNotFoundError: No module named 'agent'

In [None]:
# Sanity checks
init_db()  # safe to call repeatedly

if not os.path.exists("traces.jl"):
    open("traces.jl","a").close()

if not os.path.exists("evaluation_report.json"):
    # If you haven't run evaluate_batch, create a lightweight placeholder so notebook cells work
    placeholder = []
    with open("evaluation_report.json","w") as f:
        json.dump(placeholder, f)
    print("Created placeholder evaluation_report.json ‚Äî run evaluate_batch for real results.")

display(Markdown("‚úÖ Sanity checks complete."))


In [None]:
# Trace loader + pretty function
def load_traces(limit=None):
    traces = []
    if os.path.exists("traces.jl"):
        with open("traces.jl","r") as f:
            for i,line in enumerate(f):
                if not line.strip(): 
                    continue
                try:
                    traces.append(json.loads(line))
                except:
                    continue
                if limit and len(traces) >= limit:
                    break
    return traces

def show_trace(trace_id):
    traces = load_traces()
    seq = [t for t in traces if t.get("trace_id") == trace_id]
    if not seq:
        display(Markdown(f"**No trace found for** `{trace_id}`"))
        return
    display(Markdown(f"### Trace `{trace_id}` ‚Äî {len(seq)} events"))
    for ev in seq:
        ts = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(ev.get("timestamp", 0)))
        kind = ev.get("event","?")
        content = {k:v for k,v in ev.items() if k not in ["timestamp","trace_id","event"]}
        display(Markdown(f"- **{ts}** `{kind}` ‚Äî `{json.dumps(content)}`"))


In [None]:
# Single-turn demonstration
display(Markdown("## üß™ Single-turn demo"))

session_id = "demo_session_1"
user_id = "u001"
# Ensure user exists (will create or overwrite)
save_user(user_id, "Demo User", "demo@example.com", {"notes":"notebook demo"})

inp = "My order A123 is late ‚Äî where is it?"
display(Markdown(f"**User:** {inp}"))

out = handle_user_message(session_id=session_id, user_id=user_id, user_msg=inp)
display(Markdown(f"**Agent reply:** {out['reply']}"))
display(Markdown(f"**Trace id:** `{out['trace_id']}`"))

# show the trace for this run (most recent trace)
show_trace(out["trace_id"])


In [None]:
#Multi-turn conversation sample
display(Markdown("## üó£Ô∏è Multi-turn conversation demo"))

session_id = "demo_session_2"
user_id = "u002"
save_user(user_id, "Multi Demo", "multidemo@example.com", {})

turns = [
    "What is the status of my order A123?",
    "It says delivered but I never received it.",
    "Please help ‚Äî I want a refund if it can't be found."
]

for i, msg in enumerate(turns, 1):
    display(Markdown(f"**Turn {i} ‚Äî User:** {msg}"))
    res = handle_user_message(session_id=session_id, user_id=user_id, user_msg=msg)
    display(Markdown(f"- **Agent:** {res['reply']}  \n- **Trace:** `{res['trace_id']}`"))
    show_trace(res['trace_id'])
    display(Markdown("---"))


In [None]:
# Inspect memories for a user
display(Markdown("## üß† Memory inspection"))

target_user = "u002"
m = retrieve_memories(target_user, limit=10)
display(Markdown(f"Memories for **{target_user}** ‚Äî showing up to 10"))
for i, mem in enumerate(m,1):
    display(Markdown(f"- [{i}] **{mem['mem_type']}** ‚Äî {mem['content']}"))


In [None]:
# Load evaluation_report.json (created by evaluate_batch)
display(Markdown("## üìà Evaluation report summary"))

if os.path.exists("evaluation_report.json"):
    with open("evaluation_report.json","r") as f:
        eval_data = json.load(f)
else:
    eval_data = []

# If eval_data is empty, create a small synthetic sample so plots render
if not eval_data:
    eval_data = [
        {"case": "Where is my order A123?", "reply": "I created ticket T1", "scores": {"heuristic":0.6,"llm_resolution":0.5,"llm_helpfulness":0.7,"final_score":0.6}},
        {"case": "I want a refund", "reply": "Refund started", "scores": {"heuristic":0.8,"llm_resolution":0.7,"llm_helpfulness":0.7,"final_score":0.73}}
    ]

# Normalize to DataFrame
rows = []
for r in eval_data:
    sc = r.get("scores",{})
    rows.append({
        "case": r.get("case",""),
        "reply": r.get("reply",""),
        "heuristic": sc.get("heuristic", np.nan),
        "llm_resolution": sc.get("llm_resolution", np.nan),
        "llm_helpfulness": sc.get("llm_helpfulness", np.nan),
        "final_score": sc.get("final_score", np.nan)
    })
df_eval = pd.DataFrame(rows)
display(df_eval.head())
display(Markdown(f"**Avg final score:** {df_eval['final_score'].mean():.3f}"))


In [None]:
# Final score distribution
display(Markdown("## üìä Final score distribution"))

plt.figure(figsize=(8,4))
plt.hist(df_eval["final_score"], bins=10, edgecolor="k", alpha=0.8)
plt.title("Final Score Distribution")
plt.xlabel("Final score")
plt.ylabel("Count")
plt.axvline(df_eval["final_score"].mean(), color="red", linestyle="--", label=f"Mean: {df_eval['final_score'].mean():.2f}")
plt.legend()
plt.show()


In [None]:
# Component comparison scatter
display(Markdown("## ‚öñÔ∏è Heuristic vs LLM (resolution)"))

plt.figure(figsize=(7,5))
plt.scatter(df_eval["heuristic"], df_eval["llm_resolution"], s=60, alpha=0.8)
plt.xlabel("Heuristic score")
plt.ylabel("LLM resolution score")
plt.title("Heuristic vs LLM resolution")
plt.grid(True)
plt.show()


In [None]:
# Count tool usage in traces
display(Markdown("## üîß Tool usage frequency (from traces)"))

traces = load_traces()
tool_calls = []
for ev in traces:
    if ev.get("event") == "tool_result" or ev.get("event")=="tool":
        tname = ev.get("tool") or (ev.get("event") if ev.get("tool") is None else ev.get("tool"))
        tool_calls.append(tname)
tool_counts = pd.Series(tool_calls).value_counts()
if tool_counts.empty:
    display(Markdown("_No tool calls recorded yet._"))
else:
    display(tool_counts.to_frame("count"))
    tool_counts.plot(kind="barh", color="tab:orange")
    plt.title("Tool usage frequency")
    plt.xlabel("Count")
    plt.show()


In [None]:
# Memory impact (simple demonstration)
display(Markdown("## üßæ Memory impact (demo)"))

# We do a small AB test simulation: for same prompt, ask agent with/without memory context
test_user = "u050"
prompt = "Where is my order A123?"

# Save a memory for user (simulate prior conversation)
add_mem = {"mem_type":"note","content":"User prefers fast delivery, previously complained about late shipments"}
# Ensure db has this
from agent.memory import add_memory as _add_memory
_add_memory(test_user, add_mem["mem_type"], add_mem["content"])

# With memory
res_with = handle_user_message(session_id="ab_with", user_id=test_user, user_msg=prompt)

# Without memory: remove memory temporarily or use new user
res_without = handle_user_message(session_id="ab_without", user_id="ab_new_user", user_msg=prompt)

display(Markdown(f"- **With memory reply:** {res_with['reply']}"))
display(Markdown(f"- **Without memory reply:** {res_without['reply']}"))




## Architecture summary (short)

- **Orchestrator** ‚Äî think/act/observe loop; calls tools and records traces.  
- **Tools** ‚Äî small deterministic functions (order lookup, ticket creation, product lookup).  
- **Memory** ‚Äî SQLite-backed long-term memory + session events.  
- **Evaluator** ‚Äî hybrid (heuristic + LLM-as-judge) and batch runner.  
- **Observability** ‚Äî `traces.jl` with event-level logs for every decision.

**What judges want to see**
- Clear problem statement and demo  
- Traces showing chain-of-thought (tool calls + results)  
- Memory usage examples (personalization)  
- Evaluation metrics and visualization  
- A short demo video + README

---


### Notebook complete

- Save this notebook.
- Run all cells to produce visual outputs.
- Export as HTML (File ‚Üí Export) and include the HTML + notebook in your Kaggle submission.
