# HabitLedger Demo â€“ Behavioural Money Coach

Welcome to the **HabitLedger** interactive demonstration notebook!

## What is HabitLedger?

HabitLedger is an AI-powered behavioural money coach designed to help users build healthier financial habits through:

- **Behavioural Science Principles**: Uses proven concepts like loss aversion, habit loops, friction reduction, and commitment devices
- **Personalized Interventions**: Suggests specific, actionable strategies tailored to your situation
- **Memory & Tracking**: Remembers your goals, streaks, and struggles across interactions
- **Adaptive Coaching**: Detects patterns and adjusts recommendations over time

## What This Notebook Demonstrates

This notebook showcases:

1. How the agent analyzes user input using keyword-based principle detection
2. How it generates personalized coaching responses with interventions
3. How memory tracks progress across multiple interactions
4. Sample scenarios covering common financial habit challenges

**Note**: This demo uses deterministic keyword matching (no LLM calls yet) to demonstrate the core agent architecture and behaviour analysis logic.

## How This Notebook Is Organized

1. **Setup** â€” Import modules, load behaviour database, initialize user memory
2. **Single Interaction Demo** â€” See how the agent responds to one user input
3. **Multiple Interaction Scenarios** â€” Run through 3-5 predefined scenarios covering different behavioural principles
4. **Visuals & Structured Summaries** â€” Placeholder for future visualizations (e.g., streak charts, principle frequency)
5. **Evaluation Notes** â€” How to assess the agent's performance and relevance

## 1. Setup

Import modules, load the behaviour database, and initialize user memory.

In [None]:
import sys
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path.cwd().parent))

from src.memory import UserMemory
from src.behaviour_engine import analyse_behaviour, load_behaviour_db
from src.coach import run_once

# Load behaviour principles database
behaviour_db = load_behaviour_db("../data/behaviour_principles.json")
print(f"âœ… Loaded {len(behaviour_db)} behavioural principles")

# Initialize user memory
user_memory = UserMemory(user_id="demo_user")
user_memory.goals = ["Save $500/month", "Stop impulse buying"]
print(f"âœ… Initialized memory for user: {user_memory.user_id}")
print(f"   Goals: {user_memory.goals}")

## 2. Single Interaction Demo

Let's see how the agent responds to a single user input. The agent will:
1. Analyze the input for behavioural patterns
2. Detect the most relevant principle
3. Suggest personalized interventions
4. Update the user's memory

In [None]:
# Example user input
user_input = "I keep buying coffee every morning even though I want to save money. It's become such a routine."

# Run the agent
response = run_once(
    user_input=user_input, memory=user_memory, behaviour_db=behaviour_db
)

print("ðŸ¤– Agent Response:")
print(response)

## 3. Multiple Interaction Scenarios

Now let's test the agent with several predefined scenarios that trigger different behavioural principles. This demonstrates how the agent adapts to different situations and tracks progress over time.

In [None]:
# Define multiple test scenarios
scenarios = [
    {
        "name": "Loss Aversion",
        "input": "I'm afraid to check my bank account because I might see I've overspent again. It makes me anxious.",
    },
    {
        "name": "Temptation Bundling",
        "input": "I hate budgeting, it feels like such a chore. I keep putting it off.",
    },
    {
        "name": "Friction Reduction (Negative)",
        "input": "Online shopping is too easy. I just click and buy without thinking. No friction at all.",
    },
    {
        "name": "Micro Habits",
        "input": "Saving $500 a month feels overwhelming. I don't know where to start.",
    },
    {
        "name": "Default Effect",
        "input": "I'm still subscribed to services I never use. I just haven't bothered to cancel them.",
    },
]

# Run through each scenario
print("=" * 60)
for i, scenario in enumerate(scenarios, 1):
    print(f"\nðŸ“Œ SCENARIO {i}: {scenario['name']}")
    print(f"User: {scenario['input']}")
    print("\n" + "-" * 60)

    response = run_once(
        user_input=scenario["input"], memory=user_memory, behaviour_db=behaviour_db
    )

    print(response)
    print("=" * 60)

## 4. Visuals and Structured Summaries (Placeholder)

This section is a placeholder for future visualizations and analytics:

- **Streak Charts**: Visualize consistency over time
- **Principle Frequency**: Show which behavioural patterns are most common
- **Progress Dashboard**: Track goal completion and intervention effectiveness
- **Behaviour Heatmaps**: Identify patterns by day/time

For now, we can inspect the memory state directly:

In [None]:
# Inspect current memory state
import json

memory_state = user_memory.to_dict()
print("ðŸ“Š Current User Memory State:")
print(json.dumps(memory_state, indent=2, default=str))

## 5. Evaluation Notes

### How to Evaluate HabitLedger

When assessing the agent's performance, consider:

#### 1. **Principle Detection Accuracy**
- Does the agent correctly identify the underlying behavioural principle?
- Are the detected triggers relevant to the user's input?

#### 2. **Intervention Relevance**
- Are the suggested interventions actionable and specific?
- Do they align with the detected principle and user's goals?

#### 3. **Memory & Consistency**
- Does the agent remember goals, streaks, and struggles across interactions?
- Does it reference past context appropriately?

#### 4. **Response Quality**
- Is the tone supportive and non-judgmental?
- Are explanations clear and grounded in behavioural science?

#### 5. **Multi-Step Behavior**
- Does the agent show adaptive behavior over multiple interactions?
- Does it recognize patterns and adjust recommendations?

### Current Limitations

- **Keyword-based detection**: Not yet using LLM for nuanced pattern recognition
- **No real-time data**: Memory persists within session but doesn't integrate with actual financial data
- **Limited context**: Doesn't yet maintain conversation history beyond memory fields

### Future Improvements

- Integrate LLM-based principle detection for better accuracy
- Add visualizations for streak tracking and progress monitoring
- Implement session summaries and weekly reviews
- Connect to financial APIs for real-time habit tracking

## 6. Evaluation: Run Scenario Set

This cell runs a comprehensive set of test scenarios and produces a structured summary table showing:
- Detected principle for each prompt
- Source of detection (ADK or keyword)
- Intervention provided
- Response length

This helps evaluate:
- **Accuracy**: Are the right principles detected?
- **Coverage**: Does the agent handle diverse scenarios?
- **Consistency**: Are responses appropriately detailed?
- **Performance**: Which detection method is used?

In [None]:
import pandas as pd
import time
from datetime import datetime

# Fresh memory for evaluation
eval_memory = UserMemory(user_id="eval_user")
eval_memory.goals = [{"description": "Build better financial habits"}]

# Comprehensive test scenarios covering all principles
test_scenarios = [
    # Loss Aversion
    "I'm afraid to check my savings account because I might see I've spent too much",
    "I broke my 30-day no-spending streak yesterday and feel terrible about it",
    "I regret buying that expensive gadget last week",
    # Habit Loops
    "Every evening after work I automatically order food delivery when I'm stressed",
    "Whenever I feel bored, I start browsing shopping apps",
    "I always grab coffee at the same cafe during my lunch break",
    # Commitment Devices
    "I need help sticking to my budget, my willpower isn't enough",
    "I keep forgetting to transfer money to my savings account",
    "It's hard to resist temptation when sales pop up",
    # Temptation Bundling
    "Reviewing my expenses is so boring and tedious",
    "I dread looking at my budget spreadsheet",
    # Friction Reduction (for good habits)
    "Tracking my expenses takes too many steps and is confusing",
    "Setting up automatic savings is too complicated",
    # Friction Increase (for bad habits)
    "One-click ordering makes it too easy to impulse buy",
    "I keep ordering food delivery because the app is right on my phone",
    "Online shopping is instant, I don't even think about it",
    # Default Effect
    "I forget to save money each month, I need to automate it",
    "I'm still paying for subscriptions I never use",
    # Micro Habits
    "Saving $1000 a month feels overwhelming and impossible",
    "My financial goals are too big, I don't know where to start",
]

print(f"ðŸ§ª Running {len(test_scenarios)} evaluation scenarios...\n")

# Run evaluation
results = []
start_eval_time = time.time()

for idx, prompt in enumerate(test_scenarios, 1):
    # Analyze the prompt
    analysis = analyse_behaviour(prompt, eval_memory, behaviour_db)

    # Extract data
    principle_id = analysis.get("detected_principle_id", "None")
    source = analysis.get("source", "unknown")
    interventions = analysis.get("intervention_suggestions", [])
    intervention_text = interventions[0] if interventions else "No intervention"

    # Get full response
    response = run_once(prompt, eval_memory, behaviour_db)
    response_length = len(response)

    # Truncate for display
    prompt_short = prompt[:60] + "..." if len(prompt) > 60 else prompt
    intervention_short = (
        intervention_text[:50] + "..."
        if len(intervention_text) > 50
        else intervention_text
    )

    results.append(
        {
            "ID": idx,
            "Prompt": prompt_short,
            "Principle": principle_id if principle_id else "None",
            "Source": source,
            "Intervention": intervention_short,
            "Response Len": response_length,
        }
    )

    # Show progress
    if idx % 5 == 0:
        print(f"  âœ“ Processed {idx}/{len(test_scenarios)} scenarios")

eval_duration = time.time() - start_eval_time

print(f"\nâœ… Evaluation complete in {eval_duration:.2f}s")
print(f"ðŸ“Š Summary Table:\n")

# Create DataFrame and display
df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))

# Summary statistics
print(f"\nðŸ“ˆ Evaluation Metrics:")
print(f"  â€¢ Total scenarios: {len(test_scenarios)}")
print(f"  â€¢ Principles detected: {df_results['Principle'].notna().sum()}")
print(f"  â€¢ ADK detections: {(df_results['Source'] == 'adk').sum()}")
print(f"  â€¢ Keyword detections: {(df_results['Source'] == 'keyword').sum()}")
print(f"  â€¢ Avg response length: {df_results['Response Len'].mean():.0f} chars")
print(f"  â€¢ Total evaluation time: {eval_duration:.2f}s")
print(f"  â€¢ Avg time per scenario: {eval_duration/len(test_scenarios):.3f}s")

### Evaluation Metrics Explained

The table above shows results from running the agent on diverse test scenarios. Here's how to interpret the metrics:

#### Qualitative Metrics (Manual Assessment)

1. **Clarity**: Are the agent's responses clear and easy to understand?
   - Check if principle names are explained
   - Verify interventions are actionable and specific

2. **Relevance**: Do detected principles match the user's actual issue?
   - Compare the prompt to the detected principle
   - Assess if interventions address the root cause

3. **Tone**: Is the agent supportive and non-judgmental?
   - Look for encouraging language
   - Check that responses avoid blame or criticism

#### Automated Metrics

1. **Source Distribution**: Shows whether ADK (LLM) or keyword matching was used
   - Higher ADK usage suggests more nuanced detection
   - Keyword fallback ensures robustness

2. **Response Length**: Indicates level of detail in responses
   - Typical range: 300-800 characters
   - Too short may lack detail; too long may be overwhelming

3. **Processing Time**: Measures performance
   - ADK calls are slower (~1-3s) but more accurate
   - Keyword matching is faster (~0.1s) but less nuanced

4. **Coverage**: Percentage of scenarios where a principle was detected
   - Target: >90% principle detection rate
   - "None" detections indicate unclear input or gaps in coverage