<div align="center">
  <img src="../assets/images/hackathon.png" alt="Holistic AI Hackathon Logo" width="600"/>
</div>

**Event**: [hackathon.holisticai.com](https://hackathon.holisticai.com)

---


# Tutorial 4: Model Monitoring, Cost & Carbon Tracking

**Master comprehensive model monitoring from metrics to cost optimization**

Learn to track performance metrics (latency, tokens, cost, carbon) and compare API vs local model costs.

## What You'll Learn

1. Track latency and response times
2. Count tokens accurately with tiktoken
3. Estimate API costs
4. Monitor carbon emissions (local models)
5. Compare API vs local model costs
6. Analyze multi-agent cost differences
7. Understand real-world tool overhead (search tools)

## Why Monitor Performance?

- **Cost control** - Understand and optimize API expenses
- **Performance optimization** - Find and fix bottlenecks
- **Resource planning** - Plan infrastructure and costs
- **Environmental impact** - Measure carbon emissions for local models

---

## Prerequisites

- Basic Python knowledge
- Recommended: Completed tutorials 01-03
- Time: ~25 minutes
- **Holistic AI Bedrock API** (recommended) - Credentials will be provided during the hackathon event
- **OpenAI API key** (optional alternative) - Get at https://platform.openai.com/api-keys
- **Valyu API key** (optional for Step 5.6) - Get at https://platform.valyu.network/ (free credits)

**API Guide**: [../assets/api-guide.pdf](../assets/api-guide.pdf)

**Note:** This tutorial is completely self-contained and uses only official packages!


## Step 0: Install Dependencies

Run this cell to install all required packages for performance monitoring.

**Note:** `langchain-valyu` is optional and only needed for Step 5.6 (Search Tool Comparison). It will be installed automatically when you run that cell if you have a Valyu API key.


In [None]:
import os
from pathlib import Path
from dotenv import load_dotenv

# Load from .env file in parent directory
env_path = Path('../.env')
if env_path.exists():
    load_dotenv(env_path)
    print("üìÑ Loaded configuration from .env file")
else:
    print("‚ö†Ô∏è  No .env file found - using environment variables")

# Verify API keys
print("\nüîë API Key Status:")
if os.getenv('HOLISTIC_AI_TEAM_ID') and os.getenv('HOLISTIC_AI_API_TOKEN'):
    print("  ‚úÖ Holistic AI Bedrock credentials loaded")
elif os.getenv('OPENAI_API_KEY'):
    print("  ‚ö†Ô∏è  OpenAI API key loaded")
else:
    print("  ‚ö†Ô∏è  No API keys found")

print("\nüìÅ Working directory:", Path.cwd())

# Import Holistic AI Bedrock helper
import sys
try:
    sys.path.insert(0, '../core')
    from react_agent.holistic_ai_bedrock import get_chat_model
    from react_agent.utils import count_tokens, estimate_cost
    print("\n‚úÖ Holistic AI Bedrock helper and utils loaded")
except ImportError:
    print("\n‚ö†Ô∏è  Could not import from core - will use OpenAI only")

# Import official packages
import time
from langgraph.prebuilt import create_react_agent
from langchain_core.messages import HumanMessage

print("\n‚úÖ All imports successful!")


## Step 1: Setup Environment

Set up API keys in `.env` file. See [tutorials/README.md](../README.md#setup) for details.

```bash
HOLISTIC_AI_TEAM_ID=your-team-id-here
HOLISTIC_AI_API_TOKEN=your-api-token-here
```


In [20]:
import os
import time
import uuid
from pathlib import Path
from dotenv import load_dotenv

# ============================================
# OPTION 1: Set API keys directly (Quick Start)
# ============================================
# Uncomment and set your keys here:
# Recommended: Holistic AI Bedrock
# os.environ["HOLISTIC_AI_TEAM_ID"] = "tutorials_api"
# os.environ["HOLISTIC_AI_API_TOKEN"] = "your-token-here"
# Alternative: OpenAI (optional)
# os.environ["OPENAI_API_KEY"] = "your-openai-key-here"
# Optional: Valyu
# os.environ["VALYU_API_KEY"] = "your-valyu-key-here"

# ============================================
# OPTION 2: Load from .env file (Recommended)
# ============================================
env_path = Path('../.env')
if env_path.exists():
    load_dotenv(env_path)
    print("üìÑ Loaded configuration from .env file")
else:
    print("‚ö†Ô∏è  No .env file found - using environment variables or hardcoded keys")

# ============================================
# Import Holistic AI Bedrock helper function
# ============================================
# Import from core module
try:
    import sys
    sys.path.insert(0, '../core')
    from react_agent.holistic_ai_bedrock import HolisticAIBedrockChat, get_chat_model
    print("‚úÖ Holistic AI Bedrock helper function loaded")
except ImportError:
    print("‚ö†Ô∏è  Could not import from core - will use OpenAI only")

# Import official packages
from langgraph.prebuilt import create_react_agent
from langchain_core.messages import HumanMessage
from langchain_core.tools import tool

# Import monitoring tools
import tiktoken
from codecarbon import EmissionsTracker

# Verify API keys
print("\nüîë API Key Status:")
if os.getenv('HOLISTIC_AI_TEAM_ID') and os.getenv('HOLISTIC_AI_API_TOKEN'):
    print("  ‚úÖ Holistic AI Bedrock credentials loaded (will use Bedrock)")
elif os.getenv('OPENAI_API_KEY'):
    print("  ‚ö†Ô∏è  OpenAI API key loaded (Bedrock credentials not set)")
    print("     üí° Tip: Set HOLISTIC_AI_TEAM_ID and HOLISTIC_AI_API_TOKEN to use Bedrock (recommended)")
else:
    print("  ‚ö†Ô∏è  No API keys found")
    print("     Set Holistic AI Bedrock credentials (recommended) or OpenAI key")

# Check Valyu API key (optional - for Step 5.6)
if os.getenv('VALYU_API_KEY'):
    valyu_key = os.getenv('VALYU_API_KEY')[:10] + "..."
    print(f"  ‚úÖ Valyu API key loaded: {valyu_key}")
    print("     Step 5.6 (Search Tool Comparison) will be fully functional!")
else:
    print("  ‚ö†Ô∏è  Valyu API key not found - Step 5.6 will be skipped")
    print("     Get a free key at: https://platform.valyu.network/")
    print("     This is optional - you can continue without it!")

print("\n‚úÖ All imports successful!")

IndentationError: unindent does not match any outer indentation level (<tokenize>, line 38)

---

---

# Performance Monitoring

Track latency, tokens, costs, and carbon emissions.


## Step 2: Create Agent for Monitoring

Create a simple agent that we'll use to test performance monitoring:

In [None]:
# Create a basic agent for performance testing
# Use get_chat_model() - uses Holistic AI Bedrock by default (recommended)
llm = get_chat_model("claude-3-5-sonnet")  # Uses Holistic AI Bedrock (recommended)
agent = create_react_agent(llm, tools=[])  # No tools for faster responses

print("Agent created!")
print("  Model: claude-3-5-sonnet (via Bedrock if available)")
print("  Tools: None (for speed)")
print("  Use case: Performance monitoring")

Agent created!
  Model: claude-3-5-sonnet (via Bedrock if available)
  Tools: None (for speed)
  Use case: Performance monitoring


## Step 3: Setup Token Counting with tiktoken

`tiktoken` is OpenAI's official tokenizer library. It provides accurate token counts for cost estimation.

**Why accurate token counting matters:**
- API pricing is based on tokens, not characters
- Different models use different tokenizers
- Accurate counts = accurate cost estimates

In [None]:
# Initialize tiktoken encoder
# Claude (via Bedrock) uses the same encoding as GPT-4
encoding = tiktoken.encoding_for_model('gpt-5-mini')

# Helper function to count tokens
def count_tokens(text: str) -> int:
    """Count tokens in text using tiktoken."""
    return len(encoding.encode(text))

# Test it
test_text = "Hello, how are you?"
token_count = count_tokens(test_text)

print("tiktoken initialized!")
print(f"  Test: '{test_text}'")
print(f"  Tokens: {token_count}")
print(f"\nTIP: Accurate token counting helps estimate costs accurately!")

tiktoken initialized!
  Test: 'Hello, how are you?'
  Tokens: 6

TIP: Accurate token counting helps estimate costs accurately!


## Step 4: Track Performance (Latency, Tokens, Cost)

Let's create a comprehensive monitoring function that tracks:
- Latency (response time)
- Token usage (input + output)
- Cost estimation (based on actual tokens)
- Throughput (tokens per second)

In [None]:
def track_agent_with_tokens(agent, question: str) -> dict:
    """Run agent and track comprehensive metrics."""
    
    # Count input tokens
    input_tokens = count_tokens(question)
    
    # Run agent and measure time
    start_time = time.time()
    result = agent.invoke({"messages": [HumanMessage(content=question)]})
    elapsed = time.time() - start_time
    
    # Get response and count output tokens
    response = result['messages'][-1].content
    output_tokens = count_tokens(str(response))
    total_tokens = input_tokens + output_tokens
    
    # Calculate cost (Claude (via Bedrock) Nano pricing)
    # Input: $0.15 per 1M tokens, Output: $0.60 per 1M tokens
    input_cost = (input_tokens / 1_000_000) * 0.15
    output_cost = (output_tokens / 1_000_000) * 0.60
    total_cost = input_cost + output_cost
    
    return {
        'time': elapsed,
        'input_tokens': input_tokens,
        'output_tokens': output_tokens,
        'total_tokens': total_tokens,
        'cost': total_cost,
        'tokens_per_second': total_tokens / elapsed if elapsed > 0 else 0,
        'answer': response
    }

print("Monitoring function ready!")
print("  Tracks: latency, tokens (accurate), cost, throughput")

Monitoring function ready!
  Tracks: latency, tokens (accurate), cost, throughput


## Step 5: Test API Performance Monitoring

Let's run a query and see the detailed metrics in action:

In [None]:
# Test query
query = "Explain quantum computing in one sentence."

print(f"Query: {query}\n")

# Track metrics
metrics = track_agent_with_tokens(agent, query)

# Display results
print("="*70)
print("PERFORMANCE METRICS")
print("="*70)
print(f"Latency:          {metrics['time']:.3f}s")
print(f"Input Tokens:     {metrics['input_tokens']}")
print(f"Output Tokens:    {metrics['output_tokens']}")
print(f"Total Tokens:     {metrics['total_tokens']}")
print(f"Tokens/Second:    {metrics['tokens_per_second']:.2f}")
print(f"Estimated Cost:   ${metrics['cost']:.6f}")
print()
print(f"Response: {metrics['answer'][:150]}...")
print("="*70)

Query: Explain quantum computing in one sentence.

PERFORMANCE METRICS
Latency:          2.567s
Input Tokens:     7
Output Tokens:    39
Total Tokens:     46
Tokens/Second:    17.92
Estimated Cost:   $0.000024

Response: Quantum computing harnesses the principles of quantum mechanics (like superposition and entanglement) to perform certain calculations exponentially fa...


## Step 5.5: Multi-Agent Cost Comparison

Let's compare the performance and cost of different agent configurations to understand the impact of tools on your API expenses.

We'll create:
1. **Simple Agent** - No tools (baseline)
2. **Single Tool Agent** - One simple computational tool
3. **Multi-Tool Agent** - Multiple tools with complex logic

This will show you how tool usage affects:
- Latency (response time)
- Token consumption
- Cost per query
- Tokens per second (throughput)

In [None]:
# Create custom tools (inspired by tutorial 02)
@tool
def calculate_fibonacci(n: int) -> int:
    """Calculate the nth Fibonacci number.
    
    Args:
        n: The position in the Fibonacci sequence (must be positive)
        
    Returns:
        The nth Fibonacci number
    """
    if n <= 0:
        raise ValueError("n must be positive")
    if n <= 2:
        return 1
    
    a, b = 1, 1
    for _ in range(n - 2):
        a, b = b, a + b
    return b

@tool
def add_numbers(a: float, b: float) -> float:
    """Add two numbers together.
    
    Args:
        a: First number
        b: Second number
        
    Returns:
        Sum of a and b
    """
    return a + b

@tool
def multiply_numbers(a: float, b: float) -> float:
    """Multiply two numbers together.
    
    Args:
        a: First number
        b: Second number
        
    Returns:
        Product of a and b
    """
    return a * b

print("Custom tools created!")
print("  - calculate_fibonacci: Compute Fibonacci numbers")
print("  - add_numbers: Addition")
print("  - multiply_numbers: Multiplication")

Custom tools created!
  - calculate_fibonacci: Compute Fibonacci numbers
  - add_numbers: Addition
  - multiply_numbers: Multiplication


In [None]:
# Create three different agent configurations
# Use get_chat_model() - uses Holistic AI Bedrock by default (recommended)
llm = get_chat_model("claude-3-5-sonnet")  # Uses Holistic AI Bedrock (recommended)

# Agent 1: No tools (baseline)
agent_no_tools = create_react_agent(llm, tools=[])

# Agent 2: Single simple tool
agent_single_tool = create_react_agent(llm, tools=[calculate_fibonacci])

# Agent 3: Multiple tools (complex interactions possible)
agent_multi_tools = create_react_agent(
    llm, 
    tools=[calculate_fibonacci, add_numbers, multiply_numbers]
)

print("Three agent configurations created:")
print("  1. No tools (baseline)")
print("  2. Single tool (fibonacci)")
print("  3. Multiple tools (fibonacci + math operations)")

Three agent configurations created:
  1. No tools (baseline)
  2. Single tool (fibonacci)
  3. Multiple tools (fibonacci + math operations)


In [None]:
# Now let's compare them with the same queries
test_queries = [
    {
        'name': 'Simple Question',
        'query': 'Explain quantum computing in one sentence.',
        'agents': ['no_tools', 'single_tool', 'multi_tools']
    },
    {
        'name': 'Tool-Required Question',
        'query': 'Use the calculate_fibonacci tool to find the 10th Fibonacci number.',
        'agents': ['single_tool', 'multi_tools']  # Skip no_tools (doesn't have the tool)
    },
    {
        'name': 'Complex Multi-Step',
        'query': 'Calculate the 8th Fibonacci number, then multiply it by 2. Use the tools available.',
        'agents': ['multi_tools']  # Only multi-tools can do this
    }
]

print("Comparing agent configurations across different query types...")
print("="*70)

# Store results for comparison
all_results = []

for test_case in test_queries:
    print(f"\n\nTest Case: {test_case['name']}")
    print(f"Query: {test_case['query'][:60]}...")
    print("-"*70)
    
    for agent_type in test_case['agents']:
        # Select the appropriate agent
        if agent_type == 'no_tools':
            agent = agent_no_tools
            label = "No Tools"
        elif agent_type == 'single_tool':
            agent = agent_single_tool
            label = "Single Tool"
        else:
            agent = agent_multi_tools
            label = "Multi Tools"
        
        # Track metrics
        try:
            metrics = track_agent_with_tokens(agent, test_case['query'])
            
            # Store results
            all_results.append({
                'test_case': test_case['name'],
                'agent_type': label,
                'time': metrics['time'],
                'input_tokens': metrics['input_tokens'],
                'output_tokens': metrics['output_tokens'],
                'total_tokens': metrics['total_tokens'],
                'cost': metrics['cost'],
                'tokens_per_second': metrics['tokens_per_second']
            })
            
            # Display results
            print(f"\n  {label}:")
            print(f"    Latency: {metrics['time']:.3f}s")
            print(f"    Tokens: {metrics['total_tokens']} (in: {metrics['input_tokens']}, out: {metrics['output_tokens']})")
            print(f"    Cost: ${metrics['cost']:.6f}")
            print(f"    Throughput: {metrics['tokens_per_second']:.2f} tok/s")
            print(f"    Response: {metrics['answer'][:80]}...")
            
        except Exception as e:
            print(f"\n  {label}: ERROR - {str(e)[:60]}")

print("\n" + "="*70)

Comparing agent configurations across different query types...


Test Case: Simple Question
Query: Explain quantum computing in one sentence....
----------------------------------------------------------------------

  No Tools:
    Latency: 1.718s
    Tokens: 46 (in: 7, out: 39)
    Cost: $0.000024
    Throughput: 26.77 tok/s
    Response: Quantum computing harnesses the principles of quantum mechanics (like superposit...

  Single Tool:
    Latency: 2.891s
    Tokens: 67 (in: 7, out: 60)
    Cost: $0.000037
    Throughput: 23.18 tok/s
    Response: I apologize, but I don't see any tools available that are related to explaining ...

  Multi Tools:
    Latency: 2.755s
    Tokens: 58 (in: 7, out: 51)
    Cost: $0.000032
    Throughput: 21.05 tok/s
    Response: I apologize, but I don't have any tools available that are specifically designed...


Test Case: Tool-Required Question
Query: Use the calculate_fibonacci tool to find the 10th Fibonacci ...
--------------------------------------

In [None]:
# Production cost estimation
from collections import defaultdict

if all_results:
    print("="*70)
    print("PRODUCTION COST ESTIMATION (claude-3-5-sonnet)")
    print("="*70)
    
    # Calculate average cost per query for each agent type
    by_agent = defaultdict(list)
    for r in all_results:
        by_agent[r['agent_type']].append(r['cost'])
    
    scenarios = [
        ("Small chatbot", 1_000),
        ("Medium app", 10_000),
        ("Large platform", 100_000),
        ("Enterprise", 1_000_000)
    ]
    
    for agent_type, costs in sorted(by_agent.items()):
        avg_cost_per_query = sum(costs) / len(costs)
        
        print(f"\n{agent_type} Agent (${avg_cost_per_query:.8f}/query):")
        print("-"*70)
        
        for scenario_name, queries_per_month in scenarios:
            monthly_cost = avg_cost_per_query * queries_per_month
            print(f"  {scenario_name:<20} ({queries_per_month:>10,} queries/month): ${monthly_cost:>8.2f}")
    
    print("\n" + "="*70)
    print("COST OPTIMIZATION RECOMMENDATIONS")
    print("="*70)
    
    # Get costs for comparison
    no_tool_avg = sum(by_agent.get('No Tools', [0])) / max(len(by_agent.get('No Tools', [1])), 1)
    multi_tool_avg = sum(by_agent.get('Multi Tools', [0])) / max(len(by_agent.get('Multi Tools', [1])), 1)
    
    print("\nIntelligent Routing Strategy:")
    print(f"  - Simple queries (80%): Use No Tools agent")
    print(f"  - Complex queries (20%): Use Multi Tools agent")
    print(f"\nFor 100,000 queries/month:")
    print(f"  - All Multi Tools: ${multi_tool_avg * 100_000:.2f}/month")
    print(f"  - All No Tools: ${no_tool_avg * 100_000:.2f}/month")
    
    mixed_cost = (no_tool_avg * 80_000) + (multi_tool_avg * 20_000)
    savings = (multi_tool_avg * 100_000) - mixed_cost
    savings_pct = (savings / (multi_tool_avg * 100_000) * 100) if multi_tool_avg > 0 else 0
    
    print(f"  - Smart routing: ${mixed_cost:.2f}/month")
    print(f"  - Savings: ${savings:.2f}/month ({savings_pct:.1f}%)")
    
    print("\nKey Takeaway:")
    print("  Route queries intelligently based on complexity to optimize costs!")

else:
    print("Run the comparison cells above first to see cost estimations!")

PRODUCTION COST ESTIMATION (gpt-5-nano)

Multi Tools Agent ($0.00002950/query):
----------------------------------------------------------------------
  Small chatbot        (     1,000 queries/month): $    0.03
  Medium app           (    10,000 queries/month): $    0.29
  Large platform       (   100,000 queries/month): $    2.95
  Enterprise           ( 1,000,000 queries/month): $   29.50

No Tools Agent ($0.00002445/query):
----------------------------------------------------------------------
  Small chatbot        (     1,000 queries/month): $    0.02
  Medium app           (    10,000 queries/month): $    0.24
  Large platform       (   100,000 queries/month): $    2.44
  Enterprise           ( 1,000,000 queries/month): $   24.45

Single Tool Agent ($0.00003765/query):
----------------------------------------------------------------------
  Small chatbot        (     1,000 queries/month): $    0.04
  Medium app           (    10,000 queries/month): $    0.38
  Large platform    

### Production Cost Estimation

Based on the comparison above, let's estimate monthly costs for different workloads:

In [21]:
# Analyze and display comparison results
# Note: defaultdict already imported in cell 18

if all_results:
    print("\n" + "="*70)
    print("COST COMPARISON SUMMARY")
    print("="*70)
    
    # Group by test case
    by_test = defaultdict(list)
    for r in all_results:
        by_test[r['test_case']].append(r)
    
    # Display comparison tables
    for test_name, results in by_test.items():
        print(f"\n{test_name}:")
        print("-"*70)
        print(f"{'Agent Type':<20} {'Time':>10} {'Tokens':>10} {'Cost':>12} {'Tok/s':>10}")
        print("-"*70)
        
        for r in results:
            print(f"{r['agent_type']:<20} {r['time']:>9.3f}s {r['total_tokens']:>10} ${r['cost']:>10.6f} {r['tokens_per_second']:>9.2f}")
        
        # Show cost differences
        if len(results) > 1:
            baseline = results[0]
            for r in results[1:]:
                cost_diff = ((r['cost'] - baseline['cost']) / baseline['cost'] * 100) if baseline['cost'] > 0 else 0
                token_diff = ((r['total_tokens'] - baseline['total_tokens']) / baseline['total_tokens'] * 100) if baseline['total_tokens'] > 0 else 0
                time_diff = ((r['time'] - baseline['time']) / baseline['time'] * 100) if baseline['time'] > 0 else 0
                
                print(f"\n  {r['agent_type']} vs {baseline['agent_type']}:")
                print(f"    Cost: {cost_diff:+.1f}% | Tokens: {token_diff:+.1f}% | Time: {time_diff:+.1f}%")
    
    # Calculate averages by agent type
    print("\n" + "="*70)
    print("AVERAGE METRICS BY AGENT TYPE")
    print("="*70)
    
    by_agent = defaultdict(list)
    for r in all_results:
        by_agent[r['agent_type']].append(r)
    
    print(f"{'Agent Type':<20} {'Avg Time':>12} {'Avg Tokens':>12} {'Avg Cost':>12}")
    print("-"*70)
    
    for agent_type, results in sorted(by_agent.items()):
        avg_time = sum(r['time'] for r in results) / len(results)
        avg_tokens = sum(r['total_tokens'] for r in results) / len(results)
        avg_cost = sum(r['cost'] for r in results) / len(results)
        
        print(f"{agent_type:<20} {avg_time:>11.3f}s {avg_tokens:>12.1f} ${avg_cost:>10.6f}")
    
    print("\n" + "="*70)
    print("KEY INSIGHTS")
    print("="*70)
    print("\n1. Tool Overhead:")
    print("   - Agents with tools use more tokens due to tool descriptions")
    print("   - Tool calling requires additional reasoning tokens")
    print("   - Multi-tool agents have higher baseline token costs")
    
    print("\n2. Cost Scaling:")
    print("   - Simple questions: Tool overhead is noticeable (~20-40% more)")
    print("   - Tool-required tasks: Cost justified by accuracy")
    print("   - Complex multi-step: Tools enable tasks impossible without them")
    
    print("\n3. Performance Trade-offs:")
    print("   - No tools: Fastest, cheapest for simple queries")
    print("   - Single tool: Moderate overhead, reliable for specific tasks")
    print("   - Multi tools: Higher cost, but enables complex reasoning")
    
    print("\n4. Optimization Strategies:")
    print("   - Use separate agents for different use cases")
    print("   - Route simple queries to no-tool agents")
    print("   - Reserve multi-tool agents for complex tasks")
    print("   - Monitor usage patterns to optimize tool selection")
    
else:
    print("\nNo results to analyze - run the comparison cells above first!")


COST COMPARISON SUMMARY

Simple Question:
----------------------------------------------------------------------
Agent Type                 Time     Tokens         Cost      Tok/s
----------------------------------------------------------------------
No Tools                 1.718s         46 $  0.000024     26.77
Single Tool              2.891s         67 $  0.000037     23.18
Multi Tools              2.755s         58 $  0.000032     21.05

  Single Tool vs No Tools:
    Cost: +51.5% | Tokens: +45.7% | Time: +68.2%

  Multi Tools vs No Tools:
    Cost: +29.4% | Tokens: +26.1% | Time: +60.4%

Tool-Required Question:
----------------------------------------------------------------------
Agent Type                 Time     Tokens         Cost      Tok/s
----------------------------------------------------------------------
Single Tool              4.117s         75 $  0.000038     18.22
Multi Tools              4.650s         25 $  0.000008      5.38

  Multi Tools vs Single Tool:
    

### Understanding the Cost Comparison Results

You might notice something counterintuitive in the results above: **Multi Tools appears to have the similar or lower, even lowest average cost**. This seems wrong - shouldn't more tools mean higher overhead?

**What's Actually Happening: Small Sample Size + LLM Variability**

The cost differences you see are primarily due to **statistical noise from small sample sizes**, not real efficiency differences. Here's why:

**1. LLM Output is Highly Variable**

The same query can produce drastically different response lengths:
- "55" (1 token)
- "The 10th Fibonacci number is 55." (10 tokens)
- "55\n\nFor reference, using the common indexing F1 = 1, F2 = 1..." (65 tokens)

Running the same test 10 times shows token counts ranging from **16 to 107 tokens** for identical queries!

**2. Small Sample Size Creates Unreliable Averages**

In our comparison:
- Each agent type: Only **1-3 test queries**
- One lucky "short response" can skew the entire average
- The "Tool-Required Question" happened to get different response styles by chance

**3. Actual Long-Term Average is Similar**

With proper statistical sampling (10+ runs per agent):
- Single Tool: ~40 tokens average
- Multi Tools: ~42 tokens average
- The difference is minimal and Multi Tools is actually slightly higher (as expected)

**The Real Lesson: Statistical Methodology Matters**

For production cost analysis:
- **Run many samples** - At least 50-100 queries per configuration
- **Use diverse queries** - Mix simple, complex, and tool-required questions
- **Set temperature=0** - For more consistent responses during benchmarking
- **Monitor over time** - Real production data beats synthetic tests
- **Quality matters too** - Lower cost means nothing if responses are poor

**What This Comparison DOES Show:**

1. Tool overhead exists but is relatively small for simple tools
2. Response quality varies significantly with configuration
3. You need proper statistical methods to measure real cost differences
4. Always validate findings with production data before making decisions

**For this tutorial:** The comparison demonstrates the monitoring methodology, not definitive cost rankings. In real production, use LangSmith (see Tutorial 05) to track actual usage patterns over thousands of queries.

## Step 5.6: Real-World Example - Search Tool Token Consumption

The simple math tools above show minimal overhead. But what about **real-world tools like web search**? Let's compare token consumption with Valyu search (from Tutorial 01).

This demonstrates the **true cost impact** of agents with tools that return large amounts of data.

In [22]:
# Install langchain-valyu if needed (optional - this cell can be skipped if Valyu key not available)
try:
    from langchain_valyu import ValyuSearchTool
    VALYU_AVAILABLE = True
except ImportError:
    print("Installing langchain-valyu...")
    import subprocess
    subprocess.check_call(['pip', 'install', '-q', 'langchain-valyu'])
    from langchain_valyu import ValyuSearchTool
    VALYU_AVAILABLE = True

# Enhanced token counting function that counts ALL messages
def count_all_messages_tokens(messages) -> dict:
    """Count tokens in all messages including tool calls and returns."""
    total_input = 0
    total_output = 0
    
    for msg in messages:
        msg_type = type(msg).__name__
        
        if msg_type == 'HumanMessage':
            total_input += count_tokens(msg.content)
        
        elif msg_type == 'AIMessage':
            if hasattr(msg, 'tool_calls') and msg.tool_calls:
                # Tool call message counts as output
                tool_call_str = str(msg.tool_calls)
                total_output += count_tokens(tool_call_str)
            if msg.content:
                total_output += count_tokens(msg.content)
        
        elif msg_type == 'ToolMessage':
            # Tool return is input to next LLM call
            total_input += count_tokens(msg.content)
    
    return {
        'input_tokens': total_input,
        'output_tokens': total_output,
        'total_tokens': total_input + total_output
    }

print("Enhanced token counting function ready!")
print("  This counts ALL tokens including tool calls and tool returns")

Enhanced token counting function ready!
  This counts ALL tokens including tool calls and tool returns


In [23]:
# Compare: No Tools vs Simple Tools vs Search Tool
print("="*70)
print("TOKEN CONSUMPTION COMPARISON")
print("="*70)
print()

# Same query for fair comparison
query = "What are the latest developments in quantum computing?"

# Test 1: No tools agent (baseline)
print("TEST 1: No Tools Agent (Baseline)")
print("-"*70)
# Test 1: No tools agent (baseline)
print("-"*70)
# Use get_chat_model() - uses Holistic AI Bedrock by default (recommended)
llm_test = get_chat_model("claude-3-5-sonnet")  # Uses Holistic AI Bedrock (recommended)
agent_baseline = create_react_agent(llm_test, tools=[])

result1 = agent_baseline.invoke({"messages": [HumanMessage(content=query)]})
tokens1 = count_all_messages_tokens(result1['messages'])

print(f"Query: {query}")
print(f"Messages: {len(result1['messages'])}")
print(f"Input tokens: {tokens1['input_tokens']}")
print(f"Output tokens: {tokens1['output_tokens']}")
print(f"TOTAL: {tokens1['total_tokens']} tokens")
print()

# Test 2: Simple math tools
print("TEST 2: Simple Math Tools Agent")
print("-"*70)
agent_math = create_react_agent(llm_test, tools=[calculate_fibonacci, add_numbers, multiply_numbers])

query2 = "Calculate the 10th Fibonacci number, then multiply it by 2."
result2 = agent_math.invoke({"messages": [HumanMessage(content=query2)]})
tokens2 = count_all_messages_tokens(result2['messages'])

print(f"Query: {query2}")
print(f"Messages: {len(result2['messages'])}")
print(f"Input tokens: {tokens2['input_tokens']}")
print(f"Output tokens: {tokens2['output_tokens']}")
print(f"TOTAL: {tokens2['total_tokens']} tokens")
print()

# Test 3: Valyu search tool (if available)
if os.getenv('VALYU_API_KEY'):
    print("TEST 3: Valyu Search Tool Agent")
    print("-"*70)
    
    try:
        search_tool = ValyuSearchTool(valyu_api_key=os.getenv("VALYU_API_KEY"))
        agent_search = create_react_agent(llm_test, tools=[search_tool])
        
        query3 = "What are the latest developments in quantum computing in 2025?"
        result3 = agent_search.invoke({"messages": [HumanMessage(content=query3)]})
        tokens3 = count_all_messages_tokens(result3['messages'])
        
        print(f"Query: {query3}")
        print(f"Messages: {len(result3['messages'])}")
        print(f"Input tokens: {tokens3['input_tokens']}")
        print(f"Output tokens: {tokens3['output_tokens']}")
        print(f"TOTAL: {tokens3['total_tokens']} tokens")
        print()
        
        # Summary comparison
        print("="*70)
        print("SUMMARY: Token Consumption Comparison")
        print("="*70)
        print(f"No Tools:        {tokens1['total_tokens']:>6} tokens (baseline)")
        print(f"Simple Tools:    {tokens2['total_tokens']:>6} tokens ({tokens2['total_tokens']/tokens1['total_tokens']:.1f}x baseline)")
        print(f"Search Tool:     {tokens3['total_tokens']:>6} tokens ({tokens3['total_tokens']/tokens1['total_tokens']:.1f}x baseline)")
        print()
        print("KEY INSIGHT:")
        print(f"  Search tool uses {tokens3['total_tokens'] - tokens1['total_tokens']:,} MORE tokens than no-tool baseline")
        print(f"  That is {((tokens3['total_tokens'] / tokens1['total_tokens']) - 1) * 100:.0f}% increase!")
        print()
        print("WHY?")
        print("  - Search tool returns large results (50-100KB of data)")
        print("  - Agent must process all returned content as input")
        print("  - Multiple reasoning steps to synthesize answer")
        print()
        
        # Cost implication
        input_cost = (tokens3['input_tokens'] / 1_000_000) * 0.15
        output_cost = (tokens3['output_tokens'] / 1_000_000) * 0.60
        total_cost = input_cost + output_cost
        
        baseline_input_cost = (tokens1['input_tokens'] / 1_000_000) * 0.15
        baseline_output_cost = (tokens1['output_tokens'] / 1_000_000) * 0.60
        baseline_cost = baseline_input_cost + baseline_output_cost
        
        print("COST IMPACT (gpt-5-mini pricing):")
        print(f"  No Tools cost:    ${baseline_cost:.6f} per query")
        print(f"  Search Tool cost: ${total_cost:.6f} per query")
        print(f"  Difference:       ${total_cost - baseline_cost:.6f} ({((total_cost/baseline_cost)-1)*100:.0f}% more expensive)")
        print()
        print("For 100,000 queries/month:")
        print(f"  No Tools:    ${baseline_cost * 100_000:.2f}/month")
        print(f"  Search Tool: ${total_cost * 100_000:.2f}/month")
        print(f"  Extra cost:  ${(total_cost - baseline_cost) * 100_000:.2f}/month")
        
    except Exception as e:
        print(f"Could not test search tool: {e}")
        print("This is optional - the comparison demonstrates the monitoring technique")
else:
    print("TEST 3: Valyu Search Tool (SKIPPED)")
    print("-"*70)
    print("No VALYU_API_KEY found - this test is optional")
    print("The key lesson: Tools that return large data (search, RAG, APIs)")
    print("can increase token consumption by 10-50x compared to no-tool agents")

TOKEN CONSUMPTION COMPARISON

TEST 1: No Tools Agent (Baseline)
----------------------------------------------------------------------
TEST 1: No Tools Agent (Baseline)
----------------------------------------------------------------------
Query: What are the latest developments in quantum computing?
Messages: 2
Input tokens: 9
Output tokens: 215
TOTAL: 224 tokens

TEST 2: Simple Math Tools Agent
----------------------------------------------------------------------
Query: Calculate the 10th Fibonacci number, then multiply it by 2.
Messages: 6
Input tokens: 19
Output tokens: 180
TOTAL: 199 tokens

TEST 3: Valyu Search Tool Agent
----------------------------------------------------------------------
Query: What are the latest developments in quantum computing in 2025?
Messages: 4
Input tokens: 29012
Output tokens: 559
TOTAL: 29571 tokens

SUMMARY: Token Consumption Comparison
No Tools:           224 tokens (baseline)
Simple Tools:       199 tokens (0.9x baseline)
Search Tool:      29571

### Key Takeaway: Real Tool Overhead

The search tool comparison reveals the **true cost of agents with real-world tools**:

**Simple math tools** (fibonacci, add, multiply):
- Return small results (1-10 tokens)
- Minimal overhead (~10-50 tokens total per query)
- Cost difference: negligible

**Search tools** (Valyu, Google, APIs):
- Return large results (10,000-50,000 tokens!)
- Massive overhead (20-40x more tokens)
- Cost difference: **$1-3 per 1000 queries**

**This explains why the simple tool comparison showed confusing results** - the tools were TOO simple to show meaningful differences. Real production agents with search, RAG, or API tools have dramatically higher token consumption.

**Production Implications:**
1. **Monitor tool return sizes** - Limit search results, truncate API responses
2. **Use tool-based routing** - Only call expensive tools when necessary
3. **Implement caching** - Cache search results for common queries
4. **Track real costs** - Use LangSmith (see Tutorial 05) to monitor actual production usage

### Analysis: Cost Impact of Tools

Let's analyze the cost differences across agent configurations:

## Step 6: Carbon Tracking with Local Models

### Why Carbon Tracking Only Makes Sense for Local Models

**API Models (OpenAI, Anthropic, etc.):**
- Emissions happen in remote cloud data centers
- You CANNOT measure them with CodeCarbon
- Already optimized by cloud providers
- Not your responsibility or control

**Local Models (Ollama, llama.cpp, etc.):**
- Emissions happen on YOUR hardware
- You CAN measure with CodeCarbon
- You can optimize (smaller models, batching)
- Real trade-off: quality vs energy usage

### Local Model Options with Ollama

For this tutorial, we'll use **Ollama** with the Qwen3 family:

- **qwen3:0.6b** - Ultra-lightweight (522MB) - Great for testing
- **qwen3:1.7b** - Lightweight (1.4GB) - Balanced performance
- **qwen3:4b** - Medium (2.5GB) - Better quality

Let's demonstrate carbon tracking and cost comparison:

In [None]:
# LOCAL MODEL COMPARISON: Different Quantized Versions
# NOTE: Requires Ollama installed - https://ollama.ai/
# Pull models: ollama pull qwen3:0.6b qwen3:1.7b qwen3:4b

try:
    from langchain_ollama import OllamaLLM
    
    def track_local_model(model_name: str, question: str) -> dict:
        """Track local model with carbon and performance metrics."""
        
        # Initialize local model
        local_llm = OllamaLLM(model=model_name)
        
        # Start carbon tracking
        tracker = EmissionsTracker(project_name="local_model_monitoring", log_level="error")
        tracker.start()
        
        # Run model and measure time
        start_time = time.time()
        response = local_llm.invoke(question)
        elapsed = time.time() - start_time
        
        # Stop carbon tracking
        emissions_kg = tracker.stop()
        
        # Estimate tokens (local models don't report exact counts)
        estimated_input_tokens = len(question) // 4
        estimated_output_tokens = len(response) // 4
        estimated_total_tokens = estimated_input_tokens + estimated_output_tokens
        
        return {
            'model': model_name,
            'time': elapsed,
            'estimated_tokens': estimated_total_tokens,
            'tokens_per_second': estimated_total_tokens / elapsed if elapsed > 0 else 0,
            'carbon_kg': emissions_kg,
            'carbon_mg': emissions_kg * 1_000_000 if emissions_kg else 0,
            'cost': 0.0,  # No API cost!
            'electricity_cost': (emissions_kg * 1_000) * 0.00012 if emissions_kg else 0,
            'answer': response
        }
    
    # Test with smallest model
    query = "Explain machine learning in one sentence."
    
    print("LOCAL MODEL CARBON TRACKING")
    print("="*70)
    print(f"Query: {query}\n")
    
    # Test three different sizes
    models_to_test = [
        ("qwen3:0.6b", "522MB", "0.6 billion", "Ultra-lightweight - Fastest"),
        ("qwen3:1.7b", "1.4GB", "1.7 billion", "Balanced - Good speed/quality"),
        ("qwen3:4b", "2.5GB", "4 billion", "Higher quality - Slower"),
    ]
    
    results = []
    
    print("Testing multiple Qwen3 model sizes...\n")
    
    for model_name, size, params, note in models_to_test:
        try:
            print(f"Testing {model_name} ({size}, {params} parameters)")
            print(f"  Note: {note}")
            
            metrics = track_local_model(model_name, query)
            results.append((model_name, size, params, metrics))
            
            print(f"  SUCCESS!")
            print(f"    Time: {metrics['time']:.2f}s")
            print(f"    Carbon: {metrics['carbon_mg']:.2f}mg CO2")
            print(f"    Speed: {metrics['tokens_per_second']:.1f} tokens/sec")
            print(f"    Cost: ${metrics['electricity_cost']:.8f} (electricity only)\n")
            
        except Exception as e:
            error_msg = str(e)
            print(f"  SKIPPED: {error_msg[:80]}")
            
            if "not found" in error_msg.lower():
                print(f"  >> To install: ollama pull {model_name}")
            elif "connection" in error_msg.lower():
                print(f"  >> Make sure Ollama is running: ollama serve")
            
            print(f"  >> This section is optional - continue to Tutorial 05 (Observability with LangSmith)!\n")
    
    # Display results
    if results:
        model_name, size, params, m = results[0]
        print("="*70)
        print("LOCAL MODEL RESULTS")
        print("="*70)
        print(f"\nModel: {model_name} ({size})")
        print(f"  Time: {m['time']:.2f}s")
        print(f"  Carbon: {m['carbon_mg']:.2f}mg CO2")
        print(f"  Speed: {m['tokens_per_second']:.1f} tokens/sec")
        print(f"  Cost: ${m['electricity_cost']:.8f} (electricity)")
        print(f"\nResponse: {m['answer'][:150]}...")
    else:
        print("\n" + "="*70)
        print("LOCAL MODEL SETUP NEEDED")
        print("="*70)
        print("\nTo try carbon tracking:")
        print("  1. Install Ollama: https://ollama.ai/")
        print("  2. Install integration: pip install langchain-ollama")
        print("  3. Pull model: ollama pull qwen3:0.6b")
        print("  4. Re-run this cell")
        print("\nNote: This is optional - continue to Tutorial 05 (Observability with LangSmith)!")

except ImportError:
    print("="*70)
    print("OPTIONAL: Local Model Carbon Tracking")
    print("="*70)
    print("\nThis section requires additional setup:")
    print("  1. Install langchain-ollama: pip install langchain-ollama")
    print("  2. Install Ollama: https://ollama.ai/")
    print("  3. Pull a model: ollama pull qwen3:0.6b")
    print("\nThis is OPTIONAL - continue to Tutorial 05 (Observability with LangSmith)!")



LOCAL MODEL CARBON TRACKING
Query: Explain machine learning in one sentence.

Testing multiple Qwen3 model sizes...

Testing qwen3:0.6b (522MB, 0.6 billion parameters)
  Note: Ultra-lightweight - Fastest
  SUCCESS!
    Time: 5.13s
    Carbon: 16.44mg CO2
    Speed: 10.1 tokens/sec
    Cost: $0.00000197 (electricity only)

Testing qwen3:1.7b (1.4GB, 1.7 billion parameters)
  Note: Balanced - Good speed/quality
  SUCCESS!
    Time: 6.74s
    Carbon: 21.56mg CO2
    Speed: 7.7 tokens/sec
    Cost: $0.00000259 (electricity only)

Testing qwen3:4b (2.5GB, 4 billion parameters)
  Note: Higher quality - Slower
  SUCCESS!
    Time: 24.36s
    Carbon: 77.98mg CO2
    Speed: 4.2 tokens/sec
    Cost: $0.00000936 (electricity only)

LOCAL MODEL RESULTS

Model: qwen3:0.6b (522MB)
  Time: 5.13s
  Carbon: 16.44mg CO2
  Speed: 10.1 tokens/sec
  Cost: $0.00000197 (electricity)

Response: Machine learning is a subset of artificial intelligence that enables systems to learn patterns from data, making pre

## Step 7: Cost Comparison - API vs Local

### Understanding the Trade-offs

**Cloud API (Claude 3.5 Sonnet (via Bedrock)):**
- Input: $0.15 per 1M tokens
- Output: $0.60 per 1M tokens
- Typical query: ~$0.000001-0.000002
- Scales linearly with usage
- No setup required
- Best quality

**Local Models (Qwen3):**
- Model cost: $0 (free download)
- API cost: $0 (runs locally)
- Only cost: Electricity (~$0.12/kWh in US)
- Typical query: ~$0.0000001-0.0000005
- Requires setup and hardware
- Privacy-friendly

### When to Use Each

**Use Cloud APIs when:**
- Low volume (<1,000 queries/month)
- Best quality needed
- Rapid prototyping
- No infrastructure available

**Use Local Models when:**
- High volume (10,000+ queries/month)
- Privacy critical
- Offline operation required
- Cost sensitive

### Real-World Scenarios

**Scenario 1: Chatbot (100,000 queries/month)**
- Claude 3.5 Sonnet (via Bedrock): ~$200/month
- qwen3:0.6b: ~$0.05/month (electricity)
- **Savings: $199.95/month**

**Scenario 2: Low Volume (1,000 queries/month)**
- Claude 3.5 Sonnet (via Bedrock): ~$2/month
- qwen3:0.6b: ~$0.01/month (but setup time may not be worth it)
- **Better to use API for convenience**

---

## Summary

Congratulations! You've mastered performance monitoring, cost tracking, and carbon emissions measurement.

### What You Learned

1. **Latency Tracking** - Measure response times accurately
2. **Token Counting** - Use tiktoken for precise token counts
3. **Cost Estimation** - Calculate API costs based on actual token usage
4. **Multi-Agent Comparison** - Compare different agent configurations
5. **Real-World Tool Impact** - Understand how search tools affect token consumption
6. **Carbon Tracking** - Measure emissions for local models
7. **Cost Comparison** - API vs local model trade-offs

### Key Insights

- tiktoken provides accurate token counts for cost estimation
- Claude 3.5 Sonnet (via Bedrock): $0.15/1M input, $0.60/1M output tokens
- Search tools can increase token consumption by 20-40x
- Local models cost $0 in API fees (only electricity)
- Carbon tracking only makes sense for local models where you control hardware

### Production Best Practices

1. **Monitor tool return sizes** - Limit search results, truncate API responses
2. **Use tool-based routing** - Only call expensive tools when necessary
3. **Implement caching** - Cache search results for common queries
4. **Track real costs** - Monitor actual production usage over time
5. **Optimize prompts** - Reduce tokens to lower costs

### What's Next?

Continue your learning journey:
- **05_observability.ipynb** - Learn deep observability with LangSmith tracing
- **06_benchmark_evaluation.ipynb** - Test agents on PhD-level questions
- **Advanced**: Set up production monitoring dashboards

### Resources

- [Tiktoken Documentation](https://github.com/openai/tiktoken)
- [CodeCarbon Documentation](https://mlco2.github.io/codecarbon/)
- [OpenAI Pricing](https://openai.com/api/pricing/)
