# Module 10.2: Tool Calling & Function Execution

**Duration:** 40 minutes  
**Level:** 3 (Advanced)  
**Prerequisites:** M10.1 ReAct Pattern

## Overview

This notebook implements production-grade tool calling for agentic RAG systems. You'll learn:

- ✅ Building a tool registry with 5+ production tools
- ✅ Implementing sandboxed execution (RestrictedPython)
- ✅ Timeout protection and retry logic
- ✅ Tool result validation
- ✅ Handling 5 common tool execution failures
- ✅ When NOT to use tool calling (critical trade-offs)

In [None]:
# Initial setup and imports
import sys
import json
import logging
from pathlib import Path

# Add current directory to path
sys.path.insert(0, str(Path.cwd()))

# Import core module
from l2_m10_tool_calling_function_execution import (
    ToolRegistry,
    ToolDefinition,
    ToolCategory,
    SafeToolExecutor,
    ReActAgent,
    register_default_tools,
    tool_registry
)

# Configure logging
logging.basicConfig(level=logging.WARNING)  # Keep output clean

print("✅ Imports successful")
# Expected: No errors, confirmation message

## Section 1: Introduction & Problem Statement

**The Challenge:**

In M10.1, you built a ReAct agent that can reason and act. But it can only *search* — not **DO** things.

Production agents need to:
- Calculate risk scores
- Query databases for policy documents
- Call external APIs to check regulatory databases
- Send Slack notifications when risks are detected
- Generate charts showing compliance trends

**The Problem:**

Giving an LLM the ability to execute arbitrary code is dangerous:
- ❌ Malformed tool calls crash your agent
- ❌ Timeouts lock up your system indefinitely
- ❌ Security holes execute malicious code
- ❌ Invalid results corrupt agent state

**Today's Solution:**

Build a robust tool ecosystem that's **powerful enough to be useful** but **safe enough for production**.

## Section 2: Tool Calling Architecture (5-Step Process)

**How Production Tool Calling Works:**

**Step 1: Tool Definition**  
Define each tool with schema (what it does, parameters, return type). This becomes part of the LLM's system prompt.

**Step 2: Tool Selection**  
LLM decides which tool to use and outputs structured JSON: `{"tool": "calculator", "args": {"expression": "0.25 * 1000000"}}`

**Step 3: Sandboxed Execution**  
Your code parses JSON, validates arguments, executes tool in sandboxed environment with timeouts.

**Step 4: Result Validation**  
Tool returns result. Before sending to LLM, validate it's expected type/format. Invalid results trigger retries.

**Step 5: Observation Integration**  
Validated result becomes next Observation in ReAct loop. Agent uses it to continue reasoning.

**Why This Matters for Production:**
- 🛡️ **Safety:** Sandboxing prevents code injection attacks
- ⏱️ **Reliability:** Timeouts prevent hung tools from locking agent
- 🐛 **Debuggability:** Validation catches errors at tool boundary
- 📊 **Observability:** Every tool call is logged with inputs/outputs/timing

## Section 3: Tool Registry Implementation

The Tool Registry is the central catalog of all available tools. It uses Pydantic for schema validation and provides:
- Tool registration with validation
- Tool discovery for LLM context
- Execution statistics tracking

In [None]:
# Register default tools and inspect registry
register_default_tools(tool_registry)

# List all registered tools
tools = tool_registry.list_tools()
print(f"Registered {len(tools)} tools:\n")
for tool in tools[:3]:  # Show first 3
    print(f"  • {tool.name} ({tool.category.value})")
    print(f"    Timeout: {tool.timeout_seconds}s, Retries: {tool.retry_count}")

# Expected: 5 tools registered (search, calculator, database, api, slack)

## Section 4: Sandboxed Execution Engine

The `SafeToolExecutor` provides three critical safety layers:

1. **Argument Validation:** Checks arguments match tool schema before execution
2. **Timeout Protection:** Uses ThreadPoolExecutor with configurable timeouts
3. **Retry Logic:** Exponential backoff for transient failures (via tenacity library)

This prevents the 5 most common production failures.

In [None]:
# Initialize the safe executor
executor = SafeToolExecutor(tool_registry)

# Execute calculator tool safely
result = executor.execute_tool("calculator", {"expression": "2 + 2 * 10"})

print(f"Success: {result.success}")
print(f"Result: {result.result}")
print(f"Execution time: {result.execution_time_ms:.2f}ms")

# Expected: Success=True, Result={'result': 22, 'expression': '2 + 2 * 10'}

## Section 5: Testing All 5 Production Tools

Let's test each of the 5 registered tools to understand their behavior:

In [ ]:
# Test all 5 tools
print("1. Knowledge Search:")
r1 = executor.execute_tool("knowledge_search", {"query": "tool calling", "top_k": 3})
print(f"   Found {r1.result['total_found']} results" if r1.success else f"   Error: {r1.error}")

print("\n2. Calculator:")
r2 = executor.execute_tool("calculator", {"expression": "10000 * 0.002"})
print(f"   Result: {r2.result['result']}" if r2.success else f"   Error: {r2.error}")

print("\n3. Database Query:")
r3 = executor.execute_tool("database_query", {"query": "SELECT * FROM policies LIMIT 2"})
print(f"   Rows: {r3.result['count']}" if r3.success else f"   Error: {r3.error}")

# Expected: All 3 tools execute successfully

## Section 6: Common Failures & Solutions (Critical Learning)

These are the **5 most common production failures** and how our system handles them:

### Failure 1: Code Injection Attack
**Attack:** Malicious expression like `import os; os.system('rm -rf /')`  
**Mitigation:** Calculator validates allowed characters only (0-9, +, -, *, /, (, ), space)

### Failure 2: SQL Injection
**Attack:** Query like `SELECT * FROM users; DROP TABLE users;`  
**Mitigation:** Only SELECT queries allowed, parameterized statements

### Failure 3: Tool Timeout
**Scenario:** External API takes 60s to respond  
**Mitigation:** Timeout enforced (default 30s), returns error instead of hanging

### Failure 4: Invalid Arguments
**Scenario:** LLM generates malformed JSON or missing required params  
**Mitigation:** Schema validation rejects before execution

### Failure 5: Non-serializable Result
**Scenario:** Tool returns Python object instead of JSON  
**Mitigation:** Result validation ensures JSON compatibility

In [ ]:
# Demonstrate failure handling
print("Testing Failure Scenarios:\n")

# Failure 1: Code injection attempt
r = executor.execute_tool("calculator", {"expression": "import os"})
print(f"1. Code Injection: {'BLOCKED ✅' if not r.success else 'FAILED ❌'}")
print(f"   Error: {r.error[:60]}...\n")

# Failure 2: SQL injection attempt
r = executor.execute_tool("database_query", {"query": "DROP TABLE users"})
print(f"2. SQL Injection: {'BLOCKED ✅' if not r.success else 'FAILED ❌'}")
print(f"   Error: {r.error}\n")

# Failure 4: Invalid arguments
r = executor.execute_tool("knowledge_search", {})  # Missing required 'query'
print(f"3. Invalid Args: {'BLOCKED ✅' if not r.success else 'FAILED ❌'}")

# Expected: All attacks blocked, errors returned gracefully

## Section 7: ReAct Agent Integration

Now let's integrate tool calling into a full ReAct agent loop.

The agent follows the **Thought → Action → Observation** cycle:  
1. **Thought:** LLM reasons about what to do next
2. **Action:** Select tool + arguments (our executor runs it)
3. **Observation:** Tool result feeds back into next iteration

In [ ]:
# Run ReAct agent with tool calling
agent = ReActAgent(executor)

response = agent.run("How do I implement safe tool calling in production?")

print(f"Success: {response['success']}")
print(f"Iterations: {response['iterations']}")
print(f"\nAnswer: {response['answer'][:100]}...")
print(f"\nTrace steps: {len(response['trace'])}")

# Expected: Agent completes in 1-3 iterations with answer

## Section 8: Trade-offs & When NOT to Use

### Trade-offs Accepted:

| Trade-off | Impact |
|-----------|--------|
| Sandboxing overhead | +50-200ms latency per tool |
| Retry logic | May cause duplicate side effects |
| Timeout interruption | Long operations get killed |
| Restricted Python | Limited library access |

### When NOT to Use Tool Calling:

❌ **Information-only agents** - If you only need search/retrieval, don't add tool complexity  
❌ **Sub-100ms latency requirements** - Sandboxing overhead is too high  
❌ **Non-idempotent tools** - Retry logic can cause duplicate operations  
❌ **Cascading failure dependencies** - When one tool failure breaks others

### Alternative Solutions:

**Pre-Approved Tool Outputs:** Static response database (zero execution risk, but inflexible)  
**Human-in-the-Loop:** Require approval before execution (safer, but slower)  
**Managed Platforms:** Zapier, n8n (vendor lock-in, but managed infrastructure)  
**Container Isolation:** Docker/Podman (stronger isolation, higher resource cost)

## Section 9: Production Considerations

### Cost Breakdown (10K conversations/hour):
- **API calls:** $2,000-5,000/month (external services)
- **Compute:** $500-1,000/month (if self-hosted)
- **Storage:** $200-500/month (execution logs)

### Monitoring Requirements:
✅ Tool success/failure rates  
✅ Execution latency percentiles (p50, p95, p99)  
✅ Cost per tool call  
✅ Error categorization and alerting

### Deployment Checklist:
1. Load test with realistic query patterns
2. Implement circuit breakers for failing tools
3. Set up distributed tracing for debugging
4. Create runbooks for common failures
5. Establish SLA targets for agent latency

In [ ]:
# View execution statistics
print("Tool Execution Statistics:\n")
stats = tool_registry.get_stats()

for tool_name, tool_stats in stats.items():
    if tool_stats['calls'] > 0:
        success_rate = (tool_stats['successes'] / tool_stats['calls']) * 100
        avg_time = tool_stats['total_time_ms'] / tool_stats['calls']
        print(f"{tool_name}:")
        print(f"  Calls: {tool_stats['calls']}, Success Rate: {success_rate:.1f}%")
        print(f"  Avg Time: {avg_time:.2f}ms\n")

# Expected: Statistics for all executed tools

## Section 10: Decision Card

### Choose This Approach When:
✅ Agents must take actions beyond retrieval  
✅ You control tool implementations  
✅ Latency targets permit 50-500ms overhead  
✅ You can maintain retry-safe tool design

### Avoid When:
❌ Tool failures cascade unpredictably  
❌ Sub-100ms latency is critical  
❌ External tools lack idempotency guarantees  
❌ You only need information retrieval

### Next Steps:
➡️ **M10.3:** Multi-Agent Orchestration  
➡️ **M10.4:** Conversational RAG (multi-turn memory + tool calling)

---

## Practathon Challenges

**Easy (90 min):** Add 2 custom tools (e.g., weather API, file reader)  
**Medium (2-3 hrs):** Implement circuit breaker pattern for failing tools  
**Hard (5-6 hrs):** Build tool versioning system + performance dashboard

In [ ]:
# Cleanup and summary
executor.shutdown()

print("="*60)
print("Module 10.2 Complete! ✅")
print("="*60)
print("\nYou've learned:")
print("  ✅ Tool registry with Pydantic validation")
print("  ✅ Sandboxed execution with timeouts")
print("  ✅ 5 production tools (search, calc, DB, API, Slack)")
print("  ✅ Handling 5 common failures")
print("  ✅ When NOT to use tool calling (critical!)")
print("\n⚠️  Remember: Tool calling adds 50-500ms overhead")
print("⚠️  Only use when agents need to DO things, not just search")
print("\nNext: M10.4 - Conversational RAG")
