# MCP with Local Open Source Models

This notebook demonstrates how to use MCP servers with a **local open source language model** instead of commercial APIs.

## Why Run Models Locally?

✅ **Privacy** - Company data never leaves your machine  
✅ **Cost** - No API fees (one-time compute cost)  
✅ **Offline** - Works without internet after download  
✅ **Control** - Full transparency, can customize  
✅ **Learning** - Understand tool calling mechanics  

## What You'll Learn

1. Load and run **Gemma 2 2B** locally (lightweight and fast!)
2. Connect the model to MCP server functions
3. Implement tool calling from scratch
4. Process natural language queries end-to-end

## Requirements

- GPU recommended (Google Colab free tier works perfectly!)
- ~2-3GB VRAM for model (much lighter than larger models)
- Python 3.10+

---

## 1. Setup & Model Loading

### 1.1 Install Dependencies

In [None]:
# Install required packages (run once)
!pip install -q transformers torch accelerate bitsandbytes

### 1.2 Check GPU Availability

In [None]:
import torch

# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"🖥️  Using device: {device}")

if device == "cuda":
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("   ⚠️  Note: Running on CPU will be significantly slower")
    print("   Consider using Google Colab with GPU runtime")

### 1.3 About Gemma 2 2B

**Why this model?**  
Gemma 2 2B is Google's lightweight, instruction-tuned model that's perfect for Google Colab. It's specifically optimized for:
- **Speed** - Much faster loading and inference than 7B models
- **Memory efficiency** - Runs smoothly on Colab's free tier GPU
- **Tool calling** - Supports function calling and structured outputs
- **Quality** - Despite its small size, performs well on instruction-following tasks

**Note:** First run will download ~5GB (cached for future use). With 4-bit quantization, it uses only ~2-3GB VRAM - perfect for Colab!

⚠️ **Important**: Gemma models require authentication - follow the steps in the next section.

### 1.5 Load Gemma 2 2B Model

Now that you're authenticated, let's load the model!

In [None]:
# Load Hugging Face token from Colab secrets
try:
    from google.colab import userdata
    HF_TOKEN = userdata.get('HF_TOKEN')
    print("✅ Hugging Face token loaded successfully from Colab secrets!")
    print(f"   Token preview: {HF_TOKEN[:10]}..." if HF_TOKEN else "   ⚠️ Token is empty!")
except Exception as e:
    print("❌ Error loading token from Colab secrets!")
    print(f"   Error: {e}")
    print("\n💡 Make sure you:")
    print("   1. Created a Hugging Face token at https://huggingface.co/settings/tokens")
    print("   2. Added it to Colab secrets with name 'HF_TOKEN'")
    print("   3. Enabled 'Notebook access' for the secret")
    HF_TOKEN = None

# Verify token is available
if not HF_TOKEN:
    raise ValueError("❌ No HF_TOKEN found! Please follow the authentication steps above.")

### 1.4 Authentication Setup (Required for Gemma Models)

⚠️ **Important**: Gemma models are gated and require authentication. Follow these steps **before** running the next cell:

#### Step 1: Accept the Gemma License

1. Go to https://huggingface.co/google/gemma-2-2b-it
2. If you don't have a Hugging Face account, click **Sign Up** (it's free)
3. Log in to your account
4. Click the **"Agree and access repository"** button to accept the license terms
5. Wait for approval (usually instant, but may take 1-2 minutes)

#### Step 2: Set Up Authentication in Google Colab

Now you need to authenticate using a Hugging Face token:

**2.1 Create a Hugging Face Token:**
1. Go to https://huggingface.co/settings/tokens
2. Click **"New token"**
3. Give it a name (e.g., "colab-access")
4. Select **"Read"** permission (sufficient for downloading models)
5. Click **"Generate token"**
6. **Copy the token** (you'll need it in the next step)

**2.2 Add Token to Colab Secrets:**
1. In Google Colab, click the **🔑 key icon** on the left sidebar (Secrets)
2. Click **"+ Add new secret"**
3. In the **Name** field, enter: `HF_TOKEN`
4. In the **Value** field, paste your Hugging Face token
5. Toggle **ON** the switch for "Notebook access"
6. Your secret is now saved!

**2.3 Load the Token:**

Run the cell below to load your authentication token from Colab secrets:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

print("📥 Loading Gemma 2 2B model...")
print("   This will download ~5GB on first run (cached for future use)")
print("   Please wait 1-2 minutes...\n")

model_name = "google/gemma-2-2b-it"

# Load tokenizer with authentication
print("🔤 Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
print("   ✅ Tokenizer loaded")

# Configure 4-bit quantization
print("\n🤖 Loading model with 4-bit quantization...")
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Enable 4-bit quantization
    bnb_4bit_compute_dtype=torch.float16, # Compute dtype for 4-bit base models
    bnb_4bit_quant_type="nf4",            # Quantization type (nf4 or fp4)
    bnb_4bit_use_double_quant=True        # Nested quantization for better memory efficiency
)

# Load model with quantization config and authentication
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",              # Automatically place on GPU
    low_cpu_mem_usage=True,
    token=HF_TOKEN                  # Pass authentication token
)
print("   ✅ Model loaded")

print("\n" + "="*60)
print("🎉 Gemma 2 2B is ready to use!")
print("="*60)

---

## 2. Import MCP Server Functions

We'll import functions directly from all five MCP servers (same approach as Notebook 1).

In [None]:
# Ticket Server
from ticket_server import (
    search_tickets,
    get_ticket_details,
    get_ticket_metrics,
    find_similar_tickets_to
)

# Customer Server
from customer_server import (
    lookup_customer,
    check_customer_status,
    get_sla_terms,
    list_customer_contacts
)

# Billing Server
from billing_server import (
    get_invoice,
    check_payment_status,
    get_billing_history,
    calculate_outstanding_balance
)

# Knowledge Base Server
from kb_server import (
    search_solutions,
    get_article,
    find_related_articles,
    get_common_fixes
)

# Asset Server
from asset_server import (
    lookup_asset,
    check_warranty,
    get_asset_history
)

print("✅ All MCP server functions imported successfully!")
print("   Total: 20 tools available")

### 2.1 Create Tool Registry

We need to describe each tool in a format the model understands. Gemma 2 2B works well with OpenAI-style function calling format.

In [None]:
import json

# Define available tools in OpenAI function calling format
# Most instruction-tuned models understand this standard format
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_tickets",
            "description": "Search for support tickets with optional filters for status, priority, customer_id, or keywords",
            "parameters": {
                "type": "object",
                "properties": {
                    "status": {"type": "string", "description": "Filter by status: open, in_progress, resolved, closed"},
                    "priority": {"type": "string", "description": "Filter by priority: low, medium, high, critical"},
                    "customer_id": {"type": "string", "description": "Filter by customer ID (e.g., CUST-001)"},
                    "keyword": {"type": "string", "description": "Search in subject and description"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_ticket_details",
            "description": "Get detailed information about a specific ticket by its ID",
            "parameters": {
                "type": "object",
                "properties": {
                    "ticket_id": {"type": "string", "description": "Ticket ID (e.g., TKT-1001)"}
                },
                "required": ["ticket_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_ticket_metrics",
            "description": "Get ticket statistics and metrics for a time period",
            "parameters": {
                "type": "object",
                "properties": {
                    "period": {"type": "string", "description": "Time period: last_7_days, last_30_days, last_90_days"}
                },
                "required": ["period"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "lookup_customer",
            "description": "Look up customer information by customer ID or company name",
            "parameters": {
                "type": "object",
                "properties": {
                    "customer_id": {"type": "string", "description": "Customer ID (e.g., CUST-001)"},
                    "company_name": {"type": "string", "description": "Company name to search for"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_sla_terms",
            "description": "Get SLA (Service Level Agreement) terms for a customer",
            "parameters": {
                "type": "object",
                "properties": {
                    "customer_id": {"type": "string", "description": "Customer ID (e.g., CUST-001)"}
                },
                "required": ["customer_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_invoice",
            "description": "Get invoice information by invoice ID or customer ID",
            "parameters": {
                "type": "object",
                "properties": {
                    "invoice_id": {"type": "string", "description": "Invoice ID (e.g., INV-1001)"},
                    "customer_id": {"type": "string", "description": "Customer ID to get all invoices"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculate_outstanding_balance",
            "description": "Calculate total outstanding and overdue balance for a customer",
            "parameters": {
                "type": "object",
                "properties": {
                    "customer_id": {"type": "string", "description": "Customer ID (e.g., CUST-001)"}
                },
                "required": ["customer_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_solutions",
            "description": "Search knowledge base articles by keyword or topic",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "limit": {"type": "integer", "description": "Maximum number of results (default 5)"}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_article",
            "description": "Get full knowledge base article by article ID",
            "parameters": {
                "type": "object",
                "properties": {
                    "article_id": {"type": "string", "description": "Article ID (e.g., KB-001)"}
                },
                "required": ["article_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "lookup_asset",
            "description": "Look up asset information by asset ID",
            "parameters": {
                "type": "object",
                "properties": {
                    "asset_id": {"type": "string", "description": "Asset ID (e.g., AST-SRV-001)"}
                },
                "required": ["asset_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "check_warranty",
            "description": "Check warranty status and details for an asset",
            "parameters": {
                "type": "object",
                "properties": {
                    "asset_id": {"type": "string", "description": "Asset ID (e.g., AST-SRV-001)"}
                },
                "required": ["asset_id"]
            }
        }
    }
]

# Create a mapping from tool names to actual Python functions
tool_map = {
    "search_tickets": search_tickets,
    "get_ticket_details": get_ticket_details,
    "get_ticket_metrics": get_ticket_metrics,
    "lookup_customer": lookup_customer,
    "get_sla_terms": get_sla_terms,
    "get_invoice": get_invoice,
    "calculate_outstanding_balance": calculate_outstanding_balance,
    "search_solutions": search_solutions,
    "get_article": get_article,
    "lookup_asset": lookup_asset,
    "check_warranty": check_warranty
}

print(f"✅ Tool registry created with {len(tools)} tools")
print("\nAvailable tools:")
for tool in tools:
    print(f"  - {tool['function']['name']}")

---

## 3. Tool Calling Implementation

### 3.1 How It Works (Brief Review)

You already know the tool calling loop from previous notebooks:

1. **User asks a question** → Model receives query + available tools
2. **Model decides what to do** → Either call tools OR give final answer
3. **Execute tools** → Run Python functions, get results
4. **Feed results back** → Model uses results to answer
5. **Repeat if needed** → Model might call more tools

The key difference here: we're implementing this manually with a local model instead of using an API service.

### 3.2 Helper Functions

First, let's create helper functions for executing tools and generating model responses.

In [None]:
def execute_tool(tool_name, arguments):
    """
    Execute a tool by name with given arguments.

    Args:
        tool_name: Name of the tool to execute
        arguments: Dictionary of arguments to pass

    Returns:
        Tool execution result (dict or error message)
    """
    if tool_name not in tool_map:
        return {"error": f"Tool '{tool_name}' not found"}

    try:
        # Call the actual Python function
        result = tool_map[tool_name](**arguments)
        return result
    except Exception as e:
        return {"error": f"Tool execution failed: {str(e)}"}


def generate_response(messages, max_new_tokens=128):
    """
    Generate a response from the model given conversation history.

    Args:
        messages: List of message dictionaries with 'role' and 'content'
        max_new_tokens: Maximum tokens to generate

    Returns:
        Generated text response
    """
    # Format messages using the chat template
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

    # Decode only the new tokens (not the input)
    response = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )

    return response.strip()


def parse_tool_calls(response):
    """
    Parse tool calls from model response.
    Different models format tool calls differently - this function tries to handle multiple formats.

    Args:
        response: Model's text response

    Returns:
        List of tool call dictionaries, or None if no tool calls
    """
    try:
        # Try format 1: <tool_call> tags (used by some models like Hermes)
        if "<tool_call>" in response:
            start = response.find("<tool_call>") + len("<tool_call>")
            end = response.find("</tool_call>")
            json_str = response[start:end].strip()

            # Parse the JSON
            tool_calls = json.loads(json_str)

            # Ensure it's a list
            if isinstance(tool_calls, dict):
                tool_calls = [tool_calls]

            return tool_calls

        # Try format 2: Look for JSON blocks with "name" and "arguments" keys
        # This is a fallback for models that output JSON directly
        import re
        # Find all JSON-like structures in the response
        json_pattern = r'\{[^{}]*"name"\s*:\s*"[^"]+"\s*,\s*"arguments"\s*:\s*\{[^}]*\}[^{}]*\}'
        matches = re.findall(json_pattern, response, re.DOTALL)

        if matches:
            tool_calls = []
            for match in matches:
                try:
                    tool_call = json.loads(match)
                    if "name" in tool_call:
                        tool_calls.append(tool_call)
                except:
                    continue

            if tool_calls:
                return tool_calls

    except Exception as e:
        print(f"⚠️  Error parsing tool calls: {e}")

    return None

print("✅ Helper functions defined")

### 3.3 Main Query Function

This implements the complete tool calling loop.

In [None]:
def query_with_tools(user_question, max_iterations=5, verbose=True):
    """
    Process a user query with tool calling capability.

    Args:
        user_question: Natural language question from user
        max_iterations: Maximum number of tool calling rounds
        verbose: Print intermediate steps

    Returns:
        Final answer from the model
    """
    # Build combined message (Gemma doesn't support system role)
    # We combine the system instructions with the user message
    system_instructions = f"""You are a helpful IT support assistant with access to multiple tools.

When you need information to answer a question, use the available tools by generating a tool call in this format:
<tool_call>
{{
  "name": "tool_name",
  "arguments": {{
    "param1": "value1",
    "param2": "value2"
  }}
}}
</tool_call>

You can call multiple tools by providing a list of tool calls.

Available tools:
{json.dumps([t['function'] for t in tools], indent=2)}

After receiving tool results, use them to provide a helpful answer to the user.

---

User question: {user_question}"""

    # Initialize conversation (no system role for Gemma)
    messages = [
        {"role": "user", "content": system_instructions}
    ]

    if verbose:
        print("\n" + "=" * 70)
        print(f"👤 User: {user_question}")
        print("=" * 70)

    # Tool calling loop
    for iteration in range(max_iterations):
        if verbose:
            print(f"\n🔄 Iteration {iteration + 1}")

        # Get model response
        response = generate_response(messages)

        # Check if model wants to call tools
        tool_calls = parse_tool_calls(response)

        if tool_calls:
            # Model wants to use tools
            if verbose:
                print(f"🔧 Model requested {len(tool_calls)} tool call(s)")

            # Add assistant's response to history
            messages.append({"role": "assistant", "content": response})

            # Execute each tool
            tool_results = []
            for tool_call in tool_calls:
                tool_name = tool_call.get("name")
                arguments = tool_call.get("arguments", {})

                if verbose:
                    print(f"\n  Calling: {tool_name}")
                    print(f"  Args: {json.dumps(arguments, indent=4)}")

                # Execute the tool
                result = execute_tool(tool_name, arguments)
                tool_results.append(result)

                if verbose:
                    result_str = json.dumps(result, indent=4)
                    preview = result_str[:200] + "..." if len(result_str) > 200 else result_str
                    print(f"  Result: {preview}")

            # Add tool results to conversation
            tool_message = {
                "role": "user",
                "content": f"Tool results:\n{json.dumps(tool_results, indent=2)}"
            }
            messages.append(tool_message)

            # Continue loop to get next response
            continue

        else:
            # No tool calls - this is the final answer
            if verbose:
                print("\n✅ Final answer received")
                print("\n" + "=" * 70)
                print(f"🤖 Assistant: {response}")
                print("=" * 70)

            return response

    # Hit max iterations
    return "I've reached the maximum number of tool calls. Please try simplifying your question."


print("✅ Main query function updated for Gemma compatibility")


---

## 4. Examples & Demonstrations

Now let's test the system with various queries!

### Example 1: Simple Single-Tool Query

In [None]:
answer = query_with_tools("What are all the critical priority tickets?")

**What happened:**
1. Model analyzed the question
2. Decided to call `search_tickets` with `priority="critical"`
3. Received the results
4. Formatted a natural language response

### Example 2: Multi-Tool Query

In [None]:
answer = query_with_tools(
    "Show me customer CUST-001's information and their SLA terms"
)

**What happened:**
- Model called TWO tools: `lookup_customer` AND `get_sla_terms`
- Combined the information into a coherent answer

### Example 3: Complex Multi-Step Query

In [None]:
answer = query_with_tools(
    "Which customers have critical tickets? Show their outstanding balances."
)

**What happened:**
1. First tool call: `search_tickets(priority="critical")` to find critical tickets
2. Extracted customer IDs from ticket results
3. Multiple tool calls: `calculate_outstanding_balance` for each customer
4. Correlated the data and formatted the answer

This demonstrates **multi-step reasoning** - the model chains tools together!

### Example 4: Error Handling

In [None]:
answer = query_with_tools("Get details for ticket TKT-9999")

**What happened:**
- Tool returned an error (ticket not found)
- Error included helpful hints (from your MCP server design!)
- Model used the hints to provide a helpful response to the user

### Example 5: Knowledge Base Search

In [None]:
answer = query_with_tools(
    "Find articles about Windows blue screen errors and show me the most relevant one"
)

### Example 6: Asset and Warranty Check

In [None]:
answer = query_with_tools(
    "Check the warranty status for asset AST-SRV-001. Is it still valid?"
)

---

## 5. Try It Yourself!

Now it's your turn. Use the `query_with_tools()` function to ask your own questions.

### Your Custom Query

In [None]:
# Write your own query here!
my_query = "Your question here..."

answer = query_with_tools(my_query)

### Suggested Queries to Try

```python
# Cross-server correlation
query_with_tools("Show me all high priority tickets for customers with overdue invoices")

# Metrics and analytics
query_with_tools("What are the ticket metrics for the last 30 days?")

# Customer insights
query_with_tools("Tell me about customer CUST-002, their tickets, and billing status")

# Knowledge base assistance
query_with_tools("What are common fixes for network connectivity issues?")

# Asset management
query_with_tools("Which assets have expired warranties?")
```

---

## 6. Understanding Local vs API Models

### Key Differences You Experienced

| Aspect | Local Model (Gemma 2 2B) | API Model (OpenAI) |
|--------|---------------------------|--------------------|
| **Privacy** | ✅ All data stays local | ⚠️ Sent to external servers |
| **Cost** | ✅ Free (after setup) | 💰 Pay per token |
| **Speed** | ✅ Fast (especially 2B models) | ✅ Fast (optimized infrastructure) |
| **Quality** | ⚠️ Good for simple tasks, may struggle with complex reasoning | ✅ Excellent reasoning |
| **Offline** | ✅ Works without internet | ❌ Requires connection |
| **Setup** | ⚠️ More complex | ✅ Simple API key |
| **Customization** | ✅ Can fine-tune, modify | ❌ Limited control |
| **Memory** | ✅ 2-3GB VRAM (Colab friendly!) | N/A (runs on their servers) |

### When to Use Local Models

✅ **Good for:**
- Sensitive data (healthcare, finance, legal)
- High-volume queries (cost savings)
- Offline/air-gapped environments
- Learning and experimentation
- Custom fine-tuning needs
- Limited GPU resources (2B models are very efficient!)

⚠️ **Consider APIs when:**
- Need highest quality reasoning
- Complex multi-step tasks
- Want minimal setup complexity
- Need consistent high-quality responses

### Potential Limitations You Might Notice

1. **Tool Selection Errors**: Smaller models (2B) may occasionally:
   - Choose wrong tool
   - Hallucinate tool names
   - Miss required parameters
   - Need more explicit instructions

2. **Response Quality**: May be less natural or detailed than larger models

3. **Complex Reasoning**: Multi-step queries might require more guidance or simpler prompts

4. **Consistency**: May vary more in quality compared to larger models

**Note:** These trade-offs are why hybrid approaches exist - use local for routine queries, API for complex cases. Gemma 2 2B is particularly good for learning and simple tasks!

---

## 7. Advanced Exercises

### Exercise 1: Add Conversation History

Modify `query_with_tools()` to support multi-turn conversations:
```python
# First query
history = []
answer1, history = query_with_tools_stateful("Show me critical tickets", history)

# Follow-up query (model remembers context)
answer2, history = query_with_tools_stateful("What are their customer IDs?", history)
```

### Exercise 2: Implement Confidence Scoring

Add logic to detect when the model is uncertain and ask for clarification:
- Parse for uncertainty phrases ("I'm not sure", "might be", etc.)
- Request confirmation before executing potentially wrong tools

### Exercise 3: Tool Call Validation

Add a validation layer:
- Check if tool exists before calling
- Validate parameter types match expected schema
- Provide better error messages to model when validation fails

### Exercise 4: Compare Different Models

Try swapping in a different model to compare:
- **Gemma 2 9B** (`google/gemma-2-9b-it`) - Same family, better quality
- **Qwen2.5-7B-Instruct** - Good alternative in the 7B range
- **Phi-3-mini** (`microsoft/Phi-3-mini-4k-instruct`) - Another efficient small model
- **Llama-3.2-3B** - Meta's small instruction model

Compare:
- Tool selection accuracy
- Response quality
- Speed
- Memory usage

**Tip**: Just change the `model_name` variable in cell 6!

### Exercise 5: Performance Optimization

Optimize for speed:
- Reduce `max_new_tokens` for tool calls (tool calls don't need long responses)
- Implement caching for repeated queries
- Batch multiple tool calls when possible
- Try removing quantization if you have enough VRAM (faster inference)

---

## 8. Summary

### What You Learned

✅ **Local Model Setup**
   - Loading Gemma 2 2B with 4-bit quantization
   - GPU optimization for inference
   - Memory management in Colab (only 2-3GB VRAM needed!)
   - Fast loading and inference times

✅ **Tool Calling Implementation**
   - Creating tool schemas in OpenAI function format
   - Parsing model outputs for tool calls
   - Executing Python functions dynamically
   - Multi-turn tool calling loops
   - Handling different model output formats

✅ **Practical Applications**
   - Privacy-preserving AI for sensitive data
   - Cost-effective alternative to API services
   - Understanding tool calling mechanics
   - Running on free-tier Google Colab

✅ **Trade-offs and Decisions**
   - When to use local vs API models
   - Performance vs quality considerations
   - Memory efficiency with smaller models
   - Real-world deployment scenarios

### Key Takeaways

1. **Privacy Matters**: Local models enable AI on sensitive data without external dependencies

2. **Tool Calling is Universal**: Same concepts work across OpenAI, Anthropic, and open source models

3. **Size vs Performance**: Smaller models (2B) can handle tool calling with the right setup, though with some quality trade-offs

4. **Open Source = Control**: Full transparency and customization options

5. **Implementation Matters**: Prompt engineering and error handling crucial for success

6. **Efficiency First**: Gemma 2 2B proves you don't always need large models - 2B parameters can be surprisingly capable!

### Next Steps

- **Fine-tune Gemma** on your specific tools/domain for better accuracy
- **Deploy in production** with proper error handling and monitoring  
- **Explore other models** (try Gemma 2 9B for better quality, or Qwen 2.5 for comparison)
- **Build hybrid systems** (local for routine queries, API for complex cases)
- **Optimize prompts** to get better tool selection from smaller models

---

🎉 **Congratulations!** You now know how to build lightweight, privacy-preserving AI assistants with tool calling capabilities using small open source models that run perfectly on Google Colab!