# MCP with Ultra-Lightweight Models - Gemma 3 270M

This notebook demonstrates how to use MCP servers with **Gemma 3 270M**, an ultra-lightweight language model that runs even on CPU!

## Why Use Ultra-Lightweight Models?

✅ **Minimal Resources** - Runs on CPU or tiny GPUs  
✅ **Blazing Fast** - Almost instant responses  
✅ **Privacy** - Company data never leaves your machine  
✅ **Cost** - No API fees, minimal compute requirements  
✅ **Learning** - Perfect for understanding tool calling mechanics  

## What You'll Learn

1. Load and run **Gemma 3 270M** locally (ultra-lightweight!)
2. Connect the model to MCP server functions
3. Implement tool calling from scratch
4. Process natural language queries end-to-end
5. Understand trade-offs of tiny models

## Requirements

- GPU optional (works on CPU too!)
- ~1GB RAM for model with quantization
- Python 3.10+

⚠️ **Note**: Being only 270M parameters, this model is best for:
- Simple, straightforward tasks
- Learning and experimentation
- Environments with very limited resources
- Demonstrating concepts

For production use or complex reasoning, consider larger models like Gemma 2 2B or 9B.

---

## 1. Setup & Model Loading

### 1.1 Install Dependencies

In [None]:
# Install required packages (run once)
!pip install -q transformers torch accelerate bitsandbytes

### 1.2 Check GPU Availability

In [None]:
import torch

# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"🖥️  Using device: {device}")

if device == "cuda":
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("   💡 Running on CPU - Gemma 3 270M is small enough to run on CPU!")
    print("   Inference will be slower but still usable")

### 1.3 About Gemma 3 270M

**Why this model?**  
Gemma 3 270M is Google's ultra-lightweight **base model** that's perfect for resource-constrained environments. It's specifically designed for:
- **Extreme efficiency** - Only 270 million parameters
- **CPU-friendly** - Can run on CPU without GPU
- **Fast inference** - Near-instant responses even on modest hardware
- **32K context window** - Large context for its size
- **Learning-focused** - Great for understanding AI concepts

**Note:** First run will download ~1-2GB (cached for future use). With 4-bit quantization, it uses less than 1GB RAM!

⚠️ **Important Distinction**: 
- This is a **base model** (not instruction-tuned like "gemma-2-2b-it")
- Base models are less aligned for chat/instruction following
- We use simple prompt formatting instead of chat templates
- Expect lower quality than instruction-tuned models

⚠️ **Important**: Gemma models require authentication - follow the steps in the next section.

⚠️ **Performance Expectations**:
- ✅ Can demonstrate tool calling concepts
- ⚠️ May struggle even with simple queries
- ⚠️ Less reliable than instruction-tuned models
- ⚠️ Best for learning, not production use
- 💡 **For better results, use `google/gemma-2-2b-it` (instruction-tuned version)**

### 1.4 Authentication Setup (Required for Gemma Models)

⚠️ **Important**: Gemma models are gated and require authentication. Follow these steps **before** running the next cell:

#### Step 1: Accept the Gemma License

1. Go to https://huggingface.co/google/gemma-3-270m
2. If you don't have a Hugging Face account, click **Sign Up** (it's free)
3. Log in to your account
4. Click the **"Agree and access repository"** button to accept the license terms
5. Wait for approval (usually instant, but may take 1-2 minutes)

#### Step 2: Set Up Authentication in Google Colab

Now you need to authenticate using a Hugging Face token:

**2.1 Create a Hugging Face Token:**
1. Go to https://huggingface.co/settings/tokens
2. Click **"New token"**
3. Give it a name (e.g., "colab-access")
4. Select **"Read"** permission (sufficient for downloading models)
5. Click **"Generate token"**
6. **Copy the token** (you'll need it in the next step)

**2.2 Add Token to Colab Secrets:**
1. In Google Colab, click the **🔑 key icon** on the left sidebar (Secrets)
2. Click **"+ Add new secret"**
3. In the **Name** field, enter: `HF_TOKEN`
4. In the **Value** field, paste your Hugging Face token
5. Toggle **ON** the switch for "Notebook access"
6. Your secret is now saved!

**2.3 Load the Token:**

Run the cell below to load your authentication token from Colab secrets:

In [None]:
# Load Hugging Face token from Colab secrets
try:
    from google.colab import userdata
    HF_TOKEN = userdata.get('HF_TOKEN')
    print("✅ Hugging Face token loaded successfully from Colab secrets!")
    print(f"   Token preview: {HF_TOKEN[:10]}..." if HF_TOKEN else "   ⚠️ Token is empty!")
except Exception as e:
    print("❌ Error loading token from Colab secrets!")
    print(f"   Error: {e}")
    print("\n💡 Make sure you:")
    print("   1. Created a Hugging Face token at https://huggingface.co/settings/tokens")
    print("   2. Added it to Colab secrets with name 'HF_TOKEN'")
    print("   3. Enabled 'Notebook access' for the secret")
    HF_TOKEN = None

# Verify token is available
if not HF_TOKEN:
    raise ValueError("❌ No HF_TOKEN found! Please follow the authentication steps above.")

### 1.5 Load Gemma 3 270M Model

Now that you're authenticated, let's load the ultra-lightweight model!

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

print("📥 Loading Gemma 3 270M model...")
print("   This will download ~1-2GB on first run (cached for future use)")
print("   Please wait 30-60 seconds...\n")

model_name = "google/gemma-3-270m"

# Load tokenizer with authentication
print("🔤 Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)

# Set padding token (required for generation)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("   ✅ Tokenizer loaded")
print("   💡 Note: Gemma 3 270M is a base model (not instruction-tuned)")
print("   💡 We'll use simple prompt formatting instead of chat templates")

# Configure 4-bit quantization (optional for 270M, but saves even more memory)
print("\n🤖 Loading model with 4-bit quantization...")
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Enable 4-bit quantization
    bnb_4bit_compute_dtype=torch.float16, # Compute dtype for 4-bit base models
    bnb_4bit_quant_type="nf4",            # Quantization type (nf4 or fp4)
    bnb_4bit_use_double_quant=True        # Nested quantization for better memory efficiency
)

# Load model with quantization config and authentication
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",              # Automatically place on GPU or CPU
    low_cpu_mem_usage=True,
    token=HF_TOKEN                  # Pass authentication token
)
print("   ✅ Model loaded")

print("\n" + "="*60)
print("🎉 Gemma 3 270M is ready to use!")
print("   Model size: 270 million parameters")
print("   Memory usage: <1GB with quantization")
print("="*60)

---

## 2. Import MCP Server Functions

We'll import functions directly from all five MCP servers (same approach as other notebooks).

In [None]:
# Ticket Server
from ticket_server import (
    search_tickets,
    get_ticket_details,
    get_ticket_metrics,
    find_similar_tickets_to
)

# Customer Server
from customer_server import (
    lookup_customer,
    check_customer_status,
    get_sla_terms,
    list_customer_contacts
)

# Billing Server
from billing_server import (
    get_invoice,
    check_payment_status,
    get_billing_history,
    calculate_outstanding_balance
)

# Knowledge Base Server
from kb_server import (
    search_solutions,
    get_article,
    find_related_articles,
    get_common_fixes
)

# Asset Server
from asset_server import (
    lookup_asset,
    check_warranty,
    get_asset_history
)

print("✅ All MCP server functions imported successfully!")
print("   Total: 11 core tools available")

### 2.1 Create Tool Registry

We'll use a **simplified tool set** for the 270M model to avoid overwhelming it. We focus on the most essential tools.

In [None]:
import json

# Define a simplified set of tools for the small model
# Using fewer tools helps the 270M model make better decisions
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_tickets",
            "description": "Search for support tickets by priority or status",
            "parameters": {
                "type": "object",
                "properties": {
                    "priority": {"type": "string", "description": "Filter by priority: low, medium, high, critical"},
                    "status": {"type": "string", "description": "Filter by status: open, in_progress, resolved"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_ticket_details",
            "description": "Get details about a specific ticket",
            "parameters": {
                "type": "object",
                "properties": {
                    "ticket_id": {"type": "string", "description": "Ticket ID like TKT-1001"}
                },
                "required": ["ticket_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "lookup_customer",
            "description": "Look up customer information",
            "parameters": {
                "type": "object",
                "properties": {
                    "customer_id": {"type": "string", "description": "Customer ID like CUST-001"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "check_warranty",
            "description": "Check warranty status for an asset",
            "parameters": {
                "type": "object",
                "properties": {
                    "asset_id": {"type": "string", "description": "Asset ID like AST-SRV-001"}
                },
                "required": ["asset_id"]
            }
        }
    }
]

# Create a mapping from tool names to actual Python functions
tool_map = {
    "search_tickets": search_tickets,
    "get_ticket_details": get_ticket_details,
    "lookup_customer": lookup_customer,
    "check_warranty": check_warranty
}

print(f"✅ Simplified tool registry created with {len(tools)} tools")
print("\n💡 Note: Using fewer tools to help the 270M model make better decisions")
print("\nAvailable tools:")
for tool in tools:
    print(f"  - {tool['function']['name']}")

---

## 3. Tool Calling Implementation

### 3.1 How It Works

The tool calling loop:

1. **User asks a question** → Model receives query + available tools
2. **Model decides what to do** → Either call tools OR give final answer
3. **Execute tools** → Run Python functions, get results
4. **Feed results back** → Model uses results to answer
5. **Repeat if needed** → Model might call more tools

For a 270M model, we keep prompts simpler and more direct.

### 3.2 Helper Functions

In [None]:
def execute_tool(tool_name, arguments):
    """
    Execute a tool by name with given arguments.
    
    Args:
        tool_name: Name of the tool to execute
        arguments: Dictionary of arguments to pass
        
    Returns:
        Tool execution result (dict or error message)
    """
    if tool_name not in tool_map:
        return {"error": f"Tool '{tool_name}' not found"}
    
    try:
        # Call the actual Python function
        result = tool_map[tool_name](**arguments)
        return result
    except Exception as e:
        return {"error": f"Tool execution failed: {str(e)}"}


def format_conversation(messages):
    """
    Manually format conversation messages into a simple prompt.
    Since Gemma 3 270M is a base model, we use simple formatting.
    
    Args:
        messages: List of message dictionaries with 'role' and 'content'
        
    Returns:
        Formatted prompt string
    """
    prompt_parts = []
    
    for msg in messages:
        role = msg['role']
        content = msg['content']
        
        if role == 'user':
            prompt_parts.append(f"User: {content}")
        elif role == 'assistant':
            prompt_parts.append(f"Assistant: {content}")
    
    # Add final assistant prompt
    prompt_parts.append("Assistant:")
    
    return "\n\n".join(prompt_parts)


def generate_response(messages, max_new_tokens=128):
    """
    Generate a response from the model given conversation history.
    Uses greedy decoding to avoid sampling issues with small models.
    
    Args:
        messages: List of message dictionaries with 'role' and 'content'
        max_new_tokens: Maximum tokens to generate (reduced for 270M)
        
    Returns:
        Generated text response
    """
    # Manually format the conversation (no chat template for base models)
    prompt = format_conversation(messages)
    
    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=1024).to(device)
    
    # Generate using GREEDY DECODING (no sampling) to avoid NaN/Inf issues
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Use greedy decoding instead of sampling
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.1,  # Reduce repetition
            early_stopping=True
        )
    
    # Decode only the new tokens (not the input)
    response = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )
    
    return response.strip()


def parse_tool_calls(response):
    """
    Parse tool calls from model response.
    
    Args:
        response: Model's text response
        
    Returns:
        List of tool call dictionaries, or None if no tool calls
    """
    try:
        # Try format 1: <tool_call> tags
        if "<tool_call>" in response:
            start = response.find("<tool_call>") + len("<tool_call>")
            end = response.find("</tool_call>")
            json_str = response[start:end].strip()
            
            # Parse the JSON
            tool_calls = json.loads(json_str)
            
            # Ensure it's a list
            if isinstance(tool_calls, dict):
                tool_calls = [tool_calls]
            
            return tool_calls
        
        # Try format 2: Look for JSON blocks with "name" and "arguments" keys
        import re
        json_pattern = r'\{[^{}]*"name"\s*:\s*"[^"]+"\s*,\s*"arguments"\s*:\s*\{[^}]*\}[^{}]*\}'
        matches = re.findall(json_pattern, response, re.DOTALL)
        
        if matches:
            tool_calls = []
            for match in matches:
                try:
                    tool_call = json.loads(match)
                    if "name" in tool_call:
                        tool_calls.append(tool_call)
                except:
                    continue
            
            if tool_calls:
                return tool_calls
                
    except Exception as e:
        print(f"⚠️  Error parsing tool calls: {e}")
    
    return None

print("✅ Helper functions defined (using greedy decoding for stability)")

### 3.3 Main Query Function (Optimized for Small Models)

This version uses simpler prompts and instructions optimized for the 270M model.

In [None]:
def query_with_tools(user_question, max_iterations=3, verbose=True):
    """
    Process a user query with tool calling capability.
    Optimized for small base models like Gemma 3 270M.
    
    Args:
        user_question: Natural language question from user
        max_iterations: Maximum number of tool calling rounds (reduced for 270M)
        verbose: Print intermediate steps
        
    Returns:
        Final answer from the model
    """
    # Build very simple, direct instructions for the base model
    # Base models need simpler formatting than instruction-tuned models
    system_instructions = f"""You are an IT support assistant. Available tools:

{json.dumps([t['function'] for t in tools], indent=2)}

To call a tool, output exactly:
<tool_call>
{{"name": "tool_name", "arguments": {{"param": "value"}}}}
</tool_call>

User question: {user_question}"""
    
    # Initialize conversation with simple user/assistant format
    messages = [
        {"role": "user", "content": system_instructions}
    ]
    
    if verbose:
        print("\n" + "="*70)
        print(f"👤 User: {user_question}")
        print("="*70)
    
    # Tool calling loop
    for iteration in range(max_iterations):
        if verbose:
            print(f"\n🔄 Iteration {iteration + 1}")
        
        # Get model response
        response = generate_response(messages, max_new_tokens=200)
        
        # Check if model wants to call tools
        tool_calls = parse_tool_calls(response)
        
        if tool_calls:
            # Model wants to use tools
            if verbose:
                print(f"🔧 Model requested {len(tool_calls)} tool call(s)")
            
            # Add assistant's response to history
            messages.append({"role": "assistant", "content": response})
            
            # Execute each tool
            tool_results = []
            for tool_call in tool_calls:
                tool_name = tool_call.get("name")
                arguments = tool_call.get("arguments", {})
                
                if verbose:
                    print(f"\n  Calling: {tool_name}")
                    print(f"  Args: {json.dumps(arguments, indent=4)}")
                
                # Execute the tool
                result = execute_tool(tool_name, arguments)
                tool_results.append(result)
                
                if verbose:
                    result_str = json.dumps(result, indent=4)
                    preview = result_str[:200] + "..." if len(result_str) > 200 else result_str
                    print(f"  Result: {preview}")
            
            # Add tool results to conversation with simple formatting
            tool_message = {
                "role": "user",
                "content": f"Tool results:\n{json.dumps(tool_results, indent=2)}\n\nPlease provide your answer based on these results."
            }
            messages.append(tool_message)
            
            # Continue loop to get next response
            continue
        
        else:
            # No tool calls - this is the final answer
            if verbose:
                print("\n✅ Final answer received")
                print("\n" + "="*70)
                print(f"🤖 Assistant: {response}")
                print("="*70)
            
            return response
    
    # Hit max iterations
    return "Maximum iterations reached. Please try a simpler question."

print("✅ Main query function defined (optimized for Gemma 3 270M base model)")

---

## 4. Examples & Demonstrations

Let's test with **simple, focused queries** that work well with the 270M model!

### Example 1: Simple Single-Tool Query

In [None]:
answer = query_with_tools("Show me critical priority tickets")

### Example 2: Get Ticket Details

In [None]:
answer = query_with_tools("Get details for ticket TKT-1001")

### Example 3: Customer Lookup

In [None]:
answer = query_with_tools("Look up customer CUST-001")

### Example 4: Warranty Check

In [None]:
answer = query_with_tools("Check warranty for asset AST-SRV-001")

---

## 5. Try It Yourself!

### Tips for Best Results with 270M Model:

✅ **Do:**
- Use simple, direct questions
- Ask for one thing at a time
- Be specific with IDs (e.g., "TKT-1001", "CUST-002")
- Use short queries

⚠️ **Avoid:**
- Complex multi-step questions
- Vague or ambiguous queries
- Asking for analysis or reasoning
- Long conversational prompts

### Your Custom Query

In [None]:
# Write your own simple query here!
my_query = "Show me open tickets"

answer = query_with_tools(my_query)

### Suggested Simple Queries

```python
# Good queries for 270M:
query_with_tools("Find high priority tickets")
query_with_tools("Get ticket TKT-1002 details")
query_with_tools("Look up customer CUST-003")
query_with_tools("Check warranty AST-SRV-002")
query_with_tools("Show open tickets")
```

---

## 6. Understanding Ultra-Lightweight Models

### Gemma 3 270M vs Larger Models

| Aspect | Gemma 3 270M | Gemma 2 2B | GPT-4 |
|--------|--------------|------------|-------|
| **Parameters** | 270M | 2B | >1 trillion |
| **VRAM** | <1GB | 2-3GB | N/A (API) |
| **Speed** | ⚡⚡⚡ Blazing fast | ⚡⚡ Fast | ⚡ Fast |
| **CPU Viable** | ✅ Yes | ⚠️ Slow | ❌ No |
| **Simple Tasks** | ✅ Good | ✅✅ Very good | ✅✅✅ Excellent |
| **Complex Reasoning** | ❌ Limited | ⚠️ Good | ✅✅✅ Excellent |
| **Cost** | ✅ Free | ✅ Free | 💰💰 Expensive |
| **Privacy** | ✅ 100% local | ✅ 100% local | ❌ Cloud |

### When to Use 270M Models

✅ **Perfect for:**
- Learning and experimentation
- Simple tool calling demonstrations
- CPU-only environments
- Extremely resource-constrained setups
- Prototyping and testing
- Single-step queries with clear intent

❌ **Not recommended for:**
- Production applications
- Complex multi-step reasoning
- Nuanced language understanding
- Tasks requiring high accuracy
- Ambiguous or vague queries

### Performance Expectations

**What works well:**
- ✅ Simple information retrieval
- ✅ Direct tool calls with clear parameters
- ✅ Basic question answering
- ✅ Straightforward commands

**Limitations:**
- ⚠️ May select wrong tools occasionally
- ⚠️ Struggles with multi-step reasoning
- ⚠️ Less natural language generation
- ⚠️ May hallucinate or produce inconsistent outputs
- ⚠️ Needs very explicit instructions

---

## 7. Upgrade Path

If you find the 270M model too limited, here's the upgrade path:

### Step Up: Gemma 2 2B
```python
# Just change the model name to:
model_name = "google/gemma-2-2b-it"
```
- 7x more parameters
- Much better reasoning
- Still runs on free Colab
- Handles complex queries

### Further Up: Gemma 2 9B
```python
model_name = "google/gemma-2-9b-it"
```
- 33x more parameters than 270M
- Near-API quality
- Requires GPU with ~10GB VRAM
- Excellent for production use

### Other Options
- **Qwen2.5-7B-Instruct** - Great balance of size/quality
- **Phi-3-mini** - Microsoft's efficient small model
- **Llama-3.2-3B** - Meta's balanced option

---

## 8. Summary

### What You Learned

✅ **Ultra-Lightweight Model Setup**
   - Loading Gemma 3 270M with 4-bit quantization
   - Running on CPU or minimal GPU
   - Memory management with <1GB usage
   - Blazing fast inference

✅ **Tool Calling with Small Models**
   - Simplifying tool sets for better performance
   - Optimizing prompts for small models
   - Managing expectations and limitations
   - Handling simple, direct queries

✅ **Practical Understanding**
   - Trade-offs between model size and capability
   - When to use ultra-lightweight models
   - How to upgrade when needed
   - Real-world deployment considerations

### Key Takeaways

1. **Size Matters**: 270M is incredible for its size, but has real limitations

2. **Perfect for Learning**: Ideal for understanding MCP and tool calling mechanics

3. **Resource Efficient**: Can run anywhere, even on CPU

4. **Know the Limits**: Best for simple, focused tasks

5. **Easy to Upgrade**: One line change to use larger, more capable models

### Next Steps

- **Test the limits** - See what works and what doesn't with 270M
- **Try Gemma 2 2B** - Experience the dramatic improvement
- **Optimize prompts** - Learn how prompt engineering affects small models
- **Build hybrid systems** - Use 270M for simple tasks, larger models for complex ones
- **Fine-tune** - Improve performance on your specific use case

---

🎉 **Congratulations!** You now understand how to work with ultra-lightweight AI models and their trade-offs. This knowledge is valuable for building efficient, resource-conscious AI systems!