# Enrutador de Modelo Basado en Intenciones con Foundry Local SDK

**Sistema de Enrutamiento Multi-Modelo Optimizado para CPU**

Este cuaderno demuestra un sistema de enrutamiento inteligente que selecciona autom√°ticamente el mejor modelo de lenguaje peque√±o seg√∫n la intenci√≥n del usuario. Perfecto para escenarios de implementaci√≥n en el borde donde deseas aprovechar m√∫ltiples modelos especializados de manera eficiente.

## üéØ Lo Que Aprender√°s

- **Detecci√≥n de Intenciones**: Clasificar autom√°ticamente los prompts (c√≥digo, resumen, clasificaci√≥n, general)
- **Selecci√≥n Inteligente de Modelos**: Enrutar al modelo m√°s capaz para cada tarea
- **Optimizaci√≥n para CPU**: Modelos eficientes en memoria que funcionan en cualquier hardware
- **Gesti√≥n Multi-Modelo**: Mant√©n m√∫ltiples modelos cargados con `--retain true`
- **Patrones de Producci√≥n**: L√≥gica de reintento, manejo de errores y seguimiento de tokens

## üìã Descripci√≥n del Escenario

Este patr√≥n demuestra:

1. **Detecci√≥n de Intenciones**: Clasificar cada prompt del usuario (c√≥digo, resumen, clasificaci√≥n o general)
2. **Selecci√≥n de Modelos**: Elegir autom√°ticamente el modelo de lenguaje peque√±o m√°s adecuado seg√∫n sus capacidades
3. **Ejecuci√≥n Local**: Enrutar a modelos que se ejecutan localmente a trav√©s del servicio Foundry Local
4. **Interfaz Unificada**: Punto √∫nico de entrada de chat que enruta a m√∫ltiples modelos especializados

**Ideal para**: Implementaciones en el borde con m√∫ltiples modelos especializados donde deseas un enrutamiento inteligente de solicitudes sin selecci√≥n manual de modelos.

## üîß Requisitos Previos

- **Foundry Local** instalado y servicio en ejecuci√≥n
- **Python 3.8+** con pip
- **8GB+ de RAM** (16GB+ recomendado para m√∫ltiples modelos)
- M√≥dulo **workshop_utils** (en ../samples/)

## üöÄ Inicio R√°pido

El cuaderno har√° lo siguiente:
1. Detectar la memoria de tu sistema
2. Recomendar modelos de CPU apropiados
3. Cargar autom√°ticamente los modelos con `--retain true`
4. Verificar que todos los modelos est√©n listos
5. Enrutar prompts de prueba a modelos especializados

**Tiempo estimado de configuraci√≥n**: 5-7 minutos (incluye la carga de modelos)


## üì¶ Paso 1: Instalar Dependencias

Instala el SDK oficial de Foundry Local y las bibliotecas necesarias:

- **foundry-local-sdk**: SDK oficial de Python para la gesti√≥n local de modelos
- **openai**: API compatible con OpenAI para completar chats
- **psutil**: Detecci√≥n y monitoreo de memoria del sistema


In [107]:
# Install core dependencies
!pip install -q foundry-local-sdk openai psutil

## üíª Paso 2: Detecci√≥n de Memoria del Sistema

Detecta la memoria disponible del sistema para determinar qu√© modelos de CPU pueden funcionar de manera eficiente. Esto garantiza una selecci√≥n √≥ptima de modelos para tu hardware.


In [108]:
import psutil

# Get system memory information
total_memory_gb = psutil.virtual_memory().total / (1024**3)
available_memory_gb = psutil.virtual_memory().available / (1024**3)

print('üñ•Ô∏è  System Memory Information')
print('=' * 70)
print(f'Total Memory:     {total_memory_gb:.2f} GB')
print(f'Available Memory: {available_memory_gb:.2f} GB')
print()

# Recommend models based on available memory
# Using model aliases - Foundry Local will automatically select CPU variant
model_aliases = []

if total_memory_gb >= 32:
    model_aliases = ['phi-4-mini', 'phi-3.5-mini', 'qwen2.5-0.5b', 'qwen2.5-coder-0.5b']
    print('‚úÖ High Memory System (32GB+)')
    print('   Can run 3-4 models simultaneously')
elif total_memory_gb >= 16:
    model_aliases = ['phi-4-mini', 'qwen2.5-0.5b', 'phi-3.5-mini']
    print('‚úÖ Medium Memory System (16-32GB)')
    print('   Can run 2-3 models simultaneously')
elif total_memory_gb >= 8:
    model_aliases = ['qwen2.5-0.5b', 'phi-3.5-mini']
    print('‚ö†Ô∏è  Lower Memory System (8-16GB)')
    print('   Recommended: 2 smaller models')
else:
    model_aliases = ['qwen2.5-0.5b']
    print('‚ö†Ô∏è  Limited Memory System (<8GB)')
    print('   Recommended: Use only smallest model')

print()
print('üìã Recommended Model Aliases for Your System:')
for model in model_aliases:
    print(f'   ‚Ä¢ {model}')

print()
print('üí° About Model Aliases:')
print('   ‚úì Use base alias (e.g., phi-4-mini, not phi-4-mini-cpu)')
print('   ‚úì Foundry Local automatically selects CPU variant for your hardware')
print('   ‚úì No GPU required - optimized for CPU inference')
print('   ‚úì Predictable memory usage and consistent performance')
print('=' * 70)

üñ•Ô∏è  System Memory Information
Total Memory:     63.30 GB
Available Memory: 16.19 GB

‚úÖ High Memory System (32GB+)
   Can run 3-4 models simultaneously

üìã Recommended Model Aliases for Your System:
   ‚Ä¢ phi-4-mini
   ‚Ä¢ phi-3.5-mini
   ‚Ä¢ qwen2.5-0.5b
   ‚Ä¢ qwen2.5-coder-0.5b

üí° About Model Aliases:
   ‚úì Use base alias (e.g., phi-4-mini, not phi-4-mini-cpu)
   ‚úì Foundry Local automatically selects CPU variant for your hardware
   ‚úì No GPU required - optimized for CPU inference
   ‚úì Predictable memory usage and consistent performance


## ü§ñ Paso 3: Carga Autom√°tica de Modelos

Esta celda autom√°ticamente:
1. Inicia el servicio Foundry Local (si no est√° en ejecuci√≥n)
2. Carga los modelos recomendados con `--retain true` (mantiene m√∫ltiples modelos en memoria)
3. Verifica que todos los modelos est√©n listos utilizando el SDK

‚è±Ô∏è **Tiempo estimado**: 3-5 minutos para todos los modelos


In [109]:
import subprocess
import time
import sys
import os

# Add samples directory for workshop_utils (Foundry SDK pattern)
sys.path.append(os.path.join('..', 'samples'))

print('üöÄ Automatic Model Loading with SDK Verification')
print('=' * 70)

# Use top 3 recommended models (aliases)
# Foundry will automatically load CPU variants
REQUIRED_MODELS = model_aliases[:3]
print(f'üìã Loading {len(REQUIRED_MODELS)} models: {REQUIRED_MODELS}')
print('üí° Using model aliases - Foundry will load CPU variants automatically')
print()

# Step 1: Ensure Foundry Local service is running
print('üì° Step 1: Checking Foundry Local service...')
try:
    result = subprocess.run(['foundry', 'service', 'status'], 
                          capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        print('   ‚úÖ Service is already running')
    else:
        print('   ‚öôÔ∏è  Starting Foundry Local service...')
        subprocess.run(['foundry', 'service', 'start'], 
                      capture_output=True, text=True, timeout=30)
        time.sleep(5)
        print('   ‚úÖ Service started')
except Exception as e:
    print(f'   ‚ö†Ô∏è  Could not verify service: {e}')
    print('   üí° Try manually: foundry service start')

# Step 2: Load each model with --retain true
print(f'\nü§ñ Step 2: Loading models with retention...')
for i, model in enumerate(REQUIRED_MODELS, 1):
    print(f'   [{i}/{len(REQUIRED_MODELS)}] Starting {model}...')
    try:
        subprocess.Popen(['foundry', 'model', 'run', model, '--retain', 'true'],
                        stdout=subprocess.DEVNULL,
                        stderr=subprocess.DEVNULL)
        print(f'       ‚úÖ {model} loading in background')
    except Exception as e:
        print(f'       ‚ùå Error starting {model}: {e}')

# Step 3: Verify models are ready
print(f'\n‚úÖ Step 3: Verifying models (this may take 2-3 minutes)...')
print('=' * 70)

try:
    from workshop_utils import get_client
    
    ready_models = []
    max_attempts = 30
    attempt = 0
    
    while len(ready_models) < len(REQUIRED_MODELS) and attempt < max_attempts:
        attempt += 1
        print(f'\n   Attempt {attempt}/{max_attempts}...')
        
        for model in REQUIRED_MODELS:
            if model in ready_models:
                continue
                
            try:
                manager, client, model_id = get_client(model, None)
                response = client.chat.completions.create(
                    model=model_id,
                    messages=[{"role": "user", "content": "test"}],
                    max_tokens=5,
                    temperature=0
                )
                
                if response and response.choices:
                    ready_models.append(model)
                    print(f'   ‚úÖ {model} is READY')
                    
            except Exception as e:
                error_msg = str(e).lower()
                if 'connection' in error_msg or 'timeout' in error_msg:
                    print(f'   ‚è≥ {model} still loading...')
                else:
                    print(f'   ‚ö†Ô∏è  {model} error: {str(e)[:60]}...')
        
        if len(ready_models) == len(REQUIRED_MODELS):
            break
            
        if len(ready_models) < len(REQUIRED_MODELS):
            time.sleep(10)
    
    # Final status
    print('\n' + '=' * 70)
    print(f'üì¶ Final Status: {len(ready_models)}/{len(REQUIRED_MODELS)} models ready')
    
    for model in REQUIRED_MODELS:
        if model in ready_models:
            print(f'   ‚úÖ {model} - READY (retained in memory)')
        else:
            print(f'   ‚ùå {model} - NOT READY')
    
    if len(ready_models) == len(REQUIRED_MODELS):
        print('\nüéâ All models loaded and verified!')
        print('   ‚úÖ Ready for intent-based routing')
    else:
        print(f'\n‚ö†Ô∏è  Some models not ready. Check: foundry model ls')
        
except ImportError as e:
    print(f'\n‚ùå Cannot import workshop_utils: {e}')
    print('   üí° Ensure workshop_utils.py is in ../samples/')
except Exception as e:
    print(f'\n‚ùå Verification error: {e}')

üöÄ Automatic Model Loading with SDK Verification
üìã Loading 3 models: ['phi-4-mini', 'phi-3.5-mini', 'qwen2.5-0.5b']
üí° Using model aliases - Foundry will load CPU variants automatically

üì° Step 1: Checking Foundry Local service...
   ‚úÖ Service is already running

ü§ñ Step 2: Loading models with retention...
   [1/3] Starting phi-4-mini...
       ‚úÖ phi-4-mini loading in background
   [2/3] Starting phi-3.5-mini...
       ‚úÖ phi-3.5-mini loading in background
   [3/3] Starting qwen2.5-0.5b...
       ‚úÖ qwen2.5-0.5b loading in background

‚úÖ Step 3: Verifying models (this may take 2-3 minutes)...

   Attempt 1/30...
   ‚ö†Ô∏è  phi-4-mini error: get_client() takes 1 positional argument but 2 were given...
   ‚ö†Ô∏è  phi-3.5-mini error: get_client() takes 1 positional argument but 2 were given...
   ‚ö†Ô∏è  qwen2.5-0.5b error: get_client() takes 1 positional argument but 2 were given...

   Attempt 2/30...
   ‚ö†Ô∏è  phi-4-mini error: get_client() takes 1 positional argume

## üéØ Paso 4: Configurar la Detecci√≥n de Intenciones y el Cat√°logo de Modelos

Configura el sistema de enrutamiento con:
- **Reglas de Intenci√≥n**: Patrones Regex para clasificar las solicitudes
- **Cat√°logo de Modelos**: Asocia las capacidades de los modelos con categor√≠as de intenci√≥n
- **Sistema de Prioridad**: Determina la selecci√≥n de modelos cuando varios modelos coinciden

**Ventajas de los Modelos CPU**:
- ‚úÖ No requiere GPU
- ‚úÖ Rendimiento consistente
- ‚úÖ Menor consumo de energ√≠a
- ‚úÖ Uso de memoria predecible


In [110]:
import re

# Model capability catalog (maps model aliases to capabilities)
# Use base aliases - Foundry Local will automatically select CPU variants
CATALOG = {
    'phi-4-mini': {
        'capabilities': ['general', 'summarize', 'reasoning'],
        'priority': 3
    },
    'qwen2.5-0.5b': {
        'capabilities': ['classification', 'fast', 'general'],
        'priority': 1
    },
    'phi-3.5-mini': {
        'capabilities': ['code', 'refactor', 'technical'],
        'priority': 2
    },
    'qwen2.5-coder-0.5b': {
        'capabilities': ['code', 'programming', 'debug'],
        'priority': 1
    }
}

# Filter to only include models recommended for this system
CATALOG = {k: v for k, v in CATALOG.items() if k in model_aliases}

print('üìã Active Model Catalog (Hardware-Optimized Aliases)')
print('=' * 70)
print('üí° Using model aliases - Foundry automatically selects CPU variants')
print()
for model, info in CATALOG.items():
    caps = ', '.join(info['capabilities'])
    print(f'   ‚Ä¢ {model}')
    print(f'     Capabilities: {caps}')
    print(f'     Priority: {info["priority"]}')
    print()

# Intent detection rules (regex pattern -> intent label)
INTENT_RULES = [
    (re.compile(r'code|refactor|function|debug|program', re.I), 'code'),
    (re.compile(r'summar|abstract|tl;?dr|brief', re.I), 'summarize'),
    (re.compile(r'classif|categor|label|sentiment', re.I), 'classification'),
    (re.compile(r'explain|teach|describe', re.I), 'general'),
]

def detect_intent(prompt: str) -> str:
    """Detect intent from prompt using regex patterns.
    
    Args:
        prompt: User input text
        
    Returns:
        Intent label: 'code', 'summarize', 'classification', or 'general'
    """
    for pattern, intent in INTENT_RULES:
        if pattern.search(prompt):
            return intent
    return 'general'

def pick_model(intent: str) -> str:
    """Select best model for intent based on capabilities and priority.
    
    Args:
        intent: Detected intent category
        
    Returns:
        Model alias string, or first available model if no match
    """
    candidates = [
        (alias, info['priority']) 
        for alias, info in CATALOG.items() 
        if intent in info['capabilities']
    ]
    
    if candidates:
        # Sort by priority (higher = better)
        candidates.sort(key=lambda x: x[1], reverse=True)
        return candidates[0][0]
    
    # Fallback to first available model
    return list(CATALOG.keys())[0] if CATALOG else None

print('‚úÖ Intent detection and model selection configured')
print('=' * 70)

üìã Active Model Catalog (Hardware-Optimized Aliases)
üí° Using model aliases - Foundry automatically selects CPU variants

   ‚Ä¢ phi-4-mini
     Capabilities: general, summarize, reasoning
     Priority: 3

   ‚Ä¢ qwen2.5-0.5b
     Capabilities: classification, fast, general
     Priority: 1

   ‚Ä¢ phi-3.5-mini
     Capabilities: code, refactor, technical
     Priority: 2

   ‚Ä¢ qwen2.5-coder-0.5b
     Capabilities: code, programming, debug
     Priority: 1

‚úÖ Intent detection and model selection configured

üí° Using model aliases - Foundry automatically selects CPU variants

   ‚Ä¢ phi-4-mini
     Capabilities: general, summarize, reasoning
     Priority: 3

   ‚Ä¢ qwen2.5-0.5b
     Capabilities: classification, fast, general
     Priority: 1

   ‚Ä¢ phi-3.5-mini
     Capabilities: code, refactor, technical
     Priority: 2

   ‚Ä¢ qwen2.5-coder-0.5b
     Capabilities: code, programming, debug
     Priority: 1

‚úÖ Intent detection and model selection configured


## üß™ Paso 5: Probar la Detecci√≥n de Intenciones

Verifica que el sistema de detecci√≥n de intenciones clasifique correctamente los diferentes tipos de solicitudes.


In [111]:
# Test intent detection with sample prompts
test_prompts = [
    'Refactor this Python function for better readability',
    'Summarize the key points of this article',
    'Classify this customer feedback as positive or negative',
    'Explain how edge AI differs from cloud AI',
    'Write a function to calculate fibonacci numbers',
    'Give me a brief overview of small language models'
]

print('üß™ Testing Intent Detection')
print('=' * 70)

for prompt in test_prompts:
    intent = detect_intent(prompt)
    model = pick_model(intent)
    print(f'\nPrompt: {prompt[:50]}...')
    print(f'   Intent: {intent:15s} ‚Üí Model: {model}')

print('\n' + '=' * 70)
print('‚úÖ Intent detection working correctly')

üß™ Testing Intent Detection

Prompt: Refactor this Python function for better readabili...
   Intent: code            ‚Üí Model: phi-3.5-mini

Prompt: Summarize the key points of this article...
   Intent: summarize       ‚Üí Model: phi-4-mini

Prompt: Classify this customer feedback as positive or neg...
   Intent: classification  ‚Üí Model: qwen2.5-0.5b

Prompt: Explain how edge AI differs from cloud AI...
   Intent: general         ‚Üí Model: phi-4-mini

Prompt: Write a function to calculate fibonacci numbers...
   Intent: code            ‚Üí Model: phi-3.5-mini

Prompt: Give me a brief overview of small language models...
   Intent: summarize       ‚Üí Model: phi-4-mini

‚úÖ Intent detection working correctly


## üöÄ Paso 6: Implementar la Funci√≥n de Enrutamiento

Crea la funci√≥n principal de enrutamiento que:
1. Detecte la intenci√≥n a partir del mensaje
2. Seleccione el modelo √≥ptimo
3. Ejecute la solicitud mediante Foundry Local SDK
4. Realice un seguimiento del uso de tokens y errores

**Utiliza el patr√≥n workshop_utils**:
- Reintento autom√°tico con retroceso exponencial
- API compatible con OpenAI
- Seguimiento de tokens y manejo de errores


In [112]:
import os
from workshop_utils import chat_once

# Fix RETRY_BACKOFF environment variable if it has comments
if 'RETRY_BACKOFF' in os.environ:
    retry_val = os.environ['RETRY_BACKOFF'].strip().split()[0]
    try:
        float(retry_val)
        os.environ['RETRY_BACKOFF'] = retry_val
    except ValueError:
        os.environ['RETRY_BACKOFF'] = '1.0'

def route(prompt: str, max_tokens: int = 200, temperature: float = 0.7):
    """Route prompt to appropriate model based on intent.
    
    Pipeline:
    1. Detect intent using regex patterns
    2. Select best model by capability + priority
    3. Execute via Foundry Local SDK
    
    Args:
        prompt: User input text
        max_tokens: Maximum tokens in response
        temperature: Sampling temperature (0-1)
        
    Returns:
        Dict with: intent, model, output, tokens, usage, error
    """
    intent = detect_intent(prompt)
    model_alias = pick_model(intent)
    
    if not model_alias:
        return {
            'intent': intent,
            'model': None,
            'output': '',
            'tokens': None,
            'usage': {},
            'error': 'No suitable model found'
        }
    
    try:
        # Call Foundry Local via workshop_utils
        text, usage = chat_once(
            model_alias,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature
        )
        
        # Extract token information
        usage_info = {}
        if usage:
            usage_info['prompt_tokens'] = getattr(usage, 'prompt_tokens', None)
            usage_info['completion_tokens'] = getattr(usage, 'completion_tokens', None)
            usage_info['total_tokens'] = getattr(usage, 'total_tokens', None)
        
        # Estimate if not provided
        if not usage_info.get('total_tokens'):
            est_prompt = len(prompt) // 4
            est_completion = len(text or '') // 4
            usage_info['estimated_tokens'] = est_prompt + est_completion
        
        return {
            'intent': intent,
            'model': model_alias,
            'output': (text or '').strip(),
            'tokens': usage_info.get('total_tokens') or usage_info.get('estimated_tokens'),
            'usage': usage_info,
            'error': None
        }
    
    except Exception as e:
        return {
            'intent': intent,
            'model': model_alias,
            'output': '',
            'tokens': None,
            'usage': {},
            'error': f'{type(e).__name__}: {str(e)}'
        }

print('‚úÖ Routing function ready')
print('   Using Foundry Local SDK via workshop_utils')
print('   Token tracking: Enabled')
print('   Retry logic: Automatic with exponential backoff')

‚úÖ Routing function ready
   Using Foundry Local SDK via workshop_utils
   Token tracking: Enabled
   Retry logic: Automatic with exponential backoff


## üéØ Paso 7: Ejecutar pruebas de enrutamiento

Prueba el sistema de enrutamiento completo con varios prompts para demostrar:
- Detecci√≥n autom√°tica de intenci√≥n
- Selecci√≥n inteligente de modelos
- Enrutamiento entre m√∫ltiples modelos con modelos retenidos
- Seguimiento de tokens y m√©tricas de rendimiento


In [None]:
# Test prompts covering all intent categories
test_cases = [
    {
        'prompt': 'Refactor this Python function to make it more efficient and readable',
        'expected_intent': 'code'
    },
    {
        'prompt': 'Summarize the key benefits of using small language models at the edge',
        'expected_intent': 'summarize'
    },
    {
        'prompt': 'Classify this user feedback: The app is slow but the UI looks great',
        'expected_intent': 'classification'
    },
    {
        'prompt': 'Explain the difference between local and cloud inference',
        'expected_intent': 'general'
    },
    {
        'prompt': 'Write a Python function to calculate the Fibonacci sequence',
        'expected_intent': 'code'
    },
    {
        'prompt': 'Give me a brief overview of the Phi model family',
        'expected_intent': 'summarize'
    }
]

print('üéØ Running Intent-Based Routing Tests')
print('=' * 80)

results = []
for i, test in enumerate(test_cases, 1):
    print(f'\n[{i}/{len(test_cases)}] Testing prompt...')
    print(f'Prompt: {test["prompt"]}')
    
    result = route(test['prompt'], max_tokens=150)
    results.append(result)
    
    print(f'   Expected Intent: {test["expected_intent"]}')
    print(f'   Detected Intent: {result["intent"]} {"‚úÖ" if result["intent"] == test["expected_intent"] else "‚ö†Ô∏è"}')
    print(f'   Selected Model:  {result["model"]}')
    
    if result['error']:
        print(f'   ‚ùå Error: {result["error"]}')
    else:
        output_preview = result['output'][:100] + '...' if len(result['output']) > 100 else result['output']
        print(f'   ‚úÖ Response: {output_preview}')
        
        tokens = result.get('tokens', 0)
        if tokens:
            usage = result.get('usage', {})
            if 'estimated_tokens' in usage:
                print(f'   üìä Tokens: ~{tokens} (estimated)')
            else:
                print(f'   üìä Tokens: {tokens}')

# Summary statistics
print('\n' + '=' * 80)
print('üìä ROUTING SUMMARY')
print('=' * 80)

success_count = sum(1 for r in results if not r['error'])
total_tokens = sum(r.get('tokens', 0) or 0 for r in results if not r['error'])
intent_accuracy = sum(1 for i, r in enumerate(results) if r['intent'] == test_cases[i]['expected_intent'])

print(f'Total Prompts:        {len(results)}')
print(f'‚úÖ Successful:         {success_count}/{len(results)}')
print(f'‚ùå Failed:             {len(results) - success_count}')
print(f'üéØ Intent Accuracy:    {intent_accuracy}/{len(results)} ({intent_accuracy/len(results)*100:.1f}%)')
print(f'üìä Total Tokens Used:  {total_tokens}')

# Model usage distribution
print('\nüìã Model Usage Distribution:')
model_counts = {}
for r in results:
    if r['model']:
        model_counts[r['model']] = model_counts.get(r['model'], 0) + 1

for model, count in sorted(model_counts.items(), key=lambda x: x[1], reverse=True):
    percentage = (count / len(results)) * 100
    print(f'   ‚Ä¢ {model}: {count} requests ({percentage:.1f}%)')

if success_count == len(results):
    print('\nüéâ All routing tests passed successfully!')
else:
    print(f'\n‚ö†Ô∏è  {len(results) - success_count} test(s) failed')
    print('   Check Foundry Local service: foundry service status')
    print('   Verify models loaded: foundry model ls')

print('=' * 80)

üéØ Running Intent-Based Routing Tests

[1/6] Testing prompt...
Prompt: Refactor this Python function to make it more efficient and readable


   Expected Intent: code
   Detected Intent: code ‚úÖ
   Selected Model:  phi-3.5-mini
   ‚úÖ Response: To refactor a Python function for efficiency and readability, I would need to see the specific funct...
   üìä Tokens: ~158 (estimated)

[2/6] Testing prompt...
Prompt: Summarize the key benefits of using small language models at the edge
   Expected Intent: summarize
   Detected Intent: summarize ‚úÖ
   Selected Model:  phi-4-mini
   ‚ùå Error: APIConnectionError: Connection error.

[3/6] Testing prompt...
Prompt: Classify this user feedback: The app is slow but the UI looks great
   Expected Intent: classification
   Detected Intent: classification ‚úÖ
   Selected Model:  qwen2.5-0.5b
   ‚ùå Error: APIConnectionError: Connection error.

[4/6] Testing prompt...
Prompt: Explain the difference between local and cloud inference
   Expected Intent: general
   Detected Intent: general ‚úÖ
   Selected Model:  phi-4-mini
   ‚ùå Error: APIConnectionError: Connection error.

[5/6] Testing p

## üîß Paso 8: Pruebas Interactivas

¬°Prueba tus propios comandos para ver el sistema de enrutamiento en acci√≥n!


In [None]:
# Interactive testing - modify the prompt and run this cell
custom_prompt = "Explain how model quantization reduces memory usage"

print('üéØ Interactive Routing Test')
print('=' * 80)
print(f'Your prompt: {custom_prompt}')
print()

result = route(custom_prompt, max_tokens=200)

print(f'Detected Intent: {result["intent"]}')
print(f'Selected Model:  {result["model"]}')
print()

if result['error']:
    print(f'‚ùå Error: {result["error"]}')
else:
    print('‚úÖ Response:')
    print('-' * 80)
    print(result['output'])
    print('-' * 80)
    
    if result['tokens']:
        print(f'\nüìä Tokens used: {result["tokens"]}')

print('\nüí° Try different prompts to test routing behavior!')

üéØ Interactive Routing Test
Your prompt: Explain how model quantization reduces memory usage

Detected Intent: general
Selected Model:  phi-4-mini

‚úÖ Response:
--------------------------------------------------------------------------------
Model quantization is a technique used to reduce the memory footprint of a machine learning model, particularly deep learning models. It works by converting the high-precision weights of a neural network, typically represented as 32-bit floating-point numbers, into lower-precision representations, such as 8-bit integers or even binary values.


The primary reason for quantization is to decrease the amount of memory required to store the model's parameters. Since floating-point numbers take up more space than integers, by quantizing the weights, we can significantly reduce the model's size. This reduction in size not only saves memory but also can lead to faster computation during inference, as integer operations are generally faster than floatin

## üìä Paso 9: An√°lisis de Rendimiento

Analiza el rendimiento del sistema de enrutamiento y la utilizaci√≥n del modelo.


In [None]:
import time

# Performance benchmark
benchmark_prompts = [
    'Write a hello world function',
    'Summarize: AI at the edge is powerful',
    'Classify: Good product',
    'Explain edge computing'
]

print('‚ö° Performance Benchmark')
print('=' * 80)

timings = []
for prompt in benchmark_prompts:
    start = time.time()
    result = route(prompt, max_tokens=50)
    duration = time.time() - start
    timings.append(duration)
    
    print(f'\nPrompt: {prompt[:40]}...')
    print(f'   Model: {result["model"]}')
    print(f'   Time: {duration:.2f}s')
    if result.get('tokens'):
        print(f'   Tokens: {result["tokens"]}')

print('\n' + '=' * 80)
print('üìä Performance Statistics:')
print(f'   Average response time: {sum(timings)/len(timings):.2f}s')
print(f'   Fastest response:      {min(timings):.2f}s')
print(f'   Slowest response:      {max(timings):.2f}s')
print('\nüí° Note: First request may be slower due to model initialization')
print('=' * 80)

‚ö° Performance Benchmark

Prompt: Write a hello world function...
   Model: phi-3.5-mini
   Time: 3.31s
   Tokens: 60

Prompt: Write a hello world function...
   Model: phi-3.5-mini
   Time: 3.31s
   Tokens: 60

Prompt: Summarize: AI at the edge is powerful...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 84

Prompt: Summarize: AI at the edge is powerful...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 84

Prompt: Classify: Good product...
   Model: qwen2.5-0.5b
   Time: 7.21s
   Tokens: 69

Prompt: Classify: Good product...
   Model: qwen2.5-0.5b
   Time: 7.21s
   Tokens: 69

Prompt: Explain edge computing...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 72

üìä Performance Statistics:
   Average response time: 27.46s
   Fastest response:      3.31s
   Slowest response:      49.67s

üí° Note: First request may be slower due to model initialization

Prompt: Explain edge computing...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 72

üìä Performance Statistics:
   Average res

## üéì Puntos Clave y Pr√≥ximos Pasos

### ‚úÖ Lo que Has Aprendido

1. **Enrutamiento Basado en Intenciones**: Clasificar autom√°ticamente los prompts y dirigirlos a modelos especializados  
2. **Selecci√≥n Consciente de Memoria**: Elegir modelos de CPU seg√∫n la RAM disponible en el sistema  
3. **Retenci√≥n Multi-Modelo**: Usar `--retain true` para mantener m√∫ltiples modelos cargados  
4. **Patrones de Producci√≥n**: L√≥gica de reintentos, manejo de errores y seguimiento de tokens  
5. **Optimizaci√≥n para CPU**: Implementar de manera eficiente sin necesidad de GPU  

### üöÄ Ideas para Experimentar

1. **Agregar Intenciones Personalizadas**:  
   ```python
   INTENT_RULES.append(
       (re.compile(r'translate|convert', re.I), 'translation')
   )
   ```
  
2. **Cargar Modelos Adicionales**:  
   ```bash
   foundry model run llama-3.2-1b-cpu --retain true
   ```
  
3. **Ajustar la Selecci√≥n de Modelos**:  
   - Modificar los valores de prioridad en CATALOG  
   - A√±adir m√°s etiquetas de capacidad  
   - Implementar estrategias de respaldo  

4. **Monitorear el Rendimiento**:  
   ```python
   import psutil
   print(f"Memory: {psutil.virtual_memory().percent}%")
   ```
  

### üìö Recursos Adicionales

- **Foundry Local SDK**: https://github.com/microsoft/Foundry-Local  
- **Ejemplos del Taller**: ../samples/  
- **Curso de Edge AI**: ../../Module08/  

### üí° Mejores Pr√°cticas

‚úÖ Usa modelos de CPU para un comportamiento consistente entre plataformas  
‚úÖ Verifica siempre la memoria del sistema antes de cargar m√∫ltiples modelos  
‚úÖ Usa `--retain true` para escenarios de enrutamiento  
‚úÖ Implementa un manejo adecuado de errores y reintentos  
‚úÖ Realiza un seguimiento del uso de tokens para optimizar costos y rendimiento  

---

**üéâ ¬°Felicidades!** Has construido un enrutador de modelos basado en intenciones listo para producci√≥n utilizando Foundry Local SDK con modelos optimizados para CPU.



---

**Descargo de responsabilidad**:  
Este documento ha sido traducido utilizando el servicio de traducci√≥n autom√°tica [Co-op Translator](https://github.com/Azure/co-op-translator). Aunque nos esforzamos por garantizar la precisi√≥n, tenga en cuenta que las traducciones automatizadas pueden contener errores o imprecisiones. El documento original en su idioma nativo debe considerarse como la fuente autorizada. Para informaci√≥n cr√≠tica, se recomienda una traducci√≥n profesional realizada por humanos. No nos hacemos responsables de malentendidos o interpretaciones err√≥neas que puedan surgir del uso de esta traducci√≥n.
