# Sitzung 4 ‚Äì Vergleich SLM vs. LLM

Vergleichen Sie die Latenz und die Qualit√§t der Beispielantworten zwischen einem kleinen Sprachmodell und einem gr√∂√üeren Modell, das √ºber Foundry Local l√§uft.


## ‚ö° Schnellstart

**Speicheroptimierte Einrichtung (Aktualisiert):**  
1. Modelle w√§hlen automatisch CPU-Varianten aus (funktioniert auf jeder Hardware)  
2. Verwendet `qwen2.5-3b` anstelle von 7B (spart ~4GB RAM)  
3. Automatische Port-Erkennung (keine manuelle Konfiguration erforderlich)  
4. Ben√∂tigter Gesamtspeicher: ~8GB empfohlen (Modelle + Betriebssystem)  

**Terminal-Einrichtung (30 Sekunden):**  
```bash
foundry service start
foundry model run phi-4-mini
foundry model run qwen2.5-3b
```
  
Dann f√ºhren Sie dieses Notebook aus! üöÄ  


### Erkl√§rung: Abh√§ngigkeitsinstallation
Installiert die minimal erforderlichen Pakete (`foundry-local-sdk`, `openai`, `numpy`) f√ºr Timing- und Chat-Anfragen. Kann gefahrlos mehrfach ausgef√ºhrt werden, da es idempotent ist.


# Szenario
Vergleichen Sie ein repr√§sentatives kleines Sprachmodell (SLM) mit einem gr√∂√üeren Modell anhand eines einzelnen Prompts, um die Kompromisse zu veranschaulichen:
- **Latenzunterschied** (Zeit in Sekunden)
- **Token-Nutzung** (falls verf√ºgbar) als Proxy f√ºr den Durchsatz
- **Qualitative Beispielausgabe** f√ºr eine schnelle Einsch√§tzung
- **Geschwindigkeitsberechnung**, um Leistungsgewinne zu quantifizieren

**Umgebungsvariablen:**
- `SLM_ALIAS` - Kleines Sprachmodell (Standard: phi-4-mini, ~4GB RAM)
- `LLM_ALIAS` - Gr√∂√üeres Sprachmodell (Standard: qwen2.5-7b, ~7GB RAM)
- `COMPARE_PROMPT` - Test-Prompt f√ºr den Vergleich
- `COMPARE_RETRIES` - Wiederholungsversuche f√ºr Resilienz (Standard: 2)
- `FOUNDRY_LOCAL_ENDPOINT` - Service-Endpunkt √ºberschreiben (automatisch erkannt, falls nicht gesetzt)

**Funktionsweise (Offizielles SDK-Muster):**
1. **FoundryLocalManager** initialisiert und verwaltet den Foundry Local Service
2. Der Service startet automatisch, falls er nicht l√§uft (keine manuelle Einrichtung erforderlich)
3. Modelle werden automatisch von Aliasnamen zu konkreten IDs aufgel√∂st
4. Hardware-optimierte Varianten werden ausgew√§hlt (CUDA, NPU oder CPU)
5. OpenAI-kompatibler Client f√ºhrt Chat-Abschl√ºsse durch
6. Metriken werden erfasst: Latenz, Tokens, Ausgabequalit√§t
7. Ergebnisse werden verglichen, um das Geschwindigkeitsverh√§ltnis zu berechnen

Dieser Mikro-Vergleich hilft zu entscheiden, wann die Weiterleitung an ein gr√∂√üeres Modell f√ºr Ihren Anwendungsfall gerechtfertigt ist.

**SDK-Referenz:** 
- Python SDK: https://github.com/microsoft/Foundry-Local/tree/main/sdk/python/foundry_local
- Workshop-Utils: Verwendet das offizielle Muster aus ../samples/workshop_utils.py

**Wichtige Vorteile:**
- ‚úÖ Automatische Service-Erkennung und Initialisierung
- ‚úÖ Automatischer Start des Services, falls nicht aktiv
- ‚úÖ Eingebaute Modellaufl√∂sung und Caching
- ‚úÖ Hardware-Optimierung (CUDA/NPU/CPU)
- ‚úÖ OpenAI-SDK-Kompatibilit√§t
- ‚úÖ Robuste Fehlerbehandlung mit Wiederholungsversuchen
- ‚úÖ Lokale Inferenz (keine Cloud-API erforderlich)


## üö® Voraussetzungen: Foundry Local muss laufen!

**Bevor Sie dieses Notebook ausf√ºhren**, stellen Sie sicher, dass der Foundry Local-Dienst eingerichtet ist:

### Schnellstart-Befehle (im Terminal ausf√ºhren):

```bash
# 1. Start the Foundry Local service
foundry service start

# 2. Load the default models used in this comparison (CPU-optimized)
foundry model run phi-4-mini
foundry model run qwen2.5-3b

# 3. Verify models are loaded
foundry model ls

# 4. Check service health
foundry service status
```

### Alternative Modelle (falls die Standardmodelle nicht verf√ºgbar sind):

```bash
# Even smaller alternatives (if memory is very limited)
foundry model run phi-3.5-mini
foundry model run qwen2.5-0.5b

# Or update the environment variables in this notebook:
# SLM_ALIAS = 'phi-3.5-mini'
# LLM_ALIAS = 'qwen2.5-1.5b'  # Or qwen2.5-0.5b for minimum memory
```

‚ö†Ô∏è **Wenn Sie diese Schritte √ºberspringen**, wird beim Ausf√ºhren der untenstehenden Notebook-Zellen ein `APIConnectionError` angezeigt.


In [29]:
# Install dependencies
!pip install -q foundry-local-sdk openai numpy requests

### Erkl√§rung: Wichtige Importe
Importiert Zeitmanagement-Utilities sowie Foundry Local- und OpenAI-Clients, die verwendet werden, um Modellinformationen abzurufen und Chat-Abschl√ºsse durchzuf√ºhren.


In [30]:
import os, time, json
from foundry_local import FoundryLocalManager
from openai import OpenAI
import sys
sys.path.append('../samples')
from workshop_utils import get_client, chat_once

### Erkl√§rung: Aliase & Prompt-Einrichtung
Definiert umgebungsabh√§ngige Aliase f√ºr kleine vs. gr√∂√üere Modelle sowie einen Vergleichs-Prompt. Passen Sie Umgebungsvariablen an, um mit verschiedenen Modellfamilien oder Aufgaben zu experimentieren.


In [31]:
# Default to CPU models for better memory efficiency
SLM = os.getenv('SLM_ALIAS', 'phi-4-mini')  # Auto-selects CPU variant
LLM = os.getenv('LLM_ALIAS', 'qwen2.5-3b')  # Smaller LLM, more memory-friendly
PROMPT = os.getenv('COMPARE_PROMPT', 'List 5 benefits of local AI inference.')
# Endpoint is now managed by FoundryLocalManager - it auto-detects or can be overridden
ENDPOINT = os.getenv('FOUNDRY_LOCAL_ENDPOINT', None)

### üí° Speicheroptimierte Konfiguration

**Dieses Notebook verwendet standardm√§√üig speichereffiziente Modelle:**
- `phi-4-mini` ‚Üí ~4GB RAM (Foundry Local w√§hlt automatisch die CPU-Variante)
- `qwen2.5-3b` ‚Üí ~3GB RAM (anstatt 7B, das ~7GB+ ben√∂tigt)

**Automatische Port-Erkennung:**
- Foundry Local kann unterschiedliche Ports verwenden (h√§ufig 55769 oder 59959)
- Die Diagnosezelle unten erkennt automatisch den richtigen Port
- Keine manuelle Konfiguration erforderlich!

**Falls Sie weniger als 8GB RAM haben, verwenden Sie noch kleinere Modelle:**
```python
SLM = 'phi-3.5-mini'      # ~2GB
LLM = 'qwen2.5-0.5b'      # ~500MB
```


In [32]:
# Display current configuration
print("="*60)
print("CURRENT CONFIGURATION")
print("="*60)
print(f"SLM Model:     {SLM}")
print(f"LLM Model:     {LLM}")
print(f"SDK Pattern:   FoundryLocalManager (official)")
print(f"Endpoint:      {ENDPOINT or 'Auto-detect'}")
print(f"Test Prompt:   {PROMPT[:50]}...")
print(f"Retry Count:   2")
print("="*60)
print("\nüí° Using official Foundry SDK pattern from workshop_utils")
print("   ‚Üí FoundryLocalManager handles service lifecycle")
print("   ‚Üí Automatic model resolution and hardware optimization")
print("   ‚Üí OpenAI-compatible API for inference")

CURRENT CONFIGURATION
SLM Model:     phi-4-mini
LLM Model:     qwen2.5-7b
SDK Pattern:   FoundryLocalManager (official)
Endpoint:      Auto-detect
Test Prompt:   List 5 benefits of local AI inference....
Retry Count:   2

üí° Using official Foundry SDK pattern from workshop_utils
   ‚Üí FoundryLocalManager handles service lifecycle
   ‚Üí Automatic model resolution and hardware optimization
   ‚Üí OpenAI-compatible API for inference


### Erkl√§rung: Ausf√ºhrungshelfer (Foundry SDK-Muster)
Verwendet das offizielle Foundry Local SDK-Muster, wie in den Workshop-Beispielen dokumentiert:

**Ansatz:**
- **FoundryLocalManager** - Initialisiert und verwaltet den Foundry Local-Dienst
- **Automatische Erkennung** - Erkennt Endpunkte automatisch und verwaltet den Lebenszyklus des Dienstes
- **Modellaufl√∂sung** - Wandelt Aliase in vollst√§ndige Modell-IDs um (z. B. phi-4-mini ‚Üí phi-4-mini-instruct-cpu)
- **Hardware-Optimierung** - W√§hlt die beste Variante f√ºr die verf√ºgbare Hardware aus (CUDA, NPU oder CPU)
- **OpenAI-Client** - Konfiguriert mit dem Endpunkt des Managers f√ºr OpenAI-kompatiblen API-Zugriff

**Resilienz-Funktionen:**
- Exponentielles Backoff-Wiederholungslogik (konfigurierbar √ºber die Umgebung)
- Automatischer Dienststart, falls nicht aktiv
- Verbindungs√ºberpr√ºfung nach der Initialisierung
- Fehlerbehandlung mit detaillierter Fehlerberichterstattung
- Modell-Caching, um wiederholte Initialisierungen zu vermeiden

**Ergebnisstruktur:**
- Latenzmessung (Echtzeit)
- Verfolgung der Token-Nutzung (falls verf√ºgbar)
- Beispielausgabe (gek√ºrzt f√ºr bessere Lesbarkeit)
- Fehlerdetails bei fehlgeschlagenen Anfragen

Dieses Muster nutzt das Modul workshop_utils, das dem offiziellen SDK-Muster folgt.

**SDK-Referenz:**
- Haupt-Repository: https://github.com/microsoft/Foundry-Local
- Python SDK: https://github.com/microsoft/Foundry-Local/tree/main/sdk/python/foundry_local
- Workshop Utils: ../samples/workshop_utils.py


In [39]:
def setup(alias: str, endpoint: str = None, retries: int = 3):
    """
    Initialize a Foundry Local model connection using official SDK pattern.
    
    This follows the workshop_utils pattern which uses FoundryLocalManager
    to properly initialize the Foundry Local service and resolve models.
    
    Args:
        alias: Model alias (e.g., 'phi-4-mini', 'qwen2.5-3b')
        endpoint: Optional endpoint override (usually auto-detected)
        retries: Number of connection attempts (default: 3)
    
    Returns:
        tuple: (manager, client, model_id, metadata) or (None, None, alias, error_metadata) if failed
    """
    import time
    
    last_err = None
    current_delay = 2  # seconds
    
    for attempt in range(1, retries + 1):
        try:
            print(f"[Init] Connecting to '{alias}' (attempt {attempt}/{retries})...")
            
            # Use the workshop utility which follows the official SDK pattern
            manager, client, model_id = get_client(alias, endpoint=endpoint)
            
            print(f"[OK] Connected to '{alias}' -> {model_id}")
            print(f"     Endpoint: {manager.endpoint}")
            
            return manager, client, model_id, {
                'endpoint': manager.endpoint,
                'resolved': model_id,
                'attempts': attempt,
                'status': 'success'
            }
            
        except Exception as e:
            last_err = e
            error_msg = str(e)
            
            # Provide helpful error messages
            if "Connection error" in error_msg or "connection refused" in error_msg.lower():
                print(f"[ERROR] Cannot connect to Foundry Local service")
                print(f"        ‚Üí Is the service running? Try: foundry service start")
                print(f"        ‚Üí Is the model loaded? Try: foundry model run {alias}")
            elif "not found" in error_msg.lower():
                print(f"[ERROR] Model '{alias}' not found in catalog")
                print(f"        ‚Üí Available models: Run 'foundry model ls' in terminal")
                print(f"        ‚Üí Download model: Run 'foundry model download {alias}'")
            else:
                print(f"[ERROR] Setup failed: {type(e).__name__}: {error_msg}")
            
            if attempt < retries:
                print(f"[Retry] Waiting {current_delay:.1f}s before retry...")
                time.sleep(current_delay)
                current_delay *= 2  # Exponential backoff
    
    # All retries failed - provide actionable guidance
    print(f"\n‚ùå Failed to initialize '{alias}' after {retries} attempts")
    print(f"   Last error: {type(last_err).__name__}: {str(last_err)}")
    print(f"\nüí° Troubleshooting steps:")
    print(f"   1. Ensure Foundry Local service is running:")
    print(f"      ‚Üí foundry service status")
    print(f"      ‚Üí foundry service start (if not running)")
    print(f"   2. Ensure model is loaded:")
    print(f"      ‚Üí foundry model run {alias}")
    print(f"   3. Check available models:")
    print(f"      ‚Üí foundry model ls")
    print(f"   4. Try alternative models if '{alias}' isn't available")
    
    return None, None, alias, {
        'error': f"{type(last_err).__name__}: {str(last_err)}",
        'endpoint': endpoint or 'auto-detect',
        'attempts': retries,
        'status': 'failed'
    }


def run(client, model_id: str, prompt: str, max_tokens: int = 180, temperature: float = 0.5):
    """
    Run inference with the configured model using OpenAI SDK.
    
    Args:
        client: OpenAI client instance (configured for Foundry Local)
        model_id: Model identifier (resolved from alias)
        prompt: Input prompt
        max_tokens: Maximum response tokens (default: 180)
        temperature: Sampling temperature (default: 0.5)
    
    Returns:
        dict: Response with timing, tokens, and content
    """
    import time
    
    start = time.time()
    
    try:
        response = client.chat.completions.create(
            model=model_id,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature
        )
        
        elapsed = time.time() - start
        
        # Extract response details
        content = response.choices[0].message.content
        
        # Try to extract token usage from multiple possible locations
        usage_info = {}
        if hasattr(response, 'usage') and response.usage:
            usage_info['prompt_tokens'] = getattr(response.usage, 'prompt_tokens', None)
            usage_info['completion_tokens'] = getattr(response.usage, 'completion_tokens', None)
            usage_info['total_tokens'] = getattr(response.usage, 'total_tokens', None)
        
        # Calculate approximate token count if API doesn't provide it
        # Rough estimate: ~4 characters per token for English text
        if not usage_info.get('total_tokens'):
            estimated_prompt_tokens = len(prompt) // 4
            estimated_completion_tokens = len(content) // 4
            estimated_total = estimated_prompt_tokens + estimated_completion_tokens
            usage_info['estimated_tokens'] = estimated_total
            usage_info['estimated_prompt_tokens'] = estimated_prompt_tokens
            usage_info['estimated_completion_tokens'] = estimated_completion_tokens
        
        return {
            'status': 'success',
            'content': content,
            'elapsed_sec': elapsed,
            'tokens': usage_info.get('total_tokens') or usage_info.get('estimated_tokens'),
            'usage': usage_info,
            'model': model_id
        }
        
    except Exception as e:
        elapsed = time.time() - start
        return {
            'status': 'error',
            'error': f"{type(e).__name__}: {str(e)}",
            'elapsed_sec': elapsed,
            'model': model_id
        }


print("‚úÖ Execution helpers defined: setup(), run()")
print("   ‚Üí Uses workshop_utils for proper SDK integration")
print("   ‚Üí setup() initializes with FoundryLocalManager")
print("   ‚Üí run() executes inference via OpenAI-compatible API")
print("   ‚Üí Token counting: Uses API data or estimates if unavailable")

‚úÖ Execution helpers defined: setup(), run()
   ‚Üí Uses workshop_utils for proper SDK integration
   ‚Üí setup() initializes with FoundryLocalManager
   ‚Üí run() executes inference via OpenAI-compatible API
   ‚Üí Token counting: Uses API data or estimates if unavailable


### Erkl√§rung: Selbsttest vor dem Start
F√ºhrt einen einfachen Verbindungstest mit FoundryLocalManager f√ºr beide Modelle durch. Dies √ºberpr√ºft:
- Der Dienst ist zug√§nglich
- Modelle k√∂nnen initialisiert werden
- Aliase werden zu tats√§chlichen Modell-IDs aufgel√∂st
- Die Verbindung ist stabil, bevor der Vergleich durchgef√ºhrt wird

Die setup()-Funktion verwendet das offizielle SDK-Muster aus workshop_utils.


In [34]:
# Simplified diagnostic: Just verify service is accessible
import requests

def check_foundry_service():
    """Quick diagnostic to verify Foundry Local is running."""
    # Try common ports
    endpoints_to_try = [
        "http://localhost:59959",
        "http://127.0.0.1:59959", 
        "http://localhost:55769",
        "http://127.0.0.1:55769",
    ]
    
    print("[Diagnostic] Checking Foundry Local service...")
    
    for endpoint in endpoints_to_try:
        try:
            response = requests.get(f"{endpoint}/health", timeout=2)
            if response.status_code == 200:
                print(f"‚úÖ Service is running at {endpoint}")
                
                # Try to list models
                try:
                    models_response = requests.get(f"{endpoint}/v1/models", timeout=2)
                    if models_response.status_code == 200:
                        models_data = models_response.json()
                        model_count = len(models_data.get('data', []))
                        print(f"‚úÖ Found {model_count} models available")
                        if model_count > 0:
                            print("   Models:", [m.get('id', 'unknown') for m in models_data.get('data', [])[:5]])
                except Exception as e:
                    print(f"‚ö†Ô∏è  Could not list models: {e}")
                
                return endpoint
        except requests.exceptions.ConnectionError:
            continue
        except Exception as e:
            print(f"‚ö†Ô∏è  Error checking {endpoint}: {e}")
    
    print("\n‚ùå Foundry Local service not found!")
    print("\nüí° To fix this:")
    print("   1. Open a terminal")
    print("   2. Run: foundry service start")
    print("   3. Run: foundry model run phi-4-mini")
    print("   4. Run: foundry model run qwen2.5-3b")
    print("   5. Re-run this notebook")
    return None

# Run diagnostic
discovered_endpoint = check_foundry_service()

if discovered_endpoint:
    print(f"\n‚úÖ Service detected (will be managed by FoundryLocalManager)")
else:
    print(f"\n‚ö†Ô∏è  No service detected - FoundryLocalManager will attempt to start it")

[Diagnostic] Checking Foundry Local service...

‚ùå Foundry Local service not found!

üí° To fix this:
   1. Open a terminal
   2. Run: foundry service start
   3. Run: foundry model run phi-4-mini
   4. Run: foundry model run qwen2.5-3b
   5. Re-run this notebook

‚ö†Ô∏è  No service detected - FoundryLocalManager will attempt to start it


In [35]:
# Quick Fix: Start service and load models from notebook
# Uncomment the commands you need:

# !foundry service start
# !foundry model run phi-4-mini
# !foundry model run qwen2.5-3b
# !foundry model ls

print("‚ö†Ô∏è  The commands above are commented out.")
print("Uncomment them if you want to start the service from the notebook.")
print("")
print("üí° Recommended: Run these commands in a separate terminal instead:")
print("   foundry service start")
print("   foundry model run phi-4-mini")
print("   foundry model run qwen2.5-3b")

‚ö†Ô∏è  The commands above are commented out.
Uncomment them if you want to start the service from the notebook.

üí° Recommended: Run these commands in a separate terminal instead:
   foundry service start
   foundry model run phi-4-mini
   foundry model run qwen2.5-3b


### üõ†Ô∏è Schnelle L√∂sung: Foundry Local vom Notebook aus starten (Optional)

Falls die obenstehende Diagnose zeigt, dass der Dienst nicht l√§uft, k√∂nnen Sie versuchen, ihn von hier aus zu starten:

**Hinweis:** Dies funktioniert am besten unter Windows. Auf anderen Plattformen verwenden Sie Terminal-Befehle.


### ‚ö†Ô∏è Fehlerbehebung bei Verbindungsproblemen

Wenn Sie `APIConnectionError` sehen, l√§uft der Foundry Local-Dienst m√∂glicherweise nicht oder Modelle sind nicht geladen. Versuchen Sie diese Schritte:

**1. Dienststatus √ºberpr√ºfen:**
```bash
# In a terminal (not in notebook):
foundry service status
```

**2. Dienst starten (falls nicht aktiv):**
```bash
foundry service start
```

**3. Erforderliche Modelle laden:**
```bash
# Load the models needed for comparison
foundry model run phi-4-mini
foundry model run qwen2.5-7b

# Or use alternative models:
foundry model run phi-3.5-mini
foundry model run qwen2.5-3b
```

**4. Verf√ºgbarkeit der Modelle √ºberpr√ºfen:**
```bash
foundry model ls
```

**H√§ufige Probleme:**
- ‚ùå Dienst l√§uft nicht ‚Üí F√ºhren Sie `foundry service start` aus
- ‚ùå Modelle nicht geladen ‚Üí F√ºhren Sie `foundry model run <model-name>` aus
- ‚ùå Portkonflikte ‚Üí √úberpr√ºfen Sie, ob ein anderer Dienst den Port verwendet
- ‚ùå Firewall blockiert ‚Üí Stellen Sie sicher, dass lokale Verbindungen erlaubt sind

**Schnelle L√∂sung:** F√ºhren Sie die Diagnosezelle unten aus, bevor Sie die Vorabpr√ºfung durchf√ºhren.


In [36]:
preflight = {}
retries = 2  # Number of retry attempts

for a in (SLM, LLM):
    mgr, c, mid, info = setup(a, endpoint=ENDPOINT, retries=retries)
    # Keep the original status from info (either 'success' or 'failed')
    preflight[a] = info

print('\n[Pre-flight Check]')
for alias, details in preflight.items():
    status_icon = '‚úÖ' if details['status'] == 'success' else '‚ùå'
    print(f"  {status_icon} {alias}: {details['status']} - {details.get('resolved', details.get('error', 'unknown'))}")

preflight

[Init] Connecting to 'phi-4-mini' (attempt 1/2)...
[OK] Connected to 'phi-4-mini' -> Phi-4-mini-instruct-cuda-gpu:4
     Endpoint: http://127.0.0.1:59959/v1
[Init] Connecting to 'qwen2.5-7b' (attempt 1/2)...
[OK] Connected to 'qwen2.5-7b' -> qwen2.5-7b-instruct-cuda-gpu:3
     Endpoint: http://127.0.0.1:59959/v1

[Pre-flight Check]
  ‚úÖ phi-4-mini: success - Phi-4-mini-instruct-cuda-gpu:4
  ‚úÖ qwen2.5-7b: success - qwen2.5-7b-instruct-cuda-gpu:3


{'phi-4-mini': {'endpoint': 'http://127.0.0.1:59959/v1',
  'resolved': 'Phi-4-mini-instruct-cuda-gpu:4',
  'attempts': 1,
  'status': 'success'},
 'qwen2.5-7b': {'endpoint': 'http://127.0.0.1:59959/v1',
  'resolved': 'qwen2.5-7b-instruct-cuda-gpu:3',
  'attempts': 1,
  'status': 'success'}}

### ‚úÖ Vorabpr√ºfung: Modellverf√ºgbarkeit

Diese Zelle √ºberpr√ºft, ob beide Modelle √ºber den konfigurierten Endpunkt erreichbar sind, bevor der Vergleich durchgef√ºhrt wird.


### Erkl√§rung: Vergleich ausf√ºhren & Ergebnisse sammeln
Iteriert √ºber beide Aliase mithilfe des offiziellen Foundry-SDK-Musters:
1. Initialisiere jedes Modell mit setup() (verwendet FoundryLocalManager)
2. F√ºhre Inferenz mit einer OpenAI-kompatiblen API aus
3. Erfasse Latenz, Tokens und Beispielausgabe
4. Erstelle eine JSON-Zusammenfassung mit vergleichender Analyse

Dies folgt demselben Muster wie die Workshop-Beispiele in session04/model_compare.py.


In [40]:
results = []
retries = 2  # Number of retry attempts

for alias in (SLM, LLM):
    mgr, client, mid, info = setup(alias, endpoint=ENDPOINT, retries=retries)
    if client:
        r = run(client, mid, PROMPT)
        results.append({'alias': alias, **r})
    else:
        # If setup failed, record error
        results.append({
            'alias': alias,
            'status': 'error',
            'error': info.get('error', 'Setup failed'),
            'elapsed_sec': 0,
            'tokens': None,
            'model': alias
        })

# Display results
print(json.dumps(results, indent=2))

# Quick comparative view
print('\n' + '='*80)
print('COMPARISON SUMMARY')
print('='*80)
print(f"{'Alias':<20} {'Status':<15} {'Latency(s)':<15} {'Tokens':<15}")
print('-'*80)

for row in results:
    status = row.get('status', 'unknown')
    status_icon = '‚úÖ' if status == 'success' else '‚ùå'
    latency_str = f"{row.get('elapsed_sec', 0):.3f}" if row.get('elapsed_sec') else 'N/A'
    
    # Handle token display - show if available or indicate estimated
    tokens = row.get('tokens')
    usage = row.get('usage', {})
    if tokens:
        if 'estimated_tokens' in usage:
            tokens_str = f"~{tokens} (est.)"
        else:
            tokens_str = str(tokens)
    else:
        tokens_str = 'N/A'
    
    print(f"{status_icon} {row['alias']:<18} {status:<15} {latency_str:<15} {tokens_str:<15}")

print('-'*80)

# Show detailed token breakdown if available
print("\nDetailed Token Usage:")
for row in results:
    if row.get('status') == 'success' and row.get('usage'):
        usage = row['usage']
        print(f"\n  {row['alias']}:")
        if 'prompt_tokens' in usage and usage['prompt_tokens']:
            print(f"    Prompt tokens:     {usage['prompt_tokens']}")
            print(f"    Completion tokens: {usage['completion_tokens']}")
            print(f"    Total tokens:      {usage['total_tokens']}")
        elif 'estimated_tokens' in usage:
            print(f"    Estimated prompt:     {usage['estimated_prompt_tokens']}")
            print(f"    Estimated completion: {usage['estimated_completion_tokens']}")
            print(f"    Estimated total:      {usage['estimated_tokens']}")
            print(f"    (API did not provide token counts - using ~4 chars/token estimate)")

print('\n' + '='*80)

# Calculate speedup if both succeeded
if len(results) == 2 and all(r.get('status') == 'success' and r.get('elapsed_sec') for r in results):
    speedup = results[1]['elapsed_sec'] / results[0]['elapsed_sec']
    print(f"\nüí° SLM is {speedup:.2f}x faster than LLM for this prompt")
    
    # Compare token throughput if available
    slm_tokens = results[0].get('tokens', 0)
    llm_tokens = results[1].get('tokens', 0)
    if slm_tokens and llm_tokens:
        slm_tps = slm_tokens / results[0]['elapsed_sec']
        llm_tps = llm_tokens / results[1]['elapsed_sec']
        print(f"   SLM throughput: {slm_tps:.1f} tokens/sec")
        print(f"   LLM throughput: {llm_tps:.1f} tokens/sec")
        
elif any(r.get('status') == 'error' for r in results):
    print(f"\n‚ö†Ô∏è  Some models failed - check errors above")
    print("   Ensure Foundry Local is running: foundry service start")
    print("   Ensure models are loaded: foundry model run <model-name>")

results

[Init] Connecting to 'phi-4-mini' (attempt 1/2)...
[OK] Connected to 'phi-4-mini' -> Phi-4-mini-instruct-cuda-gpu:4
     Endpoint: http://127.0.0.1:59959/v1
[Init] Connecting to 'qwen2.5-7b' (attempt 1/2)...
[OK] Connected to 'qwen2.5-7b' -> qwen2.5-7b-instruct-cuda-gpu:3
     Endpoint: http://127.0.0.1:59959/v1
[Init] Connecting to 'qwen2.5-7b' (attempt 1/2)...
[OK] Connected to 'qwen2.5-7b' -> qwen2.5-7b-instruct-cuda-gpu:3
     Endpoint: http://127.0.0.1:59959/v1
[
  {
    "alias": "phi-4-mini",
    "status": "success",
    "content": "1. Reduced Latency: Local AI inference can significantly reduce latency by processing data closer to the source, which is particularly beneficial for real-time applications such as autonomous vehicles or augmented reality.\n\n2. Enhanced Privacy: By keeping data processing local, sensitive information is less likely to be exposed to external networks, thereby enhancing privacy and security.\n\n3. Lower Bandwidth Usage: Local AI inference reduces the n

[{'alias': 'phi-4-mini',
  'status': 'success',
  'content': '1. Reduced Latency: Local AI inference can significantly reduce latency by processing data closer to the source, which is particularly beneficial for real-time applications such as autonomous vehicles or augmented reality.\n\n2. Enhanced Privacy: By keeping data processing local, sensitive information is less likely to be exposed to external networks, thereby enhancing privacy and security.\n\n3. Lower Bandwidth Usage: Local AI inference reduces the need for data transmission over the network, which can save bandwidth and reduce the risk of network congestion.\n\n4. Improved Reliability: Local processing can be more reliable, as it is less dependent on network connectivity. This is particularly important in scenarios where network connectivity is unreliable or intermittent.\n\n5. Scalability: Local AI inference can be easily scaled by adding more local processing units, making it easier to handle increasing data volumes or m

### Ergebnisse interpretieren

**Wichtige Kennzahlen:**
- **Latenz**: Niedriger ist besser ‚Äì zeigt schnellere Antwortzeiten an
- **Tokens**: H√∂herer Durchsatz = mehr verarbeitete Tokens
- **Route**: Best√§tigt, welcher API-Endpunkt verwendet wurde

**Wann SLM vs. LLM verwenden:**
- **SLM (Small Language Model)**: Schnelle Antworten, geringer Ressourcenverbrauch, geeignet f√ºr einfache Aufgaben
- **LLM (Large Language Model)**: H√∂here Qualit√§t, bessere Argumentation, verwenden, wenn Qualit√§t am wichtigsten ist

**N√§chste Schritte:**
1. Probieren Sie verschiedene Prompts aus, um zu sehen, wie die Komplexit√§t den Vergleich beeinflusst
2. Experimentieren Sie mit anderen Modellpaaren
3. Nutzen Sie die Workshop-Router-Beispiele (Session 06), um basierend auf der Aufgabenkomplexit√§t intelligent zu routen


In [38]:
# Final Validation Check
print("="*70)
print("VALIDATION SUMMARY")
print("="*70)
print(f"‚úÖ SLM Model: {SLM}")
print(f"‚úÖ LLM Model: {LLM}")
print(f"‚úÖ Using Foundry SDK Pattern: workshop_utils with FoundryLocalManager")
print(f"‚úÖ Pre-flight passed: {all(v['status'] == 'success' for v in preflight.values()) if 'preflight' in dir() else 'Not run yet'}")
print(f"‚úÖ Comparison completed: {len(results) == 2 if 'results' in dir() else 'Not run yet'}")
print(f"‚úÖ Both models responded: {all(r.get('status') == 'success' for r in results) if 'results' in dir() and results else 'Not run yet'}")
print("="*70)

# Check for common configuration issues
issues = []
if 'LLM' in dir() and LLM not in ['qwen2.5-3b', 'qwen2.5-0.5b', 'qwen2.5-1.5b', 'qwen2.5-7b', 'phi-3.5-mini']:
    issues.append(f"‚ö†Ô∏è  LLM is '{LLM}' - expected qwen2.5-3b for memory efficiency")
if 'preflight' in dir() and not all(v['status'] == 'success' for v in preflight.values()):
    issues.append("‚ö†Ô∏è  Pre-flight check failed - models not accessible")
if 'results' in dir() and results and not all(r.get('status') == 'success' for r in results):
    issues.append("‚ö†Ô∏è  Comparison incomplete - check for errors above")

if not issues and 'results' in dir() and results and all(r.get('status') == 'success' for r in results):
    print("üéâ ALL CHECKS PASSED! Notebook completed successfully.")
    print(f"   SLM ({SLM}) vs LLM ({LLM}) comparison completed.")
    if len(results) == 2:
        speedup = results[1]['elapsed_sec'] / results[0]['elapsed_sec'] if results[0]['elapsed_sec'] > 0 else 0
        print(f"   Performance: SLM is {speedup:.2f}x faster")
elif issues:
    print("\n‚ö†Ô∏è  Issues detected:")
    for issue in issues:
        print(f"   {issue}")
    print("\nüí° Troubleshooting:")
    print("   1. Ensure service is running: foundry service start")
    print("   2. Load models: foundry model run phi-4-mini && foundry model run qwen2.5-7b")
    print("   3. Check model list: foundry model ls")
else:
    print("\nüí° Run all cells above first, then re-run this validation.")
print("="*70)

VALIDATION SUMMARY
‚úÖ SLM Model: phi-4-mini
‚úÖ LLM Model: qwen2.5-7b
‚úÖ Using Foundry SDK Pattern: workshop_utils with FoundryLocalManager
‚úÖ Pre-flight passed: True
‚úÖ Comparison completed: True
‚úÖ Both models responded: True
üéâ ALL CHECKS PASSED! Notebook completed successfully.
   SLM (phi-4-mini) vs LLM (qwen2.5-7b) comparison completed.
   Performance: SLM is 5.14x faster



---

**Haftungsausschluss**:  
Dieses Dokument wurde mit dem KI-√úbersetzungsdienst [Co-op Translator](https://github.com/Azure/co-op-translator) √ºbersetzt. Obwohl wir uns um Genauigkeit bem√ºhen, beachten Sie bitte, dass automatisierte √úbersetzungen Fehler oder Ungenauigkeiten enthalten k√∂nnen. Das Originaldokument in seiner urspr√ºnglichen Sprache sollte als ma√ügebliche Quelle betrachtet werden. F√ºr kritische Informationen wird eine professionelle menschliche √úbersetzung empfohlen. Wir √ºbernehmen keine Haftung f√ºr Missverst√§ndnisse oder Fehlinterpretationen, die sich aus der Nutzung dieser √úbersetzung ergeben.
