# Session 4 ‚Äì J√§mf√∂relse mellan SLM och LLM

J√§mf√∂r latens och kvalitet p√• exempelrespons mellan en liten spr√•kmodell och en st√∂rre modell som k√∂rs via Foundry Local.


## ‚ö° Snabbstart

**Minnesoptimerad installation (uppdaterad):**
1. Modeller v√§ljer automatiskt CPU-varianter (fungerar p√• all h√•rdvara)
2. Anv√§nder `qwen2.5-3b` ist√§llet f√∂r 7B (sparar ~4GB RAM)
3. Automatisk portdetektering (ingen manuell konfiguration)
4. Totalt RAM-behov: ~8GB rekommenderas (modeller + OS)

**Terminalinstallation (30 sekunder):**
```bash
foundry service start
foundry model run phi-4-mini
foundry model run qwen2.5-3b
```

K√∂r sedan denna notebook! üöÄ


### F√∂rklaring: Installation av beroenden
Installerar minimala paket (`foundry-local-sdk`, `openai`, `numpy`) som beh√∂vs f√∂r tidsm√§tning och chattf√∂rfr√•gningar. Kan s√§kert k√∂ras om flera g√•nger utan problem.


# Scenario
J√§mf√∂r en representativ liten spr√•kmodell (SLM) med en st√∂rre modell p√• en enda prompt f√∂r att illustrera avv√§gningar:
- **Skillnad i latens** (v√§ggklocka sekunder)
- **Tokenanv√§ndning** (om tillg√§ngligt) som en proxy f√∂r genomstr√∂mning
- **Kvalitativt provutdata** f√∂r snabb bed√∂mning
- **Hastighetsber√§kning** f√∂r att kvantifiera prestandaf√∂rb√§ttringar

**Milj√∂variabler:**
- `SLM_ALIAS` - Liten spr√•kmodell (standard: phi-4-mini, ~4GB RAM)
- `LLM_ALIAS` - St√∂rre spr√•kmodell (standard: qwen2.5-7b, ~7GB RAM)
- `COMPARE_PROMPT` - Testprompt f√∂r j√§mf√∂relse
- `COMPARE_RETRIES` - Antal omf√∂rs√∂k f√∂r robusthet (standard: 2)
- `FOUNDRY_LOCAL_ENDPOINT` - √Ösidos√§tt tj√§nstendpunkt (auto-detekteras om ej angiven)

**Hur det fungerar (Officiellt SDK-m√∂nster):**
1. **FoundryLocalManager** initierar och hanterar Foundry Local-tj√§nsten
2. Tj√§nsten startar automatiskt om den inte redan k√∂rs (ingen manuell installation beh√∂vs)
3. Modeller l√∂ses fr√•n alias till konkreta ID:n automatiskt
4. H√•rdvaruoptimerade varianter v√§ljs (CUDA, NPU eller CPU)
5. OpenAI-kompatibel klient utf√∂r chattkompletteringar
6. M√§tv√§rden samlas in: latens, tokens, utdata kvalitet
7. Resultaten j√§mf√∂rs f√∂r att ber√§kna hastighetsf√∂rh√•llande

Denna mikro-j√§mf√∂relse hj√§lper till att avg√∂ra n√§r det √§r motiverat att anv√§nda en st√∂rre modell f√∂r ditt anv√§ndningsfall.

**SDK-referens:** 
- Python SDK: https://github.com/microsoft/Foundry-Local/tree/main/sdk/python/foundry_local
- Workshop Utils: Anv√§nder det officiella m√∂nstret fr√•n ../samples/workshop_utils.py

**Viktiga f√∂rdelar:**
- ‚úÖ Automatisk uppt√§ckt och initiering av tj√§nster
- ‚úÖ Automatisk start av tj√§nsten om den inte k√∂rs
- ‚úÖ Inbyggd modelluppl√∂sning och caching
- ‚úÖ H√•rdvaruoptimering (CUDA/NPU/CPU)
- ‚úÖ OpenAI SDK-kompatibilitet
- ‚úÖ Robust felhantering med omf√∂rs√∂k
- ‚úÖ Lokal inferens (ingen moln-API kr√§vs)


## üö® F√∂ruts√§ttningar: Foundry Local M√•ste Vara Ig√•ng!

**Innan du k√∂r denna notebook**, se till att Foundry Local-tj√§nsten √§r konfigurerad:

### Snabbstartskommandon (K√∂r i Terminalen):

```bash
# 1. Start the Foundry Local service
foundry service start

# 2. Load the default models used in this comparison (CPU-optimized)
foundry model run phi-4-mini
foundry model run qwen2.5-3b

# 3. Verify models are loaded
foundry model ls

# 4. Check service health
foundry service status
```

### Alternativa Modeller (om standardmodeller inte √§r tillg√§ngliga):

```bash
# Even smaller alternatives (if memory is very limited)
foundry model run phi-3.5-mini
foundry model run qwen2.5-0.5b

# Or update the environment variables in this notebook:
# SLM_ALIAS = 'phi-3.5-mini'
# LLM_ALIAS = 'qwen2.5-1.5b'  # Or qwen2.5-0.5b for minimum memory
```

‚ö†Ô∏è **Om du hoppar √∂ver dessa steg**, kommer du att f√• `APIConnectionError` n√§r du k√∂r notebook-cellerna nedan.


In [29]:
# Install dependencies
!pip install -q foundry-local-sdk openai numpy requests

### F√∂rklaring: K√§rnimporter
Inkluderar tidsverktyg och Foundry Local / OpenAI-klienter som anv√§nds f√∂r att h√§mta modellinformation och utf√∂ra chattkompletteringar.


In [30]:
import os, time, json
from foundry_local import FoundryLocalManager
from openai import OpenAI
import sys
sys.path.append('../samples')
from workshop_utils import get_client, chat_once

### F√∂rklaring: Alias & Prompt-inst√§llning
Definierar milj√∂konfigurerbara alias f√∂r mindre respektive st√∂rre modeller samt en j√§mf√∂relseprompt. Justera milj√∂variabler f√∂r att experimentera med olika modelfamiljer eller uppgifter.


In [31]:
# Default to CPU models for better memory efficiency
SLM = os.getenv('SLM_ALIAS', 'phi-4-mini')  # Auto-selects CPU variant
LLM = os.getenv('LLM_ALIAS', 'qwen2.5-3b')  # Smaller LLM, more memory-friendly
PROMPT = os.getenv('COMPARE_PROMPT', 'List 5 benefits of local AI inference.')
# Endpoint is now managed by FoundryLocalManager - it auto-detects or can be overridden
ENDPOINT = os.getenv('FOUNDRY_LOCAL_ENDPOINT', None)

### üí° Minnesoptimerad konfiguration

**Den h√§r notebooken anv√§nder minneseffektiva modeller som standard:**
- `phi-4-mini` ‚Üí ~4GB RAM (Foundry Local v√§ljer automatiskt CPU-varianten)
- `qwen2.5-3b` ‚Üí ~3GB RAM (ist√§llet f√∂r 7B som kr√§ver ~7GB+)

**Automatisk portdetektering:**
- Foundry Local kan anv√§nda olika portar (vanligtvis 55769 eller 59959)
- Diagnostikcellen nedan uppt√§cker automatiskt r√§tt port
- Ingen manuell konfiguration beh√∂vs!

**Om du har begr√§nsat RAM (<8GB), anv√§nd √§nnu mindre modeller:**
```python
SLM = 'phi-3.5-mini'      # ~2GB
LLM = 'qwen2.5-0.5b'      # ~500MB
```


In [32]:
# Display current configuration
print("="*60)
print("CURRENT CONFIGURATION")
print("="*60)
print(f"SLM Model:     {SLM}")
print(f"LLM Model:     {LLM}")
print(f"SDK Pattern:   FoundryLocalManager (official)")
print(f"Endpoint:      {ENDPOINT or 'Auto-detect'}")
print(f"Test Prompt:   {PROMPT[:50]}...")
print(f"Retry Count:   2")
print("="*60)
print("\nüí° Using official Foundry SDK pattern from workshop_utils")
print("   ‚Üí FoundryLocalManager handles service lifecycle")
print("   ‚Üí Automatic model resolution and hardware optimization")
print("   ‚Üí OpenAI-compatible API for inference")

CURRENT CONFIGURATION
SLM Model:     phi-4-mini
LLM Model:     qwen2.5-7b
SDK Pattern:   FoundryLocalManager (official)
Endpoint:      Auto-detect
Test Prompt:   List 5 benefits of local AI inference....
Retry Count:   2

üí° Using official Foundry SDK pattern from workshop_utils
   ‚Üí FoundryLocalManager handles service lifecycle
   ‚Üí Automatic model resolution and hardware optimization
   ‚Üí OpenAI-compatible API for inference


### F√∂rklaring: Hj√§lpverktyg f√∂r exekvering (Foundry SDK-m√∂nster)
Anv√§nder det officiella Foundry Local SDK-m√∂nstret som dokumenterats i workshop-exemplen:

**Tillv√§gag√•ngss√§tt:**
- **FoundryLocalManager** - Initierar och hanterar Foundry Local-tj√§nsten
- **Automatisk uppt√§ckt** - Uppt√§cker automatiskt endpoint och hanterar tj√§nstens livscykel
- **Modelluppl√∂sning** - √ñvers√§tter alias till fullst√§ndiga modell-ID:n (t.ex. phi-4-mini ‚Üí phi-4-mini-instruct-cpu)
- **H√•rdvaruoptimering** - V√§ljer b√§sta variant f√∂r tillg√§nglig h√•rdvara (CUDA, NPU eller CPU)
- **OpenAI-klient** - Konfigurerad med managerns endpoint f√∂r OpenAI-kompatibel API-√•tkomst

**Resiliensfunktioner:**
- Exponentiell backoff-retry-logik (konfigurerbar via milj√∂variabler)
- Automatisk uppstart av tj√§nsten om den inte k√∂rs
- Verifiering av anslutning efter initialisering
- Smidig felhantering med detaljerad felrapportering
- Modellcaching f√∂r att undvika upprepad initialisering

**Resultatstruktur:**
- Latensm√§tning (v√§ggklocktid)
- Sp√•rning av tokenanv√§ndning (om tillg√§ngligt)
- Exempelutdata (f√∂rkortad f√∂r l√§sbarhet)
- Felspecifikationer f√∂r misslyckade f√∂rfr√•gningar

Detta m√∂nster utnyttjar modulen workshop_utils som f√∂ljer det officiella SDK-m√∂nstret.

**SDK-referens:**
- Huvudrepo: https://github.com/microsoft/Foundry-Local
- Python SDK: https://github.com/microsoft/Foundry-Local/tree/main/sdk/python/foundry_local
- Workshop Utils: ../samples/workshop_utils.py


In [39]:
def setup(alias: str, endpoint: str = None, retries: int = 3):
    """
    Initialize a Foundry Local model connection using official SDK pattern.
    
    This follows the workshop_utils pattern which uses FoundryLocalManager
    to properly initialize the Foundry Local service and resolve models.
    
    Args:
        alias: Model alias (e.g., 'phi-4-mini', 'qwen2.5-3b')
        endpoint: Optional endpoint override (usually auto-detected)
        retries: Number of connection attempts (default: 3)
    
    Returns:
        tuple: (manager, client, model_id, metadata) or (None, None, alias, error_metadata) if failed
    """
    import time
    
    last_err = None
    current_delay = 2  # seconds
    
    for attempt in range(1, retries + 1):
        try:
            print(f"[Init] Connecting to '{alias}' (attempt {attempt}/{retries})...")
            
            # Use the workshop utility which follows the official SDK pattern
            manager, client, model_id = get_client(alias, endpoint=endpoint)
            
            print(f"[OK] Connected to '{alias}' -> {model_id}")
            print(f"     Endpoint: {manager.endpoint}")
            
            return manager, client, model_id, {
                'endpoint': manager.endpoint,
                'resolved': model_id,
                'attempts': attempt,
                'status': 'success'
            }
            
        except Exception as e:
            last_err = e
            error_msg = str(e)
            
            # Provide helpful error messages
            if "Connection error" in error_msg or "connection refused" in error_msg.lower():
                print(f"[ERROR] Cannot connect to Foundry Local service")
                print(f"        ‚Üí Is the service running? Try: foundry service start")
                print(f"        ‚Üí Is the model loaded? Try: foundry model run {alias}")
            elif "not found" in error_msg.lower():
                print(f"[ERROR] Model '{alias}' not found in catalog")
                print(f"        ‚Üí Available models: Run 'foundry model ls' in terminal")
                print(f"        ‚Üí Download model: Run 'foundry model download {alias}'")
            else:
                print(f"[ERROR] Setup failed: {type(e).__name__}: {error_msg}")
            
            if attempt < retries:
                print(f"[Retry] Waiting {current_delay:.1f}s before retry...")
                time.sleep(current_delay)
                current_delay *= 2  # Exponential backoff
    
    # All retries failed - provide actionable guidance
    print(f"\n‚ùå Failed to initialize '{alias}' after {retries} attempts")
    print(f"   Last error: {type(last_err).__name__}: {str(last_err)}")
    print(f"\nüí° Troubleshooting steps:")
    print(f"   1. Ensure Foundry Local service is running:")
    print(f"      ‚Üí foundry service status")
    print(f"      ‚Üí foundry service start (if not running)")
    print(f"   2. Ensure model is loaded:")
    print(f"      ‚Üí foundry model run {alias}")
    print(f"   3. Check available models:")
    print(f"      ‚Üí foundry model ls")
    print(f"   4. Try alternative models if '{alias}' isn't available")
    
    return None, None, alias, {
        'error': f"{type(last_err).__name__}: {str(last_err)}",
        'endpoint': endpoint or 'auto-detect',
        'attempts': retries,
        'status': 'failed'
    }


def run(client, model_id: str, prompt: str, max_tokens: int = 180, temperature: float = 0.5):
    """
    Run inference with the configured model using OpenAI SDK.
    
    Args:
        client: OpenAI client instance (configured for Foundry Local)
        model_id: Model identifier (resolved from alias)
        prompt: Input prompt
        max_tokens: Maximum response tokens (default: 180)
        temperature: Sampling temperature (default: 0.5)
    
    Returns:
        dict: Response with timing, tokens, and content
    """
    import time
    
    start = time.time()
    
    try:
        response = client.chat.completions.create(
            model=model_id,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature
        )
        
        elapsed = time.time() - start
        
        # Extract response details
        content = response.choices[0].message.content
        
        # Try to extract token usage from multiple possible locations
        usage_info = {}
        if hasattr(response, 'usage') and response.usage:
            usage_info['prompt_tokens'] = getattr(response.usage, 'prompt_tokens', None)
            usage_info['completion_tokens'] = getattr(response.usage, 'completion_tokens', None)
            usage_info['total_tokens'] = getattr(response.usage, 'total_tokens', None)
        
        # Calculate approximate token count if API doesn't provide it
        # Rough estimate: ~4 characters per token for English text
        if not usage_info.get('total_tokens'):
            estimated_prompt_tokens = len(prompt) // 4
            estimated_completion_tokens = len(content) // 4
            estimated_total = estimated_prompt_tokens + estimated_completion_tokens
            usage_info['estimated_tokens'] = estimated_total
            usage_info['estimated_prompt_tokens'] = estimated_prompt_tokens
            usage_info['estimated_completion_tokens'] = estimated_completion_tokens
        
        return {
            'status': 'success',
            'content': content,
            'elapsed_sec': elapsed,
            'tokens': usage_info.get('total_tokens') or usage_info.get('estimated_tokens'),
            'usage': usage_info,
            'model': model_id
        }
        
    except Exception as e:
        elapsed = time.time() - start
        return {
            'status': 'error',
            'error': f"{type(e).__name__}: {str(e)}",
            'elapsed_sec': elapsed,
            'model': model_id
        }


print("‚úÖ Execution helpers defined: setup(), run()")
print("   ‚Üí Uses workshop_utils for proper SDK integration")
print("   ‚Üí setup() initializes with FoundryLocalManager")
print("   ‚Üí run() executes inference via OpenAI-compatible API")
print("   ‚Üí Token counting: Uses API data or estimates if unavailable")

‚úÖ Execution helpers defined: setup(), run()
   ‚Üí Uses workshop_utils for proper SDK integration
   ‚Üí setup() initializes with FoundryLocalManager
   ‚Üí run() executes inference via OpenAI-compatible API
   ‚Üí Token counting: Uses API data or estimates if unavailable


### F√∂rklaring: Sj√§lvtest f√∂re start
Utf√∂r en enkel anslutningskontroll med FoundryLocalManager f√∂r b√•da modellerna. Detta verifierar:
- Tj√§nsten √§r √•tkomlig
- Modeller kan initieras
- Alias l√∂ser sig till faktiska modell-ID:n
- Anslutningen √§r stabil innan j√§mf√∂relsen k√∂rs

Funktionen setup() anv√§nder det officiella SDK-m√∂nstret fr√•n workshop_utils.


In [34]:
# Simplified diagnostic: Just verify service is accessible
import requests

def check_foundry_service():
    """Quick diagnostic to verify Foundry Local is running."""
    # Try common ports
    endpoints_to_try = [
        "http://localhost:59959",
        "http://127.0.0.1:59959", 
        "http://localhost:55769",
        "http://127.0.0.1:55769",
    ]
    
    print("[Diagnostic] Checking Foundry Local service...")
    
    for endpoint in endpoints_to_try:
        try:
            response = requests.get(f"{endpoint}/health", timeout=2)
            if response.status_code == 200:
                print(f"‚úÖ Service is running at {endpoint}")
                
                # Try to list models
                try:
                    models_response = requests.get(f"{endpoint}/v1/models", timeout=2)
                    if models_response.status_code == 200:
                        models_data = models_response.json()
                        model_count = len(models_data.get('data', []))
                        print(f"‚úÖ Found {model_count} models available")
                        if model_count > 0:
                            print("   Models:", [m.get('id', 'unknown') for m in models_data.get('data', [])[:5]])
                except Exception as e:
                    print(f"‚ö†Ô∏è  Could not list models: {e}")
                
                return endpoint
        except requests.exceptions.ConnectionError:
            continue
        except Exception as e:
            print(f"‚ö†Ô∏è  Error checking {endpoint}: {e}")
    
    print("\n‚ùå Foundry Local service not found!")
    print("\nüí° To fix this:")
    print("   1. Open a terminal")
    print("   2. Run: foundry service start")
    print("   3. Run: foundry model run phi-4-mini")
    print("   4. Run: foundry model run qwen2.5-3b")
    print("   5. Re-run this notebook")
    return None

# Run diagnostic
discovered_endpoint = check_foundry_service()

if discovered_endpoint:
    print(f"\n‚úÖ Service detected (will be managed by FoundryLocalManager)")
else:
    print(f"\n‚ö†Ô∏è  No service detected - FoundryLocalManager will attempt to start it")

[Diagnostic] Checking Foundry Local service...

‚ùå Foundry Local service not found!

üí° To fix this:
   1. Open a terminal
   2. Run: foundry service start
   3. Run: foundry model run phi-4-mini
   4. Run: foundry model run qwen2.5-3b
   5. Re-run this notebook

‚ö†Ô∏è  No service detected - FoundryLocalManager will attempt to start it


In [35]:
# Quick Fix: Start service and load models from notebook
# Uncomment the commands you need:

# !foundry service start
# !foundry model run phi-4-mini
# !foundry model run qwen2.5-3b
# !foundry model ls

print("‚ö†Ô∏è  The commands above are commented out.")
print("Uncomment them if you want to start the service from the notebook.")
print("")
print("üí° Recommended: Run these commands in a separate terminal instead:")
print("   foundry service start")
print("   foundry model run phi-4-mini")
print("   foundry model run qwen2.5-3b")

‚ö†Ô∏è  The commands above are commented out.
Uncomment them if you want to start the service from the notebook.

üí° Recommended: Run these commands in a separate terminal instead:
   foundry service start
   foundry model run phi-4-mini
   foundry model run qwen2.5-3b


### üõ†Ô∏è Snabbfix: Starta Foundry lokalt fr√•n Notebook (valfritt)

Om diagnostiken ovan visar att tj√§nsten inte k√∂rs kan du f√∂rs√∂ka starta den h√§rifr√•n:

**Obs:** Detta fungerar b√§st p√• Windows. P√• andra plattformar, anv√§nd terminalkommandon.


### ‚ö†Ô∏è Fels√∂kning av anslutningsfel

Om du ser `APIConnectionError` kan det bero p√• att Foundry Local-tj√§nsten inte k√∂rs eller att modeller inte √§r laddade. Prova f√∂ljande steg:

**1. Kontrollera tj√§nstens status:**
```bash
# In a terminal (not in notebook):
foundry service status
```

**2. Starta tj√§nsten (om den inte k√∂rs):**
```bash
foundry service start
```

**3. Ladda n√∂dv√§ndiga modeller:**
```bash
# Load the models needed for comparison
foundry model run phi-4-mini
foundry model run qwen2.5-7b

# Or use alternative models:
foundry model run phi-3.5-mini
foundry model run qwen2.5-3b
```

**4. Kontrollera att modellerna √§r tillg√§ngliga:**
```bash
foundry model ls
```

**Vanliga problem:**
- ‚ùå Tj√§nsten k√∂rs inte ‚Üí K√∂r `foundry service start`
- ‚ùå Modeller √§r inte laddade ‚Üí K√∂r `foundry model run <model-name>`
- ‚ùå Portkonflikter ‚Üí Kontrollera om en annan tj√§nst anv√§nder porten
- ‚ùå Brandv√§gg blockerar ‚Üí Se till att lokala anslutningar √§r till√•tna

**Snabb l√∂sning:** K√∂r diagnostikcellen nedan innan f√∂rkontrollen.


In [36]:
preflight = {}
retries = 2  # Number of retry attempts

for a in (SLM, LLM):
    mgr, c, mid, info = setup(a, endpoint=ENDPOINT, retries=retries)
    # Keep the original status from info (either 'success' or 'failed')
    preflight[a] = info

print('\n[Pre-flight Check]')
for alias, details in preflight.items():
    status_icon = '‚úÖ' if details['status'] == 'success' else '‚ùå'
    print(f"  {status_icon} {alias}: {details['status']} - {details.get('resolved', details.get('error', 'unknown'))}")

preflight

[Init] Connecting to 'phi-4-mini' (attempt 1/2)...
[OK] Connected to 'phi-4-mini' -> Phi-4-mini-instruct-cuda-gpu:4
     Endpoint: http://127.0.0.1:59959/v1
[Init] Connecting to 'qwen2.5-7b' (attempt 1/2)...
[OK] Connected to 'qwen2.5-7b' -> qwen2.5-7b-instruct-cuda-gpu:3
     Endpoint: http://127.0.0.1:59959/v1

[Pre-flight Check]
  ‚úÖ phi-4-mini: success - Phi-4-mini-instruct-cuda-gpu:4
  ‚úÖ qwen2.5-7b: success - qwen2.5-7b-instruct-cuda-gpu:3


{'phi-4-mini': {'endpoint': 'http://127.0.0.1:59959/v1',
  'resolved': 'Phi-4-mini-instruct-cuda-gpu:4',
  'attempts': 1,
  'status': 'success'},
 'qwen2.5-7b': {'endpoint': 'http://127.0.0.1:59959/v1',
  'resolved': 'qwen2.5-7b-instruct-cuda-gpu:3',
  'attempts': 1,
  'status': 'success'}}

### ‚úÖ F√∂rkontroll: Modelltillg√§nglighet

Den h√§r cellen kontrollerar att b√•da modellerna kan n√•s vid den konfigurerade slutpunkten innan j√§mf√∂relsen k√∂rs.


### F√∂rklaring: K√∂r j√§mf√∂relse & samla resultat
Itererar √∂ver b√•da aliasen med det officiella Foundry SDK-m√∂nstret:
1. Initiera varje modell med setup() (anv√§nder FoundryLocalManager)
2. K√∂r inferens med OpenAI-kompatibel API
3. Registrera latens, tokens och exempelutdata
4. Skapa en JSON-sammanfattning med j√§mf√∂rande analys

Detta f√∂ljer samma m√∂nster som Workshop-exemplen i session04/model_compare.py.


In [40]:
results = []
retries = 2  # Number of retry attempts

for alias in (SLM, LLM):
    mgr, client, mid, info = setup(alias, endpoint=ENDPOINT, retries=retries)
    if client:
        r = run(client, mid, PROMPT)
        results.append({'alias': alias, **r})
    else:
        # If setup failed, record error
        results.append({
            'alias': alias,
            'status': 'error',
            'error': info.get('error', 'Setup failed'),
            'elapsed_sec': 0,
            'tokens': None,
            'model': alias
        })

# Display results
print(json.dumps(results, indent=2))

# Quick comparative view
print('\n' + '='*80)
print('COMPARISON SUMMARY')
print('='*80)
print(f"{'Alias':<20} {'Status':<15} {'Latency(s)':<15} {'Tokens':<15}")
print('-'*80)

for row in results:
    status = row.get('status', 'unknown')
    status_icon = '‚úÖ' if status == 'success' else '‚ùå'
    latency_str = f"{row.get('elapsed_sec', 0):.3f}" if row.get('elapsed_sec') else 'N/A'
    
    # Handle token display - show if available or indicate estimated
    tokens = row.get('tokens')
    usage = row.get('usage', {})
    if tokens:
        if 'estimated_tokens' in usage:
            tokens_str = f"~{tokens} (est.)"
        else:
            tokens_str = str(tokens)
    else:
        tokens_str = 'N/A'
    
    print(f"{status_icon} {row['alias']:<18} {status:<15} {latency_str:<15} {tokens_str:<15}")

print('-'*80)

# Show detailed token breakdown if available
print("\nDetailed Token Usage:")
for row in results:
    if row.get('status') == 'success' and row.get('usage'):
        usage = row['usage']
        print(f"\n  {row['alias']}:")
        if 'prompt_tokens' in usage and usage['prompt_tokens']:
            print(f"    Prompt tokens:     {usage['prompt_tokens']}")
            print(f"    Completion tokens: {usage['completion_tokens']}")
            print(f"    Total tokens:      {usage['total_tokens']}")
        elif 'estimated_tokens' in usage:
            print(f"    Estimated prompt:     {usage['estimated_prompt_tokens']}")
            print(f"    Estimated completion: {usage['estimated_completion_tokens']}")
            print(f"    Estimated total:      {usage['estimated_tokens']}")
            print(f"    (API did not provide token counts - using ~4 chars/token estimate)")

print('\n' + '='*80)

# Calculate speedup if both succeeded
if len(results) == 2 and all(r.get('status') == 'success' and r.get('elapsed_sec') for r in results):
    speedup = results[1]['elapsed_sec'] / results[0]['elapsed_sec']
    print(f"\nüí° SLM is {speedup:.2f}x faster than LLM for this prompt")
    
    # Compare token throughput if available
    slm_tokens = results[0].get('tokens', 0)
    llm_tokens = results[1].get('tokens', 0)
    if slm_tokens and llm_tokens:
        slm_tps = slm_tokens / results[0]['elapsed_sec']
        llm_tps = llm_tokens / results[1]['elapsed_sec']
        print(f"   SLM throughput: {slm_tps:.1f} tokens/sec")
        print(f"   LLM throughput: {llm_tps:.1f} tokens/sec")
        
elif any(r.get('status') == 'error' for r in results):
    print(f"\n‚ö†Ô∏è  Some models failed - check errors above")
    print("   Ensure Foundry Local is running: foundry service start")
    print("   Ensure models are loaded: foundry model run <model-name>")

results

[Init] Connecting to 'phi-4-mini' (attempt 1/2)...
[OK] Connected to 'phi-4-mini' -> Phi-4-mini-instruct-cuda-gpu:4
     Endpoint: http://127.0.0.1:59959/v1
[Init] Connecting to 'qwen2.5-7b' (attempt 1/2)...
[OK] Connected to 'qwen2.5-7b' -> qwen2.5-7b-instruct-cuda-gpu:3
     Endpoint: http://127.0.0.1:59959/v1
[Init] Connecting to 'qwen2.5-7b' (attempt 1/2)...
[OK] Connected to 'qwen2.5-7b' -> qwen2.5-7b-instruct-cuda-gpu:3
     Endpoint: http://127.0.0.1:59959/v1
[
  {
    "alias": "phi-4-mini",
    "status": "success",
    "content": "1. Reduced Latency: Local AI inference can significantly reduce latency by processing data closer to the source, which is particularly beneficial for real-time applications such as autonomous vehicles or augmented reality.\n\n2. Enhanced Privacy: By keeping data processing local, sensitive information is less likely to be exposed to external networks, thereby enhancing privacy and security.\n\n3. Lower Bandwidth Usage: Local AI inference reduces the n

[{'alias': 'phi-4-mini',
  'status': 'success',
  'content': '1. Reduced Latency: Local AI inference can significantly reduce latency by processing data closer to the source, which is particularly beneficial for real-time applications such as autonomous vehicles or augmented reality.\n\n2. Enhanced Privacy: By keeping data processing local, sensitive information is less likely to be exposed to external networks, thereby enhancing privacy and security.\n\n3. Lower Bandwidth Usage: Local AI inference reduces the need for data transmission over the network, which can save bandwidth and reduce the risk of network congestion.\n\n4. Improved Reliability: Local processing can be more reliable, as it is less dependent on network connectivity. This is particularly important in scenarios where network connectivity is unreliable or intermittent.\n\n5. Scalability: Local AI inference can be easily scaled by adding more local processing units, making it easier to handle increasing data volumes or m

### Tolka Resultat

**Viktiga M√§tv√§rden:**
- **Latens**: L√§gre √§r b√§ttre - indikerar snabbare svarstid
- **Tokens**: H√∂gre genomstr√∂mning = fler tokens bearbetade
- **Rutt**: Bekr√§ftar vilken API-endpoint som anv√§ndes

**N√§r ska man anv√§nda SLM vs LLM:**
- **SLM (Small Language Model)**: Snabba svar, l√§gre resursf√∂rbrukning, bra f√∂r enkla uppgifter
- **LLM (Large Language Model)**: H√∂gre kvalitet, b√§ttre resonemang, anv√§nd n√§r kvalitet √§r viktigast

**N√§sta Steg:**
1. Testa olika prompts f√∂r att se hur komplexitet p√•verkar j√§mf√∂relsen
2. Experimentera med andra modellpar
3. Anv√§nd Workshop-routerexemplen (Session 06) f√∂r att intelligent styra baserat p√• uppgiftens komplexitet


In [38]:
# Final Validation Check
print("="*70)
print("VALIDATION SUMMARY")
print("="*70)
print(f"‚úÖ SLM Model: {SLM}")
print(f"‚úÖ LLM Model: {LLM}")
print(f"‚úÖ Using Foundry SDK Pattern: workshop_utils with FoundryLocalManager")
print(f"‚úÖ Pre-flight passed: {all(v['status'] == 'success' for v in preflight.values()) if 'preflight' in dir() else 'Not run yet'}")
print(f"‚úÖ Comparison completed: {len(results) == 2 if 'results' in dir() else 'Not run yet'}")
print(f"‚úÖ Both models responded: {all(r.get('status') == 'success' for r in results) if 'results' in dir() and results else 'Not run yet'}")
print("="*70)

# Check for common configuration issues
issues = []
if 'LLM' in dir() and LLM not in ['qwen2.5-3b', 'qwen2.5-0.5b', 'qwen2.5-1.5b', 'qwen2.5-7b', 'phi-3.5-mini']:
    issues.append(f"‚ö†Ô∏è  LLM is '{LLM}' - expected qwen2.5-3b for memory efficiency")
if 'preflight' in dir() and not all(v['status'] == 'success' for v in preflight.values()):
    issues.append("‚ö†Ô∏è  Pre-flight check failed - models not accessible")
if 'results' in dir() and results and not all(r.get('status') == 'success' for r in results):
    issues.append("‚ö†Ô∏è  Comparison incomplete - check for errors above")

if not issues and 'results' in dir() and results and all(r.get('status') == 'success' for r in results):
    print("üéâ ALL CHECKS PASSED! Notebook completed successfully.")
    print(f"   SLM ({SLM}) vs LLM ({LLM}) comparison completed.")
    if len(results) == 2:
        speedup = results[1]['elapsed_sec'] / results[0]['elapsed_sec'] if results[0]['elapsed_sec'] > 0 else 0
        print(f"   Performance: SLM is {speedup:.2f}x faster")
elif issues:
    print("\n‚ö†Ô∏è  Issues detected:")
    for issue in issues:
        print(f"   {issue}")
    print("\nüí° Troubleshooting:")
    print("   1. Ensure service is running: foundry service start")
    print("   2. Load models: foundry model run phi-4-mini && foundry model run qwen2.5-7b")
    print("   3. Check model list: foundry model ls")
else:
    print("\nüí° Run all cells above first, then re-run this validation.")
print("="*70)

VALIDATION SUMMARY
‚úÖ SLM Model: phi-4-mini
‚úÖ LLM Model: qwen2.5-7b
‚úÖ Using Foundry SDK Pattern: workshop_utils with FoundryLocalManager
‚úÖ Pre-flight passed: True
‚úÖ Comparison completed: True
‚úÖ Both models responded: True
üéâ ALL CHECKS PASSED! Notebook completed successfully.
   SLM (phi-4-mini) vs LLM (qwen2.5-7b) comparison completed.
   Performance: SLM is 5.14x faster



---

**Ansvarsfriskrivning**:  
Detta dokument har √∂versatts med hj√§lp av AI-√∂vers√§ttningstj√§nsten [Co-op Translator](https://github.com/Azure/co-op-translator). √Ñven om vi str√§var efter noggrannhet, b√∂r det noteras att automatiserade √∂vers√§ttningar kan inneh√•lla fel eller felaktigheter. Det ursprungliga dokumentet p√• dess originalspr√•k b√∂r betraktas som den auktoritativa k√§llan. F√∂r kritisk information rekommenderas professionell m√§nsklig √∂vers√§ttning. Vi ansvarar inte f√∂r eventuella missf√∂rst√•nd eller feltolkningar som uppst√•r vid anv√§ndning av denna √∂vers√§ttning.
