# LLM Inference Testing

This notebook demonstrates how to create and test LLM inference using various models and our processed dataset. We'll explore different approaches for testing inference capabilities including local models and cloud-based solutions.

## Overview
- Load and test pre-trained models
- Configure inference parameters  
- Test with our processed data
- Benchmark performance
- Compare different models

In [None]:
# Import Required Libraries
import os
import json
import time
import pandas as pd
import numpy as np
from pathlib import Path
from typing import List, Dict, Optional
import psutil
import gc
from datetime import datetime

# For Azure AI Inference
try:
    from azure.ai.inference import ChatCompletionsClient
    from azure.ai.inference.models import SystemMessage, UserMessage
    from azure.core.credentials import AzureKeyCredential
    print("‚úì Azure AI Inference SDK available")
except ImportError:
    print("Installing Azure AI Inference...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "azure-ai-inference"])
    from azure.ai.inference import ChatCompletionsClient
    from azure.ai.inference.models import SystemMessage, UserMessage
    from azure.core.credentials import AzureKeyCredential

# For OpenAI SDK (alternative)
try:
    from openai import OpenAI
    print("‚úì OpenAI SDK available")
except ImportError:
    print("Installing OpenAI SDK...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "openai"])
    from openai import OpenAI

# For local model testing (optional)
try:
    import torch
    from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
    print("‚úì Transformers available for local models")
    TRANSFORMERS_AVAILABLE = True
except ImportError:
    print("‚ö†Ô∏è Transformers not available - cloud models only")
    TRANSFORMERS_AVAILABLE = False

# Set up paths
BASE_DIR = Path(r"c:\Github\Learn-GenAI\genai_book")
DATA_FILE = BASE_DIR / "llm_data_medium_chunks.jsonl"

print(f"Base directory: {BASE_DIR}")
print(f"Data file: {DATA_FILE}")
print(f"Data file exists: {DATA_FILE.exists()}")

In [None]:
# Load and prepare test data from our ingestion pipeline
print("=== LOADING TEST DATA ===")

def load_processed_data(file_path: Path, max_samples: int = 50) -> List[Dict]:
    """
    Load processed data from JSONL file
    """
    data = []
    if not file_path.exists():
        print(f"‚ùå Data file not found: {file_path}")
        return data
    
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            for i, line in enumerate(f):
                if i >= max_samples:
                    break
                data.append(json.loads(line))
        
        print(f"‚úì Loaded {len(data)} data samples")
        return data
    
    except Exception as e:
        print(f"‚ùå Error loading data: {e}")
        return []

# Load test data
test_data = load_processed_data(DATA_FILE, max_samples=20)

if test_data:
    # Display data statistics
    categories = [item['category'] for item in test_data]
    category_counts = pd.Series(categories).value_counts()
    
    print(f"\nTest Data Overview:")
    print(f"Total samples: {len(test_data)}")
    print(f"Categories: {list(category_counts.index)}")
    print(f"Category distribution:")
    for cat, count in category_counts.items():
        print(f"  {cat}: {count} samples")
    
    # Show a sample
    sample = test_data[0]
    print(f"\nSample data:")
    print(f"Category: {sample['category']}")
    print(f"Source: {sample['source_file']}")
    print(f"Text preview: {sample['text'][:200]}...")
    print(f"Chunk size: {sample['chunk_size']} words")
else:
    print("‚ùå No test data available. Please run the data ingestion pipeline first.")
    # Create some sample data for testing
    test_data = [
        {
            'text': 'This is a sample business text about market trends and economic indicators.',
            'category': 'business',
            'source_file': 'sample_business.txt',
            'chunk_size': 12
        },
        {
            'text': 'Technology advances in artificial intelligence and machine learning are transforming industries.',
            'category': 'technology',
            'source_file': 'sample_tech.txt',
            'chunk_size': 11
        }
    ]
    print("‚úì Created sample test data for demonstration")

In [None]:
# Configure Cloud-based Model for Inference (GitHub Models)
print("=== CLOUD MODEL CONFIGURATION ===")

class CloudLLMInference:
    """
    Cloud-based LLM inference using GitHub Models via Azure AI Inference SDK
    """
    
    def __init__(self, model_name: str = "openai/gpt-4.1-mini"):
        self.endpoint = "https://models.github.ai/inference"
        self.model = model_name
        self.token = os.environ.get("GITHUB_TOKEN")
        
        if not self.token:
            print("‚ùå GITHUB_TOKEN environment variable not set")
            print("üí° Set your GitHub Personal Access Token:")
            print("   export GITHUB_TOKEN='your_github_pat'")
            self.client = None
        else:
            try:
                self.client = ChatCompletionsClient(
                    endpoint=self.endpoint,
                    credential=AzureKeyCredential(self.token),
                )
                print(f"‚úì Connected to GitHub Models")
                print(f"‚úì Model: {self.model}")
            except Exception as e:
                print(f"‚ùå Error connecting to GitHub Models: {e}")
                self.client = None
    
    def generate_text(self, 
                     prompt: str, 
                     system_message: str = "You are a helpful AI assistant.",
                     temperature: float = 0.7,
                     max_tokens: int = 200) -> Dict:
        """
        Generate text using the cloud model
        """
        if not self.client:
            return {"error": "Client not initialized"}
        
        start_time = time.time()
        
        try:
            response = self.client.complete(
                messages=[
                    SystemMessage(system_message),
                    UserMessage(prompt),
                ],
                temperature=temperature,
                max_tokens=max_tokens,
                model=self.model
            )
            
            end_time = time.time()
            
            result = {
                "generated_text": response.choices[0].message.content,
                "model": self.model,
                "prompt": prompt,
                "system_message": system_message,
                "generation_time": end_time - start_time,
                "temperature": temperature,
                "max_tokens": max_tokens,
                "finish_reason": response.choices[0].finish_reason,
                "usage": {
                    "prompt_tokens": response.usage.prompt_tokens if response.usage else None,
                    "completion_tokens": response.usage.completion_tokens if response.usage else None,
                    "total_tokens": response.usage.total_tokens if response.usage else None
                }
            }
            
            return result
            
        except Exception as e:
            return {
                "error": str(e),
                "model": self.model,
                "prompt": prompt
            }

# Initialize cloud model
cloud_model = CloudLLMInference("openai/gpt-4.1-mini")

# Test connection
if cloud_model.client:
    test_result = cloud_model.generate_text("Hello! Can you generate a short response?", max_tokens=50)
    if "error" not in test_result:
        print(f"‚úì Test successful!")
        print(f"  Response: {test_result['generated_text'][:100]}...")
        print(f"  Time: {test_result['generation_time']:.2f}s")
        if test_result['usage']['total_tokens']:
            print(f"  Tokens: {test_result['usage']['total_tokens']}")
    else:
        print(f"‚ùå Test failed: {test_result['error']}")
else:
    print("‚ö†Ô∏è Cloud model not available - set GITHUB_TOKEN environment variable")

In [None]:
# Configure Local Model for Inference (Optional)
print("=== LOCAL MODEL CONFIGURATION ===")

class LocalLLMInference:
    """
    Local LLM inference using Hugging Face transformers
    """
    
    def __init__(self, model_name: str = "microsoft/DialoGPT-small"):
        self.model_name = model_name
        self.model = None
        self.tokenizer = None
        self.pipeline = None
        
        if not TRANSFORMERS_AVAILABLE:
            print("‚ùå Transformers not available for local models")
            return
            
        try:
            print(f"Loading model: {model_name}")
            print("‚ö†Ô∏è This may take a while for first-time download...")
            
            # Use a small, efficient model for testing
            self.tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
            
            # Add pad token if not present
            if self.tokenizer.pad_token is None:
                self.tokenizer.pad_token = self.tokenizer.eos_token
            
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
                device_map="auto" if torch.cuda.is_available() else None
            )
            
            # Create text generation pipeline
            self.pipeline = pipeline(
                "text-generation",
                model=self.model,
                tokenizer=self.tokenizer,
                device=0 if torch.cuda.is_available() else -1
            )
            
            print(f"‚úì Local model loaded successfully")
            print(f"‚úì Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")
            
        except Exception as e:
            print(f"‚ùå Error loading local model: {e}")
            print("üí° Trying smaller model...")
            try:
                # Fallback to a tiny model
                self.model_name = "gpt2"
                self.pipeline = pipeline("text-generation", model="gpt2")
                print(f"‚úì Fallback model (GPT-2) loaded")
            except:
                print("‚ùå Could not load any local model")
    
    def generate_text(self, 
                     prompt: str,
                     max_length: int = 100,
                     temperature: float = 0.7,
                     num_return_sequences: int = 1) -> Dict:
        """
        Generate text using local model
        """
        if not self.pipeline:
            return {"error": "Model not loaded"}
        
        start_time = time.time()
        
        try:
            # Monitor memory before generation
            memory_before = psutil.virtual_memory().used / (1024**3)  # GB
            
            outputs = self.pipeline(
                prompt,
                max_length=max_length,
                temperature=temperature,
                num_return_sequences=num_return_sequences,
                pad_token_id=self.tokenizer.eos_token_id,
                do_sample=True,
                truncation=True
            )
            
            end_time = time.time()
            memory_after = psutil.virtual_memory().used / (1024**3)  # GB
            
            result = {
                "generated_text": outputs[0]["generated_text"],
                "model": self.model_name,
                "prompt": prompt,
                "generation_time": end_time - start_time,
                "temperature": temperature,
                "max_length": max_length,
                "memory_used_gb": memory_after - memory_before,
                "device": "GPU" if torch.cuda.is_available() else "CPU"
            }
            
            return result
            
        except Exception as e:
            return {
                "error": str(e),
                "model": self.model_name,
                "prompt": prompt
            }

# Initialize local model (if available)
if TRANSFORMERS_AVAILABLE:
    print("Initializing local model...")
    local_model = LocalLLMInference("gpt2")  # Using GPT-2 for quick testing
    
    if local_model.pipeline:
        # Test local model
        test_result = local_model.generate_text("The future of AI is", max_length=50)
        if "error" not in test_result:
            print(f"‚úì Local model test successful!")
            print(f"  Response: {test_result['generated_text']}")
            print(f"  Time: {test_result['generation_time']:.2f}s")
            print(f"  Device: {test_result['device']}")
        else:
            print(f"‚ùå Local model test failed: {test_result['error']}")
else:
    print("‚ö†Ô∏è Local models not available - using cloud models only")
    local_model = None

In [None]:
# Test Inference with Our Processed Data
print("=== INFERENCE TESTING ===")

def create_test_prompts(data_samples: List[Dict]) -> List[Dict]:
    """
    Create various test prompts from our processed data
    """
    prompts = []
    
    for sample in data_samples[:5]:  # Test with first 5 samples
        text = sample['text']
        category = sample['category']
        
        # Different types of prompts to test various capabilities
        test_cases = [
            {
                "prompt": f"Summarize this {category} text in one sentence:\n\n{text}",
                "task": "summarization",
                "category": category,
                "source": sample['source_file']
            },
            {
                "prompt": f"What is the main topic of this {category} text?\n\n{text}",
                "task": "topic_identification", 
                "category": category,
                "source": sample['source_file']
            },
            {
                "prompt": f"Based on this {category} text, answer: What insights can be drawn?\n\n{text}",
                "task": "insight_generation",
                "category": category,
                "source": sample['source_file']
            }
        ]
        
        prompts.extend(test_cases)
    
    return prompts

def run_inference_tests(model, prompts: List[Dict], model_name: str) -> List[Dict]:
    """
    Run inference tests on the model
    """
    results = []
    
    for i, prompt_data in enumerate(prompts):
        print(f"Running test {i+1}/{len(prompts)} - {prompt_data['task']}")
        
        # Generate response
        if model_name == "cloud":
            result = model.generate_text(
                prompt_data["prompt"],
                system_message=f"You are an expert in {prompt_data['category']} analysis.",
                temperature=0.7,
                max_tokens=150
            )
        else:  # local model
            result = model.generate_text(
                prompt_data["prompt"],
                max_length=len(prompt_data["prompt"].split()) + 50,
                temperature=0.7
            )
        
        # Store results
        test_result = {
            **prompt_data,
            **result,
            "test_id": i + 1,
            "model_type": model_name,
            "timestamp": datetime.now().isoformat()
        }
        
        results.append(test_result)
        
        # Brief pause between requests
        time.sleep(0.5)
    
    return results

# Create test prompts
test_prompts = create_test_prompts(test_data)
print(f"Created {len(test_prompts)} test prompts")

# Show sample prompts
print(f"\nSample prompts:")
for i, prompt in enumerate(test_prompts[:2]):
    print(f"\n{i+1}. Task: {prompt['task']}")
    print(f"   Category: {prompt['category']}")
    print(f"   Prompt: {prompt['prompt'][:100]}...")

all_results = []

In [None]:
# Run Cloud Model Inference Tests
if cloud_model.client:
    print("\n=== TESTING CLOUD MODEL ===")
    cloud_results = run_inference_tests(cloud_model, test_prompts[:6], "cloud")  # Test subset for demo
    all_results.extend(cloud_results)
    
    print(f"‚úì Completed {len(cloud_results)} cloud model tests")
    
    # Show sample results
    for result in cloud_results[:2]:
        print(f"\nüìù Test: {result['task']} | Category: {result['category']}")
        print(f"ü§ñ Model: {result['model']}")
        print(f"‚ùì Prompt: {result['prompt'][:80]}...")
        print(f"üí¨ Response: {result.get('generated_text', 'Error')[:120]}...")
        print(f"‚è±Ô∏è Time: {result.get('generation_time', 0):.2f}s")
        if result.get('usage', {}).get('total_tokens'):
            print(f"üéØ Tokens: {result['usage']['total_tokens']}")
else:
    print("‚ö†Ô∏è Skipping cloud model tests - not available")

In [None]:
# Run Local Model Inference Tests
if TRANSFORMERS_AVAILABLE and local_model and local_model.pipeline:
    print("\n=== TESTING LOCAL MODEL ===")
    local_results = run_inference_tests(local_model, test_prompts[:3], "local")  # Fewer tests for local
    all_results.extend(local_results)
    
    print(f"‚úì Completed {len(local_results)} local model tests")
    
    # Show sample results
    for result in local_results[:2]:
        print(f"\nüìù Test: {result['task']} | Category: {result['category']}")
        print(f"ü§ñ Model: {result['model']}")
        print(f"‚ùì Prompt: {result['prompt'][:80]}...")
        print(f"üí¨ Response: {result.get('generated_text', 'Error')[:120]}...")
        print(f"‚è±Ô∏è Time: {result.get('generation_time', 0):.2f}s")
        print(f"üíæ Device: {result.get('device', 'Unknown')}")
    
    # Clean up memory
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()
    
else:
    print("‚ö†Ô∏è Skipping local model tests - not available")

In [None]:
# Benchmark Performance Analysis
print("\n=== PERFORMANCE ANALYSIS ===")

def analyze_performance(results: List[Dict]) -> Dict:
    """
    Analyze performance metrics from inference results
    """
    if not results:
        return {"error": "No results to analyze"}
    
    # Separate by model type
    cloud_results = [r for r in results if r.get('model_type') == 'cloud' and 'error' not in r]
    local_results = [r for r in results if r.get('model_type') == 'local' and 'error' not in r]
    
    analysis = {
        "total_tests": len(results),
        "successful_tests": len([r for r in results if 'error' not in r]),
        "failed_tests": len([r for r in results if 'error' in r])
    }
    
    # Cloud model performance
    if cloud_results:
        generation_times = [r['generation_time'] for r in cloud_results]
        token_counts = [r.get('usage', {}).get('total_tokens', 0) for r in cloud_results if r.get('usage')]
        
        analysis['cloud_model'] = {
            "test_count": len(cloud_results),
            "avg_generation_time": np.mean(generation_times),
            "min_generation_time": np.min(generation_times),
            "max_generation_time": np.max(generation_times),
            "avg_tokens": np.mean(token_counts) if token_counts else None,
            "total_tokens": sum(token_counts) if token_counts else None
        }
    
    # Local model performance  
    if local_results:
        generation_times = [r['generation_time'] for r in local_results]
        
        analysis['local_model'] = {
            "test_count": len(local_results),
            "avg_generation_time": np.mean(generation_times),
            "min_generation_time": np.min(generation_times),
            "max_generation_time": np.max(generation_times),
            "device": local_results[0].get('device', 'Unknown')
        }
    
    # Task performance
    task_performance = {}
    for result in results:
        if 'error' not in result:
            task = result['task']
            if task not in task_performance:
                task_performance[task] = []
            task_performance[task].append(result['generation_time'])
    
    analysis['task_performance'] = {}
    for task, times in task_performance.items():
        analysis['task_performance'][task] = {
            "avg_time": np.mean(times),
            "test_count": len(times)
        }
    
    return analysis

# Run performance analysis
if all_results:
    performance = analyze_performance(all_results)
    
    print(f"Performance Analysis Results:")
    print(f"Total tests: {performance['total_tests']}")
    print(f"Successful: {performance['successful_tests']}")
    print(f"Failed: {performance['failed_tests']}")
    
    # Cloud model performance
    if 'cloud_model' in performance:
        cloud_perf = performance['cloud_model']
        print(f"\nüåê Cloud Model Performance:")
        print(f"  Tests: {cloud_perf['test_count']}")
        print(f"  Avg time: {cloud_perf['avg_generation_time']:.2f}s")
        print(f"  Time range: {cloud_perf['min_generation_time']:.2f}s - {cloud_perf['max_generation_time']:.2f}s")
        if cloud_perf.get('avg_tokens'):
            print(f"  Avg tokens: {cloud_perf['avg_tokens']:.0f}")
            print(f"  Total tokens: {cloud_perf['total_tokens']}")
    
    # Local model performance
    if 'local_model' in performance:
        local_perf = performance['local_model']
        print(f"\nüíª Local Model Performance:")
        print(f"  Tests: {local_perf['test_count']}")
        print(f"  Avg time: {local_perf['avg_generation_time']:.2f}s")
        print(f"  Time range: {local_perf['min_generation_time']:.2f}s - {local_perf['max_generation_time']:.2f}s")
        print(f"  Device: {local_perf['device']}")
    
    # Task performance
    print(f"\nüìä Performance by Task:")
    for task, perf in performance['task_performance'].items():
        print(f"  {task}: {perf['avg_time']:.2f}s avg ({perf['test_count']} tests)")

else:
    print("‚ùå No results available for analysis")

In [None]:
# Save Results and Generate Report
print("\n=== SAVING RESULTS ===")

def save_inference_results(results: List[Dict], performance: Dict):
    """
    Save inference results and performance analysis
    """
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Save detailed results
    results_file = BASE_DIR / f"inference_results_{timestamp}.json"
    with open(results_file, 'w', encoding='utf-8') as f:
        json.dump({
            "timestamp": timestamp,
            "results": results,
            "performance": performance,
            "metadata": {
                "total_tests": len(results),
                "data_source": str(DATA_FILE),
                "models_tested": list(set(r.get('model_type', 'unknown') for r in results))
            }
        }, f, indent=2, ensure_ascii=False)
    
    print(f"‚úì Results saved to: {results_file}")
    
    # Create summary CSV
    summary_data = []
    for result in results:
        summary_data.append({
            'test_id': result.get('test_id'),
            'model_type': result.get('model_type'),
            'model': result.get('model'),
            'task': result.get('task'),
            'category': result.get('category'),
            'generation_time': result.get('generation_time'),
            'success': 'error' not in result,
            'tokens_used': result.get('usage', {}).get('total_tokens') if result.get('usage') else None
        })
    
    summary_df = pd.DataFrame(summary_data)
    summary_file = BASE_DIR / f"inference_summary_{timestamp}.csv"
    summary_df.to_csv(summary_file, index=False)
    
    print(f"‚úì Summary saved to: {summary_file}")
    
    return results_file, summary_file

# Save results if we have any
if all_results:
    results_file, summary_file = save_inference_results(all_results, performance)
    
    # Display summary table
    print(f"\nüìã INFERENCE TESTING SUMMARY")
    print("=" * 50)
    
    summary_df = pd.read_csv(summary_file)
    
    # Group by model type and task
    if not summary_df.empty:
        print("\nResults by Model and Task:")
        pivot = summary_df.groupby(['model_type', 'task']).agg({
            'generation_time': ['mean', 'count'],
            'success': 'sum'
        }).round(2)
        print(pivot)
        
        print(f"\nSuccess Rate by Model:")
        success_rate = summary_df.groupby('model_type')['success'].agg(['sum', 'count'])
        success_rate['rate'] = (success_rate['sum'] / success_rate['count'] * 100).round(1)
        print(success_rate)

else:
    print("‚ùå No results to save")

print(f"\nüéâ INFERENCE TESTING COMPLETED!")
print(f"Files saved to: {BASE_DIR}")

## Next Steps & Recommendations

Based on your inference testing results, here are recommendations for different use cases:

### üåê **Cloud Models (GitHub Models)**
**Best for:**
- Production applications
- High-quality responses
- Cost-effective scaling
- No local compute requirements

**Recommended models:**
- `openai/gpt-4.1-mini` - Best balance of quality and cost
- `openai/gpt-4.1` - Maximum quality for critical tasks  
- `microsoft/phi-4-mini-instruct` - Efficient for specific tasks

### üíª **Local Models**
**Best for:**
- Privacy-sensitive applications
- Offline deployment
- Cost control at scale
- Custom fine-tuning

**Recommended approaches:**
- Use quantized models for better performance
- Consider GPU acceleration for faster inference
- Fine-tune smaller models on your specific domain

### üéØ **For Your Use Case**
Based on your processed data categories:

1. **Business Analysis**: Use cloud models for complex reasoning
2. **Content Summarization**: Local models can handle well
3. **Domain-specific Tasks**: Consider fine-tuning on your data

### üìà **Performance Optimization**
- **Batch processing** for multiple requests
- **Caching** for repeated similar queries
- **Prompt engineering** to improve output quality
- **Response streaming** for better user experience

### üîß **Production Deployment**
```python
# Example production setup
from azure.ai.inference import ChatCompletionsClient
import os

client = ChatCompletionsClient(
    endpoint="https://models.github.ai/inference",
    credential=AzureKeyCredential(os.environ["GITHUB_TOKEN"])
)

def production_inference(text, category):
    response = client.complete(
        messages=[
            SystemMessage(f"You are an expert {category} analyst."),
            UserMessage(f"Analyze: {text}")
        ],
        model="openai/gpt-4.1-mini",
        temperature=0.3,  # Lower for consistency
        max_tokens=200
    )
    return response.choices[0].message.content
```