# Tutorial 5: Benchmarking & Performance Optimization

**Level**: Intermediate to Advanced  
**Time**: 30-40 minutes  
**Prerequisites**: Tutorial 1, Tutorial 2, Tutorial 4

## Overview

In this tutorial, you'll learn how to:

1. **Benchmark Pipeline Performance** - Measure throughput and latency
2. **Profile Code** - Identify bottlenecks
3. **Optimize Processing** - Improve real-time performance
4. **Compare Models** - Systematic model evaluation
5. **Monitor Resources** - CPU, memory, and GPU usage
6. **Real-Time Guarantees** - Ensure low-latency inference

## Key Concepts

- **Throughput**: Samples processed per second
- **Latency**: Time from input to output
- **Jitter**: Variability in latency
- **Profiling**: Measuring where time is spent
- **Optimization**: Reducing computational overhead

---

## Section 1: Basic Pipeline Benchmarking

Let's start by measuring the performance of a basic pipeline.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import time
from typing import Dict, List
import pandas as pd

from neuros.pipeline import Pipeline
from neuros.drivers import MockDriver
from neuros.models import SimpleClassifier
from neuros.processing import BandpassFilter, BandPowerExtractor

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("✓ Imports successful")

### Define Benchmarking Utilities

In [None]:
class PipelineBenchmark:
    """
    Comprehensive benchmarking suite for neurOS pipelines.
    """
    
    def __init__(self, pipeline: Pipeline):
        self.pipeline = pipeline
        self.results = {}
    
    def measure_latency(self, X: np.ndarray, n_iterations: int = 100) -> Dict:
        """
        Measure prediction latency.
        
        Returns
        -------
        dict
            Statistics including mean, median, std, min, max latency
        """
        latencies = []
        
        # Warm-up
        for _ in range(10):
            _ = self.pipeline.predict(X[:1])
        
        # Measure
        for i in range(n_iterations):
            start = time.perf_counter()
            _ = self.pipeline.predict(X[i:i+1])
            end = time.perf_counter()
            latencies.append((end - start) * 1000)  # Convert to ms
        
        return {
            'mean_ms': np.mean(latencies),
            'median_ms': np.median(latencies),
            'std_ms': np.std(latencies),
            'min_ms': np.min(latencies),
            'max_ms': np.max(latencies),
            'p95_ms': np.percentile(latencies, 95),
            'p99_ms': np.percentile(latencies, 99),
            'latencies': latencies
        }
    
    def measure_throughput(self, X: np.ndarray, duration: float = 5.0) -> Dict:
        """
        Measure throughput (samples per second).
        
        Parameters
        ----------
        X : ndarray
            Test data
        duration : float
            How long to run the benchmark (seconds)
        
        Returns
        -------
        dict
            Throughput statistics
        """
        samples_processed = 0
        start_time = time.perf_counter()
        
        while (time.perf_counter() - start_time) < duration:
            batch_size = min(32, len(X))
            _ = self.pipeline.predict(X[:batch_size])
            samples_processed += batch_size
        
        elapsed = time.perf_counter() - start_time
        throughput = samples_processed / elapsed
        
        return {
            'samples_per_second': throughput,
            'total_samples': samples_processed,
            'duration_s': elapsed
        }
    
    def measure_batch_performance(self, X: np.ndarray, batch_sizes: List[int]) -> Dict:
        """
        Measure performance across different batch sizes.
        """
        results = {}
        
        for batch_size in batch_sizes:
            latencies = []
            
            for _ in range(20):
                batch = X[:batch_size]
                start = time.perf_counter()
                _ = self.pipeline.predict(batch)
                end = time.perf_counter()
                latencies.append((end - start) * 1000)  # ms
            
            results[batch_size] = {
                'mean_latency_ms': np.mean(latencies),
                'samples_per_second': batch_size / (np.mean(latencies) / 1000)
            }
        
        return results
    
    def run_full_benchmark(self, X: np.ndarray) -> Dict:
        """
        Run comprehensive benchmark suite.
        """
        print("Running comprehensive benchmark...\n")
        
        # Latency
        print("1. Measuring latency...")
        latency_stats = self.measure_latency(X, n_iterations=100)
        print(f"   Mean latency: {latency_stats['mean_ms']:.2f} ms")
        print(f"   P95 latency: {latency_stats['p95_ms']:.2f} ms")
        
        # Throughput
        print("\n2. Measuring throughput...")
        throughput_stats = self.measure_throughput(X, duration=3.0)
        print(f"   Throughput: {throughput_stats['samples_per_second']:.0f} samples/sec")
        
        # Batch performance
        print("\n3. Measuring batch performance...")
        batch_sizes = [1, 8, 16, 32, 64]
        batch_stats = self.measure_batch_performance(X, batch_sizes)
        
        self.results = {
            'latency': latency_stats,
            'throughput': throughput_stats,
            'batch': batch_stats
        }
        
        print("\n✓ Benchmark complete")
        return self.results

print("✓ PipelineBenchmark class defined")

### Run Benchmark on Simple Pipeline

In [None]:
# Create test data
np.random.seed(42)
n_samples = 1000
n_channels = 64
n_features = 32

X_test = np.random.randn(n_samples, n_features)
y_test = np.random.randint(0, 4, n_samples)

# Create simple pipeline
driver = MockDriver(n_channels=n_channels, sampling_rate=250)
model = SimpleClassifier(model_type='logistic')

# Train model
model.train(X_test[:700], y_test[:700])

# Create pipeline
pipeline = Pipeline(driver=driver, model=model)

# Benchmark
benchmark = PipelineBenchmark(pipeline)
results = benchmark.run_full_benchmark(X_test[700:])

### Visualize Latency Distribution

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Latency histogram
latencies = results['latency']['latencies']
ax1.hist(latencies, bins=30, edgecolor='black', alpha=0.7, color='steelblue')
ax1.axvline(results['latency']['mean_ms'], color='red', linestyle='--', 
            linewidth=2, label=f"Mean: {results['latency']['mean_ms']:.2f} ms")
ax1.axvline(results['latency']['p95_ms'], color='orange', linestyle='--', 
            linewidth=2, label=f"P95: {results['latency']['p95_ms']:.2f} ms")
ax1.set_xlabel('Latency (ms)')
ax1.set_ylabel('Frequency')
ax1.set_title('Latency Distribution')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Latency over time
ax2.plot(latencies, linewidth=1, alpha=0.7, color='steelblue')
ax2.axhline(results['latency']['mean_ms'], color='red', linestyle='--', 
            linewidth=2, alpha=0.7, label='Mean')
ax2.set_xlabel('Iteration')
ax2.set_ylabel('Latency (ms)')
ax2.set_title('Latency Over Time')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nLatency Statistics:")
print(f"  Mean: {results['latency']['mean_ms']:.2f} ms")
print(f"  Std:  {results['latency']['std_ms']:.2f} ms")
print(f"  P95:  {results['latency']['p95_ms']:.2f} ms")
print(f"  P99:  {results['latency']['p99_ms']:.2f} ms")

---

## Section 2: Batch Size Optimization

Analyze how batch size affects performance.

In [None]:
# Extract batch performance data
batch_results = results['batch']
batch_sizes = sorted(batch_results.keys())
mean_latencies = [batch_results[bs]['mean_latency_ms'] for bs in batch_sizes]
throughputs = [batch_results[bs]['samples_per_second'] for bs in batch_sizes]

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Latency vs batch size
ax1.plot(batch_sizes, mean_latencies, marker='o', linewidth=2, markersize=8, color='coral')
ax1.set_xlabel('Batch Size')
ax1.set_ylabel('Mean Latency (ms)')
ax1.set_title('Latency vs Batch Size')
ax1.grid(True, alpha=0.3)
ax1.set_xscale('log', base=2)

# Throughput vs batch size
ax2.plot(batch_sizes, throughputs, marker='s', linewidth=2, markersize=8, color='green')
ax2.set_xlabel('Batch Size')
ax2.set_ylabel('Throughput (samples/sec)')
ax2.set_title('Throughput vs Batch Size')
ax2.grid(True, alpha=0.3)
ax2.set_xscale('log', base=2)

plt.tight_layout()
plt.show()

# Find optimal batch size
optimal_idx = np.argmax(throughputs)
optimal_batch_size = batch_sizes[optimal_idx]
optimal_throughput = throughputs[optimal_idx]

print(f"\nOptimal Configuration:")
print(f"  Batch Size: {optimal_batch_size}")
print(f"  Throughput: {optimal_throughput:.0f} samples/sec")
print(f"  Latency: {mean_latencies[optimal_idx]:.2f} ms")

---

## Section 3: Model Comparison Benchmark

Compare performance across different model types.

In [None]:
from neuros.models import SimpleClassifier

# Define models to compare
models_to_test = {
    'Logistic Regression': SimpleClassifier(model_type='logistic'),
    'SVM (Linear)': SimpleClassifier(model_type='svm'),
    'Random Forest': SimpleClassifier(model_type='random_forest'),
    'k-NN (k=5)': SimpleClassifier(model_type='knn')
}

# Train all models
print("Training models...")
for name, model in models_to_test.items():
    model.train(X_test[:700], y_test[:700])
    print(f"  ✓ {name}")

# Benchmark each model
print("\nBenchmarking models...\n")
comparison_results = {}

for name, model in models_to_test.items():
    pipeline = Pipeline(driver=driver, model=model)
    benchmark = PipelineBenchmark(pipeline)
    
    # Quick benchmark
    latency = benchmark.measure_latency(X_test[700:], n_iterations=50)
    throughput = benchmark.measure_throughput(X_test[700:], duration=2.0)
    
    # Evaluate accuracy
    y_pred = model.predict(X_test[700:])
    accuracy = np.mean(y_pred == y_test[700:])
    
    comparison_results[name] = {
        'latency_ms': latency['mean_ms'],
        'throughput': throughput['samples_per_second'],
        'accuracy': accuracy
    }
    
    print(f"{name}:")
    print(f"  Latency: {latency['mean_ms']:.2f} ms")
    print(f"  Throughput: {throughput['samples_per_second']:.0f} samples/sec")
    print(f"  Accuracy: {accuracy:.2%}")
    print()

### Visualize Model Comparison

In [None]:
# Create comparison DataFrame
df = pd.DataFrame(comparison_results).T
df = df.sort_values('throughput', ascending=False)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Latency comparison
ax1 = axes[0]
bars1 = ax1.barh(df.index, df['latency_ms'], color='coral')
ax1.set_xlabel('Latency (ms)')
ax1.set_title('Inference Latency')
ax1.grid(True, alpha=0.3, axis='x')

# Add value labels
for i, bar in enumerate(bars1):
    width = bar.get_width()
    ax1.text(width, bar.get_y() + bar.get_height()/2, 
             f'{width:.2f}',
             ha='left', va='center', fontsize=9)

# Throughput comparison
ax2 = axes[1]
bars2 = ax2.barh(df.index, df['throughput'], color='green')
ax2.set_xlabel('Throughput (samples/sec)')
ax2.set_title('Processing Throughput')
ax2.grid(True, alpha=0.3, axis='x')

for i, bar in enumerate(bars2):
    width = bar.get_width()
    ax2.text(width, bar.get_y() + bar.get_height()/2, 
             f'{width:.0f}',
             ha='left', va='center', fontsize=9)

# Accuracy comparison
ax3 = axes[2]
bars3 = ax3.barh(df.index, df['accuracy'], color='steelblue')
ax3.set_xlabel('Accuracy')
ax3.set_title('Prediction Accuracy')
ax3.set_xlim([0, 1])
ax3.grid(True, alpha=0.3, axis='x')

for i, bar in enumerate(bars3):
    width = bar.get_width()
    ax3.text(width, bar.get_y() + bar.get_height()/2, 
             f'{width:.1%}',
             ha='left', va='center', fontsize=9)

plt.tight_layout()
plt.show()

print("\nPerformance-Accuracy Trade-off:")
print(df.to_string())

---

## Section 4: Code Profiling

Identify performance bottlenecks using profiling.

In [None]:
import cProfile
import pstats
import io
from pstats import SortKey

def profile_pipeline(pipeline: Pipeline, X: np.ndarray, n_iterations: int = 100):
    """
    Profile pipeline execution to find bottlenecks.
    """
    profiler = cProfile.Profile()
    
    # Profile predictions
    profiler.enable()
    for i in range(n_iterations):
        _ = pipeline.predict(X[i:i+1])
    profiler.disable()
    
    # Get stats
    s = io.StringIO()
    stats = pstats.Stats(profiler, stream=s)
    stats.sort_stats(SortKey.CUMULATIVE)
    stats.print_stats(20)  # Top 20 functions
    
    return s.getvalue()

# Profile the pipeline
print("Profiling pipeline (top 20 functions by cumulative time):\n")
profile_output = profile_pipeline(pipeline, X_test[700:], n_iterations=50)
print(profile_output)

---

## Section 5: Memory Profiling

Monitor memory usage during processing.

In [None]:
import psutil
import os

def measure_memory_usage(pipeline: Pipeline, X: np.ndarray, n_iterations: int = 100):
    """
    Measure memory usage during pipeline execution.
    """
    process = psutil.Process(os.getpid())
    
    # Get baseline
    baseline_memory = process.memory_info().rss / 1024 / 1024  # MB
    
    memory_samples = []
    
    # Run predictions and sample memory
    for i in range(n_iterations):
        _ = pipeline.predict(X[i:i+1])
        
        if i % 10 == 0:
            current_memory = process.memory_info().rss / 1024 / 1024  # MB
            memory_samples.append(current_memory)
    
    peak_memory = process.memory_info().rss / 1024 / 1024  # MB
    
    return {
        'baseline_mb': baseline_memory,
        'peak_mb': peak_memory,
        'increase_mb': peak_memory - baseline_memory,
        'samples': memory_samples
    }

# Measure memory
memory_stats = measure_memory_usage(pipeline, X_test[700:], n_iterations=100)

print(f"Memory Usage:")
print(f"  Baseline: {memory_stats['baseline_mb']:.1f} MB")
print(f"  Peak: {memory_stats['peak_mb']:.1f} MB")
print(f"  Increase: {memory_stats['increase_mb']:.1f} MB")

# Visualize
plt.figure(figsize=(10, 4))
plt.plot(memory_stats['samples'], marker='o', linewidth=2, markersize=6, color='purple')
plt.axhline(memory_stats['baseline_mb'], color='red', linestyle='--', 
            label=f"Baseline: {memory_stats['baseline_mb']:.1f} MB")
plt.xlabel('Sample Point')
plt.ylabel('Memory Usage (MB)')
plt.title('Memory Usage During Pipeline Execution')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

## Section 6: Real-Time Performance Guarantees

Verify that the pipeline can meet real-time requirements.

In [None]:
def check_realtime_performance(
    pipeline: Pipeline,
    X: np.ndarray,
    sampling_rate: int = 250,
    target_latency_ms: float = 50.0,
    n_iterations: int = 100
) -> Dict:
    """
    Check if pipeline meets real-time performance requirements.
    
    Parameters
    ----------
    sampling_rate : int
        Data sampling rate (Hz)
    target_latency_ms : float
        Maximum acceptable latency (ms)
    """
    # Time budget per sample
    sample_period_ms = 1000.0 / sampling_rate
    
    # Measure latencies
    latencies = []
    for i in range(n_iterations):
        start = time.perf_counter()
        _ = pipeline.predict(X[i:i+1])
        end = time.perf_counter()
        latencies.append((end - start) * 1000)
    
    latencies = np.array(latencies)
    
    # Check constraints
    mean_latency = np.mean(latencies)
    p99_latency = np.percentile(latencies, 99)
    max_latency = np.max(latencies)
    
    violations = np.sum(latencies > target_latency_ms)
    violation_rate = violations / len(latencies)
    
    # Compute headroom
    headroom_percent = ((target_latency_ms - mean_latency) / target_latency_ms) * 100
    
    meets_requirements = (p99_latency < target_latency_ms)
    
    return {
        'sample_period_ms': sample_period_ms,
        'target_latency_ms': target_latency_ms,
        'mean_latency_ms': mean_latency,
        'p99_latency_ms': p99_latency,
        'max_latency_ms': max_latency,
        'violations': violations,
        'violation_rate': violation_rate,
        'headroom_percent': headroom_percent,
        'meets_requirements': meets_requirements,
        'latencies': latencies
    }

# Check real-time performance
rt_stats = check_realtime_performance(
    pipeline,
    X_test[700:],
    sampling_rate=250,
    target_latency_ms=50.0,
    n_iterations=100
)

print("Real-Time Performance Check:")
print(f"  Sample period: {rt_stats['sample_period_ms']:.2f} ms (@ 250 Hz)")
print(f"  Target latency: {rt_stats['target_latency_ms']:.2f} ms")
print(f"\nMeasured Performance:")
print(f"  Mean latency: {rt_stats['mean_latency_ms']:.2f} ms")
print(f"  P99 latency: {rt_stats['p99_latency_ms']:.2f} ms")
print(f"  Max latency: {rt_stats['max_latency_ms']:.2f} ms")
print(f"\nRequirement Check:")
print(f"  Violations: {rt_stats['violations']}/{len(rt_stats['latencies'])} ({rt_stats['violation_rate']:.1%})")
print(f"  Headroom: {rt_stats['headroom_percent']:.1f}%")
print(f"  Status: {'✓ PASS' if rt_stats['meets_requirements'] else '✗ FAIL'}")

### Visualize Real-Time Compliance

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Latency over time with threshold
ax1.plot(rt_stats['latencies'], linewidth=1, alpha=0.7, color='steelblue', label='Latency')
ax1.axhline(rt_stats['target_latency_ms'], color='red', linestyle='--', 
            linewidth=2, label=f"Target: {rt_stats['target_latency_ms']:.0f} ms")
ax1.axhline(rt_stats['mean_latency_ms'], color='green', linestyle='--', 
            linewidth=2, label=f"Mean: {rt_stats['mean_latency_ms']:.2f} ms")
ax1.set_xlabel('Sample')
ax1.set_ylabel('Latency (ms)')
ax1.set_title('Real-Time Latency Compliance')
ax1.legend()
ax1.grid(True, alpha=0.3)

# CDF (Cumulative Distribution Function)
sorted_latencies = np.sort(rt_stats['latencies'])
cdf = np.arange(1, len(sorted_latencies) + 1) / len(sorted_latencies)
ax2.plot(sorted_latencies, cdf * 100, linewidth=2, color='steelblue')
ax2.axvline(rt_stats['target_latency_ms'], color='red', linestyle='--', 
            linewidth=2, label='Target')
ax2.axhline(99, color='orange', linestyle='--', linewidth=2, alpha=0.7, label='99th percentile')
ax2.set_xlabel('Latency (ms)')
ax2.set_ylabel('Cumulative Probability (%)')
ax2.set_title('Latency CDF')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Section 7: Optimization Strategies

Practical tips for improving pipeline performance.

### Strategy 1: Batch Processing

In [None]:
def compare_batch_vs_sequential(pipeline, X, batch_size=32):
    """
    Compare batch vs sequential processing.
    """
    n_samples = min(100, len(X))
    X_subset = X[:n_samples]
    
    # Sequential
    start = time.perf_counter()
    for i in range(n_samples):
        _ = pipeline.predict(X_subset[i:i+1])
    sequential_time = time.perf_counter() - start
    
    # Batch
    start = time.perf_counter()
    for i in range(0, n_samples, batch_size):
        batch = X_subset[i:i+batch_size]
        _ = pipeline.predict(batch)
    batch_time = time.perf_counter() - start
    
    speedup = sequential_time / batch_time
    
    return {
        'sequential_time': sequential_time,
        'batch_time': batch_time,
        'speedup': speedup
    }

comparison = compare_batch_vs_sequential(pipeline, X_test[700:], batch_size=32)

print("Batch Processing Optimization:")
print(f"  Sequential time: {comparison['sequential_time']:.3f}s")
print(f"  Batch time (32): {comparison['batch_time']:.3f}s")
print(f"  Speedup: {comparison['speedup']:.2f}x")

### Strategy 2: Feature Caching

### Strategy 3: Model Quantization (Concept)

For PyTorch models, quantization can reduce model size and improve inference speed:

```python
# Quantize a PyTorch model (pseudo-code)
import torch.quantization as quantization

# Dynamic quantization (easiest)
quantized_model = quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Typically provides:
# - 2-4x speedup
# - 4x smaller model size
# - Minimal accuracy loss (<1%)
```

---

## Summary

In this tutorial, you learned:

✅ **Benchmark Pipelines** - Measure latency, throughput, and jitter  
✅ **Optimize Batch Size** - Find the best trade-off  
✅ **Compare Models** - Evaluate performance-accuracy trade-offs  
✅ **Profile Code** - Identify bottlenecks with cProfile  
✅ **Monitor Memory** - Track resource usage  
✅ **Real-Time Guarantees** - Verify latency requirements  
✅ **Optimization Strategies** - Batch processing, caching, quantization  

## Key Takeaways

1. **Measure First** - Always profile before optimizing
2. **Batch Wisely** - Larger batches improve throughput but increase latency
3. **Choose Models** - Balance accuracy vs speed for your use case
4. **Monitor Continuously** - Real-time systems need ongoing monitoring
5. **Optimize Judiciously** - Focus on bottlenecks, not premature optimization

## Best Practices

- **For Real-Time BCI**: Keep P99 latency < 50ms
- **For Batch Processing**: Maximize throughput with larger batches
- **For Production**: Monitor latency continuously
- **For Research**: Prioritize accuracy, optimize later

## Next Steps

- **Tutorial 6**: NWB Integration & Real-World Data
- **Advanced**: GPU acceleration with CUDA
- **Advanced**: Distributed processing with Ray
- **Advanced**: Model optimization (pruning, distillation)

---

**Questions or feedback?** Open an issue on GitHub or check the docs at https://neuros.readthedocs.io