# AutoTune ML Trainer - Data Exploration

**Created by Sergie Code - AI Tools for Musicians** 🎵

This notebook provides a comprehensive introduction to audio data exploration and pitch detection for training intelligent AutoTune models.

## Learning Objectives
- Set up the environment and validate dependencies
- Load and explore audio data using Librosa
- Implement pitch detection with CREPE and Librosa
- Visualize audio signals and pitch information
- Prepare data for neural network training

## 1. Environment Setup and Dependencies

First, let's import all necessary libraries and validate our environment.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Audio processing
import librosa
import librosa.display
import soundfile as sf

# Machine learning
import torch
import torch.nn as nn
import torchaudio

# Pitch detection
try:
    import crepe
    CREPE_AVAILABLE = True
    print("✅ CREPE available for advanced pitch detection")
except ImportError:
    CREPE_AVAILABLE = False
    print("⚠️ CREPE not available. Install with: pip install crepe")

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print(f"🎵 AutoTune ML Trainer Environment")
print(f"PyTorch version: {torch.__version__}")
print(f"Librosa version: {librosa.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## 2. Audio Data Loading and Preprocessing

Let's start by generating some test audio and learning how to load and process audio files.

In [None]:
# Generate test audio signals
def generate_test_audio(frequency=440, duration=2.0, sample_rate=44100):
    """Generate a simple sine wave for testing."""
    t = np.linspace(0, duration, int(sample_rate * duration))
    audio = np.sin(2 * np.pi * frequency * t)
    return audio, sample_rate

# Create test signals
test_frequencies = [220, 440, 880]  # A3, A4, A5
test_signals = {}

for freq in test_frequencies:
    audio, sr = generate_test_audio(freq, duration=1.0)
    test_signals[f"A{freq}Hz"] = (audio, sr)
    
print("Generated test audio signals:")
for name, (audio, sr) in test_signals.items():
    print(f"  {name}: {len(audio)} samples, {sr} Hz")

In [None]:
# Visualize test audio signals
fig, axes = plt.subplots(3, 2, figsize=(15, 10))
fig.suptitle('Test Audio Analysis', fontsize=16, fontweight='bold')

for i, (name, (audio, sr)) in enumerate(test_signals.items()):
    # Time domain
    time = np.linspace(0, len(audio)/sr, len(audio))
    axes[i, 0].plot(time[:1000], audio[:1000])  # First 1000 samples
    axes[i, 0].set_title(f'{name} - Time Domain')
    axes[i, 0].set_xlabel('Time (s)')
    axes[i, 0].set_ylabel('Amplitude')
    
    # Frequency domain
    fft = np.fft.fft(audio[:4096])  # Use first 4096 samples
    freqs = np.fft.fftfreq(4096, 1/sr)
    magnitude = np.abs(fft)
    
    axes[i, 1].plot(freqs[:2048], magnitude[:2048])
    axes[i, 1].set_title(f'{name} - Frequency Domain')
    axes[i, 1].set_xlabel('Frequency (Hz)')
    axes[i, 1].set_ylabel('Magnitude')
    axes[i, 1].set_xlim(0, 2000)

plt.tight_layout()
plt.show()

## 3. Pitch Detection with CREPE and Librosa

Now let's compare different pitch detection methods and analyze their accuracy.

In [None]:
def detect_pitch_librosa(audio, sr):
    """Detect pitch using Librosa's piptrack."""
    pitches, magnitudes = librosa.piptrack(y=audio, sr=sr, fmin=80, fmax=2000)
    
    # Extract fundamental frequency
    pitch_values = []
    for t in range(pitches.shape[1]):
        index = magnitudes[:, t].argmax()
        pitch = pitches[index, t]
        if pitch > 0:
            pitch_values.append(pitch)
    
    return np.array(pitch_values)

def detect_pitch_crepe(audio, sr):
    """Detect pitch using CREPE (if available)."""
    if not CREPE_AVAILABLE:
        return None, None
    
    # Resample to 16kHz for CREPE
    if sr != 16000:
        audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    else:
        audio_16k = audio
    
    time, frequency, confidence, _ = crepe.predict(
        audio_16k, sr=16000, model_capacity='medium', step_size=10
    )
    
    return frequency, confidence

# Test pitch detection on our test signals
pitch_results = {}

for name, (audio, sr) in test_signals.items():
    true_freq = float(name.replace('A', '').replace('Hz', ''))
    
    # Librosa pitch detection
    librosa_pitches = detect_pitch_librosa(audio, sr)
    librosa_mean = np.mean(librosa_pitches) if len(librosa_pitches) > 0 else 0
    
    # CREPE pitch detection
    if CREPE_AVAILABLE:
        crepe_freq, crepe_conf = detect_pitch_crepe(audio, sr)
        # Filter by confidence
        valid_crepe = crepe_freq[crepe_conf > 0.8]
        crepe_mean = np.mean(valid_crepe) if len(valid_crepe) > 0 else 0
    else:
        crepe_mean = 0
    
    pitch_results[name] = {
        'true_frequency': true_freq,
        'librosa_detected': librosa_mean,
        'crepe_detected': crepe_mean,
        'librosa_error': abs(librosa_mean - true_freq),
        'crepe_error': abs(crepe_mean - true_freq) if crepe_mean > 0 else float('inf')
    }

# Display results
results_df = pd.DataFrame(pitch_results).T
print("Pitch Detection Comparison:")
print(results_df.round(2))

In [None]:
# Visualize pitch detection accuracy
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Accuracy comparison
methods = ['Librosa', 'CREPE']
frequencies = [220, 440, 880]
librosa_errors = [results_df.loc[f'A{f}Hz', 'librosa_error'] for f in frequencies]
crepe_errors = [results_df.loc[f'A{f}Hz', 'crepe_error'] for f in frequencies]

x = np.arange(len(frequencies))
width = 0.35

ax1.bar(x - width/2, librosa_errors, width, label='Librosa', alpha=0.8)
if CREPE_AVAILABLE:
    ax1.bar(x + width/2, crepe_errors, width, label='CREPE', alpha=0.8)

ax1.set_xlabel('Test Frequency (Hz)')
ax1.set_ylabel('Absolute Error (Hz)')
ax1.set_title('Pitch Detection Accuracy Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(frequencies)
ax1.legend()
ax1.grid(True, alpha=0.3)

# Detection confidence over time (using 440Hz signal)
if CREPE_AVAILABLE:
    audio_440, sr_440 = test_signals['A440Hz']
    crepe_freq, crepe_conf = detect_pitch_crepe(audio_440, sr_440)
    
    time_axis = np.linspace(0, len(audio_440)/sr_440, len(crepe_freq))
    ax2.plot(time_axis, crepe_freq, label='Detected Frequency', linewidth=2)
    ax2.axhline(y=440, color='red', linestyle='--', label='True Frequency (440 Hz)')
    ax2.fill_between(time_axis, 0, crepe_freq, alpha=crepe_conf, label='Confidence')
    
    ax2.set_xlabel('Time (s)')
    ax2.set_ylabel('Frequency (Hz)')
    ax2.set_title('CREPE Pitch Detection Over Time')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
else:
    ax2.text(0.5, 0.5, 'CREPE not available\nInstall with: pip install crepe', 
             ha='center', va='center', transform=ax2.transAxes, fontsize=12)
    ax2.set_title('CREPE Pitch Detection (Not Available)')

plt.tight_layout()
plt.show()

## 4. Dataset Creation Pipeline

Let's create a pipeline for preparing training data from audio files.

In [None]:
# Add project source to path
import sys
sys.path.append('../src')

# Import our custom modules
try:
    from data.dataset_creator import DatasetCreator
    from data.audio_preprocessor import AudioPreprocessor
    from data.pitch_extractor import PitchExtractor
    print("✅ Successfully imported custom modules")
except ImportError as e:
    print(f"⚠️ Could not import custom modules: {e}")
    print("Using fallback implementations...")

# Create sample dataset structure
def create_sample_dataset():
    """Create a small sample dataset for demonstration."""
    
    # Create directories
    sample_dir = Path('../data/sample')
    sample_dir.mkdir(parents=True, exist_ok=True)
    
    # Generate diverse test audio
    sample_rate = 44100
    duration = 3.0
    
    samples = {
        'vocal_low': generate_vibrato_audio(220, duration, sample_rate),
        'vocal_mid': generate_vibrato_audio(440, duration, sample_rate),
        'vocal_high': generate_vibrato_audio(880, duration, sample_rate),
        'guitar_chord': generate_chord_audio([220, 277, 330], duration, sample_rate),
        'noisy_vocal': add_noise_to_audio(*generate_vibrato_audio(440, duration, sample_rate))
    }
    
    # Save audio files
    for name, (audio, sr) in samples.items():
        filepath = sample_dir / f"{name}.wav"
        sf.write(filepath, audio, sr)
        print(f"Created: {filepath}")
    
    return sample_dir

def generate_vibrato_audio(base_freq, duration, sample_rate, vibrato_rate=5, vibrato_depth=0.02):
    """Generate audio with vibrato (frequency modulation)."""
    t = np.linspace(0, duration, int(sample_rate * duration))
    
    # Apply vibrato
    vibrato = vibrato_depth * np.sin(2 * np.pi * vibrato_rate * t)
    freq_modulated = base_freq * (1 + vibrato)
    
    # Generate audio with amplitude envelope
    audio = np.sin(2 * np.pi * freq_modulated * t)
    envelope = np.exp(-t * 0.5)  # Decay envelope
    audio *= envelope
    
    return audio, sample_rate

def generate_chord_audio(frequencies, duration, sample_rate):
    """Generate audio with multiple frequencies (chord)."""
    t = np.linspace(0, duration, int(sample_rate * duration))
    audio = np.zeros_like(t)
    
    for freq in frequencies:
        audio += np.sin(2 * np.pi * freq * t) / len(frequencies)
    
    return audio, sample_rate

def add_noise_to_audio(audio, sample_rate, noise_level=0.05):
    """Add noise to audio signal."""
    noise = np.random.normal(0, noise_level, len(audio))
    return audio + noise, sample_rate

# Create the sample dataset
sample_dataset_path = create_sample_dataset()
print(f"\nSample dataset created at: {sample_dataset_path}")

In [None]:
# Analyze the sample dataset
audio_files = list(sample_dataset_path.glob('*.wav'))
print(f"Found {len(audio_files)} audio files in sample dataset:")

# Load and analyze each file
dataset_analysis = {}

for audio_file in audio_files:
    # Load audio
    audio, sr = librosa.load(audio_file, sr=44100)
    
    # Basic statistics
    duration = len(audio) / sr
    rms_energy = np.sqrt(np.mean(audio**2))
    max_amplitude = np.max(np.abs(audio))
    
    # Pitch detection
    pitch_values = detect_pitch_librosa(audio, sr)
    mean_pitch = np.mean(pitch_values) if len(pitch_values) > 0 else 0
    
    # Spectral features
    spectral_centroid = np.mean(librosa.feature.spectral_centroid(y=audio, sr=sr))
    zero_crossing_rate = np.mean(librosa.feature.zero_crossing_rate(audio))
    
    dataset_analysis[audio_file.stem] = {
        'duration': duration,
        'rms_energy': rms_energy,
        'max_amplitude': max_amplitude,
        'mean_pitch': mean_pitch,
        'spectral_centroid': spectral_centroid,
        'zero_crossing_rate': zero_crossing_rate
    }

# Create analysis DataFrame
analysis_df = pd.DataFrame(dataset_analysis).T
print("\nDataset Analysis:")
print(analysis_df.round(3))

In [None]:
# Visualize dataset characteristics
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Sample Dataset Analysis', fontsize=16, fontweight='bold')

# Plot 1: RMS Energy distribution
axes[0, 0].bar(analysis_df.index, analysis_df['rms_energy'])
axes[0, 0].set_title('RMS Energy by Sample')
axes[0, 0].set_ylabel('RMS Energy')
axes[0, 0].tick_params(axis='x', rotation=45)

# Plot 2: Pitch distribution
axes[0, 1].bar(analysis_df.index, analysis_df['mean_pitch'])
axes[0, 1].set_title('Mean Pitch by Sample')
axes[0, 1].set_ylabel('Frequency (Hz)')
axes[0, 1].tick_params(axis='x', rotation=45)

# Plot 3: Spectral centroid
axes[0, 2].bar(analysis_df.index, analysis_df['spectral_centroid'])
axes[0, 2].set_title('Spectral Centroid by Sample')
axes[0, 2].set_ylabel('Frequency (Hz)')
axes[0, 2].tick_params(axis='x', rotation=45)

# Plot 4-6: Waveforms of first three samples
for i, audio_file in enumerate(audio_files[:3]):
    audio, sr = librosa.load(audio_file, sr=44100)
    time = np.linspace(0, len(audio)/sr, len(audio))
    
    axes[1, i].plot(time, audio, alpha=0.7)
    axes[1, i].set_title(f'Waveform: {audio_file.stem}')
    axes[1, i].set_xlabel('Time (s)')
    axes[1, i].set_ylabel('Amplitude')
    axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Neural Network Architecture Implementation

Let's implement a simple pitch correction network architecture.

In [None]:
class SimplePitchCorrectionNet(nn.Module):
    """A simplified version of the pitch correction network for demonstration."""
    
    def __init__(self, input_size=512, hidden_size=256, num_layers=3):
        super(SimplePitchCorrectionNet, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        # Input normalization
        self.input_norm = nn.BatchNorm1d(1)
        
        # Pitch embedding
        self.pitch_embedding = nn.Linear(2, 32)  # target_pitch + correction_strength
        
        # Main network layers
        layers = []
        
        # First layer (audio + pitch features)
        layers.append(nn.Linear(input_size + 32, hidden_size))
        layers.append(nn.GELU())
        layers.append(nn.Dropout(0.1))
        
        # Hidden layers
        for _ in range(num_layers - 1):
            layers.append(nn.Linear(hidden_size, hidden_size))
            layers.append(nn.GELU())
            layers.append(nn.Dropout(0.1))
        
        self.main_network = nn.Sequential(*layers)
        
        # Output layers
        self.audio_output = nn.Linear(hidden_size, input_size)
        self.confidence_output = nn.Sequential(
            nn.Linear(hidden_size, 1),
            nn.Sigmoid()
        )
        
        # Residual connection weight
        self.residual_weight = nn.Parameter(torch.tensor(0.5))
        
    def forward(self, audio_buffer, target_pitch, correction_strength):
        batch_size = audio_buffer.size(0)
        
        # Normalize audio input
        audio_normalized = self.input_norm(audio_buffer.unsqueeze(1)).squeeze(1)
        
        # Create pitch embedding
        target_pitch_norm = torch.log(target_pitch + 1e-8) / 10.0
        pitch_input = torch.cat([target_pitch_norm, correction_strength], dim=1)
        pitch_features = self.pitch_embedding(pitch_input)
        
        # Combine features
        combined_input = torch.cat([audio_normalized, pitch_features], dim=1)
        
        # Process through network
        features = self.main_network(combined_input)
        
        # Generate outputs
        pitch_correction = self.audio_output(features)
        confidence = self.confidence_output(features)
        
        # Apply residual connection
        residual_weight = torch.sigmoid(self.residual_weight)
        corrected_audio = (1 - residual_weight) * audio_buffer + residual_weight * pitch_correction
        
        # Apply correction strength
        final_audio = (1 - correction_strength.unsqueeze(-1)) * audio_buffer + \
                     correction_strength.unsqueeze(-1) * corrected_audio
        
        return final_audio, confidence
    
    def get_model_info(self):
        total_params = sum(p.numel() for p in self.parameters())
        return {
            'total_parameters': total_params,
            'input_size': self.input_size,
            'hidden_size': self.hidden_size,
            'model_size_mb': total_params * 4 / (1024 * 1024)
        }

# Create and test the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimplePitchCorrectionNet().to(device)

print(f"Model created on device: {device}")
print(f"Model info: {model.get_model_info()}")

# Test with dummy data
batch_size = 4
buffer_size = 512

# Create test inputs
test_audio = torch.randn(batch_size, buffer_size).to(device)
test_target_pitch = torch.rand(batch_size, 1).to(device) * 1000 + 100  # 100-1100 Hz
test_correction_strength = torch.rand(batch_size, 1).to(device)

# Forward pass
with torch.no_grad():
    corrected_audio, confidence = model(test_audio, test_target_pitch, test_correction_strength)

print(f"\nTest Results:")
print(f"Input audio shape: {test_audio.shape}")
print(f"Output audio shape: {corrected_audio.shape}")
print(f"Confidence shape: {confidence.shape}")
print(f"Mean confidence: {confidence.mean().item():.3f}")

## 6. Model Training and Validation Setup

Let's set up a basic training loop with loss functions and metrics.

In [None]:
# Define loss functions
class CombinedLoss(nn.Module):
    """Multi-component loss function for pitch correction."""
    
    def __init__(self, audio_weight=0.6, pitch_weight=0.3, confidence_weight=0.1):
        super(CombinedLoss, self).__init__()
        self.audio_weight = audio_weight
        self.pitch_weight = pitch_weight
        self.confidence_weight = confidence_weight
        
        self.mse_loss = nn.MSELoss()
        self.bce_loss = nn.BCELoss()
        
    def forward(self, predicted_audio, target_audio, predicted_confidence, target_confidence):
        # Audio reconstruction loss
        audio_loss = self.mse_loss(predicted_audio, target_audio)
        
        # Confidence prediction loss
        confidence_loss = self.bce_loss(predicted_confidence, target_confidence)
        
        # Spectral loss (simplified)
        pred_fft = torch.fft.fft(predicted_audio, dim=-1)
        target_fft = torch.fft.fft(target_audio, dim=-1)
        spectral_loss = self.mse_loss(torch.abs(pred_fft), torch.abs(target_fft))
        
        # Combined loss
        total_loss = (self.audio_weight * audio_loss + 
                     self.pitch_weight * spectral_loss + 
                     self.confidence_weight * confidence_loss)
        
        return total_loss, {
            'audio_loss': audio_loss.item(),
            'spectral_loss': spectral_loss.item(),
            'confidence_loss': confidence_loss.item(),
            'total_loss': total_loss.item()
        }

# Training utilities
class ModelTrainer:
    """Simple trainer for the pitch correction model."""
    
    def __init__(self, model, device='cpu'):
        self.model = model
        self.device = device
        self.criterion = CombinedLoss()
        self.optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
        self.history = {'train_loss': [], 'val_loss': []}
        
    def train_step(self, audio_batch, target_batch, pitch_batch, strength_batch, confidence_batch):
        """Single training step."""
        self.model.train()
        self.optimizer.zero_grad()
        
        # Forward pass
        predicted_audio, predicted_confidence = self.model(
            audio_batch, pitch_batch, strength_batch
        )
        
        # Compute loss
        loss, loss_components = self.criterion(
            predicted_audio, target_batch, predicted_confidence, confidence_batch
        )
        
        # Backward pass
        loss.backward()
        self.optimizer.step()
        
        return loss_components
    
    def validate_step(self, audio_batch, target_batch, pitch_batch, strength_batch, confidence_batch):
        """Single validation step."""
        self.model.eval()
        
        with torch.no_grad():
            predicted_audio, predicted_confidence = self.model(
                audio_batch, pitch_batch, strength_batch
            )
            
            loss, loss_components = self.criterion(
                predicted_audio, target_batch, predicted_confidence, confidence_batch
            )
            
        return loss_components

# Create trainer
trainer = ModelTrainer(model, device)
print(f"Trainer created with device: {device}")

# Generate synthetic training data for demonstration
def generate_training_batch(batch_size=16, buffer_size=512):
    """Generate synthetic training data."""
    
    # Input audio (slightly off-pitch)
    input_audio = torch.randn(batch_size, buffer_size)
    
    # Target audio (corrected)
    target_audio = input_audio + 0.1 * torch.randn(batch_size, buffer_size)
    
    # Target pitch (random frequencies)
    target_pitch = torch.rand(batch_size, 1) * 1000 + 100
    
    # Correction strength
    correction_strength = torch.rand(batch_size, 1)
    
    # Confidence (higher for cleaner audio)
    confidence = torch.rand(batch_size, 1) * 0.5 + 0.5
    
    return input_audio, target_audio, target_pitch, correction_strength, confidence

# Test training step
input_audio, target_audio, target_pitch, correction_strength, confidence = generate_training_batch()

# Move to device
input_audio = input_audio.to(device)
target_audio = target_audio.to(device)
target_pitch = target_pitch.to(device)
correction_strength = correction_strength.to(device)
confidence = confidence.to(device)

# Run training step
loss_components = trainer.train_step(
    input_audio, target_audio, target_pitch, correction_strength, confidence
)

print(f"\nTraining step completed:")
for key, value in loss_components.items():
    print(f"  {key}: {value:.4f}")

## 7. Model Export to ONNX Format

Let's export our model to ONNX format for C++ integration.

In [None]:
def export_model_to_onnx(model, export_path, input_size=512):
    """Export PyTorch model to ONNX format."""
    
    model.eval()
    
    # Create dummy inputs
    dummy_audio = torch.randn(1, input_size)
    dummy_pitch = torch.tensor([[440.0]])  # A4 note
    dummy_strength = torch.tensor([[0.5]])  # 50% correction
    
    # Move to same device as model
    device = next(model.parameters()).device
    dummy_audio = dummy_audio.to(device)
    dummy_pitch = dummy_pitch.to(device)
    dummy_strength = dummy_strength.to(device)
    
    # Export to ONNX
    torch.onnx.export(
        model,
        (dummy_audio, dummy_pitch, dummy_strength),
        export_path,
        export_params=True,
        opset_version=11,
        do_constant_folding=True,
        input_names=['audio_input', 'target_pitch', 'correction_strength'],
        output_names=['corrected_audio', 'confidence'],
        dynamic_axes={
            'audio_input': {0: 'batch_size'},
            'target_pitch': {0: 'batch_size'},
            'correction_strength': {0: 'batch_size'},
            'corrected_audio': {0: 'batch_size'},
            'confidence': {0: 'batch_size'}
        }
    )
    
    print(f"Model exported to: {export_path}")
    
    # Verify export
    try:
        import onnx
        onnx_model = onnx.load(export_path)
        onnx.checker.check_model(onnx_model)
        print("✅ ONNX model validation passed")
        
        # Print model info
        print(f"Model inputs: {[input.name for input in onnx_model.graph.input]}")
        print(f"Model outputs: {[output.name for output in onnx_model.graph.output]}")
        
    except ImportError:
        print("⚠️ ONNX not available for verification")
    except Exception as e:
        print(f"❌ ONNX model validation failed: {e}")

# Export the model
export_dir = Path('../models/exported')
export_dir.mkdir(parents=True, exist_ok=True)
export_path = export_dir / 'pitch_correction_model.onnx'

export_model_to_onnx(model, export_path)

# Test ONNX inference if available
try:
    import onnxruntime as ort
    
    # Create inference session
    ort_session = ort.InferenceSession(str(export_path))
    
    # Prepare test input
    test_audio_np = np.random.randn(1, 512).astype(np.float32)
    test_pitch_np = np.array([[440.0]], dtype=np.float32)
    test_strength_np = np.array([[0.5]], dtype=np.float32)
    
    # Run inference
    ort_inputs = {
        'audio_input': test_audio_np,
        'target_pitch': test_pitch_np,
        'correction_strength': test_strength_np
    }
    
    ort_outputs = ort_session.run(None, ort_inputs)
    
    print(f"\n✅ ONNX Runtime inference successful")
    print(f"Output shapes: {[output.shape for output in ort_outputs]}")
    
    # Compare with PyTorch output
    model.eval()
    with torch.no_grad():
        torch_audio = torch.from_numpy(test_audio_np).to(device)
        torch_pitch = torch.from_numpy(test_pitch_np).to(device)
        torch_strength = torch.from_numpy(test_strength_np).to(device)
        
        torch_outputs = model(torch_audio, torch_pitch, torch_strength)
        torch_audio_out = torch_outputs[0].cpu().numpy()
        torch_conf_out = torch_outputs[1].cpu().numpy()
    
    # Calculate differences
    audio_diff = np.mean(np.abs(ort_outputs[0] - torch_audio_out))
    conf_diff = np.mean(np.abs(ort_outputs[1] - torch_conf_out))
    
    print(f"PyTorch vs ONNX differences:")
    print(f"  Audio output difference: {audio_diff:.6f}")
    print(f"  Confidence output difference: {conf_diff:.6f}")
    
    if audio_diff < 1e-5 and conf_diff < 1e-5:
        print("✅ PyTorch and ONNX outputs match closely")
    else:
        print("⚠️ Significant differences between PyTorch and ONNX outputs")
        
except ImportError:
    print("⚠️ ONNX Runtime not available for testing")
except Exception as e:
    print(f"❌ ONNX Runtime test failed: {e}")

## 8. C++ Engine Integration Guidelines

Here's how the exported model integrates with the C++ real-time engine:

In [None]:
# Performance analysis for C++ integration
import time

def benchmark_model_latency(model, num_iterations=100, batch_size=1, buffer_size=512):
    """Benchmark model inference latency."""
    
    model.eval()
    device = next(model.parameters()).device
    
    # Prepare test data
    test_audio = torch.randn(batch_size, buffer_size).to(device)
    test_pitch = torch.rand(batch_size, 1).to(device) * 1000 + 100
    test_strength = torch.rand(batch_size, 1).to(device)
    
    # Warmup
    for _ in range(10):
        with torch.no_grad():
            _ = model(test_audio, test_pitch, test_strength)
    
    # Benchmark
    latencies = []
    
    for _ in range(num_iterations):
        start_time = time.perf_counter()
        
        with torch.no_grad():
            _ = model(test_audio, test_pitch, test_strength)
            
        if device.type == 'cuda':
            torch.cuda.synchronize()
            
        end_time = time.perf_counter()
        latency_ms = (end_time - start_time) * 1000
        latencies.append(latency_ms)
    
    return {
        'mean_latency_ms': np.mean(latencies),
        'std_latency_ms': np.std(latencies),
        'min_latency_ms': np.min(latencies),
        'max_latency_ms': np.max(latencies),
        'p95_latency_ms': np.percentile(latencies, 95)
    }

# Benchmark the model
print("Benchmarking model latency...")
latency_results = benchmark_model_latency(model)

print("\nLatency Benchmark Results:")
for key, value in latency_results.items():
    print(f"  {key}: {value:.3f} ms")

# Check real-time requirements
buffer_duration_ms = 512 / 44100 * 1000  # ~11.6 ms for 512 samples at 44.1kHz
target_latency_ms = 5.0  # Target from ML integration guide

print(f"\nReal-time Analysis:")
print(f"Buffer duration: {buffer_duration_ms:.1f} ms")
print(f"Target latency: {target_latency_ms} ms")
print(f"Mean inference latency: {latency_results['mean_latency_ms']:.1f} ms")

if latency_results['mean_latency_ms'] <= target_latency_ms:
    print("✅ Model meets real-time latency requirements")
else:
    print("❌ Model exceeds real-time latency requirements")
    print("   Consider model optimization or hardware acceleration")

# Model size analysis
model_info = model.get_model_info()
print(f"\nModel Size Analysis:")
print(f"Parameters: {model_info['total_parameters']:,}")
print(f"Model size: {model_info['model_size_mb']:.1f} MB")

# Memory footprint estimate
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    memory_before = torch.cuda.memory_allocated() / 1024**2
    
    # Load model
    test_audio = torch.randn(1, 512).cuda()
    test_pitch = torch.rand(1, 1).cuda() * 1000 + 100
    test_strength = torch.rand(1, 1).cuda()
    
    with torch.no_grad():
        _ = model(test_audio, test_pitch, test_strength)
    
    memory_after = torch.cuda.memory_allocated() / 1024**2
    memory_usage = memory_after - memory_before
    
    print(f"GPU memory usage: {memory_usage:.1f} MB")
    
    # Check memory requirements
    target_memory_mb = 100  # Target from ML integration guide
    if memory_usage <= target_memory_mb:
        print("✅ Model meets memory requirements")
    else:
        print("❌ Model exceeds memory requirements")

## Summary and Next Steps

This notebook has demonstrated the complete pipeline for developing AutoTune ML models:

### What We've Accomplished:
1. **Environment Setup** - Validated dependencies and configuration
2. **Audio Processing** - Loaded and analyzed audio data
3. **Pitch Detection** - Compared CREPE and Librosa algorithms
4. **Dataset Creation** - Built sample dataset and analysis pipeline
5. **Neural Networks** - Implemented pitch correction architecture
6. **Training Setup** - Created training loops and loss functions
7. **Model Export** - Exported to ONNX for C++ integration
8. **Performance Analysis** - Benchmarked for real-time requirements

### Next Steps:
1. **Collect Real Data** - Gather vocal and instrumental recordings
2. **Train Full Models** - Use the training scripts in `scripts/`
3. **Evaluate Performance** - Test on real audio samples
4. **Optimize for Production** - Reduce latency and memory usage
5. **Integrate with C++** - Deploy in the real-time AutoTune engine

### For YouTube Content (Sergie Code):
- **Part 1**: Environment setup and audio basics
- **Part 2**: Pitch detection and analysis
- **Part 3**: Neural network architecture
- **Part 4**: Training and evaluation
- **Part 5**: Model export and C++ integration

Happy coding! 🎵