# Audio Processing with DSPy

This notebook demonstrates how to use DSPy for audio processing tasks including speech recognition, audio analysis, and audio-to-text workflows.

Based on the DSPy tutorial: [Audio](https://dspy.ai/tutorials/audio/)

## Setup

Import necessary libraries and configure the environment.

In [None]:
import os
import sys
sys.path.append('../../')

import dspy
from utils import setup_default_lm, print_step, print_result, print_error
from dotenv import load_dotenv
import base64
import io
import wave
import numpy as np
from typing import List, Dict, Optional

# Load environment variables
load_dotenv('../../.env')

# Note: For actual audio processing, you would install:
# pip install speechrecognition pyaudio soundfile librosa
# For this demo, we'll simulate audio processing

## Language Model Configuration

Set up DSPy with a language model for audio analysis tasks.

In [None]:
print_step("Setting up Language Model", "Configuring DSPy for audio processing")

try:
    lm = setup_default_lm(provider="openai", model="gpt-4o", max_tokens=1000)
    dspy.configure(lm=lm)
    print_result("Language model configured successfully!", "Status")
except Exception as e:
    print_error(f"Failed to configure language model: {e}")

## Audio Processing Signatures

Define signatures for various audio processing tasks.

In [None]:
class TranscribeAudio(dspy.Signature):
    """Transcribe audio content to text."""
    
    audio_description = dspy.InputField(desc="Description of audio characteristics (quality, language, etc.)")
    raw_transcription = dspy.InputField(desc="Raw transcription from speech recognition")
    clean_transcription = dspy.OutputField(desc="Cleaned and properly formatted transcription")

class AnalyzeAudioContent(dspy.Signature):
    """Analyze audio content for insights."""
    
    transcription = dspy.InputField(desc="Text transcription of audio content")
    audio_type = dspy.InputField(desc="Type of audio content (interview, lecture, music, etc.)")
    content_analysis = dspy.OutputField(desc="Analysis of audio content including key topics, sentiment, and insights")

class SummarizeAudioContent(dspy.Signature):
    """Create a summary of audio content."""
    
    transcription = dspy.InputField(desc="Full transcription of audio content")
    summary_type = dspy.InputField(desc="Type of summary needed (brief, detailed, bullet points)")
    summary = dspy.OutputField(desc="Structured summary of the audio content")

class ExtractAudioMetadata(dspy.Signature):
    """Extract structured metadata from audio content."""
    
    transcription = dspy.InputField(desc="Audio transcription")
    metadata_type = dspy.InputField(desc="Type of metadata to extract (speakers, topics, timestamps, etc.)")
    extracted_metadata = dspy.OutputField(desc="Structured metadata extracted from audio")

class GenerateAudioTags(dspy.Signature):
    """Generate relevant tags for audio content."""
    
    audio_analysis = dspy.InputField(desc="Analysis of audio content")
    content_type = dspy.InputField(desc="Type of audio content")
    tags = dspy.OutputField(desc="Relevant tags for categorizing and searching the audio content")

## Mock Audio Processing Services

Create mock services to simulate audio processing functionality.

In [None]:
class MockAudioProcessor:
    """Mock audio processing service for demonstration."""
    
    def __init__(self):
        self.processing_history = []
    
    def mock_speech_recognition(self, audio_description: str) -> Dict:
        """Simulate speech recognition on audio."""
        
        # Simulate different quality based on audio characteristics
        quality_indicators = ["clear", "high quality", "studio", "professional"]
        noise_indicators = ["noisy", "background", "poor", "muffled"]
        
        quality_score = 0.7  # Base quality
        
        for indicator in quality_indicators:
            if indicator in audio_description.lower():
                quality_score += 0.1
        
        for indicator in noise_indicators:
            if indicator in audio_description.lower():
                quality_score -= 0.2
        
        quality_score = max(0.1, min(1.0, quality_score))
        
        # Generate mock transcription based on audio type
        if "interview" in audio_description.lower():
            raw_transcription = """
            um so like the project was really interesting and uh we had some challenges 
            but overall I think we achieved our goals you know the team worked really 
            hard and uh yeah it was a good experience
            """
        elif "lecture" in audio_description.lower():
            raw_transcription = """
            today we will discuss the fundamentals of machine learning uh the key concepts
            include supervised learning unsupervised learning and reinforcement learning
            each of these approaches has different applications and use cases
            """
        elif "meeting" in audio_description.lower():
            raw_transcription = """
            alright everyone thank you for joining today's meeting uh we have three main 
            agenda items to cover first the quarterly results second the new product launch
            and third the upcoming team restructuring
            """
        else:
            raw_transcription = """
            this is a sample transcription with some filler words and uh natural speech 
            patterns that would typically appear in real speech recognition output
            """
        
        # Add noise based on quality
        if quality_score < 0.5:
            raw_transcription += " [inaudible] [background noise]"
        
        result = {
            "audio_description": audio_description,
            "quality_score": quality_score,
            "raw_transcription": raw_transcription.strip(),
            "confidence": quality_score,
            "processing_time": "2.3 seconds"
        }
        
        self.processing_history.append(result)
        return result
    
    def extract_audio_features(self, audio_description: str) -> Dict:
        """Simulate audio feature extraction."""
        
        features = {
            "duration": "3:45",
            "sample_rate": "44.1 kHz",
            "channels": "stereo" if "stereo" in audio_description.lower() else "mono",
            "file_format": "wav",
            "estimated_speakers": 1
        }
        
        if "multiple speakers" in audio_description.lower() or "conversation" in audio_description.lower():
            features["estimated_speakers"] = 2
        elif "meeting" in audio_description.lower() or "conference" in audio_description.lower():
            features["estimated_speakers"] = 4
        
        return features

# Initialize mock processor
audio_processor = MockAudioProcessor()

# Test mock speech recognition
test_audio = "Clear studio recording of an interview with professional microphone"
test_result = audio_processor.mock_speech_recognition(test_audio)

print_result(f"Mock recognition result: {test_result}")

## Audio Processing Module

Create a comprehensive module for audio processing workflows.

In [None]:
class AudioProcessor(dspy.Module):
    """Comprehensive audio processing module using DSPy."""
    
    def __init__(self):
        super().__init__()
        self.transcribe = dspy.ChainOfThought(TranscribeAudio)
        self.analyze_content = dspy.ChainOfThought(AnalyzeAudioContent)
        self.summarize = dspy.ChainOfThought(SummarizeAudioContent)
        self.extract_metadata = dspy.ChainOfThought(ExtractAudioMetadata)
        self.generate_tags = dspy.ChainOfThought(GenerateAudioTags)
        self.mock_processor = MockAudioProcessor()
    
    def process_audio_complete(self, audio_description: str, audio_type: str = "general"):
        """Complete audio processing pipeline."""
        
        print_step("Complete Audio Processing Pipeline", f"Processing: {audio_description}")
        
        # Step 1: Speech Recognition
        print_step("Step 1: Speech Recognition")
        recognition_result = self.mock_processor.mock_speech_recognition(audio_description)
        raw_transcription = recognition_result["raw_transcription"]
        
        print_result(f"Quality Score: {recognition_result['quality_score']:.2f}")
        print_result(f"Raw Transcription: {raw_transcription}")
        
        # Step 2: Clean Transcription
        print_step("Step 2: Transcription Cleaning")
        clean_result = self.transcribe(
            audio_description=audio_description,
            raw_transcription=raw_transcription
        )
        
        clean_transcription = clean_result.clean_transcription
        print_result(f"Clean Transcription: {clean_transcription}")
        
        # Step 3: Content Analysis
        print_step("Step 3: Content Analysis")
        analysis_result = self.analyze_content(
            transcription=clean_transcription,
            audio_type=audio_type
        )
        
        print_result(analysis_result.content_analysis, "Content Analysis")
        
        # Step 4: Summarization
        print_step("Step 4: Content Summarization")
        summary_result = self.summarize(
            transcription=clean_transcription,
            summary_type="structured summary with key points"
        )
        
        print_result(summary_result.summary, "Summary")
        
        # Step 5: Metadata Extraction
        print_step("Step 5: Metadata Extraction")
        metadata_result = self.extract_metadata(
            transcription=clean_transcription,
            metadata_type="speakers, topics, key information"
        )
        
        print_result(metadata_result.extracted_metadata, "Extracted Metadata")
        
        # Step 6: Tag Generation
        print_step("Step 6: Tag Generation")
        tags_result = self.generate_tags(
            audio_analysis=analysis_result.content_analysis,
            content_type=audio_type
        )
        
        print_result(tags_result.tags, "Generated Tags")
        
        return dspy.Prediction(
            raw_transcription=raw_transcription,
            clean_transcription=clean_transcription,
            content_analysis=analysis_result.content_analysis,
            summary=summary_result.summary,
            metadata=metadata_result.extracted_metadata,
            tags=tags_result.tags,
            quality_score=recognition_result["quality_score"]
        )
    
    def process_audio_fast(self, audio_description: str, focus: str = "transcription"):
        """Fast audio processing for specific tasks."""
        
        print_step("Fast Audio Processing", f"Focus: {focus}")
        
        # Get transcription
        recognition_result = self.mock_processor.mock_speech_recognition(audio_description)
        raw_transcription = recognition_result["raw_transcription"]
        
        if focus == "transcription":
            clean_result = self.transcribe(
                audio_description=audio_description,
                raw_transcription=raw_transcription
            )
            return clean_result.clean_transcription
        
        elif focus == "summary":
            clean_result = self.transcribe(
                audio_description=audio_description,
                raw_transcription=raw_transcription
            )
            summary_result = self.summarize(
                transcription=clean_result.clean_transcription,
                summary_type="brief summary"
            )
            return summary_result.summary
        
        elif focus == "analysis":
            clean_result = self.transcribe(
                audio_description=audio_description,
                raw_transcription=raw_transcription
            )
            analysis_result = self.analyze_content(
                transcription=clean_result.clean_transcription,
                audio_type="general"
            )
            return analysis_result.content_analysis
        
        return raw_transcription

# Initialize the audio processor
audio_processor = AudioProcessor()

## Example 1: Interview Processing

Process an interview recording with comprehensive analysis.

In [None]:
# Interview audio processing
interview_description = "High quality studio recording of a job interview, clear audio, professional setting"
interview_type = "interview"

interview_result = audio_processor.process_audio_complete(
    audio_description=interview_description,
    audio_type=interview_type
)

print_step("Interview Processing Results Summary")
print(f"✓ Audio quality score: {interview_result.quality_score:.2f}")
print(f"✓ Transcription completed and cleaned")
print(f"✓ Content analysis generated")
print(f"✓ Summary created")
print(f"✓ Metadata extracted")
print(f"✓ Tags generated for categorization")

## Example 2: Lecture Recording Processing

Process an educational lecture with focus on key learning points.

In [None]:
# Lecture audio processing
lecture_description = "University lecture recording, some background noise, single speaker presenting on machine learning"
lecture_type = "lecture"

lecture_result = audio_processor.process_audio_complete(
    audio_description=lecture_description,
    audio_type=lecture_type
)

print_step("Lecture Processing Completed")
print("✓ Educational content analyzed")
print("✓ Key learning points extracted")
print("✓ Lecture summary generated")

## Example 3: Meeting Recording Analysis

Process a business meeting with focus on action items and decisions.

In [None]:
# Meeting audio processing
meeting_description = "Business meeting recording with multiple speakers, conference room setting, some echo"
meeting_type = "meeting"

meeting_result = audio_processor.process_audio_complete(
    audio_description=meeting_description,
    audio_type=meeting_type
)

print_step("Meeting Analysis Summary")
print("✓ Multiple speaker content processed")
print("✓ Business context analyzed")
print("✓ Meeting summary with action items generated")

## Fast Processing Examples

Demonstrate quick processing for specific tasks.

In [None]:
# Fast transcription only
print_step("Fast Transcription Example")
quick_transcription = audio_processor.process_audio_fast(
    audio_description="Phone call recording, moderate quality",
    focus="transcription"
)
print_result(quick_transcription, "Quick Transcription")

print_step("Fast Summary Example")
quick_summary = audio_processor.process_audio_fast(
    audio_description="Podcast episode about technology trends, clear audio",
    focus="summary"
)
print_result(quick_summary, "Quick Summary")

print_step("Fast Analysis Example")
quick_analysis = audio_processor.process_audio_fast(
    audio_description="Customer service call, some background noise",
    focus="analysis"
)
print_result(quick_analysis, "Quick Analysis")

## Specialized Audio Processing

Create specialized processors for different audio types.

In [None]:
class SpecializedAudioProcessors(dspy.Module):
    """Specialized processors for different audio content types."""
    
    def __init__(self):
        super().__init__()
        self.base_processor = AudioProcessor()
        
        # Specialized signatures
        self.customer_service_analysis = dspy.ChainOfThought(
            "transcription -> customer_satisfaction, issues_identified, resolution_status"
        )
        
        self.educational_content_extraction = dspy.ChainOfThought(
            "transcription -> learning_objectives, key_concepts, quiz_questions"
        )
        
        self.legal_transcription = dspy.ChainOfThought(
            "raw_transcription -> legal_formatted_transcription, speaker_identification, timestamps"
        )
    
    def process_customer_service_call(self, audio_description: str):
        """Process customer service calls with specialized analysis."""
        
        print_step("Customer Service Call Processing")
        
        # Get basic processing
        base_result = self.base_processor.process_audio_complete(
            audio_description=audio_description,
            audio_type="customer_service"
        )
        
        # Specialized customer service analysis
        cs_analysis = self.customer_service_analysis(
            transcription=base_result.clean_transcription
        )
        
        print_result(cs_analysis.customer_satisfaction, "Customer Satisfaction")
        print_result(cs_analysis.issues_identified, "Issues Identified")
        print_result(cs_analysis.resolution_status, "Resolution Status")
        
        return {
            **base_result.__dict__,
            "customer_satisfaction": cs_analysis.customer_satisfaction,
            "issues_identified": cs_analysis.issues_identified,
            "resolution_status": cs_analysis.resolution_status
        }
    
    def process_educational_content(self, audio_description: str):
        """Process educational audio with learning-focused analysis."""
        
        print_step("Educational Content Processing")
        
        base_result = self.base_processor.process_audio_complete(
            audio_description=audio_description,
            audio_type="educational"
        )
        
        edu_analysis = self.educational_content_extraction(
            transcription=base_result.clean_transcription
        )
        
        print_result(edu_analysis.learning_objectives, "Learning Objectives")
        print_result(edu_analysis.key_concepts, "Key Concepts")
        print_result(edu_analysis.quiz_questions, "Generated Quiz Questions")
        
        return {
            **base_result.__dict__,
            "learning_objectives": edu_analysis.learning_objectives,
            "key_concepts": edu_analysis.key_concepts,
            "quiz_questions": edu_analysis.quiz_questions
        }
    
    def process_legal_deposition(self, audio_description: str):
        """Process legal audio with formal transcription requirements."""
        
        print_step("Legal Deposition Processing")
        
        # Get raw transcription
        recognition_result = self.base_processor.mock_processor.mock_speech_recognition(audio_description)
        
        legal_transcription = self.legal_transcription(
            raw_transcription=recognition_result["raw_transcription"]
        )
        
        print_result(legal_transcription.legal_formatted_transcription, "Legal Transcription")
        print_result(legal_transcription.speaker_identification, "Speaker Identification")
        print_result(legal_transcription.timestamps, "Timestamps")
        
        return {
            "legal_transcription": legal_transcription.legal_formatted_transcription,
            "speaker_identification": legal_transcription.speaker_identification,
            "timestamps": legal_transcription.timestamps
        }

# Test specialized processors
specialized = SpecializedAudioProcessors()

# Customer service example
cs_result = specialized.process_customer_service_call(
    "Customer service call, customer sounds frustrated, representative trying to help"
)

print_step("Customer Service Processing Complete")

# Educational content example
edu_result = specialized.process_educational_content(
    "University lecture on data structures and algorithms, clear presentation"
)

print_step("Educational Processing Complete")

# Legal example
legal_result = specialized.process_legal_deposition(
    "Legal deposition recording, formal setting, court reporter present"
)

print_step("Legal Processing Complete")

## Batch Audio Processing

Process multiple audio files efficiently.

In [None]:
class BatchAudioProcessor(dspy.Module):
    """Process multiple audio files in batch."""
    
    def __init__(self):
        super().__init__()
        self.single_processor = AudioProcessor()
    
    def process_batch(self, audio_batch: List[Dict]) -> List[Dict]:
        """Process a batch of audio files."""
        
        print_step("Batch Audio Processing", f"Processing {len(audio_batch)} audio files")
        
        results = []
        
        for i, audio_item in enumerate(audio_batch):
            print_step(f"Processing Audio {i+1}: {audio_item['name']}")
            
            # Choose processing type based on requirements
            if audio_item.get('fast_mode', False):
                result = self.single_processor.process_audio_fast(
                    audio_description=audio_item['description'],
                    focus=audio_item.get('focus', 'transcription')
                )
                results.append({
                    'name': audio_item['name'],
                    'result': result,
                    'processing_type': 'fast'
                })
            else:
                result = self.single_processor.process_audio_complete(
                    audio_description=audio_item['description'],
                    audio_type=audio_item.get('type', 'general')
                )
                results.append({
                    'name': audio_item['name'],
                    'result': result,
                    'processing_type': 'complete'
                })
        
        return results

# Test batch processing
batch_processor = BatchAudioProcessor()

audio_batch = [
    {
        'name': 'meeting_q1_2024.wav',
        'description': 'Quarterly business meeting, multiple speakers, conference room',
        'type': 'meeting',
        'fast_mode': False
    },
    {
        'name': 'customer_call_001.mp3',
        'description': 'Customer support call, phone quality, billing inquiry',
        'type': 'customer_service',
        'fast_mode': True,
        'focus': 'summary'
    },
    {
        'name': 'lecture_ml_intro.wav',
        'description': 'Introduction to machine learning lecture, clear audio, university setting',
        'type': 'lecture',
        'fast_mode': False
    }
]

batch_results = batch_processor.process_batch(audio_batch)

print_step("Batch Processing Summary")
for result in batch_results:
    print(f"✓ {result['name']}: {result['processing_type']} processing completed")

## Audio Quality Assessment

Implement quality assessment for audio processing results.

In [None]:
class AudioQualityAssessor:
    """Assess quality of audio processing results."""
    
    def __init__(self):
        self.quality_factors = {
            'transcription_accuracy': 0.35,
            'content_completeness': 0.25,
            'analysis_depth': 0.20,
            'metadata_richness': 0.20
        }
    
    def assess_processing_quality(self, processing_result) -> Dict[str, float]:
        """Assess the quality of audio processing results."""
        
        scores = {}
        
        # Transcription accuracy (simulated based on audio quality)
        if hasattr(processing_result, 'quality_score'):
            scores['transcription_accuracy'] = processing_result.quality_score
        else:
            scores['transcription_accuracy'] = 0.7  # Default
        
        # Content completeness (based on presence of different components)
        completeness_score = 0.5  # Base score
        if hasattr(processing_result, 'summary') and processing_result.summary:
            completeness_score += 0.2
        if hasattr(processing_result, 'metadata') and processing_result.metadata:
            completeness_score += 0.2
        if hasattr(processing_result, 'tags') and processing_result.tags:
            completeness_score += 0.1
        
        scores['content_completeness'] = min(1.0, completeness_score)
        
        # Analysis depth (based on content analysis quality)
        if hasattr(processing_result, 'content_analysis'):
            analysis_length = len(str(processing_result.content_analysis))
            scores['analysis_depth'] = min(1.0, analysis_length / 500)  # Normalize
        else:
            scores['analysis_depth'] = 0.3
        
        # Metadata richness
        if hasattr(processing_result, 'metadata'):
            metadata_length = len(str(processing_result.metadata))
            scores['metadata_richness'] = min(1.0, metadata_length / 300)
        else:
            scores['metadata_richness'] = 0.3
        
        # Overall weighted score
        overall_score = sum(scores[factor] * weight for factor, weight in self.quality_factors.items())
        scores['overall'] = overall_score
        
        return scores
    
    def generate_quality_report(self, processing_result) -> str:
        """Generate a quality assessment report."""
        
        scores = self.assess_processing_quality(processing_result)
        
        report = f"""
Audio Processing Quality Report
===============================

Overall Quality Score: {scores['overall']:.2f}/1.00

Component Scores:
- Transcription Accuracy: {scores['transcription_accuracy']:.2f}/1.00
- Content Completeness: {scores['content_completeness']:.2f}/1.00
- Analysis Depth: {scores['analysis_depth']:.2f}/1.00
- Metadata Richness: {scores['metadata_richness']:.2f}/1.00

Quality Level: {self._get_quality_level(scores['overall'])}

Recommendations:
{self._get_recommendations(scores)}
        """
        
        return report.strip()
    
    def _get_quality_level(self, score: float) -> str:
        """Get quality level description."""
        if score >= 0.9:
            return "Excellent"
        elif score >= 0.8:
            return "Very Good"
        elif score >= 0.7:
            return "Good"
        elif score >= 0.6:
            return "Fair"
        else:
            return "Needs Improvement"
    
    def _get_recommendations(self, scores: Dict[str, float]) -> str:
        """Generate improvement recommendations."""
        recommendations = []
        
        if scores['transcription_accuracy'] < 0.7:
            recommendations.append("- Improve audio quality or use better speech recognition")
        
        if scores['content_completeness'] < 0.8:
            recommendations.append("- Ensure all processing components are generating results")
        
        if scores['analysis_depth'] < 0.7:
            recommendations.append("- Enhance content analysis with more detailed insights")
        
        if scores['metadata_richness'] < 0.7:
            recommendations.append("- Extract more comprehensive metadata from audio content")
        
        if not recommendations:
            recommendations.append("- Processing quality is good, consider fine-tuning for specific use cases")
        
        return "\n".join(recommendations)

# Test quality assessment
quality_assessor = AudioQualityAssessor()

# Assess the interview result from earlier
interview_quality_report = quality_assessor.generate_quality_report(interview_result)
print_step("Quality Assessment Report")
print(interview_quality_report)

## Best Practices for Audio Processing with DSPy

### Audio Input Best Practices:

1. **Audio Quality**: Use high-quality recordings when possible
2. **Format Standardization**: Convert to standard formats (WAV, MP3)
3. **Noise Reduction**: Pre-process audio to reduce background noise
4. **Speaker Separation**: Identify and separate multiple speakers
5. **Segmentation**: Break long audio into manageable segments

### Processing Pipeline Best Practices:

1. **Error Handling**: Implement robust error handling for processing failures
2. **Quality Checks**: Validate transcription quality before further processing
3. **Contextual Processing**: Adapt processing based on audio content type
4. **Incremental Processing**: Process in chunks for large files
5. **Fallback Strategies**: Have backup processing methods for poor quality audio

### Integration Considerations:

- **Real-time vs Batch**: Choose appropriate processing mode
- **Storage**: Manage audio file storage and retrieval efficiently
- **Privacy**: Implement proper privacy controls for sensitive audio
- **Scalability**: Design for handling multiple concurrent audio processing tasks
- **Monitoring**: Track processing quality and performance metrics

## Conclusion

This notebook demonstrated comprehensive audio processing capabilities using DSPy:

- **Complete Processing Pipeline**: From speech recognition to analysis and summarization
- **Specialized Processors**: Tailored processing for different audio types
- **Quality Assessment**: Objective evaluation of processing results
- **Batch Processing**: Efficient handling of multiple audio files
- **Best Practices**: Guidelines for production-ready audio processing systems

These techniques enable building robust audio processing applications for various domains including:
- Customer service analysis
- Educational content processing
- Legal transcription
- Meeting analysis
- Podcast and media processing