Notebook 2: Individual Modality Processing
=======================================

[Click to view on Google Colab](https://colab.research.google.com/drive/1Q3IlQms7TbHDg2u0jOYMXqgGPVvFbXJw?usp=sharing)

This script demonstrates how each modality (text, image, audio) gets preprocessed
independently before being used in multimodal AI systems. We'll explore the
transformation pipelines, normalization techniques, and modality-specific challenges.

Learning Objectives:
- Understand preprocessing steps for each modality
- Learn about normalization and standardization techniques
- Explore modality-specific challenges and solutions
- See before/after transformations with dummy data
- Understand why preprocessing is crucial for multimodal AI

### Importing necessary libraries

In [1]:
!pip install numpy

import numpy as np
import re
from typing import Dict, List, Tuple, Any, Optional
from datetime import datetime



---

### Text Data Processing


#### Understanding Text Preprocessing in Multimodal AI


Text preprocessing is the foundation of any text-based AI system. Unlike humans who can easily understand messy, inconsistent text, machine learning models require clean, standardized input data. The `TextProcessor` class demonstrates the essential steps that transform raw text into numerical representations that models can process.

Key Concepts Covered:
1. **Text Cleaning Challenges**  
Raw text comes with inconsistencies: mixed case, extra spaces, punctuation, and special characters
Different cleaning levels serve different purposes - basic cleaning preserves more information while advanced cleaning creates more uniform input
Empty text and extremely long text require special handling to prevent model errors  
2. **Tokenization Process**  
Breaking text into meaningful units (tokens) is crucial for model understanding
Special tokens like <START>, <END>, <PAD>, and <UNK> serve specific purposes in sequence processing
Tokenization strategy affects how models interpret relationships between words  
3. **Vocabulary Mapping**  
Converting text tokens to numerical IDs enables mathematical operations
Out-of-vocabulary words are handled gracefully using <UNK> tokens
Vocabulary size directly impacts model complexity and memory requirements  
4. **Sequence Standardization**  
Variable-length text sequences must be standardized for batch processing
Padding ensures all sequences have the same length for efficient computation
Truncation prevents extremely long sequences from dominating processing time  
  
**Learning Outcomes:**  
Learners will understand why text can't be fed directly to models, how preprocessing affects model performance, and the trade-offs between different cleaning strategies.


In [2]:
class TextProcessor:
    """
    Handles text preprocessing operations including tokenization, normalization,
    and feature extraction. This simulates real-world text processing pipelines.
    """
    
    def __init__(self):
        self.name = "Text Processor"
        # Dummy vocabulary for demonstration (in real systems, this would be much larger)
        self.vocabulary = {
            '<PAD>': 0, 
            '<UNK>': 1, 
            '<START>': 2, 
            '<END>': 3,
            'the': 4, 
            'a': 5, 
            'an': 6, 
            'and': 7, 
            'or': 8, 
            'but': 9,
            'hello': 10, 
            'world': 11, 
            'ai': 12, 
            'multimodal': 13, 
            'text': 14,
            'image': 15, 
            'audio': 16, 
            'data': 17, 
            'processing': 18, 
            'model': 19
        }
        self.max_sequence_length = 50
        
    def create_sample_texts(self) -> Dict[str, str]:
        """
        Creates sample texts with various preprocessing challenges
        
        Returns:
            Dictionary of text samples with different characteristics
        """
        samples = {
            'clean_text': "Hello world this is a clean text sample",
            'messy_text': "  HELLO!!! World???   This has    extra spaces & symbols!!!  ",
            'mixed_case': "ThIs TeXt HaS mIxEd CaSe AnD needs NORMALIZATION",
            'with_numbers': "The model achieved 95.5% accuracy on 1000 test samples in 2023",
            'with_punctuation': "Hello, world! How are you today? I'm fine, thanks.",
            'empty_text': "",
            'very_long_text': " ".join(["word"] * 100)  # Simulate very long text
        }
        return samples
    
    def basic_text_cleaning(self, text: str) -> str:
        """
        Performs basic text cleaning operations
        
        Args:
            text: Raw input text
            
        Returns:
            Cleaned text string
        """
        if not text or not text.strip():
            return ""
            
        # Convert to lowercase
        cleaned = text.lower()
        
        # Remove extra whitespaces
        cleaned = re.sub(r'\s+', ' ', cleaned)
        
        # Remove leading/trailing whitespace
        cleaned = cleaned.strip()
        
        return cleaned
    
    def advanced_text_cleaning(self, text: str) -> str:
        """
        Performs advanced text cleaning including punctuation and special characters
        
        Args:
            text: Input text to clean
            
        Returns:
            Advanced cleaned text
        """
        if not text or not text.strip():
            return ""
            
        # Start with basic cleaning
        cleaned = self.basic_text_cleaning(text)
        
        # Remove punctuation (keep alphanumeric and spaces)
        cleaned = re.sub(r'[^a-zA-Z0-9\s]', '', cleaned)
        
        # Handle numbers (replace with special token)
        cleaned = re.sub(r'\d+', '<NUM>', cleaned)
        
        # Remove extra spaces again after punctuation removal
        cleaned = re.sub(r'\s+', ' ', cleaned).strip()
        
        return cleaned
    
    def tokenize_text(self, text: str) -> List[str]:
        """
        Tokenizes text into individual words/tokens
        
        Args:
            text: Input text to tokenize
            
        Returns:
            List of tokens
        """
        if not text or not text.strip():
            return []
            
        # Simple whitespace tokenization
        tokens = text.split()
        
        # Add start and end tokens
        tokens = ['<START>'] + tokens + ['<END>']
        
        return tokens
    
    def tokens_to_ids(self, tokens: List[str]) -> List[int]:
        """
        Converts tokens to numerical IDs using vocabulary
        
        Args:
            tokens: List of string tokens
            
        Returns:
            List of numerical token IDs
        """
        token_ids = []
        for token in tokens:
            # Use vocabulary lookup, default to <UNK> if not found
            token_id = self.vocabulary.get(token, self.vocabulary['<UNK>'])
            token_ids.append(token_id)
        
        return token_ids
    
    def pad_sequence(self, token_ids: List[int]) -> List[int]:
        """
        Pads or truncates sequence to fixed length
        
        Args:
            token_ids: List of token IDs
            
        Returns:
            Padded/truncated sequence of fixed length
        """
        if len(token_ids) > self.max_sequence_length:
            # Truncate if too long
            return token_ids[:self.max_sequence_length]
        else:
            # Pad if too short
            padding_length = self.max_sequence_length - len(token_ids)
            return token_ids + [self.vocabulary['<PAD>']] * padding_length
    
    def process_text_pipeline(self, text: str, cleaning_level: str = 'basic') -> Dict[str, Any]:
        """
        Complete text processing pipeline
        
        Args:
            text: Raw input text
            cleaning_level: 'basic' or 'advanced' cleaning
            
        Returns:
            Dictionary containing all processing steps and results
        """
        result = {
            'original_text': text,
            'original_length': len(text),
            'original_word_count': len(text.split()) if text else 0
        }
        
        # Step 1: Text cleaning
        if cleaning_level == 'advanced':
            cleaned_text = self.advanced_text_cleaning(text)
        else:
            cleaned_text = self.basic_text_cleaning(text)
        
        result['cleaned_text'] = cleaned_text
        result['cleaned_length'] = len(cleaned_text)
        
        # Step 2: Tokenization
        tokens = self.tokenize_text(cleaned_text)
        result['tokens'] = tokens
        result['token_count'] = len(tokens)
        
        # Step 3: Token to ID conversion
        token_ids = self.tokens_to_ids(tokens)
        result['token_ids'] = token_ids
        
        # Step 4: Sequence padding/truncation
        padded_sequence = self.pad_sequence(token_ids)
        result['padded_sequence'] = padded_sequence
        result['final_length'] = len(padded_sequence)
        
        # Step 5: Convert to numpy array (ready for model input)
        result['model_input'] = np.array(padded_sequence, dtype=np.int32)
        result['input_shape'] = result['model_input'].shape
        
        return result
    
    def demonstrate_text_processing(self):
        """
        Main demonstration function for text processing
        """
        print(f"=== {self.name} Demonstration ===\n")
        
        # Get sample texts
        sample_texts = self.create_sample_texts()
        
        print("1. Text Processing Pipeline Demonstration:")
        print(f"   Vocabulary size: {len(self.vocabulary)}")
        print(f"   Max sequence length: {self.max_sequence_length}\n")
        
        for text_name, text in sample_texts.items():
            print(f"Processing: {text_name}")
            print(f"Original: '{text}'")
            
            # Process with basic cleaning
            basic_result = self.process_text_pipeline(text, 'basic')
            print(f"Basic cleaned: '{basic_result['cleaned_text']}'")
            print(f"Tokens: {basic_result['tokens'][:10]}{'...' if len(basic_result['tokens']) > 10 else ''}")
            print(f"Token IDs: {basic_result['token_ids'][:10]}{'...' if len(basic_result['token_ids']) > 10 else ''}")
            print(f"Final shape: {basic_result['input_shape']}")
            
            # Show advanced cleaning for messy text
            if text_name == 'messy_text':
                advanced_result = self.process_text_pipeline(text, 'advanced')
                print(f"Advanced cleaned: '{advanced_result['cleaned_text']}'")
            
            print("-" * 50)
        
        return sample_texts


print("MULTIMODAL AI - INDIVIDUAL MODALITY PROCESSING")
print("=" * 55)
print(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")

print("PART 1: TEXT PROCESSING")
print("=" * 30)
text_processor = TextProcessor()
text_processor.demonstrate_text_processing()

print("\n" + "="*60 + "\n")


MULTIMODAL AI - INDIVIDUAL MODALITY PROCESSING
Timestamp: 2025-08-16 15:55:37

PART 1: TEXT PROCESSING
=== Text Processor Demonstration ===

1. Text Processing Pipeline Demonstration:
   Vocabulary size: 20
   Max sequence length: 50

Processing: clean_text
Original: 'Hello world this is a clean text sample'
Basic cleaned: 'hello world this is a clean text sample'
Tokens: ['<START>', 'hello', 'world', 'this', 'is', 'a', 'clean', 'text', 'sample', '<END>']
Token IDs: [2, 10, 11, 1, 1, 5, 1, 14, 1, 3]
Final shape: (50,)
--------------------------------------------------
Processing: messy_text
Original: '  HELLO!!! World???   This has    extra spaces & symbols!!!  '
Basic cleaned: 'hello!!! world??? this has extra spaces & symbols!!!'
Tokens: ['<START>', 'hello!!!', 'world???', 'this', 'has', 'extra', 'spaces', '&', 'symbols!!!', '<END>']
Token IDs: [2, 1, 1, 1, 1, 1, 1, 1, 1, 3]
Final shape: (50,)
Advanced cleaned: 'hello world this has extra spaces symbols'
-----------------------------

---

### Image Data Processing

**Image Preprocessing: From Pixels to Model Input**
Images present unique challenges in multimodal AI due to their high dimensionality and variability in size, format, and quality. The ImageProcessor class demonstrates how raw pixel data is transformed into standardized numerical arrays suitable for deep learning models.  
  
**Key Concepts Covered:**  
1. **Dimensional Standardization**  
Images come in various sizes and aspect ratios, but models expect consistent input dimensions
Resizing operations must balance information preservation with computational efficiency
Channel handling (grayscale vs RGB vs RGBA) requires careful consideration
2. **Pixel Value Normalization**  
Raw pixel values (0-255) are scaled to ranges that optimize model training
Standardization using dataset statistics (like ImageNet means/stds) improves model convergence
Normalization prevents certain pixel ranges from dominating the learning process
3. **Format Conversion Challenges**  
Converting between color spaces (grayscale to RGB) affects information content
Channel ordering (HWC vs CHW) must match model expectations
Data type conversions (uint8 to float32) are necessary for mathematical operations
4. **Data Augmentation Principles**  
Augmentation techniques increase dataset diversity without collecting new data
Transformations must preserve the essential visual information while adding variation
Different augmentation strategies serve different purposes (geometric, photometric, etc.)
5. **Memory and Computational Considerations**  
Images consume significantly more memory than text
Processing pipeline efficiency affects real-time application performance
Batch dimension addition prepares data for efficient GPU processing
  
**Learning Outcomes:**  
Students will appreciate the complexity of image preprocessing, understand why standardization is crucial, and recognize the computational trade-offs in image processing pipelines.  
  

In [3]:
class ImageProcessor:
    """
    Handles image preprocessing operations including resizing, normalization,
    and augmentation. This simulates real-world image processing pipelines.
    """
    
    def __init__(self):
        self.name = "Image Processor"
        self.target_size = (224, 224)  # Common input size for many models
        self.normalization_mean = [0.485, 0.456, 0.406]  # ImageNet means
        self.normalization_std = [0.229, 0.224, 0.225]   # ImageNet stds
        
    def create_sample_images(self) -> Dict[str, Dict]:
        """
        Creates sample images with various preprocessing challenges
        
        Returns:
            Dictionary of image samples with different characteristics
        """
        samples = {
            'small_grayscale': {
                'data': np.random.randint(0, 256, (64, 64), dtype=np.uint8),
                'channels': 1,
                'original_size': (64, 64)
            },
            'large_rgb': {
                'data': np.random.randint(0, 256, (512, 768, 3), dtype=np.uint8),
                'channels': 3,
                'original_size': (512, 768)
            },
            'square_rgb': {
                'data': np.random.randint(0, 256, (256, 256, 3), dtype=np.uint8),
                'channels': 3,
                'original_size': (256, 256)
            },
            'very_small': {
                'data': np.random.randint(0, 256, (32, 48, 3), dtype=np.uint8),
                'channels': 3,
                'original_size': (32, 48)
            },
            'high_values': {
                'data': np.random.randint(200, 256, (128, 128, 3), dtype=np.uint8),
                'channels': 3,
                'original_size': (128, 128)
            }
        }
        return samples
    
    def resize_image(self, image: np.ndarray, target_size: Tuple[int, int]) -> np.ndarray:
        """
        Simulates image resizing (simplified version)
        In real applications, this would use proper interpolation
        
        Args:
            image: Input image array
            target_size: Target (height, width)
            
        Returns:
            Resized image array
        """
        original_shape = image.shape
        
        if len(original_shape) == 2:  # Grayscale
            # Simple nearest neighbor simulation
            resized = np.random.randint(0, 256, target_size, dtype=np.uint8)
        else:  # RGB/RGBA
            channels = original_shape[2]
            resized = np.random.randint(0, 256, (*target_size, channels), dtype=np.uint8)
        
        return resized
    
    def normalize_image(self, image: np.ndarray) -> np.ndarray:
        """
        Normalizes image pixel values to [0, 1] range and applies standardization
        
        Args:
            image: Input image array (0-255 range)
            
        Returns:
            Normalized image array
        """
        # Convert to float and normalize to [0, 1]
        normalized = image.astype(np.float32) / 255.0
        
        # Apply channel-wise standardization if RGB
        if len(normalized.shape) == 3 and normalized.shape[2] == 3:
            for i in range(3):
                normalized[:, :, i] = (normalized[:, :, i] - self.normalization_mean[i]) / self.normalization_std[i]
        elif len(normalized.shape) == 2:  # Grayscale
            # Simple standardization for grayscale
            normalized = (normalized - 0.5) / 0.5
        
        return normalized
    
    def handle_grayscale_to_rgb(self, image: np.ndarray) -> np.ndarray:
        """
        Converts grayscale image to RGB by replicating channels
        
        Args:
            image: Grayscale image array
            
        Returns:
            RGB image array
        """
        if len(image.shape) == 2:
            # Replicate grayscale channel to create RGB
            rgb_image = np.stack([image, image, image], axis=2)
            return rgb_image
        return image
    
    def apply_basic_augmentation(self, image: np.ndarray) -> Dict[str, np.ndarray]:
        """
        Applies basic data augmentation techniques (simulated)
        
        Args:
            image: Input image array
            
        Returns:
            Dictionary of augmented images
        """
        augmented = {}
        
        # Original
        augmented['original'] = image.copy()
        
        # Horizontal flip (simulated)
        augmented['horizontal_flip'] = np.fliplr(image)
        
        # Brightness adjustment (simulated)
        bright_factor = 1.2
        augmented['brightness_up'] = np.clip(image * bright_factor, 0, 255).astype(image.dtype)
        
        # Contrast adjustment (simulated)
        contrast_factor = 1.3
        mean_val = np.mean(image)
        augmented['contrast_up'] = np.clip((image - mean_val) * contrast_factor + mean_val, 0, 255).astype(image.dtype)
        
        return augmented
    
    def process_image_pipeline(self, image_data: Dict[str, Any], apply_augmentation: bool = False) -> Dict[str, Any]:
        """
        Complete image processing pipeline
        
        Args:
            image_data: Dictionary containing image data and metadata
            apply_augmentation: Whether to apply data augmentation
            
        Returns:
            Dictionary containing all processing steps and results
        """
        image = image_data['data']
        
        result = {
            'original_shape': image.shape,
            'original_dtype': image.dtype,
            'original_size_bytes': image.nbytes,
            'original_value_range': (int(image.min()), int(image.max()))
        }
        
        # Step 1: Handle grayscale to RGB conversion if needed
        if len(image.shape) == 2:
            image = self.handle_grayscale_to_rgb(image)
            result['converted_to_rgb'] = True
        else:
            result['converted_to_rgb'] = False
        
        result['after_rgb_conversion_shape'] = image.shape
        
        # Step 2: Resize image
        resized_image = self.resize_image(image, self.target_size)
        result['resized_shape'] = resized_image.shape
        result['resize_factor'] = (
            self.target_size[0] / image.shape[0],
            self.target_size[1] / image.shape[1]
        )
        
        # Step 3: Normalize image
        normalized_image = self.normalize_image(resized_image)
        result['normalized_dtype'] = normalized_image.dtype
        result['normalized_value_range'] = (float(normalized_image.min()), float(normalized_image.max()))
        
        # Step 4: Convert to model input format (add batch dimension)
        # Transpose from HWC to CHW format (common in deep learning)
        if len(normalized_image.shape) == 3:
            model_input = np.transpose(normalized_image, (2, 0, 1))  # CHW format
        else:
            model_input = normalized_image
        
        # Add batch dimension
        model_input = np.expand_dims(model_input, axis=0)

        # [H, W, C] -> [B, H, W, C]
        # [128, 128, 3] -> [1, 128, 128, 3] d
        
        result['model_input_shape'] = model_input.shape
        result['model_input'] = model_input
        
        # Step 5: Apply augmentation if requested
        if apply_augmentation:
            augmented_images = self.apply_basic_augmentation(resized_image)
            result['augmented_versions'] = len(augmented_images)
            result['augmentation_applied'] = True
        else:
            result['augmentation_applied'] = False
        
        return result
    
    def demonstrate_image_processing(self):
        """
        Main demonstration function for image processing
        """
        print(f"=== {self.name} Demonstration ===\n")
        
        # Get sample images
        sample_images = self.create_sample_images()
        
        print("1. Image Processing Pipeline Demonstration:")
        print(f"   Target size: {self.target_size}")
        print(f"   Normalization mean: {self.normalization_mean}")
        print(f"   Normalization std: {self.normalization_std}\n")
        
        for image_name, image_data in sample_images.items():
            print(f"Processing: {image_name}")
            print(f"Original shape: {image_data['data'].shape}")
            print(f"Original size: {image_data['original_size']}")
            
            # Process image
            result = self.process_image_pipeline(image_data, apply_augmentation=(image_name == 'square_rgb'))
            
            print(f"RGB conversion needed: {result['converted_to_rgb']}")
            print(f"After RGB shape: {result['after_rgb_conversion_shape']}")
            print(f"Resized shape: {result['resized_shape']}")
            print(f"Resize factors: {result['resize_factor']}")
            print(f"Normalized range: ({result['normalized_value_range'][0]:.3f}, {result['normalized_value_range'][1]:.3f})")
            print(f"Final model input shape: {result['model_input_shape']}")
            print(f"Augmentation applied: {result['augmentation_applied']}")
            
            print("-" * 50)
        
        return sample_images


print("PART 2: IMAGE PROCESSING")
print("=" * 30)
image_processor = ImageProcessor()
image_processor.demonstrate_image_processing()

print("\n" + "="*60 + "\n")

PART 2: IMAGE PROCESSING
=== Image Processor Demonstration ===

1. Image Processing Pipeline Demonstration:
   Target size: (224, 224)
   Normalization mean: [0.485, 0.456, 0.406]
   Normalization std: [0.229, 0.224, 0.225]

Processing: small_grayscale
Original shape: (64, 64)
Original size: (64, 64)
RGB conversion needed: True
After RGB shape: (64, 64, 3)
Resized shape: (224, 224, 3)
Resize factors: (3.5, 3.5)
Normalized range: (-2.118, 2.640)
Final model input shape: (1, 3, 224, 224)
Augmentation applied: False
--------------------------------------------------
Processing: large_rgb
Original shape: (512, 768, 3)
Original size: (512, 768)
RGB conversion needed: False
After RGB shape: (512, 768, 3)
Resized shape: (224, 224, 3)
Resize factors: (0.4375, 0.2916666666666667)
Normalized range: (-2.118, 2.640)
Final model input shape: (1, 3, 224, 224)
Augmentation applied: False
--------------------------------------------------
Processing: square_rgb
Original shape: (256, 256, 3)
Original s

---

### Audio Data Processing

**Audio Signal Processing for AI Applications**
Audio data presents temporal challenges that differ from both text and images. The AudioProcessor class demonstrates how continuous audio signals are converted into discrete, standardized representations suitable for machine learning models.
  
**Key Concepts Covered:**  
1. **Temporal Data Characteristics**  
Audio is inherently time-series data with temporal dependencies
Sample rates determine the resolution of audio capture and affect processing requirements
Duration variability requires standardization strategies similar to text sequence lengths
2. **Channel and Format Standardization**  
Stereo to mono conversion affects information content but simplifies processing
Sample rate conversion (resampling) must preserve essential frequency information
Amplitude normalization prevents volume variations from affecting model training
3. **Spectral Feature Extraction**  
Time-domain audio signals are often converted to frequency-domain representations
STFT (Short-Time Fourier Transform) captures both temporal and spectral information
Mel spectrograms and MFCCs provide perceptually-relevant audio features
Feature extraction transforms 1D audio into 2D representations suitable for various model architectures
4. **Windowing and Framing**  
Audio signals are processed in overlapping windows to capture temporal dynamics
Window size and hop length parameters balance temporal resolution with computational efficiency
Windowing functions reduce spectral artifacts in frequency analysis
5. **Length Standardization**  
Audio clips of varying lengths must be standardized for batch processing
Padding with silence vs. truncation affects information preservation
Fixed-length processing enables efficient model training and inference
6. **Signal Quality Considerations**  
Amplitude normalization prevents clipping and ensures consistent signal levels
Noise handling and filtering improve signal quality for model input
Dynamic range considerations affect model sensitivity to quiet vs. loud sounds
  
**Learning Outcomes:**  
Students will understand the unique challenges of temporal data processing, appreciate the complexity of audio feature extraction, and recognize why spectral representations are often preferred over raw audio for AI applications.



In [4]:
class AudioProcessor:
    """
    Handles audio preprocessing operations including resampling, windowing,
    and feature extraction. This simulates real-world audio processing pipelines.
    """
    
    def __init__(self):
        self.name = "Audio Processor"
        self.target_sample_rate = 16000  # Common sample rate for speech processing
        self.window_size = 1024  # For spectral analysis
        self.hop_length = 512    # For spectral analysis
        self.n_mels = 128        # Number of mel frequency bins
        
    def create_sample_audio(self) -> Dict[str, Dict]:
        """
        Creates sample audio with various preprocessing challenges
        
        Returns:
            Dictionary of audio samples with different characteristics
        """
        samples = {
            'short_mono_16k': {
                'data': np.random.uniform(-1.0, 1.0, 16000),  # 1 second
                'sample_rate': 16000,
                'channels': 1,
                'duration': 1.0
            },
            'long_stereo_44k': {
                'data': np.random.uniform(-1.0, 1.0, (44100 * 3, 2)),  # 3 seconds stereo
                'sample_rate': 44100,
                'channels': 2,
                'duration': 3.0
            },
            'quiet_audio': {
                'data': np.random.uniform(-0.1, 0.1, 22050),  # Very quiet audio
                'sample_rate': 22050,
                'channels': 1,
                'duration': 1.0
            },
            'loud_audio': {
                'data': np.random.uniform(-0.9, 0.9, 8000),  # Loud audio, low sample rate
                'sample_rate': 8000,
                'channels': 1,
                'duration': 1.0
            },
            'very_short': {
                'data': np.random.uniform(-0.5, 0.5, 1600),  # 0.1 seconds
                'sample_rate': 16000,
                'channels': 1,
                'duration': 0.1
            }
        }
        return samples
    
    def resample_audio(self, audio: np.ndarray, original_sr: int, target_sr: int) -> np.ndarray:
        """
        Simulates audio resampling (simplified version)
        In real applications, this would use proper signal processing
        
        Args:
            audio: Input audio array
            original_sr: Original sample rate
            target_sr: Target sample rate
            
        Returns:
            Resampled audio array
        """
        if original_sr == target_sr:
            return audio
        
        # Calculate resampling ratio
        ratio = target_sr / original_sr
        
        if len(audio.shape) == 1:  # Mono
            new_length = int(len(audio) * ratio)
            resampled = np.random.uniform(-1.0, 1.0, new_length)
        else:  # Stereo
            new_length = int(audio.shape[0] * ratio)
            resampled = np.random.uniform(-1.0, 1.0, (new_length, audio.shape[1]))
        
        return resampled
    
    def convert_to_mono(self, audio: np.ndarray) -> np.ndarray:
        """
        Converts stereo audio to mono by averaging channels
        
        Args:
            audio: Input audio array
            
        Returns:
            Mono audio array
        """
        if len(audio.shape) == 1:
            return audio  # Already mono
        else:
            # Average across channels
            return np.mean(audio, axis=1)
    
    def normalize_audio(self, audio: np.ndarray) -> np.ndarray:
        """
        Normalizes audio amplitude
        
        Args:
            audio: Input audio array
            
        Returns:
            Normalized audio array
        """
        # Find the maximum absolute value
        max_val = np.max(np.abs(audio))
        
        if max_val > 0:
            # Normalize to [-1, 1] range
            normalized = audio / max_val
        else:
            normalized = audio
        
        return normalized
    
    def apply_windowing(self, audio: np.ndarray) -> np.ndarray:
        """
        Applies windowing function to audio signal
        
        Args:
            audio: Input audio array
            
        Returns:
            Windowed audio array
        """
        # Apply Hanning window (simulated)
        window = np.hanning(len(audio))
        windowed = audio * window
        return windowed
    
    def extract_spectral_features(self, audio: np.ndarray, sample_rate: int) -> Dict[str, np.ndarray]:
        """
        Extracts spectral features from audio (simulated)
        
        Args:
            audio: Input audio array
            sample_rate: Sample rate of audio
            
        Returns:
            Dictionary of spectral features
        """
        features = {}
        
        # Simulate STFT (Short-Time Fourier Transform)
        n_frames = max(1, len(audio) // self.hop_length)
        n_freq_bins = self.window_size // 2 + 1
        
        features['stft_magnitude'] = np.random.uniform(0, 1, (n_freq_bins, n_frames))
        features['stft_phase'] = np.random.uniform(-np.pi, np.pi, (n_freq_bins, n_frames))
        
        # Simulate Mel spectrogram
        features['mel_spectrogram'] = np.random.uniform(0, 1, (self.n_mels, n_frames))
        
        # Simulate MFCCs (Mel-Frequency Cepstral Coefficients)
        n_mfcc = 13
        features['mfcc'] = np.random.uniform(-1, 1, (n_mfcc, n_frames))
        
        return features
    
    def pad_or_truncate_audio(self, audio: np.ndarray, target_length: int) -> np.ndarray:
        """
        Pads or truncates audio to target length
        
        Args:
            audio: Input audio array
            target_length: Target length in samples
            
        Returns:
            Padded or truncated audio array
        """
        current_length = len(audio)
        
        if current_length > target_length:
            # Truncate
            return audio[:target_length]
        elif current_length < target_length:
            # Pad with zeros
            padding = target_length - current_length
            return np.pad(audio, (0, padding), mode='constant', constant_values=0)
        else:
            return audio
    
    def process_audio_pipeline(self, audio_data: Dict[str, Any], extract_features: bool = True) -> Dict[str, Any]:
        """
        Complete audio processing pipeline
        
        Args:
            audio_data: Dictionary containing audio data and metadata
            extract_features: Whether to extract spectral features
            
        Returns:
            Dictionary containing all processing steps and results
        """
        audio = audio_data['data']
        original_sr = audio_data['sample_rate']
        
        result = {
            'original_shape': audio.shape,
            'original_sample_rate': original_sr,
            'original_duration': audio_data['duration'],
            'original_channels': audio_data['channels'],
            'original_amplitude_range': (float(audio.min()), float(audio.max()))
        }
        
        # Step 1: Convert to mono if stereo
        if len(audio.shape) > 1:
            audio = self.convert_to_mono(audio)
            result['converted_to_mono'] = True
        else:
            result['converted_to_mono'] = False
        
        result['after_mono_shape'] = audio.shape
        
        # Step 2: Resample to target sample rate
        resampled_audio = self.resample_audio(audio, original_sr, self.target_sample_rate)
        result['resampled_shape'] = resampled_audio.shape
        result['resampling_ratio'] = self.target_sample_rate / original_sr
        
        # Step 3: Normalize audio
        normalized_audio = self.normalize_audio(resampled_audio)
        result['normalized_amplitude_range'] = (float(normalized_audio.min()), float(normalized_audio.max()))
        
        # Step 4: Pad or truncate to fixed length (1 second at target sample rate)
        target_length = self.target_sample_rate  # 1 second
        fixed_length_audio = self.pad_or_truncate_audio(normalized_audio, target_length)
        result['fixed_length_shape'] = fixed_length_audio.shape
        result['padding_applied'] = len(fixed_length_audio) > len(normalized_audio)
        result['truncation_applied'] = len(normalized_audio) > target_length
        
        # Step 5: Apply windowing
        windowed_audio = self.apply_windowing(fixed_length_audio)
        result['windowing_applied'] = True
        
        # Step 6: Extract spectral features if requested
        if extract_features:
            spectral_features = self.extract_spectral_features(windowed_audio, self.target_sample_rate)
            result['spectral_features'] = {
                'stft_shape': spectral_features['stft_magnitude'].shape,
                'mel_spectrogram_shape': spectral_features['mel_spectrogram'].shape,
                'mfcc_shape': spectral_features['mfcc'].shape
            }
            result['features_extracted'] = True
            
            # Model input would typically be one of these features
            result['model_input'] = spectral_features['mel_spectrogram']
            result['model_input_shape'] = result['model_input'].shape
        else:
            result['features_extracted'] = False
            result['model_input'] = windowed_audio
            result['model_input_shape'] = windowed_audio.shape
        
        return result
    
    def demonstrate_audio_processing(self):
        """
        Main demonstration function for audio processing
        """
        print(f"=== {self.name} Demonstration ===\n")
        
        # Get sample audio
        sample_audio = self.create_sample_audio()
        
        print("1. Audio Processing Pipeline Demonstration:")
        print(f"   Target sample rate: {self.target_sample_rate} Hz")
        print(f"   Window size: {self.window_size}")
        print(f"   Hop length: {self.hop_length}")
        print(f"   Mel frequency bins: {self.n_mels}\n")
        
        for audio_name, audio_data in sample_audio.items():
            print(f"Processing: {audio_name}")
            print(f"Original: {audio_data['duration']}s, {audio_data['sample_rate']}Hz, {audio_data['channels']} ch")
            
            # Process audio
            result = self.process_audio_pipeline(audio_data, extract_features=(audio_name in ['short_mono_16k', 'long_stereo_44k']))
            
            print(f"Converted to mono: {result['converted_to_mono']}")
            print(f"Resampling ratio: {result['resampling_ratio']:.2f}")
            print(f"Amplitude range after norm: ({result['normalized_amplitude_range'][0]:.3f}, {result['normalized_amplitude_range'][1]:.3f})")
            print(f"Padding applied: {result['padding_applied']}")
            print(f"Truncation applied: {result['truncation_applied']}")
            print(f"Features extracted: {result['features_extracted']}")
            
            if result['features_extracted']:
                print(f"STFT shape: {result['spectral_features']['stft_shape']}")
                print(f"Mel spectrogram shape: {result['spectral_features']['mel_spectrogram_shape']}")
                print(f"MFCC shape: {result['spectral_features']['mfcc_shape']}")
            
            print(f"Final model input shape: {result['model_input_shape']}")
            
            print("-" * 50)
        
        return sample_audio


print("PART 3: AUDIO PROCESSING")
print("=" * 30)
audio_processor = AudioProcessor()
audio_processor.demonstrate_audio_processing()

print("\n" + "="*60 + "\n")


PART 3: AUDIO PROCESSING
=== Audio Processor Demonstration ===

1. Audio Processing Pipeline Demonstration:
   Target sample rate: 16000 Hz
   Window size: 1024
   Hop length: 512
   Mel frequency bins: 128

Processing: short_mono_16k
Original: 1.0s, 16000Hz, 1 ch
Converted to mono: False
Resampling ratio: 1.00
Amplitude range after norm: (-1.000, 1.000)
Padding applied: False
Truncation applied: False
Features extracted: True
STFT shape: (513, 31)
Mel spectrogram shape: (128, 31)
MFCC shape: (13, 31)
Final model input shape: (128, 31)
--------------------------------------------------
Processing: long_stereo_44k
Original: 3.0s, 44100Hz, 2 ch
Converted to mono: True
Resampling ratio: 0.36
Amplitude range after norm: (-1.000, 1.000)
Padding applied: False
Truncation applied: True
Features extracted: True
STFT shape: (513, 31)
Mel spectrogram shape: (128, 31)
MFCC shape: (13, 31)
Final model input shape: (128, 31)
--------------------------------------------------
Processing: quiet_audio

---

### Modality Processing Comparison

**Cross-Modal Preprocessing Analysis**
The comparison section synthesizes the preprocessing challenges and solutions across all three modalities, highlighting both commonalities and unique requirements. This analysis is crucial for understanding why multimodal AI systems are complex and how different data types interact.
  
**Key Concepts Covered:**  
1. **Universal Preprocessing Principles**  
Standardization: All modalities require conversion to consistent formats
Normalization: Scaling values to appropriate ranges is universal across modalities
Dimensionality Management: Fixed input dimensions enable efficient batch processing
Quality Control: Noise reduction and irrelevant information filtering improve model performance
2. **Modality-Specific Challenges**  
Text: Discrete symbols with semantic relationships, variable sequence lengths
Images: Continuous pixel values with spatial relationships, high dimensionality
Audio: Continuous temporal signals with frequency content, time-series dependencies
3. **Resource and Computational Considerations**  
Memory usage varies dramatically across modalities (text < audio < images)
Processing complexity differs based on data characteristics and required transformations
Real-time processing constraints affect preprocessing pipeline design
4. **Information Preservation Trade-offs**  
Each preprocessing step potentially loses information
Balancing standardization needs with information retention is crucial
Different applications may require different preprocessing strategies
5. **Preprocessing Impact on Model Performance**  
Quality of preprocessing directly affects downstream model performance
Inconsistent preprocessing can introduce biases and reduce model robustness
Understanding preprocessing effects is essential for debugging model issues
6. **Scalability and Production Considerations**  
Preprocessing pipelines must handle varying data quality in production
Error handling and fallback mechanisms are essential for robust systems
Preprocessing efficiency affects overall system performance
  
**Learning Outcomes:**  
Students will understand the fundamental principles that apply across all modalities, appreciate the unique challenges each data type presents, and recognize why multimodal systems require careful coordination of preprocessing pipelines. This foundation prepares them for understanding how different modalities can be combined effectively in multimodal AI systems.

In [5]:
class ModalityProcessingComparison:
    """
    Compares preprocessing challenges and solutions across modalities
    """
    
    def __init__(self):
        self.name = "Modality Processing Comparison"
    
    def compare_preprocessing_challenges(self):
        """
        Compares preprocessing challenges across different modalities
        """
        print(f"=== {self.name} ===\n")
        
        print("1. Common Preprocessing Challenges by Modality:\n")
        
        print("TEXT PROCESSING CHALLENGES:")
        print("   • Variable sequence lengths → Solution: Padding/Truncation")
        print("   • Different vocabularies → Solution: Vocabulary mapping")
        print("   • Case sensitivity → Solution: Normalization")
        print("   • Punctuation and special characters → Solution: Cleaning")
        print("   • Out-of-vocabulary words → Solution: <UNK> tokens")
        print("   • Multiple languages → Solution: Multilingual tokenizers")
        
        print("\nIMAGE PROCESSING CHALLENGES:")
        print("   • Variable image sizes → Solution: Resizing")
        print("   • Different color spaces → Solution: Standardization")
        print("   • Pixel value ranges → Solution: Normalization")
        print("   • Different aspect ratios → Solution: Cropping/Padding")
        print("   • Limited training data → Solution: Data augmentation")
        print("   • Channel differences → Solution: Channel conversion")
        
        print("\nAUDIO PROCESSING CHALLENGES:")
        print("   • Variable sample rates → Solution: Resampling")
        print("   • Different audio lengths → Solution: Padding/Truncation")
        print("   • Stereo vs mono → Solution: Channel conversion")
        print("   • Amplitude variations → Solution: Normalization")
        print("   • Background noise → Solution: Filtering")
        print("   • Temporal dependencies → Solution: Windowing/Framing")
        
        print("\n2. Key Preprocessing Principles:")
        print("   • STANDARDIZATION: Convert all inputs to consistent format")
        print("   • NORMALIZATION: Scale values to appropriate ranges")
        print("   • DIMENSIONALITY: Ensure consistent input dimensions")
        print("   • QUALITY: Remove noise and irrelevant information")
        print("   • AUGMENTATION: Increase data diversity when needed")
        
        print("\n3. Modality-Specific Considerations:")
        print("   • Text: Semantic meaning preserved during cleaning")
        print("   • Image: Spatial relationships maintained during resizing")
        print("   • Audio: Temporal information preserved during processing")


print("PART 4: PROCESSING COMPARISON")
print("=" * 30)
comparison = ModalityProcessingComparison()
comparison.compare_preprocessing_challenges()

print("\n" + "="*60)
print("DEMONSTRATION COMPLETE")
print("\nKey Takeaways:")
print("- Each modality requires specific preprocessing steps")
print("- Standardization and normalization are crucial for all modalities")
print("- Variable input sizes must be handled consistently")
print("- Preprocessing quality directly affects model performance")
print("- Understanding modality-specific challenges is essential")

PART 4: PROCESSING COMPARISON
=== Modality Processing Comparison ===

1. Common Preprocessing Challenges by Modality:

TEXT PROCESSING CHALLENGES:
   • Variable sequence lengths → Solution: Padding/Truncation
   • Different vocabularies → Solution: Vocabulary mapping
   • Case sensitivity → Solution: Normalization
   • Punctuation and special characters → Solution: Cleaning
   • Out-of-vocabulary words → Solution: <UNK> tokens
   • Multiple languages → Solution: Multilingual tokenizers

IMAGE PROCESSING CHALLENGES:
   • Variable image sizes → Solution: Resizing
   • Different color spaces → Solution: Standardization
   • Pixel value ranges → Solution: Normalization
   • Different aspect ratios → Solution: Cropping/Padding
   • Limited training data → Solution: Data augmentation
   • Channel differences → Solution: Channel conversion

AUDIO PROCESSING CHALLENGES:
   • Variable sample rates → Solution: Resampling
   • Different audio lengths → Solution: Padding/Truncation
   • Stereo vs 

---

### Conclusion

**Why Individual Modality Processing Matters?**  
Understanding individual modality processing is essential before tackling multimodal fusion because:  
- **Foundation Building:** Each modality has unique characteristics that must be properly handled
- **Quality Assurance:** Poor preprocessing in any modality degrades overall system performance
- **Debugging Capability:** Understanding individual pipelines enables effective troubleshooting
- **Design Decisions:** Preprocessing choices affect how modalities can be combined later
- **Resource Planning:** Different modalities have different computational and memory requirements  
  
This comprehensive understanding of individual modality processing provides the necessary foundation for exploring how these different data types can be effectively combined in multimodal AI systems.

---