Notebook 1: Dataset Loading and Preprocessing Pipelines
========================================

[Click to view on Google Colab](https://colab.research.google.com/drive/1cS4MUQx4Zl_5b9Z3UFQCE3maEUnLpjFn?usp=sharing)

This notebook demonstrates practical approaches to loading, analyzing, and preprocessing real multimodal datasets for AI applications. We'll work with the [MSR-VTT](https://huggingface.co/datasets/friedrichor/MSR-VTT) video captioning dataset to explore the complete pipeline from raw data to training-ready batches.

Learning Objectives:
- Master real-world multimodal dataset loading and handling techniques
- Understand dataset structure analysis and preprocessing pipeline design
- Learn text tokenization and augmentation strategies for multimodal systems
- Implement efficient custom dataset classes and data loaders
- Optimize preprocessing pipelines for performance and memory efficiency using PyTorch

### Import the necessary Libraries

In [3]:
!pip install numpy torch transformers datasets opencv-python

import os
import cv2
import random
import subprocess
import numpy as np
from typing import Dict, List, Optional

import torch
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer
from datasets import load_dataset

# Fix tokenizer parallelism warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

---

### Real-World Dataset Loading: MSR-VTT Video Captioning Dataset

**From Raw Data to Structured Datasets: Practical Multimodal Data Loading**

Loading real multimodal datasets presents unique challenges that differ significantly from toy datasets. The MSR-VTT (Microsoft Research Video to Text) dataset serves as an excellent example of large-scale multimodal data with complex structure and real-world characteristics.

**Key Concepts Explored:**
1. **Large-Scale Dataset Management**
   - MSR-VTT contains 10,000 video clips with 200,000 natural language descriptions
   - Multiple dataset splits (train_7k, train_9k, test_1k) for different experimental setups
   - Memory-efficient loading techniques for datasets 

2. **Dataset Structure Understanding**
   - HuggingFace datasets integration for standardized data access
   - DatasetDict navigation and split selection strategies
   - Feature schema exploration to understand available data fields

In [None]:
def load_msr_vtt_dataset(split: str = "train_7k", sample_size: Optional[int] = 100):
    """
    Load MSR-VTT dataset with optional sampling for educational purposes.
    
    MSR-VTT (Microsoft Research Video to Text) is a large-scale video captioning dataset
    containing 10,000 video clips with 200,000 natural language descriptions.
    
    Args:
        split: Dataset config to load ('train_7k', 'train_9k', 'test_1k')
        sample_size: Number of samples to load (None for full dataset)
    
    Returns:
        Dataset object with video paths and captions
    """
    print(f"Loading MSR-VTT dataset - split: {split}")
    
    # Load the dataset with config as split parameter
    dataset_dict = load_dataset(
        "friedrichor/MSR-VTT", 
        split,  # This is the config name (train_7k, train_9k, test_1k)
        trust_remote_code=True
    )
    
    print(f"Available splits: {list(dataset_dict.keys())}")
    
    # Extract the actual dataset from the DatasetDict
    if 'train' in dataset_dict:
        dataset = dataset_dict['train']
    else:
        # If no 'train' split, use the first available split
        first_split = list(dataset_dict.keys())[0]
        dataset = dataset_dict[first_split]
        print(f"Using split '{first_split}' (no 'train' split found)")
    
    print(f"Original dataset size: {len(dataset)}")
    
    # Sample subset for educational purposes
    if sample_size and sample_size < len(dataset):
        indices = random.sample(range(len(dataset)), sample_size)
        dataset = dataset.select(indices)
        print(f"Sampled dataset size: {len(dataset)}")
    
    # Display dataset structure
    print("\nDataset structure:")
    print(f"Features: {dataset.features}")
    
    # Show example
    example = dataset[0]
    print(f"\nExample entry:")
    print(f"Video ID: {example.get('video_id', 'N/A')}")
    print(f"Caption: {example.get('caption', 'N/A')}")
    print(f"Available keys: {list(example.keys())}")
    
    return dataset

# Test the function
dataset = load_msr_vtt_dataset(split="train_7k", sample_size=50)

---

### Dataset Structure Analysis and Characterization

**Deep Dive into Multimodal Data Characteristics**

Understanding the structure and characteristics of multimodal datasets is crucial for designing effective preprocessing pipelines. This analysis phase reveals patterns, distributions, and potential challenges that will inform all subsequent processing decisions.

**Key Concepts Explored:**
1. **Statistical Data Profiling**
   - Caption length distribution analysis revealing natural language patterns
   - Video-caption relationship mapping to understand data redundancy

2. **Text Content Analysis**
   - Word count statistics providing insights into caption complexity
   - Vocabulary diversity assessment for tokenization strategy planning
   - Special character and multilingual content detection

3. **Data Relationship Mapping**
   - Video-to-caption ratio analysis revealing dataset structure

In [None]:
def analyze_dataset_structure(dataset):
    """
    Analyze the structure and characteristics of the loaded dataset.
    
    Understanding dataset structure is crucial for preprocessing pipeline design:
    - Data types and formats in each field
    - Distribution of text lengths and video properties
    
    Args:
        dataset: HuggingFace dataset object
    """
    print("=== Dataset Structure Analysis ===")
    
    # Basic statistics
    print(f"Dataset size: {len(dataset)}")
    print(f"Features: {list(dataset.features.keys())}")
    
    # First, let's examine the structure of captions
    sample_caption = dataset[0]['caption']
    print(f"\nCaption field type: {type(sample_caption)}")
    print(f"Sample caption: {sample_caption}")
    
    # Handle different caption formats
    if isinstance(sample_caption, list):
        # If captions are lists, flatten them or take first element
        captions = []
        for item in dataset:
            caption_list = item['caption']
            if isinstance(caption_list, list) and len(caption_list) > 0:
                captions.append(caption_list[0])  # Take first caption
            else:
                captions.append(str(caption_list))
    else:
        # If captions are strings
        captions = [item['caption'] for item in dataset]
    
    # Calculate caption lengths
    caption_lengths = [len(str(caption).split()) for caption in captions]
    
    print(f"\n=== Caption Analysis ===")
    print(f"Average caption length: {np.mean(caption_lengths):.2f} words")
    print(f"Min caption length: {min(caption_lengths)} words")
    print(f"Max caption length: {max(caption_lengths)} words")
    print(f"Median caption length: {np.median(caption_lengths):.2f} words")
    
    # Analyze video IDs
    video_ids = [item['video_id'] for item in dataset]
    unique_videos = len(set(video_ids))
    print(f"\n=== Video Analysis ===")
    print(f"Total samples: {len(video_ids)}")
    print(f"Unique videos: {unique_videos}")
    print(f"Average captions per video: {len(video_ids) / unique_videos:.2f}")
    
    # Check for categories if available
    if 'category' in dataset.features:
        categories = [item['category'] for item in dataset]
        unique_categories = set(categories)
        print(f"\n=== Category Analysis ===")
        print(f"Unique categories: {len(unique_categories)}")
        print(f"Categories: {sorted(unique_categories)}")
    
    # Check for other fields
    sample = dataset[0]
    print(f"\n=== Available Fields ===")
    for key, value in sample.items():
        print(f"{key}: {type(value)} - {str(value)[:100]}...")
    
    return {
        'caption_lengths': caption_lengths,
        'unique_videos': unique_videos,
        'sample_fields': list(sample.keys()),
        'total_samples': len(dataset),
        'processed_captions': captions
    }

# Analyze the dataset
analysis = analyze_dataset_structure(dataset)

---

### Advanced Text Preprocessing for Multimodal Systems

**From Raw Text to Model-Ready Tokens: Multimodal Text Processing**

Text preprocessing in multimodal systems requires sophisticated approaches that consider how text will interact with other modalities. Unlike single-modal text processing, multimodal preprocessing must ensure consistency, efficiency, and compatibility across different data types.

**Key Concepts Explored:**
1. **Tokenization Strategy Design**
   - BERT-style tokenization for transformer-based multimodal models
   - Subword tokenization handling out-of-vocabulary terms effectively
   - Special token integration ([CLS], [SEP]) for multimodal alignment
   - Vocabulary consistency across different text sources and domains

2. **Sequence Length Optimization**
   - Maximum length determination balancing information retention and efficiency
   - Padding strategies ensuring consistent tensor dimensions for batch processing
   - Truncation policies preserving the most important textual information
   - Dynamic length handling for variable-content scenarios

3. **Attention Mechanism Preparation**
   - Attention mask generation distinguishing real content from padding
   - Token importance weighting for multimodal attention mechanisms
   - Position encoding compatibility with visual and audio modalities
   - Cross-modal attention preparation through consistent tokenization

**Learning Outcomes:**
Learners will understand how text preprocessing in multimodal systems differs from single-modal approaches, master the implementation of efficient tokenization pipelines, and learn to optimize text processing for integration with visual and audio data. This knowledge is essential for building robust multimodal AI systems.

In [None]:
def preprocess_text_captions(captions: List[str], tokenizer_name: str = "bert-base-uncased", 
                           max_length: int = 77) -> Dict:
    """
    Preprocess text captions for multimodal learning.
    
    Text preprocessing in multimodal systems involves:
    - Tokenization: Converting text to tokens that models can understand
    - Length normalization: Ensuring consistent sequence lengths
    - Special tokens: Adding [CLS], [SEP] tokens for BERT-style models
    - Attention masks: Indicating which tokens are padding vs. real content
    
    Args:
        captions: List of text captions
        tokenizer_name: HuggingFace tokenizer to use
        max_length: Maximum sequence length
    
    Returns:
        Dictionary with tokenized captions and attention masks
    """
    print(f"Preprocessing {len(captions)} captions with {tokenizer_name}")
    
    # Initialize tokenizer
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    
    # Show original captions
    print("\nOriginal captions (first 3):")
    for i, caption in enumerate(captions[:3]):
        print(f"{i+1}: {caption}")
    
    # Tokenize captions
    tokenized = tokenizer(
        captions,
        padding=True,           # Pad to same length
        truncation=True,        # Truncate if too long
        max_length=max_length,
        return_tensors="pt"     # Return PyTorch tensors
    )
    
    print(f"\nTokenized output shape:")
    print(f"Input IDs: {tokenized['input_ids'].shape}")
    print(f"Attention mask: {tokenized['attention_mask'].shape}")
    
    # Show tokenized example
    print(f"\nTokenized example (first caption):")
    tokens = tokenizer.convert_ids_to_tokens(tokenized['input_ids'][0])
    print(f"Tokens: {tokens[:15]}...")  # Show first 15 tokens
    
    # Analyze tokenization statistics
    token_lengths = tokenized['attention_mask'].sum(dim=1)
    print(f"\nTokenization statistics:")
    print(f"Average tokens per caption: {token_lengths.float().mean():.2f}")
    print(f"Max tokens used: {token_lengths.max().item()}")
    print(f"Min tokens used: {token_lengths.min().item()}")
    
    return {
        'input_ids': tokenized['input_ids'],
        'attention_mask': tokenized['attention_mask'],
        'tokenizer': tokenizer,
        'token_lengths': token_lengths
    }

# Test text preprocessing with actual captions from dataset
sample_captions = analysis['processed_captions'][:10]  # Use processed captions from analysis
text_data = preprocess_text_captions(sample_captions)

---

### Custom Multimodal Dataset Implementation

**Building Efficient and Scalable Dataset Classes for Multimodal Learning**

Custom dataset classes form the backbone of efficient multimodal training pipelines. Unlike simple data containers, these classes must handle complex multimodal data relationships and provide consistent interfaces for diverse data types.

**Key Concepts Explored:**
1. **Multimodal Data Architecture**
   - Consistent indexing ensuring perfect alignment between modalities
   - Flexible preprocessing integration allowing different transforms per modality
   - Error handling strategies gracefully managing missing or corrupted data

2. **Memory Management Strategies**
   - On-demand data loading minimizing memory footprint during training
   - Efficient tensor creation and management for GPU compatibility

4. **Performance Optimization Techniques**
   - Batch optimization strategies minimizing data transfer overhead

5. **Extensibility and Modularity**
   - Modular design patterns enabling easy extension to new modalities
   - Configuration-driven preprocessing allowing runtime customization
   - Plugin architecture supporting custom transformation pipelines
   - Version compatibility ensuring dataset classes work across different frameworks

In [7]:
def download_videos(dataset, num_videos=1, output_dir="videos"):
    """Download videos from dataset URLs."""
    os.makedirs(output_dir, exist_ok=True)
    downloaded = []
    
    for i in range(min(num_videos, len(dataset))):
        sample = dataset[i]
        video_id = sample['video_id']
        url = sample['url']
        start_time = sample.get('start time', 0)
        end_time = sample.get('end time', 30)
        
        output_path = os.path.join(output_dir, f"{video_id}.mp4")
        
        if os.path.exists(output_path):
            downloaded.append(output_path)
            continue
        
        try:
            cmd = [
                "yt-dlp", "--format", "best[height<=480]", 
                "--output", output_path, "--quiet",
                "--external-downloader", "ffmpeg",
                "--external-downloader-args", 
                f"ffmpeg:-ss {start_time} -t {end_time - start_time}",
                url
            ]
            
            result = subprocess.run(cmd, capture_output=True, timeout=120)
            if result.returncode == 0 and os.path.exists(output_path):
                downloaded.append(output_path)
                print(f"Downloaded: {video_id}")
        except:
            print(f"Failed: {video_id}")
    
    return downloaded

In [8]:
class MultimodalDataset(Dataset):
    """Minimal multimodal dataset with all HF dataset keys using cv2."""
    
    def __init__(self, dataset, video_dir="videos", max_frames=8):
        self.dataset = dataset
        self.video_dir = video_dir
        self.max_frames = max_frames
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        
        # Find valid samples
        self.valid_indices = []
        for i in range(len(dataset)):
            video_id = dataset[i]['video_id']
            if os.path.exists(os.path.join(video_dir, f"{video_id}.mp4")):
                self.valid_indices.append(i)
        
        print(f"Found {len(self.valid_indices)} valid samples")
    
    def __len__(self):
        return len(self.valid_indices)
    
    def __getitem__(self, idx):
        sample = self.dataset[self.valid_indices[idx]]
        
        # Process text
        caption = sample.get('caption', '')
        if isinstance(caption, list):
            caption = caption[0]
        
        tokenized = self.tokenizer(
            caption, padding='max_length', truncation=True, 
            max_length=77, return_tensors='pt'
        )
        
        # Load video frames using cv2
        video_path = os.path.join(self.video_dir, f"{sample['video_id']}.mp4")
        video_frames = self._load_video_cv2(video_path)
        
        return {
            'text_ids': tokenized['input_ids'].squeeze(0),
            'text_mask': tokenized['attention_mask'].squeeze(0),
            'caption': caption,
            'video_frames': video_frames,
            'video_id': sample['video_id'],
            'video': sample['video'],
            'source': sample['source'],
            'category': sample['category'],
            'url': sample['url'],
            'start_time': sample['start time'],
            'end_time': sample['end time'],
            'id': sample['id']
        }
    
    def _load_video_cv2(self, video_path):
        """Load video frames using OpenCV with comprehensive preprocessing.
        
        This function loads a video file, extracts frames uniformly across the video duration,
        preprocesses them for machine learning (resizing, color conversion, normalization),
        and returns them as a PyTorch tensor suitable for multimodal model training.
        
        Args:
            video_path (str): Full path to the video file (.mp4, .avi, etc.)
            
        Returns:
            torch.Tensor: A tensor of shape (max_frames, 3, 224, 224) containing
                        RGB video frames normalized to [0, 1] range.
                        
        Raises:
            Exception: If video cannot be loaded, falls back to placeholder frames.
            
        Processing Pipeline:
            1. Open video file using OpenCV VideoCapture
            2. Extract video metadata (total frames, fps)
            3. Calculate uniform frame sampling indices
            4. Extract and preprocess each frame:
            - Convert from BGR to RGB color space
            - Resize to 224x224 pixels (standard vision model input size)
            - Normalize pixel values from [0, 255] to [0, 1] range
            - Convert to PyTorch tensor with channels-first format (C, H, W)
            5. Handle frame padding to ensure consistent output size
            6. Stack all frames into a single tensor
        """
        try:
            # Step 1: Initialize OpenCV VideoCapture object
            # VideoCapture is the primary interface for reading video files in OpenCV
            cap = cv2.VideoCapture(video_path)
            
            # Step 2: Verify video file can be opened
            # isOpened() returns True if the video source has been initialized successfully
            if not cap.isOpened():
                print(f"Could not open video: {video_path}")
                return None
            
            # Step 3: Extract video metadata for frame sampling strategy
            # CAP_PROP_FRAME_COUNT: Total number of frames in the video
            # CAP_PROP_FPS: Frames per second of the video
            total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
            fps = cap.get(cv2.CAP_PROP_FPS)
            
            # Step 4: Validate video properties
            # Videos with 0 frames or invalid fps cannot be processed
            if total_frames <= 0 or fps <= 0:
                print(f"Invalid video properties: frames={total_frames}, fps={fps}")
                cap.release()  # Always release resources
                return None
            
            # Step 5: Calculate frame sampling indices for uniform temporal coverage
            # We want to sample max_frames uniformly across the entire video duration
            # This ensures we capture the video's temporal progression regardless of length
            if total_frames <= self.max_frames:
                # If video has fewer frames than needed, use all available frames
                frame_indices = list(range(total_frames))
            else:
                # Calculate uniformly spaced indices across the video
                # Formula: index = (sample_position * total_frames) / max_frames
                # This gives us evenly distributed frames across the entire video
                frame_indices = [
                    int(i * total_frames / self.max_frames) 
                    for i in range(self.max_frames)
                ]
            
            frames = []
            
            # Step 6: Extract and preprocess each sampled frame
            for frame_idx in frame_indices:
                # Step 6a: Seek to specific frame position
                # CAP_PROP_POS_FRAMES sets the 0-based index of the frame to be decoded/captured next
                cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
                
                # Step 6b: Read the frame at current position
                # ret: boolean indicating if frame was read successfully
                # frame: numpy array containing the frame data (H, W, C) in BGR format
                ret, frame = cap.read()
                
                # Step 6c: Verify frame was read successfully
                if ret and frame is not None:
                    # Step 6d: Color space conversion from BGR to RGB
                    # OpenCV uses BGR (Blue-Green-Red) by default, but most ML models expect RGB
                    # This is crucial for correct color representation in the model
                    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                    
                    # Step 6e: Resize frame to standard input size
                    # 224x224 is the standard input size for many vision models (ResNet, ViT, etc.)
                    # Bicubic interpolation preserves image quality during resizing
                    frame = cv2.resize(frame, (224, 224))
                    
                    # Step 6f: Convert to PyTorch tensor and normalize
                    # Convert numpy array to PyTorch tensor with float32 precision
                    # Permute dimensions from (H, W, C) to (C, H, W) - channels first format
                    # Normalize pixel values from [0, 255] to [0, 1] range for numerical stability
                    frame_tensor = torch.tensor(frame, dtype=torch.float32).permute(2, 0, 1) / 255.0
                    
                    frames.append(frame_tensor)
                else:
                    # Step 6g: Handle frame reading failure
                    print(f"Could not read frame {frame_idx}")
                    break  # Stop processing if we encounter read errors
            
            # Step 7: Release video capture resources
            # Always release to prevent memory leaks and file locks
            cap.release()
            
            # Step 8: Handle frame padding to ensure consistent output size
            # If we have fewer frames than max_frames, pad with the last valid frame
            # This ensures all samples have the same tensor dimensions for batching
            while len(frames) < self.max_frames:
                if frames:
                    # Clone the last frame to avoid tensor sharing issues
                    frames.append(frames[-1].clone())
                else:
                    # If no frames were successfully read, return None
                    return None
            
            # Step 9: Ensure exact frame count by truncating if necessary
            # This handles edge cases where we might have extracted more frames than needed
            frames = frames[:self.max_frames]
            
            # Step 10: Stack individual frame tensors into a single tensor
            # Result shape: (max_frames, 3, 224, 224)
            # This creates a 4D tensor suitable for video processing models
            return torch.stack(frames)
            
        except Exception as e:
            # Step 11: Comprehensive error handling
            # Catch any unexpected errors (codec issues, corrupted files, etc.)
            print(f"Error loading video {video_path}: {e}")
            return None

In [9]:
def create_dataloader(dataset, batch_size=1):
    """Create dataloader with all dataset keys."""
    def collate_fn(batch):
        return {
            'text_ids': torch.stack([item['text_ids'] for item in batch]),
            'text_masks': torch.stack([item['text_mask'] for item in batch]),
            'captions': [item['caption'] for item in batch],
            'video_frames': torch.stack([item['video_frames'] for item in batch]),
            'video_ids': [item['video_id'] for item in batch],
            'videos': [item['video'] for item in batch],
            'sources': [item['source'] for item in batch],
            'categories': [item['category'] for item in batch],
            'urls': [item['url'] for item in batch],
            'start_times': [item['start_time'] for item in batch],
            'end_times': [item['end_time'] for item in batch],
            'ids': [item['id'] for item in batch]
        }
    
    return DataLoader(dataset, batch_size=batch_size, collate_fn=collate_fn)

In [None]:
# Note: 
# A number of videos present in the dataset are not available on youtube
# anymore, hence the downloads will fail. 
# In such cases rerun the entire notebook so that a different section of the dataset 
# gets loaded or increase the value of `num_videos` so that chances of finding an 
# available video increases.

# Usage
downloaded_paths = download_videos(dataset, num_videos=2, output_dir="videos")

if downloaded_paths:
    multimodal_dataset = MultimodalDataset(dataset, video_dir="videos")
    dataloader = create_dataloader(multimodal_dataset, batch_size=2)
    
    # Test
    batch = next(iter(dataloader))
    print("Available keys:")
    for key in batch.keys():
        print(f"  - {key}")
    
    print(f"\nBatch size: {len(batch['captions'])}")
    print(f"Video shape: {batch['video_frames'].shape}")
    print(f"Text shape: {batch['text_ids'].shape}")
    print(f"Video frames range: [{batch['video_frames'].min():.3f}, {batch['video_frames'].max():.3f}]")
    
    # Display data for ALL videos in the batch
    for i in range(len(batch['captions'])):
        print(f"\n=== Video {i+1} ===")
        print(f"URL: {batch['urls'][i]}")
        print(f"Caption: {batch['captions'][i]}")
        print(f"Video ID: {batch['video_ids'][i]}")
        print(f"Source: {batch['sources'][i]}")
        print(f"Category: {batch['categories'][i]}")
        print(f"Start time: {batch['start_times'][i]}")
        print(f"End time: {batch['end_times'][i]}")
        print(f"Duration: {batch['end_times'][i] - batch['start_times'][i]:.1f}s")

---

### Conclusion

**Foundation for Production-Ready Multimodal AI Systems**

This comprehensive exploration of dataset loading and preprocessing pipelines provides the essential practical skills needed for real-world multimodal AI development:

1. **Production-Ready Data Handling**
   - Understanding how to work with large-scale, real-world multimodal datasets
   - Implementing robust data loading pipelines that handle edge cases and errors gracefully

3. **Multimodal-Specific Considerations**
   - Text preprocessing techniques optimized for multimodal alignment and integration
   - Cross-modal consistency preservation throughout the preprocessing pipeline
   - Augmentation strategies that enhance robustness while maintaining semantic coherence

4. **Engineering Best Practices**
   - Modular design patterns enabling easy extension and maintenance

---