# 03 - The Datasets Library: Efficient Data Handling for NLP

## Learning Objectives
By the end of this notebook, you will understand:
- What the Hugging Face datasets library provides
- How to load datasets from the Hub and local files
- Dataset processing and transformation techniques
- Working with large datasets efficiently (streaming)
- Creating custom datasets
- Integrating datasets with tokenizers and models

## Why Use the Datasets Library?

The Hugging Face datasets library provides:
- **Unified API** for accessing thousands of datasets
- **Efficient storage** with Apache Arrow backend
- **Memory mapping** for large datasets
- **Streaming** for datasets too large for memory
- **Built-in preprocessing** and caching
- **Easy integration** with transformers and tokenizers

In [None]:
# Import necessary libraries
from datasets import (
    load_dataset, Dataset, DatasetDict,
    load_from_disk, concatenate_datasets
)
from transformers import AutoTokenizer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import torch
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")

## Part 1: Loading Datasets from the Hub

The Hugging Face Hub hosts thousands of datasets for various NLP tasks.

In [None]:
# Load a popular sentiment analysis dataset
print("Loading IMDB dataset...")
imdb_dataset = load_dataset("imdb")

print(f"Dataset type: {type(imdb_dataset)}")
print(f"Dataset splits: {list(imdb_dataset.keys())}")
print(f"Train set size: {len(imdb_dataset['train']):,}")
print(f"Test set size: {len(imdb_dataset['test']):,}")

# Examine the structure
print(f"\nDataset features: {imdb_dataset['train'].features}")
print(f"Column names: {imdb_dataset['train'].column_names}")

# Look at a few examples
print("\nFirst example:")
first_example = imdb_dataset['train'][0]
print(f"Text (first 200 chars): {first_example['text'][:200]}...")
print(f"Label: {first_example['label']} ({'positive' if first_example['label'] == 1 else 'negative'})")

In [None]:
# Load other popular datasets
datasets_info = {
    "Squad (Question Answering)": {"name": "squad", "config": None},
    "CNN/DailyMail (Summarization)": {"name": "cnn_dailymail", "config": "3.0.0"},
    "CoLA (Grammar)": {"name": "glue", "config": "cola"},
    "AG News (Classification)": {"name": "ag_news", "config": None}
}

loaded_datasets = {}

for dataset_desc, info in datasets_info.items():
    try:
        print(f"\nLoading {dataset_desc}...")
        
        if info["config"]:
            dataset = load_dataset(info["name"], info["config"])
        else:
            dataset = load_dataset(info["name"])
        
        loaded_datasets[dataset_desc] = dataset
        
        print(f"  Splits: {list(dataset.keys())}")
        print(f"  Features: {dataset[list(dataset.keys())[0]].features}")
        print(f"  Size: {len(dataset[list(dataset.keys())[0]]):,} examples")
        
    except Exception as e:
        print(f"  Error loading {dataset_desc}: {e}")

## Part 2: Dataset Exploration and Analysis

In [None]:
# Analyze the IMDB dataset in detail
train_dataset = imdb_dataset['train']

print("IMDB Dataset Analysis:")
print("=" * 30)

# Label distribution
labels = train_dataset['label']
label_counts = Counter(labels)

print(f"Label distribution: {dict(label_counts)}")
print(f"Class balance: {label_counts[0] / len(labels):.2%} negative, {label_counts[1] / len(labels):.2%} positive")

# Text length analysis
text_lengths = [len(text.split()) for text in train_dataset['text'][:1000]]  # Sample first 1000

print(f"\nText length statistics (first 1000 examples):")
print(f"  Mean: {np.mean(text_lengths):.1f} words")
print(f"  Median: {np.median(text_lengths):.1f} words")
print(f"  Min: {min(text_lengths)} words")
print(f"  Max: {max(text_lengths)} words")
print(f"  Standard deviation: {np.std(text_lengths):.1f} words")

# Visualize distributions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Label distribution
ax1.bar(['Negative', 'Positive'], [label_counts[0], label_counts[1]])
ax1.set_title('Label Distribution (IMDB)')
ax1.set_ylabel('Count')

# Text length distribution
ax2.hist(text_lengths, bins=30, alpha=0.7, edgecolor='black')
ax2.set_title('Text Length Distribution (Words)')
ax2.set_xlabel('Number of Words')
ax2.set_ylabel('Frequency')
ax2.axvline(np.mean(text_lengths), color='red', linestyle='--', label=f'Mean: {np.mean(text_lengths):.1f}')
ax2.legend()

plt.tight_layout()
plt.show()

## Part 3: Dataset Processing and Transformation

The datasets library provides powerful tools for data preprocessing.

In [None]:
# Basic dataset operations
print("Basic Dataset Operations:")
print("=" * 25)

# Select a subset
small_dataset = train_dataset.select(range(1000))
print(f"Selected subset size: {len(small_dataset)}")

# Filter examples
positive_reviews = train_dataset.filter(lambda example: example['label'] == 1)
print(f"Positive reviews: {len(positive_reviews)}")

# Sort examples
def text_length(example):
    return len(example['text'].split())

sorted_by_length = small_dataset.sort('text', key=text_length)
shortest = sorted_by_length[0]
longest = sorted_by_length[-1]

print(f"\nShortest review ({len(shortest['text'].split())} words): {shortest['text'][:100]}...")
print(f"Longest review ({len(longest['text'].split())} words): {longest['text'][:100]}...")

In [None]:
# Advanced transformations with map()
def add_text_stats(examples):
    """Add text statistics to examples"""
    results = {
        'word_count': [],
        'char_count': [],
        'avg_word_length': []
    }
    
    for text in examples['text']:
        words = text.split()
        word_count = len(words)
        char_count = len(text)
        avg_word_len = np.mean([len(word) for word in words]) if words else 0
        
        results['word_count'].append(word_count)
        results['char_count'].append(char_count)
        results['avg_word_length'].append(avg_word_len)
    
    return results

# Apply transformation
print("Adding text statistics...")
enhanced_dataset = small_dataset.map(
    add_text_stats,
    batched=True,
    desc="Adding text statistics"
)

print(f"New features: {enhanced_dataset.features}")

# Show enhanced examples
example = enhanced_dataset[0]
print(f"\nExample with statistics:")
print(f"  Text: {example['text'][:100]}...")
print(f"  Label: {example['label']}")
print(f"  Word count: {example['word_count']}")
print(f"  Character count: {example['char_count']}")
print(f"  Average word length: {example['avg_word_length']:.2f}")

## Part 4: Tokenization Integration

The datasets library integrates seamlessly with tokenizers.

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Define tokenization function
def tokenize_function(examples):
    """Tokenize text examples"""
    return tokenizer(
        examples['text'],
        truncation=True,
        padding=False,  # We'll pad later in batches
        max_length=512
    )

# Apply tokenization
print("Tokenizing dataset...")
tokenized_dataset = small_dataset.map(
    tokenize_function,
    batched=True,
    desc="Tokenizing",
    remove_columns=['text']  # Remove original text to save memory
)

print(f"Tokenized features: {tokenized_dataset.features}")

# Examine tokenized example
tokenized_example = tokenized_dataset[0]
print(f"\nTokenized example:")
print(f"  Input IDs length: {len(tokenized_example['input_ids'])}")
print(f"  Input IDs: {tokenized_example['input_ids'][:20]}...")
print(f"  Attention mask: {tokenized_example['attention_mask'][:20]}...")
print(f"  Label: {tokenized_example['label']}")

# Decode to verify
decoded_text = tokenizer.decode(tokenized_example['input_ids'], skip_special_tokens=True)
print(f"  Decoded text: {decoded_text[:100]}...")

### Analyzing Tokenized Data

In [None]:
# Analyze token length distribution
token_lengths = [len(example['input_ids']) for example in tokenized_dataset]

print("Token Length Analysis:")
print(f"  Mean: {np.mean(token_lengths):.1f} tokens")
print(f"  Median: {np.median(token_lengths):.1f} tokens")
print(f"  Min: {min(token_lengths)} tokens")
print(f"  Max: {max(token_lengths)} tokens")
print(f"  95th percentile: {np.percentile(token_lengths, 95):.1f} tokens")

# Visualize token length distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(token_lengths, bins=30, alpha=0.7, edgecolor='black')
plt.axvline(np.mean(token_lengths), color='red', linestyle='--', label=f'Mean: {np.mean(token_lengths):.1f}')
plt.axvline(512, color='orange', linestyle='--', label='Max length: 512')
plt.title('Token Length Distribution')
plt.xlabel('Number of Tokens')
plt.ylabel('Frequency')
plt.legend()

plt.subplot(1, 2, 2)
# Cumulative distribution
sorted_lengths = np.sort(token_lengths)
cumulative = np.arange(1, len(sorted_lengths) + 1) / len(sorted_lengths)
plt.plot(sorted_lengths, cumulative)
plt.axvline(np.percentile(token_lengths, 95), color='red', linestyle='--', 
           label=f'95th percentile: {np.percentile(token_lengths, 95):.0f}')
plt.title('Cumulative Distribution of Token Lengths')
plt.xlabel('Number of Tokens')
plt.ylabel('Cumulative Proportion')
plt.legend()

plt.tight_layout()
plt.show()

# Show examples of different lengths
short_examples = [i for i, length in enumerate(token_lengths) if length < 50]
long_examples = [i for i, length in enumerate(token_lengths) if length > 400]

print(f"\nExamples:")
print(f"Short examples (< 50 tokens): {len(short_examples)}")
print(f"Long examples (> 400 tokens): {len(long_examples)}")

if short_examples:
    idx = short_examples[0]
    print(f"\nShort example ({token_lengths[idx]} tokens):")
    print(tokenizer.decode(tokenized_dataset[idx]['input_ids'], skip_special_tokens=True))

if long_examples:
    idx = long_examples[0]
    print(f"\nLong example ({token_lengths[idx]} tokens):")
    print(tokenizer.decode(tokenized_dataset[idx]['input_ids'], skip_special_tokens=True)[:200] + "...")

## Part 5: Working with Large Datasets - Streaming

For datasets too large to fit in memory, use streaming.

In [None]:
# Load dataset in streaming mode
print("Loading large dataset in streaming mode...")

try:
    # Load a large dataset that would be impractical to load entirely
    streaming_dataset = load_dataset("oscar", "unshuffled_deduplicated_en", streaming=True, split="train")
    
    print(f"Streaming dataset type: {type(streaming_dataset)}")
    
    # Process first few examples
    print("\nProcessing first few examples from stream:")
    for i, example in enumerate(streaming_dataset):
        if i >= 3:  # Only show first 3 examples
            break
        print(f"Example {i+1}: {example['text'][:100]}...")
    
    # Demonstrate streaming with transformations
    def clean_and_tokenize(example):
        # Clean the text
        text = example['text'].strip()
        
        # Skip very short or very long texts
        if len(text) < 50 or len(text) > 1000:
            return None
        
        # Tokenize
        tokenized = tokenizer(text, truncation=True, max_length=256)
        return tokenized
    
    # Apply transformation to streaming dataset
    processed_stream = streaming_dataset.map(clean_and_tokenize)
    
    # Filter out None values
    processed_stream = processed_stream.filter(lambda x: x is not None)
    
    print("\nProcessed streaming examples:")
    count = 0
    for example in processed_stream:
        if count >= 2:  # Show 2 processed examples
            break
        if example is not None:
            print(f"Processed example {count+1}: {len(example.get('input_ids', []))} tokens")
            count += 1
            
except Exception as e:
    print(f"Streaming example failed (might need internet/permissions): {e}")
    print("This is normal in some environments - streaming works in practice!")

### Streaming Best Practices Demo

In [None]:
# Demonstrate streaming best practices with a smaller dataset
def streaming_processing_demo():
    """Demo of efficient streaming processing"""
    
    # Load IMDB in streaming mode for demo
    streaming_imdb = load_dataset("imdb", streaming=True, split="train")
    
    # Process in chunks
    def process_batch(examples, batch_size=32):
        """Process examples in batches for efficiency"""
        batch = []
        
        for example in examples:
            batch.append(example)
            
            if len(batch) >= batch_size:
                # Process batch
                texts = [ex['text'] for ex in batch]
                tokenized = tokenizer(texts, truncation=True, padding=True, max_length=256)
                
                # Yield processed batch
                for i, ex in enumerate(batch):
                    yield {
                        'input_ids': tokenized['input_ids'][i],
                        'attention_mask': tokenized['attention_mask'][i],
                        'label': ex['label']
                    }
                
                batch = []
        
        # Process remaining examples
        if batch:
            texts = [ex['text'] for ex in batch]
            tokenized = tokenizer(texts, truncation=True, padding=True, max_length=256)
            
            for i, ex in enumerate(batch):
                yield {
                    'input_ids': tokenized['input_ids'][i],
                    'attention_mask': tokenized['attention_mask'][i],
                    'label': ex['label']
                }
    
    print("Streaming processing demo:")
    print("Processing first 100 examples in batches...")
    
    # Take first 100 examples and process them
    limited_stream = streaming_imdb.take(100)
    processed_count = 0
    
    for processed_example in process_batch(limited_stream, batch_size=16):
        processed_count += 1
        if processed_count <= 3:  # Show first 3
            print(f"  Example {processed_count}: {len(processed_example['input_ids'])} tokens, label: {processed_example['label']}")
    
    print(f"Total processed: {processed_count} examples")

streaming_processing_demo()

## Part 6: Creating Custom Datasets

You can create datasets from your own data.

In [None]:
# Create dataset from Python data
print("Creating custom dataset from Python data:")

# Sample data
custom_data = {
    'text': [
        "I love this product! It's amazing.",
        "This is terrible, worst purchase ever.",
        "Pretty good, would recommend to others.",
        "Not bad, but could be better.",
        "Absolutely fantastic! Five stars!",
        "Waste of money, very disappointed."
    ],
    'label': [1, 0, 1, 0, 1, 0],  # 1: positive, 0: negative
    'rating': [5, 1, 4, 2, 5, 1]
}

# Create dataset
custom_dataset = Dataset.from_dict(custom_data)

print(f"Custom dataset size: {len(custom_dataset)}")
print(f"Features: {custom_dataset.features}")

# Show examples
for i, example in enumerate(custom_dataset):
    print(f"Example {i+1}: '{example['text']}' -> Label: {example['label']}, Rating: {example['rating']}")

# Split into train/test
split_dataset = custom_dataset.train_test_split(test_size=0.33, seed=42)
print(f"\nAfter split:")
print(f"  Train size: {len(split_dataset['train'])}")
print(f"  Test size: {len(split_dataset['test'])}")

In [None]:
# Create dataset from pandas DataFrame
print("\nCreating dataset from pandas DataFrame:")

# Create DataFrame
df_data = pd.DataFrame({
    'review': [
        "Great product, highly recommended!",
        "Poor quality, broke after one day.",
        "Excellent value for money.",
        "Not worth the price.",
        "Amazing quality and fast shipping!"
    ],
    'sentiment': ['positive', 'negative', 'positive', 'negative', 'positive'],
    'score': [0.9, 0.1, 0.8, 0.2, 0.95]
})

print("DataFrame:")
print(df_data)

# Convert to dataset
df_dataset = Dataset.from_pandas(df_data)

print(f"\nDataset from DataFrame:")
print(f"  Size: {len(df_dataset)}")
print(f"  Features: {df_dataset.features}")
print(f"  First example: {df_dataset[0]}")

In [None]:
# Create dataset from CSV/JSON files (demo with temporary files)
import tempfile
import json
import os

print("Creating datasets from files:")

# Create temporary CSV file
with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
    csv_content = """text,label
"This is a positive example",1
"This is a negative example",0
"Another positive case",1
"Another negative case",0"""
    f.write(csv_content)
    csv_file = f.name

# Load from CSV
csv_dataset = load_dataset('csv', data_files=csv_file)
print(f"CSV dataset: {len(csv_dataset['train'])} examples")
print(f"Features: {csv_dataset['train'].features}")

# Create temporary JSON file
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
    json_data = [
        {"text": "JSON example 1", "label": "positive"},
        {"text": "JSON example 2", "label": "negative"},
        {"text": "JSON example 3", "label": "positive"}
    ]
    for item in json_data:
        f.write(json.dumps(item) + '\n')
    json_file = f.name

# Load from JSON
json_dataset = load_dataset('json', data_files=json_file)
print(f"JSON dataset: {len(json_dataset['train'])} examples")
print(f"Features: {json_dataset['train'].features}")

# Clean up temporary files
os.unlink(csv_file)
os.unlink(json_file)

print("\nExample from JSON dataset:")
print(json_dataset['train'][0])

## Part 7: Dataset Caching and Saving

In [None]:
# Demonstrate caching and saving
import tempfile

print("Dataset caching and saving:")

# Save processed dataset to disk
with tempfile.TemporaryDirectory() as temp_dir:
    save_path = os.path.join(temp_dir, "processed_imdb")
    
    # Save tokenized dataset
    print(f"Saving dataset to {save_path}...")
    tokenized_dataset.save_to_disk(save_path)
    
    # Check saved files
    saved_files = os.listdir(save_path)
    print(f"Saved files: {saved_files}")
    
    # Load from disk
    print("Loading dataset from disk...")
    loaded_dataset = load_from_disk(save_path)
    
    print(f"Loaded dataset size: {len(loaded_dataset)}")
    print(f"Features match: {loaded_dataset.features == tokenized_dataset.features}")
    print(f"First example matches: {loaded_dataset[0] == tokenized_dataset[0]}")

# Demonstrate caching behavior
print("\nCaching behavior:")

def expensive_operation(example):
    """Simulate an expensive operation"""
    import time
    time.sleep(0.01)  # Simulate processing time
    return {'processed_text': example['text'].upper()}

# First run - will be slow
start_time = time.time()
processed_once = small_dataset.map(expensive_operation, cache_file_name="expensive_cache.arrow")
first_run_time = time.time() - start_time

# Second run - should use cache and be fast
start_time = time.time()
processed_twice = small_dataset.map(expensive_operation, cache_file_name="expensive_cache.arrow")
second_run_time = time.time() - start_time

print(f"First run time: {first_run_time:.3f} seconds")
print(f"Second run time: {second_run_time:.3f} seconds")
print(f"Speedup from caching: {first_run_time/second_run_time:.1f}x")

## Part 8: Advanced Dataset Operations

In [None]:
# Concatenating datasets
print("Advanced dataset operations:")
print("=" * 30)

# Create two small datasets
dataset1 = Dataset.from_dict({
    'text': ['Example 1', 'Example 2'],
    'label': [1, 0]
})

dataset2 = Dataset.from_dict({
    'text': ['Example 3', 'Example 4'],
    'label': [0, 1]
})

# Concatenate datasets
combined_dataset = concatenate_datasets([dataset1, dataset2])
print(f"Combined dataset size: {len(combined_dataset)}")
print(f"Combined examples: {[ex['text'] for ex in combined_dataset]}")

# Interleaving datasets
from datasets import interleave_datasets

interleaved = interleave_datasets([dataset1, dataset2])
print(f"Interleaved examples: {[ex['text'] for ex in interleaved]}")

# Dataset dictionary operations
dataset_dict = DatasetDict({
    'train': dataset1,
    'test': dataset2
})

print(f"\nDatasetDict splits: {list(dataset_dict.keys())}")
print(f"Train size: {len(dataset_dict['train'])}")
print(f"Test size: {len(dataset_dict['test'])}")

# Apply operations to all splits
def add_length(example):
    return {'text_length': len(example['text'])}

dataset_dict_with_length = dataset_dict.map(add_length)
print(f"After adding length feature:")
print(f"Train features: {dataset_dict_with_length['train'].features}")
print(f"Example: {dataset_dict_with_length['train'][0]}")

## Part 9: Dataset Performance Tips

In [None]:
# Performance comparison
import time

def performance_tips_demo():
    """Demonstrate performance optimization techniques"""
    
    print("Performance optimization tips:")
    print("=" * 35)
    
    # Create a larger sample for meaningful timing
    large_sample = train_dataset.select(range(5000))
    
    def simple_transform(example):
        return {'word_count': len(example['text'].split())}
    
    def batch_transform(examples):
        return {'word_count': [len(text.split()) for text in examples['text']]}
    
    # Single example processing
    print("1. Single vs Batch processing:")
    start_time = time.time()
    single_result = large_sample.map(simple_transform)
    single_time = time.time() - start_time
    
    # Batch processing
    start_time = time.time()
    batch_result = large_sample.map(batch_transform, batched=True, batch_size=1000)
    batch_time = time.time() - start_time
    
    print(f"  Single processing: {single_time:.3f} seconds")
    print(f"  Batch processing: {batch_time:.3f} seconds")
    print(f"  Speedup: {single_time/batch_time:.1f}x")
    
    # Multiprocessing
    print("\n2. Multiprocessing:")
    start_time = time.time()
    mp_result = large_sample.map(
        batch_transform, 
        batched=True, 
        batch_size=1000,
        num_proc=2  # Use 2 processes
    )
    mp_time = time.time() - start_time
    
    print(f"  Multiprocessing time: {mp_time:.3f} seconds")
    print(f"  Speedup vs single: {single_time/mp_time:.1f}x")
    
    # Memory usage tips
    print("\n3. Memory optimization:")
    
    # Remove unused columns
    memory_optimized = large_sample.map(
        lambda x: tokenizer(x['text'], truncation=True, max_length=128),
        batched=True,
        remove_columns=['text'],  # Remove original text to save memory
        desc="Tokenizing and removing text"
    )
    
    original_features = large_sample.features
    optimized_features = memory_optimized.features
    
    print(f"  Original features: {list(original_features.keys())}")
    print(f"  Optimized features: {list(optimized_features.keys())}")
    print(f"  Memory saved by removing text column")

performance_tips_demo()

## Part 10: Real-world Dataset Pipeline

In [None]:
# Complete pipeline for preparing data for training
def create_training_pipeline(dataset_name, model_name, max_length=128, test_size=0.2):
    """Complete pipeline for dataset preparation"""
    
    print(f"Creating training pipeline for {dataset_name} with {model_name}")
    print("=" * 60)
    
    # Step 1: Load dataset
    print("Step 1: Loading dataset...")
    if dataset_name == "imdb":
        dataset = load_dataset("imdb")
        text_column = "text"
        label_column = "label"
    else:
        raise ValueError(f"Unsupported dataset: {dataset_name}")
    
    print(f"  Loaded {len(dataset['train'])} training examples")
    
    # Step 2: Load tokenizer
    print("Step 2: Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Step 3: Analyze data
    print("Step 3: Analyzing data...")
    sample_texts = [ex[text_column] for ex in dataset['train'].select(range(1000))]
    token_lengths = [len(tokenizer.encode(text)) for text in sample_texts]
    
    print(f"  Token length statistics:")
    print(f"    Mean: {np.mean(token_lengths):.1f}")
    print(f"    95th percentile: {np.percentile(token_lengths, 95):.1f}")
    print(f"    Max length setting: {max_length}")
    
    # Step 4: Create preprocessing function
    def preprocess_function(examples):
        return tokenizer(
            examples[text_column],
            truncation=True,
            padding=False,  # Dynamic padding during training
            max_length=max_length
        )
    
    # Step 5: Apply preprocessing
    print("Step 4: Preprocessing dataset...")
    tokenized_dataset = dataset.map(
        preprocess_function,
        batched=True,
        num_proc=2,
        remove_columns=[text_column],  # Remove text to save memory
        desc="Tokenizing"
    )
    
    # Step 6: Prepare for training
    print("Step 5: Preparing for training...")
    tokenized_dataset = tokenized_dataset.rename_column(label_column, "labels")
    tokenized_dataset.set_format("torch")
    
    # Step 7: Split if needed
    if "validation" not in tokenized_dataset:
        print(f"Step 6: Creating validation split ({test_size:.1%})...")
        train_val = tokenized_dataset['train'].train_test_split(
            test_size=test_size, 
            seed=42
        )
        tokenized_dataset['train'] = train_val['train']
        tokenized_dataset['validation'] = train_val['test']
    
    print(f"\nFinal dataset:")
    for split, data in tokenized_dataset.items():
        print(f"  {split}: {len(data)} examples")
    
    print(f"  Features: {tokenized_dataset['train'].features}")
    
    return tokenized_dataset

# Create a complete pipeline
prepared_dataset = create_training_pipeline(
    dataset_name="imdb",
    model_name="distilbert-base-uncased",
    max_length=256,
    test_size=0.1
)

# Show example
print(f"\nExample from prepared dataset:")
example = prepared_dataset['train'][0]
for key, value in example.items():
    if torch.is_tensor(value):
        print(f"  {key}: tensor of shape {value.shape}")
    else:
        print(f"  {key}: {value}")

## Summary

In this notebook, we covered the comprehensive usage of the Hugging Face datasets library:

1. **Loading Datasets**: From the Hub, local files, and custom data
2. **Dataset Exploration**: Understanding structure, features, and distributions
3. **Data Processing**: Transformations, filtering, and mapping operations
4. **Tokenization Integration**: Seamless integration with transformers tokenizers
5. **Streaming**: Handling large datasets that don't fit in memory
6. **Custom Datasets**: Creating datasets from various data sources
7. **Caching and Saving**: Efficient storage and retrieval of processed data
8. **Advanced Operations**: Concatenation, interleaving, and dataset dictionaries
9. **Performance Optimization**: Batching, multiprocessing, and memory management
10. **Production Pipeline**: Complete workflow from raw data to training-ready datasets

## Key Takeaways

- **The datasets library is memory-efficient** thanks to Apache Arrow backend
- **Streaming is essential** for large datasets that don't fit in memory
- **Batch processing** is much faster than individual example processing
- **Caching prevents recomputation** of expensive operations
- **Integration with tokenizers** makes preprocessing seamless
- **Performance optimization** through multiprocessing and smart batching

## Next Steps

- **Notebook 04**: Mini-project combining concepts from notebooks 01-03
- **Notebook 05**: Fine-tuning models with the Trainer API

Understanding the datasets library is crucial for efficient NLP workflows. The concepts learned here will be essential for handling real-world data processing challenges!