[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.2/working-with-datasets.ipynb)
[![Open with SageMaker](https://img.shields.io/badge/Open%20with-SageMaker-orange?logo=amazonaws)](https://studiolab.sagemaker.aws/import/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.2/working-with-datasets.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.2/working-with-datasets.ipynb)

# Working with HF Datasets & External Datasets

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- How to load datasets from the Hugging Face Hub
- How to load datasets from various file formats (Parquet, CSV, JSON, Pickle)
- How to work with datasets that aren't on the Hub
- Dataset processing and transformation techniques
- Best practices for dataset handling in NLP projects

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and pandas
- Knowledge of NLP fundamentals (refer to [NLP Learning Journey](https://github.com/vuhung16au/nlp-learning-journey))

## 📚 What We'll Cover
1. **HF Hub Datasets**: Loading datasets from Hugging Face Hub
2. **Apache Parquet Format**: Loading efficient columnar data
3. **Pickled DataFrames**: Loading Python pickle files
4. **JSON & JSON Lines**: Loading JSON-based datasets
5. **CSV & TSV Files**: Loading comma/tab-separated data
6. **Dataset Processing**: Transforming and preparing data
7. **Best Practices**: Tips for efficient dataset handling

## 💡 Reference
This notebook is based on the [HuggingFace Course Chapter 5.2](https://huggingface.co/learn/llm-course/chapter5/2?fw=pt) about working with datasets.

## Part 1: Environment Setup

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import pickle
import os
import tempfile
from pathlib import Path
from typing import Dict, List, Optional

# HuggingFace libraries
from datasets import (
    load_dataset, 
    Dataset, 
    DatasetDict,
    load_from_disk,
    concatenate_datasets
)
from transformers import AutoTokenizer

# Machine learning libraries
import torch

# Set random seeds for reproducibility (repository standard: seed=16)
import random
random.seed(16)
np.random.seed(16)
torch.manual_seed(16)
if torch.cuda.is_available():
    torch.cuda.manual_seed(16)
    torch.cuda.manual_seed_all(16)

print("🔢 Random seed set to 16 for reproducibility")

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

# Device detection (includes TPU support for Google Colab)
try:
    from google.colab import userdata
    import torch_xla.core.xla_model as xm
    COLAB_AVAILABLE = True
    TPU_AVAILABLE = True
except ImportError:
    COLAB_AVAILABLE = False
    TPU_AVAILABLE = False

def get_device() -> torch.device:
    """
    Get the best available device for PyTorch operations.
    Priority in Colab: TPU > GPU > CPU
    Priority elsewhere: GPU > MPS > CPU
    """
    # Google Colab: Always prefer TPU when available
    if COLAB_AVAILABLE and TPU_AVAILABLE:
        try:
            device = xm.xla_device()
            print("🔥 Using Google Colab TPU for optimal performance")
            return device
        except Exception as e:
            print(f"⚠️ TPU initialization failed: {e}")
    
    # Standard device detection
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS (Apple Silicon)")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU")
    
    return device

device = get_device()

print(f"\n✅ Setup complete!")
print(f"📦 PyTorch version: {torch.__version__}")
print(f"🔧 Device: {device}")

## Part 2: Loading Datasets from Hugging Face Hub

The Hugging Face Hub hosts thousands of datasets that can be loaded with a single line of code.
For hate speech detection (our preferred application area), we'll use recommended datasets.

In [None]:
# Load a hate speech detection dataset from HF Hub
# Using the preferred dataset: tdavidson/hate_speech_offensive
print("🛡️ Loading hate speech detection dataset from HF Hub...")
print("📥 Dataset: tdavidson/hate_speech_offensive")

try:
    # Load the full dataset
    hf_dataset = load_dataset("tdavidson/hate_speech_offensive")
    
    print(f"\n✅ Dataset loaded successfully!")
    print(f"📊 Dataset type: {type(hf_dataset)}")
    print(f"📋 Available splits: {list(hf_dataset.keys())}")
    
    # Examine the dataset structure
    train_data = hf_dataset['train']
    print(f"\n🔍 Dataset Information:")
    print(f"  Total examples: {len(train_data):,}")
    print(f"  Features: {list(train_data.features.keys())}")
    print(f"  Feature types: {train_data.features}")
    
    # Show a sample example
    print(f"\n📝 Sample example:")
    sample = train_data[0]
    for key, value in sample.items():
        if isinstance(value, str) and len(value) > 100:
            print(f"  {key}: {value[:100]}...")
        else:
            print(f"  {key}: {value}")
    
except Exception as e:
    print(f"❌ Error loading dataset: {e}")
    print("💡 Try checking your internet connection or using a different dataset")

### Loading Specific Splits or Subsets

You can load specific splits or create custom subsets:

In [None]:
# Load only a specific split
print("📥 Loading specific split...")
train_only = load_dataset("tdavidson/hate_speech_offensive", split="train")
print(f"✅ Loaded train split: {len(train_only):,} examples")

# Load a subset using slicing
print("\n📥 Loading a subset (first 1000 examples)...")
subset = load_dataset("tdavidson/hate_speech_offensive", split="train[:1000]")
print(f"✅ Loaded subset: {len(subset):,} examples")

# Load a percentage of the data
print("\n📥 Loading 10% of the training data...")
percentage_subset = load_dataset("tdavidson/hate_speech_offensive", split="train[:10%]")
print(f"✅ Loaded 10% subset: {len(percentage_subset):,} examples")

## Part 3: What if My Dataset Isn't on the Hub?

Let's explore how to load datasets from various file formats when they're not available on the Hub.

### 3.1 Creating Sample Data for Demonstration

First, let's create sample datasets in different formats to demonstrate loading:

In [None]:
# Create a temporary directory for our example files
temp_dir = tempfile.mkdtemp()
print(f"📁 Created temporary directory: {temp_dir}")

# Create sample hate speech detection data
sample_data = [
    {"text": "This is a normal tweet about the weather today", "label": 2, "class_name": "neither"},
    {"text": "Stop spreading lies and misinformation!", "label": 1, "class_name": "offensive"},
    {"text": "Beautiful day at the park with my family", "label": 2, "class_name": "neither"},
    {"text": "Can't believe how terrible some people can be", "label": 1, "class_name": "offensive"},
    {"text": "Looking forward to the weekend!", "label": 2, "class_name": "neither"},
    {"text": "This behavior is completely unacceptable", "label": 1, "class_name": "offensive"},
    {"text": "Just finished a great book, highly recommend it", "label": 2, "class_name": "neither"},
    {"text": "Why do people keep doing such awful things?", "label": 1, "class_name": "offensive"},
]

# Convert to pandas DataFrame for easy manipulation
df = pd.DataFrame(sample_data)
print("\n📊 Sample dataset created:")
print(df.head())

### 3.2 Loading from Apache Parquet Format

Apache Parquet is a columnar storage format that's highly efficient for large datasets.
It's the preferred format for the Hugging Face datasets library.

In [None]:
# Save as Parquet
parquet_file = os.path.join(temp_dir, "dataset.parquet")
df.to_parquet(parquet_file, index=False)
print(f"💾 Saved dataset as Parquet: {parquet_file}")
print(f"📏 File size: {os.path.getsize(parquet_file) / 1024:.2f} KB")

# Load from Parquet using HF datasets
print("\n📥 Loading from Parquet format...")
parquet_dataset = load_dataset('parquet', data_files=parquet_file)

print(f"\n✅ Parquet dataset loaded successfully!")
print(f"📋 Available splits: {list(parquet_dataset.keys())}")
print(f"📊 Number of examples: {len(parquet_dataset['train']):,}")
print(f"🔧 Features: {list(parquet_dataset['train'].features.keys())}")

# Show sample
print(f"\n📝 Sample from Parquet:")
print(parquet_dataset['train'][0])

print("\n💡 Pro Tip: Parquet is highly efficient for:")
print("   - Large datasets (GB-scale)")
print("   - Columnar operations")
print("   - Cloud storage (S3, GCS, etc.)")
print("   - Preserving data types")

### 3.3 Loading from Pickled DataFrames

Pickle files are Python-specific serialization format. They're quick but not portable across languages.

In [None]:
# Save as Pickle
pickle_file = os.path.join(temp_dir, "dataset.pkl")
df.to_pickle(pickle_file)
print(f"💾 Saved dataset as Pickle: {pickle_file}")
print(f"📏 File size: {os.path.getsize(pickle_file) / 1024:.2f} KB")

# Load from Pickle
print("\n📥 Loading from Pickle format...")
loaded_df = pd.read_pickle(pickle_file)

# Convert pandas DataFrame to HF Dataset
pickle_dataset = Dataset.from_pandas(loaded_df)

print(f"\n✅ Pickle dataset loaded and converted successfully!")
print(f"📊 Number of examples: {len(pickle_dataset):,}")
print(f"🔧 Features: {list(pickle_dataset.features.keys())}")

# Show sample
print(f"\n📝 Sample from Pickle:")
print(pickle_dataset[0])

print("\n⚠️ Warning about Pickle:")
print("   - Only works with Python")
print("   - Security risk: don't load untrusted pickle files")
print("   - Not recommended for production or sharing")
print("   - Consider using Parquet instead")

### 3.4 Loading from JSON & JSON Lines

JSON (JavaScript Object Notation) is a popular, human-readable format.
JSON Lines (.jsonl) stores one JSON object per line, making it efficient for streaming.

In [None]:
# Save as regular JSON
json_file = os.path.join(temp_dir, "dataset.json")
df.to_json(json_file, orient='records', indent=2)
print(f"💾 Saved dataset as JSON: {json_file}")
print(f"📏 File size: {os.path.getsize(json_file) / 1024:.2f} KB")

# Save as JSON Lines (one JSON object per line)
jsonl_file = os.path.join(temp_dir, "dataset.jsonl")
df.to_json(jsonl_file, orient='records', lines=True)
print(f"\n💾 Saved dataset as JSON Lines: {jsonl_file}")
print(f"📏 File size: {os.path.getsize(jsonl_file) / 1024:.2f} KB")

# Show the difference between formats
print("\n📄 Regular JSON format (first 300 chars):")
with open(json_file, 'r') as f:
    print(f.read()[:300] + "...")

print("\n📄 JSON Lines format (first 2 lines):")
with open(jsonl_file, 'r') as f:
    for i, line in enumerate(f):
        if i < 2:
            print(line.strip())

In [None]:
# Load from JSON using HF datasets
print("📥 Loading from JSON format...")
json_dataset = load_dataset('json', data_files=json_file)

print(f"\n✅ JSON dataset loaded successfully!")
print(f"📊 Number of examples: {len(json_dataset['train']):,}")
print(f"🔧 Features: {list(json_dataset['train'].features.keys())}")

# Load from JSON Lines
print("\n📥 Loading from JSON Lines format...")
jsonl_dataset = load_dataset('json', data_files=jsonl_file)

print(f"\n✅ JSON Lines dataset loaded successfully!")
print(f"📊 Number of examples: {len(jsonl_dataset['train']):,}")

# Show sample
print(f"\n📝 Sample from JSON:")
print(jsonl_dataset['train'][0])

print("\n💡 When to use JSON vs JSON Lines:")
print("   JSON: Small datasets, human readability, web APIs")
print("   JSON Lines: Large datasets, streaming, append operations")

### 3.5 Loading from CSV & TSV Files

CSV (Comma-Separated Values) and TSV (Tab-Separated Values) are simple, widely-supported formats.

In [None]:
# Save as CSV
csv_file = os.path.join(temp_dir, "dataset.csv")
df.to_csv(csv_file, index=False)
print(f"💾 Saved dataset as CSV: {csv_file}")
print(f"📏 File size: {os.path.getsize(csv_file) / 1024:.2f} KB")

# Save as TSV (Tab-Separated Values)
tsv_file = os.path.join(temp_dir, "dataset.tsv")
df.to_csv(tsv_file, index=False, sep='\t')
print(f"\n💾 Saved dataset as TSV: {tsv_file}")
print(f"📏 File size: {os.path.getsize(tsv_file) / 1024:.2f} KB")

# Show the difference between formats
print("\n📄 CSV format (first 3 lines):")
with open(csv_file, 'r') as f:
    for i, line in enumerate(f):
        if i < 3:
            print(line.strip())

print("\n📄 TSV format (first 3 lines):")
with open(tsv_file, 'r') as f:
    for i, line in enumerate(f):
        if i < 3:
            print(line.strip())

In [None]:
# Load from CSV using HF datasets
print("📥 Loading from CSV format...")
csv_dataset = load_dataset('csv', data_files=csv_file)

print(f"\n✅ CSV dataset loaded successfully!")
print(f"📊 Number of examples: {len(csv_dataset['train']):,}")
print(f"🔧 Features: {list(csv_dataset['train'].features.keys())}")

# Load from TSV
print("\n📥 Loading from TSV format...")
tsv_dataset = load_dataset('csv', data_files=tsv_file, delimiter='\t')

print(f"\n✅ TSV dataset loaded successfully!")
print(f"📊 Number of examples: {len(tsv_dataset['train']):,}")

# Show sample
print(f"\n📝 Sample from CSV:")
print(csv_dataset['train'][0])

print("\n💡 CSV/TSV Best Practices:")
print("   - Simple and widely supported")
print("   - Good for spreadsheet compatibility")
print("   - Watch out for special characters in text")
print("   - Consider using quotes for text fields with commas/tabs")

## Part 4: Loading Multiple Files

Often datasets are split across multiple files. The datasets library makes it easy to load them all at once.

In [None]:
# Create multiple CSV files
print("📁 Creating multiple split files...")

# Split the data into train/validation/test
train_df = df.iloc[:5]
val_df = df.iloc[5:7]
test_df = df.iloc[7:]

# Save each split
train_csv = os.path.join(temp_dir, "train.csv")
val_csv = os.path.join(temp_dir, "validation.csv")
test_csv = os.path.join(temp_dir, "test.csv")

train_df.to_csv(train_csv, index=False)
val_df.to_csv(val_csv, index=False)
test_df.to_csv(test_csv, index=False)

print(f"✅ Created 3 split files:")
print(f"   Train: {len(train_df)} examples")
print(f"   Validation: {len(val_df)} examples")
print(f"   Test: {len(test_df)} examples")

In [None]:
# Load all splits at once
print("📥 Loading multiple files as DatasetDict...")

data_files = {
    'train': train_csv,
    'validation': val_csv,
    'test': test_csv
}

multi_file_dataset = load_dataset('csv', data_files=data_files)

print(f"\n✅ Multiple files loaded successfully!")
print(f"📋 Dataset type: {type(multi_file_dataset)}")
print(f"📊 Available splits: {list(multi_file_dataset.keys())}")

for split_name, split_data in multi_file_dataset.items():
    print(f"\n  {split_name}: {len(split_data)} examples")
    print(f"    Features: {list(split_data.features.keys())}")

## Part 5: Format Comparison & Best Practices

Let's compare the different formats and discuss when to use each one.

In [None]:
# Compare file sizes
formats = {
    'Parquet': parquet_file,
    'Pickle': pickle_file,
    'JSON': json_file,
    'JSON Lines': jsonl_file,
    'CSV': csv_file,
    'TSV': tsv_file
}

print("📊 File Size Comparison:")
print("=" * 40)

sizes = []
for format_name, file_path in formats.items():
    size_kb = os.path.getsize(file_path) / 1024
    sizes.append(size_kb)
    print(f"{format_name:12} : {size_kb:6.2f} KB")

# Visualize the comparison
plt.figure(figsize=(10, 6))
bars = plt.bar(formats.keys(), sizes, alpha=0.7, color='skyblue', edgecolor='navy')
plt.xlabel('Format', fontsize=12, fontweight='bold')
plt.ylabel('File Size (KB)', fontsize=12, fontweight='bold')
plt.title('Dataset Format File Size Comparison', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.2f} KB',
             ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

print("\n💡 Note: File sizes vary based on:")
print("   - Compression algorithms")
print("   - Data types and structure")
print("   - Text encoding")
print("   - Metadata overhead")

In [None]:
# Format recommendation table
recommendations = pd.DataFrame([
    {
        'Format': 'Parquet',
        'Use Case': 'Large datasets, production, cloud storage',
        'Pros': 'Efficient, columnar, preserves types',
        'Cons': 'Requires special libraries',
        'Speed': '⭐⭐⭐⭐⭐'
    },
    {
        'Format': 'CSV',
        'Use Case': 'Simple data, spreadsheet compatibility',
        'Pros': 'Universal support, human-readable',
        'Cons': 'No type preservation, inefficient',
        'Speed': '⭐⭐⭐'
    },
    {
        'Format': 'JSON Lines',
        'Use Case': 'Streaming data, append operations',
        'Pros': 'Streamable, one record per line',
        'Cons': 'Larger file size than Parquet',
        'Speed': '⭐⭐⭐⭐'
    },
    {
        'Format': 'JSON',
        'Use Case': 'Web APIs, small datasets',
        'Pros': 'Human-readable, nested data support',
        'Cons': 'Not efficient for large data',
        'Speed': '⭐⭐'
    },
    {
        'Format': 'Pickle',
        'Use Case': 'Quick Python prototyping only',
        'Pros': 'Fast for Python, preserves types',
        'Cons': 'Python-only, security risk',
        'Speed': '⭐⭐⭐⭐'
    }
])

print("\n📋 Format Recommendations:")
print("=" * 80)
print(recommendations.to_string(index=False))

print("\n🎯 Quick Decision Guide:")
print("   ✅ Production/Large Data → Parquet")
print("   ✅ Streaming/Logging → JSON Lines")
print("   ✅ Spreadsheet/Simple → CSV")
print("   ✅ Web APIs → JSON")
print("   ⚠️  Python Quick Test Only → Pickle")

## Part 6: Dataset Processing & Transformation

Once loaded, you'll often need to process and transform your datasets.

In [None]:
# Load our example dataset
dataset = csv_dataset['train']
print(f"📊 Working with dataset: {len(dataset)} examples")
print(f"🔧 Original features: {list(dataset.features.keys())}")

# 1. Filtering examples
print("\n🔍 Filtering: Keep only 'offensive' and 'neither' classes...")
filtered = dataset.filter(lambda x: x['label'] in [1, 2])
print(f"   Kept {len(filtered)} out of {len(dataset)} examples")

# 2. Mapping - Add new features
print("\n🗺️  Mapping: Adding text length feature...")
def add_text_length(example):
    example['text_length'] = len(example['text'])
    return example

dataset_with_length = dataset.map(add_text_length)
print(f"   New features: {list(dataset_with_length.features.keys())}")
print(f"   Sample text length: {dataset_with_length[0]['text_length']}")

# 3. Batch processing for efficiency
print("\n⚡ Batch processing: Converting text to lowercase...")
def lowercase_batch(batch):
    batch['text_lower'] = [text.lower() for text in batch['text']]
    return batch

dataset_lowercase = dataset.map(lowercase_batch, batched=True, batch_size=4)
print(f"   Processed in batches of 4")
print(f"   Original: {dataset[0]['text'][:50]}...")
print(f"   Lowercase: {dataset_lowercase[0]['text_lower'][:50]}...")

### Tokenization Example

Let's tokenize our dataset using a hate speech detection model tokenizer:

In [None]:
# Load tokenizer for hate speech detection (preferred model)
print("🔤 Loading tokenizer for hate speech detection...")
model_name = "cardiffnlp/twitter-roberta-base-hate-latest"
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"✅ Loaded tokenizer: {model_name}")

# Define tokenization function
def tokenize_function(examples):
    """Tokenize text with truncation and padding."""
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=128
    )

# Tokenize the dataset
print("\n⚡ Tokenizing dataset...")
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    batch_size=4,
    remove_columns=['text', 'class_name']  # Remove text columns after tokenization
)

print(f"\n✅ Tokenization complete!")
print(f"📊 Tokenized features: {list(tokenized_dataset.features.keys())}")
print(f"\n📝 Sample tokenized example:")
sample_tokenized = tokenized_dataset[0]
print(f"   Input IDs (first 10): {sample_tokenized['input_ids'][:10]}")
print(f"   Attention mask (first 10): {sample_tokenized['attention_mask'][:10]}")
print(f"   Label: {sample_tokenized['label']}")

## Part 7: Saving Processed Datasets

After processing, you'll want to save your datasets for reuse.

In [None]:
# Create output directory
output_dir = os.path.join(temp_dir, 'processed')
os.makedirs(output_dir, exist_ok=True)
print(f"📁 Created output directory: {output_dir}")

# 1. Save as Arrow format (recommended by HF)
arrow_path = os.path.join(output_dir, 'arrow_dataset')
tokenized_dataset.save_to_disk(arrow_path)
print(f"\n💾 Saved as Arrow format: {arrow_path}")
print(f"   This is the recommended format for HF datasets")

# 2. Save as Parquet
parquet_path = os.path.join(output_dir, 'dataset.parquet')
tokenized_dataset.to_parquet(parquet_path)
print(f"\n💾 Saved as Parquet: {parquet_path}")

# 3. Save as JSON
json_path = os.path.join(output_dir, 'dataset.json')
tokenized_dataset.to_json(json_path)
print(f"\n💾 Saved as JSON: {json_path}")

# 4. Save as CSV
csv_path = os.path.join(output_dir, 'dataset.csv')
tokenized_dataset.to_csv(csv_path)
print(f"\n💾 Saved as CSV: {csv_path}")

print("\n✅ All formats saved successfully!")

In [None]:
# Demonstrate loading the saved dataset
print("📥 Loading saved Arrow dataset...")
reloaded_dataset = load_from_disk(arrow_path)

print(f"\n✅ Reloaded successfully!")
print(f"📊 Number of examples: {len(reloaded_dataset)}")
print(f"🔧 Features: {list(reloaded_dataset.features.keys())}")

# Verify data integrity
original_sample = tokenized_dataset[0]
reloaded_sample = reloaded_dataset[0]

print(f"\n🔍 Data integrity check:")
print(f"   Input IDs match: {original_sample['input_ids'] == reloaded_sample['input_ids']}")
print(f"   Labels match: {original_sample['label'] == reloaded_sample['label']}")

print("\n💡 Arrow format benefits:")
print("   ✅ Fast loading and saving")
print("   ✅ Preserves all data types")
print("   ✅ Memory-mapped for large datasets")
print("   ✅ Native format for HF datasets")

## Part 8: Real-World Example - Combining Multiple Sources

Let's combine datasets from different sources into a single dataset.

In [None]:
print("🔗 Combining datasets from multiple sources...\n")

# Source 1: CSV file
source1 = load_dataset('csv', data_files=csv_file, split='train')
print(f"📥 Source 1 (CSV): {len(source1)} examples")

# Source 2: JSON file  
source2 = load_dataset('json', data_files=jsonl_file, split='train')
print(f"📥 Source 2 (JSON Lines): {len(source2)} examples")

# Source 3: Parquet file
source3 = load_dataset('parquet', data_files=parquet_file, split='train')
print(f"📥 Source 3 (Parquet): {len(source3)} examples")

# Combine all sources
combined_dataset = concatenate_datasets([source1, source2, source3])

print(f"\n�� Combined dataset:")
print(f"   Total examples: {len(combined_dataset)}")
print(f"   Features: {list(combined_dataset.features.keys())}")

# Remove duplicates based on text (using repository seed=16)
print(f"\n🧹 Removing duplicates...")
# Note: unique() doesn't exist in older versions, so we'll use a workaround
seen_texts = set()
unique_indices = []
for i, example in enumerate(combined_dataset):
    if example['text'] not in seen_texts:
        seen_texts.add(example['text'])
        unique_indices.append(i)

deduplicated = combined_dataset.select(unique_indices)
print(f"   After deduplication: {len(deduplicated)} examples")
print(f"   Removed: {len(combined_dataset) - len(deduplicated)} duplicates")

print("\n💡 Real-world scenario:")
print("   ✅ Data from multiple teams/formats")
print("   ✅ Combine for larger training set")
print("   ✅ Deduplicate for data quality")
print("   ✅ Reproducible with seed=16")

## Part 9: Performance Tips & Best Practices

In [None]:
import time

# Create a larger dataset for benchmarking
large_data = sample_data * 100  # 800 examples
large_df = pd.DataFrame(large_data)
large_csv = os.path.join(temp_dir, "large_dataset.csv")
large_df.to_csv(large_csv, index=False)

print("⚡ Performance Comparison: Loading Methods\n")

# Method 1: Load entire dataset
start = time.time()
full_load = load_dataset('csv', data_files=large_csv, split='train')
full_load_time = time.time() - start
print(f"📥 Full load: {full_load_time:.4f}s ({len(full_load)} examples)")

# Method 2: Load with streaming
start = time.time()
stream_load = load_dataset('csv', data_files=large_csv, split='train', streaming=True)
# Take first 100 examples from stream
stream_sample = []
for i, example in enumerate(stream_load):
    if i >= 100:
        break
    stream_sample.append(example)
stream_load_time = time.time() - start
print(f"📥 Streaming (100 examples): {stream_load_time:.4f}s")

print(f"\n⚡ Speedup: {full_load_time/stream_load_time:.2f}x faster for partial data")

print("\n💡 When to use streaming:")
print("   ✅ Dataset too large for memory")
print("   ✅ Only need a subset of data")
print("   ✅ Processing data in chunks")
print("   ✅ Real-time data pipelines")

In [None]:
# Performance tips summary
tips = [
    ("1. Use Parquet for large datasets", "Most efficient format for size and speed"),
    ("2. Use batched=True in map()", "Process multiple examples at once"),
    ("3. Use streaming for huge datasets", "Don't load everything into memory"),
    ("4. Cache processed datasets", "Reuse tokenized data across runs"),
    ("5. Remove unused columns", "Save memory after tokenization"),
    ("6. Use seed=16 for reproducibility", "Repository standard for all random ops"),
    ("7. Save to Arrow format", "Native HF format, fastest reload"),
    ("8. Use num_proc for parallel processing", "Speed up map() operations"),
]

print("🎯 Top Performance Tips & Best Practices")
print("=" * 70)
for tip, description in tips:
    print(f"\n{tip}")
    print(f"   → {description}")

print("\n" + "=" * 70)
print("✅ Follow these practices for optimal dataset handling!")

## Part 10: Cleanup

In [None]:
# Clean up temporary files
import shutil

try:
    shutil.rmtree(temp_dir)
    print(f"🧹 Cleaned up temporary directory: {temp_dir}")
except Exception as e:
    print(f"⚠️ Could not clean up temp directory: {e}")

print("\n✅ Notebook execution complete!")

---

## 📋 Summary

### 🔑 Key Concepts Mastered
- **HF Hub Datasets**: Loading datasets directly from Hugging Face Hub with `load_dataset()`
- **Multiple Formats**: Working with Parquet, CSV, TSV, JSON, JSON Lines, and Pickle formats
- **Dataset Processing**: Filtering, mapping, and transforming datasets efficiently
- **Tokenization**: Converting text to model-ready format with proper batching
- **Data Persistence**: Saving and loading processed datasets in various formats
- **Best Practices**: Performance optimization and reproducibility with seed=16

### 📈 Format Recommendations
- **Production/Large Data** → Apache Parquet (most efficient)
- **Streaming/Logging** → JSON Lines (one record per line)
- **Simple/Spreadsheet** → CSV/TSV (universal compatibility)
- **Web APIs** → JSON (standard for APIs)
- **HF Native** → Arrow format (fastest for HF ecosystem)
- **Quick Testing Only** → Pickle (Python-specific, security risk)

### 🚀 Next Steps
- **Notebook 05**: Fine-tuning models with the Trainer API
- **Documentation**: [HF Datasets Documentation](https://huggingface.co/docs/datasets)
- **Course**: [HF Course Chapter 5](https://huggingface.co/learn/llm-course/chapter5)
- **Practice**: Try loading your own datasets in different formats

### 💡 Key Takeaways
1. The `datasets` library provides a unified API for all data formats
2. Parquet is the most efficient format for large-scale NLP datasets
3. Use streaming for datasets too large for memory
4. Always use `seed=16` for reproducibility (repository standard)
5. Cache processed datasets to avoid redundant computation
6. Batch processing is significantly faster than single-example processing

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*