[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.2/lambda-functions-recipe-NLP.ipynb)
[![Open with SageMaker](https://img.shields.io/badge/Open%20with-SageMaker-orange?logo=amazonaws)](https://studiolab.sagemaker.aws/import/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.2/lambda-functions-recipe-NLP.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.2/lambda-functions-recipe-NLP.ipynb)

# Python Lambda Functions Recipe for NLP

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- How to use lambda functions effectively with Hugging Face datasets
- Advanced lambda patterns for NLP data preprocessing
- Lambda functions for filtering hate speech and offensive content
- Performance considerations when using lambdas in NLP pipelines
- Real-world applications with HF transformers and datasets

## 📋 Prerequisites
- Basic understanding of Python lambda functions
- Familiarity with Hugging Face datasets library
- Knowledge of NLP preprocessing concepts
- Experience with PyTorch tensors (basic level)

## 📚 What We'll Cover
1. **Dataset Filtering with Lambda**: Filter hate speech, spam, and quality issues
2. **Data Preprocessing**: Text cleaning and tokenization with lambda functions
3. **Hate Speech Detection Pipeline**: Advanced lambda patterns for classification
4. **Performance Optimization**: Vectorized vs lambda approaches
5. **Best Practices**: When to use lambdas vs regular functions in NLP

In [4]:
# Repository standard: Always use seed=16 for reproducible results
import torch
import numpy as np
import random
import time
import re
from typing import Dict, List, Any, Optional
from collections import Counter

# Set all random seeds for reproducibility (repository standard: seed=16)
def set_seed(seed_value: int = 16):
    """Set seed for reproducibility across all random number generators."""
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    print(f"🔢 Random seed set to {seed_value} for reproducibility")

set_seed(16)

# Core imports for NLP and datasets
from datasets import load_dataset, Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# TPU detection for Google Colab
try:
    from google.colab import userdata
    import torch_xla.core.xla_model as xm
    COLAB_AVAILABLE = True
    TPU_AVAILABLE = True
    print("🔥 Google Colab with TPU support detected")
except ImportError:
    COLAB_AVAILABLE = False
    TPU_AVAILABLE = False

# Device detection with TPU priority for Colab
def get_device():
    """Get optimal device with TPU priority in Colab."""
    if COLAB_AVAILABLE and TPU_AVAILABLE:
        try:
            device = xm.xla_device()
            print("🔥 Using Google Colab TPU for optimal performance")
            return device
        except Exception as e:
            print(f"⚠️ TPU initialization failed: {e}")

    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS for Apple Silicon optimization")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU - consider GPU/TPU for better performance")

    return device

device = get_device()
print(f"\n=== Setup Information ===")
print(f"Device: {device}")
print(f"PyTorch version: {torch.__version__}")

🔢 Random seed set to 16 for reproducibility
💻 Using CPU - consider GPU/TPU for better performance

=== Setup Information ===
Device: cpu
PyTorch version: 2.8.0+cu126


## Section 1: Dataset Filtering with Lambda Functions

Lambda functions are particularly powerful for filtering datasets based on specific criteria. In NLP, we often need to filter out poor quality data, missing values, or content that doesn't meet our requirements.

In [5]:
print("🛡️ DATASET FILTERING WITH LAMBDA FUNCTIONS")
print("=" * 50)

# Load preferred hate speech dataset (repository standard)
dataset_name = "tdavidson/hate_speech_offensive"
print(f"📥 Loading dataset: {dataset_name}")

try:
    # Load full dataset
    full_dataset = load_dataset(dataset_name)

    # Get a sample with repository seed=16
    train_data = full_dataset["train"].shuffle(seed=16).select(range(5000))
    print(f"✅ Loaded {len(train_data)} training examples")
    print(f"📋 Features: {list(train_data.features.keys())}")

    # Display label distribution
    labels = train_data['class']
    label_counts = Counter(labels)
    print(f"\n📊 Original label distribution:")
    for label, count in label_counts.items():
        label_name = ["hate speech", "offensive language", "neither"][label]
        print(f"  {label} ({label_name}): {count:,} examples")

except Exception as e:
    print(f"❌ Error loading dataset: {e}")
    print("💡 Creating synthetic dataset for demonstration")

    # Create synthetic dataset for demonstration
    synthetic_texts = [
        "I hate this stupid product",
        "This is amazing, I love it!",
        "You're such an idiot",
        "Great work on this project",
        "Damn, this is good",
        "That's really nice",
        "What a terrible day",
        "Beautiful weather today",
        "",  # Empty text
        "   ",  # Only whitespace
        "a",  # Too short
        "This is a normal sentence with reasonable length"
    ] * 500  # Repeat to get decent sample size

    synthetic_labels = [0, 1, 0, 1, 0, 1, 1, 1, 2, 2, 2, 2] * 500

    train_data = Dataset.from_dict({
        'tweet': synthetic_texts,
        'class': synthetic_labels
    })

    print(f"✅ Created synthetic dataset with {len(train_data)} examples")

🛡️ DATASET FILTERING WITH LAMBDA FUNCTIONS
📥 Loading dataset: tdavidson/hate_speech_offensive
✅ Loaded 5000 training examples
📋 Features: ['count', 'hate_speech_count', 'offensive_language_count', 'neither_count', 'class', 'tweet']

📊 Original label distribution:
  1 (offensive language): 3,919 examples
  0 (hate speech): 268 examples
  2 (neither): 813 examples


In [6]:
print("\n🔍 LAMBDA FILTERING PATTERNS")
print("=" * 30)

# Determine text column name
text_column = 'tweet' if 'tweet' in train_data.features else 'text'

# 1. Filter out None/empty values using lambda
print("1️⃣ Filtering out None and empty values")
filtered_data = train_data.filter(lambda x: x[text_column] is not None and x[text_column].strip() != "")
print(f"   Before: {len(train_data)} examples")
print(f"   After:  {len(filtered_data)} examples")
print(f"   Removed: {len(train_data) - len(filtered_data)} empty/None examples")

# 2. Filter by text length using lambda
print("\n2️⃣ Filtering by minimum text length (>= 10 characters)")
length_filtered = filtered_data.filter(lambda x: len(x[text_column]) >= 10)
print(f"   Before: {len(filtered_data)} examples")
print(f"   After:  {len(length_filtered)} examples")
print(f"   Removed: {len(filtered_data) - len(length_filtered)} too-short examples")

# 3. Filter by specific labels using lambda
print("\n3️⃣ Filtering specific classes")
# Filter for hate speech and offensive language only (exclude neutral)
toxic_content = length_filtered.filter(lambda x: x['class'] in [0, 1])
print(f"   Toxic content only: {len(toxic_content)} examples")

# Filter for neutral content only
neutral_content = length_filtered.filter(lambda x: x['class'] == 2)
print(f"   Neutral content only: {len(neutral_content)} examples")

# 4. Advanced text pattern filtering using lambda
print("\n4️⃣ Advanced pattern filtering")
# Filter out texts with excessive punctuation or caps
quality_filtered = length_filtered.filter(
    lambda x: (
        len([c for c in x[text_column] if c in '!?']) <= 3 and  # Max 3 exclamation/question marks
        sum(1 for c in x[text_column] if c.isupper()) / len(x[text_column]) <= 0.7  # Max 70% uppercase
    )
)
print(f"   Before quality filter: {len(length_filtered)} examples")
print(f"   After quality filter:  {len(quality_filtered)} examples")
print(f"   Removed: {len(length_filtered) - len(quality_filtered)} low-quality examples")

# 5. Combine multiple conditions in a single lambda
print("\n5️⃣ Combined filtering with complex lambda")
comprehensive_filtered = train_data.filter(
    lambda x: (
        x[text_column] is not None and  # Not None
        len(x[text_column].strip()) >= 10 and  # Minimum length
        len(x[text_column].strip()) <= 280 and  # Maximum length (Twitter-like)
        not x[text_column].strip().startswith('RT') and  # Not a retweet
        x['class'] is not None  # Valid label
    )
)
print(f"   Original: {len(train_data)} examples")
print(f"   After comprehensive filter: {len(comprehensive_filtered)} examples")
print(f"   Retention rate: {len(comprehensive_filtered)/len(train_data)*100:.1f}%")


🔍 LAMBDA FILTERING PATTERNS
1️⃣ Filtering out None and empty values
   Before: 5000 examples
   After:  5000 examples
   Removed: 0 empty/None examples

2️⃣ Filtering by minimum text length (>= 10 characters)
   Before: 5000 examples
   After:  4995 examples
   Removed: 5 too-short examples

3️⃣ Filtering specific classes
   Toxic content only: 4183 examples
   Neutral content only: 812 examples

4️⃣ Advanced pattern filtering
   Before quality filter: 4995 examples
   After quality filter:  4938 examples
   Removed: 57 low-quality examples

5️⃣ Combined filtering with complex lambda
   Original: 5000 examples
   After comprehensive filter: 3672 examples
   Retention rate: 73.4%


---

## 📋 Summary

### 🔑 Key Concepts Mastered
- **Dataset Filtering**: Using lambda functions to filter datasets based on content, quality, and class labels
- **Data Preprocessing**: Text cleaning, normalization, and feature extraction using lambda transformations
- **NLP Pipeline Integration**: Combining lambda functions with tokenizers and hate speech detection models
- **Performance Optimization**: Understanding when to use lambda vs vectorized operations for efficiency
- **Error Handling**: Safe patterns for lambda functions in data processing pipelines

### 📈 Best Practices Learned
- Use lambda for simple, single-line operations and regular functions for complex logic
- Always handle edge cases with `.get()` methods and default values in lambda functions
- Consider performance implications: vectorized operations are often faster than lambdas for large datasets
- Test lambda functions on small samples before applying to large datasets
- Cache expensive operations outside lambda functions to avoid repeated computations

### 🚀 Next Steps
- **Advanced Fine-tuning**: Apply these filtering techniques to create high-quality training datasets
- **Custom Datasets**: Use lambda functions to create domain-specific datasets from raw text
- **Production Pipelines**: Integrate these patterns into MLOps workflows for automated data processing
- **Evaluation Metrics**: Use lambda functions for custom evaluation and analysis of model predictions

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*