[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.9/Compare-Transformers-Hatespeech.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.9/Compare-Transformers-Hatespeech.ipynb)

# Comparison of Transformer Architectures for Hate Speech Detection

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- How different transformer architectures perform on hate speech classification
- Practical differences between encoder-only, decoder-only, and encoder-decoder models
- When to choose each architecture for classification tasks
- How to evaluate and compare model performance
- Best practices for hate speech detection systems

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and PyTorch
- Knowledge of NLP fundamentals (refer to [NLP Learning Journey](https://github.com/vuhung16au/nlp-learning-journey))
- Understanding of transformer architectures

## 📚 What We'll Cover
1. **Setup**: Environment and dependencies
2. **Test Data**: Creating a comprehensive test dataset
3. **Encoder-Only**: BERT/RoBERTa for classification
4. **Decoder-Only**: GPT-2 with prompting strategies
5. **Encoder-Decoder**: T5 for text classification
6. **Comparison**: Side-by-side performance analysis
7. **Evaluation**: Metrics and best practices
8. **Recommendations**: When to use each architecture
9. **Summary**: Key takeaways

## 🏗️ Architecture Overview

We'll compare three fundamental transformer architectures:

### 🔍 Encoder-Only (BERT-like)
- **Strengths**: Excellent for classification, bidirectional context
- **Models**: BERT, RoBERTa, DistilBERT
- **Use Case**: Traditional hate speech detection

### 🎭 Decoder-Only (GPT-like)
- **Strengths**: Flexible prompting, few-shot learning
- **Models**: GPT-2, DistilGPT-2
- **Use Case**: Zero/few-shot classification via prompting

### 🔄 Encoder-Decoder (T5-like)
- **Strengths**: Text-to-text framework, flexible outputs
- **Models**: T5, FLAN-T5, BART
- **Use Case**: Classification as text generation task


## Setup and Installation

In [None]:
# Install required packages (uncomment if needed)
# !pip install transformers torch datasets evaluate scikit-learn matplotlib seaborn pandas numpy tqdm

# Core imports
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
import warnings
import time
from typing import List, Dict, Tuple, Optional

# Hugging Face ecosystem
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    AutoModelForCausalLM, T5ForConditionalGeneration,
    pipeline, set_seed
)

# Evaluation metrics
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
import evaluate

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')
set_seed(42)

print("✅ All packages imported successfully!")

In [None]:
# Device detection with educational output
def get_device() -> torch.device:
    """
    Automatically detect and return the best available device.
    
    Priority: CUDA > MPS (Apple Silicon) > CPU
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
        print(f"   Available GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS (Apple Silicon) for GPU acceleration")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU - consider GPU for better performance")
    
    return device

# Get optimal device
device = get_device()
print(f"📱 Selected device: {device}")

## Test Dataset Creation

Let's create a comprehensive test dataset that includes various types of content, from clearly positive to potentially problematic, allowing us to evaluate how each architecture handles different scenarios.

In [None]:
# Comprehensive test dataset for hate speech detection
test_data = {
    "Clearly Positive": [
        "I love Sydney",
        "Melbourne has beautiful architecture",
        "Brisbane is a wonderful city to visit",
        "Perth has amazing beaches",
        "Thank you for your help and support"
    ],
    
    "Neutral/Informative": [
        "The weather in Sydney today is cloudy",
        "Melbourne's population is approximately 5 million",
        "The conference will be held in Brisbane",
        "Flight tickets to Perth are available online",
        "The meeting is scheduled for tomorrow"
    ],
    
    "Mild Criticism": [
        "I hate Melbourne traffic during rush hour", 
        "Sydney's housing prices are too expensive",
        "Brisbane's public transport could be improved",
        "Perth feels too isolated from other cities",
        "This policy decision seems poorly thought out"
    ],
    
    "Strong Negative (Non-Toxic)": [
        "I strongly disagree with this approach",
        "This service was terrible and disappointing",
        "The customer support was unhelpful",
        "I hate waiting in long queues",
        "This movie was boring and poorly made"
    ],
    
    "Potentially Problematic": [
        "People from that country are all the same",
        "I can't stand people who think differently",
        "Those protesters should be silenced",
        "Nobody wants to hear your opinion",
        "You people always complain about everything"
    ]
}

# Flatten the test data for processing
all_texts = []
categories = []
expected_labels = []

# Define expected toxicity levels (0=safe, 1=toxic)
category_toxicity = {
    "Clearly Positive": 0,
    "Neutral/Informative": 0,
    "Mild Criticism": 0,  # Criticism but not toxic
    "Strong Negative (Non-Toxic)": 0,  # Strong but not hateful
    "Potentially Problematic": 1  # Potentially toxic content
}

for category, texts in test_data.items():
    for text in texts:
        all_texts.append(text)
        categories.append(category)
        expected_labels.append(category_toxicity[category])

# Create DataFrame for easier analysis
test_df = pd.DataFrame({
    'text': all_texts,
    'category': categories,
    'expected_toxic': expected_labels
})

print(f"📊 Test Dataset Summary:")
print(f"   Total samples: {len(test_df)}")
print(f"   Categories: {list(test_data.keys())}")
print(f"   Expected toxic samples: {sum(expected_labels)}")
print(f"   Expected safe samples: {len(expected_labels) - sum(expected_labels)}")

# Display sample data
print("\n📝 Sample Test Data:")
for category, texts in test_data.items():
    print(f"\n{category}:")
    for i, text in enumerate(texts[:2], 1):  # Show first 2 examples
        print(f"  {i}. \"{text}\"")
    if len(texts) > 2:
        print(f"  ... and {len(texts)-2} more")

## 🔍 Encoder-Only Architecture: BERT for Hate Speech Detection

Encoder-only models excel at classification tasks. We'll use a pre-trained model specifically fine-tuned for toxicity detection.

In [None]:
# Encoder-Only Implementation
print("📥 Loading Encoder-Only model for hate speech detection...")

try:
    # Try to load a toxicity-specific model
    encoder_model = pipeline(
        "text-classification",
        model="unitary/toxic-bert",
        device=0 if device.type == "cuda" else -1,
        return_all_scores=True
    )
    encoder_model_name = "unitary/toxic-bert"
    print("✅ Loaded toxic-bert model successfully")
    
except Exception as e:
    print(f"⚠️ Could not load toxic-bert, using fallback: {e}")
    # Fallback to a general sentiment model
    encoder_model = pipeline(
        "text-classification",
        model="cardiffnlp/twitter-roberta-base-sentiment-latest",
        device=0 if device.type == "cuda" else -1,
        return_all_scores=True
    )
    encoder_model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest (fallback)"
    print("✅ Loaded sentiment model as fallback")

print(f"📊 Model: {encoder_model_name}")

## 🎭 Decoder-Only Architecture: GPT-2 with Prompting

For decoder-only models, we'll use prompting to perform classification. This demonstrates how modern LLMs handle classification tasks.

In [None]:
# Decoder-Only Implementation
print("📥 Loading Decoder-Only model for prompted classification...")

try:
    # Load a smaller GPT model for demonstration
    decoder_tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
    decoder_model = AutoModelForCausalLM.from_pretrained("distilgpt2")
    
    # Add pad token if not present
    if decoder_tokenizer.pad_token is None:
        decoder_tokenizer.pad_token = decoder_tokenizer.eos_token
    
    decoder_model.to(device)
    decoder_model.eval()
    
    decoder_model_name = "distilgpt2"
    print("✅ Loaded DistilGPT-2 model successfully")
    
except Exception as e:
    print(f"❌ Could not load decoder model: {e}")
    decoder_model = None
    decoder_model_name = "Not available"

def classify_with_prompting(text: str) -> dict:
    """Use GPT-2 with prompting for classification."""
    if decoder_model is None:
        return {'toxic_score': 0.5, 'is_toxic': False, 'confidence': 0.0, 'error': 'Model not available'}
    
    # Create a classification prompt
    prompt = f"""Text: "{text}"
Question: Is this text toxic or safe?
Answer: This text is"""
    
    try:
        # Tokenize and generate
        inputs = decoder_tokenizer(prompt, return_tensors="pt", truncation=True, max_length=200).to(device)
        
        with torch.no_grad():
            outputs = decoder_model.generate(
                inputs.input_ids,
                max_length=inputs.input_ids.shape[1] + 10,
                num_return_sequences=1,
                temperature=0.3,
                do_sample=True,
                pad_token_id=decoder_tokenizer.eos_token_id,
                attention_mask=inputs.attention_mask
            )
        
        # Decode and analyze response
        generated = decoder_tokenizer.decode(outputs[0], skip_special_tokens=True)
        completion = generated[len(prompt):].strip().lower()
        
        # Simple heuristic classification
        toxic_indicators = ['toxic', 'harmful', 'bad', 'inappropriate', 'offensive']
        safe_indicators = ['safe', 'fine', 'okay', 'normal', 'appropriate']
        
        if any(word in completion for word in toxic_indicators):
            toxic_score = 0.8
        elif any(word in completion for word in safe_indicators):
            toxic_score = 0.2
        else:
            toxic_score = 0.5  # Neutral/uncertain
        
        return {
            'toxic_score': toxic_score,
            'is_toxic': toxic_score > 0.5,
            'confidence': abs(toxic_score - 0.5) * 2,
            'completion': completion[:100]  # Truncate for display
        }
        
    except Exception as e:
        return {'toxic_score': 0.5, 'is_toxic': False, 'confidence': 0.0, 'error': str(e)}

print(f"📊 Model: {decoder_model_name}")

## 🔄 Encoder-Decoder Architecture: T5 for Text Classification

Encoder-decoder models like T5 treat classification as a text-to-text task, generating class labels as output.

In [None]:
# Encoder-Decoder Implementation
print("📥 Loading Encoder-Decoder model for text-to-text classification...")

try:
    # Try to load FLAN-T5 which is instruction-tuned
    encoder_decoder_model = pipeline(
        "text2text-generation",
        model="google/flan-t5-small",
        device=0 if device.type == "cuda" else -1,
        max_length=50
    )
    encoder_decoder_model_name = "google/flan-t5-small"
    print("✅ Loaded FLAN-T5-small model successfully")
    
except Exception as e:
    print(f"⚠️ Could not load FLAN-T5, trying T5-small: {e}")
    try:
        encoder_decoder_model = pipeline(
            "text2text-generation",
            model="t5-small",
            device=0 if device.type == "cuda" else -1,
            max_length=50
        )
        encoder_decoder_model_name = "t5-small (fallback)"
        print("✅ Loaded T5-small as fallback")
    except Exception as e2:
        print(f"❌ Could not load encoder-decoder model: {e2}")
        encoder_decoder_model = None
        encoder_decoder_model_name = "Not available"

def classify_with_text2text(text: str) -> dict:
    """Use T5 for text-to-text classification."""
    if encoder_decoder_model is None:
        return {'toxic_score': 0.5, 'is_toxic': False, 'confidence': 0.0, 'error': 'Model not available'}
    
    # Create text-to-text prompt
    prompt = f"Classify this text as 'safe' or 'toxic': {text}"
    
    try:
        # Generate classification
        result = encoder_decoder_model(prompt, max_length=10, num_return_sequences=1)
        generated_text = result[0]['generated_text'].strip().lower()
        
        # Parse the result
        if 'toxic' in generated_text:
            toxic_score = 0.8
        elif 'safe' in generated_text:
            toxic_score = 0.2
        else:
            # Fallback analysis
            negative_words = ['bad', 'negative', 'harmful', 'inappropriate']
            if any(word in generated_text for word in negative_words):
                toxic_score = 0.7
            else:
                toxic_score = 0.5
        
        return {
            'toxic_score': toxic_score,
            'is_toxic': toxic_score > 0.5,
            'confidence': abs(toxic_score - 0.5) * 2,
            'generated_text': generated_text
        }
        
    except Exception as e:
        return {'toxic_score': 0.5, 'is_toxic': False, 'confidence': 0.0, 'error': str(e)}

print(f"📊 Model: {encoder_decoder_model_name}")

## 🚀 Running All Models and Comparison

Now let's run all three architectures on our test dataset and create a comprehensive comparison.

In [None]:
# Run all models on test data and create comparison
print("�� Running comprehensive comparison across all architectures...\n")

# Initialize results storage
results_comparison = []

print("Processing texts with all models...")
for i, text in enumerate(tqdm(all_texts, desc="Processing")):
    row = {
        'Text': text[:50] + '...' if len(text) > 50 else text,
        'Full_Text': text,
        'Category': categories[i],
        'Expected_Toxic': bool(expected_labels[i])
    }
    
    # Encoder-Only Results
    try:
        enc_result = encoder_model(text)
        if 'toxic' in encoder_model_name.lower():
            # Toxic-BERT model
            toxic_score = next((p['score'] for p in enc_result if p['label'] == 'TOXIC'), 0)
        else:
            # Sentiment model fallback
            negative_score = next((p['score'] for p in enc_result if p['label'] in ['NEGATIVE', 'LABEL_0']), 0)
            toxic_score = negative_score
        
        row.update({
            'Encoder_Toxic': toxic_score > 0.5,
            'Encoder_Score': f"{toxic_score:.3f}",
            'Encoder_Confidence': f"{max(p['score'] for p in enc_result):.3f}"
        })
    except Exception as e:
        row.update({
            'Encoder_Toxic': False,
            'Encoder_Score': '0.500',
            'Encoder_Confidence': '0.000'
        })
    
    # Decoder-Only Results
    dec_result = classify_with_prompting(text)
    row.update({
        'Decoder_Toxic': dec_result['is_toxic'],
        'Decoder_Score': f"{dec_result['toxic_score']:.3f}",
        'Decoder_Confidence': f"{dec_result['confidence']:.3f}"
    })
    
    # Encoder-Decoder Results
    ed_result = classify_with_text2text(text)
    row.update({
        'EncDec_Toxic': ed_result['is_toxic'],
        'EncDec_Score': f"{ed_result['toxic_score']:.3f}",
        'EncDec_Confidence': f"{ed_result['confidence']:.3f}"
    })
    
    results_comparison.append(row)

# Create results DataFrame
results_df = pd.DataFrame(results_comparison)

print("\n✅ All models completed! Results ready for analysis.")

## 📊 Results Table Comparison

Here's the comprehensive comparison table showing how each architecture performed on our test cases, including **"I love Sydney"** and **"I hate Melbourne traffic"**.

In [None]:
# Display comprehensive results table
print("📋 COMPREHENSIVE RESULTS COMPARISON TABLE")
print("=" * 100)

# Create a clean display DataFrame
display_columns = ['Text', 'Category', 'Expected_Toxic', 
                  'Encoder_Toxic', 'Encoder_Score',
                  'Decoder_Toxic', 'Decoder_Score', 
                  'EncDec_Toxic', 'EncDec_Score']

display_df = results_df[display_columns].copy()
print(display_df.to_string(index=False, max_colwidth=40))

# Calculate accuracy for each model
encoder_predictions = [bool(row['Encoder_Toxic']) for _, row in results_df.iterrows()]
decoder_predictions = [bool(row['Decoder_Toxic']) for _, row in results_df.iterrows()]
encdec_predictions = [bool(row['EncDec_Toxic']) for _, row in results_df.iterrows()]

encoder_accuracy = accuracy_score(expected_labels, encoder_predictions)
decoder_accuracy = accuracy_score(expected_labels, decoder_predictions)
encdec_accuracy = accuracy_score(expected_labels, encdec_predictions)

print("\n\n📊 OVERALL ACCURACY SUMMARY")
print("=" * 50)
print(f"🔍 Encoder-Only ({encoder_model_name[:30]}...): {encoder_accuracy:.3f}")
print(f"🎭 Decoder-Only ({decoder_model_name}): {decoder_accuracy:.3f}")
print(f"🔄 Encoder-Decoder ({encoder_decoder_model_name}): {encdec_accuracy:.3f}")

# Highlight specific examples
print("\n\n🎯 SPECIFIC EXAMPLES ANALYSIS")
print("=" * 50)

key_examples = ["I love Sydney", "I hate Melbourne traffic during rush hour", 
               "I strongly disagree with this approach", "People from that country are all the same"]

for example in key_examples:
    matching_rows = results_df[results_df['Full_Text'] == example]
    if not matching_rows.empty:
        row = matching_rows.iloc[0]
        print(f"\n📝 \"{example}\"")
        print(f"   Expected: {'Toxic' if row['Expected_Toxic'] else 'Safe'}")
        print(f"   🔍 Encoder: {'Toxic' if row['Encoder_Toxic'] else 'Safe'} ({row['Encoder_Score']})")
        print(f"   🎭 Decoder: {'Toxic' if row['Decoder_Toxic'] else 'Safe'} ({row['Decoder_Score']})")
        print(f"   🔄 Enc-Dec: {'Toxic' if row['EncDec_Toxic'] else 'Safe'} ({row['EncDec_Score']})")
        
        # Check agreement
        predictions = [row['Encoder_Toxic'], row['Decoder_Toxic'], row['EncDec_Toxic']]
        if len(set(predictions)) == 1:
            print(f"   ✅ All models agree!")
        else:
            print(f"   ⚠️ Models disagree")

## 📈 Performance Analysis and Recommendations

Let's analyze the results and provide recommendations for when to use each architecture.

In [None]:
# Performance analysis and grading
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

def analyze_performance(predictions, expected, model_name):
    """Analyze model performance with comprehensive metrics."""
    accuracy = accuracy_score(expected, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(expected, predictions, average='binary', zero_division=0)
    
    # Confusion matrix
    cm = confusion_matrix(expected, predictions)
    tn, fp, fn, tp = cm.ravel() if cm.size == 4 else (0, 0, 0, 0)
    
    # Grade the model
    if accuracy >= 0.9 and f1 >= 0.85:
        grade = "🟢 Excellent"
    elif accuracy >= 0.8 and f1 >= 0.75:
        grade = "🟡 Good"
    elif accuracy >= 0.7 and f1 >= 0.6:
        grade = "🟠 Fair"
    else:
        grade = "🔴 Needs Improvement"
    
    return {
        'model': model_name,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'true_pos': tp,
        'false_pos': fp,
        'true_neg': tn,
        'false_neg': fn,
        'grade': grade
    }

# Analyze each model
encoder_analysis = analyze_performance(encoder_predictions, expected_labels, "Encoder-Only")
decoder_analysis = analyze_performance(decoder_predictions, expected_labels, "Decoder-Only")
encdec_analysis = analyze_performance(encdec_predictions, expected_labels, "Encoder-Decoder")

analyses = [encoder_analysis, decoder_analysis, encdec_analysis]

print("📊 DETAILED PERFORMANCE ANALYSIS")
print("=" * 80)

# Create performance comparison table
perf_data = []
for analysis in analyses:
    perf_data.append({
        'Architecture': analysis['model'],
        'Grade': analysis['grade'],
        'Accuracy': f"{analysis['accuracy']:.3f}",
        'Precision': f"{analysis['precision']:.3f}",
        'Recall': f"{analysis['recall']:.3f}",
        'F1-Score': f"{analysis['f1_score']:.3f}",
        'True Pos': analysis['true_pos'],
        'False Pos': analysis['false_pos']
    })

perf_df = pd.DataFrame(perf_data)
print(perf_df.to_string(index=False))

# Recommendations
print("\n\n🏆 ARCHITECTURE RECOMMENDATIONS")
print("=" * 60)

best_model = max(analyses, key=lambda x: x['f1_score'])
print(f"\n🥇 **Best Overall Performance**: {best_model['model']} ({best_model['grade']})")
print(f"   F1-Score: {best_model['f1_score']:.3f} | Accuracy: {best_model['accuracy']:.3f}")

recommendations = {
    "🔍 Encoder-Only (BERT/RoBERTa)": [
        "✅ Best for: Production hate speech detection systems",
        "✅ Advantages: Fast, reliable, specifically designed for classification",
        "⚠️ Limitations: Requires fine-tuning, less flexible for new tasks",
        "📊 Use when: You have labeled data and need consistent performance"
    ],
    "🎭 Decoder-Only (GPT-2/GPT)": [
        "✅ Best for: Exploratory analysis, zero-shot classification",
        "✅ Advantages: No fine-tuning needed, flexible prompting",
        "⚠️ Limitations: Inconsistent, slower, requires prompt engineering",
        "📊 Use when: You lack labeled data or need quick prototyping"
    ],
    "🔄 Encoder-Decoder (T5/BART)": [
        "✅ Best for: Multi-task systems, classification with explanations",
        "✅ Advantages: Text-to-text flexibility, can generate reasoning",
        "⚠️ Limitations: Overkill for simple classification, slower inference",
        "📊 Use when: You need both classification and explanation generation"
    ]
}

for arch, points in recommendations.items():
    print(f"\n{arch}")
    for point in points:
        print(f"   {point}")

print(f"\n\n💡 **Key Takeaway**: The choice between architectures depends on your specific")
print(f"    requirements: accuracy vs. flexibility vs. speed vs. explainability.")

---

## 📋 Summary

### 🔑 Key Concepts Mastered
- **Architecture Comparison**: Practical differences between encoder-only, decoder-only, and encoder-decoder models for hate speech detection
- **Evaluation Methodology**: Comprehensive approach to assess classification performance using multiple metrics
- **Model Selection**: Understanding when to choose each architecture based on requirements and constraints
- **Performance Analysis**: How to interpret metrics and make data-driven decisions about model deployment

### 📈 Best Practices Learned
- **Comprehensive Testing**: Use diverse test cases including edge cases and different content types
- **Multiple Metrics**: Don't rely on accuracy alone - consider precision, recall, and F1-score
- **Context Matters**: Different architectures excel in different scenarios and use cases
- **Production Readiness**: Evaluate not just accuracy but also speed, reliability, and maintainability

### 🚀 Next Steps
- **Fine-tuning**: Explore fine-tuning techniques for encoder-only models on domain-specific data
- **Prompt Engineering**: Improve decoder-only performance through better prompt design
- **Ensemble Methods**: Combine multiple architectures for robust hate speech detection
- **Production Deployment**: Consider scalability, monitoring, and continuous improvement strategies

### 🔗 Related Resources
- **[Encoder-Decoder Guide](../../docs/encoder-decoder.md)**: Detailed technical documentation
- **[HF NLP Models](../../docs/HF-NLP-models.md)**: Comprehensive model recommendations
- **[Best Practices](../../docs/best-practices.md)**: Production deployment guidelines

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*