[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/04_mini_project.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/04_mini_project.ipynb)

# 04 - Mini-Project: Building a Complete Sentiment Analysis System

## Project Overview
In this mini-project, we'll combine everything we've learned from the first three notebooks:
- **Hugging Face transformers** for model loading and inference
- **Tokenizers** for text preprocessing
- **Datasets** for efficient data handling

We'll build a complete sentiment analysis system that can:
1. Load and preprocess data efficiently
2. Compare different models and tokenizers
3. Analyze model performance
4. Create a user-friendly interface for predictions
5. Handle edge cases and errors gracefully

## Learning Objectives
- Integrate transformers, tokenizers, and datasets libraries
- Build a complete ML pipeline from data to deployment
- Compare different models systematically
- Create robust, production-ready code
- Visualize and interpret results

In [None]:
# Import all necessary libraries
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    pipeline, Trainer, TrainingArguments
)
from datasets import load_dataset, Dataset, DatasetDict
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from collections import Counter
import time
import os
import warnings
warnings.filterwarnings('ignore')

# Load environment variables from .env.local for local development
try:
    from dotenv import load_dotenv
    load_dotenv('.env.local', override=True)
    print("Environment variables loaded from .env.local")
except ImportError:
    print("python-dotenv not installed, skipping .env.local loading")

# Credential management function
def get_api_key(key_name: str) -> str:
    """Get API key from environment or Colab secrets."""
    try:
        # Try to import Colab userdata (only available in Colab)
        from google.colab import userdata
        return userdata.get(key_name)
    except (ImportError, Exception):
        # Fall back to local environment variable
        api_key = os.getenv(key_name)
        if not api_key:
            print(f"Info: {key_name} not found. Public models will work without authentication.")
            return None
        return api_key

# Device detection function
def get_device():
    """Get the best available device for training/inference."""
    if torch.cuda.is_available():
        return torch.device("cuda")
    elif torch.backends.mps.is_available():
        return torch.device("mps") 
    else:
        return torch.device("cpu")

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Setup authentication and device
hf_token = get_api_key('HF_TOKEN')
if hf_token:
    os.environ['HF_TOKEN'] = hf_token
    print("Hugging Face token configured")

device = get_device()
print(f"\n=== Setup Information ===")
print(f"All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
elif device.type == 'mps':
    print("Apple Silicon GPU (MPS) detected")

## Phase 1: Data Preparation Pipeline

Let's start by creating a robust data preparation pipeline.

In [None]:
class SentimentDataProcessor:
    """A complete data processing pipeline for sentiment analysis"""
    
    def __init__(self, dataset_name="imdb", sample_size=None):
        self.dataset_name = dataset_name
        self.sample_size = sample_size
        self.dataset = None
        self.processed_dataset = None
        
    def load_data(self):
        """Load dataset with error handling"""
        print(f"Loading {self.dataset_name} dataset...")
        
        try:
            self.dataset = load_dataset(self.dataset_name)
            
            if self.sample_size:
                print(f"Sampling {self.sample_size} examples for faster processing...")
                self.dataset['train'] = self.dataset['train'].select(range(min(self.sample_size, len(self.dataset['train']))))
                if 'test' in self.dataset:
                    test_sample = min(self.sample_size // 2, len(self.dataset['test']))
                    self.dataset['test'] = self.dataset['test'].select(range(test_sample))
            
            print(f"✓ Dataset loaded successfully!")
            for split, data in self.dataset.items():
                print(f"  {split}: {len(data):,} examples")
                
        except Exception as e:
            print(f"✗ Error loading dataset: {e}")
            raise
    
    def analyze_data(self):
        """Analyze dataset characteristics"""
        if not self.dataset:
            raise ValueError("Dataset not loaded. Call load_data() first.")
        
        print("\nDataset Analysis:")
        print("=" * 20)
        
        train_data = self.dataset['train']
        
        # Label distribution
        labels = train_data['label']
        label_counts = Counter(labels)
        
        print(f"Features: {train_data.features}")
        print(f"Label distribution: {dict(label_counts)}")
        
        # Text statistics
        sample_texts = [ex['text'] for ex in train_data.select(range(min(1000, len(train_data))))]
        word_counts = [len(text.split()) for text in sample_texts]
        char_counts = [len(text) for text in sample_texts]
        
        print(f"\nText Statistics (sample of {len(sample_texts)}):")
        print(f"  Word count - Mean: {np.mean(word_counts):.1f}, Median: {np.median(word_counts):.1f}")
        print(f"  Char count - Mean: {np.mean(char_counts):.1f}, Median: {np.median(char_counts):.1f}")
        print(f"  Min words: {min(word_counts)}, Max words: {max(word_counts)}")
        
        return {
            'label_distribution': label_counts,
            'word_stats': {'mean': np.mean(word_counts), 'median': np.median(word_counts)},
            'sample_size': len(sample_texts)
        }
    
    def create_small_test_set(self, size=100):
        """Create a small balanced test set for quick evaluation"""
        if not self.dataset:
            raise ValueError("Dataset not loaded. Call load_data() first.")
        
        train_data = self.dataset['train']
        
        # Get equal numbers of positive and negative examples
        positive_examples = train_data.filter(lambda x: x['label'] == 1).select(range(size // 2))
        negative_examples = train_data.filter(lambda x: x['label'] == 0).select(range(size // 2))
        
        # Combine and create new dataset
        from datasets import concatenate_datasets
        small_test = concatenate_datasets([positive_examples, negative_examples])
        
        print(f"Created balanced test set with {len(small_test)} examples")
        return small_test

# Initialize data processor
data_processor = SentimentDataProcessor(sample_size=5000)  # Use smaller sample for speed
data_processor.load_data()
analysis_results = data_processor.analyze_data()

# Create test set
test_set = data_processor.create_small_test_set(200)

## Phase 2: Model Comparison Framework

Now let's create a framework to systematically compare different models.

In [None]:
class ModelComparison:
    """Framework for comparing different sentiment analysis models"""
    
    def __init__(self):
        self.models = {}
        self.tokenizers = {}
        self.pipelines = {}
        self.results = {}
        
        # Define models to compare
        self.model_configs = {
            "DistilBERT": "distilbert-base-uncased-finetuned-sst-2-english",
            "RoBERTa": "cardiffnlp/twitter-roberta-base-sentiment-latest",
            "BERT": "nlptown/bert-base-multilingual-uncased-sentiment"
        }
    
    def load_models(self):
        """Load all models and tokenizers"""
        print("Loading models...")
        
        for name, model_name in self.model_configs.items():
            try:
                print(f"  Loading {name} ({model_name})...")
                
                # Create pipeline (easiest for comparison)
                self.pipelines[name] = pipeline(
                    "sentiment-analysis",
                    model=model_name,
                    tokenizer=model_name,
                    device=0 if device.type == 'cuda' else -1
                )
                
                print(f"    ✓ {name} loaded successfully")
                
            except Exception as e:
                print(f"    ✗ Failed to load {name}: {e}")
                continue
        
        print(f"\nLoaded {len(self.pipelines)} models successfully")
    
    def evaluate_models(self, test_dataset, max_examples=100):
        """Evaluate all models on test dataset"""
        print(f"\nEvaluating models on {min(max_examples, len(test_dataset))} examples...")
        
        # Prepare test data
        test_examples = test_dataset.select(range(min(max_examples, len(test_dataset))))
        texts = [ex['text'] for ex in test_examples]
        true_labels = [ex['label'] for ex in test_examples]
        
        for model_name, pipeline_model in self.pipelines.items():
            print(f"\n  Evaluating {model_name}...")
            
            try:
                start_time = time.time()
                
                # Get predictions
                predictions = pipeline_model(texts)
                
                inference_time = time.time() - start_time
                
                # Convert predictions to binary format
                pred_labels = self._convert_predictions(predictions, model_name)
                
                # Calculate metrics
                accuracy = accuracy_score(true_labels, pred_labels)
                
                # Store results
                self.results[model_name] = {
                    'accuracy': accuracy,
                    'inference_time': inference_time,
                    'predictions': pred_labels,
                    'true_labels': true_labels,
                    'raw_predictions': predictions
                }
                
                print(f"    ✓ Accuracy: {accuracy:.3f}")
                print(f"    ✓ Inference time: {inference_time:.2f}s")
                print(f"    ✓ Speed: {len(texts)/inference_time:.1f} examples/second")
                
            except Exception as e:
                print(f"    ✗ Error evaluating {model_name}: {e}")
                continue
    
    def _convert_predictions(self, predictions, model_name):
        """Convert model predictions to binary labels (0/1)"""
        pred_labels = []
        
        for pred in predictions:
            label = pred['label'].upper()
            
            # Handle different label formats
            if label in ['POSITIVE', 'POS', '1', 'LABEL_1']:
                pred_labels.append(1)
            elif label in ['NEGATIVE', 'NEG', '0', 'LABEL_0']:
                pred_labels.append(0)
            else:
                # For models with different labels, use confidence score
                if pred['score'] > 0.5:
                    pred_labels.append(1 if 'POS' in label else 0)
                else:
                    pred_labels.append(0 if 'POS' in label else 1)
        
        return pred_labels
    
    def visualize_results(self):
        """Create visualizations of model comparison results"""
        if not self.results:
            print("No results to visualize. Run evaluate_models() first.")
            return
        
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        fig.suptitle('Model Comparison Results', fontsize=16)
        
        # 1. Accuracy comparison
        models = list(self.results.keys())
        accuracies = [self.results[model]['accuracy'] for model in models]
        
        axes[0, 0].bar(models, accuracies, alpha=0.7)
        axes[0, 0].set_title('Model Accuracy Comparison')
        axes[0, 0].set_ylabel('Accuracy')
        axes[0, 0].set_ylim(0, 1)
        
        # Add value labels on bars
        for i, v in enumerate(accuracies):
            axes[0, 0].text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom')
        
        # 2. Speed comparison
        speeds = [len(self.results[model]['predictions'])/self.results[model]['inference_time'] 
                 for model in models]
        
        axes[0, 1].bar(models, speeds, alpha=0.7, color='orange')
        axes[0, 1].set_title('Inference Speed Comparison')
        axes[0, 1].set_ylabel('Examples/Second')
        
        # 3. Confusion matrix for best model
        best_model = max(models, key=lambda x: self.results[x]['accuracy'])
        cm = confusion_matrix(self.results[best_model]['true_labels'], 
                             self.results[best_model]['predictions'])
        
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1, 0])
        axes[1, 0].set_title(f'Confusion Matrix - {best_model}')
        axes[1, 0].set_xlabel('Predicted')
        axes[1, 0].set_ylabel('Actual')
        
        # 4. Accuracy vs Speed trade-off
        axes[1, 1].scatter(speeds, accuracies, s=100, alpha=0.7)
        
        for i, model in enumerate(models):
            axes[1, 1].annotate(model, (speeds[i], accuracies[i]), 
                               xytext=(5, 5), textcoords='offset points')
        
        axes[1, 1].set_xlabel('Speed (examples/second)')
        axes[1, 1].set_ylabel('Accuracy')
        axes[1, 1].set_title('Accuracy vs Speed Trade-off')
        
        plt.tight_layout()
        plt.show()
        
        # Print detailed results
        print(f"\nDetailed Results:")
        print("=" * 50)
        for model in models:
            results = self.results[model]
            print(f"\n{model}:")
            print(f"  Accuracy: {results['accuracy']:.3f}")
            print(f"  Speed: {len(results['predictions'])/results['inference_time']:.1f} examples/sec")
            print(f"  Inference time: {results['inference_time']:.2f}s")

# Initialize and run model comparison
model_comparison = ModelComparison()
model_comparison.load_models()
model_comparison.evaluate_models(test_set, max_examples=100)
model_comparison.visualize_results()

## Phase 3: Advanced Analysis and Error Cases

Let's analyze model behavior on different types of inputs and edge cases.

In [None]:
class SentimentAnalysisAnalyzer:
    """Advanced analysis of sentiment models"""
    
    def __init__(self, model_comparison):
        self.model_comparison = model_comparison
        self.edge_cases = self._create_edge_cases()
    
    def _create_edge_cases(self):
        """Create challenging test cases"""
        return {
            "Short texts": [
                "Good", "Bad", "OK", "Meh", "Great!", "Awful"
            ],
            "Sarcastic/Ironic": [
                "Oh great, another delay...",
                "Just what I needed today",
                "Perfect timing as always",
                "Wow, such amazing service"
            ],
            "Neutral/Mixed": [
                "It was okay, nothing special",
                "Some good points, some bad ones",
                "Average product, average price",
                "Could be better, could be worse"
            ],
            "Emotional/Strong": [
                "I absolutely LOVE this product!!!",
                "This is the WORST thing I've ever bought",
                "Amazing amazing amazing!!!",
                "Terrible terrible terrible"
            ],
            "Long/Complex": [
                "While the initial experience was somewhat disappointing due to shipping delays, the actual product quality exceeded my expectations and the customer service team was very helpful in resolving my concerns.",
                "I have mixed feelings about this purchase because although the price was reasonable and the features are adequate for basic use, the build quality feels somewhat cheap and I'm not sure it will last very long."
            ]
        }
    
    def test_edge_cases(self):
        """Test models on edge cases"""
        print("Testing Edge Cases:")
        print("=" * 30)
        
        results = {}
        
        for category, texts in self.edge_cases.items():
            print(f"\n{category}:")
            print("-" * len(category))
            
            category_results = {}
            
            for model_name, pipeline_model in self.model_comparison.pipelines.items():
                try:
                    predictions = pipeline_model(texts)
                    category_results[model_name] = predictions
                except Exception as e:
                    print(f"  Error with {model_name}: {e}")
                    continue
            
            # Display results for each text
            for i, text in enumerate(texts):
                print(f"\n  Text: '{text}'")
                
                for model_name in category_results:
                    pred = category_results[model_name][i]
                    label = pred['label']
                    score = pred['score']
                    print(f"    {model_name:12}: {label:12} ({score:.3f})")
            
            results[category] = category_results
        
        return results
    
    def analyze_agreement(self, edge_case_results):
        """Analyze where models agree/disagree"""
        print("\n\nModel Agreement Analysis:")
        print("=" * 30)
        
        model_names = list(self.model_comparison.pipelines.keys())
        
        for category, results in edge_case_results.items():
            print(f"\n{category}:")
            
            texts = self.edge_cases[category]
            agreements = []
            
            for i in range(len(texts)):
                # Get predictions for this text from all models
                text_predictions = []
                for model in model_names:
                    if model in results:
                        label = results[model][i]['label'].upper()
                        # Normalize labels
                        if 'POS' in label or label in ['POSITIVE', '1', 'LABEL_1']:
                            text_predictions.append('POSITIVE')
                        else:
                            text_predictions.append('NEGATIVE')
                
                # Check if all models agree
                if len(set(text_predictions)) == 1:
                    agreements.append("AGREE")
                    consensus = text_predictions[0]
                else:
                    agreements.append("DISAGREE")
                    consensus = "MIXED"
                
                print(f"  '{texts[i][:50]}...': {agreements[-1]} ({consensus})")
            
            agreement_rate = agreements.count("AGREE") / len(agreements)
            print(f"  → Agreement rate: {agreement_rate:.1%}")
    
    def create_interactive_demo(self):
        """Create a simple interactive demo"""
        print("\n" + "=" * 50)
        print("INTERACTIVE SENTIMENT ANALYSIS DEMO")
        print("=" * 50)
        print("Enter text to analyze (or 'quit' to exit)")
        
        # For notebook environment, we'll just demo with predefined examples
        demo_texts = [
            "I love this product!",
            "This is terrible",
            "It's okay, nothing special",
            "Best purchase ever!!!",
            "Waste of money"
        ]
        
        print("\nDemo with sample texts:")
        for text in demo_texts:
            print(f"\nInput: '{text}'")
            print("-" * 40)
            
            for model_name, pipeline_model in self.model_comparison.pipelines.items():
                try:
                    result = pipeline_model(text)[0]
                    label = result['label']
                    confidence = result['score']
                    
                    # Add emoji for fun
                    emoji = "😊" if "POS" in label.upper() else "😞"
                    
                    print(f"  {model_name:12}: {label:12} {emoji} (confidence: {confidence:.3f})")
                    
                except Exception as e:
                    print(f"  {model_name:12}: Error - {e}")

# Run advanced analysis
analyzer = SentimentAnalysisAnalyzer(model_comparison)
edge_case_results = analyzer.test_edge_cases()
analyzer.analyze_agreement(edge_case_results)
analyzer.create_interactive_demo()

## Phase 4: Tokenization Deep Dive

Let's analyze how different tokenizers handle the same text.

In [None]:
def tokenization_comparison_analysis():
    """Compare how different tokenizers handle the same texts"""
    
    print("Tokenization Comparison Analysis:")
    print("=" * 35)
    
    # Load different tokenizers
    tokenizer_configs = {
        "DistilBERT": "distilbert-base-uncased",
        "RoBERTa": "roberta-base",
        "BERT": "bert-base-uncased"
    }
    
    tokenizers = {}
    for name, model_name in tokenizer_configs.items():
        try:
            tokenizers[name] = AutoTokenizer.from_pretrained(model_name)
            print(f"✓ Loaded {name} tokenizer")
        except Exception as e:
            print(f"✗ Failed to load {name}: {e}")
    
    # Test texts with different characteristics
    test_texts = [
        "Hello world!",
        "The quick brown fox jumps over the lazy dog.",
        "antidisestablishmentarianism",  # Long word
        "COVID-19 pandemic",  # Numbers and hyphens
        "😊 I'm happy!",  # Emojis
        "user@domain.com visited https://example.com",  # URLs and emails
    ]
    
    # Compare tokenization
    for text in test_texts:
        print(f"\nText: '{text}'")
        print("-" * (len(text) + 8))
        
        for name, tokenizer in tokenizers.items():
            try:
                tokens = tokenizer.tokenize(text)
                token_ids = tokenizer.encode(text, add_special_tokens=False)
                
                print(f"  {name:12}: {len(tokens):2d} tokens - {tokens}")
                
                # Show how it gets reconstructed
                decoded = tokenizer.decode(token_ids)
                if decoded.strip() != text.strip():
                    print(f"  {' ':12}   Decoded: '{decoded}'")
                    
            except Exception as e:
                print(f"  {name:12}: Error - {e}")
    
    # Analyze tokenization efficiency
    print("\n\nTokenization Efficiency Analysis:")
    print("=" * 35)
    
    sample_texts = [ex['text'] for ex in test_set.select(range(50))]
    
    efficiency_data = []
    
    for name, tokenizer in tokenizers.items():
        token_counts = []
        char_counts = []
        
        for text in sample_texts:
            tokens = tokenizer.encode(text, add_special_tokens=False)
            token_counts.append(len(tokens))
            char_counts.append(len(text))
        
        # Calculate compression ratio (characters per token)
        compression_ratio = np.mean([chars/tokens for chars, tokens in zip(char_counts, token_counts)])
        
        efficiency_data.append({
            'tokenizer': name,
            'avg_tokens': np.mean(token_counts),
            'compression_ratio': compression_ratio,
            'vocab_size': len(tokenizer)
        })
        
        print(f"{name:12}: Avg tokens: {np.mean(token_counts):5.1f}, "
              f"Compression: {compression_ratio:.2f} chars/token, "
              f"Vocab: {len(tokenizer):,}")
    
    # Visualize tokenization comparison
    df = pd.DataFrame(efficiency_data)
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # Average tokens per text
    axes[0].bar(df['tokenizer'], df['avg_tokens'], alpha=0.7)
    axes[0].set_title('Average Tokens per Text')
    axes[0].set_ylabel('Number of Tokens')
    
    # Compression ratio
    axes[1].bar(df['tokenizer'], df['compression_ratio'], alpha=0.7, color='orange')
    axes[1].set_title('Compression Efficiency')
    axes[1].set_ylabel('Characters per Token')
    
    # Vocabulary size
    axes[2].bar(df['tokenizer'], df['vocab_size'], alpha=0.7, color='green')
    axes[2].set_title('Vocabulary Size')
    axes[2].set_ylabel('Number of Tokens')
    
    plt.tight_layout()
    plt.show()

tokenization_comparison_analysis()

## Phase 5: Production-Ready Sentiment Analyzer

Let's create a robust, production-ready sentiment analyzer with proper error handling.

In [None]:
class ProductionSentimentAnalyzer:
    """Production-ready sentiment analyzer with comprehensive error handling"""
    
    def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"):
        self.model_name = model_name
        self.pipeline = None
        self.tokenizer = None
        self.model = None
        self.max_length = 512
        self.batch_size = 32
        
        self._load_model()
    
    def _load_model(self):
        """Load model with error handling"""
        try:
            print(f"Loading model: {self.model_name}")
            self.pipeline = pipeline(
                "sentiment-analysis",
                model=self.model_name,
                device=0 if device.type == 'cuda' else -1,
                max_length=self.max_length,
                truncation=True
            )
            self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
            print("✓ Model loaded successfully")
            
        except Exception as e:
            print(f"✗ Error loading model: {e}")
            raise
    
    def preprocess_text(self, text):
        """Preprocess text with comprehensive cleaning"""
        if not isinstance(text, str):
            text = str(text)
        
        # Handle empty or whitespace-only text
        if not text.strip():
            return "[EMPTY]"
        
        # Basic cleaning (you can extend this)
        text = text.strip()
        
        # Check if text is too long
        tokens = self.tokenizer.encode(text, add_special_tokens=True)
        if len(tokens) > self.max_length:
            # Truncate and decode to maintain readable text
            truncated_tokens = tokens[:self.max_length-1] + [self.tokenizer.sep_token_id]
            text = self.tokenizer.decode(truncated_tokens, skip_special_tokens=True)
        
        return text
    
    def predict_single(self, text):
        """Predict sentiment for a single text"""
        try:
            # Preprocess
            processed_text = self.preprocess_text(text)
            
            # Get prediction
            result = self.pipeline(processed_text)[0]
            
            # Normalize output
            label = result['label'].upper()
            confidence = result['score']
            
            # Convert to standardized format
            if 'POS' in label or label in ['POSITIVE', '1', 'LABEL_1']:
                sentiment = 'positive'
                score = confidence
            else:
                sentiment = 'negative'
                score = confidence
            
            return {
                'text': text,
                'sentiment': sentiment,
                'confidence': score,
                'processed_text': processed_text,
                'raw_output': result
            }
            
        except Exception as e:
            return {
                'text': text,
                'sentiment': 'error',
                'confidence': 0.0,
                'error': str(e)
            }
    
    def predict_batch(self, texts):
        """Predict sentiment for a batch of texts"""
        results = []
        
        # Process in batches for memory efficiency
        for i in range(0, len(texts), self.batch_size):
            batch = texts[i:i + self.batch_size]
            
            try:
                # Preprocess batch
                processed_batch = [self.preprocess_text(text) for text in batch]
                
                # Get predictions
                batch_results = self.pipeline(processed_batch)
                
                # Process results
                for j, (original_text, processed_text, result) in enumerate(zip(batch, processed_batch, batch_results)):
                    label = result['label'].upper()
                    confidence = result['score']
                    
                    if 'POS' in label or label in ['POSITIVE', '1', 'LABEL_1']:
                        sentiment = 'positive'
                    else:
                        sentiment = 'negative'
                    
                    results.append({
                        'text': original_text,
                        'sentiment': sentiment,
                        'confidence': confidence,
                        'processed_text': processed_text
                    })
                    
            except Exception as e:
                # Handle batch errors
                for text in batch:
                    results.append({
                        'text': text,
                        'sentiment': 'error',
                        'confidence': 0.0,
                        'error': str(e)
                    })
        
        return results
    
    def analyze_dataset(self, dataset, text_column='text', sample_size=None):
        """Analyze sentiment for an entire dataset"""
        print(f"Analyzing dataset with {len(dataset)} examples...")
        
        if sample_size:
            dataset = dataset.select(range(min(sample_size, len(dataset))))
            print(f"Using sample of {len(dataset)} examples")
        
        texts = [ex[text_column] for ex in dataset]
        
        start_time = time.time()
        results = self.predict_batch(texts)
        processing_time = time.time() - start_time
        
        # Calculate statistics
        sentiments = [r['sentiment'] for r in results if r['sentiment'] != 'error']
        sentiment_counts = Counter(sentiments)
        
        errors = [r for r in results if r['sentiment'] == 'error']
        
        print(f"\nAnalysis Results:")
        print(f"  Processing time: {processing_time:.2f}s")
        print(f"  Speed: {len(texts)/processing_time:.1f} examples/second")
        print(f"  Sentiment distribution: {dict(sentiment_counts)}")
        print(f"  Errors: {len(errors)}")
        
        if errors:
            print(f"  Sample errors: {[e.get('error', 'Unknown') for e in errors[:3]]}")
        
        return {
            'results': results,
            'sentiment_distribution': sentiment_counts,
            'processing_time': processing_time,
            'errors': errors
        }
    
    def get_model_info(self):
        """Get information about the loaded model"""
        return {
            'model_name': self.model_name,
            'max_length': self.max_length,
            'batch_size': self.batch_size,
            'device': str(device),
            'tokenizer_vocab_size': len(self.tokenizer) if self.tokenizer else None
        }

# Create production analyzer
production_analyzer = ProductionSentimentAnalyzer()

# Test with various inputs
test_cases = [
    "I love this product!",  # Normal positive
    "This is terrible",      # Normal negative  
    "",                      # Empty
    None,                    # None
    123,                     # Number
    "   ",                   # Whitespace only
    "A" * 1000,             # Very long text
    "😊 Great! 👍",          # With emojis
]

print("Testing Production Analyzer:")
print("=" * 30)

for test_text in test_cases:
    result = production_analyzer.predict_single(test_text)
    print(f"Input: {repr(test_text)}")
    print(f"  → Sentiment: {result['sentiment']}, Confidence: {result['confidence']:.3f}")
    if 'error' in result:
        print(f"  → Error: {result['error']}")
    print()

# Test batch processing
print("\nBatch Processing Test:")
batch_texts = [ex['text'] for ex in test_set.select(range(20))]
batch_results = production_analyzer.predict_batch(batch_texts)

positive_count = sum(1 for r in batch_results if r['sentiment'] == 'positive')
negative_count = sum(1 for r in batch_results if r['sentiment'] == 'negative')

print(f"Processed {len(batch_results)} texts:")
print(f"  Positive: {positive_count}")
print(f"  Negative: {negative_count}")

# Show model info
print(f"\nModel Information:")
model_info = production_analyzer.get_model_info()
for key, value in model_info.items():
    print(f"  {key}: {value}")

## Phase 6: Final Performance Benchmark

Let's create a comprehensive benchmark of our system.

In [None]:
def comprehensive_benchmark():
    """Run comprehensive benchmark of the sentiment analysis system"""
    
    print("COMPREHENSIVE SENTIMENT ANALYSIS BENCHMARK")
    print("=" * 50)
    
    # Test different dataset sizes
    test_sizes = [10, 50, 100, 500]
    
    benchmark_results = []
    
    for size in test_sizes:
        print(f"\nBenchmarking with {size} examples...")
        
        # Get test data
        test_data = test_set.select(range(min(size, len(test_set))))
        
        # Run analysis
        start_time = time.time()
        results = production_analyzer.analyze_dataset(
            test_data, 
            sample_size=size
        )
        total_time = time.time() - start_time
        
        # Calculate accuracy if we have true labels
        if 'label' in test_data[0]:
            true_labels = [ex['label'] for ex in test_data]
            pred_labels = [
                1 if r['sentiment'] == 'positive' else 0 
                for r in results['results']
                if r['sentiment'] != 'error'
            ]
            
            if len(pred_labels) == len(true_labels):
                accuracy = accuracy_score(true_labels, pred_labels)
            else:
                accuracy = None
        else:
            accuracy = None
        
        benchmark_results.append({
            'size': size,
            'total_time': total_time,
            'examples_per_second': size / total_time,
            'accuracy': accuracy,
            'errors': len(results['errors']),
            'positive_ratio': results['sentiment_distribution'].get('positive', 0) / size
        })
        
        print(f"  Time: {total_time:.2f}s, Speed: {size/total_time:.1f} ex/s, Accuracy: {accuracy:.3f if accuracy else 'N/A'}")
    
    # Create benchmark visualization
    df_benchmark = pd.DataFrame(benchmark_results)
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Sentiment Analysis System Benchmark', fontsize=16)
    
    # Processing speed vs dataset size
    axes[0, 0].plot(df_benchmark['size'], df_benchmark['examples_per_second'], 'bo-')
    axes[0, 0].set_xlabel('Dataset Size')
    axes[0, 0].set_ylabel('Examples per Second')
    axes[0, 0].set_title('Processing Speed vs Dataset Size')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Total processing time
    axes[0, 1].plot(df_benchmark['size'], df_benchmark['total_time'], 'ro-')
    axes[0, 1].set_xlabel('Dataset Size')
    axes[0, 1].set_ylabel('Total Time (seconds)')
    axes[0, 1].set_title('Total Processing Time')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Accuracy (if available)
    if any(r['accuracy'] for r in benchmark_results if r['accuracy'] is not None):
        valid_accuracy = [(r['size'], r['accuracy']) for r in benchmark_results if r['accuracy'] is not None]
        sizes, accuracies = zip(*valid_accuracy)
        axes[1, 0].plot(sizes, accuracies, 'go-')
        axes[1, 0].set_xlabel('Dataset Size')
        axes[1, 0].set_ylabel('Accuracy')
        axes[1, 0].set_title('Accuracy vs Dataset Size')
        axes[1, 0].set_ylim(0, 1)
        axes[1, 0].grid(True, alpha=0.3)
    
    # Error rate
    error_rates = [r['errors'] / r['size'] * 100 for r in benchmark_results]
    axes[1, 1].bar(df_benchmark['size'], error_rates, alpha=0.7)
    axes[1, 1].set_xlabel('Dataset Size')
    axes[1, 1].set_ylabel('Error Rate (%)')
    axes[1, 1].set_title('Error Rate vs Dataset Size')
    
    plt.tight_layout()
    plt.show()
    
    # Summary statistics
    print(f"\n\nBenchmark Summary:")
    print("=" * 20)
    print(f"Average processing speed: {np.mean([r['examples_per_second'] for r in benchmark_results]):.1f} examples/second")
    
    if any(r['accuracy'] for r in benchmark_results if r['accuracy'] is not None):
        valid_accuracies = [r['accuracy'] for r in benchmark_results if r['accuracy'] is not None]
        print(f"Average accuracy: {np.mean(valid_accuracies):.3f}")
    
    total_errors = sum(r['errors'] for r in benchmark_results)
    total_examples = sum(r['size'] for r in benchmark_results)
    print(f"Overall error rate: {total_errors/total_examples*100:.2f}%")
    
    return benchmark_results

# Run comprehensive benchmark
benchmark_results = comprehensive_benchmark()

## Project Summary and Key Learnings

Congratulations! You've successfully built a complete sentiment analysis system that demonstrates the integration of:

### 🎯 **What We Accomplished**

1. **Data Pipeline**: Created a robust data processing pipeline using the datasets library
2. **Model Comparison**: Systematically compared different transformer models
3. **Tokenization Analysis**: Deep dive into how different tokenizers handle text
4. **Error Handling**: Built production-ready code with comprehensive error handling
5. **Performance Optimization**: Implemented batch processing and caching
6. **Comprehensive Evaluation**: Created thorough benchmarking and visualization

### 📚 **Key Integration Points**

- **Transformers + Datasets**: Seamless integration for model training and evaluation
- **Tokenizers + Models**: Understanding how tokenization affects model performance
- **Datasets + Processing**: Efficient data handling for large-scale processing
- **All Three Together**: Building end-to-end ML pipelines

### 🚀 **Production-Ready Features**

- ✅ Error handling for edge cases (empty text, None values, etc.)
- ✅ Batch processing for efficiency
- ✅ Memory optimization
- ✅ Performance benchmarking
- ✅ Comprehensive logging and monitoring
- ✅ Flexible model switching

### 🔍 **Key Insights Discovered**

1. **Model Performance Varies**: Different models excel at different types of text
2. **Tokenization Matters**: Tokenizer choice significantly impacts results
3. **Batch Processing**: Much more efficient than individual processing
4. **Edge Cases**: Production systems must handle all input types gracefully
5. **Speed vs Accuracy**: There's always a trade-off to consider

### 📈 **What's Next?**

This mini-project provides a solid foundation for:
- **Fine-tuning** models on custom datasets (Notebook 05)
- **Advanced training techniques** (Notebook 06)
- **Specialized applications** like summarization (Notebook 07) and QA (Notebook 08)
- **Advanced techniques** like LoRA and RLHF (Notebooks 09-10)

### 🎓 **Skills Mastered**

- **Hugging Face Ecosystem**: Confident usage of transformers, tokenizers, and datasets
- **Production ML**: Building robust, scalable ML systems
- **Performance Optimization**: Efficient processing and resource management
- **Model Evaluation**: Comprehensive testing and benchmarking
- **Error Handling**: Graceful handling of edge cases and errors

**Great job!** You've successfully combined all the concepts from the first three notebooks into a comprehensive, production-ready sentiment analysis system. This foundation will serve you well as we dive into more advanced topics in the upcoming notebooks.

---

*Ready to continue your journey? Head to **Notebook 05: Fine-Tuning with Trainer API** to learn how to train your own models!*

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*