[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.2/06-feature-extraction.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.2/06-feature-extraction.ipynb)

# 06 - Feature Extraction: Extract Vector Representations of Text

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- How to extract meaningful vector representations from text using transformers
- Different types of text embeddings: token-level, sentence-level, and document-level
- Techniques for pooling transformer outputs into fixed-size vectors
- Applications of text embeddings for similarity search and semantic analysis
- Comparing different models for feature extraction tasks

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and PyTorch
- Knowledge of NLP fundamentals
- Understanding of transformer architectures (refer to previous notebooks)

## 📚 What We'll Cover
1. **Introduction to Feature Extraction**: Understanding text embeddings and their applications
2. **Basic Feature Extraction**: Using pre-trained models to extract embeddings
3. **Pooling Strategies**: Different methods to create sentence-level representations
4. **Model Comparison**: Comparing different architectures for feature extraction
5. **Similarity and Search**: Practical applications using cosine similarity
6. **Advanced Techniques**: Using specialized sentence transformer models
7. **Visualization**: Plotting embeddings in 2D space
8. **Summary and Best Practices**: Key takeaways and recommendations


## Part 1: Introduction to Feature Extraction

**Feature extraction** from text involves converting human-readable text into numerical vectors (embeddings) that capture semantic meaning. These vectors can be used for:

- **Semantic Search**: Finding similar documents or sentences
- **Clustering**: Grouping similar texts together
- **Classification**: Using embeddings as input features for downstream tasks
- **Recommendation Systems**: Finding similar content based on text descriptions

### Why Use Transformer Models for Feature Extraction?

- **Contextual Understanding**: Transformers understand word meaning based on context
- **Rich Representations**: Capture complex semantic relationships
- **Transfer Learning**: Pre-trained models work well across different domains
- **Flexible Output**: Can extract features at token, sentence, or document level

In [None]:
# Install required packages (uncomment if needed)
# !pip install transformers torch numpy matplotlib scikit-learn sentence-transformers

# Import essential libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import (
    AutoModel, 
    AutoTokenizer, 
    pipeline
)
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')

# For Google Colab compatibility
try:
    from google.colab import userdata
    COLAB_AVAILABLE = True
except ImportError:
    COLAB_AVAILABLE = False

print("📚 Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"Running in Colab: {COLAB_AVAILABLE}")

In [None]:
# Device detection for optimal performance
def get_device():
    """
    Automatically detect and return the best available device.
    
    Priority: CUDA > MPS (Apple Silicon) > CPU
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS for Apple Silicon optimization")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU - consider GPU for better performance")
    
    return device

# Set up device
device = get_device()

## Part 2: Basic Feature Extraction with BERT

Let's start with extracting features using BERT, one of the most popular transformer models for feature extraction.

In [None]:
# Load BERT model and tokenizer
model_name = "bert-base-uncased"
print(f"📥 Loading {model_name}...")

try:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model = model.to(device)
    model.eval()  # Set to evaluation mode
    
    print(f"✅ Model loaded successfully")
    print(f"📊 Model size: {model.num_parameters():,} parameters")
    print(f"🎯 Max sequence length: {tokenizer.model_max_length}")
    
except Exception as e:
    print(f"❌ Error loading model: {e}")
    print("💡 Check internet connection or try a different model")

In [None]:
# Sample texts for feature extraction
sample_texts = [
    "The cat sat on the mat.",
    "A feline rested on the rug.",
    "The dog ran in the park.",
    "Machine learning is fascinating.",
    "Artificial intelligence transforms technology."
]

print("📝 Sample texts for feature extraction:")
for i, text in enumerate(sample_texts, 1):
    print(f"  {i}. {text}")

In [None]:
def extract_features_basic(text, model, tokenizer, device):
    """
    Extract basic features from text using a transformer model.
    
    Args:
        text: Input text string
        model: Transformer model
        tokenizer: Corresponding tokenizer
        device: Device to run on
    
    Returns:
        dict: Dictionary containing different types of embeddings
    """
    # Tokenize the input
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        padding=True, 
        truncation=True, 
        max_length=512
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Extract features without gradient computation
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get different representations
    # last_hidden_state: [batch_size, seq_len, hidden_size]
    last_hidden_state = outputs.last_hidden_state
    
    # pooler_output: [batch_size, hidden_size] (for BERT, this is CLS token representation)
    pooler_output = outputs.pooler_output if hasattr(outputs, 'pooler_output') else None
    
    return {
        'last_hidden_state': last_hidden_state.cpu(),
        'pooler_output': pooler_output.cpu() if pooler_output is not None else None,
        'tokens': tokenizer.tokenize(text),
        'input_ids': inputs['input_ids'].cpu()
    }

# Extract features for first sample text
sample_text = sample_texts[0]
features = extract_features_basic(sample_text, model, tokenizer, device)

print(f"📊 Feature extraction results for: '{sample_text}'")
print(f"  Tokens: {features['tokens']}")
print(f"  Token count: {len(features['tokens'])}")
print(f"  Last hidden state shape: {features['last_hidden_state'].shape}")
if features['pooler_output'] is not None:
    print(f"  Pooler output shape: {features['pooler_output'].shape}")
    print(f"  Pooler output (first 10 dims): {features['pooler_output'][0][:10].numpy()}")

## Part 3: Pooling Strategies for Sentence Embeddings

When working with transformer models, we often need to convert the variable-length token representations into fixed-size sentence embeddings. Here are several popular pooling strategies:

In [None]:
def apply_pooling_strategies(last_hidden_state, attention_mask):
    """
    Apply different pooling strategies to create sentence embeddings.
    
    Args:
        last_hidden_state: Tensor of shape [batch_size, seq_len, hidden_size]
        attention_mask: Tensor of shape [batch_size, seq_len]
    
    Returns:
        dict: Dictionary with different pooled representations
    """
    # 1. CLS Token (first token) - common for BERT
    cls_embedding = last_hidden_state[:, 0, :]  # [batch_size, hidden_size]
    
    # 2. Mean Pooling - average over all tokens (excluding padding)
    # Expand attention mask to match hidden_state dimensions
    attention_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
    
    # Apply mask and compute mean
    masked_embeddings = last_hidden_state * attention_mask_expanded
    sum_embeddings = torch.sum(masked_embeddings, 1)
    sum_mask = torch.clamp(attention_mask_expanded.sum(1), min=1e-9)
    mean_embedding = sum_embeddings / sum_mask
    
    # 3. Max Pooling - take maximum value across sequence dimension
    # Set padded positions to large negative value before max pooling
    masked_embeddings_max = last_hidden_state.clone()
    masked_embeddings_max[attention_mask_expanded == 0] = -1e9
    max_embedding = torch.max(masked_embeddings_max, 1)[0]
    
    # 4. Min Pooling - take minimum value across sequence dimension
    masked_embeddings_min = last_hidden_state.clone()
    masked_embeddings_min[attention_mask_expanded == 0] = 1e9
    min_embedding = torch.min(masked_embeddings_min, 1)[0]
    
    return {
        'cls_token': cls_embedding,
        'mean_pooling': mean_embedding,
        'max_pooling': max_embedding,
        'min_pooling': min_embedding
    }

# Test pooling strategies
def extract_sentence_embeddings(texts, model, tokenizer, device, pooling_strategy='mean'):
    """
    Extract sentence embeddings using specified pooling strategy.
    
    Args:
        texts: List of text strings
        model: Transformer model
        tokenizer: Tokenizer
        device: Device
        pooling_strategy: 'cls', 'mean', 'max', or 'min'
    
    Returns:
        numpy.ndarray: Array of sentence embeddings
    """
    all_embeddings = []
    
    for text in texts:
        # Tokenize
        inputs = tokenizer(
            text, 
            return_tensors="pt", 
            padding=True, 
            truncation=True, 
            max_length=512
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Get model outputs
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Apply pooling
        pooled = apply_pooling_strategies(
            outputs.last_hidden_state, 
            inputs['attention_mask']
        )
        
        # Select requested strategy
        if pooling_strategy == 'cls':
            embedding = pooled['cls_token']
        elif pooling_strategy == 'mean':
            embedding = pooled['mean_pooling']
        elif pooling_strategy == 'max':
            embedding = pooled['max_pooling']
        elif pooling_strategy == 'min':
            embedding = pooled['min_pooling']
        else:
            raise ValueError(f"Unknown pooling strategy: {pooling_strategy}")
        
        all_embeddings.append(embedding.cpu().numpy())
    
    return np.vstack(all_embeddings)

# Compare different pooling strategies
pooling_strategies = ['cls', 'mean', 'max', 'min']
pooling_results = {}

print("🔄 Comparing pooling strategies...")
for strategy in pooling_strategies:
    embeddings = extract_sentence_embeddings(
        sample_texts, 
        model, 
        tokenizer, 
        device, 
        pooling_strategy=strategy
    )
    pooling_results[strategy] = embeddings
    print(f"  {strategy.upper()} pooling shape: {embeddings.shape}")

print("✅ Pooling strategies comparison completed!")

## Part 4: Similarity Analysis

Now let's use the extracted features to compute semantic similarity between texts.

In [None]:
def compute_similarity_matrix(embeddings):
    """
    Compute cosine similarity matrix between all pairs of embeddings.
    
    Args:
        embeddings: numpy array of shape [n_texts, embedding_dim]
    
    Returns:
        numpy.ndarray: Similarity matrix of shape [n_texts, n_texts]
    """
    return cosine_similarity(embeddings)

def visualize_similarity_matrix(similarity_matrix, texts, strategy_name):
    """
    Visualize similarity matrix as a heatmap.
    
    Args:
        similarity_matrix: Similarity matrix
        texts: List of text strings for labels
        strategy_name: Name of pooling strategy
    """
    plt.figure(figsize=(10, 8))
    
    # Create labels (truncated for readability)
    labels = [text[:30] + '...' if len(text) > 30 else text for text in texts]
    
    sns.heatmap(
        similarity_matrix, 
        annot=True, 
        fmt='.3f', 
        cmap='viridis', 
        xticklabels=labels, 
        yticklabels=labels,
        cbar_kws={'label': 'Cosine Similarity'}
    )
    
    plt.title(f'Text Similarity Matrix - {strategy_name.upper()} Pooling', 
              fontsize=14, fontweight='bold')
    plt.xlabel('Texts', fontweight='bold')
    plt.ylabel('Texts', fontweight='bold')
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()

# Analyze similarity for mean pooling (typically most effective)
mean_embeddings = pooling_results['mean']
similarity_matrix = compute_similarity_matrix(mean_embeddings)

print("📊 Similarity Analysis (Mean Pooling):")
print("\nSimilarity Matrix:")
print(similarity_matrix)

# Find most similar pairs
n_texts = len(sample_texts)
for i in range(n_texts):
    for j in range(i+1, n_texts):
        similarity = similarity_matrix[i, j]
        print(f"\nTexts {i+1} & {j+1}: {similarity:.3f}")
        print(f"  '{sample_texts[i]}'")
        print(f"  '{sample_texts[j]}'")

# Visualize similarity matrix
visualize_similarity_matrix(similarity_matrix, sample_texts, 'mean')

## Part 5: Comparing Different Models

Different transformer models can produce different quality embeddings. Let's compare several popular models.

In [None]:
# Define models to compare
models_to_compare = {
    'BERT Base': 'bert-base-uncased',
    'DistilBERT': 'distilbert-base-uncased',
    'RoBERTa': 'roberta-base',
}

def load_model_safely(model_name):
    """
    Load model with error handling.
    
    Args:
        model_name: HuggingFace model name
    
    Returns:
        tuple: (tokenizer, model) or (None, None) if failed
    """
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModel.from_pretrained(model_name)
        model = model.to(device)
        model.eval()
        return tokenizer, model
    except Exception as e:
        print(f"❌ Failed to load {model_name}: {e}")
        return None, None

# Load models and extract embeddings
model_embeddings = {}
model_objects = {}

print("📥 Loading and comparing different models...")
for model_display_name, model_name in models_to_compare.items():
    print(f"\n🔄 Processing {model_display_name} ({model_name})...")
    
    tokenizer, model_obj = load_model_safely(model_name)
    if tokenizer is None or model_obj is None:
        print(f"⚠️  Skipping {model_display_name} due to loading error")
        continue
    
    # Extract embeddings
    try:
        embeddings = extract_sentence_embeddings(
            sample_texts, 
            model_obj, 
            tokenizer, 
            device, 
            pooling_strategy='mean'
        )
        model_embeddings[model_display_name] = embeddings
        model_objects[model_display_name] = (tokenizer, model_obj)
        
        print(f"  ✅ {model_display_name}: {embeddings.shape}")
        print(f"  📊 Parameters: {model_obj.num_parameters():,}")
        
    except Exception as e:
        print(f"  ❌ Failed to extract embeddings: {e}")

print(f"\n🎯 Successfully loaded {len(model_embeddings)} models")

In [None]:
# Compare similarity patterns across different models
def compare_model_similarities(model_embeddings, sample_texts):
    """
    Compare similarity patterns across different models.
    
    Args:
        model_embeddings: Dict of model embeddings
        sample_texts: List of sample texts
    """
    print("🔍 Comparing similarity patterns across models:")
    print("=" * 60)
    
    # Compare specific text pairs across models
    text_pairs_to_compare = [
        (0, 1),  # "cat sat" vs "feline rested" - should be similar
        (0, 2),  # "cat sat" vs "dog ran" - different animals
        (3, 4),  # "machine learning" vs "artificial intelligence" - similar domain
    ]
    
    for i, j in text_pairs_to_compare:
        print(f"\n📝 Comparing:")
        print(f"  Text {i+1}: '{sample_texts[i]}'")
        print(f"  Text {j+1}: '{sample_texts[j]}'")
        print(f"  Similarities by model:")
        
        for model_name, embeddings in model_embeddings.items():
            similarity_matrix = compute_similarity_matrix(embeddings)
            similarity = similarity_matrix[i, j]
            print(f"    {model_name:12}: {similarity:.4f}")

# Run model comparison
if model_embeddings:
    compare_model_similarities(model_embeddings, sample_texts)
else:
    print("⚠️  No models successfully loaded for comparison")

## Part 6: Advanced Feature Extraction with Sentence Transformers

For the best sentence embeddings, we can use models specifically trained for this task, such as those from the `sentence-transformers` library.

In [None]:
# Try to use sentence-transformers if available
try:
    from sentence_transformers import SentenceTransformer
    SENTENCE_TRANSFORMERS_AVAILABLE = True
    print("✅ sentence-transformers library available")
except ImportError:
    SENTENCE_TRANSFORMERS_AVAILABLE = False
    print("⚠️  sentence-transformers not available. Installing...")
    print("💡 Run: pip install sentence-transformers")

if SENTENCE_TRANSFORMERS_AVAILABLE:
    # Load a sentence transformer model
    sentence_model_name = 'all-MiniLM-L6-v2'  # Fast and good quality
    print(f"📥 Loading sentence transformer: {sentence_model_name}")
    
    try:
        sentence_model = SentenceTransformer(sentence_model_name)
        
        # Extract sentence embeddings
        sentence_embeddings = sentence_model.encode(sample_texts)
        
        print(f"✅ Sentence embeddings shape: {sentence_embeddings.shape}")
        print(f"📊 Model max sequence length: {sentence_model.max_seq_length}")
        
        # Compare with our manual approach
        st_similarity_matrix = compute_similarity_matrix(sentence_embeddings)
        
        print("\n🔍 Sentence Transformer similarities:")
        for i in range(len(sample_texts)):
            for j in range(i+1, len(sample_texts)):
                similarity = st_similarity_matrix[i, j]
                print(f"  Texts {i+1} & {j+1}: {similarity:.4f}")
        
        # Visualize sentence transformer similarities
        visualize_similarity_matrix(st_similarity_matrix, sample_texts, 'Sentence Transformer')
        
    except Exception as e:
        print(f"❌ Error with sentence transformer: {e}")
        SENTENCE_TRANSFORMERS_AVAILABLE = False

if not SENTENCE_TRANSFORMERS_AVAILABLE:
    print("📝 Sentence transformers demo skipped - using manual approach instead")
    print("💡 Sentence transformers are specifically optimized for semantic similarity tasks")

## Part 7: Practical Application - Semantic Search

Let's implement a simple semantic search system using our extracted features.

In [None]:
class SimpleSemanticSearch:
    """
    Simple semantic search system using pre-computed embeddings.
    """
    
    def __init__(self, documents, embeddings):
        """
        Initialize with documents and their embeddings.
        
        Args:
            documents: List of text documents
            embeddings: Corresponding embeddings array
        """
        self.documents = documents
        self.embeddings = embeddings
        
    def search(self, query_embedding, top_k=3):
        """
        Search for most similar documents.
        
        Args:
            query_embedding: Embedding of the query
            top_k: Number of top results to return
        
        Returns:
            list: List of (document, similarity_score) tuples
        """
        # Compute similarities
        similarities = cosine_similarity([query_embedding], self.embeddings)[0]
        
        # Get top-k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        # Return results
        results = []
        for idx in top_indices:
            results.append((self.documents[idx], similarities[idx]))
        
        return results

# Create a larger document collection for search
document_collection = [
    "The cat sat on the mat in the living room.",
    "A feline rested comfortably on the soft rug.",
    "The dog ran quickly through the green park.",
    "Machine learning algorithms can solve complex problems.",
    "Artificial intelligence is transforming modern technology.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing helps computers understand text.",
    "The weather today is sunny and warm.",
    "Climate change affects global weather patterns.",
    "Python is a popular programming language for data science."
]

print("📚 Document collection for semantic search:")
for i, doc in enumerate(document_collection, 1):
    print(f"  {i:2d}. {doc}")

In [None]:
# Extract embeddings for document collection
print("🔄 Extracting embeddings for document collection...")

# Use the best available model (sentence transformer if available, otherwise BERT)
if SENTENCE_TRANSFORMERS_AVAILABLE:
    doc_embeddings = sentence_model.encode(document_collection)
    search_model_name = "Sentence Transformer"
    
    def encode_query(query):
        return sentence_model.encode([query])[0]
        
elif 'BERT Base' in model_objects:
    tokenizer, bert_model = model_objects['BERT Base']
    doc_embeddings = extract_sentence_embeddings(
        document_collection, bert_model, tokenizer, device, 'mean'
    )
    search_model_name = "BERT Base"
    
    def encode_query(query):
        return extract_sentence_embeddings(
            [query], bert_model, tokenizer, device, 'mean'
        )[0]
        
else:
    # Fallback to the original model
    doc_embeddings = extract_sentence_embeddings(
        document_collection, model, tokenizer, device, 'mean'
    )
    search_model_name = "BERT Base (fallback)"
    
    def encode_query(query):
        return extract_sentence_embeddings(
            [query], model, tokenizer, device, 'mean'
        )[0]

print(f"✅ Document embeddings extracted using {search_model_name}")
print(f"📊 Embeddings shape: {doc_embeddings.shape}")

# Create search system
search_system = SimpleSemanticSearch(document_collection, doc_embeddings)

In [None]:
# Test semantic search with different queries
test_queries = [
    "animal sitting on carpet",
    "AI and machine intelligence",
    "sunny weather conditions",
    "programming languages for data analysis"
]

print("🔍 Semantic Search Results")
print("=" * 50)

for query in test_queries:
    print(f"\n📝 Query: '{query}'")
    
    # Encode query
    query_embedding = encode_query(query)
    
    # Search
    results = search_system.search(query_embedding, top_k=3)
    
    print("   Top matches:")
    for i, (doc, score) in enumerate(results, 1):
        print(f"   {i}. Score: {score:.4f} - {doc}")

## Part 8: Visualizing Embeddings in 2D Space

Let's visualize the high-dimensional embeddings in 2D space using dimensionality reduction.

In [None]:
# Reduce dimensions for visualization
def visualize_embeddings_2d(embeddings, texts, title="Text Embeddings Visualization"):
    """
    Visualize high-dimensional embeddings in 2D space using PCA.
    
    Args:
        embeddings: High-dimensional embeddings
        texts: Corresponding text labels
        title: Plot title
    """
    # Apply PCA to reduce to 2 dimensions
    pca = PCA(n_components=2, random_state=42)
    embeddings_2d = pca.fit_transform(embeddings)
    
    # Create plot
    plt.figure(figsize=(12, 8))
    
    # Plot points
    scatter = plt.scatter(
        embeddings_2d[:, 0], 
        embeddings_2d[:, 1], 
        c=range(len(texts)), 
        cmap='tab10', 
        s=100, 
        alpha=0.7
    )
    
    # Add text labels
    for i, txt in enumerate(texts):
        # Truncate long texts for readability
        label = txt[:40] + '...' if len(txt) > 40 else txt
        plt.annotate(
            f"{i+1}: {label}", 
            (embeddings_2d[i, 0], embeddings_2d[i, 1]),
            xytext=(5, 5), 
            textcoords='offset points',
            fontsize=9,
            bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.7)
        )
    
    plt.title(title, fontsize=14, fontweight='bold')
    plt.xlabel(f'PC1 (explained variance: {pca.explained_variance_ratio_[0]:.2%})', 
               fontweight='bold')
    plt.ylabel(f'PC2 (explained variance: {pca.explained_variance_ratio_[1]:.2%})', 
               fontweight='bold')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print(f"📊 Explained variance ratio: {pca.explained_variance_ratio_}")
    print(f"📈 Total explained variance: {pca.explained_variance_ratio_.sum():.2%}")

# Visualize document collection embeddings
visualize_embeddings_2d(
    doc_embeddings, 
    document_collection, 
    f"Document Embeddings ({search_model_name}) - 2D Visualization"
)

## Part 9: Best Practices and Tips

Let's summarize the key best practices for feature extraction with transformers.

In [None]:
def demonstrate_best_practices():
    """
    Demonstrate best practices for feature extraction.
    """
    print("🎯 FEATURE EXTRACTION BEST PRACTICES")
    print("=" * 50)
    
    practices = {
        "Model Selection": [
            "Use sentence-transformers models for semantic similarity tasks",
            "BERT/RoBERTa are good general-purpose choices",
            "DistilBERT offers good speed-quality trade-off",
            "Consider domain-specific models for specialized tasks"
        ],
        "Pooling Strategies": [
            "Mean pooling usually works better than CLS token for similarity",
            "CLS token is good for classification tasks",
            "Max pooling can capture important features but is less stable",
            "Try different strategies and evaluate on your specific task"
        ],
        "Processing Tips": [
            "Normalize embeddings for cosine similarity tasks",
            "Batch process multiple texts for efficiency",
            "Handle padding and attention masks properly",
            "Use torch.no_grad() for inference to save memory"
        ],
        "Performance": [
            "Move models to GPU when available",
            "Consider model quantization for deployment",
            "Cache embeddings for frequently used texts",
            "Use appropriate batch sizes to avoid OOM errors"
        ],
        "Evaluation": [
            "Test similarity results on known similar/dissimilar pairs",
            "Use downstream task performance to validate embeddings",
            "Visualize embeddings to understand model behavior",
            "Compare multiple models on your specific use case"
        ]
    }
    
    for category, tips in practices.items():
        print(f"\n🔹 {category}:")
        for tip in tips:
            print(f"   • {tip}")

# Example of proper embedding normalization
def normalize_embeddings(embeddings):
    """
    Normalize embeddings to unit length for cosine similarity.
    
    Args:
        embeddings: numpy array of embeddings
    
    Returns:
        numpy.ndarray: Normalized embeddings
    """
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    return embeddings / norms

# Demonstrate normalization effect
print("📊 Embedding Normalization Example:")
sample_embedding = doc_embeddings[:3]  # First 3 embeddings
normalized_embedding = normalize_embeddings(sample_embedding)

print(f"Original norms: {[np.linalg.norm(emb) for emb in sample_embedding]}")
print(f"Normalized norms: {[np.linalg.norm(emb) for emb in normalized_embedding]}")
print("✅ All normalized embeddings have unit length (norm = 1.0)")

print("\n")
demonstrate_best_practices()

## Summary

### 🔑 Key Concepts Mastered

- **Feature Extraction**: Converting text to meaningful vector representations using transformer models
- **Pooling Strategies**: Different methods (CLS, mean, max, min) to create sentence-level embeddings
- **Similarity Analysis**: Using cosine similarity to measure semantic similarity between texts
- **Model Comparison**: Understanding trade-offs between different transformer architectures
- **Practical Applications**: Implementing semantic search and embedding visualization

### 📈 Best Practices Learned

- **Model Selection**: Choose task-appropriate models (sentence-transformers for similarity, BERT for general use)
- **Pooling Choice**: Mean pooling generally works better than CLS token for semantic similarity
- **Normalization**: Normalize embeddings for cosine similarity tasks
- **Efficiency**: Use batch processing, GPU acceleration, and proper memory management
- **Evaluation**: Test on known similar/dissimilar pairs and visualize results

### 🚀 Next Steps

- **Advanced Models**: Explore domain-specific transformer models
- **Fine-tuning**: Fine-tune models for your specific similarity tasks
- **Scalability**: Implement approximate similarity search with FAISS or similar libraries
- **Applications**: Build recommendation systems, document clustering, or question-answering systems

### 📚 Further Resources

- [Sentence Transformers Documentation](https://www.sbert.net/)
- [Hugging Face Transformers Guide](https://huggingface.co/docs/transformers/index)
- [Understanding BERT Embeddings](https://jalammar.github.io/illustrated-bert/)
- [Text Embeddings and Similarity Search](https://www.pinecone.io/learn/vector-embeddings/)


---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*