Text Embeddings Tutorial with Hugging Face
=================================================
  
[View on Google Colab](https://colab.research.google.com/drive/1mClrNFwUeztQjUL4NXZslr9nEoeD67sj?usp=sharing)

### Import the Necessary Libraries

In [9]:
! pip install torch transformers
! pip install "numpy<2.0.0"

import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from typing import List, Union, Optional

import warnings
warnings.filterwarnings('ignore')

191.25s - pydevd: Sending message related to process being replaced timed-out after 5 seconds




196.76s - pydevd: Sending message related to process being replaced timed-out after 5 seconds




---

### Create the Text Embeddings

In [12]:
def create_text_embeddings(
    texts: Union[str, List[str]], 
    model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
    max_length: int = 512,
    normalize: bool = True,
    device: Optional[str] = None
) -> np.ndarray:
    """
    Create dense vector embeddings for text using pre-trained transformer models.
    
    This function uses Hugging Face transformers to convert text into numerical
    representations that capture semantic meaning. These embeddings can be used
    for similarity search, clustering, classification, and other NLP tasks.
    
    Args:
        texts (Union[str, List[str]]): Single text string or list of text strings
            to embed. Each text will be converted to a fixed-size vector.
        model_name (str, optional): Name of the pre-trained model from Hugging Face.
            Default is "sentence-transformers/all-MiniLM-L6-v2" which is optimized
            for sentence-level embeddings. Other options include:
            - "sentence-transformers/all-mpnet-base-v2" (higher quality, slower)
            - "bert-base-uncased" (general BERT model)
            - "distilbert-base-uncased" (faster, smaller BERT variant)
        max_length (int, optional): Maximum sequence length for tokenization.
            Longer texts will be truncated. Default is 512 tokens.
        normalize (bool, optional): Whether to normalize embeddings to unit vectors.
            Normalized embeddings work better for cosine similarity. Default is True.
        device (Optional[str], optional): Device to run the model on ('cpu', 'cuda').
            If None, automatically detects available device.
    
    Returns:
        np.ndarray: Array of embeddings with shape (n_texts, embedding_dim).
            Each row represents the embedding vector for one input text.
    
    Example:
        >>> # Single text embedding
        >>> text = "This is a sample sentence for embedding."
        >>> embedding = create_text_embeddings(text)
        >>> print(f"Embedding shape: {embedding.shape}")
        
        >>> # Multiple texts
        >>> texts = ["Hello world", "Natural language processing", "Machine learning"]
        >>> embeddings = create_text_embeddings(texts)
        >>> print(f"Embeddings shape: {embeddings.shape}")
    """
    
    # Step 1: Handle input format - convert single string to list
    if isinstance(texts, str):
        texts = [texts]
        single_input = True
    else:
        single_input = False
    
    # Step 2: Determine compute device (GPU if available, else CPU)
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"
    
    print(f"Using device: {device}")
    print(f"Loading model: {model_name}")
    
    # Step 3: Load pre-trained tokenizer and model
    # The tokenizer converts text to tokens (numbers) that the model can process
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # The model creates contextual embeddings from the tokenized input
    model = AutoModel.from_pretrained(model_name)
    model.to(device)
    model.eval()  # Set to evaluation mode (disables dropout, etc.)
    
    # Step 4: Tokenize all input texts
    # This converts text to token IDs and creates attention masks
    print(f"Tokenizing {len(texts)} text(s)...")
    encoded = tokenizer(
        texts,
        padding=True,          # Pad shorter sequences to match longest
        truncation=True,       # Truncate sequences longer than max_length
        max_length=max_length,
        return_tensors="pt"    # Return PyTorch tensors
    )
    
    # Move tokenized inputs to the same device as the model
    input_ids = encoded['input_ids'].to(device)
    attention_mask = encoded['attention_mask'].to(device)
    
    # Step 5: Generate embeddings using the model
    print("Generating embeddings...")
    with torch.no_grad():  # Disable gradient computation for efficiency
        # Forward pass through the transformer model
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        
        # Extract the last hidden states (contextual embeddings for each token)
        last_hidden_states = outputs.last_hidden_state
        
        # Step 6: Pool token embeddings to create sentence-level embeddings
        # Method: Mean pooling with attention mask weighting
        # This averages token embeddings while ignoring padding tokens
        
        # Expand attention mask to match hidden state dimensions
        attention_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_states.size()).float()
        
        # Apply attention mask and compute mean
        sum_embeddings = torch.sum(last_hidden_states * attention_mask_expanded, 1)
        sum_mask = torch.sum(attention_mask_expanded, 1)
        
        # Avoid division by zero
        sum_mask = torch.clamp(sum_mask, min=1e-9)
        
        # Mean pooling: divide sum by count of non-padding tokens
        embeddings = sum_embeddings / sum_mask
    
    # Step 7: Move embeddings back to CPU and convert to numpy
    embeddings = embeddings.cpu().numpy()
    
    # Step 8: Optional normalization for better similarity computation
    if normalize:
        # L2 normalization: each embedding becomes a unit vector
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        embeddings = embeddings / (norms + 1e-9)  # Avoid division by zero
    
    print(f"Generated embeddings with shape: {embeddings.shape}")
    
    # Step 9: Return single embedding if single input was provided
    if single_input:
        return embeddings[0]
    
    return embeddings

---

### Demonstrate Text Embeddings

In [13]:
def demonstrate_text_embeddings():
    """
    Demonstrate the text embedding function with various examples.
    
    This function shows practical usage scenarios and helps verify
    that the embedding function works correctly.
    """
    print("=" * 60)
    print("TEXT EMBEDDINGS DEMONSTRATION")
    print("=" * 60)
    
    # Example 1: Single text embedding
    print("\n1. Single Text Embedding:")
    print("-" * 30)
    
    single_text = "Artificial intelligence is transforming the world."
    embedding = create_text_embeddings(single_text)
    
    print(f"Input text: '{single_text}'")
    print(f"Embedding shape: {embedding.shape}")
    print(f"Embedding magnitude: {np.linalg.norm(embedding):.4f}")
    print(f"First 5 dimensions: {embedding[:5]}")
    
    # Example 2: Multiple texts with similarity comparison
    print("\n\n2. Multiple Texts with Similarity Analysis:")
    print("-" * 45)
    
    texts = [
        "Machine learning is a subset of artificial intelligence.",
        "Deep learning uses neural networks with multiple layers.",
        "I love eating pizza and pasta for dinner.",
        "Natural language processing helps computers understand text.",
        "My favorite Italian food is definitely pizza."
    ]
    
    embeddings = create_text_embeddings(texts)
    
    print(f"Input texts: {len(texts)} sentences")
    print(f"Embeddings shape: {embeddings.shape}")
    
    # Compute cosine similarity matrix
    similarity_matrix = np.dot(embeddings, embeddings.T)
    
    print("\nCosine Similarity Matrix:")
    print("(Higher values = more similar texts)")
    
    # Print similarity matrix with text indices
    print("\nText Index Reference:")
    for i, text in enumerate(texts):
        print(f"{i}: {text[:50]}...")
    
    print(f"\nSimilarity Matrix:")
    print("     ", end="")
    for j in range(len(texts)):
        print(f"{j:6}", end="")
    print()
    
    for i in range(len(texts)):
        print(f"{i}: ", end="")
        for j in range(len(texts)):
            print(f"{similarity_matrix[i,j]:6.3f}", end="")
        print()
    
    # Find most similar pair (excluding diagonal)
    max_sim = 0
    max_pair = (0, 0)
    for i in range(len(texts)):
        for j in range(i+1, len(texts)):
            if similarity_matrix[i,j] > max_sim:
                max_sim = similarity_matrix[i,j]
                max_pair = (i, j)
    
    print(f"\nMost similar texts (similarity: {max_sim:.3f}):")
    print(f"Text {max_pair[0]}: {texts[max_pair[0]]}")
    print(f"Text {max_pair[1]}: {texts[max_pair[1]]}")
    
    # Example 3: Different model comparison
    print("\n\n3. Comparing Different Models:")
    print("-" * 35)
    
    test_text = "The quick brown fox jumps over the lazy dog."
    
    models_to_test = [
        "sentence-transformers/all-MiniLM-L6-v2",  # Fast, good quality
        "distilbert-base-uncased"                   # General BERT variant
    ]
    
    for model_name in models_to_test:
        print(f"\nTesting model: {model_name}")
        embedding = create_text_embeddings(test_text, model_name=model_name)
        print(f"Embedding dimension: {embedding.shape[0]}")
        print(f"Sample values: {embedding[:3]}")

In [14]:
demonstrate_text_embeddings()

TEXT EMBEDDINGS DEMONSTRATION

1. Single Text Embedding:
------------------------------
Using device: cpu
Loading model: sentence-transformers/all-MiniLM-L6-v2


0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.


Tokenizing 1 text(s)...
Generating embeddings...
Generated embeddings with shape: (1, 384)
Input text: 'Artificial intelligence is transforming the world.'
Embedding shape: (384,)
Embedding magnitude: 1.0000
First 5 dimensions: [ 0.03872417 -0.00110556  0.08271619 -0.01628861  0.0465431 ]


2. Multiple Texts with Similarity Analysis:
---------------------------------------------
Using device: cpu
Loading model: sentence-transformers/all-MiniLM-L6-v2
Tokenizing 5 text(s)...
Generating embeddings...
Generated embeddings with shape: (5, 384)
Input texts: 5 sentences
Embeddings shape: (5, 384)

Cosine Similarity Matrix:
(Higher values = more similar texts)

Text Index Reference:
0: Machine learning is a subset of artificial intelli...
1: Deep learning uses neural networks with multiple l...
2: I love eating pizza and pasta for dinner....
3: Natural language processing helps computers unders...
4: My favorite Italian food is definitely pizza....

Similarity Matrix:
          0     1     2  

---