# Demo: Exploring Pretrained Word Embeddings

**Estimated Time**: 6 minutes | **Skill Pair 2**: Word Embeddings

**Scenario**: Your news recommendation system currently can't tell that "climate" and "environment" articles are related. By using word embeddings that capture semantic similarity, you can recommend articles about "renewable energy" to readers interested in "solar power"‚Äîeven if they don't share exact words.

**What You'll Discover**: Why one-hot encoding fails, how embeddings capture meaning in geometry, and why "king - man + woman ‚âà queen" actually works!

---

## ü§ñ Why Word Embeddings Matter

**The Problem**: One-hot encoding treats every word as equally different:
- **"dog"** and **"cat"** ‚Üí completely unrelated (distance = ‚àö2)
- **"dog"** and **"mathematics"** ‚Üí also completely unrelated (distance = ‚àö2)

**But we know**: "dog" is more similar to "cat" than to "mathematics"!

**The Solution**: Word embeddings place semantically similar words close together in vector space.

---

In [1]:
# Import necessary libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import gensim.downloader as api
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("‚úÖ Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")

‚úÖ Libraries imported successfully!
PyTorch version: 2.9.0


## üö® The One-Hot Problem

Let's see why one-hot encoding fails for capturing word meaning.

In [2]:
def create_one_hot_demo():
    """Demonstrate the limitations of one-hot encoding."""
    
    # Simple vocabulary for demonstration
    vocab = ['dog', 'cat', 'animal', 'mathematics', 'equation', 'number']
    vocab_size = len(vocab)
    
    # Create one-hot vectors
    one_hot_vectors = {}
    for i, word in enumerate(vocab):
        vector = np.zeros(vocab_size)
        vector[i] = 1
        one_hot_vectors[word] = vector
    
    print("üî¢ One-Hot Encoding Example:")
    for word, vector in one_hot_vectors.items():
        print(f"{word:12} ‚Üí {vector}")
    
    # Calculate distances between words
    def euclidean_distance(v1, v2):
        return np.sqrt(np.sum((v1 - v2) ** 2))
    
    print("\nüìè Distances between words:")
    dog_vec = one_hot_vectors['dog']
    cat_vec = one_hot_vectors['cat']
    math_vec = one_hot_vectors['mathematics']
    
    dist_dog_cat = euclidean_distance(dog_vec, cat_vec)
    dist_dog_math = euclidean_distance(dog_vec, math_vec)
    
    print(f"   dog ‚Üî cat:         {dist_dog_cat:.3f}")
    print(f"   dog ‚Üî mathematics: {dist_dog_math:.3f}")
    
    print("\n‚ùå Problem: All words are equally distant!")
    print("   - Can't capture that 'dog' and 'cat' are both animals")
    print("   - 50,000 word vocabulary = 50,000 dimensions (sparse!)")
    print("   - No notion of semantic similarity")

create_one_hot_demo()

üî¢ One-Hot Encoding Example:
dog          ‚Üí [1. 0. 0. 0. 0. 0.]
cat          ‚Üí [0. 1. 0. 0. 0. 0.]
animal       ‚Üí [0. 0. 1. 0. 0. 0.]
mathematics  ‚Üí [0. 0. 0. 1. 0. 0.]
equation     ‚Üí [0. 0. 0. 0. 1. 0.]
number       ‚Üí [0. 0. 0. 0. 0. 1.]

üìè Distances between words:
   dog ‚Üî cat:         1.414
   dog ‚Üî mathematics: 1.414

‚ùå Problem: All words are equally distant!
   - Can't capture that 'dog' and 'cat' are both animals
   - 50,000 word vocabulary = 50,000 dimensions (sparse!)
   - No notion of semantic similarity


## üß† Two Approaches to Word Embeddings

**Word2Vec**: Predicts context from word (or word from context)
- **Skip-gram**: Given "cat", predict ["the", "sat", "on", "mat"]
- **CBOW**: Given ["the", "sat", "on", "mat"], predict "cat"

**GloVe**: Global co-occurrence statistics
- Count how often words appear together across entire corpus
- Factorize co-occurrence matrix into dense vectors
- Captures global statistical information

**Both produce**: Dense 300-dimensional vectors instead of 50,000-dimensional one-hot!

## üì• Load Pretrained GloVe Embeddings

We'll use GloVe vectors trained on 6 billion tokens from Wikipedia and Gigaword.

In [3]:
# Load pretrained GloVe embeddings
print("üì• Loading GloVe embeddings (this may take a moment)...")

# Load smaller GloVe model for demo (100d instead of 300d for speed)
try:
    glove_model = api.load('glove-wiki-gigaword-100')
    print(f"‚úÖ Loaded GloVe embeddings!")
    print(f"   Vocabulary size: {len(glove_model):,} words")
    print(f"   Vector dimension: {glove_model.vector_size}")
    print(f"   Training corpus: Wikipedia + Gigaword (6B tokens)")
except Exception as e:
    print(f"‚ùå Error loading GloVe: {e}")
    print("   Note: This requires internet connection for first download")
    glove_model = None

üì• Loading GloVe embeddings (this may take a moment)...
‚úÖ Loaded GloVe embeddings!
   Vocabulary size: 400,000 words
   Vector dimension: 100
   Training corpus: Wikipedia + Gigaword (6B tokens)


## üîç Explore Word Vectors

Let's look up embeddings for news-related words and see how they relate to each other.

In [4]:
if glove_model is not None:
    # News-related words for our recommendation scenario
    news_words = ['climate', 'environment', 'weather', 'politics', 'sports', 'technology']
    
    print("üì∞ News Category Word Vectors:")
    word_vectors = {}
    
    for word in news_words:
        if word in glove_model:
            vector = glove_model[word]
            word_vectors[word] = vector
            print(f"{word:12} ‚Üí {vector[:5]}... (showing first 5 of {len(vector)} dims)")
        else:
            print(f"{word:12} ‚Üí Not found in vocabulary")
    
    print(f"\n‚ú® Key insight: Each word is now a dense {glove_model.vector_size}-dimensional vector!")
    print("   Unlike one-hot, these vectors can capture semantic relationships.")

üì∞ News Category Word Vectors:
climate      ‚Üí [-1.0901    -0.0036324  1.4329     0.45647   -0.01104  ]... (showing first 5 of 100 dims)
environment  ‚Üí [-0.74272   0.1349    0.68435  -0.077705  0.026786]... (showing first 5 of 100 dims)
weather      ‚Üí [-1.077    -0.42305   0.72816   0.031298 -0.85608 ]... (showing first 5 of 100 dims)
politics     ‚Üí [-0.54286   0.45469   0.64719  -0.22052   0.091599]... (showing first 5 of 100 dims)
sports       ‚Üí [ 0.25178  0.21679 -0.18549 -0.60748 -0.5374 ]... (showing first 5 of 100 dims)
technology   ‚Üí [-0.12241   0.64795   0.43668   0.011368  0.50016 ]... (showing first 5 of 100 dims)

‚ú® Key insight: Each word is now a dense 100-dimensional vector!
   Unlike one-hot, these vectors can capture semantic relationships.


## üìê Measuring Semantic Similarity

Cosine similarity measures the angle between vectors‚Äîperfect for capturing semantic relatedness!

In [5]:
if glove_model is not None and word_vectors:
    def calculate_similarity_matrix(words, model):
        """Calculate cosine similarity between all pairs of words."""
        available_words = [w for w in words if w in model]
        vectors = [model[w] for w in available_words]
        
        # Calculate cosine similarity matrix
        similarity_matrix = cosine_similarity(vectors)
        
        return available_words, similarity_matrix
    
    # Calculate similarities
    words, sim_matrix = calculate_similarity_matrix(news_words, glove_model)
    
    print("üîó Cosine Similarity Matrix (1.0 = identical, 0.0 = unrelated):")
    print(f"{'':12}", end="")
    for word in words:
        print(f"{word:8}", end="")
    print()
    
    for i, word1 in enumerate(words):
        print(f"{word1:12}", end="")
        for j, word2 in enumerate(words):
            similarity = sim_matrix[i][j]
            print(f"{similarity:8.3f}", end="")
        print()
    
    # Highlight key relationships
    print("\nüéØ Key Insights:")
    if 'climate' in words and 'environment' in words:
        climate_env_sim = glove_model.similarity('climate', 'environment')
        print(f"   üìà 'climate' ‚Üî 'environment': {climate_env_sim:.3f} (high similarity!)")
    
    if 'climate' in words and 'sports' in words:
        climate_sports_sim = glove_model.similarity('climate', 'sports')
        print(f"   üìä 'climate' ‚Üî 'sports': {climate_sports_sim:.3f} (low similarity)")
    
    print("\nüí° This is how recommendation systems work!")
    print("   Users reading 'climate' articles ‚Üí recommend 'environment' content")

üîó Cosine Similarity Matrix (1.0 = identical, 0.0 = unrelated):
            climate environmentweather politicssports  technology
climate        1.000   0.760   0.632   0.401   0.195   0.414
environment    0.760   1.000   0.522   0.454   0.343   0.556
weather        0.632   0.522   1.000   0.246   0.330   0.357
politics       0.401   0.454   0.246   1.000   0.452   0.413
sports         0.195   0.343   0.330   0.452   1.000   0.419
technology     0.414   0.556   0.357   0.413   0.419   1.000

üéØ Key Insights:
   üìà 'climate' ‚Üî 'environment': 0.760 (high similarity!)
   üìä 'climate' ‚Üî 'sports': 0.195 (low similarity)

üí° This is how recommendation systems work!
   Users reading 'climate' articles ‚Üí recommend 'environment' content


## ü™Ñ Word Arithmetic: The Magic of Embeddings

The famous example: **king - man + woman ‚âà queen**

This works because embeddings capture semantic relationships as geometric patterns!

In [6]:
if glove_model is not None:
    def word_arithmetic(model, positive_words, negative_words, top_n=5):
        """Perform word arithmetic: positive_words - negative_words."""
        try:
            # Use gensim's built-in most_similar method
            results = model.most_similar(
                positive=positive_words, 
                negative=negative_words, 
                topn=top_n
            )
            return results
        except KeyError as e:
            return f"Word not found: {e}"
    
    print("ü™Ñ Word Arithmetic Examples:\n")
    
    # Classic example: king - man + woman
    print("1Ô∏è‚É£ king - man + woman =")
    results = word_arithmetic(glove_model, ['king', 'woman'], ['man'])
    if isinstance(results, list):
        for word, similarity in results:
            print(f"   {word:15} (similarity: {similarity:.3f})")
        print(f"   ‚ú® Top result: '{results[0][0]}' - The arithmetic worked!")
    else:
        print(f"   {results}")
    
    print("\n2Ô∏è‚É£ paris - france + italy =")
    results = word_arithmetic(glove_model, ['paris', 'italy'], ['france'])
    if isinstance(results, list):
        for word, similarity in results[:3]:
            print(f"   {word:15} (similarity: {similarity:.3f})")
        print(f"   üó∫Ô∏è Top result: '{results[0][0]}' (capital of Italy!)")
    
    print("\n3Ô∏è‚É£ News domain: climate - environment + technology =")
    results = word_arithmetic(glove_model, ['climate', 'technology'], ['environment'])
    if isinstance(results, list):
        for word, similarity in results[:3]:
            print(f"   {word:15} (similarity: {similarity:.3f})")
        print(f"   üî¨ Interesting blend of climate and tech concepts!")
    
    print("\nüéØ Why this works:")
    print("   - Embeddings learn that 'gender' is a consistent direction")
    print("   - 'Capital city' relationships are captured geometrically")
    print("   - Vector arithmetic preserves these semantic relationships")

ü™Ñ Word Arithmetic Examples:

1Ô∏è‚É£ king - man + woman =
   queen           (similarity: 0.770)
   monarch         (similarity: 0.684)
   throne          (similarity: 0.676)
   daughter        (similarity: 0.659)
   princess        (similarity: 0.652)
   ‚ú® Top result: 'queen' - The arithmetic worked!

2Ô∏è‚É£ paris - france + italy =
   rome            (similarity: 0.819)
   milan           (similarity: 0.738)
   naples          (similarity: 0.712)
   üó∫Ô∏è Top result: 'rome' (capital of Italy!)

3Ô∏è‚É£ News domain: climate - environment + technology =
   technologies    (similarity: 0.649)
   global          (similarity: 0.613)
   tech            (similarity: 0.608)
   üî¨ Interesting blend of climate and tech concepts!

üéØ Why this works:
   - Embeddings learn that 'gender' is a consistent direction
   - 'Capital city' relationships are captured geometrically
   - Vector arithmetic preserves these semantic relationships


## üé≤ Compare to Random Embeddings

Let's prove that these relationships aren't just coincidence by comparing to random vectors.

In [7]:
if glove_model is not None:
    # Create random embeddings for comparison
    np.random.seed(42)
    random_embeddings = {}
    test_words = ['king', 'queen', 'man', 'woman', 'climate', 'environment']
    
    print("üé≤ Random vs. Pretrained Embedding Comparison:\n")
    
    for word in test_words:
        if word in glove_model:
            # Random 100-dimensional vector
            random_embeddings[word] = np.random.normal(0, 1, 100)
    
    # Compare similarities
    def cosine_sim(v1, v2):
        return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
    
    print("üëë King-Queen Similarity:")
    if 'king' in glove_model and 'queen' in glove_model:
        glove_sim = glove_model.similarity('king', 'queen')
        random_sim = cosine_sim(random_embeddings['king'], random_embeddings['queen'])
        print(f"   GloVe:  {glove_sim:.3f} (high - they're related!)")
        print(f"   Random: {random_sim:.3f} (low - no relationship captured)")
    
    print("\nüåç Climate-Environment Similarity:")
    if 'climate' in glove_model and 'environment' in glove_model:
        glove_sim = glove_model.similarity('climate', 'environment')
        random_sim = cosine_sim(random_embeddings['climate'], random_embeddings['environment'])
        print(f"   GloVe:  {glove_sim:.3f} (high - semantically related)")
        print(f"   Random: {random_sim:.3f} (random - no meaning)")
    
    print("\n‚úÖ Conclusion: Pretrained embeddings capture real semantic relationships!")
    print("   Random vectors can't distinguish related from unrelated concepts.")

üé≤ Random vs. Pretrained Embedding Comparison:

üëë King-Queen Similarity:
   GloVe:  0.751 (high - they're related!)
   Random: -0.138 (low - no relationship captured)

üåç Climate-Environment Similarity:
   GloVe:  0.760 (high - semantically related)
   Random: -0.082 (random - no meaning)

‚úÖ Conclusion: Pretrained embeddings capture real semantic relationships!
   Random vectors can't distinguish related from unrelated concepts.
