# Exercise: Analyze Word Relationships with Embedding Arithmetic

**Estimated Time**: 15 minutes | **Status**: üöß Your implementation needed

**Scenario**: Before committing to GloVe embeddings for your news recommender, the data science team wants evidence they actually capture the relationships you care about: category similarities (politics/government), writing style (formal/casual), and topic clustering (technology subcategories). Test whether pretrained embeddings understand news domain semantics.

**What You'll Learn**: Embedding relationships, analogy evaluation, visualization techniques, and coverage analysis for domain-specific applications.

---

## üéØ Why This Matters for Production Systems

**Real-world question**: Should we use pretrained GloVe or train custom embeddings?

**This analysis helps decide**:
- Do pretrained embeddings capture **news domain semantics**?
- What's the **vocabulary coverage** on our specific dataset?
- Where do we need **domain-specific training**?

---

In [None]:
# Import necessary libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
import gensim.downloader as api
from datasets import load_dataset
from collections import Counter
import re
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("‚úÖ Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")

## üì• Load Data and Embeddings

First, let's load both our AG News dataset and pretrained GloVe embeddings.

In [None]:
# Load AG News dataset
print("üì∞ Loading AG News dataset...")
try:
    # Load full dataset for comprehensive analysis
    dataset = load_dataset("ag_news", split="train")
    
    # Category mapping
    categories = {0: "World", 1: "Sports", 2: "Business", 3: "Technology"}
    category_names = list(categories.values())
    
    print(f"‚úÖ Loaded {len(dataset):,} news articles")
    print(f"üìä Categories: {category_names}")
    
    # Show distribution
    label_counts = Counter([dataset[i]['label'] for i in range(len(dataset))])
    for label, count in label_counts.items():
        print(f"   {categories[label]}: {count:,} articles")
    
except Exception as e:
    print(f"‚ùå Error loading dataset: {e}")
    dataset = None

In [None]:
# Load GloVe embeddings
print("üì• Loading GloVe embeddings...")
try:
    # Use 100d model for faster processing
    glove_model = api.load('glove-wiki-gigaword-100')
    print(f"‚úÖ Loaded GloVe embeddings!")
    print(f"   Vocabulary size: {len(glove_model):,} words")
    print(f"   Vector dimension: {glove_model.vector_size}")
    
except Exception as e:
    print(f"‚ùå Error loading GloVe: {e}")
    print("   Note: This requires internet connection for first download")
    glove_model = None

## Part A: Find Similar Words

Build a similarity function and test it on news-relevant words.

**Your task**: Complete the `find_similar()` function to find words with highest cosine similarity.

In [None]:
if glove_model is not None:
    def find_similar(word, n=10, model=glove_model):
        """
        Find n most similar words using cosine similarity.
        
        Args:
            word: Target word to find similarities for
            n: Number of similar words to return
            model: Embedding model to use
            
        Returns:
            List of (word, similarity_score) tuples
        """
        try:
            # TODO: Use gensim's built-in similarity function to find most similar words
            similar_words = None  # Replace this line
            return similar_words
        except KeyError:
            return f"'{word}' not found in vocabulary"
    
    # Test on news-relevant words
    test_words = ["election", "technology", "economy"]
    
    print("üîç Finding similar words for news-relevant terms:\n")
    
    for word in test_words:
        print(f"üìç Words similar to '{word}':")
        similar = find_similar(word, n=8)
        
        if isinstance(similar, list):
            for similar_word, score in similar:
                print(f"   {similar_word:<15} (similarity: {score:.3f})")
        else:
            print(f"   {similar}")
        print()
    
    # TODO: Analyze the patterns you observe
    print("üéØ YOUR OBSERVATIONS:")
    print("   What patterns do you notice in the similar words?")
    print("   Are the relationships meaningful for news categorization?")
    print("   YOUR ANALYSIS HERE...")

## Part B: Explore Semantic Analogies

Test whether embeddings capture logical relationships through vector arithmetic.

**Your task**: Complete the analogy function and analyze the results.

In [None]:
if glove_model is not None:
    def test_analogy(word1, word2, word3, expected=None, model=glove_model, top_n=5):
        """
        Test analogy: word1 is to word2 as word3 is to ?
        
        Args:
            word1, word2, word3: The analogy components  
            expected: Expected result (for evaluation)
            model: Embedding model
            top_n: Number of results to return
            
        Returns:
            List of (word, similarity) results
        """
        try:
            # TODO: Implement vector arithmetic for analogies
            # Vector arithmetic: word2 - word1 + word3
            results = None  # Replace this line
            return results
        except KeyError as e:
            return f"Word not found: {e}"
    
    print("ü™Ñ Testing Semantic Analogies:\n")
    
    # Test cases with expected answers
    analogies = [
        ("good", "better", "bad", "worse"),
        ("president", "politics", "quarterback", "sports"),
        ("newspaper", "journalism", "television", "broadcasting"),
        ("stock", "business", "election", "politics"),
        ("computer", "technology", "athlete", "sports"),
        ("profit", "business", "victory", "sports"),
        ("software", "technology", "legislation", "politics")
    ]
    
    successful_analogies = 0
    total_analogies = len(analogies)
    
    for i, (w1, w2, w3, expected) in enumerate(analogies, 1):
        print(f"{i}Ô∏è‚É£ {w1} : {w2} :: {w3} : ?  (expecting '{expected}')")
        
        results = test_analogy(w1, w2, w3, expected)
        
        if isinstance(results, list) and results:
            # Show top 3 results
            for j, (word, similarity) in enumerate(results[:3]):
                marker = "‚úÖ" if word.lower() == expected.lower() else "  "
                print(f"   {marker} {word:<15} (similarity: {similarity:.3f})")
            
            # Check if expected word is in top 3
            top_3_words = [word.lower() for word, _ in results[:3]]
            if expected.lower() in top_3_words:
                successful_analogies += 1
                print(f"   üéØ SUCCESS: Found '{expected}' in top 3!")
            else:
                print(f"   ‚ùå Expected '{expected}' not in top 3")
        else:
            print(f"   {results}")
        print()
    
    # Calculate success rate
    success_rate = (successful_analogies / total_analogies) * 100
    print(f"üìä Analogy Success Rate: {successful_analogies}/{total_analogies} ({success_rate:.1f}%)")
    
    # TODO: Interpret the results
    print("\nü§î YOUR INTERPRETATION:")
    print("   What does this performance suggest about using GloVe for news?")
    print("   Do the embeddings capture meaningful semantic relationships?")
    print("   YOUR ANALYSIS HERE...")

## Part C: Visualize Word Clusters

Extract representative words from each news category and visualize their embedding space.

**Your task**: Complete the word embedding extraction for visualization.

In [None]:
if dataset is not None:
    def extract_category_words(dataset, categories, words_per_category=10):
        """
        Extract most frequent words from each category.
        """
        category_words = {}
        
        for label, category_name in categories.items():
            # Get all texts for this category
            category_texts = []
            for article in dataset:
                if article['label'] == label:
                    category_texts.append(article['text'].lower())
            
            # Extract and count words
            all_words = []
            for text in category_texts:
                # Simple tokenization
                words = re.findall(r'\b[a-zA-Z]+\b', text)
                all_words.extend(words)
            
            # Get most frequent words (excluding common stop words)
            stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'have', 'has', 'had', 'will', 'would', 'could', 'should', 'may', 'might', 'must', 'can', 'shall', 'said', 'says', 'say'}
            
            word_counts = Counter([w for w in all_words if len(w) > 3 and w not in stop_words])
            most_frequent = [word for word, count in word_counts.most_common(words_per_category * 3)]
            
            # Filter to words that exist in GloVe
            category_words[category_name] = []
            for word in most_frequent:
                if word in glove_model and len(category_words[category_name]) < words_per_category:
                    category_words[category_name].append(word)
        
        return category_words
    
    # Extract representative words
    print("üîç Extracting representative words from each category...")
    category_words = extract_category_words(dataset, categories, words_per_category=10)
    
    for category, words in category_words.items():
        print(f"\nüìä {category} category words:")
        print(f"   {', '.join(words)}")
    
    # TODO: Prepare data for visualization by getting embeddings for each word
    all_words = []
    word_categories = []
    word_vectors = []
    
    for category, words in category_words.items():
        for word in words:
            all_words.append(word)
            word_categories.append(category)
            # TODO: Get the embedding vector for this word from glove_model
            word_vectors.append(None)  # Replace this line
    
    print(f"\n‚úÖ Prepared {len(all_words)} words for visualization")

In [None]:
if glove_model is not None and len(all_words) > 0:
    # Convert to numpy array for processing
    embedding_matrix = np.array(word_vectors)
    
    print(f"üìä Embedding matrix shape: {embedding_matrix.shape}")
    
    # Apply PCA for dimensionality reduction
    print("üîÑ Applying PCA to reduce dimensions...")
    pca = PCA(n_components=2, random_state=42)
    embeddings_2d_pca = pca.fit_transform(embedding_matrix)
    
    # Apply t-SNE for non-linear reduction
    print("üîÑ Applying t-SNE for non-linear reduction...")
    tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(all_words)-1))
    embeddings_2d_tsne = tsne.fit_transform(embedding_matrix)
    
    print(f"‚úÖ Explained variance by PCA: {pca.explained_variance_ratio_.sum():.3f}")
    
    # Create visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))
    
    # Color mapping for categories
    colors = {'World': 'red', 'Sports': 'green', 'Business': 'blue', 'Technology': 'purple'}
    
    # Plot PCA results
    for category in category_names:
        mask = np.array(word_categories) == category
        ax1.scatter(
            embeddings_2d_pca[mask, 0], 
            embeddings_2d_pca[mask, 1],
            c=colors[category], 
            label=category,
            alpha=0.7,
            s=50
        )
    
    ax1.set_title('Word Embeddings - PCA Projection')
    ax1.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2f} variance)')
    ax1.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2f} variance)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Add word labels for some points
    for i, word in enumerate(all_words):
        if i % 3 == 0:  # Label every 3rd word to avoid crowding
            ax1.annotate(word, (embeddings_2d_pca[i, 0], embeddings_2d_pca[i, 1]), 
                        xytext=(5, 5), textcoords='offset points', 
                        fontsize=8, alpha=0.7)
    
    # Plot t-SNE results
    for category in category_names:
        mask = np.array(word_categories) == category
        ax2.scatter(
            embeddings_2d_tsne[mask, 0], 
            embeddings_2d_tsne[mask, 1],
            c=colors[category], 
            label=category,
            alpha=0.7,
            s=50
        )
    
    ax2.set_title('Word Embeddings - t-SNE Projection')
    ax2.set_xlabel('t-SNE 1')
    ax2.set_ylabel('t-SNE 2')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Add word labels for t-SNE
    for i, word in enumerate(all_words):
        if i % 3 == 0:
            ax2.annotate(word, (embeddings_2d_tsne[i, 0], embeddings_2d_tsne[i, 1]), 
                        xytext=(5, 5), textcoords='offset points', 
                        fontsize=8, alpha=0.7)
    
    plt.tight_layout()
    plt.show()
    
    # TODO: Analyze the clustering patterns you observe
    print("\nüéØ YOUR CLUSTER ANALYSIS:")
    print("   What patterns do you see in the clusters?")
    print("   Which categories separate well? Which overlap?")
    print("   What does this suggest about semantic relationships?")
    print("   YOUR OBSERVATIONS HERE...")

## Part D: Limitations Analysis

Analyze vocabulary coverage and identify when custom embeddings might be needed.

**Your task**: Complete the coverage analysis calculations.

In [None]:
if dataset is not None and glove_model is not None:
    def analyze_vocabulary_coverage(dataset, model, sample_size=1000):
        """
        Analyze what percentage of words in AG News are covered by GloVe.
        """
        print(f"üîç Analyzing vocabulary coverage on {sample_size} articles...")
        
        # Sample articles for analysis
        sample_indices = np.random.choice(len(dataset), size=min(sample_size, len(dataset)), replace=False)
        
        all_words = []
        missing_words = []
        
        for idx in sample_indices:
            text = dataset[int(idx)]['text'].lower()
            # Simple tokenization
            words = re.findall(r'\b[a-zA-Z]+\b', text)
            
            for word in words:
                if len(word) > 2:  # Skip very short words
                    all_words.append(word)
                    # TODO: Check if word exists in model vocabulary
                    if None:  # Replace this condition
                        missing_words.append(word)
        
        # TODO: Calculate basic statistics
        total_words = len(all_words)
        unique_words = len(set(all_words))
        missing_count = None  # Replace this line
        unique_missing = None  # Replace this line
        
        # TODO: Calculate coverage percentages
        coverage_by_tokens = None  # Replace this line
        coverage_by_types = None  # Replace this line
        
        print(f"\nüìä Coverage Statistics:")
        print(f"   Total word tokens: {total_words:,}")
        print(f"   Unique word types: {unique_words:,}")
        print(f"   Missing tokens: {missing_count:,}")
        print(f"   Unique missing: {unique_missing:,}")
        print(f"\nüéØ Coverage Rates:")
        print(f"   By tokens: {coverage_by_tokens:.1f}%")
        print(f"   By types:  {coverage_by_types:.1f}%")
        
        return {
            'all_words': all_words,
            'missing_words': missing_words,
            'coverage_tokens': coverage_by_tokens,
            'coverage_types': coverage_by_types,
            'unique_missing': unique_missing
        }
    
    # Run coverage analysis
    coverage_stats = analyze_vocabulary_coverage(dataset, glove_model, sample_size=2000)
    
    # Analyze types of missing words
    missing_words_freq = Counter(coverage_stats['missing_words'])
    
    print(f"\nüîç Most common missing words:")
    for word, count in missing_words_freq.most_common(15):
        print(f"   {word:<15} (appears {count} times)")
    
    # Categorize missing words
    def categorize_missing_word(word):
        if any(char.isdigit() for char in word):
            return "Numbers/Codes"
        elif len(word) <= 3:
            return "Abbreviations"
        elif word.endswith('ing') or word.endswith('ed') or word.endswith('ly'):
            return "Inflected Forms"
        elif word.isupper():
            return "Acronyms"
        elif word[0].isupper():
            return "Proper Names"
        else:
            return "Other"
    
    category_counts = Counter()
    unique_missing_words = list(set(coverage_stats['missing_words']))
    
    for word in unique_missing_words:
        category = categorize_missing_word(word)
        category_counts[category] += 1
    
    print(f"\nüìà Types of missing words:")
    for category, count in category_counts.most_common():
        percentage = (count / len(unique_missing_words)) * 100
        print(f"   {category:<15}: {count:3d} words ({percentage:4.1f}%)")
    
    # TODO: Provide your analysis
    print(f"\nü§î YOUR ANALYSIS:")
    print(f"   What types of words are commonly missing?")
    print(f"   How might this affect a news recommender system?")
    print(f"   What recommendations would you make?")
    print(f"   YOUR INSIGHTS HERE...")

## üéì Reflection Questions

**Answer these based on your analysis**:

1. **Semantic Quality**: How well did GloVe capture relationships relevant to news categorization? What evidence supports your conclusion?

   *Your answer here...*

2. **Coverage Analysis**: What percentage of vocabulary was covered? What types of words were commonly missing? How would this affect a production system?

   *Your answer here...*

3. **Clustering Insights**: Which news categories clustered well in the visualization? Which ones overlapped? What does this suggest about their vocabulary similarity?

   *Your answer here...*

   *Your answer here...*

4. **Limitations**: What are the main limitations of this analysis? What additional tests would you run before deploying to production?

   *Your answer here...*
