# Unit 4 Introduction to Word Embeddings

# Introduction to Word Embeddings

Welcome to the final lesson of the **"Foundations of NLP Data Processing"** course\! So far, you've explored various techniques for representing text as numerical data, including **Bag-of-Words** and **TF-IDF**. While these methods are useful, they have a major limitation: they treat words as independent tokens without capturing their meanings or relationships.

In this lesson, we introduce **word embeddings**, a powerful approach to representing words as continuous-valued vectors that encode semantic meaning. Word embeddings allow models to understand **relationships between words** based on their context in large text corpora. Unlike traditional text representations, embeddings can capture **synonymy, analogies, and contextual similarities**.

## Why Do We Need Word Embeddings?

Before diving into code, let's build an **intuition** for word embeddings by examining some limitations of traditional approaches:

### Bag-of-Words (BoW) and TF-IDF ignore meaning

In BoW and TF-IDF, words are represented as **isolated** units. Two words with similar meanings (e.g., "king" and "queen") have no direct relationship in these models.

**Example:**
"I love NLP" → `[1, 1, 1, 0, 0]`
"I enjoy NLP" → `[1, 0, 1, 1, 0]`
These two sentences are similar in meaning, but their vector representations don't capture this\!

### Word Order Matters

"The cat chased the dog" vs. "The dog chased the cat" have different meanings, but traditional models treat them similarly.

### No Concept of Context

"Apple" (the fruit) and "Apple" (the company) are treated the same.

Word embeddings solve these issues by representing words as **dense vectors** in a multi-dimensional space, where words with similar meanings have closer representations.

## Understanding Word Embeddings: The Core Idea

Word embeddings are generated by training models on large text data. The key idea is:

> Words appearing in similar contexts should have similar vector representations.

**Example:** The words "king" and "queen" often appear in similar contexts (e.g., "the king rules the kingdom," "the queen rules the kingdom").
Models like Word2Vec and GloVe learn to place these words **closer** together in the vector space.

There are different approaches to training word embeddings:

  * **Continuous Bag of Words (CBOW):** Predicts a word based on surrounding words.
  * **Skip-gram:** Predicts surrounding words given a target word.
  * **GloVe (Global Vectors for Word Representation):** Utilizes word co-occurrence statistics from a corpus.

## Differences Between Word2Vec and GloVe

  * **Word2Vec:** Developed by Google, it uses either the CBOW or Skip-gram approach. It focuses on predicting a word based on its context or vice versa, which makes it efficient for capturing local context.
  * **GloVe:** Developed by Stanford, it uses global word co-occurrence statistics to learn embeddings. This approach captures both local and global context, making it effective for understanding broader semantic relationships.

## Using Pre-trained Word Embeddings

Instead of training from scratch, we can use **pre-trained embeddings**. Here, we'll use a smaller GloVe model (25-dimensional) for demonstration, but you can easily switch to other models like Word2Vec or FastText depending on your needs. If you want to suppress the output during model loading, you can use the following approach:

```python
import os
import contextlib
import gensim.downloader as api

# Load a smaller GloVe model (25-dimensional) without printing output
with open(os.devnull, 'w') as fnull:
    with contextlib.redirect_stdout(fnull), contextlib.redirect_stderr(fnull):
        pretrained_model = api.load("glove-twitter-25")  # or any other model

# Find similar words
print("Most similar to 'apple':", pretrained_model.most_similar("apple"))

# Compute similarity
similarity = pretrained_model.similarity("queen", "king")
print("Similarity between 'queen' and 'king':", similarity)

# Perform analogy task
result = pretrained_model.most_similar(positive=['dog', 'cats'], negative=['cat'])
print("Result of analogy 'cat' is to 'cats' as 'dog' is to:", result[0][0])
```

To suppress the output during model loading, you need to import the `os` and `contextlib` modules. The `os.devnull` is used to discard any output, while `contextlib.redirect_stdout` and `contextlib.redirect_stderr` are used to redirect standard output and error streams to `os.devnull`. This way, the model loads silently without printing any messages to the console.

## When to Train a Custom Word Embedding Model

While pre-trained models are powerful, there are scenarios where training a custom word embedding model might be beneficial:

  * **Domain-Specific Vocabulary:** If your text data contains specialized vocabulary not well-represented in general corpora, a custom model can better capture these nuances.
  * **Language Variants:** For dialects or less common languages, pre-trained models may not be available or effective.
  * **Updated Contexts:** If your data reflects recent trends or changes in language use, a custom model can capture these shifts.

## Visualizing Word Embeddings

To understand the learned embeddings, we can visualize them using **Principal Component Analysis** (PCA), which reduces the high-dimensional vectors to 2D space:

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

words = ["dog", "cat", "computer", "town", "city"]

# Get Word Vectors
vectors = np.array([pretrained_model.get_vector(word) for word in words if word in pretrained_model])

# Reduce Dimensionality
pca = PCA(n_components=2)
reduced_vectors = pca.fit_transform(vectors)

# Plot
plt.figure(figsize=(6,6))
for word, (x, y) in zip(words, reduced_vectors):
    plt.scatter(x, y)
    plt.text(x + 0.01, y + 0.01, word, fontsize=12)
plt.title("Word Embeddings Visualization (PCA)")
plt.show()
```

This visualization helps us see how similar words are **grouped together** in the vector space. For instance, we can see that "cat" and "dog" are close to each other, as are "city" and "town," while "computer" is positioned further away, indicating its distinct semantic meaning compared to the other words.

## Summary and Next Steps

Word embeddings capture the meaning of words based on their context, with models like Word2Vec and GloVe learning these embeddings by predicting words from their neighbors or using co-occurrence statistics. Pre-trained embeddings offer efficient word representation, and visualization aids in understanding word relationships. Moving forward, experiment with different datasets, explore models like GloVe and FastText, and apply word embeddings in NLP tasks such as sentiment analysis and text classification.



## Exploring Word Similarity with GloVe

You've learned about using pre-trained word embeddings. Now, let's put that into practice!

Your task is to load a smaller GloVe model using the gensim.downloader library. Then, choose two pairs of words: one pair with similar meanings and another with unrelated meanings.

Use the model's similarity method to print their similarity scores.

```python
import gensim.downloader as api
import contextlib
import os

# Suppress output while loading the model
with open(os.devnull, 'w') as fnull:
    with contextlib.redirect_stdout(fnull), contextlib.redirect_stderr(fnull):
        # TODO: Load a smaller GloVe model (e.g., "glove-wiki-gigaword-25") 
        pretrained_model = api.load("")  # or any other model


# TODO: Choose word pairs
similar_pair = ("", "")
unrelated_pair = ("", "")

# TODO: Compute similarity scores

# TODO: Print similarity scores

```

```python
import gensim.downloader as api
import contextlib
import os

# Suppress output while loading the model
with open(os.devnull, 'w') as fnull:
    with contextlib.redirect_stdout(fnull), contextlib.redirect_stderr(fnull):
        # Load a smaller GloVe model (e.g., "glove-wiki-gigaword-50") 
        pretrained_model = api.load("glove-wiki-gigaword-50")

# Choose word pairs
similar_pair = ("king", "queen")
unrelated_pair = ("apple", "car")

# Compute similarity scores
similar_score = pretrained_model.similarity(similar_pair[0], similar_pair[1])
unrelated_score = pretrained_model.similarity(unrelated_pair[0], unrelated_pair[1])

# Print similarity scores
print(f"Similarity between '{similar_pair[0]}' and '{similar_pair[1]}': {similar_score:.4f}")
print(f"Similarity between '{unrelated_pair[0]}' and '{unrelated_pair[1]}': {unrelated_score:.4f}")
```

### Analysis of the Results

The output demonstrates how word embeddings capture semantic relationships. The similarity score for the pair ("king", "queen") will be significantly higher than the score for the pair ("apple", "car"). This is because "king" and "queen" often appear in similar contexts and share a conceptual relationship, so the model places them closer together in the vector space. In contrast, "apple" and "car" have very different meanings and contexts, resulting in a much lower similarity score. This highlights the effectiveness of word embeddings in representing the meaning of words beyond simple co-occurrence.

## Exploring Word Synonyms with Embeddings

You've just explored the power of pre-trained word embeddings. Now, let's dive deeper!

Your task is to choose at least three distinct words (e.g., "cat", "computer", "city") and use the pre-trained model's most_similar function to find and print the top five synonyms for each word.

Load the pre-trained model.
Select your words.
Use most_similar to find synonyms.
Print the top five synonyms for each word.
Good luck!

```python
import gensim.downloader as api
import contextlib
import os

# Suppress output while loading the model
with open(os.devnull, 'w') as fnull:
    with contextlib.redirect_stdout(fnull), contextlib.redirect_stderr(fnull):
        pretrained_model = api.load("glove-twitter-25")  # or any other model

# Choose words to find synonyms for 
words_to_check = ["cat", "computer", "city"]

# TODO: Find and print top 5 synonyms for each word

```

```python
import gensim.downloader as api
import contextlib
import os

# Suppress output while loading the model
with open(os.devnull, 'w') as fnull:
    with contextlib.redirect_stdout(fnull), contextlib.redirect_stderr(fnull):
        pretrained_model = api.load("glove-twitter-25")  # or any other model

# Choose words to find synonyms for 
words_to_check = ["cat", "computer", "city"]

# Find and print top 5 synonyms for each word
for word in words_to_check:
    print(f"Top 5 most similar words to '{word}':")
    try:
        similar_words = pretrained_model.most_similar(word, topn=5)
        for i, (similar_word, score) in enumerate(similar_words):
            print(f"{i+1}. {similar_word} (Similarity: {score:.4f})")
    except KeyError:
        print(f"'{word}' not found in the vocabulary.")
    print("-" * 30)
```

### Analysis of the Results

The code demonstrates how the `most_similar` method uses the vector representations of words to find other words that are semantically close in the embedding space.

  * For the word "cat," the model correctly identifies other animals or related terms like "dog" and "cats," showing its ability to capture high-level semantic categories.
  * For "computer," the model returns words like "pc," "laptop," and "device," which are direct synonyms or related concepts in the context of technology.
  * For "city," the results will likely include other geographical or political entities like "town," "chicago," and "new york," reflecting the contextual similarity of these words in the training data.

This exercise proves the effectiveness of word embeddings in understanding and representing the nuances of language, a significant improvement over simple frequency-based methods like Bag-of-Words.

## Word Analogy with GloVe

You've just learned about the power of pre-trained word embeddings. Now, let's apply that knowledge!

Your task is to create a word analogy using the pre-trained GloVe model. Use the most_similar function to find words similar to the result of "bad" minus "big" plus "biggest".

Use most_similar to perform the analogy.
Check if the result makes sense in the context of the analogy.
This exercise will show you how well the model captures word relationships. Dive in and see the magic!

```python
import gensim.downloader as api
import contextlib
import os
import sys

# Suppress output while loading the model
with open(os.devnull, 'w') as fnull:
    with contextlib.redirect_stdout(fnull), contextlib.redirect_stderr(fnull):
        # TODO: Load Pre-trained GloVe Model (Twitter 25)

# TODO: Use the model's most_similar function to create a word analogy: "bad" - "big" + "biggest"

# TODO: Print the top result

```

```python
import gensim.downloader as api
import contextlib
import os
import sys

# Suppress output while loading the model
with open(os.devnull, 'w') as fnull:
    with contextlib.redirect_stdout(fnull), contextlib.redirect_stderr(fnull):
        # Load Pre-trained GloVe Model (Twitter 25)
        glove_model = api.load("glove-twitter-25")

# Use the model's most_similar function to create a word analogy: "bad" - "big" + "biggest"
analogy_result = glove_model.most_similar(positive=['bad', 'biggest'], negative=['big'], topn=1)

# Print the top result
print(f"Result of analogy 'bad' - 'big' + 'biggest': {analogy_result[0][0]}")
```

### Analysis of the Results

The code demonstrates a classic word analogy problem using vector arithmetic in the GloVe embedding space. The operation `vector("bad") - vector("big") + vector("biggest")` attempts to solve the analogy: "big" is to "biggest" as "bad" is to...

The result `analogy_result[0][0]` will be the word "worst". This makes perfect sense, as "biggest" is the superlative form of "big," and "worst" is the superlative form of "bad." This result shows that word embeddings are not just about finding synonyms but also about understanding and encoding complex semantic relationships like analogies and grammatical transformations. This is a powerful feature that goes far beyond what traditional methods like Bag-of-Words and TF-IDF can achieve.