# Word2Vec

<a target="_blank" href="https://colab.research.google.com/github/UniquifyAI/uniquify-ai-training/blob/main/topics/natural_language_processing/packages/word2vec.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Overview

In this notebook, we will explore how to use Word2Vec to create word embeddings from a given text corpus. Word2Vec learns vector representations of words in such a way that words that are semantically similar have similar vector representations. These embeddings can then be used for tasks such as word similarity, analogy solving, and more.

Word2Vec is a popular method for representing words as vectors in a continuous vector space, capturing semantic meanings, and relationships between words based on their context in large text corpora. Developed by Tomas Mikolov and others at Google in 2013, Word2Vec transforms words into dense vectors of fixed size, where semantically similar words are mapped to nearby points.


### **Skip-gram Model**

The **Skip-gram model** is one of the two primary models in Word2Vec (the other being CBOW). The objective of the skip-gram model is to predict the surrounding context words given a target word. It works by sliding a window over a sequence of words in a text and predicting which words are likely to occur near the target word.

#### **How It Works:**

- Given a target word $w_t$, the skip-gram model tries to predict the context words $w_{t-k}, w_{t+k}$ within a window of size $k$.
- For example, if the sentence is “The cat sits on the mat,” and the target word is "cat," the model would try to predict the words "The," "sits," and "on" based on "cat."

#### **Mathematical Formulation:**

The goal of the skip-gram model is to maximize the probability of the context words given the target word:

$$
P(w_{context}|w_{target}) = \prod_{-k \leq j \leq k, j \neq 0} P(w_{t+j} | w_t)
$$

Where $w_{context}$ are the words surrounding the target word $w_t$ in a context window of size $2k$.

#### **Architecture:**

- **Input Layer**: The input to the model is the one-hot vector representing the target word.
- **Hidden Layer**: A dense layer with a smaller dimensionality (embedding size) that transforms the high-dimensional one-hot encoding into a dense vector (word embedding).
- **Output Layer**: The output layer computes the probability of each word in the vocabulary appearing as a context word given the target word.

The model uses **softmax** as the activation function in the output layer to predict the probabilities for each context word.

### **Continuous Bag of Words (CBOW)**

The **CBOW model** is the reverse of the skip-gram model. Instead of predicting the context words given a target word, CBOW predicts the target word given its surrounding context words.

#### **How It Works:**

- For example, given the words "The," "cat," and "on," the CBOW model tries to predict the target word "sits."
- It takes the average of the word embeddings of the context words and uses that to predict the target word.

#### **Mathematical Formulation:**

The CBOW model tries to maximize the probability of the target word given the surrounding context words:

$$
P(w_{t} | w_{t-k}, \dots, w_{t+k})
$$

Where $w_t$ is the target word and $w_{t-k}, \dots, w_{t+k}$ are the context words.

#### **Architecture:**

- **Input Layer**: The input consists of the one-hot vectors of the context words.
- **Hidden Layer**: A dense layer transforms the context vectors into a fixed-size vector by averaging the embeddings of the context words.
- **Output Layer**: Similar to the skip-gram model, the output layer applies a softmax to predict the target word based on the context.

### **Word Embeddings**

Both the skip-gram and CBOW models produce **word embeddings**, which are dense, low-dimensional representations of words. These embeddings capture semantic relationships between words, so words with similar meanings tend to have similar embeddings.

For example, the words "king" and "queen" would have embeddings that are close in the vector space, and we can even observe patterns like:

$$
\text{vector("king")} - \text{vector("man")} + \text{vector("woman")} \approx \text{vector("queen")}
$$

### **Negative Sampling**

To make training more efficient, Word2Vec uses **negative sampling**, a technique that modifies the softmax function to consider only a small sample of negative examples (words that do not appear in the context) instead of the entire vocabulary. This reduces the computational cost, making Word2Vec suitable for large-scale text data.

### **Applications of Word2Vec**

- **Semantic Similarity**: Word2Vec embeddings can be used to measure the similarity between words based on their contexts.
- **Analogy Solving**: Word2Vec can capture word relationships like “king is to queen as man is to woman.”
- **Pre-training for NLP Models**: The learned word embeddings can be used to initialize word vectors in other natural language processing tasks, such as text classification, machine translation, and more.

### Objective
The steps performed include:

- Learn word embeddings from a text corpus using Word2Vec.
- Understand the theory behind Word2Vec (Skip-gram and Continuous Bag of Words models).
- Visualize the learned word embeddings using dimensionality reduction techniques like PCA or t-SNE.
- Use the learned embeddings to find similarities and relationships between words.

### Dataset

Word2Vec models are typically trained on large text corpora, as they rely on the co-occurrence of words within a context window to learn word representations. Some popular datasets for training Word2Vec models include:
- **Wikipedia Dumps**
- **Common Crawl**
- **Text8 Dataset** (a small compressed version of Wikipedia)
- **Gutenberg Project Texts**

For the purpose of this notebook, we'll use a smaller dataset to illustrate the process of loading and preparing text data for Word2Vec. You can either use a pre-existing dataset or scrape your own data from websites, books, or any textual resources.

In this notebook, we'll use the **text8 dataset**, which is commonly used for training word embedding models. It contains about 17 million words extracted from Wikipedia and is freely available for download.

If you want to use other datasets, make sure they are in plain text format. Additionally, we will need to preprocess the text, including:
- Lowercasing all words
- Removing punctuation
- Tokenizing the text into words

## Getting Started

### Installation

In [None]:
%pip install -q gensim \
    numpy \
    matplotlib \
    requests \
    scikit-learn

### Import libraries

In [None]:
# Import the necessary libraries
import gensim
import requests
import numpy as np
import matplotlib.pyplot as plt
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.models.word2vec import Text8Corpus
from sklearn.manifold import TSNE

### Word2Vec Training on Text8 Dataset

### Load and Preprocess the Dataset
We will load and preprocess the Text8 dataset, as we did in the Dataset section.

In [None]:
# Download the Text8 dataset
url = "http://mattmahoney.net/dc/text8.zip"
filename = "text8.zip"
response = requests.get(url)

# Save the dataset locally
with open(filename, "wb") as f:
    f.write(response.content)

# Unzip the file
import zipfile

with zipfile.ZipFile(filename, "r") as zip_ref:
    zip_ref.extractall(".")

# Load the dataset
with open("text8", "r") as file:
    text_data = file.read()


# Preprocessing the text
def preprocess_text(text):
    return simple_preprocess(text)


# Tokenize the dataset
tokens = preprocess_text(text_data)
print(f"First 20 tokens: {tokens[:20]}")

# Output total number of tokens
print(f"Total number of tokens: {len(tokens)}")

In [None]:
# Load the Text8 dataset
dataset = Text8Corpus("text8")

# Print the first sentence from the dataset
first_sentence = next(iter(dataset))
print(f"First sentence: {first_sentence}")

### Train the Word2Vec Model
Now that the data is preprocessed, we can train the Word2Vec model using the Skip-gram method. We'll define the hyperparameters such as embedding size, window size, and number of iterations.

In [None]:
# Define model hyperparameters
embedding_size = 100  # Dimensionality of word vectors
window_size = 5  # Context window size
min_count = 5  # Ignore words with low frequency
workers = 4  # Number of worker threads for training

# Initialize and train the Word2Vec model
model = Word2Vec(
    sentences=dataset,
    vector_size=embedding_size,
    window=window_size,
    min_count=min_count,
    workers=workers,
)

# Save the model
model.save("word2vec_text8.model")

The model is trained on the dataset, and now we have word embeddings for each word in the vocabulary.

### Visualizing Word Embeddings
To understand how Word2Vec captures the relationships between words, we can visualize the word embeddings using t-SNE, which reduces the high-dimensional word vectors into two dimensions.

In [None]:
# Extract word vectors from the model
word_vectors = model.wv
words = list(word_vectors.index_to_key)[:500]  # Visualize only top 500 words

# Get the vectors for the selected words
word_vecs = np.array([word_vectors[word] for word in words])

# Use t-SNE to reduce dimensionality
tsne = TSNE(n_components=2, random_state=0)
reduced_vecs = tsne.fit_transform(word_vecs)

# Plot the t-SNE projection
plt.figure(figsize=(12, 8))
plt.scatter(reduced_vecs[:, 0], reduced_vecs[:, 1])

# Annotate points with word labels
for i, word in enumerate(words):
    plt.annotate(word, xy=(reduced_vecs[i, 0], reduced_vecs[i, 1]))

plt.title("Word Embeddings Visualization using t-SNE")
plt.show()

### Inference with Word2Vec

After training the Word2Vec model, we can now use it to perform various tasks such as:
- **Finding similar words** based on cosine similarity of their embeddings.
- **Word analogies**, which help us understand the semantic relationships between words.

#### Finding Similar Words

We can use the trained Word2Vec model to find words that are similar to a given word, based on the learned word vectors. The similarity is typically calculated using **cosine similarity** between word embeddings.

In [None]:
# Load the saved Word2Vec model
model = Word2Vec.load("word2vec_text8.model")

# Find words similar to 'king'
similar_words = model.wv.most_similar("king", topn=5)
print(f"Words most similar to 'king':")
for word, similarity in similar_words:
    print(f"{word}: {similarity}")

#### Word Analogies
One of the most famous features of Word2Vec is the ability to perform word analogies. For example, the relationship between "king" and "queen" is similar to the relationship between "man" and "woman". We can find the word that completes the analogy "king is to man as queen is to ...".

In [None]:
# Performing word analogy: 'king' is to 'man' as 'queen' is to ?
result = model.wv.most_similar(positive=["queen", "man"], negative=["king"], topn=1)
print(f"'King' is to 'man' as 'queen' is to: {result[0][0]}")

#### Finding the Similarity between Two Words
We can also calculate the similarity between two words to understand how closely related they are in the embedding space.

In [None]:
# Calculate similarity between "king" and "queen"
similarity_score = model.wv.similarity("king", "queen")
print(f"Similarity between 'king' and 'queen': {similarity_score}")

# Calculate similarity between "king" and "car"
similarity_score = model.wv.similarity("king", "car")
print(f"Similarity between 'king' and 'car': {similarity_score}")

## Quizzes

1. **What is Word2Vec used for?**
* a. Image recognition
* b. Learning word embeddings
* c. Time series analysis
* d. Database management

<details>
<summary>Answers</summary>
<p>b. Learning word embeddings</p>
</details>

2. **What are word embeddings?**

* a. Images representing words
* b. Numerical vectors representing words
* c. Audio files representing words
* d. HTML code representing words

<details>
  <summary>Answer</summary>
  <p>b. Numerical vectors representing words</p>
</details>


3. **What are the two main architectures of Word2Vec?**

* a. CNN and RNN
* b. CBOW and Skip-gram
* c. LSTM and GRU
* d. SVM and Naive Bayes

<details>
  <summary>Answer</summary>
  <p>b. CBOW and Skip-gram</p>
</details>

4. **What does CBOW stand for?**

* a. Continuous Binary Output Words
* b. Continuous Bag-of-Words
* c. Contextual Bit-Oriented Words
* d. Cross-Boundary Output Words

<details>
  <summary>Answer</summary>
  <p>b. Continuous Bag-of-Words</p>
</details>

5. **How does Word2Vec capture the semantic meaning of words?**

* a. By counting the frequency of words
* b. By analyzing the grammatical structure of sentences
* c. By learning the context in which words appear
* d. By using a dictionary of synonyms

<details>
  <summary>Answer</summary>
  <p>c. By learning the context in which words appear</p>
</details>

6. **Explain how the Skip-gram model works.**

<details>
  <summary>Answer</summary>
  <p>The Skip-gram model takes a target word as input and tries to predict the surrounding context words. It does this by training a neural network to maximize the probability of predicting the context words given the target word.</p>
</details>

7. **Explain how the CBOW model works.**

<details>
  <summary>Answer</summary>
  <p>The CBOW model takes the surrounding context words as input and tries to predict the target word. It trains a neural network to maximize the probability of predicting the target word given the context words.</p>
</details>

## Further Learning and Resources

* **Original Word2Vec Papers:**
    * Efficient Estimation of Word Representations in Vector Space: [https://arxiv.org/abs/1301.3781](https://arxiv.org/abs/1301.3781)
    * Distributed Representations of Words and Phrases and their Compositionality: [https://arxiv.org/abs/1310.4546](https://arxiv.org/abs/1310.4546)
* **Gensim Documentation:** Comprehensive documentation for the Gensim library, including Word2Vec. [https://radimrehurek.com/gensim/models/word2vec.html](https://radimrehurek.com/gensim/models/word2vec.html)
* **Stanford CS224N:** Natural Language Processing with Deep Learning. [http://web.stanford.edu/class/cs224n/](http://web.stanford.edu/class/cs224n/)
* **Deep Learning for NLP (Goldberg):** A comprehensive textbook on deep learning for natural language processing. [https://www.morganclaypoolpublishers.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037](https://www.morganclaypoolpublishers.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037)
* **Hugging Face Transformers:** While not Word2Vec, it's the current state of the art. [https://huggingface.co/transformers/](https://huggingface.co/transformers/)


## Conclusion
In this notebook, we explored how to use Word2Vec for learning word representations from text data. Here's a summary of the steps we followed:

- **Overview**: We introduced Word2Vec and its use for generating word embeddings.
- **Dataset**: We used the Text8 dataset, a sample of text data, and preprocessed it for training.
- **Training**: We trained a Word2Vec model using the Gensim library, applying the Skip-gram method to learn word embeddings.
- **Visualization**: We visualized the word embeddings using t-SNE to observe how similar words are clustered in the vector space.

### Key Takeaways:
- Word2Vec is a powerful tool for learning word embeddings from large text corpora, capturing semantic similarities between words.
- Skip-gram and CBOW are two fundamental models used in Word2Vec, and the choice between them depends on the task and dataset.
- Word embeddings can be used in various natural language processing tasks, such as text classification, semantic similarity, and more.