# Word Embeddings

Traditional methods (frequency based embeddings) of representing words in a way that machines can understand, such as one-hot encoding, represent each word as a sparse vector with a dimension equal to the size of the vocabulary. Here, only one element of the vector is "hot" (set to 1) to indicate the presence of that word. While simple, this approach suffers from the curse of dimensionality, lacks semantic information and doesn't capture relationships between words.

Prediction-based embeddings, on the other hand, are dense vectors with continuous values that are trained using machine learning techniques, often based on neural networks. The idea is to learn representations that encode semantic meaning and relationships between words. Word embeddings are trained by exposing a model to a large amount of text data and adjusting the vector representations based on the context in which words appear.

One popular method for training prediction-based embeddings is Word2Vec, which uses a neural network to predict the surrounding words of a target word in a given context. Another widely used approach is GloVe (Global Vectors for Word Representation), which leverages global statistics to create embeddings.

## Word2Vec



Developed by a team of researchers at Google, including Tomas Mikolov, in 2013, Word2Vec (Word to Vector) has become a foundational technique for learning word embeddings in natural language processing (NLP) and machine learning models.

Word2Vec consists of two main models for generating vector representations: Continuous Bag of Words (CBOW) and Continuous Skip-gram.

In the context of Word2Vec, the Continuous Bag of Words (CBOW) model aims to predict a target word based on its surrounding context words within a given window. It uses the context words to predict the target word, and the learned embeddings capture semantic relationships between words.

The Continuous Skip-gram model, on the other hand, takes a target word as input and aims to predict the surrounding context words.


Let's go step by step to understand **what exactly we did in the Word2Vec code**.

### **1️⃣ Preprocessing the Data**  

#### **Step 1: Import Libraries**
```python
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
```
- **`gensim`** provides the Word2Vec implementation.  
- **`nltk`** helps tokenize (split) text into words.  

In [11]:
!pip install gensim

from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize

# Download NLTK tokenizer if not already installed
nltk.download('punkt')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ujjwa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### **Step 2: Prepare the Corpus (Text Data)**  

In [12]:
corpus = [
    "The cat sat on the mat",
    "The dog barked at the cat",
    "Dogs and cats are great pets",
    "The pet store has dogs and cats"
]

- This is our **training data**. Word2Vec will learn from these sentences.

#### **Step 3: Tokenization**

In [13]:
# Tokenize sentences into words
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]
print(tokenized_corpus)

[['the', 'cat', 'sat', 'on', 'the', 'mat'], ['the', 'dog', 'barked', 'at', 'the', 'cat'], ['dogs', 'and', 'cats', 'are', 'great', 'pets'], ['the', 'pet', 'store', 'has', 'dogs', 'and', 'cats']]


- **Breaks each sentence into words** and converts them to **lowercase**.  

### **2️⃣ Training Word2Vec**
Now we train **two models**:  
- **CBOW (Continuous Bag of Words)**  
- **Skip-gram**  

#### **Step 4: Training CBOW Model**

In [14]:
# Train CBOW model (sg=0 means CBOW)
cbow_model = Word2Vec(sentences=tokenized_corpus, vector_size=50, window=3, min_count=1, sg=0)

- `sentences=tokenized_corpus` → Uses our tokenized data.  
- `vector_size=50` → Each word is represented by a **50-dimensional vector**.  
- `window=2` → Looks at **two words before and after** the target word for context.  
- `min_count=1` → Ignores words that appear less than once (here, all words are used).  
- `sg=0` → **Sets CBOW mode** (`sg=1` would be Skip-gram).  

📌 **CBOW learns to predict the missing word based on surrounding words.**  

#### **Step 5: Training Skip-gram Model**

In [15]:
skipgram_model = Word2Vec(sentences=tokenized_corpus, vector_size=50, window=2, min_count=1, sg=1)

- **Same parameters as CBOW** but with `sg=1`, which enables **Skip-gram**.  

📌 **Skip-gram learns to predict surrounding words given a single word.**  


### **3️⃣ Using the Trained Word2Vec Model**
Now that we have trained **CBOW and Skip-gram**, let's see how to use them.  

#### **Step 6: Get Word Vectors**

In [16]:
vector = cbow_model.wv['cat']
print(vector)

[-0.01723938  0.00733148  0.01037977  0.01148388  0.01493384 -0.01233535
  0.00221123  0.01209456 -0.0056801  -0.01234705 -0.00082045 -0.0167379
 -0.01120002  0.01420908  0.00670508  0.01445134  0.01360049  0.01506148
 -0.00757831 -0.00112361  0.00469675 -0.00903806  0.01677746 -0.01971633
  0.01352928  0.00582883 -0.00986566  0.00879638 -0.00347915  0.01342277
  0.0199297  -0.00872489 -0.00119868 -0.01139127  0.00770164  0.00557325
  0.01378215  0.01220219  0.01907699  0.01854683  0.01579614 -0.01397901
 -0.01831173 -0.00071151 -0.00619968  0.01578863  0.01187715 -0.00309133
  0.00302193  0.00358008]


- Retrieves the **50-dimensional word vector** for `"cat"`.  

#### **Step 7: Find Similar Words**

In [17]:
similar_words = skipgram_model.wv.most_similar('cat')
print(similar_words)

[('dogs', 0.16563552618026733), ('dog', 0.1551763415336609), ('pet', 0.14387421309947968), ('store', 0.1394207924604416), ('the', 0.12672513723373413), ('barked', 0.1211986094713211), ('has', 0.1051950454711914), ('great', 0.08872983604669571), ('sat', 0.032278481870889664), ('on', 0.02048538811504841)]


- Finds words that are **semantically similar** to `"cat"`.  

In [18]:
# Save models
cbow_model.save("cbow.model")
skipgram_model.save("skipgram.model")

# Load models
cbow_model = Word2Vec.load("cbow.model")
skipgram_model = Word2Vec.load("skipgram.model")

## 🔥 **What Did We Do?**
| Step | Action | Why? |
|------|--------|------|
| 1 | Tokenized text | So Word2Vec can process words |
| 2 | Trained CBOW | Learns from surrounding words |
| 3 | Trained Skip-gram | Predicts context words from a single word |
| 4 | Extracted word vectors | To represent words numerically |
| 5 | Found similar words | To check if embeddings make sense |

---

## 📌 **CBOW vs. Skip-gram: Key Differences**
| Feature | CBOW | Skip-gram |
|---------|------|----------|
| **Training Speed** | Faster | Slower |
| **Works Well On** | Large datasets | Small datasets |
| **Focus** | Predicts **target word** from context | Predicts **context words** from a target word |
| **Performance on Rare Words** | Not good | Good |