Here are **detailed notes on Word2Vec (CBOW model)** ‚Äî covering theory, working, architecture, equations, and examples üëá

---

# **Word2Vec (CBOW Model) ‚Äì Full Notes with Example**

---

## üß† **1. Introduction to Word2Vec**

**Word2Vec** is a **neural network-based model** used to learn **word embeddings** ‚Äî numerical vector representations of words capturing their meanings, similarities, and relationships.
Developed by **Tomas Mikolov et al., 2013 (Google)**.

It converts words into vectors such that:

* Similar words have **similar vector representations**.
* Words with similar context appear **close in the vector space**.

---

## üß© **2. Two Main Architectures of Word2Vec**

1. **CBOW (Continuous Bag of Words)** ‚Äì predicts a **target word** from **context words**.
2. **Skip-Gram** ‚Äì predicts **context words** from a **target word**.

---

## ‚öôÔ∏è **3. Working of CBOW Model**

### üéØ **Goal:**

Predict the **target word** based on the surrounding **context words**.

Example:
Sentence: `"The cat sits on the mat"`

If we choose a **window size = 2**,
for the target word `"sits"`:

* **Context words** = ["The", "cat", "on", "the"]

CBOW tries to predict `"sits"` from its context.

---

## üßÆ **4. CBOW Architecture**

**Steps:**

1. **Input layer:** Takes context words.
2. **Projection layer (Hidden layer):** Average their embeddings.
3. **Output layer:** Predicts target word using softmax.

### Diagram (Conceptual)

```
Context Words ‚Üí Embedding Lookup ‚Üí Average ‚Üí Hidden Layer ‚Üí Softmax ‚Üí Target Word
```

---

## üìò **5. Mathematical Explanation**

Let‚Äôs say:

* Vocabulary size = `V`
* Embedding dimension = `N`
* Context size = `C` (number of context words)

### 1Ô∏è‚É£ **Input:**

Context words represented as **one-hot vectors** of size `V`.

Example vocabulary = {the, cat, sits, on, mat}
‚Üí V = 5

Each word ‚Üí [0, 0, 1, 0, 0] (depending on position)

---

### 2Ô∏è‚É£ **Projection (Hidden) Layer:**

Each one-hot vector is multiplied with **Weight matrix W** (V √ó N)
to get the **word embedding**.

For each context word ( w_i ):
[
h_i = W^T \cdot x_i
]

Then take **average** of all context embeddings:
[
h = \frac{1}{C} \sum_{i=1}^{C} h_i
]

---

### 3Ô∏è‚É£ **Output Layer:**

We use another matrix ( W' ) (N √ó V) to map back to vocabulary space.

[
u = W'^T \cdot h
]

Then apply **Softmax** to predict target word probability:

[
P(w_t | context) = \frac{e^{u_{w_t}}}{\sum_{j=1}^{V} e^{u_j}}
]

The model is trained using **cross-entropy loss**.

---

## ‚öôÔ∏è **6. Training Process**

* Randomly initialize word vectors.
* For each training example:

  1. Take context words ‚Üí input
  2. Predict target word ‚Üí output
  3. Compute error (using softmax)
  4. Backpropagate error
  5. Update word vectors (in `W` and `W'`)

After many iterations ‚Üí embeddings capture semantic meaning.

---

## üìä **7. Example**

### Example Sentence:

> ‚ÄúThe dog barks loudly‚Äù

Vocabulary: [the, dog, barks, loudly]

Let window size = 2

| Target | Context words |
| ------ | ------------- |
| the    | [dog]         |
| dog    | [the, barks]  |
| barks  | [dog, loudly] |
| loudly | [barks]       |

So, training pairs:

* Input: [the, barks] ‚Üí Output: dog
* Input: [dog, loudly] ‚Üí Output: barks
* Input: [barks] ‚Üí Output: loudly
  and so on.

After training, embeddings will learn:

```
dog ‚âà cat
barks ‚âà meows
the ‚âà a
```

---

## üß† **8. Key Features of CBOW**

| Feature  | Description           |
| -------- | --------------------- |
| Input    | Context words         |
| Output   | Target word           |
| Speed    | Faster than Skip-gram |
| Best for | Large datasets        |
| Captures | Frequent words better |

---

## ‚öñÔ∏è **9. Comparison: CBOW vs Skip-Gram**

| Feature    | CBOW                     | Skip-Gram                          |
| ---------- | ------------------------ | ---------------------------------- |
| Predicts   | Target word from context | Context words from target          |
| Efficiency | Faster                   | Slower                             |
| Handles    | Frequent words well      | Rare words better                  |
| Output     | One word                 | Multiple context words             |
| Use case   | Large corpus             | Small corpus or rare-word analysis |

---

## üß© **10. Advantages of CBOW**

‚úÖ Simple and efficient to train
‚úÖ Learns good embeddings for frequent words
‚úÖ Captures semantic and syntactic word relationships
‚úÖ Useful for large-scale NLP tasks

---

## ‚ö†Ô∏è **11. Limitations**

‚ùå Struggles with rare words
‚ùå Ignores word order within context (bag-of-words assumption)
‚ùå Static embeddings (same vector for all contexts)

---

## üí° **12. Practical Example (Using Python and Gensim)**

```python
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Sample text
text = "The cat sits on the mat. The dog lies on the rug."

# Tokenize
tokens = word_tokenize(text.lower())

# Train CBOW model (sg=0 means CBOW)
model = Word2Vec([tokens], vector_size=50, window=2, min_count=1, sg=0)

# Show embedding for a word
print(model.wv['cat'])

# Find similar words
print(model.wv.most_similar('cat'))
```

---

## üß© **13. Applications of Word2Vec (CBOW)**

* Text classification
* Sentiment analysis
* Machine translation
* Document similarity
* Information retrieval
* Recommendation systems

---

## üß≠ **14. Visualization Idea**

If embeddings are plotted in 2D (after dimensionality reduction):

```
king - man + woman ‚âà queen
Paris - France + Italy ‚âà Rome
```

This shows **semantic relationships** captured by Word2Vec embeddings.

---

‚úÖ **In summary:**

| Aspect         | CBOW Model Summary                            |
| -------------- | --------------------------------------------- |
| Input          | Context words                                 |
| Output         | Target word                                   |
| Type           | Predictive embedding model                    |
| Learning       | Neural network                                |
| Output vectors | Dense semantic word embeddings                |
| Example        | Predict ‚Äúsits‚Äù from ‚Äúthe‚Äù, ‚Äúcat‚Äù, ‚Äúon‚Äù, ‚Äúthe‚Äù |

---

Would you like me to **convert this into a well-formatted PDF with diagram and visual example (2D embedding plot)** for offline study?
