# CBOW Word2Vec Revision Notebook

A one-page flashcard-style revision tool for the Continuous Bag of Words (CBOW) model in Word2Vec. Covers architecture, why it works (including dot product theory), dense embeddings vs. one-hot vectors, and negative sampling. Designed for quick, deep review without code.

## 1. What is CBOW?
- **Definition**: Predicts a target word (e.g., "fox") from its context words (e.g., "quick", "brown", "jumps", "over").
- **Goal**: Learn dense word embeddings (vectors, e.g., 100-dim) that capture semantic/syntactic relationships (e.g., "king" ≈ "queen", "king - man + woman ≈ queen").
- **Core Idea**: Distributional hypothesis—words in similar contexts have similar meanings.
- **Training**: Unsupervised, uses large text corpora (e.g., Wikipedia), sliding window to generate (context, target) pairs.
- **Example**: Sentence: "The quick brown fox jumps." Context (window size 2): "quick", "brown", "jumps", "over". Target: "fox".

## 2. CBOW Architecture (Step-by-Step)
- **Step 1: Input (Context Words)**
  - $C$ context words (e.g., $C = 4$).
  - Each word as a one-hot vector (length $V$, vocab size, e.g., 10,000; 1 at word’s index, 0s elsewhere).
  - Example: "quick" (index 100) → $[0, ..., 1, ..., 0]$.
- **Step 2: Embedding Lookup**
  - One-hot vectors index into input embedding matrix $W_{\text{in}}$ (shape: $V \times d$, $d = 100$).
  - Output: $C$ dense vectors $x_1, x_2, ..., x_C$, each $d$-dimensional.
  - Example: $x_1$ for "quick" = $[0.2, -0.1, ...]$.
- **Step 3: Context Aggregation**
  - Take embeddings of the context words and average elementwise: $h = \frac{1}{C} \sum x_i$ ($d$-dimensional context vector).
  - Represents combined semantic content, order-insensitive ("bag of words", also called "continuous" because it's dense and not 1 hot sparse vector).
- **Step 4: Output Layer**
  - Multiply $h$ by output embedding matrix $W_{\text{out}}$ (shape: $d \times V$) to get scores: $z = h \cdot W_{\text{out}}$ (length $V$).
  - Softmax: $\hat{y}_j = \frac{\exp(z_j)}{\sum_k \exp(z_k)}$ (probabilities for all words).
- **Step 5: Training**
  - Loss: Cross-entropy, $\mathcal{L} = -\log \hat{y}_t$ (maximize target word probability).
  - Optimize $W_{\text{in}}$, $W_{\text{out}}$ via backpropagation.
  - Final Product embeddings: Rows of $W_{\text{in}}$ (or average with $W_{\text{out}}$).
- **Parameters**: Approx. $2 \times V \times d$ (e.g., 2M for $V = 10,000$, $d = 100$).

## 3. Why Dense Embeddings (Not One-Hot)?
- **One-Hot Vectors**:
  - Length $V$, sparse, no semantic info (all words orthogonal, e.g., "cat" ≠ "dog").
  - Issues: High-dimensional ($V$), computationally expensive, no generalization across similar words.
- **Dense Embeddings**:
  - $d$-dimensional ($d \ll V$), learned to capture semantics (e.g., "cat" ≈ "dog").
  - Averaging embeddings blends meaning (e.g., "cat" + "dog" → pet-related vector).
  - Efficient: Fixed-size input ($d$), enables generalization (similar contexts → similar $h$).
- **Why Not Summed One-Hot?**: Sparse, no semantic structure, requires larger weight matrix ($V \times d$) vs. CBOW’s efficient lookup and averaging.

## 4. Why Averaging Context Embeddings?
- Combines $C$ embeddings into one $d$-dimensional vector $h$.
- **Benefits**:
  - Fixed size regardless of $C$ (vs. concatenation → $C \times d$).
  - Captures collective semantic meaning (e.g., "quick" + "brown" → context for "fox").
  - Order-insensitive, simplifies model, effective for embeddings.
- **Alternative (Summing One-Hots)**: Only tracks word presence, not semantic relationships.

## 5. Negative Sampling: Why and How?
- **Problem**: Softmax over $V$ words (O($V$)) is slow for large vocab (e.g., $V = 10,000$).
- **Solution**: Negative sampling:
  - For each example: Use target word (positive) + $k$ negative words (e.g., $k = 5–20$, sampled by frequency^0.75).
  - Loss: $\mathcal{L} = -\log \sigma(h \cdot v_{t,\text{out}}) - \sum_{i=1}^k \log \sigma(-h \cdot v_{n_i,\text{out}})$.
  - Compute scores for only $1 + k$ words, complexity O($k$) vs. O($V$).
- **$W_{\text{out}}$ Dimensions**:
  - Remains $d \times V$, as all words can be targets/negatives.
  - Each step uses subset ($d \times (1 + k)$) for target + $k$ negatives.
  - Full matrix updated over time as different words are sampled.
- **Why Not $d \times k$?**: Would cover only $k$ words, insufficient for vocab size $V$.

## 6. Why CBOW Works?
- **Semantic Learning**: Context-based prediction groups similar words (e.g., "king" ≈ "queen") via distributional hypothesis.
- **Dot Product Theory**:
  - Output scores ($z = h \cdot W_{\text{out}}$) are dot products between context vector $h$ (averaged input embeddings) and each word’s output embedding ($v_{j,\text{out}}$, columns of $W_{\text{out}}$).
  - Dot product measures cosine similarity: High $h \cdot v_{t,\text{out}}$ means context aligns with target word’s embedding; low for unrelated words.
  - Training maximizes $h \cdot v_{t,\text{out}}$ (target) and minimizes $h \cdot v_{n,\text{out}}$ (others), positioning similar words close in embedding space.
  - Result: Embeddings capture semantic relationships (e.g., "cat" ≈ "dog") and enable analogies (e.g., "king - man + woman ≈ queen") via linear vector arithmetic.
- **Dense Embeddings**: Low-dimensional ($d$), capture latent features (e.g., gender, topic), enable generalization across similar contexts.
- **Averaging**: Blends context meaning efficiently, scalable for any $C$, produces $h$ that reflects collective semantics.
- **Large-Scale Training**: Millions of examples ensure robust embeddings, capturing diverse contexts.
- **Efficiency**: Negative sampling reduces computation, making large vocabularies feasible.
- **Simplicity**: Linear model (no non-linearities except softmax) learns complex patterns via data volume.

## 7. Key Numbers (Example)
- Vocabulary: $V = 10,000$.
- Embedding dim: $d = 100$.
- Context size: $C = 4$.
- Negative samples: $k = 5–20$.
- Parameters: Approx. $2 \times V \times d$ (e.g., 2M for $V = 10,000$, $d = 100$).
- Complexity: O($C \times d + d \times k$) per example with negative sampling.