# Skip-gram Word2Vec Revision Notebook

A one-page flashcard-style revision tool for the Skip-gram model in Word2Vec. Covers architecture, why it works, dense embeddings vs. one-hot vectors, and negative sampling. Designed for quick, deep review without code.

## 1. What is Skip-gram?
- **Definition**: Predicts context words (e.g., "quick", "brown", "jumps") from a target word (e.g., "fox").
- **Goal**: Learn dense word embeddings (vectors, e.g., 100-dim) that capture semantic/syntactic relationships (e.g., "king" ≈ "queen", "king - man + woman ≈ queen").
- **Core Idea**: Distributional hypothesis—words in similar contexts have similar meanings.
- **Training**: Unsupervised, uses large text corpora (e.g., Wikipedia), sliding window to generate (target, context) pairs.
- **Example**: Sentence: "The quick brown fox jumps." Target: "fox". Context (window size 2): "quick", "brown", "jumps", "over".

## 2. Skip-gram Architecture (Step-by-Step)
- **Step 1: Input (Target Word)**
  - Single target word (e.g., "fox").
  - Represented as a one-hot vector (length $V$, vocab size, e.g., 10,000; 1 at word’s index, 0s elsewhere).
  - Example: "fox" (index 500) → $[0, ..., 1, ..., 0]$.
- **Step 2: Embedding Lookup**
  - One-hot vector indexes into input embedding matrix $W_{\text{in}}$ (shape: $V \times d$, $d = 100$).
  - Output: Dense target vector $x$ (d-dimensional).
  - Example: $x$ for "fox" = $[0.3, -0.2, ...]$.
- **Step 3: Output Layer**
  - Multiply $x$ by output embedding matrix $W_{\text{out}}$ (shape: $d \times V$) to get scores: $z = x \cdot W_{\text{out}}$ (length $V$).
  - Softmax: $\hat{y}_j = \frac{\exp(z_j)}{\sum_k \exp(z_k)}$ (probabilities for all context words).
- **Step 4: Training**
  - Loss: Cross-entropy, $\mathcal{L} = -\sum_{j \in \text{context}} \log \hat{y}_j$ (maximize context word probabilities).
  - Optimize $W_{\text{in}}$, $W_{\text{out}}$ via backpropagation.
  - Final embeddings: Rows of $W_{\text{in}}$ (or average with $W_{\text{out}}$).
- **Parameters**: Approx. $2 \times V \times d$ (e.g., 2M for $V = 10,000$, $d = 100$).

## 3. Why Dense Embeddings (Not One-Hot)?
- **One-Hot Vectors**:
  - Length $V$, sparse, no semantic info (all words orthogonal, e.g., "cat" ≠ "dog").
  - Summing context one-hots: Vector with 1s at context word indices (e.g., 1s for "quick", "brown").
  - Issues: High-dimensional ($V$), computationally expensive, no generalization across similar words.
- **Dense Embeddings**:
  - $d$-dimensional ($d \ll V$), learned to capture semantics (e.g., "cat" ≈ "dog").
  - Single target embedding $x$ predicts context, leveraging semantic similarity.
  - Efficient: Fixed-size input ($d$), enables generalization (similar targets → similar contexts).
- **Why Not Summed One-Hot?**: Sparse, no semantic structure, requires larger weight matrix ($V \times d$) vs. Skip-gram’s efficient lookup.

## 4. Why Predicting Multiple Contexts?
- Uses single target vector $x$ to predict $C$ context words (e.g., $C = 4$).
- **Benefits**:
  - Captures relationships from multiple contexts per target, enriching embeddings.
  - Fixed-size input ($d$) regardless of $C$, scalable for large windows.
  - Order-insensitive within context window, simplifies model.
- **Contrast with Averaging**: No averaging (unlike CBOW); each context prediction contributes independently.

## 5. Negative Sampling: Why and How?
- **Problem**: Softmax over $V$ words (O($V$)) is slow for large vocab (e.g., $V = 10,000$).
- **Solution**: Negative sampling:
  - For each context word: Use target-context pair (positive) + $k$ negative words (e.g., $k = 5–20$, sampled by frequency^0.75).
  - Loss: $\mathcal{L} = -\log \sigma(x \cdot v_{c,\text{out}}) - \sum_{i=1}^k \log \sigma(-x \cdot v_{n_i,\text{out}})$ per context word.
  - Compute scores for only $1 + k$ words, complexity O($k$) vs. O($V$).
- **$W_{\text{out}}$ Dimensions**:
  - Remains $d \times V$, as all words can be contexts/negatives.
  - Each step uses subset ($d \times (1 + k)$) for context + $k$ negatives.
  - Full matrix updated over time as different words are sampled.
- **Why Not $d \times k$?**: Would cover only $k$ words, insufficient for vocab size $V$.

## 6. Why Skip-gram Works?
- **Semantic Learning**: Predicting contexts from a target groups similar words (e.g., "king" ≈ "queen") via distributional hypothesis.
- **Dot Product Theory**:
  - Output scores ($z = x \cdot W_{\text{out}}$) are dot products between target vector $x$ and each context’s output embedding ($v_{j,\text{out}}$, columns of $W_{\text{out}}$).
  - Dot product measures similarity: High $x \cdot v_{c,\text{out}}$ means target aligns with context word’s embedding; low for unrelated words.
  - Training maximizes $x \cdot v_{c,\text{out}}$ (context) and minimizes $x \cdot v_{n,\text{out}}$ (negatives), positioning similar words close in embedding space.
  - Result: Embeddings capture semantic relationships (e.g., "cat" ≈ "dog") and enable analogies (e.g., "king - man + woman ≈ queen") via linear vector arithmetic.
- **Dense Embeddings**: Low-dimensional ($d$), capture latent features (e.g., gender, topic), enable generalization across diverse contexts.
- **Multiple Contexts**: Predicting $C$ contexts per target enriches embeddings, especially for rare words.
- **Large-Scale Training**: Millions of examples ensure robust embeddings, capturing diverse target-context pairs.
- **Efficiency**: Negative sampling reduces computation, making large vocabularies feasible.
- **Simplicity**: Linear model (no non-linearities except softmax) learns complex patterns via data volume.

## 7. Key Numbers (Example)
- Vocabulary: $V = 10,000$.
- Embedding dim: $d = 100$.
- Context size: $C = 4$.
- Negative samples: $k = 5–20$.
- Parameters: Approx. $2 \times V \times d$ (e.g., 2M for $V = 10,000$, $d = 100$).
- Complexity: O($d + d \times k$) per context word with negative sampling.