# Negative Sampling Loss Revision Notebook

A one-page flashcard-style revision tool explaining why sigmoid loss is used with negative sampling instead of softmax in Word2Vec (CBOW and Skip-gram). Designed for quick, deep review without code.

## 1. What Changes with Negative Sampling?
- **Definition**: Negative sampling transforms the task from predicting one word out of $V$ to binary classification of $1 + k$ words (1 correct, $k$ negatives, e.g., $k = 5–20$).
- **Goal**: Approximate the full softmax distribution efficiently by focusing on a subset of words.
- **Context**: Applied to CBOW (predict target from context) and Skip-gram (predict context from target).
- **Example**: Target "fox", context "quick", negatives "zebra", "apple" ($k = 2$).

## 2. Original Loss: Cross-Entropy with Softmax
- **How It Works**:
  - Computes probability $P(w_j | \text{context}) = \frac{\exp(z_j)}{\sum_{k=1}^V \exp(z_k)}$, where $z_j = h \cdot v_{j,\text{out}}$ (CBOW) or $z_j = x \cdot v_{j,\text{out}}$ (Skip-gram).
  - Loss: $-\log P(w_t | \text{context})$ (CBOW) or $-\sum_{j \in \text{context}} \log P(w_j | \text{target})$ (Skip-gram).
- **What It Does**: Maximizes probability of correct word by normalizing over all $V$ words.
- **Issue**: Denominator $\sum_{k=1}^V \exp(z_k)$ requires O($V$) computation, slow for large vocabularies (e.g., $V = 10,000$).

## 3. Why Negative Sampling Changes the Loss?
- **Binary Classification**:
  - Treats each word as "correct" (positive) or "incorrect" (negative).
  - Uses sigmoid $P(\text{positive}) = \sigma(s) = \frac{1}{1 + e^{-s}}$, where $s$ is the score (e.g., $x \cdot v_{c,\text{out}}$).
  - Loss: $-\log \sigma(s)$ (positive), $-\log (1 - \sigma(s))$ (negative).
- **New Loss**: $-\log \sigma(x \cdot v_{c,\text{out}}) - \sum_{i=1}^k \log \sigma(-x \cdot v_{n_i,\text{out}})$ (Skip-gram per context; similar for CBOW).
- **Why Change?**: Softmax requires normalization over $V$, but negative sampling avoids this by focusing on $1 + k$ words, needing a binary-compatible loss.

## 4. Why Not Use Softmax with Negative Sampling?
- **Incompatibility**:
  - Softmax models a multinomial distribution (one correct out of $V$), requiring $P(\text{positive}) + \sum P(\text{negative}_i) = 1$ over $1 + k$ words.
  - Negative sampling is binary per word pair, not a single choice, making softmax misaligned.
- **Computational Overhead**:
  - Softmax still needs normalization, even over $1 + k$ words, adding unnecessary complexity.
  - Sigmoid evaluates each word independently, keeping computation at O($k$).
- **Example**:
  - Softmax on "quick", "zebra", "apple": Forces $P(\text{quick}) + P(\text{zebra}) + P(\text{apple}) = 1$, ignoring $V - 3$ words.
  - Sigmoid: $P(\text{quick}) \approx 1$, $P(\text{zebra}) \approx 0$, $P(\text{apple}) \approx 0$, no normalization needed.
- **Loss Mismatch**: Softmax assumes one correct answer; sigmoid handles multiple negatives as independent incorrect cases.

## 5. Why Sigmoid Fits Negative Sampling?
- **Alignment**:
  - Matches binary task: Maximize $P(\text{correct})$ and minimize $P(\text{incorrect})$ for each of $1 + k$ words.
  - No need for global normalization, preserving O($k$) efficiency.
- **Gradient Efficiency**:
  - Updates embeddings only for positive word and $k$ negatives, vs. softmax’s updates for all $V$.
- **Effectiveness**:
  - Leverages dot product similarity (e.g., $x \cdot v_{c,\text{out}}$) to learn co-occurrence, approximating softmax’s semantic goal.

## 6. Why This Works for Word2Vec?
- **Semantic Learning**: Dot product-based sigmoid loss positions similar words close in embedding space (e.g., "king" ≈ "queen").
- **Scalability**: O($k$) computation enables training on large vocabularies and corpora.
- **Approximation**: Focus on $1 + k$ words captures key relationships, sufficient for embedding quality.
- **Flexibility**: $k$ can be tuned (e.g., higher for larger datasets), balancing speed and accuracy.

## 7. Key Points (Example)
- Vocabulary: $V = 10,000$.
- Negative samples: $k = 5–20$.
- Complexity: O($k$) per word pair vs. O($V$) with softmax.
- Trade-off: Less precise than softmax but faster and scalable for Word2Vec’s unsupervised goal.