# Word2Vec Advantages & Limitations Revision Notebook

A one-page flashcard-style revision tool explaining why Word2Vec outperforms earlier word representation methods and its key limitations. Designed for quick, deep review without code.

## 1. What Were Previous Approaches?
- **Bag-of-Words (BoW)**: Represents text as a vector of word counts/frequencies, ignoring order and semantics (e.g., "cat sat on mat" = [1,1,1,1]).
- **TF-IDF**: Weights words by frequency and rarity across documents, still sparse and high-dimensional.
- **One-Hot Encoding**: Sparse vectors (length $V$, 1 for the word, 0s elsewhere), no semantic similarity (all words orthogonal).
- **Latent Semantic Analysis (LSA/LSI)**: Uses SVD on term-document matrices to reduce dimensions and capture latent topics, but computationally expensive for large corpora.
- **n-grams**: Probabilistic models for sequences, but representations are discrete, not dense embeddings.
- **Early Neural Models (e.g., Bengio 2003)**: First neural embeddings, but slow training and limited scalability.

## 2. Why Word2Vec is Better Than Previous Approaches
- **Dense Embeddings**: Low-dimensional vectors (e.g., 100-300 dims) vs. sparse, high-dimensional (e.g., $V = 10,000+$) in BoW/TF-IDF/one-hot. Captures continuous semantics, reducing curse of dimensionality.
- **Semantic & Syntactic Capture**: Words with similar meanings/contexts have similar vectors (e.g., "king" ≈ "queen"), enabling analogies ("king - man + woman ≈ queen"). Previous methods lack this (e.g., LSA captures topics but not fine-grained relations).
- **Efficiency & Scalability**: Trains on massive corpora (e.g., billions of words) with optimizations like negative sampling (O($k$)) or hierarchical softmax (O($\log V$)), vs. LSA's O($V^3$) SVD or early neural models' slow feedforward nets.
- **Unsupervised Learning**: Uses raw text co-occurrences, no labels needed, unlike supervised methods. Predicts words from context (CBOW) or context from words (Skip-gram), learning distributed representations.
- **Generalization**: Embeddings transferable to downstream tasks (e.g., NLP models), outperforming sparse vectors in similarity/search tasks.
- **Empirical Superiority**: Faster convergence, better on word analogy/similarity benchmarks than LSA or BoW, especially for rare words (Skip-gram).

## 3. Word2Vec's Limitations
- **Static Embeddings**: Fixed vector per word, ignores polysemy (e.g., "bank" as river/money has one vector). Contextual models like BERT address this.
- **Out-of-Vocabulary (OOV)**: No handling for new/unseen words; requires retraining or hacks like averaging known subwords.
- **No Subword/Morphology**: Whole-word based, ignores prefixes/suffixes (e.g., "unhappy" ≠ "happy" + negation). FastText extends it with subwords.
- **Limited Context**: Fixed window size (e.g., 5-10 words), misses long-range dependencies vs. Transformers' full-sequence attention.
- **Bias Amplification**: Reflects dataset biases (e.g., gender stereotypes from web text), propagating to downstream applications.
- **Shallow Architecture**: Linear model, less expressive than deep neural nets; doesn't capture complex hierarchies.
- **No Directionality**: Bidirectional but not truly sequential; struggles with order-sensitive tasks.

## 4. Why These Limitations Matter
- **In Practice**: Word2Vec excels for simple similarity but underperforms on nuanced tasks (e.g., sentiment with polysemy).
- **Evolution**: Led to GloVe (global co-occurrence), FastText (subwords), and contextual models (ELMo, BERT) that overcome them.

## 5. Key Points (Comparison & Trade-offs)
- **Better For**: Scalable semantic embeddings from raw text; outperforms sparse methods in efficiency/quality.
- **Worse For**: Contextual nuance, OOV, bias; use modern alternatives for production.
- **Training Time**: Word2Vec: Hours on billions of words; LSA: Days on smaller corpora.
- **Dim Reduction**: Word2Vec: 100-300 dims; BoW/one-hot: $V$ dims.
- **Use Case**: Pre-2018 baseline; now often replaced by Transformers.