# L9c: Skip-Gram Model for Word Embeddings
In this lecture, we explore the Skip-Gram model, a method for learning word embeddings by predicting context words from a target word. Unlike CBOW (Continuous Bag-of-Words from L9a), which predicts a target from context, Skip-Gram reverses the task and predicts multiple context words given a single target word.

> __Learning Objectives:__
> 
> By the end of this lecture, you should be able to:
> 
> * __Skip-Gram architecture and training:__ Describe the Skip-Gram neural network structure, how it differs from CBOW, and the training objective that maximizes context word predictions.
> * __Embedding extraction and usage:__ Identify where word embeddings are stored in Skip-Gram and explain why Skip-Gram produces better representations for rare words compared to CBOW.
> * __Computational optimizations:__ Explain the softmax bottleneck in Skip-Gram and describe how Negative Sampling addresses this computational challenge.

Let's get started!
___


## Skip-Gram Model
Skip-Gram reverses CBOW's prediction task: instead of predicting a target from context, it predicts multiple context words from a single target.

> __What is it?__
>
> Skip-Gram uses a feedforward network with one hidden layer to predict context word probabilities given a target word. The input is the [one-hot encoded vector](https://en.wikipedia.org/wiki/One-hot) of the target word, and the output produces probability distributions for each context position within a sliding window.
> 
> __Reference__: [Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space. ArXiv, abs/1301.3781.](https://arxiv.org/abs/1301.3781)

Let $w_t$ be the target word at position $t$, $\mathcal{C} = \{t-m, \dots, t-1, t+1, \dots, t+m\}$ the context positions within window size $m$, and $\mathbf{v}_{w_t} \in \{0,1\}^{N_{\mathcal{V}}}$ the one-hot encoded target vector.

The input $\mathbf{x} = \mathbf{v}_{w_t} \in\mathbb{R}^{N_{\mathcal{V}}}$ connects to hidden layer $\mathbf{h}\in\mathbb{R}^{h}$ via a linear transformation:
$$
\begin{align*}
\mathbf{h} &= \mathbf{W}_{1} \cdot \mathbf{x}
\end{align*}
$$
where $\mathbf{W}_{1}\in\mathbb{R}^{h\times{N_{\mathcal{V}}}}$ is the input-to-hidden weight matrix. Since $\mathbf{x}$ is one-hot, this selects the corresponding column of $\mathbf{W}_{1}$. The hidden layer maps to the output layer:
$$
\begin{align*}
\mathbf{u} &= \mathbf{W}_{2} \cdot \mathbf{h}
\end{align*}
$$
where $\mathbf{W}_{2}\in\mathbb{R}^{N_{\mathcal{V}}\times{h}}$ is the hidden-to-output weight matrix, producing $\mathbf{u}\in\mathbb{R}^{N_{\mathcal{V}}}$. For each context position $c \in \mathcal{C}$, the model outputs a probability distribution over the vocabulary using softmax:
$$
\begin{align*}
p(w_{c} = w_{i} | w_t) = \hat{y}_{c,i} &= \frac{e^{\mathbf{u}_i}}{\sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}}
\end{align*}
$$
where $\hat{y}_{c,i}$ is the predicted probability that word $i$ appears at context position $c$ given target word $w_t$.

> __Key Difference from CBOW__
>
> * **CBOW**: Many context words → one target (many-to-one)
> * **Skip-Gram**: One target → many context words (one-to-many)
>
> Skip-Gram generates $2m$ predictions per example (one per context position), making it more expensive than CBOW but better for rare words since each occurrence produces $2m$ training signals.

The training objective maximizes the likelihood of actual context words given the target. For target word $w_t$ with context $\{w_{t-m}, \dots, w_{t-1}, w_{t+1}, \dots, w_{t+m}\}$, we minimize negative log-likelihood:
$$
\begin{align*}
\mathcal{L} &= -\log p(w_{t-m}, \dots, w_{t-1}, w_{t+1}, \dots, w_{t+m} | w_t)
\end{align*}
$$

Assuming context words are conditionally independent given the target:
$$
\begin{align*}
\mathcal{L} &= -\log \prod_{c \in \mathcal{C}} p(w_c | w_t) \\
&= -\sum_{c \in \mathcal{C}} \log p(w_c | w_t) \\
&= -\sum_{c \in \mathcal{C}} \log \left( \frac{e^{\mathbf{u}_{c}}}{\sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}} \right)
\end{align*}
$$

where $\mathbf{u}_{c}$ denotes the element of the output vector $\mathbf{u}$ corresponding to the actual context word at position $c$. Expanding this:

$$
\begin{align*}
\mathcal{L} &= -\sum_{c \in \mathcal{C}} \left( \mathbf{u}_{c} - \log \sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j} \right) \\
&= \sum_{c \in \mathcal{C}} \left( \log \sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j} - \mathbf{u}_{c} \right) \\
&= |\mathcal{C}| \cdot \log \sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j} - \sum_{c \in \mathcal{C}} \mathbf{u}_{c} \quad\blacksquare
\end{align*}
$$

where $|\mathcal{C}| = 2m$ is the number of context positions. The softmax normalization term $\log \sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}$ appears $2m$ times, making this expensive to compute for large vocabularies.

### Optimization Strategy
We optimize the Skip-Gram objective using stochastic gradient descent (SGD) or adaptive optimizers (Adam, AdaGrad) that adjust learning rates per parameter.

> __Training Algorithm__
>
> Use mini-batch SGD with adaptive optimizers (Adam/AdaGrad) to handle sparse gradients. Start with learning rate $\alpha = 0.025$ and linearly decay to near zero. Negative Sampling (next section) replaces expensive softmax with $O(k)$ binary classification.
>
> **Why gradient descent?** Scales to millions of parameters and billions of examples, converges to semantic-preserving local minima, and is computationally efficient: $O(h \cdot k)$ per update with negative sampling versus $O(h \cdot N_{\mathcal{V}})$ for full softmax.

> __Computing the Gradients__
>
> For context word $c$ around target $t$, let $\mathbf{h} = \mathbf{W}_1 \cdot \mathbf{v}_{w_t}$ be the target embedding. Gradients for the actual context word:
> $$
> \begin{align*}
> \frac{\partial \mathcal{L}}{\partial \mathbf{u}_c} &= \hat{y}_c - 1 \in \mathbb{R}
> \end{align*}
> $$
> and for all other words $j \neq c$:
> $$
> \begin{align*}
> \frac{\partial \mathcal{L}}{\partial \mathbf{u}_j} &= \hat{y}_j \in \mathbb{R}
> \end{align*}
> $$
> where $\hat{y}_i = \frac{e^{\mathbf{u}_i}}{\sum_{k=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_k}}$ is the softmax probability. The gradient with respect to the hidden layer is:
> $$
> \begin{align*}
> \frac{\partial \mathcal{L}}{\partial \mathbf{h}} &= \mathbf{W}_2^{\top} \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{u}} \in \mathbb{R}^h
> \end{align*}
> $$
> Finally, the weight matrix gradients are computed using the chain rule:
> $$
> \begin{align*}
> \frac{\partial \mathcal{L}}{\partial \mathbf{W}_2} &= \frac{\partial \mathcal{L}}{\partial \mathbf{u}} \cdot \mathbf{h}^{\top} \in \mathbb{R}^{N_{\mathcal{V}} \times h} \\
> \frac{\partial \mathcal{L}}{\partial \mathbf{W}_1} &= \frac{\partial \mathcal{L}}{\partial \mathbf{h}} \cdot \mathbf{v}_{w_t}^{\top} \in \mathbb{R}^{h \times N_{\mathcal{V}}}
> \end{align*}
> $$
> These gradients are accumulated across all $2m$ context positions and used to update the weight matrices via gradient descent.

The softmax bottleneck motivates the need for computational optimizations like Negative Sampling.

> __Training Cost and Embeddings__
>
> Skip-Gram costs $\sim 2m \times$ more than CBOW because it produces $2m$ predictions per target versus CBOW's one prediction per $2m$ context words.
>
> **Why better for rare words?** Each rare word occurrence generates $2m$ gradient updates (one per context prediction) versus CBOW's single update. More updates mean better learning from limited data.
>
> After training, use embeddings from:
> * **Input embeddings**: Columns of $\mathbf{W}_{1} \in \mathbb{R}^{h \times N_{\mathcal{V}}}$
> * **Output embeddings**: Rows of $\mathbf{W}_{2} \in \mathbb{R}^{N_{\mathcal{V}} \times h}$
>
> In practice, use $\mathbf{W}_{1}$ (input embeddings) or the average of both matrices. The column of $\mathbf{W}_{1}$ corresponding to word $i$ is the $h$-dimensional embedding for that word.

___

## Negative Sampling and Training Efficiency
The softmax denominator $\sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}$ sums over all vocabulary words, computed $2m$ times per example. For $N_{\mathcal{V}} \approx 1\text{M}$ vocabularies, this becomes prohibitive. Toward this issue, negative sampling replaces full softmax with binary classification: distinguish actual context words (positive examples) from randomly sampled words (negative examples).

> __Negative Sampling Objective__
>
> For each context word $w_c$, let $\mathbf{w}_{c}^{(2)}$ be the output embedding (row $c$ of $\mathbf{W}_2$) and $\mathbf{h}$ be the target word embedding. The objective becomes:
> $$
> \log \sigma((\mathbf{w}_{c}^{(2)})^{\top} \mathbf{h}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma(-(\mathbf{w}_{i}^{(2)})^{\top} \mathbf{h}) \right]
> $$
> where $\sigma(x) = 1/(1+e^{-x})$ is the sigmoid function, $k$ is the number of negative samples (typically 5-20), and $P_n(w)$ is the noise distribution (often $P_n(w) \propto f(w)^{3/4}$ where $f(w)$ is word frequency).

This reduces complexity from $O(N_{\mathcal{V}})$ to $O(k)$ per context word. For target $w_t$ at context position $c$ with embedding $\mathbf{h} = \mathbf{W}_{1} \cdot \mathbf{v}_{w_t}$:

1. **Positive example**: Actual context word $w_c$
   - Compute score: $s_{+} = (\mathbf{w}_{c}^{(2)})^{\top} \mathbf{h}$
   - Objective: Maximize $\log \sigma(s_{+})$

2. **Negative examples**: Sample $k$ words $\{w_{n_1}, \dots, w_{n_k}\}$ from noise distribution
   - For each negative word: Compute score $s_{-,i} = (\mathbf{w}_{n_i}^{(2)})^{\top} \mathbf{h}$
   - Objective: Maximize $\sum_{i=1}^{k} \log \sigma(-s_{-,i})$

3. **Combined loss** for this context position:
$$
\mathcal{L}_c = -\log \sigma(s_{+}) - \sum_{i=1}^{k} \log \sigma(-s_{-,i})
$$

The objective encourages high dot products for co-occurring words (positive examples) and low dot products for random pairs (negative examples), implicitly learning that words appearing together should have similar embeddings.

> __Theoretical Connection to PMI__
>
> [Levy & Goldberg (2014)](https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html) showed that SGNS with optimal solutions is equivalent to factorizing:
> $$
> \mathbf{M} = \text{PMI}(w, c) - \log k
> $$
> where PMI is Pointwise Mutual Information (discussed in L9a) and $k$ is the number of negative samples. This connects neural embedding methods to traditional count-based methods.

___

## Practical Considerations
Window size ($m=5$-$10$): smaller captures syntax, larger captures semantics. Embedding dimension ($h=100$-$300$): balances expressiveness vs. data needs. Negative samples ($k=5$-$20$): trades quality for speed.

> __CBOW vs Skip-Gram Guidelines__
>
> * **Use CBOW** when training speed is critical and the corpus is large with frequent words
> * **Use Skip-Gram** when rare word quality matters or the corpus is smaller
> * **Skip-Gram** performs better on infrequent words because each occurrence generates $2m$ gradient updates instead of one

Pre-trained embeddings: Word2Vec (Google News), GloVe (Common Crawl), FastText (with subwords). Use pre-trained for small (<10M words) or general corpora. Train from scratch for large, domain-specific corpora (medical, legal, technical).

___

## Summary
This lecture covers the Skip-Gram model for learning word embeddings by predicting context from target words.

> __Key Takeaways:__
>
> * __Skip-Gram architecture:__ Skip-Gram reverses CBOW by predicting multiple context words from a single target word, producing $2m$ predictions per training example compared to CBOW's one prediction.
> * __Rare word performance:__ Skip-Gram provides better embeddings for rare words because each occurrence generates $2m$ gradient updates to the word's embedding, while CBOW generates only one update per occurrence, allowing Skip-Gram to learn better representations from limited data.
> * __Negative Sampling optimization:__ Negative Sampling replaces the expensive full softmax (complexity $O(N_{\mathcal{V}})$) with a binary classification task using $k$ negative samples (complexity $O(k)$), making training practical for large vocabularies.

Skip-Gram with Negative Sampling remains one of the most effective methods for learning word embeddings.

___
