# Gradient Descent Algorithms for Skip-Gram Model
The Skip-Gram model learns word embeddings by predicting context words from target words, optimizing parameters $\mathbf{W}_1 \in \mathbb{R}^{h \times N_{\mathcal{V}}}$ and $\mathbf{W}_2 \in \mathbb{R}^{N_{\mathcal{V}} \times h}$ using gradient-based optimization. This notebook examines gradient descent variants suitable for Skip-Gram training, particularly algorithms designed for high-dimensional sparse gradient problems.

> __Gradient Descent for Skip-Gram__
>
> Standard gradient descent updates parameters using:
> $$
> \begin{align*}
> \theta_{k+1} &= \theta_k - \alpha \nabla_{\theta} \mathcal{L}(\theta_k)
> \end{align*}
> $$
> where $\theta$ represents all parameters $(\mathbf{W}_1, \mathbf{W}_2)$, $\alpha > 0$ is the learning rate, and $\nabla_{\theta} \mathcal{L}$ is the gradient of the loss function.
>
> For Skip-Gram, the loss for a single training example (target word $w_t$ with context $\mathcal{C}$) is:
> $$
> \begin{align*}
> \mathcal{L} &= -\sum_{c \in \mathcal{C}} \log p(w_c | w_t)
> \end{align*}
> $$
> where $p(w_c | w_t) = \frac{\exp(\mathbf{u}_c)}{\sum_{j=1}^{N_{\mathcal{V}}} \exp(\mathbf{u}_j)}$ and $\mathbf{u} = \mathbf{W}_2 \mathbf{h}$ with $\mathbf{h} = \mathbf{W}_1 \mathbf{v}_{w_t}$.

### Challenges in Skip-Gram Optimization
Skip-Gram presents unique optimization challenges that motivate adaptive gradient methods:

1. **Sparse gradients**: For each training example, only one column of $\mathbf{W}_1$ receives gradient updates (the target word's embedding), and typically only $2m + k$ rows of $\mathbf{W}_2$ receive updates (context words and negative samples). The remaining $N_{\mathcal{V}} - 1$ columns of $\mathbf{W}_1$ and $N_{\mathcal{V}} - 2m - k$ rows of $\mathbf{W}_2$ have zero gradients.

2. **Varying update frequencies**: Frequent words receive many gradient updates, while rare words receive few. A single learning rate treats all parameters equally, causing frequent word embeddings to oscillate while rare word embeddings converge slowly.

3. **High dimensionality**: With vocabulary sizes $N_{\mathcal{V}} \approx 10^4$ to $10^6$ and embedding dimension $h \approx 100$ to $300$, the parameter space has $\sim 10^7$ to $10^9$ dimensions.

Adaptive gradient methods address these challenges by maintaining per-parameter learning rates that adjust based on gradient history.
___

## AdaGrad: Adaptive Gradient Algorithm
AdaGrad (Adaptive Gradient Algorithm) adapts learning rates for each parameter based on the history of squared gradients. Parameters with large cumulative gradients receive smaller learning rates, while parameters with small cumulative gradients receive larger learning rates.

> __Algorithm__
>
> **Initialization**: Given initial parameters $\theta_0$, learning rate $\alpha > 0$, and small constant $\epsilon > 0$ (typically $10^{-8}$) to prevent division by zero. Initialize the accumulated squared gradient $\mathbf{G}_0 = \mathbf{0}$ (same dimensions as $\theta$). Set iteration counter $k \gets 0$.
>
> For each training iteration $k = 0, 1, 2, \ldots$ **do**:
> 1. **Compute gradient**: $\mathbf{g}_k = \nabla_{\theta} \mathcal{L}(\theta_k)$
> 2. **Accumulate squared gradients**: $\mathbf{G}_{k+1} = \mathbf{G}_k + \mathbf{g}_k \odot \mathbf{g}_k$ where $\odot$ denotes element-wise multiplication
> 3. **Update parameters**: $\theta_{k+1} = \theta_k - \frac{\alpha}{\sqrt{\mathbf{G}_{k+1}} + \epsilon} \odot \mathbf{g}_k$ where division and square root are element-wise
> 4. **Increment**: $k \gets k + 1$

The key innovation is the per-parameter learning rate $\frac{\alpha}{\sqrt{G_{k+1,i}} + \epsilon}$ for parameter $i$, where $G_{k+1,i}$ accumulates all past squared gradients for that parameter. This provides two benefits:

1. **Automatic decay**: Parameters with large cumulative gradients get smaller effective learning rates
2. **Sparse-friendly**: Parameters with zero gradients retain their full learning rate $\alpha$

> __Why AdaGrad for Skip-Gram?__
>
> AdaGrad naturally handles Skip-Gram's sparse gradient structure. Frequent words accumulate large $\mathbf{G}$ values, reducing their learning rates and preventing oscillation. Rare words maintain small $\mathbf{G}$ values, preserving large learning rates to extract maximum information from limited updates. This per-parameter adaptation requires no manual tuning of separate learning rates.

### Limitations
AdaGrad has a critical limitation: the accumulated squared gradient $\mathbf{G}_k$ grows monotonically, causing learning rates to decay continuously. For parameter $i$:
$$
\begin{align*}
G_{k,i} &= \sum_{j=0}^{k-1} g_{j,i}^2
\end{align*}
$$
As $k \to \infty$, $G_{k,i} \to \infty$ (assuming non-zero gradients), so the effective learning rate $\frac{\alpha}{\sqrt{G_{k,i}} + \epsilon} \to 0$. This aggressive decay can stop learning prematurely, particularly in non-convex problems where the optimization path may revisit similar parameter regions requiring continued adaptation.

For Skip-Gram training over millions of examples, this monotonic decay often causes convergence to suboptimal solutions before reaching good word embeddings. This limitation motivated the development of methods that limit or discount old gradient information.
___

## Adam: Adaptive Moment Estimation
Adam (Adaptive Moment Estimation) addresses AdaGrad's aggressive learning rate decay by using exponentially weighted moving averages of both gradients and squared gradients. This provides adaptive per-parameter learning rates while preventing indefinite decay.

> __Algorithm__
>
> **Initialization**: Given initial parameters $\theta_0$, learning rate $\alpha > 0$ (typically $0.001$), exponential decay rates $\beta_1 \in [0,1)$ (typically $0.9$) for first moment and $\beta_2 \in [0,1)$ (typically $0.999$) for second moment, and small constant $\epsilon > 0$ (typically $10^{-8}$). Initialize first moment $\mathbf{m}_0 = \mathbf{0}$, second moment $\mathbf{v}_0 = \mathbf{0}$, and iteration counter $k \gets 1$.
>
> For each training iteration $k = 1, 2, 3, \ldots$ **do**:
> 1. **Compute gradient**: $\mathbf{g}_k = \nabla_{\theta} \mathcal{L}(\theta_{k-1})$
> 2. **Update biased first moment**: $\mathbf{m}_k = \beta_1 \mathbf{m}_{k-1} + (1 - \beta_1) \mathbf{g}_k$
> 3. **Update biased second moment**: $\mathbf{v}_k = \beta_2 \mathbf{v}_{k-1} + (1 - \beta_2) \mathbf{g}_k \odot \mathbf{g}_k$
> 4. **Bias correction for first moment**: $\hat{\mathbf{m}}_k = \frac{\mathbf{m}_k}{1 - \beta_1^k}$
> 5. **Bias correction for second moment**: $\hat{\mathbf{v}}_k = \frac{\mathbf{v}_k}{1 - \beta_2^k}$
> 6. **Update parameters**: $\theta_k = \theta_{k-1} - \frac{\alpha}{\sqrt{\hat{\mathbf{v}}_k} + \epsilon} \odot \hat{\mathbf{m}}_k$
> 7. **Increment**: $k \gets k + 1$

Adam maintains two moving averages: $\mathbf{m}_k$ estimates the mean gradient (first moment), and $\mathbf{v}_k$ estimates the uncentered variance (second moment). The bias correction steps account for initialization at zero, which biases estimates toward zero in early iterations.

> __Comparison to AdaGrad__
>
> AdaGrad accumulates all past squared gradients: $\mathbf{G}_k = \sum_{j=1}^{k} \mathbf{g}_j \odot \mathbf{g}_j$, causing monotonic growth. Adam uses exponential moving average: $\mathbf{v}_k = \beta_2 \mathbf{v}_{k-1} + (1 - \beta_2) \mathbf{g}_k \odot \mathbf{g}_k$, giving recent gradients weight $(1-\beta_2)$ and discounting old gradients by $\beta_2$ per iteration. With $\beta_2 = 0.999$, gradients from $\sim 1000$ iterations ago contribute negligibly ($0.999^{1000} \approx 0.37$), preventing indefinite accumulation while maintaining adaptation to recent gradient patterns.

### Why Adam for Skip-Gram?
Adam combines three properties beneficial for Skip-Gram training:

1. **Momentum-like behavior**: The first moment $\mathbf{m}_k$ smooths noisy gradients from stochastic mini-batches, accelerating convergence in consistent directions
2. **Adaptive learning rates**: The second moment $\mathbf{v}_k$ provides per-parameter scaling similar to AdaGrad, handling sparse gradients effectively
3. **Bounded decay**: Exponential averaging prevents the vanishing learning rates that plague AdaGrad in long training runs

The effective learning rate for parameter $i$ at iteration $k$ is:
$$
\begin{align*}
\alpha_{\text{eff},i,k} &= \frac{\alpha}{\sqrt{\hat{v}_{k,i}} + \epsilon}
\end{align*}
$$
Unlike AdaGrad where $\alpha_{\text{eff},i,k}$ decreases monotonically, Adam's $\alpha_{\text{eff},i,k}$ can increase or decrease based on recent gradient history, allowing the algorithm to adapt to changing loss landscapes during training.

> __Practical Considerations__
>
> For Skip-Gram training, Adam typically uses default hyperparameters ($\alpha = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$) and requires minimal tuning. Start with these defaults and reduce $\alpha$ if training diverges. The memory overhead is $2 \times \text{sizeof}(\theta)$ to store $\mathbf{m}$ and $\mathbf{v}$, acceptable for embedding dimensions $h \approx 100$-$300$.

___

## Application to Skip-Gram Training
This section details how Adam and AdaGrad apply to Skip-Gram parameter updates with negative sampling.

### Gradient Computation for Skip-Gram
For a single training example with target word $w_t$ and context position $c$, the loss with negative sampling is:
$$
\begin{align*}
\mathcal{L}_c &= -\log \sigma((\mathbf{w}_c^{(2)})^{\top} \mathbf{h}) - \sum_{i=1}^{k} \log \sigma(-(\mathbf{w}_{n_i}^{(2)})^{\top} \mathbf{h})
\end{align*}
$$
where $\mathbf{h} = \mathbf{W}_1 \mathbf{v}_{w_t}$ is the target word embedding (column of $\mathbf{W}_1$), $\mathbf{w}_c^{(2)}$ is the context word's output embedding (row of $\mathbf{W}_2$), $\mathbf{w}_{n_i}^{(2)}$ are negative sample embeddings, $k$ is the number of negative samples, and $\sigma(x) = 1/(1+e^{-x})$ is the sigmoid function.

The total loss for this target word sums over all $|\mathcal{C}| = 2m$ context positions:
$$
\begin{align*}
\mathcal{L} &= \sum_{c \in \mathcal{C}} \mathcal{L}_c
\end{align*}
$$

> __Gradients__
>
> For context position $c$ with positive example $w_c$ and negative samples $\{w_{n_1}, \ldots, w_{n_k}\}$:
>
> **Gradient with respect to output embeddings**:
> $$
> \begin{align*}
> \frac{\partial \mathcal{L}_c}{\partial \mathbf{w}_c^{(2)}} &= (\sigma((\mathbf{w}_c^{(2)})^{\top} \mathbf{h}) - 1) \mathbf{h} \in \mathbb{R}^h \\
> \frac{\partial \mathcal{L}_c}{\partial \mathbf{w}_{n_i}^{(2)}} &= \sigma((\mathbf{w}_{n_i}^{(2)})^{\top} \mathbf{h}) \mathbf{h} \in \mathbb{R}^h \quad \text{for } i = 1, \ldots, k
> \end{align*}
> $$
>
> **Gradient with respect to hidden layer** (target embedding):
> $$
> \begin{align*}
> \frac{\partial \mathcal{L}_c}{\partial \mathbf{h}} &= (\sigma((\mathbf{w}_c^{(2)})^{\top} \mathbf{h}) - 1) \mathbf{w}_c^{(2)} + \sum_{i=1}^{k} \sigma((\mathbf{w}_{n_i}^{(2)})^{\top} \mathbf{h}) \mathbf{w}_{n_i}^{(2)} \in \mathbb{R}^h
> \end{align*}
> $$
>
> **Gradient with respect to input embeddings** (propagated to $\mathbf{W}_1$):
> $$
> \begin{align*}
> \frac{\partial \mathcal{L}_c}{\partial \mathbf{W}_1} &= \frac{\partial \mathcal{L}_c}{\partial \mathbf{h}} \mathbf{v}_{w_t}^{\top} \in \mathbb{R}^{h \times N_{\mathcal{V}}}
> \end{align*}
> $$
> Since $\mathbf{v}_{w_t}$ is one-hot, this updates only the column of $\mathbf{W}_1$ corresponding to target word $w_t$.

### Parameter Update with Adam
For mini-batch of $B$ training examples, accumulate gradients:
$$
\begin{align*}
\mathbf{g}_{\mathbf{W}_1,k} &= \frac{1}{B} \sum_{b=1}^{B} \sum_{c \in \mathcal{C}_b} \frac{\partial \mathcal{L}_{b,c}}{\partial \mathbf{W}_1} \\
\mathbf{g}_{\mathbf{W}_2,k} &= \frac{1}{B} \sum_{b=1}^{B} \sum_{c \in \mathcal{C}_b} \left( \frac{\partial \mathcal{L}_{b,c}}{\partial \mathbf{w}_{c}^{(2)}} + \sum_{i=1}^{k} \frac{\partial \mathcal{L}_{b,c}}{\partial \mathbf{w}_{n_i}^{(2)}} \right)
\end{align*}
$$

Apply Adam updates separately to $\mathbf{W}_1$ and $\mathbf{W}_2$:

1. **Update moment estimates**:
   - $\mathbf{m}_{\mathbf{W}_1,k} = \beta_1 \mathbf{m}_{\mathbf{W}_1,k-1} + (1-\beta_1) \mathbf{g}_{\mathbf{W}_1,k}$
   - $\mathbf{v}_{\mathbf{W}_1,k} = \beta_2 \mathbf{v}_{\mathbf{W}_1,k-1} + (1-\beta_2) \mathbf{g}_{\mathbf{W}_1,k} \odot \mathbf{g}_{\mathbf{W}_1,k}$
   - (Similarly for $\mathbf{W}_2$)

2. **Bias correction**:
   - $\hat{\mathbf{m}}_{\mathbf{W}_1,k} = \frac{\mathbf{m}_{\mathbf{W}_1,k}}{1-\beta_1^k}$, $\hat{\mathbf{v}}_{\mathbf{W}_1,k} = \frac{\mathbf{v}_{\mathbf{W}_1,k}}{1-\beta_2^k}$

3. **Parameter update**:
   - $\mathbf{W}_1 \leftarrow \mathbf{W}_1 - \frac{\alpha}{\sqrt{\hat{\mathbf{v}}_{\mathbf{W}_1,k}} + \epsilon} \odot \hat{\mathbf{m}}_{\mathbf{W}_1,k}$
   - (Similarly for $\mathbf{W}_2$)

> __Sparsity Handling__
>
> In each mini-batch, only embeddings for observed words (targets and their contexts/negative samples) receive non-zero gradients. Adam's moment estimates for unobserved words remain unchanged, automatically handling sparsity without special logic. This contrasts with standard SGD where all parameters would receive equal learning rates regardless of update frequency.

### Convergence Criteria
Training typically continues for a fixed number of epochs (passes through the corpus) rather than until convergence, as word embeddings continue improving even with small gradient norms. Common stopping conditions:

1. **Fixed epochs**: Train for $E$ epochs where $E \in \{5, 10, 20\}$ depending on corpus size
2. **Validation loss**: Monitor loss on held-out data and stop if loss increases for consecutive epochs (early stopping)
3. **Embedding stability**: Measure cosine similarity between embeddings at epoch $t$ and $t-1$; stop if similarity exceeds threshold (e.g., $0.99$)

For Skip-Gram with negative sampling and Adam, convergence to a tolerance $\|\mathbf{g}_k\|_2 \leq \epsilon$ is rarely used, as gradient norms remain large due to the stochastic nature of negative sampling even near good solutions.
___