# Word Embeddings

## Lexical semtantics
Idea: Words rarely carry meaning on their own

_You shall know the meaning of a word by the company it keeps_

I.e. look at which contexts words are used in, to derive meaning.


## Semantic Representation
Given examples of words usages (corpora of text), learn word representations which capture the semantic meanings of the words.

### Vector representations
Similarilty measures such as dot product (angle) between vectors should relate to word meanings.

### Context Model Likelihood
A probabilistic model for predicting the next words in some context, given current word $w^t$
$$ L(\theta) = \sum_{t=1}^T \sum_{i\in C} \text{log} \ p_\theta(w^{t+i} | w^t)$$

Offsets in context $C = [-R,...,-1, 1, ..., R]$ 

Perform MLE on $L_\theta$ such that the model $p_\theta$ can be parametrized. Optimal model is the model assigning the highest probability to _observed context_, i.e. this is a language modelling task.



## Latent Vector Models

### Basic model
Embed word $w$ as vector $x_w \in R^d$ in some $d$-dimensional space, given words in the vocabulary $V$.

Log-bilinear model
$$\text{log} \ p_\theta (w| w') = \langle x_w, x_{w'} \rangle$$

Main effects of using inner product:
* When inner product high, i.e. small angle between vectors, the likelihood $p_\theta (w | w')$ goes up

Normalizing the probability distribution with softmax:

$$p_\theta (w | w') = \frac{exp(\langle x_w, x_{w'} \rangle)}{\sum_{v \in V} exp(\langle x_v, x_{w'} \rangle)}$$

### Skip-Gram Model
Use LLH of the bilinear model:

$$ L(\theta) = \sum_{t=1}^T \sum_{i\in C} \text{log} \ p_\theta (w^{t+i}| w^t)$$

$$= \sum_{t=1}^T \sum_{i\in C} \Bigg[ \langle x_{w^{t+i}}, x_{w^{t}} \rangle - \text{log} \sum_{v \in V} exp(\langle x_v, x_{w^t} \rangle) \Bigg]$$

Model params: Word embeddings $x_w \in R^d$ for every $w \in V$, i.e. embedding matrix $X \in R^{d}{|V|}$

#### Negative sampling
In order to train problem as logistic regression we need negative examples. Therefore it is necessary to somehow generate negative examples for context words.

We sample random context words $w_j ~ P(w_j)^{3/4}$
* Here the exponent dampens frequent words, so they are accounted for

### GloVe
#### Co-occurence matrix
Let the co-oc matrix be defined as
$$N \in \mathbb{N}^{|V|\times |V|}$$ 


Where $n_{ij}$ is the number of times word $w_i$ occured in the context of $w_j$. The co-oc matrix is very sparse, since many words do not appear together.


#### Objective
The glove objective uses a weighting function for frequent words, which clips their frequency by some number $n_{max}$ and also filters out infrequent words by letting their weight become incredibly small:

$$f(n_{ij}) = \text{min} \Bigg\{ 1, \bigg(\frac{n_{ij}}{n_{max}}^\alpha\bigg)\Bigg\}, \quad \alpha = 3/4$$

$$H(\theta; N) = \sum_{(i,j)} f(n_{ij}) \Bigg( \text{log} n_{ij} - \text{log} \tilde{p}_\theta (w_i | w_j) \Bigg)$$

Where $\tilde{p}_\theta$ is an unnormalized distribution $\tilde{p}_\theta = \langle x_i, y_j \rangle$
* No computation of normalization needed (difficult becasuse it is a sum)

### GloVe as Matrix factorization
Let $M = \text{log} \ N$, i.e. $m_{ij} = \text{log} \ n_{ij}$

GloVe can then be solved as the following matrix factorization problem:

$$min_{X,Y} \|M - X^T Y\|_F^2$$

With $f(n_{ij})$ as binary mask, $1$ if $n_{ij} > 0$, otherwise $0$.

Concretely:

$$min_{X,Y} \sum_{i, j: n_{ij} > 0} (m_{ij} - (X^T Y)_{ij})^2$$

#### Optimization Problem
Can't just derive global minimum, instead just SGD, optimize in an alternating fashion:

$$x_i \leftarrow x_i + 2 \eta \cdot f(n_{ij}) (\log n_{ij} - \langle x_i, y_j\rangle) y_j$$

$$y_j \leftarrow y_j + 2 \eta \cdot f(n_{ij}) (\log n_{ij} - \langle x_i, y_j\rangle) x_i$$

### Dicussion
* Relatedness and similarity captured well
* Antonyms not well captured, usually words will be close to each other even though they mean the opposite, because they are often used in the same context
    * Ex "cheap" vs "expensive"
    
Sentence/document-level embeddings
* Aggregation of word vectors, word order will be lost
* Instead use convolutional/recurrent neural networks