# GloVe: Global Vectors for Word Representation

The semantic vector space models play the fundamental yet crucial roles in a variety of natural language applications, since the distributional representation embeds each word into a real-valued vector that somehow capture the underlying semantic or syntactic regularities.

## Matrix Factorization Methods
One of the most common matrix factorization methods is Latent Semantic Analysis or Latent Semantic Indexing in the information retrieval context. LSA constructs the matrices based on “term-document” type, which is a desired structure for query searching. It performs SVD on the matrices and decompose them and only keeps the top rank value in singular matrix. In such way, we compress and reconstruct the matrices by keeping only the major components and capture statistical information about the corpus.
[image]
The Hyperspace Analogue to Language, on the other hand, utilizes the “term-term” co-occurrence matrices. Any time two words are next to each other, the association between them should be stronger. The shortcoming of naïve HAL is also obvious: the co-occurrence with ‘the’, ‘and’ convey very little information about their similarity. There are several improved techniques that address this issue.

## Shallow Window-Based Methods
Another main approach for word embedding is based on neural network. The word vector is learned within local context windows. The idea of neural network is not new but it is not until recent to achieve a successful implementation for word representation. The skip-gram and continuous bag-of-words (CBOW) models are proposed to achieve the learning process on a single-layer architecture based on the inner product between two word vectors.
[image]
CBOW – skip-gram

## The GloVe Model

We want to construct a word-word co-occurrence matrix $X$
* $X_{ij}$ tabulate the number of times word $j$ occurs in the context of word $i$.

For example, consider a corpus:
> I love NLP and NLP is fun

There are merely one sentence and 6 words in that corpus. If we consider a window size as 5, we will have the following windows:

Window No. | Center Word | Content
:--------: | :---------: | -------
0 | I | I love NLP
1 | love | I love NLP and
2 | NLP | I love NLP and NLP
3 | and | love NLP and NLP is
4 | NLP | NLP and NLP is fun
5 | is | and NLP is fun
6 | fun | NLP is fun

Take window No.1 for instance, the center word is *love*, the context words are *I*, *NLP*, *and*. There are less than 2 words left to the center word, so there are only 3 context words in this case. We will then make an increment to $X_{love,I}$, $X_{love,NLP}$, and $X_{love,and}$.

Then let's define a few more notations:
* $X_i = \sum_kX_{ik}$, the number of times any word appears in the context of word $i$. In the example above $X_{love}$ is 3.
* $P_{ij} = P(j|i) = \frac{X_{ij}}{X_i}$, the probability that word $j$ appears in the context of word $i$.
* $ratio_{ijk} = P(k|i)/P(k|j) = \frac{P_{ik}}{P_{jk}}$, the ratio of probability between word $i$ and $j$ regarding to the context of word $k$.

In the paper, the author crafts a example to illustrate how the co-occurrence could be useful. Supoose we are interested in the concept of thermodynamic phase, so let target words $i = ice$ and $j = steam$. With selected context words from a 6 billion token corpus, the co-occurrence probabilities are shown as following:
<img src="https://nlp.stanford.edu/projects/glove/images/table.png">

An interesting observation is that for context words like *water* and *fashion*, they are either related to both target words or to either of them. These non-discriminative words cancel out the ratio which results as close to 1. However, for word like *solid* which correlates with ice well, the ratio becomes much greater than 1, while for word like *gas* correlates with steam, the ratio becomes very small value. To be more generalized, we have observed a regularity can summarized by table like this:

Value of $ratio_{ijk}$ | $j, k$ related | $j, k$ not related
:--------: | :---------: | :-----:
**$i, k$ related** | $\approx1$ | $\gg1$
**$i, k$ not related** | $\ll1$ | $\approx1$

This observation gives us a good hint and starting point to construct our word vector model. We would like to construct a function that capture or encode the ratio information well in the vector space,
\begin{equation}
F(w_i,w_j,\tilde{w}_k) = \frac{P_{ik}}{P_{jk}}
\end{equation}

Here $\tilde{w}$ means a separate context word vector. However, there is an issue about this model. There are 3 word vector involved in it, which means the computation intensity and complexity of learning this model are mostly impossible to be afforded. With that in mind, considering the properties of our model, we could have the following modifications:
1. Since the vector space are inheretly linear and we are considering the similarity in that space, the most natrual way is to take the difference on $wi$ and $w_j$, without loss of linearity. Its form will look like this: $F(w_i-w_j, \tilde{w}_k)$.
2. It is worthful to notice that the arguments of our function are vectors, whlie the right-hand side is a scalar. It would be reasonable to take the dot product of the arguments to avoid the complicated parameterization. Now we have function formulated as: $F((w_i-w_j)^T\tilde{w}_k)$.
3. Note that for word-word co-occurrence matrices, the distinction between a word and a context word is arbitrary and exchangeable. In order to restore this symmetry, we first require $F$ be a homomorphism between the groups $(\mathbb{R}, +)$ and $(\mathbb{R}_{\gt{0}},\times)$, i.e., we want $F((w_i-w_j)^T\tilde{w}_k)$ equals to $F(w_i^T\tilde{w}_k)/F(w_j^T\tilde{w}_k)$, which could be sovled by $F(w_i^T\tilde{w}_k) = P_{ik}$ and $F = exp$, or, $w_i^T\tilde{w}_k = log(P_{ik}) = log(X_{ik}) - log(X_i)$.
4. This equation is still not symmetric, i.e., the left-hand side has symmetry but not the right-hand side because of the $log(X_i)$ term. We can add bias terms to cancel it out. Finally, we will have,
\begin{equation}
w_i^T\tilde{w}_k + b_i + \tilde{b}_k = log(X_{ik})
\end{equation}

This is a much more simplified eqation comparing to the original proposal. There is still a drawback if we weight every co-occurrence equally, even if rare one. Thus, in the cost function we need to add a weight term to address this issue. Now, we have the cost function for the model:
\begin{equation}
J = \sum_{i,j=1}^V{f(X_{ij})(w_i^T\tilde{w}_k + b_i + \tilde{b}_k - log(X_{ik}))^2}
\end{equation}

From experiments, the author found it would be a decent choice for $f(x) = (x/xmax)^{0.75}$, if $x \lt xmax$, where $xmax$ is around 100. Otherwise, $f(x) = 1$.