# Word Representions

Natural language text can be seen as a sequence of words. One of the most fundamental tasks of NLP is to represent words. Ideally, we want to represent words such that similar words are "closer" to one another in some space than disimilar words.

One naive representation is to let each word in the vocabolary sit in its own dimension. This is referred to as **one hot encoding**:

$$
\mathbf{w}_{a} = \begin{bmatrix}1\\0\\\vdots\\0\end{bmatrix},
\mathbf{w}_{at} = \begin{bmatrix}0\\1\\\vdots\\0\end{bmatrix}, 
\cdots,
\mathbf{w}_{zebra} = \begin{bmatrix}0\\0\\\vdots\\1\end{bmatrix} 
$$

The issue with this representation is that these vectors are sparse. Besides issues related to storage and computation of large sparse vectors, this representation is semantically weak because any vector do not relate to any other word vector. This representation does give any notion of relationships or similarity between words. Each word is treated as being orthogonal to all the other words so similarity using the dot product is zero for different words. 

A better representation would express semantic similarity so that e.g. cats and kittens are close to each other. The theory of Distributional Semantics provides an idea of representing the meaning of a word by looking at its neighbouring words. This is based on the idea that words that have similar meanings appears in similar contexts.

There are three main approaches:
- count-based methods
  - Define a basis vocabulary $V$ of context words (exclude common function words like "the", "a", "some" and run stemming)
  - Define a word window size $m$. This is used to define the context or neighbouring words of the word $\mathbf{w}$. 
  - Count the basis vocabulary words occuring $m$ words to the left or right of each instance of a target word in the corpuse
  - Form a vector representation of the target word based on these counts
- direct prediction
  - word2vec
  - GloVe
- task-based methods
  - Train the word vectors at the same time as we learn the parameters of the network that uses them
  - Idea is to learn a classifier based on features but also learn the features themselves in the process like Convolutional Neural Network that learn both features and classifiers at the same time.

## word2vec

Word2vec is a software package that consists of:
- two algorithms: continuous bag-of-words (CBOW) and Skip Gram
- two training methods; negative sampling and hierarchical softmax

Word2vec attempts to create word vectors such that words with similar meanings or syntactical/grammatical function cluster together in an Euclidean space

### Skip Gram model

Predicting surrounding context words given a center word.

Main steps:
- Using a window of size $m$ e.g. $m=1$, go through each word of the whole corpus
- For each window, compute the probability of the context words given the center word. Essentially, we build a classifier for each window:
$$
p(w_{outside} | w_{center}) = \frac{ \exp( \mathbf{u}_o^T \mathbf{v}_c ) }{ \sum_{w=1}^{V} (\mathbf{u}_{w}^T \mathbf{v}_c) }
$$
where $V$ is the size of the corpus, $\mathbf{u}$ corresponds to the word embedding of the context/outside word and $\mathbf{v}$ corresponds to the word embedding of the center word. The sum in the denominator involves the entire vocabulary.
- Use the stochastic gradient descent to train the model. This means that we formulate an objective or a loss function for each classifier that maximises the probability function. Then we take the derivative of the objective function w.r.t. the vectors and set it to zero. Initially, the vectors $\mathbf{u}$ and $\mathbf{v}$ are random (sampled uniformly between two small numbers). By following the negative derivative iteratively, we find minimise the loss and increase the probability.

The issue with the Skip Gram model is that in each window, there are only $2m+1$ words. This means that the derivative of the objective function is going to be very sparse.

\begin{align}
\nabla \mathcal{J} = \begin{bmatrix}
0 \\
\vdots \\
\nabla_{v_{\text{like}}} \\
\vdots \\
0 \\
\vdots \\
\nabla_{v_{\text{nlp}}} \\
\vdots \\
0
\end{bmatrix} \in \mathbb{R}^{2dV}
\end{align}
where $d$ is the number of dimensions in the embedding space e.g. $d=100$ and $V$ is the size of the vocabulary.

#### Objective Function

The trick with the Skip Gram model is that we just train a small number of logistic regression for the true pairs.

The overall objective function:

$$
\mathcal{J}(\theta) = \frac{1}{T} \sum_{t=1}^{T} \mathcal{J}_t(\theta)
$$

where $T$ is the total number of windows going through the corpus.

The:
$$
\mathcal{J}_t(\theta) = -\log \sigma \left( \mathbf{u}_o^T \mathbf{v}_c \right)
- \sum_{i=1}^{k} \mathbb{E}_{j \sim P(w)} \left[  \log \sigma \left( - \mathbf{u}_j^T \mathbf{v}_c \right)  \right]
$$

The first term is the logarithm of the sigmoid function.
$$
\sigma(x) = \frac{1}{1 + e^{-1}}
$$

Note:
$$
\sigma(-x) = 1- \sigma(x)
$$

The sigmoid function squishes any number in the range [0, 1]. In order words, we are trying to maximise the probability that the outside word and the center word co-occuring.

<img src="figures/log-of-sigmoid.png" width="600" />




The second term says that we randomly select $k$ negative samples/words from the corpus and for each these words, we minimise the probability of the negative word and the center word occuring together.

Sampling is based on unigram distribution:

$$
P(w) = \frac{U(w)^{3/4}}{Z} 
$$
The power makes less frequent words be sampled more often.

### Continous Bag of Words (CBOW)

Main idea: Predict center word from sum of surrounding word

## GloVe: Global Vectors for Word Represention

GloVe is method that attemts to combine the best of the count-based methods and the direct prediction methods.

The count-based methods work by creating a co-occurrence matrix where we count how often words co-occur. Then we apply SVD on the co-occurence matrix.

Advantages:
- Relatively fast to train for smaller matrices
- We only need the co-occurence counts (and not the entire corpus)

Disadvantages:
- The method captures mostly word similarities (word2vec model captures more pattern)
- Disproprtionate importance are given to large counts

The advantages and disadvantages of the direct prediction methods like the Skip Gram model:

Advantages
- Much better performance on downstream NLP tasks like entity recognition or part of speech tagging
- Can capture complex patterns beyond word similarity

Disadvantages:
- Scale with corpus size, must go through every single window
- Inefficient usage of statistics. 


### Objective Function

$$
\mathcal{J}(\theta) = \frac{1}{2} \sum_{i,j=1}^{W} f(P_{ij})(\mathbf{u}_i^T \mathbf{v}_j - \log P_{ij}  )^2
$$
where
- $P$ is the co-occurrence matrix 
- $P_{ij}$ is the number of times that the words $i$ and $i$ occur together.
- $f$ is a way to cap the importance of very frequent words. It allows us to lower the weight of some frequent co-occurences e.g., the word "the" with other words

Essentially, we want the value of $\mathbf{u}^T\mathbf{v}$ to be close to the overall count $\log P_{ij}$.

The difference between the Skip Gram and GloVe:
- Skip Gram model attempts to capture co-occurrences one window at a time.
- GloVe model attempts to capture the counts of how often the words appear together.

### Advantages

- Traning GloVe model is faster
- Scalable to large corpora
- Good performance even with small corpus because of the efficient usage of the statistics