# Word Embedings


* Finding a higher level vector representation of words
    * represent word in a more compact way. In other words, represent word with a smaller number of dimensions
        * recall the goals of PCA: decorrelate observed data - if 99% documents that contains "car" also contain "vehicle", then we do not need 2 separate dimensions to represent them.
    * represent word in a more meaningful way such that semantically closer words would have smaller distance apart from each other (i.e vectors that their euclidean distance is small).
    



### Bag of Words 

* BOW throws away a lot of information that is vital to the meaning of text
    * For example, we have two sentences: "Dogs love cats and I" and "I love dogs and cats". They both have correct grammatical structure, but have totally different meaning.
* How to model bag-of-words
    * <b style="color:red">Each chuck of text (e.g., sentence or document) is modeled as a vector</b>. 
    * All vectors have the same length of V, which is the size of the vocabulary you defined for a specific NLP problem.
    * Each entry of the vector represents a word in the vocabulary. 
    * Assuming we have a word-to-index dictionary, called word2idx, that maps each word in the vocabulary to a index corresponding a position in the vector
    * Then, sentences are converted to vectors as follow:
    
    ```python
    word_vectors = []
    for sentence in input_sentences:
        vector = np.zeros(V)
        tokens = tokenize(sentence)
        for token in tokens:
            word_vector[word2idx[token]] = 1
            word_vectors.append(vector)   
    ```
    
    * A sentence becomes [0, 1, 0, 0,..., 1]
    * The order of words in the original sentence is not maintained.
* Why the order of words might be important
    * for example we have sentences "Today is a good day" vs "Today is not a good day"
    * The two sentences lead to almost the same vector except vector["not"] = 1
    * Therefore bag-of-words is not good at handling negation
* RNNs might be a ideal because it keeps the state - seeing a "not" might result in negating everything that comes after

### One-Hot Encoding

* Each word is modeled as a vector of size V, 
    * V is the size of the vocabulary you defined for a specific NLP problem. 
    * Each position corresponds to a word in the vocabulary
    * For the vector of each word, say $w$, all positions in the vector have values of zero except the position that corresponds to the word $w$ has value of 1.

<img src="images/one_hot.png" alt="one_hot" style="width:40%;height:40%"/>
    
* The problems of one-hot encoding are that:
    * If we have one million words in the vocabulary, we would need a 1 million dimentional vector to represents all distinct possible words. 
    * It can not tell how related two words are since all pairs of words are the same distance part.
        * |[1,0,0] - [0,1,0]| = 2
        * all words are a distance of 1 from the origin, and a Manhattan distance of 2 from each other.


### PCA

* PCA takes in BOW or BOW weights by TF-IDF and outputs new vectors that are both smaller than the original and are meaningful distance away from each other.
* How PCA works:

** Step 1 ** 
* gather a bunch of documents, count how many times each word apprears in each document
    * Data_matrix[word_index, doc_index] = count
    * Could also use TF-IDF
      
```
example:

    document 1: "cat dog rabit"
    document 2: "cat lion tiger cat"
    
    Create a VxN(5x2) matrix with these word-idx mapping:
    cat=0, dog=1, rabbit=2, lion=3, tiger=4
    
            doc1 doc2
     cat     1     2
     dog     1     0
     rabbit  1     0
     lion    0     1
     tiger   0     1
```   

* Usually we have NxD matrix, where N is the number of samples and D is the number of features
* We put words along rows whild documents along columns
* This makes sense:

> The document a word shows up in is a feature of that word, since the document explains the meaning of that word. e.g., If a set of words show up in physics journal articles, these words have meaning related to physics

> Unsupervised learning: the "physics topic" is the hidden cause/latent variable, the actual document is then generated from a distribution which describes what they look like. 

**Step 2**
* Let's say we used PCA on a term-document matrix
```python
model = PCA()
Z = model.fit_transform(X)
# X is VxN matrix, Z is VxD matrix, where D<<N
```

* This VxD matrix is a <b style="color:red">word embedding</b>. Each row is a D-dimensional vector




## Word Embedings in RNN

* NLP is tightly coupled with RNNs, since RNNs give us a way to avoid using bag-of-words (BOW)
* Deep learning model uses word embeddings instead of one-hot word vector.
    * <b style="color:red">Each word is modeled as a one-hot word vector of size V</b>
        
    > Since each word is modeled as a one-hot word vector, a sequence of words (e.g., sentence or document) would be modeled as a list of one-hot word vectors with shape (T, V), where T is the length of the sequence. 
    
    * one-hot word vector of size V would be tranformed into a much smaller vector of size D, where D << V. 
    
    > The newly created word vector of size D is the word embedding.
    
    * Then, the word embeddings are used as inputs instead of one-hot vectors. But, the word embeddings are trained as a part of the RNN model
    
    
** Word embeddings + RNN **
* The architecture of word embeddings + RNN is depicted as follow. The word embedding layer is the first layer and the word embedding matrix $ W_e $ will be trained together with the RNN as a whole: 
    
    <img src="images/embedding_rnn.png" alt="word_embedding_rnn" style="width:50%;height:50%"/>
     
         Input vector (1xV) -> W_e -> word vector (1xD) -> recurrent unit -> softmax(1xK)
    
    * Assuming each word in a sequence is represented as a <b>one-hot vector</b> of length V. Thus, a sequence of length T is a list of one-hot vectors with shape (T, V)
    * $ W_e $ of shape $(V, D)$ is the word embedding transformation matrix that is used to transform one-hot vector of size V into word vector/embedding of size D
    * Since $ word(t) $ is a one-hot vector that has 1 in the $ ith $ position and all other positions have value of 0, $ x(t) =  word(t) \times W_e $ = $ ith $ row of $ W_e $


* More compact way of representing original word vector
    * Mathematically equivalent to the method described above, where the original sequence is a list of one-hot vectors with shape (T, V)
    * Assuming we have a word-to-index dictionary, called word2idx, that maps each word in the vocabulary to a index corresponding a row in the $ W_e $ matrix. We can model a sequence as a vector of length T such that each entry in the vector is a row index of the $ W_e $ matrix that corresponds a word vector/embedding.
    * Now, a sequence is modeled as a vector of indexes with length T, instead of the original list of shape (T, V)
    
    A slow/naive way:
    ```python
    word_vectors = []
    for index in input_sequence_indices:
        word_vector = W_e(index, :)
        word_vectors.append(word_vector)
    ```
    
    Numpy/Theano can accept arrays/lists as indexes. Thus, a more efficient way is utilizing indexing of numpy or theano:
    
    ```python
    word_vectors = W_e[input_sequence_indices]
    ```

**Example**
* Assume we have a vocabulary containing 6 words and the word-to-index dictionary is defined as follow:
    
    { I=0, like=1, ice=2, cream=3, and=4, cake=5 }
    
    
* A sequence "I like ice cake" would be modeled as a vector of indexes as:
    
     $$ seq_1 = [0, 1, 2, 5] $$
     
* A sequence "cake I like" would be modeled as:
    
     $$ seq_2 = [5, 0, 1] $$
     
         
* Suppose we have the word embedding transformation matrix $ W_e $ of shape $(V, D) = (6, 3)$:
    
$$ W_e = \begin{bmatrix}
  a & b & c \\
  d & e & f \\
  g & h & i \\
  j & k & l \\
  m & n & o \\
  p & q & r
 \end{bmatrix} $$
 

* The embedding transformation of $ seq_1 $ is calculated as follow:  
    
 <b style="color:red">$$   W_e[seq_1] = W_e[[0, 1, 2, 5]] =
 \begin{bmatrix}
  a & b & c \\
  d & e & f \\
  g & h & i \\
  p & q & r \\
 \end{bmatrix} 
 $$</b>

** TensorFlow and Keras provide high level API for embedding layer: [here for TensorFlow](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) and [here for Keras](https://keras.io/layers/embeddings/) **

* When defining an `embedding` layer, we need to specify an embedding matrix with shape `(vocabulary size, dimension of word vectors)`
* The `embedding` layer takes a tensor of shape `(batch size, max input length)` as input. This corresponds to sentences converted into lists of indices (integers), as shown in the figure below.
    * The `batch size` is the number of training examples in each min-batch
    * The `max input length` is the max number of words among all training examples in the min-batch

<img src="images/embedding1.png" style="width:700px;height:250px;">

<caption><center> **Figure 4**: Embedding layer. This example shows the propagation of two examples through the embedding layer. Both have been zero-padded to a length of `max_len=5`. The final dimension of the representation is  `(2,max_len,50)` because the word embeddings we are using are 50 dimensional. </center></caption>

* The embedding layer contains a embedding matrix with shape `(vocabulary size, dimension of word vectors)`
* The largest integer (i.e. word index) in the input should be no larger than the vocabulary size. 
* The layer outputs a tenso of shape `(batch size, max input length, dimension of word vectors)`.
    * Note that the second dimension `max input length` is also frequently refered as `time step` dimension.

## Pre-trained Word Embeddings

* In Word embeddings + RNNs, we put the word embedding matrix at the front of the recurrent network, which may suffer from vanishing gradient problem
* Pre-training word embeddings would be especially helpful. We pre-train word embeddings instead of randomly initializing them and then plug that word embeddings into the RNN
* We can still update word embedding matrix using backpropagation globally

### Word2Vec

* Word2vec is a technique or a paradigm which consists of a group of models (Skip-gram, Continuous Bag of Words (CBOW)), the target of each model is to produce fixed-size vectors for the corpus words, so that the words which have similar meaning would have close vectors (i.e vectors that their euclidean distance is small)
* VxD input-to-hidden matrix and only one hidden layer 
* DxV hidden-to-output matrix
* No hidden layer nonlinearity. That is: Linear mapping 
* not deep. Just shallow neural network



#### Bigram Model

* How to train the bigram model
* Set output (next word) to 1, softmax, cross-entropy, gradient descent
* Nothing special

## Continguous Bag of Words (CBOW)

* Incorporate context
* Can better predict a word given a window of surrounding words (rather then just previous word)
* We will use a context size of 5-10 words (10 left + 10 right)
* How we modify the neural network to use CBOW:
    * Add more inputs to the neural network. So the input becomes fat but the output remains the same size.
    * The same input-to-hidden weight for all inputs
    * Each word is one-hot encoded. That is only one row from W goes to the hidden layer. This is not necessarily sharing weights
    * What happens at hidden layer?
        * $h$ = mean($W_i$), where $i$ = index of context word
        * If we have 3 input vectors, we will have 3 D-dimensional vectors at the hidden layer, we just take the average of the 3 D-dimentional vector to form one D-dimensional vector that can be passed along to the output layer.
    
<img src="images/word2vec-cbow.png" alt="Drawing" style="width:50%;height:50%"/>

## Skip-Gram
* CBOW and skip-gram are the two main methods of incorporating context with word2vec
* Skip-gram is the opposite of CBOW
* CBOW uses context to predict middle word
* Skip-gram uses middle word to predict context

<img src="images/word2vec-skipgram.png" alt="Drawing" style="width:55%;height:55%"/>


#### Negative Sampling

* The problem of word2vec is that: there may have tremendous number of weights to update at every iteration of gradient descent
* Negative sampling addresses this problem by having each training sample only modify a small percentage of the weights, rather than all of them.
* When training the network, the “label” or “correct/expected output” of the network is a one-hot vector:
    * We want the output neuron corresponding to the target to output a 1, and all other output neurons to output a 0. 
    * There will be V-1 non-target words. A lot of time is spent on updating weights for those words. 

* Basic idea: throw away most of the non-target words from cost function since it is almost the size of the entire vocabulary. In other words, we are going to sample only a small number of non-target words to update the weights for. 
    * Non-target words are called "negative samples" that are drawn from the vocabulary excluding words from the context of the input word

> [The paper](https://arxiv.org/pdf/1310.4546.pdf) says that selecting 5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets.<br/>

If the output layer of our model has a weight matrix that’s 300 x 10,000, we will just be updating the weights for words in the context, say 2, plus the weights negative samples,say 10. That’s a total of 12 output neurons, and 3600 weight values total. That’s only 0.12% of the 3M weights in the output layer!

> In the hidden layer, only the weights for the input word are updated (this is true whether you’re using Negative Sampling or not).

**How to select the negative samples**
* Typically, choosing negative samples depends on the words frequency in the given corpus. The higher frequency of the word means the higher chance of choosing it as a negative sample.
* The following formula is used to determine the probability of selecting the word as a negative sample:

$$ P(w_i) = {f(w_i)^c \over \sum_{j=0}^{n} f(w_j)^c}$$

Where c is a constant that is selected by the model creator. Research shows that c is 3/4 workds well
* The decision to raise the frequency to the 3/4 power appears to be empirical; in their paper they say it outperformed other functions.


### Which word embeddings to use

* With word2vec we have two word embeddings
* Which one to use?

**Method 1**

$$W_e = [W_1 W_2^T]$$

* $W_1$ is the input-to-hidden matrix with shape (V, D) 
* $W_2$ is the hidden-to-output matrix with shape (D, V)
* $W_e$ is the concatenation of $W_1$ and $W_2^T$ with shape (V, 2D)

**Method 2**

$$W_e = {{W_1 + W_2^T} \over 2} $$

* $W_1$ is the input-to-hidden matrix with shape (V, D) 
* $W_2$ is the hidden-to-output matrix with shape (D, V)
* $W_e$ is the average of the sum of $W_1$ and $W_2^T$ with shape (V, D)

### <b style="color:red">*</b>Cost function and derivative

* To adopt negative sampling, we have a cost function that is differenet from softmax cross entropy:

$$ J = -\sum_{o \in ctxt} log (\sigma(V_o^T W_i)) -\sum_{\phi \in neg} log (\sigma(- V_{\phi}^T W_i)) \qquad \qquad (1) $$

$ V $ is hidden-to-output matrix with shape $(D, M)$, $V^T$ has shape $(M, D)$, <br/>
    $ V_o^T $ is the $o$th vector of $ V^T $ and it is a complex vector. <br/>
    $ V_{\phi}^T $ is the $\phi$th vector of $ V^T $ and it is a negative sample vector. <br/>
$ W $ is input-to-hidden matrix with shape $(M, D)$, $ W_i $ is the $i$th vector of $ W $

* Here, I draw a pictorical example of the input-to-hidden matrix $ W $ and hidden-to-output matrix $ V $. In this example, we have 1M works in the vocabulary and 300 neurons in the hidden layer:

<img src="images/word2vec_2matrix.png" alt="Drawing" style="width:55%;height:55%"/>


* Here, I draw a pictorical example of $ V $ and $ V^T $: 
<img src="images/word2vec_output_matrix.png" alt="Drawing" style="width:55%;height:55%"/>


#### Partial Derivatives

(1) To update output layer weights, we need to update all output weights that are either in context weight vector or in negative sample weight vector:

* For each conext word weight vector $ o \in ctxt $:

$$ \frac{\partial J}{\partial V_o^T} = (\sigma(V_o^T W_i) - 1)W_i \qquad \qquad (2) $$

where  <b style="color:red">$\sigma(V_o^T W_i) - 1 $</b> is the error


* For each negative example weight vector $ \phi \in neg $:

$$\frac{\partial J}{\partial V_{\phi}^T} = (1 - \sigma(- V_{\phi}^T W_i))W_i \qquad \qquad (3) $$

where  <b style="color:red">$ 1- \sigma(- V_{\phi}^T W_i)$</b> is the error

(2) To update hidden layer weights, we only need to update input word weight vector:

$$\frac{\partial J}{\partial W_i} = \sum_{o \in ctxt}(\sigma(V_o^T W_i) - 1)V_o^{T} + \sum_{\phi \in neg} (1- \sigma(- V_{\phi}^T W_i))V_{\phi}^{T} \qquad \qquad (4) $$

#### Vectorize formulars 
* In order to efficiently calculate these derivatives (get rid of for-loops) to speed up code significantly by utilizing numerical computation tools like numpy, we vectorizing cost function formula (1) and derivative fomulars (2), (3) and (4).

* Before explaining vertorization, we first explain some notations that may be confusing.

<img src="images/notation_explain_1.png" alt="Drawing" style="width:55%;height:55%"/>

**Vectorize cost function, formular (1)**

<b style="color:red"> $$ J = - sum(log(\sigma(W_i V_{(:,ctxt)}))) - sum(log(\sigma(-W_i V_{(:,neg)}))) $$ </b>

**Vectorize formular (2)**
* Let's first zoom in formula (2):

$$ (..., \frac{\partial J}{\partial V_{(o,j)}^T},...) = (..., (\sigma(V_{o}^T W_i) - 1)W_{(i, j)},...) \qquad  \text{for each} \quad  o \in \text{ctxt} $$

This is actually two for-loops for calculating the derivatives of each weight in the context word weight vectors. Let's write it in pseudo code:

<img src="images/two_for_loop.png" alt="Drawing" style="width:25%;height:25%"/>

* We can vectorize this by using outer product of numpy:

<img src="images/vec_outer_product.png" alt="Drawing" style="width:40%;height:40%"/>

* We vectorize $ \sigma(V_{o}^T W_i) $ for all $ o \in ctxt$: 

$$ \sigma(V_{(ctxt,:)}^T W_i) = (..., \sigma(V_{o \in ctxt}^T W_i) ,...)$$ 

* For programming convenience, we may use $V$ instead of $V^T$:

$$ \sigma(V_{(ctxt,:)}^T W_i) = \sigma(W_i V_{(:,ctxt)}) $$

$$ \sigma(W_i V_{(:,ctxt)}) = (..., \sigma(W_i V_{(:,o \in ctxt)}) ,...) $$ 

* Taking the error form of the vector:

$$ \sigma(W_i V_{(:,ctxt)} )-1 = (..., \sigma(W_i V_{(:,o \in ctxt)})-1 ,...) $$ 

* Finally we outer product $ W_i $ and $ (\sigma(W_i V_{(:,ctxt)} )-1) $:

<b style="color:red"> $$ \frac{\partial J}{\partial V_{(:,ctxt)}} = W_i \odot (\sigma(W_i V_{(:,ctxt)} )-1) $$ </b>

**Vectoring formula (3)**
* Zoom in formula (3), we have:

$$(..., \frac{\partial J}{\partial V_{(\phi,j)}^T},...) = (..., (1 - \sigma(- V_{\phi}^T W_i))W_{(i,j)},...) \qquad  \text{for each} \quad  \phi \in \text{neg}$$

* Vectorizing $ \sigma(V_{\phi}^T W_i) $:

$$ \sigma(V_{(neg,:)}^T W_i) = (..., \sigma(V_{\phi \in neg}^T W_i) ,...)$$ 

* For programming convenience, we may use $V$ instead of $V^T$:

$$ \sigma(V_{(neg,:)}^T W_i) = \sigma(W_i V_{(:,neg)}) $$

$$ \sigma(W_i V_{(:,neg)} ) = (..., \sigma(W_i V_{(:,\phi \in neg)}) ,...) $$ 

* Taking the error form of the vector:

$$ 1-\sigma(- W_i V_{(:,neg)} ) = (..., 1-\sigma(- W_i V_{(:,\phi \in neg)}) ,...) $$ 

* Finally we outer product $ W_i $ and $ (1-\sigma(-W_i V_{(:,neg)})) $:

<b style="color:red"> $$ \frac{\partial J}{\partial V_{(:,neg)}} = W_i \odot (1 - \sigma(- W_i V_{(:,neg)} )) $$ </b>

**Vectoring formula (4)**
* formula (4) has two components, both of which are actually in the form of inner product of two vectors according to the picture below:

<img src="images/vec_inner_product.png" alt="Drawing" style="width:40%;height:40%"/>

* Vectorizing the two components:

$$ \sum_{o \in ctxt}(\sigma(V_o^T W_i) - 1)V_o^{T} = (\sigma(W_i V_{(:,o \in ctxt)})-1) \cdot V_{(:,o \in ctxt)}^T $$

$$ \sum_{\phi \in neg} (1- \sigma(- V_{\phi}^T W_i))V_{\phi}^{T} = (1-\sigma(- W_i V_{(:,\phi \in neg)})) \cdot  V_{(:,\phi \in neg)}^T $$

Finally, we have the vectorization form:

<b style="color:red"> $$ \frac{\partial J}{\partial W_i} = (\sigma(W_i V_{(:,o \in ctxt)})-1) \cdot V_{(:,\phi \in ctxt)}^T  + (1-\sigma(- W_i V_{(:,\phi \in neg)})) \cdot  V_{(:,\phi \in neg)}^T $$</b>

To calculate these derivatives in code, we only need to calculate the formulas in red color.

### Summary
* Following are all the components and formulars we need to train a Word2Vec model.
* $ (5)-(8) $ are the basic components forming formular (9)-(12). When coding, we can calculate each of them individually.

$$\sigma(W_i V_{(:,ctxt)}) \qquad \qquad \qquad (5) $$

$$\sigma(-W_i V_{(:,neg)}) \qquad \qquad \qquad (6) $$

$$\sigma(W_i V_{(:,ctxt)} )-1 \qquad \qquad \qquad (7) $$

$$ 1 - \sigma(- W_i V_{(:,neg)}) \qquad \qquad \qquad (8) $$


* $ (9)-(12) $ are the formulars for calculating cost/loss funtion and derivatives.


$$ J = - sum(log(\sigma(W_i V_{(:,ctxt)}))) - sum(log(\sigma(-W_i V_{(:,neg)}))) \qquad \qquad \qquad (9) $$

$$ \frac{\partial J}{\partial V_{(:,ctxt)}} = W_i \odot (\sigma(W_i V_{(:,ctxt)} )-1) \qquad \qquad \qquad (10) $$

$$ \frac{\partial J}{\partial V_{(:,neg)}} = W_i \odot (1 - \sigma(- W_i V_{(:,neg)} )) \qquad \qquad \qquad (11) $$

$$ \frac{\partial J}{\partial W_i} = (\sigma(W_i V_{(:,o \in ctxt)})-1) \cdot V_{(:,\phi \in ctxt)}^T  + (1-\sigma(- W_i V_{(:,\phi \in neg)})) \cdot  V_{(:,\phi \in neg)}^T \qquad \qquad \qquad (12) $$
