# CS224n: NLP with Deep Learning

# Lecture 2: Word Vectors and Word Senses

## Word2vec

### Our usual example

* King - Man ~= Idea of Kingship without the Man Part
* +Woman = add Woman Idea to it
* => Get Queen

Plot words on a scatter plot:

### Word2vec : Vectorization

We have 2 big matrices:
* 1 that represents every outside word's vector $U$
* 1 which represents every center word's vector $V$

$$U = \begin{bmatrix}[outside\,word\,vector\,1]
\\ \vdots
\\ [outside\,word\,vector\,n] \end{bmatrix}$$

$$V = \begin{bmatrix}[center\,word\,vector\,1]
\\ \vdots
\\ [center\,word\,vector\,n] \end{bmatrix}$$

* We multiply $U$ by a center word $v_{4}$

$$\begin{align*}
U \cdot v_{4} &= \begin{bmatrix}[outside\,word\,vector\,1]
\\ \vdots
\\ [outside\,word\,vector\,n] \end{bmatrix}
\cdot v_{4}
\\ &=\begin{bmatrix}[similarity\,(u_{1}, v_{4})]
\\ \vdots
\\ [similarity\,(u_{n}, v_{4})] \end{bmatrix}
\end{align*}$$

* We apply softmax to get a vector of probabilities

$$softmax(U\cdot u_{4})$$

#### Remark 1
**The outside words that are predicted will always be the same !!**

This means that our model will give a reasonably high probability estimate to all words that occur in the context

#### Remark 2
The words 'and', 'the', 'of', ... will have very high frequency with all the other words

#### Remark 3
The 2D-projections of the word clouds are very misleading

In very high dimensional space, a word can be close to lots of other words

## Optimization: Gradient Descent

**Objective:**
* Minimise cost function $J(\theta)$

**Idea:** 
* For current value of $\theta$, calculate the gradient of $J(\theta)$
* Take a small step in the direction of negative gradient

$$ \theta^{new} = \theta^{old} - \alpha \nabla_{\theta}J(\theta) $$ where $\alpha$ is the learning rate

### Stochastic Gradient Decsent

**Problem:**

* $J(\theta)$ is a function of all windows in the corpus!!

**Solution:**

Stochastic Gradient Descent !

* Repeatedly sample windows
* Update our parameters after going through each window
* The parameters are updated with amazingly noisy gradient, but it doesn't matter too much 
* It allows us to go much quicker

#### Remark 1

Choose mini-batch size of 32, 64, or other powers of 2, as they allow to make the most out of parallelization

#### Remark 2

* In each window, we have a certain number of words $2m+1$
* Our parameter vector $\theta$ is in $\mathbb{R}^{2dV}$, which is much bigger
* Hence, $\nabla_{\theta} J(\theta)$ is a very sparce matrix

=> **Idea:**

Only update the word vectors that appear in our window!

### Why 2 vectors for each word?

* It is easy to optimize
* We just average them to get a unique word vector in the end

#### 2 Model variants

1. Skip-gram: Predict outside words (position independent) given center word
2. Continuous Bag of Words: predict center word from a Bag of context words

The results we get via these 2 methods are quite similar, as the dot product is symmetric

### Negative Sampling

**Idea:**

Train binary logistic regression for a true pair (center word & word in its context window)

VS

Several noise pairs (center word & a random word) = Negative samples

* Maximize probability that real outside word appears
* Minimize probability that random words appear around center word

We sample these words using the Unigram distribution to determine their probability of being sampled

$$P(w) = U(w)^{3/4} / Z$$

* The 3/4 power helps to make the most frequent words appear more often, and the less frequent words appear more
* $Z$ is a normalization factor

**Remark:**

The sigmoid function is like the softmax function for 2 classes: maps our values to a probability distribution

## Count-based methods

## GloVe

## Evaluating word vectors

## Word Senses