In [2]:
import numpy as np
def cosine(a,b):
    return np.dot(a,b)/(np.sqrt(np.dot(a,a))* np.sqrt(np.dot(b,b)))

<center>
<h2>Distributed word representations</h2>
<p style="text-align:center">
Natural Language Processing<br>
(COM4513/6513)<br>
<br>
<a href="http://andreasvlachos.github.io">Andreas Vlachos</a><br>
a.vlachos@sheffield.ac.uk<br>
<small>Department of Computer Science<br>
University of Sheffield
</small>
</p>
</center>

### So far

We have represented words as discrete objects.

In [4]:
represention_appricot = np.array([0,0,0,0,1,0,0,0])
represention_pineapple = np.array([0,0,0,1,0,0,0,0])
BoW_appricot_pineapple = represention_appricot +  represention_pineapple
BoW_appricot_pineapple

array([0, 0, 0, 1, 1, 0, 0, 0])

Also known as **one-hot encoding**. What's the problem with it?

Every word is equally different from every other word. But pineapple and apricot are related!

### In this lecture

We will challenge the discreteness of word representations.

Instead of one-hot encodings, we will map words to distributed representations, a.k.a. **embeddings**.

### A quick test

What is *tesguino*?

Some sentences mentioning it: 
- A bottle of *tesguino* is on the table.
- Everybody likes *tesguino*
- Tesguino makes you drunk.
- We make tesguino out of corn.

> You shall know a word by the company it keeps (Firth, 1957)

Words appearing in similar contexts are likely have similar meanings

### Word-context matrix

![coocs](images/coocs.png)

- typically **very sparse**
- the shorter the window the more **syntactic** they are
- the longer the window the more **semantic** they are

Representation above **syntactic** or **semantic**?

### One-hot vs  distributed representations

In [5]:
appricot_1hot = np.array([1,0,0,0]) # length equal to vocabulary size
pineapple_1hot = np.array([0,1,0,0])
digital_1hot = np.array([0,0,1,0])

In [6]:
appricot_dist = np.array([0,0,0,1,0,1]) # length equal to number of contexts from the table in the previous slide
pineapple_dist = np.array([0,0,0,1,0,1])
digital_dist = np.array([0,2,1,0,1,0])

How to calculate vector similarity?

### Similarity

$\text{dot-product}(\mathbf{x_1},\mathbf{x_2})= \mathbf{x_1} \cdot \mathbf{x_2} = \sum_{c=1}^{|C|} x_1^c x_2^c$

In [7]:
np.dot(appricot_1hot, pineapple_1hot)

0

In [20]:
np.dot(appricot_dist, pineapple_dist)

2

In [22]:
np.dot(appricot_dist, digital_dist)

0

Works! Any problems with the dot product?

Frequent words have more contexts, will be more similar to anything.

In [24]:
the_dist = np.array([5,2,3,4,5,1])
np.dot(the_dist, pineapple_dist)

5

### Cosine similarity

Solution: Divide by their length (a.k.a. cosine)

$\text{cosine}(\mathbf{x_1}, \mathbf{x_2}) =
							 \frac{\mathbf{x_1} \cdot \mathbf{x_2}}{|\mathbf{x_1}||\mathbf{x_2}|} =
							 \frac{\sum_{c=1}^{|C|}x_1^c x_2^c}{\sqrt{\sum_{c=1}^{|C|} (x_1^c)^2}\sqrt{\sum_{c=1}^{|C|} (x_2^c)^2}}$

In [29]:
cosine(appricot_dist, the_dist)

0.39528470752104738

In [30]:
cosine(appricot_dist, pineapple_dist)

0.99999999999999978

Other options: Dice, Jaccard, etc.

### Problem with counts

Counts are OK, but frequent words (articles, pronouns, etc.) dominate contexts without being informative.

Let's add *the* to the contexts which appears often with most nouns:

In [37]:
appricot_dist_the = np.array([0,0,0,1,0,1,10]) # last element is the counts of "the"
pineapple_dist_the = np.array([0,0,0,1,0,1,30])
digital_dist_the = np.array([0,2,1,0,1,0,10])
information_dist_the = np.array([0,1,6,0,4,0,20])
cosine(digital_dist_the, pineapple_dist_the)

0.97020845119663623

In [38]:
cosine(digital_dist_the, information_dist_the)

0.96746180284073735

### Pointwise mutual information

Pointwise mutual information measures how often two events occur relative to them occurring independently:

$$PMI(word,context) = \log_2 \frac{P(word,context)}{P(word)P(context)}$$

Positive values quantify relatedness.  Use PMI instead of counts.

Negative values? Usually ignored (positive PMI):

$$PPMI(word,ctxt) = \max(PMI(word,ctxt),0)$$

### Choice of contexts

We can refine contexts using:
- their part-of-speech tags (*bank_V* vs. *bank_N*)
- syntactic dependencies (*eat_dobj* vs. *eat_subj*)

We can weigh contexts according to the distance from the word: the further away, the lower the weight.

If we are using a window size $w$, multiply the context word at each position as $\frac{w-distance}{w}$, e.g. for $w=3$:

$[\frac{1}{3}, \frac{2}{3}, \frac{3}{3}, word, \frac{3}{3}, \frac{2}{3}, \frac{1}{3}]$ 

### Singular value decomposition

PPMI matrices are good, but:
- high dimensional
- very sparse

Dimensionality reduction using *truncated* singular value decomposition:

$$ PPMI^{|V|\times|C|} \approx W^{|V|\times k}  S^{k \times k} C^{k \times |C|} $$

Approximation is good: exploits redundancy to remove noise 

### Singular value decomposition

<img src="images/trunc_svd.png">

### Skip-gram (Mikolov et al. 2013)

Running SVD on large matrices is expensive.

Let's look at one word a time:

$P(ctxt|w_t) = P(w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2}|w_t)  \quad$ (<b>skip</b>-gram)

$P(ctxt|w_t) = \prod_{w_c \in ctxt} P(w_c | w_t)  \quad \quad\quad$ (word independence)

$P(w_c | w_t) = \frac{\exp(\mathbf{w_c} \cdot \mathbf{w_t})}{\sum_{w_c^\prime \in V} \exp(\mathbf{w_c}^\prime \cdot \mathbf{w_t}) } \quad \quad$ (each word $w$ is a word vector $\mathbf{w}$)

A **giant logistic regression classifier**: words are the labels. Raw text gives us positive examples, negative ones?

**Negative sampling**: negative training examples are sub-sampled randomly, all positive ones are kept

### Skip-gram

<img src="images/skipgram.png">

### Each word has two embeddings
![embedding](images/embedding.png)

Can discard the context word embeddings, add them, or concatenate them with the target word embeddings

### Evaluation

Intrinsic:
- similarity: order word pairs according to their semantic similarity
- in-context similarity: substitute a word in a sentence without chagning its meaning.
- analogy: Athens is to Greece what Rome is to ...?

Extrinsic: use them to improve performance in a task: instead of bag of words, **bag of word vectors** 

They are an easy way to take advantage of unlabeled data to do **semi-supervised learning**.

### Best word vectors?

- high-dimensional (processed) counts?
- low-dimensional neural/SVD?

Recent paper by Levy et al. (2015) showed that choice of context window size, rare word removal, etc. matter more.

Choice of texts to obtain the counts matters. More text is better, and low-dimensional methods scale better.

### What about polysemy?

All occurrences of a word (and all its senses) are represented by one vector.

How do we handle polysemy?
- all senses are present in the vector
- given a task, it is often useful to adapt the vectors to represent the appropriate sense

### Limitations

- **antonyms appear** in similar contexts, hard to distinguish them from synonyms

- **compositionality**: what is the meaning of a sequence of words? while we might be able to obtain context vectors for short phrases, this doesn't scale to whole sentences, paragraphs, etc.

Soon we will see methods dedicated to learning embeddings for word sequences from word embeddings, the recurrent neural networks!

### Bibliography
- Jurafsky & Martin [Chapter 15](https://web.stanford.edu/~jurafsky/slp3/15.pdf) and [Chapter 16](https://web.stanford.edu/~jurafsky/slp3/16.pdf)
- Omer Levy's [article](https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf)
- Turian et al.'s [paper](http://www.aclweb.org/anthology/P10-1040) on how to use them

### Coming up next

Neural Networks