<center>
<h2>Recurrent Neural Networks</h2>
<p style="text-align:center">
Natural Language Processing<br>
(COM4513/6513)<br>
<br>
<a href="http://andreasvlachos.github.io">Andreas Vlachos</a><br>
a.vlachos@sheffield.ac.uk<br>
<small>Department of Computer Science<br>
University of Sheffield
</small>
</p>
</center>

### Some reminders

N-gram language models:
- large-scale classifiers predicting the next word
- sparsity issues: smoothing, back-off, interpolation

Neural networks:
- can learn more complex decision boundaries
- can learn feature representations

Continuous word representations: words are no longer discrete objects,
but vectors that capture semantics.

### In this lecture

Recurrent Neural networks applied to language modelling
- revolutionized the way a lot of people think in NLP
- return word probabilities conditioned on all previous words as well as **sentence representations**
- their popularity can't be over-stated!

### Problem setup

Training data is a (large) set of sentences $\mathbf{x}^m$ with words $x_n$:

<p>
\begin{align}
D_{train} & = \{\mathbf{x}^1,...,\mathbf{x}^M\} \\
\mathbf{x}& = [x_1,... x_N]\\
\end{align}
</p>

<p class="fragment">
for example:
\begin{align}
\mathbf{x}=&[\text{None}, \text{The}, \text{water}, \text{is}, \text{clear}, \text{.}, \text{None}]
\end{align}
</p>

We want to learn a model that returns:
\begin{align}
\text{probability}\; P(\mathbf{x}), \mathbf{for} \; \forall \mathbf{x}\in V^{maxN}
\end{align}
$V$ is the vocabulary and $V^{maxN}$ all possible sentences

### Rethinking Language Modelling

\begin{align}
P(\mathbf{x}) &= P(x_1,...,x_N) \\
&= P(x_1)P(x_2...x_N|x_1)\\
&= P(x_1)P(x_2|x_1) ... P(x_N|x_1,...,x_{N-1})\\
\end{align}

where:
$$P(x_n = k| x_{n-1...x_1})=\frac{counts(x_1...x_{n-1}, k)}{counts(x_1...x_{n-1})}$$

A logistic regression classifier predicting the next word:

\begin{align}
p(x_n = k| x_{n-1}... x_1) &= \frac{\exp(\mathbf{w}_k \cdot \phi(x_{n-1}... x_1) )}{\sum_{k^\prime=1}^{|\cal V|}\exp(\mathbf{w}_{k^\prime} \cdot \phi(x_{n-1}... x_1) )} \\
&= softmax(\mathbf{W}\cdot\phi(x_{n-1}... x_1))
\end{align}

Why this woudn't work?

### Sparsity

\begin{align}
p(x_n = k| x_{n-1}... x_1) &= \frac{\exp(\mathbf{w}_k \cdot \phi(x_{n-1}... x_1) )}{\sum_{k^\prime=1}^{|\cal V|}\exp(\mathbf{w}_{k^\prime} \cdot \phi(x_{n-1}... x_1) )} \\
&= softmax(\mathbf{W}\cdot\phi(x_{n-1}... x_1))
\end{align}

- $\mathbf{w}_k$ are the weights for word $k$</li>
- $\phi(x_{n-1}... x_1)$ are the features extracted from the previous words (one-hot encoding of $x_{n-1}... x_1$)
- dimensionality of $\mathbf{W}$ is $|V|\times |V|^n$ (vocabulary size, contexts)!

We avoided the issue using the Markov assumption (**N-gram** LMs), but could we do something different without restricting the contexts?

### Skip gram reminder

<img src="images/skipgram.png"  style="width:650px;">

$P(w_{t-1} | w_t) = \frac{\exp(\mathbf{c_{t-1}} \cdot \mathbf{w_t})}{\sum_{c^\prime \in V} \exp(\mathbf{c}^\prime \cdot \mathbf{w_t}) }$

Each word is has two vectors, one for itself ($\mathbf{w}$) and one as context ($\mathbf{c}$)

### From word embeddings to word sequence embeddings

\begin{align}
p(x_n = k| x_{n-1}... x_1) &= \frac{\exp(\mathbf{w}_k \cdot \phi(x_{n-1}... x_1) )}{\sum_{k^\prime=1}^{|\cal V|}\exp(\mathbf{w}_{k^\prime} \cdot \phi(x_{n-1}... x_1) )} \\
&= softmax(\mathbf{W}\cdot\phi(x_{n-1}... x_1))
\end{align}

Let's assume we have word embeddings $\mathbf{W}\in \Re^{|\cal V|\times d}$

We need to have a $\phi(x_{n-1}... x_1)$ that gives us a continuous representation in $\Re^{d\times 1}$ for each context

### Recurrent neural networks

<a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/"><img src="images/rnn.jpg" style="width:700px; background:none; border:none; box-shadow:none;" /></a>

- **input**: $x_t \in \{0,1\}^{|\cal V|}$ are the words in one-hot encoding
- **hidden**: $s_{t-1} \in \Re^d$: "memory" of the context until word $x_{t-1}$
- **output**: $\mathbf{o}_{t-1} = p(x_{t}| x_{t-1}... t_1) \in \Re^{|\cal V|}$

### Recurrent neural networks

<a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/"><img src="images/rnn.jpg" style="width:700px; background:none; border:none; box-shadow:none;" /></a>

Parameters to be learned:
- $\mathbf{U}\in \Re^{d\times |\cal V|}$: matrix containing the word vectors for all the words, $x_n$ picks one
- $\mathbf{W}\in \Re^{d \times d}$: controls how this memory is passed on
- $\mathbf{V}\in \Re^{|\cal V| \times d}$: maps the memory to probability for each word

### Recurrent neural networks

<a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/"><img src="images/rnn.jpg" style="width:700px; background:none; border:none; box-shadow:none;" /></a>

$s_t = \sigma(\mathbf{W}s_{t-1} + \mathbf{U}x_t)$

$\mathbf{o}_{t} = p(x_{n+1}| x_{t}... x_1) = softmax(\mathbf{V}s_{t})$

### Training RNNs

We need to learn the word vectors $\mathbf{U}$, hidden and output layer parameters $\mathbf{W}, \mathbf{V}$

Standard backpropagation can't work because of the recurrence:  we reuse the hidden layer parameters $\mathbf{W}$

**Backpropagation through time**: unroll the graph for $n$ steps and sum the gradients in updating

Not as restrictive as the $n$th-order Markov: we still use all previous words through the recurrence

### Long-range dependencies

RNNs try but can't capture long-range dependencies:
- effectively ave one layer per word in the sentence
- all context information has to be passed by the hidden layer

<a href="https://chunml.github.io/ChunML.github.io/project/Creating-Text-Generator-Using-Recurrent-Neural-Network/"><img src="images/LSTM.png" style="width:600px; background:none; border:none; box-shadow:none;" /></a>

**Long-short term memory networks** address this by adding an extra "memory" cell

### Variants

In language modelling we use word sequence representations at each time-step to predict the next word. Other tasks?

<a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/"><img src="images/rnns.jpeg" width="800" style="background:none; border:none; box-shadow:none;" /></a>
- many to one (sentiment analyis)
- many to many (equal) (e.g. PoS tagging)

### Representations

RNNs learn word and sentence representations

Words are not as interesting since RNNs are slower to train than Skip-Gram:
thus use less data
- hint: use skipgram to initialize (pre-train) the RNN word vectors

RNN sentence representations though are used often!

<h3>Textual entailment</h3>
<a href="http://arxiv.org/abs/1509.06664"><img src="images/rockt.jpg" style="width:800px; background:none; border:none; box-shadow:none;" /></a>

<h3>Machine translation</h3>

<a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/"><img src="images/rnn_mt.png" style="width:800px; background:none; border:none; box-shadow:none;" /></a>


### Convolutional neural nets:
- operate on short context windows, e.g. 5 words
- popular in vision to model pixel neighborhoods
- used in  tasks where modelling the full sequence (sentence, document) is not more useful than modelling parts.

<a href="http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/"><img src="images/CNNs_NLP.png" width="1000" style="background:none; border:none; box-shadow:none;" /></a>


### Multimodal processing

<a href="http://kelvinxu.github.io/projects/capgen.html"><img src="images/capgen.png"/></a>


### Bibliography

- The lecture followed roughly Cho's [lecture notes](http://arxiv.org/pdf/1511.07916v1.pdf) (section 5.5, but check references to earlier chapters)
- [Notes](https://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes5.pdf) from Stanford's NLP course
- This [blog](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) has great visualizations, easy to read code
- For more on LSTMs see this [blog post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) from where their figures were taken and [this](http://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/) for some python code
- For more NLP references check Yoav Goldberg's [tutorial](http://u.cs.biu.ac.il/~yogo/nnlp.pdf), section 10
- Chapter 6, 7 and 8 from the just released [deep learning book](http://www.deeplearningbook.org), chapter 10 and section 12.4


### Coming up next

Natural Language Generation

Machine Translation