In [3]:
%run "preamble.ipynb"
%matplotlib inline
import torch
from torch.autograd import Variable
from dl4nlp.util import *
from torch import nn
import torch.nn.functional as F
import numpy as np
from torch.nn import Parameter
from dl4nlp.tikz import *
import IPython.display
IPython.display.display_latex(IPython.display.Latex(filename="tex/macros.tex"))

The tikzmagic extension is already loaded. To reload it, use:
  %reload_ext tikzmagic


<IPython.core.display.Javascript object>

<center>
<h1>Deep Learning for Natural Language Processing III</h1>
<h2>Attention</h2>
<br>
Tim Rocktäschel<br>
<a href="https://rockt.github.com">rockt.github.com</a> <a href="mailto:tim.rocktaschel@cs.ox.ac.uk">tim.rocktaschel@cs.ox.ac.uk</a> <a href="https://twitter.com/_rockt">Twitter: @_rockt</a><br>
<img src="./figures/oxford.svg" width=30%><br>
2nd Int'l Summer School on Data Science, Split, Croatia<br>
27th September 2017<br>
</center>

# Use-case: Recognizing Textual Entailment (RTE)

- **A wedding party is taking pictures**
  - There is a funeral					: **<span class=red>Contradiction</span>**
  - They are outside					: **<span class=blue>Neutral</span>**
  - Someone got married				    : **<span class=green>Entailment</span>**

### State of the Art until 2015

- Engineered natural language processing pipelines
- Various external resources
- Specialized subcomponents
- Extensive manual creation of **features**:
  - Negation detection, word overlap, part-of-speech tags, dependency parses, alignment, unaligned matching, chunk alignment, synonym, hypernym, antonym, denotation graph
  
<div class=cite>[Lai and Hockenmaier, 2014, Jimenez et al., 2014, Zhao et al., 2014, Beltagy et al., 2015, ...]</div>

### Neural Networks for RTE

**Previous RTE corpora**:
- Tiny data sets (1k-10k examples)
- Partly synthetic examples

**Stanford Natural Inference Corpus (SNLI)**:
- 500k sentence pairs
- Two orders of magnitude larger than existing RTE data set
- All examples generated by humans


### Independent Sentence Encoding

Same LSTM encodes premise and hypothesis

<img src="./figures/3-attention/rte_encoding.svg" width=60%/> 

<div class=cite>[Bowman et al, 2015]</div>


> You can’t cram the meaning of a whole
%&!\$# sentence into a single \$&!#* vector!
>
> -- <cite>Raymond J. Mooney</cite>

### Independent Sentence Encoding

<img src="./figures/3-attention/mlp.svg" width=60%/> 

<div class=cite>[Bowman et al, 2015]</div>


## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|

### Conditional Endcoding

<img src="figures/3-attention/conditional_encoding.svg" width=60%/> 

\begin{align}
\text{softmax}(\text{tanh}(\mathbf{W}\mathbf{h}_N))
\end{align}

## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|
| Conditional Endcoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|

### Attention

<img src="./figures/3-attention/attention_encoding.svg" width=60%/> 

<div class=small>
\begin{align}
  \mathbf{M} &= \tanh(\mathbf{W}^y\mathbf{Y}+ \mathbf{W}^h\mathbf{h}_N\mathbf{1}^T_L)&\mathbf{M}&\in\mathbb{R}^{k \times L}\\
  \alpha &= \text{softmax}(\mathbf{w}^T\mathbf{M})&\alpha&\in\mathbb{R}^L\\
  \mathbf{r} &= \mathbf{Y}\alpha^T&\mathbf{r}&\in\mathbb{R}^k
\end{align}
</div>

<div class=cite> [Graves 2013, Bahdanau et al. 2015]</div>

<img  src="./figures/3-attention/camel.png"/>

## Contextual Understanding
<img  src="./figures/3-attention/pink.png"/>

## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|
| Conditional Encoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|
| Attention | 100 | 3.9M | 242k | 85.4 | 83.2 | 82.3 |

## Fuzzy Attention
<img  src="./figures/3-attention/mimes.png"/>

# Word-by-word Attention

<img src="./figures/3-attention/word_attention_encoding.svg" width=60%/> 

<div class=small>
\begin{align}
  \mathbf{M}_t &= \tanh(\mathbf{W}^y\mathbf{Y}+(\mathbf{W}^h\mathbf{h}_t+\mathbf{W}^r\mathbf{r}_{t-1})\mathbf{1}^T_L) & \mathbf{M}_t &\in\mathbb{R}^{k\times L}\\
  \alpha_t &= \text{softmax}(\mathbf{w}^T\mathbf{M}_t)&\alpha_t&\in\mathbb{R}^L\\
  \mathbf{r}_t &= \mathbf{Y}\alpha^T_t + \tanh(\mathbf{W}^t\mathbf{r}_{t-1})&\mathbf{r}_t&\in\mathbb{R}^k
\end{align}
</div>
<div class=cite>[Bahdanau et al. 2015, Hermann et al. 2015, Rush et al. 2015, Rocktäschel et al. 2016]</div>

## Reordering
<img src="./figures/3-attention/reordering.png" width=40%/>

## Garbage Can = Trashcan
<img  src="./figures/3-attention/trashcan.png" width=70%/>

## Kids =  Girl + Boy
<img  src="./figures/3-attention/kids.png" width=60%/>

## Snow is outside
<img  src="./figures/3-attention/snow.png" width=90%/>

## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|
| Conditional Encoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|
| Attention | 100 | 3.9M | 242k | 85.4 | 83.2 | 82.3 |
| Word-by-word Attention | 100 | 3.9M | 252k | 85.3 | **83.7** | **83.5** |