$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\Tr}[0]{^\top}
\newcommand{\softmax}[1]{\mathrm{softmax}\left({#1}\right)}
$$

# CS236781: Deep Learning
# Tutorial 7: Attention

## Introduction

In this tutorial, we will cover:

TODO

In [1]:
# Setup
%matplotlib inline
import os
import sys
import time
import torch
import matplotlib.pyplot as plt

In [2]:
plt.rcParams['font.size'] = 20
data_dir = os.path.expanduser('~/.pytorch-datasets')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Attention Mechanisms

In the context of learning from **sequences** of inputs, we have seen RNNs as a model capable of learning a transformation of one sequence into another.

<center><img src="img/rnn_unrolled.png" width="1000" /></center>


Where,

$$
\begin{align}
\forall t \geq 0:\\
\vec{h}_t &= \varphi_h\left( \mat{W}_{hh} \vec{h}_{t-1} + \mat{W}_{xh} \vec{x}_t + \vec{b}_h\right) \\
\vec{y}_t &= \varphi_y\left(\mat{W}_{hy}\vec{h}_t + \vec{b}_y \right)
\end{align}
$$

However, RNNs (even the fancy ones) have some major drawbacks:

1. Input must be processed sequentially.
2. Hard to train on long sequences (needs BPTT).
3. Difficult to learn long-term dependencies, e.g. between late outputs and early inputs. The **hidden state** has the burden of "remembering" the "meaning" of the entire sequence so far.

Imagine we want to translate text from English to French. The general approach using RNNs is to design a Sequence-to-sequence (**Seq2Seq**) Encoder-Decoder architecture:

<center><img src="img/seq2seq.svg" width="1000" /></center>

In such an architecture the **last** hidden state must encode all the information the decoder needs for translation.

**Local** information, i.e. the encoder outputs and intermediate hidden states is discarded.

Can we use this local info to help the decoder?

### Definition

In deep learning contexts, **attention** is a term used for a family of related mechanisms which, in general, learn to predict some probability distribution over a sequence of elements.

Intuitively, this allows a model to "pay more attention" to elements from the sequence which get a higher probability.

Recent versions of attention mechanisms can be defined formally as follows:

Given:
- $n$ **key-value** pairs: $\left\{\left(\vec{k}_i, \vec{v}_i\right)\right\}_{i=1}^{n}$, where $\vec{k}_i\in\set{R}^{d_k}$, $\vec{v}_i\in\set{R}^{d_v}$
- A **query**, $\vec{q} \in\set{R}^{d_q}$
- Some similarity function between keys and queries, $s: \set{R}^{d_k}\times \set{R}^{d_q} \mapsto \set{R}$

An attention mechanism computes a weighted sum of the **values**,

$$
\vec{o} = \sum_{i=1}^{n} a_i \vec{v}_i\ \in \set{R}^{d_v},
$$

where attention weights $a_i$ are computed according the the similarity between the **query** and each **key**:

$$
\begin{align}
b_i &= s(\vec{k}_i, \vec{q}) \\
\vec{b} &= \left[  b_1, \dots, b_n \right]\Tr \\
\vec{a} &= \softmax{\vec{b}}.
\end{align}
$$


### Multiplicative attention

One basic type of attention mechanism uses a simple **dot product** as the similarity function.

Widely-used by models based on the **Transformer** architecture.

Assume $d_k=d_q=d$, then

$$
s(\vec{k},\vec{q})= \frac{\vectr{k}\vec{q}}{\sqrt{d}}.
$$

Why scale by $\sqrt{d}$ ?

It's the factor at which the dot-product grows due to the dimensionality. E.g.,

$$
\norm{\vec{1}_d}_2 = \norm{[1,\dots,1]\Tr}_2 = \sqrt{d\cdot 1^2} =\sqrt{d}.
$$

This helps keep the softmax values from becoming very small when the dimension is large, and therefore helps prevents tiny gradients.

Let's now deal with $m$ queries simultaneously by stacking them in a matrix $\mat{Q} \in \set{R}^{m\times d}$.

Similarly, we'll stack the keys and values in their own matrices, $\mat{K}\in\set{R}^{n\times d}$, $\mat{V}\in\set{R}^{n\times d_v}$.

Then we can compute the attention weights for all queries in parallel:

$$
\begin{align}
\mat{B} &= \frac{1}{\sqrt{d}} \mat{Q}\mattr{K}  \ \in\set{R}^{m\times n} \\
\mat{A} &= \softmax{\mat{B}},\ \mathrm{dim}=1 \\
\mat{O} &= \mat{A}\mat{V} \ \in\set{R}^{m\times d_v}.
\end{align}
$$

Note that the softmax is applied per-row, and so each row $i$ of $\mat{A}$ contains the attention weights for the $i$th query.

Also notice that in this formulation, we **input a sequence** of $m$ queries and get an **output sequence** of $m$ weighed values.

### Additive attention based on an MLP

Another common type of attention mechanism uses an MLP to **learn** the similarity function $s(\vec{k},\vec{q})$.

In this type of attention, the similarity function is 

$$
s(\vec{k},\vec{q}) = \vectr{v} \tanh(\mat{W}_k\vec{k} + \mat{W}_q\vec{q}),
$$

where $\mat{W}_k\in\set{R}^{h\times d_k}$, $\mat{W}_q\in\set{R}^{h\times d_q}$ and $\vec{v}\in\set{R}^{h}$ are trainable parameters.

## Example: Seq2Seq language translation

In [3]:
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

field_args = dict(tokenize='spacy', init_token = '<sos>', eos_token = '<eos>', lower = False) 
src_field = Field(tokenizer_language="de_core_news_sm", **field_args)
tgt_field = Field(tokenizer_language="en_core_web_sm", **field_args)

In [21]:
ds_train, ds_valid, ds_test = Multi30k.splits(
    root=data_dir, exts=('.de', '.en'), fields=(src_field, tgt_field)
)

src_field.build_vocab(ds_train)
tgt_field.build_vocab(ds_train)

In [22]:
len(src_field.vocab)

18659

In [23]:
len(tgt_field.vocab)

9799

In [29]:
for i in ([0, 10, 100, 1000]):
    example = ds_train[i]
    src = str.join(" ", example.src)
    tgt = str.join(" ", example.trg)
    print(f'sample#{i:04d}:\n\tDE: {src}\n\tEN: {tgt}\n')

sample#0000:
	DE: zwei junge weiße männer sind im freien in der nähe vieler büsche .
	EN: two young , white males are outside near many bushes .

sample#0010:
	DE: eine ballettklasse mit fünf mädchen , die nacheinander springen .
	EN: a ballet class of five girls jumping in sequence .

sample#0100:
	DE: männliches kleinkind in einem roten hut , das sich an einem geländer festhält .
	EN: toddler boy in a red hat holding on to some railings .

sample#1000:
	DE: ein junger mann in einem weißen hemd , der tomaten schneidet .
	EN: a young man in a white shirt cutting tomatoes .



**Image credits**

Some images in this tutorial were taken and/or adapted from:

- Zhang et al., Dive into Deep Learning, 2019
- Fundamentals of Deep Learning, Nikhil Buduma, Oreilly 2017
- Andrej Karpathy, http://karpathy.github.io
- MIT 6.S191
- Stanford cs231n
- K. Xu et al. 2015, https://arxiv.org/abs/1502.03044