## Attention

Intuitively, some parts of the input may be more important than others.

An **Attention** mechanism, allows the model to "focus" on, i.e. give a *greater weight* to
different parts of the input or some other intermetiate part of the model.

Example from an image captioning [paper](https://arxiv.org/pdf/1502.03044.pdf) (K. Xu et al. 2015):

<img src="img/attn_ic1.png" width="900"/>

<img src="img/attn_ic2.png" width="800"/>


### Input soft attention

One place to apply attention is to the **input features**.

In the context of our RNN model, we can change it's hidden state update to:


$$
\begin{align}
\vec{a}_t &= \sigma\left( \mat{W}_{ha} \vec{h}_{t-1} + \mat{W}_{xa} \vec{x}_t+ \vec{b}_a\right) \\
\vec{g}_t &= \mathrm{softmax}(\alpha \vec{a}_t) \\
\vec{h}_t &= \varphi_h\left( \mat{W}_{hh} \vec{h}_{t-1} + \mat{W}_{xh} (\vec{x}_t \odot \vec{g}_t)+ \vec{b}_h\right) \\
\end{align}
$$


In [1]:
import torch.nn as nn

class RNNLayerInputAttn(nn.Module):
    def __init__(self, in_dim, h_dim, out_dim, phi_h=torch.tanh, phi_y=torch.sigmoid):
        super().__init__()
        self.phi_h, self.phi_y = phi_h, phi_y
        
        # Attention parameters
        self.fc_xa = nn.Linear(in_dim, in_dim, bias=False)
        self.fc_ha = nn.Linear(h_dim, in_dim, bias=True)
        
        # Regular RNN parameters
        self.fc_xh = nn.Linear(in_dim, h_dim, bias=False)
        self.fc_hh = nn.Linear(h_dim, h_dim, bias=True)
        self.fc_hy = nn.Linear(h_dim, out_dim, bias=True)
        
    def forward(self, xt, h_prev=None):
        if h_prev is None:
            h_prev = torch.zeros(xt.shape[0], self.fc_hh.in_features)
            
        # Calculate the attention gating gt: a weight for each feature of x
        at = torch.sigmoid(self.fc_xa(xt) + self.fc_ha(h_prev))
        gt = torch.softmax(at, dim=1)
        
        # Apply regular RNN with gated input
        ht = self.phi_h(self.fc_xh(xt * gt) + self.fc_hh(h_prev))
        
        yt = self.fc_hy(ht)
        
        if self.phi_y is not None:
            yt = self.phi_y(yt)
        
        return yt, ht
        

NameError: name 'torch' is not defined

We can interpret this as a soft (differentiable) gating of the input.

This makes sense for image captioning, where we want to emphasize image regions based on their feature maps.

What about our sentiment analysis task?

### Self attention

Another place to apply attention in the context of RNNs is to the **hidden states**.

In an ICLR 2017 [paper](https://arxiv.org/pdf/1703.03130.pdf), Lin et al. proposed
an attention for sentiment analysis.

<img src="img/self_attn_sa.png" width="700" />

The problem with applying attention to the hidden state vectors, is that their number changes each batch,
depdending on the sentence length.

This approach creates a **sentence embedding** $M$ of a fixed size:

$$
\begin{align}
\mat{H}_T &= \sigma\left[ \vectr{h}_1; \dots; \vectr{h}_T \right] \in\set{R}^{T\times d_h}\\
\mat{A} &= \mathrm{softmax}\left(\mat{W}_{s2} \tanh\left( \mat{W}_{s1} \mattr{H}_T \right) \right),\ 
\mat{W}_{s1}\in\set{R}^{d_a \times d_h},\ \mat{W}_{s2}\in\set{R}^{r \times d_a} \\
\mat{M} &= \mat{A}\mat{H}_T \in\set{R}^{r\times d_h}
\end{align}
$$


The sentence embedding $M$ is then fed into an FC classifier to produce the prediction.

*Self excercise:* Modify our `SentimentRNN` and add the Self-Attantion layer.

**Image credits**

Some images in this tutorial were taken and/or adapted from:

- Fundamentals of Deep Learning, Nikhil Buduma, Oreilly 2017
- Andrej Karpathy, http://karpathy.github.io
- MIT 6.S191
- Stanford cs231n
- K. Xu et al. 2015, https://arxiv.org/abs/1502.03044