# Introduction

My goal with this notebook is to show the direct relation between the paper (Show, Attend and Tell) and it's implementation within this repository. I will try to explain the code as much as possible, nontheless recommend you to read the paper first.

The implementation is based on the [AaronCCWong's implementation](https://github.com/AaronCCWong/Show-Attend-and-Tell). I build on top of his work, trying some alternative techniques (like using pre-trained embeddings) and more closely aligning it with the original paper.

## The Encoder

Section 3.1.1 in the "Show, Attend and Tell" paper describes the encoder. The proposed model uses a CNN to extract a set of feature vectors, referred to as annotation vectors. This code implements that concept by using pre-trained models (VGG19, ResNet152, or DenseNet161), modifying them to exclude the final classification layers, and reshaping the output to form a set of feature vectors (our annotation vectors). Each vector represents different spacial parts of the image, which is key for the attention mechanism in the next stages of the model.

```python
import torch.nn as nn
from torchvision.models import densenet161, resnet152, vgg19
from torchvision.models import VGG19_Weights

class Encoder(nn.Module):
    """
    Encoder network for image feature extraction, follows section 3.1.1 of the paper
    """
    def __init__(self, network='vgg19'):
        super(Encoder, self).__init__()
        self.network = network
        # Selection of pre-trained CNNs for feature extraction
        if network == 'resnet152':
            self.net = resnet152(pretrained=True)
            # Removing the final fully connected layers of ResNet152
            self.net = nn.Sequential(*list(self.net.children())[:-2])
            self.dim = 2048  # Dimension of feature vectors for ResNet152
        elif network == 'densenet161':
            self.net = densenet161(pretrained=True)
            # Removing the final layers of DenseNet161
            self.net = nn.Sequential(*list(list(self.net.children())[0])[:-1])
            self.dim = 1920  # Dimension of feature vectors for DenseNet161
        else:
            self.net = vgg19(weights=VGG19_Weights.DEFAULT)
            # Using features from VGG19, excluding the last pooling layer
            self.net = nn.Sequential(*list(self.net.features.children())[:-1])
            self.dim = 512  # Dimension of feature vectors for VGG19

    def forward(self, x):
        x = self.net(x)
        # These steps correspond to the extraction of annotation vectors (a = {a1,...,aL}) as described in Section 3.1.1 of the paper.
        # 1. Change the order from (BS, C, H, W) to (BS, H, W, C) in prep for reshaping
        x = x.permute(0, 2, 3, 1)
        # 2. Reshape to [BS, num_spatial_features, C], the -1 effectively flattens the height and width dimensions into a single dimension
        x = x.view(x.size(0), -1, x.size(-1))
        return x
```


## The Decoder

Let's move forward to Section 3.1.2, explaining the Decoder. The Decoder uses the extracted image features to initialize the states of the LSTM cell (by averaging them) and then employs an **attention mechanism** at each time step to focus on different parts of the image while generating the caption. The Decoder predicts one word of the caption at each time step, and its prediction is conditioned on the current LSTM state, the context vector from the attention mechanism, and the previous word. Teacher forcing, a common technique in training sequence generation models where the ground truth word is fed as the next input instead of the model's prediction, is optionally used.

Provided below is the full implementation of the Decoder. I will break the code down into smaller chunks to explore it in more detail, based on the paper's description of the Decoder.

```python
import torch
import torch.nn as nn
from attention import Attention


class Decoder(nn.Module):
    def __init__(self, vocabulary_size, encoder_dim, tf=False):
        super(Decoder, self).__init__()
        self.use_tf = tf

        # Initializing parameters
        self.vocabulary_size = vocabulary_size
        self.encoder_dim = encoder_dim

        # Initial LSTM cell state generators
        self.init_h = nn.Linear(encoder_dim, 512)  # For hidden state
        self.init_c = nn.Linear(encoder_dim, 512)  # For cell state
        self.tanh = nn.Tanh()

        # Attention mechanism related layers
        self.f_beta = nn.Linear(512, encoder_dim)  # Gating scalar in attention mechanism
        self.sigmoid = nn.Sigmoid()

        # Output layer and embedding
        self.deep_output = nn.Linear(512, vocabulary_size)  # Maps LSTM outputs to vocabulary
        self.dropout = nn.Dropout()

        # Attention and LSTM components
        self.attention = Attention(encoder_dim)  # Attention network
        self.embedding = nn.Embedding(vocabulary_size, 512)  # Embedding layer for input words
        self.lstm = nn.LSTMCell(512 + encoder_dim, 512)  # LSTM cell

    def forward(self, img_features, captions):
        # Forward pass of the decoder
        batch_size = img_features.size(0)

        # Initialize LSTM state
        h, c = self.get_init_lstm_state(img_features)

        # Teacher forcing setup
        max_timespan = max([len(caption) for caption in captions]) - 1
        prev_words = torch.zeros(batch_size, 1).long().to(mps_device)
        if self.use_tf:
            embedding = self.embedding(captions) if self.training else self.embedding(prev_words)
        else:
            embedding = self.embedding(prev_words)

        # Preparing to store predictions and attention weights
        preds = torch.zeros(batch_size, max_timespan, self.vocabulary_size).to(mps_device)
        alphas = torch.zeros(batch_size, max_timespan, img_features.size(1)).to(mps_device)

        # Generating captions
        for t in range(max_timespan):
            context, alpha = self.attention(img_features, h)  # Compute context vector via attention
            gate = self.sigmoid(self.f_beta(h))  # Gating scalar for context
            gated_context = gate * context  # Apply gate to context

            # Prepare LSTM input
            if self.use_tf and self.training:
                lstm_input = torch.cat((embedding[:, t], gated_context), dim=1)
            else:
                embedding = embedding.squeeze(1) if embedding.dim() == 3 else embedding
                lstm_input = torch.cat((embedding, gated_context), dim=1)

            # LSTM forward pass
            h, c = self.lstm(lstm_input, (h, c))
            output = self.deep_output(self.dropout(h))  # Generate word prediction

            preds[:, t] = output
            alphas[:, t] = alpha  # Store attention weights

            # Prepare next input word
            if not self.training or not self.use_tf:
                embedding = self.embedding(output.max(1)[1].reshape(batch_size, 1))
        return preds, alphas

    def get_init_lstm_state(self, img_features):
        # Initializing LSTM state based on image features
        avg_features = img_features.mean(dim=1)

        c = self.init_c(avg_features)  # Cell state
        c = self.tanh(c)

        h = self.init_h(avg_features)  # Hidden state
        h = self.tanh(h)

        return h, c
```

### A close look at the deep_output layer

Show, Attend and Tell utilizes a *deep output layer* (Pascanu et al., 2014) to compute the output word probability given given the current state of the LSTM, the context vector from the attention mechanism, and the previously generated word.

Let's break down this formula and map its components to the code:

$$ p\left(\mathbf{y}_t \mid \mathbf{a}, \mathbf{y}_1^{t-1}\right) \propto \exp \left(\mathbf{L}_o\left(\mathbf{E} \mathbf{y}_{t-1}+\mathbf{L}_h \mathbf{h}_t+\mathbf{L}_z \hat{\mathbf{z}}_t\right)\right) $$



Where p of $ \mathbf{y}_t $ is the probability of the output word $ y $ at time $ _t $ given the image features $ \mathbf{a} $ and the previously generated words $ \mathbf{y}_1^{t-1} $.

In this formula:

- $ \mathbf{y}_t $ is the output word at time $ t $.
- $ \mathbf{a} $ represents the set of annotation vectors (image features).
- $ \mathbf{y}_1^{t-1} $ are the previously generated words up to time $ t-1 $.
- $ \mathbf{L}_o, \mathbf{L}_h, \mathbf{L}_z $ are learned weight matrices.
- $ \mathbf{E} $ is the embedding matrix for the previous word $ \mathbf{y}_{t-1} $.
- $ \mathbf{h}_t $ is the hidden state of the LSTM at time $ t $.
- $ \hat{\mathbf{z}}_t $ is the context vector at time $ t $, generated by the attention mechanism.

Now let's map this to the Decoder's code:

1. **Embedding of the Previous Word ($ \mathbf{E} \mathbf{y}_{t-1} $)**: This is done using the `self.embedding` layer in the code.

    ```python
    embedding = self.embedding(prev_words)
    ```

2. **Hidden State of the LSTM ($ \mathbf{h}_t $)**: The `h` variable in the code represents the hidden state of the LSTM at each time step.

    ```python
    h, c = self.lstm(lstm_input, (h, c))
    ```

3. **Context Vector ($ \hat{\mathbf{z}}_t $)**: The context vector is computed by the attention mechanism in the `self.attention` layer.

    ```python
    context, alpha = self.attention(img_features, h)
    ```

4. **Combining and Transforming for Output Prediction**: The output word probability is computed by combining these elements and applying the learned weight matrices. In the code, this operation is currently condensed into one `self.deep_output` layer transforming the hidden state $ \mathbf{h}_t $. In a more complex or literal implementation of the DO-RNN, you would expect to see multiple such layers, each followed by a non-linear activation function.

    ```python
    output = self.deep_output(self.dropout(h))
    ```

## Attention Mechanisms

At the heart of the Show, Attend and Tell model is the attention mechanism. The attention mechanism is used to focus on different parts of the image while generating the caption. The attention mechanism is implemented as a separate module, which is used by the Decoder at each time step.

There are two main types of attention mechanisms: **soft attention** and **hard attention**. Soft attention is differentiable and allows for end-to-end training, while hard attention is non-differentiable and requires reinforcement learning to train. I want to break down both types of attention mechanisms theoretically and then show how they are implemented in the code.

### Stochastic Hard Attention

Stochastic "Hard" Attention is an approach where the model discretely chooses specific regions (or locations, i.e. one annotation vector) in an image to focus on at each step of generating a caption. This contrasts with "Soft" Attention, where the model considers all regions but with varying degrees of focus.



#### 1. Attention Location Representation

$ s_{t, i} = 1 $ if the $ i $-th location is chosen at time $ t $, out of $ L $ total locations.

**Significance**: Represents the model's decision on where to focus in the image when generating the $ t $-th word in the caption as a one-hot vector.

#### 2. Context Vector Computation

$$ \hat{\mathbf{z}}_t = \sum_i s_{t, i} \mathbf{a}_i $$

**Significance**: Computes the context vector as the feature vector of the selected image region. Only the chosen region contributes to the context at each step.

#### 3. Attention as Multinoulli Distribution

$$ \tilde{s}_t \sim \operatorname{Multinoulli}_L(\{\alpha_i\}) $$

**Significance**: Models the attention decision as a random variable, following a Multinoulli distribution. The attention weights $ \{\alpha_i\} $ determine the probability of focusing on each region.

#### 4. Objective Function (Variational Lower Bound)

$$ L_s = \sum_s p(s | \mathbf{a}) \log p(\mathbf{y} | s, \mathbf{a}) $$
$$ L_s \leq \log p(\mathbf{y} | \mathbf{a}) $$

where...
- The inequality $ L_s \leq \log p(\mathbf{y} \mid \mathbf{a}) $ indicates that $ L_s $ is a lower bound on the log-likelihood. Lower bound means it is always less than or equal to the true log probability of the caption given the image.
- The objective function $ L_s $ involves summing over all possible attention sequences, but since we can't compute this exactly, we use a weighted sum where the weights are the probabilities of each attention sequence: $ p(s | \mathbf{a}) $.

**Significance**: $ L_s $ serves as a computationally feasible approximation to the true log-likelihood of generating the correct caption. It is about finding the best possible set of attention decisions (where to focus in the image at each step) to maximize the probability of correctly generating the caption sequence. It's optimized during training to improve captioning accuracy.

#### 5. Gradient Approximation via Monte Carlo Sampling

$$ \frac{\partial L_s}{\partial W} \approx \frac{1}{N} \sum_{n=1}^N\left[\frac{\partial \log p(\mathbf{y} | \tilde{s}^n, \mathbf{a})}{\partial W} + \log p(\mathbf{y} | \tilde{s}^n, \mathbf{a}) \frac{\partial \log p(\tilde{s}^n | \mathbf{a})}{\partial W}\right] $$

where...
- The gradient of $ L_s $ is approximated as an average over $ N $ sampled sequences of attention decisions.
- $ \frac{\partial \log p(\mathbf{y} | \tilde{s}^n, \mathbf{a})}{\partial W} $ is the gradient of the log likelihood of the generated word sequence given the sampled attention sequence and the image features
- $ \log p(\mathbf{y} | \tilde{s}^n, \mathbf{a}) $ is the log likelihood of the word sequence given the sampled attention sequence and the image features
- $ \frac{\partial \log p(\tilde{s}^n | \mathbf{a})}{\partial W} $ is the gradient of the log probability of the sampled attention sequence given the image features


**Significance**: Provides a practical method to approximate the gradient of $ L_s $ for model optimization, as direct computation is infeasible due to the stochastic nature of hard attention.

#### 6. Variance Reduction Techniques

**Moving Average Baseline**
$$ b_k = 0.9 \times b_{k-1} + 0.1 \times \log p(\mathbf{y} | \tilde{s}_k, \mathbf{a}) $$

where...
- $ b_k $ represents the moving average baseline at the $ k $-th mini-batch during training.
- The formula for $ b_k $ involves an exponential decay component, which is a method commonly used to calculate a moving average that gives more weight to recent observations. In this case, the decay is controlled by the coefficient $ 0.9 $. This coefficient multiplies the previous baseline $ b_{k-1} $, effectively reducing its influence over time.


- **Significance**: Reduces the variance in the Monte Carlo estimator of the gradient, stabilizing training.


***Entropy Regularization***
$$ \lambda_e \frac{\partial H[\tilde{s}^n]}{\partial W} $$

where...
- $ H[\tilde{s}^n] $ is the entropy of the sampled attention sequence $ \tilde{s}^n $. By adding the entropy of the attention distribution to the objective function, the model is encouraged to maintain a degree of uncertainty in its attention decisions. This encouragement for higher entropy effectively promotes exploration in the model's attention mechanism. Instead of always focusing on the same regions for similar images or features, the model is nudged to explore other potentially informative regions as well.
- $ \lambda_e $ is a hyperparameter controlling the strength of the entropy regularization.

- **Significance**: Encourages exploration in attention decisions, further reducing variance and improving model robustness. A model that explores more diverse attention strategies is less likely to get stuck in local optima and can generalize better.

#### 7. Final Learning Rule

$$ \frac{\partial L_s}{\partial W} \approx \frac{1}{N} \sum_{n=1}^N\left[\frac{\partial \log p(\mathbf{y} | \tilde{s}^n, \mathbf{a})}{\partial W} + \lambda_r(\log p(\mathbf{y} | \tilde{s}^n, \mathbf{a}) - b) \frac{\partial \log p(\tilde{s}^n | \mathbf{a})}{\partial W} + \lambda_e \frac{\partial H[\tilde{s}^n]}{\partial W}\right] $$

where...
 
- $ \frac{\partial \log p(\mathbf{y} | \tilde{s}^n, \mathbf{a})}{\partial W} $ is the gradient of the objective function $ L_s $ with respect to the model parameters $ W $
- $ \lambda_r(\log p(\mathbf{y} | \tilde{s}^n, \mathbf{a}) - b) \frac{\partial \log p(\tilde{s}^n | \mathbf{a})}{\partial W} $ resembles the REINFORCE learning rule from reinforcement learning
- $ \lambda_r $ is a hyperparameter that controls the influence of the reinforcement learning-based reward signal in the training process. It adjusts the balance between following the gradient of the attention model’s log probability and the reinforcement learning-based reward signal.
- $ \lambda_e $ is a hyperparameter controlling the strength of the entropy regularization.

​

**Significance**: Combines all elements (gradient approximation, baseline, and entropy regularization) into a single learning rule for training the model with hard attention.



#### Connection to REINFORCE Learning Rule

This approach aligns with the REINFORCE rule from reinforcement learning, treating the sequence of attention decisions as actions with associated rewards based on the log likelihood of the generated caption.

### Deterministic Soft Attention

Unlike stochastic hard attention, which involves random sampling (where the model discretely chooses specific regions to focus on), soft attention deterministically calculates a weighted sum of all parts of the input, allowing for straightforward optimization and learning. Thus, in soft attention, the model considers all regions (or locations, i.e. all annotation vectors) in an image at each step of generating a caption, but with varying degrees of focus.

#### 1. Expectation of the Context Vector

$$ \mathbb{E}_{p(s_t \mid \mathbf{a})}[\hat{\mathbf{z}}_t] = \sum_{i=1}^L \alpha_{t, i} \mathbf{a}_i $$

where...
- The weights $ \alpha_{t, i} $ are the attention probabilities for each region at time step $ t $
- $ L $ is the total number of regions.

- **Explanation**: This formula represents the expected context vector as a weighted sum of all annotation vectors $ \mathbf{a}_i $ from the image.
- **Significance**: It provides a 'soft' focus by blending information from all parts of the image, with more emphasis on the areas deemed most relevant by the model.

#### 2. Deterministic Attention Model

  - **Concept**: The soft attention mechanism is deterministic because it doesn't involve random sampling. Instead, it uses a predictable, continuous function (the weighted sum) to determine the focus.
  - **Significance**: This deterministic nature makes the entire model, including the attention mechanism, smooth and differentiable, allowing for standard backpropagation during training.

#### 3. Normalized Weighted Geometric Mean (NWGM)
$$ N W G M\left[p\left(y_t=k \mid \mathbf{a}\right)\right] = \frac{\exp \left(\mathbb{E}_{p\left(s_t \mid a\right)}\left[n_{t, k}\right]\right)}{\sum_j \exp \left(\mathbb{E}_{p\left(s_t \mid a\right)}\left[n_{t, j}\right]\right)} $$

- **Explanation**: NWGM is used for calculating the probability of the next word in the caption. It approximates the softmax probability distribution over possible next words by applying softmax to the expectations of the underlying linear projections.
- **Significance**: This approach aligns with the standard mechanism for generating predictions in neural networks, facilitating efficient training and prediction.

#### 4. Simplification for Learning

- **Concept**: Learning with deterministic soft attention is more straightforward than with stochastic hard attention. The expected context vector can be directly used in forward propagation, and standard backpropagation can be applied for training.
- **Significance**: This simplification means that models with soft attention can be trained efficiently with conventional optimization algorithms, making them practical for large-scale applications.

#### 5. Approximation to Marginal Likelihood

- **Concept**: Deterministic soft attention can be seen as an approximation to optimizing the marginal likelihood over attention locations, which is a complex problem in the stochastic hard attention setting.
- **Significance**: It provides a practical and computationally efficient way to capture the benefits of attention mechanisms without the need for complex sampling or optimization methods required by stochastic hard attention.

#### Conclusion

Deterministic Soft Attention offers a practical and efficient method for implementing attention mechanisms in neural networks, especially for tasks like image captioning. By calculating a weighted sum of input features and avoiding the complexity of stochastic sampling, it facilitates smooth and differentiable models that are amenable to standard training techniques. This approach enables the model to effectively focus on relevant parts of the input while maintaining computational tractability and ease of training.

### Implementation of the Attention Mechanism

Enough theory, let's apply this knowledge to the code. The attention mechanism is implemented as a separate module used by the decoder at each time step. I use the deterministic soft attention mechanism described above because implementing stochastic hard attention is dramatically more complex and would require reinforcement learning to train the model, which is beyond the scope of the deep learning course for which I am implementing this project. It is worth noting, however, that Stochastic Hard Attention performed slightly better than Deterministic Soft Attention in the original paper, as measured by the BLEU score.

```python
import torch
import torch.nn as nn


class Attention(nn.Module):
    def __init__(self, encoder_dim):
        super(Attention, self).__init__()
        self.U = nn.Linear(512, 512)
        self.W = nn.Linear(encoder_dim, 512)
        self.v = nn.Linear(512, 1)
        self.tanh = nn.Tanh()
        self.softmax = nn.Softmax(1)

    def forward(self, img_features, hidden_state):
        U_h = self.U(hidden_state).unsqueeze(1)
        W_s = self.W(img_features)
        att = self.tanh(W_s + U_h)
        e = self.v(att).squeeze(2)
        alpha = self.softmax(e)
        context = (img_features * alpha.unsqueeze(2)).sum(1)
        return context, alpha
    
```

#### 1. Calculating Attention Weights and Context Vector

**Formula**: 
$ \mathbb{E}_{p(s_t \mid \mathbf{a})}[\hat{\mathbf{z}}_t] = \sum_{i=1}^L \alpha_{t, i} \mathbf{a}_i $

**Code Implementation** (in `Attention` class):

- **Attention Weights Calculation**:
  ```python
  U_h = self.U(hidden_state).unsqueeze(1)
  W_s = self.W(img_features)
  att = self.tanh(W_s + U_h)
  e = self.v(att).squeeze(2)
  alpha = self.softmax(e)
  ```
  Here, `U_h` and `W_s` are the transformed hidden state and image features, respectively. `alpha` is the attention probability for each region in the image.

- **Context Vector Calculation**:
  ```python
  context = (img_features * alpha.unsqueeze(2)).sum(1)
  ```
  This line computes the weighted sum of the image features based on the attention weights, resulting in the context vector.

TODO: complete this section

## Changes to the implementation

In an effort to make the implementation more closely align with the paper, I made the following changes to the base implementation.

### DO-RNN according to the paper

As we saw above, the paper describes the deep-output RNN as having multiple layers, each followed by a non-linear activation function. The implementation by AaronCCWong only had one layer transforming the hidden state of the LSTM. Therfore, I implemented the deep-output RNN as described in the paper, with multiple layers and non-linear activations. This can be enabled by using the `--ado` flag when training the model.

$$ p\left(\mathbf{y}_t \mid \mathbf{a}, \mathbf{y}_1^{t-1}\right) \propto \exp \left(\mathbf{L}_o\left(\mathbf{E} \mathbf{y}_{t-1}+\mathbf{L}_h \mathbf{h}_t+\mathbf{L}_z \hat{\mathbf{z}}_t\right)\right) $$

Where:

- $ \mathbf{L}_o, \mathbf{L}_h, \mathbf{L}_z $ are learned weight matrices for transforming the embedding, hidden state, and context vector respectively.
- $ \exp ( ) $ represents the softmax function, which is automatically applied by the CrossEntropyLoss function in PyTorch (TODO: verify this).

```python

class Decoder(nn.Module):
    def __init__(self, vocabulary_size, encoder_dim, tf=False, ado=False):
        # ...
        # Simple DO: Layer for transforming LSTM state to vocabulary
        self.deep_output = nn.Linear(512, vocabulary_size)  # Maps LSTM outputs to vocabulary
        self.dropout = nn.Dropout()

        # Advanced DO: Layers for transforming LSTM state, context vector and embedding for DO-RNN
        hidden_dim, intermediate_dim = 512, 512
        self.f_h = nn.Linear(hidden_dim, intermediate_dim)  # Transforms LSTM hidden state
        self.f_z = nn.Linear(encoder_dim, intermediate_dim)  # Transforms context vector
        self.f_out = nn.Linear(intermediate_dim, vocabulary_size)  # Transforms the combined vector (sum of embedding, LSTM state, and context vector) to vocabulary
        self.relu = nn.ReLU()  # Activation function
        self.dropout = nn.Dropout()
        # ...

    def forward(self, img_features, captions):
        # ...
        for t in range(max_timespan):
            # ...
            # Generate word prediction
            if self.use_advanced_deep_output:
                output = self.advanced_deep_output(self.dropout(h), context, captions, embedding, t)
            else:
                output = self.deep_output(self.dropout(h))
            # ...
    
    def advanced_deep_output(self, h, context, captions, embedding, t):
        # Combine the LSTM state and context vector
        h_transformed = self.relu(self.f_h(h))
        z_transformed = self.relu(self.f_z(context))

        # Sum the transformed vectors with the embedding
        combined = h_transformed + z_transformed + self.embedding(captions[:, t] if self.training else embedding)

        # Transform the combined vector & compute the output word probability
        return self.relu(self.f_out(combined))
```

TODO:

- [ ] Verify that the softmax function is applied by the CrossEntropyLoss function in PyTorch.