# Implementing Transformer Models
## Practical II
Carel van Niekerk & Hsien-Chin Lin

14-18.10.2024

---

In this practical we will explore the two different reasons for masking in the attention mechanism and implement both. We will further discuss how deep learning code can be tested using the `pytest` framework.

### 1. Masking
#### 1.1. Masking Padded Tokens

In the previous practical, we delved into the implementation of the attention mechanism, which computes a weighted sum of values ($V$) given a query ($Q$) and a set of keys ($K$). The example illustrated the process of deriving contextual representations for each token in a character sequence based on an input sequence.

In real-world scenarios, multiple sequences are often processed simultaneously, utilizing Tensors of shape (batch_size, sequence_length, embedding_size). However, a challenge arises as Tensors necessitate uniform sequence lengths, which is impractical given the natural variability in sentence lengths. A common resolution to this is padding the sequences to attain a standardized length. Padding entails appending a distinct token (e.g., <pad>) to the end of shorter sequences until uniformity in length is achieved across all sequences. For instance, given the sentences: "Welcome to this tutorial on attention" and "Please remember to pad your reply before sending it", the first sentence would be padded with the `<pad>` token to become: "Welcome to this tutorial on attention `<pad> <pad> <pad>`", facilitating the combination of these sentences into a Tensor with sequence_length = 9.

When employing the attention mechanism to generate contextual representations for these sentences, it's crucial that the <pad> tokens do not alter the original "meaning" of the sentences. To ensure this, the padding tokens are masked, thereby instructing the attention mechanism to disregard the <pad> tokens during processing.

#### 1.2. Future Masking

In the model introduced in the paper Attention is all you need, a "decoder" component is featured. Within this decoder, the inputs are the target outcomes, each shifted upward by one position. A critical aspect of this decoder's functionality is the "masking" of future tokens to prevent the model from inadvertently "cheating" by peeking at the subsequent token in the sequence. This is accomplished by masking the upper triangular portion of the attention matrix, ensuring the attention mechanism remains oblivious to the future tokens.

### 2. Testing

Unit testing of the individual modules in a deep learning model is important. It ensures that the module perform the intended operations. In this practical we will use the `pytest` framework to test the implementation of the attention mechanism. To write a unit test for a simple linear layer, we can write the following test function:

```python
import torch
from torch.nn import Linear

def test_linear_layer():
    # Set seed for reproducibility
    seed = 42
    torch.manual_seed(seed)

    # Define linear layer
    layer = Linear(2, 3)

    # Generate random weight and bias vectors for the linear layer
    weight = torch.randn(layer.weight.size())
    bias = torch.randn(layer.bias.size())
    layer.weight = torch.nn.Parameter(weight)
    layer.bias = torch.nn.Parameter(bias)

    x = torch.randn(5, 2)

    # Compute the expected and actual outputs
    expected = torch.matmul(weight, x.T).T + bias.unsqueeze(0).repeat((x.size(0), 1))
    actual = layer(x)

    assert torch.allclose(expected, actual)
```

The `test_linear_layer` function tests the linear layer by comparing the expected output with the actual output. The `assert` statement checks whether the two outputs are equal. If they are not equal, the test fails. The `torch.allclose` function checks whether the two tensors are equal within a certain tolerance. This is necessary because floating point operations are not always exact. The `torch.allclose` function returns a boolean tensor, which is `True` if the two tensors are equal within the specified tolerance and `False` otherwise. The `assert` statement checks whether all elements in the boolean tensor are `True`. If this is the case, the test passes. If not, the test fails.

This test can be executed using:

```bash
pytest test_linear_layer.py
```

# Exercises

1. Make sure you understand the role of the two different masks in the attention mechanism. Explain the role of each mask in your own words.
2. Incorporate both masking mechanisms into your attention mechanism implementation from the preceding practical. For masking padded tokens, utilize an input tensor that indicates which tokens are not padded (binary matrix). Conversely, the mask for future tokens can be computed internally within the attention mechanism module.
3. Revisit the test outlined earlier, and subsequently, formulate a test to verify the accuracy of your attention mechanism implementation.
4. Using the test provided 'practical_2_test.py' test your attention machanism implementation. (Ensure that your attention mechanism has the following inputs: query, key, value, mask=None).

1: Summary:
Padding Mask: Prevents attention to padded tokens that don't carry meaningful information.
Future Mask: Prevents the model from looking at future tokens in tasks like sequence generation.