**Heidelberg University**

**Data Science  Group**
    
Prof. Dr. Michael Gertz  

Ashish Chouhan, Satya Almasian, John Ziegler, Jayson Salazar, Nicolas Reuter
    
December 4, 2023
    
Natural Language Processing with Transformers

Winter Semster 2023/2024     
***

# **Assignment 3: “Transformers”**
**Due**: Monday, January 8, 2024, 2pm, via [Moodle](https://moodle.uni-heidelberg.de/course/view.php?id=19251)



### **Submission Guidelines**

- Solutions need to be uploaded as a **single** Jupyter notebook. You will find several pre-filled code segments in the notebook, your task is to fill in the missing cells.
- For the written solution, use LaTeX in markdown inside the same notebook. Do **not** hand in a separate file for it.
- Download the .zip file containing the dataset but do **not** upload it with your solution.
- It is sufficient if one person per group uploads the solution to Moodle, but make sure that the full names of all team members are given in the notebook.

***

## **Task 1: Diving into Attention** (3 + 4 + 4 + 1 = 12 points)

In this task, you work with self-attention equations and find out why multi-head attention is preferable to single-head attention.

Recall the equation of attention on slide 5-9 to compute self-attention on a series of input tokens. We simplify the formula by focusing on a single query vector $q \in R^d$, value vectors ($\{ v_1,v_2,...,v_i \},v_i \in R^d$), and key vectors ($\{ k_1,k_2,...,k_i \},k_i \in R^d$). We then have

$$
a_i=\frac{exp(q^Tk_i)}{\Sigma^n_{j=1}exp(q^Tk_j)}
$$

$$
 o= \Sigma^n_{i=1} a_i v_i
$$

with $a_i$ being the attention weight for query $q$ with respect to key $k_i$. Then the output $o$ is the new representation for the query token as a weighted average of value vectors with weights $a=\{ a_1,a_2,...,a_i \},a_i \in R^d$.
Answer the following questions with the help of the equations and the intuition behind attention that you learned in the class:



### Subtask 1: Copying  

1.   Explain why $a$ can be interpreted as a categorical distribution.
2.   This distribution is typically diffuse, where the mass is spread out between different values of $a_i$. Describe a scenario in which the categorical distribution puts all the weight on a single element, e.g., $a_j \gg \Sigma_{j\neq i}a_i$. What are the conditions on key and/or query for this to happen?
3. In this case of a single large $a$, what would the output $c$ look like and what it means intuitively?

In attention, it is easy to **copy** a value vector $v_i$ to the output $o$.





**Answer**



1. $a$ can be interpreted as a categorical distribution because it is a probability distribution over the set of keys $\{k_1, k_2, ..., k_i\}$, where each $a_i$ is the probability of selecting the corresponding key $k_i$ for the given query $q$.

2. A scenario in which the categorical distribution puts all the weight on a single element $a_j \gg \Sigma_{j\neq i}a_i$ is when the query $q$ is very similar to the key $k_j$ and dissimilar to all other keys. In other words, when the dot product $q^Tk_j$ is much larger than the dot products between $q$ and all other keys $k_i$ for $i \neq j$. This means that the query is highly correlated with a specific key and not with others.

3. In the case of a single large $a$, the output $c$ would be equal to the value vector $v_j$ corresponding to the key $k_j$ with the highest attention weight $a_j$. This means that the query $q$ is highly correlated with the key $k_j$ and the output $c$ is a copy of the corresponding value vector $v_j$. Intuitively, this means that the query is focused on a specific aspect of the input sequence represented by the key $k_j$.

It is easy to copy a value vector $v_i$ to the output $o$ by setting the corresponding attention weight $a_i$ to 1 and all other weights to 0. This is because the output $o$ is a weighted sum of the value vectors $v_i$ with weights $a_i$, and setting $a_i$ to 1 and all other weights to 0 results in a copy of the corresponding value vector $v_i$.





#### ${\color{red}{Comments\ 1.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Averaging


Instead of focusing on just one value vector $v_j$, the Transformer model can incorporate information from multiple inputs. Consider the situation where we want to incorporate information from two value vectors $v_b$ and $v_c$ with keys $k_b$ and $k_c$. In machine learning one of the ways to combine this information is through averaging of vectors $o= \frac{1}{2}(v_b+v_c)$.  It might seem hard to extract information about the original vectors $v_b$ and $v_c$ from the resulting average. But under certain conditions, one can do so. In this subtask, we look at the following cases:

1. Suppose we know the following:


* $v_b$ lies in a subspace $B$ formed by the $m$ basis vectors $\{b_1, b_2, .. , b_m\}$, while $v_c$ lies in a subspace $C$ formed by the $p$ basis vectors $\{c_1, c_2, . . . , c_p\}$ (This means that any $v_b$ and $v_c$ can be expressed as a linear combination of their basis vectors).
*   All basis vectors have the norm 1 and are orthogonal to each other.
*   The two subspaces $B$ and $C$ are orthogonal, meaning $b_j^Tc_k=0$ for all $j$ and $k$.
* Given that $\{b_1, b_2, .. , b_m\}$ are both orthogonal and form a basis for $v_b$, we know that there exists some $d_1, ..., d_m$ such that $v_b=d_1 b_1+d_2 b_2+...+d_m b_m$. Use these $d\text{s}$ to solve this task.

Using the basis vectors $\{b_1, b_2, .. , b_m\}$, construct a matrix $M$ such that for arbitrary vectors $v_b$ and $v_c$ with the given conditions, we can use $M$ to extract $v_b$ from the sum of the vector $s = v_b + v_c$. In other words, construct an $M$ such that  $ Ms = v_b$ holds.


2. If we assume that
* all key vectors are orthogonal, i.e., $k_i^Tk_j=0$ for all $i \neq j$, and
* all key vectors have the norm 1.

Find an expression for the query vector $q$ such that $o \approx \frac{1}{2}(v_b+v_c)$. Justify your answer.

**Hint:** Use your finding in subtask 1 to solve part 2.

**Hint:** If the norm of a vector $x$ is 1, then $x^Tx=1$

**Hint:** Start with writing $v_b$ and $v_c$ as the linear combination of the bases.


**Answer**


1. To construct the matrix $M$, we can use the fact that the dot product of any basis vector in $B$ with any vector in $C$ is zero, due to the orthogonality of the subspaces. This means that for any $i$ and $j$, we have $b_i^T(v_b + v_c) = b_i^Tv_b + b_i^Tv_c = b_i^Tv_b$. Therefore, if we stack the basis vectors of $B$ as rows of a matrix, we get $M = \begin{bmatrix} b_1^T \\ b_2^T \\ \vdots \\ b_m^T \end{bmatrix}$, and we can verify that $Ms = v_b$, since $Ms = \begin{bmatrix} b_1^T \\ b_2^T \\ \vdots \\ b_m^T \end{bmatrix} (v_b + v_c) = \begin{bmatrix} b_1^Tv_b \\ b_2^Tv_b \\ \vdots \\ b_m^Tv_b \end{bmatrix} = \begin{bmatrix} d_1 \\ d_2 \\ \vdots \\ d_m \end{bmatrix} \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix} = v_b$.

2. To find the query vector $q$, we can use the fact that the attention mechanism computes the weighted average of the value vectors, where the weights are given by the softmax of the dot products of the query vector with the key vectors. In other words, $o = \frac{\exp(q^Tk_b)v_b + \exp(q^Tk_c)v_c}{\exp(q^Tk_b) + \exp(q^Tk_c)}$. To make this expression close to $\frac{1}{2}(v_b + v_c)$, we want the weights to be close to $\frac{1}{2}$. This means that we want the dot products of $q$ with $k_b$ and $k_c$ to be close to each other. One possible way to achieve this is to choose $q$ such that it is orthogonal to both $k_b$ and $k_c$. This way, $q^Tk_b = q^Tk_c = 0$, and the weights are equal to $\frac{1}{2}$. Therefore, one possible expression for $q$ is $q = k_b \times k_c$, where $\times$ denotes the cross product. This vector is orthogonal to both $k_b$ and $k_c$, and has norm 1, since the key vectors have norm 1 and are orthogonal. Thus, $q = k_b \times k_c$ satisfies the conditions and gives $o \approx \frac{1}{2}(v_b + v_c)$.







#### ${\color{red}{Comments\ 1.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Drawbacks of Single-head Attention

You might have wondered why we need multi-heads for attention. In this subtask, we look at some of the drawbacks of having a single head attention. As shown in the previous subtask, it is possible for single head attention to focus equally on two values. The same can apply to any subset of values, which therefor can become problematic.

Consider a set of key vectors $\{ k_1,k_2,...,k_n \}$, randomly sampled from a normal distribution with a known mean value of $\mu_i \in R^d$ and unknown covariance $Σ_i, i \in \{1, \ldots, n\}$, where


*   $\mu_i\text{s}$ are all orthogonal $\mu_i^T\mu_j=0$ if $i \neq j$.
*   $\mu_i\text{s}$ all have unit norm $||\mu_i||=1$.

1. For a vanishingly small $\alpha$ (not to be confused with attention weights), the covariance matrices are  $Σ_i=\alpha I, \forall i  \in \{1,2,..,n\}$, design a query $q$ in terms of the $\mu_i$ such that as before, $o= \frac{1}{2}(v_b+v_c)$ and describe why it works.

2.  Large perturbations in key value might cause problems for single head attention.  Specifically, in some cases, one key vector $k_b$ may be larger or smaller in norm than the others, while still pointing in the same direction as $\mu_b$. As an example of such a case,
consider a covariance matrix for item $b$ for vanishingly small $\alpha$ as $Σ_b=\alpha I + \frac{1}{2}(\mu_b^T\mu_b)$. This causes $k_a$ to point to roughly the same direction as $\mu_b$ but with large differences in magnitude, while for other items. Further, let $Σ_i=\alpha I\  \forall_i i \neq b$. When you sample multiple keys from the distribution $\{ k_1,k_2,...,k_n \}$ and use the $q$ vector from the pervious part, what do you expect vector $o$ to look like? Explain why this shows the drawback of single-head attention.

**Hint:**
Think about how it differs from pervious part and how $o$'s variance would be affected by the change in $Σ_b$.

**Hint:** Considering that $\mu_b^T\mu_b=1$, think of what are the ranges $Σ_b$ can take and how does that effect a sampled $k_b$ value.

**Hint:** $\frac{exp(b)}{exp(b)+exp(c)}=\frac{exp(b)}{exp(b)+exp(c)}\frac{exp(-b)}{exp(-b)}= \frac{1}{1+exp(c-b)}$

**Answer:**




1. To design the query $q$, we can use the same idea as in the previous subtask, where we chose $q$ to be orthogonal to both $k_b$ and $k_c$. This way, the dot products of $q$ with $k_b$ and $k_c$ are both zero, and the weights are both $\frac{1}{2}$. Since the key vectors are randomly sampled from a normal distribution with mean $\mu_i$, we can approximate the dot products of $q$ with $k_i$ by the dot products of $q$ with $\mu_i$. Therefore, we can choose $q$ to be orthogonal to both $\mu_b$ and $\mu_c$. One possible way to do this is to choose $q = \mu_b \times \mu_c$, where $\times$ denotes the cross product. This vector is orthogonal to both $\mu_b$ and $\mu_c$, and has norm 1, since the mean vectors have norm 1 and are orthogonal. Thus, $q = \mu_b \times \mu_c$ satisfies the conditions and gives $o \approx \frac{1}{2}(v_b + v_c)$.

2. If one key vector $k_b$ has a larger or smaller norm than the others, while still pointing in the same direction as $\mu_b$, then the dot product of $q$ with $k_b$ will not be zero anymore. Instead, it will be proportional to the norm of $k_b$. This means that the weight for $v_b$ will be larger or smaller than $\frac{1}{2}$, depending on the sign of the dot product. This will affect the output vector $o$, making it closer or farther from $v_b$. For example, if $k_b$ has a larger norm than the others, then $q^Tk_b > 0$, and the weight for $v_b$ will be larger than $\frac{1}{2}$. This will make $o$ closer to $v_b$ than to $v_c$. This shows the drawback of single-head attention, because it cannot handle the variability in the key vectors' norms. It will either over- or under-attend to some values, depending on the random sampling. Multi-head attention can overcome this problem by learning different query vectors for different attention heads, and combining them in a more robust way.

```



#### ${\color{red}{Comments\ 1.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: Model Size  
1. Imagine you have an input sequence of  $l$ tokens, how much memory is required and what time complexity do we have for a single self-attention layer? (give your answer in terms of $l$)
2. If you have $N$ layers of self-attention, how  would the memory requirements and the time complexity change? (give your answer in terms of $l$ and $N$)
3. If you have $l=10,000$ and $10$ layers, with the ability to perform $10M$ operations per second, how long would it take to compute the attention output?


**Answer**


1. A single self-attention layer requires $O(l^2d)$ memory and time complexity, where $d$ is the dimension of the vector representations. This is because the self-attention layer computes the query, key, and value matrices, each of which has shape $(l,d)$, and then computes the dot product of the query and key matrices, which has shape $(l,l)$.
2. If you have $N$ layers of self-attention, the memory and time complexity will be multiplied by $N$, since each layer performs the same computation. Therefore, the memory and time complexity will be $O(Nl^2d)$ for $N$ layers of self-attention.
3. If you have $l=10,000$ and $10$ layers, the number of operations required to compute the attention output will be $10 \times 10,000^2 \times d$, where $d$ is the dimension of the vector representations. Assuming $d=512$, which is a common choice for transformer models, the number of operations will be $512 \times 10^{11}$. If you can perform $10M$ operations per second, it will take $\frac{512 \times 10^{11}}{10 \times 10^6} = 5.12 \times 10^5$ seconds, which is about $5.9$ days, to compute the attention output. This shows that the self-attention mechanism is very expensive for long sequences, and motivates the need for more efficient alternatives.


#### ${\color{red}{Comments\ 1.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## **Task 2: Multiple Choice Question Answering** (4 + 3 + 5 + 2 = 14 points)

In this task, you will fine-tune a transformer model on a multiple-choice task, which is the task of selecting the most plausible inputs in a given selection. The dataset used here is [SWAG](https://www.aclweb.org/anthology/D18-1009/), which is available via the Hugging Face [hub](https://huggingface.co/datasets/swag). Check the link for an overview of the dataset. SWAG is a dataset about commonsense reasoning, where each example describes a situation and then proposes four options that could apply for it.
Let's start by installing the necessary packages.

In [1]:
%pip install transformers
%pip install datasets
%pip install evaluate
%pip install accelerate -U
%pip install sentencepiece

Collecting transformers
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
     ---------------------------------------- 0.0/126.8 kB ? eta -:--:--
     --- ------------------------------------ 10.2/126.8 kB ? eta -:--:--
     --- ------------------------------------ 10.2/126.8 kB ? eta -:--:--
     --- ------------------------------------ 10.2/126.8 kB ? eta -:--:--
     --- ------------------------------------ 10.2/126.8 kB ? eta -:--:--
     ----------------- ------------------- 61.4/126.8 kB 297.7 kB/s eta 0:00:01
     ----------------- ------------------- 61.4/126.8 kB 297.7 kB/s eta 0:00:01
     ------------------------------------ 126.8/126.8 kB 466.1 kB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Downloading huggingface_hub-0.20.2-py3-none-any.whl.metadata (12 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2023.12.25-cp311-cp311-win_amd64.whl.metadata (41 kB)
     ---------------------------------------

In this task, you will use a BERT model with a `MultipleChoice` head from the Hugging Face library and then create your custom model.   Recall from the class that the BERT model has an auxiliary next sentence prediction task, in which two sentences are given to BERT separated by a `[SEP]` token and a classifier head decides if the second sentence logically follows the first one. Hugging Face has
 a `*ForMultipleChoice` architecture that uses the representation of the `[CLS]` token and a linear layer to classify if one sentence follows the other. We first start with this default architecture and then build a more complicated one in a later subtask.

### Subtask 1: Loading and Processing the Data

We use the `dataset` library to download the SWAG dataset, which already contains train, validation, and test splits.

In [2]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
from datasets import load_dataset, load_metric
datasets = load_dataset("swag", "regular")
datasets

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/7.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/7.10k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.71M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.24M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/73546 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/20006 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/20005 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 73546
    })
    validation: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20006
    })
    test: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20005
    })
})

Lets look at the first item to see how the data looks like:

In [3]:
datasets["train"][0]

{'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': '3416',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'gold-source': 'gold',
 'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'label': 0}

**Question:**
Look at the dataset card on the Hugging Face hub and define what each of these fields means, with respect to the task:

*   `sent1`:
*   `sent2`:
*    `ending0`, `ending1`, `ending2` and `ending3`:
*   `label`:




**Answer**

`
*   `sent1`: these fields show how a sentence starts, and if you put the two together, you get the startphrase field.
*   `sent2`:these fields show how a sentence starts, and if you put the two together, you get the startphrase field.
*    `ending0`, `ending1`, `ending2` and `ending3`: suggests a possible ending for how a sentence can end, but only one of them is correct.
*   `label`: identifies the correct sentence ending.
`

Write a function that displays the context and each of the four choices, following the format


```
Context:...
A-
B-
C-
D-
Ground truth: option ...
```

How you display the results is not important. You should be able to extract different parts of the data correctly and know what each field represents.

In [12]:
def explain_example(examples):
  ### your code ###
    startphrase = examples["startphrase"]
    ground_truth = "ending" + str(examples["label"])
    
    return examples["ending0"],examples["ending1"],examples["ending2"],examples["ending3"], examples[ground_truth]
  ### your code ###


In [13]:
explain_example(datasets["train"][0])

('passes by walking down the street playing their instruments.',
 'has heard approaching them.',
 "arrives and they're outside dancing and asleep.",
 'turns the lead singer watches the performance.',
 'passes by walking down the street playing their instruments.')

Before feeding the data into the model, we need to preprocess the text using `Tokenizer` to tokenize the inputs into tokens and put it in a format that the model expects. The tokenizer specific to the model we want to use for this task is `distilbert-base-uncased`. Complete the code below to load a fast tokenizer for this model. DistilBERT is similar to the BERT model, and we only use this particular architecture for faster training.


In [None]:
from transformers import AutoTokenizer

###your code###
tokenizer =
###your code###

In [None]:
tokenizer("This is the first sentence!", "And this is the second one.")

Write a function that preprocesses the samples.
The tricky part is to put all the possible pairs of sentences in two big lists before passing them to the tokenizer.
Each **first** sentence has to be repeated 4 times to go with different ending options.
There should be a separator token between the first and second sentence, to follow the BERT input logic.
The final output is a list of 4 elements, one for each choice, where the input is transformed by the tokenizer.
For example, with a list of 2 training examples, the output includes 2 lists, where each contains 4 elements. Each of those elements is the converted input ID of the first sentence followed by the second sentence with different endings.
When calling the `tokenizer`, we use the argument `truncation=True`. This will ensure that an input longer than what the model selected can handle will be truncated to the maximum length accepted by the model.

**Hint:** Flatten the lists (all choices are flattened into a single list) before feeding them into the tokenizer and unflatten them once again for the final output.

In [None]:
### your code ###
ending_names =
### your code ###
def preprocess_function(examples):
  ### your code ###
    # repeat each first sentence four times
    first_sentences =
    # second sentences possible are combination of header and ending
    question_headers =
    second_sentences =

    # flatten everything


    # tokenize

    # un-flatten


    return
    ### your code ###

In [None]:
examples = datasets["train"][:2]
features = preprocess_function(examples)
print(len(features["input_ids"]), len(features["input_ids"][0]), [len(x) for x in features["input_ids"][0]])# output should be 2 4 [30, 25, 30, 28]

We can now apply our function to all the examples in the dataset. We use the `map` method to apply the function on all the elements of all the splits in the dataset (training, validation, and testing).
Note that we passed `batched=True` to leverage the fast tokenizer and use multi-threading to process the texts in batches concurrently.

In [None]:
encoded_datasets = datasets.map(preprocess_function, batched=True)

Our dataset is still not converted to tensors and not padded. This is the job of the `data collator`. A data collator takes a list of examples and converts them to a batch.
There is no data collator in the Hugging Face default library that works on our specific problem. We thus need to write our own one. In this collator:

*  All the inputs/attention masks are flattened.
* A flattened list is passed to the `tokenizer.pad ` method to apply dynamic padding to pad inputs to the maximum length in the batch. Output will be the size of `(batch_size * 4) x seq_length`.
* Everything needs to be unflattened for the output of the data collator.
* `input_ids` and `labels` should be returned as tensors.
* The output is a dictionary called `batch` that contains features needed for training (`input_ids`, `attention_mask`, `label`).



In [None]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class MultipleChoiceDataCollator:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        accepted_keys = ["input_ids", "attention_mask", "label"]
        if len(features[0])>len(accepted_keys):
          features=[{k: v for k, v in i.items() if k in accepted_keys} for i in features]
      ### your code ###

        labels =
        # flatten
        flattened_features =

        # use the tokenizer and attributes from the class to pad the input
        batch = self.tokenizer...


        # un-flatten


        ### your code ###
        return batch

In [None]:
accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(2)]
batch=MultipleChoiceDataCollator(tokenizer)(features)
print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)
print(batch["labels"].shape)

In [None]:
for i in range(4):
  print(batch["input_ids"][0][i])
  print(tokenizer.decode(batch["input_ids"][0][i]))

#### ${\color{red}{Comments\ 2.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Fine-tuning a Hugging Face Model

To fine-tune our model, we first need to download the correct architecture from Hugging Face. Import the correct class for this task and download the pre-trained checkpoint for the base class from `distilbert-base-uncased`. Note that the weights in the classification head are initialized at random.

In [None]:
### your code ###
from transformers import ...
model_hf =

### your code ###

Next, we need to define our `Trainer` and pass in the correct `TrainingArguments` (a class that contains all the attributes to customize the training). Define a `TrainingArguments` that


* creates an output directory `distilbert-base-uncased-swag` to save the checkpoints and logs.
*   evaluates the model on the validation set after the `300` steps.
* a checkpoint should be saved after each `600` step and no more than 2 checkpoints should be saved in total.
* the random seed for training is `77`.
* batch size for training and evaluation: `48` (if you are running out of memory, feel free to change this setting but indicate it as a comment in your notebook, on a T4 GPU from google colab this takes about `13.2GB` of `15.0GB`).
* train for `1800` steps with a learning rate of `5e-5`, and add weight decay of `0.01` to the optimizer.
* the trainer should remove the columns from the data that are not used by the model.
* The final checkpoint should be the checkpoint that had the best overall validation metric not necessarily the last checkpoint.

**Note:** Please use GPU for to train your model. If on colab, you can use T4 GPU for free.

In [None]:
from transformers import TrainingArguments, Trainer
training_args =
    ### your code ###


    ### your code ###


Before we initialize the `Trainer`, we create a function that tells the trainer how to compute the metrics from the predictions. Fill the `compute_metrics` function to compute the accuracy based on the `predictions`. This object contains the prediction of the model, as well as the ground truth labels.

**Hint 1:** Keep in mind that the output of this function should be a dictionary containing the metric name and value.

**Hint 2:** Consider the shape of the example input. This is similar to the logits produced by the model.

In [None]:
import numpy as np
def compute_metrics(predictions):
  ### your code ###
   preds, label_ids =

   return_dict=
  ### your code ###
    return return_dict

In [None]:
preds=np.array([[0.9,0.2,0,0],
                [0.2,0.2,0.9,0.1],
                [0.2,0.9,0,0],
                [0.2,0.1,0.8,0],
                [0.9,0.1,0.8,0],
                [0.2,1,0.4,0],
                [0.2,1,0.4,0.9],
                [1,0.1,0.4,0.3],
                [0.1,0.1,0.9,0.3],
                [0.1,0.1,0.2,1]])
label_ids=np.array([0,3,1,2,0,1,3,0,2,3])
compute_metrics((preds,label_ids))

Now it's time to pass everything to a `Trainer` object to start the training process. Initialize a `Trainer` object and pass all the necessary information, keep in mind that we also have the optional metric computation and that we tend to run an evaluation on the validation set during training. The training should take around 30 min on Google Colab T4 GPU.

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

In [None]:
### your code ###
trainer =
### your code ###

In [None]:
trainer.train()# should take around 30 min on Google Colab T4 GPU

Save the model in `distilbert-base-uncased-swag/final_model`.

In [None]:
### your code ###

### your code ###

Look at the saved files and answer the following questions (it is possible to answer these questions by writing some code, but we want you to explore the saved files):

**Question:**


1.   What is the vocabulary id for the `[CLS]` and `[MASK]` tokens?
2.   What is the dropout probability for the attention layer?

**Dropout:** With dropout, certain nodes are set to the value zero in a training run, i.e. removed from the network. Thus, they have no influence on the prediction and also in the backpropagation. Thus, a new, slightly modified network architecture is built in each run and the network learns to produce good predictions without certain inputs. Read more [here](https://databasecamp.de/en/ml/dropout-layer-en).



**Answer**

`
Enter your answer here
`

#### ${\color{red}{Comments\ 2.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Fine-tune a Custom Model


In this case, we were lucky that Hugging Face had a pre-implemented architecture available for us to use. However, that is not always the case. Moreover, we might want to experiment beyond the default architectures to find a suitable one for a task. Therefore, it is important to learn to extend the Hugging Face models and train a custom model. The good news is that except for the model architecture the rest of the code can remain as it is.

Design a model for multiple choice model as follows:


1.   the config file for a feature extractor (must be a distilbert type) is  passed during initialization. The config file determines which model is used for feature extraction.
2.   From the `last_hidden_state` of the feature extractor, choose the `[CLS]` embedding (first one). This embedding is used as the compressed representation of first and second sentences. During pre-training it is used  for classifying whether these two sentences follow one another, making it a good candidate for our task.
3. `[CLS]` embedding is passed through a linear layer **that does not change the size of the embedding** and is passed through a tanh nonlinearity.
4. The output of tanh is passed through a dropout layer, where the dropout probability is the same as the dropout probability used for the `distilbert` model used as feature extractor.
5. The output of the previous stage is fed into another linear layer that shrinks the size of the embedding dimension to a quarter of the original size, e.g., if the embedding size is 12, the new embedding dimension is 3.
6. The output is followed by another dropout layer (you can use the one from stage 4).
7. Finally, a binary classifier is applied to determine the probability of sentence 1 being followed by sentence 2.
8. the cross-entropy loss is used to compute the loss.

**Hint:** Keep in mind that for a 4 choice system, you classify each of the four solutions independently. However, the final output should group the four logits together. For example, if input ids have the shape `[2, 4, 35]` (batch size=2, num choices=4, seq len=35), then the logits have the `[2, 4]` and labels have the dimension `[2, 1]`.



In [None]:
from transformers import DistilBertModel,BertConfig,DistilBertConfig,PretrainedConfig,PreTrainedModel,DistilBertPreTrainedModel
from torch import nn

class CustomMultipleChoice(DistilBertPreTrainedModel):
    def __init__(self, config: PretrainedConfig):
        super().__init__(config)
        ###your code ###
        self.distilbert =
        self.dense =
        self.activation =
        self.dropout =
        self.dense2 =
        self.classifier =
        ###your code ###


    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
    ):
        """
        input_ids: input sentences converted to ids
        attention_mask: the attention mask
        labels:  Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors.
        """

        num_choices = input_ids.shape[1]

        ###your code ###
        input_ids =
        attention_mask =



        loss = None
        if labels is not None:

        ###your code ###
        return {"loss":loss,"logits":reshaped_logits}


Initialize the feature extractor with `distilbert-base-uncased` and create your custome model.

In [None]:
from transformers import AutoConfig
###your code ###
config=
model_custom =
###your code ###

In [None]:
for name, param in model_custom.named_parameters():
    if param.requires_grad and not name.startswith("distilbert."):
        print(name, param.data.shape)

We keep the same training arguments but change the directory in which we save the model logs, the directory in which we save the model output and the name of the run, to `custom_model`.



In [None]:
###your code ###


###your code ###

Initialize the trainer for training the custom model.The training should take around 30 min on Google Colab T4 GPU.


In [None]:
trainer =
###your code ###

###your code ###


In [None]:
trainer.train()# should take around 30 min on Colab T4 GPU

Save the model in `custom_model/final_model`. Note that with the custom model, you need to save it without the help of the trainer. The trainer would save the configuration but since this model is not a registered Hugging Face model only the base model would be saved. Loading the model weights is also effected by this.

In [None]:
###your code ###

###your code ###

#### ${\color{red}{Comments\ 2.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: Evaluation and Model Comparison

Many times you do not perform the final evaluation right after training, but load the checkpoints and evaluate them on the fly. To this end, load the two models from  disk.

In [None]:
from transformers import AutoModelForMultipleChoice,AutoConfig
### your code ###
model_hf =
model_custom =
### your code ###

To evaluate the data we load the validation split using a data loader and our previously defined data collator. Note that although we had a test split we cannot use it, since there are no labels available for this split (you can check the data to confirm this).

In [None]:
from torch.utils.data import DataLoader
import evaluate

eval_dataloader = DataLoader(encoded_datasets["validation"], batch_size=64, collate_fn=MultipleChoiceDataCollator(tokenizer))

To make things easier, let's use the `evaluate` library from Hugging Face to compute the accuracy metric. Here we load `accuracy` from the `evaluate` library two times, one for the custom model and one for the Hugging Face model. Further, we put the models on eval mode. Complete the code for evaluation using the capabilities of the `evaluate` library to simultaneously compute the metric for both models.


In [None]:
from tqdm import tqdm
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
metric_dict={"custom":evaluate.load("accuracy"),"hf":evaluate.load("accuracy")} #use to compute accuracy
models_dict= {"custom":model_custom,"hf":model_hf}# use to access models

for name, model in models_dict.items():
  model.to(device)
  model.eval()

for i,batch in tqdm(enumerate(eval_dataloader), total=len(eval_dataloader)):
  ### your code ###
  #evaluate on both model on each batch

acc_hf=
acc_custom
  ### your code ###
print("Hugging Face Model :",acc_hf)
print("Custom Model :",acc_custom)

#### ${\color{red}{Comments\ 2.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## **Task 3: Encoder-Decoder Architecture** (5 + 2 + 2 + 5 = 14 points)

We explored an encoder-based model (BERT) in the previous exercise. In this task, we look at another family of transformer architectures, the encoder-decoder. We use the [T5](https://arxiv.org/pdf/1910.10683.pdf) model, presented by Raffel et al.  T5 is an encoder-decoder architecture pre-trained on a multi-task mixture of unsupervised and supervised tasks. In this task, we set up a fine-tuning example for question answering using the [SQUAD](https://huggingface.co/datasets/squad) dataset. Since the actual fine-tuning is time-consuming and computational intensive for inference, we use an already pre-trained model. The main goal is to introduce you to the structure of the fine-tuning and its simplicity with the Hugging Face framework.

To fine-tune the BERT-based models, we usually add a task-specific head. On the other hand, T5 converts all NLP problems into a text-to-text format.  
It is trained using teacher forcing, meaning that we require an input sequence and a corresponding target sequence.


1.   The input sequence is fed to the model using `input_ids` from the tokenizer.
2.   The target sequence is shifted to the right, i.e., prepended by a start-sequence token and fed to the decoder using the `decoder_input_ids` (input_ids of the encoded target sequence). The target sequence is appended by EOS (end of the sentence) to denote the end of a generation and corresponds to the `labels`.
3. The task prefix defines what task is expected of T5. For example, we prepend the input sequence with `translate English to German: ` before encoding the input to tell the model to translate. T5 already has a set of pre-defined task prefixes, and it is best to stick to those since they were used during pre-training. With enough training data, you can also introduce your own custom task.


In contrast to the encoder model, where only a single `max_length` is required, for encoder-decoder architectures, one typically defines a `max_source_length` and `max_target_length`, which determine the maximum length of the input and output sequences, respectively. We must also ensure that the padding ID of the `labels` is not taken into account by the loss function. This can be done by replacing them with `-100`, which is the `ignore_index` of the `CrossEntropyLoss`.

### Subtask 1: Data Processing

We first start by loading the dataset from Hugging Face hub:

In [7]:
from datasets import load_dataset

datasets_squad = load_dataset("squad")
datasets_squad

Downloading readme:   0%|          | 0.00/7.83k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [8]:
print("context ---->" ,datasets_squad["train"][0]["context"])
print("question ---->",datasets_squad["train"][0]["question"])
print("answers ---->",datasets_squad["train"][0]["answers"])

context ----> Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
question ----> To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
answers ----> {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


Now let's load the needed pre-trained tokenizer for `t5-small`, which is the smallest T5 model. Set the maximum sequence length to `512`.

In [11]:
import torch
### your code ###
from transformers import T5Tokenizer
t5_tokenizer = T5Tokenizer.from_pretrained("t5-small", max_len=512)
### your code ###

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


The next step is to pre-process the dataset using the tokenizer to convert the sequences to IDs and add the special tokens.
T5 is based on the SentencePiece tokenizer, and the end of sentence token is denoted by `</s>`.
Complete the function `add_eos_to_examples` to format the input and target sequence. Your input as `input_text` should have the format `question:{question_text} context:{context_text} <EOS_Token>` and your target as `target_text` should have the format `{answer_text} <EOS_Token>`.

In [12]:
def add_eos_to_examples(example, tokenizer):
    # Format input sequence
    example['input_text'] = f"question:{example['question']} context:{example['context']} </s>"
    
    # Format target sequence
    example['target_text'] = f"{example['answers']['text'][0]} </s>"
    
    # Tokenize input and target sequences
    example['input_ids'] = tokenizer.encode(example['input_text'], return_tensors="pt")
    example['target_ids'] = tokenizer.encode(example['target_text'], return_tensors="pt")
    
    return example

Use the `map` function to process the data, and do not set the `batched` argument.

In [13]:
### your code ###
encoded_squad = datasets_squad.map(lambda example: add_eos_to_examples(example, t5_tokenizer))
### your code ###

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (520 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [14]:
print(encoded_squad["train"][0]["input_text"])
print(encoded_squad["train"][0]["target_text"])

question:To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? context:Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary. </s>
Saint Bernadette Soubirous </s>


Complete the function `convert_to_features` that takes in the examples from the dataset and tokenizes them using the T5 tokenizer. However, our answers in this dataset are relatively short and do not require `512` tokens, in contrast to the input sequence which is a combination of question and context paragraphs and is usually long. To this end, we want to truncate the input sequence at `512` and the target sequence at `16`. If any input or target is smaller than the specified length, make sure you pad them. Finally, convert everything to PyTorch tensors to be easily used by the data collator and place them in the dictionary `encodings`.

In [15]:
import torch

def convert_to_features(examples):
    # Concatenate question and context, and truncate to 512 tokens
    input_text = f"question: {examples['question']} context: {examples['context']}"
    input_encodings = t5_tokenizer(
        input_text,
        truncation=True,
        padding='max_length',
        max_length=512,
        return_tensors='pt'
    )

    # Extract answer text and truncate to 16 tokens
    target_text = examples['answers']['text'][0]
    target_encodings = t5_tokenizer(
        target_text,
        truncation=True,
        padding='max_length',
        max_length=16,
        return_tensors='pt'
    )

    # Create the 'encodings' dictionary
    encodings = {
        'input_ids': input_encodings['input_ids'].squeeze(),  # Remove batch dimension
        'attention_mask': input_encodings['attention_mask'].squeeze(),
        'target_ids': target_encodings['input_ids'].squeeze(),  # Remove batch dimension
    }

    return encodings

Use the `map` function to process the data.

In [16]:
### your code ###
encoded_squad = datasets_squad.map(convert_to_features)
### your code ###

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [17]:
encoded_squad #new columns are added

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'attention_mask', 'target_ids'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'attention_mask', 'target_ids'],
        num_rows: 10570
    })
})

Interestingly, although we specified PyTorch tensors as output, the type of the `input_ids` is still a list. To remedy this problem, you need to explicitly set the type of the column that contains PyTorch tensors.

In [18]:
type(encoded_squad["train"][0]["input_ids"])

list

In [19]:
### your code ###
encoded_squad.set_format(type='torch', columns=['input_ids', 'attention_mask', 'target_ids'])
### your code ###
type(encoded_squad["train"][0]["input_ids"])

torch.Tensor

In [20]:
print("Shape of the input_ids:",encoded_squad["train"][0]["input_ids"].shape)
print("Shape of the target_ids:",encoded_squad["train"][0]["target_ids"].shape)

Shape of the input_ids: torch.Size([512])
Shape of the target_ids: torch.Size([16])


The final step in the data processing is the creation of the data collator to
prepare `labels` from `target_ids` and return examples with keys as expected by the forward method of T5.
This is necessary because the trainer directly passes this dict as argument to the model so you need to check the input of T5 and rename the column based on that.
`input_ids`, `target_ids`, `attention_mask`, and `target_attention_mask` need to be stacked in a batch and the pad tokens in the target need to be set to `-100` to avoid loss computation.

In [21]:
from dataclasses import dataclass
from transformers import DataCollator
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
@dataclass
class T2TDataCollator:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, batch):
        # Pad sequences to the specified max length
        batch = self.tokenizer.pad(
            batch,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        # Set pad tokens in target to -100
        batch["labels"] = batch["target_ids"].clone()
        batch["labels"][batch["labels"] == self.tokenizer.pad_token_id] = -100

        # Prepare feature dictionary for the forward method of T5
        feature_dict = {
            "input_ids": batch["input_ids"],
            "attention_mask": batch["attention_mask"],
            "labels": batch["labels"],
        }

        return feature_dict

In [22]:
accepted_keys = ['input_text', 'target_text', 'input_ids', 'attention_mask', 'target_ids', 'target_attention_mask']
features = [{k: v for k, v in encoded_squad["train"][i].items() if k in accepted_keys} for i in range(2)]
batch=T2TDataCollator(t5_tokenizer)(features)
print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)
print(batch["labels"].shape)

torch.Size([2, 512])
torch.Size([2, 512])
torch.Size([2, 16])


#### ${\color{red}{Comments\ 3.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 2: Training

For training and inference, we can use `T5ForConditionalGeneration`, which includes the language modeling head on top of the decoder. Load the `t5-small` model.

In [23]:
### your code ###
from transformers import T5ForConditionalGeneration, T5Tokenizer
t5 = T5ForConditionalGeneration.from_pretrained("t5-small")
### your code ###

Next, similar to the previous task we initiate training arguments. Note that this time we are using a `Seq2SeqTrainingArguments` for a `Seq2SeqTrainer`. Set the parameters for training as follows:


*   T5 doesn't support GPU and TPU evaluation for now, so we only focus on training. You do not need to pass any parameters for evaluation setup.
*   The output directory should be named `t5-squad`.
* The T5 models need a slightly higher learning rate than the default one set in the `Trainer` when using the `AdamW` optimizer. Set the learning rate to `1e-4` and the regularization parameter to `0.01`.
* Random seed should be `77`, and we train for a maximum of `200` steps and save a checkpoint every `100` steps. A complete training of the T5 model requires far more than `200` steps, however, that is beyond the scope of this assignment.
* T5 models require a large batch size. The default model was trained with a batch size of `128`. However, we cannot fit that into a single GPU, therefore we use gradient accumulation. Set the batch size to `32` and choose the gradient accumulation step to reach the effective batch size of `128`.
* Make sure that your trainer does not remove unused columns during training, as this will cause a runtime error later on.


**Gradient accumulation:** is a technique that simulates a larger batch size by accumulating gradients from multiple small batches before performing a weight update.



In [24]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="t5-squad",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=4,  # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    save_total_limit=2,
    save_steps=100,
    num_train_epochs=200,
    learning_rate=1e-4,
    weight_decay=0.01,
    seed=77,
    logging_steps=10,
    logging_dir="logs",
    push_to_hub=False,
    remove_unused_columns=False,
)

Once again make sure that you are using GPU before running the cell below.
Initilize your `Seq2SeqTrainer` with inputs necessary for training. The training should take around 15 min on Google Colab T4 GPU.


In [25]:
# Initialize our Trainer
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, T5ForConditionalGeneration


# Initialize the Seq2SeqTrainer
trainer = Seq2SeqTrainer(
    model=t5,
    args=training_args,
    data_collator=T2TDataCollator(tokenizer=t5_tokenizer),
    train_dataset=encoded_squad["train"],
)

In [None]:
trainer.train()

FailedPreconditionError: logs is not a directory

#### ${\color{red}{Comments\ 3.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 3: Inference

Our trained model has seen far too few instances to make a coherent prediction. To this end, we load an already trained checkpoint from Hugging Face and perform inference. Load this [model](https://huggingface.co/mrm8488/t5-base-finetuned-squadv2) and the respective tokenizer. Note that we are loading a `base` model that is slightly larger than `t5-small`.

In [28]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
### your code ###
t5_tokenizer = AutoTokenizer.from_pretrained("t5-base")
t5_model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
### your code ###

At inference time for T5, it is recommended to use the `generate()` function. This auto-regressively generates the decoder output. Complete the code for the `get_answer` function, which gives a model, a tokenizer, and a question and context pair, and generates the answer from the context given. The output should be the answer to the given question in natural text (without the special tokens).

**Hint:** Many of the steps are similar to how you prepared your input data for the model.

In [29]:
def get_answer(tokenizer, model, question, context):
    # Format input text
    input_text = f"question: {question} context: {context}"

    # Tokenize the input
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

    # Generate the answer using the model
    outputs = model.generate(**inputs, max_length=32, num_beams=1, length_penalty=1.0, no_repeat_ngram_size=2)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer


Let's try it with an example.

In [None]:
context = "Sarah has joined NLP for transformers class and is working on her research project with the support of Harry."
question = "Who is supporting Sarah?"

get_answer(t5_tokenizer,t5_model,question, context)###your answer should be "Harry"

In [56]:
context = "TPUs are more power efficient in comparison to GPUs making them a better choice for machine learning projects."
question = "What is better for machine learning projects?"

get_answer(t5_tokenizer,t5_model,question, context)###your answer should be "TPUs"

'TPUs'

#### ${\color{red}{Comments\ 3.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

### Subtask 4: T5 Paper

To answer questions of the final subtask you need to have a general overview of the [T5 paper](https://arxiv.org/pdf/1910.10683.pdf).



1.   Describe what a “text-to-text format" is and how T5 processes input and output for text classification tasks? What are the possible complications with a predefined set of classes?
2.   Describe the "masked language modeling" and "word dropout" unsupervised objective with sentinel tokens. Give an example of how this would look in a single sentence.
3. Explain "fully-visible", "causal" and "causal masking with prefix" masking.
4. Briefly describe "adapter layers" and "gradual unfreezing" as methods for fine-tuning on fewer parameters.



**Answer**

`
Text-to-Text Format:
In the context of T5, the "text-to-text format" refers to the approach of framing all natural language processing tasks as a text generation problem. Both input and output are treated as text sequences. For text classification tasks, the input is a textual description or prompt, and the output is the class label or category.

Processing Input and Output for Text Classification in T5:
In text classification tasks, T5 takes a textual input, which is a prompt describing the content to be classified. The model then generates a textual output representing the predicted class or label. During training, the model is fine-tuned using labeled examples, where the input is a description of the text, and the output is the corresponding class label. This way, T5 can be used for a range of classification tasks by providing task-specific prompts during fine-tuning.

Complications with a Predefined Set of Classes:
A challenge with a predefined set of classes is that the model needs to be trained on a representative dataset that includes all possible classes. If new classes emerge or if the classification task requires a more dynamic set of categories, the model might struggle to generalize effectively.
`

**Answer**

`
Masked Language Modeling (MLM):
In MLM, random tokens in the input sequence are masked, and the model is trained to predict these masked tokens based on the surrounding context. T5 uses MLM as part of its unsupervised pretraining to learn contextualized representations.

Word Dropout with Sentinel Tokens:
Word dropout involves randomly dropping out words from the input sequence during training. T5 introduces the concept of sentinel tokens to handle word dropout. Sentinel tokens are used to indicate when a word has been dropped during training.

Example Sentence:
Original: "The quick brown fox jumps over the lazy dog."
With Word Dropout: "[SENT] quick brown [DROP] jumps over [SENT] lazy [DROP]."
`

**Answer**

`
Fully-Visible Masking:
All positions in the input sequence are visible to the model during training. No tokens are masked, and the model can attend to the entire input.

Causal Masking:
In causal masking, tokens are masked based on their position in the sequence, preventing the model from attending to future positions during training.

Causal Masking with Prefix:
This masking technique involves masking the future tokens and also a specified prefix length. It allows the model to attend to some future context while still providing some limitation.
`

**Answer**

`
Adapter Layers:
Adapter layers are additional small neural network modules added to the pre-trained model. They allow for task-specific modifications without changing the entire architecture, facilitating easier transfer learning.

Gradual Unfreezing:
Gradual unfreezing involves training the model in a layer-wise manner. Initially, only the task-specific layers are trained, and then gradually, lower layers are unfrozen and fine-tuned. This helps preserve knowledge from pretraining.

These techniques in fine-tuning make it possible to adapt T5 to specific tasks efficiently while leveraging the knowledge gained during pretraining.
`

#### ${\color{red}{Comments\ 3.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$