# Assignment 7: Question Answering (Solution)
_Word Representations and Language Models (WS 23/24)_

***

In this assignment, you'll work with a BERT model fine-tuned for Question Answering. We ask you to evaluate the model on the SQuAD 1.1 as described in the original paper and compare performances of BERT compared to the paper's logistic regression model. Then, we ask you to experiment with the model on our news dataset, generate question/answer pairs manually, and evaluate model performance qualitatively.



In [1]:
from transformers import BertTokenizer, BertForQuestionAnswering
import transformers
import torch
import numpy as np
import json
import string
import re

## Task 1: Question Answering with BERT

Task 1 is about loading the SQuAD 1.1 dataset and fine-tuned BERT model (both from huggingface) as well as computing the model's Exact Match and F1 score as introduced in the SQuAD paper. We already pre-define the code for downloading the data and model. We use a BERT model that is fine-tuned on the train set of the SQuAD corpus. Therefore, we only need validation data for evaluation. Description of dataset downloaded from huggingface: https://huggingface.co/datasets/squad

Please implement the following steps:
1. Inspect the data to get an overview of the data, you'll need to predict answers.
2. For all questions, generate answer predictions with BERT (using GPU might help you a lot here and reduces runtime to only a few minutes).
3. Since predicted and true answers might not only differ in content but also in punctuations, whitespaces etc, please implement the following preprocessing steps for all predicted and true answers: lower-case answers and remove punctuations (the string package might be useful here), articles (regex might useful to remove "a", "and", "the"), and standardize whitespaces.
4. Use the pre-processed predicted answers and the true answers included to calculate the Exact Match and F1-score.

In [2]:
# Read data: we only need the validation set since the model is already fine-tuned on the train set
#!pip install datasets
from datasets import load_dataset
df_test = load_dataset('squad',split = 'validation')

Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


In [3]:
# Load tokenizer and model: we take a BERT model already fine-tuned on the Squad dataset
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

#tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Optionally try out non-fine-tuned BERT model to see how important fine-tuning is.
#model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
# Print example data
print(len(df_test))
print(df_test[5])

10570
{'id': '56be8e613aeaaa14008c90d1', 'title': 'Super_Bowl_50', 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.', 'question': 'What was the theme of Super Bowl 50?', 'answers': {'text': ['"golden anniversary"', 'gold-themed', '"golden anniversary'], 'answe

In [5]:
# Connect to GPU and push model to GPU
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")
    
model.to(device)

There are 1 GPU(s) available.
Device name: Tesla P100-PCIE-16GB


BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), ep

In [6]:
# Iterate over all questions/paragraphs and predict answer --> use GPU to speed-up process (should take a few minutes to get output for all 10570 question/answer pairs)
from tqdm import tqdm

def preprocess_answer(answer):
    answer = answer.lower()  
    answer = re.sub(f"[{re.escape(punctuation)}]", "", answer) 
    answer = re.sub(r"\b(?:a|an|the|and)\b", "", answer) 
    answer = " ".join(answer.split())
    return answer


def calculate_scores(predictions, references):
    em_score = f1_score = 0
    for pred, ref in zip(predictions, references):
        if pred in ref:
            em_score += 1
        f1_list = []
        for t_a in ref:
            common_tokens = set(pred.split()) & set(t_a.split())
            precision = len(common_tokens) / (len(pred.split()) + 1e-10)
            recall = len(common_tokens) / (len(t_a.split()) + 1e-10)
            f1_score += 2 * (precision * recall) / (precision + recall + 1e-10)
            f1_list.append(f1_score)
        f1_avg = sum(f1_list) / len(f1_list)
    total_examples = len(predictions)
    em_score /= total_examples
    f1_avg /= total_examples*3
    return em_score, f1_avg

In [7]:
predictions = []
for example in tqdm(df_test, desc="Predicting Answers"):
    context = example["context"]
    question = example["question"]
    inputs = tokenizer(question, context, return_tensors="pt", max_length=512, truncation=True)
    inputs.to(device)
    outputs = model(**inputs)
    start_idx = torch.argmax(outputs.start_logits)
    end_idx = torch.argmax(outputs.end_logits)
    answer_span = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1])
    predictions.append(answer_span)

Predicting Answers:  39%|███▉      | 4141/10570 [02:20<05:29, 19.53it/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Predicting Answers:  39%|███▉      | 4143/10570 [02:20<06:01, 17.80it/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Predicting Answers:  39%|███▉      | 4145/10570 [02:21<06:58, 15.34it/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with th

In [8]:
# Pre-process predicted and true answers for computation of exact match metric and F1-score: we lower-case predicion/answer, remove punctuations,articles, and whitespaces
from string import punctuation
processed_true_answers = []

true_answers = [example["answers"]["text"] for example in df_test]
processed_predictions = [preprocess_answer(pred) for pred in predictions]
for each in true_answers:
    pre_each = [preprocess_answer(words) for words in each]
    processed_true_answers.append(pre_each)

In [9]:
# Compute proportion of exact matches between predicitons and true answers
em_score, f1_score = calculate_scores(processed_predictions, processed_true_answers) 

In [10]:
# Compute F1 score based on shared words between prediction and true answer
print(f"Exact Match Score: {em_score * 100:.2f}%")
print(f"F1 Score: {f1_score * 100:.2f}%")

Exact Match Score: 77.93%
F1 Score: 85.09%


Exact match achieved by the Squad paper's logistic regression model on dev data:
- Exact match score = 40.0%
- F1- score = 51 %
- Human performance: F1 = 86%, EM = 77%

## Task 2: Experiment with BERT

Task 2 is about experimenting with the BERT model on our news data. Please implement the following steps:

1. Select one article from the news data on which you want to generate question/answer pairs.
2. Generate exemplary question/answer pairs and check if the model can predict the answers correctly. Try to generate easy and more difficult questions. Are there any questions the model cannot correctly answer to?

In [11]:
# Read news data
with open("/kaggle/input/rev-articles/relevant_articles.json") as d:
    articles = json.load(d)

In [12]:
# Display example article
articles[3]

{'_id': 198038,
 'url': 'http://www.bbc.co.uk/news/uk-scotland-35210821#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa',
 'title': "'Very drunk' patient numbers revealed",
 'feed': 'bbc',
 'type': 'politics',
 'pub': {'$date': '2016-01-02T00:42:46.000+0000'},
 'ret': {'$date': '2016-01-02T00:45:47.000+0000'},
 'lang': 'en',
 'refs': ['http://www.bbc.co.uk/news/uk-scotland-35097230'],
 'sum': 'Ambulances attend more than 60 incidents on average every day where a patient is so drunk that it has to be formally noted by crews.',
 'body': 'Paramedics treated about 12,000 people who were so drunk it was noted on Scottish Ambulance Service systems in the six months to the end of September.\nThe figures were obtained by the Scottish Conservatives under freedom of information laws.\nThe ambulance service said alcohol had a significant impact on its operations.\nIt comes after a recent internal Scottish Ambulance Service survey showed alcohol was a factor in more than half of all call-outs ambulanc

In [13]:
# Formulate question/answer pairs
questions2 = ["What is a paperback with a huge number of pictures?"
             ,"Who had a celebration for the launch of her publication?"
             ,"Of the copies printed by Confidential Publications Limited, how many copies are already sold?"
             ,"What kind of families did Mandy decide to aid?"
             ,"To which city in Germany is Mandy going this month, to visit a club?"]
answers2 = ["The Mandy Report", "Miss Mandy Rice-Davies", "50,000"
           ,"families who had been victims of the exploitation", "Munich"]

questions3 = ["According to the text, what kind of incidents do the ambulances attend more than 60 per day?"
             ,"According to a Scottish Ambulance Service survey, in how many call-outs that the ambulance staff dealt during the weekends was alcohol a factor?"
             ,"Who claimed 'However, the harmful effects of excessive alcohol consumption puts additional pressure on these services, and this is another reason why everyone should drink responsibly and keep safe.'?"
             ,"How many people who were extremely drunk did the paramedics in the six months to the end of September treat?"
             ,"Who said 'Our staff should not have to fear for their own safety when responding to patients but alcohol is often a key factor in assaults.'?"]
answers3 = ["where a patient is so drunk that it has to be formally noted by crews."
           ,"more than half of all call-outs", "Public Health Minister Maureen Watt"
           ,"12,000", "Scottish Ambulance Service spokesman"]

questions4 = ["How much can the fine of littering reach?"
             ,"How much will the minimum penalty for littering be?"
             ,"Where could the litter clearance and disposal fees, which cost hundreds of millions of pounds for councils every year, be used as a better alternative?"
             ,"In which city can vehicle owners be fined if they drop litter from their car?"
             ,"Who claimed that those who litter will be 'hit in the pocket'?"]
answers4 = ["£150", "£100", "vital services", "London", "Marcus Jones"]


In [14]:
def predictAnswer(question_list, answer_list, num):
    answer = 0
    for question in question_list:
        text = articles[num]["text"]
        inputs = tokenizer(question, text, return_tensors="pt", max_length=512, truncation=True)
        inputs.to(device)
        outputs = model(**inputs)
        start_idx = torch.argmax(outputs.start_logits)
        end_idx = torch.argmax(outputs.end_logits)
        answer_span = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1])
        print(f"The question is: {question}")
        print(f"The predicted answer is: {answer_span}")
        print(f"The true answer to the question is: {answer_list[answer]}")
        print("-" * 50)
        answer += 1

In [15]:
# Iterate over all questions/paragraphs and predict answer
predictAnswer(questions2, answers2, 2)

The question is: What is a paperback with a huge number of pictures?
The predicted answer is: the mandy report
The true answer to the question is: The Mandy Report
--------------------------------------------------
The question is: Who had a celebration for the launch of her publication?
The predicted answer is: miss mandy rice - davies
The true answer to the question is: Miss Mandy Rice-Davies
--------------------------------------------------
The question is: Of the copies printed by Confidential Publications Limited, how many copies are already sold?
The predicted answer is: 50, 000
The true answer to the question is: 50,000
--------------------------------------------------
The question is: What kind of families did Mandy decide to aid?
The predicted answer is: families who had been victims of the exploitation
The true answer to the question is: families who had been victims of the exploitation
--------------------------------------------------
The question is: To which city in Ger

In [16]:
predictAnswer(questions3, answers3, 3)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


The question is: According to the text, what kind of incidents do the ambulances attend more than 60 per day?
The predicted answer is: where a patient is so drunk
The true answer to the question is: where a patient is so drunk that it has to be formally noted by crews.
--------------------------------------------------
The question is: According to a Scottish Ambulance Service survey, in how many call-outs that the ambulance staff dealt during the weekends was alcohol a factor?
The predicted answer is: more than half
The true answer to the question is: more than half of all call-outs
--------------------------------------------------
The question is: Who claimed 'However, the harmful effects of excessive alcohol consumption puts additional pressure on these services, and this is another reason why everyone should drink responsibly and keep safe.'?
The predicted answer is: public health minister maureen watt
The true answer to the question is: Public Health Minister Maureen Watt
-------

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


The question is: How many people who were extremely drunk did the paramedics in the six months to the end of September treat?
The predicted answer is: 12, 000
The true answer to the question is: 12,000
--------------------------------------------------
The question is: Who said 'Our staff should not have to fear for their own safety when responding to patients but alcohol is often a key factor in assaults.'?
The predicted answer is: a scottish ambulance service spokesman said its teams had to respond to an increase in demand over the festive period, which was largely driven by alcohol.'key factor'he added : " they are highly trained professionals who are frustrated by the amount of time they spend looking after patients who are simply intoxicated. " our staff should not have to fear for their own safety when responding to patients but alcohol is often a key factor in assaults. " assaults or threatening behaviour are reported to the police and support and counselling services are availa

In [17]:
predictAnswer(questions4, answers4, 4)

The question is: How much can the fine of littering reach?
The predicted answer is: as much as £150
The true answer to the question is: £150
--------------------------------------------------
The question is: How much will the minimum penalty for littering be?
The predicted answer is: £150
The true answer to the question is: £100
--------------------------------------------------
The question is: Where could the litter clearance and disposal fees, which cost hundreds of millions of pounds for councils every year, be used as a better alternative?
The predicted answer is: vital services
The true answer to the question is: vital services
--------------------------------------------------
The question is: In which city can vehicle owners be fined if they drop litter from their car?
The predicted answer is: london
The true answer to the question is: London
--------------------------------------------------
The question is: Who claimed that those who litter will be 'hit in the pocket'?
The p

## Task 3: Experiment with FLAN-T5

Another approach involves using generative models for Question Answering by parsing both the document and the question together into the model as a joint text sequence, also known as a prompt. In contrast to BERT, we will not fine-tune the model on our dataset. To obtain answers, implement the following steps:

1. Load the [FLAN-T5-large ](https://huggingface.co/google/flan-t5-large) model, which [was trained at Google](https://arxiv.org/abs/2210.11416).
2. Concatenate the news article and a question, [then use the model to generate an answer](https://huggingface.co/docs/transformers/model_doc/flan-t5).
3. Do this for all your questions and compare the answers with those from BERT.

*(Hint: Figure 3 in [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf) descibes how to construct a good prompt.)*



In [18]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load the tokenizer
tokenizer_flan = AutoTokenizer.from_pretrained("google/flan-t5-large")

# Load the model
model_flan = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [19]:
# Generate Answers
concat_prompt = articles[2]["text"] + "Answer the following question as short as possible." + questions2 [0]
inputs = tokenizer_flan(concat_prompt, return_tensors="pt")
outputs = model_flan.generate(**inputs)
print(tokenizer_flan.batch_decode(outputs, skip_special_tokens=True))

Token indices sequence length is longer than the specified maximum sequence length for this model (513 > 512). Running this sequence through the model will result in indexing errors


['The Mandy Report']


In [20]:
# Generate Answers
concat = articles[2]["text"] + questions2 [0]
inputs = tokenizer_flan(concat, return_tensors="pt")
outputs = model_flan.generate(**inputs)
print(tokenizer_flan.batch_decode(outputs, skip_special_tokens=True))

['The “Mandy Report” turned out to be a slim paper-back with ']


In [21]:
def predictAnswerFlan(question_list, answer_list, num):
    answer = 0
    for question in question_list:
        text = articles[num]["text"]
        concat = text + question
        inputs = tokenizer_flan(concat, return_tensors="pt")
        outputs = model_flan.generate(**inputs)
        decoded_answer = tokenizer_flan.batch_decode(outputs, skip_special_tokens=True)
        print(f"The question is: {question}")
        print(f"The predicted answer is: {decoded_answer}")
        print(f"The true answer to the question is: {answer_list[answer]}")
        print("-" * 50)
        answer += 1

In [22]:
predictAnswerFlan(questions2, answers2, 2)

The question is: What is a paperback with a huge number of pictures?
The predicted answer is: ['The “Mandy Report” turned out to be a slim paper-back with ']
The true answer to the question is: The Mandy Report
--------------------------------------------------
The question is: Who had a celebration for the launch of her publication?
The predicted answer is: ['Miss Mandy Rice-Davies']
The true answer to the question is: Miss Mandy Rice-Davies
--------------------------------------------------
The question is: Of the copies printed by Confidential Publications Limited, how many copies are already sold?
The predicted answer is: ['50,000']
The true answer to the question is: 50,000
--------------------------------------------------
The question is: What kind of families did Mandy decide to aid?
The predicted answer is: ['Miss Mandy Rice-Davies, one of 1963’s better known personalities']
The true answer to the question is: families who had been victims of the exploitation
-----------------

In [23]:
predictAnswerFlan(questions3, answers3, 3)

The question is: According to the text, what kind of incidents do the ambulances attend more than 60 per day?
The predicted answer is: ['patient is so drunk that it has to be formally noted by crews']
The true answer to the question is: where a patient is so drunk that it has to be formally noted by crews.
--------------------------------------------------
The question is: According to a Scottish Ambulance Service survey, in how many call-outs that the ambulance staff dealt during the weekends was alcohol a factor?
The predicted answer is: ['more than half of all call-outs']
The true answer to the question is: more than half of all call-outs
--------------------------------------------------
The question is: Who claimed 'However, the harmful effects of excessive alcohol consumption puts additional pressure on these services, and this is another reason why everyone should drink responsibly and keep safe.'?
The predicted answer is: ['Public Health Minister Maureen Watt']
The true answer 

In [24]:
predictAnswerFlan(questions4, answers4, 4)

The question is: How much can the fine of littering reach?
The predicted answer is: ['Minimum fine set to double to £100']
The true answer to the question is: £150
--------------------------------------------------
The question is: How much will the minimum penalty for littering be?
The predicted answer is: ['Minimum fine for littering could reach as much as £150']
The true answer to the question is: £100
--------------------------------------------------
The question is: Where could the litter clearance and disposal fees, which cost hundreds of millions of pounds for councils every year, be used as a better alternative?
The predicted answer is: ["England's clean-up operation"]
The true answer to the question is: vital services
--------------------------------------------------
The question is: In which city can vehicle owners be fined if they drop litter from their car?
The predicted answer is: ['London']
The true answer to the question is: London
-------------------------------------

In [25]:
# Generate Answers
exp_input = articles[4]["text"] + "Answer the following question as an answer that is an integer." + questions4[1]
inputs = tokenizer_flan(exp_input, return_tensors="pt")
outputs = model_flan.generate(**inputs)
print(tokenizer_flan.batch_decode(outputs, skip_special_tokens=True))

['£50']


In [26]:
exp_input = articles[4]["text"] + "Answer the following question with a short answer." + questions4[2]
inputs = tokenizer_flan(exp_input, return_tensors="pt")
outputs = model_flan.generate(**inputs)
print(tokenizer_flan.batch_decode(outputs, skip_special_tokens=True))

['in the pocket']


In [27]:
exp_input = articles[4]["text"] + "Give an answer by reasoning step-by-step." + questions2[3]
inputs = tokenizer_flan(exp_input, return_tensors="pt")
outputs = model_flan.generate(**inputs)
print(tokenizer_flan.batch_decode(outputs, skip_special_tokens=True))

['The relevant sentence in the passage is: Penalties for people who drop litter could reach as']
