# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Keywords**: Transformers, Question Answering, CoQA

## Deadlines

* **December 11**, 2022: deadline for having assignments graded by January 11, 2023
* **January 11**, 2023: deadline for half-point speed bonus per assignment
* **After January 11**, 2023: assignments are still accepted, but there will be no speed bonus

## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: an question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Rationales

Each QA pair is paired with a rationale $R$: it is a text span extracted from the given text passage $P$. <br>
$\rightarrow$ $R$ is not a requested output, but it can be used as an additional information at training time!

## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

In [2]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us

## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

## Dataset Download


In [1]:
import os
import urllib.request
from tqdm import tqdm

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [2]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

#### Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs!

In [3]:
import collections
import re
import string
from typing import Callable, Sequence, TypeVar, Tuple
import os
import re
import json
import string
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer, TFBertModel, BertConfig, TFRobertaModel
from sklearn.model_selection import train_test_split
import pandas as pd
import json
import os
import urllib.request
from tqdm import tqdm
from keras.preprocessing.text import Tokenizer
from tqdm import tqdm
from itertools import islice
from transformers import AutoTokenizer
from datasets import Dataset
from numpy.random import seed
from tensorflow.random import set_seed


def tokenizer(name):
    return AutoTokenizer.from_pretrained(name)

def nth_index(iterable, value, n):
    matches = (idx for idx, val in enumerate(iterable) if val == value)
    return next(islice(matches, n-1, n), None)

TRAIN_DATA_PATH = 'coqa/train.json'
TEST_DATA_PATH = 'coqa/test.json'

bert_model_name = 'prajjwal1/bert-tiny'
roberta_model_name = 'distilroberta-base'

max_len = 512

bert_tokenizer = tokenizer(bert_model_name)
roberta_tokenizer = tokenizer(roberta_model_name)

In [4]:
def get_dataframe(file_path, history_memory=0):
    with open(file_path) as f:
        paragraphs = pd.DataFrame(json.load(f))['data']
    titles = ['source', 'id', 'story']  # 'filename', 'name'
    data = {k: [] for k in titles +
            ['turn_id', 'context', 'question', 'answer', 'span_start', 'span_end', 'span_text']}
    for paragraph in paragraphs:
        questions = paragraph['questions']
        answers = paragraph['answers']
        history = []
        for i in range(len(questions)):
            if answers[i]['input_text'] == 'unknown':
                continue
            answer = answers[i]['input_text']
            question = " ".join([f"{history_memory-i} {q} {history_memory-i} {a}." for i, (q, a) in enumerate(history)] + [f"0 {questions[i]['input_text']}"])
            context = paragraph['story']
            span_start = answers[i]['span_start']
            span_start = span_start + 1 if context[span_start] == ' ' else span_start
            while span_start > 0 and context[span_start-1] not in [' ', '\n', '.', ',']:
                span_start -= 1
            span_end = answers[i]['span_end']
            span_text = context[span_start:span_end]
            data['context'].append(context)
            data['turn_id'].append(questions[i]['turn_id'])
            data['question'].append(question)
            data['answer'].append(answer)
            data['span_start'].append(span_start)
            data['span_end'].append(span_end)
            data['span_text'].append(span_text)
            for key in titles:
                data[key].append(paragraph[key])
            history.append((questions[i]['input_text'], answers[i]['input_text']))
            if len(history) > history_memory:
                history.pop(0)

    return pd.DataFrame(data)

In [5]:
train = get_dataframe(TRAIN_DATA_PATH)
test = get_dataframe(TEST_DATA_PATH)
train_h = get_dataframe(TRAIN_DATA_PATH, 2)
test_h = get_dataframe(TEST_DATA_PATH, 2)

## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [6]:
scale_factor = 2
all_ids = train.id.unique()
np.random.seed(0)
all_ids = all_ids[np.random.choice(all_ids.shape[0], all_ids.shape[0]//scale_factor, replace=False)]
train_id_set, val_id_set = train_test_split(all_ids, train_size=.8, random_state=42)

def prepare_set_splits(data_split, id_set):
    return data_split.loc[data_split.id.isin(id_set)].reset_index()

train_set = prepare_set_splits(train, train_id_set)
val_set = prepare_set_splits(train, val_id_set)
test_set = test

train_set_h = prepare_set_splits(train_h, train_id_set)
val_set_h = prepare_set_splits(train_h, val_id_set)
test_set_h = test_h

In [7]:
train_set_d = Dataset.from_pandas(train_set)
val_set_d = Dataset.from_pandas(val_set)
test_set_d = Dataset.from_pandas(test_set)

train_set_d_h = Dataset.from_pandas(train_set_h)
val_set_d_h = Dataset.from_pandas(val_set_h)
test_set_d_h = Dataset.from_pandas(test_set_h)

In [8]:
old_cols = train_set_d.features.keys()

def prepare_train_features(examples, tokenizer):
    pad_on_right = tokenizer.padding_side == "right"
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_len,
        return_offsets_mapping=True,
        padding="max_length",
    )
    offset_mapping = tokenized_examples["offset_mapping"]
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []
    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)
        sequence_ids = tokenized_examples.sequence_ids(i)
        
        start_char = examples["span_start"][i]
        end_char = examples["span_end"][i]
        end_of_question = sequence_ids.index(1)
        offset = offsets[end_of_question:]
        starts, ends = [i for i, j in offset], [j for i, j in offset]
        sync = 0
        while 1:
            try:
                last_offset = None
                for k in range(1, len(offset)):
                    if offset[-k][1] != 0:
                        last_offset = offset[-k][1]
                        break
                if last_offset < start_char + sync or last_offset < end_char:
                    raise IndexError
                tokenized_examples["start_positions"].append(end_of_question + starts.index(start_char+sync))
                break
            except ValueError:
                sync += 1
            except IndexError:
                tokenized_examples["start_positions"].append(cls_index)
                break
        sync = 0
        while 1:
            try:
                last_offset = None
                for k in range(1, len(offset)):
                    if offset[-k][1] != 0:
                        last_offset = offset[-k][1]
                        break
                if last_offset < start_char or last_offset < end_char + sync:
                        raise IndexError
                tokenized_examples["end_positions"].append(end_of_question + ends.index(end_char+sync))
                break
            except ValueError:
                sync -= 1
            except IndexError:
                tokenized_examples["end_positions"].append(cls_index)
                break
    return tokenized_examples

def prepare_validation_features(examples, tokenizer):
    pad_on_right = tokenizer.padding_side == "right"
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_len,
        padding="max_length",
    )
    return tokenized_examples

def get_data(ds, tokenizer):
    return ds.map(lambda xx: prepare_train_features(xx, tokenizer), batched=True, remove_columns=old_cols)


class XYInputOutput():
    def __init__(self, x, y):
        self.x = x
        self.y = y
        self.source = None
    
    def __hash__(self):
        return id(self)

    def set_source(self, source):
        self.source = source
        return self


def create_inputs_targets(coqa_examples):
    dataset_dict = {
        "input_ids": [],
        "offset_mapping": [],
        "token_type_ids": [],
        "attention_mask": [],
        "start_positions": [],
        "end_positions": [],
    }
    roberta = False
    for item in iter(coqa_examples):
        for key in dataset_dict:
            try:
                dataset_dict[key].append(item[key])
            except KeyError:
                roberta = True
    for key in tqdm(dataset_dict):
        try:
            if roberta and key == 'token_type_ids':
                raise KeyError
            dataset_dict[key] = np.array(dataset_dict[key])
        except KeyError:
            pass
    x = 0
    try:
        if roberta:
            raise KeyError
        x = [
        dataset_dict["input_ids"],
        dataset_dict["token_type_ids"],
        dataset_dict["attention_mask"],
        dataset_dict["offset_mapping"],
        ]
    except KeyError:
        x = [
        dataset_dict["input_ids"],
        dataset_dict["attention_mask"],
        dataset_dict["offset_mapping"],
        ]
    y = [dataset_dict["start_positions"], dataset_dict["end_positions"]]
    return XYInputOutput(x, y)

def create_inputs(coqa_example):
    dataset_dict = {
        "input_ids": [],
        "attention_mask": [],
        "token_type_ids": []
    }
    roberta = False
    for key in dataset_dict:
        try:
            dataset_dict[key].append(coqa_example[key])
        except KeyError:
            roberta = True
            pass
    for key in dataset_dict:
        try:
            if roberta and key == 'token_type_ids':
                raise KeyError
            dataset_dict[key] = np.array(dataset_dict[key])
        except KeyError:
            pass
    try:
        if roberta:
            raise KeyError
        x = [
            dataset_dict["input_ids"][0],
            dataset_dict["token_type_ids"][0],
            dataset_dict["attention_mask"][0],
        ] 
    except KeyError:
        x = [
        dataset_dict["input_ids"][0],
        dataset_dict["attention_mask"][0],
        ]
    return list(np.array(x).reshape((len(x), 1, max_len)))

In [9]:
import pickle


def prepare_load_save_all_data_inputs_targets():
    result = []
    for data, data_name in {
#                             train_set_d: "train", \
#                             val_set_d: "val", \
#                             test_set_d: "test", \
                            train_set_d_h: "train_h", \
                            val_set_d_h: "val_h", \
                            test_set_d_h: "test_h" \
                            }.items():
        for tokenizer, tokenizer_name in {
#                                           bert_tokenizer: "bert", \
                                          roberta_tokenizer: "roberta" \
                                          }.items():
            name = f"{data_name}_{tokenizer_name}"
            print(f'Creating {name}.')
            try:
                with open(f"{name}", "rb") as fp:
                    inp_out = pickle.load(fp)
            except FileNotFoundError:
                inp_out = create_inputs_targets(get_data(data, tokenizer)).set_source(data)
                with open(f"{name}", "wb") as fp:
                    pickle.dump(inp_out, fp)
            finally:
                print(f'{name} Created.')
                result.append(inp_out)
    return result

### Resource draining part

In [10]:
# train_bert\
# train_roberta,\
# val_bert,\
# val_roberta,\
# test_bert,\
# test_roberta,\
# train_bert_h,\
# train_roberta_h,\
# val_bert_h,\
# val_roberta_h,\
# test_bert_h,\
# test_roberta_h,\
train_roberta_h, val_roberta_h, test_roberta_h = prepare_load_save_all_data_inputs_targets()

Creating train_h_roberta.


  0%|          | 0/43 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:09<00:00,  1.52s/it]


train_h_roberta Created.
Creating val_h_roberta.


  0%|          | 0/11 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:02<00:00,  2.66it/s]


val_h_roberta Created.
Creating test_h_roberta.


  0%|          | 0/4 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00,  6.97it/s]

test_h_roberta Created.





## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

In [11]:
def get_model(model_name):
    ## encoder
    if model_name == roberta_model_name:
        encoder = TFRobertaModel.from_pretrained(model_name, from_pt=True)
    else:
        encoder = TFBertModel.from_pretrained(model_name, from_pt=True)

    ## QA Model
    input_ids = layers.Input(shape=(max_len,), dtype=tf.int32)
    token_type_ids = layers.Input(shape=(max_len,), dtype=tf.int32)
    attention_mask = layers.Input(shape=(max_len,), dtype=tf.int32)
    embedding = encoder(
        input_ids, 
        token_type_ids=token_type_ids if model_name == bert_model_name else None, \
        attention_mask=attention_mask
    )[0]

    start_logits = layers.Dense(1, name="start_logit", use_bias=False)(embedding)
    start_logits = layers.Flatten()(start_logits)

    end_logits = layers.Dense(1, name="end_logit", use_bias=False)(embedding)
    end_logits = layers.Flatten()(end_logits)

    start_probs = layers.Activation(keras.activations.softmax)(start_logits)
    end_probs = layers.Activation(keras.activations.softmax)(end_logits)

    if model_name == bert_model_name:
        model = keras.Model(
            inputs=[input_ids, token_type_ids, attention_mask],
            outputs=[start_probs, end_probs],
        )
    else:
        model = keras.Model(
            inputs=[input_ids, attention_mask],
            outputs=[start_probs, end_probs],
        )
    loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
    optimizer = keras.optimizers.Adam(lr=5e-5)
    model.compile(optimizer=optimizer, loss=[loss, loss])
    return model


def create_model(model_name, use_tpu=False):
    if use_tpu:
        # Create distribution strategy
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()
        strategy = tf.distribute.TPUStrategy(tpu)

        # Create model
        with strategy.scope():
            return get_model(model_name)
    else:
        return get_model(model_name)

In [13]:
qa_bert = create_model(bert_model_name)
qa_bert.summary()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'bert.embeddings.position_ids', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 512)]        0           []                               
                                                                                                  
 input_3 (InputLayer)           [(None, 512)]        0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 512)]        0           []                               
                                                                                                  
 tf_bert_model (TFBertModel)    TFBaseModelOutputWi  4385920     ['input_1[0][0]',                
                                thPoolingAndCrossAt               'input_3[0][0]',            

  super(Adam, self).__init__(name, **kwargs)


In [14]:
qa_roberta = create_model(roberta_model_name)
qa_roberta.summary()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_4 (InputLayer)           [(None, 512)]        0           []                               
                                                                                                  
 input_6 (InputLayer)           [(None, 512)]        0           []                               
                                                                                                  
 tf_roberta_model (TFRobertaMod  TFBaseModelOutputWi  82118400   ['input_4[0][0]',                
 el)                            thPoolingAndCrossAt               'input_6[0][0]']                
                                tentions(last_hidde                                               
                                n_state=(None, 512,                                         

## [Task 4] Answer generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

In [12]:
class color:
    PURPLE = '\033[95m'   
    CYAN = '\033[96m'
    DARKCYAN = '\033[36m'
    BLUE = '\033[94m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
    END = '\033[0m'

In [15]:
def f_theta(passage, question, model, tokenizer):
    data = prepare_validation_features({'context': context, 'question': question}, tokenizer)
    inputs = create_inputs(data)
    pred_start, pred_end = model.predict(inputs, verbose=0)
    start = np.argmax(pred_start[0])
    end = np.argmax(pred_end[0])
    print(f"\t{color.BOLD}Question:{color.END} {question}")
    if start >= end:
        print(f"\t{color.RED}Not able to find an answer.{color.END}")
    else:
        print(f"\t{color.GREEN}{color.BOLD}Answer: {color.END}{color.GREEN}{passage[start:end].strip()}{color.END}")


dialogue_id = train_set.iloc[0]['id']
dialogues = train_set[train_set.id == dialogue_id]
print(color.BOLD + 'BERT:' + color.END)
for turn_id in dialogues.turn_id.values:
    dialogue = dialogues[dialogues.turn_id == turn_id]
    context = dialogue.context.values[0]
    question = dialogue.question.values[0][2:]
    f_theta(context, question, qa_bert, bert_tokenizer)
print(color.BOLD + 'RoBERTa:' + color.END)
for turn_id in dialogues.turn_id.values:
    dialogue = dialogues[dialogues.turn_id == turn_id]
    context = dialogue.context.values[0]
    question = dialogue.question.values[0][2:]
    f_theta(context, question, qa_roberta, roberta_tokenizer)

[1mBERT:[0m
	[1mQuestion:[0m When was the Vat formally opened?
	[91mNot able to find an answer.[0m
	[1mQuestion:[0m what is the library for?
	[92m[1mAnswer: [0m[92me Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much[0m
	[1mQuestion:[0m for what subjects?
	[92m[1mAnswer: [0m[92mthe Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is muc[0m
	[1mQuestion:[0m and?
	[92m[1mAnswer: [0m[92md the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much[0m
	[1mQuestion:[0m what was started in 2014?
	[91mNot able to find an answer.[0m
	[1mQuestion:[0m how do scholars divide the library?
	[91mNot able to find an answer.[0m
	[1mQuestion:[0m how many?
	[92m[1mAnswer: [0m[92mthe Vatican Library or simp

## [Task 5] Answer generation with text passage $P$, question $Q$ and dialogue history $H$

We want to define $f_\theta(P, Q, H)$. Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$, $Q_i$, and $H = \{ Q_0, A_0, \dots, Q_{i-1}, A_{i-1} \}$ to generate $A_i$.

In [16]:
def f_theta_h(passage, question, history, model, tokenizer):
    print(f"\t{color.BOLD}Question:{color.END} {question}")
    question = " ".join([f"{len(history)-i} {q} {len(history)-i} {a}." for i, (q, a) in enumerate(history)] + [f"0 {question}"])
    data = prepare_validation_features({'context': context, 'question': question}, tokenizer)
    inputs = create_inputs(data)
    pred_start, pred_end = model.predict(inputs, verbose=0)
    start = np.argmax(pred_start[0])
    end = np.argmax(pred_end[0])
    if start >= end:
        print(f"\t{color.RED}Not able to find an answer.{color.END}")
    else:
        print(f"\t{color.GREEN}{color.BOLD}Answer: {color.END}{color.GREEN}{passage[start:end].strip()}{color.END}")

history_memory = 2
dialogue_id = train_set.iloc[0]['id']
dialogues = train_set[train_set.id == dialogue_id]
print(color.BOLD + 'BERT:' + color.END)
for turn_id in dialogues.turn_id.values:
    if turn_id<5:
        continue
    dialogue = dialogues[dialogues.turn_id == turn_id]
    context = dialogue.context.values[0]
    question = dialogue.question.values[0][2:]
    history_dialogues = dialogues[dialogues.turn_id < turn_id]
    history_dialogues = history_dialogues[history_dialogues.turn_id >= turn_id-history_memory]
    history = [(q[2:], a) for q, a in zip(history_dialogues.question.values, history_dialogues.answer.values)]
    f_theta_h(context, question, history, qa_bert, bert_tokenizer)
print(color.BOLD + 'RoBERTa:' + color.END)
for turn_id in dialogues.turn_id.values:
    if turn_id<5:
        continue
    dialogue = dialogues[dialogues.turn_id == turn_id]
    context = dialogue.context.values[0]
    question = dialogue.question.values[0][2:]
    history_dialogues = dialogues[dialogues.turn_id < turn_id]
    history_dialogues = history_dialogues[history_dialogues.turn_id >= turn_id-history_memory]
    history = [(q[2:], a) for q, a in zip(history_dialogues.question.values, history_dialogues.answer.values)]
    f_theta_h(context, question, history, qa_roberta, roberta_tokenizer)

[1mBERT:[0m
	[1mQuestion:[0m what was started in 2014?
	[91mNot able to find an answer.[0m
	[1mQuestion:[0m how do scholars divide the library?
	[91mNot able to find an answer.[0m
	[1mQuestion:[0m how many?
	[91mNot able to find an answer.[0m
	[1mQuestion:[0m what is the official name of the Vat?
	[91mNot able to find an answer.[0m
	[1mQuestion:[0m where is it?
	[92m[1mAnswer: [0m[92my or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although[0m
	[1mQuestion:[0m how many printed books does it contain?
	[92m[1mAnswer: [0m[92me Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, alt[0m
	[1mQuestion:[0m when were the Secret Archives moved from the rest of the library?
	[92m[1mAnswer: [0m[92mat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and

## [Task 6] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$

Write your own script to train and evaluate your $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$ models.

### Instructions

* Perform multiple train/evaluation seed runs: [42, 2022, 1337].$^1$
* Evaluate your models with the following metrics: SQUAD F1-score.$^2$
* Fine-tune each transformer-based models for **3 epochs**.
* Report evaluation SQUAD F1-score computed on the validation and test sets.

$^1$ Remember what we said about code reproducibility in Tutorial 2!

$^2$ You can use ```allennlp``` python package for a quick implementation of SQUAD F1-score: ```from allennlp_models.rc.tools import squad```. 

In [152]:
def normalize_text(s):
    """Lower text and remove punctuation, articles and extra whitespace."""

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def get_tokens(s):
    if not s:
        return []
    return normalize_text(s).split()


def compute_f1(a_pred: str, a_gold: str) -> float:
    pred_toks = get_tokens(a_pred)
    gold_toks = get_tokens(a_gold)
    common = collections.Counter(pred_toks) & collections.Counter(gold_toks)  # type: ignore[var-annotated]
    num_same = sum(common.values())
    if len(pred_toks) == 0 or len(gold_toks) == 0:
        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
        return float(pred_toks == gold_toks)
    if num_same == 0:
        return 0.0
    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1



def validate(data, model):
    x_eval = data.x[:-1]
    y_eval = data.y
    source = data.source
    offsets_list = data.x[-1]
    f1_score = 0
    pred_start, pred_end = model.predict(x_eval, verbose = 0)
    counter = 0
    for idx, (sample, start, end) in enumerate(zip(iter(source), pred_start, pred_end)):
        start = np.argmax(start)
        end = np.argmax(end)
        if start >= end:
            continue
        offsets = offsets_list[idx]
        helper = [i for i, j in np.argwhere(offsets==[0, 0]) if i and j]
        start_context = helper[0]
        end_context = helper[1]
        context = sample['context']
        turn_id = sample['turn_id']
        question = sample['question']
        index = sample['index']
        pred_char_start = offsets[start][0]
        if end < len(offsets):
            pred_char_end = offsets[end][1]
            pred_ans = context[pred_char_start:pred_char_end]
        else:
            pred_ans = context[pred_char_start:]
        
        true_ans = sample['span_text']
        normalized_pred_ans = normalize_text(pred_ans)
        normalized_true_ans = normalize_text(true_ans)
        sample_f1 = compute_f1(normalized_pred_ans, normalized_true_ans)
        if(sample_f1 <= 0.05):
            counter += 1
            print(counter)
            print('turn_id: ', turn_id, '\nindex: ', index, '\nQuestion: ', question, '\npred answer: ', pred_ans, '\nTrue answer: ', true_ans)
            
        f1_score += sample_f1
    f1_score /= len(y_eval[0])
    print(f"f1_score={f1_score:.5f}")
    return f1_score

In [18]:
models = {}

In [None]:
for random_seed in [
    # 42,
    # 2022,
    1337,
    ]:
    seed(random_seed)
    set_seed(random_seed)
    for train, val, test, model_name, hist in [
        # (train_bert, val_bert, test_bert, bert_model_name, False),
        # (train_roberta, val_roberta, test_roberta, roberta_model_name, False),
        # (train_bert_h, val_bert_h, test_bert_h, bert_model_name, True)
        (train_roberta_h, val_roberta_h, test_roberta_h, roberta_model_name, True),
    ]:
          model_key = f"{model_name}_{random_seed}{'_h' if hist else ''}"
          try:
              model = models[model_key]
          except KeyError:
            model = create_model(model_name, True)
          model.fit(
              train.x[:-1],
              train.y,
              epochs=3,
              verbose=1,
              batch_size=200 if model_name == bert_model_name else 32,
              validation_data=(val.x[:-1], val.y),
          )
          models[model_key] = model
          # model.save_weights(f"/content/drive/MyDrive/checkpoints/bert-tiny_{random_seed}{'_h' if hist else ''}.h5")

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

Epoch 1/3




Epoch 2/3
Epoch 3/3


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

Epoch 1/3




Epoch 2/3
Epoch 3/3


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

Epoch 1/3




Epoch 2/3
Epoch 3/3


In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### load models

In [60]:
import re
test_f1_scores = {}
pattern = '^.+_h.h5$'
weight_files = os.listdir('/content/drive/MyDrive/checkpoints')
weight_files.remove('.ipynb_checkpoints')
for weight_file in weight_files:
    if weight_file.split('-')[0] == 'bert':
        model_name = bert_model_name
        if (re.match(pattern, weight_file)):
            test_set = test_bert_h
        else:
            test_set = test_bert
    else:
        model_name = roberta_model_name
        if (re.match(pattern, weight_file)):
            test_set = test_roberta_h
        else:
            test_set = test_roberta

    model = create_model(model_name, True)
    model.load_weights(f"/content/drive/MyDrive/checkpoints/{weight_file}")
    test_f1_scores[weight_file.split('.')[0]] = validate(test_set, model)


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.
  super(Adam, self).__init__(name, **kwargs)


f1_score=0.48486


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


f1_score=0.47579


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


f1_score=0.47486


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


f1_score=0.54713


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


f1_score=0.54564


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


f1_score=0.53971


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

f1_score=0.21950


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

f1_score=0.16867


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

f1_score=0.16994


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

f1_score=0.14096


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

f1_score=0.13934


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

f1_score=0.13400
{'distilroberta-base_42': 0.48486468732310956, 'distilroberta-base_2022': 0.47578566677079914, 'distilroberta-base_1337': 0.4748605532484781, 'distilroberta-base_42_h': 0.5471264434644234, 'distilroberta-base_2022_h': 0.5456373762303172, 'distilroberta-base_1337_h': 0.5397110697558715, 'bert-tiny_42': 0.2195032285180981, 'bert-tiny_2022': 0.1686661558906862, 'bert-tiny_1337': 0.1699359185097832, 'bert-tiny_42_h': 0.14095852525173905, 'bert-tiny_2022_h': 0.13933953957878534, 'bert-tiny_1337_h': 0.1339994296646173}


In [61]:
print(test_f1_scores)

{'distilroberta-base_42': 0.48486468732310956, 'distilroberta-base_2022': 0.47578566677079914, 'distilroberta-base_1337': 0.4748605532484781, 'distilroberta-base_42_h': 0.5471264434644234, 'distilroberta-base_2022_h': 0.5456373762303172, 'distilroberta-base_1337_h': 0.5397110697558715, 'bert-tiny_42': 0.2195032285180981, 'bert-tiny_2022': 0.1686661558906862, 'bert-tiny_1337': 0.1699359185097832, 'bert-tiny_42_h': 0.14095852525173905, 'bert-tiny_2022_h': 0.13933953957878534, 'bert-tiny_1337_h': 0.1339994296646173}


## [Task 7] Error Analysis

Perform a simple and short error analysis as follows:
* Group dialogues by ```source``` and report the worst 5 model errors for each source (w.r.t. SQUAD F1-score).
* Inspect observed results and try to provide some comments (e.g., do the models make errors when faced with a particular question type?)$^1$

$^1$ Check the [paper](https://arxiv.org/pdf/1808.07042.pdf) for some valuable information about question/answer types (e.g., Table 6, Table 8) 

In [19]:
#load the most promising model
model = create_model(roberta_model_name, False)
model.load_weights("distilroberta-base_42_h.h5")
f1 = validate(test_roberta_h, model)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


f1_score=0.54732


In [105]:
test_scores = {}
sources = test_set.source.unique()
grouped_sources = {source: test_set[test_set.source == source] for source in sources}
for source, data in grouped_sources.items():
    d = []
    for i in range(data.shape[0]):
        row_data = Dataset.from_pandas(data.iloc[i:i+1])
        test = create_inputs_targets(get_data(row_data, roberta_tokenizer)).set_source(row_data)
        f1_score = validate(test, model, False)
        context = data.context.values[i]
        question = data.question.values[i][2:]
        d.append([f1_score, context, question, data.span_text.values[i], data.answer.values[i]])
    test_scores[source] = d



  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]




  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|███████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 726.58it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]






In [162]:
test_scores = {}
sources = test_set_h.source.unique()
grouped_sources = {source: test_set_h[test_set_h.source == source] for source in sources}

for i, (source, data) in enumerate(grouped_sources.items()):
    if i != 1:
        continue
    data = Dataset.from_pandas(data)
    test = create_inputs_targets(get_data(data, roberta_tokenizer)).set_source(data)
    print(source)
    validate(test, model)


  0%|          | 0/1 [00:00<?, ?ba/s]

100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 31.42it/s]


race
1
turn_id:  8 
index:  28 
Question:  2 Where does Nicole live? 2 Shanghai. 1 How is she related to the boy? 1 mother. 0 What is in the bag? 
pred answer:  she holds a paper carrier bag 
True answer:  bag--a thermos with hot soup and a stainless-steel container with rice, vegetables and either chicken, meat or shrimp, sometimes with a kind of pancake
2
turn_id:  9 
index:  29 
Question:  2 How is she related to the boy? 2 mother. 1 What is in the bag? 1 food. 0 Has she done this before? 
pred answer:  It is not her first visit. 
True answer:  This has become an almost-daily practice. 
3
turn_id:  11 
index:  31 
Question:  2 Has she done this before? 2 Yes. 1 Why? 1 I am having heart surgery soon, so her mother has decided I need more nutrients. 0 What has helped us communicate? 
pred answer:  Communication between us is somewhat affected by the fact that she doesn't speak English and all I can say in Chinese is hello 
True answer:  an iPad
4
turn_id:  12 
index:  32 
Question:  2

42
turn_id:  4 
index:  1797 
Question:  2 What is Sandy's last name 2 Lin. 1 what is wrong with her mother? 1 she's ill. 0 can she work? 
pred answer:  she can't do a part-time job after class. 
True answer:  I've been ill in bed several years
43
turn_id:  11 
index:  1804 
Question:  2 Who did Rose talk to ? 2 Justin. 1 what is his occupation? 1 social worker. 0 does her friends think Sandy can handle her job and school? 
pred answer:  I'll manage it as soon as I can. 
True answer:  but it's her time to study hard to enter a good senior high school, she can't do a part-time job after class
44
turn_id:  2 
index:  1831 
Question:  2 What defeats all? 2 Labor. 0 What kind of labor? 
pred answer:  Labor defeats all--not inconstant, or ill-directed labor 
True answer:  faithful, persistent, daily effort toward a well-directed purpose
45
turn_id:  10 
index:  1839 
Question:  2 Like what? 2 science. 1 And? 1 literature. 0 Not math? 
pred answer:  The celebrated mathematician, Edmund Stone

69
turn_id:  1 
index:  3113 
Question:  0 What was the name of the great author? 
pred answer:  Vladimir Ilyich Tolstoy 
True answer:  This year marks the 100thanniversary of Leo Tolstoy's death. He is considered by many to be one of the greatest novelists of all time.
70
turn_id:  16 
index:  3145 
Question:  2 Did the narrator hear something whilst waiting at the elevator? 2 yes. 1 Was it Peter? 1 yes. 0 What did he ask him? 
pred answer:  Then he asked me to broadcast an imaginary game 
True answer:  "What did you say about sports? Do you know anything about football?" 
71
turn_id:  17 
index:  3146 
Question:  2 Was it Peter? 2 yes. 1 What did he ask him? 1 what he said about sports. 0 What sports? 
pred answer:  Then he asked me to broadcast an imaginary game 
True answer:  Do you know anything about football
72
turn_id:  3 
index:  3236 
Question:  2 What is the primary device mentioned here? 2 lifts,or elevators. 1 What can you do when you're in one alone? 1 whatever you want. 

In [115]:
#Print 5 worst predicted answers with respect to f1-score for each source
for source, scores in test_scores.items():
    print(source)
    scores.sort(key=lambda row: (row[0]), reverse=False)
    for score in scores[:6]:
        f_theta(score[1], score[2], model, roberta_tokenizer)
        print(score[3])
           

mctest
	[1mQuestion:[0m Who did she live with?
	[92m[1mAnswer: [0m[92ma ni[0m
with her mommy and 5 other sisters
	[1mQuestion:[0m What color was Cotton?
	[92m[1mAnswer: [0m[92mnear a farm house, there lived a little white kitten n[0m
a little white kitten named Cotton
	[1mQuestion:[0m What color were her sisters?
	[92m[1mAnswer: [0m[92mar a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a ni[0m
her sisters were all orange with beautiful white tiger stripes
	[1mQuestion:[0m Where did she live?
	[92m[1mAnswer: [0m[92mar[0m
in a barn near a farm house, there lived a little white kitten
	[1mQuestion:[0m Did she live alone?
	[91mNot able to find an answer.[0m
Cotton wasn't alone
race
	[1mQuestion:[0m Who is at the door?
	[92m[1mAnswer: [0m[92mp, I[0m
On the step, I find the elderly Chinese lady, small and slight, holding the hand of a little boy
	[1mQuestion:[0m What?
	[91mNot able to find an answer.[0m
a paper 

# Assignment Evaluation

The following assignment points will be awarded for each task as follows:

* Task 1, Pre-processing $\rightarrow$ 0.5 points.
* Task 2, Dataset Splitting $\rightarrow$ 0.5 points.
* Task 3 and 4, Models Definition $\rightarrow$ 1.0 points.
* Task 5 and 6, Models Training and Evaluation $\rightarrow$ 2.0 points.
* Task 7, Analysis $\rightarrow$ 1.0 points.
* Report $\rightarrow$ 1.0 points.

**Total** = 6 points <br>

We may award an additional 0.5 points for outstanding submissions. 
 
**Speed Bonus** = 0.5 extra points <br>

# Report

We apply the rules described in Assignment 1 regarding the report.
* Write a clear and concise report following the given overleaf template (**max 2 pages**).
* Report validation and test results in a table.$^1$
* **Avoid reporting** code snippets or copy-paste terminal outputs $\rightarrow$ **Provide a clean schema** of what you want to show

# Comments and Organization

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

Structure your code for readability and maintenance. If you work with Colab, use sections. 

This allows you to build clean and modular code, as well as easy to read and to debug (notebooks can be quite tricky time to time).

# FAQ (READ THIS!)

---

**Question**: Does Task 3 also include data tokenization and conversion step?

**Answer:** Yes! These steps are usually straightforward since ```transformers``` also offers a specific tokenizer for each model.

**Example**: 

```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(text)
%% Alternatively
inputs = tokenizer.tokenize(text, add_special_tokens=True, max_length=min(max_length, 512))
input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
```

**Suggestion**: Hugginface's documentation is full of tutorials and user-friendly APIs.

---
---

**Question**: I'm hitting **out of memory error** when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length

---
---

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# The End!

Questions?