## Bert Assignment

In this notebook, we train a sentiment polarity classifier on a dataset of IMDB movie reviews. The movies have been separated into two classes: positive and negative. The code in this notebook is adapted from the code in this [ Hugging Face tutorial: ](https://huggingface.co/transformers/training.html)

First we need to install the huggingface datasets library

In [None]:
pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloa

Let's also install the huggingface transformers library


In [None]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m107.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.27.3


Now let's load the dataset we need, i.e. the imdb binary sentiment polarity movie review dataset.


In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
raw_datasets.keys()

dict_keys(['train', 'test', 'unsupervised'])

In [None]:
print(len(raw_datasets['train']))
print(len(raw_datasets['test']))
print(len(raw_datasets['unsupervised']))


25000
25000
50000


In [None]:
train_dataset = raw_datasets['train'].shuffle(seed=42).select(range(2000))
test_dataset = raw_datasets['test'].shuffle(seed=42).select(range(1000))
print(len(train_dataset))
print(len(test_dataset))



2000
1000


In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/pytorch_model.b

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_dataset(dataset,tokenizer):
  tokenized_dataset = []
  for item in dataset:
    tokenized = tokenizer(item["text"],padding="max_length", truncation=True)
    item.update(tokenized)
    tokenized_dataset.append(item)
  return tokenized_dataset

tokenized_train = tokenize_dataset(train_dataset,tokenizer)
tokenized_test = tokenize_dataset(test_dataset,tokenizer)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/vocab.txt
loading file tokenize

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer",evaluation_strategy="epoch")

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
import numpy as np
from datasets import load_metric

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    metric = load_metric("accuracy")
    return metric.compute(predictions=predictions, references=labels)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=tokenized_train, 
    eval_dataset=tokenized_test,
    compute_metrics = compute_metrics
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 2000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 750
  Number of trainable parameters = 108311810
The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.338673,0.863
2,0.357400,0.612492,0.853
3,0.357400,0.569828,0.875


***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
  metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=750, training_loss=0.27596288045247397, metrics={'train_runtime': 686.4578, 'train_samples_per_second': 8.741, 'train_steps_per_second': 1.093, 'total_flos': 1578666332160000.0, 'train_loss': 0.27596288045247397, 'epoch': 3.0})

In [None]:
trainer.evaluate(tokenized_test)

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.


{'eval_loss': 0.5698278546333313,
 'eval_accuracy': 0.875,
 'eval_runtime': 33.4824,
 'eval_samples_per_second': 29.866,
 'eval_steps_per_second': 3.733,
 'epoch': 3.0}

In [None]:

predictions = trainer.predict(tokenized_test)

***** Running Prediction *****
  Num examples = 1000
  Batch size = 8
The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.


In [None]:
def print_predictions_by_type(pred_type, predictions):
    print('='*200)
    print(pred_type)
    for pred in predictions:
      print(pred)

def get_predictions_by_type(dataset):
   for i,item in enumerate(dataset):
      predicted = np.argmax(predictions.predictions[i])
      gold = item['label']
      if predicted == gold: # prediction is correct
         if predicted == 1:
            print('TP: ' + item['text'])
         else:
            print('TN: ' + item['text'])
      else: # prediction is incorrect
         if predicted == 1:
            print('FP: ' + item['text'])
         else:
            print('FN: ' + item['text'])
   

get_predictions_by_type(test_dataset)

## Question answering implementation

I have decided to take the BERT model and apply it to a different task, namely question answering. There are several different flavours of question answering, these are as follows:

*   **Extractive** - The model extracts the answer from the provided context.
*   **Open Generative** - The model generates a free-form answer from the provided context.
*   **Closed Generative** - The model generates a free-form answer from no context, meaning the model ideally already has the means to generate the answer.


The choice of question answering typically depends on the exact usecase for the model, if the context is available and the answer can follow the appearance in the context, extractive is the approach to use.

We can briefly have a look at the extractive model and see how it operates.

### Extractive Model Implementation

NOTE: This implementation is adapted from the HuggingFace tutorial to explore the model and potential implementation. The Squad dataset is a commonly used question answering dataset that will not be used in my implementation.

In [1]:
# Mount the drive so we can save the results of training to our drive.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%%capture
pip install datasets

In [3]:
%%capture
pip install transformers

Now we have mounted our drive and installed the needed packages, we can load the dataset and train our BERT question answering model.

In [None]:
from datasets import load_dataset

# Get the dataset and only take the first 5,000 samples
squad = load_dataset("squad", split="train[:5000]")

# Split these samples into a train and test set with the training set being 80% and the test being 20%
squad = squad.train_test_split(test_size=0.2)

We now need to preprocess the dataset. This will involve tokenizing the input data so the model can process it. In addition, we will need to pass the start and end of the answer in the context for training. This will need to be found and added to the data when we are processing it.

In [None]:
from transformers import AutoTokenizer

# Create the auto tokenizer so we can tokenize the text for the model to process.
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
def preprocess_function(examples):
    """
    This function takes batches from the squad dataset and tokenizes the input 
    while also finding the index in the input for the answer in the context.
    """

    # Strip any whitespace from the text.
    questions = [q.strip() for q in examples["question"]]

    # Generate the tokens for each word in the input. This goes up to a max length of 384.
    # The input ids are padded with 0's or truncated if the sequence is too long.
    # The offset mapping is returned so the start and end position of the word can be found in the sequence.
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Extract the offset mapping
    offset_mapping = inputs.pop("offset_mapping")

    # Extract the answers from the examples
    answers = examples["answers"]

    # Create a list for keeping track of the start and end positions.
    start_positions = []
    end_positions = []

    # For every sample (an individual list in the offset mapping)
    for i, offset in enumerate(offset_mapping):

      # Get the current answer
        answer = answers[i]

        # Get the start and end of the word.
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])

        # Get the ids of the sequences. Special characters added by the tokenizer
        # will be None, words from the first sequence (question) will have the id 0.
        # The words from the second sequence (context) will be 1.
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    # Add the start and end positions to the inputs.
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    return inputs

Now we can apply our preprocessing function using map, giving us the processed data that we are ready to use for the model training.

In [None]:
# Apply the preprocess function to the squad data in a batched manner and remove the column names in train.
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

In [None]:
from transformers import DefaultDataCollator

# Create a DataCollator for used during training.
data_collator = DefaultDataCollator()

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

# Create the Auto Question Answering model using the bert-base-cased weights.
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-cased")

We can see that some of the weights were not initialized from the BERT model checkpoint, these are the question answering output weights and biases, which are not a part of the BERT model so these have to be initialized randomly.

This is typically the case when fine-tuning BERT as the base model is essentially an encoder. Adding additional layers at the end and fine-tuning the model on a task allows for good performance on a range of natural language tasks.

In [None]:
# Define the training arguments.
training_args = TrainingArguments(
    output_dir="tmp/",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01
)

# Create a trainer object so we can train the model easily.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

We can see that the model improved after the first epoch, but remained quire similar on the next. We could increase the number of epochs and observe how the loss changes on both the training and test sets.

Unfortunately, there is no default metric for understanding the performance. As this is the sample model we will not concern ourselves with this, we will look at this in our later models.

In [None]:
trainer.evaluate()

We can manually run our model to create predictions given a question and context. We can see that it works reasonably well. This question and context was not in the dataset.

In [None]:
from transformers import pipeline

# Ensure that the model is back on the cpu, as this is where our test question will be created.
model.to("cpu")

# We can make use of a pipeline to pass questions and context and get the answers returned.
question_answerer = pipeline("question-answering", model=model, tokenizer=tokenizer)

context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

This approach works well when we have a dataset that the answer to the questions can be directly extracted from the context. This means that there can be no advanced answers that return more than a single extracted segment of text from the context.


### Tweet QA dataset

The dataset I have decided to make use of is the [Tweet QA dataset](https://huggingface.co/datasets/tweet_qa). This dataset is intersting as it makes use of user generated data from the social media platform Twitter. The tweets were extracted from articles on the major news websites CNN and NBC. The questions and potential answers were gathered from Amazon Mechanical Turk workers.

Let us take a look at the dataset below.


In [4]:
# Load the dataset
from datasets import load_dataset

raw_datasets = load_dataset("tweet_qa")
raw_datasets

Downloading builder script:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

Downloading and preparing dataset tweet_qa/default to /root/.cache/huggingface/datasets/tweet_qa/default/1.0.0/7d588f7f477946b10f60c035ca55175737315ac446102b015218af38d2638777...


Downloading data:   0%|          | 0.00/1.57M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10692 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1979 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1086 [00:00<?, ? examples/s]

Dataset tweet_qa downloaded and prepared to /root/.cache/huggingface/datasets/tweet_qa/default/1.0.0/7d588f7f477946b10f60c035ca55175737315ac446102b015218af38d2638777. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['Question', 'Answer', 'Tweet', 'qid'],
        num_rows: 10692
    })
    test: Dataset({
        features: ['Question', 'Answer', 'Tweet', 'qid'],
        num_rows: 1979
    })
    validation: Dataset({
        features: ['Question', 'Answer', 'Tweet', 'qid'],
        num_rows: 1086
    })
})

In [5]:
# Show the dataset
raw_datasets["train"][0]

{'Question': 'at which school were first responders on the scene for?',
 'Answer': ['independence high school'],
 'Tweet': 'Our prayers are with the students, educators & families at Independence High School & all the first responders on the scene. #PatriotPride— Doug Ducey (@dougducey) February 12, 2016',
 'qid': '0c871b7e5320d0816d5b2979d67c2649'}

We can see that the dataset is structured quite similarly to the squad dataset, there is a Question, Answer and Tweet (context). The only issue is that there is no answer start index for us to make use of. This means that to make use of the extractive method we will need to manually extract this information from the dataset. It can be seen that there are several instances where there is either no answer provided or that it does not match the structure in the context, which will be an issue when finding the start and end index of the answer.

There are however ways around this, or we can disregard samples where we cannot find any start or end indexes. In addition, we will take a look after at using a generative method to answer the questions. 

### Extractive Model

We will need to figure out our issue with the start and end index. After this we can easily train a Q&A model in a similar way to the example provided.

I will make use of Python RegEx to match the case-insensitive answer in the tweet and extract the start and end index. I have decided to ignore case as the user generated content may not always follow typical grammatical rules.


In [6]:
import re
from datasets import Dataset

def find_question_start_end_indexes(dataset):
  """
  This function takes a dataset and creates a new one that has the start and end
  indexes of the answer inside of the context added to the item dictionary.
  In the case where there is no answer or it cannot be found, the item is dropped
  from the dataset."""

  # Create variables for storing the new dataset and count of samples we can't find an answer for.
  new_dataset = []
  dropped_samples = 0


  # For every entry in the dataset.
  for item in dataset:

    # Create variables for storing the start and end of the answer.
    start, end = None, None

    # Take the context.
    context = item["Tweet"]

    for answer in item["Answer"]:

      # The answer cannot start with a "*" or "?" as regex can't handle this.
      if answer[0] in ["*", "?"]:

        # If it does, take the rest of the answer only.
        answer = answer[1:]

      # If both of these aren't empty or of type None.
      if answer not in [None, ""] and context not in [None, ""]:
        
        # Search the context for the case insensitive answer.
        matched = re.search(answer, context, flags=re.IGNORECASE)

        if matched:
          # Extract the start and end of the found answer.
          start, end = matched.start(), matched.end()
          break
    
    # If we have found a match
    if start is not None and end is not None:
      item["start_position"] = start
      item["end_position"] = end
      new_dataset.append(item)
    
    else:
      # Otherwise, we won't be adding this sample to the new dataset.
      dropped_samples += 1
  
  return Dataset.from_list(new_dataset), dropped_samples

In [7]:
# Use the created function
train_dataset, dropped_count = find_question_start_end_indexes(raw_datasets["train"])

# Find the percentage of the dataset we have dropped.
percentage_dropped = (dropped_count / len(raw_datasets["train"])) * 100

# Output the number of samples and percent of the total we have dropped.
print(f"We have dropped {dropped_count} samples, losing {percentage_dropped}% of the training data.")

We have dropped 4044 samples, losing 37.82267115600449% of the training data.


In [8]:
# Use the created function
test_dataset, dropped_count = find_question_start_end_indexes(raw_datasets["test"])

# Find the percentage of the dataset we have dropped.
percentage_dropped = (dropped_count / len(raw_datasets["test"])) * 100

# Output the number of samples and percent of the total we have dropped.
print(f"We have dropped {dropped_count} samples, losing {percentage_dropped}% of the test data.")

We have dropped 1979 samples, losing 100.0% of the test data.


We can see that after removing samples that we cannot find a RegEx match of the answer in the context we have 63% of the training data remaining and none of the test data.

This means that our model will be training on a small section of data, but should hopefully still perform reasonably.

In [9]:
# Check the length of tokens
highest = 0

# For every item
for item in train_dataset:

  # The count of tokens is the length of the tweet and question.
  token_count = len(item["Tweet"]) + len(item["Question"]) + 3 # Start, seperator and end tokens.

  # If the current token count is greater than the highest count so far, update the highest count.
  if token_count > highest:
    highest = token_count

We can see that there should be no issue with input length. This is because the largest length of tokens should be 423. This is under the max tokens accepted by BERT which is 512, so we don't need to consider breaking these down into separate samples.

Now we can tokenize the data and perform the pre-processing necessary to convert the data to a useable type for the model.

In [10]:
from transformers import AutoTokenizer

# Create the auto tokenizer so we can tokenize the text for the model to process.
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [11]:
def preprocess_function(examples):
    """
    This function takes batches from the dataset and tokenizes the input 
    while also finding the index in the input for the answer in the context.
    """

    # Strip any whitespace from the text.
    questions = [q.strip() for q in examples["Question"]]

    # Generate the tokens for each word in the input. This goes up to a max length of 512.
    # The input ids are padded with 0's or truncated if the sequence is too long.
    # The offset mapping is returned so the start and end position of the word can be found in the sequence.
    inputs = tokenizer(
        questions,
        examples["Tweet"],
        max_length=512,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
        return_tensors="pt"
    )

    # Extract the offset mapping
    offset_mapping = inputs.pop("offset_mapping")

    # Extract the answers from the examples
    # answers = examples["Answer"]
    start_index = examples["start_position"]
    end_index = examples["end_position"]


    # Create a list for keeping track of the start and end positions.
    start_positions = []
    end_positions = []

    # For every sample (an individual list in the offset mapping)
    for i, offset in enumerate(offset_mapping):

        # Get the start and end of the word.
        start_char = start_index[i]
        end_char = end_index[i]

        # Get the ids of the sequences. Special characters added by the tokenizer
        # will be None, words from the first sequence (question) will have the id 0.
        # The words from the second sequence (context) will be 1.
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    # Add the start and end positions to the inputs.
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    return inputs

In [12]:
tokenize_tweet_qa = train_dataset.map(preprocess_function, batched=True, remove_columns=train_dataset.column_names)
tokenize_tweet_qa.set_format("pt", columns=["input_ids", "attention_mask", "token_type_ids"], output_all_columns=True)

Map:   0%|          | 0/6648 [00:00<?, ? examples/s]

In [13]:
split_dataset = tokenize_tweet_qa.train_test_split(test_size=0.2, shuffle=False)

In [14]:
split_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 5318
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 1330
    })
})

Our data is now ready for passing into our model. We just need to create several components for use during training such as a DataCollator and the TrainingArguments.

In [35]:
from transformers import DefaultDataCollator

# Create a DataCollator for used during training.
data_collator = DefaultDataCollator()

In [36]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

# Create the Auto Question Answering model using the bert-base-cased weights.
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

In [37]:
# Define the training arguments.
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/CA4023-BERT/extractive",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()



Epoch,Training Loss,Validation Loss


KeyboardInterrupt: ignored

In [25]:
trainer.evaluate()

{'eval_loss': 1.2374424934387207,
 'eval_runtime': 45.165,
 'eval_samples_per_second': 29.448,
 'eval_steps_per_second': 1.86,
 'epoch': 3.0}

Now we can save the model so we can load it back in with ease if we wish to use it. This can be done using the save_pretrained method and the using the from_pretrained method to read it back in.

In [26]:
model.save_pretrained("/content/drive/MyDrive/CA4023-BERT/extractive/extractive_model")

In [15]:
from transformers import AutoModelForQuestionAnswering

# Create the Auto Question Answering model using the bert-base-cased weights.
model = AutoModelForQuestionAnswering.from_pretrained("/content/drive/MyDrive/CA4023-BERT/extractive/extractive_model")

In [16]:
from transformers import pipeline

# Ensure that the model is back on the cpu, as this is where our test question will be created.
model.to("cpu")

# We can make use of a pipeline to pass questions and context and get the answers returned.
question_answerer = pipeline("question-answering", model=model, tokenizer=tokenizer)

context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

{'score': 0.24850651621818542,
 'start': 33,
 'end': 90,
 'answer': 'three most popular deep learning libraries — Jax, PyTorch'}

We can see that it works as expected.

The test samples did not contain any answers, so we can use the question and context and manually inspect the answer.



In [17]:
# Select the first entry in the raw test dataset.
sample = raw_datasets["test"][0]

print(sample["Tweet"])
print(sample["Question"])

The police dogs who patrol the area are some of my favorite faces to see around the Capitol. # Sen. Al Franken (@SenFranken) August 26, 2017
who are some of franken's favorite faces to see?


In [18]:
question_answerer(question=sample["Question"], context=sample["Tweet"])

{'score': 0.4365227520465851, 'start': 4, 'end': 15, 'answer': 'police dogs'}

We can see that the model works well and we can change the sample we are looking at to assess different question and context pairs. There are some questions in the set that make little sense to me and the answer seems to be outside the scope of the context, but the model seems to work well.

We can make use of the ROUGE metric, which is used to find the overlap between the predicted sequence and the target sequence. This may help us get some idea of how well our model performs.

We can make use of the test dataset and get the predicted answers and test the overlap with the ground truth.

In [None]:
%%capture
!pip install rouge_score

In [19]:
import torch
from datasets import load_metric

# Load the rouge metric
rouge = load_metric("rouge")

actual_answers = []
predicted_answers = []

model = model.to("cuda")

for data in split_dataset["test"]:

  # Add a dimension for the batches, in this case each is a batch of 1.
  # Also move the tensors to the GPU.
  data["input_ids"] = data["input_ids"][None, :].to("cuda")
  data["attention_mask"] = data["attention_mask"][None, :].to("cuda")
  data["token_type_ids"] = data["token_type_ids"].to("cuda")

  start_pos = data.pop('start_positions', None)

  end_pos = data.pop('end_positions', None)

  # Without calculating gradients, run inference on the input data.
  with torch.no_grad():
    outputs = model(**data)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  # Get the input tokens that correspond to the indexes.
  predict_answer_tokens = data["input_ids"][0, answer_start_index : answer_end_index + 1]
  answer_tokens = data["input_ids"][0, start_pos : end_pos + 1]

  # Decode the predictions and actual answers.
  actual_answers.append(tokenizer.decode(answer_tokens))
  predicted_answers.append(tokenizer.decode(predict_answer_tokens))

print(actual_answers)
print(predicted_answers)

# Compute the rouge metric for unigrams, so the presence of words in the actual and predicted is used.
rouge_output = rouge.compute(predictions=predicted_answers, references=actual_answers, rouge_types=["rouge1"])["rouge1"].mid
print({
    "rouge1_precision": round(rouge_output.precision, 4),
    "rouge1_recall": round(rouge_output.recall, 4),
    "rouge1_fmeasure": round(rouge_output.fmeasure, 4),
    })



['thank you', 'Jenny Jackson', '##2015', 'the truth', 'Chad Johnson', 'African Americans', 'Will Smith', 'a virtual street corner', '##2013', 'DVD', 'Taylor & Nicki', 'In a movie', '2014 season', 'Charlottesville', 'our employees', '8. 5km', 'sweet, hilarious and courageous', 'youth homelessness', 'white', 'the White House', 'sobbing girlfriend', 'today', 'Maracana Stadium', 'Maysak', 'The University of Houston', 'Christians', 'the F. B. I.', 'Middle - Eastern', 'HATCH', 'bodies out', 'Nigeria', 'a shirt featuring women in lingerie', 'Anderson Cooper', 'The Queen from', 'Michael Jackson', 'chills w', 'new Ministers', 'Meryl Streep', '@ CNN', 'Same thing basically', 'New Orleans Pelicans', 'realDonaldTrump', 'grossed out', 'Leonard Nimoy.', 'Carter Thicke', 'Dick Gregory & Jerry Lewis is', 'Late - November / late - December', 'the first', 'hippo', 'Eusébio', 'macklemore', '10 years', 'the bank', 'Danbury students', '##2015', 'flag burning', 'Paul Allen', 'Neymar', '##snow', 'Krissi', 'P

We can see from looking at the actual vs. predicted answers that the model works quite well. We often get majority of the answer, if not all. The score shows us that there is a lot of overlap between the predicted answers and the actual answers, but it does not account for answers that are technically correct but contain extra tokens.

Overall, this model has performed very well. It requires a specfic structure within the data, but provides a decent accuracy without too much fine-tuning.


### Casual Language Model

Now we will look at a simple generative approach. This involves using a text generator that has been trained on examples which contain the context, question and then the answer. In this way, the model will (hopefully) learn to predict the next words as the answer to the question in relation to the given context.

It is likely that this model will not perform amazing as this is a simple approach to the task of question answering. It is still worth exploring this option though.

In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# Load the BERT weights into a Casual Language Model architecture.
gen_model = AutoModelForCausalLM.from_pretrained("bert-base-cased")

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The benefit to use this approach is that we don't need to perform the same level of preprocessing. This is because the model can handle the answer not being explicitly stated in the text, assuming it has learned how to generate it given the context.

I suspect that the approach we took to extract the samples that had an answer in the context weeded out many of the poor samples with no answer, so we may still want to exclude these samples, or at least samples with no answer given.

In [None]:
from transformers import AutoTokenizer
from datasets import Dataset

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Create a function to tokenize the dataset
def tokenize_dataset(dataset, tokenizer, train=True):

  # Create variables to store the new dataset and the count of samples we have dropped.
  tokenized_dataset = []
  text_lengths = []
  dropped_samples = 0

  # For every item in the dataset.
  for item in dataset:

    # Pull the context, question and answer from the sample.
    question = item["Question"]
    context = item["Tweet"]

    if train:
      # Check if we have an answer
      if len(item["Answer"]) == 0:
        dropped_samples += 1
        continue
    
      # Just take the first answer as the most meaningful
      answer = item["Answer"][0]

      # The text we will use will be the context, question and then the answer.
      item["text"] = f"{context}\n{question}\n{answer}"
    
    else:
      # The text we will use will be the context and question only
      item["text"] = f"{context}\n{question}"

    # Let us track the length of these text inputs
    text_lengths.append(len(item["text"]))

    # Tokenize the text
    tokenized = tokenizer(item["text"])

    # Update the item
    item.update(tokenized)

    # Add the updated item to the new dataset.
    tokenized_dataset.append(item)

  # Create a new HuggingFace dataset from the list generated.
  dataset = Dataset.from_list(tokenized_dataset)

  # Remove the columns that are not relevant for training.
  dataset = dataset.remove_columns(["Question", "Answer", "Tweet", "qid"])
  return dataset, dropped_samples, text_lengths

# Preprocess the data for the train and test set.
tokenized_train, train_dropped_samples, train_text_lengths  = tokenize_dataset(raw_datasets['train'],tokenizer)
tokenized_test, test_dropped_samples, test_text_lengths = tokenize_dataset(raw_datasets['test'],tokenizer, False)

Now we can see how much data we have dropped in each due to an answer not being present, as well as consider what length the inputs will be and if the model can handle them.

In [None]:
# Find the percentage of the dataset we have dropped.
percentage_dropped = (train_dropped_samples / len(raw_datasets["train"])) * 100

# Output the number of samples and percent of the total we have dropped.
print(f"We have dropped {train_dropped_samples} samples, losing {percentage_dropped}% of the train data.")

We have dropped 0 samples, losing 0.0% of the train data.


In [None]:
# Find the percentage of the dataset we have dropped.
percentage_dropped = (test_dropped_samples / len(raw_datasets["test"])) * 100

# Output the number of samples and percent of the total we have dropped.
print(f"We have dropped {test_dropped_samples} samples, losing {percentage_dropped}% of the test data.")

We have dropped 0 samples, losing 0.0% of the test data.


In [None]:
print(max(max(train_text_lengths), max(test_text_lengths)))

437


We can see that we have lost no samples. The training data must have at least one answer for each of the samples, and we didn't care if the test data didn't as we can manually inspect this.

We can now split the training data into a train and test set for use during the training and evaluation.

In [None]:
split_dataset = tokenized_train.train_test_split(test_size=0.2, shuffle=False)

In [None]:
split_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 8553
    })
    test: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2139
    })
})

We have got over 8,500 samples in our training set and over 2,000 in our test. We are now ready to run the model.

In [None]:
from transformers import DataCollatorForLanguageModeling

# Create a data collator. This collator can be used to enable masked language modelling during training.
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [None]:
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/CA4023-BERT/simple_generative",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01
)

In [None]:
trainer = Trainer(
    model=gen_model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    data_collator=data_collator,
)

In [None]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,0.0548,0.004777
2,0.0107,0.000919
3,0.0046,0.00049


TrainOutput(global_step=3210, training_loss=0.21895405650138855, metrics={'train_runtime': 643.0774, 'train_samples_per_second': 39.9, 'train_steps_per_second': 4.992, 'total_flos': 1013996987543904.0, 'train_loss': 0.21895405650138855, 'epoch': 3.0})

In [None]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 1.00


In [None]:
gen_model.save_pretrained("/content/drive/MyDrive/CA4023-BERT/simple_generative/simple_generative_model")

In [None]:
gen_model = AutoModelForCausalLM.from_pretrained("/content/drive/MyDrive/CA4023-BERT/simple_generative/simple_generative_model")

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


In [None]:
question = raw_datasets['test'][0]["Question"]
context = raw_datasets['test'][0]["Tweet"]

prompt = f"{context}\n{question}"

In [None]:
gen_model.to("cpu")

In [None]:
gen_model.config.no_repeat_ngram_size = 2

In [None]:
# Make use of langauge generation to test if it will work
from transformers import pipeline

generator = pipeline("text-generation", model=gen_model, tokenizer=tokenizer)
generator(prompt, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)



[{'generated_text': "The police dogs who patrol the area are some of my favorite faces to see around the Capitol. # Sen. Al Franken (@SenFranken) August 26, 2017\nwho are some of franken's favorite faces to see? then then after then including her then she she then next next then first then before then but but then again then Then then maybe then thought Sawyer after later also hearing hearing read well I then later then like like then and then said that and and when and or andon and, and.. and ) and He and but and Then or or perhaps perhaps for for and certain certain and again, then briefly again again and repeated again Again again once once,, in, ), (, or,."}]

This concludes our basic implementation of a generative question answering model. We can see that the model has not learned much. I believe this is a mixture of the model not being tailored for this task as BERT is normally used as an encoder with additional layers on top to fine-tune the model to a task. In addition, we are not working with huge amounts of data, so we will not get amaing results.

Now we will look at a more advanced architecture for solving this open generative question answering task.

### EncoderDecoder Model (sequence-to-sequence)

The final model that we will be having a look at is en EncoderDecoder for converting a given sequence to another sequence. In our case, this will be a sequence of context and question to a sequence containing the answer. This will likely work much better than the simple generative approach as we are explicitly telling the model what output we want, where previously there was no way of us explicitly stating that we want the generated text to be based on the context.

We can start by loading in the model, which makes use of the weight from BERT for the encoder and the weights from GPT2 for the decoder. I have decided to take this approach as GPT2 is already pre-trained as an autoregressive model to generate text and using the BERT weights will require more tuning.

We could optionally tie the weights together for the encoder and decoder, allowing the model to have a smaller number of parameters.

In [20]:
from transformers import EncoderDecoderModel, BertTokenizer, GPT2Tokenizer

# Create the Bert tokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

# make sure GPT2 appends EOS in begin and end
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
    outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
    return outputs

# Create the GPT2 tokenizer.
GPT2Tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# set pad_token_id to unk_token_id -> be careful here as unk_token_id == eos_token_id == bos_token_id
gpt2_tokenizer.pad_token = gpt2_tokenizer.unk_token

# Create the EncoderDecoder model from the saved weights of both models.
model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "gpt2") #, tie_encoder_decoder=True)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.6.crossattention.q_attn.weight', 'h.4.crossattention.bias', 'h.2.crossattention.bias', 'h.2.ln_cross_attn.weight', 'h.0.crossattention.c_proj.bias', 'h.10.crossattention.c_proj.weight', 'h.4.crossattention.masked_bias', 'h.9.crossattention.q_attn.weight', 'h.5.crossattention.q_attn.weight', 'h.1.crossattention.q_attn.weight', 'h.7.ln_cross_attn.weight', 'h.10.crossattention.c_attn.weight', 'h.5.crossattention.c_attn.weight', 'h.11.crossattention.c_attn.weight', 'h.11.crossattention.c_proj.bias', 'h.7.crossattention.c_proj.weight', 'h.2.crossattention.masked_bias', 'h.10.ln_cross_attn.weight', 'h.4.crossattention.c_attn.weight', 'h.5.crossattention.bias', 'h.3.crossattention.masked_bias', 'h.1.crossattention.masked_bias', 'h.6.ln_cross_attn.weight', 'h.5.crossattention.c_proj.weight', 'h.0.crossattention.masked_bias', 'h.0.crossattention.c_attn.weight', 'h.6.crossattenti

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Now we can set some config parameters of the model relating to the start token, end of sentence and pad token.

In [21]:
model.config.do_sample = True
model.config.temperature = 0.5
model.config.top_k = 150
model.config.top_p = 0.8
model.config.no_repeat_ngram_size = 3
model.config.num_beams = 5
model.config.max_length = 16
model.config.min_length = 1
model.config.early_stopping = True
model.config.decoder_start_token_id = gpt2_tokenizer.bos_token_id

model.config.decoder.do_sample = True
model.config.decoder.temperature = 0.5
model.config.decoder.top_k = 150
model.config.decoder.top_p = 0.8
model.config.decoder.no_repeat_ngram_size = 3
model.config.decoder.num_beams = 5
model.config.decoder.max_length = 16
model.config.decoder.min_length = 1
model.config.decoder.early_stopping = True
model.config.decoder.decoder_start_token_id = gpt2_tokenizer.bos_token_id

In [22]:
# set decoding params
model.config.decoder_start_token_id = gpt2_tokenizer.bos_token_id
model.config.eos_token_id = gpt2_tokenizer.eos_token_id
model.config.pad_token_id = gpt2_tokenizer.eos_token_id

Now we can set config parameters relating to beam search, which will hopefully improve the results. These parameters can be tuned and can be difficult to select without testing.

We will use the default for many of the parameters, such as top_k, top_p and temperature. The top_k and top_p parameters control how many items can get through up to k or the probability proportion p, then the probabilities are re-distributed between the remaining possible next tokes. Temperature is used to affect the distribution of probabilities, making the probabilities better distributed among the tokens.

In [23]:
model.config

EncoderDecoderConfig {
  "_commit_hash": null,
  "decoder": {
    "_name_or_path": "gpt2",
    "activation_function": "gelu_new",
    "add_cross_attention": true,
    "architectures": [
      "GPT2LMHeadModel"
    ],
    "attn_pdrop": 0.1,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 50256,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": 50256,
    "diversity_penalty": 0.0,
    "do_sample": true,
    "early_stopping": true,
    "embd_pdrop": 0.1,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 50256,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "is_decoder": true,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_epsilon": 1e

Now we can preprocess our data. The data should be tokenized for the input and desired output. This will be a string of the context and question for the input and the answer for the output. There are also attention masks for the encoder and decoder that we must pass through.

In [None]:
# Set the max length for the encoder and decoder
encoder_max_length=512
decoder_max_length=16

def process_data_to_model_inputs(batch):

  context = batch["Tweet"]
  question = batch["Question"]
  answer = batch["Answer"][0]

  # tokenize the inputs and labels
  inputs = bert_tokenizer(f"question:{question}  context:{context}", padding="max_length", truncation=True, max_length=encoder_max_length)
  outputs = gpt2_tokenizer(answer, padding="max_length", truncation=True, max_length=decoder_max_length)

  # Add all the needed values to the batch
  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  batch["decoder_input_ids"] = outputs.input_ids
  batch["decoder_attention_mask"] = outputs.attention_mask
  batch["labels"] = outputs.input_ids.copy()

  # because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`. 
  # We have to make sure that the PAD token is ignored
  batch["labels"] = [-100 if token == 0 else token for token in batch["labels"]] # for labels in batch["labels"]]
  
  assert all([len(inputs.input_ids) == encoder_max_length])
  assert all([len(outputs.input_ids) == decoder_max_length])

  return batch


We can now use the map function to process our data through the function we defined.

In [None]:
train_data = raw_datasets["train"].map(
    process_data_to_model_inputs,
    num_proc=4,
    remove_columns=["Question", "Answer", "Tweet", "qid"]
)

Map (num_proc=4):   0%|          | 0/10692 [00:00<?, ? examples/s]

We can see that the data is now in the correct format. We can split the training set into a training and test as this is the only set that has corresponding answers.

In [None]:
train_data

Dataset({
    features: ['input_ids', 'attention_mask', 'decoder_input_ids', 'decoder_attention_mask', 'labels'],
    num_rows: 10692
})

In [None]:
split_dataset = train_data.train_test_split(test_size=0.2, shuffle=False)

We will be making use of a sequence-to-sequence trainer that is an adapted trainer for this specific task. These can be downloaded and imported as shown below.

In [None]:
%%capture
!rm seq2seq_trainer.py
!rm seq2seq_training_args.py
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/legacy/seq2seq/seq2seq_trainer.py
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/legacy/seq2seq/seq2seq_training_args.py

In [None]:
%%capture
!pip install git-python==1.0.3
!pip install rouge_score
!pip install sacrebleu

In [None]:
from transformers import Seq2SeqTrainer
from transformers import Seq2SeqTrainingArguments

We specify the training arguments below, we will be evaluating at the end of each epoch and using a batch size of 4. We have enabled mixed precision training, which allows float16 to be used in place of float32 when applicable, reducing the overhead during training. We will also be using a warmup to start the training off.

In [None]:
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="epoch",
    num_train_epochs=5,
    learning_rate=0.00001,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    fp16=True, 
    output_dir="/content/drive/MyDrive/CA4023-BERT/generative",
    do_train=True,
    logging_steps=1000,
    save_steps=1000,
    eval_steps=1000,
    do_eval=True,
    overwrite_output_dir=True,
    warmup_steps=2000
)

Now we are ready to initialize our trainer and train our model.

In [None]:
tmp_train = split_dataset["train"].select(range(4))
tmp_test = split_dataset["test"].select(range(4))

In [None]:
# Initialize trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"]
)
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1 Precision,Rouge1 Recall,Rouge1 Fmeasure
1,0.0183,9.414216,0.0,0.0,0.0




['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',

KeyboardInterrupt: ignored

With our model trained, we can save it to our drive so we can load the model without the need to train it again.

In [None]:
model.save_pretrained("/content/drive/MyDrive/CA4023-BERT/generative/Encoder_Decoder_model")

In [24]:
loaded_model = EncoderDecoderModel.from_pretrained("/content/drive/MyDrive/CA4023-BERT/generative/Encoder_Decoder_model")

Now we can make use of the model to generate answers for the samples in our test set.

In [None]:
split_dataset["test"]

Dataset({
    features: ['input_ids', 'attention_mask', 'decoder_input_ids', 'decoder_attention_mask', 'labels'],
    num_rows: 2139
})

In [28]:
def generate_answer(batch, processed=False):
    # cut off at BERT max length 512
    
    if processed:
      input_ids = batch["input_ids"]#.to("cuda")
      attention_mask = batch["attention_mask"]#.to("cuda")

      label_ids = batch["labels"]
      label_ids[label_ids == -100] = gpt2_tokenizer.eos_token_id
      label_str = gpt2_tokenizer.batch_decode(label_ids, skip_special_tokens=True)
      batch["answer"] = label_str

    else:

      context = batch["Tweet"]
      question = batch["Question"]

      input_string = f"{context}\n{question}"

      inputs = bert_tokenizer(input_string, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
      input_ids = inputs.input_ids#.to("cuda")
      attention_mask = inputs.attention_mask#.to("cuda")
  	
    outputs = loaded_model.generate(input_ids,attention_mask=attention_mask, do_sample=True, min_length=1)

    output_str = gpt2_tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["predicted_answer"] = output_str

    return batch

In [29]:
batch_size = 16  # change to 64 for full evaluation

results = raw_datasets["train"].train_test_split(test_size=0.2, shuffle=False)["test"].select(range(1,2)).select(range(1)).map(generate_answer)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [30]:
for item in results:
  print(item["Question"], item["Tweet"])
  print(item["predicted_answer"])

what can be a huge boost for a company's bottom line? Men& women make truly productive teams! "@Forbes: Gender diversity can be a huge boost for a company's bottom line: Ann Cairns (@Anncairns4) December 22, 2014
['']


Now we can manually review answers for the questions in our test set that contained now corresponding answers.

In [32]:
loaded_model.config.decoder.min_length = 1

In [33]:
no_answer_results =  raw_datasets["test"].select(range(5)).map(generate_answer)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

In [34]:
no_answer_results

Dataset({
    features: ['Question', 'Answer', 'Tweet', 'qid', 'predicted_answer'],
    num_rows: 5
})

In [35]:
for item in no_answer_results:
  print(item["Question"], item["Tweet"])
  print(item["predicted_answer"])

who are some of franken's favorite faces to see? The police dogs who patrol the area are some of my favorite faces to see around the Capitol. # Sen. Al Franken (@SenFranken) August 26, 2017
['']
who performed glen campbell's "wichita lineman?" .@LittleBigTown performs Glen Campbell's "Wichita Lineman" with Jimmy Webb, and can't get enough! CMA Country Music (@CountryMusic) November 9, 2017
['']
who did little big town perform "wichita lineman" with? .@LittleBigTown performs Glen Campbell's "Wichita Lineman" with Jimmy Webb, and can't get enough! CMA Country Music (@CountryMusic) November 9, 2017
['']
where did they come from #BillOreilly says slaves who built #WhiteHouse were well fed & had decent lodges. Bill-they were still slaves!— Carl Dix (@Carl_Dix) July 27, 2016
['']
what lessons to super hero movies teach Superhero movies are "oversaturated," and other lessons from @HamillHimself about what Affleck's Bateman can learn from Adam West Maeve McDermott (@maeve_mcdermott) June 11, 2

It seems that our model is having some issues. I have set the minimum generation length to 1, though this does not seem to have an effect.

I had issues creating this model as the tokens used for the different tokenizers were tough to sync up. The solution to this issue would be to pull the model apart and ensure that the input is being passed through correctly.

Additionally, we did not train for long or with much data. The individual parts, the BERT encoder and GPT2 decoder were not made for this task. They would need a decent amount of training to adapt to this task and the work well together. There are many newer models that work much better for this task, such as Google's FLAN-T5

### Conclusion

This concludes our fine-tuning of BERT to the task of question answering. There was some slight change in architecture in the EncoderDecoder model, though the encoder is still BERT with the autoregressive GPT2 added on.

The results weren't and it took a lot of playing around with to get the model to work. The model seems to be very sensitive to the learning rate and it may have been worth freezing some the weights and performing some sort of learning rate scheduling. It is unfortunate that the model has not worked as I hoped, but it would require some debugging and additional training.

The next level of question answering involves closed generative question answering. This means that there is no context and the model must return an answer based on the knowledge it already has. This means that the model must be much bigger to store this information in the weights and a bigger training set must be used.

The models that we have created are good, but there is better still. Google's FLAN-T5 is a new, open-source model that has great capabilities to carry out the task of generative question answering. 

Overall, this assignment has helped me to improve on my familiarity with HuggingFace and the transformers that are available through it.

### FLAN-T5 test

We will quickly load in the FLAN-T5 base model and pass it one of the questions that we have been working with. After, we will prompt it with a closed question, one which it is not given context to select an answer from.

In [50]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

inputs = tokenizer("Please answer the following question. who are some of franken's favorite faces to see? Context: The police dogs who patrol the area are some of my favorite faces to see around the Capitol. # Sen. Al Franken (@SenFranken) August 26, 2017", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['The police dogs who patrol the area']


In [48]:
inputs = tokenizer("Please answer the following question. What is the capital of Ireland?", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['dublin']


We can see that Google's FLAN-T5 works great out of the box. The model is bigger and has been trained on more data than BERT. Additionally, it is a sequence-to-sequence model which works well for generative question answering.