# Pre-Training and Transfer Learning with Hugging Face and OpenAI

**IMPORTANT**<br>
Enable **GPU acceleration** by going to *Runtime > Change Runtime Type*. Keep in mind that, on certain tiers, you're not guaranteed GPU access depending on usage history and current load.
<br><br>
Also, if you're running this in the cloud rather than a local Jupyter server on your machine, then the notebook will *timeout* after a period of inactivity.
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

We'll explore pre-training and transfer learning using the **Transformers** library from [Hugging Face](https://huggingface.co/). **Transformers** is an API and toolkit to download pre-trained models and further train them as needed. <br>

We'll start with the **pipelines** module which abstracts a lot of operations such as tokenization, vectorization, inference, etc.<br>

With **Transformers pipelines**, we can just feed text input and get text output. And there are **pipelines** for common tasks including classification, NER, summarization, etc.<br>
https://huggingface.co/docs/transformers/index<br>
https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#pipelines

To get started, we'll need to install **Transformers**.

In [None]:
!pip install transformers
!pip install datasets

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


In [None]:
import operator
import pandas as pd
import tensorflow as tf
import transformers

from datasets import load_dataset
from tensorflow import keras
from transformers import AutoTokenizer
from transformers import pipeline
from transformers import TFAutoModelForQuestionAnswering

## Getting up and running quickly with Hugging Face Pipelines

We'll use the **pipeline** (note the singular) abstraction which wraps all the other pipelines. Put simply, it'll be our interface to doing a bunch of NLP tasks.

Using the **pipeline** abstraction is easy. We can instantiate a pipeline with a particular task, and it'll automatically download a suitable tokenizer and model behind the scenes for us and take care of the input and output operations.<br>
https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline<br>



Here, we're retrieving a pipeline for text-classification.

In [None]:
classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Note the warning message about how no model was supplied. When we instantiate a pipeline for a task without specifying a particular model to perform the task, **Transformers** uses a default model. This is good enough for prototyping but for production, we'll want to specify which model to use for the task since the default can change. We'll see how to do this further below.

We can use the pipeline immediately to classify some text. Tokenization, vectorization, etc is taken care of behind the scenes.

In [None]:
classifier("Alice was excited to go the island but it didn't live up to the hype.")

[{'label': 'NEGATIVE', 'score': 0.9993934631347656}]

In [None]:
classifier("Bob doesn't do well in group situations but he said it wasn't bad.")

[{'label': 'POSITIVE', 'score': 0.9946909546852112}]

There's support for summarization...

In [None]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
text = """
Hans Niemann is launching a counterattack in his dispute with chess world
champion Magnus Carlsen, filing a federal lawsuit that accuses Carlsen of
maliciously colluding with others to defame the 19-year-old grandmaster and
ruin his career.

It's the latest move in a scandal that has injected unprecedented levels of
drama into the world of elite chess since early September, when Carlsen
suggested Niemann's upset victory over him at the Sinquefield Cup tournament
in St. Louis was the result of cheating.

Niemann wants a federal court in Missouri's eastern district to award him at
least $100 million in damages. Defendants in the lawsuit include Carlsen, his
company Play Magnus Group, the online platform Chess.com and its leader, Danny
Rensch, along with grandmaster Hikaru Nakamura.
"""

In [None]:
summarizer(text)

[{'summary_text': ' Chess grandmaster Hans Niemann files federal lawsuit against Magnus Carlsen . He accuses Carlsen of colluding with others to defame the 19-year-old grandmaster and ruin his career . Defendants in the lawsuit include Carlsen, his company Play Magnus Group, the online platform Chess.com and its leader .'}]

...and question answering (extractive in this example).

In [None]:
qa = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
context="""
Hugging Face was founded in 2016 by Clément Delangue, Julien Chaumond, and
Thomas Wolf originally as a company that developed a chatbot app targeted at
teenagers.[2] After open-sourcing the model behind the chatbot, the company
pivoted to focus on being a platform for democratizing machine learning. In March
2021, Hugging Face raised $40 million in a Series B funding round.
"""

question = "Who are the Hugging Face founders?"

qa(question=question, context=context)

{'score': 0.9919217228889465,
 'start': 37,
 'end': 87,
 'answer': 'Clément Delangue, Julien Chaumond, and\nThomas Wolf'}

Extractive question-answering models work fine for certain domains, document structures, and questions. But situations that require reasoning, more complex parsing, or contain ambiguity can trip it up.

In [None]:
question = "What does Hugging Face do?"
qa(question=question, context=context)

{'score': 0.08730420470237732,
 'start': 117,
 'end': 162,
 'answer': 'developed a chatbot app targeted at\nteenagers'}

There are ready-made pipelines for a number of tasks:<br>
https://huggingface.co/docs/transformers/main/en/quicktour#pipeline

Let's say we want a pipeline that uses a particular model. On the Hugging Face model hub, you'll find both pre-trained models (e.g. BERT) *and* pre-trained models that have been fine-tuned for all sorts of tasks (e.g. BERT for text classification). These models are contributed by Hugging Face, other companies, institutions, and individuals. You can (and are encouraged) to train or fine-tune a model and upload it for others to use.<br>
https://huggingface.co/models
<br><br>
For example, here's a collection of pre-trained models that have been tuned for text classification.<br>
https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads
<br><br>
This particular one is a pre-trained *Roberta-base* model that's been fine-tuned on Twitter data for sentiment analysis:
https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment
<br><br>
Note how you can try the model directly on the model page.

Let's say we want to download and use a particular model. For example, this *BERT-base* model fine-tuned for NER:
https://huggingface.co/dslim/bert-base-NER
<br>

We just need to pass the model path during pipeline instantiation.

In [None]:
ner = pipeline(model="dslim/bert-base-NER")

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
text = "Panic ensues in Redmond as love child of Microsoft and OpenAI declares humanity obsolete."
ner(text)

[{'entity': 'B-PER',
  'score': 0.9993875,
  'index': 6,
  'word': 'Red',
  'start': 16,
  'end': 19},
 {'entity': 'I-PER',
  'score': 0.80496854,
  'index': 7,
  'word': '##mond',
  'start': 19,
  'end': 23},
 {'entity': 'B-ORG',
  'score': 0.9980654,
  'index': 12,
  'word': 'Microsoft',
  'start': 41,
  'end': 50},
 {'entity': 'B-ORG',
  'score': 0.9985505,
  'index': 14,
  'word': 'Open',
  'start': 55,
  'end': 59},
 {'entity': 'I-ORG',
  'score': 0.98842865,
  'index': 15,
  'word': '##A',
  'start': 59,
  'end': 60},
 {'entity': 'I-ORG',
  'score': 0.973982,
  'index': 16,
  'word': '##I',
  'start': 60,
  'end': 61}]

The **Transformers** library provides a bunch of helper classes to help with training models. And beyond the model hub, Hugging Face also hosts datasets, provides *spaces* where you can host your app, and offers a bunch of services such as cloud hardware and inference endpoints to help deploy your model.<br>
Datasets: https://huggingface.co/datasets<br>
Spaces: https://huggingface.co/spaces<br>

With Hugging Face, you can build an ML app prototype within minutes and iterate quickly from there.<br>
https://huggingface.co/docs<br>

Learn more about how to build with Hugging Face through their free course and fantastic book:<br>
https://huggingface.co/course<br>
https://www.oreilly.com/library/view/natural-language-processing/9781098136789/


## Fine-Tuning a Pre-Trained Model.

Let's say the model hub doesn't have a model that exactly suits your purpose. Perhaps you work in a particular domain and need to fine-tune a model using your own dataset.<br>

In this section, we'll walk through how to download a pre-trained model and fine-tune it. Our example covers extractive question answering but it's the same idea with other tasks.

We'll fine-tune using a dataset from the **Datasets** hub.<br>
https://huggingface.co/datasets<br><br>
Hugging Face provides a **datasets** library to download and interact with the datasets. It's similar to the Tensorflow Dataset library we used in that it can hold data and provides a bunch of methods to preprocess that data.<br>
https://huggingface.co/docs/datasets/ndex

The **Datasets** hub holds a bunch of question answering datasets.<br>
https://huggingface.co/datasets?task_categories=task_categories:question-answering&sort=downloads<br>


They differ based on data source, domain, and level of challenge. Since we're in a constrained environment (Colab free tier) and just learning how to fine-tune, we'll use SQuAD, a famous dataset comprised of crowd-sourced questions on a set of Wikipedia articles, and where the answer is a span of text in the article.<br>
https://huggingface.co/datasets/squad


In [None]:
data = load_dataset("squad")

Downloading readme:   0%|          | 0.00/7.83k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

The **datasets** library downloads and automatically splits the data into train and validation sets. It returns a dictionary of **Dataset** objects:<br>
https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict<br>
https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset<br>


A **Dataset** object wraps an Apache Arrow table and provides a bunch of helper functions on top of it.<br>
https://arrow.apache.org/

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Glaning at the data, we see every context (Wikipedia passage) is used multiple times. i.e., there are multiple questions and answers for each context.<br>

Every answer is a span of text from the context and the character position where the answer starts in the context is given.

In [None]:
pd.DataFrame(data['train'][0, 1, 2, 100, 101, 102],
             columns=["context", "question", "answers"])

Unnamed: 0,context,question,answers
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,One of the main driving forces in the growth o...,In what year did the team lead by Knute Rockne...,"{'text': ['1925'], 'answer_start': [354]}"
4,One of the main driving forces in the growth o...,How many years was Knute Rockne head coach at ...,"{'text': ['13'], 'answer_start': [251]}"
5,One of the main driving forces in the growth o...,How many national titles were won when Knute R...,"{'text': ['three'], 'answer_start': [274]}"


Here's what we need to do:
1. Choose a pre-trained model based on what we want to accomplish and our constraints.
2. Download the appropriate tokenizer for the pre-trained model.
3. Tokenize and vectorize our dataset.
4. Mark where each answer starts and ends in our vectorized dataset.
5. Download the pre-trained model.
6. Fine-tune the pre-trained model with the vectorized dataset.

Given the free tier of Colab doesn't have a lot of GPU memory and that we're just trying to fine-tune a simple, extractive question answering model, we'll use *distilroberta-base*.<br>
https://huggingface.co/distilroberta-base
<br><br>
Recall from the slides that *DistilBERT* was created using a technique called *knowledge distillation*. The result is a model that performs almost as well as BERT but is 40% smaller and 60% faster.<br>
DistilBert Paper: https://arxiv.org/abs/1910.01108<br>
https://en.wikipedia.org/wiki/Knowledge_distillation
<br><br>
*distilroberta-base* was created by applying knowledge distillation to *Roberta-Base*, a more powerful model than BERT.<br>
Roberta paper: https://arxiv.org/abs/1907.11692<br><br>
The **Transformers** library provides a set of Auto Classes that can automatically retrieve configurations, tokenizers, and models based on a path or a name. We'll use the **AutoTokenizer** class to get the right tokenizer for *distilroberta-base*.<br>
https://huggingface.co/docs/transformers/main/en/model_doc/auto<br>
https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer



In [None]:
model_name = 'distilroberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Calling *encode* converts a string to a sequence of integer token ids.<br>
https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode

In [None]:
t = "Where can I find a pizzeria?"
print(tokenizer.encode(t))

[0, 13841, 64, 38, 465, 10, 26432, 6971, 116, 2]


But to tokenize, we call the tokenizer object directly (i.e. using *\_\_call\_\_*).<br>
https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__<br>

This returns a sequence of ids and an attention mask in a **BatchEncoding** object:<br>
https://huggingface.co/docs/transformers/main/en/glossary#input-ids<br>
https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.BatchEncoding<br>

Since there's no padding on this sample string, the mask is all 1s.

In [None]:
encoded_t = tokenizer(t)
print(encoded_t)

{'input_ids': [0, 13841, 64, 38, 465, 10, 26432, 6971, 116, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


We can convert the ids back to tokens using *convert_ids_to_tokens*.<br>
https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.convert_ids_to_tokens<br>

Note how the tokenizer added a start of sequence token (\<s\>), end of sequence token (\</s\>), and how it uses Ġ to signal a word has preceding whitespace. Keep in mind that what you're seeing here is the output from the *distilroberta-base* tokenizer. Other tokenizers may work differently.

In [None]:
print(tokenizer.convert_ids_to_tokens(encoded_t['input_ids']))

['<s>', 'Where', 'Ġcan', 'ĠI', 'Ġfind', 'Ġa', 'Ġpizz', 'eria', '?', '</s>']


As we covered in the slides, for question answering, we need to encode the question and context as a pair. In our case, we can do that by passing in both strings separated by a comma.

In [None]:
encoded_pair = tokenizer("this is a question", "this is the context")
print(encoded_pair)

{'input_ids': [0, 9226, 16, 10, 864, 2, 2, 9226, 16, 5, 5377, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The *distilroberta-base* tokenizer uses a double \</s\>\</s\> as a separator.

In [None]:
print(tokenizer.convert_ids_to_tokens(encoded_pair['input_ids']))

['<s>', 'this', 'Ġis', 'Ġa', 'Ġquestion', '</s>', '</s>', 'this', 'Ġis', 'Ġthe', 'Ġcontext', '</s>']


**Side note**:<br>
Most of the tokenizers in the **Transformers** library come in two versions: a Python implementation and a faster Rust implementation. When available, **Autotokenizer** will download the fast version.<br>
https://huggingface.co/docs/transformers/main_classes/tokenizer<br>

We can check whether we have a fast tokenizer.


In [None]:
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

Suppose we tokenize this question/context pair...

In [None]:
context = "Sarah went to The Mirthless Cafe last night to meet her friend."
question = "Where did Sarah go?"

# The answer span and the answer's starting character position in the context.
answer = "The Mirthless Cafe"
answer_start = 14

In [None]:
x = tokenizer(question, context)
x

{'input_ids': [0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 1672, 16542, 94, 363, 7, 972, 69, 1441, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Note how the word *Mirthless* gets tokenized into subwords. For legibility, we're using *batch_decode* to convert the input_ids to strings.<br>
https://huggingface.co/docs/transformers/v4.23.1/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode

In [None]:
tokenizer.batch_decode(x['input_ids'])

['<s>',
 'Where',
 ' did',
 ' Sarah',
 ' go',
 '?',
 '</s>',
 '</s>',
 'Sarah',
 ' went',
 ' to',
 ' The',
 ' M',
 'irth',
 'less',
 ' Cafe',
 ' last',
 ' night',
 ' to',
 ' meet',
 ' her',
 ' friend',
 '.',
 '</s>']

When we tokenize our dataset, there will probably be question/context pairs which exceed our model's maximum sequence length. In *Roberta*'s case, that's 512. Available GPU memory may make us further reduce the maximum sequence length of our input.<br>

Let's say the maximum sequence length we can handle is 15, so we truncate the context.

In [None]:
example_max_length = 15
x = tokenizer(question, context, max_length=example_max_length,
              truncation="only_second")
x

{'input_ids': [0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The problem here is that the answer span gets chopped off by truncation. In other situations, the answer may not be included at all.

In [None]:
tokenizer.batch_decode(x['input_ids'])

['<s>',
 'Where',
 ' did',
 ' Sarah',
 ' go',
 '?',
 '</s>',
 '</s>',
 'Sarah',
 ' went',
 ' to',
 ' The',
 ' M',
 'irth',
 '</s>']

To ensure we tokenize all context tokens while respecting a maximum length, we can set *return_overflowing_tokens* to **True**. The end effect is to split the input into multiple question/context sequences, with each context sequence being a continuation of the previous one. Since the last one may be shorter than the max length, we set the right padding length as well.<br>

What we get back are multiple *input_id* sequences.

In [None]:
x = tokenizer(question, context, max_length=example_max_length,
              truncation="only_second", return_overflowing_tokens=True,
              padding="max_length")
x

{'input_ids': [[0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 1672, 16542, 94, 363, 7, 972, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 69, 1441, 4, 2, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]], 'overflow_to_sample_mapping': [0, 0, 0]}

In [None]:
len(x['input_ids'])

3

Looking at the decoded sequences, we see the entire context is included across three sequences (along with padding on the last one).<br>

In [None]:
tokenizer.batch_decode(x['input_ids'])

['<s>Where did Sarah go?</s></s>Sarah went to The Mirth</s>',
 '<s>Where did Sarah go?</s></s>less Cafe last night to meet</s>',
 '<s>Where did Sarah go?</s></s> her friend.</s><pad><pad><pad>']

Note a few things from the encoded object *x*:
- The last *attention_mask* sequence has 0s to signify padding.
- The *overflow_to_sample_mapping* array tells us which question/context pair each *input_ids* sequence comes from. In our example, we tokenized a single question/context pair which resulted in three *input_ids* sequences, so *overflow_to_sample_mapping* is 3 0s.<br>

If we tokenize two question/context pairs, we'll see the *overflow_to_sample_mapping* reflect that.

In [None]:
tokenizer(['question 1', 'question 2'],
          ['context 1', 'context 2'],
          return_overflowing_tokens=True)

{'input_ids': [[0, 40018, 112, 2, 2, 46796, 112, 2], [0, 40018, 132, 2, 2, 46796, 132, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]], 'overflow_to_sample_mapping': [0, 1]}

But there's still a problem here in that none of the sequences contain the full answer ("The Mirthless Cafe"). Right now, the correct full answer is split across sequences.<br>

To counter this, we can tokenize our question/context pair into overlapping sequences by setting a *stride* length. We did something similar when we prepared the dataset for our [character-level language model](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_recurrent_neural_networks.ipynb#scrollTo=X1c-ihOByy88).

In [None]:
stride = 5
x = tokenizer(question, context, max_length=example_max_length,
              truncation="only_second", return_overflowing_tokens=True,
              stride=stride, padding="max_length")

By setting a stride of 5, each context sequence starts 5 subwords back from the previous sequence.<br>

This way, two of our tokenized sequences now contain the full answer.


In [None]:
tokenizer.batch_decode(x['input_ids'])

['<s>Where did Sarah go?</s></s>Sarah went to The Mirth</s>',
 '<s>Where did Sarah go?</s></s> went to The Mirthless</s>',
 '<s>Where did Sarah go?</s></s> to The Mirthless Cafe</s>',
 '<s>Where did Sarah go?</s></s> The Mirthless Cafe last</s>',
 '<s>Where did Sarah go?</s></s> Mirthless Cafe last night</s>',
 '<s>Where did Sarah go?</s></s>irthless Cafe last night to</s>',
 '<s>Where did Sarah go?</s></s>less Cafe last night to meet</s>',
 '<s>Where did Sarah go?</s></s> Cafe last night to meet her</s>',
 '<s>Where did Sarah go?</s></s> last night to meet her friend</s>',
 '<s>Where did Sarah go?</s></s> night to meet her friend.</s>']

We now have a way to tokenize our question/context pairs.<br>

Our tokenizer returned this **BatchEncoding** object:

In [None]:
print(x.keys(), '\n')
x

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping']) 



{'input_ids': [[0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 439, 7, 20, 256, 24208, 1672, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 7, 20, 256, 24208, 1672, 16542, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 20, 256, 24208, 1672, 16542, 94, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 256, 24208, 1672, 16542, 94, 363, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 24208, 1672, 16542, 94, 363, 7, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 1672, 16542, 94, 363, 7, 972, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 16542, 94, 363, 7, 972, 69, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 94, 363, 7, 972, 69, 1441, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 363, 7, 972, 69, 1441, 4, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 

To fine-tune a model for question answering, our pre-trained *distilroberta-base* model expects this object to contain two more pieces of information:
- *start_positions*: the token positions where answers begin.
- *end_positions*: the token positions where answers end.<br>

https://huggingface.co/docs/transformers/main/en/model_doc/roberta#transformers.RobertaForQuestionAnswering.forward

All we have in our example (and the SQuAD dataset) is the position of the starting  <u>character</u> of the answer.

In [None]:
print(answer_start)
print(context[answer_start:answer_start+len(answer)])

14
The Mirthless Cafe


We need to use this to locate the <u>token</u> positions where each answer starts and ends in every *input_ids* sequence. In some cases, the complete answer may not be in a particular sequence. We need to handle those cases as well.<br>

To do this, we'll get more information by setting *return_offsets_mapping* to **True** in the tokenizer.

In [None]:
x = tokenizer(question, context, max_length=example_max_length,
              truncation="only_second", return_overflowing_tokens=True,
              stride=stride, return_offsets_mapping=True,
              padding="max_length")
x

{'input_ids': [[0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 439, 7, 20, 256, 24208, 1672, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 7, 20, 256, 24208, 1672, 16542, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 20, 256, 24208, 1672, 16542, 94, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 256, 24208, 1672, 16542, 94, 363, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 24208, 1672, 16542, 94, 363, 7, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 1672, 16542, 94, 363, 7, 972, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 16542, 94, 363, 7, 972, 69, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 94, 363, 7, 972, 69, 1441, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 363, 7, 972, 69, 1441, 4, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 

This results in *offset_mapping* sequences, one for each *input_ids* sequence.

In [None]:
print(len(x['input_ids']))
print(len(x['offset_mapping']))

10
10


Each entry in an *offset_mapping* tells us the starting and ending character position of each token in the original string. An offset mapping of (0,0) represents a special token (e.g. \<s\>).<br>

For example, here's the first *input_ids* sequence along with its respective *offset_mapping*.

In [None]:
print(x['input_ids'][0])
print(x['offset_mapping'][0])

[0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2]
[(0, 0), (0, 5), (6, 9), (10, 15), (16, 18), (18, 19), (0, 0), (0, 0), (0, 5), (6, 10), (11, 13), (14, 17), (18, 19), (19, 23), (0, 0)]


If we convert the first non-special input id to a token, and use the first non-special offset_mapping to extract a span from the question string, we get a match.

In [None]:
print("First non-special input_id converted to token:")
print(tokenizer.convert_ids_to_tokens(x['input_ids'][0][1]), "\n")

offset = x['offset_mapping'][0][1]
print(f"Span extracted from context using corresponding offset_mapping {offset}:")
print(question[offset[0]:offset[1]])

First non-special input_id converted to token:
Where 

Span extracted from context using corresponding offset_mapping (0, 5):
Where


Since we know the character position of where the answer starts, we can use that and *offset_mapping* to get the start and ending token positions of the answer span.<br>

The only remaining issue is identifying whether an offset is for a question or a context. Looking at the first two *offset_mappings*, note that:<br>
1. In the first sequence, both the question and context *offset_mappings* start from zero.
2. In the second sequence, the context *offset_mapping* values carry on from the previous sequence (after accounting in the stride).

In [None]:
print(x['offset_mapping'][0])
print(x['offset_mapping'][1])

[(0, 0), (0, 5), (6, 9), (10, 15), (16, 18), (18, 19), (0, 0), (0, 0), (0, 5), (6, 10), (11, 13), (14, 17), (18, 19), (19, 23), (0, 0)]
[(0, 0), (0, 5), (6, 9), (10, 15), (16, 18), (18, 19), (0, 0), (0, 0), (6, 10), (11, 13), (14, 17), (18, 19), (19, 23), (23, 27), (0, 0)]


This means we need to identify
1. which *offset_mappings* belong to a context.
2. whether a particular sequence contains the answer at all.<br>

The first can be done using the *sequence_ids* method on the encoding object. Each *input_ids* sequence has a corresponding *sequence_ids* list which tells us whether a token is part of a question, part of a context, or a special token.<br>
https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.BatchEncoding.sequence_ids

In [None]:
print(x['input_ids'][0])
print(x.sequence_ids(0))

[0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2]
[None, 0, 0, 0, 0, 0, None, None, 1, 1, 1, 1, 1, 1, None]


So to identify whether a token is part a context, we can use *sequence_ids* to check whether a token position maps to 1.

For the second issue, we can check whether the answer start and end character positions are within the lowest and highest offset mapping values respectively.

In [None]:
# We can calculate the answer end character position using the answer length.
answer_end = answer_start + len(answer)

print("Answer start character position:", answer_start)
print("Answer end character position:", answer_end)
print("Answer pulled from context:", context[answer_start:answer_end])

Answer start character position: 14
Answer end character position: 32
Answer pulled from context: The Mirthless Cafe


Let's find the start and end token positions from our collection of sequences. The full answer is not in the first sequence, but is in the third sequence. So let's experiment with those.

In [None]:
tokenizer.batch_decode(x['input_ids'])

['<s>Where did Sarah go?</s></s>Sarah went to The Mirth</s>',
 '<s>Where did Sarah go?</s></s> went to The Mirthless</s>',
 '<s>Where did Sarah go?</s></s> to The Mirthless Cafe</s>',
 '<s>Where did Sarah go?</s></s> The Mirthless Cafe last</s>',
 '<s>Where did Sarah go?</s></s> Mirthless Cafe last night</s>',
 '<s>Where did Sarah go?</s></s>irthless Cafe last night to</s>',
 '<s>Where did Sarah go?</s></s>less Cafe last night to meet</s>',
 '<s>Where did Sarah go?</s></s> Cafe last night to meet her</s>',
 '<s>Where did Sarah go?</s></s> last night to meet her friend</s>',
 '<s>Where did Sarah go?</s></s> night to meet her friend.</s>']

First get all the information we need for the first sequence.

In [None]:
input_ids = x['input_ids'][0]
offset_mapping = x['offset_mapping'][0]
seq_ids = x.sequence_ids(0)

Determine where the context tokens start and end in the sequence.

In [None]:
# These are the sequence ids
print("Sequence IDs: ", seq_ids)

Sequence IDs:  [None, 0, 0, 0, 0, 0, None, None, 1, 1, 1, 1, 1, 1, None]


In [None]:
# Get the start index position (i.e. the first occurrence of 1).
context_pos_start = seq_ids.index(1)

In [None]:
# Utility function to find the *last* occurrence of a sequence.
def rindex(lst, value):
    return len(lst) - operator.indexOf(reversed(lst), value) - 1

# Get the end index position (i.e. the last occurrence of 1).
context_pos_end = rindex(seq_ids, 1)

In [None]:
print("Context tokens begin at position", context_pos_start)
print("Context tokens end at position", context_pos_end)

Context tokens begin at position 8
Context tokens end at position 13


Now that we know which tokens are part of the context, we can look at their corresponding offset mappings to check whether the start and end character positions are within the offsets.

In [None]:
# These are the corresponding offsets.
context_offsets = offset_mapping[context_pos_start:context_pos_end+1]
print(context_offsets)

[(0, 5), (6, 10), (11, 13), (14, 17), (18, 19), (19, 23)]


In [None]:
print("Is the lowest offset value lower than or equal to the starting character position?")
print("Answer starting character position:", answer_start)
print("First offset:", context_offsets[0])

# Note how we're checking the first tuple value.
print(context_offsets[0][0] <= answer_start)

Is the lowest offset value lower than or equal to the starting character position?
Answer starting character position: 14
First offset: (0, 5)
True


In [None]:
print("Is the highest offset value higher than or equal to the ending character position?")
print("Answer ending character position:", answer_end)
print("Last offset:", context_offsets[-1])

# Note how how we're checking the second tuple value.
print(context_offsets[-1][1] >= answer_end)

Is the highest offset value higher than or equal to the ending character position?
Answer ending character position: 32
Last offset: (19, 23)
False


So the first sequence contains a part of the answer but the full answer gets truncated. This matches a visual inspection:

In [None]:
print(tokenizer.batch_decode(input_ids))

['<s>', 'Where', ' did', ' Sarah', ' go', '?', '</s>', '</s>', 'Sarah', ' went', ' to', ' The', ' M', 'irth', '</s>']


Let's now do the same with the third sequence.

In [None]:
input_ids = x['input_ids'][2]
offset_mapping = x['offset_mapping'][2]
seq_ids = x.sequence_ids(2)

context_pos_start = seq_ids.index(1)
context_pos_end = rindex(seq_ids, 1)

context_offsets = offset_mapping[context_pos_start:context_pos_end+1]

print("Is the lowest offset value lower than or equal to the starting character position?")
print("Answer starting character position:", answer_start)
print("First offset:", context_offsets[0])

# Note how we're checking the first tuple value.
print(context_offsets[0][0] <= answer_start)

print("Is the highest offset value higher than or equal to the ending character position?")
print("Answer ending character position:", answer_end)
print("Last offset:", context_offsets[-1])

# Note how how we're checking the second tuple value.
print(context_offsets[-1][1] >= answer_end)


Is the lowest offset value lower than or equal to the starting character position?
Answer starting character position: 14
First offset: (11, 13)
True
Is the highest offset value higher than or equal to the ending character position?
Answer ending character position: 32
Last offset: (28, 32)
True


Now that we've confirmed the third sequence contains the full answer, we need to identify where the answer starts and ends in the *input_ids*. We can do this by scanning the offset_mapping from the left to find the start, and from the right to find the end.

In [None]:
s = e = 0

# Start scanning the offset_mapping from the
# left to find the token position where the answer starts.
# It's not guaranteed a tokenizer will output a token where the
# starting character matches the first answer character. When
# this happens, we take the previous token's position as our start.
i = context_pos_start
while offset_mapping[i][0] < answer_start:
  i += 1
if offset_mapping[i][0] == answer_start:
  s = i
else:
  s = i - 1

# Same idea when finding the ending token position.
j = context_pos_end
while offset_mapping[j][1] > answer_end:
  j -= 1
if offset_mapping[j][1] == answer_end:
  e = j
else:
  e = j + 1

In [None]:
print("Answer start token position in context:", s)
print("Answer end token position in context:", e)

Answer start token position in context: 9
Answer end token position in context: 13


In [None]:
print("Answer lifted from context:")
tokenizer.batch_decode(input_ids[s:e+1])

Answer lifted from context:


[' The', ' M', 'irth', 'less', ' Cafe']

All the logic we stepped through so far is encapsulated in the following method. We'll use this to process our dataset.

In [None]:
def prepare_dataset(examples):
  # Some tokenizers don't strip spaces. If there happens to be question text
  # with excessive spaces, the context may not get encoded at all.
  examples["question"] = [q.lstrip() for q in examples["question"]]
  examples["context"] = [c.lstrip() for c in examples["context"]]

  # Tokenize.
  tokenized_examples = tokenizer(
      examples['question'],
      examples['context'],
      truncation="only_second",
      max_length = max_length,
      stride=stride,
      return_overflowing_tokens=True,
      return_offsets_mapping=True,
      padding="max_length"
  )

  # We'll collect a list of starting positions and ending positions.
  tokenized_examples['start_positions'] = []
  tokenized_examples['end_positions'] = []

  # Work through every sequence.
  for seq_idx in range(len(tokenized_examples['input_ids'])):
    seq_ids = tokenized_examples.sequence_ids(seq_idx)
    offset_mappings = tokenized_examples['offset_mapping'][seq_idx]

    cur_example_idx = tokenized_examples['overflow_to_sample_mapping'][seq_idx]
    answer = examples['answers'][cur_example_idx]
    answer_text = answer['text'][0]
    answer_start = answer['answer_start'][0]
    answer_end = answer_start + len(answer_text)

    context_pos_start = seq_ids.index(1)
    context_pos_end = rindex(seq_ids, 1)

    s = e = 0
    if (offset_mappings[context_pos_start][0] <= answer_start and
        offset_mappings[context_pos_end][1] >= answer_end):
      i = context_pos_start
      while offset_mappings[i][0] < answer_start:
        i += 1
      if offset_mappings[i][0] == answer_start:
        s = i
      else:
        s = i - 1

      j = context_pos_end
      while offset_mappings[j][1] > answer_end:
        j -= 1
      if offset_mappings[j][1] == answer_end:
        e = j
      else:
        e = j + 1

    tokenized_examples['start_positions'].append(s)
    tokenized_examples['end_positions'].append(e)

  return tokenized_examples

Before we process, we'll set maximum sequence length, stride, and batch size values.<br>

I arrived at these values through experimentation. Even though *distilroberta-base* has a maximum sequence length of 512, using the full capacity (or a large batch value) results in an out-of-memory error while the attention scores are being calculated. This is on Colab's free tier. On the premium tier, you can use larger sequence lengths or batch values.<br>

The nature of the data will also influence the values.

In [None]:
max_length = 400
stride = 100
batch_size = 32

We can map over the **Dataset** objects and apply our prepare method to the examples in batches.<br>
https://huggingface.co/docs/datasets/main/en/nlp_process<br>
https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map<br>

*remove_columns* removes the original data columns and leaves only the post-tokenization columns in place. We can also parallelize processing by using the *num_proc* parameter.<br>
https://huggingface.co/docs/datasets/main/en/process#multiprocessing

In [None]:
tokenized_datasets = data.map(
  prepare_dataset,
  batched=True,
  remove_columns=data["train"].column_names,
  num_proc=2,
)

Map (num_proc=2):   0%|          | 0/87599 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/10570 [00:00<?, ? examples/s]

Our tokenized dataset still contains two entries (*offset_mapping* and *overflow_to_sample_mapping*) our model won't expect, so we'll remove them.<br>
https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.remove_columns

In [None]:
data = tokenized_datasets.remove_columns(["offset_mapping",
                                          "overflow_to_sample_mapping"])

The last preparation step is to convert the Hugging Face **Dataset** objects into a Tensorflow-compatible datasets.<br>
https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.to_tf_dataset<br>
https://huggingface.co/docs/datasets/main/en/use_with_tensorflow#when-to-use-totfdataset

In [None]:
train_set = data['train'].to_tf_dataset(batch_size=batch_size)
validation_set = data['validation'].to_tf_dataset(batch_size=batch_size)

We can now download a pre-trained model for fine-tuning. Just like we did with the tokenizer, we'll use an Auto Class to download the right model. In this case, we're using **TFAutoModelForQuestionAnswering**. This will download a Tensorflow implementation of the pre-trained model with a question answering head on it.<br>

The head in this case is a dense layer that returns *start_logits* and *end_logits*. We can take the argmax of each to determine the start and end of the answer span (see model code for details).<br>
https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.TFAutoModelForQuestionAnswering<br>
https://github.com/huggingface/transformers/blob/main/src/transformers/models/roberta/modeling_tf_roberta.py#L1629

In [None]:
model = TFAutoModelForQuestionAnswering.from_pretrained(model_name)

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFRobertaForQuestionAnswering.

Some weights or buffers of the TF 2.0 model TFRobertaForQuestionAnswering were not initialized from the PyTorch model and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The following method attempts to answer a question given a context. It tokenizes the question and context, runs it through the model, takes the argmax of the start and end logits, and uses the result to extract an answer span from the context.

In [None]:
def get_answer(tokenizer, model, question, context):
  inputs = tokenizer([question], [context], return_tensors="np")
  outputs = model(inputs)
  start_position = tf.argmax(outputs.start_logits, axis=1)
  end_position = tf.argmax(outputs.end_logits, axis=1)
  answer = inputs["input_ids"][0, int(start_position) : int(end_position) + 1]
  return tokenizer.decode(answer).strip()

While the model body (the pre-trained *distilroberta-base* model) is trained, the head is not. So if we try to use our model to answer a question, it should fail or perform poorly (your output will differ because of different initial head weight values).

In [None]:
c = "Sarah went to The Mirthless Cafe last night to meet her friend."
q = "Where did Sarah go?"
get_answer(tokenizer, model, q, c)

''

In [None]:
# https://www.tensorflow.org/guide/mixed_precision
keras.mixed_precision.set_global_policy("mixed_float16")

# Use a learning rate recommended by the BERT authors.
# https://github.com/google-research/bert
model.compile(optimizer=keras.optimizers.Adam(learning_rate=3e-5))

We'll now fine-tune the model. Note that we didn't freeze the layers of the pre-trained body, so its weights will be tuned along with the head's weights.<br>

Because the body is already pre-trained, we don't need a lot of epochs. 2-4 is typically enough (BERT authors recommend 4). Here, we're using 1 to demonstrate the power of pre-training.<br>

**Note:** If you have GPU enabled and you're using Colab's free tier, the training time can be all over the place depending on which GPU you get assigned (anywhere from 20 minutes to an hour).

In [None]:
model.fit(train_set, validation_data=validation_set, epochs=1)



<keras.src.callbacks.History at 0x7c4f3c2a7280>

After completing our fine-tuning, we should now have a decent extractive question answering model.

In [None]:
c = "Sarah went to The Mirthless Cafe last night to meet her friend."
q = "Where did Sarah go?"
get_answer(tokenizer, model, q, c)

'The Mirthless Cafe'

In [None]:
q = "Who did Sarah meet?"
get_answer(tokenizer, model, q, c)

'her friend'

In [None]:
q = "When did Sarah meet her friend?"
get_answer(tokenizer, model, q, c)

'last night'

In [None]:
q = "Who went to the restaurant?"
get_answer(tokenizer, model, q, c)

'Sarah'

But as we saw earlier, extractive question answering has its limits.

In [None]:
# Asking a logic teaser question is difficult despite the
# answer being available. To be fair, there is ambiguity here.
q = "Who did Sarah's friend meet?"
get_answer(tokenizer, model, q, c)

'her friend'

In [None]:
# The model can't determine when a question can't be
# answered. Some question answering datasets explicitly
# train for this.
q = "How did Sarah get to the restaurant?"
get_answer(tokenizer, model, q, c)

'to meet her friend'

In [None]:
# The model isn't generative, either.
q = "What is a possible reason for why Sarah met her friend?"
get_answer(tokenizer, model, q, c)

'<s>'

This notebook is taken from https://www.nlpdemystified.org/


# Further Exploration

OpenAI API docs to learn how to build products using their models:<br>
https://openai.com/api/<br>
https://beta.openai.com/docs/introduction

A catalog of transformer models:<br>
https://amatriain.net/blog/transformer-models-an-introduction-and-catalog-2d1e9039f376/


Wordpiece and Sentencepiece:<br>
https://huggingface.co/course/chapter6/6?fw=pt<br>
https://github.com/google/sentencepiece


**Papers**<br>
Attention Is All You Need (original Transformer paper): https://arxiv.org/abs/1706.03762<br>

The Annotated Transformer: http://nlp.seas.harvard.edu/annotated-transformer/<br>

GPT-3: https://arxiv.org/abs/2005.14165<br>

BERT: https://arxiv.org/abs/1810.04805<br>

RoBERTa paper: https://arxiv.org/abs/1907.11692<br>

ALBERT paper: https://arxiv.org/abs/1909.11942<br>

DistilBert paper: https://arxiv.org/abs/1910.01108<br>

Electra paper: https://arxiv.org/abs/2003.10555<br>

XLM: https://arxiv.org/abs/1901.07291<br>