# Question Answering And Semantic Search   
The fine-tuning part of this notebook is adapted from [HuggingFace example](https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb).    
One of the go-to libraries for transformer models is [Hugging Face](https://huggingface.co/docs). 
<div class="alert alert-block alert-info">
For this workshop, we have downloaded the models to use. If the model loading cells are ran locally it will download the dataset and models through the internet if there is no cache found in the cache_dir. 
</div>  

As we introduced in the lecture, the pre-trained models are pushing NLP field to a new time. In this notebook we will show you how to make a simple widget to get answers from the corpus by semantic search and fine-tuned models. If you can’t find a model for your use-case, you’ll need to finetune a pretrained model on your data. The **Challenge** section demonstrates the steps to fine-tune a model for Q&A task.

**Outline**  

- Use the Bert model to get answer
- Semantic search for articles that are relevant to the question in corpus
- The Q&A widget  
- Challenge: Fine tune Bert
    - Load pre-trained models from Hugging Face library
    - Fine tune the bert model for Q&A task using squad data

**Estimated time:** 
 45 mins (excluding challenge)

In [7]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import BertTokenizerFast, BertForQuestionAnswering
from transformers import TrainingArguments, Trainer, default_data_collator
import datasets
import sentence_transformers
import IPython
from IPython.core.display import display, HTML
import logging
import pickle
import numpy as np
import pandas as pd
import time
import os

## Use Q&A Model 
HuggingFace provides a pipeline for easy interence using the models. Here is an example for using the pipeline with our specified model.

Now we directly use fine-tuned tokenizer and model. Below cell download and save the models. Here we can load from the download directory.

In [7]:
# # download fine-tuned model and tokenizer
# tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
# model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
# # save to directory
# tokenizer.save_pretrained("../model/fine-tuned/tokenizer/")
# model.save_pretrained("../model/fine-tuned/bert/")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

In [9]:
tokenizer = AutoTokenizer.from_pretrained("../model/fine-tuned/tokenizer/")
qa_model = AutoModelForQuestionAnswering.from_pretrained("../model/fine-tuned/bert/")

Below cell demonstrates how to use the model and tokenizer for Q&A task.

In [64]:
def getAnswer(contexts, questions, tokenizer, model):
    print('>>>> Looking for answers in {} documents...'.format(len(contexts)))
    t=time.time()
    answers = []
    for question in questions:
         for context in contexts:
            inputs = tokenizer(question, context, return_tensors="pt")
            # word to id representation
            input_ids = inputs["input_ids"].tolist()[0]
            #This outputs a range of scores across the entire sequence tokens (question and text), for both the start and end positions.
            outputs = qa_model(**inputs)
            answer_start_scores = outputs.start_logits
            answer_end_scores = outputs.end_logits

            # Get the most likely beginning of answer with the argmax of the score
            answer_start = torch.argmax(answer_start_scores)
            # Get the most likely end of answer with the argmax of the score
            answer_end = torch.argmax(answer_end_scores) +1
            # Get the answer string based on start and end token id
            answer = tokenizer.convert_tokens_to_string(
                tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
            )
            answers.append(answer)
    print('>>>> Answers extracted in : {}s'.format(time.time()-t))
    return answers

In [23]:
# Let's try with our simple example
questions = ['What is extractive question answering?']
context = [r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""]
getAnswer(context, questions, tokenizer, qa_model)

>>>> Looking for answers in 1 documents...
>>>> Answers extracted in : 0.4712860584259033s


['the task of extracting an answer from a text given a question']

## Semantic Search in Corpus

Now our model can find the answer sentences in a context. But how can we query the entire corpus? We need to perform semantic search to find the documents that is most relevant to our question/query. To do this, we need to produce embeddings for our corpus as well as queries when we pass them in. Then we can calculate the similarity/distance between question and each document in our corpus to find the relevant ones.  
The `sentence_transformer` package provides the models we are using today, trained on 215M question-answer pairs and perform well across search tasks and domains.  
![semanticsearch](../img/semanticsearch.png "Semantic Search")  
image from: https://www.sbert.net/examples/applications/semantic-search/README.html

In [26]:
from sentence_transformers import SentenceTransformer, util
model_name = 'multi-qa-MiniLM-L6-cos-v1'
bi_encoder = SentenceTransformer(model_name, cache_folder='../model/bi_encoder')

query_embedding = bi_encoder.encode('How big is London')
passage_embedding = bi_encoder.encode(['London has 9,787,426 inhabitants at the 2011 census',
                                  'London is known for its finacial district'])

print("Similarity:", util.cos_sim(query_embedding, passage_embedding))

Similarity: tensor([[0.5472, 0.6330]])


## Build Q&A Widget for Dataset

### Load Dataset 
Now we load the downloaded simple wiki dataset

In [29]:
import json
import gzip
wikipedia_filepath = '../data/simplewiki-2020-11-01.jsonl.gz'
passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append([data['title'], paragraph])


In [52]:
print(len(passages), passages[:5])

509663 [['Ted Cassidy', 'Ted Cassidy (July 31, 1932 - January 16, 1979) was an American actor. He was best known for his roles as Lurch and Thing on "The Addams Family".'], ['Aileen Wuornos', 'Aileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956\xa0– October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six men in Florida and was executed in Florida State Prison by lethal injection for the murders. Wuornos said that the men she killed had raped her or tried to rape her while she was working as a prostitute.'], ['Aileen Wuornos', 'Wuornos was diagnosed with antisocial personality disorder and borderline personality disorder.'], ['Aileen Wuornos', 'The movie, "Monster" is about her life. Two documentaries were made about her.'], ['Aileen Wuornos', 'Wuornos was born Aileen Carol Pittman in Rochester, Michigan. She never met her father. Wuornos was adopted by her grandparents. When she was 13 she became pregnant. 

### Create Dataset Embeddings

Now we encode the dataset like we saw in the example above. For time sake we load the existing embedding from data folder. For this model and wiki dataset it took 2 hours to do embedding with 20 workers. This model is fine-tuned from a variation of Microsoft MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers.

In [31]:
# if model and embedding path is defined, load existing embedding and texts
if model_name == 'multi-qa-MiniLM-L6-cos-v1':
    embeddings_filepath = '../data/corpus_emb_MiniLM-L6.pkl'
    if os.path.isfile(embeddings_filepath):
        
        with open(embeddings_filepath, "rb") as fIn:
            cache_data = pickle.load(fIn)
            passages = cache_data['sentences']
            corpus_embeddings = cache_data['embeddings']
            corpus_embeddings = corpus_embeddings.float()  # Convert embedding file to float
# otherwise generate new embedding and store it to pickle        
else: 
    corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)
    with open('../data/corpus_emb_MiniLM-L6.pkl', "wb") as f:
        pickle.dump({'sentences': passages, 'embeddings': corpus_embeddings}, f, protocol=4)

For a small corpus (up to 1 million documents), we can compute the cosine-similarity between query and documents in corpus by ` util.cos_sim() ` and retrieve top k documents by `torch.topk `. Fortunately this is done for us in `sentence_transformers.util.semantic_search(query_embeddings: torch.Tensor, corpus_embeddings: torch.Tensor, query_chunk_size: int = 100, corpus_chunk_size: int = 500000, top_k: int = 10, score_function: typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor] = <function cos_sim>)`.

<div class="alert alert-block alert-warning">
<b>Task 1. Try it out</b> <br>
Embed the quesiton and use the function util.semantic_search( ) and get top_k relevant documents. <br>

</div>

In [40]:
%%time
top_k = 5

query = 'How many people Aileen Carol killed?'
### TODO 
query_embedding = bi_encoder.encode(query, convert_to_tensor=True)
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)

hits = hits[0]
print("Input question:", query)
for hit in hits:
    print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']]))
print("\n\n========\n")

Input question: How many people Aileen Carol killed?
	0.608	['Aileen Wuornos', 'Aileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956\xa0– October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six men in Florida and was executed in Florida State Prison by lethal injection for the murders. Wuornos said that the men she killed had raped her or tried to rape her while she was working as a prostitute.']
	0.549	['Hanau', 'Nine people were killed in two shootings in Hanau on 19 February 2020.']
	0.526	['Pauline Fowler', 'In the show, she lived at number 45 Albert Square. She had three children, Mark, Michelle and Martin. Mark died in 2004 of AIDS.']
	0.524	['Diest', 'In 2007, 22845 people lived there.']
	0.517	['Carol Ann Susi', 'On November 11, 2014, Susi died of cancer of unknown primary origin in Los Angeles, California, aged 62.']



CPU times: user 2.86 s, sys: 2.78 s, total: 5.63 s
Wall time: 427 ms


<details><summary><b>Solution</b></summary>

    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
</details>

### Helper Functions

Now let's pack above code into a function to use later.

<div class="alert alert-block alert-warning">
<b>Task 2.</b> <br>
Write the function searchContext( ): <br>
1. Encode the query<br>
2. Perform semantic search between question embedding and corpus embedding<br>
3. Return top_k hit <b>passages indexes</b>
</div>

In [35]:
def searchContext(top_k, bi_encoder, query, corpus_embedding):
    indexes = []
    t = time.time()
    ### TODO
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]
    for hit in hits:
        indexes.append(hit['corpus_id'])

    print('>>>> Relevent document search finished in : {}s'.format(time.time()-t))
    return indexes

<details><summary><b>Solution</b></summary>

    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]
    for hit in hits:
        indexes.append(hit['corpus_id'])
</details>

Below is a helper function to get the context from top hit ids.

In [36]:
# extract the top ranked passages to feed BERT
def getContext(indexes, passages):
    contexts = []
    for k in indexes:
        contexts.append(passages[k][1])
    return contexts

### Make the Widget 1.0

<div class="alert alert-block alert-warning">
<b>Task 3.</b> <br>
Complete the function QandA( ): <br>
1. Extract top matching document ids, get the contexts and answers<br>
2. display original documents that are matched<br>
</div>

In [125]:
def QandA(corpus_embeddings, passages):
    # Ask for question input
    promptQ = HTML('<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>Ask me a question</b>:')
    display(promptQ)
    question = input()
    questions = [question]

    # ask for top k value
    promptK = HTML('<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>How many results would you like?</b>')
    display(promptK)
    top_k = int(input())

    # display question
    question_HTML = '<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>Query</b>: '+question+'</div>'
    display(HTML(question_HTML))
    
    ### TODO 3.1 search the corpus for relevent passages and answers
    top_k_ids = searchContext(top_k, bi_encoder, question, corpus_embeddings)
    contexts = getContext(top_k_ids, passages)
    answers = getAnswer(contexts, questions, tokenizer, model)
    
    for a in answers:
        answers_HTML = '<div style="font-family: Times New Roman; font-size: 18px; margin-bottom:1pt"><b>Answer found</b>: '+a+'</div>'
        display(HTML(answers_HTML))
    # warning text
    warning_HTML = '<div style="font-family: Times New Roman; font-size: 15px; padding-bottom:15px; color:#E76f51; margin-top:1pt"> These are extracted answers from original documents. Please see the documents below:</div>'
    display(HTML(warning_HTML))
    
    ### TODO 3.2 show original documents
    doc = [passages[k] for k in top_k_ids]
    df_hits = pd.DataFrame(doc, columns=['title','text'])
    df_hits.text.str.wrap(100)
    display(HTML(df_hits.to_html(render_links=True, escape=False)))

In [134]:
QandA(corpus_embeddings, passages)

Is Aileen Wuornos in a movie


5


>>>> Relevent document search finished in : 0.39664506912231445s
>>>> Looking for answers in 5 documents...
>>>> Answers extracted in : 2.0770132541656494s


Unnamed: 0,title,text
0,Aileen Wuornos,"The movie, ""Monster"" is about her life. Two documentaries were made about her."
1,Aileen Wuornos,Wuornos was diagnosed with antisocial personality disorder and borderline personality disorder.
2,Aileen Wuornos,"Wuornos was born Aileen Carol Pittman in Rochester, Michigan. She never met her father. Wuornos was adopted by her grandparents. When she was 13 she became pregnant. She started working as a prostitute when she was 14."
3,Monster (movie),Monster is a 2003 American crime drama movie. It is a true story about female American serial killer Aileen Wuornos. Wuornos is played by Charlize Theron. She received an Academy Award for her role. The movie also features Christina Ricci as Wuornos' girlfriend. The movie was directed by first time director Patty Jenkins.
4,Charlize Theron,"She did star in a few movies, but only really became noticed when she portrayed the life of serial killer, Aileen Wuornos, in the movie ""Monster"". She won an Oscar for her role in this movie."


<details><summary><b>Solution</b></summary>
    
    ### TODO 3.1 
    top_k_ids = searchContext(top_k, bi_encoder, question, corpus_embeddings)
    contexts = getContext(top_k_ids, passages)
    answers = getAnswer(contexts, questions, tokenizer, model)
    ### TODO 3.2 
    doc = [passages[k] for k in top_k_ids]
</details>

## Improve The Widget  
### Retrieve & Re-rank with Cross-Encoder

The are some different ways to improve this workflow. Here we introduce the **Retrieve & Re-rank Pipeline**, which provides better performance for long docment and complex searches. It retrieves the relevant passages first using bi-encoder (what we did above), then re-rank them by the classification score between each pair of question and passage.  
![bi-cross encode](../img/Bi_vs_Cross-Encoder.png )   
image from: https://www.sbert.net/examples/applications/cross-encoder/README.html  
Cross-encoders do not produce embeddings, so it is less efficient for comparision with millions of pairs data. However, in this pipeline, we limit our scope using bi-encoder first and then use cross-encoder to improve the accuracy of results.

In [56]:
from sentence_transformers import SentenceTransformer, CrossEncoder

# cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# cross_encoder.save('../model/cross_encoder')
cross_encoder = CrossEncoder('../model/cross_encoder')

In [58]:
# example
query = 'When is Aileen Wuornos born'

# The cross-encoder takes 2 inputs and perform classification tasks
cross_input = [[query, 'Ted Cassidy (July 31, 1932 - January 16, 1979) was an American actor. He was best known for his roles as Lurch and Thing on "The Addams Family".'],
                [query, 'Aileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956\xa0– October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six men in Florida and was executed in Florida State Prison by lethal injection for the murders. Wuornos said that the men she killed had raped her or tried to rape her while she was working as a prostitute.'],
                [query, 'Wuornos was diagnosed with antisocial personality disorder and borderline personality disorder.'],
                [query, 'The movie, "Monster" is about her life. Two documentaries were made about her.'], 
                [query, 'Wuornos was born Aileen Carol Pittman in Rochester, Michigan. She never met her father. Wuornos was adopted by her grandparents. When she was 13 she became pregnant. She started working as a prostitute when she was 14.']]

cross_scores = cross_encoder.predict(cross_input)
cross_scores

array([-10.458561 ,   9.398579 ,  -3.1490192, -11.279931 ,   8.263056 ],
      dtype=float32)

<div class="alert alert-block alert-warning">
<b>Task 4: Complete function rerank( ) and QandArerank</b> <br>
    1.  The function takes one question and compares with a list of contexts. We use the list of indexs and passages data to get the context. The list of indexes will be passed by bi-encoder search. The return will be indexes list sorted by cross-encoder score.<br>
    2. use bi-encoder to search for top 20 relevant passages and use rerank( ) to get the top_k passage indexes. top_k is the user input number.
</div>  

In [129]:
def rerank(question, indexs, passages):
    print('>>>> Re-ranking results...')
    ### TODO 4.1
    cross_in = [[question, passages[idx][0]] for idx in indexs]
    cross_result = cross_encoder.predict(cross_in)
    rank_top = []
    for i, v in enumerate(indexs):
        rank_top.append([v, cross_result[i]])
    rank_top = sorted(rank_top, key=lambda x: x[1], reverse=True)
    rank_top = [e[0] for e in rank_top]
    return rank_top

In [130]:
def QandArerank(corpus_embeddings, passages):
    # Ask for question input
    promptQ = HTML('<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>Ask me a question</b>:')
    display(promptQ)
    question = input()
    questions = [question]

    # ask for top k value after cross-encoder
    promptK = HTML('<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>How many results would you like?</b>')
    display(promptK)
    top_k = int(input())

    # display question
    question_HTML = '<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>Query</b>: '+question+'</div>'
    display(HTML(question_HTML))
    
    ### TODO 4.2 
    bi_top_k = searchContext(20, bi_encoder, questions, corpus_embeddings)
    cross_top_k = rerank(question, bi_top_k, passages)[:top_k]
    contexts = getContext(cross_top_k, passages)
    answers = getAnswer(contexts, questions, tokenizer, model)
    
    for a in answers:
        answers_HTML = '<div style="font-family: Times New Roman; font-size: 18px; margin-bottom:1pt"><b>Answer found</b>: '+a+'</div>'
        display(HTML(answers_HTML))
    # warning text
    warning_HTML = '<div style="font-family: Times New Roman; font-size: 15px; padding-bottom:15px; color:#E76f51; margin-top:1pt"> These are extracted answers from original documents. Please see the documents below:</div>'
    display(HTML(warning_HTML))
    
    doc = [passages[k] for k in cross_top_k]
    
    df_hits = pd.DataFrame(doc, columns=['title','text'])
    df_hits.text.str.wrap(100)
    display(HTML(df_hits.to_html(render_links=True, escape=False)))

In [133]:
QandArerank(corpus_embeddings, passages)

Is Aileen Wuornos in a movie


5


>>>> Relevent document search finished in : 0.4095332622528076s
>>>> Re-ranking results...
>>>> Looking for answers in 5 documents...
>>>> Answers extracted in : 1.867354154586792s


Unnamed: 0,title,text
0,Aileen Wuornos,"The movie, ""Monster"" is about her life. Two documentaries were made about her."
1,Aileen Wuornos,Wuornos was diagnosed with antisocial personality disorder and borderline personality disorder.
2,Aileen Wuornos,"Wuornos was born Aileen Carol Pittman in Rochester, Michigan. She never met her father. Wuornos was adopted by her grandparents. When she was 13 she became pregnant. She started working as a prostitute when she was 14."
3,Aileen Wuornos,"Aileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956 – October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six men in Florida and was executed in Florida State Prison by lethal injection for the murders. Wuornos said that the men she killed had raped her or tried to rape her while she was working as a prostitute."
4,Aileen Pringle,"Aileen Pringle (July 23, 1895 – December 16, 1989) was an American actress who acted in silent movies and on stage. She played The Queen in ""Three Weeks"" and Mrs. Eva Boutelle in the movie ""True as Steel"", both released in 1924."


<details><summary><b>Solution</b></summary>
    
    ### TODO 4.1 
    cross_in = [[question, passages[idx][0]] for idx in indexs]
    cross_result = cross_encoder.predict(cross_in)

    ### TODO 4.2 show original documents
    bi_top_k = searchContext(20, bi_encoder, questions, corpus_embeddings)
    cross_top_k = rerank(question, bi_top_k, passages)[:top_k]
</details>

>Pay attention to the **first** answer in both of your widgets, can you see the improvement?  

>Change to bigger dataset for longer documents, and try these widgets at home!

<div class="alert alert-block alert-info">
Apart from the retrieve-re-rank pipeline, there is also other methods like using Elasticsearch, FAISS indexing or nearest neightbours to improve the workflow. It depends on the size of dataset and the task you wish to perform.  

</div>  

------------------------  
By the end of this notebook, you have created and improved a small widget to find answers for you in the articles, i.e information retrieval. In the [next notebook](4-Topic_Modelling.ipynb), we will have a look at topic modelling, another useful application to explore a large amount of text data.    

--------------------------
<div class="alert alert-block alert-danger">
<b>Challenge (Use internet):</b> <br>
Fine tune a pre-trained model at home following below steps.
</div>  

## Load Pre-trained Bert model and Datasets   
Fine tune a model required labeled datasets for the target task. For our Q&A task, we use the orginal [Squad dataset](https://arxiv.org/abs/1606.05250) with 100,000+ questions on Wikipedia articles.  

In [None]:
# download datasets and models
dataset = datasets.load_dataset('squad')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
model = BertForQuestionAnswering.from_pretrained('bert-base-cased', return_dict=True)
metric = datasets.load_metric('squad')

You may encounter some warning when you download a new model from HuggingFace. The warning is telling us we are throwing away some weights (the vocab_transform and vocab_layer_norm layers) and randomly initializing some other (the pre_classifier and classifier layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.  

Now let's try the tokenizer and inspect our dataset.

In [None]:
tokenizer("What is your name?", "My name is Sylvain.")

In [None]:
len(dataset['validation'])

In [None]:
dataset["train"][10]

## Preparation for fine-tuning

 **Processing long contexts**  
The model from Hugging Face has the maximum input length of 512, which includes the question and the context. So we need to make some changes in our `tokenizer` configuration.  

In the last notebook we truncted sequences that are too long, but here we risk losing the answer if we simply cut the extra text off. So we allow a long context to provide multiple input features, each of them has a length within the input limit (`max_length`). In case the answer is at the splitting point in the context, we also allow certain length of overlap (`doc_stride`) between features for the same context.

In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [None]:
# this find a long context example
for i, example in enumerate(dataset["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > max_length:
        break
example = dataset["train"][i]
# Without any truncation, we get the following length for the input IDs:
len(tokenizer(example["question"], example["context"])["input_ids"])

Note that we never want to truncate the question, only the context, so the `only_second` truncation picked. Now, our tokenizer can automatically return us a list of features capped by a certain maximum length, with the overlap we talked above, we just have to tell it with `return_overflowing_tokens=True` and passing the `stride`:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

Now lets's see the truncted context with overlaps:

In [None]:
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

The model reuqires the start positon and end positions of answers in the tokens. So we need to map parts of the original context to tokens by setting `return_offsets_mapping=True`. The very first token ([CLS]) has (0, 0) because it doesn't correspond to any part of the question/answer, then the second token is the same as the characters 0 to 3 of the question

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])

Now we distinguish which parts of the offsets are for question and which parts are for context by using `sequence_ids`.

In [None]:
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

In [None]:
pad_on_right = tokenizer.padding_side == "right"

In [None]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [None]:
examples = dataset['train'][:5]

 In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
features = prepare_train_features(examples)

## Fine Tune the Model

In [None]:
# map the function over the entire dataset
tokenized_datasets = dataset.map(prepare_train_features, batched=True, remove_columns=dataset['train'].column_names)


In [None]:
args = TrainingArguments('../model/ft-squad',# directory to save trained model
                        evaluation_strategy='epoch',
                        learning_rate=0.001,
                        per_device_train_batch_size=4000,
                        per_device_eval_batch_size=4000,
                        num_train_epochs=2,
                        # weight_decay=0.01,
                        )
trainer = Trainer(model, args,
                 train_dataset=tokenized_datasets['train'], # for time sake we only use part of the dataset in workshop, feel free to use full traning dataset later! 
                 eval_dataset=tokenized_datasets['validation'],
                 data_collator=default_data_collator,tokenizer=tokenizer)

In [None]:
trainer.train()

In [None]:
# save the model after a long training
trainer.save_model("../model/tuned-squad")