# Tokenizers
How to train your own! Tokenizers are deterministic that look at the corpus and decide how to break up text into tokens, so they will necesssarily be different depending on the languag or domain.

To avoid using too much RAM we make the dataset an iterator of lists of texts (eg. batches of 100 strings). 

## CodeSearchNet Python Dataset
This was compiled from github libraries

In [188]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import pipeline

import pandas as pd
import numpy as np

In [2]:
# This takes a few minutes to load
raw_datasets = load_dataset("code_search_net", "python")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/8.44k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/941M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/412178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

In [5]:
raw_datasets["train"]

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

In [7]:
# Let's use the `whole_func_string` to train the tokenizer
raw_datasets["train"][123456]["whole_func_string"]

'def last_rate_limit(self):\n        """\n        A `dict` of the rate limit information returned in the most recent\n        response, or `None` if no requests have been made yet.  The `dict`\n        consists of all headers whose names begin with ``"RateLimit"`` (case\n        insensitive).\n\n        The DigitalOcean API specifies the following rate limit headers:\n\n        :var string RateLimit-Limit: the number of requests that can be made\n            per hour\n        :var string RateLimit-Remaining: the number of requests remaining until\n            the limit is reached\n        :var string RateLimit-Reset: the Unix timestamp for the time when the\n            oldest request will expire from rate limit consideration\n        """\n        if self.last_response is None:\n            return None\n        else:\n            return {k:v for k,v in iteritems(self.last_response.headers)\n                        if k.lower().startswith(\'ratelimit\')}'

In [12]:
# make generator

# # BAD! This would make a list of lists that would fill up RAM
# training_corpus = [raw_datasets["train"][i: i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000)]

# # GOOD! This makes a generator
# training_corpus = (raw_datasets["train"][i : i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000))

# BETTER! Define it inside a function
def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]

training_corpus = get_training_corpus()

### Training an old tokenizer
This is faster than starting from scratch so start with the GPT-2 tokenizer.

In [13]:
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
print(tokens)

['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


This tokenizer has a few special symbols, like Ġ and Ċ, which denote spaces and newlines, respectively. As we can see, this is not too efficient: the tokenizer returns individual tokens for each space, when it could group together indentation levels (since having sets of four or eight spaces is going to be very common in code). It also split the function name a bit weirdly, not being used to seeing words with the _ character.

As a result we should train it on the new corpus. 

In [19]:
# note: train_new_from_iterator() only works for "fast" tokenizers (ie the ones not written in python)
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000) # vocab_size=52000
tokenizer.is_fast

True

In [20]:
tokens = tokenizer.tokenize(example)
print(tokens)

['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


In [21]:
# save locally
tokenizer.save_pretrained("code-search-net-tokenizer")

('code-search-net-tokenizer\\tokenizer_config.json',
 'code-search-net-tokenizer\\special_tokens_map.json',
 'code-search-net-tokenizer\\vocab.json',
 'code-search-net-tokenizer\\merges.txt',
 'code-search-net-tokenizer\\added_tokens.json',
 'code-search-net-tokenizer\\tokenizer.json')

# Notes on fast tokenizers
In addition to returning `['input_ids', 'token_type_ids', 'attention_mask']`, the tokenizer also can keep track of `offset_mapping`, which refers to which characters the tokens span. Similarly, you can call `.word_ids()` which returns the word the token came from. This is useful for determining if 2 adjacent tokens are part of the same word. While this is easy to see with BERT-type tokenizers (with their ##prefix for tokens that are not at the start of words) it is needed for tokenizers that don't have this feature.

In [56]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "Trevor is learning about tokenizers"
encoding = tokenizer(example, return_offsets_mapping=True)
print(type(encoding))
encoding

<class 'transformers.tokenization_utils_base.BatchEncoding'>


{'input_ids': [101, 8282, 1110, 3776, 1164, 22559, 17260, 1116, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 6), (7, 9), (10, 18), (19, 24), (25, 30), (30, 34), (34, 35), (0, 0)]}

In [67]:
pd.DataFrame({"tokens":encoding.tokens(),
              "word_ids": encoding.word_ids(),
              "offset_mapping": encoding["offset_mapping"]
             }).T.reset_index()

Unnamed: 0,index,0,1,2,3,4,5,6,7,8
0,tokens,[CLS],Trevor,is,learning,about,token,##izer,##s,[SEP]
1,word_ids,,0.0,1.0,2.0,3.0,4.0,4.0,4.0,
2,offset_mapping,"(0, 0)","(0, 6)","(7, 9)","(10, 18)","(19, 24)","(25, 30)","(30, 34)","(34, 35)","(0, 0)"


We can map any word or token to characters in the original text, and vice versa, with
- `word_to_chars()`
- `token_to_chars()`
- `char_to_word()`
- `char_to_token()`


In [58]:
# let's use word #4 (tokenizer)
word_id = 4
start, end = encoding.word_to_chars(word_id)
example[start:end]

'tokenizers'

In [61]:
token_id = 6 # ##izer
start, end = .token_to_chars(token_id)
example[start:end]

'izer'

In [63]:
encoding.char_to_word(start)

4

In [66]:
token_id = 15
encoding.char_to_token(token_id)

3

## Inside the token-classification pipeline
Recall NER (Named Entity Recognition) with the `pipeline()` function. A pipeline groups together the three stages necessary to get the predictions from a raw text: 
- tokenization
- passing the inputs through the model
- post-processing

The first two steps in the token-classification pipeline are the same as in any other pipeline, but the post-processing is more complex.

In [80]:
# 
token_classifier = pipeline("token-classification", 
                            model="dbmdz/bert-large-cased-finetuned-conll03-english")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [123]:
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
pd.DataFrame(token_classifier(example)).T

Unnamed: 0,0,1,2,3,4,5,6,7
entity,I-PER,I-PER,I-PER,I-PER,I-ORG,I-ORG,I-ORG,I-LOC
score,0.999383,0.998155,0.995907,0.999233,0.973893,0.976115,0.988798,0.993211
index,4,5,6,7,12,13,14,16
word,S,##yl,##va,##in,Hu,##gging,Face,Brooklyn
start,11,12,14,16,33,35,41,49
end,12,14,16,18,35,40,45,57


This split the words it didn't recognize into tokens, If we want them to be aggregated together we can use the `aggregation_strategy` parameter.

The `aggregation_strategy` picked will change the scores computed for each grouped entity. 

- `simple`: the score is mean of the scores of each token in the given entity: for instance, the score of "Sylvain" is the mean of the scores we saw in the previous example for the tokens S, ##yl, ##va, and ##in)
- `first`: the score of each entity is the score of the first token of that entity (so for "Sylvain" it would be 0.993828, the score of the token "S")
- `max`: the score of each entity is the maximum score of the tokens in that entity (so for "Hugging Face" it would be 0.98879766, the score of "Face")- `average`:  the score of each entity is the average of the scores of the words composing that entity (so for “Sylvain” there would be no difference from the `simple` strategy, but "Hugging Face" would have a score of 0.9819, the average of the scores for "Hugging", 0.975, and "Face", 0.98879)


In [131]:
token_classifier = pipeline("token-classification", 
                            model="dbmdz/bert-large-cased-finetuned-conll03-english",
                            aggregation_strategy="simple")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
pd.DataFrame(token_classifier(example))

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Unnamed: 0,entity_group,score,word,start,end
0,PER,0.998169,Sylvain,11,18
1,ORG,0.979602,Hugging Face,33,45
2,LOC,0.993211,Brooklyn,49,57


## Doing it without the `pipeline()`
As in doing the tokenization, running through a model, doing postprocessing separately.

In [82]:
from transformers import AutoTokenizer, TFAutoModelForTokenClassification
import tensorflow as tf

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="tf")
outputs = model(**inputs)




All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


In [83]:
# input_shape = (sequence_length, num_tokens) 
# ouput_shape = (sequence_length, num_tokens, num_model_labels) 

print(inputs["input_ids"].shape)
print(outputs.logits.shape)

(1, 19)
(1, 19, 9)


Use a softmax function to convert those logits to probabilities, and we take the argmax to get predictions (note that we can take the argmax on the logits because the softmax does not change the order)

In [120]:
# generate probabilities (optional... can just use logits directly)
probabilities = tf.math.softmax(outputs.logits, axis=-1)[0]
probabilities = probabilities.numpy().tolist()

# select the max logits as the predicted class
predictions = tf.math.argmax(outputs.logits, axis=-1)[0]
predictions = predictions.numpy().tolist()
print(predictions)

[0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]


In [132]:
# What do the predictions mean?
label_dict = model.config.id2label
print(label_dict)

{0: 'O', 1: 'B-MISC', 2: 'I-MISC', 3: 'B-PER', 4: 'I-PER', 5: 'B-ORG', 6: 'I-ORG', 7: 'B-LOC', 8: 'I-LOC'}


There are 9 prediction labels: 

- `O`: tokens that are not in any named entity (it stands for "outside")
- `B-MISC`
- `I-MISC`
- `B-PER`
- `I-PER`
- `B-ORG`
- `I-ORG`
- `B-LOC`
- `I-LOC`


**Note:**
- `B-XXX` means the labels is at the start ot entity XXX's tokens.* It is only used when there are consecutive named entities that need to be separate*d
- `I-XXX` means the label is inside entoty XXX's tokens
tity XXX

In [130]:
results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        start, end = offsets[idx]
        results.append(
            {
                "entity": label,
                "score": probabilities[idx][pred],
                "word": tokens[idx],
                "start": start,
                "end": end,
            }
        )

pd.DataFrame(results)

Unnamed: 0,entity,score,word,start,end
0,I-PER,0.999383,S,11,12
1,I-PER,0.998155,##yl,12,14
2,I-PER,0.995907,##va,14,16
3,I-PER,0.999233,##in,16,18
4,I-ORG,0.973893,Hu,33,35
5,I-ORG,0.976115,##gging,35,40
6,I-ORG,0.988797,Face,41,45
7,I-LOC,0.99321,Brooklyn,49,57


In [133]:
# If we want to group the entities
import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

pd.DataFrame(results)

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.998169,Sylvain,11,18
1,ORG,0.979602,Hugging Face,33,45
2,LOC,0.99321,Brooklyn,49,57


# Question-Answering Pipline

In [135]:
question_answerer = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

In [138]:
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
output = question_answerer(question=question, context=context)
output

{'score': 0.9802601933479309,
 'start': 78,
 'end': 106,
 'answer': 'Jax, PyTorch, and TensorFlow'}

In [139]:
context[output["start"]:output["end"]]

'Jax, PyTorch, and TensorFlow'

If the context is very long, it might have to be split into pieces to fit into the model. This splitting might cause a problem if it splits along the middle of the answer. 

In [143]:
long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
output = question_answerer(question=question, context=long_context)
long_context[output["start"]:output["end"]]

'Jax, PyTorch and TensorFlow'

## Questions Answering without Pipeline

In [206]:
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="tf")
outputs = model(**inputs)

All PyTorch model weights were used when initializing TFDistilBertForQuestionAnswering.

All the weights of TFDistilBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.


In [207]:
# outputs

### QA inputs
Note that we tokenize the question and the context as a pair, with the question first, separated by '[SEP]'

In [208]:
print(inputs.tokens())

['[CLS]', 'Which', 'deep', 'learning', 'libraries', 'back', '[UNK]', 'Transformers', '?', '[SEP]', '[UNK]', 'Transformers', 'is', 'backed', 'by', 'the', 'three', 'most', 'popular', 'deep', 'learning', 'libraries', '—', 'Jax', ',', 'P', '##y', '##T', '##or', '##ch', ',', 'and', 'Ten', '##sor', '##F', '##low', '—', 'with', 'a', 'sea', '##m', '##less', 'integration', 'between', 'them', '.', 'It', "'", 's', 'straightforward', 'to', 'train', 'your', 'models', 'with', 'one', 'before', 'loading', 'them', 'for', 'in', '##ference', 'with', 'the', 'other', '.', '[SEP]']


### QA outputs
The QA model has been trained to predict the index of the token starting the answer (here 23) and the index of the token where the answer ends (here 35). This is why those models return 2 tensors: 
- one for the logits corresponding to the start token of the answer,
- one for the logits corresponding to the end token of the answer.

Since in this case we have only one input containing 67 tokens, we get

In [209]:
start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

(1, 67) (1, 67)


In [210]:
start_logits

<tf.Tensor: shape=(1, 67), dtype=float32, numpy=
array([[-4.49515   , -6.445369  , -4.711502  , -7.096823  , -7.07262   ,
        -7.4981337 , -5.539712  , -4.1367955 , -5.9198594 , -5.419285  ,
        -1.5920235 , -1.0857029 , -5.098052  , -2.9330769 , -3.407005  ,
         2.2466576 ,  5.1562777 , -1.3601698 , -2.2208617 , -0.96861225,
        -4.8112288 , -2.252688  ,  1.4382719 , 10.121094  , -1.5310661 ,
         2.2685452 , -1.8951424 , -2.210829  , -4.214192  , -2.5570729 ,
        -2.3252347 , -2.604605  ,  1.7046974 , -1.9866575 , -1.7211072 ,
        -0.54148585, -2.023939  , -4.42457   , -5.101218  , -4.4965935 ,
        -7.893999  , -6.719963  , -4.675866  , -6.3278456 , -4.833928  ,
        -5.183854  , -3.372427  , -7.411994  , -8.154199  , -4.4871044 ,
        -7.4659348 , -4.329343  , -4.229309  , -3.190328  , -7.9467363 ,
        -5.2664833 , -7.590193  , -5.0569754 , -7.4475775 , -7.9083138 ,
        -6.595132  , -7.406136  , -8.882107  , -7.6748734 , -6.987933  ,
  

**Masking relevant tokens**: we don't want to pay attention to the `[CLS]` token (though we do care about the `[SEP]` one, since it might indicate the answer is not in the context)

In [211]:
sequence_ids = inputs.sequence_ids() # question=1, context=0, specialtokens=None

# Mask everything (ie set to True) apart from the tokens of the context (ie set context to False)
mask = [i != 1 for i in sequence_ids]

# # Unmask the [CLS] token
mask[0] = False
mask = tf.constant(mask)[None] # the [None] changes shape from (67,) to (1,67)

# # because we will apply softmax later to get probs, just set masked entries to big negative #

start_logits = tf.where(mask, -10000, start_logits) # (condition, value, tensor to apply it to)
end_logits = tf.where(mask, -10000, end_logits)     # (condition, value, tensor to apply it to)

In [212]:
start_probabilities = tf.math.softmax(start_logits, axis=-1)[0].numpy()
end_probabilities = tf.math.softmax(end_logits, axis=-1)[0].numpy()

At this stage, we could take the argmax of the start and end probabilities — but we might end up with a start index that is greater than the end index, so we need to take a few more precautions. We will compute the probabilities of each possible `start_index` and `end_index` where `start_index <= end_index`, then take the tuple `(start_index, end_index)` with the highest probability.

Assuming the events *"The answer starts at start_index"* and *"The answer ends at end_index"* to be independent, the probability that the answer starts at start_index and ends at end_index is:
$$\text{start\_probabilities}[\text{start\_index}] \times \text{end\_probabilities}[\text{end\_index}]$$

So, to compute all the scores, we just need to compute all the products where `start_index <= end_index`

First let’s compute all the possible products:

In [214]:
start_probabilities[:, None].shape, end_probabilities[None, :].shape

((67, 1), (1, 67))

In [215]:
# calculate matrix of scores
scores = start_probabilities[:, None] * end_probabilities[None, :]

# Keep only the upper triangular part where start_index>end_index
scores = np.triu(scores) #returns upper triangular m

scores.shape

(67, 67)

In [216]:

max_index = scores.argmax().item() # returns index of flattened tensor
start_index = max_index // scores.shape[1] # the row of the max
end_index = max_index % scores.shape[1]    # the col of the max
print(scores[start_index, end_index])

0.9802602


This is the same score we got as when using the pipeline

In [217]:
max_index, start_index, end_index

(1576, 23, 35)

In [218]:
# Now map these tokens to words using the offset_mapping of the inputs
inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]


# format it like pieline result
result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}
print(result)

{'answer': 'Jax, PyTorch, and TensorFlow', 'start': 78, 'end': 106, 'score': 0.9802602}


### Compare top 3 from pipeline to the manual way

In [220]:
# pipeline
output = question_answerer(question=question, context=context, top_k=3)
output

[{'score': 0.9802601933479309,
  'start': 78,
  'end': 106,
  'answer': 'Jax, PyTorch, and TensorFlow'},
 {'score': 0.0082477992400527,
  'start': 78,
  'end': 108,
  'answer': 'Jax, PyTorch, and TensorFlow —'},
 {'score': 0.0013676979579031467,
  'start': 78,
  'end': 90,
  'answer': 'Jax, PyTorch'}]

In [219]:
# manual
top_k = np.argsort(scores, axis = None)[::-1][0:3]
for max_index in top_k:
    start_index = max_index // scores.shape[1] # the row of the max
    end_index = max_index % scores.shape[1]    # the col of the max
    print(start_index, end_index, scores[start_index, end_index])

23 35 0.9802602
23 36 0.008247784
16 35 0.006841465


In [None]:
# the first 2 are the same, but not the 3rd... weird

# Long Contexts
If the context is longer than the model can accept, the input will be truncated. For example, our model's `max_length=384`

In [236]:
print(f"{len(long_context.split(' ' ))} words in long_context")

inputs = tokenizer(question, long_context)
print(f"{len(inputs['input_ids'])} tokens in long_context")

321 words in long_context
461 tokens in long_context


In [237]:
# truncate only the context, not the question
inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP [UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting - edge NLP easier to use for everyone. [UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine - tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments. Why should I use transformers? 1. Easy - to - use state - of - the - art models : - High performance on NLU and NLG tasks. - Low barrier to entry for educators and practitioners. - Few user - facing abstractions with just three classes to learn. - A unified A

Note that the answer is not in the context!

To overcome this the pipeline can split the context into smaller chunks, with some overlap to avoid splitting on the answer.

In [268]:
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, 
    truncation=True, 
    return_overflowing_tokens=True, 
    max_length=6, # max chunk length
    stride=2,  # how many tokens to overlap
)

for ids in inputs["input_ids"]:
    # print(tokenizer.convert_ids_to_tokens(ids))
    print(tokenizer.decode(ids))

[CLS] This sentence is not [SEP]
[CLS] is not too long [SEP]
[CLS] too long but we [SEP]
[CLS] but we are going [SEP]
[CLS] are going to split [SEP]
[CLS] to split it anyway [SEP]
[CLS] it anyway. [SEP]


In [267]:
print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])


In [273]:
# OK, so what is the `overflow_to_sample_mapping`?

sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, 
    truncation=True, 
    return_overflowing_tokens=True, 
    max_length=6, 
    stride=2
)
# print out the input chunks
for ids in inputs["input_ids"]:
    # print(tokenizer.convert_ids_to_tokens(ids))
    print(tokenizer.decode(ids))

#This shows which chunk belongs to which sample
print(inputs["overflow_to_sample_mapping"])

[CLS] This sentence is not [SEP]
[CLS] is not too long [SEP]
[CLS] too long but we [SEP]
[CLS] but we are going [SEP]
[CLS] are going to split [SEP]
[CLS] to split it anyway [SEP]
[CLS] it anyway. [SEP]
[CLS] This sentence is shorter [SEP]
[CLS] is shorter but will [SEP]
[CLS] but will still get [SEP]
[CLS] still get split. [SEP]
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]


In [274]:
## To make this tokenizer behave like the pipeline default
## And tokenize `long_context`
inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

In [276]:
inputs.keys()

dict_keys(['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [277]:
# get rid of `overflow_to_sample_mapping` and `offset_mapping` since they are not used by our model
_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("tf")
print(inputs["input_ids"].shape)

(2, 384)


In [278]:
# `long_context` was split in two, so after it goes through the model, there are two sets of start and end logits
outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

(2, 384) (2, 384)


In [280]:
# mask the tokens not in context

sequence_ids = inputs.sequence_ids()

# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]

# Unmask the [CLS] token
mask[0] = False

# Mask all the [PAD] tokens
mask = tf.math.logical_or(tf.constant(mask)[None], inputs["attention_mask"] == 0)

start_logits = tf.where(mask, -10000, start_logits)
end_logits = tf.where(mask, -10000, end_logits)

# calculate probs
start_probabilities = tf.math.softmax(start_logits, axis=-1).numpy()
end_probabilities = tf.math.softmax(end_logits, axis=-1).numpy()

In [281]:
# do the examining pairs of start and end probabilities
candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = np.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

[(0, 18, 0.33867087960243225), (173, 184, 0.9714869856834412)]


Those two candidates correspond to the best answers the model was able to find in each chunk. The model is way more confident the right answer is in the second part (which is a good sign!). Now we just have to map those two token spans to spans of characters in the context (we only need to map the second one to have our answer, but it’s interesting to see what the model has picked in the first chunk).

In [282]:
for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)

{'answer': '\n🤗 Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33867087960243225}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.9714869856834412}


It was more confident in the secod prediction, which was correct