# Intro to DL Day 2: NLP
We'll start with an introduction to different NLP tasks and how to use already pre-trained algorithms to perform these tasks using the [Huggingface](https://huggingface.co) library (most popular library in NLP these days) !! (This notebook is therefore greatly inspired by their [courses](https://huggingface.co/course/chapter1).
Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
from transformers import pipeline

# NLP tasks
For each of the tasks below you can either leave the model blank and it will use the default one or you can specify a specific model from the [model hub](https://huggingface.co/models).
## 1.1 Sentiment classification

In [2]:
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a Deep Learning course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.8671445250511169}]

We can even pass in multiple sentences

In [3]:
classifier(
    ["I've been waiting for a Deep Learning course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.8671445250511169},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

There are three main steps involved when you pass some text to a pipeline:

1. The text is preprocessed into a format the model can understand.
1. The preprocessed inputs are passed to the model.
1. The predictions of the model are post-processed, so you can make sense of them.

Some of the currently available pipelines are:

- feature-extraction (get the vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

Let’s have a look at a few of these!

## 1.2  Zero-shot classification
We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the zero-shot-classification pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.

In [4]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course using the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is a course using the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8637737035751343, 0.09807975590229034, 0.038146551698446274]}

This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!

In zero shot classification, we provide the model with a prompt and a sequence of text that describes what we want our model to do, in natural language. Zero-shot classification excludes any examples of the desired task being completed. This differs from single or few-shot classification, as these tasks include a single or a few examples of the selected task.

Zero, single and few-shot classification seem to be an emergent feature of large language models. This feature seems to come about around model sizes of +100M parameters. The effectiveness of a model at a zero, single or few-shot task seems to scale with model size, meaning that larger models (models with more trainable parameters or layers) generally do better at this task.

It will basically ask to the model a question like:

```
Classify the following input text into one of the following three categories: [education, business, politics]

Input Text: This is a course using the Transformers library
Category: Education
```

In [5]:
# try it out yourself with some examples

## 1.3 Text generation

Now let’s see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it’s normal if you don’t get the same results as shown below.

In [6]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

  "You have modified the pretrained model configuration to control generation. This is a"
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to work it. What to do and how to be good at it are the most important things. We do this throughout each day, but the most important part is understanding the difference between these two. It'}]

You can control how many different sequences are generated with the argument `num_return_sequences` and the total length of the output text with the argument `max_length`.

In [7]:
generator("In this course, we will teach you how to", num_return_sequences=3, max_length=50)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this course, we will teach you how to write JavaScript, in all of its complexity, from scratch. All of our lectures will be written in JavaScript, as it's only a limited subset of JavaScript, so we are not asking you to be"},
 {'generated_text': 'In this course, we will teach you how to use CSS to write your own CSS files to allow you to manipulate the site and layout without having to read CSS files yourself.\n\nWe will be using Sass for the content creation and styling but we'},
 {'generated_text': 'In this course, we will teach you how to solve a problem from your perspective, use our tools to solve a problem from a different perspective; and when a problem was raised from your perspective, you can add your contribution in that point of view.'}]

You can as well specify what model you want to use in your pipeline:

In [8]:
generator = pipeline("text-generation", model="distilgpt2")
generator("In this course, we will teach you how to")

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this course, we will teach you how to use the following techniques to control your own body.\n\n\n\n\n\nOnce you are ready, go back to the previous page!\nDon't do this again!\nThe previous page was"}]

## 1.4 Mask filling
The idea of this task is to fill in the blanks in a given text:

In [9]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.1961982548236847,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052723944187164,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

The `top_k` argument controls how many possibilities you want to be displayed. Note that here the model fills in the special `<mask>` word, which is often referred to as a mask token. Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models.

## 1.5  Named entity recognition
Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. Let’s look at an example:

In [10]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Thomas and I work at Business & Decision in Brussels.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

  "`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to"


[{'entity_group': 'PER',
  'score': 0.99934226,
  'word': 'Thomas',
  'start': 11,
  'end': 17},
 {'entity_group': 'ORG',
  'score': 0.99677676,
  'word': 'Business & Decision',
  'start': 32,
  'end': 51},
 {'entity_group': 'LOC',
  'score': 0.99659485,
  'word': 'Brussels',
  'start': 55,
  'end': 63}]

We pass the option grouped_entities=True in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: here the model correctly grouped “Business”, "&" and “Decision” as a single organization, even though the name consists of multiple words. In fact, as we will see in the next chapter, the preprocessing even splits some words into smaller parts. In the post-processing step, the pipeline successfully regrouped those pieces.

## 1.6 Question answering

The question-answering pipeline answers questions using information from a given context:

In [11]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Thomas and I work at Business & Decision in Brussels",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.6524998545646667,
 'start': 32,
 'end': 51,
 'answer': 'Business & Decision'}

## 1.7 Summarization

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here’s an example:

In [12]:
summarizer = pipeline("summarization")
summarizer(
    """
Business & Decision, part of Orange Group, is one of the world's leading management consultancies and system integrators for Data Intelligence & Digital Experience.  

We are Data Native Artists who leverage a unique combination of technical, functional and industry specialization, as well as partnerships with key software vendors, to deliver state of the art solutions since 1992.     

As a front runner in Big Data, Artificial Intelligence and Digital, Business & Decision is enabling customers to innovate, drive their business strategy and improve customer experience through effective use of data.   

Clients choose Business & Decision as their strategic Data & Digital partner due to our pioneering vision, expertise, core values, quality of service & our passion for delivery.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': " Business & Decision, part of Orange Group, is one of the world's leading management consultancies and system integrators for Data Intelligence & Digital Experience . It is enabling customers to\u202finnovate,\u202fdrive their business strategy and improve customer experience through effective use of data ."}]

Like with text generation, you can specify a `max_length` or a `min_length` for the result.

## 1.8 Translation

For translation, you can use a default model if you provide a language pair in the task name (such as "translation_en_to_fr"), but the easiest way is to pick the model you want to use on the [Model Hub](https://huggingface.co/models). Here we’ll try translating from French to English:

In [13]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours utilise le package python crée par Hugging Face.")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/301M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



[{'translation_text': 'This course uses the python package created by Hugging Face.'}]

# Tokenization
To showcase the different tokenization techniques, let's start with some movie review data:

In [14]:
from fastai.data.external import untar_data, URLs

path = untar_data(URLs.IMDB_SAMPLE)

In [15]:
import os
os.listdir(path)

['texts.csv']

In [16]:
import pandas as pd

df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!",False
1,positive,"This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som...",False
2,negative,"Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li...",False
3,positive,"Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie ""Duty, Honor, Country"" are not just mere words blathered from the lips of a high-brassed offic...",False
4,negative,"This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr...",False


In [17]:
from datasets import Dataset,DatasetDict

ds = Dataset.from_pandas(df)

In [18]:
ds

Dataset({
    features: ['label', 'text', 'is_valid'],
    num_rows: 1000
})

But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:

- Tokenization: Split each text up into words (or actually, as we'll see, into tokens)
- Numericalization: Convert each word (or token) into a number.

The details about how this is done actually depend on the particular model we use. So first we'll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use this:


In [19]:
model_nm = 'bert-base-uncased'

`AutoTokenizer` will create a tokenizer appropriate for a given model:

In [20]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [21]:
tokz

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

Here's an example of how the tokenizer splits a text into "tokens" (which are like words, but can be sub-word pieces, as you see below):

In [22]:
tokz.tokenize("Hi everyone, I'm happy to be here with all of you and discover the wonderful world of deep learning together using huggingface.")

['hi',
 'everyone',
 ',',
 'i',
 "'",
 'm',
 'happy',
 'to',
 'be',
 'here',
 'with',
 'all',
 'of',
 'you',
 'and',
 'discover',
 'the',
 'wonderful',
 'world',
 'of',
 'deep',
 'learning',
 'together',
 'using',
 'hugging',
 '##face',
 '.']

Uncommon words (such as huggingface) will be split into pieces. In the Bert WordPiece tokenizer, words that need to be splitted in sub-words will be pre-fixed with "##" for the pieces that are not at the start of the word.


In [23]:
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")

['a',
 'pl',
 '##at',
 '##yp',
 '##us',
 'is',
 'an',
 'or',
 '##ni',
 '##thor',
 '##hy',
 '##nch',
 '##us',
 'ana',
 '##tin',
 '##us',
 '.']

We can now generate a small function that tokenizes our text:

In [24]:
def tok_func(x): return tokz(x["text"], max_length=512, truncation=True)

To run this quickly in parallel on every row in our dataset, use `map`:

In [25]:
tok_ds = ds.map(tok_func, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

This adds a new item to our dataset called `input_ids`. For instance, here is the input and IDs for the first row of our data:

In [26]:
row = tok_ds[0]
row['text'], row['input_ids']

("Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!",
 [101,
  4895,
  1011,
  1038,
  10559,
  4691,
  1011,
  19337,
  2666,
  12423,
  999,
  12669,
  4575,
  2987,
  1005,
  1056,
  2130,
  2298,
  2014,
  5156,
  2566,
  2102,
  8840,
  12423,
  2969,
  1999,
  2023,
  1010,
  2029,
  5373,
  3084,
  2033,
  9641,
  2014,
  8467,
  16356,
  2100,
  3772,
  8040,
  11039,
  6799,
  1012,
  2524,
  2000,
  2903,
  2016,
  2001,
  1996,
  3135,
  2006,
  2023,
  3899,
  1012,
  4606,
  4901,
  1047,
  4179,
  1024,
  2054,
  2785,
  1997,
  5920,
  4440,
  2038,
  2010,
  2476,
  2042,
  2006,
  1029,
  2040,
  17369,
  1012,
  1012,


So, what are those IDs and where do they come from? The secret is that there's a list called `vocab` in the tokenizer which contains a unique integer for every possible token string. We can look them up like this, for instance to find the token for the word "her":

In [27]:
tokz.vocab['her']

2014

Looking above at our input IDs, we do indeed see that `2014` appears as expected.

## Implementing WordPiece
Now let’s take a look at an implementation of the WordPiece algorithm. This is just pedagogical, and you won’t able to use this on a big corpus.

In [28]:
corpus = [
    "This is the introduction to Deep Learning Course.",
    "This chapter is about tokenization.",
    "This section shows the WordPiece tokenizer algorithm.",
    "Hopefully, you will be able to understand how it is trained and generate tokens.",
]

First, we need to pre-tokenize the corpus into words. Since we are replicating a WordPiece tokenizer (like BERT), we will use the `bert-base-cased` tokenizer for the pre-tokenization:

In [29]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(corpus[0])

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

[('This', (0, 4)),
 ('is', (5, 7)),
 ('the', (8, 11)),
 ('introduction', (12, 24)),
 ('to', (25, 27)),
 ('Deep', (28, 32)),
 ('Learning', (33, 41)),
 ('Course', (42, 48)),
 ('.', (48, 49))]

Then we compute the frequencies of each word in the corpus as we do the pre-tokenization:

In [30]:
from collections import defaultdict

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

word_freqs

defaultdict(int,
            {'This': 3,
             'is': 3,
             'the': 2,
             'introduction': 1,
             'to': 2,
             'Deep': 1,
             'Learning': 1,
             'Course': 1,
             '.': 4,
             'chapter': 1,
             'about': 1,
             'tokenization': 1,
             'section': 1,
             'shows': 1,
             'WordPiece': 1,
             'tokenizer': 1,
             'algorithm': 1,
             'Hopefully': 1,
             ',': 1,
             'you': 1,
             'will': 1,
             'be': 1,
             'able': 1,
             'understand': 1,
             'how': 1,
             'it': 1,
             'trained': 1,
             'and': 1,
             'generate': 1,
             'tokens': 1})

As we saw before, the alphabet is the unique set composed of all the first letters of words, and all the other letters that appear in words prefixed by ##:

In [31]:
alphabet = []
for word in word_freqs.keys():
    if word[0] not in alphabet:
        alphabet.append(word[0])
    for letter in word[1:]:
        if f"##{letter}" not in alphabet:
            alphabet.append(f"##{letter}")

alphabet.sort()

print(alphabet)

['##P', '##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##r', '##s', '##t', '##u', '##w', '##y', '##z', ',', '.', 'C', 'D', 'H', 'L', 'T', 'W', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u', 'w', 'y']


We also add the special tokens used by the model at the beginning of that vocabulary. In the case of BERT, it’s the list `["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]`:

In [32]:
vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()

Next we need to split each word, with all the letters that are not the first prefixed by ##:

In [33]:
splits = {
    word: [c if i == 0 else f"##{c}" for i, c in enumerate(word)]
    for word in word_freqs.keys()
}
splits

{'This': ['T', '##h', '##i', '##s'],
 'is': ['i', '##s'],
 'the': ['t', '##h', '##e'],
 'introduction': ['i',
  '##n',
  '##t',
  '##r',
  '##o',
  '##d',
  '##u',
  '##c',
  '##t',
  '##i',
  '##o',
  '##n'],
 'to': ['t', '##o'],
 'Deep': ['D', '##e', '##e', '##p'],
 'Learning': ['L', '##e', '##a', '##r', '##n', '##i', '##n', '##g'],
 'Course': ['C', '##o', '##u', '##r', '##s', '##e'],
 '.': ['.'],
 'chapter': ['c', '##h', '##a', '##p', '##t', '##e', '##r'],
 'about': ['a', '##b', '##o', '##u', '##t'],
 'tokenization': ['t',
  '##o',
  '##k',
  '##e',
  '##n',
  '##i',
  '##z',
  '##a',
  '##t',
  '##i',
  '##o',
  '##n'],
 'section': ['s', '##e', '##c', '##t', '##i', '##o', '##n'],
 'shows': ['s', '##h', '##o', '##w', '##s'],
 'WordPiece': ['W', '##o', '##r', '##d', '##P', '##i', '##e', '##c', '##e'],
 'tokenizer': ['t', '##o', '##k', '##e', '##n', '##i', '##z', '##e', '##r'],
 'algorithm': ['a', '##l', '##g', '##o', '##r', '##i', '##t', '##h', '##m'],
 'Hopefully': ['H', '##o', '##p

Now that we are ready for training, let’s write a function that computes the score of each pair. We’ll need to use this at each step of the training:

In [34]:
def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq
        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

Let’s have a look at a part of this dictionary after the initial splits:

In [35]:
pair_scores = compute_pair_scores(splits)
for i, key in enumerate(pair_scores.keys()):
    print(f"{key}: {pair_scores[key]}")
    if i >= 5:
        break

('T', '##h'): 0.125
('##h', '##i'): 0.028846153846153848
('##i', '##s'): 0.023076923076923078
('i', '##s'): 0.06
('t', '##h'): 0.03125
('##h', '##e'): 0.011363636363636364


Now, finding the pair with the best score only takes a quick loop:

In [36]:
best_pair = ""
max_score = None
for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        best_pair = pair
        max_score = score

print(best_pair, max_score)

('a', '##b') 0.25


So the first merge to learn is ('a', '##b') -> 'ab', and we add 'ab' to the vocabulary:

In [37]:
vocab.append("ab")

To continue, we need to apply that merge in our splits dictionary. Let’s write another function for this:

In [38]:
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                merge = a + b[2:] if b.startswith("##") else a + b
                split = split[:i] + [merge] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

And we can have a look at the result of the first merge:

In [39]:
splits = merge_pair("a", "##b", splits)
splits["about"]

['ab', '##o', '##u', '##t']

Now we have everything we need to loop until we have learned all the merges we want. Let’s aim for a vocab size of 100:

In [40]:
vocab_size = 100
while len(vocab) < vocab_size:
    scores = compute_pair_scores(splits)
    best_pair, max_score = "", None
    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score
    print(f'Best pair is {best_pair} with score {score}')
    splits = merge_pair(*best_pair, splits)
    new_token = (
        best_pair[0] + best_pair[1][2:]
        if best_pair[1].startswith("##")
        else best_pair[0] + best_pair[1]
    )
    vocab.append(new_token)

Best pair is ('##f', '##u') with score 0.007142857142857143
Best pair is ('##d', '##P') with score 0.007142857142857143
Best pair is ('##fu', '##l') with score 0.007142857142857143
Best pair is ('##ful', '##l') with score 0.007142857142857143
Best pair is ('##full', '##y') with score 0.007142857142857143
Best pair is ('T', '##h') with score 0.007142857142857143
Best pair is ('c', '##h') with score 0.007142857142857143
Best pair is ('##h', '##m') with score 0.007142857142857143
Best pair is ('ch', '##a') with score 0.007142857142857143
Best pair is ('cha', '##p') with score 0.007142857142857143
Best pair is ('s', '##h') with score 0.007142857142857143
Best pair is ('t', '##h') with score 0.007142857142857143
Best pair is ('a', '##l') with score 0.007142857142857143
Best pair is ('al', '##g') with score 0.007142857142857143
Best pair is ('ab', '##l') with score 0.007142857142857143
Best pair is ('##l', '##l') with score 0.007142857142857143
Best pair is ('chap', '##t') with score 0.00714

We can then look at the generated vocabulary:

In [41]:
print(vocab)

['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]', '##P', '##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##r', '##s', '##t', '##u', '##w', '##y', '##z', ',', '.', 'C', 'D', 'H', 'L', 'T', 'W', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u', 'w', 'y', 'ab', '##fu', '##dP', '##ful', '##full', '##fully', 'Th', 'ch', '##hm', 'cha', 'chap', 'sh', 'th', 'al', 'alg', 'abl', '##ll', 'chapt', '##thm', '##za', '##zat', '##rdP', '##ct', '##uct', '##duct', 'Thi', 'This', '##ducti', '##izat', '##izati', '##cti', '##rdPi', '##iz', '##ithm', 'wi', 'will', '##ai', '##rithm', '##rai', 'trai', 'is', '##ws', 'it', '##rs', '##urs', '##rst', '##rsta', '##ut', '##at', '##tr', '##ar', '##rat', 'in']


To tokenize a new text, we pre-tokenize it, split it, then apply the tokenization algorithm on each word. That is, we look for the biggest subword starting at the beginning of the first word and split it, then we repeat the process on the second part, and so on for the rest of that word and the following words in the text:

In [42]:
def encode_word(word):
    tokens = []
    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in vocab:
            i -= 1
        if i == 0:
            return ["[UNK]"]
        tokens.append(word[:i])
        word = word[i:]
        if len(word) > 0:
            word = f"##{word}"
    return tokens

Let’s test it on one word that’s in the vocabulary, and another that isn’t:

In [43]:
print(encode_word("Hugging"))
print(encode_word("HOgging"))

['H', '##u', '##g', '##g', '##i', '##n', '##g']
['[UNK]']


Now, let’s write a function that tokenizes a text:

In [44]:
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    encoded_words = [encode_word(word) for word in pre_tokenized_text]
    return sum(encoded_words, [])

We can try it on any text:

In [45]:
tokenize("This is the Intro to Deep Learning course!")

['This',
 'is',
 'th',
 '##e',
 '[UNK]',
 't',
 '##o',
 'D',
 '##e',
 '##e',
 '##p',
 'L',
 '##e',
 '##ar',
 '##n',
 '##i',
 '##n',
 '##g',
 'c',
 '##o',
 '##urs',
 '##e',
 '[UNK]']

That’s it for the WordPiece algorithm!