In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 5.4 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 48.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.7 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 33.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

#Converting Tokens to IDs

When the BERT model was trained, each token was given a unique ID. Hence, when we want to use a pre-trained BERT model, we will first need to convert each token in the input sentence into its corresponding unique IDs.

There is an important point to note when we use a pre-trained model. Since the model is pre-trained on a certain corpus, the vocabulary was also fixed. In other words, when we apply a pre-trained model to some other data, it is possible that some tokens in the new data might not appear in the fixed vocabulary of the pre-trained model. This is commonly known as the **out-of-vocabulary (OOV)** problem.

For tokens not appearing in the original vocabulary, it is designed that they should be replaced with a special token **[UNK]**, which stands for unknown token.

However, converting all unseen tokens into **[UNK]** will take away a lot of information from the input data. Hence, BERT makes use of a **WordPiece** algorithm that breaks a word into several subwords, such that commonly seen subwords can also be represented by the model.

For example, the word **characteristically** does not appear in the original vocabulary. Nevertheless, when we use the BERT tokenizer to tokenize a sentence containing this word, we get something as shown below:

In [None]:
from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
tz.convert_tokens_to_ids(["characteristically"])

[100]

We can see that the word **characteristically** will be converted to the ID 100, which is the ID of the token **[UNK]**, if we do not apply the tokenization function of the **BERT model***.

The **BERT tokenization** function, on the other hand, will first breaks the word into **two subwoards**, namely **characteristic** and **##ally**, where the first token is a more commonly-seen word (prefix) in a corpus, and the second token is prefixed by two hashes ## to indicate that it is a suffix following some other subwords.

After this **tokenization step**, all tokens can be converted into their **corresponding IDs.**

In [None]:
sent = "He remains characteristically confident and optimistic."
tz.tokenize(sent)

['He',
 'remains',
 'characteristic',
 '##ally',
 'confident',
 'and',
 'optimistic',
 '.']

In [None]:
tz.convert_tokens_to_ids(tz.tokenize(sent))

[1124, 2606, 7987, 2716, 9588, 1105, 24876, 119]

In [None]:
# Original Sentence
origin_sent = "Let's learn deep learning!"

In [None]:
token_sent = tz.tokenize(origin_sent)
token_sent

['Let', "'", 's', 'learn', 'deep', 'learning', '!']

In [None]:
#Add [CLS] and [SEP] tokens
pad_sent = '[CLS]' + origin_sent + '[SEP]' + '[PAD]'

In [None]:
token_sent = tz.tokenize(pad_sent)
token_sent

['[CLS]', 'Let', "'", 's', 'learn', 'deep', 'learning', '!', '[SEP]', '[PAD]']

In [None]:
#Convert to IDs
tz.convert_tokens_to_ids(token_sent)

[101, 2421, 112, 188, 3858, 1996, 3776, 106, 102, 0]

In [None]:
# # Original Sentence
# origin_sent = "Let's learn deep learning!"

# # Tokenized Sentence
# ['Let', "'", 's', 'learn', 'deep', 'learning', '!']

# # Adding [CLS] and [SEP] Tokens
# ['[CLS]', 'Let', "'", 's', 'learn', 'deep', 'learning', '!', '[SEP]']

# # Padding
# ['[CLS]', 'Let', "'", 's', 'learn', 'deep', 'learning', '!', '[SEP]', '[PAD]']

# # Converting to IDs
# [101, 2421, 112, 188, 3858, 1996, 3776, 106, 102, 0]

#Tokenization using the transformers Package

While there are quite a number of steps to transform an input sentence into the appropriate representation, we can use the functions provided by the **transformers package** to help us perform the **tokenization** and **transformation easily.** In particular, we can use the function **encode_plus**, which does the following in one go:
* 1.**Tokenize** the input sentence
* 2.Add the **[CLS]** and **[SEP]** tokens.
* 3.**Pad or truncate** the sentence to the maximum length allowed
* 4.**Encode the tokens** into their corresponding IDs Pad or truncate all sentences to the same length.
* 5.Create the attention masks which explicitly differentiate real tokens from **[PAD]** tokens

In [None]:
# Import tokenizer from transformers package
from transformers import BertTokenizer

# Load the tokenizer of the "bert-base-cased" pretrained model
# See https://huggingface.co/transformers/pretrained_models.html for other models
tz = BertTokenizer.from_pretrained("bert-base-cased")

# The senetence to be encoded
sent = "Let's learn deep learning!"

# Encode the sentence
encoded = tz.encode_plus(
    text=sent,  # the sentence to be encoded
    add_special_tokens=True,  # Add [CLS] and [SEP]
    max_length = 64,  # maximum length of a sentence
    pad_to_max_length=True,  # Add [PAD]s
    return_attention_mask = True,  # Generate the attention mask
    return_tensors = 'pt',  # ask the function to return PyTorch tensors
)

# Get the input IDs and attention mask in tensor format
input_ids = encoded['input_ids']
attn_mask = encoded['attention_mask']

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


After executing the codes above, we will have the following content for the **input_ids** and **attn_mask** variables:

In [None]:
input_ids

tensor([[ 101, 2421,  112,  188, 3858, 1996, 3776,  106,  102,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0]])

In [None]:
attn_mask

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

The "**attention mask**" tells the model **which tokens should be attended to** and which (the **[PAD] tokens**) should not (see the documentation for more detail). It will be needed when we feed the input into the BERT model.