# Tokenizer Experiments

This notebook experiments with how the tokenizers of NLP models like BERT and GPT-2 handle certain characters. This encompasses non-linguistic characters like '\\' and html inputs like \<br>, but also "Umlaute" and accents. Since we aim to look at multiple languages, special characters that are used in French and German are especially examined. 

This notebook also aims to explore in what way we need to preprocess language data in the future, especially when compared to classical machine learning.

As a reference please consult https://huggingface.co/transformers/main_classes/tokenizer.html

## Preparations

In [1]:
from IPython.display import display, Markdown
with open('../../doc/env_variables_setup.md', 'r') as fh:
    content = fh.read()
display(Markdown(content))

Environment variables that need to be defined:   
`export DIR_PROJ=your_path_git_repository`  
`export PYTHONPATH=$DIR_PROJ/src`  
`export PATH_TENSORBOARD=your_path_tensorboard`  
`export PATH_DATASETS=your_path_datasets`  
`export PROJECT_ID=your_gcp_project_id`  
`export BUCKET_NAME=your_gcp_gs_bucket_name`  
`export BUCKET_TRANSLATION_NAME=your_gcp_gs_bucket_translation_name`  
`export BUCKET_STAGING_NAME=your_gcp_gs_bucket_staging_name` 
`export REGION=your_region`  
`export PATH_SAVE_MODEL=your_path_to_save_model`  
`export CLOUDSDK_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`  
`export CLOUDSDK_GSUTIL_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`  

- Use local Jupyter Lab 
    - you need to have the `jupyter-notebook` Anaconda python environment created [link](local_jupyter_lab_installation.md) 
    - you need to have the `jupyter-notebook` Anaconda python environment activated [link](local_jupyter_lab_installation.md) 
    - then define the environment variables above (copy and paste) 
    - you need to have the `env_multilingual_class` Anaconda python environment created [link](local_jupyter_lab_installation.md)  
    - start Jupyter Lab:  `jupyter lab` 
    - open a Jupyter Lab notebook from `notebook/` 
     - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - choose the proper Anaconda python environment:  `Python [conda env:env_multilingual_class]` [link](conda_env.md) 
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`


- Use GCP Jupyter Lab 
    - Go on GCP
    - open a Cloud Shell
    - `ssh-keygen -t rsa -b 4096 -C firstName_lastName`
    - `cp .ssh/id_rsa.pub .`
    - use Cloud Editor to edit this file `id_rsa.pub` and copy the full content
    - Go on Compute Engine -> Metadata
    - Click SSH Keys
    - Click Edit
    - Click + Add item, copy the content of `id_rsa.pub`
    - You should see firstName_lastName of the left
    - Click Save
    - you need to start a AI Platform instance 
    - open a Jupyter Lab terminal and got to `/home/gcp_user_name/`
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - then `cd proj_multilingual_text_classification/`
    - create the Anacond Python environment `conda env create -f env/environment.yml`
    - create a file `config.sh` in `/home` with the following information: 
    ```
    #!/bin/bash
    
    echo "applying some configuration ..."
    git config --global user.email user_email
    git config --global user.name user_name
    git config --global credential.helper store
        
    # Add here the enviroment variables from above below
    # [EDIT ME]
    export DIR_PROJ=your_path_git_repository
    export PYTHONPATH=$DIR_PROJ/src
  
    cd /home/gcp_user_name/
    
    conda activate env_multilingual_class

    export PS1='\[\e[91m\]\u@:\[\e[32m\]\w\[\e[0m\]$'
    ```
    - Got to AI Platform Notebook, select your instance and click "Reset".
    - Wait and reshreh you Web browser with the Notebook


### Import Packages

In [2]:
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import (BertTokenizer,
                          GPT2Tokenizer,
                          glue_convert_examples_to_features,
                         )
import itertools

# local packages
import preprocessing.preprocessing as pp
import importlib
importlib.reload(pp);

### Define Paths


In [3]:
#try:
#    data_dir=os.environ['PATH_DATASETS']
#except:
#    print('missing PATH_DATASETS')

## Tokenizer Initialisation

General differences between the tokenizers of BERT and GPT-2:

- the style of decoding
- the type of special tokens that are added to the sequences

In [4]:
# BERT
tokenizer_bert_cased = BertTokenizer.from_pretrained('bert-base-cased')
tokenizer_bert_uncased = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer_bert_german = BertTokenizer.from_pretrained('bert-base-german-cased')
# did not find a french-specific tokenizer yet
#tokenizer_bert_french = BertTokenizer.from_pretrained('bert-base-cased')
tokenizer_bert_multi_cased = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
tokenizer_bert_multi_uncased = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')

# GPT-2
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained('gpt2')


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=254728.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




## BERT - Special Characters

### Language-Independent

In [5]:
#sequence_en = "This is testing \what happens to cases<br></br> with words containing hyphens like mother-in-law. Also, what happens to the second sentence?"
sequence_en = "Sensor Artikel 51317350597 Frontscheibe grün Regensensor STK 1 768.74"

In [10]:
bert_cased_tokenized_sequence = tokenizer_bert_cased.tokenize(sequence_en)
bert_uncased_tokenized_sequence = tokenizer_bert_uncased.tokenize(sequence_en)
bert_multi_cased_tokenized_sequence = tokenizer_bert_multi_cased.tokenize(sequence_en)
bert_multi_uncased_tokenized_sequence = tokenizer_bert_multi_uncased.tokenize(sequence_en)
print("Number of token if split by space:\n",sequence_en.split(' '),"\nLength:", len(sequence_en.split(' ')))
print("Cased, English sequence:\n",bert_cased_tokenized_sequence,"\nLength:",len(bert_cased_tokenized_sequence))
print("\nUncased, English sequence:\n",bert_uncased_tokenized_sequence,"\nLength:",len(bert_uncased_tokenized_sequence))
print("\nCased, multilingual sequence:\n",bert_multi_cased_tokenized_sequence,"\nLength:",len(bert_multi_cased_tokenized_sequence))
print("\nUncased, multilingual sequence:\n",bert_multi_uncased_tokenized_sequence,"\nLength:",len(bert_multi_uncased_tokenized_sequence))

Number of token if split by space:
 ['Sensor', 'Artikel', '51317350597', 'Frontscheibe', 'grün', 'Regensensor', 'STK', '1', '768.74'] 
Length: 9
Cased, English sequence:
 ['Sen', '##sor', 'Art', '##ike', '##l', '51', '##31', '##7', '##35', '##0', '##5', '##9', '##7', 'Front', '##sche', '##ibe', 'g', '##r', '##ü', '##n', 'Reg', '##ense', '##nso', '##r', 'ST', '##K', '1', '76', '##8', '.', '74'] 
Length: 31

Uncased, English sequence:
 ['sensor', 'art', '##ike', '##l', '51', '##31', '##7', '##35', '##0', '##59', '##7', 'fronts', '##che', '##ibe', 'gr', '##un', 'reg', '##ense', '##nsor', 'st', '##k', '1', '76', '##8', '.', '74'] 
Length: 26

Cased, multilingual sequence:
 ['Sens', '##or', 'Artikel', '513', '##17', '##35', '##0', '##59', '##7', 'Front', '##sche', '##ibe', 'gr', '##ün', 'Reg', '##ense', '##ns', '##or', 'ST', '##K', '1', '768', '.', '74'] 
Length: 24

Uncased, multilingual sequence:
 ['sensor', 'artikel', '513', '##17', '##35', '##0', '##59', '##7', 'front', '##sche', '##ibe

Findings:
- When special characters are not removed from the text, they get tokenized separately.
- The difference between the cased and uncased sequences (English) are that uncased sequences are lower case only (which was to be expected). Also, the word "hyphen" was split up differently.
- The multilingual tokenizers work slightly differently than the English ones. The only difference in this specific sequence concerns the tokenization of the word "hyphen".

In [8]:
tokens_ids_cased = tokenizer_bert_cased.convert_tokens_to_ids(bert_cased_tokenized_sequence)
tokens_ids_uncased = tokenizer_bert_uncased.convert_tokens_to_ids(bert_uncased_tokenized_sequence)
tokens_ids_multi_cased = tokenizer_bert_multi_cased.convert_tokens_to_ids(bert_multi_cased_tokenized_sequence)
tokens_ids_multi_uncased = tokenizer_bert_multi_uncased.convert_tokens_to_ids(bert_multi_uncased_tokenized_sequence)
print("Cased, English sequence:\nToken IDs:",tokens_ids_cased,"\nLength:",len(tokens_ids_cased))
print("\nUncased, English sequence:\nToken IDs:",tokens_ids_uncased,"\nLength:",len(tokens_ids_uncased))
print("\nCased, multilingual sequence:\nToken IDs:",tokens_ids_multi_cased,"\nLength:",len(tokens_ids_multi_cased))
print("\nUncased, multilingual sequence:\nToken IDs:",tokens_ids_multi_uncased,"\nLength:",len(tokens_ids_multi_uncased))

Cased, English sequence:
Token IDs: [1188, 1110, 5193, 165, 1184, 5940, 1106, 2740, 133, 9304, 135, 133, 120, 9304, 135, 1114, 1734, 4051, 177, 1183, 27801, 2316, 1176, 1534, 118, 1107, 118, 1644, 119, 2907, 117, 1184, 5940, 1106, 1103, 1248, 5650, 136] 
Length: 38

Uncased, English sequence:
Token IDs: [2023, 2003, 5604, 1032, 2054, 6433, 2000, 3572, 1026, 7987, 1028, 1026, 1013, 7987, 1028, 2007, 2616, 4820, 1044, 22571, 10222, 2015, 2066, 2388, 1011, 1999, 1011, 2375, 1012, 2036, 1010, 2054, 6433, 2000, 1996, 2117, 6251, 1029] 
Length: 38

Cased, multilingual sequence:
Token IDs: [10747, 10124, 38306, 165, 12976, 105315, 10114, 16480, 133, 33989, 135, 133, 120, 33989, 135, 10169, 21296, 27248, 15165, 60383, 10107, 11850, 15293, 118, 10106, 118, 13255, 119, 20593, 117, 12976, 105315, 10114, 10105, 11132, 49219, 136] 
Length: 37

Uncased, multilingual sequence:
Token IDs: [10372, 10127, 32311, 139, 11523, 76959, 10114, 16379, 133, 18710, 135, 133, 120, 18710, 135, 10171, 18281, 26648,

In [9]:
for i in tokens_ids_cased:
    print('{:7d}    ---->    {}'.format(i, tokenizer_bert_cased.decode(int(i))))

   1188    ---->    T h i s
   1110    ---->    i s
   5193    ---->    t e s t i n g
    165    ---->    \
   1184    ---->    w h a t
   5940    ---->    h a p p e n s
   1106    ---->    t o
   2740    ---->    c a s e s
    133    ---->    <
   9304    ---->    b r
    135    ---->    >
    133    ---->    <
    120    ---->    /
   9304    ---->    b r
    135    ---->    >
   1114    ---->    w i t h
   1734    ---->    w o r d s
   4051    ---->    c o n t a i n i n g
    177    ---->    h
   1183    ---->    # # y
  27801    ---->    # # p h e
   2316    ---->    # # n s
   1176    ---->    l i k e
   1534    ---->    m o t h e r
    118    ---->    -
   1107    ---->    i n
    118    ---->    -
   1644    ---->    l a w
    119    ---->    .
   2907    ---->    A l s o
    117    ---->    ,
   1184    ---->    w h a t
   5940    ---->    h a p p e n s
   1106    ---->    t o
   1103    ---->    t h e
   1248    ---->    s e c o n d
   5650    ---->    s e n t e n c e
    136 

All tokens have different ids because the two tokenizers have a different number of total tokens.

In [10]:
tokens_ids_with_special_cased = tokenizer_bert_cased.build_inputs_with_special_tokens(tokens_ids_cased)
tokens_ids_with_special_multi_uncased = tokenizer_bert_multi_uncased.build_inputs_with_special_tokens(tokens_ids_multi_uncased)
print("Cased, English sequence:\n",tokens_ids_with_special_cased,"\nLength:", len(tokens_ids_with_special_cased))
print("\nUncased, multilingual sequence:\n",tokens_ids_with_special_multi_uncased,"\nLength:", len(tokens_ids_with_special_multi_uncased))

Cased, English sequence:
 [101, 1188, 1110, 5193, 165, 1184, 5940, 1106, 2740, 133, 9304, 135, 133, 120, 9304, 135, 1114, 1734, 4051, 177, 1183, 27801, 2316, 1176, 1534, 118, 1107, 118, 1644, 119, 2907, 117, 1184, 5940, 1106, 1103, 1248, 5650, 136, 102] 
Length: 40

Uncased, multilingual sequence:
 [101, 10372, 10127, 32311, 139, 11523, 76959, 10114, 16379, 133, 18710, 135, 133, 120, 18710, 135, 10171, 18281, 26648, 13051, 47607, 10933, 11531, 13907, 118, 10104, 118, 11785, 119, 10398, 117, 11523, 76959, 10114, 10103, 10981, 45261, 136, 102] 
Length: 39


In [11]:
for i in tokens_ids_with_special_cased:
    print('{:7d}    ---->    {}'.format(i, tokenizer_bert_cased.decode(int(i))))

    101    ---->    [ C L S ]
   1188    ---->    T h i s
   1110    ---->    i s
   5193    ---->    t e s t i n g
    165    ---->    \
   1184    ---->    w h a t
   5940    ---->    h a p p e n s
   1106    ---->    t o
   2740    ---->    c a s e s
    133    ---->    <
   9304    ---->    b r
    135    ---->    >
    133    ---->    <
    120    ---->    /
   9304    ---->    b r
    135    ---->    >
   1114    ---->    w i t h
   1734    ---->    w o r d s
   4051    ---->    c o n t a i n i n g
    177    ---->    h
   1183    ---->    # # y
  27801    ---->    # # p h e
   2316    ---->    # # n s
   1176    ---->    l i k e
   1534    ---->    m o t h e r
    118    ---->    -
   1107    ---->    i n
    118    ---->    -
   1644    ---->    l a w
    119    ---->    .
   2907    ---->    A l s o
    117    ---->    ,
   1184    ---->    w h a t
   5940    ---->    h a p p e n s
   1106    ---->    t o
   1103    ---->    t h e
   1248    ---->    s e c o n d
   5650    ---

Special tokens are simply [CLS] and [SEP] which wrap around the whole sequence.

In [12]:
encoded_tokens_cased = tokenizer_bert_cased.encode_plus(sequence_en, max_length=50, pad_to_max_length=True)
encoded_tokens_multi_uncased = tokenizer_bert_multi_uncased.encode_plus(sequence_en, max_length=50, pad_to_max_length=True)
print("Cased, English sequence:\n",encoded_tokens_cased)
print("\nUncased, mulitlingual sequence:\n",encoded_tokens_multi_uncased)

Cased, English sequence:
 {'input_ids': [101, 1188, 1110, 5193, 165, 1184, 5940, 1106, 2740, 133, 9304, 135, 133, 120, 9304, 135, 1114, 1734, 4051, 177, 1183, 27801, 2316, 1176, 1534, 118, 1107, 118, 1644, 119, 2907, 117, 1184, 5940, 1106, 1103, 1248, 5650, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

Uncased, mulitlingual sequence:
 {'input_ids': [101, 10372, 10127, 32311, 139, 11523, 76959, 10114, 16379, 133, 18710, 135, 133, 120, 18710, 135, 10171, 18281, 26648, 13051, 47607, 10933, 11531, 13907, 118, 10104, 118, 11785, 119, 10398, 117, 11523, 76959, 10114, 10103, 10981, 45261, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 

Findings:
- When encoding the tokens the token_type_ids specify to which sequence the token belongs to. This is used when two or more sequences are fed at the same time to be compared to each other. In this case, all tokens belong to the same sequence, 0.
- The attention_mask specifies which of the tokens the model should consider and which tokens are merely padding which are added for sequences that are not as long as the specified max_length.

### German

In [25]:
sequence_de = "Äußerst interessant ist die Frage, wie österreichische Übersetzungsdienstleistungsunternehmen mit Tokenisierung und \
              Schreibweisen wie Ueberfuehrung oder ueberfuehren umgehen."

In [26]:
bert_de_cased_tokenized_sequence = tokenizer_bert_cased.tokenize(sequence_de)
bert_de_uncased_tokenized_sequence = tokenizer_bert_uncased.tokenize(sequence_de)
bert_de_german_tokenized_sequence = tokenizer_bert_german.tokenize(sequence_de)
bert_de_multi_cased_tokenized_sequence = tokenizer_bert_multi_cased.tokenize(sequence_de)
bert_de_multi_uncased_tokenized_sequence = tokenizer_bert_multi_uncased.tokenize(sequence_de)

print("Cased, English sequence:\nTokens:",bert_de_cased_tokenized_sequence,"\nLength:",len(bert_de_cased_tokenized_sequence),"\n")
print("Uncased, English sequence:\nTokens:",bert_de_uncased_tokenized_sequence,"\nLength:",len(bert_de_uncased_tokenized_sequence),"\n")
print("Cased, German sequence:\nTokens:",bert_de_german_tokenized_sequence,"\nLength:",len(bert_de_german_tokenized_sequence),"\n")
print("Cased, multilingual sequence:\nTokens:",bert_de_multi_cased_tokenized_sequence,"\nLength:",len(bert_de_multi_cased_tokenized_sequence),"\n")
print("Uncased, multilingual sequence:\nTokens:",bert_de_multi_uncased_tokenized_sequence,"\nLength:",len(bert_de_multi_uncased_tokenized_sequence))


Cased, English sequence:
Tokens: ['Ä', '##u', '##ße', '##rst', 'inter', '##ess', '##ant', 'is', '##t', 'die', 'Fr', '##age', ',', 'w', '##ie', 'ö', '##ster', '##re', '##ichi', '##sche', 'Ü', '##bers', '##et', '##zu', '##ng', '##s', '##die', '##nst', '##le', '##ist', '##ung', '##sun', '##tern', '##eh', '##men', 'mit', 'To', '##ken', '##isi', '##er', '##ung', 'und', 'Sc', '##hr', '##ei', '##b', '##wei', '##sen', 'w', '##ie', 'U', '##eb', '##er', '##fu', '##eh', '##rung', 'o', '##der', 'u', '##eb', '##er', '##fu', '##eh', '##ren', 'um', '##ge', '##hen', '.'] 
Length: 68 

Uncased, English sequence:
Tokens: ['au', '##ße', '##rst', 'inter', '##ess', '##ant', 'ist', 'die', 'fra', '##ge', ',', 'wi', '##e', 'os', '##ter', '##re', '##ichi', '##sche', 'uber', '##set', '##zu', '##ng', '##sd', '##iens', '##tle', '##ist', '##ung', '##sun', '##tern', '##eh', '##men', 'mit', 'token', '##isi', '##er', '##ung', 'und', 'sc', '##hre', '##ib', '##wei', '##sen', 'wi', '##e', 'u', '##eber', '##fu', '##eh', 

Findings:
- The tokenization varies between the different tokenizers. Both of the English tokenizers do not make much sense since they do not recognize most of the words or word pieces. This is why these tokenizers are ignored in this case.
- The two multilingual tokenizers work slightly differently, especially when it comes to Umlaute since those are eliminated in the uncased setting.
- Compared to the German only tokenizer, the uncased multilingual version is more similar than the cased multilingual one. This is mainly based on the first word which gets extremely split up in the cased multilingual setting.
- The cased, German tokenizer seems to represent the words the best, but the uncased, multilingual tokenizer should also work.
- Alternative spellings of Umlaute like "ue" for "ü" are not captured well. Hereby, capital words behave similarly to the others. **We may need to account for these alternative spellings.**

In [15]:
tokens_ids_de_cased = tokenizer_bert_cased.convert_tokens_to_ids(bert_de_cased_tokenized_sequence)
tokens_ids_de_german = tokenizer_bert_german.convert_tokens_to_ids(bert_de_german_tokenized_sequence)
tokens_ids_de_multi_cased = tokenizer_bert_multi_cased.convert_tokens_to_ids(bert_de_multi_cased_tokenized_sequence)
tokens_ids_de_multi_uncased = tokenizer_bert_multi_uncased.convert_tokens_to_ids(bert_de_multi_uncased_tokenized_sequence)
print("Cased, English sequence:\nToken IDs:",tokens_ids_de_cased,"\nLength:",len(tokens_ids_de_cased))
print("\nCased, German sequence:\nToken IDs:",tokens_ids_de_german,"\nLength:",len(tokens_ids_de_german))
print("\nCased, multilingual sequence:\nToken IDs:",tokens_ids_de_multi_cased,"\nLength:",len(tokens_ids_de_multi_cased))
print("\nUncased, multilingual sequence:\nToken IDs:",tokens_ids_de_multi_uncased,"\nLength:",len(tokens_ids_de_multi_uncased))

Cased, English sequence:
Token IDs: [229, 1358, 13750, 9731, 9455, 5800, 2861, 1110, 1204, 2939, 13359, 2553, 117, 192, 1663, 268, 4648, 1874, 11985, 15460, 243, 11697, 2105, 10337, 2118, 1116, 7782, 22399, 1513, 1776, 4380, 23294, 16748, 10486, 2354, 26410, 1706, 6378, 26868, 1200, 4380, 5576, 20452, 8167, 6851, 1830, 24078, 3792, 192, 1663, 158, 15581, 1200, 14703, 10486, 20901, 184, 2692, 190, 15581, 1200, 14703, 10486, 5123, 15276, 2176, 10436, 119] 
Length: 68

Cased, German sequence:
Token IDs: [7166, 13, 19168, 127, 30, 1685, 2036, 246, 11675, 10919, 26902, 2068, 9992, 4039, 114, 9626, 7, 3873, 42, 21779, 26898, 246, 23808, 73, 26913, 1790, 48, 27, 309, 2118, 731, 6, 26913, 1790, 2306, 18978, 4813] 
Length: 37

Cased, multilingual sequence:
Token IDs: [229, 10138, 88376, 10562, 103691, 10298, 10128, 39278, 117, 10953, 69515, 47038, 65069, 73003, 100489, 17894, 58348, 10221, 11469, 11062, 34600, 10130, 55260, 32206, 17609, 35130, 10953, 158, 68977, 20758, 89707, 10716, 10760, 189

In [27]:
tokens_ids_with_special_de_cased = tokenizer_bert_cased.build_inputs_with_special_tokens(tokens_ids_de_cased)
tokens_ids_with_special_de_german = tokenizer_bert_german.build_inputs_with_special_tokens(tokens_ids_de_german)
tokens_ids_with_special_de_multi_cased = tokenizer_bert_multi_cased.build_inputs_with_special_tokens(tokens_ids_de_multi_cased)
tokens_ids_with_special_de_multi_uncased = tokenizer_bert_multi_uncased.build_inputs_with_special_tokens(tokens_ids_de_multi_uncased)
print("Cased, English sequence:\nToken IDs",tokens_ids_with_special_de_cased,"\nLength:", len(tokens_ids_with_special_de_cased))
print("\nCased, German sequence:\nToken IDs:",tokens_ids_with_special_de_german,"\nLength:",len(tokens_ids_with_special_de_german))
print("\nCased, multilingual sequence:\nToken IDs:",tokens_ids_with_special_de_multi_cased,"\nLength:",len(tokens_ids_with_special_de_multi_cased))
print("\nUncased, multilingual sequence:\nToken IDs:",tokens_ids_with_special_de_multi_uncased,"\nLength:",len(tokens_ids_with_special_de_multi_uncased))

Cased, English sequence:
Token IDs [101, 229, 1358, 13750, 9731, 9455, 5800, 2861, 1110, 1204, 2939, 13359, 2553, 117, 192, 1663, 268, 4648, 1874, 11985, 15460, 243, 11697, 2105, 10337, 2118, 1116, 7782, 22399, 1513, 1776, 4380, 23294, 16748, 10486, 2354, 26410, 1706, 6378, 26868, 1200, 4380, 5576, 20452, 8167, 6851, 1830, 24078, 3792, 192, 1663, 158, 15581, 1200, 14703, 10486, 20901, 184, 2692, 190, 15581, 1200, 14703, 10486, 5123, 15276, 2176, 10436, 119, 102] 
Length: 70

Cased, German sequence:
Token IDs: [3, 7166, 13, 19168, 127, 30, 1685, 2036, 246, 11675, 10919, 26902, 2068, 9992, 4039, 114, 9626, 7, 3873, 42, 21779, 26898, 246, 23808, 73, 26913, 1790, 48, 27, 309, 2118, 731, 6, 26913, 1790, 2306, 18978, 4813, 4] 
Length: 39

Cased, multilingual sequence:
Token IDs: [101, 229, 10138, 88376, 10562, 103691, 10298, 10128, 39278, 117, 10953, 69515, 47038, 65069, 73003, 100489, 17894, 58348, 10221, 11469, 11062, 34600, 10130, 55260, 32206, 17609, 35130, 10953, 158, 68977, 20758, 8970

Special tokens have different id's for the German tokenizer.

In [30]:
encoded_tokens_de_cased = tokenizer_bert_cased.encode_plus(sequence_de, max_length=50, pad_to_max_length=True)
encoded_tokens_de_german = tokenizer_bert_german.encode_plus(sequence_de, max_length=50, pad_to_max_length=True)
encoded_tokens_de_multi_cased = tokenizer_bert_multi_cased.encode_plus(sequence_de, max_length=50, pad_to_max_length=True)
encoded_tokens_de_multi_uncased = tokenizer_bert_multi_uncased.encode_plus(sequence_de, max_length=50, pad_to_max_length=True)
print("Cased, English sequence:\n",encoded_tokens_de_cased)
print("\nCased, German sequence:\n",encoded_tokens_de_german)
print("\nCased, mulitlingual sequence:\n",encoded_tokens_de_multi_cased)
print("\nUncased, mulitlingual sequence:\n",encoded_tokens_de_multi_uncased)

Cased, English sequence:
 {'input_ids': [101, 229, 1358, 13750, 9731, 9455, 5800, 2861, 1110, 1204, 2939, 13359, 2553, 117, 192, 1663, 268, 4648, 1874, 11985, 15460, 243, 11697, 2105, 10337, 2118, 1116, 7782, 22399, 1513, 1776, 4380, 23294, 16748, 10486, 2354, 26410, 1706, 6378, 26868, 1200, 4380, 5576, 20452, 8167, 6851, 1830, 24078, 3792, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Cased, German sequence:
 {'input_ids': [3, 7166, 13, 19168, 127, 30, 1685, 2036, 246, 11675, 10919, 26902, 2068, 9992, 4039, 114, 9626, 7, 3873, 42, 21779, 26898, 246, 23808, 73, 26913, 1790, 48, 27, 309, 2118, 731, 6, 26913, 1790, 2306, 18978, 4813, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 

In [58]:
for i,j in itertools.zip_longest(tokens_ids_with_special_de_multi_cased, tokens_ids_with_special_de_multi_uncased):
    if i==None:
        i = 0
    elif j==None:
        j = 0
    print('{:7d}    ---->         {:^40}     {:^40}         <----    {:7d}'.format(i, tokenizer_bert_multi_cased.decode(int(i)),tokenizer_bert_multi_uncased.decode(int(j)),j))

    101    ---->                        [ C L S ]                                    [ C L S ]                         <----        101
    229    ---->                            Ä                                        a u ß e r                         <----      41623
  10138    ---->                          # # u                                       # # s t                          <----      10607
  88376    ---->                        # # ß e r                              i n t e r e s s a n t                   <----      65039
  10562    ---->                         # # s t                                       i s t                           <----      10339
 103691    ---->                  i n t e r e s s a n t                                d i e                           <----      10121
  10298    ---->                          i s t                                      f r a g e                         <----      38563
  10128    ---->                          d i e 

### French

In [32]:
sequence_fr = "À Noël, les garçons Pierre et Philippe ramassent une pierre et puis visitent un café près d'un château où un roi a vécu."

In [33]:
bert_fr_cased_tokenized_sequence = tokenizer_bert_cased.tokenize(sequence_fr)
bert_fr_uncased_tokenized_sequence = tokenizer_bert_uncased.tokenize(sequence_fr)
bert_fr_multi_cased_tokenized_sequence = tokenizer_bert_multi_cased.tokenize(sequence_fr)
bert_fr_multi_uncased_tokenized_sequence = tokenizer_bert_multi_uncased.tokenize(sequence_fr)

print("Cased, English sequence:\nTokens:",bert_fr_cased_tokenized_sequence,"\nLength:",len(bert_fr_cased_tokenized_sequence),"\n")
print("Uncased, English sequence:\nTokens:",bert_fr_uncased_tokenized_sequence,"\nLength:",len(bert_fr_uncased_tokenized_sequence),"\n")
print("Cased, multilingual sequence:\nTokens:",bert_fr_multi_cased_tokenized_sequence,"\nLength:",len(bert_fr_multi_cased_tokenized_sequence),"\n")
print("Uncased, multilingual sequence:\nTokens:",bert_fr_multi_uncased_tokenized_sequence,"\nLength:",len(bert_fr_multi_uncased_tokenized_sequence))


Cased, English sequence:
Tokens: ['À', 'No', '##ël', ',', 'les', 'g', '##ar', '##ç', '##ons', 'Pierre', 'et', 'Philippe', 'ram', '##ass', '##ent', 'une', 'pier', '##re', 'et', 'pu', '##is', 'visit', '##ent', 'un', 'café', 'p', '##r', '##ès', 'd', "'", 'un', 'ch', '##â', '##teau', 'o', '##ù', 'un', 'r', '##oi', 'a', 'v', '##é', '##cu', '.'] 
Length: 44 

Uncased, English sequence:
Tokens: ['a', 'noel', ',', 'les', 'ga', '##rco', '##ns', 'pierre', 'et', 'philippe', 'rama', '##ssen', '##t', 'une', 'pierre', 'et', 'pu', '##is', 'visit', '##ent', 'un', 'cafe', 'pre', '##s', 'd', "'", 'un', 'chateau', 'ou', 'un', 'roi', 'a', 've', '##cu', '.'] 
Length: 35 

Cased, multilingual sequence:
Tokens: ['À', 'Noël', ',', 'les', 'garçon', '##s', 'Pierre', 'et', 'Philippe', 'ramas', '##sent', 'une', 'pierre', 'et', 'puis', 'visite', '##nt', 'un', 'café', 'près', 'd', "'", 'un', 'château', 'où', 'un', 'roi', 'a', 'vécu', '.'] 
Length: 30 

Uncased, multilingual sequence:
Tokens: ['a', 'noel', ',', 'les

Findings:
- The cased english tokenizer does the worst job at representing the sequence of all the options.
- The cased and uncased multilingual representations are remarkably similar which was not the case for all German words. (Note: this could just be biased by the choice of words in the test sequence.) This means that we can also use the uncased multilingual representation.

In [34]:
tokens_ids_fr_cased = tokenizer_bert_cased.convert_tokens_to_ids(bert_fr_cased_tokenized_sequence)
tokens_ids_fr_multi_cased = tokenizer_bert_multi_cased.convert_tokens_to_ids(bert_fr_multi_cased_tokenized_sequence)
tokens_ids_fr_multi_uncased = tokenizer_bert_multi_uncased.convert_tokens_to_ids(bert_fr_multi_uncased_tokenized_sequence)
print("Cased, English sequence:\nToken IDs:",tokens_ids_fr_cased,"\nLength:",len(tokens_ids_fr_cased))
print("\nCased, multilingual sequence:\nToken IDs:",tokens_ids_fr_multi_cased,"\nLength:",len(tokens_ids_fr_multi_cased))
print("\nUncased, multilingual sequence:\nToken IDs:",tokens_ids_fr_multi_uncased,"\nLength:",len(tokens_ids_fr_multi_uncased))

Cased, English sequence:
Token IDs: [226, 1302, 17259, 117, 8241, 176, 1813, 28201, 4199, 4855, 3084, 11162, 26084, 11192, 3452, 25731, 16331, 1874, 3084, 23609, 1548, 3143, 3452, 8362, 20583, 185, 1197, 10695, 173, 112, 8362, 22572, 28198, 17871, 184, 22702, 8362, 187, 8136, 170, 191, 2744, 10182, 119] 
Length: 44

Cased, multilingual sequence:
Token IDs: [225, 38835, 117, 10152, 90381, 10107, 11609, 10131, 15408, 95304, 30832, 10231, 35346, 10131, 12451, 46770, 10368, 10119, 34551, 16092, 172, 112, 10119, 17890, 11814, 10119, 15681, 169, 92416, 119] 
Length: 30

Uncased, multilingual sequence:
Token IDs: [143, 23171, 117, 10152, 74185, 10107, 11676, 10137, 15753, 45289, 37164, 10249, 11676, 10137, 12505, 40560, 10368, 10119, 18427, 13913, 146, 112, 10119, 14785, 10391, 10119, 13277, 143, 60579, 119] 
Length: 30


In [35]:
tokens_ids_with_special_fr_cased = tokenizer_bert_cased.build_inputs_with_special_tokens(tokens_ids_fr_cased)
tokens_ids_with_special_fr_multi_cased = tokenizer_bert_multi_cased.build_inputs_with_special_tokens(tokens_ids_fr_multi_cased)
tokens_ids_with_special_fr_multi_uncased = tokenizer_bert_multi_uncased.build_inputs_with_special_tokens(tokens_ids_fr_multi_uncased)
print("Cased, English sequence:\nToken IDs",tokens_ids_with_special_fr_cased,"\nLength:", len(tokens_ids_with_special_fr_cased))
print("\nCased, multilingual sequence:\nToken IDs:",tokens_ids_with_special_fr_multi_cased,"\nLength:",len(tokens_ids_with_special_fr_multi_cased))
print("\nUncased, multilingual sequence:\nToken IDs:",tokens_ids_with_special_fr_multi_uncased,"\nLength:",len(tokens_ids_with_special_fr_multi_uncased))

Cased, English sequence:
Token IDs [101, 226, 1302, 17259, 117, 8241, 176, 1813, 28201, 4199, 4855, 3084, 11162, 26084, 11192, 3452, 25731, 16331, 1874, 3084, 23609, 1548, 3143, 3452, 8362, 20583, 185, 1197, 10695, 173, 112, 8362, 22572, 28198, 17871, 184, 22702, 8362, 187, 8136, 170, 191, 2744, 10182, 119, 102] 
Length: 46

Cased, multilingual sequence:
Token IDs: [101, 225, 38835, 117, 10152, 90381, 10107, 11609, 10131, 15408, 95304, 30832, 10231, 35346, 10131, 12451, 46770, 10368, 10119, 34551, 16092, 172, 112, 10119, 17890, 11814, 10119, 15681, 169, 92416, 119, 102] 
Length: 32

Uncased, multilingual sequence:
Token IDs: [101, 143, 23171, 117, 10152, 74185, 10107, 11676, 10137, 15753, 45289, 37164, 10249, 11676, 10137, 12505, 40560, 10368, 10119, 18427, 13913, 146, 112, 10119, 14785, 10391, 10119, 13277, 143, 60579, 119, 102] 
Length: 32


In [36]:
encoded_tokens_fr_cased = tokenizer_bert_cased.encode_plus(sequence_fr, max_length=50, pad_to_max_length=True)
encoded_tokens_fr_multi_cased = tokenizer_bert_multi_cased.encode_plus(sequence_fr, max_length=50, pad_to_max_length=True)
encoded_tokens_fr_multi_uncased = tokenizer_bert_multi_uncased.encode_plus(sequence_fr, max_length=50, pad_to_max_length=True)
print("Cased, English sequence:\n",encoded_tokens_fr_cased)
print("\nCased, mulitlingual sequence:\n",encoded_tokens_fr_multi_cased)
print("\nUncased, mulitlingual sequence:\n",encoded_tokens_fr_multi_uncased)

Cased, English sequence:
 {'input_ids': [101, 226, 1302, 17259, 117, 8241, 176, 1813, 28201, 4199, 4855, 3084, 11162, 26084, 11192, 3452, 25731, 16331, 1874, 3084, 23609, 1548, 3143, 3452, 8362, 20583, 185, 1197, 10695, 173, 112, 8362, 22572, 28198, 17871, 184, 22702, 8362, 187, 8136, 170, 191, 2744, 10182, 119, 102, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]}

Cased, mulitlingual sequence:
 {'input_ids': [101, 225, 38835, 117, 10152, 90381, 10107, 11609, 10131, 15408, 95304, 30832, 10231, 35346, 10131, 12451, 46770, 10368, 10119, 34551, 16092, 172, 112, 10119, 17890, 11814, 10119, 15681, 169, 92416, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': 

In [46]:
for i,j in zip(tokens_ids_with_special_fr_multi_cased, tokens_ids_with_special_fr_multi_uncased):
    print('{:7d}    ---->    {:^25}     {:^25}    <----    {:7d}'.format(i, tokenizer_bert_multi_cased.decode(int(i)),tokenizer_bert_multi_uncased.decode(int(j)),j))

    101    ---->            [ C L S ]                     [ C L S ]            <----        101
    225    ---->                À                             a                <----        143
  38835    ---->             N o ë l                       n o e l             <----      23171
    117    ---->                ,                             ,                <----        117
  10152    ---->              l e s                         l e s              <----      10152
  90381    ---->           g a r ç o n                   g a r c o n           <----      74185
  10107    ---->              # # s                         # # s              <----      10107
  11609    ---->           P i e r r e                   p i e r r e           <----      11676
  10131    ---->               e t                           e t               <----      10137
  15408    ---->         P h i l i p p e               p h i l i p p e         <----      15753
  95304    ---->            r a m a s   

Findings:
- As seen above, when we choose the uncased multilingual tokenizer, Pierre and pierre are assigned the same token even though they mean different things.

## Experiments with Truncation

In [51]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

In [52]:
sequence = "Hello, I am looking for an embedding for a really long and windy sentence that is definitely not going to fit max length. I am even going to start another sentence to make this sequence especially long."
sequence_long = sequence + "This was still not long enough which is why I am still adding words to this sentence to make sure that it is longer than necessary."

bert_tokenized_sequence = tokenizer.tokenize(sequence)
bert_tokenized_sequence_long = tokenizer.tokenize(sequence_long)

bert_tokenized_sequence

['Hello',
 ',',
 'I',
 'am',
 'looking',
 'for',
 'an',
 'em',
 '##bed',
 '##ding',
 'for',
 'a',
 'really',
 'long',
 'and',
 'wind',
 '##y',
 'sentence',
 'that',
 'is',
 'definitely',
 'not',
 'going',
 'to',
 'fit',
 'ma',
 '##x',
 'length',
 '.',
 'I',
 'am',
 'even',
 'going',
 'to',
 'start',
 'another',
 'sentence',
 'to',
 'make',
 'this',
 'sequence',
 'especially',
 'long',
 '.']

In [53]:
tokens_ids = tokenizer.convert_tokens_to_ids(bert_tokenized_sequence)
tokens_ids_long = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sequence_long))

print("Tokens ids (short): {}".format(tokens_ids))
print("")
print("Tokens ids (long): {}".format(tokens_ids_long))

print("")
print("Amount of tokens: ", len(tokens_ids), "(short sequence), ", len(tokens_ids_long), "(long sequence)")

Tokens ids (short): [8667, 117, 146, 1821, 1702, 1111, 1126, 9712, 4774, 3408, 1111, 170, 1541, 1263, 1105, 3223, 1183, 5650, 1115, 1110, 5397, 1136, 1280, 1106, 4218, 12477, 1775, 2251, 119, 146, 1821, 1256, 1280, 1106, 1838, 1330, 5650, 1106, 1294, 1142, 4954, 2108, 1263, 119]

Tokens ids (long): [8667, 117, 146, 1821, 1702, 1111, 1126, 9712, 4774, 3408, 1111, 170, 1541, 1263, 1105, 3223, 1183, 5650, 1115, 1110, 5397, 1136, 1280, 1106, 4218, 12477, 1775, 2251, 119, 146, 1821, 1256, 1280, 1106, 1838, 1330, 5650, 1106, 1294, 1142, 4954, 2108, 1263, 119, 1188, 1108, 1253, 1136, 1263, 1536, 1134, 1110, 1725, 146, 1821, 1253, 5321, 1734, 1106, 1142, 5650, 1106, 1294, 1612, 1115, 1122, 1110, 2039, 1190, 3238, 119]

Amount of tokens:  44 (short sequence),  71 (long sequence)


In [54]:
tokens_ids_with_special = tokenizer.build_inputs_with_special_tokens(tokens_ids)
print(tokens_ids_with_special)

[101, 8667, 117, 146, 1821, 1702, 1111, 1126, 9712, 4774, 3408, 1111, 170, 1541, 1263, 1105, 3223, 1183, 5650, 1115, 1110, 5397, 1136, 1280, 1106, 4218, 12477, 1775, 2251, 119, 146, 1821, 1256, 1280, 1106, 1838, 1330, 5650, 1106, 1294, 1142, 4954, 2108, 1263, 119, 102]


In [55]:
encoded_tokens = tokenizer.encode_plus("Hello, I am looking for an embedding.", 
                                       max_length=20, 
                                       pad_to_max_length=True)
print(encoded_tokens)

{'input_ids': [101, 8667, 117, 146, 1821, 1702, 1111, 1126, 9712, 4774, 3408, 119, 102, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}


In [56]:
# comparing sentence lengths
# truncation_strategy is set to longest_first by default, this is why the longer sequence gets automatically cut off
# special tokens are also automatically added by default (i.e. start and end token in this case)
encoded_tokens_short = tokenizer.encode_plus(sequence, 
                                       max_length=64, 
                                       pad_to_max_length=True)

encoded_tokens_long = tokenizer.encode_plus(sequence_long, 
                                       max_length=64, 
                                       pad_to_max_length=True)

print("Output for the shorter sequence with", len(tokens_ids), "tokens: \n", encoded_tokens_short)

print("\nOutput for the longer sequence with", len(tokens_ids_long), "tokens: \n", encoded_tokens_long)

Output for the shorter sequence with 44 tokens: 
 {'input_ids': [101, 8667, 117, 146, 1821, 1702, 1111, 1126, 9712, 4774, 3408, 1111, 170, 1541, 1263, 1105, 3223, 1183, 5650, 1115, 1110, 5397, 1136, 1280, 1106, 4218, 12477, 1775, 2251, 119, 146, 1821, 1256, 1280, 1106, 1838, 1330, 5650, 1106, 1294, 1142, 4954, 2108, 1263, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

Output for the longer sequence with 71 tokens: 
 {'input_ids': [101, 8667, 117, 146, 1821, 1702, 1111, 1126, 9712, 4774, 3408, 1111, 170, 1541, 1263, 1105, 3223, 1183, 5650, 1115, 1110, 

In [57]:
# showing the sentence / what remained
#for i in list(train_dataset.take(1).as_numpy_iterator())[0][0]['input_ids'][0]:
#    print('{:7d}    ---->    {}'.format(i, tokenizer.decode(int(i))))

tokenizer.decode(encoded_tokens_long['input_ids'])

'[CLS] Hello, I am looking for an embedding for a really long and windy sentence that is definitely not going to fit max length. I am even going to start another sentence to make this sequence especially long. This was still not long enough which is why I am still adding words to this sentence to [SEP]'

*How does truncation work? Where does it come in?*

Sequences that are longer than the specified length limit are simply cut off. This can theoretically be specified in 'encode_plus', but there is not much wiggle room to truncate differently.