<a href="https://colab.research.google.com/github/seenu-g/gen-AI/blob/main/experiments/tokenizer_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install tokenizers


Collecting tokenizers
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface_hub<0.18,>=0.16.4 (from tokenizers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub, tokenizers
Successfully installed huggingface_hub-0.17.3 tokenizers-0.14.1


In [7]:
import zipfile

from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


# Build a tokenizer from scratch. train a new tokenizer on wikitext-103 (516M of text)

In [59]:
import requests

def download_file(file_url, file_local) :
  #file_url = "https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip"

  r = requests.get(file_url, stream = True)

  #with open("/content/gdrive/My Drive/wikitext-103-raw-v1.zip", "wb") as file:
  with open(file_local, "wb") as file:

      for block in r.iter_content(chunk_size = 1024):
          if block:
              file.write(block)

In [None]:
download_file("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip","/content/gdrive/My Drive/wikitext-103-raw-v1.zip")

In [61]:
download_file("https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt","/content/gdrive/My Drive/bert-base-uncased-vocab.txt")

In [23]:
%cd  /content/gdrive/MyDrive/

/content/gdrive/MyDrive


In [None]:
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt


In [None]:
!pwd
!ls

In [26]:
!unzip wikitext-103-raw-v1.zip

Archive:  wikitext-103-raw-v1.zip
   creating: wikitext-103-raw/
  inflating: wikitext-103-raw/wiki.test.raw  
  inflating: wikitext-103-raw/wiki.valid.raw  
  inflating: wikitext-103-raw/wiki.train.raw  


build and train a Byte-Pair Encoding (BPE) tokenizer. training the tokenizer means it will learn merge rules by:
1.   Start with all the characters present in the training corpus as tokens.
2.   Identify the most common pair of tokens and merge it into one token.
2.   Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want.




In [29]:
# instantiate tokenizer with a BPE model:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

The order in which you write the special tokens list matters: here "[UNK]" will get the ID 0, "[CLS]" will get the ID 1 and so forth.



In [30]:
#  instantiate a [trainer]{.title-ref},
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

Without a pre-tokenizer that split inputs into words, we might get tokens that overlap several words: for instance we  get an "it is" token since those two words often appear next to each other. Using a pre-tokenizer will ensure no token is bigger than a word returned by the pre-tokenizer

In [31]:
# use the easiest pre-tokenizer possible by splitting on whitespace.
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

In [32]:
# call the Tokenizer.train method with any list of files we want to use:
files = [f"/content/gdrive/MyDrive/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
tokenizer.train(files, trainer)

In [35]:
# save the tokenizer in one file that contains all its configuration and vocabulary
tokenizer.save("/content/gdrive/MyDrive/tokenizer-wiki.json")
tokenizer = Tokenizer.from_file("/content/gdrive/MyDrive/tokenizer-wiki.json")


In [55]:
# reload your tokenizer from that file with the Tokenizer.from_file classmethod:
tokenizer = Tokenizer.from_file("/content/gdrive/MyDrive/tokenizer-wiki.json")

In [56]:
# we can use it on any text we want with the Tokenizer.encode method:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")

# tokens attribute contains the segmentation of your text in tokens:
print(output.tokens)

# contain the index of each of those tokens in the tokenizer’s vocabulary:
print(output.ids)

['Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '[UNK]', '?']
[27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]


Tokenizers library comes with full alignment tracking, meaning you can always get the part of your original sentence that corresponds to a given token. Those are stored in the offsets attribute of our Encoding object.

In [37]:
print(output.offsets[9])


(26, 27)


In [38]:
sentence = "Hello, y'all! How are you 😁 ?" #  indices that correspond to the emoji in the original sentence:
sentence[26:27]

'😁'

When we built our tokenizer, we set "[CLS]" and "[SEP]" in positions 1 and 2 of our list of special tokens, so this should be their IDs. To double-check, we can use the Tokenizer.token_to_id method:



In [39]:
tokenizer.token_to_id("[SEP]")


2

In [40]:
# set the post-processing to give us the traditional BERT inputs:
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)
#  we specify the template for sentence pairs (above)

In [41]:
# encode the same sentence
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)

['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '[UNK]', '?', '[SEP]']


In [42]:
# To check the results on a pair of sentences, we just pass the two sentences to Tokenizer.encode:
output = tokenizer.encode("Hello, y'all!", "How are you 😁 ?")
print(output.tokens)

['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', '[SEP]', 'How', 'are', 'you', '[UNK]', '?', '[SEP]']


In [43]:
print(output.type_ids)


[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]


In [44]:
output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])


In [45]:
output = tokenizer.encode_batch(
    [["Hello, y'all!", "How are you 😁 ?"], ["Hello to you too!", "I'm fine, thank you!"]]
)

can set the direction of the padding (defaults to the right) or a given length if we want to pad every sample to that specific number (here we leave it unset to pad to the size of the longest text).

In [46]:
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")

In [47]:
output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])
print(output[1].tokens)
# ["[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"]

['[CLS]', 'How', 'are', 'you', '[UNK]', '?', '[SEP]', '[PAD]']


In [48]:
# attention mask generated by the tokenizer takes the padding into account:
print(output[1].attention_mask)


[1, 1, 1, 1, 1, 1, 1, 0]


In [57]:
# load any tokenizer from the Hugging Face Hub as long as a tokenizer.json file is available in the repository.
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [63]:
#import a pretrained tokenizer directly
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer("/content/gdrive/My Drive/bert-base-uncased-vocab.txt", lowercase=True)

In [64]:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("/content/gdrive/My Drive/tokenizer-wiki.json")

Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less random or “cleaner”. Common operations include stripping whitespace, removing accented characters or lowercasing all text

In [65]:
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
normalizer = normalizers.Sequence([NFD(), StripAccents()])

In [66]:
# manually test that normalizer by applying it to any string:
normalizer.normalize_str("Héllò hôw are ü?")


'Hello how are u?'

Pre-tokenization is the act of splitting a text into smaller objects that give an upper bound to what your tokens will be at the end of training. A good way to think of this is that the pre-tokenizer will split your text into “words” and then, your final tokens will be parts of those words.

In [67]:
from tokenizers.pre_tokenizers import Whitespace
pre_tokenizer = Whitespace()
pre_tokenizer.pre_tokenize_str("Hello! How are you? I'm fine, thank you.")
# The output is a list of tuples, with each tuple containing one word and its span in the original sentence (which is used to determine the final offsets of our Encoding)

[('Hello', (0, 5)),
 ('!', (5, 6)),
 ('How', (7, 10)),
 ('are', (11, 14)),
 ('you', (15, 18)),
 ('?', (18, 19)),
 ('I', (20, 21)),
 ("'", (21, 22)),
 ('m', (22, 23)),
 ('fine', (24, 28)),
 (',', (28, 29)),
 ('thank', (30, 35)),
 ('you', (36, 39)),
 ('.', (39, 40))]

In [68]:
# pre-tokenizer that will split on space, punctuation and digits, separating numbers in their individual digits:
from tokenizers import pre_tokenizers
from tokenizers.pre_tokenizers import Digits
pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)])
pre_tokenizer.pre_tokenize_str("Call 911!")

[('Call', (0, 4)), ('9', (5, 6)), ('1', (6, 7)), ('1', (7, 8)), ('!', (8, 9))]

In [71]:
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
)

In [72]:
#  instantiate a new Tokenizer with this model:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

Post-processing is the last step of the tokenization pipeline, to perform any additional transformation to the Encoding before it’s returned, like adding potential special tokens.



In [73]:
# BERT preprocesses texts by removing accents and lowercasing. We also use a unicode normalizer:
from tokenizers import normalizers
from tokenizers.normalizers import NFD, Lowercase, StripAccents
bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])

In [74]:
from tokenizers.pre_tokenizers import Whitespace
bert_tokenizer.pre_tokenizer = Whitespace()

In [75]:
from tokenizers.processors import TemplateProcessing
bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

In [77]:
from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
files = [f"/content/gdrive/MyDrive/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
bert_tokenizer.train(files, trainer)
bert_tokenizer.save("/content/gdrive/MyDrive/bert-wiki.json")

In [79]:
# The decoder will first convert the IDs back to tokens (using the tokenizer’s vocabulary) and remove all special tokens, then join those tokens with spaces:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.ids)
tokenizer.decode([1, 27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35, 2])

[1, 27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35, 2]


"Hello , y ' all ! How are you ?"

In [80]:
#  If we take our previous bert_tokenizer for instance the default decoding will give:
output = bert_tokenizer.encode("Welcome to the 🤗 Tokenizers library.")
print(output.tokens)
bert_tokenizer.decode(output.ids)

['[CLS]', 'welcome', 'to', 'the', '[UNK]', 'tok', '##eni', '##zer', '##s', 'library', '.', '[SEP]']


'welcome to the tok ##eni ##zer ##s library .'

In [81]:
# by changing it to a proper decoder,
from tokenizers import decoders
bert_tokenizer.decoder = decoders.WordPiece()
bert_tokenizer.decode(output.ids)

'welcome to the tokenizers library.'

This tokenizer is based on the Unigram model. It takes care of normalizing the input using the NFKC Unicode normalization method, and uses a ByteLevel pre-tokenizer with the corresponding decoder.



In [82]:
from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
tokenizer = Tokenizer(models.Unigram())
tokenizer.normalizer = normalizers.NFKC()
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.UnigramTrainer(
    vocab_size=20000,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    special_tokens=["<PAD>", "<BOS>", "<EOS>"],
)

In [83]:
data = [
    "Beautiful is better than ugly."
    "Explicit is better than implicit."
    "Simple is better than complex."
    "Complex is better than complicated."
    "Flat is better than nested."
    "Sparse is better than dense."
    "Readability counts."
]
tokenizer.train_from_iterator(data, trainer=trainer)

In [87]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.14.

In [88]:
import datasets
dataset = datasets.load_dataset("wikitext", "wikitext-103-raw-v1", split="train+test+validation")
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/192M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

In [None]:
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer, length=len(dataset))
