# Tokenizáció

In [None]:
!pip install transformers
!pip install datasets

## Character-based tokenization

In [None]:
sentence = "I would like to work than machine lerning engineer at Google!".lower()
print(sentence)

In [None]:
sentence = sentence.replace(" ","")
print(sentence)

In [None]:
chars = [char for char in sentence]
print(chars)

In [None]:
chars = list(set(chars))
print(chars)

In [None]:
word_to_idx = {chars[i] : i for i in range(len(chars))}
word_to_idx

## WordLevel based tokenization

This is the “classic” tokenization algorithm. You can map words to tokens. The advantage of this is that it is very easy to use and understand, but it requires an extremely large vocabulary for good coverage. This model will not make a direct selection; it simply maps the input words to tokens.

### NLTK

In [None]:
import nltk
nltk.download('punkt')

In [None]:
from nltk.tokenize import word_tokenize

s = '''Good muffins cost $3.88\nin New York.  Please buy me two of them.\n\nThanks.'''
word_tokenize(s)

### Hugging Face

In [None]:
from tokenizers.pre_tokenizers import Whitespace

pre_tokenizer = Whitespace()
pre_tokenizer.pre_tokenize_str("Hello! How are you? I'm fine, thank you.")

## BPE

One of the most popular subword tokenization algorithms. Byte-Pair-Encoding works by starting with characters and combining the most frequently seen ones to create new tokens. It then works iteratively to build new tokens from the most frequent pairs seen in the corpus. BPE can build words it has never seen before by using multiple subword tokens, so it requires a smaller vocabulary and is less likely to have “unknown” tokens.

In [None]:
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-103-raw-v1")
dataset

In [None]:
corpus = dataset["train"]["text"] + dataset["test"]["text"] + dataset["validation"]["text"]
len(corpus)

### Special tokens
- [UNK] unknown token
- [CLS] complete sentence token
- [SEP] sentence separator token
- [PAD] padding token, fixed input length padding token
- [MASK] Masking token. e.g. "Hello I'm a [MASK] model."

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
print(tokenizer)

In [None]:
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
print(trainer)

In [None]:
from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

In [None]:
tokenizer.train_from_iterator(corpus, trainer)

In [None]:
tokenizer.save("tokenizer-bpe-wiki.json")

In [None]:
tokenizer = Tokenizer.from_file("tokenizer-bpe-wiki.json")

In [None]:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output)

In [None]:
print(output.tokens)
print(output.ids)
print(output.offsets[9])

In [None]:
tokenizer.token_to_id("[SEP]")

In [None]:
from tokenizers.processors import TemplateProcessing

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

In [None]:
print(output.tokens)
output = tokenizer.encode("Hello, y'all!", "How are you 😁 ?")
print(output.tokens)

In [None]:
print(output.type_ids)

## Encoding in a batch

In [None]:
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")

In [None]:
output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])
print(output[0].tokens)
print(output[1].tokens)

In [None]:
print(output[0].attention_mask)
print(output[1].attention_mask)

## Pretrained tokenizer, usage

- BERT
- WordPiece: This is a subword tokenization algorithm very similar to BPE, which is mainly used by Google in models like BERT. It uses a greedy algorithm that tries to build long words first. This is different from BPE, which starts with characters and builds tokens as large as possible. It uses the ## prefix to identify tokens that are part of a word (i.e. not the beginning of a word).

In [None]:
import requests

url = "https://huggingface.co/nlpaueb/legal-bert-base-uncased/raw/main/vocab.txt"
response = requests.get(url)

with open("bert-vocab.txt", "w") as f:
  f.write(response.text)

In [None]:
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer("bert-vocab.txt", lowercase=True)

In [None]:
output = tokenizer.encode("Hello, y'all!", "How are you 😁 ?")
print(output.tokens)

### Building your own WordPiece

Same as BPE, just use the WordPiece lib.

In [None]:
from tokenizers.models import WordPiece

tokenizerWP = Tokenizer(WordPiece(unk_token="[UNK]"))
print(tokenizerWP)

In [None]:
from tokenizers.trainers import WordPieceTrainer

trainerWP = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
print(trainerWP)

In [None]:
tokenizerWP.pre_tokenizer = Whitespace()
tokenizerWP.train_from_iterator(corpus, trainerWP)

In [None]:
output = tokenizerWP.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)

## Unigram

Unigram is also a subword tokenization algorithm, and works by trying to identify the best set of subword tokens to maximize the likelihood of a given sentence. It differs from BPE in that it is not deterministic, based on sequentially applied rules. Instead, Unigram will be able to compute multiple tokenization schemes while selecting the most likely one.

In [None]:
from tokenizers.models import Unigram