# **Tokenization**

## **TOC:**

- 1) **[Introduction](#intro)**

- 2) **[Character Tokenization](#chartoken)**

- 3) **[Word Tokenization](#wordtoken)**

- 4) **[Subword Tokenization](#subwordtoken)**

    - 4.1) **[Auto Tokenizer](#autotokenizer)**
    - 4.2) **[Specific Tokenizer](#specifictokenizer)**

- 5) **[Tokenizing the Dataset](#tokenizingdataset)**
    
    - 5.1) **[HuggingFace Dataset](#huggingdataset)**
    
    - 5.2) **[Custom Dataset](#customdataset)**

Wrapper de um dicionario. 

In [1]:
from datasets import load_dataset


# The base class Dataset implements a Dataset backed by an Apache Arrow table.
emotions = load_dataset("emotion") ; emotions

Using custom data configuration default
Reusing dataset emotion (/home/rocabrera/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [2]:
train_ds = emotions["train"] ; train_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

---

## 2) **Character Tokenization** <a class="anchor" id="chartoken"></a>

**References:**
- https://huggingface.co/docs/datasets/process
- https://huggingface.co/docs/datasets/v2.3.2/en/package_reference/main_classes#datasets.Dataset.map

In [3]:
# Assuming everything fits in memory
vocab = set("".join(train_ds["text"]))
char_mapping = {ch: idx for idx, ch in enumerate(sorted(vocab))}

In [5]:
# %%timeit -n 1 -r 1

def char_non_batched_tokenizer(batch, mapping):

    mapped_tokens = [mapping[char] for char in batch["text"]]
        
    return {"input_ids": mapped_tokens}

# function(example: Dict[str, Any]) -> Dict[str, Any]
_ = train_ds.map(lambda x: char_non_batched_tokenizer(x, char_mapping))

  0%|          | 0/16000 [00:00<?, ?ex/s]

849 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


The default batch size is 1000, but you can adjust it with the ```batch_size``` argument.

In [6]:
# %%timeit -n 1 -r 1

def char_batched_tokenizer(batch, mapping):
    
    mapped_tokens = [[mapping[char] for char in list(sentence)] for sentence in batch["text"]]
    
    return {"input_ids": mapped_tokens}    

_ = train_ds.map(lambda x: char_batched_tokenizer(x, char_mapping), batched=True)

  0%|          | 0/16 [00:00<?, ?ba/s]

216 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Set the ```num_proc``` argument to set the number of processes to use.

In [7]:
# %%timeit -n 1 -r 1

def char_tokenizer(batch, mapping):
    
    if isinstance(batch["text"], list):
        mapped_tokens = [[mapping[char] for char in list(sentence)] for sentence in batch["text"]]
    else:
        mapped_tokens = [mapping[char] for char in batch["text"]]
        
    return {"input_ids": mapped_tokens}

_ = train_ds.map(lambda x: char_tokenizer(x, char_mapping), batched=True, num_proc=4)

      

#0:   0%|          | 0/4 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/4 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/4 [00:00<?, ?ba/s]

#3:   0%|          | 0/4 [00:00<?, ?ba/s]

297 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


---

## 3) **Word Tokenization** <a class="anchor" id="wordtoken"></a>

Using word tokenization enables the model to skip the step of
learning words from characters, and thereby reduces the complexity of the training
process.

In [15]:
# Assuming everything fits in memory
vocab = set("".join(train_ds["text"]).split())
word_mapping = {word: idx for idx, word in enumerate(sorted(vocab))}

**Problems:**
- Punctuation is not accounted.
- Declinations, conjugations and misspellings are not accounted.
- The size of the vocabulary can easily grow.

---

## 3) **Subword Tokenization** <a class="anchor" id="subwordtoken"></a>

Transformers provides a convenient AutoTokenizer class that allows you to quickly load the tokenizer associated with a pretrained model. There are several subword tokenization algorithms, such as **Byte-Pair Encoding** and **WordPiece**. More information can be found [here](https://huggingface.co/course/chapter6/1?fw=pt).

In [16]:
from transformers import DistilBertTokenizer

model_ckpt = "distilbert-base-uncased"

distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)

In [25]:
# %%timeit -n 1 -r 1

def tokenize(batch, tokenizer):
    return tokenizer(batch["text"], padding=True, truncation=True)

tokenized_train_ds = train_ds.map(lambda x: tokenize(x, distilbert_tokenizer), batched=True)

  0%|          | 0/16 [00:00<?, ?ba/s]

6.61 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [63]:
# %%timeit -n 1 -r 1

def tokenize(batch, tokenizer):
    return tokenizer(batch["text"], padding=True, truncation=True)

tokenized_train_ds = train_ds.map(lambda x: tokenize(x, distilbert_tokenizer), batched=True, num_proc=4)

      

#0:   0%|          | 0/4 [00:00<?, ?ba/s]

#1:   0%|          | 0/4 [00:00<?, ?ba/s]

  

#2:   0%|          | 0/4 [00:00<?, ?ba/s]

#3:   0%|          | 0/4 [00:00<?, ?ba/s]

We are using: truncation to max model input length and padding to max sequence in batch. More on padding and truncation can be found here: https://huggingface.co/docs/transformers/pad_truncation.

The tokenization process here is expensive. Therefore, using a higher number of cores to process improved the overall time.

In [69]:
tokens = distilbert_tokenizer.convert_ids_to_tokens(tokenized_train_ds["input_ids"][100])
tokens[0], tokens[35]

('[CLS]', '[SEP]')

First, some special [CLS] and [SEP] tokens have been added to the start and end of the sequence. These tokens differ from model to model, but their main role is to indicate the start and end of a sequence.

In [70]:
tokens[:5]

['[CLS]', 'i', 'won', '##t', 'let']

The ## prefix in ##t means that the preceding string is not whitespace. We can convert the tokens to a string:

In [71]:
distilbert_tokenizer.convert_tokens_to_string(tokens)

'[CLS] i wont let me child cry it out because i feel that loving her and lily when she was little was going to be opportunities that only lasted for those short few months [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

Also, when a token is not found in the vocabulary the tokenizer can add an unknown token to represent it.

In [72]:
distilbert_tokenizer.unk_token

'[UNK]'

For each batch, the input sequences are padded to the maximum sequence length in the batch; the attention mask is used in the model to ignore the padded areas of
the input tensors.

<center><img src="figures/attention_masks.png" width=600></center>

In [73]:
tokenized_train_ds["attention_mask"][0][:8], tokenized_train_ds["input_ids"][0][:8]

([1, 1, 1, 1, 1, 1, 1, 0], [101, 1045, 2134, 2102, 2514, 26608, 102, 0])

In [57]:
distilbert_tokenizer.model_max_length

512

---