# **Tokenization**

## **TOC:**

- 1) **[Introduction](#intro)**

- 2) **[Character Tokenization](#chartoken)**

- 3) **[Word Tokenization](#wordtoken)**

- 4) **[Subword Tokenization](#subwordtoken)**

    - 4.1) **[Auto Tokenizer](#autotokenizer)**
    - 4.2) **[Specific Tokenizer](#specifictokenizer)**

- 5) **[Tokenizing the Dataset](#tokenizingdataset)**
    
    - 5.1) **[HuggingFace Dataset](#huggingdataset)**
    
    - 5.2) **[Custom Dataset](#customdataset)**

Transformers provides a convenient AutoTokenizer class that allows you to quickly load
the tokenizer associated with a pretrained model

<center><img src="figures/attention_masks.png" width=600></center>

In [1]:
from datasets import load_dataset

emotions = load_dataset("emotion") ; emotions

Using custom data configuration default
Reusing dataset emotion (/home/rocabrera/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

## 2) **Character Tokenization** <a class="anchor" id="chartoken"></a>

Wrapper de um dicionario. 

In [46]:
train_ds = emotions["train"]

# Supondo que cabe tudo na memoria
vocab = set("".join(train_ds["text"]))
char_mapping = {ch: idx for idx, ch in enumerate(sorted(vocab))}

In [91]:
def char_tokenizer(batch, mapping):
    
    if isinstance(batch["text"], list):
        mapped_tokens = [[mapping[char] for char in list(sentence)] for sentence in batch["text"]]
    else:
        mapped_tokens = [mapping[char] for char in batch["text"]]
        
    return {"input_ids": mapped_tokens}

In [94]:
test = train_ds.map(lambda x: char_tokenizer(x, char_mapping), batched=True)

  0%|          | 0/16 [00:00<?, ?ba/s]

In [95]:
test[0]

{'text': 'i didnt feel humiliated',
 'label': 0,
 'input_ids': [9,
  0,
  4,
  9,
  4,
  14,
  20,
  0,
  6,
  5,
  5,
  12,
  0,
  8,
  21,
  13,
  9,
  12,
  9,
  1,
  20,
  5,
  4]}

In [62]:
batch = ["test a", "test b"]
[[char_mapping[char] for char in list(elem)] for elem in batch ]

[[20, 5, 19, 20, 0, 1], [20, 5, 19, 20, 0, 2]]

In [None]:
train_ds.map(char_tokenizer)

In [None]:
[for char in list(train_ds[0]["text"])]

In [None]:
char_mapping

In [5]:
from transformers import DistilBertTokenizer

model_ckpt = "distilbert-base-uncased"

distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)

In [6]:
def tokenize(batch, tokenizer):
    return tokenizer(batch["text"], padding=True, truncation=True)


aux1 = train_ds.select(range(1000)).map(lambda x: tokenize(x, distilbert_tokenizer), batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [9]:
aux1

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1000
})

In [8]:
def tokenize(batch, tokenizer):
    return tokenizer(batch["text"], padding=True, truncation=True)

aux2 = train_ds.select(range(1000)).map(lambda x: tokenize(x, distilbert_tokenizer))

  0%|          | 0/1000 [00:00<?, ?ex/s]

In [14]:
aux2[1]

{'text': 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
 'label': 0,
 'input_ids': [101,
  1045,
  2064,
  2175,
  2013,
  3110,
  2061,
  20625,
  2000,
  2061,
  9636,
  17772,
  2074,
  2013,
  2108,
  2105,
  2619,
  2040,
  14977,
  1998,
  2003,
  8300,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}