# How to use huggingface tokenizers and datasets

### Some basic info about tokenization from this reference: https://huggingface.co/learn/nlp-course/en/chapter2/4
* Word tokenization results in a very large vocabulary since you need a unique ID for every unique word. This can result in rare words being mapped to <UNK>, or unknown.
* Character tokenization results in a very small vocabulary (at least for English), but the tokens aren't very meaningful. Also, the number of tokens that need to be processed by the model becomes very large.
* Subword tokenization is a good balance. It reduces the vocabulary size while still maintaining meaningful tokens. Rare words can still be attended to because they are composed of common subwords.

### <u>Let's first get a feel for what the tokenizer does.</u>

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd

In [30]:
if torch.cuda.is_available():
    print('cuda available:', torch.cuda.is_available())
    device = 'cuda'
else:
    device = 'cpu'

tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')

text = "Replace me by any text you'd like."

text = r"When you're the daughter of the Oceanic nymph, Styx, and a Titan called Pallas, you expect to be given a weird name."

# text = r"This is a test of run vs. running."

encoded_input = tokenizer(text, return_tensors='pt').to(device)
print(type(encoded_input))
print(encoded_input) # contains input_ids and attention_mask
print(encoded_input['input_ids'].shape)
print(encoded_input['attention_mask'].shape)

cuda available: True
<class 'transformers.tokenization_utils_base.BatchEncoding'>
{'input_ids': tensor([[  101,  2043,  2017,  1005,  2128,  1996,  2684,  1997,  1996, 18955,
          6396,  8737,  2232,  1010, 21856,  1010,  1998,  1037, 16537,  2170,
         14412,  8523,  1010,  2017,  5987,  2000,  2022,  2445,  1037,  6881,
          2171,  1012,   102]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
torch.Size([1, 33])
torch.Size([1, 33])


### <u>What about using the tokenizer to decode?</u>

In [31]:
decoded = tokenizer.decode(encoded_input['input_ids'].squeeze())
decoded_remove_special = tokenizer.decode(encoded_input['input_ids'].squeeze(), skip_special_tokens=True)
print(decoded)
print(decoded_remove_special)

[CLS] when you're the daughter of the oceanic nymph, styx, and a titan called pallas, you expect to be given a weird name. [SEP]
when you're the daughter of the oceanic nymph, styx, and a titan called pallas, you expect to be given a weird name.


### <u>Let's see what the tokenizer is doing internally.</u>

In [32]:
# Generally, you won't be using these methods yourself.

# These two methods comprise encoding: tokenizing text and converting to IDs
tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens) # uses a vocabulary to do the conversion

# Converts IDs back to tokens and groups them together into a readable sequence
# Useful for decoding the output of generative models
decoded = tokenizer.decode(ids)

print(tokens)
print(ids)
print(decoded)

['when', 'you', "'", 're', 'the', 'daughter', 'of', 'the', 'oceanic', 'ny', '##mp', '##h', ',', 'styx', ',', 'and', 'a', 'titan', 'called', 'pal', '##las', ',', 'you', 'expect', 'to', 'be', 'given', 'a', 'weird', 'name', '.']
[2043, 2017, 1005, 2128, 1996, 2684, 1997, 1996, 18955, 6396, 8737, 2232, 1010, 21856, 1010, 1998, 1037, 16537, 2170, 14412, 8523, 1010, 2017, 5987, 2000, 2022, 2445, 1037, 6881, 2171, 1012]
when you're the daughter of the oceanic nymph, styx, and a titan called pallas, you expect to be given a weird name.


### <u>Let's see how we pass the encoded input to a model.</u>

In [20]:
model = AutoModel.from_pretrained("distilbert/distilbert-base-uncased", torch_dtype=torch.float16)
model.to(device)
output = model(**encoded_input) # unpack the encoded input (ids, attention mask) and pass it to the model
final_output = output.last_hidden_state.detach().cpu().numpy()
print(final_output.shape) # batch size (1 in this case), number of tokens, and hidden size

(1, 33, 768)


### *

In [21]:
dataset = load_dataset("mteb/reddit-clustering") # download a small dataset for demonstration

In [22]:
# DatasetDict, e.g. ['train'], ['test']. This one only has 'test'.
# DatasetDict['test'] is a dataset. It's a tabular object build on arrow table (slightly different from pandas DataFrame).

dataset

DatasetDict({
    test: Dataset({
        features: ['sentences', 'labels'],
        num_rows: 25
    })
})

In [None]:
# You can convert Dataset to DataFrame

pd.set_option('display.max_colwidth', 500)
dataframe = pd.DataFrame(dataset['test'])
dataframe.head(2)