<a href="https://colab.research.google.com/github/w1ndwatcher/Transformers/blob/main/01_Creating_a_custom_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Tokenization is the process of converting text into smaller units called tokens. These tokens can be words, characters, or sub-words.

In [None]:
!pip install transformers



In [None]:
from tokenizers import Tokenizer, pre_tokenizers, trainers, models

Read the dataset from the file

In [None]:
with open("/content/tokenizer_train.txt", "r") as file:
  dataset = [line.strip() for line in file.readlines()]

dataset

['Patient complains of persistent headache and dizziness. Prescribed 500mg Paracetamol twice daily.',
 'Administer Amoxicillin 250mg every 8 hours for 7 days.',
 'Blood pressure reading: 150/95 mmHg. Start Amlodipine 5mg once daily.',
 'Diagnosed with Type 2 Diabetes Mellitus. Metformin 850mg after meals.',
 'Apply topical Clotrimazole cream twice a day on affected area.',
 'Symptoms include nausea, vomiting, and abdominal cramps. Possible food poisoning.',
 'Prescribe Ibuprofen 400mg for pain relief every 6 hours as needed.',
 'Monitor blood glucose levels before breakfast and dinner.']

Initialize Byte-Pair (sub-word) tokenizer:-

In [None]:
tokenizer = Tokenizer(models.BPE())

Set the pre-tokenizer to split the input into words

In [None]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

Train the BPE tokenizer on the dataset

In [None]:
trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train_from_iterator(dataset, trainer=trainer)
tokenizer.save("/content/tokenizer.json")

<hr>

Inference

In [None]:
from transformers import PreTrainedTokenizerFast

In [None]:
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="/content/tokenizer.json")

In [None]:
text = "Symptoms include"
encoded = tokenizer.encode(text)
print(encoded)

Encoding(num_tokens=2, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


Encoded tokens:-

In [None]:
print(encoded.tokens)

['Symptoms', 'include']


In [None]:
text = "The patient"
encoded = tokenizer.encode(text)
print(encoded.tokens)

['T', 'h', 'e', 'p', 'a', 'tient']


Visualization:-

In [None]:
from tokenizers.tools import EncodingVisualizer

In [None]:
vis = EncodingVisualizer(fast_tokenizer._tokenizer)
vis(text="The patient") # 6 tokens

<hr>

Using a pre-trained tokenizer:-

In [None]:
from transformers import BertTokenizer

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [None]:
print(tokenizer.tokenize("The patient"))

['the', 'patient']


In [None]:
print(tokenizer.tokenize("The patients daily doses of meds"))

['the', 'patients', 'daily', 'doses', 'of', 'med', '##s']
