[Open 2.Implementing_Tokenization.ipynb in Google Colab](https://colab.research.google.com/github/shambhuphysics/GenerativeAI/blob/main/2.Implementing_Tokenization.ipynb)


# Implementing Tokenization

- Tokenizers are essential tools in natural language processing (NLP) that break down text into smaller units called token. These tokens can be words, characters, or subwords, making complex text  understandable to computers
- By dividing text into manageable pieces, tokenizers enable machines to process and analyze human language, powering various language related applications:
  - translation
  - sentiment analysis
  - chatbots and so on

- Tokenizers bridge the gap between human language and machine understanding

<div style="text-align:center">
  <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/tokenizer.png" width="700px" alt="wizard">
</div>

# __Table of Contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Installing-required-libraries">Installing required libraries</a></li>
            <li><a href="#Importing-required-libraries">Importing required libraries</a></li>
        </ol>
    </li>
    <li>
        <a href="#What-is-a-tokenizer-and-why-do-we-use-it?">What is a tokenizer and why do we use it?</a>
    </li>
    <li><a href="#Types-of-tokenizer">Types of tokenizer</a></li>
        <ol>
            <li><a href="#Word-based-tokenizer">Word-based tokenizer</a></li>
            <li><a href="#Character-based-tokenizer">Character-based tokenizer</a></li>
            <li><a href="#Subword-based-tokenizer">Subword-based tokenizer</a></li>
                <ol>
                    <li><a href="#WordPiece">WordPiece</a></li>
                    <li><a href="#Unigram-and-SentencePiece">Unigram and SentencePiece</a></li>
                </ol>
        </ol>
    <li>
        <a href="#Tokenization-with-PyTorch">Tokenization with PyTorch</a>
    </li>
    <li>
        <a href="#Token-indices">Token indices</a>
        <ol>
            <li><a href="#Out-of-vocabulary-(OOV)">Out-of-vocabulary (OOV)</a></li>
        </ol>
    </li>
    <li><a href="#Exercise:-Comparative-text-tokenization-and-performance-analysis">Exercise: Comparative text tokenization and performance analysis</a></li>
</ol>


# Objectivies 

- Understand the concept of tokenization and its importance in natural language processing
- Identify and explain `word-based`, `character-based`, `subwords-based` tokenization methods.
- Apply tokenization strategies to preprocess raw textual data before using it in machine learing models

# Libraries Required:
1. Natural language toolkit(nltk)
 - NLTK is set of tools and resources, used for data managment such as text preprocessing and analysis
2. Spacy
-  Preprocessing text
3. BertTokenizer
  - A Hugging Face Transformers library; used for tokenizing text according to `BERT` models.
4. XLNetTokenizere
  - A Hugging Face Transformer library,  used for tokenizing text according to `XLNet` Models.
5. Torchtext
  - A pytorch library useful for text data preprocessing, tokenization, vocabulary managments and batching.


# Installing the required libraries

In [1]:
!pip install nltk
!pip install transformers==4.42.1
!pip install sentencepiece
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!pip install scikit-learn
!pip install torch==2.2.2
!pip install torchtext==0.17.2
!pip install numpy==1.26.0

Collecting numpy>=1.19.0 (from spacy)
  Using cached numpy-2.3.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
Using cached numpy-2.3.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.6 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.0
    Uninstalling numpy-1.26.0:
      Successfully uninstalled numpy-1.26.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyiron-base 0.10.9 requires numpy<=1.26.4,>=1.23.5, but you have numpy 2.3.3 which is incompatible.
pyiron-base 0.10.9 requires sqlalchemy<=2.0.36,>=2.0.22, but you have sqlalchemy 2.0.40 which is incompatible.
transformers 4.42.1 requires numpy<2.0,>=1.17, but you have numpy 2.3.3 which is incompatible.
pyfileindex 0.0.31 requires numpy<=2.1.2,>=1.23.5, but you have numpy 2.3.3 which is inco

# Importing Required Libraries

In [2]:
import nltk
nltk.download("punkt")
nltk.download('punkt_tab')
import spacy
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.util import ngrams
from transformers import BertTokenizer
from transformers import XLNetTokenizer

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /home/ucfbsbh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/ucfbsbh/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

# 1. Tokenizers:
- Tokenizers play a pivotal role in NLP, sgementing text into smaller units known as tokens. These tokens are subsequently transformed into numerical representation called token indices, which are directly employed by deep learning algorithms.

  - Text (I Love You) ---> Tokenization ---> Tokens ['I', 'Love', 'You] 
<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0201EN-Coursera/images/Tokenization%20lab%20Diagram%201.png" width="50%" alt="Image Description">
</center>

## 1.1 Types of Tokenizers:
The meaningful representation may vary depending on the model in use. Various models employ distinct tokenization algorithms. Transforming text into numerical value might appear straightforward initially but it encompasses several consideration that must be kept in mind. 

<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0201EN-Coursera/images/Tokenization%20lab%20Diagram%202.png" width="50%" alt="Image Description">
</center>

### 1.1.1  Word-Based tokenizers
- A. Using  `nltk's` `word_tokenize`:
  - The text is splitting into token based on the words. There are several rules for word-based tokenizers such as spliting on spaces, spliting on puntuations, and so on.  Each option assigns a specific ID to the split word.  For eg. lets use `nltk's` `word_tokenize`

In [3]:
text = "This is a sample sentences for word tokenization."

token = word_tokenize(text)

token 

['This', 'is', 'a', 'sample', 'sentences', 'for', 'word', 'tokenization', '.']

- Note: General libraries like nltk and spaCy often splits word like `don't` and `couldn't` , which are contractions, into different indivisual words. There are no universal rule, and each library has it's own tokenization rule for word-based tokenizers. 
    - The general guideline is to preserve the input format after tokenization to mach how the model was tranined. 

In [11]:
text = "I couldn't help the dog."

token = word_tokenize(text)

token

['I', 'could', "n't", 'help', 'the', 'dog', '.']

- B. Using  `spaCy's & torchtext's` `word_tokenize`:

In [10]:
text = "I couldn't help the dog."

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

token = [ token.text for token in doc]

token

['I', 'could', "n't", 'help', 'the', 'dog', '.']

In [15]:
# this spaCy and torchtext tokenization funtion also shows more details of token 

for token in doc:
  print(token.text, token.pos_, token.dep_)

I PRON nsubj
could AUX aux
n't PART neg
help VERB ROOT
the DET det
dog NOUN dobj
. PUNCT punct


- Please note that:
  - I PRON nsubj: "I" is a pronoun (PRON), and is the nominal subject (nsub) of the sentences.
  - help VERB ROOT: "help" is a verb(VERB) and is the root action (ROOT) of the sentences.
 

The problem with this algorithm is that words with the similar meaning will be assigned different IDs, resulting in them being treated as entirely separate words with distinct meaning.  For example Unicorns word is the plural from of Unicorn, but a word-based tokenizer would tokenize them as two seperate words, potentially causing the model to miss their sematic meaning. 

In [16]:
text= "Unicorns are real, I saw a unicorn yesterday"

token = word_tokenize(text)
token 

['Unicorns', 'are', 'real', ',', 'I', 'saw', 'a', 'unicorn', 'yesterday']

Each word splits into a token , leading to a significant increase in model's overall vocabulary. Each token is mapped to a large vector containing the word's meaning, resultsing in large model parameters. 
Language generally have a large number of words, the vocabularies based on them will always be extensive. However, the number of characters in laguage is always fewer compared to the number of words. 

## 1.1.2 Character Based tokenizers
As the nmae suggest, character-based tokenization involves splittting text into indivisual characters. THe advantages of using this approaach is that resulting vocabulary are inherently samll. Furthermore, since language have a limited set of characters, the numbers of out-of-vocabulary tokens is also limited, reducing token wastage.

For example, input text: This is a sample sentence for tokenization. 

Character-based tokenization outputs ['T', 'h', 'i', 's', 'i', 's', 'a', 'm', 'p'...]

However, it is important  to note that the characte-based tokenization has its limitations. Single characters may not convey the information as entire words, and the overall token length increases significantly, potentially causing issues with model size and loss of performance. 

## 1.1.3 Subword-based Tokenizers

The subword-based tokenizer allows frequently used words to remain unsplit while breaking down infrequent words into meaninful subwords. 
- SentencePiece, or WordPiece are commonly used for subword tokenization.
- These methods learn subwords units from a given text corpus, identifying common prefixes, suffixes, and root words as subwords token based on their frequency of occurance. 
- This helps preserving the semantic information assocites  with the overall word. 
 some example is shown below 

<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0201EN-Coursera/images/Tokenization%20lab%20Diagram%203.png" width="50%" alt="Image Description">
</center>

### A. WordPiece
- Initially, WordPiece initializes its vocabulary to include every character presents in the traning data and progressively learn s a specified number of merged rules. 

- WordPiece doesn't select the most frequent symbol pair but rather the one that maximizes the likelihood of the traning data when added to the vocabulary. 

- WordPiece evaluates what it sacrifies by merging two symbols to ensure it's worthwhile enddeavor. 



In [None]:
# using BertTokenizer as WordPiece tokenizer 
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("IBM taught me tokenization")
##ization  indicates that it is a part of a orginal word.

['ibm', 'taught', 'me', 'token', '##ization']

### B. Unigram and SentencePiece

- Unigram starts with a large list of possible words and gradually narrowing it down based on frequency of those text

- Sentencepiece produces token , assigns IDs, maintaing the cosistency. 

- Unigram+SentencePiece work together, unigram reduces the vocabulary efficenlty while SentencePiece handles subword segmentaiton and IDs assignment.

In [None]:
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
tokenizer.tokenize("IBM taught me tokenization")
# _ <-- refred as whitespace, indicates as whole word
# word without _ , indicates subword token
# '.' <--- puntuation is treated seperately 

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

['▁IBM', '▁taught', '▁me', '▁token', 'ization']

### C. Tokenization with Pytorch
- Pytorch's torchtext libary breaks down text into token (words or subwords), facilating into numberial format, assigning unique integers allowing them to feed into NN. 

In [32]:
from torchtext.data.utils import get_tokenizer

dataset = [
    (1,"Introduction to NLP"),
    (2,"Basics of PyTorch"),
    (1,"NLP Techniques for Text Classification"),
    
    (3,"Named Entity Recognition with PyTorch"),
    (3,"Sentiment Analysis using PyTorch"),
    (3,"Machine Translation with PyTorch"),
    (1," NLP Named Entity,Sentiment Analysis,Machine Translation "),
    (1," Machine Translation with NLP "),
    (1," Named Entity vs Sentiment Analysis  NLP ")]

tokenizer = get_tokenizer('basic_english')

for idx,sentence in enumerate(dataset):
  print(tokenizer(dataset[idx][1]))

['introduction', 'to', 'nlp']
['basics', 'of', 'pytorch']
['nlp', 'techniques', 'for', 'text', 'classification']
['named', 'entity', 'recognition', 'with', 'pytorch']
['sentiment', 'analysis', 'using', 'pytorch']
['machine', 'translation', 'with', 'pytorch']
['nlp', 'named', 'entity', ',', 'sentiment', 'analysis', ',', 'machine', 'translation']
['machine', 'translation', 'with', 'nlp']
['named', 'entity', 'vs', 'sentiment', 'analysis', 'nlp']


- Token Indices
  - to represent word as number, `build_vocab_from_iterator`
  - dataset is iterable therefore, generator funtion is used to get one at a time. 

In [33]:
def yield_tokens(data_iter):
  for _, text in data_iter:
    yield tokenizer(text)

In [41]:
my_iterator = yield_tokens(dataset)

- Tokenization can produce words not present in the vocabulary due to rarity or absence during vocabulary building.  
- Out-of-vocabulary (OOV) words encountered in tasks like text generation or language modeling are represented using the `<unk>` token.  
- Example: "apple" in the vocabulary is used normally, while "pineapple" (OOV) is replaced with `<unk>`.  
- Including `<unk>` in the vocabulary ensures a consistent method for handling OOV words in NLP tasks.
 

In [40]:
vocab = build_vocab_from_iterator(yield_tokens(dataset), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
print(vocab.get_itos() ) # vocabulary is built 
print(vocab.get_stoi())

['<unk>', 'nlp', 'pytorch', 'analysis', 'entity', 'machine', 'named', 'sentiment', 'translation', 'with', ',', 'basics', 'classification', 'for', 'introduction', 'of', 'recognition', 'techniques', 'text', 'to', 'using', 'vs']
{'vs': 21, 'to': 19, 'of': 15, 'introduction': 14, 'recognition': 16, 'for': 13, 'nlp': 1, 'entity': 4, 'pytorch': 2, 'translation': 8, 'techniques': 17, 'machine': 5, 'named': 6, ',': 10, 'text': 18, 'sentiment': 7, '<unk>': 0, 'with': 9, 'basics': 11, 'using': 20, 'analysis': 3, 'classification': 12}


In [45]:
def get_tokenized_sentence_and_indicess(iterator):
  tokenized_sentence = next(iterator)
  token_indices = [vocab[token] for token in tokenized_sentence]
  return tokenized_sentence, token_indices 

tokenized_sentence, token_indices = get_tokenized_sentence_and_indicess(my_iterator)

print("Tokenized Sentences:", tokenized_sentence)
print("Token Indices:", token_indices)

Tokenized Sentences: ['named', 'entity', 'recognition', 'with', 'pytorch']
Token Indices: [6, 4, 16, 9, 2]


In [51]:
lines = ["IBM taught me tokenization", 
         "Special tokenizers are ready and they will blow your mind", 
         "just saying hi!"]

special_symbols = ['unk', '<pad>', '<bos>', '<eos>']

tokenizer_en = get_tokenizer('spacy', language="en_core_web_sm")

tokens = []

max_length = 0

for line in lines:
  tokenized_line = tokenizer_en(line)
  tokenized_line = ['<bos>'] + tokenized_line + ['<eos>']
  tokens.append(tokenized_line)
  max_length = max(max_length, len(tokenized_line))
  
for i in range(len(tokens)):
  tokens[i] = tokens[i] + ['<pad>'] * (max_length - len(tokens[i]))

print("lines after adding special tokens:\n", tokens)

  
vocab = build_vocab_from_iterator(tokens, specials = ['<unk>'])
vocab.set_default_index(vocab["<unk>"])
 
  

lines after adding special tokens:
 [['<bos>', 'IBM', 'taught', 'me', 'tokenization', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['<bos>', 'Special', 'tokenizers', 'are', 'ready', 'and', 'they', 'will', 'blow', 'your', 'mind', '<eos>'], ['<bos>', 'just', 'saying', 'hi', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']]


In [52]:
vocab.get_itos()

['<unk>',
 '<pad>',
 '<bos>',
 '<eos>',
 '!',
 'IBM',
 'Special',
 'and',
 'are',
 'blow',
 'hi',
 'just',
 'me',
 'mind',
 'ready',
 'saying',
 'taught',
 'they',
 'tokenization',
 'tokenizers',
 'will',
 'your']

In [53]:
vocab.get_stoi()

{'will': 20,
 'tokenizers': 19,
 'tokenization': 18,
 'taught': 16,
 'your': 21,
 'saying': 15,
 '<unk>': 0,
 'and': 7,
 'hi': 10,
 '<pad>': 1,
 '<bos>': 2,
 'they': 17,
 '<eos>': 3,
 '!': 4,
 'ready': 14,
 'IBM': 5,
 'are': 8,
 'Special': 6,
 'mind': 13,
 'me': 12,
 'blow': 9,
 'just': 11}

In [58]:
text = """
Going through the world of tokenization has been like walking through a huge maze made of words, symbols, and meanings. Each turn shows a bit more about the cool ways computers learn to understand our language. And while I'm still finding my way through it, the journey’s been enlightening and, honestly, a bunch of fun.
Eager to see where this learning path takes me next!"
"""

# Counting and displaying tokens and their frequency
from collections import Counter
def show_frequencies(tokens, method_name):
    print(f"{method_name} Token Frequencies: {dict(Counter(tokens))}\n")

from datetime import datetime 

#NLTK Tokenization
start_time =  datetime.now()
nltk_tokens = nltk.word_tokenize(text)
nltk_time = datetime.now() - start_time
nltk_time
show_frequencies(nltk_tokens, "NLTK")

NLTK Token Frequencies: {'Going': 1, 'through': 3, 'the': 3, 'world': 1, 'of': 3, 'tokenization': 1, 'has': 1, 'been': 2, 'like': 1, 'walking': 1, 'a': 3, 'huge': 1, 'maze': 1, 'made': 1, 'words': 1, ',': 5, 'symbols': 1, 'and': 2, 'meanings': 1, '.': 3, 'Each': 1, 'turn': 1, 'shows': 1, 'bit': 1, 'more': 1, 'about': 1, 'cool': 1, 'ways': 1, 'computers': 1, 'learn': 1, 'to': 2, 'understand': 1, 'our': 1, 'language': 1, 'And': 1, 'while': 1, 'I': 1, "'m": 1, 'still': 1, 'finding': 1, 'my': 1, 'way': 1, 'it': 1, 'journey': 1, '’': 1, 's': 1, 'enlightening': 1, 'honestly': 1, 'bunch': 1, 'fun': 1, 'Eager': 1, 'see': 1, 'where': 1, 'this': 1, 'learning': 1, 'path': 1, 'takes': 1, 'me': 1, 'next': 1, '!': 1, "''": 1}



In [59]:
import nltk
import spacy
from transformers import BertTokenizer, XLNetTokenizer
from datetime import datetime

# NLTK Tokenization
start_time = datetime.now()
nltk_tokens = nltk.word_tokenize(text)
nltk_time = datetime.now() - start_time

# SpaCy Tokenization
nlp = spacy.load("en_core_web_sm")
start_time = datetime.now()
spacy_tokens = [token.text for token in nlp(text)]
spacy_time = datetime.now() - start_time

# BertTokenizer Tokenization
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
start_time = datetime.now()
bert_tokens = bert_tokenizer.tokenize(text)
bert_time = datetime.now() - start_time

# XLNetTokenizer Tokenization
xlnet_tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
start_time = datetime.now()
xlnet_tokens = xlnet_tokenizer.tokenize(text)
xlnet_time = datetime.now() - start_time
    
# Display tokens, time taken for each tokenizer, and token frequencies
print(f"NLTK Tokens: {nltk_tokens}\nTime Taken: {nltk_time} seconds\n")
show_frequencies(nltk_tokens, "NLTK")

print(f"SpaCy Tokens: {spacy_tokens}\nTime Taken: {spacy_time} seconds\n")
show_frequencies(spacy_tokens, "SpaCy")

print(f"Bert Tokens: {bert_tokens}\nTime Taken: {bert_time} seconds\n")
show_frequencies(bert_tokens, "Bert")

print(f"XLNet Tokens: {xlnet_tokens}\nTime Taken: {xlnet_time} seconds\n")
show_frequencies(xlnet_tokens, "XLNet")

NLTK Tokens: ['Going', 'through', 'the', 'world', 'of', 'tokenization', 'has', 'been', 'like', 'walking', 'through', 'a', 'huge', 'maze', 'made', 'of', 'words', ',', 'symbols', ',', 'and', 'meanings', '.', 'Each', 'turn', 'shows', 'a', 'bit', 'more', 'about', 'the', 'cool', 'ways', 'computers', 'learn', 'to', 'understand', 'our', 'language', '.', 'And', 'while', 'I', "'m", 'still', 'finding', 'my', 'way', 'through', 'it', ',', 'the', 'journey', '’', 's', 'been', 'enlightening', 'and', ',', 'honestly', ',', 'a', 'bunch', 'of', 'fun', '.', 'Eager', 'to', 'see', 'where', 'this', 'learning', 'path', 'takes', 'me', 'next', '!', "''"]
Time Taken: 0:00:00.000389 seconds

NLTK Token Frequencies: {'Going': 1, 'through': 3, 'the': 3, 'world': 1, 'of': 3, 'tokenization': 1, 'has': 1, 'been': 2, 'like': 1, 'walking': 1, 'a': 3, 'huge': 1, 'maze': 1, 'made': 1, 'words': 1, ',': 5, 'symbols': 1, 'and': 2, 'meanings': 1, '.': 3, 'Each': 1, 'turn': 1, 'shows': 1, 'bit': 1, 'more': 1, 'about': 1, 'cool

In [None]:
# ===================================================================
# 4. Data augmentation and graph conversion (corrected core logic)
# ===================================================================
print("\n--- [Step 4/5] Augmenting full dataset and converting to PyG graph objects ---")
data_list = []
for _, row in tqdm(main_df.iterrows(), total=len(main_df), desc="Processing SMILES and augmenting"):
    original_smiles = row[SMILES_COL]
    targets = [row.get(t, np.nan) for t in TARGETS]
    y = torch.tensor(targets, dtype=torch.float).unsqueeze(0)

    def process_single_smiles(smi, label):
        graph_data = smiles_to_periodic_graph(smi)
        if not graph_data: return None
        mol = Chem.MolFromSmiles(smi)
        if mol:
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, MORGAN_FP_RADIUS, nBits=MORGAN_FP_DIM)
            graph_data.morgan_fp = torch.tensor(np.array(fp), dtype=torch.float).unsqueeze(0)
        else:
            graph_data.morgan_fp = torch.zeros(1, MORGAN_FP_DIM, dtype=torch.float)
        graph_data.y = label
        return graph_data

    # Process original SMILES
    processed_original = process_single_smiles(original_smiles, y)
    if processed_original:
        data_list.append(processed_original)
    
    # Process augmented SMILES
    augmented_smiles = augment_repeat_units(original_smiles, n_repeats=3)
    if augmented_smiles != original_smiles:
        processed_augmented = process_single_smiles(augmented_smiles, y)
        if processed_augmented:
            data_list.append(processed_augmented)

print(f"Data processing complete, generated {len(data_list)} graph objects (including original and augmented data).")
