<a href="https://colab.research.google.com/github/rajkstats/uplimit_nlp/blob/main/wk3_tokenizer_rk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.

# Tokenization

### What we are building
Tokenization is the task of chopping text into pieces, called tokens. As you might have observed going through prior notebooks, vectorization and tokenization have a huge influence on the output of the exact same model.

We will compare the different tokenizers for different sizes of vocabulary on Botchan, a novel written by Natsume Sōseki in 1906. We'll see what percentage of vocabulary would be considered OOV (out-of-vocab) at different sizes.

### Instructions

1. We have provide scaffolding for all the different tokenizers and added an assert to make sure the output is same as expected.
1. Most of the tokenizers are already somethings you've seen before, but we'll dive deeper into SentencePiece.

### Code Overview

- Dependencies: Install and import python dependencies
- Tokenizers
  - WhitespaceTokenizer
  - CharacterTokenizer
  - SpacyTokenizer
  - BERTTokenizer
  - SentencePieceTokenizer
- Extensions


# Dependencies

✨ Now let's get started! To kick things off, as always, we will install some dependencies.

In [1]:
!pip install transformers sentencepiece --quiet
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
# Import all the relevant libraries
import re
import os
import spacy
import sentencepiece as spm

from collections import defaultdict
from transformers import BertTokenizer

nlp = spacy.load("en_core_web_lg")

### Dataset

Download the Botchan novel from [SentencePiece repository](https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt).


In [3]:
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt

lines = []
with open('botchan.txt', 'r') as the_file:
  lines = the_file.readlines()

--2024-08-27 01:49:22--  https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278779 (272K) [text/plain]
Saving to: ‘botchan.txt’


2024-08-27 01:49:22 (53.2 MB/s) - ‘botchan.txt’ saved [278779/278779]



Constants for sample sentence and thresholds we'll be using throughout

In [4]:
SAMPLE_SENTENCE = "I'm learning NLP. Aren't my projects awesome?"
THRESHOLDS = [1000, 2000, 3000, 4000, 5000, 7500, 10000]

### Base Tokenizer class that implements the coverage report function

In [5]:
class Tokenizer:
  def __init__(self, lines):
    self.vocab_size = 0
    token_dict = defaultdict(int)

    for line in lines:
      for token in self.tokenize(line):
        self.vocab_size += 1
        token_dict[token] += 1

    self.token_counts = sorted(token_dict.items(), key=lambda x: x[1], reverse=True)

  def tokenize(self, sentence):
    return []

  def coverage(self, threshold):
    return sum([x[1] for x in self.token_counts[:threshold]]) / self.vocab_size

  def coverage_report(self, thresholds):
    # For each threshold print the percentage of coverage and OOV
    for tv in thresholds:
      coverage = self.coverage(tv) * 100
      print("For vocab size: %d, coverage is: %.2f%% and oov is: %.2f%%" % (tv, coverage, 100-coverage))

## Assignment Part:1 - Whitespace based separators
##### <font color='red'>Expected:  for vocab size 1,000 -> coverage: 72.79%</font>
##### <font color='red'>Expected:  for vocab size 10,000 -> coverage: 99.43%</font>

Tokenizer that splits the string sentences into tokens using  whitespace.

In [6]:
class WhiteSpaceTokenizer(Tokenizer):
  def tokenize(self, sentence):
    output = sentence.split()
    return output

white_space_tokenizer = WhiteSpaceTokenizer(lines)
assert white_space_tokenizer.tokenize(SAMPLE_SENTENCE) == ["I'm", 'learning', 'NLP.', "Aren't", 'my', 'projects', 'awesome?']
white_space_tokenizer.coverage_report(THRESHOLDS)

For vocab size: 1000, coverage is: 75.34% and oov is: 24.66%
For vocab size: 2000, coverage is: 82.58% and oov is: 17.42%
For vocab size: 3000, coverage is: 86.77% and oov is: 13.23%
For vocab size: 4000, coverage is: 89.78% and oov is: 10.22%
For vocab size: 5000, coverage is: 91.75% and oov is: 8.25%
For vocab size: 7500, coverage is: 96.68% and oov is: 3.32%
For vocab size: 10000, coverage is: 100.00% and oov is: 0.00%


## Assignment Part:2 - Character based tokenizer
##### <font color='red'>Expected:  for vocab size 1,000 -> coverage: 100.00%</font>
##### <font color='red'>Expected:  for vocab size 10,000 -> coverage: 100.00%</font>

Tokenizer that splits the string sentences into individual characters.

In [7]:
class CharacterTokenizer(Tokenizer):
  def tokenize(self, sentence):
     # Tokenize by splitting the sentence into characters
     output = list(sentence)
     return output

character_tokenizer = CharacterTokenizer(lines)
assert character_tokenizer.tokenize(SAMPLE_SENTENCE) == ['I', "'", 'm', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'N', 'L', 'P', '.', ' ', 'A', 'r', 'e', 'n', "'", 't', ' ', 'm', 'y', ' ', 'p', 'r', 'o', 'j', 'e', 'c', 't', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e', '?']
character_tokenizer.coverage_report(THRESHOLDS)

For vocab size: 1000, coverage is: 100.00% and oov is: 0.00%
For vocab size: 2000, coverage is: 100.00% and oov is: 0.00%
For vocab size: 3000, coverage is: 100.00% and oov is: 0.00%
For vocab size: 4000, coverage is: 100.00% and oov is: 0.00%
For vocab size: 5000, coverage is: 100.00% and oov is: 0.00%
For vocab size: 7500, coverage is: 100.00% and oov is: 0.00%
For vocab size: 10000, coverage is: 100.00% and oov is: 0.00%


## Assignment Part:3 - Spacy Tokenizer
##### <font color='red'>Expected:  for vocab size 1,000 -> coverage: 86%</font>
##### <font color='red'>Expected:  for vocab size 10,000 -> coverage: 100%</font>

Tokenizer that splits the string sentences into individual tokens using the Spacy's built in tokenizer.

In [8]:
class SpacyTokenizer(Tokenizer):
  def tokenize(self, sentence):
    # Using Spacy to tokenize the sentence
    doc = nlp(sentence)
    output = [token.text for token in doc]
    return output

spacy_tokenizer = SpacyTokenizer(lines)
assert spacy_tokenizer.tokenize(SAMPLE_SENTENCE) == ['I', "'m", 'learning', 'NLP', '.', 'Are', "n't", 'my', 'projects', 'awesome', '?']
spacy_tokenizer.coverage_report(THRESHOLDS)

For vocab size: 1000, coverage is: 86.00% and oov is: 14.00%
For vocab size: 2000, coverage is: 91.94% and oov is: 8.06%
For vocab size: 3000, coverage is: 95.08% and oov is: 4.92%
For vocab size: 4000, coverage is: 96.67% and oov is: 3.33%
For vocab size: 5000, coverage is: 98.21% and oov is: 1.79%
For vocab size: 7500, coverage is: 100.00% and oov is: 0.00%
For vocab size: 10000, coverage is: 100.00% and oov is: 0.00%


## Assignment Part:4 - BERT Tokenizer
##### <font color='red'>Expected:  for vocab size 1,000 -> coverage: 85.46%</font>
##### <font color='red'>Expected:  for vocab size 10,000 -> coverage: 100.00%</font>

BERT tokenizer provided by the hugging face library.

In [9]:
class BERTTokenizer(Tokenizer):
  def __init__(self, tokenizer, lines):
    self.tokenizer = tokenizer
    super(BERTTokenizer, self).__init__(lines)

  def tokenize(self, sentence):
      # Use the BERT tokenizer to tokenize the sentence
      output = self.tokenizer.tokenize(sentence)

      return output

raw_bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_tokenizer = BERTTokenizer(raw_bert_tokenizer, lines)
assert bert_tokenizer.tokenize(SAMPLE_SENTENCE) == ['i', "'", 'm', 'learning', 'nl', '##p', '.', 'aren', "'", 't', 'my', 'projects', 'awesome', '?'], bert_tokenizer.tokenize(SAMPLE_SENTENCE)
bert_tokenizer.coverage_report(THRESHOLDS)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

For vocab size: 1000, coverage is: 85.46% and oov is: 14.54%
For vocab size: 2000, coverage is: 92.11% and oov is: 7.89%
For vocab size: 3000, coverage is: 95.61% and oov is: 4.39%
For vocab size: 4000, coverage is: 97.51% and oov is: 2.49%
For vocab size: 5000, coverage is: 99.04% and oov is: 0.96%
For vocab size: 7500, coverage is: 100.00% and oov is: 0.00%
For vocab size: 10000, coverage is: 100.00% and oov is: 0.00%


## Assignment Part:5 - SentencePieceTokenizer
##### <font color='red'>Expected:  for vocab size 1,000 -> coverage: 99.88%</font>
##### <font color='red'>Expected:  for vocab size 10,000 -> coverage: 100.00%</font>

SentencePiece tokenizer works a bit differently from everything we've used so far. It needs to be trained with a vocabulary and a target size.

In [10]:
class SentencePieceTokenizer(Tokenizer):
  def __init__(self, lines):
    self.lines = lines
    spm.SentencePieceTrainer.train(input='botchan.txt', model_prefix='m', vocab_size=4000)
    self.sp = spm.SentencePieceProcessor()
    self.sp.load('m.model')

  def tokenize(self, sentence):
      # Tokenize the sentence using the SentencePiece model
      output = self.sp.encode_as_pieces(sentence)
      return output

  def coverage(self, threshold):
    # Train a new SentencePiece tokenizer for the threshold provided
    try:
      spm.SentencePieceTrainer.train(input='botchan.txt', model_prefix=f'm{threshold}', vocab_size=threshold)
    except RuntimeError:
      # Vocabulary size > 5239 raises a runtime error
      return 1.0

    sp = spm.SentencePieceProcessor()
    # Load our recently trained model
    sp.load(f'm{threshold}.model')

    # Count the number of times UNK (id=0) was assigned in the entire dataset from the model
    total = 0
    unk = 0
    for line in self.lines:
      ids = sp.encode_as_ids(line)
      unk += ids.count(0)
      total += len(ids)
    return (total - unk) / total

sp_tokenizer = SentencePieceTokenizer(lines)
assert sp_tokenizer.tokenize(SAMPLE_SENTENCE) == ['▁I', "'", 'm', '▁learn', 'ing', '▁N', 'L', 'P', '.', '▁A', 'ren', "'", 't', '▁my', '▁pro', 'j', 'e', 'c', 't', 's', '▁awe', 'some', '?'], sp_tokenizer.tokenize(SAMPLE_SENTENCE)
sp_tokenizer.coverage_report(THRESHOLDS)

For vocab size: 1000, coverage is: 99.88% and oov is: 0.12%
For vocab size: 2000, coverage is: 99.85% and oov is: 0.15%
For vocab size: 3000, coverage is: 99.84% and oov is: 0.16%
For vocab size: 4000, coverage is: 99.83% and oov is: 0.17%
For vocab size: 5000, coverage is: 99.83% and oov is: 0.17%
For vocab size: 7500, coverage is: 100.00% and oov is: 0.00%
For vocab size: 10000, coverage is: 100.00% and oov is: 0.00%


As you see the OOV% is really low but at the same time it increased a bit when we increased a vocabulary size. An intuitive way to think about this is that in smaller vocabularies the algorithm trains to something closer to character embeddings as we give it a bit larger size it tries to learn more language semantics and trades off some vocabulary coverage.