# Week 2

### Tokenization
1. Tokenizer → 2. Token to Id → 3. embeddings → 4. Language Model → 5. Id to Token → 6. Token to words

Encoder [1-2]
Decoder [5-6]


### Byte Pair Encoding
Small Vocabulary → Large Sequence Length
Large Vocabulary → Problem in Computing Softmax
Encodes language without spaces
→ Based on Frequency
→ Fertility → No. of subwords broken out from a word
→ 1 Merge = 1 Addition to Vocabulary

### Word-Piece Tokenizer


## Practice

#### Load Dataset

In [1]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
ds = load_dataset('bookcorpus',split='all', trust_remote_code=True)
print(ds)
for idx,sample in enumerate(ds[0:6]['text']):
    print(f'{idx} : {sample}')

Downloading data: 100%|██████████| 1.18G/1.18G [03:46<00:00, 5.21MB/s]  
Generating train split: 100%|██████████| 74004228/74004228 [14:07<00:00, 87321.80 examples/s] 


Dataset({
    features: ['text'],
    num_rows: 74004228
})
0 : usually , he would be tearing around the living room , playing with his toys .
1 : but just one look at a minion sent him practically catatonic .
2 : that had been megan 's plan when she got him dressed earlier .
3 : he 'd seen the movie almost by mistake , considering he was a little young for the pg cartoon , but with older cousins , along with her brothers , mason was often exposed to things that were older .
4 : she liked to think being surrounded by adults and older kids was one reason why he was a such a good talker for his age .
5 : `` are n't you being a good boy ? ''


#### Tokenize

In [4]:
from tokenizers import Tokenizer

|**Component** |**Choice**  |
|:------------:|:----------:|
|normalizer    |Lowercase   |
|pre-tokenizer |Whitespace  |
|model         | BPE        |
|postprocessor | None       |

In [5]:
from tokenizers.normalizers import Lowercase 
from tokenizers.pre_tokenizers import Whitespace 
from tokenizers.models import BPE

try:
    model = BPE(unk_token="[UNK]")
    tokenizer = Tokenizer(model)
    tokenizer.normalizer = Lowercase()
    tokenizer.pre_tokenizer = Whitespace()
except Exception as e:
    print(f"ERROR: {e}")    

In [6]:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(vocab_size=32000,special_tokens=["[PAD]","[UNK]"],continuing_subword_prefix='##')
# pipeline is done

def get_examples(batch_size=1000):
    for i in range(0, len(ds), batch_size):
        yield ds[i : i + batch_size]['text']    

from multiprocessing import cpu_count
cpus = cpu_count()
print(cpus)

from tqdm import tqdm
example_iterator = get_examples(batch_size=10000)
example_iterator_with_progress = tqdm(example_iterator, total=len(ds), desc="Training tokenizer")
tokenizer.train_from_iterator(example_iterator_with_progress, trainer=trainer, length=len(ds))

# tokenizer.train_from_iterator(get_examples(batch_size=10000),trainer=trainer,length=len(ds))

12


In [8]:
import os
try:
    save_dir = 'model'
    os.makedirs(save_dir, exist_ok=True)
    tokenizer.model.save(save_dir,prefix='hopper')
except Exception as e:
    print(f"ERROR: {e}")

#### Vocabulary

In [21]:
with open('model/hopper-merges.txt','r') as file:    
    lines = file.readlines() 
print(f'Number of merges:{len(lines)}') 
print(f'vocab size:{tokenizer.get_vocab_size()}') 


Number of merges:31871
vocab size:32000


In [27]:
vocab = tokenizer.get_vocab()
vocab_sorted = sorted(vocab.items(), key=lambda item: item[1])
vocab_sorted

[('[PAD]', 0),
 ('[UNK]', 1),
 ('\x13', 2),
 ('\x14', 3),
 ('\x18', 4),
 ('\x19', 5),
 ('\x1c', 6),
 ('\x1d', 7),
 ('\x1f', 8),
 ('!', 9),
 ('#', 10),
 ('$', 11),
 ('%', 12),
 ('&', 13),
 ("'", 14),
 ('(', 15),
 (')', 16),
 ('*', 17),
 ('+', 18),
 (',', 19),
 ('-', 20),
 ('.', 21),
 ('/', 22),
 ('0', 23),
 ('1', 24),
 ('2', 25),
 ('3', 26),
 ('4', 27),
 ('5', 28),
 ('6', 29),
 ('7', 30),
 ('8', 31),
 ('9', 32),
 (':', 33),
 (';', 34),
 ('<', 35),
 ('=', 36),
 ('>', 37),
 ('?', 38),
 ('@', 39),
 ('[', 40),
 ('\\', 41),
 (']', 42),
 ('^', 43),
 ('_', 44),
 ('`', 45),
 ('a', 46),
 ('b', 47),
 ('c', 48),
 ('d', 49),
 ('e', 50),
 ('f', 51),
 ('g', 52),
 ('h', 53),
 ('i', 54),
 ('j', 55),
 ('k', 56),
 ('l', 57),
 ('m', 58),
 ('n', 59),
 ('o', 60),
 ('p', 61),
 ('q', 62),
 ('r', 63),
 ('s', 64),
 ('t', 65),
 ('u', 66),
 ('v', 67),
 ('w', 68),
 ('x', 69),
 ('y', 70),
 ('z', 71),
 ('{', 72),
 ('|', 73),
 ('}', 74),
 ('~', 75),
 ('\x7f', 76),
 ('##g', 77),
 ('##i', 78),
 ('##t', 79),
 ('##a', 80

#### Encoding

In [28]:
sample = ds[0]['text']
print(f'sample: {sample}')
encoding = tokenizer.encode(sample)
print(encoding)

sample: usually , he would be tearing around the living room , playing with his toys .
Encoding(num_tokens=16, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


In [29]:
token_ids = encoding.ids
tokens = encoding.tokens
type_ids = encoding.type_ids
attention_mask = encoding.attention_mask

from tokenizers.tools import EncodingVisualizer
visualizer = EncodingVisualizer(tokenizer=tokenizer)
visualizer(text=sample)

In [32]:
import pandas as pd
out_dict = {'tokens':tokens,'ids':token_ids,'type_ids':type_ids,'attention_mask':attention_mask}
df = pd.DataFrame.from_dict(out_dict)
df

Unnamed: 0,tokens,ids,type_ids,attention_mask
0,usually,2462,0,1
1,",",19,0,1
2,he,149,0,1
3,would,277,0,1
4,be,162,0,1
5,tearing,6456,0,1
6,around,422,0,1
7,the,131,0,1
8,living,1559,0,1
9,room,536,0,1


#### Batch Encoding

In [39]:
from pprint import pprint 

In [40]:
samples = ds[0:4]['text']
batch_encoding = tokenizer.encode_batch(samples)
pprint(batch_encoding)

[Encoding(num_tokens=16, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=14, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=14, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=42, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]


In [42]:
# all default args
tokenizer.enable_padding(direction = 'right',
                         pad_id = 0,
                         pad_type_id = 0,
                         pad_token = '[PAD]',
                         length = None, # None default to max_len in the batch
                         pad_to_multiple_of = None) 

tokenizer.enable_truncation(max_length=128)


In [43]:
batch_encoding = tokenizer.encode_batch(samples)
print(batch_encoding)

[Encoding(num_tokens=42, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=42, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=42, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=42, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]


#### Testing

In [45]:
text = "All this is so simple to do in HF இ😊."
encoded = tokenizer.encode(text).tokens
print(encoded)
visualizer(text=text)

['all',
 'this',
 'is',
 'so',
 'simple',
 'to',
 'do',
 'in',
 'h',
 '##f',
 '[UNK]',
 '[UNK]',
 '##.']


In [47]:
try:
    tokenizer.save(save_dir, 'hopper.json')
except Exception as e:
    print(f"ERROR: {e}")

"ERROR: argument 'pretty': 'str' object cannot be converted to 'PyBool'"


## Assignment

In [2]:
from pprint import pprint

Download the BookCorpus dataset. Take every 7-th sample (the indices are multiple of 7:[0,7,14,21,...]) from the entire dataset. This will result in a dataset with 10 million samples (exactly, 10,572,033). Use these samples to build a tokenizer with the BPE tokenization algorithm by varying the vocabulary size.

In [3]:
from datasets import load_dataset
try:
    all_ds = load_dataset('bookcorpus',split='all')
    print(all_ds)
except Exception as e:
    print(f"ERROR: {e}")

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['text'],
    num_rows: 74004228
})


In [4]:
ds = all_ds.select(range(0, all_ds.num_rows, 7))
ds

Dataset({
    features: ['text'],
    num_rows: 10572033
})

* Normalizer: LowerCase
* PreTokenizer: WhiteSpace
* Model: BPE
* Special tokens: [GO],[UNK],[PAD],[EOS]
* PostProcessing: None

In [5]:
from tokenizers import Tokenizer
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.models import  BPE
from tokenizers.trainers import BpeTrainer

Tokenize the input text: “SEBI study finds 93% of individual F&O traders made losses between FY22 and FY24.” using the following configurations.

In [6]:
text = "SEBI study finds 93% of individual F&O traders made losses between FY22 and FY24."
tokens = text.split()
print(len(tokens),tokens)

14 ['SEBI', 'study', 'finds', '93%', 'of', 'individual', 'F&O', 'traders', 'made', 'losses', 'between', 'FY22', 'and', 'FY24.']


1) Keep the vocabulary size at 5000 and tokenize the input text using the learned vocabulary. Choose the number of tokens returned by the tokenizer.

In [7]:
model = BPE(unk_token= "[UNK]")
tokenizer = Tokenizer(model)
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size = 5000,
    special_tokens = ["[GO]","[UNK]","[PAD]","[EOS]"],
    continuing_subword_prefix='##')

In [8]:
def samples(batch_size = 1000):
    for i in range(0, len(ds), batch_size):
        yield ds[i : i+ batch_size]['text']

In [10]:
from tqdm import tqdm
bsize = 1000
samples_iterator = tqdm(samples(bsize), total = len(ds) // bsize, desc="Tokenizer Training")
tokenizer.train_from_iterator(samples_iterator, trainer= trainer, length=len(ds))

Tokenizer Training: 10573it [01:57, 89.79it/s]                           


In [12]:
tokens = tokenizer.encode(text).tokens
print(len(tokens), tokens)

32 ['seb', '##i', 'study', 'find', '##s', '9', '##3', '%', 'of', 'ind', '##ivid', '##ual', 'f', '&', 'o', 'tr', '##ad', '##ers', 'made', 'loss', '##es', 'between', 'f', '##y', '##2', '##2', 'and', 'f', '##y', '##2', '##4', '.']


**Q2:** Increase the vocabulary size to 10K, 15K and 32K. For each case, tokenize the same input with the newly learned vocabulary. Choose all the correct statements

Do change the `vocab_size` and retrain the model

In [None]:
vsizes = [10_000, 15_000, 32_000]
for i in vsizes:
    model = BPE(unk_token= "[UNK]")
    tokenizer = Tokenizer(model)
    tokenizer.normalizer = Lowercase()
    tokenizer.pre_tokenizer = Whitespace()

    trainer = BpeTrainer(vocab_size = i,
    special_tokens = ["[GO]","[UNK]","[PAD]","[EOS]"],
    continuing_subword_prefix='##')
    
    samples_iterator = tqdm(samples(bsize), total = len(ds) // bsize, desc=f"Batches Trained for Vocab size {i}")
    tokenizer.train_from_iterator(samples_iterator, trainer= trainer, length=len(ds))
    print("token size =", len(tokenizer.encode(text).tokens))
    visualizer = EncodingVisualizer(tokenizer=tokenizer)
    visualizer(text=text)

Batches Trained for Vocab size 10000: 10573it [01:55, 91.61it/s]                            


token size = 28


Batches Trained for Vocab size 15000: 10573it [03:27, 50.90it/s]                           


token size = 28


Batches Trained for Vocab size 32000: 10573it [04:17, 41.12it/s]                           


token size = 25


**Q3**: Download the pre-trained tokenizer file “hopper.json” used in the lecture, from [here](https://drive.google.com/file/d/1QNnyh8iMN-IqW_h1w8gAMtw09Em7-e1e/view?usp=sharing). The tokenizer was trained on all 70 million samples in the BookCorpus dataset. Tokenize the same input text using this “hopper” tokenizer. How many tokens are there?

In [26]:
trained_tokenizer = Tokenizer(BPE())
trained_tokenizer = trained_tokenizer.from_file('hopper.json')
tokens = trained_tokenizer.encode(text).tokens
print(len(tokens))
EncodingVisualizer(tokenizer=trained_tokenizer)(text=text)


25


**Q4**: Suppose we know that the acronym “FY” will likely appear very frequently in most of the input text (assume the text comes from the financial domain). Therefore, we hope that adding it manually to the vocabulary might help. Add the token “FY” to the vocabulary and tokenize the input text. Enter the number of tokens produced.

In [28]:
print(trained_tokenizer.get_vocab_size())
trained_tokenizer.add_tokens(['FY'])
print(trained_tokenizer.get_vocab_size())
tokens = trained_tokenizer.encode(text).tokens
print(len(tokens))
EncodingVisualizer(tokenizer=trained_tokenizer)(text=text)

32000
32001
22


**Q5** Load the “bert-base-uncased” and "gpt2” tokenizers (use AutoTokenizer function from transformers). Which of the following special tokens are used in these tokenizers?

In [31]:
from transformers import AutoTokenizer
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')
print(f"bbu special tokens - {bert_tokenizer.all_special_tokens}")
print(f"gpt2 special tokens - {gpt2_tokenizer.all_special_tokens}")

bbu special tokens - ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
gpt2 special tokens - ['<|endoftext|>']


**Q6** By now, we have four tokenizers. <br>

1. Custom tokenizer (vocab size 32K, trained on 10 million samples) <br>
2. bert-base-uncased <br>
3. gpt2 <br>
4. hopper <br>

Use these four tokenizers to count the number of tokens for the entire “imdb” dataset (drop the “unsupervised” part of the dataset). Enter the tokenizers in order such that the size of the dataset (measured in tokens) as returned by the tokenizers is in decreasing order. For example, if the first tokenizer yields the smallest number of tokens and the fourth tokenizer yields the largest, you would enter 1234 (without any spaces).”


In [None]:

hopper_tokenizer = Tokenizer(BPE())
hopper_tokenizer = trained_tokenizer.from_file('hopper.json')

imdb = load_dataset("stanfordnlp/imdb", split='train+test')
imdb

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 149950.81 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 363102.96 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 364136.93 examples/s]


Dataset({
    features: ['text', 'label'],
    num_rows: 50000
})

In [44]:
for t in [tokenizer, hopper_tokenizer]:
    num_tokens =0
    for sample in imdb:
        tokens = t.encode(sample['text']).tokens
        num_tokens += len(tokens)
    print(num_tokens)

15352840
13526933


In [45]:
for t in [bert_tokenizer, gpt2_tokenizer]:
    num_tokens = 0
    for sample in tqdm(imdb, total= len(imdb)):
        token_ids = t(sample['text'])['input_ids']
        num_tokens += len(token_ids)
    print(num_tokens)

  0%|          | 0/50000 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors
100%|██████████| 50000/50000 [00:48<00:00, 1025.08it/s]


15516058


  0%|          | 0/50000 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors
100%|██████████| 50000/50000 [00:50<00:00, 998.79it/s] 

14812432



