<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


- Install the additional package requirements for this bonus notebook by uncommenting and running the following cell:

In [1]:
# pip install -r requirements-extra.txt

# Comparing Various Byte Pair Encoding (BPE) Implementations

<br>
&nbsp;

## Using BPE from `tiktoken`

In [1]:
from importlib.metadata import version

print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.8.0


In [2]:
import tiktoken

tik_tokenizer = tiktoken.get_encoding("gpt2")

text = "Hello, world. Is this-- a test?"

In [21]:
integers = tik_tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


In [22]:
strings = tik_tokenizer.decode(integers)

print(strings)

Hello, world. Is this-- a test?


In [23]:
print(tik_tokenizer.n_vocab)  # max token number that this tokenizer holds.

50257


In [35]:
print(tik_tokenizer.decode([50256]))
# print(tik_tokenizer.encode('<|endoftext|>', allowed_special={"<|endoftext|>"}))
# print(tik_tokenizer.encode('<|endoftext|>'))
print(f'All Special token: {tik_tokenizer.special_tokens_set}')
print(tik_tokenizer.encode('<|endoftext|>', disallowed_special=(tik_tokenizer.special_tokens_set - {"<|endoftext|>"})))



<|endoftext|>
All Special token: {'<|endoftext|>'}
[27, 91, 437, 1659, 5239, 91, 29]


<br>
&nbsp;

## Using the original BPE implementation used in GPT-2

In [36]:
from bpe_openai_gpt2 import get_encoder, download_vocab

In [37]:
"""
requested files:
1. encoder.json
2. vocal.bpe
you may skip this step,
as needed files are included already in ./gpt2_modl
"""
download_vocab()

Fetching encoder.json: 1.04Mit [00:18, 55.8kit/s]                                                   
Fetching vocab.bpe: 457kit [00:09, 49.9kit/s]                                                       


In [43]:
orig_tokenizer = get_encoder(model_name="gpt2_model", models_dir=".")

In [39]:
integers = orig_tokenizer.encode(text)

print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


In [40]:
strings = orig_tokenizer.decode(integers)

print(strings)

Hello, world. Is this-- a test?


In [41]:
print(orig_tokenizer.encode('<|endoftext|>'))

[27, 91, 437, 1659, 5239, 91, 29]


<br>
&nbsp;

## Using the BPE via Hugging Face transformers

In [45]:
import transformers

transformers.__version__

'4.47.1'

In [46]:
from transformers import GPT2Tokenizer

hf_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [47]:
hf_tokenizer(strings)["input_ids"]

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

## Put them together

In [56]:
print(f'original input:   "{text}"')
# tiktoken
print(f'tiktoken:         {hf_tokenizer.encode(text)}')
# gpt-2's original
print(f"gpt-2's original: {orig_tokenizer.encode(text)}")
# huggingface
print(f'huggingface:      {hf_tokenizer(text)["input_ids"]}')

original input:   "Hello, world. Is this-- a test?"
tiktoken:         [15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]
gpt-2's original: [15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]
huggingface:      [15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


<br>
&nbsp;

## A quick performance benchmark

In [15]:
with open('../01_main-chapter-code/the-verdict.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

In [16]:
%timeit orig_tokenizer.encode(raw_text)

4.29 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [17]:
%timeit tik_tokenizer.encode(raw_text)

1.4 ms ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [18]:
%timeit hf_tokenizer(raw_text)["input_ids"]

Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors


8.46 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [19]:
%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)["input_ids"]

8.36 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
