# Byte pair encoding

> A detailed discussion and implementation of BPE is out of the scope of this
book, but in short, it builds its vocabulary by iteratively merging frequent
characters into subwords and frequent subwords into words. For example,
BPE starts with adding all individual single characters to its vocabulary ("a",
"b", ...). In the next stage, it merges character combinations that frequently
occur together into subwords. For example, "d" and "e" may be merged into
the subword "de," which is common in many English words like "define",
"depend", "made", and "hidden". The merges are determined by a frequency
cutoff.

In [3]:
# install tiktoken
! pip install -q tiktoken

In [5]:
# import libraries
from importlib_metadata import version
import tiktoken

In [9]:
print(f'tiktoken version :{version("tiktoken")}')

tiktoken version :0.6.0


In [11]:
tokenizer = tiktoken.get_encoding('gpt2')
text = "Hello, do you like tea? <|endoftext|> In the sunlit terra."

integers = tokenizer.encode(text,allowed_special={"<|endoftext|>"})
print("encode :",integers)

encode : [15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 1059, 430, 13]


In [12]:
print('decode :',tokenizer.decode(integers))

decode : Hello, do you like tea? <|endoftext|> In the sunlit terra.


<hr>


In [13]:
from Tokenizer import read_txt # 2.3 & 2.4 Tokenizer.py
txt_file = read_txt(r'E:\Courses\LLMs\LLMs-from-Scratch\ch02-Working with Text Data\data\the-verdict.txt')
integers = tokenizer.encode(txt_file)

print(integers[:10])

[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138]


In [14]:
print('decode :',tokenizer.decode(integers[:10]))

decode : I HAD always thought Jack Gisburn rather


<hr>

# UNK words

In [25]:
text = ["Zeyad","Akwirw ier"]
for i in text:
  integers = tokenizer.encode(i)
  print("encode :",integers)
  print('decode :',tokenizer.decode(integers))
  print('-*'*10)

encode : [57, 2959, 324]
decode : Zeyad
-*-*-*-*-*-*-*-*-*-*
encode : [33901, 86, 343, 86, 220, 959]
decode : Akwirw ier
-*-*-*-*-*-*-*-*-*-*
