# 01 - Tokenizing text

We'll use an of-the-shelve tokenizer for doing Byte Pair Encoding (BPE).  We'll use `tiktoken` for this.

In [None]:
#| echo: true
#| output: false
%conda install -y tiktoken

Let's load a text and tokenize it:

In [32]:
import tiktoken

filepath = '../data/dracula.txt'

def load_text(path):
    with open(path, 'r') as f:
        raw_text = f.read()
    return raw_text

def tokens_from_text(text: str):
    tokenizer = tiktoken.get_encoding("gpt2")
    integers = tokenizer.encode(text)
    return integers

def text_from_tokens(tokens: list[int]):
    tokenizer = tiktoken.get_encoding("gpt2")
    text = tokenizer.decode(tokens)
    return text


This now allows us to load text and turn it into tokens (each identified by an integer) or the reverse: given a set of tokens, reconstruct the text from them:

In [33]:
sample_text = load_text(filepath)[:40]
print(sample_text)

tokens = tokens_from_text(sample_text)
print(tokens)

text = text_from_tokens(tokens)
print(text)

The Project Gutenberg eBook of Dracula
 
[464, 4935, 20336, 46566, 286, 41142, 198, 220]
The Project Gutenberg eBook of Dracula
 
