# Tokenization

_Tokenization_ is the process of converting a body of text into individual _tokens_, e.g., words and punctuation characters. This is the first step for most Natural Language Processing (NLP) tasks, including preparing data for training an LLM. Let's see how it's done!

## Some sample text

In [4]:
text = "This is a test! Or is this not a test? Test it to be sure. :)"
print(text)
print(f"This sample text has {len(text)} characters.")

This is a test! Or is this not a test? Test it to be sure. :)
This sample text has 61 characters.


In [6]:
print(text.split())

['This', 'is', 'a', 'test!', 'Or', 'is', 'this', 'not', 'a', 'test?', 'Test', 'it', 'to', 'be', 'sure.', ':)']


In [2]:
import re

In [10]:
tokens = re.split(r'([.?!:()]|\s)', text)
tokens = [item for item in tokens if item.split() ]
tokens= sorted(list(set(tokens)))
print(tokens)




['!', ')', '.', ':', '?', 'Or', 'Test', 'This', 'a', 'be', 'is', 'it', 'not', 'sure', 'test', 'this', 'to']


In [11]:
vocab = {token:index for index, token in enumerate(tokens)}
print(vocab.items())

dict_items([('!', 0), (')', 1), ('.', 2), (':', 3), ('?', 4), ('Or', 5), ('Test', 6), ('This', 7), ('a', 8), ('be', 9), ('is', 10), ('it', 11), ('not', 12), ('sure', 13), ('test', 14), ('this', 15), ('to', 16)])


In [12]:
vocab["Test"]

6

In [13]:
with open("thelostrace.txt", "r") as f:
    raw_text = f.read()
print(raw_text[:100])

Cororuc glanced about him and hastened his pace. He was no coward, but he did not like the place. Ta


In [17]:
tokens = re.split( r'([.?!:())"\'“”‘’^]|\s)', raw_text)
tokens = [item for item in tokens if item.split()]
tokens.extend(["<|unk|>", "<|endoftext|>"])
print(len(tokens))

5759


In [18]:
tokens = sorted(list(set(tokens)))
print(len(tokens))

1650


In [19]:
print(tokens[:20])


['!', '"', "'", '(', ')', '.', ':', '<|endoftext|>', '<|unk|>', '?', 'A', 'After', 'Again,', 'Ah,', 'Alba,', 'Alban', 'All', 'An', 'And', 'Are']


In [21]:
vocab= {token:index for index, token in enumerate(tokens)}
vocab.items()

dict_items([('!', 0), ('"', 1), ("'", 2), ('(', 3), (')', 4), ('.', 5), (':', 6), ('<|endoftext|>', 7), ('<|unk|>', 8), ('?', 9), ('A', 10), ('After', 11), ('Again,', 12), ('Ah,', 13), ('Alba,', 14), ('Alban', 15), ('All', 16), ('An', 17), ('And', 18), ('Are', 19), ('As', 20), ('At', 21), ('Backed', 22), ('Be', 23), ('Before', 24), ('BelgÃ¦', 25), ('Beneath', 26), ('Berula,', 27), ('Britain', 28), ('Britain,', 29), ('Britainâ€', 30), ('Briton', 31), ('Briton,', 32), ('Britons,', 33), ('Britons;', 34), ('Buruc', 35), ('But', 36), ('Caledonia,', 37), ('Caves', 38), ('Celt', 39), ('Celt,', 40), ('Celt;', 41), ('Celtic', 42), ('Celtic,', 43), ('Celts', 44), ('Close', 45), ('Cold', 46), ('Come,', 47), ('Cornish', 48), ('Cornishmen', 49), ('Cornwall', 50), ('Cororuc', 51), ('Cororuc,', 52), ('Cruel,', 53), ('Doubtless', 54), ('Down', 55), ('Erin,', 56), ('Eternity', 57), ('Every', 58), ('Faint', 59), ('Farther', 60), ('For', 61), ('From', 62), ('Gael,', 63), ('Gaelic', 64), ('Gaels', 65), ('

In [22]:
first_line = ("Cororuc glanced about him and hastened his pace. He was no coward, but he did not like the place. Tall trees rose all about, their sullen branches shutting out the sunlight. The dim trail led in and out among them, sometimes skirting the edge of a ravine, where Cororuc could gaze down at the tree-tops beneath. Occasionally, through a rift in the forest, he could see away to the forbidding hills that hinted of the ranges much farther to the west, that were the mountains of Cornwall.")
first_line = re.split( r'([.?!:())"\'“”‘’^]|\s)', first_line)
first_line = [item for item in first_line if item.split()]
print(first_line)

['Cororuc', 'glanced', 'about', 'him', 'and', 'hastened', 'his', 'pace', '.', 'He', 'was', 'no', 'coward,', 'but', 'he', 'did', 'not', 'like', 'the', 'place', '.', 'Tall', 'trees', 'rose', 'all', 'about,', 'their', 'sullen', 'branches', 'shutting', 'out', 'the', 'sunlight', '.', 'The', 'dim', 'trail', 'led', 'in', 'and', 'out', 'among', 'them,', 'sometimes', 'skirting', 'the', 'edge', 'of', 'a', 'ravine,', 'where', 'Cororuc', 'could', 'gaze', 'down', 'at', 'the', 'tree-tops', 'beneath', '.', 'Occasionally,', 'through', 'a', 'rift', 'in', 'the', 'forest,', 'he', 'could', 'see', 'away', 'to', 'the', 'forbidding', 'hills', 'that', 'hinted', 'of', 'the', 'ranges', 'much', 'farther', 'to', 'the', 'west,', 'that', 'were', 'the', 'mountains', 'of', 'Cornwall', '.']


In [24]:
ids = [ vocab[token] for token in first_line]
print(ids)

[51, 705, 175, 781, 217, 749, 788, 1036, 5, 73, 1549, 987, 438, 357, 755, 486, 994, 876, 1399, 1065, 5, 139, 1466, 1166, 199, 176, 1400, 1361, 334, 1248, 1026, 1399, 1364, 5, 141, 489, 1458, 862, 811, 217, 1026, 213, 1402, 1291, 1266, 1399, 532, 1007, 171, 1117, 1579, 51, 429, 692, 503, 249, 1399, 1465, 296, 5, 104, 1426, 171, 1157, 811, 1399, 662, 755, 429, 1208, 256, 1439, 1399, 657, 779, 1398, 787, 1007, 1399, 1115, 965, 592, 1439, 1399, 1576, 1398, 1572, 1399, 961, 1007, 50, 5]


In [27]:
vocab ["Cororuc"]

51

In [None]:
rever_vocab = {index:token for token, index in vocab.items()}