# Working with text data

Here we will look into:
- Preparing text for LLM model training.
- Splitting text into word and subword tokens.
- Byte pair encoding.
- Sampling training examples using a sliding window approach.
- Converting tokens into vectors.

For the purposes of learning we will work with the text of short story by Edith Wharton called "The Verdict."

# Load data

In [3]:
import urllib.request

url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
urllib.request.urlretrieve(url, './data/the-verdict.txt')

('./data/the-verdict.txt', <http.client.HTTPMessage at 0x10bca0430>)

In [24]:
# Read the file.

with open('./data/the-verdict.txt', 'r', encoding='utf-8') as f_in:
    raw_text = f_in.read()

print(f'Num characters: {len(raw_text):,}')
print(f'Num words: {len(raw_text.split(" ")):,}')

Num characters: 20,479
Num words: 3,552


# Tokenizing text

We cannot just feed raw words as input to the Transformer model. We need to first tokenize the text. Tokens are converted to embeddings which can be passed as input to the Transformer model.

More specifically: `input text --> tokenized text --> token IDs --> token embeddings`

Some key notes:
- Its better to not modify the capitalization of text because it helps the LLM understand the differences between different kinds of nouns, understand sentence structure, and generate text with proper capitalization.
- Simply splitting the text by word is not enough. We also want to separate out punctuation.
- Whether not to remove whitespace characters is important decision to make. You probably don't want to do it in cases where the structure of the input matters. For example, python code.

In [37]:
import re

def tokenize_text(text):
    exp = r'([,.:;?_!"()\']|--|\s)'
    res = re.split(exp, text)
    res = [x for x in res if x.strip()]

    return res

example_text = 'Hello world. My name is Steve--!'
tokenize_text(example_text)

['Hello', 'world', '.', 'My', 'name', 'is', 'Steve', '--', '!']

In [38]:
tokenized_text = tokenize_text(raw_text)  # tokenize 
tokenized_text[:10]

['I',
 'HAD',
 'always',
 'thought',
 'Jack',
 'Gisburn',
 'rather',
 'a',
 'cheap',
 'genius']