# Working with text data

Here we will look into:
- Preparing text for LLM model training.
- Splitting text into word and subword tokens.
- Byte pair encoding.
- Sampling training examples using a sliding window approach.
- Converting tokens into vectors.

For the purposes of learning we will work with the text of short story by Edith Wharton called "The Verdict."

# Load data

In [1]:
import urllib.request

url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
urllib.request.urlretrieve(url, './data/the-verdict.txt')

('./data/the-verdict.txt', <http.client.HTTPMessage at 0x10654ed30>)

In [10]:
# Read the file.

with open('./data/the-verdict.txt', 'r', encoding='utf-8') as f_in:
    raw_text = f_in.read()

print(f'Num characters: {len(raw_text):,}')
print(f'Num words: {len(raw_text.split(" ")):,}')

Num characters: 20,479
Num words: 3,552


# Tokenizing text

We cannot just feed raw words as input to the Transformer model. We need to first tokenize the text. Tokens are converted to embeddings which can be passed as input to the Transformer model.

More specifically: `input text --> tokenized text --> token IDs --> token embeddings`

Some key notes:
- Its better to not modify the capitalization of text because it helps the LLM understand the differences between different kinds of nouns, understand sentence structure, and generate text with proper capitalization.
- Simply splitting the text by word is not enough. We also want to separate out punctuation.
- Whether not to remove whitespace characters is important decision to make. You probably don't want to do it in cases where the structure of the input matters. For example, python code.

In [13]:
import re

class SimpleTokenizer:
    def __init__(self, vocab: dict = None):
        self.str_to_int = None
        self.int_to_str = None
        
        if vocab:
            self.str_to_int = vocab
            self.int_to_str = {i:s for s,i in vocab.items()}
        
    def build_vocab(self, text: str) -> None:
        exp = r'([,.:;?_!"()\']|--|\s)'
        res = re.split(exp, text)
        res = [x for x in res if x.strip()]
        tokens = sorted(set(res))
        vocab = {token: i for i, token in enumerate(tokens)}
        
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
        
    def encode(self, text: str) -> list[int]:
        """Covert text to token ids."""
        exp = r'([,.:;?_!"()\']|--|\s)'
        res = re.split(exp, text)
        res = [x for x in res if x.strip()]
        ids = [self.str_to_int[i] for i in res]
        
        return ids
    
    def decode(self, ids: list[int]) -> str:
        """Convert token ids to text."""
        tokens = [self.int_to_str[i] for i in ids]
        text = ' '.join(tokens)
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        
        return text
        
        

In [22]:
tokenizer = SimpleTokenizer()
tokenizer.build_vocab(raw_text)

text = "It's the last he painted, you know."
ids = tokenizer.encode(text)
print(ids)

decoded_text = tokenizer.decode(ids)
print(decoded_text)

[56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 7]
It' s the last he painted, you know.
