
# Understanding LLM Input Data

In [None]:
!pip install torch
!pip install tiktoken


In [None]:


from importlib.metadata import version
print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

<br>
<br>
<br>
<br>

# Tokenizing text

- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters

- Load raw text we want to work with
- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story

In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

- The goal is to tokenize and embed this text for an LLM
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above

- The following regular expression will split on whitespaces and punctuation

In [None]:
import re

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item for item in preprocessed if item]
print(preprocessed[:38])

In [None]:
print("Number of tokens:", len(preprocessed))

<br>
<br>
<br>
<br>

#  Converting tokens into token IDs

- Next, we convert the text tokens into token IDs that we can process via embedding layers later
- For this we first need to build a vocabulary

- The vocabulary contains the unique words in the input text

In [None]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

In [51]:
vocab = {token:integer for integer,token in enumerate(all_words)}

- Below are the first 50 entries in this vocabulary:

In [None]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

- Below, we illustrate the tokenization of a short sample text using a small vocabulary:

- Let's now put it all together into a tokenizer class

In [53]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

- The `encode` function turns text into token IDs
- The `decode` function turns token IDs back into text

- We can use the tokenizer to encode (that is, tokenize) texts into integers
- These integers can then be embedded (later) as input of/for the LLM

In [None]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

- We can decode the integers back into text

In [None]:
tokenizer.decode(ids)

In [None]:
tokenizer.decode(tokenizer.encode(text))

<br>
<br>
<br>
<br>

# BytePair encoding

GPT-2 uses BytePair Encoding (BPE) as its tokenizer, which is designed to break down words that aren't in its predefined vocabulary into smaller units or individual characters. This allows the model to handle out-of-vocabulary words effectively. For example, if GPT-2 doesn’t recognize the word “unfamiliarword,” it might split it into subwords like ["unfam", "iliar", "word"] based on its trained BPE merges.

The original BPE tokenizer can be found in OpenAI’s GPT-2 encoder. However, in this session, we’re using the BPE tokenizer from OpenAI's open-source library, tiktoken. What makes tiktoken stand out is its core algorithms being written in Rust, which significantly boosts computational speed. In fact, after comparing performance, I found that tiktoken is roughly 3x faster than the original GPT-2 tokenizer and 6x faster than the equivalent tokenizer in Hugging Face.


In [57]:
# pip install tiktoken

In [None]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

In [59]:
tokenizer = tiktoken.get_encoding("gpt2")

In [None]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

In [None]:
strings = tokenizer.decode(integers)

print(strings)

- BPE tokenizers break down unknown words into subwords and individual characters:

In [None]:
tokenizer.encode("Akwirw ier", allowed_special={"<|endoftext|>"})

<br>
<br>
<br>
<br>

# Data sampling with a sliding window

# Earlier, we handled the tokenization process, which involves converting text into word tokens and representing them as token IDs. Now, let’s shift focus to how we prepare data loading for Large Language Models (LLMs). Since LLMs are trained to generate one word at a time, we need to structure the training data so that the next word in a sequence becomes the target for the model to predict. This way, the model learns the context and patterns necessary for generating coherent and contextually relevant output.

In [None]:
from supplementary import create_dataloader_v1


dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)