<a href="https://colab.research.google.com/github/smathews88/nlp/blob/main/02_ml_NLP_DataPreperationforLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preperation of Input data for LLM**

# **Download Data**

Data preparation and sampling to get input data "ready" for the LLM

In [None]:
import os
import urllib.request

if not os.path.exists("business_11.txt"):
    url = ("https://storage.googleapis.com/kagglesdsdata/datasets/701505/1226200/business/business_11.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240618%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240618T074023Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=2a50233309ae50fdbceb07316abeca396bb3cba5cad1be5ef8517835b5a816738390e6ffe16f1bbc74b15811346339ea74135f03d565abdce9435c915ec7d10724d38432ed3135cb95744a088aa7557ae2f636f85e95ac8c3419fb63ea3b75222587206f0625b7f54c85ca3141597f4e641dd69348c6a8944849fe256a2248379c60f797100555cefdd3cf0fe735a5fed0d54df4638b4af0b7901089a685d3ef9c707d2fd9c406ccce1df5596402b1b0173b72201f7a6caf298cc001e7b26aa2d5adf4163ebadbf6935077e76c1c76720fc6936a4c295e0ad0ec59b842549e451ab6e79450f7c07bf05923a640f51759ffd77adfca2df10ac3ab8b03d229808b")
    file_path = "/content/drive/MyDrive/Colab Notebooks/dataset/business_11.txt"
    urllib.request.urlretrieve(url, file_path)

In [None]:
with open("business_11.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 3494
Saab to build Cadillacs in Sweden

General Motors, the world's largest car maker, has confirmed tha


#**Tokenizing data and converting tokens to IDs**

Tokenizing means breaking text into smaller units, such as individual words and punctuation characters. Each unique token is added to vocabulary in alphabetic order.


In [None]:
import re
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
vocab = {token:integer for integer,token in enumerate(all_words)}


Below is a simple class that perform tokenization and convert tokens to IDs and vice versa.

<li> The encode function turns text into token IDs </li>
<li> The decode function turns token IDs back into text </li>

In [None]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [None]:
tokenizer = SimpleTokenizerV1(vocab)

text = "Bringing Cadillac production to Sweden should help introduce desperately-needed scale to the Saab factory, which currently produces fewer than 130,000 cars per year."
ids = tokenizer.encode(text)
print(ids)
tokenizer.decode(ids)

[26, 28, 228, 274, 55, 252, 163, 171, 120, 244, 274, 271, 53, 139, 6, 294, 115, 227, 143, 269, 11, 6, 9, 99, 219, 301, 8]


'Bringing Cadillac production to Sweden should help introduce desperately-needed scale to the Saab factory, which currently produces fewer than 130, 000 cars per year.'

# **Addding Special Context Tokens**
It's useful to add some "special" tokens for unknown words and to denote the end of a text.**bold text**
Some of these special tokens are

<li> [BOS] (beginning of sequence) marks the beginning of text </li>
<li> [EOS] (end of sequence) marks where the text ends (this is usually used to concatenate multiple unrelated texts, e.g., two different Wikipedia articles or two different books, and so on) </li>
<li> [PAD] (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length) </li>
<li> [UNK] to represent works that are not included in the vocabulary </li>

In [None]:
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [None]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text


In [None]:

tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

tokenizer.encode(text)
tokenizer.decode(tokenizer.encode(text))

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


'<|unk|>, do <|unk|> <|unk|> <|unk|> <|unk|> <|endoftext|> In the <|unk|> <|unk|> of the <|unk|>.'

# **Bytepair Encoding**

GPT-2 used BytePair encoding (BPE) as its tokenizer
it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words
For instance, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges

In [None]:
import importlib
!pip install tiktoken
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.7.0


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [None]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=0
    )

    return dataloader

In [None]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=8, stride=1, shuffle=False)

In [None]:
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
tokenizer = tiktoken.get_encoding("gpt2")
for i in range(inputs.shape[0]):
  print(tokenizer.decode(inputs.tolist()[i])," ------> ", tokenizer.decode(targets.tolist()[i]))

Saab to build Cadillacs in  ------>  ab to build Cadillacs in Sweden
ab to build Cadillacs in Sweden  ------>   to build Cadillacs in Sweden

 to build Cadillacs in Sweden
  ------>   build Cadillacs in Sweden


 build Cadillacs in Sweden

  ------>   Cadillacs in Sweden

General
 Cadillacs in Sweden

General  ------>  illacs in Sweden

General Motors
illacs in Sweden

General Motors  ------>  acs in Sweden

General Motors,
acs in Sweden

General Motors,  ------>   in Sweden

General Motors, the
 in Sweden

General Motors, the  ------>   Sweden

General Motors, the world


# **Creating token embeddings**
The data is already almost ready for an LLM
But lastly let us embed the tokens in a continuous vector representation using an embedding layer
Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training

In [None]:

vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 8, 256])


# **Creating Positional Embedding**
<li> Embedding layer convert IDs into identical vector representations regardless of where they are located in the input sequence </li>
<li>Positional embeddings are combined with the token embedding vector to form the input embeddings for a large language model. </li>
<li>The BytePair encoder has a vocabulary size of 50,257</li>


In [None]:
context_length = 8
max_length = context_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings)

tensor([[ 0.5294,  0.5202,  0.1852,  ..., -1.2792,  0.9501,  0.6760],
        [-2.1529,  0.1797, -0.6630,  ..., -0.0313,  0.6326,  0.4746],
        [ 0.8169,  0.6653, -0.2359,  ...,  0.9724,  1.1933,  0.4391],
        ...,
        [ 1.4240, -0.2753, -0.2177,  ..., -1.3997,  1.0861, -0.0760],
        [ 0.3672, -0.4940,  0.3362,  ..., -1.3847,  0.0193, -0.2987],
        [ 1.9855,  0.9215,  0.0335,  ..., -1.1825,  0.2851,  1.3173]],
       grad_fn=<EmbeddingBackward0>)


# **Creating Input Embedding**

In [None]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 8, 256])
