<a href="https://colab.research.google.com/github/swethag04/ml-projects/blob/main/nlp/text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import re

In [4]:
# reading a short story
with open('/content/sample_data/verdict.txt','r') as f:
  raw_text = f.read()
print("Total number of characters: ", len(raw_text))

Total number of characters:  20479


In [5]:
print(raw_text[:99])

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [6]:
# splitting on punctutations so words and punctuations are separate list entries
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)

#removing white spaces
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4649


In [7]:
print(preprocessed[:50])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself']


**Converting tokens into token IDs** <br>
The next step is to convert the tokens from a string to an integer representations called token-IDs. This is an intermediate step before converting token IDs to embeddings. <br><br>
To do this, the tokens are sorted alphabetically and duplicate tokens are removed. The unique tokens are then mapped to a unique integer value.



In [8]:
# create vocab from unieuw tokens
all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)
print(f"Vocab size: {vocab_size}")

Vocab size: 1159


In [9]:
# print 15 entries from vocab
vocab = {token:integer for integer , token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
  print(item)
  if i>15:
    break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)


In [10]:
# Implementing a simple text tokenizer
class SimpleTokenizerV1:
  def __init__(self, vocab):
    self.str_to_int = vocab
    # Inverse vocab
    self.int_to_str = {i:s for s,i in vocab.items()}

  def encode(self, text):
    """ Process input text into token IDs"""
    preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
    preprocessed = [item.strip() for item in preprocessed if item.strip()]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self, ids):
    """Convert token ids back to text"""
    text = " ".join([self.int_to_str[i] for i in ids])
    # replace spaces before the specified punctuation
    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    return text

In [11]:
# Instantiate a new tokenizer object
tokenizer = SimpleTokenizerV1(vocab)

text = """It's  the last he painted, you know, Mrs. Gisburn said
          with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 69, 7, 39, 873, 1136, 773, 812, 7]


In [12]:
tokenizer.decode(ids)

"It' s the last he painted, you know, Mrs. Gisburn said with pardonable pride."

In [13]:
# Using tokenizer on a new text
text = " Hello, do you like tea?"
tokenizer.encode(text)

KeyError: 'Hello'

The error occured as the word "Hello" was not in the vocab. Our current tokenizer needs to be modified to handle unknown words.

We add an <|unk|> token to represent new and unknown words that were not part of the training data and hence not part of the vocab. Also, we add an <|endoftext|> token that we can use to separate two unrelated text sources. This helps LLM understand that although they are trained on multiple independent documents and the text sources are concatenated for training, they are infact unrelated.



In [14]:
# Modifying vocab to include the 2 special tokens
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}
print(len(vocab.items()))

1161


In [15]:
for i, item in enumerate(list(vocab.items())[-5:]):
  print(item)

('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)


In [16]:
# Tokenizer V2
class SimpleTokenizerV2:
  def __init__(self, vocab):
    self.str_to_int = vocab
    # Inverse vocab
    self.int_to_str = {i:s for s,i in vocab.items()}

  def encode(self, text):
    """ Process input text into token IDs"""
    preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
    preprocessed = [item.strip() for item in preprocessed if item.strip()]
    preprocessed = [item if item in self.str_to_int
                    else "<|unk|>" for item in preprocessed ]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self, ids):
    """Convert token ids back to text"""
    text = " ".join([self.int_to_str[i] for i in ids])
    # replace spaces before the specified punctuation
    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    return text

In [17]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = "<|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea?<|endoftext|> In the sunlit terraces of the palace.


In [18]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160, 7]


The list of tokenids contains 1159 for the <|endoftext|> spearator token as well as two 1160 tokens, which are used for unknown words

In [19]:
print(tokenizer.decode(tokenizer.encode(text)))

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


**Byte pair encoding** <br>
The BPE tokenizer was used to train LLMs such as GPT-3. It can handle any unknown word. It breaks down words that are not in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out of vocabulary words. This it can parse any word and does not need to replace unknown words with special tokens such as <|unk|> <br> <br>
Implementing BPE is relatively complicated, so we will use an existing open source librarycalled tiktoken, which implements the BPE algorithm very efficiently.


In [20]:
#!pip install tiktoken

In [21]:
import importlib
import tiktoken
print(f"tiktoken version: {importlib.metadata.version('tiktoken')}")

tiktoken version: 0.6.0


In [22]:
# Instantiate BPE tokenizer from tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

In [23]:
# User BPE tokenizer to encode the text
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
integers = tokenizer.encode(text, allowed_special = {"<|endoftext|>"})
print(integers)


[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [24]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


From the BPE token ids, we can see that <|endoftext|> is assigned a relatively large token ID. Also, the BPE tokenizer encodes and decodes unknown words such as "someunknownPlace" correctly.

In [25]:
# Tokenize the whole verdict story
with open('/content/sample_data/verdict.txt', 'r', encoding='utf-8') as f:
  raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))


5145


In [26]:
enc_sample = enc_text[50:]

In [27]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:       {y}")

x: [290, 4920, 2241, 287]
y:       [4920, 2241, 287, 257]


In [28]:
for i in range(1, context_size+1):
  context = enc_sample[:i]
  desired = enc_sample[i]
  print(context, "---->", desired)


[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [29]:
for i in range(1, context_size+1):
  context = enc_sample[:i]
  desired = enc_sample[i]
  print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


In [30]:
# Implementing a data loader
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
  def __init__(self, txt, tokenizer, max_length, stride):
    self.tokenizer = tokenizer
    self.input_ids = []
    self.target_ids = []

    # tokenizer the entire text
    token_ids = tokenizer.encode(txt)

    # Use sliding window to chunk the text into overlapping sequences of max length
    for i in range(0, len(token_ids) - max_length, stride):
      input_chunk = token_ids[i:i+max_length]
      target_chunk = token_ids[i+1:i+max_length+1]
      self.input_ids.append(torch.tensor(input_chunk))
      self.target_ids.append(torch.tensor(target_chunk))

  def __len__(self):
    """return the total number of rows in the dataset"""
    return len(self.input_ids)

  def __getitem__(self, idx):
    """ return single row from dataset"""
    return self.input_ids[idx], self.target_ids[idx]


In [31]:
# A data loader to generate batches with input with pairs
def create_dataloader(txt, batch_size=4,
                      max_length=256, stride=128, shuffle=True):
  tokenizer = tiktoken.get_encoding("gpt2")
  dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
  dataloader = DataLoader(dataset,
                          batch_size = batch_size,
                          shuffle=shuffle)
  return dataloader


In [32]:
with open ('/content/sample_data/verdict.txt', 'r', encoding='utf-8') as f:
  raw_text = f.read()

In [33]:
dataloader = create_dataloader(
                        raw_text,
                        batch_size=1,
                        max_length=4, #tokens in each tensor
                        stride=1, # stride indicates the #positions the input shifts across batches
                        shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


The `first_batch` cntains two tensors: first tensor stores input token ids and second tensor stores the target token ids. Since the `max_length=4`, each of the 2 sensors contains 4 token ids.

In [34]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


`stride` is used to indicate the number of positions the input shifts across batches, emulating a sliding window approach

In [35]:
# data loader with bigger batch size
dataloader = create_dataloader(raw_text,
                                batch_size=8,
                                max_length=4,
                                stride=5)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs: \n", inputs)
print("\nTargets: \n", targets)

Inputs: 
 tensor([[   13,   198,   198,     1],
        [  550, 18459,  1068,   284],
        [  383,  8631,  3872,   373],
        [ 1718,   262,  1657,   832],
        [   11,   530,  1139,  2063],
        [  526,  1114,  9074,    13],
        [  284,   502,   262,  4112],
        [  262,  3211,    12, 16337]])

Targets: 
 tensor([[  198,   198,     1,  5574],
        [18459,  1068,   284,  1577],
        [ 8631,  3872,   373,    11],
        [  262,  1657,   832, 41160],
        [  530,  1139,  2063,   262],
        [ 1114,  9074,    13,   402],
        [  502,   262,  4112,   957],
        [ 3211,    12, 16337,  1474]])


**Creating token embeddings** <br>
The last step for preparing input text for LLM training is to convert the token ids to embedding vectors.

In [36]:
input_ids = torch.tensor([5,1,3,2])
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


In [37]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


The embedding layer is a look up operation that retrieves rows from the embedding layer's weight via a token id

In [38]:
print(embedding_layer(input_ids))

tensor([[-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010],
        [-0.4015,  0.9666, -1.1481],
        [ 1.2753, -0.2010, -0.1606]], grad_fn=<EmbeddingBackward0>)


The embedding layer converts a token id into the same vector representation regardless of where it is located in the input sequence. Additional positional information needs to be injected into the LLM. There are two categories of positional aware embeddings.
* Relative positional embeddings
* Absolute positional embeddings

Absolute positional embeddings are directly associated with specific positions in a sequence. For each position in the input sequence, a unique embedding is added to the token's embedding to convey its exact location. <br>

Relative positional embeddings focus on the relative positons or distance between tokens. The model learns the relationship in terms of how far apart rather than at which exact position. The advantage is that the model can generalize better to sequences of varying lengths, even if it hasn't seen such lengths during training.

OpenAI GPT model uses absolute positional embeddings.

In [39]:
output_dim = 256
vocab_size = 50257
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [40]:
max_length = 4
dataloader = create_dataloader(raw_text,
                               batch_size =8,
                               max_length = max_length,
                               stride = 5,
                               shuffle = False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs: \n", inputs)
print("\n Inputs shape: \n", inputs.shape)

Token IDs: 
 tensor([[   40,   367,  2885,  1464],
        [ 3619,   402,   271, 10899],
        [  257,  7026, 15632,   438],
        [  257,   922,  5891,  1576],
        [  568,   340,   373,   645],
        [ 5975,   284,   502,   284],
        [  326,    11,   287,   262],
        [  286,   465, 13476,    11]])

 Inputs shape: 
 torch.Size([8, 4])


In [41]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


In [42]:
block_size = max_length
pos_embedding_layer = torch.nn.Embedding(block_size, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(block_size))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [43]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)


torch.Size([8, 4, 256])
