# **`COURAGE` IS GOING FROM `FAILURE` TO FAILURE` WITHOUT `LOSING ENTHUSIAM`**

# **Data Preprocessing**

NB: lecture note found here **[🔹 Lecture 12 Notes 🔹](lecture_12_notes.md)**

this lecture is to build the whole preprocessing pipeline from scratch

basically in this lecture we're revising everything we've done so far


In [1]:
import torch
import tiktoken

### **Step 1: Creating tokens**

In [2]:
# first load the text data
with open("./data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print(f"total number of character: {len(raw_text)}")
print(raw_text[:99])

total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


#### **implementing word based tokenization**

In [3]:
import re

preprocessed_text = re.split(r'(--|[.,"-():;?_!\']|\s)', raw_text)
preprocessed_text = [item.strip() for item in preprocessed_text if item.strip()]
print(len(preprocessed_text))
print(preprocessed_text[:99])


4690
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.', ')', '"', 'The', 'height', 'of', 'his', 'glory', '"', '--', 'that', 'was', 'what', 'the', 'women', 'called', 'it', '.', 'I', 'can', 'hear', 'Mrs', '.', 'Gideon', 'Thwing', '--', 'his', 'last', 'Chicago', 'sitter']


#### **creating vocabs**
vocabs is the dict of all unique words with their token ids <br>
so first get all unique words then create the vocabs

In [13]:
# use set ds to remove all duplicate
all_unique_words = sorted(set(preprocessed_text))
# adding special tokens
all_unique_words.extend(["<|endoftext|>", "<|unk|>"])
print(all_unique_words[:99])
print(len(all_unique_words))

['!', '"', "'", '(', ')', ',', '--', '.', ':', ';', '?', 'A', 'Ah', 'Among', 'And', 'Are', 'Arrt', 'As', 'At', 'Be', 'Begin', 'Burlington', 'But', 'By', 'Carlo', 'Chicago', 'Claude', 'Come', 'Croft', 'Destroyed', 'Devonshire', 'Don', 'Dubarry', 'Emperors', 'Florence', 'For', 'Gallery', 'Gideon', 'Gisburn', 'Gisburns', 'Grafton', 'Greek', 'Grindle', 'Grindles', 'HAD', 'Had', 'Hang', 'Has', 'He', 'Her', 'Hermia', 'His', 'How', 'I', 'If', 'In', 'It', 'Jack', 'Jove', 'Just', 'Lord', 'Made', 'Miss', 'Money', 'Monte', 'Moon-dancers', 'Mr', 'Mrs', 'My', 'Never', 'No', 'Now', 'Nutley', 'Of', 'Oh', 'On', 'Once', 'Only', 'Or', 'Perhaps', 'Poor', 'Professional', 'Renaissance', 'Rickham', 'Riviera', 'Rome', 'Russian', 'Sevres', 'She', 'Stroud', 'Strouds', 'Suddenly', 'That', 'The', 'Then', 'There', 'They', 'This', 'Those']
1132


In [18]:
# now up to creating vocabs
vocabs = {token: token_id for token_id, token in enumerate(all_unique_words)}
print(len(vocabs))

1132


#### **using the word based tokenizer class to create vocabs and then try encoding and decoding**

In [18]:
from word_based_tokenizer import SimpleWordBasedTokenizer

In [19]:
wb_tokenizer = SimpleWordBasedTokenizer(split_regex=r'(--|[.,"-():;?_!\']|\s)')

vocabs = wb_tokenizer.create_vocabs(raw_text=raw_text)


In [21]:
len(vocabs)

1132

In [20]:
sample_text = """
I looked up again, and caught sight of that sketch of the donkey hanging on the wall near his bed. His wife told me afterward it was the last thing he had done--just a note taken with a shaking hand, when he was down in Devonshire recovering from a previous heart attack. Just a note! But it tells his whole history. There are years of patient scornful persistence in every line. A man who had swum with the current could never have learned that mighty up-stream stroke. . . .*
"""

enc_samp_text = wb_tokenizer.encode(sample_text)
print(enc_samp_text)
print(wb_tokenizer.decode(enc_samp_text))

[53, 643, 1051, 140, 5, 157, 250, 887, 722, 987, 899, 722, 988, 361, 524, 727, 988, 1072, 701, 549, 207, 7, 51, 1103, 1017, 663, 139, 585, 1077, 988, 602, 996, 533, 514, 360, 6, 590, 115, 712, 973, 1108, 115, 874, 521, 5, 1090, 533, 1077, 362, 568, 30, 825, 477, 115, 791, 537, 183, 7, 59, 115, 712, 0, 22, 585, 980, 549, 1098, 550, 7, 95, 169, 1123, 722, 762, 860, 766, 568, 403, 630, 7, 11, 656, 1097, 514, 969, 1108, 988, 308, 292, 707, 530, 611, 987, 672, 1052, 938, 7, 7, 7, 7, 1131]
I looked up again, and caught sight of that sketch of the donkey hanging on the wall near his bed. His wife told me afterward it was the last thing he had done -- just a note taken with a shaking hand, when he was down in Devonshire recovering from a previous heart attack. Just a note! But it tells his whole history. There are years of patient scornful persistence in every line. A man who had swum with the current could never have learned that mighty up-stream stroke.... <|unk|>


**the class is a word based tokenizer i create to tokenize textual data... it has functions to encode and decode... let test it out by copying from `the-verdict.txt` file and test the it out and see**

#### **using Byte Pair Encoding(BPE), subwords encoding**

In [4]:
gpt2_tokenization = tiktoken.get_encoding(encoding_name="gpt2")
gpt2_tokenized_data = gpt2_tokenization.encode(raw_text)
print(len(gpt2_tokenized_data))
# print(gpt2_tokenization.decode(gpt2_tokenized_data))


5145


#### **creating input-target pair**

In [22]:
from custom_dataloader import create_dataloader_v1

In [26]:
batch_size = 8
context_window = 4
dataloader = create_dataloader_v1(
    raw_text,batch_size=batch_size, max_length=context_window, stride=context_window, shuffle=False, drop_last=True
)
data_iter = iter(dataloader)
inputs,target = next(data_iter)
print(inputs)
print(target)

tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


#### **Creating Vector/token embeddings & Postional Embeddings**

In [29]:
# print(token_ids)
num_embeddings = 50257
embeddings_dim = 256
torch.manual_seed(123)

# token/vector embeddings
token_embeddings_layer = torch.nn.Embedding(
    num_embeddings=num_embeddings, embedding_dim=embeddings_dim
)
# positional embeddings
positional_embeddings_layer = torch.nn.Embedding(
    num_embeddings=context_window, embedding_dim=embeddings_dim
)
print(f"token embeddings weights:\n {token_embeddings_layer.weight} ---> shape: {token_embeddings_layer.weight.shape}")
print(f"\npositional embeddings weights:\n {positional_embeddings_layer.weight} ---> shape: {positional_embeddings_layer.weight.shape}")

token embeddings weights:
 Parameter containing:
tensor([[ 0.3374, -0.1778, -0.3035,  ...,  1.3337,  0.0771, -0.0522],
        [ 0.2386,  0.1411, -1.3354,  ..., -0.0315, -1.0640,  0.9417],
        [-1.3152, -0.0677, -0.1350,  ..., -0.3181, -1.3936,  0.5226],
        ...,
        [ 0.5871, -0.0572, -1.1628,  ..., -0.6887, -0.7364,  0.4479],
        [ 0.4438,  0.7411,  1.1263,  ...,  1.2091,  0.6781,  0.3331],
        [-0.2537,  0.1446,  0.7203,  ..., -0.2134,  0.2144,  0.3006]],
       requires_grad=True) ---> shape: torch.Size([50257, 256])

positional embeddings weights:
 Parameter containing:
tensor([[ 0.5423, -0.1224, -1.4150,  ...,  0.2515, -2.3067,  0.8155],
        [-0.3973, -1.2575, -1.9800,  ..., -0.1207,  0.3075, -0.6422],
        [ 0.1840,  1.1128,  1.0052,  ...,  0.2081,  0.5531, -1.1619],
        [ 1.4155,  0.6599,  0.3760,  ...,  0.7034, -0.6108,  0.1080]],
       requires_grad=True) ---> shape: torch.Size([4, 256])


In [37]:
# fetching vector embedding for a particular token id
inputs.shape
# inputs
token_embeddings = token_embeddings_layer(inputs)
pos_embeddings = positional_embeddings_layer(torch.arange(context_window))


#### **Creating input embeddings**

In [39]:
# input_embeddings = token embeddings + positional embeddings
input_embeddings = token_embeddings + pos_embeddings
input_embeddings, input_embeddings.shape

(tensor([[[ 0.4784,  0.2094, -1.3080,  ...,  0.7864, -3.1091, -1.5083],
          [-0.7497, -0.9066, -0.9927,  ..., -1.9672, -1.3960, -0.3200],
          [ 1.1857,  2.0427, -0.2581,  ..., -1.0175,  1.6710, -1.0276],
          [ 2.2151,  2.9436, -0.2765,  ..., -0.4182, -0.1402,  0.2612]],
 
         [[ 0.4341, -1.3947, -2.6367,  ..., -0.6684, -0.2993, -0.5983],
          [-0.6400, -0.3430, -0.8915,  ..., -0.9857,  3.8343,  0.0802],
          [-0.3594,  2.7332,  2.2274,  ...,  0.8895, -0.8501, -1.0126],
          [ 1.0652, -0.2726, -0.9140,  ..., -0.7946, -0.4708,  0.4810]],
 
         [[ 0.6100, -1.1981, -1.2779,  ..., -0.1396, -3.4041,  1.5746],
          [-1.6534, -0.9395, -1.9595,  ...,  0.6251, -0.7305, -2.3072],
          [ 0.3030,  1.8729,  0.0746,  ...,  0.2088,  1.4137, -2.5317],
          [ 1.4252,  1.2383,  0.6891,  ...,  0.5719, -1.7446, -1.0321]],
 
         ...,
 
         [[ 0.9363, -1.0170, -1.7172,  ..., -1.7176, -3.8629,  1.1428],
          [ 0.7812, -1.5820, -2.6241,  