In [4]:
with open('Tiny Shakespeare Input.txt', 'r', encoding = 'utf-8') as file:
    text = file.read()


print('Length of the data in the file for Tiny Shakespeare Input.txt is:', len(text))

Length of the data in the file for Tiny Shakespeare Input.txt is: 1115394


In [8]:
print(text[0:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [9]:
#here are all the unique characters in the text

chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


**We would now like to develop a strategy that can tokenize the input text**



--converst the raw text as a string to some sequence of integers according to a vocab size


In [19]:
#here we are creating a map to help us map the characters to integers
stoi = {ch:i for i , ch in enumerate(chars)} #this is a dictionary that maps the characters to integers
itos = {i:ch for i , ch in enumerate(chars)} #this is a dictionary that maps the integers to characters
#we need both to be able to make sure that the tokenization and detokenization are consistent and have the same mapping to be reversible
encode = lambda s: [stoi[c] for c in s]# this is a lambda function that takes a string and returns a list of integers based on the mapping of the characters to integers
decode = lambda l: ''.join([itos[i] for i in l])# this is a lambda function that takes a list of integers and returns a string based ont the mapping of the integers to characters

print(encode("what's up cutie"))
print(decode([61, 46, 39, 58, 5, 57, 1, 59, 54, 1, 41, 59, 58, 47, 43]))

[61, 46, 39, 58, 5, 57, 1, 59, 54, 1, 41, 59, 58, 47, 43]
what's up cutie


Encoding helps us to convert the raw text into a sequence of integers that can be used to train a model, the mapping used here is one to one, meaning that each character is mapped to a unique integer and vice versa which is a very simple one, but it is not the most efficient one, we can use more efficient encoding techniques like sentencePiece or byte pair encoding (BPE). SentencePiece has been developed by Google that encodes smaller chunks of words into tokens which is a bit broader than the one to one mapping that we have here. but still smaller than word mapping. it uses a re-implementation of BPE that is more efficient and can be used for more languages thus making it more useful for a larger set of data and versatile.

OpenAI has a library called tiktoken that has a lot of different encoding techniques that can be used for different purposes and languages.It also uses a byte pair encoding (BPE) technique that is more efficient and can be used for more languages thus making it more useful for a larger set of data and versatile as mentioned before.This is what GPT uses for encoding the text. According to OpenAI the tiktoken library is the most efficient as compared to other encoding techniques and huggingface's tokenizers library whcih is a comparable library.



In [21]:
#we will now convert the text into a tensor of integers

import torch
data = torch.tensor(encode(text), dtype = torch.long)
print(data.shape, data.dtype)
print(data[:1000]) #this is the first 1000 characters of the text that we have seen before




torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

The tensor is so long just becuase the tokaniser we are using is so simple and basic, it is not efficient and is not the best way to encode the text. as we move towards more complex models and more complex tokanisers, the tensor will become shorter and more efficient. right now we have only 65 characters in the text, so the tensor is long, but if we had a larger text with more characters that BPE uses, the tensor would be shorter and more efficient and thus would be a better representation of the text.

In [22]:
# a good practive is to split the data into train and validation sets 90% for train and 10% for validation

n = int(0.9 * len(data)) 
train_data = data[:n]
val_data = data[n:]


We will use block sizes to create a contexct window for the model , we do not fill in the the transformer with the entire bunch o0f data that we have, we go by smaller chunks that can help the model learn better and faster.Alos filling the model with the entire bunch of data would be computationally expensive and would take a lot of time to train which we DO NOT WANT.The importtant thing to note here is that the block size should be less than the length of the data, otherwise we would be repeating the data and thus not getting any new information.

In [23]:
block_size = 8
train_data[:block_size + 1]# we use this to get the first 9 characters of the text this is becuase to get a block size of 8 we need 9 characters to be able to get 8 pairings of characters, this is essential for the model to learn the patterns in the data and this provides a good context window for the model to learn from.

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [26]:
x= train_data[:block_size]
y = train_data[1:block_size + 1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(t+1, f"when input is {context} the target is {target}")




1 when input is tensor([18]) the target is 47
2 when input is tensor([18, 47]) the target is 56
3 when input is tensor([18, 47, 56]) the target is 57
4 when input is tensor([18, 47, 56, 57]) the target is 58
5 when input is tensor([18, 47, 56, 57, 58]) the target is 1
6 when input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
7 when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
8 when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


As shown above we are creating a context window for the model to learn from, we are taking the first 8 characters of the text and then predicting the 9th character and are using 8 characters as the context window for the model to learn from. The transformer is taken all way from 1 characvter to the block size which is 8 in this case. this helps the transfter model to learn the pattenrs in the data, not only becuase it is computationally efficient but also because it helps the model to learn the patterns in the data and thus is a good way to train the model.

In [27]:
torch.manual_seed(1337)# this is to set the seed for the random number generator so that we can get the same results everytime we run the code
batch_size = 4 #how many independent sequences will we process in parallel, this is a hyperparameter that we can tune to get the best results
block_size = 8 #the maximum context length for predictions

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x= torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(input(xb))
print('targets:')
print(yb.shape)
print(input(yb))

print('--------------------------------')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when input is {context} the target is {target}")

inputs:
torch.Size([4, 8])


In [None]:
#we will now create a simple bigram model to predict the next character in the sequence
# the bigram model is a model that predicts the next character in the sequence based on the previous character

import torch 
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        #each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets = None):
        #idx and targets are both of shape (batch_size, block_size) tensor of integers
        logits = self.token_embedding_table(idx) #(batch_size, block_size, vocab_size)
        return logits
    
m = BigramLanguageModel(vocab_size)
out = m(xb, yb)
print(out.shape)
