# GPT : from scratch

## Dataset

### Loading Data

In [1]:
with open('input.txt','r',encoding='utf-8') as f:
    text = f.read()
print('length of the dataset in characters:',len(text))

length of the dataset in characters: 1115394


In [3]:
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


### getting all the unique characters in the data

In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


### Tokenizing (Building Encoder & Decoder)

In [8]:
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]  #input string ---> output list of integers
decode = lambda l: ''.join([itos[i] for i in l])

In [10]:
print(encode('hii there'))
print(decode(encode('hii there')))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### Tokenizing the dataset and stroing into a tensor

In [11]:
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


### Splitting data into train and validation 

In [12]:
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

## Data Loader : batch of chunks of data

### Block size or context size

In [14]:
block_size = 8
train_data[:block_size+1] #because input and lable 8 will be input and last one is output(targets) of

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [18]:
x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f'when input is {context} the target: {target}')

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


### Batch Dimension : chunks of tensors

In [21]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel
block_size = 8 # maximum context length for predictions


#creating 4 batches
def get_batch(split):
    #generate a small batch of data of inputs x and targets y
    data = train_data if split=='train' else val_data

    ix = torch.randint(len(data)-block_size,(batch_size,))

    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    return x,y


In [22]:
#getting x batch xb, y batch yb
xb, yb = get_batch('train')
print('inputs')
print(xb.shape)
print(xb)
print('\n')


print('targets')
print(yb.shape)
print(yb)

inputs
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


targets
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


In [28]:
for b in range(batch_size): #batch dimension
    print('---Batch---')
    for t in range(block_size): #time dimension or context size
        context = xb[b,:t+1]
        target = yb[b,t]
        print(f'when input is {context} target is: {target}')
    print('____\n')

---Batch---
when input is tensor([24]) target is: 43
when input is tensor([24, 43]) target is: 58
when input is tensor([24, 43, 58]) target is: 5
when input is tensor([24, 43, 58,  5]) target is: 57
when input is tensor([24, 43, 58,  5, 57]) target is: 1
when input is tensor([24, 43, 58,  5, 57,  1]) target is: 46
when input is tensor([24, 43, 58,  5, 57,  1, 46]) target is: 43
when input is tensor([24, 43, 58,  5, 57,  1, 46, 43]) target is: 39
____

---Batch---
when input is tensor([44]) target is: 53
when input is tensor([44, 53]) target is: 56
when input is tensor([44, 53, 56]) target is: 1
when input is tensor([44, 53, 56,  1]) target is: 58
when input is tensor([44, 53, 56,  1, 58]) target is: 46
when input is tensor([44, 53, 56,  1, 58, 46]) target is: 39
when input is tensor([44, 53, 56,  1, 58, 46, 39]) target is: 58
when input is tensor([44, 53, 56,  1, 58, 46, 39, 58]) target is: 1
____

---Batch---
when input is tensor([52]) target is: 58
when input is tensor([52, 58]) targ

In [29]:
print(xb)

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


## Biagram Language Model

Here we are creating a token embedding table of vocab_size * vocab_size i.e 65*65 \n 
idx will go to this emebdding table and pluck out the row of size 65 for example here
24 will to go the embedding table and pluck out the 24th row.

Pytorch arrange these tensors in B,T,C dimensions

where 

      Batch size (B) = 4

      Time(context) T = 8

      channels(vocab size) c = 65

here tokens are not talking to each other 