<a href="https://colab.research.google.com/github/yudumpacin/LLM/blob/main/TurkishGaripPoemsGPTwithTurkishTokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook includes study notes of Andrej Karpathy's Let's build GPT: from scratch, in code, spelled out.  [video source](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=1998s&ab_channel=AndrejKarpathy).
Training data is Turkish Garip Poems which is scapped from web, Garip_Siirleri.csv, resulted from notebook in this repository. Instead of using tokenizer which is a mapping of unique characters of input data to enumerate numbers, in this notebook bert turkish uncased tokenizer is used [source2](https://https://github.com/Infatoshi/fcc-intro-to-llms/blob/main/gpt-v2.ipynb)

In [1]:
#import necessary libraries
import pandas as pd
import torch

#Read and Save Data

In [2]:
data = pd.read_csv("Garip_Siirleri.csv")

In [3]:
text = "\n".join([siir for siir in data.Şiir])

In [4]:
with open("garip_siirleri.txt","w") as f:
  f.write(text)

In [5]:
with open("garip_siirleri.txt","r") as f:
  content = f.read()

In [6]:
print(content[0:100])

ANLATAMIYORUM  
Ağlasam sesimi duyar mısınız,  
Mısralarımda; 
Dokunabilir misiniz, 
Gözyaşlarıma, e


# Tokenization with Bert Turkish

In [7]:
from transformers import AutoTokenizer

In [8]:
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/251k [00:00<?, ?B/s]

In [9]:
vocab_size = len(tokenizer)
vocab_size

32000

In [10]:
encoder = lambda s: tokenizer.encode(s, add_special_tokens=True)
decoder = lambda l: tokenizer.decode(l,skip_special_tokens=True)

In [11]:
print(encoder("merhaba"))

[2, 9714, 3]


In [12]:
print(decoder(encoder("merhaba")))

merhaba


In [13]:
#encode entire dataset to torch tensor

In [14]:
data = torch.tensor(encoder(content),dtype=torch.long)

Token indices sequence length is longer than the specified maximum sequence length for this model (6835 > 512). Running this sequence through the model will result in indexing errors


In [15]:
print(data.shape, data.dtype)

torch.Size([6835]) torch.int64


In [16]:
print(data[:100])

tensor([    2,  5904, 12252, 10501, 23099, 27174,  6899,  4555, 31534,  1027,
         8905,  2938,  6436, 12760,    16,    49,  2116,  7672,  5643,  1986,
           31, 28050,  2540,  7775,    16,  5160, 11781, 14089,    16, 17462,
         1050,  1985,    35, 14445,  2125,  2628, 15504,  1009,  2048,  2220,
         2639,    16, 26638,  2208,  2138,  2402,  2451,  6032,  2577,  2444,
         2123, 29914,  1025, 20816,  2025,  2478,    18,  2281,  2147,  2166,
           16,  5862,    31,  2874,  4037,  6985,  3472,    31, 16114,  2018,
         2090,  3072, 23791,    16, 16518,    31, 15729,  9635,    18,  9187,
        26755,  5188,  1105,  2593,  1006,  6454,  2027,  5359, 22168,  3183,
           31,  8806,  3137,  7215, 10922,    16,  3078,  2639,    31,  3161])


In [17]:
#split to train and val
n = int(0.9*(len(data)))
train_data = data[:n]
val_data = data[n:]

In [18]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [19]:
device

'cuda'

In [20]:
torch.manual_seed(12)

<torch._C.Generator at 0x7a95c8856cf0>

In [21]:
# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


In [22]:
import torch.nn as nn
from torch.nn import functional as F


In [23]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

#  GPT

In [24]:
batch_size= 64
block_size = 256
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
eval_iters = 200
n_head = 6
dropout = 0.5
n_embd = 256
n_layer = 8


In [25]:
class Head(nn.Module):
  def __init__(self,head_size):
    super().__init__()
    self.key = nn.Linear(n_embd,head_size, bias = False)
    self.query = nn.Linear(n_embd,head_size, bias = False)
    self.value = nn.Linear(n_embd,head_size, bias = False)
    self.register_buffer('tril',torch.tril(torch.ones(block_size,block_size)))
  def forward(self,x):
    B,T,C = x.shape
    q = self.query(x)
    k = self.key(x)
    wei = q @ k.transpose(-2,-1)*C**-0.5
    wei = wei.masked_fill(self.tril[:T,:T]==0,float('-inf'))
    wei = F.softmax(wei, dim=-1)
    v = self.value(x)
    out = wei @ v
    return out

In [26]:
class MultiHeadAttention(nn.Module):
  def __init__(self, num_heads, head_size):
    super().__init__()
    self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
    self.proj = nn.Linear(n_embd,n_embd)
    self.dropout = nn.Dropout(dropout)

  def forward(self, x):
    out = torch.cat([h(x) for h in self.heads], dim=-1)
    out = self.proj(out)
    return out

In [27]:
class FeedForward(nn.Module):
  """a simple linear layer followed by a non-linearity"""
  def __init__(self, n_embd):
    super().__init__()
    self.net = nn.Sequential(
        nn.Linear(n_embd, 4*n_embd),
        nn.ReLU(),
        nn.Linear(4*n_embd, n_embd),
        nn.Dropout(dropout)
    )
  def forward(self, x):
    return self.net(x)

In [28]:
class Block(nn.Module):
  """Transformer block: communication followed by computation"""

  def __init__(self, n_embd, n_head):
    #n_emb embedding dimension
    #n_head: number of heads
    super().__init__()
    head_size = n_embd // n_head
    self.sa = MultiHeadAttention(n_head, head_size)
    self.ffwd = FeedForward(n_embd)
    self.ln1 = nn.LayerNorm(n_embd)
    self.ln2 = nn.LayerNorm(n_embd)


  def forward(self, x):
     x = x + self.sa(self.ln1(x))
     x = x + self.ffwd(self.ln2(x))
     return x

In [29]:
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]


In [30]:
class GPT(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.sa_heads = MultiHeadAttention(4,n_embd//4)
        self.ffwd = FeedForward(n_embd)
        #self.block = nn.Sequential(
        #    Block(n_embd,n_head=4),
        #    Block(n_embd,n_head=4),
        #    Block(n_embd,n_head=4),
        #    Block(n_embd,n_head=4),
        #    nn.LayerNorm(n_embd))
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd,vocab_size)


    def forward(self, idx, targets=None):
        B,T = idx.shape
        # idx and targets are both (B,T) tensor of integers
        token_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = token_emb + pos_emb
        x = self.sa_heads(x)
        x = self.ffwd(x)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block size tokens
            idx_cond = idx[:,-block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


In [31]:
model = GPT(vocab_size)
m = model.to(device)
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 10.3771, val loss 10.3775
step 500: train loss 1.5256, val loss 17.1940
step 1000: train loss 0.3582, val loss 28.2738
step 1500: train loss 0.1565, val loss 34.3110
step 2000: train loss 0.0796, val loss 39.4233
step 2500: train loss 0.0455, val loss 44.3158
step 3000: train loss 0.0286, val loss 48.5718
step 3500: train loss 0.0199, val loss 51.4888
step 4000: train loss 0.0158, val loss 53.9022
step 4500: train loss 0.0127, val loss 56.5727
step 4999: train loss 0.0113, val loss 57.8189


When we use BERT Tokenizer , validation is loss isincreased model is overfitted.

In [32]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decoder(m.generate(context, max_new_tokens=500)[0].tolist()))

mezarı Hâlâ Benimım çilli belden yukarsı çıplak,mda Turgut Reisn X Güneş omuzlarında, sürü, güvercinler, kaptandın gibi bulutların benzerdin, Özgürlük yağmurla Ben bir çil. Öyledi yamacında. Şarabım birap dal benden, Gökyüzü, bulut Kadeh kırbacında. Şarap fıçılarına sinmiş yüzlerini döktüğü arka sokak! N'oldu Damında kediler sevişen ev, rüzgârlara karşı büyütlmaya görsün bir kediler sevişen ev, rüzgârlara karşı büyüt günleri, rüzgârın Tuzlu tüylerini döktüğü arka sokak! Camdan bir sesa sıçrarleri Ay doğar, tam MERDİVENİ geçerek basıyordu çekiyorlardı, umutların benzerdin, tam cezaruldu servilerin ve işkilli, umutlara karşı çıkışlarıtan ve ev yükseliyordu gitti toz işte Bir delik kalıyordu servilerin ve delik kalıyordu yerinde yeller otta yabancı bir kara karaya çalıyorsa. doluya çalıyorsa, Denize bile iştahsız bakıyorsam, Hep bu boyu devrilesi bozuk düzen, Bu darağacı suratlı toplum! AKŞAM BALIĞIN KARNINDA BEKLİYOR Bir yağmurla çıkıyor rıhtımına sıkıntının, büyük kayıkların Tuzlu sesi 

In [37]:
from IPython.display import display, Markdown

In [38]:
prompt = 'Ağlasam sesimi duyar mısınız mısralarımda'
context = torch.tensor(encoder(prompt), dtype=torch.long, device=device)
generated_chars = decoder(m.generate(context.unsqueeze(0), max_new_tokens=100)[0].tolist())
display(Markdown(generated_chars))

Ağlasam sesimi duyar mısınız mısralarımdademndan Bir kızım vardı beş yaşında Ölmüş şimdi beraberiz İçi sıkılıyor burada Ellerini Varşova'da unutmuş Çember çeviremiyor Ve bir ses Ne patates çapalamak Ne taş kırmak Ne de yük taşımak pazara Burada rahatım iyidir Biri de karısını merak etmiş Evden haber soruyor bana Üstümden kaputumu aldılar Öldüğüm zaman Üşüyorum Önümüz de kış Sonra bir ağızdan konuşuyorlar III Bir bardaktan su içiyoruz Birlikte yemek yiyoruz akşamları Kimisi sevgilimize âşık Kimisi evlât

Also, the output is not in the format of poem now

In [34]:
import pickle

In [35]:
with open('model-01.pkl', 'wb') as f:
    pickle.dump(model, f)
print('model saved')

model saved
