In [None]:
import torch
import numpy as np

In [None]:
dataset = """
Sure! Here's a 400-word essay on the topic of "The Impact of Social Media on Society":

The Impact of Social Media on Society

Social media has emerged as a powerful force in shaping the way we communicate, share information, and interact with others. With its rapid growth and widespread adoption, it has significantly impacted various aspects of society. While it has brought about numerous benefits, such as increased connectivity and access to information, social media has also raised concerns about privacy, mental health, and the spread of misinformation.

One of the most significant impacts of social media is its ability to connect people from different parts of the world. Platforms like Facebook, Twitter, and Instagram have enabled individuals to form new relationships, reconnect with old friends, and build online communities. This connectivity has fostered a sense of global citizenship and facilitated the exchange of ideas and cultural diversity.

Moreover, social media has revolutionized the way information is disseminated. News spreads rapidly on platforms like Twitter, making it a valuable tool for staying updated on current events. It has given a voice to marginalized communities, allowing them to share their experiences and advocate for social change. Social media has also played a crucial role in organizing protests and movements, bringing attention to important social and political issues.

However, the rise of social media has also brought about concerns regarding privacy. Users often share personal information online, which can be exploited by malicious entities. Social media platforms have faced criticism for their handling of user data and breaches of privacy. It is crucial for individuals to be cautious about the information they share and to understand the privacy settings offered by these platforms.

Another significant concern associated with social media is its impact on mental health. The constant exposure to curated and idealized representations of others' lives can lead to feelings of inadequacy and low self-esteem. Social media platforms have also been linked to increased rates of anxiety, depression, and cyberbullying. It is essential for individuals to use social media mindfully and seek support when needed.

Furthermore, the spread of misinformation on social media has become a pressing issue. False information can easily go viral and influence public opinion. This has serious implications for democracy, public health, and social cohesion. It is important for users to critically evaluate the information they encounter and for platforms to take responsibility in combating the spread of misinformation through fact-checking and content moderation.

In conclusion, social media has had a profound impact on society. It has connected people across borders, facilitated the exchange of information, and empowered marginalized communities. However, it has also raised concerns about privacy, mental health, and the spread of misinformation. It is essential for individuals, platforms, and policymakers to work together to harness the positive aspects of social media while addressing its challenges. By promoting responsible use, protecting user privacy, and ensuring the reliability of information, we can maximize the benefits of social media and create a more inclusive and informed society.
"""

In [None]:
unique_tokens = sorted(set(dataset))
print(unique_tokens[:10],len(unique_tokens))

['\n', ' ', '!', '"', "'", ',', '-', '.', '0', '4'] 49


In [None]:
## create encoder and decoder
stoi = {ch : i for i,ch in enumerate(unique_tokens)}
itos = {i: ch for i,ch in enumerate(unique_tokens)}
encoder = lambda string : [stoi[x] for x in string]
decoder = lambda vector : [itos[x] for x in vector]

In [None]:
encoded_part = encoder(dataset[:10])
encoded_part

[0, 20, 43, 40, 28, 2, 1, 14, 28, 40]

In [None]:
decoded_part = decoder(encoded_part)
"".join(decoded_part)

'\nSure! Her'

In [None]:
encoded_dataset = encoder(dataset)
train_dataset = encoded_dataset[:int(len(encoded_dataset)*.8)]
test_dataset = encoded_dataset[int(len(encoded_dataset)*.8):]
print(len(train_dataset),len(test_dataset))

2691 673


In [None]:
np.random.randint?

In [None]:
## now we need to split dataset into batches and sequences (time dimention)
## create random starting points for batch
def generate_batch(dataset,batch_size = 32,seq_len = 8):
  random_ids = np.random.randint(len(dataset) - seq_len - 1,size= (batch_size,))
  ## stack creates a new 0th axis
  x = np.stack([dataset[i:i + seq_len] for i in random_ids])
  y = np.stack([dataset[i+1:i+seq_len + 1] for i in random_ids])
  # context, target pairs
  return torch.tensor(x),torch.tensor(y)

In [None]:
generate_batch(train_dataset,4,8)

(tensor([[40, 34, 27,  7,  1, 19, 34, 24],
         [15, 42,  1, 32, 41,  1, 26, 40],
         [28, 46, 38, 37, 41, 43, 40, 28],
         [31, 28,  1, 41, 38, 40, 28, 24]]),
 tensor([[34, 27,  7,  1, 19, 34, 24, 42],
         [42,  1, 32, 41,  1, 26, 40, 43],
         [46, 38, 37, 41, 43, 40, 28,  1],
         [28,  1, 41, 38, 40, 28, 24, 27]]))

In [None]:
## now we have found batch for our work
## remainig part is to create dataset

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as f
torch.manual_seed(1337)

<torch._C.Generator at 0x7f73d65bdb30>

In [None]:
## defining some length of vectors
vocab_size = len(unique_tokens)
seq_len = 8
feature_len = 32
head_size = 16
batch_size = 32

In [None]:
criterion = nn.CrossEntropyLoss()

In [None]:
class BigramLanguageModel(nn.Module):
  def __init__(self,vocab_size):
    super().__init__()
    self.token_embedding_table = nn.Embedding(vocab_size,vocab_size)
  def forward(self,idx,target = None):
    ## idx and target are batch of sequences(tokens) -> B T
    ## after being employed over embedding -> B * T * C
    ## so embedding works on sequence of
    logits = self.token_embedding_table(idx)
    if target ==  None:
      return logits
    ## now categorical cross entropy excepts [vector] [singleoutput] as input
    ## that means only 2d 1 for batch other for [vector] or [singleOutput](1 dimention)
    ## so need to convert this dataset in desired format
    B,T,C = logits.shape
    logits = logits.view(-1,C)
    target = target.view(-1)
    score =criterion(logits,target)

    return logits,score
  def generator(self,idx,new_max_tokens):
    for _ in range(new_max_tokens):
      logits = self(idx) # batch token C
      logits = logits[:,-1,:] # we needs batch x C of final dimention only
      probs = f.softmax(logits,dim = -1) ## only B X C
      predictions = torch.multinomial(probs,num_samples=1)  ## B X 1 output
      idx = torch.cat([idx,predictions],axis = 1)
    return idx

In [None]:
## some tests
idx = np.array(test_dataset[:1]) ## it takes all dimention
idx2 = np.array(test_dataset[1]) ## this ohmits this dimention as we are taking single element
idx.shape,idx2.shape

((1,), ())

In [None]:
x = torch.tensor([[1,2],[1,2],[1,2]])
## now depending on which data we are adding to it
new_data = torch.tensor([[2,3]])
x.shape,new_data.shape

(torch.Size([3, 2]), torch.Size([1, 2]))

In [None]:
torch.cat([x,new_data],axis = 0) ### concatenation should pass only axis on which data differs
### that means to concatenate (3,2) (1,2) axis = 0 is the different one other's are same and hence can be concatenated

tensor([[1, 2],
        [1, 2],
        [1, 2],
        [2, 3]])

In [None]:
dx,dy = generate_batch(train_dataset,8,1)

In [None]:
dx.shape,dy.shape

(torch.Size([8, 1]), torch.Size([8, 1]))

In [None]:
## generation output is for each batch ## take 1 batch we are left with T dimention now decode it

In [None]:
m = BigramLanguageModel(vocab_size)

In [None]:
logits,loss = m(dx,dy)
print(loss.item()) ## using item over loss gives value we are working with

4.623099327087402


In [None]:
## let's just say 0 being the first token
tokens = m.generator(torch.tensor([[35]]),200)[0]
print("".join(decoder(tokens.tolist())))

msxhg,IM,"UflcaP4mc4w''oMWO'"yySwF"yFUTuF0M'vzBNU,.l.A4u
TnIiIsxtufz:.aTO4mwbx:WfcA"v,.r!AUxvnIBetnhvl.N"vIAOlIrxfTN"AuctBOMk.o0fWlin-WkaPAerrz0qBeoFUpTThgAIdx'UxhMmfiFd:Iy."ofdq:wepmrug-co0fev0lOzAnhm


In [None]:
## now let's train bigram model
optimizer = optim.AdamW(params = m.parameters(), lr = 1e-3)

In [None]:
num_epochs = 10000
for i in range(num_epochs):
  x,y = generate_batch(train_dataset)
  logits,loss = m(x,y)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()
  if(i%500 == 0):
    print(loss.item())

4.454430103302002
3.8837852478027344
3.4051735401153564
2.9932796955108643
2.839930534362793
2.663543939590454
2.6118483543395996
2.4756112098693848
2.4258763790130615
2.3992905616760254
2.314478635787964
2.313333511352539
2.2423791885375977
2.266751766204834
2.353645086288452
2.299842119216919
2.304314374923706
2.2036361694335938
2.149061918258667
2.161999225616455


In [None]:
tokens = m.generator(torch.tensor([[5]]),200)[0]
print("".join(decoder(tokens.tolist())))

, ake!Soninfonen ly. w bovein of icy on ivericeas crendusoofrid be! cciay tioSocin harecAn fripraneso s o owhad exprmpd n Imeas ofoct pldactererentiarsoualted thed updsedor wial eth Mo wa gediverenghex


# mathematical trick towards attention

In [None]:
## attention is taking previous context in account while making next work prediction
## so this must be decided by current word that how much of previous word matters here
## also since this is just about prediction we need to take previous words in account we can't have next words
## in works like sentimental analysis we can definitely use something like attention with previous<>next words in account
## so to create attention we first need to take previous data and then average their context vector or data value whatever way previous info stored as
sample_data = torch.rand((3,4,5)) ## 3 batches each having 4 tokens represented as vector of 5 elements
context_data = torch.zeros((3,4,5))
for i in range(sample_data.shape[0]) :
  for j in range(sample_data.shape[1]) :
    context = sample_data[i,:j+1] # TxC
    context_data[i][j] = torch.mean(context,axis = 0) ## mean along row T,
sample_data[0],context_data[0]

(tensor([[0.1080, 0.8708, 0.3333, 0.1221, 0.1974],
         [0.5288, 0.5014, 0.3724, 0.8378, 0.8934],
         [0.8923, 0.5049, 0.5240, 0.2993, 0.0722],
         [0.1155, 0.6658, 0.4625, 0.0577, 0.1422]]),
 tensor([[0.1080, 0.8708, 0.3333, 0.1221, 0.1974],
         [0.3184, 0.6861, 0.3529, 0.4799, 0.5454],
         [0.5097, 0.6257, 0.4099, 0.4197, 0.3877],
         [0.4111, 0.6357, 0.4231, 0.3292, 0.3263]]))

In [None]:
## doing attention work using matrix multiplication
## we want for each T decide which T should it take from that batch
B,T,C = sample_data.shape
wei = torch.ones(T,T)
wei = torch.tril(wei) ## making it lower triangular matrix
wei = wei/torch.sum(wei,dim = 1,keepdim=True) ## all elements in jth dimention are added which is then used to divide
context_data = wei@sample_data
sample_data[0],context_data[0]

(tensor([[0.1080, 0.8708, 0.3333, 0.1221, 0.1974],
         [0.5288, 0.5014, 0.3724, 0.8378, 0.8934],
         [0.8923, 0.5049, 0.5240, 0.2993, 0.0722],
         [0.1155, 0.6658, 0.4625, 0.0577, 0.1422]]),
 tensor([[0.1080, 0.8708, 0.3333, 0.1221, 0.1974],
         [0.3184, 0.6861, 0.3529, 0.4799, 0.5454],
         [0.5097, 0.6257, 0.4099, 0.4197, 0.3877],
         [0.4111, 0.6357, 0.4231, 0.3292, 0.3263]]))

In [None]:
## now we need Affinity vector decides how much of previous data we need to use
## for that purpose we need to give everything some weights and normalize those weights properly
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill((tril == 0),float('-inf'))
wei

tensor([[0., -inf, -inf, -inf],
        [0., 0., -inf, -inf],
        [0., 0., 0., -inf],
        [0., 0., 0., 0.]])

In [None]:
wei = f.softmax(wei,dim = 1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500]])

In [None]:
context_data = wei @ sample_data
sample_data[0],context_data[0]

(tensor([[0.1080, 0.8708, 0.3333, 0.1221, 0.1974],
         [0.5288, 0.5014, 0.3724, 0.8378, 0.8934],
         [0.8923, 0.5049, 0.5240, 0.2993, 0.0722],
         [0.1155, 0.6658, 0.4625, 0.0577, 0.1422]]),
 tensor([[0.1080, 0.8708, 0.3333, 0.1221, 0.1974],
         [0.3184, 0.6861, 0.3529, 0.4799, 0.5454],
         [0.5097, 0.6257, 0.4099, 0.4197, 0.3877],
         [0.4111, 0.6357, 0.4231, 0.3292, 0.3263]]))

In [None]:
#### now affinity matrix (wei) needs to be created based on input token information
## we'll used toke (size v) and get some query (what they wants to ask) and key (what do I stand for) for each token
## after then when we do matrix multiplication if (requirement, standing) are similar they creates higher affinity
## what model asks for and what is stand for is similar that will cause it to create higher affinity
head_size = 16
key = nn.Linear(C,head_size,bias = False)
query = nn.Linear(C,head_size,bias = False)
k = key(sample_data)  ## B T 16
q = query(sample_data) ## B T 16
## now creating affinity matrix
wei = q @ k.transpose(-2,-1) # B T T
wei = f.softmax(wei,dim = -1)
## now we also want to transform data, i.e. from data -> extract what it has to offer for which is then going to be concatenated
value = nn.Linear(C,head_size,bias = False)
x = value(sample_data)
output = wei @ x  ## will product B T H dimention output
output[0].shape

torch.Size([4, 16])

In [None]:
head_size = 16
n_embed = 32
block_size = 8

In [None]:
## now creating head block
class Head(nn.Module):
  def __init__(self,n_embed = 32,head_size = 16): ## head_size defined globally making it more scalable
    super().__init__()
    self.query = nn.Linear(n_embed,head_size,bias = False)
    self.value = nn.Linear(n_embed,head_size,bias = False)
    self.key = nn.Linear(n_embed,head_size,bias = False)
    self.register_buffer('tril',torch.tril(torch.ones(block_size,block_size)))
  def forward(self,x):
    ## tokens might differ in steps that's why we are creating it like this
    B,T,C = x.shape
    q = self.query(x)
    k = self.key(x)
    v = self.value(x)
    wei = q @ k.transpose(-2,-1)*k.shape[-1]**-.5 ## multiplication will cause variance to be of order of head_size so need to normazlize
    ## having such larger values softmax will oncverge to very large valules
    wei = wei.masked_fill(self.tril[:T,:T] == 0, float('-inf')) ##avoiding communication with past
    wei = f.softmax(wei,dim = -1) ## normalization of we matrix
    return wei @ v  ## remember output is going to be B T head_size  ## this might need one more layer to convert it back to V dimention

In [None]:
class HeadedBigramLanguageModel(nn.Module):
  def __init__(self,vocab_size = 49,n_embed = 32,head_size = 16):
    super().__init__()
    self.token_embedding_table = nn.Embedding(vocab_size,n_embed)
    self.index_embedding_table = nn.Embedding(vocab_size,n_embed)
    self.sa_head = Head(n_embed,head_size)
    self.ln_head = nn.Linear(head_size,vocab_size) ## going from head -> vocab_size
  def forward(self,idx,target = None):
    ## idx and target are batch of sequences(tokens) -> B T
    ## after being employed over embedding -> B * T * C
    ## so embedding works on sequence of
    token_embeds = self.token_embedding_table(idx)
    index_embeds = self.index_embedding_table(torch.arange(idx.shape[-1])) ## at every batch each T size values are going to get their C size index information
    x = token_embeds*index_embeds ## should auto extend it's dimention over batch as it's element wise multiplication

    x = self.sa_head(x)
    logits = self.ln_head(x)
    if target ==  None:
      loss = None
    else:
      ## now categorical cross entropy excepts [vector] [singleoutput] as input
      ## that means only 2d 1 for batch other for [vector] or [singleOutput](1 dimention)
      ## so need to convert this dataset in desired format
      B,T,C = logits.shape
      logits = logits.view(-1,C)
      target = target.view(-1)
      score =criterion(logits,target)
    return logits,score
  def generator(self,idx,new_max_tokens):
    for _ in range(new_max_tokens):
      logits = self(idx) # batch token C
      logits = logits[:,-block_size:,:] # don't take more than block size items as their context because trils are not of larger size here
      probs = f.softmax(logits,dim = -1) ## only B X C
      predictions = torch.multinomial(probs,num_samples=1)  ## B X 1 output
      idx = torch.cat([idx,predictions],axis = 1)
    return idx

In [None]:
device = torch.device('cude' if torch.cuda.is_available() else 'cpu')
hm = HeadedBigramLanguageModel(vocab_size,32,16)
# hm.to(device)
optimizer = optim.AdamW(params = hm.parameters(), lr = 1e-3)
num_epochs = 10000
for i in range(num_epochs):
  x,y = generate_batch(train_dataset)
  # x.to(device)
  # y.to(device)
  logits,loss = hm(x,y)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()
  if(i%500 == 0):
    print(loss.item())

3.9133803844451904
2.7331979274749756
2.5434930324554443
2.4078478813171387
2.4238455295562744
2.1879477500915527
2.198587656021118
2.2066116333007812
2.1335368156433105
2.182493209838867
2.0280001163482666
2.0662906169891357
2.0876331329345703
2.0656802654266357
2.145443916320801
2.168588876724243
2.1405746936798096
2.2180380821228027
2.016458749771118
2.0461058616638184


In [None]:
### now single head is not enough we need to have many head to make things good
### also after we have selected values using attention we also need to process that information
### so creating multiheaded layer and forward layer


### some more advancements we need some regularizers and projections layers to multiheads,feedforward also some dropouts for these linear layers required
### one more thing add resudual networks by making x = x + layer(x) that will create residual connections
class MultiHeadedAttention(nn.Module):
  def __init__(self,head_size = 16,num_heads = 4,n_embed = 32): # head_size is global parameter
    super().__init__()
    self.heads = nn.ModuleList([Head(n_embed,head_size//num_heads) for _ in range(num_heads)])  ## might need to change heads definition to take head_size as input
    self.projection = nn.Linear(head_size,num_heads*head_size)
    self.drop = nn.Dropout(.4)
  def forward(self,x):
    x =  torch.cat([head(x) for head in self.heads],axis = -1) # x -> B T C(head_size)
    x = nn.ReLU(self.projection(x))
    x = self.projection(x)
    x = nn.ReLU(x)
    x = self.drop(x)


## for now we'll change feedFowards instead, this is to take head_size -> n_embed
class FeedForward(nn.Mudule):
  def __init__(self,head_size = 16,n_embed = 32):
    super().__init__()
    self.net = nn.Sequential(
        nn.Linear(head_size,n_embed),
        nn.ReLU()
    )
  def forward(self,x):
    return self.net(x)
class Block(nn.Module): ## attention + feedFoward
  def __init__(self,n_embed = 32,head_size = 16,n_heads = 4):
    super().__init__()
    self.sa = MultiHeadedAttention(head_size,n_heads)  # self attention
    self.ffwd = FeedForward(head_size,n_embed)  # feed forward
  def forward(self,x):
    return self.ffwd(self.sa(x))


SyntaxError: ignored

In [None]:
## final model creation
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
# ------------

torch.manual_seed(1337)


<torch._C.Generator at 0x7fd1eb5e92d0>

In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-07-09 09:14:30--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-07-09 09:14:30 (173 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out



In [None]:
class Head(nn.Module):
    """
    one head of self-attention
    this also requires dropouts in attention matrix
    """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

In [None]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

In [None]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity
        this is to give thinking time to data extracted by attention
        it first maps parameters to 4 times then gets them back to origial size
        this also uses dropouts to this higher linear layers
    """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

In [None]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation
        this will also add some residual connections to network with layer normalization
        this is required while connecting many layers together
     """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

In [None]:
class GPTLanguageModel(nn.Module):
    """
    this is just final collection of all models
    we are using embeddings (position,token)
    then we multiply both these embeddings
    then uses many layers of blocks
    finally a linear head to give output of vocab_size
    """
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [None]:

model = GPTLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))
#open('more.txt', 'w').write(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))

10.788929 M parameters
step 0: train loss 4.3759, val loss 4.3710
step 500: train loss 1.7478, val loss 1.9022
step 1000: train loss 1.4023, val loss 1.6221
step 1500: train loss 1.2712, val loss 1.5413
step 2000: train loss 1.1937, val loss 1.5184
step 2500: train loss 1.1288, val loss 1.4963
step 3000: train loss 1.0765, val loss 1.4904
step 3500: train loss 1.0226, val loss 1.5102
step 4000: train loss 0.9695, val loss 1.5283
step 4500: train loss 0.9142, val loss 1.5485
step 4999: train loss 0.8660, val loss 1.5774

But with price-pitch gentlewise-cowaning kiss;
And Capulet, alwo'd with flesh sun
Arrows charges the officet o' the general;
And yet no more than commen all to him all.

BRAKENBURY:
How comes here come along Warwick! what will be the day
To ven yieldest then of figure to groan!
I could know could to provil to our drink?

LADY ANNE:
Would in the Angelo?

PRINCE EDWARD:
Faith, look to no seeky children,
I will not try her heart with this city blood,
In Edward sesting in r