# Bigram Language Model
**Learns how words follow each other in sentences.**

---

Download the Book **Wizard of OZ** as .txt file from [here.](https://github.com/subratamondal1/mini-GPT/raw/main/src/mini_gpt/data/wizard%20of%20oz.txt)

In [1]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

# Hyperparameters
block_size = 8 # means that each training example will consist of a sequence of 8 consecutive tokens from input data
batch_size = 4 # each batch will contain 4 training examples (sequence of 8 consecutive tokens)
learning_rate = 3e-4
max_iterations = 1000
eval_iterations = 250
dropout = 0.2

cpu


In [2]:
! wget "https://github.com/subratamondal1/mini-GPT/raw/main/src/mini_gpt/data/wizard%20of%20oz.txt"

--2023-09-05 17:47:03--  https://github.com/subratamondal1/mini-GPT/raw/main/src/mini_gpt/data/wizard%20of%20oz.txt
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/subratamondal1/mini-GPT/main/src/mini_gpt/data/wizard%20of%20oz.txt [following]
--2023-09-05 17:47:03--  https://raw.githubusercontent.com/subratamondal1/mini-GPT/main/src/mini_gpt/data/wizard%20of%20oz.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 217429 (212K) [text/plain]
Saving to: ‘wizard of oz.txt’


2023-09-05 17:47:04 (1.80 MB/s) - ‘wizard of oz.txt’ saved [217429/217429]



In [3]:
! tree /kaggle

[01;34m/kaggle[00m
├── [01;34minput[00m
├── [01;34mlib[00m
│   └── [01;34mkaggle[00m
│       └── gcp.py
└── [01;34mworking[00m
    └── wizard of oz.txt

4 directories, 2 files


Open the file **Wizard of OZ** which is in text form. 

In [4]:
with open(file = "/kaggle/working/wizard of oz.txt", mode = "r", encoding="utf-8") as file:
    text = file.read()
print(f"Length of the text: {len(text)}.\n")

Length of the text: 207797.



In [5]:
# First 200 texts
print(text[:200])


The Wonderful Wizard of Oz

by L. Frank Baum


This book is dedicated to my good friend & comrade
My Wife
L.F.B.


Contents

 Introduction
 Chapter I. The Cyclone
 Chapter II. The Council with the Mu


Convert the texts into `characters`.

In [20]:
# Convert the texts into `unique characters`.
unique_chars = sorted(set(text)) # Keeps only unique characters, and discard duplicate ones
vocabulary_size = len(unique_chars)
print(f"Set doesn't allow duplicates, hence the decreased : {len(text)} ---> {len(unique_chars)}\n\n{unique_chars}")
print(f"\nVocabulary Size is {vocabulary_size}")

Set doesn't allow duplicates, hence the decreased : 207797 ---> 72

['\n', ' ', '!', '&', '(', ')', ',', '-', '.', '0', '1', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '—', '‘', '’', '“', '”']

Vocabulary Size is 72


## Character Tokenization
**Character Level Tokenizer** converts the input text data into tokens of characters.

In [7]:
string_to_int = {char:index for index, char in enumerate(unique_chars)}
int_to_string = {index:char for index, char in enumerate(unique_chars)}

# Encoder & Decoder
encode = lambda input_Text: [string_to_int[char] for char in input_Text]
decode = lambda encoded_Data: "".join([int_to_string[integer] for integer in encoded_Data])

encoded_string = encode("SUBRATA MONDAL")
decoded_string = decode(encode("SUBRATA MONDAL"))

print(f"Encoded Text:\t {encoded_string}\nDecoded Text:\t {decoded_string}")

Encoded Text:	 [33, 35, 16, 32, 15, 34, 15, 1, 27, 29, 28, 18, 15, 26]
Decoded Text:	 SUBRATA MONDAL


In [8]:
def character_tokenization(input_text:str, unique_chars):
    string_to_int = {char:i for i, char in enumerate(unique_chars)}
    int_to_string = {i:char for i, char in enumerate(unique_chars)}

    # Encoder & Decoder
    encode = lambda S: [string_to_int[c] for c in S]
    decode = lambda L: "".join([int_to_string[i] for i in L])

    encoded_string = encode(input_text)
    decoded_string = decode(encode(input_text))
    
    return input_text, encoded_string, decoded_string

In [9]:
character_tokens = character_tokenization(input_text="SUBRATA MONDAL", unique_chars = unique_chars)

print(f"Input Text:\t {character_tokens[0]}\nEncoded Text:\t {character_tokens[1]}\nDecoded Text:\t {character_tokens[2]}\n")

Input Text:	 SUBRATA MONDAL
Encoded Text:	 [33, 35, 16, 32, 15, 34, 15, 1, 27, 29, 28, 18, 15, 26]
Decoded Text:	 SUBRATA MONDAL



Convert the whole text data into PyTorch Tensor.
* First Encode into Integers, then Convert to Torch Tensor.

In [10]:
data = torch.tensor(data=encode(text), dtype=torch.long)
data.shape

torch.Size([207797])

How Bigram Language Models work at a fundamental level.
* Bigram Language Model predicts the next word based on the previous words. It is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2.

In [11]:
# How Bigram Language Models work at a fundamental level.
prediction_X = data[:block_size] 
target_y = data[1:block_size+1]

for t in range(block_size):
    input_text = prediction_X[:t+1]
    target = target_y[t]
    print(f"When Input is {input_text} then target is {target}")

When Input is tensor([0]) then target is 34
When Input is tensor([ 0, 34]) then target is 48
When Input is tensor([ 0, 34, 48]) then target is 45
When Input is tensor([ 0, 34, 48, 45]) then target is 1
When Input is tensor([ 0, 34, 48, 45,  1]) then target is 37
When Input is tensor([ 0, 34, 48, 45,  1, 37]) then target is 55
When Input is tensor([ 0, 34, 48, 45,  1, 37, 55]) then target is 54
When Input is tensor([ 0, 34, 48, 45,  1, 37, 55, 54]) then target is 44


## Train Validation Split

In [21]:
n = int(0.8 * len(data))
print(f"n is {n}")

train_data = data[:n]
val_data = data[n:]

# generate batches of training or validation data
def get_batch(split): 
    data = train_data if split == "train" else val_data
    ix = torch.randint(high=len(data) - block_size, size = (batch_size,))
    
    x = torch.stack(tensors=[ data[i: i+block_size] for i in ix])
    y = torch.stack(tensors=[ data[i+1: i+block_size+1] for i in ix])
    return x.to(device), y.to(device) # move x,y to gpu if available

# x is inputs, y is target or label
x,y = get_batch("train")
print(f"\ninputs --- {x.shape} --- \n{x}\n")
print(f"targets --- {y.shape} --- \n{y}\n")

n is 166237

inputs --- torch.Size([4, 8]) --- 
tensor([[53, 54,  6,  1, 41, 54, 44,  1],
        [ 1, 27, 61, 54, 43, 48, 51, 49],
        [ 1, 58, 55, 55, 53,  8,  1, 23],
        [61,  1, 47, 55,  1, 23,  1, 43]])

targets --- torch.Size([4, 8]) --- 
tensor([[54,  6,  1, 41, 54, 44,  1, 58],
        [27, 61, 54, 43, 48, 51, 49, 54],
        [58, 55, 55, 53,  8,  1, 23, 60],
        [ 1, 47, 55,  1, 23,  1, 43, 45]])



In [13]:
@torch.no_grad() # reduces memory usage, improves performance as we are not calculating the gradient/slope
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train","val"]:
        losses = torch.zeros(eval_iterations)
        for k in range(eval_iterations):
            X,y = get_batch(split)
            logits, loss = model(X,y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

## Neural Network
---

In [14]:
import torch.nn as nn
from torch.nn import functional as F

In [24]:
# Define a Bigram Language Model using PyTorch
class BigramLanguageModel(nn.Module):
    def __init__(self, vocabulary_size):
        super(BigramLanguageModel, self).__init__()
        # Create an embedding table to convert token indices into dense vectors
        self.tokens_embedding_table = nn.Embedding(num_embeddings=vocabulary_size, embedding_dim=vocabulary_size)
        
    def forward(self, input_tokens, targets=None):
        # Calculate logits (scores) for input tokens using the embedding table
        logits = self.tokens_embedding_table(input_tokens)
        
        if targets is None:
            loss = None
        else:
            # Reshape logits and targets for cross-entropy loss calculation
            batch_size, tokens_in_a_sequence, vocabulary_size = logits.shape
            logits = logits.view(batch_size * tokens_in_a_sequence, vocabulary_size)
            targets = targets.view(batch_size * tokens_in_a_sequence)
            # Calculate cross-entropy loss
            loss = F.cross_entropy(input=logits, target=targets)
        
        return logits, loss
    
    def generate(self, input_tokens, max_new_tokens):
        for _ in range(max_new_tokens):
            # Get predictions and loss using the forward method
            logits, loss = self.forward(input_tokens)
            # Focus on the last token in the sequence
            logits = logits[:, -1, :]  # Shape: (batch_size, vocabulary_size)
            # Apply softmax to get token probabilities
            probabilities = F.softmax(input=logits, dim=-1)  # Shape: (batch_size, vocabulary_size)
            # Sample the next token from the distribution
            input_tokens_next = torch.multinomial(input=probabilities, num_samples=1)  # Shape: (batch_size, 1)
            # Append the sampled token to the running sequence
            input_tokens = torch.cat((input_tokens, input_tokens_next), dim=1)  # Shape: (batch_size, tokens_in_a_sequence + 1)
            
        return input_tokens

# Create an instance of the Bigram Language Model
model = BigramLanguageModel(vocabulary_size=vocabulary_size)
m = model.to(device)

# Create an initial input token (e.g., start of a sequence)
input_tokens = torch.zeros(size=(1, 1), dtype=torch.long, device=device)

# Generate a sequence of tokens using the Bigram Language Model
generated_tokens = m.generate(input_tokens=input_tokens, max_new_tokens=500)

# Convert generated token indices to text using a decoding function
generated_text = decode(generated_tokens[0].tolist())

# Print the generated text
print(generated_text)


LXHkgdDMHSxZ(YC;FSYV“YVdnvKeuGoBUH!wa&goK !cMl mpvV—&fjffnvW—;u?fD::Ovej“-O.Q
we;iOA-Hh?MlOlrjUYLhEgW0:CwAOAmIKOTLXEuMh
Zs;XhTaE
gzfHuy,eHUjcQd1,ZyZteuG“GR.uVdNsQFinfn1:zzUwuObr?xUsQ-”XrcWeNVsbvaP:Dt)HtnWqv1&KkQW“-nNZt.yMljjHxLK0QZWcknwLXvApbQV“ eYnYnvCdk““pXJ
vZsjHXA9.
n1m.9hjEglDNTgUVRCLevIWqY(!oMh?VahPK?QFSpjhz9xKH9”yZZtqQpJk”f
?SbS”IwN;rBZsTynVKzyIwi&AaAp,KzTnMleO“,K0fEao1WcWsnbLKdR“uL9AXA.CdN
?—NOrv0NC:ULck—KDpj1x?oaoa,vL0eOi,“U(FS 9-ANZyPW-ub1lOKadg—H.VLmN
1SfU&Y!nKzW0DtfU&tKr)QWDssEYPC”YB


## Optimizer
---

In [26]:
optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)
optimizer

AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.0003
    maximize: False
    weight_decay: 0.01
)

## Training Loop
---

In [29]:
for iter in range(max_iterations):
    if iter % eval_iterations == 0:
        # Estimate and print the training and validation losses
        losses = estimate_loss()
        print(f"Step is {iter}, Train Loss is {losses['train']:.4f}, Val Loss is {losses['val']:.4f}")
    
    # Get a batch of training data (input and target)
    xb, yb = get_batch("train")
    
    # Forward pass: Calculate logits (scores) and loss for the batch using the model
    logits, loss = model.forward(input_tokens=xb, targets=yb)
    
    # Backpropagation: Zero the gradients, perform backpropagation, and update the model's parameters
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# After training iterations are completed, print the final loss
print(f"Loss is {loss.item()}")

Step is 0, Train Loss is 4.7941, Val Loss is 4.7787
Step is 250, Train Loss is 4.7368, Val Loss is 4.7234
Step is 500, Train Loss is 4.6710, Val Loss is 4.6709
Step is 750, Train Loss is 4.6006, Val Loss is 4.6069
Loss is 4.562045574188232


In [27]:
input_tokens = torch.zeros(size=(1,1), dtype=torch.long, device=device)
generated_charts = decode(m.generate(input_tokens = input_tokens, max_new_tokens=500)[0].tolist())
print(generated_charts)


.Km.x(P
fs(v“’vV’jKzUWd“zUtZ dN?ogG:cdCmfCrbi&ggtgs1pxPH,U,L
1fbj,Kl:w1svRCvk;lu—qC(JnvNhej,u JKeQ0!G”XRSzXEZcvWs1k(u rF—x0SvZSff“p1KdaSNvHj“:;?xbwv);-XFFJo )FXFgpLMV
h-BQW 
wLJZMeVG1xPbbsVAl9W-&fbKQrqy0qvCD‘.YLX:hV”uqLDtxRS‘)FSKHpx
Jn0:!cWoLMe
JUa?y“:?yUMJU0:rowvev“’hrJ’UxrhZrquGS‘!9;L0KfPDkd&bp JwT“PEE”YoiizU&CdNsapB.9”L;ynv””u&Du&R1;wt&pP“QFTc“g—nD.vnveT?TdRTn1WvnNTdAJBbZL9DMf uGA..1apjUXHreNMxxAbiVEfap :-CCCeYl:9AeSLfM)trji1fqn;ArX!W0k“““KNxuuRSzErg.uOse0jVkYsYNZW0ni”WRueXE9bX”RSsxoMUnHZSODc


---
# Mini GPT
**Character Level Language Model**

---

In [30]:
!wget "https://github.com/karpathy/char-rnn/raw/master/data/tinyshakespeare/input.txt"

--2023-09-05 18:56:14--  https://github.com/karpathy/char-rnn/raw/master/data/tinyshakespeare/input.txt
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt [following]
--2023-09-05 18:56:14--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-09-05 18:56:15 (5.51 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [31]:
!tree /kaggle

[01;34m/kaggle[00m
├── [01;34minput[00m
├── [01;34mlib[00m
│   └── [01;34mkaggle[00m
│       └── gcp.py
└── [01;34mworking[00m
    ├── input.txt
    └── wizard of oz.txt

4 directories, 3 files


In [32]:
data_path = "/kaggle/working/input.txt"

In [33]:
with open(file=data_path, mode="r", encoding="utf-8") as f:
    text = f.read()

In [35]:
# length
print(f"Length of the dataset in characters: {len(text)}")

Length of the dataset in characters: 1115394


In [50]:
# text
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



Vocabulary

In [46]:
# here are all the unique characters, 
unique_chars = sorted(list(set(text))) 
vocab_size = len(unique_chars)

print(f"The Vocabulary Size is {vocab_size}, it's way less than 1.1 Million because, Vocabulary Size contains only unique characters.")

The Vocabulary Size is 65, it's way less than 1.1 Million because, Vocabulary Size contains only unique characters.


## Tokenizer

Simple Tokenizer using Encoding and Decoding.

In [48]:
# map characters to integers
string_to_int = {char:index for index, char in enumerate(unique_chars)}
# map int (encoded string) back to text string
int_to_string = {index:char for index, char in enumerate(unique_chars)}

# Encoder converts from text string to int
encode = lambda input_Text: [string_to_int[char] for char in input_Text]
# Decoder converts Encoded string (int) back to text string
decode = lambda encoded_Data: "".join([int_to_string[integer] for integer in encoded_Data])

encoded_string = encode("SUBRATA MONDAL")
decoded_string = decode(encode("SUBRATA MONDAL"))

print(f"Encoded Text:\t {encoded_string}\nDecoded Text:\t {decoded_string}")

Encoded Text:	 [31, 33, 14, 30, 13, 32, 13, 1, 25, 27, 26, 16, 13, 24]
Decoded Text:	 SUBRATA MONDAL


Note that:
* OpenAi uses [tiktoken](https://github.com/openai/tiktoken) tokenizer.
* Google uses [sentencepiece](https://github.com/google/sentencepiece) tokenizer.

In [51]:
# Let's now encode the entire text dataset and store it in torch.Tensor
import torch
data = torch.tensor(encode(text),dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # let's see the first 1000 characters

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      