# Bigram Language Model
**Learns how words follow each other in sentences.**

---

Download the Book **Wizard of OZ** as .txt file from [here.](https://github.com/subratamondal1/mini-GPT/raw/main/src/mini_gpt/data/wizard%20of%20oz.txt)

In [1]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

# Hyperparameters
block_size = 8 # means that each training example will consist of a sequence of 8 consecutive tokens from input data
batch_size = 4 # each batch will contain 4 training examples (sequence of 8 consecutive tokens)
learning_rate = 3e-4
max_iterations = 100

cpu


In [2]:
! wget "https://github.com/subratamondal1/mini-GPT/raw/main/src/mini_gpt/data/wizard%20of%20oz.txt"

--2023-09-02 08:03:02--  https://github.com/subratamondal1/mini-GPT/raw/main/src/mini_gpt/data/wizard%20of%20oz.txt
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/subratamondal1/mini-GPT/main/src/mini_gpt/data/wizard%20of%20oz.txt [following]
--2023-09-02 08:03:03--  https://raw.githubusercontent.com/subratamondal1/mini-GPT/main/src/mini_gpt/data/wizard%20of%20oz.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 217429 (212K) [text/plain]
Saving to: ‘wizard of oz.txt’


2023-09-02 08:03:04 (1.79 MB/s) - ‘wizard of oz.txt’ saved [217429/217429]



In [3]:
! tree /kaggle

[01;34m/kaggle[00m
├── [01;34minput[00m
├── [01;34mlib[00m
│   └── [01;34mkaggle[00m
│       └── gcp.py
├── [01;34msrc[00m
│   └── script.ipynb
└── [01;34mworking[00m
    ├── __notebook__.ipynb
    └── wizard of oz.txt

5 directories, 4 files


Open the file **Wizard of OZ.**

In [4]:
with open(file = "/kaggle/working/wizard of oz.txt", mode = "r", encoding="utf-8") as file:
    text = file.read()
print(f"Length of the text: {len(text)}.\n")

Length of the text: 207797.



In [5]:
# First 200 texts
print(text[:200])


The Wonderful Wizard of Oz

by L. Frank Baum


This book is dedicated to my good friend & comrade
My Wife
L.F.B.


Contents

 Introduction
 Chapter I. The Cyclone
 Chapter II. The Council with the Mu


Convert the texts into `characters`.

In [6]:
# Convert the texts into `unique characters`.
unique_chars = sorted(set(text))
vocabulary_size = len(unique_chars)
print(f"Set doesn't allow duplicates, hence the decreased len(text): {len(unique_chars)}\n\n{unique_chars}")
print(f"\nVocabulary Size is {vocabulary_size}")

Set doesn't allow duplicates, hence the decreased len(text): 72

['\n', ' ', '!', '&', '(', ')', ',', '-', '.', '0', '1', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '—', '‘', '’', '“', '”']

Vocabulary Size is 72


## Character Tokenization
**Character Level Tokenizer** converts the input text data into tokens of characters.

In [7]:
string_to_int = {char:index for index, char in enumerate(unique_chars)}
int_to_string = {index:char for index, char in enumerate(unique_chars)}

# Encoder & Decoder
encode = lambda input_Text: [string_to_int[char] for char in input_Text]
decode = lambda input_Data: "".join([int_to_string[integer] for integer in input_Data])

encoded_string = encode("SUBRATA MONDAL")
decoded_string = decode(encode("SUBRATA MONDAL"))

print(f"Encoded Text:\t {encoded_string}\nDecoded Text:\t {decoded_string}")

Encoded Text:	 [33, 35, 16, 32, 15, 34, 15, 1, 27, 29, 28, 18, 15, 26]
Decoded Text:	 SUBRATA MONDAL


In [8]:
def character_tokenization(input_text:str, unique_chars):
    string_to_int = {char:i for i, char in enumerate(unique_chars)}
    int_to_string = {i:char for i, char in enumerate(unique_chars)}

    # Encoder & Decoder
    encode = lambda S: [string_to_int[c] for c in S]
    decode = lambda L: "".join([int_to_string[i] for i in L])

    encoded_string = encode(input_text)
    decoded_string = decode(encode(input_text))
    
    return input_text, encoded_string, decoded_string

In [9]:
character_tokens = character_tokenization(input_text="SUBRATA MONDAL", unique_chars = unique_chars)

print(f"Input Text:\t {character_tokens[0]}\nEncoded Text:\t {character_tokens[1]}\nDecoded Text:\t {character_tokens[2]}\n")

Input Text:	 SUBRATA MONDAL
Encoded Text:	 [33, 35, 16, 32, 15, 34, 15, 1, 27, 29, 28, 18, 15, 26]
Decoded Text:	 SUBRATA MONDAL



Convert the whole text data into PyTorch Tensor.

In [10]:
data = torch.tensor(data=encode(text), dtype=torch.long)
data.shape

torch.Size([207797])

How Bigram Language Models work at a fundamental level.
* It generates input sequences and their corresponding target outputs for the given dataset.

In [11]:
# How Bigram Language Models work at a fundamental level.
prediction_X = data[:block_size] 
target_y = data[1:block_size+1]

for t in range(block_size):
    input_text = prediction_X[:t+1]
    target = target_y[t]
    print(f"When Input is {input_text} then target is {target}")

When Input is tensor([0]) then target is 34
When Input is tensor([ 0, 34]) then target is 48
When Input is tensor([ 0, 34, 48]) then target is 45
When Input is tensor([ 0, 34, 48, 45]) then target is 1
When Input is tensor([ 0, 34, 48, 45,  1]) then target is 37
When Input is tensor([ 0, 34, 48, 45,  1, 37]) then target is 55
When Input is tensor([ 0, 34, 48, 45,  1, 37, 55]) then target is 54
When Input is tensor([ 0, 34, 48, 45,  1, 37, 55, 54]) then target is 44


## Train Validation Split

In [12]:
n = int(0.8 * len(data))
print(f"n is {n}")

train_data = data[:n]
val_data = data[n:]

# generate batches of training or validation data
def get_batch(split): 
    data = train_data if split == "train" else val_data
    ix = torch.randint(high=len(data) - block_size, size = (batch_size,))
    
    x = torch.stack(tensors=[ data[i: i+block_size] for i in ix])
    y = torch.stack(tensors=[ data[i+1: i+block_size+1] for i in ix])
    return x.to(device), y.to(device) # move x,y to gpu if available

# x is inputs, y is target or label
x,y = get_batch("train")
print(f"\ninputs --- {x.shape} --- \n{x}\n")
print(f"targets --- {y.shape} --- \n{y}\n")

n is 166237

inputs --- torch.Size([4, 8]) --- 
tensor([[53,  1, 41, 54, 64, 49, 55, 61],
        [ 1, 54, 45, 45, 44, 52, 45, 59],
        [41, 62, 49, 54, 47,  1, 59, 48],
        [ 0, 18, 55,  1, 52, 45, 60,  1]])

targets --- torch.Size([4, 8]) --- 
tensor([[ 1, 41, 54, 64, 49, 55, 61, 59],
        [54, 45, 45, 44, 52, 45, 59,  1],
        [62, 49, 54, 47,  1, 59, 48, 41],
        [18, 55,  1, 52, 45, 60,  1, 53]])



## Neural Network
---

In [13]:
import torch.nn as nn
from torch.nn import functional as F

In [14]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocabulary_size):
        super(BigramLanguageModel, self).__init__()
        self.tokens_embedding_table = nn.Embedding(num_embeddings=vocabulary_size, embedding_dim=vocabulary_size)
        
    def forward(self, input_tokens, targets=None):
        logits = self.tokens_embedding_table(input_tokens)
        
        if targets is None:
            loss = None
        else:
            batch_size, tokens_in_a_sequence, vocabulary_size = logits.shape
            logits = logits.view(batch_size * tokens_in_a_sequence, vocabulary_size)
            targets = targets.view(batch_size * tokens_in_a_sequence)
            loss = F.cross_entropy(input=logits, target=targets)
        
        return logits, loss
    
    def generate(self, input_tokens, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self.forward(input_tokens) # get predictions
            # focus only on the last sequence
            logits = logits[:, -1, : ] # becomes (batch_size,vocabulary_size)
            # apply softmax to get probabilities
            probabilities = F.softmax(input=logits, dim = -1) # (batch_size, tokens_in_a_sequence+1)
            # sample from the distribution
            input_tokens_next = torch.multinomial(input = probabilities, num_samples=1) # (batch_size, 1)
            # append sampled index to the running sequence
            input_tokens = torch.cat((input_tokens, input_tokens_next), dim=1) # (B, tokens_in_a_sequence + 1)
            
        return input_tokens
    
model = BigramLanguageModel(vocabulary_size=vocabulary_size)
m = model.to(device)

input_tokens = torch.zeros(size=(1,1), dtype=torch.long, device=device)
generated_charts = decode(m.generate(input_tokens = input_tokens, max_new_tokens=500)[0].tolist())
print(generated_charts)


u—pAguCH!FS?bAWdzM1
et’TqT;J,HE1C!Q)xf!dJQOPVawLsIzXFtXFhh“ ’z:xNU;OC—Z!?B&ZhBvng,G1kopeNhns
RQp“miW0p?HvazhwqLQl;“QtOwmB?zXAQp—)SU-“V&JQtG?V.Lcy—d’NGCl9k’ggTflI
I
peqDjIX&nILlGQBgv ANaVp(Qt,Sjo.drxRfSs—‘V !w1?ZDJn&YF1
&‘rWLlMxaBjwmLGksdGV&,KGVl’iFmCXJ—r!Jh(cGR9wecN jPD;if9nTDb:XtNsBUEQ-fjQt,-lBZZDPw;E
ILBZ0xzU;aL1Rf(qUETSI1q9lUSy—d”hmOIQIo(KVTe?S“s)VnvYrV TSk0d
gZ(zZyvxgTSxahY09lwd9 TdVqQkf?STPB;ze;gJ,GXe;Pgtm.X&j:xKjxfKAm?ZS(b,!p!rkbrje(w—;Lnby—m!hifBjSsQk0MgT‘eADd:DL”ThgIgY(HlGU&kh(z)pb1V’Jd0


## Optimizer
---

In [15]:
optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

## Training Loop
---

In [16]:
for iter in range(max_iterations):
    # sample a batch of data
    xb, yb = get_batch("train")
    
    # loss evaluation
    logits, loss = model.forward(input_tokens=xb, targets=yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
print(f"Loss is {loss.item()}")

Loss is 4.8312201499938965


In [17]:
input_tokens = torch.zeros(size=(1,1), dtype=torch.long, device=device)
generated_charts = decode(m.generate(input_tokens = input_tokens, max_new_tokens=500)[0].tolist())
print(generated_charts)


ro—rVN&lfE!)WrEV1nVBKdw0&k—Tqv-vWhw 09KN’K!WbAXKqsuZzPqIVk-“b9Sh
eGZDOIz9m!vdRe“TixqLRKxep“9kqeLHILnm”Wj&Q.TEpbYK0-u
GI ID!ELHO
caSb)jNXg!LG—iIpuG)(zlnBTeO‘eO:WZhwqtyA!‘tXTLH?S“x&CJ—Oqo,w’9BZAhwElzvKZyeuDWV&CL.
Y?veY’
Hpv-NNjJbLP!Q1k0,xr
i
-NS“9k—jzJH-NJusSr c!S?E
-ljf’RXDgBWpeMk’!y—cU&k xv“rFKx’1gk0ZWp&iarVrxrZ9QdN‘ep?c“dx-”Z”qKS?excUDcEWLhF.xp”1’YKCdN0hX-”TYR-E1”-Crv)’WaE1x1Bu!”,!y—“bmsdAX”qZm.“Baht.hwpcWav’Y“SBkXJ-NaBq”—‘00aVrgaW;oZXnI Tze) Ym.
”iNGX ?DcM;Rz”PuGiLGVZ’KkEjN,Zf”t xyloW0ZJmsP’w—
