Hello everyone! So far on this journey, we’ve managed to build our very own version of GPT-2, and later brought in the open-source weights provided by OpenAI. We walked through the idea of pretraining—imagining what it would take to train on massive datasets for countless epochs with enormous resources—before grounding ourselves with the already available pretrained weights. From there, we successfully fine-tuned the model to work as an email classifier, giving us our first real taste of adapting a large language model to a specific task.

Now, we’re ready to take the next step: instruction fine-tuning. This is another widely used and practical approach to tailoring LLMs. And the good news is, since we’ve already built a strong understanding of the key concepts and mechanisms, this next stage will feel familiar. It’s just the natural continuation of our journey—one more piece of the puzzle in shaping these models to follow human instructions more effectively.

#### Data Preparation

In [1]:
import json
import os
import urllib
import ssl

def download_and_load_file(file_path, url):
    ssl_context = ssl.create_default_context()
    ssl_context.check_hostname = False
    ssl_context.verify_mode = ssl.CERT_NONE

    if not os.path.exists(file_path):
        with urllib.request.urlopen(url, context=ssl_context) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
    else:
        with open(file_path, "r", encoding="utf-8") as file:
            text_data = file.read()

    with open(file_path, "r", encoding="utf-8") as file:
        data = json.load(file)

    return data


file_path = "instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)

data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))

Number of entries: 1100


In [2]:
data[1]

{'instruction': 'Edit the following sentence for grammar.',
 'input': 'He go to the park every day.',
 'output': 'He goes to the park every day.'}

When it comes to instruction fine-tuning, the way we prepare our data is just as important as the training itself. One of the most effective approaches comes from Stanford’s Alpaca project, where the team experimented with several methods of formatting instruction-based data. After trying different styles and variations, they discovered a particular structure that consistently gave the best results—this has since become widely known as the Alpaca format.

By organizing the data in this way, the model learns not only to generate text but also to follow human-like instructions more reliably. This format has been tested and proven to work exceptionally well, which is why we’ll be adopting it in our own instruction fine-tuning journey.

Here is the link of Alpaca format which you can read: https://github.com/tatsu-lab/stanford_alpaca

Alpaca format:

In [3]:
# Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

# ### Instruction:
# {instruction}

# ### Input:
# {input}

# ### Response:

In [4]:
def format_input(entry):
  instruction_text = (
      f"Below is an instruction that describes a task, paired with an input that provides further context. "
      f"Write a response that appropriately completes the request."
      f"\n\n### Instruction:\n{entry['instruction']}"
      )
  input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
  return instruction_text + input_text

In [5]:
print(format_input(data[1]))

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Edit the following sentence for grammar.

### Input:
He go to the park every day.


In [6]:
print(format_input(data[999]))

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What is an antonym of 'complicated'?


In [7]:
model_input = format_input(data[1])
desired_response = f"\n\n### Response:\n{data[1]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Edit the following sentence for grammar.

### Input:
He go to the park every day.

### Response:
He goes to the park every day.


Data Splitting:

In [8]:
train_portion = int(len(data) * 0.85)  # 85% for training
test_portion = int(len(data) * 0.1)    # 10% for testing
val_portion = len(data) - train_portion - test_portion  # Remaining 5% for validation

train_data = data[:train_portion]
test_data = data[train_portion:train_portion + test_portion]
val_data = data[train_portion + test_portion:]

In [9]:
print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110


In [10]:
len(data)

1100

In [11]:
import torch
from torch.utils.data import Dataset


class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data

        # Pre-tokenize texts
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(
                tokenizer.encode(full_text)
            )

    def __getitem__(self, index):
        return self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

In [12]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

In [13]:
def custom_collate_fn(
    batch,
    pad_token_id=50256,
    ignore_index=-100,
    allowed_max_length=None,
    device="cpu"
):
    # Find the longest sequence in the batch
    batch_max_length = max(len(item)+1 for item in batch)

    # Pad and prepare inputs and targets
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        # Add an <|endoftext|> token
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1])  # Truncate the last token for inputs
        targets = torch.tensor(padded[1:])  # Shift +1 to the right for targets

        # New: Replace all but the first padding tokens in targets by ignore_index
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index

        # New: Optionally truncate to maximum sequence length
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]

        inputs_lst.append(inputs)
        targets_lst.append(targets)

    # Convert list of inputs and targets to tensors and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)

    return inputs_tensor, targets_tensor

When working with instruction fine-tuning, the way we prepare batches of data for training becomes critical. Unlike simple classification tasks, here we’re dealing with variable-length sequences of text (instructions, inputs, and outputs). If we simply feed them as-is, the model will face issues—different sequence lengths don’t align well in a batch, and the loss function may mistakenly compute gradients on padding tokens.

That’s why we wrote this *custom collate_fn*. Let’s break down what it does and why it matters:
**Dynamic Padding:**
- Each sequence can be a different length.
- We find the longest one and pad the others to match.
- Padding is done with <|endoftext|> (not zeros) to stay consistent with GPT-2’s tokenizer.

**Input–Target Alignment**
- Language models learn by predicting the next token.
- So, we prepare: Inputs → all tokens except the last one. Targets → all tokens except the first one (shifted by 1).

**Masking Padding**
- Predicting padding is useless.
- We replace padding tokens in the targets with ignore_index=-100.
- PyTorch’s CrossEntropyLoss will skip these positions, focusing only on real tokens.

In [14]:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]

batch = (
    inputs_1,
    inputs_2,
    inputs_3
)

inputs, targets = custom_collate_fn(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])


<div class="alert alert-block alert-info">
    
The modified collate function works as expected, altering the target list by inserting the
token ID -100.

What is the logic behind this adjustment? Let's explore the underlying
purpose of this modification.

</div>

<div class="alert alert-block alert-warning">

For demonstration purposes, consider the following simple and self-contained example
where each output logit can correspond to a potential token from the model's vocabulary.

Here's how we might calculate the cross entropy loss during
training when the model predicts a sequence of tokens, similar to what we have done in
chapter 5 when pretraining the model, or in chapter 6 when finetuning the model for
classification:

</div>

In [15]:
logits_1 = torch.tensor(
    [[-1.0, 1.0],  # 1st training example
     [-0.5, 1.5]]  # 2nd training example
)
targets_1 = torch.tensor([0, 1])


loss_1 = torch.nn.functional.cross_entropy(logits_1, targets_1)
print(loss_1)

tensor(1.1269)


<div class="alert alert-block alert-success">

Adding an additional token ID will, as we would expect, affect the loss calculation.
</div>

In [16]:
logits_2 = torch.tensor(
    [[-1.0, 1.0],
     [-0.5, 1.5],
     [-0.5, 1.5]]  # New 3rd training example
)
targets_2 = torch.tensor([0, 1, 1])

loss_2 = torch.nn.functional.cross_entropy(logits_2, targets_2)
print(loss_2)

tensor(0.7936)


<div class="alert alert-block alert-success">

Now, let's get to the interesting part and see what happens if we replace the third target
token ID with -100:
</div>

In [17]:

targets_3 = torch.tensor([0, 1, -100])

loss_3 = torch.nn.functional.cross_entropy(logits_2, targets_3)
print(loss_3)
print("loss_1 == loss_3:", loss_1 == loss_3)

tensor(1.1269)
loss_1 == loss_3: tensor(True)


<div class="alert alert-block alert-warning">

Based on this result, we can see that the resulting loss on these 3 training examples is
identical to the loss we calculated from the 2 training examples earlier.

In other words, the
cross entropy loss function ignored the third entry in the targets_3 vector, the token ID
corresponding to -100.

(Interested readers can try to replace the -100 value with another
token IDs that is not 0 or 1, and will see that this results in an error.)

</div>

<div class="alert alert-block alert-warning">

So, what's so special about -100 that it's ignored by the cross entropy loss? The default
setting of the cross entropy function in PyTorch is cross_entropy(...,
ignore_index=-100).

This means that it ignores targets labeled with -100.

</div>

<div class="alert alert-block alert-warning">

In this chapter, we take advantage of this ignore_index to ignore the additional end-oftext (padding) tokens that we used to pad the training examples to have the same length in
each batch.

</div>

<div class="alert alert-block alert-warning">

However, we want to keep one 50256 (end-of-text)
token ID in the targets because it helps the LLM to learn to generate end-of-text tokens,
which we can use as an indicator that a response is complete.

</div>

**MASKING TARGET TOKEN IDS**

<div class="alert alert-block alert-info">
    
In addition to masking out padding tokens, it is also common to mask out the target
token IDs that correspond to the instruction

</div>

<div class="alert alert-block alert-success">

By masking out the target token IDs that correspond to the instruction, the LLM cross entropy loss is only computed for the generated response target
IDs.

By masking out the instruction tokens, the model is trained to focus on generating
accurate responses rather than additionally also memorizing instructions, which can help
with reducing overfitting.
</div>

<div class="alert alert-block alert-info">
    
Currently, researchers are divided on whether masking the instructions is universally beneficial during instruction finetuning.

For instance, a recent
paper titled "Instruction Tuning With Loss Over Instructions" demonstrated that not
masking the instructions benefits the LLM performance.

In this chapter, we do not apply masking and leave it as an optional
exercise for the reader.

</div>

In [18]:
import torch

In [19]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Device:", device)

Device: cuda


In [20]:
from torch.utils.data import DataLoader


num_workers = 0
batch_size = 8

torch.manual_seed(123)

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=custom_collate_fn,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn=custom_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn=custom_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

In [21]:
for input, target in train_loader:
  print(input.shape, target.shape)

torch.Size([8, 70]) torch.Size([8, 70])
torch.Size([8, 85]) torch.Size([8, 85])
torch.Size([8, 82]) torch.Size([8, 82])
torch.Size([8, 77]) torch.Size([8, 77])
torch.Size([8, 74]) torch.Size([8, 74])
torch.Size([8, 81]) torch.Size([8, 81])
torch.Size([8, 89]) torch.Size([8, 89])
torch.Size([8, 76]) torch.Size([8, 76])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 84]) torch.Size([8, 84])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 77]) torch.Size([8, 77])
torch.Size([8, 76]) torch.Size([8, 76])
torch.Size([8, 86]) torch.Size([8, 86])
torch.Size([8, 78]) torch.Size([8, 78])
torch.Size([8, 88]) torch.Size([8, 88])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 75]) torch.Size([8, 75])
torch.Size([8, 92]) torch.Size([8, 92])
torch.Size([8, 77]) torch.Size([8, 77])
torch.Size([8, 89]) torch.Size([8, 89])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 78]) torch.Size([8, 78])
torch.Size([8, 74]) torch.Size([8, 74])
torch.Size([8, 77]) torch.Size([8, 77])


In [22]:
len(train_loader)

116

In [23]:
len(test_loader)

14

In [24]:
len(val_loader)

7

In [25]:
import tiktoken

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0) # remove batch dimension
    return tokenizer.decode(flat.tolist())

In [26]:
import torch
import torch.nn as nn

In [27]:
GPT_CONFIG_355M = {
    'vocab_size': 50257,      # GPT-2 vocab size
    'emb_dim': 1024,          # Hidden size
    'context_length': 1024,   # Max sequence length
    'n_heads': 16,            # Attention heads
    'n_layers': 24,           # Number of transformer blocks
    'drop_rate': 0.1,         # Dropout (same as small)
    'qkv_bias': True,         # Bias terms in QKV projections
    'model_name': 'gpt2-medium'
}


In [28]:
class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) *
            (x + 0.044715 * torch.pow(x, 3))
        ))

In [29]:
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)

In [30]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

In [31]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

In [32]:
class TransformerBlock(nn.Module):
  def __init__(self, cfg):
      super().__init__() # d_in, d_out, context_length, dropout, num_heads, qkv_bias=False
      self.attn = MultiHeadAttention(cfg['emb_dim'], cfg['emb_dim'], cfg['context_length'], cfg['drop_rate'], cfg['n_heads'], cfg['qkv_bias'])
      self.ffn = FeedForward(cfg)
      self.norm1 = nn.LayerNorm(cfg['emb_dim'])
      self.norm2 = nn.LayerNorm(cfg['emb_dim'])
      self.dropout = nn.Dropout(cfg['drop_rate'])

  def forward(self, x):
      shortcut = x
      x = self.norm1(x)
      x = self.attn(x)
      x = self.dropout(x)
      x = shortcut + x

      shortcut = x
      x = self.norm2(x)
      x = self.ffn(x)
      x = self.dropout(x)
      x = shortcut + x

      return x

In [33]:
class GPTModel(nn.Module):
  def __init__(self, cfg):
      super().__init__()
      self.token_embedding = nn.Embedding(cfg['vocab_size'], cfg['emb_dim'])
      self.positional_embedding = nn.Embedding(cfg['context_length'], cfg['emb_dim'])
      self.dropout = nn.Dropout(cfg['drop_rate'])

      self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

      self.final_norm = nn.LayerNorm(cfg['emb_dim'])
      self.lm_head = nn.Linear(cfg['emb_dim'], cfg['vocab_size'], bias=False)

  def forward(self, x):
      batch_size, seq_len = x.shape
      token_embeddings = self.token_embedding(x)
      position_embeddings = self.positional_embedding(torch.arange(seq_len, device=x.device))
      input_embeddings = token_embeddings + position_embeddings
      x = self.dropout(input_embeddings)
      x = self.trf_blocks(x)
      x = self.final_norm(x)
      x = self.lm_head(x)
      return x

In [34]:
def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):

    # For-loop is the same as before: Get logits, and only focus on last time step
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:].to(idx.device) # Explicitly move to device
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :]

        # New: Filter logits with top_k sampling
        if top_k is not None:
            # Keep only top_k values
            top_logits, _ = torch.topk(logits, top_k)
            min_val = top_logits[:, -1]
            logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)

        # New: Apply temperature scaling
        if temperature > 0.0:
            logits = logits / temperature

            # Apply softmax to get probabilities
            probs = torch.softmax(logits, dim=-1)  # (batch_size, context_len)

            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (batch_size, 1)

        # Otherwise same as before: get idx of the vocab entry with the highest logits value
        else:
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch_size, 1)

        if idx_next == eos_id:  # Stop generating early if end-of-sequence token is encountered and eos_id is specified
            break

        # Same as before: append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch_size, num_tokens+1)

    return idx

In [35]:
!pip install tensorflow>=2.15.0 tqdm>=4.66

In [36]:
import tensorflow as tf
import tqdm

print("TensorFlow version:", tf.__version__)
print("tqdm version:", tqdm.__version__)

TensorFlow version: 2.19.0
tqdm version: 4.67.1


In [37]:
def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
    return torch.nn.Parameter(torch.tensor(right, device=left.device))

In [38]:
import numpy as np

def load_weights_into_gpt(gpt, params):
    gpt.positional_embedding.weight = assign(gpt.positional_embedding.weight, params['wpe'])
    gpt.token_embedding.weight = assign(gpt.token_embedding.weight, params['wte'])

    for b in range(len(params["blocks"])):
        q_w, k_w, v_w = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
        gpt.trf_blocks[b].attn.W_query.weight = assign(
            gpt.trf_blocks[b].attn.W_query.weight, q_w.T)
        gpt.trf_blocks[b].attn.W_key.weight = assign(
            gpt.trf_blocks[b].attn.W_key.weight, k_w.T)
        gpt.trf_blocks[b].attn.W_value.weight = assign(
            gpt.trf_blocks[b].attn.W_value.weight, v_w.T)

        q_b, k_b, v_b = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
        gpt.trf_blocks[b].attn.W_query.bias = assign(
            gpt.trf_blocks[b].attn.W_query.bias, q_b)
        gpt.trf_blocks[b].attn.W_key.bias = assign(
            gpt.trf_blocks[b].attn.W_key.bias, k_b)
        gpt.trf_blocks[b].attn.W_value.bias = assign(
            gpt.trf_blocks[b].attn.W_value.bias, v_b)

        gpt.trf_blocks[b].attn.out_proj.weight = assign(
            gpt.trf_blocks[b].attn.out_proj.weight,
            params["blocks"][b]["attn"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].attn.out_proj.bias = assign(
            gpt.trf_blocks[b].attn.out_proj.bias,
            params["blocks"][b]["attn"]["c_proj"]["b"])

        gpt.trf_blocks[b].ffn.layers[0].weight = assign(
            gpt.trf_blocks[b].ffn.layers[0].weight,
            params["blocks"][b]["mlp"]["c_fc"]["w"].T)
        gpt.trf_blocks[b].ffn.layers[0].bias = assign(
            gpt.trf_blocks[b].ffn.layers[0].bias,
            params["blocks"][b]["mlp"]["c_fc"]["b"])
        gpt.trf_blocks[b].ffn.layers[2].weight = assign(
            gpt.trf_blocks[b].ffn.layers[2].weight,
            params["blocks"][b]["mlp"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].ffn.layers[2].bias = assign(
            gpt.trf_blocks[b].ffn.layers[2].bias,
            params["blocks"][b]["mlp"]["c_proj"]["b"])

        gpt.trf_blocks[b].norm1.weight = assign(
            gpt.trf_blocks[b].norm1.weight,
            params["blocks"][b]["ln_1"]["g"])
        gpt.trf_blocks[b].norm1.bias = assign(
            gpt.trf_blocks[b].norm1.bias,
            params["blocks"][b]["ln_1"]["b"])
        gpt.trf_blocks[b].norm2.weight = assign(
            gpt.trf_blocks[b].norm2.weight,
            params["blocks"][b]["ln_2"]["g"])
        gpt.trf_blocks[b].norm2.bias = assign(
            gpt.trf_blocks[b].norm2.bias,
            params["blocks"][b]["ln_2"]["b"])


    gpt.final_norm.weight = assign(gpt.final_norm.weight, params["g"])
    gpt.final_norm.bias = assign(gpt.final_norm.bias, params["b"])
    gpt.lm_head.weight = assign(gpt.lm_head.weight, params["wte"])

In [40]:
from gpt_download3 import download_and_load_gpt2

BASE_CONFIG = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "drop_rate": 0.0,        # Dropout rate
    "qkv_bias": True         # Query-key-value bias
}

model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

CHOOSE_MODEL = "gpt2-medium (355M)"

BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(
    model_size=model_size,
    models_dir="gpt2"
)

model = GPTModel(BASE_CONFIG).to(device)
load_weights_into_gpt(model, params)
model.eval()

checkpoint: 100%|██████████| 77.0/77.0 [00:00<00:00, 188kiB/s]
encoder.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 2.20MiB/s]
hparams.json: 100%|██████████| 90.0/90.0 [00:00<00:00, 149kiB/s]
model.ckpt.data-00000-of-00001: 100%|██████████| 498M/498M [00:42<00:00, 11.7MiB/s]
model.ckpt.index: 100%|██████████| 5.21k/5.21k [00:00<00:00, 8.01MiB/s]
model.ckpt.meta: 100%|██████████| 471k/471k [00:00<00:00, 1.40MiB/s]
vocab.bpe: 100%|██████████| 456k/456k [00:00<00:00, 1.35MiB/s]
checkpoint: 100%|██████████| 77.0/77.0 [00:00<00:00, 173kiB/s]
encoder.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 2.31MiB/s]
hparams.json: 100%|██████████| 91.0/91.0 [00:00<00:00, 148kiB/s]
model.ckpt.data-00000-of-00001: 100%|██████████| 1.42G/1.42G [01:47<00:00, 13.2MiB/s]
model.ckpt.index: 100%|██████████| 10.4k/10.4k [00:00<00:00, 14.2MiB/s]
model.ckpt.meta: 100%|██████████| 927k/927k [00:00<00:00, 2.58MiB/s]
vocab.bpe: 100%|██████████| 456k/456k [00:00<00:00, 1.58MiB/s]


GPTModel(
  (token_embedding): Embedding(50257, 1024)
  (positional_embedding): Embedding(1024, 1024)
  (dropout): Dropout(p=0.0, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (attn): MultiHeadAttention(
        (W_query): Linear(in_features=1024, out_features=1024, bias=True)
        (W_key): Linear(in_features=1024, out_features=1024, bias=True)
        (W_value): Linear(in_features=1024, out_features=1024, bias=True)
        (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ffn): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=1024, out_features=4096, bias=True)
          (1): GELU()
          (2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.0, inplac

In [41]:
torch.manual_seed(123)
model.eval()
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate(
    model=model,
    idx=text_to_token_ids('Every effort moves you', tokenizer).to(device),
    max_new_tokens=50,
    context_size=BASE_CONFIG["context_length"],
    top_k=50,
    temperature=0.0,
    eos_id=50256
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you forward, but you must be careful. You must not let your guard down. You must not let your guard down."

"I will not let you down."

"I will not let you down."

"I will not


In [42]:
print(format_input(val_data[0]))

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'


In [43]:
torch.manual_seed(123)
model.eval()
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate(
    model=model,
    idx=text_to_token_ids(format_input(val_data[0]), tokenizer).to(device),
    max_new_tokens=256,
    context_size=BASE_CONFIG["context_length"],
    top_k=50,
    temperature=0.0,
    eos_id=50256
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'

### Input:

The active sentence: 'The chef cooks the meal every day.'

### Output:

The passive sentence: 'The chef cooks the meal every day.'

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal 

In [44]:
generated_text = token_ids_to_text(token_ids, tokenizer)
response_text = generated_text[len(format_input(val_data[0])):].strip()
print(response_text)

### Input:

The active sentence: 'The chef cooks the meal every day.'

### Output:

The passive sentence: 'The chef cooks the meal every day.'

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:

The chef cooks the meal every day.

### Example:


In [49]:
def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
    logits = model(input_batch)
    loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
    return loss


def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0.
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        # Reduce the number of batches to match the total number of batches in the data loader
        # if num_batches exceeds the number of batches in the data loader
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches

def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
                       eval_freq, eval_iter, start_context, tokenizer):
    # Initialize lists to track losses and tokens seen
    train_losses, val_losses, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, -1

    # Main training loop
    for epoch in range(num_epochs):
        model.train()  # Set model to training mode

        for input_batch, target_batch in train_loader:
            optimizer.zero_grad() # Reset loss gradients from previous batch iteration
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            loss.backward() # Calculate loss gradients
            optimizer.step() # Update model weights using loss gradients
            tokens_seen += input_batch.numel() # Returns the total number of elements (or tokens) in the input_batch.
            global_step += 1

            if global_step % 15 == 0:
                print(f"The training is going on for batch {global_step} in epoch {epoch+1}")

            # # Optional evaluation step
            # if global_step % eval_freq == 0:
            #     train_loss, val_loss = evaluate_model(
            #         model, train_loader, val_loader, device, eval_iter)
            #     train_losses.append(train_loss)
            #     val_losses.append(val_loss)
            #     track_tokens_seen.append(tokens_seen)
            #     print(f"Ep {epoch+1} (Step {global_step:06d}): "
            #           f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

        # Print a sample text after each epoch
        model.eval()
        print(token_ids_to_text(generate(model, text_to_token_ids(start_context, tokenizer).to(device),
                 max_new_tokens=50, context_size=BASE_CONFIG["context_length"],
                 top_k=50, temperature=0), tokenizer))

    # return train_losses, val_losses, track_tokens_seen


In [46]:
model.to(device)

torch.manual_seed(123)

with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Training loss: 3.8396378993988036
Validation loss: 3.7788986682891847


In [None]:
model.eval()

In [47]:
device

device(type='cuda')

I have run the code below actually 2 times, so you can imagine the epochs as 4

In [50]:
import time

start_time = time.time()

torch.manual_seed(123)

optimizer = torch.optim.AdamW(model.parameters(), lr=0.00005, weight_decay=0.1)
epochs = 2

train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=epochs, eval_freq=5, eval_iter=5,
    start_context=format_input(val_data[0]), tokenizer=tokenizer
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

The training is going on for batch 0 in epoch 1
The training is going on for batch 15 in epoch 1
The training is going on for batch 30 in epoch 1
The training is going on for batch 45 in epoch 1
The training is going on for batch 60 in epoch 1
The training is going on for batch 75 in epoch 1
The training is going on for batch 90 in epoch 1
The training is going on for batch 105 in epoch 1
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'

### Response:
The meal is prepared by the chef.<|endoftext|>The following is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What
The training is going on for batch 120 in epoch 2
The training is going on for batch 135 in epoch 2
The t

In [51]:
model.to(device)

torch.manual_seed(123)

with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device)
    val_loss = calc_loss_loader(val_loader, model, device)
    test_loss = calc_loss_loader(test_loader, model, device)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)
print("Test loss:", test_loss)

Training loss: 0.17877831811021114
Validation loss: 0.5765614424433027
Test loss: 0.6692315489053726


In [52]:
response_text = token_ids_to_text(generate(model, text_to_token_ids(format_input(val_data[0]), tokenizer).to(device),
                 max_new_tokens=50, context_size=BASE_CONFIG["context_length"],
                 top_k=50, temperature=0, eos_id=50256), tokenizer)

response_text = response_text[len(format_input(val_data[0])):].strip()
print(response_text)

### Response:
The meal is prepared by the chef every day.


In [None]:
import re
file_name = f"{re.sub(r'[ ()]', '', CHOOSE_MODEL) }-sft.pth"
torch.save(model.state_dict(), file_name)
print(f"Model saved as {file_name}")

# Load model via
# model.load_state_dict(torch.load("gpt2-medium355M-sft.pth"))

<All keys matched successfully>

In [None]:
torch.manual_seed(123)

model.eval()

for entry in test_data[:3]:

    input_text = format_input(entry)

    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)
    response_text = (
        generated_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
)

    print(input_text)
    print(f"\nCorrect response:\n>> {entry['output']}")
    print(f"\nModel response:\n>> {response_text.strip()}")
    print("-------------------------------------")

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Rewrite the sentence using a simile.

### Input:
The car is very fast.

Correct response:
>> The car is as fast as lightning.

Model response:
>> The car is as fast as a cheetah.
-------------------------------------
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What type of cloud is typically associated with thunderstorms?

Correct response:
>> The type of cloud typically associated with thunderstorms is cumulonimbus.

Model response:
>> The type of cloud typically associated with thunderstorms is a cumulus.
-------------------------------------
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately comple

In [None]:
from tqdm import tqdm

for i, entry in tqdm(enumerate(test_data), total=len(test_data)):

    input_text = format_input(entry)

    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)
    response_text = generated_text[len(input_text):].replace("### Response:", "").strip()

    test_data[i]["model_response"] = response_text


with open("instruction-data-with-response.json", "w") as file:
    json.dump(test_data, file, indent=4)  # "indent" for pretty-printing

In [None]:
torch.manual_seed(123)

model.eval()

for entry in test_data[:110]:

    input_text = format_input(entry)

    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)
    response_text = (
        generated_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
)

    print(input_text)
    print(f"\nCorrect response:\n>> {entry['output']}")
    print(f"\nModel response:\n>> {response_text.strip()}")
    print("-------------------------------------")

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Rewrite the sentence using a simile.

### Input:
The car is very fast.

Correct response:
>> The car is as fast as lightning.

Model response:
>> The car is as fast as a cheetah.
-------------------------------------
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What type of cloud is typically associated with thunderstorms?

Correct response:
>> The type of cloud typically associated with thunderstorms is cumulonimbus.

Model response:
>> The type of cloud typically associated with thunderstorms is a cumulus.
-------------------------------------
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately comple

Now, we have everything trained on, the result is in the hand. Think wisely, the model looks quite OK now in our eyes but how OK? What will be the quantitive result?

- There are many popular evaluation method over out there, I am opening this research into your hand.
- I am going to evaluate through Llama 2 which is a product of ollama.

As we are going to run ollama locally, and evaluate, I have to shift this notebook to the VSCode jupyter notebook.

How did I transfer it? Simply saving and loading the model there, and running afterwards, no redundant training out there.

In [None]:
import psutil

def check_if_running(process_name):
    running = False
    for proc in psutil.process_iter(["name"]):
        if process_name in proc.info["name"]:
            running = True
            break
    return running

ollama_running = check_if_running("ollama")

if not ollama_running:
    raise RuntimeError("Ollama not running. Launch ollama before proceeding.")
print("Ollama running:", check_if_running("ollama"))

Ollama running: True


In [None]:
import psutil

def check_if_running(process_name):
    running = False
    for proc in psutil.process_iter(["name"]):
        if process_name in proc.info["name"]:
            running = True
            break
    return running

ollama_running = check_if_running("ollama")

if not ollama_running:
    raise RuntimeError("Ollama not running. Launch ollama before proceeding.")
print("Ollama running:", check_if_running("ollama"))

Ollama running: True


<div class="alert alert-block alert-success">

An alternative to the ollama run command for interacting with the model is through its
REST API using Python.

The following query_model function demonstrates how to use the
API:</div>

<div class="alert alert-block alert-info">


Step 1: Create the data payload as a dictionary
    
Step 2: Convert the dictionary to a JSON formatted string and encode it to bytes
    
Step 3: Create a request object, setting the method to POST and adding necessary headers
    
Step 4: Send the request and capture the response
    
</div>

In [None]:
import urllib.request

def query_model(
    prompt,
    model="llama3",
    url="http://localhost:11434/api/chat"
):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "options": {     # Settings below are required for deterministic responses
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        }
    }


    # Convert the dictionary to a JSON formatted string and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a request object, setting the method to POST and adding necessary headers
    request = urllib.request.Request(
        url,
        data=payload,
        method="POST"
    )
    request.add_header("Content-Type", "application/json")

    # Send the request and capture the response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # Read and decode the response
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data

<div class="alert alert-block alert-success">

Before running the subsequent code cells in this notebook, ensure that Ollama is still
running. The previous code cells should print "Ollama running: True" to confirm that the
model is active and ready to receive requests.
</div>

<div class="alert alert-block alert-success">

Here's an example of how to use the query_llama function we just implemented:
</div>

In [None]:
model = "llama3"
result = query_model("What do Llamas eat?", model)
print(result)

Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy the sweet taste and texture of fresh hay.
3. Grains: Llamas may receive grains like oats, barley, or corn as part of their daily ration. However, it's essential to provide these grains in moderation, as they can be high in calories.
4. Fruits and vegetables: Llamas enjoy a variety of fruits and veggies, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.
5. Minerals: Llamas require access to mineral supplements, which help maintain their overall health and strong bones.

In the wild, llamas might also eat:

1. Leaves: They'll munch on leaves from trees and shrubs, like willow or cedar.
2. Bark: In some cases, llamas may eat the bark of

<div class="alert alert-block alert-success">

Using the query_model function defined earlier, we can evaluate the responses generated
by our finetuned model with a prompt that prompts the Llama 3 model to rate our
finetuned model's responses on a scale from 0 to 100 based on the given test set response
as reference.
</div>

<div class="alert alert-block alert-success">

The following generate_model_scores function uses a modified the prompt telling the
model to "Respond with the integer number only.":
</div>

In [None]:
def avg_score(scores):
    total = 0
    for score in scores:
        total = total + score
    return total//len(scores)

scores = []
for entry in test_data[:5]:
    prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry['model_response']}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
    print("\nDataset response:")
    print(">>", entry['output'])
    print("\nModel response:")
    print(">>", entry["model_response"])
    print("\nScore:")
    score = int(query_model(prompt, model))
    print(">>", str(score))
    scores.append(score)
    print("\n-------------------------")

print(f"The average score of the model is: ", avg_score(scores))


Dataset response:
>> The car is as fast as lightning.

Model response:
>> The car is as fast as a cheetah.

Score:
>> 85

-------------------------

Dataset response:
>> The type of cloud typically associated with thunderstorms is cumulonimbus.

Model response:
>> The type of cloud typically associated with thunderstorms is a cumulus.

Score:
>> 60

-------------------------

Dataset response:
>> Jane Austen.

Model response:
>> The author of 'Pride and Prejudice' is Jane Austen.

Score:
>> 98

-------------------------

Dataset response:
>> The periodic symbol for chlorine is Cl.

Model response:
>> The periodic symbol for chlorine is CH3.

Score:
>> 4

-------------------------

Dataset response:
>> The corrected sentence should be: 'It's time to go home.'

Model response:
>> The corrected sentence should be: 'Its time to go home.'

Score:
>> 20

-------------------------
The average score of the model is:  53


In [None]:
def generate_model_scores(json_data, json_key, model="llama3"):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        score = query_model(prompt, model)
        try:
            scores.append(int(score))
        except ValueError:
            print(f"Could not convert score: {score}")
            continue

    return scores

<div class="alert alert-block alert-info">


When you run the above code, you will see that the evaluation output shows that our finetuned model achieves an average score above 50,
which provides a useful benchmark for comparison against other models or for
experimenting with different training configurations to improve the model's performance.

It's worth noting that Ollama is not entirely deterministic at the time of this writing,
which means that the scores you obtain might slightly vary from the ones presented above.
    
To obtain more robust results, you can repeat the evaluation multiple times and average
the resulting scores.
</div>

Created by **Md. Shadikur Rahman Sheam**.
- [Linkedin](https://www.linkedin.com/in/md-shadikur-rahman-sheam-3826482b3/)
- [Github](https://github.com/sadikurSenpai)
- [Facebook](https://www.facebook.com/profile.php?id=100091833665881)

Special thanks to Sebastian Raschka for his amazing book.