# Buzuzu-Mavi Challenge - Finetuning InkubaLM (50M parameter version)

**You will need a Huggingface account to be able to access the datasets and the models**

Make sure you have generated an HF_TOKEN and added it to your notebook.

🚨 NB: In order to access the Inkuba model, make sure you have requested and granted access to the Gated InkubaLM model on huggingface.

In [None]:
import os
from huggingface_hub import login

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["HF_TOKEN"] = ""
login(token=os.environ["HF_TOKEN"])

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [2]:
from importlib.metadata import version

pkgs = [
    "numpy",
    "matplotlib",
    "torch",
    "transformers",
    "accelerate",
    "sentencepiece",
    "protobuf",
    "sacrebleu"
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

numpy version: 1.26.3
matplotlib version: 3.7.3
torch version: 2.6.0
transformers version: 4.49.0
accelerate version: 1.5.2
sentencepiece version: 0.2.0
protobuf version: 4.25.6
sacrebleu version: 2.0.0


## Set up paths

In [None]:
from pathlib import Path
basedir = Path("/.")
data_path = basedir/"inkuba_instruct_sample.json"

# if `tokenizer_path` is None, the InkubaLM tokenizer will be used without modification
tokenizer_path = basedir/"tokenizer"

# option to limit the number of translation examples per language
# to ensure that the dataset is better balanced by task
max_mt_data_per_lang = 50_000  # set to None to disable this option

## Load the training data

In [4]:
import json

with open(data_path, "r") as f:
    data = json.load(f)
print("Number of entries:", len(data))

Number of entries: 350815


In [5]:
langs = set([d.get("lang", d.get("language", "")) for d in data])
langs

{'hausa', 'swahili'}

In [6]:
tasks = {"qa": 0, "sentiment": 0, "mmt": 0}
for d in data:
    tasks[d["task"]] += 1
tasks

{'qa': 60000, 'sentiment': 90828, 'mmt': 199987}

Shuffle the data:

In [7]:
import random

random.seed(123)
# data = data + data_sent + data_sent
random.shuffle(data)
len(data)

350815

In [8]:
if max_mt_data_per_lang is not None:
    data_mt_swa = [d for d in data if d["task"] == "mmt" and d["language"] == "swahili"][:max_mt_data_per_lang]
    data_mt_hau = [d for d in data if d["task"] == "mmt" and d["language"] == "hausa"][:max_mt_data_per_lang]
    data = [d for d in data if d["task"] != "mmt"]

    random.seed(123)
    data = data + data_mt_swa + data_mt_hau
    random.shuffle(data)
    print(len(data))

250828


Look at a few data examples:

In [9]:
print("Example entry:\n", data[50])

Example entry:
 {'task': 'qa', 'instruction': 'You will be given two Hausa sentences. Your job is to  is to predict textual entailment. In other words, does sentence A imply/contradict/neither sentence B. Your response must be one word: entailment, contradiction, or neutral.', 'input': 'Sentence A: Otr ne , vermont , sabon , massachusetts , rhode , rhode , rhode , rhode , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland\nSentence B: Massachusetts da New York da kuma daga OTR.', 'output': 'contradiction', 'lang': 'hausa'}


In [10]:
print("Example entry:\n", data[-7])

Example entry:
 {'task': 'sentiment', 'instruction_orig': "Gano ra'ayin da aka bayyana a cikin wannan rubutu. Bin waɗannan jagororin, kyakkyawa yana na rubutu na nufin kyakkyawan tunani, ɗabi'a, da motsin rai. Korau na nuna rubutu na nufin mummunan tunani ko motsin rai. Tsaka-tsaki na nuna rubutu baya nufin magana mai kyau ko mara kyau kai tsaye ko a kaikaice.", 'input': 'kai amma su buhari baajin kunyaryin karya', 'output': 'Korau', 'language': 'hausa', 'instruction': "Gano ra'ayin da aka bayyana a cikin wannan rubutu. Bin waɗannan jagororin, kyakkyawa yana na rubutu na nufin kyakkyawan tunani, ɗabi'a, da motsin rai. Korau na nuna rubutu na nufin mummunan tunani ko motsin rai. Tsaka-tsaki na nuna rubutu baya nufin magana mai kyau ko mara kyau kai tsaye ko a kaikaice."}


## Prepare data for instruction tuning

In [11]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

In [12]:
model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
You will be given two Hausa sentences. Your job is to  is to predict textual entailment. In other words, does sentence A imply/contradict/neither sentence B. Your response must be one word: entailment, contradiction, or neutral.

### Input:
Sentence A: Otr ne , vermont , sabon , massachusetts , rhode , rhode , rhode , rhode , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland , maryland
Sentence B: Massachusetts da New York da kuma daga OTR.

### Response:
contradiction


Train/validation/test split:

In [13]:
import random
random.seed(123)

train_data, test_data, val_data = [], [], []
for example in data:
    r = random.random()
    if r < 0.85:
        train_data.append(example)
    elif r < 0.95:
        test_data.append(example)
    else:
        val_data.append(example)

In [14]:
print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 212841
Validation set length: 12646
Test set length: 25341


### Create `Dataset` object to pre-tokenize the training data

In [15]:
import torch
from torch.utils.data import Dataset


class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data

        # Pre-tokenize texts
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(
                tokenizer.encode(full_text)
            )

    def __getitem__(self, index):
        return self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

In [16]:
from transformers import AutoTokenizer

model_name = "lelapa/InkubaLM-0.4B"
if tokenizer_path is None:
    tokenizer_path = model_name
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
print(tokenizer.special_tokens_map)
print(tokenizer.convert_tokens_to_ids("</s>"))
print(tokenizer.vocab_size)

{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}
2
8064


### Define a custom collate function to create training batches

In [17]:
def custom_collate_fn(
    batch,
    pad_token_id=2,
    ignore_index=-100,
    allowed_max_length=None,
    device="cpu"
):
    # Find the longest sequence in the batch
    batch_max_length = max(len(item)+1 for item in batch)

    # Pad and prepare inputs and targets
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        # Add an <|endoftext|> token
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1])  # Truncate the last token for inputs
        targets = torch.tensor(padded[1:])  # Shift +1 to the right for targets

        # New: Replace all but the first padding tokens in targets by ignore_index
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index

        # New: Optionally truncate to maximum sequence length
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]

        inputs_lst.append(inputs)
        targets_lst.append(targets)

    # Convert list of inputs and targets to tensors and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)

    return inputs_tensor, targets_tensor

Check that it works as expected with a toy example:

In [18]:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]

batch = (
    inputs_1,
    inputs_2,
    inputs_3
)

inputs, targets = custom_collate_fn(batch)
print(inputs)
print(targets)

tensor([[0, 1, 2, 3, 4],
        [5, 6, 2, 2, 2],
        [7, 8, 9, 2, 2]])
tensor([[   1,    2,    3,    4, -100],
        [   6,    2, -100, -100, -100],
        [   8,    9,    2, -100, -100]])


### Create dataloaders

In [19]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

Device: cuda


In [20]:
from functools import partial

customized_collate_fn = partial(
    custom_collate_fn,
    device=device,
    allowed_max_length=1024
)

- Next, we instantiate the data loaders similar to previous chapters, except that we now provide our own collate function for the batching process

In [21]:
from torch.utils.data import DataLoader


num_workers = 0
batch_size = 8

torch.manual_seed(123)

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers
)

In [22]:
val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

Let's see what the dimensions of the resulting input and target batches look like:

In [23]:
print("Train loader:")
for i, (inputs, targets) in enumerate(train_loader):
    print(inputs.shape, targets.shape)
    if i == 10: break

Train loader:
torch.Size([8, 229]) torch.Size([8, 229])
torch.Size([8, 208]) torch.Size([8, 208])
torch.Size([8, 271]) torch.Size([8, 271])
torch.Size([8, 228]) torch.Size([8, 228])
torch.Size([8, 223]) torch.Size([8, 223])
torch.Size([8, 221]) torch.Size([8, 221])
torch.Size([8, 262]) torch.Size([8, 262])
torch.Size([8, 209]) torch.Size([8, 209])
torch.Size([8, 205]) torch.Size([8, 205])
torch.Size([8, 688]) torch.Size([8, 688])
torch.Size([8, 210]) torch.Size([8, 210])


In [24]:
print(inputs[0])

tensor([   1,  347,  294,  337,  335,  382,  296,  302,  540,  419,  390,  517,
        4074,  371, 2319,  263, 1768, 4080,  396, 4069,  531,  263,  572, 4077,
         265,  341,  390, 1335, 4077, 7467,  736, 1130, 1343,  277,  334,  336,
         339, 4080,   13,   13, 1388, 4116,  484,  302,  540,  419, 4092,   13,
        7153,  935,  901, 4101, 6600,  897, 4225, 4721,  263,  273,  587,  262,
        4325, 6746, 4080, 4013, 3173, 7066, 3362,  423,  348,  271,  271,  262,
        4083, 4504, 4086, 1781, 3011, 4236,  845, 6746,  845,  301,  892,  262,
        4504, 4086, 1781, 1059,  272,  260, 6188, 4066, 4083, 5231, 2243, 4101,
        4065, 4083,  897, 7639,  262, 5760, 4080, 2766,  333, 4075,  845, 4863,
        6746,  845,  301,  892,  262, 6225, 4076, 6188,  260, 6188, 4066, 2347,
        7639,  262, 5760, 4080,  320, 5680, 4090, 1022, 3001,  845, 4863, 6746,
        4774,  301,  892,  262, 5493, 2285, 4685, 2347, 4268, 4685, 4475, 6200,
        2347,  263,  409, 5718,  575, 40

In [25]:
print(targets[0])

tensor([ 347,  294,  337,  335,  382,  296,  302,  540,  419,  390,  517, 4074,
         371, 2319,  263, 1768, 4080,  396, 4069,  531,  263,  572, 4077,  265,
         341,  390, 1335, 4077, 7467,  736, 1130, 1343,  277,  334,  336,  339,
        4080,   13,   13, 1388, 4116,  484,  302,  540,  419, 4092,   13, 7153,
         935,  901, 4101, 6600,  897, 4225, 4721,  263,  273,  587,  262, 4325,
        6746, 4080, 4013, 3173, 7066, 3362,  423,  348,  271,  271,  262, 4083,
        4504, 4086, 1781, 3011, 4236,  845, 6746,  845,  301,  892,  262, 4504,
        4086, 1781, 1059,  272,  260, 6188, 4066, 4083, 5231, 2243, 4101, 4065,
        4083,  897, 7639,  262, 5760, 4080, 2766,  333, 4075,  845, 4863, 6746,
         845,  301,  892,  262, 6225, 4076, 6188,  260, 6188, 4066, 2347, 7639,
         262, 5760, 4080,  320, 5680, 4090, 1022, 3001,  845, 4863, 6746, 4774,
         301,  892,  262, 5493, 2285, 4685, 2347, 4268, 4685, 4475, 6200, 2347,
         263,  409, 5718,  575, 4080,   

(TODO: mask out tokens corresponding to the instruction.)

## Load InkubaLM

In [None]:
from copy import deepcopy
from transformers import AutoModelForCausalLM

model_name = "lelapa/InkubaLM-0.4B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    low_cpu_mem_usage=True,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float32,
)

# Update the model config and load the pruned model weights
config_prn = deepcopy(model.config)
config_prn.vocab_size = tokenizer.vocab_size
config_prn.hidden_size = 1024
config_prn.intermediate_size = 2816
config_prn.max_position_embeddings = 1024
config_prn.tie_word_embeddings = True
model = AutoModelForCausalLM.from_config(config_prn).to(torch.float32).to(device)
model.load_state_dict(torch.load(basedir/"pruned_model_init.pth"))

model.eval();
print(f"memory (GB): {model.get_memory_footprint() / 1024**3:.2f}")
print(f"num params:  {model.num_parameters():,.0f}")

config.json:   0%|          | 0.00/763 [00:00<?, ?B/s]

vulavulaslm.py:   0%|          | 0.00/42.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/lelapa/InkubaLM-0.4B:
- vulavulaslm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
  warn(
2025-04-09 11:00:26.294404: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-09 11:00:26.836077: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-04-09 11:00:26.836210: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-

pytorch_model.bin:   0%|          | 0.00/2.66G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

VulavulaLlamaForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


memory (GB): 0.19
num params:  50,480,128


### Define a `generate` function:

In [27]:
def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):

    # For-loop is the same as before: Get logits, and only focus on last time step
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond).logits
        logits = logits[:, -1, :]

        # New: Filter logits with top_k sampling
        if top_k is not None:
            # Keep only top_k values
            top_logits, _ = torch.topk(logits, top_k)
            min_val = top_logits[:, -1]
            logits = torch.where(logits < min_val, torch.tensor(float('-inf')).to(logits.device), logits)

        # New: Apply temperature scaling
        if temperature > 0.0:
            logits = logits / temperature

            # Apply softmax to get probabilities
            probs = torch.softmax(logits, dim=-1)  # (batch_size, context_len)

            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (batch_size, 1)

        # Otherwise same as before: get idx of the vocab entry with the highest logits value
        else:
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch_size, 1)

        if idx_next == eos_id:  # Stop generating early if end-of-sequence token is encountered and eos_id is specified
            break

        # Same as before: append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch_size, num_tokens+1)

    return idx

## Train!

In [28]:
def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
    logits = model(input_batch).logits
    loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
    return loss

def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0.
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        # Reduce the number of batches to match the total number of batches in the data loader
        # if num_batches exceeds the number of batches in the data loader
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches

Check that loss functions work as expected:

In [29]:
model.to(device)

torch.manual_seed(123)

with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Training loss: 10.082386016845703
Validation loss: 10.069643211364745


### Define functions to handle the training and validation loops

In [30]:
def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (B, T) array of indices in the current context
    for _ in range(max_new_tokens):

        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]

        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond).logits

        # Focus only on the last time step
        # (batch, n_token, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]

        # Get the idx of the vocab entry with the highest logits value
        idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx

def evaluate_model(model, train_loader, val_loader, device, eval_iter):
    model.eval()
    with torch.no_grad():
        train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
        val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
    model.train()
    return train_loss, val_loss


def generate_and_print_sample(model, tokenizer, device, start_context):
    model.eval()
    # context_size = model.pos_emb.weight.shape[0]
    context_size = model.config.max_position_embeddings
    # encoded = text_to_token_ids(start_context, tokenizer).to(device)
    encoded = tokenizer.encode(start_context, return_tensors="pt").to(device)
    with torch.no_grad():
        token_ids = generate_text_simple(
            model=model, idx=encoded,
            max_new_tokens=100, context_size=context_size
        )
        # decoded_text = token_ids_to_text(token_ids, tokenizer)
        decoded_text = tokenizer.decode(token_ids[0].tolist(), skip_special_tokens=True)
        print(decoded_text.replace("\n", " "))  # Compact print format
    model.train()

def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
                       eval_freq, eval_iter, start_context, tokenizer, scheduler=None):
    # Initialize lists to track losses and tokens seen
    train_losses, val_losses, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, -1

    # Main training loop
    for epoch in range(num_epochs):
        model.train()  # Set model to training mode

        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()  # Reset loss gradients from previous batch iteration
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            loss.backward()  # Calculate loss gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()  # Update model weights using loss gradients
            if scheduler is not None:
                scheduler.step()
            tokens_seen += input_batch.numel()
            global_step += 1

            # Optional evaluation step
            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, device, eval_iter)
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                track_tokens_seen.append(tokens_seen)
                print(f"Ep {epoch+1} (Step {global_step:06d}): "
                      f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

        # Print a sample text after each epoch
        generate_and_print_sample(
            model, tokenizer, device, start_context
        )

    return train_losses, val_losses, track_tokens_seen

In [None]:
print("Number of batches per epoch", len(train_data)//batch_size)

Number of batched per epoch 26605


In [32]:
import time

start_time = time.time()

torch.manual_seed(123)

num_epochs = 5
lr = 0.0005

optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.1)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=lr, total_steps=len(train_data)//batch_size*num_epochs)

train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=200, eval_iter=10,
    start_context=format_input(val_data[0]), tokenizer=tokenizer,
    scheduler=scheduler
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

Ep 1 (Step 000000): Train loss 8.660, Val loss 8.686
Ep 1 (Step 000200): Train loss 1.987, Val loss 1.919
Ep 1 (Step 000400): Train loss 1.824, Val loss 1.736
Ep 1 (Step 000600): Train loss 1.521, Val loss 1.654
Ep 1 (Step 000800): Train loss 1.562, Val loss 1.572
Ep 1 (Step 001000): Train loss 1.583, Val loss 1.511
Ep 1 (Step 001200): Train loss 1.581, Val loss 1.468
Ep 1 (Step 001400): Train loss 1.320, Val loss 1.443
Ep 1 (Step 001600): Train loss 1.488, Val loss 1.417
Ep 1 (Step 001800): Train loss 1.270, Val loss 1.400
Ep 1 (Step 002000): Train loss 1.349, Val loss 1.382
Ep 1 (Step 002200): Train loss 1.473, Val loss 1.370
Ep 1 (Step 002400): Train loss 1.235, Val loss 1.350
Ep 1 (Step 002600): Train loss 1.238, Val loss 1.337
Ep 1 (Step 002800): Train loss 1.121, Val loss 1.322
Ep 1 (Step 003000): Train loss 1.396, Val loss 1.311
Ep 1 (Step 003200): Train loss 1.207, Val loss 1.308
Ep 1 (Step 003400): Train loss 1.414, Val loss 1.292
Ep 1 (Step 003600): Train loss 1.274, Val loss

Let's look at a few validation examples:

In [33]:
torch.manual_seed(123)


for entry in test_data[:3]:

    input_text = format_input(entry)

    token_ids = generate(
        model=model,
        idx=tokenizer.encode(input_text, return_tensors="pt").to(device),
        max_new_tokens=256,
        context_size=1024,
        eos_id=2
    )
    generated_text = tokenizer.decode(token_ids[0].tolist(), skip_special_tokens=True)
    response_text = (
        generated_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
    )

    print(input_text)
    print(f"\nCorrect response:\n>> {entry['output']}")
    print(f"\nModel response:\n>> {response_text.strip()}")
    print("-------------------------------------")

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Translate the following from English into Swahili.

### Input:
stop thinking of joseph and his...

Correct response:
>> kama hunielewi kamsome yusufu na ndoto yake.....

Model response:
>> joseph na josefu wake...
-------------------------------------
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Da fatan za a gano ra'ayin da ke cikin wannan rubutu bisa ga jagorori masu zuwa: Kyakkyawa: idan rubutu na nuna kyakkyawan tunani, hali, da yanayi. Korau: idan rubutu yana nuna mummunar tunani ko yanayi. Tsaka-tsaki: idan rubutu baya nuna kyakkyawar magana ko mara kyau kai tsaye ko a kaikaice.

### Input:
tohh fah gayya tan mu kekeyi muga wata version na shegantaka ki toh anki kallo

Correct response:
>> Korau

Model response:
>> Korau
-------------------------------------
Below is an instructio

## Save model

In [34]:
file_name = basedir/"vulavula_inkuba_instruct_tokenizer8k_pruned.pth"
torch.save(model.state_dict(), file_name)
print(f"Model saved as {file_name}")

Model saved as lelapa_buzuzu_mavi/vulavula_inkuba_instruct_tokenizer8k_pruned.pth
