<a href="https://colab.research.google.com/github/rokosbasilisk/HALOs/blob/main/RamBharadwaj_Aryasomayajula_C4AIScholarsChallenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Background**

Welcome to the C4AI Scholars Program Take-Home Challenge! This exercise is designed to allow you to showcase your engineering and problem solving skills. The Challenge consists of different challenges including:

*   Identifying bugs, and getting the code working. This is designed to test your ability to grapple with real world engineering challenges.
*   Testing your ability to generate code for a specified problem.
*   An opportunity for you to attempt an optional challenge question that extends the original problem set.

These tasks were chosen as a setting to see how you think about problems, even if they are not in your own research field of interest. The tasks and dataset are not meant to be indicative of the research goals of the Scholar Program. We purposefully have selected a simple toy problem so the focus is on how you think, and does not require significant machine learning resources (can be run in this colab).

Good luck!

**How to Use and Submit this Document?**

*   **Make a copy of this document** and rename it **Firstname_Lastname_C4AIScholarsChallenge**
*   Once you have completed all tasks, save and pin your revisions
*   Submit the assignment by responding directly to this email with a link to your final document by Sunday, September 15th, 11 PM PDT.

## **Coding Challenge Part 1: Debugging custom SmolLM code [10 points]**

In this coding challenge, you are required to debug and fix a bare-bones implementation of the following model.

**Model** : SmolLM-135M can be found at [HuggingFace](https://huggingface.co/HuggingFaceTB/SmolLM-135M).

We have 10 bugs in the following implementation.
There is a `check_solution` function for your convenience to verify you have correctly identified all the bugs. If you have found all bugs, the generated outputs will match the reference model exactly.

**Rules**:
1. **Bug Definition:**
  - There are 10 bugs to be fixed.
  - A bug is *defined as **{incorrect, missing, unnecessary}** lines of code*.
  - You earn 1 point for each correctly identified and fixed bug.
2. **Fix Guidelines:**
  - You are encouraged to make the smallest possible fix, wherever possible (e.g. edit a line instead of replacing it entirely).
  - Do not optimize the code; only fix the bugs. The implementation is *intentionally* non-optimized but valid.
3. **Documentation:** Document each fix by adding a comment on the line above the fix: : `### BUG FIX ###`.
4. **Sections:** *1. Setup [Helper Functions]* and *3. Test* don't contain bugs and shouldn't be changed.
5. **Submission:** Your final submission should be the exact same file except with your proposed fixes and the respective comments as per Rule #3.

## 1. Setup [Helper Functions]

In [None]:
# #####################################################################################################################
# ############################################# DO NOT CHANGE[START] ##################################################
# #####################################################################################################################


#[Don't use. Rate limit issues.] Use gdown to get weights file(BareBones_SmolLM-135M.pt) at https://drive.google.com/file/d/1tY46FSJEhGYRrfKRQTjJ1Cc7q9psaKUU/view . gdown should be installed by default else use `pip install gdown`
# !gdown 1tY46FSJEhGYRrfKRQTjJ1Cc7q9psaKUU


# [Recommended] Use HF to download the weights
!git lfs install
!git clone https://huggingface.co/dsouzadaniel/C4AI_SMOLLM135
!mv C4AI_SMOLLM135/BareBones_SmolLM-135M.pt ./
!ls

Git LFS initialized.
Cloning into 'C4AI_SMOLLM135'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (6/6), 2.11 KiB | 2.11 MiB/s, done.
BareBones_SmolLM-135M.pt  C4AI_SMOLLM135  sample_data


In [2]:

# Libraries
import torch
import torch.nn.functional as F
from torch import nn
import math
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model initialization/settings
checkpoint="HuggingFaceTB/SmolLM-135M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

__reference_model = AutoModelForCausalLM.from_pretrained(checkpoint)
__reference_model.eval()

class smolConfig:
    vocab_size=49152
    hidden_size=576
    intermediate_size=1536
    num_hidden_layers = 30
    num_heads = 9
    kv_heads=3
config = smolConfig

# Helper Functions
def __generate(model, inputs, num_tokens):
    collect = []
    for _ in range(num_tokens):
        output = model(**inputs)
        output_id = torch.argmax(output['logits'][0,-1]).item()
        collect.append(output_id)
        if output_id==tokenizer.eos_token_id:
            break
        inputs['input_ids'] = torch.unsqueeze(torch.cat([inputs['input_ids'][0],torch.tensor([output_id])]),dim=0)
        inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
    return tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(collect))

def check_solution(prompt, num_tokens, model_A, model_B):
    print()
    print(f"{'>'*20}\n\tPrompt\n{'<'*20}\n{prompt}\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_A Generation\n{'<'*30}\n{__generate(model_A,  model_inputs, num_tokens)}")
    print("\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_B Generation\n{'<'*30}\n{__generate(model_B,  model_inputs, num_tokens)}")

######################################################################################################################
############################################### DO NOT CHANGE[END] ###################################################
######################################################################################################################

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [1]:
# helper functions
import os
from google.colab import drive
import shutil

# for installing missing libraries
def install_if_not_installed(package_name):
    try:
        __import__(package_name)
        print(f"'{package_name}' is already installed.")
    except ImportError:
        print(f"'{package_name}' not found. Installing...")
        os.system(f'pip install {package_name}')

# install_if_not_installed('datasets')


# made this helper function so that i can download all the model files in case the runtime gets randomly terminated (which happened multiple times)
def save_colab_to_drive(destination_folder='Colab_Backup'):

    # Mount Google Drive if not already mounted
    if not os.path.exists('/content/drive'):
        drive.mount('/content/drive')

    # Specify the full path to the destination folder in Google Drive
    full_destination_path = f'/content/drive/MyDrive/{destination_folder}/'

    # Create the destination folder if it doesn't exist
    os.makedirs(full_destination_path, exist_ok=True)

    # Get list of all files and directories in the current directory
    files_to_copy = os.listdir('/content')

    # Copy each file/folder to Google Drive
    for item in files_to_copy:
        source_path = os.path.join('/content', item)
        dest_path = os.path.join(full_destination_path, item)

        if item == 'drive':  # Skip the mounted Google Drive folder
            continue

        try:
            if os.path.isfile(source_path):
                shutil.copy2(source_path, dest_path)
                print(f"Copied file: {item}")
            elif os.path.isdir(source_path):
                shutil.copytree(source_path, dest_path, dirs_exist_ok=True)
                print(f"Copied directory: {item}")
            else:
                print(f"Skipped: {item} (not a file or directory)")
        except Exception as e:
            print(f"Error copying {item}: {str(e)}")

    print(f"All files and folders have been copied to Google Drive folder: {destination_folder}")

# save_colab_to_drive()  # This will save to a folder named 'Colab_Backup'

def load_drive_to_colab(source_folder='Colab_Backup'):
    # Mount Google Drive if not already mounted
    if not os.path.exists('/content/drive'):
        drive.mount('/content/drive')

    # Specify the full path to the source folder in Google Drive
    full_source_path = f'/content/drive/MyDrive/{source_folder}/'

    # Check if the source folder exists
    if not os.path.exists(full_source_path):
        print(f"Source folder '{full_source_path}' not found in Google Drive.")
        return

    # Get list of all files and directories in the source folder
    files_to_copy = os.listdir(full_source_path)

    # Copy each file/folder to Colab workspace
    for item in files_to_copy:
        source_path = os.path.join(full_source_path, item)
        dest_path = os.path.join('/content', item)

        try:
            if os.path.isfile(source_path):
                shutil.copy2(source_path, dest_path)
                print(f"Copied file: {item}")
            elif os.path.isdir(source_path):
                shutil.copytree(source_path, dest_path, dirs_exist_ok=True)
                print(f"Copied directory: {item}")
            else:
                print(f"Skipped: {item} (not a file or directory)")
        except Exception as e:
            print(f"Error copying {item}: {str(e)}")

    print(f"All files and folders have been copied from Google Drive folder '{source_folder}' to Colab workspace.")

# load_drive_to_colab()  # This will load from a folder named 'Colab_Backup'

## 2. Custom SmolLM (for BugFixes)

In [None]:
def rotate_half(x):
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):

    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)


    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

def repeat_kv(hidden_states, n_rep):
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)


class RotaryEmbedder(nn.Module):
    def __init__(self, dim, base):
        super().__init__()

        ### BUG FIX ###
        # remove int64 dtype and use float for the frequencies (?)
        self.freq = 1/(base ** (torch.arange(0, dim, 2).float()/dim))

    @torch.no_grad()
    def forward(self, x, position_ids):

        ### BUG FIX ###
        # utilize position_ids provided
        position_ids = position_ids.view(-1).to(torch.float32)

        angles = torch.einsum('i,j->ij', position_ids, self.freq)
        angles = angles.unsqueeze(0)
        emb = torch.cat((angles, angles), dim=-1)
        return emb.cos(), emb.sin()


class MLP(nn.Module):
    def __init__(self, hidden_size, intermediate_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size

        ### BUG FIX ###
        # use bias for all linear layers
        self.W_gate = nn.Linear(self.hidden_size, self.intermediate_size, bias=True)
        self.W_up = nn.Linear(self.hidden_size, self.intermediate_size, bias=True)
        self.W_down = nn.Linear(self.intermediate_size, self.hidden_size, bias=True)
        self.act_fn = torch.nn.modules.activation.SiLU()

    def forward(self, x):
        down_proj = self.W_down(self.act_fn((self.W_gate(x)) * self.W_up(x)))
        return down_proj

class RMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        variance = hidden_states.pow(2).mean(-1, keepdim=True)

        ### BUG FIX ###
        # normalization should use rsqrt instead of sqrt
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states


class RopeAttention(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.hidden_size=config.hidden_size
        self.num_heads = config.num_heads
        self.head_dim = config.hidden_size//self.num_heads
        self.kv_heads = config.kv_heads
        self.rope_theta = 10000.0


        ### BUG FIX ###
        # use bias for all linear layers
        self.W_query = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=True)
        self.W_key = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=True)
        self.W_value = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=True)
        self.W_output = nn.Linear(config.hidden_size, config.hidden_size, bias=True)

        ### BUG FIX ###
        # use  already defined self.head_dim instead of config.hidden_size//self.num_heads
        self.rotary_emb = RotaryEmbedder(base=self.rope_theta, dim=self.head_dim)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask= None,
    ):
        b, q, _ = hidden_states.size()

        q_states = self.W_query(hidden_states)
        k_states = self.W_key(hidden_states)
        v_states = self.W_value(hidden_states)

        q_states = q_states.view(b, q, self.num_heads, self.head_dim).transpose(1, 2)
        k_states = k_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)
        v_states = v_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)

        position_ids = torch.arange(q, device=hidden_states.device).unsqueeze(0)

        cos, sin = self.rotary_emb(v_states, position_ids)
        q_states, k_states = apply_rotary_pos_emb(q_states, k_states, cos, sin, position_ids)

        ### BUG FIX ###
        # use floor division for number of kv-groups

        __kv_groups = self.num_heads // self.kv_heads
        k_states = repeat_kv(k_states, __kv_groups)
        v_states = repeat_kv(v_states, __kv_groups)

        ### BUG FIX ###
        # attention weights should be scaled by math.sqrt(self.head_dim) not hidden_size
        attn_weights = torch.matmul(q_states, k_states.transpose(2, 3)) / math.sqrt(self.head_dim)

        attn_weights = attn_weights + attention_mask
        attn_weights = nn.functional.softmax(attn_weights, dim=-1)

        ### BUG FIX ###
        #  explicitly specify a smaller dropout probability to avoid the default value (0.5)
        attn_weights = nn.functional.dropout(attn_weights, p = 0.2)

        attn_output = torch.matmul(attn_weights, v_states)
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.reshape(b, q, -1)

        return attn_output

class LlamaDecoder(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.self_attn = RopeAttention(config)
        self.mlp = MLP(hidden_size=config.hidden_size, intermediate_size=config.intermediate_size)
        self.pre_attn_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)
        self.pre_mlp_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)

    def forward(self,hidden_states, attention_mask):
        residual = hidden_states
        hidden_states = self.pre_attn_rmsnorm(hidden_states)

        ### BUG FIX ###
        # Remove this redundant attention_mask and use the actual one provided
        # attention_mask = torch.triu(torch.full((attention_mask.shape[-1],attention_mask.shape[-1]), fill_value=float('-inf')),diagonal=1)

        hidden_states = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
        )
        hidden_states += residual
        hidden_states = self.pre_mlp_rmsnorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states += residual

        outputs = (hidden_states,)
        return outputs

class smolModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_tokens = nn.Embedding(num_embeddings=config.vocab_size,
                                         embedding_dim=config.hidden_size)
        self.layers = nn.ModuleList([LlamaDecoder(config) for _ in range(config.num_hidden_layers)])
        self.norm = RMSNorm(config.hidden_size, eps=1e-05)

    def forward(
        self,
        input_ids= None,
        attention_mask= None,
    ):
        inputs_embeds = self.embed_tokens(input_ids)
        hidden_states = inputs_embeds
        for decoder_layer in self.layers:
            layer_outputs = decoder_layer(
                hidden_states,
                attention_mask=attention_mask,
            )
            hidden_states = layer_outputs[0]
        hidden_states = self.norm(hidden_states)
        return [hidden_states]

class smolLM(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.model = smolModel(config)

        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size,bias=False)

    def forward(self,input_ids,attention_mask):
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        hidden_states = outputs[0].squeeze()
        logits = self.lm_head(hidden_states)
        logits = logits.float()
        return {'logits':logits}


In [None]:
__test_model = smolLM(config)
__test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)
__test_model.eval()

  __test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)


smolLM(
  (model): smolModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoder(
        (self_attn): RopeAttention(
          (W_query): Linear(in_features=576, out_features=576, bias=True)
          (W_key): Linear(in_features=576, out_features=192, bias=True)
          (W_value): Linear(in_features=576, out_features=192, bias=True)
          (W_output): Linear(in_features=576, out_features=576, bias=True)
          (rotary_emb): RotaryEmbedder()
        )
        (mlp): MLP(
          (W_gate): Linear(in_features=576, out_features=1536, bias=True)
          (W_up): Linear(in_features=576, out_features=1536, bias=True)
          (W_down): Linear(in_features=1536, out_features=576, bias=True)
          (act_fn): SiLU()
        )
        (pre_attn_rmsnorm): RMSNorm()
        (pre_mlp_rmsnorm): RMSNorm()
      )
    )
    (norm): RMSNorm()
  )
  (lm_head): Linear(in_features=576, out_features=49152, bias=False)
)

# 3. Test

In [None]:
######################################################################################################################
############################################## DO NOT CHANGE[START] ##################################################
######################################################################################################################

###### TESTING PROMPTS
# Single-Token Quick Test
check_solution(prompt="Given the following film movie by a critic, rate it out of 10. Respond in a single number.\n\nThe movie started off extremely well, but just got worse after that.\nThe storyline was all over the place and everyone acted terribly.\n 10/10 would not recommend! \n\n ",
               num_tokens=1,
               model_A=__reference_model,
               model_B=__test_model)


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)



>>>>>>>>>>>>>>>>>>>>
	Prompt
<<<<<<<<<<<<<<<<<<<<
Given the following film movie by a critic, rate it out of 10. Respond in a single number.

The movie started off extremely well, but just got worse after that.
The storyline was all over the place and everyone acted terribly.
 10/10 would not recommend! 

 


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_A Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
1



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_B Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<|endoftext|>


In [None]:
# Multi-Token Quick Test
check_solution(prompt="Where is the Nile located?",
               num_tokens=50,
               model_A=__reference_model,
               model_B=__test_model)

######################################################################################################################
############################################### DO NOT CHANGE[END] ###################################################
######################################################################################################################


>>>>>>>>>>>>>>>>>>>>
	Prompt
<<<<<<<<<<<<<<<<<<<<
Where is the Nile located?


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_A Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

The Nile River is located in the Nile Delta in the Nile River Basin, which is a region of Africa. It is the longest river in the world, with a length of 4,330 miles (6,900 km



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_B Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<|endoftext|>


# **Coding Challenge Part 2: Teach SmolLM to do grammatical error correction [15 points]**

The goal of this part is to train the SmolLM-135M model to perform grammatical error correction (GEC) using the Grammarly CoEdIT dataset. This [dataset](https://huggingface.co/datasets/grammarly/coedit), derived from the [CoEdIT project](https://arxiv.org/abs/2305.09857), provides a rich collection of text editing instructions and examples. The task involves several key steps that mimic conventional alignment processes:




## **2.1 Supervised Fine-Tuning (SFT) on Training Data [5 points]**

* Fine-tune the [SmolLM-135M model](https://huggingface.co/HuggingFaceTB/SmolLM-135M) using the CoEdIT dataset, which includes input sentences with grammatical errors and their corrected versions. Use the training GEC portion of the CoEdIT dataset to teach the model how to correct grammatical errors effectively.
* Calculate the BLEU score on the validation set to evaluate the model's performance in generating grammatically correct sentences. Ensure that this evaluation process is reusable for later comparisons.
* Search for an optimal set of hyperparameters, such as the learning rate. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters. **Do not train for more than 3 epochs -- we do not expect extensive training time.**
* For Part 2, don't use additional libraries, if an imported library is missing, install it with **pip install**.

In [None]:
install_if_not_installed('datasets')
from datasets import load_dataset, Dataset

# Download the GEC data
full_train_ds = load_dataset("grammarly/coedit", split="train")
full_test_ds = load_dataset("grammarly/coedit", split="validation")

'datasets' is already installed.


README.md:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/19.7M [00:00<?, ?B/s]

validation.jsonl:   0%|          | 0.00/692k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/69071 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1712 [00:00<?, ? examples/s]

In [None]:
# Filter examples, keeping only GEC task

print(full_train_ds[0])

train_dataset = Dataset.from_dict({
    'src': [example['src'] for example in full_train_ds if example['task']=='gec'],
    'tgt': [example['tgt'] for example in full_train_ds if example['task']=='gec'],
})

test_dataset = Dataset.from_dict({
    'src': [example['src'] for example in full_test_ds if example['task']=='gec'],
    'tgt': [example['tgt'] for example in full_test_ds if example['task']=='gec'],
})

print(f"number of examples in train: {len(train_dataset)} test: {len(test_dataset)}")
print(test_dataset[0]) #print sample row

{'_id': '1', 'task': 'gec', 'src': 'Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.', 'tgt': 'For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.'}
number of examples in train: 19823 test: 485
{'src': 'Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you.', 'tgt': 'First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you.'}


Expected number of train and test samples are 19823 and 485, respectively.

In [None]:
install_if_not_installed('transformers')

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model and tokenizer name
model_name = "HuggingFaceTB/SmolLM-135M"

# Training device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add special tokens
special_tokens_dict = {'additional_special_tokens': ['<INPUT>', '</INPUT>', '<OUTPUT>', '</OUTPUT>']}
tokenizer.add_special_tokens(special_tokens_dict)

# Add pad token if not present
if tokenizer.pad_token is None: tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# set model's pad_token_id
model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id
model.to(device)

'transformers' is already installed.


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49157, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm

In [None]:
install_if_not_installed('trl')

from transformers import DataCollatorForLanguageModeling
from trl import SFTConfig, SFTTrainer
from datasets import Dataset


# Hyperparameters
learning_rates = [5e-5, 1e-4]
weight_decays = [0.1, 0.01]
batch_size = 32
grad_accum_steps = 2
dropout = 0.3
warmup_steps = 100
max_seq_len = 128
num_epochs = 1

def format_text(src: str) -> str: # common function to be used to format the input and output examples for training and inference
    return f"<INPUT> Correct the sentence: {src} </INPUT> <OUTPUT>"


def tokenize_and_prepare(examples):
    inputs = []
    labels = []
    for src, tgt in zip(examples['src'], examples['tgt']):
        # Format the input and target
        formatted_input = format_text(src)
        formatted_target = f"{tgt} </OUTPUT>"

        # Concatenate input and target
        full_sequence = formatted_input + formatted_target

        # Tokenize the full sequence
        tokenized = tokenizer(
            full_sequence,
            max_length=max_seq_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt',
        )

        input_ids = tokenized['input_ids'][0]

        # Determine the input length
        input_length = len(tokenizer.encode(formatted_input, add_special_tokens=False))

        # Prepare labels: mask input tokens
        labels_ids = input_ids.clone()
        labels_ids[:input_length] = -100  # Mask input tokens

        # Append to lists
        inputs.append(input_ids)
        labels.append(labels_ids)

    return {'input_ids': torch.stack(inputs), 'labels': torch.stack(labels)}


train_dataset_tokenized = train_dataset.map(tokenize_and_prepare, batched=True, remove_columns=train_dataset.column_names)

test_dataset_tokenized = test_dataset.map(tokenize_and_prepare, batched=True, remove_columns=test_dataset.column_names)

# Set the format for PyTorch tensors
train_dataset_tokenized.set_format(type='torch')
test_dataset_tokenized.set_format(type='torch')

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Initialize tracking variables
best_loss = torch.inf
best_hyperparams = {}
best_model_dir = "best_model"


# Iterate over combinations of learning rates and weight decays
for lr in learning_rates:
    for wd in weight_decays:
        print(f"\nTraining with Learning Rate: {lr}, Weight Decay: {wd}")

        model = AutoModelForCausalLM.from_pretrained(model_name)
        model.resize_token_embeddings(len(tokenizer))
        model.config.pad_token_id = tokenizer.pad_token_id
        model.to(device)

        # Adjust dropout rates
        model.config.dropout = dropout
        model.config.attention_dropout = dropout
        model.config.activation_dropout = dropout

        sft_config = SFTConfig(
            output_dir=f"checkpoints_lr{lr}_wd{wd}",
            num_train_epochs=num_epochs,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            gradient_accumulation_steps=grad_accum_steps,
            learning_rate=lr,
            weight_decay=wd,
            lr_scheduler_type="linear",
            warmup_steps=warmup_steps,
            logging_dir=f"logs_lr{lr}_wd{wd}",
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="eval_loss",
            greater_is_better=False,
            logging_strategy="steps",
            logging_steps=100,
            do_eval=True,
            logging_first_step=True,
            packing=False,
            dataset_text_field='input_ids',
        )

        trainer = SFTTrainer(
            model=model,
            args=sft_config,
            train_dataset=train_dataset_tokenized,
            eval_dataset=test_dataset_tokenized,
            tokenizer=tokenizer,
            data_collator=data_collator,
        )

        train_result = trainer.train()
        print(f"Training loss: {train_result.training_loss}")

        eval_results = trainer.evaluate()
        eval_loss = eval_results["eval_loss"]
        print(f"Eval Loss: {eval_loss}")


        if eval_loss < best_loss:
            best_loss = eval_loss
            best_hyperparams = {"learning_rate": lr, "weight_decay": wd}
            # Save the best model and tokenizer
            trainer.save_model(best_model_dir)
            tokenizer.save_pretrained(best_model_dir)
            print(f"New best model saved with lr={lr}, wd={wd}")

print(f"\nBest hyperparameters found: {best_hyperparams} with eval loss: {best_loss}")
print(f"Best model saved in directory: {best_model_dir}")

'trl' not found. Installing...


Map:   0%|          | 0/19823 [00:00<?, ? examples/s]

Map:   0%|          | 0/485 [00:00<?, ? examples/s]


Training with Learning Rate: 5e-05, Weight Decay: 0.1




Epoch,Training Loss,Validation Loss
1,1.4761,2.924565


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Training loss: 1.7378782756866948


Eval Loss: 2.924565076828003
New best model saved with lr=5e-05, wd=0.1

Training with Learning Rate: 5e-05, Weight Decay: 0.01




Epoch,Training Loss,Validation Loss
1,1.4727,2.921771


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Training loss: 1.7287281159431704


Eval Loss: 2.9217708110809326
New best model saved with lr=5e-05, wd=0.01

Training with Learning Rate: 0.0001, Weight Decay: 0.1




Epoch,Training Loss,Validation Loss
1,1.457,2.920208


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Training loss: 1.6610293173020887


Eval Loss: 2.920208215713501
New best model saved with lr=0.0001, wd=0.1

Training with Learning Rate: 0.0001, Weight Decay: 0.01




Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss
1,1.4571,2.920365


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Training loss: 1.6610445330219885


Eval Loss: 2.9203648567199707

Best hyperparameters found: {'learning_rate': 0.0001, 'weight_decay': 0.1} with eval loss: 2.920208215713501
Best model saved in directory: best_model


In [None]:
# Quick test if your model works properly
model.to('cpu')

# Example of how to run inference on a single example
text = "I likes turtles"
formatted_text = format_text(text)

# Tokenize the formatted input
inputs = tokenizer(
    formatted_text,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=128,
)

# Generate the corrected sentence
outputs = model.generate(
    input_ids=inputs['input_ids'],
    attention_mask=inputs['attention_mask'],
    max_new_tokens=128,
    temperature=0.5,
    early_stopping=True,
    num_beams=5,
    eos_token_id=tokenizer.convert_tokens_to_ids('</OUTPUT>'),
)

# Decode the generated tokens
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)

# Extract the corrected sentence from the generated text
# Remove the input prompt and any special tokens
if '<OUTPUT>' in generated_text:
    corrected_sentence = generated_text.split('<OUTPUT>')[-1].split('</OUTPUT>')[0].strip()
else:
    # Fallback in case the special tokens are not present
    corrected_sentence = generated_text

print("Corrected sentence:", corrected_sentence)

Setting `pad_token_id` to `eos_token_id`:49155 for open-end generation.


Corrected sentence: I like turtles.


Expected output: I like turtles.

In [None]:
install_if_not_installed('evaluate')

import evaluate
from torch.utils.data import DataLoader

# Load the BLEU metric from the evaluate library
bleu = evaluate.load("bleu")

def evaluate_model(model, tokenizer, dataset, device, batch_size=32):
    preds = []
    targets = []

    model.eval()  # Set model to evaluation mode

    # Create a DataLoader without a collate_fn since we'll handle tokenization here
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

    # Define bad_words_ids to prevent the model from generating [PAD] tokens
    bad_words_ids = [[tokenizer.pad_token_id]]

    for batch in dataloader:
        # Access 'src' and 'tgt' from the batch
        src_texts = batch['src']
        tgt_texts = batch['tgt']

        # Format the input texts using the same formatting function as during training
        inputs = [f"<INPUT> Correct the sentence: {src} </INPUT> <OUTPUT>" for src in src_texts]

        # Tokenize the inputs
        tokenized_inputs = tokenizer(
            inputs,
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=128,
        ).to(device)

        with torch.no_grad():
            outputs = model.generate(
                input_ids=tokenized_inputs['input_ids'],
                attention_mask=tokenized_inputs['attention_mask'],
                max_new_tokens=128,
                num_beams=5,
                temperature=0.2,
                eos_token_id=tokenizer.convert_tokens_to_ids('</OUTPUT>'),
                early_stopping=True,
                pad_token_id=tokenizer.pad_token_id,  # Ensure the pad_token_id is set
                bad_words_ids=bad_words_ids,  # Prevent generation of [PAD] tokens
            )

        # Decode predictions
        decoded_preds = tokenizer.batch_decode(outputs, skip_special_tokens=False)

        # Extract the corrected sentences from predictions
        corrected_preds = []
        for pred in decoded_preds:
            # Remove [PAD] tokens from the prediction
            pred = pred.replace(tokenizer.pad_token, '').strip()
            if '<OUTPUT>' in pred and '</OUTPUT>' in pred:
                corrected_pred = pred.split('<OUTPUT>')[1].split('</OUTPUT>')[0].strip()
            elif '<OUTPUT>' in pred:
                corrected_pred = pred.split('<OUTPUT>')[1].strip()
            else:
                corrected_pred = pred.strip()
            corrected_preds.append(corrected_pred)

        # Prepare targets
        corrected_targets = [tgt.strip() for tgt in tgt_texts]

        preds.extend(corrected_preds)
        targets.extend([[target] for target in corrected_targets])  # References need to be a list of lists

    # Compute BLEU score
    results = bleu.compute(predictions=preds, references=targets)
    return results["bleu"], preds, targets

'evaluate' not found. Installing...


Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

In [None]:
# Evaluate model, use the function given above
model.to('cuda')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bleu_score,preds_sft,targets_sft = evaluate_model(model, tokenizer, test_dataset, device, 32)
print(f"BLEU score on test set: {bleu_score}")

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

BLEU score on test set: 0.48387137720260337


Expected BLEU score after 1 epoch SFT is ~ 0.48.

## **2.2 Create a preference optimization dataset [5 points]**

* *Generate Output Variants* -- for each input sentence in the training set, use the fine-tuned model to generate two different output variants.
 * Consider using different decoding strategies, such as varying the temperature or beam size, to produce diverse outputs. Select an approach based on the desired balance between diversity and quality.

* *Preference Annotation* -- measure the edit distance between each **generated predicted variant** and **ground truth correction**. Label the variant with the lower edit distance as "chosen" and the one with the higher edit distance as "rejected."
 * Beyond using edit distance, what other metrics or methods could you consider to do preference dataset annotation?


In [33]:
install_if_not_installed('fast-edit-distance')

import torch
from fast_edit_distance import edit_distance
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
from torch.cuda.amp import autocast

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("best_model")
tokenizer = AutoTokenizer.from_pretrained("best_model")

# Use FP16 (half-precision) for faster computation if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    model = model.half().to(device)
else:
    model = model.to(device)

# Define bad words to prevent generation of [PAD] tokens
bad_words_ids = [[tokenizer.pad_token_id]]

# Function to format the text
def format_text(src: str) -> str:
    return f"<INPUT> Correct the sentence: {src} </INPUT> <OUTPUT>"

# Function to generate variants with different decoding methods
def generate_variants(texts, max_seq_len=128, batch_size=128, method='beam'):
    all_variants = []
    total_batches = (len(texts) + batch_size - 1) // batch_size

    # Tokenize and process in batches
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]

        # Format input texts like during training
        inputs = [format_text(src) for src in batch_texts]

        tokenized_inputs = tokenizer(
            inputs,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=max_seq_len
        ).to(device)

        # Use half-precision for model inference with autocast
        with torch.no_grad():
            with autocast():
                if method == 'beam':
                    # Beam search: deterministic, controlled output
                    outputs = model.generate(
                        input_ids=tokenized_inputs['input_ids'],
                        attention_mask=tokenized_inputs['attention_mask'],
                        max_new_tokens=128,
                        num_beams=5,  # Number of beams for beam search
                        eos_token_id=tokenizer.convert_tokens_to_ids('</OUTPUT>'),  # Ensure generation stops at </OUTPUT>
                        early_stopping=True,  # Stop when all beams reach eos
                        pad_token_id=tokenizer.pad_token_id,  # Ensure the pad_token_id is set
                        bad_words_ids=bad_words_ids,  # Prevent generation of [PAD] tokens
                    )
                elif method == 'sample':
                    # Sampling: diverse, stochastic output
                    outputs = model.generate(
                        input_ids=tokenized_inputs['input_ids'],
                        attention_mask=tokenized_inputs['attention_mask'],
                        max_new_tokens=128,
                        do_sample=True,
                        temperature=0.7,
                        top_p=0.9,  # Nucleus sampling
                        eos_token_id=tokenizer.convert_tokens_to_ids('</OUTPUT>'),
                        pad_token_id=tokenizer.pad_token_id,
                        bad_words_ids=bad_words_ids,
                    )

        # Decode and post-process predictions
        decoded_preds = tokenizer.batch_decode(outputs, skip_special_tokens=False)
        corrected_preds = []
        for pred in decoded_preds:
            # Remove [PAD] tokens from the prediction
            pred = pred.replace(tokenizer.pad_token, '').strip()
            # Extract text between <OUTPUT> and </OUTPUT>
            if '<OUTPUT>' in pred and '</OUTPUT>' in pred:
                corrected_pred = pred.split('<OUTPUT>')[1].split('</OUTPUT>')[0].strip()
            elif '<OUTPUT>' in pred:
                corrected_pred = pred.split('<OUTPUT>')[1].strip()
            else:
                corrected_pred = pred.strip()
            corrected_preds.append(corrected_pred)

        all_variants.extend(corrected_preds)

        # Progress tracking
        batch_num = i // batch_size + 1
        print(f"{method.capitalize()} Search - Batch {batch_num}/{total_batches} - "
              f"Progress: {batch_num/total_batches*100:.2f}%")

    return all_variants

# Function to compute fast edit distances between two sets of sequences
def batch_edit_distance(sequences1, sequences2):
    return [edit_distance(s1, s2) for s1, s2 in zip(sequences1, sequences2)]

# Function to create the preference dataset with post-processing
def create_preference_dataset(dataset):
    print("Formatting input texts...")

    # Format source and target texts
    src_texts = [example['src'] for example in dataset]
    tgt_texts = [example['tgt'] for example in dataset]
    print("Formatting completed.")

    # Generate output variants using beam search and temperature sampling
    print("Generating beam search variants...")
    variants1 = generate_variants(src_texts, method='beam')
    print("Generating temperature sampling variants...")
    variants2 = generate_variants(src_texts, method='sample')

    # Compute edit distances efficiently
    print("Calculating edit distances...")
    edit_dist_1 = batch_edit_distance(variants1, tgt_texts)
    edit_dist_2 = batch_edit_distance(variants2, tgt_texts)

    # Construct preference dataset based on lower edit distances
    print("Creating preference dataset...")
    preference_dataset = []
    for src, v1, v2, ed1, ed2 in zip(src_texts, variants1, variants2, edit_dist_1, edit_dist_2):
        if ed1 < ed2:
            preference_dataset.append({"prompt": src, "chosen": v1, "rejected": v2})
        else:
            preference_dataset.append({"prompt": src, "chosen": v2, "rejected": v1})

    print("Preference dataset creation completed.")

    # Return as a Dataset object
    return Dataset.from_dict({
        "prompt": [item["prompt"] for item in preference_dataset],
        "chosen": [item["chosen"] for item in preference_dataset],
        "rejected": [item["rejected"] for item in preference_dataset],
    })

# Example usage
print("Starting preference dataset creation...")
po_dataset = create_preference_dataset(train_dataset)
po_dataset.save_to_disk("preference_optimization_dataset")
print("Preference optimization dataset created and saved.")

print(f"Number of examples in preference dataset: {len(po_dataset)}")
print("Sample entry:")
print(po_dataset[0])

Starting preference dataset creation...
Formatting input texts...


  with autocast():
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Formatting completed.
Generating beam search variants...


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 1/155 - Progress: 0.65%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 2/155 - Progress: 1.29%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 3/155 - Progress: 1.94%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 4/155 - Progress: 2.58%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 5/155 - Progress: 3.23%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 6/155 - Progress: 3.87%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 7/155 - Progress: 4.52%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 8/155 - Progress: 5.16%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 9/155 - Progress: 5.81%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 10/155 - Progress: 6.45%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 11/155 - Progress: 7.10%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 12/155 - Progress: 7.74%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 13/155 - Progress: 8.39%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 14/155 - Progress: 9.03%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 15/155 - Progress: 9.68%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 16/155 - Progress: 10.32%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 17/155 - Progress: 10.97%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 18/155 - Progress: 11.61%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 19/155 - Progress: 12.26%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 20/155 - Progress: 12.90%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 21/155 - Progress: 13.55%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 22/155 - Progress: 14.19%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 23/155 - Progress: 14.84%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 24/155 - Progress: 15.48%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 25/155 - Progress: 16.13%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 26/155 - Progress: 16.77%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 27/155 - Progress: 17.42%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 28/155 - Progress: 18.06%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 29/155 - Progress: 18.71%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 30/155 - Progress: 19.35%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 31/155 - Progress: 20.00%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 32/155 - Progress: 20.65%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 33/155 - Progress: 21.29%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 34/155 - Progress: 21.94%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 35/155 - Progress: 22.58%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 36/155 - Progress: 23.23%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 37/155 - Progress: 23.87%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 38/155 - Progress: 24.52%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 39/155 - Progress: 25.16%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 40/155 - Progress: 25.81%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 41/155 - Progress: 26.45%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 42/155 - Progress: 27.10%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 43/155 - Progress: 27.74%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 44/155 - Progress: 28.39%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 45/155 - Progress: 29.03%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 46/155 - Progress: 29.68%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 47/155 - Progress: 30.32%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 48/155 - Progress: 30.97%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 49/155 - Progress: 31.61%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 50/155 - Progress: 32.26%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 51/155 - Progress: 32.90%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 52/155 - Progress: 33.55%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 53/155 - Progress: 34.19%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 54/155 - Progress: 34.84%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 55/155 - Progress: 35.48%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 56/155 - Progress: 36.13%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 57/155 - Progress: 36.77%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 58/155 - Progress: 37.42%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 59/155 - Progress: 38.06%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 60/155 - Progress: 38.71%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 61/155 - Progress: 39.35%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 62/155 - Progress: 40.00%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 63/155 - Progress: 40.65%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 64/155 - Progress: 41.29%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 65/155 - Progress: 41.94%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 66/155 - Progress: 42.58%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 67/155 - Progress: 43.23%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 68/155 - Progress: 43.87%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 69/155 - Progress: 44.52%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 70/155 - Progress: 45.16%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 71/155 - Progress: 45.81%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 72/155 - Progress: 46.45%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 73/155 - Progress: 47.10%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 74/155 - Progress: 47.74%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 75/155 - Progress: 48.39%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 76/155 - Progress: 49.03%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 77/155 - Progress: 49.68%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 78/155 - Progress: 50.32%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 79/155 - Progress: 50.97%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 80/155 - Progress: 51.61%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 81/155 - Progress: 52.26%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 82/155 - Progress: 52.90%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 83/155 - Progress: 53.55%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 84/155 - Progress: 54.19%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 85/155 - Progress: 54.84%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 86/155 - Progress: 55.48%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 87/155 - Progress: 56.13%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 88/155 - Progress: 56.77%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 89/155 - Progress: 57.42%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 90/155 - Progress: 58.06%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 91/155 - Progress: 58.71%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 92/155 - Progress: 59.35%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 93/155 - Progress: 60.00%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 94/155 - Progress: 60.65%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 95/155 - Progress: 61.29%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 96/155 - Progress: 61.94%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 97/155 - Progress: 62.58%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 98/155 - Progress: 63.23%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 99/155 - Progress: 63.87%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 100/155 - Progress: 64.52%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 101/155 - Progress: 65.16%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 102/155 - Progress: 65.81%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 103/155 - Progress: 66.45%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 104/155 - Progress: 67.10%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 105/155 - Progress: 67.74%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 106/155 - Progress: 68.39%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 107/155 - Progress: 69.03%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 108/155 - Progress: 69.68%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 109/155 - Progress: 70.32%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 110/155 - Progress: 70.97%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 111/155 - Progress: 71.61%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 112/155 - Progress: 72.26%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 113/155 - Progress: 72.90%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 114/155 - Progress: 73.55%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 115/155 - Progress: 74.19%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 116/155 - Progress: 74.84%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 117/155 - Progress: 75.48%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 118/155 - Progress: 76.13%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 119/155 - Progress: 76.77%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 120/155 - Progress: 77.42%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 121/155 - Progress: 78.06%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 122/155 - Progress: 78.71%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 123/155 - Progress: 79.35%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 124/155 - Progress: 80.00%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 125/155 - Progress: 80.65%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 126/155 - Progress: 81.29%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 127/155 - Progress: 81.94%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 128/155 - Progress: 82.58%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 129/155 - Progress: 83.23%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 130/155 - Progress: 83.87%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 131/155 - Progress: 84.52%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 132/155 - Progress: 85.16%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 133/155 - Progress: 85.81%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 134/155 - Progress: 86.45%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 135/155 - Progress: 87.10%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 136/155 - Progress: 87.74%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 137/155 - Progress: 88.39%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 138/155 - Progress: 89.03%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 139/155 - Progress: 89.68%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 140/155 - Progress: 90.32%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 141/155 - Progress: 90.97%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 142/155 - Progress: 91.61%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 143/155 - Progress: 92.26%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 144/155 - Progress: 92.90%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 145/155 - Progress: 93.55%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 146/155 - Progress: 94.19%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 147/155 - Progress: 94.84%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 148/155 - Progress: 95.48%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 149/155 - Progress: 96.13%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 150/155 - Progress: 96.77%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 151/155 - Progress: 97.42%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 152/155 - Progress: 98.06%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 153/155 - Progress: 98.71%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 154/155 - Progress: 99.35%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Beam Search - Batch 155/155 - Progress: 100.00%
Generating temperature sampling variants...


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 1/155 - Progress: 0.65%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 2/155 - Progress: 1.29%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 3/155 - Progress: 1.94%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 4/155 - Progress: 2.58%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 5/155 - Progress: 3.23%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 6/155 - Progress: 3.87%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 7/155 - Progress: 4.52%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 8/155 - Progress: 5.16%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 9/155 - Progress: 5.81%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 10/155 - Progress: 6.45%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 11/155 - Progress: 7.10%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 12/155 - Progress: 7.74%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 13/155 - Progress: 8.39%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 14/155 - Progress: 9.03%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 15/155 - Progress: 9.68%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 16/155 - Progress: 10.32%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 17/155 - Progress: 10.97%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 18/155 - Progress: 11.61%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 19/155 - Progress: 12.26%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 20/155 - Progress: 12.90%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 21/155 - Progress: 13.55%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 22/155 - Progress: 14.19%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 23/155 - Progress: 14.84%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 24/155 - Progress: 15.48%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 25/155 - Progress: 16.13%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 26/155 - Progress: 16.77%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 27/155 - Progress: 17.42%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 28/155 - Progress: 18.06%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 29/155 - Progress: 18.71%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 30/155 - Progress: 19.35%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 31/155 - Progress: 20.00%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 32/155 - Progress: 20.65%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 33/155 - Progress: 21.29%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 34/155 - Progress: 21.94%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 35/155 - Progress: 22.58%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 36/155 - Progress: 23.23%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 37/155 - Progress: 23.87%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 38/155 - Progress: 24.52%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 39/155 - Progress: 25.16%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 40/155 - Progress: 25.81%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 41/155 - Progress: 26.45%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 42/155 - Progress: 27.10%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 43/155 - Progress: 27.74%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 44/155 - Progress: 28.39%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 45/155 - Progress: 29.03%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 46/155 - Progress: 29.68%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 47/155 - Progress: 30.32%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 48/155 - Progress: 30.97%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 49/155 - Progress: 31.61%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 50/155 - Progress: 32.26%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 51/155 - Progress: 32.90%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 52/155 - Progress: 33.55%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 53/155 - Progress: 34.19%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 54/155 - Progress: 34.84%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 55/155 - Progress: 35.48%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 56/155 - Progress: 36.13%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 57/155 - Progress: 36.77%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 58/155 - Progress: 37.42%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 59/155 - Progress: 38.06%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 60/155 - Progress: 38.71%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 61/155 - Progress: 39.35%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 62/155 - Progress: 40.00%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 63/155 - Progress: 40.65%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 64/155 - Progress: 41.29%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 65/155 - Progress: 41.94%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 66/155 - Progress: 42.58%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 67/155 - Progress: 43.23%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 68/155 - Progress: 43.87%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 69/155 - Progress: 44.52%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 70/155 - Progress: 45.16%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 71/155 - Progress: 45.81%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 72/155 - Progress: 46.45%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 73/155 - Progress: 47.10%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 74/155 - Progress: 47.74%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 75/155 - Progress: 48.39%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 76/155 - Progress: 49.03%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 77/155 - Progress: 49.68%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 78/155 - Progress: 50.32%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 79/155 - Progress: 50.97%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 80/155 - Progress: 51.61%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 81/155 - Progress: 52.26%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 82/155 - Progress: 52.90%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 83/155 - Progress: 53.55%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 84/155 - Progress: 54.19%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 85/155 - Progress: 54.84%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 86/155 - Progress: 55.48%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 87/155 - Progress: 56.13%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 88/155 - Progress: 56.77%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 89/155 - Progress: 57.42%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 90/155 - Progress: 58.06%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 91/155 - Progress: 58.71%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 92/155 - Progress: 59.35%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 93/155 - Progress: 60.00%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 94/155 - Progress: 60.65%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 95/155 - Progress: 61.29%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 96/155 - Progress: 61.94%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 97/155 - Progress: 62.58%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 98/155 - Progress: 63.23%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 99/155 - Progress: 63.87%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 100/155 - Progress: 64.52%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 101/155 - Progress: 65.16%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 102/155 - Progress: 65.81%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 103/155 - Progress: 66.45%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 104/155 - Progress: 67.10%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 105/155 - Progress: 67.74%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 106/155 - Progress: 68.39%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 107/155 - Progress: 69.03%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 108/155 - Progress: 69.68%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 109/155 - Progress: 70.32%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 110/155 - Progress: 70.97%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 111/155 - Progress: 71.61%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 112/155 - Progress: 72.26%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 113/155 - Progress: 72.90%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 114/155 - Progress: 73.55%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 115/155 - Progress: 74.19%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 116/155 - Progress: 74.84%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 117/155 - Progress: 75.48%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 118/155 - Progress: 76.13%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 119/155 - Progress: 76.77%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 120/155 - Progress: 77.42%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 121/155 - Progress: 78.06%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 122/155 - Progress: 78.71%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 123/155 - Progress: 79.35%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 124/155 - Progress: 80.00%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 125/155 - Progress: 80.65%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 126/155 - Progress: 81.29%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 127/155 - Progress: 81.94%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 128/155 - Progress: 82.58%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 129/155 - Progress: 83.23%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 130/155 - Progress: 83.87%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 131/155 - Progress: 84.52%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 132/155 - Progress: 85.16%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 133/155 - Progress: 85.81%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 134/155 - Progress: 86.45%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 135/155 - Progress: 87.10%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 136/155 - Progress: 87.74%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 137/155 - Progress: 88.39%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 138/155 - Progress: 89.03%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 139/155 - Progress: 89.68%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 140/155 - Progress: 90.32%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 141/155 - Progress: 90.97%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 142/155 - Progress: 91.61%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 143/155 - Progress: 92.26%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 144/155 - Progress: 92.90%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 145/155 - Progress: 93.55%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 146/155 - Progress: 94.19%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 147/155 - Progress: 94.84%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 148/155 - Progress: 95.48%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 149/155 - Progress: 96.13%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 150/155 - Progress: 96.77%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 151/155 - Progress: 97.42%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 152/155 - Progress: 98.06%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 153/155 - Progress: 98.71%


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Sample Search - Batch 154/155 - Progress: 99.35%
Sample Search - Batch 155/155 - Progress: 100.00%
Calculating edit distances...
Creating preference dataset...
Preference dataset creation completed.


Saving the dataset (0/1 shards):   0%|          | 0/19823 [00:00<?, ? examples/s]

Preference optimization dataset created and saved.
Number of examples in preference dataset: 19823
Sample entry:
{'prompt': 'Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.', 'chosen': 'For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.', 'rejected': 'For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.'}


In [36]:
# (Load and) Visualize the created dataset -- display at least 5 lines of the dataset.

for i in range(5):
    print(f"Example {i+1}:")
    print(f"Ground Truth: {po_dataset[i]['prompt']}")
    print(f"Chosen Variant: {po_dataset[i]['chosen']}")
    print(f"Rejected Variant: {po_dataset[i]['rejected']}")
    print("*"*100)

Example 1:
Ground Truth: Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.
Chosen Variant: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.
Rejected Variant: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.
****************************************************************************************************
Example 2:
Ground Truth: Improve the grammaticality: As the number of people grows, the need of habitable environment is unquestionably essential.
Chosen Variant: As the number of people grows, the need for a habitable environment is unquestionably essential.
Rejected Variant: As the number of people 

In [1]:
# save_colab_to_drive()
# helper functions
import os
from google.colab import drive
import shutil

# for installing missing libraries
def install_if_not_installed(package_name):
    try:
        __import__(package_name)
        print(f"'{package_name}' is already installed.")
    except ImportError:
        print(f"'{package_name}' not found. Installing...")
        os.system(f'pip install {package_name}')

# install_if_not_installed('datasets')


# made this helper function so that i can download all the model files in case the runtime gets randomly terminated (which happened multiple times)
def save_colab_to_drive(destination_folder='Colab_Backup'):

    # Mount Google Drive if not already mounted
    if not os.path.exists('/content/drive'):
        drive.mount('/content/drive')

    # Specify the full path to the destination folder in Google Drive
    full_destination_path = f'/content/drive/MyDrive/{destination_folder}/'

    # Create the destination folder if it doesn't exist
    os.makedirs(full_destination_path, exist_ok=True)

    # Get list of all files and directories in the current directory
    files_to_copy = os.listdir('/content')

    # Copy each file/folder to Google Drive
    for item in files_to_copy:
        source_path = os.path.join('/content', item)
        dest_path = os.path.join(full_destination_path, item)

        if item == 'drive':  # Skip the mounted Google Drive folder
            continue

        try:
            if os.path.isfile(source_path):
                shutil.copy2(source_path, dest_path)
                print(f"Copied file: {item}")
            elif os.path.isdir(source_path):
                shutil.copytree(source_path, dest_path, dirs_exist_ok=True)
                print(f"Copied directory: {item}")
            else:
                print(f"Skipped: {item} (not a file or directory)")
        except Exception as e:
            print(f"Error copying {item}: {str(e)}")

    print(f"All files and folders have been copied to Google Drive folder: {destination_folder}")

# save_colab_to_drive()  # This will save to a folder named 'Colab_Backup'

def load_drive_to_colab(source_folder='Colab_Backup'):
    # Mount Google Drive if not already mounted
    if not os.path.exists('/content/drive'):
        drive.mount('/content/drive')

    # Specify the full path to the source folder in Google Drive
    full_source_path = f'/content/drive/MyDrive/{source_folder}/'

    # Check if the source folder exists
    if not os.path.exists(full_source_path):
        print(f"Source folder '{full_source_path}' not found in Google Drive.")
        return

    # Get list of all files and directories in the source folder
    files_to_copy = os.listdir(full_source_path)

    # Copy each file/folder to Colab workspace
    for item in files_to_copy:
        source_path = os.path.join(full_source_path, item)
        dest_path = os.path.join('/content', item)

        try:
            if os.path.isfile(source_path):
                shutil.copy2(source_path, dest_path)
                print(f"Copied file: {item}")
            elif os.path.isdir(source_path):
                shutil.copytree(source_path, dest_path, dirs_exist_ok=True)
                print(f"Copied directory: {item}")
            else:
                print(f"Skipped: {item} (not a file or directory)")
        except Exception as e:
            print(f"Error copying {item}: {str(e)}")

    print(f"All files and folders have been copied from Google Drive folder '{source_folder}' to Colab workspace.")

# load_drive_to_colab()  # This will load from a folder named 'Colab_Backup'

install_if_not_installed('transformers')
install_if_not_installed('trl')
install_if_not_installed('datasets')
install_if_not_installed('evaluate')
install_if_not_installed('fast-edit-distance')

load_drive_to_colab()

'transformers' is already installed.
'trl' is already installed.
'datasets' is already installed.
'evaluate' is already installed.
'fast-edit-distance' not found. Installing...
Copied directory: .config
Copied directory: sample_data
Copied directory: best_model
Copied directory: checkpoints_lr0.0001_wd0.01
Copied directory: logs_lr0.0001_wd0.01
Copied directory: preference_optimization_dataset
Copied directory: dpo_trained_model
Copied directory: C4AI_SMOLLM135
Copied file: targets_sft.txt
Copied file: BareBones_SmolLM-135M.pt
Copied file: preds_sft.txt
All files and folders have been copied from Google Drive folder 'Colab_Backup' to Colab workspace.


## **2.3 Run Direct Preference Optimization (DPO) [5 points]**
* Use the preference optimization dataset to further train the model through DPO, a method that leverages human-like preferences for model training.
* After running DPO, measure the BLEU score on the test set. Compare this performance to the baseline established during the SFT phase.
* Search for an optimal set of hyperparameters, such as the learning rate and number of epochs. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
from trl import DPOConfig, DPOTrainer
from trl.trainer.utils import DPODataCollatorWithPadding

# Model and tokenizer name
model_name = "HuggingFaceTB/SmolLM-135M"

# Training device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add special tokens
special_tokens_dict = {'additional_special_tokens': ['<INPUT>', '</INPUT>', '<OUTPUT>', '</OUTPUT>']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
print(f"Added {num_added_toks} special tokens.")

# Ensure pad token is added if not present
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Initialize the model and resize token embeddings
dpo_model = AutoModelForCausalLM.from_pretrained(model_name)
dpo_model.resize_token_embeddings(len(tokenizer))
dpo_model.config.pad_token_id = tokenizer.pad_token_id
dpo_model.to(device)

# Read preference data from disk
preference_data = Dataset.load_from_disk("preference_optimization_dataset")

# Function to format the text input and output for tokenization
def format_text(src: str, tgt: str) -> str:
    return f"<INPUT> {src} </INPUT> <OUTPUT> {tgt} </OUTPUT>"

# Tokenization function that handles input/output formatting and padding
def tokenize_and_prepare(examples):
    inputs = []
    chosen_inputs = []
    rejected_inputs = []

    for prompt, chosen, rejected in zip(examples['prompt'], examples['chosen'], examples['rejected']):
        # Format the input, chosen, and rejected examples
        formatted_input = f"<INPUT> {prompt} </INPUT>"
        formatted_chosen = f"<OUTPUT> {chosen} </OUTPUT>"
        formatted_rejected = f"<OUTPUT> {rejected} </OUTPUT>"

        # Tokenize the input, chosen, and rejected sequences
        input_encoding = tokenizer(formatted_input, max_length=128, padding="max_length", truncation=True, return_tensors="pt")
        chosen_encoding = tokenizer(formatted_chosen, max_length=128, padding="max_length", truncation=True, return_tensors="pt")
        rejected_encoding = tokenizer(formatted_rejected, max_length=128, padding="max_length", truncation=True, return_tensors="pt")

        # Prepare labels: mask input tokens in chosen and rejected sequences
        input_ids = input_encoding["input_ids"][0]
        chosen_ids = chosen_encoding["input_ids"][0]
        rejected_ids = rejected_encoding["input_ids"][0]

        input_length = len(tokenizer.encode(formatted_input, add_special_tokens=False))

        chosen_labels = chosen_ids.clone()
        chosen_labels[:input_length] = -100  # Mask input tokens in labels

        rejected_labels = rejected_ids.clone()
        rejected_labels[:input_length] = -100  # Mask input tokens in labels

        # Append to lists
        inputs.append(input_ids)
        chosen_inputs.append(chosen_labels)
        rejected_inputs.append(rejected_labels)

    return {
        "input_ids": torch.stack(inputs),
        "chosen_input_ids": torch.stack(chosen_inputs),
        "rejected_input_ids": torch.stack(rejected_inputs)
    }

# Tokenize the dataset using the updated function
preference_dataset = preference_data.map(tokenize_and_prepare, batched=True)

# Use DPODataCollatorWithPadding for padding and other necessary steps
data_collator = DPODataCollatorWithPadding(
    pad_token_id=tokenizer.pad_token_id,
    label_pad_token_id=-100,
    is_encoder_decoder=False
)

# Define DPOConfig for training
dpo_config = DPOConfig(
    output_dir="dpo_trained_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=2,
    optim="adamw_torch",
    learning_rate=5e-5,
    max_grad_norm=0.3,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    logging_steps=25,
    save_steps=500,
    save_total_limit=2,
    eval_strategy="steps",  # Changed from 'evaluation_strategy' to 'eval_strategy'
    eval_steps=100,
    remove_unused_columns=False,
    max_length=128,
    max_prompt_length=128
)

# Initialize the DPOTrainer
trainer = DPOTrainer(
    model=dpo_model,
    ref_model=dpo_model,  # Reference model is the same as the main model
    args=dpo_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,  # Use the correct DataCollatorWithPadding
    beta=0.1
)

# Train the model using DPO
trainer.train()

# Save the trained model
trainer.save_model("dpo_trained_model")

Added 4 special tokens.


Tokenizing train dataset:   0%|          | 0/19823 [00:00<?, ? examples/s]

In [None]:
# TODO: Evaluate model, use evaluate_model function

# Evaluate model, use the function given above
bleu_dpo,preds_dpo,targets_dpo = evaluate_model(dpo_model, tokenizer, test_dataset, device, 32)
print(f"BLEU score on test set for DPO: {bleu_dpo}")

Expected BLEU score after 1 epoch SFT + DPO is ~ 0.50.

# **Coding Challenge Part 3: Explore Alternative DPO Variants for Improved Model Performance [10 points]**

Consider employing a different version or variant of DPO. Your task is to:

* Choose a variant of DPO or another preference-based optimization method that could potentially enhance the model's performance.
* Describe the specific differences in this approach compared to the initial DPO method used.
* Train the model using this alternative DPO method and measure its performance on the test set using the BLEU score.
* Compare these results with the baseline performance achieved during the initial Supervised Fine-Tuning (SFT) and the first DPO implementation.
* Select a few GEC example after SFT, DPO and this DPO variant phases and compare the quality of the corrections, which one you prefer as human?
* You are allowed to make changes in the preference data annotation to improve the score, e.g. apply different metrics or methods beyond edit distance.
* Discuss the role of any changes in achieving these results. Consider potential trade-offs or limitations introduced by the new approach.