<a href="https://colab.research.google.com/github/rokosbasilisk/HALOs/blob/main/RamBharadwaj_Aryasomayajula_C4AIScholarsChallenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Background**

Welcome to the C4AI Scholars Program Take-Home Challenge! This exercise is designed to allow you to showcase your engineering and problem solving skills. The Challenge consists of different challenges including:

*   Identifying bugs, and getting the code working. This is designed to test your ability to grapple with real world engineering challenges.
*   Testing your ability to generate code for a specified problem.
*   An opportunity for you to attempt an optional challenge question that extends the original problem set.

These tasks were chosen as a setting to see how you think about problems, even if they are not in your own research field of interest. The tasks and dataset are not meant to be indicative of the research goals of the Scholar Program. We purposefully have selected a simple toy problem so the focus is on how you think, and does not require significant machine learning resources (can be run in this colab).

Good luck!

**How to Use and Submit this Document?**

*   **Make a copy of this document** and rename it **Firstname_Lastname_C4AIScholarsChallenge**
*   Once you have completed all tasks, save and pin your revisions
*   Submit the assignment by responding directly to this email with a link to your final document by Sunday, September 15th, 11 PM PDT.

## **Coding Challenge Part 1: Debugging custom SmolLM code [10 points]**

In this coding challenge, you are required to debug and fix a bare-bones implementation of the following model.

**Model** : SmolLM-135M can be found at [HuggingFace](https://huggingface.co/HuggingFaceTB/SmolLM-135M).

We have 10 bugs in the following implementation.
There is a `check_solution` function for your convenience to verify you have correctly identified all the bugs. If you have found all bugs, the generated outputs will match the reference model exactly.

**Rules**:
1. **Bug Definition:**
  - There are 10 bugs to be fixed.
  - A bug is *defined as **{incorrect, missing, unnecessary}** lines of code*.
  - You earn 1 point for each correctly identified and fixed bug.
2. **Fix Guidelines:**
  - You are encouraged to make the smallest possible fix, wherever possible (e.g. edit a line instead of replacing it entirely).
  - Do not optimize the code; only fix the bugs. The implementation is *intentionally* non-optimized but valid.
3. **Documentation:** Document each fix by adding a comment on the line above the fix: : `### BUG FIX ###`.
4. **Sections:** *1. Setup [Helper Functions]* and *3. Test* don't contain bugs and shouldn't be changed.
5. **Submission:** Your final submission should be the exact same file except with your proposed fixes and the respective comments as per Rule #3.

## 1. Setup [Helper Functions]

In [None]:
# #####################################################################################################################
# ############################################# DO NOT CHANGE[START] ##################################################
# #####################################################################################################################


#[Don't use. Rate limit issues.] Use gdown to get weights file(BareBones_SmolLM-135M.pt) at https://drive.google.com/file/d/1tY46FSJEhGYRrfKRQTjJ1Cc7q9psaKUU/view . gdown should be installed by default else use `pip install gdown`
# !gdown 1tY46FSJEhGYRrfKRQTjJ1Cc7q9psaKUU


# [Recommended] Use HF to download the weights
!git lfs install
!git clone https://huggingface.co/dsouzadaniel/C4AI_SMOLLM135
!mv C4AI_SMOLLM135/BareBones_SmolLM-135M.pt ./
!ls

Git LFS initialized.
Cloning into 'C4AI_SMOLLM135'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (6/6), 2.11 KiB | 2.11 MiB/s, done.
BareBones_SmolLM-135M.pt  C4AI_SMOLLM135  sample_data


In [None]:

# Libraries
import torch
import torch.nn.functional as F
from torch import nn
import math
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model initialization/settings
checkpoint="HuggingFaceTB/SmolLM-135M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

__reference_model = AutoModelForCausalLM.from_pretrained(checkpoint)
__reference_model.eval()

class smolConfig:
    vocab_size=49152
    hidden_size=576
    intermediate_size=1536
    num_hidden_layers = 30
    num_heads = 9
    kv_heads=3
config = smolConfig

# Helper Functions
def __generate(model, inputs, num_tokens):
    collect = []
    for _ in range(num_tokens):
        output = model(**inputs)
        output_id = torch.argmax(output['logits'][0,-1]).item()
        collect.append(output_id)
        if output_id==tokenizer.eos_token_id:
            break
        inputs['input_ids'] = torch.unsqueeze(torch.cat([inputs['input_ids'][0],torch.tensor([output_id])]),dim=0)
        inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
    return tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(collect))

def check_solution(prompt, num_tokens, model_A, model_B):
    print()
    print(f"{'>'*20}\n\tPrompt\n{'<'*20}\n{prompt}\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_A Generation\n{'<'*30}\n{__generate(model_A,  model_inputs, num_tokens)}")
    print("\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_B Generation\n{'<'*30}\n{__generate(model_B,  model_inputs, num_tokens)}")

######################################################################################################################
############################################### DO NOT CHANGE[END] ###################################################
######################################################################################################################

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## 2. Custom SmolLM (for BugFixes)

In [None]:
def rotate_half(x):
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):

    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)


    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

def repeat_kv(hidden_states, n_rep):
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)


class RotaryEmbedder(nn.Module):
    def __init__(self, dim, base):
        super().__init__()

        ### BUG FIX ###
        # remove int64 dtype and use float for the frequencies (?)
        self.freq = 1/(base ** (torch.arange(0, dim, 2).float()/dim))

    @torch.no_grad()
    def forward(self, x, position_ids):

        ### BUG FIX ###
        # utilize position_ids provided
        position_ids = position_ids.view(-1).to(torch.float32)

        angles = torch.einsum('i,j->ij', position_ids, self.freq)
        angles = angles.unsqueeze(0)
        emb = torch.cat((angles, angles), dim=-1)
        return emb.cos(), emb.sin()


class MLP(nn.Module):
    def __init__(self, hidden_size, intermediate_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size

        ### BUG FIX ###
        # use bias for all linear layers
        self.W_gate = nn.Linear(self.hidden_size, self.intermediate_size, bias=True)
        self.W_up = nn.Linear(self.hidden_size, self.intermediate_size, bias=True)
        self.W_down = nn.Linear(self.intermediate_size, self.hidden_size, bias=True)
        self.act_fn = torch.nn.modules.activation.SiLU()

    def forward(self, x):
        down_proj = self.W_down(self.act_fn((self.W_gate(x)) * self.W_up(x)))
        return down_proj

class RMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        variance = hidden_states.pow(2).mean(-1, keepdim=True)

        ### BUG FIX ###
        # normalization should use rsqrt instead of sqrt
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states


class RopeAttention(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.hidden_size=config.hidden_size
        self.num_heads = config.num_heads
        self.head_dim = config.hidden_size//self.num_heads
        self.kv_heads = config.kv_heads
        self.rope_theta = 10000.0


        ### BUG FIX ###
        # use bias for all linear layers
        self.W_query = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=True)
        self.W_key = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=True)
        self.W_value = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=True)
        self.W_output = nn.Linear(config.hidden_size, config.hidden_size, bias=True)

        ### BUG FIX ###
        # use  already defined self.head_dim instead of config.hidden_size//self.num_heads
        self.rotary_emb = RotaryEmbedder(base=self.rope_theta, dim=self.head_dim)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask= None,
    ):
        b, q, _ = hidden_states.size()

        q_states = self.W_query(hidden_states)
        k_states = self.W_key(hidden_states)
        v_states = self.W_value(hidden_states)

        q_states = q_states.view(b, q, self.num_heads, self.head_dim).transpose(1, 2)
        k_states = k_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)
        v_states = v_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)

        position_ids = torch.arange(q, device=hidden_states.device).unsqueeze(0)

        cos, sin = self.rotary_emb(v_states, position_ids)
        q_states, k_states = apply_rotary_pos_emb(q_states, k_states, cos, sin, position_ids)

        ### BUG FIX ###
        # use floor division for number of kv-groups

        __kv_groups = self.num_heads // self.kv_heads
        k_states = repeat_kv(k_states, __kv_groups)
        v_states = repeat_kv(v_states, __kv_groups)

        ### BUG FIX ###
        # attention weights should be scaled by math.sqrt(self.head_dim) not hidden_size
        attn_weights = torch.matmul(q_states, k_states.transpose(2, 3)) / math.sqrt(self.head_dim)

        attn_weights = attn_weights + attention_mask
        attn_weights = nn.functional.softmax(attn_weights, dim=-1)

        ### BUG FIX ###
        #  explicitly specify a smaller dropout probability to avoid the default value (0.5)
        attn_weights = nn.functional.dropout(attn_weights, p = 0.2)

        attn_output = torch.matmul(attn_weights, v_states)
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.reshape(b, q, -1)

        return attn_output

class LlamaDecoder(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.self_attn = RopeAttention(config)
        self.mlp = MLP(hidden_size=config.hidden_size, intermediate_size=config.intermediate_size)
        self.pre_attn_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)
        self.pre_mlp_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)

    def forward(self,hidden_states, attention_mask):
        residual = hidden_states
        hidden_states = self.pre_attn_rmsnorm(hidden_states)

        ### BUG FIX ###
        # Remove this redundant attention_mask and use the actual one provided
        # attention_mask = torch.triu(torch.full((attention_mask.shape[-1],attention_mask.shape[-1]), fill_value=float('-inf')),diagonal=1)

        hidden_states = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
        )
        hidden_states += residual
        hidden_states = self.pre_mlp_rmsnorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states += residual

        outputs = (hidden_states,)
        return outputs

class smolModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_tokens = nn.Embedding(num_embeddings=config.vocab_size,
                                         embedding_dim=config.hidden_size)
        self.layers = nn.ModuleList([LlamaDecoder(config) for _ in range(config.num_hidden_layers)])
        self.norm = RMSNorm(config.hidden_size, eps=1e-05)

    def forward(
        self,
        input_ids= None,
        attention_mask= None,
    ):
        inputs_embeds = self.embed_tokens(input_ids)
        hidden_states = inputs_embeds
        for decoder_layer in self.layers:
            layer_outputs = decoder_layer(
                hidden_states,
                attention_mask=attention_mask,
            )
            hidden_states = layer_outputs[0]
        hidden_states = self.norm(hidden_states)
        return [hidden_states]

class smolLM(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.model = smolModel(config)

        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size,bias=False)

    def forward(self,input_ids,attention_mask):
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        hidden_states = outputs[0].squeeze()
        logits = self.lm_head(hidden_states)
        logits = logits.float()
        return {'logits':logits}


In [None]:
__test_model = smolLM(config)
__test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)
__test_model.eval()

  __test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)


smolLM(
  (model): smolModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoder(
        (self_attn): RopeAttention(
          (W_query): Linear(in_features=576, out_features=576, bias=True)
          (W_key): Linear(in_features=576, out_features=192, bias=True)
          (W_value): Linear(in_features=576, out_features=192, bias=True)
          (W_output): Linear(in_features=576, out_features=576, bias=True)
          (rotary_emb): RotaryEmbedder()
        )
        (mlp): MLP(
          (W_gate): Linear(in_features=576, out_features=1536, bias=True)
          (W_up): Linear(in_features=576, out_features=1536, bias=True)
          (W_down): Linear(in_features=1536, out_features=576, bias=True)
          (act_fn): SiLU()
        )
        (pre_attn_rmsnorm): RMSNorm()
        (pre_mlp_rmsnorm): RMSNorm()
      )
    )
    (norm): RMSNorm()
  )
  (lm_head): Linear(in_features=576, out_features=49152, bias=False)
)

# 3. Test

In [None]:
######################################################################################################################
############################################## DO NOT CHANGE[START] ##################################################
######################################################################################################################

###### TESTING PROMPTS
# Single-Token Quick Test
check_solution(prompt="Given the following film movie by a critic, rate it out of 10. Respond in a single number.\n\nThe movie started off extremely well, but just got worse after that.\nThe storyline was all over the place and everyone acted terribly.\n 10/10 would not recommend! \n\n ",
               num_tokens=1,
               model_A=__reference_model,
               model_B=__test_model)


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)



>>>>>>>>>>>>>>>>>>>>
	Prompt
<<<<<<<<<<<<<<<<<<<<
Given the following film movie by a critic, rate it out of 10. Respond in a single number.

The movie started off extremely well, but just got worse after that.
The storyline was all over the place and everyone acted terribly.
 10/10 would not recommend! 

 


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_A Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
1



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_B Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<|endoftext|>


In [None]:
# Multi-Token Quick Test
check_solution(prompt="Where is the Nile located?",
               num_tokens=50,
               model_A=__reference_model,
               model_B=__test_model)

######################################################################################################################
############################################### DO NOT CHANGE[END] ###################################################
######################################################################################################################


>>>>>>>>>>>>>>>>>>>>
	Prompt
<<<<<<<<<<<<<<<<<<<<
Where is the Nile located?


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_A Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

The Nile River is located in the Nile Delta in the Nile River Basin, which is a region of Africa. It is the longest river in the world, with a length of 4,330 miles (6,900 km



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_B Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<|endoftext|>


# **Coding Challenge Part 2: Teach SmolLM to do grammatical error correction [15 points]**

The goal of this part is to train the SmolLM-135M model to perform grammatical error correction (GEC) using the Grammarly CoEdIT dataset. This [dataset](https://huggingface.co/datasets/grammarly/coedit), derived from the [CoEdIT project](https://arxiv.org/abs/2305.09857), provides a rich collection of text editing instructions and examples. The task involves several key steps that mimic conventional alignment processes:




## **2.1 Supervised Fine-Tuning (SFT) on Training Data [5 points]**

* Fine-tune the [SmolLM-135M model](https://huggingface.co/HuggingFaceTB/SmolLM-135M) using the CoEdIT dataset, which includes input sentences with grammatical errors and their corrected versions. Use the training GEC portion of the CoEdIT dataset to teach the model how to correct grammatical errors effectively.
* Calculate the BLEU score on the validation set to evaluate the model's performance in generating grammatically correct sentences. Ensure that this evaluation process is reusable for later comparisons.
* Search for an optimal set of hyperparameters, such as the learning rate. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters. **Do not train for more than 3 epochs -- we do not expect extensive training time.**
* For Part 2, don't use additional libraries, if an imported library is missing, install it with **pip install**.

In [7]:
!pip install datasets
from datasets import load_dataset, Dataset

# Download the GEC data
full_train_ds = load_dataset("grammarly/coedit", split="train")
full_test_ds = load_dataset("grammarly/coedit", split="validation")



In [37]:
# Filter examples, keeping only GEC task

print(full_train_ds[0])

train_dataset = Dataset.from_dict({
    'src': [example['src'] for example in full_train_ds if example['task']=='gec'],
    'tgt': [example['tgt'] for example in full_train_ds if example['task']=='gec'],
})

test_dataset = Dataset.from_dict({
    'src': [example['src'] for example in full_test_ds if example['task']=='gec'],
    'tgt': [example['tgt'] for example in full_test_ds if example['task']=='gec'],
})

print(f"number of examples in train: {len(train_dataset)} test: {len(test_dataset)}")
print(test_dataset[0]) #print sample row

{'_id': '1', 'task': 'gec', 'src': 'Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.', 'tgt': 'For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.'}
number of examples in train: 19823 test: 485
{'src': 'Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you.', 'tgt': 'First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you.'}


Expected number of train and test samples are 19823 and 485, respectively.

In [32]:
train_dataset[0]

{'src': 'Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.',
 'tgt': 'For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.'}

In [42]:
# Install necessary packages (uncomment if not already installed)
# !pip install trl transformers datasets evaluate

import os
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    DataCollatorForLanguageModeling,
)
from trl import SFTConfig, SFTTrainer
from datasets import Dataset
import evaluate


# Model and tokenizer name
MODEL_NAME = "HuggingFaceTB/SmolLM-135M"

# Hyperparameters
LEARNING_RATES = [5e-5, 1e-4]  # Adjusted learning rates
WEIGHT_DECAYS = [0.1, 0.01]
BATCH_SIZE = 32  # Adjusted batch size
GRADIENT_ACCUMULATION_STEPS = 2
DROPOUT_RATE = 0.3  # Adjusted dropout rate
WARMUP_STEPS = 100
MAX_SEQ_LENGTH = 128
NUM_EPOCHS = 1

# Training device
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Add special tokens
special_tokens_dict = {'additional_special_tokens': ['<INPUT>', '</INPUT>', '<OUTPUT>', '</OUTPUT>']}
tokenizer.add_special_tokens(special_tokens_dict)

# Add pad token if not present
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Ensure the model's pad_token_id is set
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id
model.to(DEVICE)


model.config.dropout = DROPOUT_RATE
model.config.attention_dropout = DROPOUT_RATE
model.config.activation_dropout = DROPOUT_RATE

def format_text(src: str) -> str:
    return f"<INPUT> Correct the sentence: {src} </INPUT> <OUTPUT>"


def tokenize_and_prepare(examples):
    inputs = []
    labels = []
    for src, tgt in zip(examples['src'], examples['tgt']):
        # Format the input and target
        formatted_input = format_text(src)
        formatted_target = f"{tgt} </OUTPUT>"

        # Concatenate input and target
        full_sequence = formatted_input + formatted_target

        # Tokenize the full sequence
        tokenized = tokenizer(
            full_sequence,
            max_length=MAX_SEQ_LENGTH,
            padding='max_length',  # Fixed padding
            truncation=True,
            return_tensors='pt',
        )

        input_ids = tokenized['input_ids'][0]

        # Determine the input length
        input_length = len(tokenizer.encode(formatted_input, add_special_tokens=False))

        # Prepare labels: mask input tokens
        labels_ids = input_ids.clone()
        labels_ids[:input_length] = -100  # Mask input tokens

        # Append to lists
        inputs.append(input_ids)
        labels.append(labels_ids)

    return {'input_ids': torch.stack(inputs), 'labels': torch.stack(labels)}


train_dataset_tokenized = train_dataset.map(
    tokenize_and_prepare,
    batched=True,
    remove_columns=train_dataset.column_names,
)

test_dataset_tokenized = test_dataset.map(
    tokenize_and_prepare,
    batched=True,
    remove_columns=test_dataset.column_names,
)

# Set the format for PyTorch tensors
train_dataset_tokenized.set_format(type='torch')
test_dataset_tokenized.set_format(type='torch')


data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)


# Initialize tracking variables
best_loss = torch.inf
best_hyperparams = {}
best_model_dir = "best_model"


# Iterate over combinations of learning rates and weight decays
for lr in LEARNING_RATES:
    for wd in WEIGHT_DECAYS:
        print(f"\nTraining with Learning Rate: {lr}, Weight Decay: {wd}")

        # Reset the model and optimizer state
        model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
        model.resize_token_embeddings(len(tokenizer))
        model.config.pad_token_id = tokenizer.pad_token_id
        model.to(DEVICE)

        # Adjust dropout rates
        model.config.dropout = DROPOUT_RATE
        model.config.attention_dropout = DROPOUT_RATE
        model.config.activation_dropout = DROPOUT_RATE

        # Configure SFT
        sft_config = SFTConfig(
            output_dir=f"checkpoints_lr{lr}_wd{wd}",
            num_train_epochs=NUM_EPOCHS,
            per_device_train_batch_size=BATCH_SIZE,
            per_device_eval_batch_size=BATCH_SIZE,
            gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
            learning_rate=lr,
            weight_decay=wd,
            lr_scheduler_type="linear",
            warmup_steps=WARMUP_STEPS,
            logging_dir=f"logs_lr{lr}_wd{wd}",
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="eval_loss",
            greater_is_better=False,
            logging_strategy="steps",
            logging_steps=100,
            do_eval=True,
            logging_first_step=True,
            packing=False,  # Set packing to False
            dataset_text_field='input_ids',  # Provide dataset_text_field
        )

        # Initialize Trainer
        trainer = SFTTrainer(
            model=model,
            args=sft_config,
            train_dataset=train_dataset_tokenized,
            eval_dataset=test_dataset_tokenized,
            tokenizer=tokenizer,
            data_collator=data_collator,
        )

        # Train the model
        train_result = trainer.train()
        print(f"Training loss: {train_result.training_loss}")

        # Evaluate the model
        eval_results = trainer.evaluate()
        eval_loss = eval_results["eval_loss"]
        print(f"Eval Loss: {eval_loss}")


        if eval_loss < best_loss:
            best_loss = eval_loss
            best_hyperparams = {"learning_rate": lr, "weight_decay": wd}
            # Save the best model and tokenizer
            trainer.save_model(best_model_dir)
            tokenizer.save_pretrained(best_model_dir)
            print(f"New best model saved with lr={lr}, wd={wd}")

print(f"\nBest hyperparameters found: {best_hyperparams} with eval loss: {best_loss}")
print(f"Best model saved in directory: {best_model_dir}")

Map:   0%|          | 0/19823 [00:00<?, ? examples/s]

Map:   0%|          | 0/485 [00:00<?, ? examples/s]


Training with Learning Rate: 5e-05, Weight Decay: 0.1




Epoch,Training Loss,Validation Loss
1,1.4768,2.924451


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Training loss: 1.7178165028172154


Eval Loss: 2.9244513511657715
New best model saved with lr=5e-05, wd=0.1

Training with Learning Rate: 5e-05, Weight Decay: 0.01




Epoch,Training Loss,Validation Loss
1,1.4727,2.921771


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Training loss: 1.7287281159431704


Eval Loss: 2.9217708110809326
New best model saved with lr=5e-05, wd=0.01

Training with Learning Rate: 0.0001, Weight Decay: 0.1




Epoch,Training Loss,Validation Loss
1,1.457,2.920208


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Training loss: 1.6610293173020887


Eval Loss: 2.920208215713501
New best model saved with lr=0.0001, wd=0.1

Training with Learning Rate: 0.0001, Weight Decay: 0.01




Epoch,Training Loss,Validation Loss
1,1.4571,2.920365


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Training loss: 1.6610445330219885


Eval Loss: 2.9203648567199707

Best hyperparameters found: {'learning_rate': 0.0001, 'weight_decay': 0.1} with eval loss: 2.920208215713501
Best model saved in directory: best_model


In [43]:
# Quick test if your model works properly
model.to('cpu')

def format_text(text: str) -> str:
    # Use the same formatting as during training
    return f"<INPUT> Correct the sentence: {text} </INPUT> <OUTPUT>"

# Example of how to run inference on a single example
text = "I likes turtles"
formatted_text = format_text(text)

# Tokenize the formatted input
inputs = tokenizer(
    formatted_text,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=128,
)

# Generate the corrected sentence
outputs = model.generate(
    input_ids=inputs['input_ids'],
    attention_mask=inputs['attention_mask'],
    max_new_tokens=128,
    temperature=0.5,
    early_stopping=True,
    num_beams=5,
    eos_token_id=tokenizer.convert_tokens_to_ids('</OUTPUT>'),
)

# Decode the generated tokens
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)

# Extract the corrected sentence from the generated text
# Remove the input prompt and any special tokens
if '<OUTPUT>' in generated_text:
    corrected_sentence = generated_text.split('<OUTPUT>')[-1].split('</OUTPUT>')[0].strip()
else:
    # Fallback in case the special tokens are not present
    corrected_sentence = generated_text

print("Corrected sentence:", corrected_sentence)

Setting `pad_token_id` to `eos_token_id`:49155 for open-end generation.


Corrected sentence: I like turtles.


Expected output: I like turtles.

In [27]:
# !pip install evaluate
import evaluate
from torch.utils.data import DataLoader

bleu = evaluate.load("bleu")

def evaluate_model(model, tokenizer, dataset, device, data_collator):
    preds = []
    targets = []

    model.eval()  # Set model to evaluation mode

    # Use the same data collator as in training
    dataloader = DataLoader(dataset, batch_size=32, shuffle=False, collate_fn=data_collator)

    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        with torch.no_grad():
            outputs = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_new_tokens=128,
                num_beams=5,
                temperature=0.0,
                eos_token_id=tokenizer.eos_token_id
            )

        # Decode predictions
        decoded_preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)

        # Prepare labels for decoding by replacing -100 with pad_token_id
        labels_for_decoding = labels.clone()
        labels_for_decoding[labels_for_decoding == -100] = tokenizer.pad_token_id

        decoded_targets = tokenizer.batch_decode(labels_for_decoding, skip_special_tokens=True)

        preds.extend(decoded_preds)
        targets.extend([[target] for target in decoded_targets])  # References need to be a list of lists

    # Compute BLEU score
    results = bleu.compute(predictions=preds, references=targets)
    return results["bleu"]

In [28]:
# Evaluate model, use the function given above
model.to('cuda')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bleu_score = evaluate_model(model, tokenizer, test_dataset_tokenized, device, data_collator)
print(f"BLEU score on test set: {bleu_score}")

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for o

BLEU score on test set: 0.13104930115035918


Expected BLEU score after 1 epoch SFT is ~ 0.48.

## **2.2 Create a preference optimization dataset [5 points]**

* *Generate Output Variants* -- for each input sentence in the training set, use the fine-tuned model to generate two different output variants.
 * Consider using different decoding strategies, such as varying the temperature or beam size, to produce diverse outputs. Select an approach based on the desired balance between diversity and quality.

* *Preference Annotation* -- measure the edit distance between each **generated predicted variant** and **ground truth correction**. Label the variant with the lower edit distance as "chosen" and the one with the higher edit distance as "rejected."
 * Beyond using edit distance, what other metrics or methods could you consider to do preference dataset annotation?


In [None]:
from fast_edit_distance import edit_distance

# TODO: Create preference optimization dataset



In [None]:
# TODO: (Load and) Visualize the created dataset -- display at least 5 lines of the dataset.




## **2.3 Run Direct Preference Optimization (DPO) [5 points]**
* Use the preference optimization dataset to further train the model through DPO, a method that leverages human-like preferences for model training.
* After running DPO, measure the BLEU score on the test set. Compare this performance to the baseline established during the SFT phase.
* Search for an optimal set of hyperparameters, such as the learning rate and number of epochs. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters.

In [None]:
import os
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM
from datasets import Dataset
import pandas as pd

# TODO: Run Direct Preference Optimization (DPO)



In [None]:
# TODO: Evaluate model, use evaluate_model function



Expected BLEU score after 1 epoch SFT + DPO is ~ 0.50.

# **Coding Challenge Part 3: Explore Alternative DPO Variants for Improved Model Performance [10 points]**

Consider employing a different version or variant of DPO. Your task is to:

* Choose a variant of DPO or another preference-based optimization method that could potentially enhance the model's performance.
* Describe the specific differences in this approach compared to the initial DPO method used.
* Train the model using this alternative DPO method and measure its performance on the test set using the BLEU score.
* Compare these results with the baseline performance achieved during the initial Supervised Fine-Tuning (SFT) and the first DPO implementation.
* Select a few GEC example after SFT, DPO and this DPO variant phases and compare the quality of the corrections, which one you prefer as human?
* You are allowed to make changes in the preference data annotation to improve the score, e.g. apply different metrics or methods beyond edit distance.
* Discuss the role of any changes in achieving these results. Consider potential trade-offs or limitations introduced by the new approach.