### 1. Usecase and task

- the competitive programming code generation problem can be viewed as a sequence-to-sequence translation task
- given a problem description 'X' in natural language, produce a corresponding solution 'Y' in a programming language. 
- The metric used for evaluation is "percentage of problems solved using 'n' submissions from 'k' samples per problem", denoted as 'n@k'.

### 2. Setup development environment

In [None]:
!pip install \
    "torch==2.3.0" \
    tensorboard

!pip install --upgrade \
    "transformers==4.41.2" \
    "accelerate==0.30.1" \
    "datasets==2.19.1" \
    "peft==0.11.1" \
    "bitsandbytes==0.43.1" \
    "trl==0.8.6" \
    "evaluate==0.4.2" \
    huggingface_hub huggingface

!pip install -U sentencepiece

In [None]:
import torch
assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'

# on a multi-gpu machine
!FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn --no-build-isolation

# NOTE: use when 'Hardware not supported for Flash Attention'
# on a single gpu or only cpu machine 
! pip install ninja packaging
! MAX_JOBS=4 pip install flash-attn --no-build

In [1]:
!git config --global credential.helper store

In [2]:
from huggingface_hub import login
 
login(
  token="<HF_API_KEY_REMOVED>", 
  add_to_git_credential=True
)

In [5]:
from datasets import load_dataset, Dataset
from pprint import pprint
import pandas as pd

In [None]:
!sudo apt install unzip
!mkdir -p data-rb-25000
!wget "https://www.dropbox.com/scl/fi/4773pmla01womzshvvqm6/model-25000.zip?rlkey=u9zhzidy0kn3c0xjtxkqve7uq&st=hz6i63rs&dl=1" -O model.zip
!unzip model.zip -d data-rb-25000

In [6]:
from datasets import load_from_disk
dataset = load_from_disk('data-rb-25000')
dataset = dataset.train_test_split(test_size=0.05) 

In [5]:
from random import randint

In [7]:
mistral_instruct_template = "[INST]{instruction}[/INST]"

system_prompt = """Your job is to generate random, valid HTML+CSS code for data set generation. Ensure the HTML and CSS have valid syntax and are compatible with modern web browsers. Only geneate the <body> tag and its contents, the <html> and <head> tags are not needed. The CSS should the very last element in code generated and should be enclosed in <style> tags.

You must generate only HTML and CSS code within <answer> XML tags."""

human_prompt = """Generate HTML and CSS."""

assistant_prompt = """<answer>
{code}
</answer>"""

def format_dataset(sample):
    system_message = system_prompt.format()
    human_message = human_prompt
    instruction = f"{system_message}\n\n{human_message}"
    prompt = mistral_instruct_template.format(instruction=instruction)
    
    # b. construct completion
    completion = assistant_prompt.format(
        code=sample["html"]
    )
   
    # c. instruction format
    sample["prompt"] = prompt
    sample["completion"] = completion
    return sample

# convert dataset to instruct prompt template
columns_to_remove = list(dataset["train"].column_names)
print(f"columns_to_remove: {columns_to_remove}")
dataset = dataset.map(format_dataset, remove_columns=columns_to_remove, batched=False)
print(f"len(dataset): {len(dataset)}")

columns_to_remove: ['svg', 'html']


Map:   0%|          | 0/22922 [00:00<?, ? examples/s]

Map:   0%|          | 0/1207 [00:00<?, ? examples/s]

len(dataset): 2


In [7]:
# print random sample
pprint(dataset['train'][randint(0, len(dataset['train']))])

{'completion': '<answer>\n'
               '<body class="ascend home js-comp-ver-6.13.1 page page-id-6452 '
               'page-template-default vc_responsive wpb-js-composer" '
               'data-aie="none" data-ajax-transitions="false" '
               'data-animated-anchors="true" data-apte="standard" '
               'data-bg-header="false" data-body-border="off" '
               'data-boxed-style="" data-button-style="default" data-cad="750" '
               'data-cae="easeOutCubic" data-cart="false" '
               'data-col-gap="default" data-dropdown-style="minimal" '
               'data-ext-padding="90" data-ext-responsive="true" '
               'data-fancy-form-rcs="default" data-flex-cols="true" '
               'data-footer-reveal="false" data-footer-reveal-shadow="none" '
               'data-force-header-trans-color="light" '
               'data-form-style="minimal" data-form-submit="regular" '
               'data-full-width-header="false" data-header-breakpoint="

In [8]:
!mkdir -p sft_cache
!mkdir -p sft_cache/checkpoints
!mkdir -p sft_cache/model
!mkdir -p sft_cache/offload

In [9]:
### model
model_id = "mistral-community/Codestral-22B-v0.1"

### qlora related
r = 64
lora_alpha = 16
lora_dropout = 0.1
target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
task_type = "CAUSAL_LM"

### bitsandbytes related
load_in_4bit=True
bnb_4bit_use_double_quant=True
bnb_4bit_quant_type="nf4"
bnb_4bit_compute_dtype="bfloat16"


### training related
output_dir = "sft_cache/checkpoints" # prexisting folder path
save_model_dir = "sft_cache/model/"  # prexisting folder path
offload_folder = "sft_cache/offload" # prexisting folder path
logging_dir=f"{output_dir}/logs"

num_train_epochs = 1
# max_steps = 100 # mumber of training steps (overrides num_train_epochs)

per_device_train_batch_size = 1
per_device_eval_batch_size = 1
gradient_accumulation_steps = 1
gradient_checkpointing = True

bf16 = True
fp16 = False

max_grad_norm = 0.3
weight_decay = 0.001
# optim = "paged_adamw_32bit"
optim = "adamw_torch"

learning_rate = 2e-4
warmup_ratio = 0.03
lr_scheduler_type = "constant"

save_strategy = "no"
logging_steps = 25
logging_strategy = "steps"
group_by_length = True

max_seq_length = 4096
packing = False

In [10]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

In [11]:
# define tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'

In [12]:
# define 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=load_in_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,
    bnb_4bit_compute_dtype=getattr(torch, bnb_4bit_compute_dtype),
)
# define model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    use_cache=False if gradient_checkpointing else True,
    quantization_config=bnb_config,
    device_map="auto"
)
model.config.use_cache = False
model.config.pretraining_tp = 1 # num_of_gpus
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

In [13]:
import bitsandbytes as bnb
from peft import LoraConfig

In [14]:
def find_all_linear_names(model):
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)


# get lora target modules
modules = find_all_linear_names(model)
print(modules) # NOTE: update target_modules with these values

['up_proj', 'down_proj', 'q_proj', 'v_proj', 'k_proj', 'o_proj', 'gate_proj']


In [15]:
lora_config = LoraConfig(
    r=r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=target_modules,
    bias="none",
    task_type=task_type,
)

In [16]:
from transformers import TrainingArguments
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

In [17]:
# checkout for more info: Train on completions only https://huggingface.co/docs/trl/en/sft_trainer

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['prompt'])):
        text = f"{example['prompt'][i]}\n\n ### Answer: {example['completion'][i]}"
        output_texts.append(text)
    return output_texts

collator = DataCollatorForCompletionOnlyLM(
    response_template="### Answer:", 
    tokenizer=tokenizer
)

In [18]:
# set training arguments
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    # max_steps=max_steps,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=gradient_checkpointing,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    weight_decay=weight_decay,    
    optim=optim,
    learning_rate=learning_rate,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    save_strategy=save_strategy,
    logging_steps=logging_steps,
    logging_strategy=logging_strategy,
    group_by_length=group_by_length,
)

In [19]:
trainer = SFTTrainer(
    args=training_arguments,
    model=model,
    peft_config=lora_config,
    tokenizer=tokenizer,
    train_dataset=dataset['train'],
    formatting_func=formatting_prompts_func,
    data_collator=collator,
    max_seq_length=max_seq_length,
    packing=packing
)

Map:   0%|          | 0/22922 [00:00<?, ? examples/s]

In [20]:
trainer.train()



Step,Training Loss
25,0.548
50,0.6182
75,0.584
100,0.6238
125,0.581
150,0.5076
175,0.6662
200,0.5434
225,0.5435
250,0.5595


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



TrainOutput(global_step=22922, training_loss=0.44771109186077124, metrics={'train_runtime': 75855.4469, 'train_samples_per_second': 0.302, 'train_steps_per_second': 0.302, 'total_flos': 1.1310742966738575e+19, 'train_loss': 0.44771109186077124, 'epoch': 1.0})

In [21]:
# save int4 model
trainer.model.save_pretrained(output_dir, safe_serialization=False)



In [None]:
# clear memory
del model
del trainer
torch.cuda.empty_cache()

In [None]:
from peft import AutoPeftModelForCausalLM

In [None]:
model = AutoPeftModelForCausalLM.from_pretrained(
    output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    trust_remote_code=True, 
)  

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

In [None]:
merged_model = model.merge_and_unload()

In [None]:
merged_model.save_pretrained(save_model_dir, safe_serialization=True,  max_shard_size="2GB")

In [None]:
tokenizer.save_pretrained(save_model_dir)

('sft_cache/model/tokenizer_config.json',
 'sft_cache/model/special_tokens_map.json',
 'sft_cache/model/tokenizer.model',
 'sft_cache/model/added_tokens.json',
 'sft_cache/model/tokenizer.json')

In [None]:
del model
del merged_model
del tokenizer

torch.cuda.empty_cache()

Restart the kernel and run from this section. Make sure to load data and template.

In [1]:
import gc, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.cuda.empty_cache()
gc.collect()

40

In [2]:
model_local_path = "sft_cache/model/"
print(f"model_local_path: {model_local_path}")

model_local_path: sft_cache/model/


In [3]:
tokenizer = AutoTokenizer.from_pretrained(
    model_local_path, trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token

sft_model = AutoModelForCausalLM.from_pretrained(
    model_local_path,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Loading checkpoint shards:   0%|          | 0/23 [00:00<?, ?it/s]

In [8]:
eval_sample = dataset['test'][6]
eval_prompt, eval_completion = eval_sample["prompt"], eval_sample["completion"]

print(f"prompt: {eval_prompt}")
print("\n", f"*"*25, "\n")
print(f"completion: {eval_completion}")

prompt: [INST]Your job is to generate random, valid HTML+CSS code for data set generation. Ensure the HTML and CSS have valid syntax and are compatible with modern web browsers. Only geneate the <body> tag and its contents, the <html> and <head> tags are not needed. The CSS should the very last element in code generated and should be enclosed in <style> tags.

You must generate only HTML and CSS code within <answer> XML tags.

Generate HTML and CSS.[/INST]

 ************************* 

completion: <answer>
<body><nav class="header navbar navbar-fixed-top navbar-inverse" id="header"><div class="container"><div class="navbar-header"><button aria-expanded="false" class="collapsed navbar-toggle toggle-menu" data-target="#bs-example-navbar-collapse-1" data-toggle="collapse" type="button"><span class="sr-only">Toggle navigation</span> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span></button> <a href="#" class="nav-logo navbar-brand visible-xs"><img

In [11]:
model_inputs = tokenizer([eval_prompt], return_tensors="pt").to("cuda")
sft_model.eval()
with torch.no_grad():
    generated_ids = sft_model.generate(
        **model_inputs, max_new_tokens=1000, do_sample=True,
        length_penalty=-5.0, repetition_penalty=2.0, num_beams=10,
    )
    results = tokenizer.batch_decode(generated_ids)[0]
    # prompt_length = model_inputs['input_ids'].shape[1]
    # results = tokenizer.batch_decode(generated_ids[prompt_length:])[0]
    print(results)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST]Your job is to generate random, valid HTML+CSS code for data set generation. Ensure the HTML and CSS have valid syntax and are compatible with modern web browsers. Only geneate the <body> tag and its contents, the <html> and <head> tags are not needed. The CSS should the very last element in code generated and should be enclosed in <style> tags.

You must generate only HTML and CSS code within <answer> XML tags.

Generate HTML and CSS.[/INST]

<answer><body><h1>Not Found</h1><p>The requested URL was not found on this server.</p></body>

<style>

</style>
</answer></s>


# Original model

In [12]:
del sft_model
del tokenizer

In [13]:
import gc, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.cuda.empty_cache()
gc.collect()

61

In [14]:
model_id = "mistral-community/Codestral-22B-v0.1"
print(f"model_id: {model_id}")

model_id: mistral-community/Codestral-22B-v0.1


In [15]:
tokenizer = AutoTokenizer.from_pretrained(
    model_local_path, trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

In [16]:
eval_sample = dataset['test'][6]
eval_prompt, eval_completion = eval_sample["prompt"], eval_sample["completion"]

print(f"prompt: {eval_prompt}")
print("\n", f"*"*25, "\n")
print(f"completion: {eval_completion}")

prompt: [INST]Your job is to generate random, valid HTML+CSS code for data set generation. Ensure the HTML and CSS have valid syntax and are compatible with modern web browsers. Only geneate the <body> tag and its contents, the <html> and <head> tags are not needed. The CSS should the very last element in code generated and should be enclosed in <style> tags.

You must generate only HTML and CSS code within <answer> XML tags.

Generate HTML and CSS.[/INST]

 ************************* 

completion: <answer>
<body><nav class="header navbar navbar-fixed-top navbar-inverse" id="header"><div class="container"><div class="navbar-header"><button aria-expanded="false" class="collapsed navbar-toggle toggle-menu" data-target="#bs-example-navbar-collapse-1" data-toggle="collapse" type="button"><span class="sr-only">Toggle navigation</span> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span></button> <a href="#" class="nav-logo navbar-brand visible-xs"><img

In [17]:
model_inputs = tokenizer([eval_prompt], return_tensors="pt").to("cuda")
base_model.eval()
with torch.no_grad():
    generated_ids = base_model.generate(
        **model_inputs, max_new_tokens=32_000, do_sample=True
    )
    results = tokenizer.batch_decode(generated_ids)[0]
    print(results)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST]Your job is to generate random, valid HTML+CSS code for data set generation. Ensure the HTML and CSS have valid syntax and are compatible with modern web browsers. Only geneate the <body> tag and its contents, the <html> and <head> tags are not needed. The CSS should the very last element in code generated and should be enclosed in <style> tags.

You must generate only HTML and CSS code within <answer> XML tags.

Generate HTML and CSS.[/INST]
<answer>

<div class="container">
  <h1>Welcome to Our Site</h1>
  <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed euismod, nibh vel elementum euismod.</p>
  <a href="#" class="button">Learn More</a>
</div>

<style>
.container {
  max-width: 800px;
  margin: 0 auto;
  padding: 20px;
  text-align: center;
  background-color: #f4f4f4;
  border-radius: 8px;
  box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}

h1 {
  font-size: 36px;
  margin-bottom: 20px;
}

p {
  line-height: 1.6;
  margin-bottom: 20px;
}

a.button {
  display: in

#### inference: chat with vllm