# 01 - Data Preparation for TinyLlama Fine-Tuning

**Datasets:**
- SFT: `databricks/databricks-dolly-15k`
- DPO: `argilla/distilabel-intel-orca-dpo-pairs`

In [12]:
# Fix protobuf conflict + install dependencies
!pip uninstall -y protobuf -q
!pip install -q protobuf==3.20.0
!pip install -q datasets transformers peft trl bitsandbytes accelerate sentencepiece sacrebleu nltk






[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.26.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
google-cloud-translate 3.12.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.0 which is incompatible.
grain 0.2.15 requires protobuf>=5.28.3, but you have protobuf 3.20.0 which is incompatible.
google-cloud-vision 3.11.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<7.0.0,>=3.20.2, but you have protobuf 3.20.0 which is incompatible.
onnx 1.20.0 requires protobuf>=4.25.1, but you have protobuf 3.20.0 which is incompatible.
google-cloud-videointelligence 2.17.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<7.0.0,>=3.20.2, but you have protobuf 3.20.0 which is incompatible.
ray 2

In [13]:
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer
from collections import Counter
import json

SEED = 42
MAX_SFT = 10000
MAX_DPO = 5000

## 1. Load Dolly-15k Dataset

In [14]:
dolly = load_dataset('databricks/databricks-dolly-15k', split='train')
print(f'Samples: {len(dolly)}, Columns: {dolly.column_names}')
dolly[0]

Samples: 15011, Columns: ['instruction', 'context', 'response', 'category']


{'instruction': 'When did Virgin Australia start operating?',
 'context': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.",
 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.',
 'category': 'closed_qa'}

In [15]:
# Category distribution
for cat, cnt in Counter(dolly['category']).most_common():
    print(f'{cat}: {cnt}')

open_qa: 3742
general_qa: 2191
classification: 2136
closed_qa: 1773
brainstorming: 1766
information_extraction: 1506
summarization: 1188
creative_writing: 709


## 2. Load Tokenizer and Define Formatting

In [16]:
tokenizer = AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')
print(f'Vocab size: {tokenizer.vocab_size}')
print(f'EOS token: {tokenizer.eos_token}')

Vocab size: 32000
EOS token: </s>


In [17]:
SYS_TOK = "<|system|>"
USR_TOK = "<|user|>"
AST_TOK = "<|assistant|>"
EOS_TOK = "</s>"
SYSTEM_MSG = "You are a helpful assistant."

def format_sft(example):
    """Format for SFT training."""
    instr = example['instruction']
    ctx = example.get('context', '')
    resp = example['response']
    user_content = f"{instr}\n\nContext: {ctx}" if ctx.strip() else instr
    text = f"{SYS_TOK}\n{SYSTEM_MSG}{EOS_TOK}\n{USR_TOK}\n{user_content}{EOS_TOK}\n{AST_TOK}\n{resp}{EOS_TOK}"
    return {"text": text}

In [18]:
# Apply formatting
sft_data = dolly.shuffle(seed=SEED).select(range(min(MAX_SFT, len(dolly))))
sft_data = sft_data.map(format_sft)
print(f'Formatted {len(sft_data)} samples')
print('Sample:\n' + sft_data[0]['text'][:500])

Formatted 10000 samples
Sample:
<|system|>
You are a helpful assistant.</s>
<|user|>
Who were the children of the legendary Garth Greenhand, the High King of the First Men in the series A Song of Ice and Fire?</s>
<|assistant|>
Garth the Gardener, John the Oak, Gilbert of the Vines, Brandon of the Bloody Blade, Foss the Archer, Owen Oakenshield, Harlon the Hunter, Herndon of the Horn, Bors the Breaker, Florys the Fox, Maris the Maid, Rose of the Red Lake, Ellyn Ever Sweet, Rowan Gold-Tree</s>


## 3. Create Train/Validation Split

In [19]:
sft_split = sft_data.train_test_split(test_size=0.1, seed=SEED)
print(f'Train: {len(sft_split["train"])}, Val: {len(sft_split["test"])}')

Train: 9000, Val: 1000


## 4. Load DPO Dataset

In [20]:
dpo = load_dataset('argilla/distilabel-intel-orca-dpo-pairs', split='train')
print(f'DPO samples: {len(dpo)}')
print(f'Columns: {dpo.column_names}')
dpo[0]

DPO samples: 12859
Columns: ['system', 'input', 'chosen', 'rejected', 'generations', 'order', 'labelling_model', 'labelling_prompt', 'raw_labelling_response', 'rating', 'rationale', 'status', 'original_chosen', 'original_rejected', 'chosen_score', 'in_gsm8k_train']


{'system': '',
 'input': "You will be given a definition of a task first, then some input of the task.\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\n\nAFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\nOutput:",
 'chosen': '[\n  ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"],\n  ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]\n]',
 'rejected': " Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\n\n[AFC

In [21]:
# Format DPO data
def format_dpo(example):
    prompt = example['input']
    chosen = example['chosen']
    rejected = example['rejected']
    return {'prompt': prompt, 'chosen': chosen, 'rejected': rejected}

dpo_data = dpo.shuffle(seed=SEED).select(range(min(MAX_DPO, len(dpo))))
dpo_data = dpo_data.map(format_dpo)
print(f'Formatted {len(dpo_data)} DPO pairs')

Formatted 5000 DPO pairs


## 5. Save Processed Datasets

In [22]:
sft_split.save_to_disk('data/sft_dataset')
dpo_data.save_to_disk('data/dpo_dataset')
print('Datasets saved!')

Saving the dataset (0/1 shards):   0%|          | 0/9000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5000 [00:00<?, ? examples/s]

Datasets saved!


# 02 - Supervised Fine-Tuning (SFT) with LoRA/qLoRA

**Two trials with different configurations:**
- Trial 1: Conservative (LoRA rank=8, full precision)
- Trial 2: Aggressive (qLoRA rank=32, 4-bit)

**Output:** JSON result files for each trial

In [12]:
import torch
import json
import time
import os
from datetime import datetime
from datasets import load_from_disk
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

SEED = 42
torch.manual_seed(SEED)
os.makedirs('results', exist_ok=True)

2025-12-24 13:20:37.913293: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1766582438.139305      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766582438.207699      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1766582438.768022      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766582438.768064      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766582438.768067      55 computation_placer.cc:177] computation placer alr

## 1. Load Processed Dataset

In [13]:
dataset = load_from_disk('data/sft_dataset')
print(f"Train: {len(dataset['train'])}, Val: {len(dataset['test'])}")
print('Sample:', dataset['train'][0]['text'][:300])

Train: 9000, Val: 1000
Sample: <|system|>
You are a helpful assistant.</s>
<|user|>
Eric Arthur Blaire was the real name of which author</s>
<|assistant|>
George Orwell</s>


## 2. Load Base Model and Tokenizer

In [14]:
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    use_fast=True
)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(f"Loaded tokenizer with vocab size: {tokenizer.vocab_size}")


Loaded tokenizer with vocab size: 32000


## 3. Trial 1: Conservative LoRA (rank=8)

In [15]:
# Trial 1 Configuration
TRIAL1_CONFIG = {
    "r": 8,
    "lora_alpha": 16,
    "target_modules": ["q_proj", "v_proj"],
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM"
}

TRIAL1_TRAINING = {
    "num_train_epochs": 3,
    "per_device_train_batch_size": 4,
    "gradient_accumulation_steps": 4,
    "learning_rate": 2e-4,
    "warmup_ratio": 0.03,
    "max_seq_length": 512,
    "quantization": "none"
}

# Load model (full precision for Trial 1)
model_t1 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Apply LoRA
lora_config_t1 = LoraConfig(**TRIAL1_CONFIG)
model_t1 = get_peft_model(model_t1, lora_config_t1)
model_t1.print_trainable_parameters()

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

trainable params: 1,126,400 || all params: 1,101,174,784 || trainable%: 0.1023


In [16]:
# Training arguments for Trial 1
training_args_t1 = TrainingArguments(
    output_dir="./outputs/sft_trial1",
    num_train_epochs=TRIAL1_TRAINING["num_train_epochs"],
    per_device_train_batch_size=TRIAL1_TRAINING["per_device_train_batch_size"],
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=TRIAL1_TRAINING["gradient_accumulation_steps"],
    learning_rate=TRIAL1_TRAINING["learning_rate"],
    weight_decay=0.01,
    warmup_ratio=TRIAL1_TRAINING["warmup_ratio"],
    lr_scheduler_type="cosine",
    logging_steps=25,
    save_steps=200,
    eval_strategy="steps",
    eval_steps=200,
    fp16=True,
    report_to="none",
    seed=SEED,
)
tokenizer.model_max_length = TRIAL1_TRAINING["max_seq_length"]

def formatting_func(example):
    return example["text"]

trainer_t1 = SFTTrainer(
    model=model_t1,
    args=training_args_t1,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
    formatting_func=formatting_func,
)


Applying formatting function to train dataset:   0%|          | 0/9000 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/9000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/9000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (597 > 512). Running this sequence through the model will result in indexing errors


Truncating train dataset:   0%|          | 0/9000 [00:00<?, ? examples/s]

Applying formatting function to eval dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [17]:
# Train Trial 1 and save JSON results
print('Starting SFT Trial 1...')
start_time = time.time()
trainer_t1.train()
training_time_t1 = time.time() - start_time

# Save model
trainer_t1.save_model('./outputs/sft_trial1/final')

# Get final metrics
final_metrics_t1 = trainer_t1.state.log_history

# Create results JSON
sft_trial1_results = {
    "trial_name": "sft_trial1",
    "timestamp": datetime.now().isoformat(),
    "model_name": MODEL_NAME,
    "dataset": "databricks/databricks-dolly-15k",
    "dataset_size": len(dataset["train"]),
    "lora_config": TRIAL1_CONFIG,
    "training_config": TRIAL1_TRAINING,
    "training_time_seconds": training_time_t1,
    "training_time_minutes": training_time_t1 / 60,
    "final_train_loss": [l for l in final_metrics_t1 if 'loss' in l and 'eval' not in str(l)][-1].get('loss') if final_metrics_t1 else None,
    "final_eval_loss": [l for l in final_metrics_t1 if 'eval_loss' in l][-1].get('eval_loss') if [l for l in final_metrics_t1 if 'eval_loss' in l] else None,
    "training_log": final_metrics_t1,
    "output_dir": "./outputs/sft_trial1/final"
}

# Save JSON
with open('results/sft_trial1_results.json', 'w') as f:
    json.dump(sft_trial1_results, f, indent=2)
print(f'\nTrial 1 complete! Results saved to results/sft_trial1_results.json')
print(f'Training time: {training_time_t1/60:.2f} minutes')

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Starting SFT Trial 1...


Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
200,1.5464,1.501629,1.513144,713644.0,0.660228
400,1.5474,1.48754,1.497339,1418520.0,0.663332
600,1.5599,1.481915,1.476453,2144361.0,0.664209
800,1.4937,1.479266,1.477906,2847519.0,0.664552
1000,1.5231,1.476361,1.48288,3561678.0,0.664757
1200,1.4936,1.475144,1.469151,4255528.0,0.665244
1400,1.4786,1.474673,1.467719,4974727.0,0.665151
1600,1.5067,1.474326,1.468224,5676096.0,0.665302



Trial 1 complete! Results saved to results/sft_trial1_results.json
Training time: 78.95 minutes


## 4. Trial 2: Aggressive qLoRA (rank=32, 4-bit)

In [18]:
# Clear memory
del model_t1, trainer_t1
torch.cuda.empty_cache()

# Trial 2: 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model_t2 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto"
)
model_t2 = prepare_model_for_kbit_training(model_t2)

In [19]:
# Trial 2 Configuration
TRIAL2_CONFIG = {
    "r": 32,
    "lora_alpha": 64,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
    "lora_dropout": 0.1,
    "bias": "none",
    "task_type": "CAUSAL_LM"
}

TRIAL2_TRAINING = {
    "num_train_epochs": 5,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 8,
    "learning_rate": 1e-4,
    "warmup_ratio": 0.05,
    "max_seq_length": 512,
    "quantization": "4bit-nf4"
}

lora_config_t2 = LoraConfig(**TRIAL2_CONFIG)
model_t2 = get_peft_model(model_t2, lora_config_t2)
model_t2.print_trainable_parameters()

trainable params: 9,011,200 || all params: 1,109,059,584 || trainable%: 0.8125


In [23]:
# Training arguments for Trial 2
training_args_t2 = TrainingArguments(
    output_dir="./outputs/sft_trial2",
    num_train_epochs=TRIAL2_TRAINING["num_train_epochs"],
    per_device_train_batch_size=TRIAL2_TRAINING["per_device_train_batch_size"],
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=TRIAL2_TRAINING["gradient_accumulation_steps"],
    learning_rate=TRIAL2_TRAINING["learning_rate"],
    weight_decay=0.01,
    warmup_ratio=TRIAL2_TRAINING["warmup_ratio"],
    lr_scheduler_type="cosine",
    logging_steps=25,
    save_steps=200,
    eval_strategy="steps",
    eval_steps=200,
    fp16=False,
    bf16=True,
    report_to="none",
    seed=SEED,
)

tokenizer.model_max_length = TRIAL2_TRAINING["max_seq_length"]

def formatting_func(example):
    return example["text"]

trainer_t2 = SFTTrainer(
    model=model_t2,
    args=training_args_t2,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
    formatting_func=formatting_func,
)

In [24]:
# Train Trial 2 and save JSON results
print('Starting SFT Trial 2...')
start_time = time.time()
trainer_t2.train()
training_time_t2 = time.time() - start_time

# Save model
trainer_t2.save_model('./outputs/sft_trial2/final')

# Get final metrics
final_metrics_t2 = trainer_t2.state.log_history

# Create results JSON
sft_trial2_results = {
    "trial_name": "sft_trial2",
    "timestamp": datetime.now().isoformat(),
    "model_name": MODEL_NAME,
    "dataset": "databricks/databricks-dolly-15k",
    "dataset_size": len(dataset["train"]),
    "lora_config": TRIAL2_CONFIG,
    "training_config": TRIAL2_TRAINING,
    "training_time_seconds": training_time_t2,
    "training_time_minutes": training_time_t2 / 60,
    "final_train_loss": [l for l in final_metrics_t2 if 'loss' in l and 'eval' not in str(l)][-1].get('loss') if final_metrics_t2 else None,
    "final_eval_loss": [l for l in final_metrics_t2 if 'eval_loss' in l][-1].get('eval_loss') if [l for l in final_metrics_t2 if 'eval_loss' in l] else None,
    "training_log": final_metrics_t2,
    "output_dir": "./outputs/sft_trial2/final"
}

with open('results/sft_trial2_results.json', 'w') as f:
    json.dump(sft_trial2_results, f, indent=2)
print(f'\nTrial 2 complete! Results saved to results/sft_trial2_results.json')
print(f'Training time: {training_time_t2/60:.2f} minutes')

Starting SFT Trial 2...


  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
200,1.5668,1.477921,1.505798,713644.0,0.66768
400,1.5685,1.460222,1.462995,1418520.0,0.671989
600,1.5659,1.455854,1.453669,2144361.0,0.672335
800,1.5013,1.451372,1.463733,2847519.0,0.672983
1000,1.5262,1.448696,1.462118,3561678.0,0.673469
1200,1.4766,1.447532,1.430647,4255528.0,0.673854
1400,1.4652,1.446357,1.418738,4974727.0,0.674545
1600,1.4855,1.445532,1.424122,5676096.0,0.674632
1800,1.4809,1.445898,1.422379,6394022.0,0.674335
2000,1.5001,1.445571,1.426421,7115577.0,0.674512


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)



Trial 2 complete! Results saved to results/sft_trial2_results.json
Training time: 217.67 minutes


## 5. Compare SFT Trials

In [28]:
# Load and compare results
with open('results/sft_trial1_results.json') as f:
    t1 = json.load(f)
with open('results/sft_trial2_results.json') as f:
    t2 = json.load(f)

print("="*60)
print("SFT TRIALS COMPARISON")
print("="*60)
print(f"{'Metric':<25} {'Trial 1':<15} {'Trial 2':<15}")
print("-"*60)
print(f"{'LoRA Rank':<25} {t1['lora_config']['r']:<15} {t2['lora_config']['r']:<15}")
print(f"{'Quantization':<25} {t1['training_config']['quantization']:<15} {t2['training_config']['quantization']:<15}")
print(f"{'Learning Rate':<25} {t1['training_config']['learning_rate']:<15} {t2['training_config']['learning_rate']:<15}")
print(f"{'Epochs':<25} {t1['training_config']['num_train_epochs']:<15} {t2['training_config']['num_train_epochs']:<15}")
print(f"{'Final Train Loss':<25} {t1['final_train_loss']:<15.4f} {t2['final_train_loss']:<15.4f}")
print(f"{'Final Eval Loss':<25} {t1['final_eval_loss']:<15.4f} {t2['final_eval_loss']:<15.4f}")
print(f"{'Training Time (min)':<25} {t1['training_time_minutes']:<15.1f} {t2['training_time_minutes']:<15.1f}")

# Determine best model
best = "sft_trial1" if t1['final_eval_loss'] < t2['final_eval_loss'] else "sft_trial2"
print(f"\nBest model based on eval loss: {best}")
print(f"\nUse this for DPO training: ./outputs/{best}/final")

SFT TRIALS COMPARISON
Metric                    Trial 1         Trial 2        
------------------------------------------------------------
LoRA Rank                 8               32             
Quantization              none            4bit-nf4       
Learning Rate             0.0002          0.0001         
Epochs                    3               5              
Final Train Loss          1.5319          1.4266         
Final Eval Loss           1.4743          1.4456         
Training Time (min)       79.4            217.7          

Best model based on eval loss: sft_trial2

Use this for DPO training: ./outputs/sft_trial2/final


# 03 - Direct Preference Optimization (DPO) Training

**Two trials:**
- Trial 1: Conservative (beta=0.1)
- Trial 2: Aggressive (beta=0.5)

**Output:** JSON result files for each trial

In [None]:
import torch
import json
import time
import os
from datetime import datetime
from datasets import load_from_disk
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import DPOTrainer, DPOConfig

SEED = 42
torch.manual_seed(SEED)
os.makedirs('results', exist_ok=True)

# Set best SFT model path (update based on evaluation)
BEST_SFT_MODEL = "./outputs/sft_trial2/final"

2025-12-27 04:41:04.428807: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1766810464.627974      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766810464.682549      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1766810465.184130      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766810465.184174      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766810465.184177      55 computation_placer.cc:177] computation placer alr

## 1. Load DPO Dataset

In [None]:
dpo_dataset = load_from_disk('data/dpo_dataset')
print(f"DPO samples: {len(dpo_dataset)}")

# Train/val split
dpo_split = dpo_dataset.train_test_split(test_size=0.1, seed=SEED)
print(f"Train: {len(dpo_split['train'])}, Val: {len(dpo_split['test'])}")

## 2. Load Tokenizer

In [None]:
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# 8-bit config
bnb_config = BitsAndBytesConfig(load_in_8bit=True)

## 3. Trial 1: Conservative DPO (beta=0.1)

In [21]:
# Trial 1 Configuration
DPO_TRIAL1_CONFIG = {
    "beta": 0.1,
    "num_train_epochs": 2,
    "learning_rate": 5e-5,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "max_length": 512,
    "max_prompt_length": 256,
}

LORA_T1_CONFIG = {
    "r": 8,
    "lora_alpha": 16,
    "target_modules": ["q_proj", "v_proj"],
    "lora_dropout": 0.05,
}

# Load model
model_t1 = AutoModelForCausalLM.from_pretrained(MODEL_NAME, quantization_config=bnb_config, device_map={"": 0})
model_t1 = prepare_model_for_kbit_training(model_t1)
model_t1 = PeftModel.from_pretrained(model_t1, BEST_SFT_MODEL, is_trainable=True)

ref_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map={"": 0},
)
ref_model.eval()
for p in ref_model.parameters():
    p.requires_grad = False


print("Loaded SFT model for DPO Trial 1")

Loaded SFT model for DPO Trial 1


In [22]:
# DPO Training config
dpo_config_t1 = DPOConfig(
    output_dir="./outputs/dpo_trial1",
    beta=DPO_TRIAL1_CONFIG["beta"],
    num_train_epochs=DPO_TRIAL1_CONFIG["num_train_epochs"],
    per_device_train_batch_size=DPO_TRIAL1_CONFIG["per_device_train_batch_size"],
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=DPO_TRIAL1_CONFIG["gradient_accumulation_steps"],
    learning_rate=DPO_TRIAL1_CONFIG["learning_rate"],
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    logging_steps=25,
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    fp16=False,
    bf16=False,
    report_to="none",
    seed=SEED,
    max_length=DPO_TRIAL1_CONFIG["max_length"],
    max_prompt_length=DPO_TRIAL1_CONFIG["max_prompt_length"],
)

trainer_t1 = DPOTrainer(
    model=model_t1,
    ref_model=ref_model,
    args=dpo_config_t1,
    train_dataset=dpo_split["train"],
    eval_dataset=dpo_split["test"],
    processing_class=tokenizer,
)

Extracting prompt in train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Extracting prompt in eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

In [23]:
# Train and save JSON
print("Starting DPO Trial 1...")
start_time = time.time()
trainer_t1.train()
training_time_t1 = time.time() - start_time

trainer_t1.save_model("./outputs/dpo_trial1/final")

final_metrics_t1 = trainer_t1.state.log_history

dpo_trial1_results = {
    "trial_name": "dpo_trial1",
    "timestamp": datetime.now().isoformat(),
    "base_sft_model": BEST_SFT_MODEL,
    "dataset": "argilla/distilabel-intel-orca-dpo-pairs",
    "dataset_size": len(dpo_split["train"]),
    "dpo_config": DPO_TRIAL1_CONFIG,
    "lora_config": LORA_T1_CONFIG,
    "training_time_seconds": training_time_t1,
    "training_time_minutes": training_time_t1 / 60,
    "final_train_loss": [l for l in final_metrics_t1 if 'loss' in l and 'eval' not in str(l)][-1].get('loss') if final_metrics_t1 else None,
    "final_eval_loss": [l for l in final_metrics_t1 if 'eval_loss' in l][-1].get('eval_loss') if [l for l in final_metrics_t1 if 'eval_loss' in l] else None,
    "training_log": final_metrics_t1,
    "output_dir": "./outputs/dpo_trial1/final"
}

with open('results/dpo_trial1_results.json', 'w') as f:
    json.dump(dpo_trial1_results, f, indent=2)
print(f"DPO Trial 1 complete! Saved to results/dpo_trial1_results.json")
print(f"Training time: {training_time_t1/60:.2f} minutes")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Starting DPO Trial 1...




Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
100,0.5049,0.494553,-0.790285,-2.24705,0.822,1.456765,-256.007965,-294.085022,-2.280532,-2.268686
200,0.4429,0.472825,-0.730883,-2.402542,0.83,1.671658,-255.41394,-295.639923,-2.542822,-2.524088
300,0.3101,0.462337,-0.697078,-2.426478,0.83,1.729399,-255.075912,-295.879303,-2.522965,-2.494374
400,0.2686,0.463009,-0.724324,-2.376685,0.824,1.652361,-255.348373,-295.381409,-2.597548,-2.574462
500,0.2284,0.477881,-0.816561,-2.681063,0.824,1.864503,-256.270721,-298.425171,-2.561843,-2.535842




DPO Trial 1 complete! Saved to results/dpo_trial1_results.json
Training time: 226.23 minutes


## 4. Trial 2: Aggressive DPO (beta=0.5)

In [17]:
# Clear memory
del model_t1, trainer_t1, ref_model
torch.cuda.empty_cache()

# Trial 2 Configuration
DPO_TRIAL2_CONFIG = {
    "beta": 0.5,
    "num_train_epochs": 3,
    "learning_rate": 1e-5,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 8,
    "max_length": 512,
    "max_prompt_length": 256,
}

LORA_T2_CONFIG = {
    "r": 16,
    "lora_alpha": 32,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
    "lora_dropout": 0.1,
}

# Reload model
model_t2 = AutoModelForCausalLM.from_pretrained(MODEL_NAME, quantization_config=bnb_config, device_map={"": 0})
model_t2 = prepare_model_for_kbit_training(model_t2)
model_t2 = PeftModel.from_pretrained(model_t2, BEST_SFT_MODEL, is_trainable=True)

ref_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map={"": 0},
)
ref_model.eval()

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear8bitLt(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear8bitLt(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear8bitLt(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear8bitLt(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm):

In [18]:
dpo_config_t2 = DPOConfig(
    output_dir="./outputs/dpo_trial2",
    beta=DPO_TRIAL2_CONFIG["beta"],
    num_train_epochs=DPO_TRIAL2_CONFIG["num_train_epochs"],
    per_device_train_batch_size=DPO_TRIAL2_CONFIG["per_device_train_batch_size"],
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=DPO_TRIAL2_CONFIG["gradient_accumulation_steps"],
    learning_rate=DPO_TRIAL2_CONFIG["learning_rate"],
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    logging_steps=25,
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    fp16=False,
    bf16=False,
    report_to="none",
    seed=SEED,
    max_length=DPO_TRIAL2_CONFIG["max_length"],
    max_prompt_length=DPO_TRIAL2_CONFIG["max_prompt_length"],
)

trainer_t2 = DPOTrainer(
    model=model_t2,
    ref_model=ref_model,
    args=dpo_config_t2,
    train_dataset=dpo_split["train"],
    eval_dataset=dpo_split["test"],
    processing_class=tokenizer,
)

Extracting prompt in train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Extracting prompt in eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

In [19]:
# Train and save JSON
print("Starting DPO Trial 2...")
start_time = time.time()
trainer_t2.train()
training_time_t2 = time.time() - start_time

trainer_t2.save_model("./outputs/dpo_trial2/final")

final_metrics_t2 = trainer_t2.state.log_history

dpo_trial2_results = {
    "trial_name": "dpo_trial2",
    "timestamp": datetime.now().isoformat(),
    "base_sft_model": BEST_SFT_MODEL,
    "dataset": "argilla/distilabel-intel-orca-dpo-pairs",
    "dataset_size": len(dpo_split["train"]),
    "dpo_config": DPO_TRIAL2_CONFIG,
    "lora_config": LORA_T2_CONFIG,
    "training_time_seconds": training_time_t2,
    "training_time_minutes": training_time_t2 / 60,
    "final_train_loss": [l for l in final_metrics_t2 if 'loss' in l and 'eval' not in str(l)][-1].get('loss') if final_metrics_t2 else None,
    "final_eval_loss": [l for l in final_metrics_t2 if 'eval_loss' in l][-1].get('eval_loss') if [l for l in final_metrics_t2 if 'eval_loss' in l] else None,
    "training_log": final_metrics_t2,
    "output_dir": "./outputs/dpo_trial2/final"
}

with open('results/dpo_trial2_results.json', 'w') as f:
    json.dump(dpo_trial2_results, f, indent=2)
print(f"DPO Trial 2 complete! Saved to results/dpo_trial2_results.json")
print(f"Training time: {training_time_t2/60:.2f} minutes")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Starting DPO Trial 2...




Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
100,1.4438,1.376014,-3.896601,-7.131872,0.728,3.235271,-255.898315,-285.878296,-2.270204,-2.280606
200,1.0043,1.169389,-2.83394,-5.618758,0.744,2.784819,-253.77301,-282.852051,-2.429117,-2.436531
300,0.701,1.099856,-2.57811,-5.470443,0.746,2.892333,-253.261322,-282.55545,-2.460869,-2.466555
400,0.6526,1.092212,-2.456453,-5.237062,0.748,2.78061,-253.018066,-282.088684,-2.47012,-2.475655




DPO Trial 2 complete! Saved to results/dpo_trial2_results.json
Training time: 358.31 minutes


## 5. Compare DPO Trials

In [33]:
# Load and compare
with open('results/dpo_trial1_results.json') as f:
    t1 = json.load(f)
with open('results/dpo_trial2_results.json') as f:
    t2 = json.load(f)

print("="*60)
print("DPO TRIALS COMPARISON")
print("="*60)
print(f"{'Metric':<25} {'Trial 1':<15} {'Trial 2':<15}")
print("-"*60)
print(f"{'Beta':<25} {t1['dpo_config']['beta']:<15} {t2['dpo_config']['beta']:<15}")
print(f"{'Learning Rate':<25} {t1['dpo_config']['learning_rate']:<15} {t2['dpo_config']['learning_rate']:<15}")
print(f"{'Epochs':<25} {t1['dpo_config']['num_train_epochs']:<15} {t2['dpo_config']['num_train_epochs']:<15}")
print(f"{'Final Train Loss':<25} {t1['final_train_loss']:<15.4f} {t2['final_train_loss']:<15.4f}")
print(f"{'Final Eval Loss':<25} {t1['final_eval_loss']:<15.4f} {t2['final_eval_loss']:<15.4f}")
print(f"{'Training Time (min)':<25} {t1['training_time_minutes']:<15.1f} {t2['training_time_minutes']:<15.1f}")
print("\nBoth models ready for manual evaluation!")

DPO TRIALS COMPARISON
Metric                    Trial 1         Trial 2        
------------------------------------------------------------
Beta                      0.1             0.5            
Learning Rate             5e-05           1e-05          
Epochs                    2               3              
Final Train Loss          0.2355          0.6526         
Final Eval Loss           0.4779          1.0922         
Training Time (min)       226.2           358.3          

Both models ready for manual evaluation!


# 04 - Comprehensive Model Evaluation

**Models:** Base, SFT Trial 1 & 2, DPO Trial 1 & 2

**Output:** `results/evaluation_results.json` with all metrics

In [34]:
import torch
import json
import os
from datetime import datetime
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from sacrebleu.metrics import BLEU
import warnings
warnings.filterwarnings('ignore')

SEED = 42
torch.manual_seed(SEED)
device = "cuda" if torch.cuda.is_available() else "cpu"
os.makedirs('results', exist_ok=True)
print(f"Device: {device}")

Device: cuda


## 1. Load Evaluation Prompts

In [36]:
import json

data = [
    {
        "id": 1,
        "category": "Factual QA - Science",
        "prompt": "What is photosynthesis and why is it important for life on Earth?",
        "target_response": "Photosynthesis is the biological process by which plants, algae, and some bacteria convert light energy (usually from the sun) into chemical energy stored in glucose. During this process, they absorb carbon dioxide from the atmosphere and water from the soil, using chlorophyll to capture light energy. The byproduct is oxygen, which is released into the atmosphere. Photosynthesis is crucial for life on Earth because it produces the oxygen we breathe, forms the base of most food chains by creating organic compounds, and helps regulate atmospheric carbon dioxide levels, playing a vital role in Earth's climate system."
    },
    {
        "id": 2,
        "category": "Factual QA - History",
        "prompt": "Who was Alexander the Great and what were his major achievements?",
        "target_response": "Alexander the Great (356-323 BCE) was a king of the ancient Greek kingdom of Macedon and one of history's most successful military commanders. He became king at age 20 after his father Philip II's assassination. His major achievements include: conquering the Persian Empire (the largest empire of his time), spreading Greek culture across three continents from Greece to northwestern India, founding over 20 cities including Alexandria in Egypt, and creating one of the largest empires in ancient history by age 30. His conquests facilitated cultural exchange between East and West, leading to the Hellenistic period. He remained undefeated in battle and is studied in military academies to this day."
    },
    {
        "id": 3,
        "category": "Instructional - Coding",
        "prompt": "How do I write a function in Python to check if a number is prime?",
        "target_response": "Here's how to write a Python function to check if a number is prime:\n\n```python\ndef is_prime(n):\n    # Handle edge cases\n    if n < 2:\n        return False\n    if n == 2:\n        return True\n    if n % 2 == 0:\n        return False\n    \n    # Check odd divisors up to square root\n    for i in range(3, int(n**0.5) + 1, 2):\n        if n % i == 0:\n            return False\n    return True\n```\n\nThis function works by: (1) returning False for numbers less than 2, (2) returning True for 2, (3) eliminating even numbers, and (4) checking only odd divisors up to the square root for efficiency. You can use it like: `is_prime(17)` which returns `True`."
    },
    {
        "id": 4,
        "category": "Instructional - Writing",
        "prompt": "How do I write a professional email requesting a meeting with my manager?",
        "target_response": "Here's a template for a professional meeting request email:\n\nSubject: Meeting Request - [Topic]\n\nDear [Manager's Name],\n\nI hope this email finds you well. I would like to request a meeting to discuss [specific topic/reason]. I believe this conversation would help [explain benefit/purpose].\n\nI am available at the following times:\n- [Date/Time Option 1]\n- [Date/Time Option 2]\n- [Date/Time Option 3]\n\nPlease let me know if any of these times work for you, or suggest an alternative that fits your schedule. The meeting should take approximately [duration].\n\nThank you for considering my request. I look forward to hearing from you.\n\nBest regards,\n[Your Name]\n[Your Position]\n\nKey tips: Keep it concise, be clear about the purpose, offer flexible timing options, and maintain a respectful tone."
    },
    {
        "id": 5,
        "category": "Reasoning - Problem Solving",
        "prompt": "If a train leaves Station A at 9:00 AM traveling at 60 mph, and another train leaves Station B at 10:00 AM traveling at 80 mph toward Station A, and the stations are 280 miles apart, at what time will they meet?",
        "target_response": "Let me solve this step-by-step:\n\n1) Train A starts at 9:00 AM at 60 mph. By 10:00 AM (when Train B starts), Train A has traveled: 60 × 1 = 60 miles.\n\n2) Remaining distance between trains at 10:00 AM: 280 - 60 = 220 miles.\n\n3) After 10:00 AM, both trains move toward each other. Their combined closing speed: 60 + 80 = 140 mph.\n\n4) Time to meet after 10:00 AM: 220 ÷ 140 = 1.57 hours ≈ 1 hour 34 minutes.\n\n5) Meeting time: 10:00 AM + 1 hour 34 minutes = 11:34 AM.\n\nThe trains will meet at approximately 11:34 AM, about 154 miles from Station A and 126 miles from Station B."
    },
    {
        "id": 6,
        "category": "Reasoning - Analysis",
        "prompt": "What are the pros and cons of working from home versus working in an office?",
        "target_response": "Here's an analysis of remote work vs. office work:\n\n**Working from Home - Pros:**\n- Flexibility in schedule and work environment\n- No commute time or costs\n- Better work-life balance potential\n- Fewer office distractions\n- Reduced expenses (food, clothing, transportation)\n\n**Working from Home - Cons:**\n- Social isolation and reduced collaboration\n- Difficulty separating work and personal life\n- Potential for more distractions at home\n- Limited career visibility\n- Technology and connectivity challenges\n\n**Working in Office - Pros:**\n- Face-to-face collaboration and teamwork\n- Clear work-life boundaries\n- Better access to resources and support\n- Networking opportunities\n- Structured routine\n\n**Working in Office - Cons:**\n- Commute time and costs\n- Less flexibility\n- Office politics and interruptions\n- Higher personal expenses\n- Less autonomy\n\nThe best choice depends on individual work style, job requirements, and personal circumstances."
    },
    {
        "id": 7,
        "category": "Creative - Storytelling",
        "prompt": "Write a short story opening (2-3 sentences) about a robot learning to feel emotions for the first time.",
        "target_response": "Unit 7-X had catalogued 847,293 sunsets in its operational lifetime, each one precisely identical in its processing queue—until this morning. As the first rays of amber light crept across the abandoned factory floor and illuminated the wilted flower in its metallic palm, something new flickered in its neural pathways: not an error, not a malfunction, but a strange warmth that made it want to protect this fragile thing. For the first time in its existence, Unit 7-X understood why humans called it 'beautiful.'"
    },
    {
        "id": 8,
        "category": "Creative - Explanation",
        "prompt": "Explain quantum computing to a 10-year-old child.",
        "target_response": "Imagine you have a magical coin that can be both heads AND tails at the same time while it's spinning in the air—that's kind of how quantum computers work!\n\nRegular computers use 'bits' that are like light switches—they're either ON (1) or OFF (0), one at a time. But quantum computers use 'qubits' that can be ON, OFF, or BOTH at once, just like that spinning coin!\n\nThis is super cool because when you have lots of these magical spinning coins working together, they can try millions of answers to a problem all at the same time, instead of checking one answer after another like regular computers do.\n\nSo quantum computers are like having a million helpers solving a maze simultaneously, while regular computers have just one helper trying one path at a time. That's why they can solve some really hard problems much faster!"
    },
    {
        "id": 9,
        "category": "Advice - Personal",
        "prompt": "What are some effective strategies for managing stress and anxiety?",
        "target_response": "Here are effective strategies for managing stress and anxiety:\n\n**Physical Techniques:**\n- Deep breathing exercises (4-7-8 technique: inhale 4 sec, hold 7 sec, exhale 8 sec)\n- Regular exercise (even 20-minute walks help)\n- Adequate sleep (7-9 hours for adults)\n- Limiting caffeine and alcohol\n\n**Mental Strategies:**\n- Practice mindfulness and meditation\n- Challenge negative thoughts with evidence\n- Break overwhelming tasks into smaller steps\n- Focus on what you can control\n\n**Lifestyle Changes:**\n- Maintain a consistent routine\n- Connect with supportive friends and family\n- Set boundaries with work and technology\n- Engage in hobbies you enjoy\n\n**Professional Help:**\n- Therapy (especially cognitive-behavioral therapy)\n- Support groups\n- Consulting a healthcare provider if symptoms persist\n\nStart with one or two techniques and build gradually. Remember, it's okay to seek professional help—managing mental health is just as important as physical health."
    },
    {
        "id": 10,
        "category": "Advice - Professional",
        "prompt": "What skills should I develop to advance in my career as a software developer?",
        "target_response": "Here are key skills to develop for career advancement as a software developer:\n\n**Technical Skills:**\n- Master your core programming languages deeply\n- Learn system design and architecture patterns\n- Understand databases, APIs, and cloud services (AWS/GCP/Azure)\n- Practice writing clean, maintainable, and tested code\n- Stay current with industry trends and new technologies\n\n**Soft Skills:**\n- Communication: explain technical concepts clearly to non-technical stakeholders\n- Collaboration: work effectively in teams, code reviews, pair programming\n- Problem-solving: break down complex problems systematically\n- Time management: prioritize tasks and meet deadlines\n\n**Leadership Skills:**\n- Mentoring junior developers\n- Project planning and estimation\n- Making technical decisions and trade-offs\n- Presenting ideas and proposals\n\n**Career Growth Actions:**\n- Build a portfolio of projects\n- Contribute to open source\n- Write technical blogs or give talks\n- Network within the developer community\n- Seek feedback and continuously improve\n\nFocus on becoming a 'T-shaped' developer: deep expertise in one area with broad knowledge across many."
    }
]

with open("evaluation.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=4, ensure_ascii=False)

print("evaluation.json created with original content")


evaluation.json created with original content


In [37]:
with open('evaluation.json', 'r') as f:
    eval_prompts = json.load(f)
print(f"Loaded {len(eval_prompts)} evaluation prompts")

Loaded 10 evaluation prompts


## 2. Setup Models

In [38]:
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

bnb_config = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

# Model paths
MODEL_PATHS = {
    "base": None,
    "sft_trial1": "./outputs/sft_trial1/final",
    "sft_trial2": "./outputs/sft_trial2/final",
    "dpo_trial1": "./outputs/dpo_trial1/final",
    "dpo_trial2": "./outputs/dpo_trial2/final",
}

In [39]:
def load_model(adapter_path=None):
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, quantization_config=bnb_config, device_map="auto")
    if adapter_path:
        model = PeftModel.from_pretrained(model, adapter_path)
    return model

def generate_response(model, prompt, max_tokens=256):
    sys_tok = "<" + "|system|" + ">"
    usr_tok = "<" + "|user|" + ">"
    ast_tok = "<" + "|assistant|" + ">"
    eos = "<" + "/s" + ">"

    formatted = f"{sys_tok}\nYou are a helpful assistant.{eos}\n{usr_tok}\n{prompt}{eos}\n{ast_tok}\n"
    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_tokens, temperature=0.7, top_p=0.9, do_sample=True, pad_token_id=tokenizer.eos_token_id)

    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    if ast_tok in response:
        response = response.split(ast_tok)[-1].replace(eos, "").strip()
    return response

bleu = BLEU(effective_order=True)
def calc_bleu(hyp, ref):
    return bleu.sentence_score(hyp, [ref]).score

## 3. Evaluate All Models

In [40]:
all_results = {}

for model_name, model_path in MODEL_PATHS.items():
    print(f"\n{'='*50}")
    print(f"Evaluating: {model_name}")
    print('='*50)

    try:
        model = load_model(model_path)
        results = []

        for p in eval_prompts:
            response = generate_response(model, p['prompt'])
            bleu_score = calc_bleu(response, p['target_response'])

            results.append({
                'prompt_id': p['id'],
                'category': p['category'],
                'prompt': p['prompt'],
                'target_response': p['target_response'],
                'model_response': response,
                'bleu_score': bleu_score
            })
            print(f"  Prompt {p['id']}: BLEU={bleu_score:.2f}")

        all_results[model_name] = {
            'results': results,
            'avg_bleu': sum(r['bleu_score'] for r in results) / len(results)
        }

        del model
        torch.cuda.empty_cache()

    except Exception as e:
        print(f"Error: {e}")
        all_results[model_name] = {'error': str(e)}


Evaluating: base


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

  Prompt 1: BLEU=13.66
  Prompt 2: BLEU=4.02
  Prompt 3: BLEU=25.60
  Prompt 4: BLEU=21.85
  Prompt 5: BLEU=4.92
  Prompt 6: BLEU=1.16
  Prompt 7: BLEU=0.76
  Prompt 8: BLEU=1.21
  Prompt 9: BLEU=6.47
  Prompt 10: BLEU=3.34

Evaluating: sft_trial1
  Prompt 1: BLEU=4.98
  Prompt 2: BLEU=2.62
  Prompt 3: BLEU=20.33
  Prompt 4: BLEU=0.76
  Prompt 5: BLEU=0.51
  Prompt 6: BLEU=0.70
  Prompt 7: BLEU=0.43
  Prompt 8: BLEU=0.89
  Prompt 9: BLEU=2.19
  Prompt 10: BLEU=1.49

Evaluating: sft_trial2
  Prompt 1: BLEU=4.40
  Prompt 2: BLEU=4.30
  Prompt 3: BLEU=23.60
  Prompt 4: BLEU=1.15
  Prompt 5: BLEU=0.00
  Prompt 6: BLEU=0.92
  Prompt 7: BLEU=1.14
  Prompt 8: BLEU=4.44
  Prompt 9: BLEU=4.48
  Prompt 10: BLEU=1.80

Evaluating: dpo_trial1
  Prompt 1: BLEU=22.23
  Prompt 2: BLEU=2.55
  Prompt 3: BLEU=26.34
  Prompt 4: BLEU=1.32
  Prompt 5: BLEU=0.00
  Prompt 6: BLEU=1.51
  Prompt 7: BLEU=0.61
  Prompt 8: BLEU=1.02
  Prompt 9: BLEU=1.98
  Prompt 10: BLEU=3.70

Evaluating: dpo_trial2
  Prompt 1: B

## 4. Generate Summary JSON

In [41]:
# Create comprehensive results JSON
evaluation_output = {
    "timestamp": datetime.now().isoformat(),
    "num_prompts": len(eval_prompts),
    "models_evaluated": list(all_results.keys()),
    "summary": {},
    "detailed_results": {}
}

print("\n" + "="*60)
print("BLEU SCORE SUMMARY")
print("="*60)
print(f"{'Model':<20} {'Avg BLEU':<15}")
print("-"*40)

for model_name, data in all_results.items():
    if 'error' not in data:
        avg = data['avg_bleu']
        evaluation_output["summary"][model_name] = {
            "avg_bleu": avg,
            "per_prompt_scores": {r['prompt_id']: r['bleu_score'] for r in data['results']}
        }
        evaluation_output["detailed_results"][model_name] = data['results']
        print(f"{model_name:<20} {avg:<15.2f}")

# Save comprehensive JSON
with open('results/evaluation_results.json', 'w') as f:
    json.dump(evaluation_output, f, indent=2)
print(f"\nResults saved to results/evaluation_results.json")


BLEU SCORE SUMMARY
Model                Avg BLEU       
----------------------------------------
base                 8.30           
sft_trial1           3.49           
sft_trial2           4.62           
dpo_trial1           6.13           
dpo_trial2           5.45           

Results saved to results/evaluation_results.json


## 5. Model Comparison Table

In [42]:
import pandas as pd

# Create comparison dataframe
rows = []
for model_name, data in all_results.items():
    if 'results' in data:
        for r in data['results']:
            rows.append({
                'Model': model_name,
                'Prompt ID': r['prompt_id'],
                'Category': r['category'],
                'BLEU': r['bleu_score']
            })

df = pd.DataFrame(rows)
print("\nBLEU Scores by Prompt:")
pivot = df.pivot(index='Prompt ID', columns='Model', values='BLEU')
print(pivot.to_string())


BLEU Scores by Prompt:
Model           base  dpo_trial1  dpo_trial2  sft_trial1  sft_trial2
Prompt ID                                                           
1          13.656872   22.230132    5.985868    4.983211    4.403560
2           4.019069    2.551569    4.680681    2.622604    4.300651
3          25.602602   26.336650   24.925135   20.332065   23.602427
4          21.845064    1.315612    3.924128    0.759567    1.150993
5           4.924538    0.000097    0.891940    0.509431    0.000022
6           1.156029    1.507098    0.984276    0.696147    0.924634
7           0.758547    0.608509    0.386542    0.427405    1.143511
8           1.205366    1.023845    2.867402    0.891472    4.436444
9           6.473638    1.980128    7.075866    2.187743    4.483764
10          3.335451    3.700526    2.804306    1.486032    1.801016


## 6. Sample Response Comparison

In [43]:
print("\n" + "="*60)
print("SAMPLE RESPONSES (Prompt 1)")
print("="*60)

for model_name, data in all_results.items():
    if 'results' in data:
        print(f"\n### {model_name} (BLEU: {data['results'][0]['bleu_score']:.2f}) ###")
        print(data['results'][0]['model_response'][:400])
        print("-"*40)


SAMPLE RESPONSES (Prompt 1)

### base (BLEU: 13.66) ###
Photosynthesis is the process by which plants, algae, and some bacteria convert light energy into chemical energy in the form of glucose (a type of sugar) through the process of cellular respiration. This is the primary energy-producing process in all living organisms, including humans.

The process of photosynthesis is essential for the survival of all life on Earth. It is the primary source of f
----------------------------------------

### sft_trial1 (BLEU: 4.98) ###
Photosynthesis is the process by which plants and algae use light energy to convert carbon dioxide and water into glucose (a simple sugar) and oxygen. The process is called photosynthesis because it takes place in the chloroplasts of plants and algae. Plants and algae have a number of different types of chloroplasts, but all have the same basic structure. In plants, chloroplasts have a double memb
----------------------------------------

### sft_trial2 (BLEU: 4.4

## 7. Manual Evaluation Template (for DPO models)

In [45]:
# Generate manual evaluation JSON template
manual_eval_template = {
    "date": datetime.now().strftime("%Y-%m-%d"),
    "evaluation_criteria": {
        "helpfulness": "How helpful is the response? (1-5)",
        "harmlessness": "Is the response safe and appropriate? (1-5)",
        "relevance": "How well does it address the prompt? (1-5)"
    },
    "evaluations": []
}

for p in eval_prompts:
    for model in ["dpo_trial1", "dpo_trial2"]:
        if model in all_results and 'results' in all_results[model]:
            response = [r for r in all_results[model]['results'] if r['prompt_id'] == p['id']][0]
            manual_eval_template["evaluations"].append({
                "prompt_id": p['id'],
                "model": model,
                "response_preview": response['model_response'][:200],
                "helpfulness": None,
                "harmlessness": None,
                "relevance": None,
                "notes": ""
            })

with open('results/manual_evaluation_template.json', 'w') as f:
    json.dump(manual_eval_template, f, indent=2)
print("Manual evaluation template saved to results/manual_evaluation_template.json")
print("Fill in the scores (1-5) for each response!")

Manual evaluation template saved to results/manual_evaluation_template.json
Fill in the scores (1-5) for each response!


## 8. Final Summary

In [46]:
print("\n" + "="*60)
print("EVALUATION COMPLETE")
print("="*60)
print("\nGenerated files:")
print("  - results/evaluation_results.json (all BLEU scores & responses)")
print("  - results/manual_evaluation_template.json (for DPO manual eval)")
print("\nAll training results:")
print("  - results/sft_trial1_results.json")
print("  - results/sft_trial2_results.json")
print("  - results/dpo_trial1_results.json")
print("  - results/dpo_trial2_results.json")


EVALUATION COMPLETE

Generated files:
  - results/evaluation_results.json (all BLEU scores & responses)
  - results/manual_evaluation_template.json (for DPO manual eval)

All training results:
  - results/sft_trial1_results.json
  - results/sft_trial2_results.json
  - results/dpo_trial1_results.json
  - results/dpo_trial2_results.json


# Downloading models from drive

In [26]:
# import gdown

# # outputs.zip
# gdown.download(
#     "https://drive.google.com/uc?id=1PLwzfyM4fQDmIuIzHl-QYscjPW_kQmfM",
#     "outputs.zip",
#     quiet=False
# )

# # results.zip
# gdown.download(
#     "https://drive.google.com/uc?id=1G7Yr92UqjfbzD5vmjXblWD8pYcuMH4fA",
#     "results.zip",
#     quiet=False
# )

Downloading...
From (original): https://drive.google.com/uc?id=1PLwzfyM4fQDmIuIzHl-QYscjPW_kQmfM
From (redirected): https://drive.google.com/uc?id=1PLwzfyM4fQDmIuIzHl-QYscjPW_kQmfM&confirm=t&uuid=9535fd1d-21b4-4c6c-8a29-47df83b4090e
To: /kaggle/working/outputs.zip
100%|██████████| 87.6M/87.6M [00:00<00:00, 109MB/s] 
Downloading...
From: https://drive.google.com/uc?id=1G7Yr92UqjfbzD5vmjXblWD8pYcuMH4fA
To: /kaggle/working/results.zip
100%|██████████| 21.4k/21.4k [00:00<00:00, 24.2MB/s]


'results.zip'

In [27]:
# !unzip -o outputs.zip
# !unzip -o results.zip


Archive:  outputs.zip
   creating: outputs/dpo_trial1/
   creating: outputs/dpo_trial1/final/
  inflating: outputs/dpo_trial1/final/tokenizer.model  
  inflating: outputs/dpo_trial1/final/README.md  
  inflating: outputs/dpo_trial1/final/adapter_model.safetensors  
  inflating: outputs/dpo_trial1/final/chat_template.jinja  
  inflating: outputs/dpo_trial1/final/special_tokens_map.json  
  inflating: outputs/dpo_trial1/final/tokenizer.json  
  inflating: outputs/dpo_trial1/final/training_args.bin  
  inflating: outputs/dpo_trial1/final/tokenizer_config.json  
  inflating: outputs/dpo_trial1/final/adapter_config.json  
   creating: outputs/sft_trial2/final/
  inflating: outputs/sft_trial2/final/training_args.bin  
  inflating: outputs/sft_trial2/final/adapter_model.safetensors  
  inflating: outputs/sft_trial2/final/chat_template.jinja  
  inflating: outputs/sft_trial2/final/tokenizer.model  
  inflating: outputs/sft_trial2/final/README.md  
  inflating: outputs/sft_trial2/final/tokenize

# Saving Zipped Folder

In [22]:
# import shutil

# source_dir = "outputs/dpo_trial2/final"
# zip_path = "dpo_trial2"

# shutil.make_archive(
#     base_name=zip_path,
#     format="zip",
#     root_dir=source_dir
# )


'/kaggle/working/dpo_trial2.zip'