# Singlish Model Comparison: 1.5B vs 4B vs 8B

This notebook trains and evaluates three model sizes to determine the best base for Singlish fine-tuning.

## Workflow
1. **Setup** - Install dependencies, prepare dataset
2. **Train** - Train all 3 models with identical config
3. **Evaluate** - Run evaluation suite
4. **Compare** - Generate comparison report

---
## Part 1: Setup

In [5]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2
!pip install sentence-transformers  # For semantic similarity

In [6]:
# Core imports
import torch
import time
import json
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template, train_on_responses_only
from trl import SFTTrainer, SFTConfig
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import re

In [7]:
# Model configurations to compare
MODELS = {
    "1.7B": "Qwen/Qwen3-1.7B",
    "4B": "unsloth/Qwen3-4B-Instruct-2507",
    # "4B": "Qwen/Qwen3-4B",
    "8B": "Qwen/Qwen3-8B",
}

# Shared training config (keep identical for fair comparison)
TRAINING_CONFIG = {
    "r": 32,
    "lora_alpha": 32,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    "max_seq_length": 2048,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "max_steps": 90,
    "learning_rate": 2e-4,
    "warmup_steps": 5,
}

print("Models to compare:", list(MODELS.keys()))
print("Training config:", TRAINING_CONFIG)

from huggingface_hub import model_info

def check_model_exists(model_name):
    try:
        model_info(model_name)
        print(f"✓ {model_name} exists")
        return True
    except:
        print(f"✗ {model_name} NOT FOUND")
        return False

for name, path in MODELS.items():
    check_model_exists(path)

Models to compare: ['1.7B', '4B-Instruct', '8B']
Training config: {'r': 32, 'lora_alpha': 32, 'target_modules': ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'], 'max_seq_length': 2048, 'per_device_train_batch_size': 2, 'gradient_accumulation_steps': 4, 'max_steps': 40, 'learning_rate': 0.0002, 'warmup_steps': 5}
✓ Qwen/Qwen3-1.7B exists
✓ unsloth/Qwen3-4B-Instruct-2507 exists
✓ Qwen/Qwen3-8B exists


### Prepare Dataset

In [8]:
# Load your Singlish dataset
dataset = load_dataset("csv", data_files="singlish_200_pairs(Mixed).csv", split="train")

# Convert to standard format
def convert_to_standard_format(example):
    return {
        "conversations": [
            {"role": "user", "content": example["Instruction"]},
            {"role": "assistant", "content": example["Output"]}
        ]
    }

dataset = dataset.map(convert_to_standard_format)

# Split into train (80%) and test (20%) for evaluation
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset_split["train"]
test_dataset = dataset_split["test"]

print(f"Train size: {len(train_dataset)}")
print(f"Test size: {len(test_dataset)}")

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Train size: 180
Test size: 20


In [9]:
# Prepare test prompts and references for evaluation
test_prompts = [ex["conversations"][0]["content"] for ex in test_dataset]
test_references = [ex["conversations"][1]["content"] for ex in test_dataset]

print(f"Test prompts prepared: {len(test_prompts)}")
print(f"Example prompt: {test_prompts[0]}")
print(f"Example reference: {test_references[0]}")

Test prompts prepared: 20
Example prompt: Family keep asking when get married how?
Example reference: Aiyo this one classic. Just tell them when ready lor. Don't rush because of pressure, your life your choice.


---
## Part 2: Training Function

Define a reusable function to train any model size with identical settings.

In [10]:
def train_model(model_name, model_path, train_dataset, config):
    """
    Train a model with QLoRA and return training stats.

    Args:
        model_name: Label for this model (e.g., "1.5B")
        model_path: HuggingFace model path
        train_dataset: Training dataset
        config: Training configuration dict

    Returns:
        dict with training metrics
    """
    print(f"Training {model_name}: {model_path}")

    # Track GPU memory before
    torch.cuda.reset_peak_memory_stats()
    start_time = time.time()

    # Load model
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_path,
        max_seq_length=config["max_seq_length"],
        load_in_4bit=True,
        load_in_8bit=False,
        full_finetuning=False,
    )

    # Add LoRA
    model = FastLanguageModel.get_peft_model(
        model,
        r=config["r"],
        target_modules=config["target_modules"],
        lora_alpha=config["lora_alpha"],
        lora_dropout=0,
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=3407,
    )

    # Setup tokenizer
    tokenizer = get_chat_template(tokenizer, chat_template="qwen3-instruct")

    # Format dataset
    def formatting_prompts_func(examples):
        convos = examples["conversations"]
        texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
        return {"text": texts}

    formatted_dataset = train_dataset.map(formatting_prompts_func, batched=True)

    # Trainer
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=formatted_dataset,
        args=SFTConfig(
            dataset_text_field="text",
            per_device_train_batch_size=config["per_device_train_batch_size"],
            gradient_accumulation_steps=config["gradient_accumulation_steps"],
            warmup_steps=config["warmup_steps"],
            max_steps=config["max_steps"],
            learning_rate=config["learning_rate"],
            logging_steps=10,
            optim="adamw_8bit",
            weight_decay=0.001,
            lr_scheduler_type="linear",
            seed=3407,
            report_to="none",
            output_dir=f"outputs_{model_name}",
        ),
    )

    trainer = train_on_responses_only(
        trainer,
        instruction_part="<|im_start|>user\n",
        response_part="<|im_start|>assistant\n",
    )

    # Train
    trainer_stats = trainer.train()

    # Collect metrics
    training_time = time.time() - start_time
    peak_memory = torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024
    final_loss = trainer_stats.metrics.get("train_loss", trainer.state.log_history[-1].get("loss", None))

    # Save adapter
    save_path = f"singlish_adapter_{model_name}"
    model.save_pretrained(save_path)
    tokenizer.save_pretrained(save_path)

    results = {
        "model_name": model_name,
        "training_time_min": round(training_time / 60, 2),
        "peak_vram_gb": round(peak_memory, 2),
        "final_loss": round(final_loss, 4) if final_loss else None,
        "adapter_path": save_path,
    }

    print(f"\n{model_name} Training Complete:")
    print(f"  Time: {results['training_time_min']} min")
    print(f"  Peak VRAM: {results['peak_vram_gb']} GB")
    print(f"  Final Loss: {results['final_loss']}")
    print(f"  Saved to: {save_path}")

    # Cleanup to free VRAM
    del model, trainer
    torch.cuda.empty_cache()

    return results

### Train All 3 Models

In [11]:
# Store training results
training_results = {}

for model_name, model_path in MODELS.items():
    results = train_model(model_name, model_path, train_dataset, TRAINING_CONFIG)
    training_results[model_name] = results

    # Save intermediate results
    with open("training_results.json", "w") as f:
        json.dump(training_results, f, indent=2)

pd.DataFrame(training_results).T

Training 1.7B: Qwen/Qwen3-1.7B
==((====))==  Unsloth 2025.11.4: Fast Qwen3 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.41G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth 2025.11.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 180 | Num Epochs = 2 | Total steps = 40
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 34,865,152 of 1,755,440,128 (1.99% trained)


Step,Training Loss
10,5.7152
20,3.4579
30,2.9531
40,2.5234



1.7B Training Complete:
  Time: 2.02 min
  Peak VRAM: 1.8 GB
  Final Loss: 3.6624
  Saved to: singlish_adapter_1.7B
Training 4B-Instruct: unsloth/Qwen3-4B-Instruct-2507
==((====))==  Unsloth 2025.11.4: Fast Qwen3 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth 2025.11.4 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 180 | Num Epochs = 2 | Total steps = 40
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 66,060,288 of 4,088,528,384 (1.62% trained)


Step,Training Loss
10,4.6116
20,2.7489
30,2.2621
40,1.8145



4B-Instruct Training Complete:
  Time: 2.22 min
  Peak VRAM: 5.4 GB
  Final Loss: 2.8593
  Saved to: singlish_adapter_4B-Instruct
Training 8B: Qwen/Qwen3-8B
==((====))==  Unsloth 2025.11.4: Fast Qwen3 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.50G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 180 | Num Epochs = 2 | Total steps = 40
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 87,293,952 of 8,278,029,312 (1.05% trained)


Step,Training Loss
10,4.7793
20,2.7632
30,2.1878
40,1.8299



8B Training Complete:
  Time: 2.69 min
  Peak VRAM: 11.66 GB
  Final Loss: 2.8901
  Saved to: singlish_adapter_8B


Unnamed: 0,model_name,training_time_min,peak_vram_gb,final_loss,adapter_path
1.7B,1.7B,2.02,1.8,3.6624,singlish_adapter_1.7B
4B-Instruct,4B-Instruct,2.22,5.4,2.8593,singlish_adapter_4B-Instruct
8B,8B,2.69,11.66,2.8901,singlish_adapter_8B


---
## Part 3: Evaluation Functions

In [12]:
def load_trained_model(model_name, model_path, adapter_path):
    """
    Load a trained model with its adapter for evaluation.
    """
    from peft import PeftModel

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_path,
        max_seq_length=2048,
        load_in_4bit=True,
    )

    model = PeftModel.from_pretrained(model, adapter_path)
    tokenizer = get_chat_template(tokenizer, chat_template="qwen3-instruct")

    # Enable inference mode
    FastLanguageModel.for_inference(model)

    return model, tokenizer

In [13]:
def generate_response(model, tokenizer, prompt, max_new_tokens=128):
    """
    Generate a response for a given prompt.
    """
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
        )

    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    return response.strip()

### Evaluation 1: Perplexity

In [14]:
def calculate_perplexity(model, tokenizer, test_texts):
    """
    Calculate perplexity on test set.
    Lower = better fit to Singlish patterns.
    """
    model.eval()
    total_loss = 0
    total_tokens = 0

    for text in test_texts:
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(model.device)

        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
            total_loss += outputs.loss.item() * inputs["input_ids"].size(1)
            total_tokens += inputs["input_ids"].size(1)

    avg_loss = total_loss / total_tokens
    perplexity = torch.exp(torch.tensor(avg_loss)).item()

    return round(perplexity, 2)

### Evaluation 2: Semantic Similarity

In [15]:
# Load sentence embedding model (once)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

def calculate_semantic_similarity(generated_responses, reference_responses):
    """
    Calculate average cosine similarity between generated and reference responses.
    Higher = better meaning preservation.
    """
    gen_embeddings = embedding_model.encode(generated_responses)
    ref_embeddings = embedding_model.encode(reference_responses)

    similarities = []
    for gen_emb, ref_emb in zip(gen_embeddings, ref_embeddings):
        sim = cosine_similarity([gen_emb], [ref_emb])[0][0]
        similarities.append(sim)

    return round(np.mean(similarities), 4)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Evaluation 3: Singlish Feature Detection

In [16]:
def singlish_feature_score(text):
    """
    Count Singlish linguistic features in text.
    Higher = more Singlish features present.
    """
    text_lower = text.lower()
    scores = {}

    # Particles (sentence-ending)
    particles = ["lah", "lor", "leh", "meh", "sia", "hor", "ah", "mah", "liao", "nia"]
    scores["particles"] = sum(1 for p in particles if re.search(rf"\b{p}\b", text_lower))

    # Singlish vocabulary
    vocab = ["shiok", "chope", "kiasu", "paiseh", "sian", "atas", "lepak",
             "bojio", "jialat", "walao", "alamak", "aiyo", "kanchiong",
             "sibei", "damn", "machiam", "chio", "siao", "blur", "kaypoh",
             "bodoh", "gabra", "agak", "shiok", "steady", "swee"]
    scores["vocabulary"] = sum(1 for v in vocab if v in text_lower)

    # Grammar patterns
    scores["got_existential"] = len(re.findall(r"\bgot\s+\w+", text_lower))  # "got time", "got people"
    scores["one_suffix"] = len(re.findall(r"\w+\s+one\b", text_lower))  # "this one", "like that one"
    scores["can_cannot"] = len(re.findall(r"\b(can|cannot)\b", text_lower))
    scores["already_pattern"] = len(re.findall(r"\balready\b|\bliao\b", text_lower))

    # Discourse markers
    discourse = ["eh", "wah", "aiyah", "haiya", "oi"]
    scores["discourse"] = sum(1 for d in discourse if re.search(rf"\b{d}\b", text_lower))

    total = sum(scores.values())
    return {"breakdown": scores, "total": total}


def average_singlish_score(responses):
    """
    Calculate average Singlish feature score across all responses.
    """
    scores = [singlish_feature_score(r)["total"] for r in responses]
    return round(np.mean(scores), 2)

### Evaluation 4: Inference Metrics

In [17]:
def measure_inference_speed(model, tokenizer, prompts, max_new_tokens=100, num_runs=5):
    """
    Measure inference speed metrics.
    """
    times = []
    token_counts = []

    # Warmup
    _ = generate_response(model, tokenizer, prompts[0], max_new_tokens=20)

    # Actual measurement
    test_prompts = prompts[:num_runs]

    for prompt in test_prompts:
        torch.cuda.synchronize()
        start = time.perf_counter()

        response = generate_response(model, tokenizer, prompt, max_new_tokens=max_new_tokens)

        torch.cuda.synchronize()
        end = time.perf_counter()

        times.append(end - start)
        token_counts.append(len(tokenizer.encode(response)))

    avg_latency = np.mean(times)
    avg_tokens = np.mean(token_counts)
    tokens_per_sec = avg_tokens / avg_latency
    vram = torch.cuda.max_memory_allocated() / 1e9

    return {
        "avg_latency_sec": round(avg_latency, 3),
        "tokens_per_sec": round(tokens_per_sec, 1),
        "inference_vram_gb": round(vram, 2),
    }

### Master Evaluation Function

In [18]:
def evaluate_model(model_name, model_path, adapter_path, test_prompts, test_references):
    """
    Run full evaluation suite on a trained model.
    """
    print(f"\n{'='*60}")
    print(f"Evaluating {model_name}")
    print(f"{'='*60}")

    # Load model
    print("Loading model...")
    model, tokenizer = load_trained_model(model_name, model_path, adapter_path)

    # Generate responses for all test prompts
    print(f"Generating responses for {len(test_prompts)} prompts...")
    generated_responses = []
    for i, prompt in enumerate(test_prompts):
        response = generate_response(model, tokenizer, prompt)
        generated_responses.append(response)
        if (i + 1) % 10 == 0:
            print(f"  Generated {i + 1}/{len(test_prompts)}")

    # Run evaluations
    print("\nRunning evaluations...")

    # 1. Perplexity
    print("  - Calculating perplexity...")
    perplexity = calculate_perplexity(model, tokenizer, test_references)

    # 2. Semantic similarity
    print("  - Calculating semantic similarity...")
    semantic_sim = calculate_semantic_similarity(generated_responses, test_references)

    # 3. Singlish features
    print("  - Calculating Singlish feature score...")
    singlish_score = average_singlish_score(generated_responses)

    # 4. Inference speed
    print("  - Measuring inference speed...")
    inference_metrics = measure_inference_speed(model, tokenizer, test_prompts)

    # Compile results
    results = {
        "model_name": model_name,
        "perplexity": perplexity,
        "semantic_similarity": semantic_sim,
        "singlish_feature_score": singlish_score,
        **inference_metrics,
        "generated_responses": generated_responses,  # Save for inspection
    }

    print(f"\n{model_name} Evaluation Results:")
    print(f"  Perplexity: {perplexity} (lower = better)")
    print(f"  Semantic Similarity: {semantic_sim} (higher = better)")
    print(f"  Singlish Score: {singlish_score} (higher = more Singlish)")
    print(f"  Latency: {inference_metrics['avg_latency_sec']}s")
    print(f"  Tokens/sec: {inference_metrics['tokens_per_sec']}")

    # Cleanup
    del model
    torch.cuda.empty_cache()

    return results

### Run Evaluation on All 3 Models

In [19]:
# Run evaluations
evaluation_results = {}

for model_name, model_path in MODELS.items():
    adapter_path = f"singlish_adapter_{model_name}"

    results = evaluate_model(
        model_name=model_name,
        model_path=model_path,
        adapter_path=adapter_path,
        test_prompts=test_prompts,
        test_references=test_references,
    )

    evaluation_results[model_name] = results

print("\n" + "="*60)
print("ALL EVALUATIONS COMPLETE")
print("="*60)


Evaluating 1.7B
Loading model...
==((====))==  Unsloth 2025.11.4: Fast Qwen3 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Generating responses for 20 prompts...
  Generated 10/20
  Generated 20/20

Running evaluations...
  - Calculating perplexity...
  - Calculating semantic similarity...
  - Calculating Singlish feature score...
  - Measuring inference speed...

1.7B Evaluation Results:
  Perplexity: 88.14 (lower = better)
  Semantic Similarity: 0.49810001254081726 (higher = better)
  Singlish Score: 1.15 (higher = more Singlish)
  Latency: 1.965s
  Tokens/sec: 14.3

Evaluating 4B-Instruct
Loading model...
==((

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Generating responses for 20 prompts...
  Generated 10/20
  Generated 20/20

Running evaluations...
  - Calculating perplexity...
  - Calculating semantic similarity...
  - Calculating Singlish feature score...
  - Measuring inference speed...

8B Evaluation Results:
  Perplexity: 42.05 (lower = better)
  Semantic Similarity: 0.6187000274658203 (higher = better)
  Singlish Score: 1.4 (higher = more Singlish)
  Latency: 2.358s
  Tokens/sec: 11.0

ALL EVALUATIONS COMPLETE


---
## Part 4: Comparison & Results

In [20]:
# Combine training and evaluation results
comparison_data = []

for model_name in MODELS.keys():
    row = {
        "Model": model_name,
        # Training metrics
        "Train Time (min)": training_results[model_name]["training_time_min"],
        "Train VRAM (GB)": training_results[model_name]["peak_vram_gb"],
        "Final Loss": training_results[model_name]["final_loss"],
        # Evaluation metrics
        "Perplexity ↓": evaluation_results[model_name]["perplexity"],
        "Semantic Sim ↑": evaluation_results[model_name]["semantic_similarity"],
        "Singlish Score ↑": evaluation_results[model_name]["singlish_feature_score"],
        "Latency (s) ↓": evaluation_results[model_name]["avg_latency_sec"],
        "Tokens/sec ↑": evaluation_results[model_name]["tokens_per_sec"],
        "Infer VRAM (GB)": evaluation_results[model_name]["inference_vram_gb"],
    }
    comparison_data.append(row)

comparison_df = pd.DataFrame(comparison_data)
comparison_df.set_index("Model", inplace=True)

print("\n" + "="*80)
print("FINAL COMPARISON TABLE")
print("="*80)
print(comparison_df.to_string())


FINAL COMPARISON TABLE
             Train Time (min)  Train VRAM (GB)  Final Loss  Perplexity ↓  Semantic Sim ↑  Singlish Score ↑  Latency (s) ↓  Tokens/sec ↑  Infer VRAM (GB)
Model                                                                                                                                                   
1.7B                     2.02             1.80      3.6624         88.14          0.4981              1.15          1.965          14.3            12.36
4B-Instruct              2.22             5.40      2.8593         40.76          0.5754              1.30          1.841          11.0            12.36
8B                       2.69            11.66      2.8901         42.05          0.6187              1.40          2.358          11.0            14.51


In [21]:
# Determine winners per metric
print("WINNERS BY METRIC")

metrics_lower_better = ["Perplexity ↓", "Latency (s) ↓", "Final Loss", "Train Time (min)", "Train VRAM (GB)", "Infer VRAM (GB)"]
metrics_higher_better = ["Semantic Sim ↑", "Singlish Score ↑", "Tokens/sec ↑"]

winners = {}

for metric in metrics_lower_better:
    if metric in comparison_df.columns:
        winner = comparison_df[metric].idxmin()
        winners[metric] = winner
        print(f"  {metric}: {winner} ({comparison_df.loc[winner, metric]})")

for metric in metrics_higher_better:
    if metric in comparison_df.columns:
        winner = comparison_df[metric].idxmax()
        winners[metric] = winner
        print(f"  {metric}: {winner} ({comparison_df.loc[winner, metric]})")

WINNERS BY METRIC
  Perplexity ↓: 4B-Instruct (40.76)
  Latency (s) ↓: 4B-Instruct (1.841)
  Final Loss: 4B-Instruct (2.8593)
  Train Time (min): 1.7B (2.02)
  Train VRAM (GB): 1.7B (1.8)
  Infer VRAM (GB): 1.7B (12.36)
  Semantic Sim ↑: 8B (0.6187000274658203)
  Singlish Score ↑: 8B (1.4)
  Tokens/sec ↑: 1.7B (14.3)


In [23]:
# Show first 3 test cases
for i in range(min(10, len(test_prompts))):
    print(f"\n--- Test Case {i+1} ---")
    print(f"Prompt: {test_prompts[i]}")
    print(f"Reference: {test_references[i]}")
    print()
    for model_name in MODELS.keys():
        response = evaluation_results[model_name]["generated_responses"][i]
        score = singlish_feature_score(response)["total"]
        print(f"{model_name}: {response}")
        print(f"       [Singlish features: {score}]")
    print()


--- Test Case 1 ---
Prompt: Family keep asking when get married how?
Reference: Aiyo this one classic. Just tell them when ready lor. Don't rush because of pressure, your life your choice.

1.7B: Ask them for advice lah. At least know their plans before, don't rush. If already decided then just follow.
       [Singlish features: 2]
4B-Instruct: Say when you ready, don't rush lah. Take time to think, don't pressure yourself.
       [Singlish features: 1]
8B: You still single lor. No need rush. One day when ready then.
       [Singlish features: 1]


--- Test Case 2 ---
Prompt: Why we get hiccups?
Reference: Diaphragm muscle suddenly contract mah. Usually eat too fast or drink fizzy drink then kena.

1.7B: When you swallow air, then the diaphragm goes down, then the air goes up. Lah. When too many air in stomach, you get hiccups.
       [Singlish features: 1]
4B-Instruct: When diaphragm muscle spasm, then air go up to stomach and cause that little sound. Usually go away after few minute

---
## Interpretation Guide

| Metric | Meaning | What's Good |
|--------|---------|-------------|
| **Perplexity** | How well model predicts Singlish tokens | Lower = better |
| **Semantic Similarity** | How close meaning is to reference | Higher = better (max 1.0) |
| **Singlish Score** | Count of Singlish features in output | Higher = more authentic |
| **Latency** | Time to generate response | Lower = faster |
| **Tokens/sec** | Generation speed | Higher = faster |
| **Final Loss** | Training convergence | Lower = better fit |

### Decision Framework

- **Best quality**: Choose model with best Perplexity + Singlish Score
- **Best speed**: Choose model with best Latency + Tokens/sec
- **Best balance**: Weigh quality vs speed based on your use case