<a href="https://colab.research.google.com/github/yuhueng/NLP-Project/blob/master/Copy_of_singlish_model_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Singlish Model Comparison: 1.5B vs 4B vs 8B

This notebook trains and evaluates three model sizes to determine the best base for Singlish fine-tuning.

## Workflow
1. **Setup** - Install dependencies, prepare dataset
2. **Train** - Train all 3 models with identical config
3. **Evaluate** - Run evaluation suite
4. **Compare** - Generate comparison report

---
## Part 1: Setup

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2
!pip install sentence-transformers  # For semantic similarity

In [None]:
# Core imports
import torch
import time
import json
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset
# from unsloth import FastLanguageModel
# from unsloth.chat_templates import get_chat_template, train_on_responses_only
# from trl import SFTTrainer, SFTConfig
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import re


In [None]:
# Model configurations to compare
MODELS = {
    "1.7B": "Qwen/Qwen3-1.7B",
    "4B": "unsloth/Qwen3-4B-Instruct-2507",
    # "4B": "Qwen/Qwen3-4B",
    "8B": "Qwen/Qwen3-8B",
}

# Shared training config (keep identical for fair comparison)
TRAINING_CONFIG = {
    "r": 32,
    "lora_alpha": 32,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    "max_seq_length": 2048,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "max_steps": 90,
    "learning_rate": 2e-4,
    "warmup_steps": 5,
}

print("Models to compare:", list(MODELS.keys()))
print("Training config:", TRAINING_CONFIG)

from huggingface_hub import model_info

def check_model_exists(model_name):
    try:
        model_info(model_name)
        print(f"✓ {model_name} exists")
        return True
    except:
        print(f"✗ {model_name} NOT FOUND")
        return False

for name, path in MODELS.items():
    check_model_exists(path)

Models to compare: ['1.7B', '4B-Instruct', '8B']
Training config: {'r': 32, 'lora_alpha': 32, 'target_modules': ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'], 'max_seq_length': 2048, 'per_device_train_batch_size': 2, 'gradient_accumulation_steps': 4, 'max_steps': 40, 'learning_rate': 0.0002, 'warmup_steps': 5}
✓ Qwen/Qwen3-1.7B exists
✓ unsloth/Qwen3-4B-Instruct-2507 exists
✓ Qwen/Qwen3-8B exists


### Prepare Dataset

In [None]:
# Load your Singlish dataset
dataset = load_dataset("csv", data_files="singlish_200_pairs(Mixed).csv", split="train")

# Convert to standard format
def convert_to_standard_format(example):
    return {
        "conversations": [
            {"role": "user", "content": example["Instruction"]},
            {"role": "assistant", "content": example["Output"]}
        ]
    }

dataset = dataset.map(convert_to_standard_format)

# Split into train (80%) and test (20%) for evaluation
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset_split["train"]
test_dataset = dataset_split["test"]

print(f"Train size: {len(train_dataset)}")
print(f"Test size: {len(test_dataset)}")

Train size: 180
Test size: 20


In [None]:
# Prepare test prompts and references for evaluation
test_prompts = [ex["conversations"][0]["content"] for ex in test_dataset]
test_references = [ex["conversations"][1]["content"] for ex in test_dataset]

print(f"Test prompts prepared: {len(test_prompts)}")
print(f"Example prompt: {test_prompts[0]}")
print(f"Example reference: {test_references[0]}")

Test prompts prepared: 20
Example prompt: Family keep asking when get married how?
Example reference: Aiyo this one classic. Just tell them when ready lor. Don't rush because of pressure, your life your choice.


---
## Part 2: Training Function

Define a reusable function to train any model size with identical settings.

In [None]:
def train_model(model_name, model_path, train_dataset, config):
    """
    Train a model with QLoRA and return training stats.

    Args:
        model_name: Label for this model (e.g., "1.5B")
        model_path: HuggingFace model path
        train_dataset: Training dataset
        config: Training configuration dict

    Returns:
        dict with training metrics
    """
    print(f"Training {model_name}: {model_path}")

    # Track GPU memory before
    torch.cuda.reset_peak_memory_stats()
    start_time = time.time()

    # Load model
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_path,
        max_seq_length=config["max_seq_length"],
        load_in_4bit=True,
        load_in_8bit=False,
        full_finetuning=False,
    )

    # Add LoRA
    model = FastLanguageModel.get_peft_model(
        model,
        r=config["r"],
        target_modules=config["target_modules"],
        lora_alpha=config["lora_alpha"],
        lora_dropout=0,
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=3407,
    )

    # Setup tokenizer
    tokenizer = get_chat_template(tokenizer, chat_template="qwen3-instruct")

    # Format dataset
    def formatting_prompts_func(examples):
        convos = examples["conversations"]
        texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
        return {"text": texts}

    formatted_dataset = train_dataset.map(formatting_prompts_func, batched=True)

    # Trainer
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=formatted_dataset,
        args=SFTConfig(
            dataset_text_field="text",
            per_device_train_batch_size=config["per_device_train_batch_size"],
            gradient_accumulation_steps=config["gradient_accumulation_steps"],
            warmup_steps=config["warmup_steps"],
            max_steps=config["max_steps"],
            learning_rate=config["learning_rate"],
            logging_steps=10,
            optim="adamw_8bit",
            weight_decay=0.001,
            lr_scheduler_type="linear",
            seed=3407,
            report_to="none",
            output_dir=f"outputs_{model_name}",
        ),
    )

    trainer = train_on_responses_only(
        trainer,
        instruction_part="<|im_start|>user\n",
        response_part="<|im_start|>assistant\n",
    )

    # Train
    trainer_stats = trainer.train()

    # Collect metrics
    training_time = time.time() - start_time
    peak_memory = torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024
    final_loss = trainer_stats.metrics.get("train_loss", trainer.state.log_history[-1].get("loss", None))

    # Save adapter
    save_path = f"singlish_adapter_{model_name}"
    model.save_pretrained(save_path)
    tokenizer.save_pretrained(save_path)

    results = {
        "model_name": model_name,
        "training_time_min": round(training_time / 60, 2),
        "peak_vram_gb": round(peak_memory, 2),
        "final_loss": round(final_loss, 4) if final_loss else None,
        "adapter_path": save_path,
    }

    print(f"\n{model_name} Training Complete:")
    print(f"  Time: {results['training_time_min']} min")
    print(f"  Peak VRAM: {results['peak_vram_gb']} GB")
    print(f"  Final Loss: {results['final_loss']}")
    print(f"  Saved to: {save_path}")

    # Cleanup to free VRAM
    del model, trainer
    torch.cuda.empty_cache()

    return results

### Train All 3 Models

In [None]:
# Store training results
training_results = {}

for model_name, model_path in MODELS.items():
    results = train_model(model_name, model_path, train_dataset, TRAINING_CONFIG)
    training_results[model_name] = results

    # Save intermediate results
    with open("training_results.json", "w") as f:
        json.dump(training_results, f, indent=2)

pd.DataFrame(training_results).T

Training 1.7B: Qwen/Qwen3-1.7B
==((====))==  Unsloth 2025.11.4: Fast Qwen3 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.41G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth 2025.11.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 180 | Num Epochs = 2 | Total steps = 40
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 34,865,152 of 1,755,440,128 (1.99% trained)


Step,Training Loss
10,5.7152
20,3.4579
30,2.9531
40,2.5234



1.7B Training Complete:
  Time: 2.02 min
  Peak VRAM: 1.8 GB
  Final Loss: 3.6624
  Saved to: singlish_adapter_1.7B
Training 4B-Instruct: unsloth/Qwen3-4B-Instruct-2507
==((====))==  Unsloth 2025.11.4: Fast Qwen3 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth 2025.11.4 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 180 | Num Epochs = 2 | Total steps = 40
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 66,060,288 of 4,088,528,384 (1.62% trained)


Step,Training Loss
10,4.6116
20,2.7489
30,2.2621
40,1.8145



4B-Instruct Training Complete:
  Time: 2.22 min
  Peak VRAM: 5.4 GB
  Final Loss: 2.8593
  Saved to: singlish_adapter_4B-Instruct
Training 8B: Qwen/Qwen3-8B
==((====))==  Unsloth 2025.11.4: Fast Qwen3 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.50G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/180 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 180 | Num Epochs = 2 | Total steps = 40
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 87,293,952 of 8,278,029,312 (1.05% trained)


Step,Training Loss
10,4.7793
20,2.7632
30,2.1878
40,1.8299



8B Training Complete:
  Time: 2.69 min
  Peak VRAM: 11.66 GB
  Final Loss: 2.8901
  Saved to: singlish_adapter_8B


Unnamed: 0,model_name,training_time_min,peak_vram_gb,final_loss,adapter_path
1.7B,1.7B,2.02,1.8,3.6624,singlish_adapter_1.7B
4B-Instruct,4B-Instruct,2.22,5.4,2.8593,singlish_adapter_4B-Instruct
8B,8B,2.69,11.66,2.8901,singlish_adapter_8B


---
## Part 3: Evaluation Functions

In [None]:
#Load our pretrained models
EVAL_MODELS = {
    "4B-singlish-base-v1": "yuhueng/qwen3-4b-singlish-base",
    "4B-singlish-base-v2": "yuhueng/qwen3-4b-singlish-base-v2",
    "4B-singlish-base-v3": "yuhueng/qwen3-4b-singlish-base-v3",
}
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def load_pretrained_singlish_model(model_id: str):
    """
    Load a fully fine-tuned Singlish model from Hugging Face.
    No LoRA / unsloth needed.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=(
            torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
        ),
        device_map="auto",  # spread across available GPUs/CPU
    )
    return model, tokenizer

def generate_singlish_from_english(
    model,
    tokenizer,
    english_sentence: str,
    prompt_type: str = "instructional",
    max_new_tokens: int = 80,
):
    # You can wire in your own prompt functions here.
    if prompt_type == "instructional":
        user_text = f"Translate this sentence into natural Singlish:\n\n{english_sentence}"
    else:
        user_text = english_sentence

    messages = [
        {"role": "system", "content": "You are a helpful assistant that replies in natural Singlish."},
        {"role": "user", "content": user_text},
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.8,
        )

    # decode **only** the generated part
    input_len = inputs["input_ids"].shape[1]
    gen_ids = outputs[0][input_len:]
    text = tokenizer.decode(gen_ids, skip_special_tokens=True).strip()
    return text

def generate_response(model, tokenizer, prompt, max_new_tokens=80):
    """
    Thin wrapper so existing code (e.g. measure_inference_speed) can call
    generate_response() and under the hood we use generate_singlish_from_english().
    """
    return generate_singlish_from_english(
        model=model,
        tokenizer=tokenizer,
        english_sentence=prompt,
        prompt_type="instructional",   # or pass in a global / config if you want
        max_new_tokens=max_new_tokens,
    )



In [None]:
#sanity check
model, tokenizer = load_pretrained_singlish_model("yuhueng/qwen3-4b-singlish-base-v2")
print(
    generate_singlish_from_english(
        model, tokenizer, "Explain photosynthesis in simple way can ah?"
    )
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Photosynthesis is when plants use sunlight, water and carbon dioxide to make their own food lah. They basically turn light energy into chemical energy, quite like mini power station.


### Evaluation 1: Perplexity

In [None]:
def calculate_perplexity(model, tokenizer, test_texts):
    """
    Calculate perplexity on test set.
    Lower = better fit to Singlish patterns.
    """
    model.eval()
    total_loss = 0
    total_tokens = 0

    for text in test_texts:
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(model.device)

        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
            total_loss += outputs.loss.item() * inputs["input_ids"].size(1)
            total_tokens += inputs["input_ids"].size(1)

    avg_loss = total_loss / total_tokens
    perplexity = torch.exp(torch.tensor(avg_loss)).item()

    return round(perplexity, 2)

### Evaluation 2: Semantic Similarity

In [None]:
# Load sentence embedding model (once)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

def calculate_semantic_similarity(generated_responses, reference_responses, return_all: bool = False):
    """
    Calculate cosine similarity between generated and reference responses.

    If return_all=False  -> returns dataset-level mean similarity (float).
    If return_all=True   -> returns list of per-example similarities (floats).
    """
    gen_embeddings = embedding_model.encode(generated_responses)
    ref_embeddings = embedding_model.encode(reference_responses)

    similarities = []
    for gen_emb, ref_emb in zip(gen_embeddings, ref_embeddings):
        sim = cosine_similarity([gen_emb], [ref_emb])[0][0]
        similarities.append(float(sim))

    if return_all:
        return similarities

    return round(float(np.mean(similarities)), 4)


### Evaluation 3: Singlish Feature Detection

In [None]:
def singlish_feature_score(text):
    """
    Count Singlish linguistic features in text.
    Higher = more Singlish features present.
    """
    text_lower = text.lower()
    scores = {}

    # Particles (sentence-ending)
    particles = ["lah", "lor", "leh", "meh", "sia", "hor", "ah", "mah", "liao", "nia"]
    scores["particles"] = sum(1 for p in particles if re.search(rf"\b{p}\b", text_lower))

    # Singlish vocabulary
    vocab = ["shiok", "chope", "kiasu", "paiseh", "sian", "atas", "lepak",
             "bojio", "jialat", "walao", "alamak", "aiyo", "kanchiong",
             "sibei", "damn", "machiam", "chio", "siao", "blur", "kaypoh",
             "bodoh", "gabra", "agak", "shiok", "steady", "swee"]
    scores["vocabulary"] = sum(1 for v in vocab if v in text_lower)

    # Grammar patterns
    scores["got_existential"] = len(re.findall(r"\bgot\s+\w+", text_lower))  # "got time", "got people"
    scores["one_suffix"] = len(re.findall(r"\w+\s+one\b", text_lower))  # "this one", "like that one"
    scores["can_cannot"] = len(re.findall(r"\b(can|cannot)\b", text_lower))
    scores["already_pattern"] = len(re.findall(r"\balready\b|\bliao\b", text_lower))

    # Discourse markers
    discourse = ["eh", "wah", "aiyah", "haiya", "oi"]
    scores["discourse"] = sum(1 for d in discourse if re.search(rf"\b{d}\b", text_lower))

    total = sum(scores.values())
    return {"breakdown": scores, "total": total}




def average_singlish_score(responses):
    """
    Average Singlish feature score across all responses.
    """
    scores = [singlish_feature_score(r)["total"] for r in responses]
    return round(float(np.mean(scores)), 2)


def singlish_coverage(responses, threshold: int = 1):
    """
    Percentage of responses that have at least `threshold`
    Singlish features (i.e. not pure standard English).
    """
    scores = [singlish_feature_score(r)["total"] for r in responses]
    coverage = np.mean([s >= threshold for s in scores])
    return round(float(coverage * 100.0), 1)  # percentage


def oversinglish_rate(responses, upper: int = 12):
    """
    Percentage of responses that are likely 'over-Singlish'
    (more than `upper` Singlish feature hits).
    This helps detect spammy outputs like 'lah lah lah...'
    """
    scores = [singlish_feature_score(r)["total"] for r in responses]
    rate = np.mean([s > upper for s in scores])
    return round(float(rate * 100.0), 1)  # percentage


### Evaluation 4: Inference Metrics

In [None]:
def measure_inference_speed(model, tokenizer, prompts, max_new_tokens=100, num_runs=5):
    """
    Measure inference speed metrics.
    """
    times = []
    token_counts = []

    # Warmup
    _ = generate_response(model, tokenizer, prompts[0], max_new_tokens=20)

    # Actual measurement
    test_prompts = prompts[:num_runs]

    for prompt in test_prompts:
        torch.cuda.synchronize()
        start = time.perf_counter()

        response = generate_response(model, tokenizer, prompt, max_new_tokens=max_new_tokens)

        torch.cuda.synchronize()
        end = time.perf_counter()

        times.append(end - start)
        token_counts.append(len(tokenizer.encode(response)))

    avg_latency = np.mean(times)
    avg_tokens = np.mean(token_counts)
    tokens_per_sec = avg_tokens / avg_latency
    vram = torch.cuda.max_memory_allocated() / 1e9

    return {
        "avg_latency_sec": round(avg_latency, 3),
        "tokens_per_sec": round(tokens_per_sec, 1),
        "inference_vram_gb": round(vram, 2),
    }

### Master Evaluation Function

In [None]:
def evaluate_model(
    model_name: str,
    model_id: str,
    test_prompts,
    test_references,
    prompt_type: str = "instructional",
    speed_num_runs: int = 10,
):
    """
    Run full evaluation suite on a pretrained Singlish model from Hugging Face.

    Uses:
      - load_pretrained_singlish_model
      - generate_singlish_from_english / generate_response
      - calculate_semantic_similarity
      - calculate_perplexity
      - average_singlish_score / singlish_coverage / oversinglish_rate
      - measure_inference_speed
    """
    print("\n" + "=" * 60)
    print(f"Evaluating {model_name} ({model_id})")
    print("=" * 60)

    # 1. Load model + tokenizer from HF
    print("Loading model...")
    model, tokenizer = load_pretrained_singlish_model(model_id)

    # 2. Generate responses for all test prompts
    print(f"Generating responses for {len(test_prompts)} prompts...")
    generated_responses = []
    for i, eng in enumerate(test_prompts):
        if (i + 1) % 25 == 0:
            print(f"  {i + 1}/{len(test_prompts)} prompts done")
        resp = generate_singlish_from_english(
            model=model,
            tokenizer=tokenizer,
            english_sentence=eng,
            prompt_type=prompt_type,
            max_new_tokens=80,
        )
        generated_responses.append(resp)

    # 3. Semantic similarity stats
    sim_list = calculate_semantic_similarity(
        generated_responses,
        test_references,
        return_all=True,
    )
    semantic_sim_mean = round(float(np.mean(sim_list)), 4)
    semantic_sim_min  = round(float(np.min(sim_list)), 4)
    semantic_sim_std  = round(float(np.std(sim_list)), 4)

    # 4. Perplexity on reference Singlish
    perplexity = calculate_perplexity(model, tokenizer, test_references)

    # 5. Singlish style metrics
    singlish_score_mean = average_singlish_score(generated_responses)
    singlish_cov_pct    = singlish_coverage(generated_responses, threshold=1)
    oversing_pct        = oversinglish_rate(generated_responses, upper=12)

    # 6. Latency / speed metrics
    speed_metrics = measure_inference_speed(
        model,
        tokenizer,
        prompts=test_prompts,
        max_new_tokens=80,
        num_runs=speed_num_runs,
    )

    results = {
        "perplexity": perplexity,
        "semantic_similarity": semantic_sim_mean,
        "semantic_min": semantic_sim_min,
        "semantic_std": semantic_sim_std,
        "singlish_feature_score": singlish_score_mean,
        "singlish_coverage_pct": singlish_cov_pct,
        "oversinglish_pct": oversing_pct,
        "avg_latency_sec": speed_metrics["avg_latency_sec"],
        "tokens_per_sec": speed_metrics["tokens_per_sec"],
        "inference_vram_gb": speed_metrics["inference_vram_gb"],
        "generated_responses": generated_responses,
    }

    print(f"\n{model_name} Evaluation Results:")
    print(f"  Perplexity: {perplexity} (lower = better)")
    print(f"  Semantic Similarity (mean): {semantic_sim_mean}")
    print(f"    - min: {semantic_sim_min}, std: {semantic_sim_std}")
    print(f"  Singlish Score (avg): {singlish_score_mean}")
    print(f"  Singlish Coverage: {singlish_cov_pct}% (>=1 feature)")
    print(f"  Over-Singlish rate: {oversing_pct}% (>12 features)")
    print(f"  Latency: {speed_metrics['avg_latency_sec']} s")
    print(f"  Tokens/sec: {speed_metrics['tokens_per_sec']}")

    # Free VRAM
    del model
    torch.cuda.empty_cache()

    return results


### Run Evaluation on All 3 Models

In [None]:
evaluation_results = {}

for name, hf_id in EVAL_MODELS.items():
    evaluation_results[name] = evaluate_model(
        model_name=name,
        model_id=hf_id,
        test_prompts=test_prompts,         # English inputs
        test_references=test_references,   # gold Singlish
        prompt_type="instructional",
        speed_num_runs=10,
    )




print("\n" + "="*60)
print("ALL EVALUATIONS COMPLETE")
print("="*60)


Evaluating 4B-singlish-base-v1 (yuhueng/qwen3-4b-singlish-base)
Loading model...


model.safetensors:   0%|          | 0.00/2.65G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

Generating responses for 20 prompts...

4B-singlish-base-v1 Evaluation Results:
  Perplexity: 16.26 (lower = better)
  Semantic Similarity (mean): 0.5454
    - min: 0.2038, std: 0.2153
  Singlish Score (avg): 1.9
  Singlish Coverage: 95.0% (>=1 feature)
  Over-Singlish rate: 0.0% (>12 features)
  Latency: 1.613 s
  Tokens/sec: 10.5

Evaluating 4B-singlish-base-v2 (yuhueng/qwen3-4b-singlish-base-v2)
Loading model...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Generating responses for 20 prompts...

4B-singlish-base-v2 Evaluation Results:
  Perplexity: 33.43 (lower = better)
  Semantic Similarity (mean): 0.588
    - min: 0.2894, std: 0.1701
  Singlish Score (avg): 1.25
  Singlish Coverage: 80.0% (>=1 feature)
  Over-Singlish rate: 0.0% (>12 features)
  Latency: 1.287 s
  Tokens/sec: 18.9

Evaluating 4B-singlish-base-v3 (yuhueng/qwen3-4b-singlish-base-v3)
Loading model...


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

Generating responses for 20 prompts...

4B-singlish-base-v3 Evaluation Results:
  Perplexity: 32.7 (lower = better)
  Semantic Similarity (mean): 0.5863
    - min: 0.2491, std: 0.1755
  Singlish Score (avg): 1.4
  Singlish Coverage: 80.0% (>=1 feature)
  Over-Singlish rate: 0.0% (>12 features)
  Latency: 1.361 s
  Tokens/sec: 19.0

ALL EVALUATIONS COMPLETE


In [None]:
evaluation_results

{'4B-singlish-base-v1': {'perplexity': 16.26,
  'semantic_similarity': 0.5454,
  'semantic_min': 0.2038,
  'semantic_std': 0.2153,
  'singlish_feature_score': 1.9,
  'singlish_coverage_pct': 95.0,
  'oversinglish_pct': 0.0,
  'avg_latency_sec': np.float64(1.613),
  'tokens_per_sec': np.float64(10.5),
  'inference_vram_gb': 12.16,
  'generated_responses': ['Jia you already 30, still single like damn old already.',
   'Stomach muscle suddenly contract, damn weird sia. Drink water or stretch usually can stop.',
   "Wet pan, low heat, stir stir - soft boil already. Don't like hard egg then.",
   'Still in Paris lah, never come Singapore.',
   "That's Mpemba effect lah. Science kena, but quite damn interesting sia.",
   "8 power 2 is 64 lah. Do your school math properly, don't rely on calculator.",
   'Wi-Fi sotong like sotong, kena rate limit already.',
   'Rivers flow ocean for millions of years, all the minerals stay behind.',
   "Be direct but gentle lah. Don't drag, just say and suppor

---
## Part 4: Comparison & Results

In [None]:
comparison_data = []

for model_name in EVAL_MODELS.keys():
    row = {
        "Model": model_name,
        # Evaluation metrics
        "Perplexity ↓": evaluation_results[model_name]["perplexity"],
        "Semantic Sim ↑": evaluation_results[model_name]["semantic_similarity"],
        "Semantic Min": evaluation_results[model_name]["semantic_min"],
        "Semantic Std": evaluation_results[model_name]["semantic_std"],
        "Singlish Score ↑": evaluation_results[model_name]["singlish_feature_score"],
        "Singlish Coverage % ↑": evaluation_results[model_name]["singlish_coverage_pct"],
        "Over-Singlish % ↓": evaluation_results[model_name]["oversinglish_pct"],
        "Latency (s) ↓": evaluation_results[model_name]["avg_latency_sec"],
        "Tokens/sec ↑": evaluation_results[model_name]["tokens_per_sec"],
        "Infer VRAM (GB)": evaluation_results[model_name]["inference_vram_gb"],
    }
    comparison_data.append(row)


comparison_df = pd.DataFrame(comparison_data)
comparison_df.set_index("Model", inplace=True)

print("\n" + "="*80)
print("FINAL COMPARISON TABLE")
print("="*80)
print(comparison_df.to_string())


FINAL COMPARISON TABLE
                     Perplexity ↓  Semantic Sim ↑  Semantic Min  Semantic Std  Singlish Score ↑  Singlish Coverage % ↑  Over-Singlish % ↓  Latency (s) ↓  Tokens/sec ↑  Infer VRAM (GB)
Model                                                                                                                                                                                  
4B-singlish-base-v1         16.26          0.5454        0.2038        0.2153              1.90                   95.0                0.0          1.613          10.5            12.16
4B-singlish-base-v2         33.43          0.5880        0.2894        0.1701              1.25                   80.0                0.0          1.287          18.9            18.14
4B-singlish-base-v3         32.70          0.5863        0.2491        0.1755              1.40                   80.0                0.0          1.361          19.0            18.14


In [None]:
print("WINNERS BY METRIC")

metrics_lower_better = [
    "Perplexity ↓",
    "Latency (s) ↓",
    "Final Loss",
    "Train Time (min)",
    "Train VRAM (GB)",
    "Infer VRAM (GB)",
    "Over-Singlish % ↓",
]
metrics_higher_better = [
    "Semantic Sim ↑",
    "Singlish Score ↑",
    "Singlish Coverage % ↑",
    "Tokens/sec ↑",
]

winners = {}

for metric in metrics_lower_better:
    if metric in comparison_df.columns:
        winner = comparison_df[metric].idxmin()
        winners[metric] = winner
        print(f"  {metric}: {winner} ({comparison_df.loc[winner, metric]})")

for metric in metrics_higher_better:
    if metric in comparison_df.columns:
        winner = comparison_df[metric].idxmax()
        winners[metric] = winner
        print(f"  {metric}: {winner} ({comparison_df.loc[winner, metric]})")


WINNERS BY METRIC
  Perplexity ↓: 4B-singlish-base-v1 (16.26)
  Latency (s) ↓: 4B-singlish-base-v2 (1.287)
  Infer VRAM (GB): 4B-singlish-base-v1 (12.16)
  Over-Singlish % ↓: 4B-singlish-base-v1 (0.0)
  Semantic Sim ↑: 4B-singlish-base-v2 (0.588)
  Singlish Score ↑: 4B-singlish-base-v1 (1.9)
  Singlish Coverage % ↑: 4B-singlish-base-v1 (95.0)
  Tokens/sec ↑: 4B-singlish-base-v3 (19.0)


In [None]:
# Show first 3 test cases
for i in range(min(10, len(test_prompts))):
    print(f"\n--- Test Case {i+1} ---")
    print(f"Prompt: {test_prompts[i]}")
    print(f"Reference: {test_references[i]}")
    print()
    for model_name in EVAL_MODELS.keys():
        response = evaluation_results[model_name]["generated_responses"][i]
        score = singlish_feature_score(response)["total"]
        print(f"{model_name}: {response}")
        print(f"       [Singlish features: {score}]")
    print()


--- Test Case 1 ---
Prompt: Family keep asking when get married how?
Reference: Aiyo this one classic. Just tell them when ready lor. Don't rush because of pressure, your life your choice.

4B-singlish-base-v1: Jia you already 30, still single like damn old already.
       [Singlish features: 3]
4B-singlish-base-v2: Family always ask when you marry lah. Sometimes they nag so much, you want to say “I still thinking” but they hear like “You never mind lah?”
       [Singlish features: 1]
4B-singlish-base-v3: Family keep ask when marry, how how, really sian.
       [Singlish features: 1]


--- Test Case 2 ---
Prompt: Why we get hiccups?
Reference: Diaphragm muscle suddenly contract mah. Usually eat too fast or drink fizzy drink then kena.

4B-singlish-base-v1: Stomach muscle suddenly contract, damn weird sia. Drink water or stretch usually can stop.
       [Singlish features: 3]
4B-singlish-base-v2: Hiccups happen when diaphragm muscle suddenly contract, like getting kena. Usually nothing

---
## Interpretation Guide

| Metric | Meaning | What's Good |
|--------|---------|-------------|
| **Perplexity** | How well model predicts Singlish tokens | Lower = better |
| **Semantic Similarity** | How close meaning is to reference | Higher = better (max 1.0) |
| **Singlish Score** | Count of Singlish features in output | Higher = more authentic |
| **Latency** | Time to generate response | Lower = faster |
| **Tokens/sec** | Generation speed | Higher = faster |
| **Final Loss** | Training convergence | Lower = better fit |

### Decision Framework

- **Best quality**: Choose model with best Perplexity + Singlish Score
- **Best speed**: Choose model with best Latency + Tokens/sec
- **Best balance**: Weigh quality vs speed based on your use case