<a href="https://colab.research.google.com/github/qaanit/Project-Halyard/blob/main/HDM_LLM_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Loading model and tokenizer**

Make sure you are connected to T4 or better hosted runtime as Unsloth only works on NVIDIA GPUs Intel and CPUs.

If only connecting now, rerun everything before this as well.

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [None]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer

PatchDPOTrainer()

In [None]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-14B",
    max_seq_length = 1024,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

#model, tokenizer = FastLanguageModel.from_pretrained(
#    model_name = "unsloth/zephyr-sft-bnb-4bit",
#    max_seq_length = 2048,
#    dtype = None,
#    load_in_4bit = True,
#)

##Uploading the training dataset


In [None]:
# prompt: upload the unsloth hdm ai training dataset.json as the training dataset for the model

from datasets import load_dataset

dataset = load_dataset("json", data_files="/content/Corrected UnSloth HDM AI Training Dataset.json", encoding="utf-8", split = "train")

In [None]:
# Split it into 90% for training and 10% for evaluation
dataset_split = dataset.train_test_split(test_size=0.25, seed=42)

train_dataset = dataset_split["train"]
eval_dataset = dataset_split["test"] # This is your evaluation set

print(f"Training on {len(train_dataset)} examples.")
print(f"Evaluating on {len(eval_dataset)} examples.")


## Setting the LoRa hyperparameters


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

In [None]:
# Set the padding token if it's not already defined
#if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token

##Converting the data into conversational format
The dataset need not be changed to a converstional format as the format is already structured for **Direct Preference Optimization (DPO)**. This is a much more advanced technique compared to standard fine-tuning as it trains the model to understand *why* the chosen response is better than the rejected.

Training arguments:


*   **per_device_train_batch_size**: The number of training examples processed at once on a single GPU. A higher value uses more VRAM.
*   **gradient_accumulation_steps**: Simulates a larger batch size without using more VRAM. It processes N small batches and only updates the model weights after all N have been seen.
*   **num_train_epochs**: The total number of times the trainer will iterate over the entire dataset.
*   **max_steps**: An alternative to num_train_epochs. It specifies the exact number of model updates to perform.
* **warmup_ratio**: The portion of training where the learning rate gradually increases from 0 to its target value. This prevents the model from being destabilized by large weight updates at the very beginning.
* **learning_rate**: The step size for each model update. Too high, and the training becomes unstable; too low, and it learns too slowly.
* **lr_scheduler_type**: Determines the pattern for how the learning rate changes after the warmup phase. Linear is default.
* **optim**: The optimization algorithm used to update the model's weights. "adamw_8bit" is the best choice for Unsloth on Colab.

In [None]:
from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()

from transformers import TrainingArguments
from trl import DPOTrainer, DPOConfig

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None, # Unsloth handles this automatically
    args = DPOConfig(
        per_device_train_batch_size = 1, # Use 1 for larger models on free Colab
        gradient_accumulation_steps = 4, # Simulates a larger batch size
        warmup_ratio = 0.1,
        max_steps = 10,
        num_train_epochs = 1, # A single epoch is often enough for DPO
        learning_rate = 5e-6, # A lower learning rate is recommended for DPO
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 2,
        optim = "adamw_8bit",
        seed = 42,
        output_dir = "outputs_dpo",

        eval_strategy = "steps",      # Evaluate at regular step intervals
        eval_steps = 2,                    # How often to run evaluation (e.g., every 20 steps)
        load_best_model_at_end = True,      # Automatically load the best model (lowest eval_loss) at the end
        save_total_limit = 2,
    ),
    beta = 0.05,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    tokenizer = tokenizer,
    max_length = 4096,          # Max length of prompt + response
    max_prompt_length = 2048,   # Max length of the prompt section
)

##Training the model - Understanding the DPO log



*   **Step**: The training update number. Each step represents one update to the model's weights.

*   **Training Loss**: The most important metric. This is the DPO loss. It measures how well the model is learning to prefer the chosen response over the rejected one.

* **rewards / chosen**: The reward score calculated for the chosen (good) response. Higher is better.

* **rewards / rejected**: The reward score for the rejected (bad) response. Lower is better.

* **rewards / accuracies**: The accuracy of the model in correctly giving the chosen response a higher reward than the rejected one for a given batch.

* **rewards / margins**: The difference between the chosen and rejected rewards (chosen - rejected). A larger positive margin is better.

In [None]:
dpo_trainer.train()

In [None]:
from transformers import TextStreamer

# --- After dpo_trainer.train() has completed ---

# 1. Optimize model for inference
# This enables Unsloth's native 2x faster inference for generation
FastLanguageModel.for_inference(model)

# 2. Prepare your input prompt using the chat template
# The tokenizer automatically knows the correct chat format for the Qwen3 model.
messages = [
    {"role": "user", "content": "?"},
]

# Apply the chat template to format the messages into a single string
# `tokenize=False` returns the string, `add_generation_prompt=True` adds the final assistant prompt part
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking = True)

print("--- Your Input Prompt ---")
print(prompt)

# 3. Tokenize the input and move to GPU
inputs = tokenizer(
    [prompt],
    return_tensors = "pt"
).to("cuda") # Ensure tensors are on the GPU

# 4. Generate a response using TextStreamer for streaming output
print("\n--- Model's Response (Streaming) ---")
text_streamer = TextStreamer(tokenizer)

_ = model.generate(
    **inputs,
    streamer = text_streamer,
    max_new_tokens = 1024, # Adjust the maximum number of tokens to generate
    use_cache = True,
    pad_token_id = tokenizer.eos_token_id,
    eos_token_id = tokenizer.eos_token_id,
    # Optional generation parameters for more creative/diverse outputs:
    # do_sample=True,
    # temperature=0.7,
    # top_p=0.9,
    # repetition_penalty=1.1,
)