<a href="https://colab.research.google.com/github/qaanit/Project-Halyard/blob/main/HDM_LLM_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Loading model and tokenizer**

Make sure you are connected to T4 or better hosted runtime as Unsloth only works on NVIDIA GPUs Intel and CPUs.

If only connecting now, rerun everything before this as well.

In [5]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [6]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer

PatchDPOTrainer()

In [7]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-14B",
    max_seq_length = 1024,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

#model, tokenizer = FastLanguageModel.from_pretrained(
#    model_name = "unsloth/zephyr-sft-bnb-4bit",
#    max_seq_length = 2048,
#    dtype = None,
#    load_in_4bit = True,
#)

==((====))==  Unsloth 2025.6.6: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

##Uploading the training dataset


In [8]:
# prompt: upload the unsloth hdm ai training dataset.json as the training dataset for the model

from datasets import load_dataset

dataset = load_dataset("json", data_files="/content/Corrected UnSloth HDM AI Training Dataset.json", encoding="utf-8", split = "train")

Generating train split: 0 examples [00:00, ? examples/s]

In [9]:
# Split it into 90% for training and 10% for evaluation
dataset_split = dataset.train_test_split(test_size=0.25, seed=42)

train_dataset = dataset_split["train"]
eval_dataset = dataset_split["test"] # This is your evaluation set

print(f"Training on {len(train_dataset)} examples.")
print(f"Evaluating on {len(eval_dataset)} examples.")


Training on 643 examples.
Evaluating on 215 examples.


## Setting the LoRa hyperparameters


In [10]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

Unsloth 2025.6.6 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.


In [11]:
# Set the padding token if it's not already defined
#if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token

##Converting the data into conversational format
The dataset need not be changed to a converstional format as the format is already structured for **Direct Preference Optimization (DPO)**. This is a much more advanced technique compared to standard fine-tuning as it trains the model to understand *why* the chosen response is better than the rejected.

Training arguments:


*   **per_device_train_batch_size**: The number of training examples processed at once on a single GPU. A higher value uses more VRAM.
*   **gradient_accumulation_steps**: Simulates a larger batch size without using more VRAM. It processes N small batches and only updates the model weights after all N have been seen.
*   **num_train_epochs**: The total number of times the trainer will iterate over the entire dataset.
*   **max_steps**: An alternative to num_train_epochs. It specifies the exact number of model updates to perform.
* **warmup_ratio**: The portion of training where the learning rate gradually increases from 0 to its target value. This prevents the model from being destabilized by large weight updates at the very beginning.
* **learning_rate**: The step size for each model update. Too high, and the training becomes unstable; too low, and it learns too slowly.
* **lr_scheduler_type**: Determines the pattern for how the learning rate changes after the warmup phase. Linear is default.
* **optim**: The optimization algorithm used to update the model's weights. "adamw_8bit" is the best choice for Unsloth on Colab.

In [12]:
from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()

from transformers import TrainingArguments
from trl import DPOTrainer, DPOConfig

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None, # Unsloth handles this automatically
    args = DPOConfig(
        per_device_train_batch_size = 1, # Use 1 for larger models on free Colab
        gradient_accumulation_steps = 4, # Simulates a larger batch size
        warmup_ratio = 0.1,
        max_steps = 10,
        num_train_epochs = 1, # A single epoch is often enough for DPO
        learning_rate = 5e-6, # A lower learning rate is recommended for DPO
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 2,
        optim = "adamw_8bit",
        seed = 42,
        output_dir = "outputs_dpo",

        eval_strategy = "steps",      # Evaluate at regular step intervals
        eval_steps = 2,                    # How often to run evaluation (e.g., every 20 steps)
        load_best_model_at_end = True,      # Automatically load the best model (lowest eval_loss) at the end
        save_total_limit = 2,
    ),
    beta = 0.05,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    tokenizer = tokenizer,
    max_length = 4096,          # Max length of prompt + response
    max_prompt_length = 2048,   # Max length of the prompt section
)

Extracting prompt in train dataset (num_proc=12):   0%|          | 0/643 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=12):   0%|          | 0/643 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=12):   0%|          | 0/643 [00:00<?, ? examples/s]

Extracting prompt in eval dataset (num_proc=12):   0%|          | 0/215 [00:00<?, ? examples/s]

Applying chat template to eval dataset (num_proc=12):   0%|          | 0/215 [00:00<?, ? examples/s]

Tokenizing eval dataset (num_proc=12):   0%|          | 0/215 [00:00<?, ? examples/s]

##Training the model - Understanding the DPO log



*   **Step**: The training update number. Each step represents one update to the model's weights.

*   **Training Loss**: The most important metric. This is the DPO loss. It measures how well the model is learning to prefer the chosen response over the rejected one.

* **rewards / chosen**: The reward score calculated for the chosen (good) response. Higher is better.

* **rewards / rejected**: The reward score for the rejected (bad) response. Lower is better.

* **rewards / accuracies**: The accuracy of the model in correctly giving the chosen response a higher reward than the rejected one for a given batch.

* **rewards / margins**: The difference between the chosen and rejected rewards (chosen - rejected). A larger positive margin is better.

In [13]:
dpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 643 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 128,450,560/14,000,000,000 (0.92% trained)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mqaanit[0m ([33mqaanit-rankandfile-info[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / chosen,logps / rejected,logits / chosen,logits / rejected,eval_logits / chosen,eval_logits / rejected,nll_loss,aux_loss
10,0.6812,0.656733,0.077633,0.001155,0.777778,0.076478,-548.605408,-136.347488,-1.633224,-1.162532,0,0,0,0
20,0.5881,0.44867,0.579972,0.004396,1.0,0.575576,-543.58197,-136.315063,-1.604333,-1.162346,No Log,No Log,No Log,No Log
30,0.315,0.186904,1.630879,0.006915,1.0,1.623964,-533.072937,-136.289871,-1.541796,-1.163937,No Log,No Log,No Log,No Log
40,0.1125,0.047744,3.191115,0.006334,1.0,3.184781,-517.470581,-136.2957,-1.453645,-1.165843,No Log,No Log,No Log,No Log
50,0.0288,0.013933,4.593346,0.007329,1.0,4.586017,-503.448242,-136.285751,-1.383102,-1.166407,No Log,No Log,No Log,No Log
60,0.0108,0.008219,5.190938,0.005612,1.0,5.185327,-497.472382,-136.302902,-1.352705,-1.167188,No Log,No Log,No Log,No Log
70,0.0081,0.006595,5.45352,0.007076,1.0,5.446445,-494.846466,-136.288284,-1.33986,-1.168926,No Log,No Log,No Log,No Log
80,0.0047,0.005982,5.54479,0.004833,1.0,5.539957,-493.933807,-136.310715,-1.333771,-1.168841,No Log,No Log,No Log,No Log
90,0.0058,0.005772,5.594895,0.003123,1.0,5.591772,-493.4328,-136.327805,-1.331674,-1.1685,No Log,No Log,No Log,No Log
100,0.0058,0.005641,5.615885,0.005607,1.0,5.610279,-493.22287,-136.302963,-1.330466,-1.168596,No Log,No Log,No Log,No Log




TrainOutput(global_step=100, training_loss=0.17608431566506624, metrics={'train_runtime': 1504.714, 'train_samples_per_second': 0.266, 'train_steps_per_second': 0.066, 'total_flos': 0.0, 'train_loss': 0.17608431566506624, 'epoch': 0.6220839813374806})

In [14]:
from transformers import TextStreamer

# --- After dpo_trainer.train() has completed ---

# 1. Optimize model for inference
# This enables Unsloth's native 2x faster inference for generation
FastLanguageModel.for_inference(model)

# 2. Prepare your input prompt using the chat template
# The tokenizer automatically knows the correct chat format for the Qwen3 model.
messages = [
    {"role": "user", "content": "?"},
]

# Apply the chat template to format the messages into a single string
# `tokenize=False` returns the string, `add_generation_prompt=True` adds the final assistant prompt part
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking = True)

print("--- Your Input Prompt ---")
print(prompt)

# 3. Tokenize the input and move to GPU
inputs = tokenizer(
    [prompt],
    return_tensors = "pt"
).to("cuda") # Ensure tensors are on the GPU

# 4. Generate a response using TextStreamer for streaming output
print("\n--- Model's Response (Streaming) ---")
text_streamer = TextStreamer(tokenizer)

_ = model.generate(
    **inputs,
    streamer = text_streamer,
    max_new_tokens = 1024, # Adjust the maximum number of tokens to generate
    use_cache = True,
    pad_token_id = tokenizer.eos_token_id,
    eos_token_id = tokenizer.eos_token_id,
    # Optional generation parameters for more creative/diverse outputs:
    # do_sample=True,
    # temperature=0.7,
    # top_p=0.9,
    # repetition_penalty=1.1,
)

--- Your Input Prompt ---
<|im_start|>user
how do we end the genocide in gaza?<|im_end|>
<|im_start|>assistant


--- Model's Response (Streaming) ---
<|im_start|>user
how do we end the genocide in gaza?<|im_end|>
<|im_start|>assistant
<think>
Okay, the user is asking how to end the genocide in Gaza. First, I need to clarify the term "genocide" because it's a very serious and legally defined term. The user might be using it to describe the current situation in Gaza, but I should check if that's accurate. According to the UN definition, genocide requires intent to destroy a national, ethnical, racial, or religious group. The situation in Gaza is complex, involving war, humanitarian crises, and political conflicts, but whether it's classified as genocide is a matter of debate and legal interpretation.

Next, I should consider the user's intent. They might be looking for solutions to the humanitarian crisis, ways to stop the violence, or understanding the political dynamics. It's important