# Look into the Open-Platypus data

In [2]:
from datasets import load_dataset
import pandas as pd

# Load a small sample
dataset = load_dataset("garage-bAInd/Open-Platypus", split="train[:1%]")

# View keys and one sample
print("Available fields:", dataset.column_names)
print("\nSample row:")
print(dataset[0])

# Convert to DataFrame for easier inspection (optional)
df = pd.DataFrame(dataset)
print("\nDataFrame shape:", df.shape)
display(df.head())


Available fields: ['input', 'output', 'instruction', 'data_source']

Sample row:
{'input': '', 'output': 'To find the probability of the spinner landing on $C$, I need to subtract the probabilities of the spinner landing on $A$ and $B$ from $1$, since the sum of the probabilities of all possible outcomes is $1$. I can write this as an equation: $P(C) = 1 - P(A) - P(B)$. I know that $P(A) = \\frac{1}{3}$ and $P(B) = \\frac{5}{12}$, so I can plug those values into the equation and simplify. I get: $P(C) = 1 - \\frac{1}{3} - \\frac{5}{12} = \\frac{12}{12} - \\frac{4}{12} - \\frac{5}{12} = \\frac{3}{12}$. I can reduce this fraction by dividing the numerator and denominator by $3$, and I get: $P(C) = \\frac{1}{4}$. ', 'instruction': 'A board game spinner is divided into three parts labeled $A$, $B$  and $C$. The probability of the spinner landing on $A$ is $\\frac{1}{3}$ and the probability of the spinner landing on $B$ is $\\frac{5}{12}$.  What is the probability of the spinner landing on 

Unnamed: 0,input,output,instruction,data_source
0,,To find the probability of the spinner landing...,A board game spinner is divided into three par...,MATH/PRM-800K
1,,"I need to choose 6 people out of 14, and the o...",My school's math club has 6 boys and 8 girls. ...,MATH/PRM-800K
2,,First we count the number of all 4-letter word...,How many 4-letter words with at least one cons...,MATH/PRM-800K
3,,She can do this if and only if at least one of...,Melinda will roll two standard six-sided dice ...,MATH/PRM-800K
4,,Think of the problem as a sequence of H's and ...,"Let $p$ be the probability that, in the proces...",MATH/PRM-800K


instruction : the actual prompt or question (what the user would say).

input : optional extra context or data (used only when needed).

output : the model’s desired response.

data_source : where the sample was sourced from (e.g., MATH/PRM-800K).

# Test supervised finetuned model

In [1]:
import torch
torch.cuda.empty_cache()

In [1]:
import torch
from unsloth import FastLanguageModel
from transformers import BitsAndBytesConfig  # Import BitsAndBytesConfig

# Paths
base_model    = "unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit"
adapter_dir   = "./platypus_supervised_fine_tuning/sft-lora"
tokenizer_dir = "./platypus_supervised_fine_tuning/tokenizer"

# Create quantization config with CPU offloading
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    llm_int8_enable_fp32_cpu_offload=True,  # Correct placement
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

# Load model with quantization config
model, tokenizer = FastLanguageModel.from_pretrained(
    base_model,
    max_seq_length=1024,
    quantization_config=quantization_config,  # Use config here
    device_map="auto",
)

# Load LoRA adapter weights
model.load_adapter(adapter_dir)

# Prompt
instruction = "Explain reinforcement learning in simple terms."
prompt = f"### Instruction:\n{instruction}\n### Response:\n"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)

print("\nGenerated:\n", tokenizer.decode(outputs[0], skip_special_tokens=True))

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.11: Fast Llama patching. Transformers: 4.54.1.
   \\   /|    NVIDIA GeForce RTX 3070 Laptop GPU. Num GPUs = 1. Max memory: 7.664 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Generated:
 ### Instruction:
Explain reinforcement learning in simple terms.
### Response:
Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize some notion of cumulative reward. The agent doesn't need to be told what to do; instead, it learns through trial and error, receiving rewards or penalties for its actions. Over time, the agent improves its polici

# Test DPO'S finetuned model

In [2]:
import torch
from unsloth import FastLanguageModel

# Paths
base_model    = "unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit"
sft_adapter   = "./platypus_supervised_fine_tuning/sft-lora"
dpo_adapter   = "./platypus_supervised_fine_tuning/dpo_final"
tokenizer_dir = "./platypus_supervised_fine_tuning/dpo_final/tokenizer"

# Load base model with CPU offloading
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = base_model,
    max_seq_length = 1024,
    dtype = torch.bfloat16,
    load_in_4bit = True,
    device_map = "auto",
    max_memory = {0: "6GB", "cpu": "20GB"}  # Adjusted for your 8GB GPU
)

# First load SFT adapter with unique name
model.load_adapter(sft_adapter, adapter_name="sft")

# Then load DPO adapter with unique name
model.load_adapter(dpo_adapter, adapter_name="dpo")

# Set the DPO adapter as active
model.set_adapter("dpo")

# Prompt
instruction = "Explain reinforcement learning in simple terms."
prompt = f"### Instruction:\n{instruction}\n### Response:\n"

# Generate response
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)

# Print just the response portion
full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
try:
    response = full_output.split("### Response:")[1].strip()
    print("\nDPO-Tuned Response:\n" + response)
except IndexError:
    print("\nFull Output:\n" + full_output)

# Compare with SFT model only
print("\n======== SFT Model Response ========")
model.set_adapter("sft")  # Switch to SFT adapter
sft_outputs = model.generate(**inputs, max_new_tokens=200)
sft_full = tokenizer.decode(sft_outputs[0], skip_special_tokens=True)
try:
    sft_response = sft_full.split("### Response:")[1].strip()
    print(sft_response)
except IndexError:
    print(sft_full)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.11: Fast Llama patching. Transformers: 4.54.1.
   \\   /|    NVIDIA GeForce RTX 3070 Laptop GPU. Num GPUs = 1. Max memory: 7.664 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!





DPO-Tuned Response:
Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions and learning from the rewards or penalties it receives. The agent's goal is to maximize its cumulative reward over time. The environment provides feedback in the form of rewards (good outcomes) or penalties (bad outcomes), and the agent uses this feedback to adjust its behavior through a process of trial and error.

Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions and learning from the rewards or penalties it receives. The agent's goal is to maximize the cumulative reward over time. It operates in an environment where the agent interacts with the world, and each interaction results in a state and a reward. The agent's actions determine the next state, and the reward tells the agent how good or bad the action was. Over time, the agent learns which actions lead to higher rewards,