# Zephyr 7B SFT with Unsloth
This notebook demonstrates how to use Unsloth to run the Zephyr 7B SFT model.

In [1]:
!pip install -U bitsandbytes
!pip install trl peft accelerate



In [10]:
import torch
from transformers import AutoTokenizer
# Using old (stable) API - works until TRL 0.29.0
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from peft import LoraConfig

### Configuration
We use PPOConfig to manage hyperparameters. Note that for a 7B model on a single GPU, we keep the batch size very small (1) and use gradient accumulation if needed. target_kl is set to 0.1 to prevent the model from drifting too far from the original language distribution.

In [11]:
# We define the model name as a simple string variable now
model_id = "HuggingFaceH4/zephyr-7b-beta"

# PPOConfig for TRL 0.26+ (uses new API even from old import path)
config = PPOConfig(
    exp_name="ppo_zephyr_direct_rlaif",
    learning_rate=1.41e-5,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    num_ppo_epochs=4,
    num_mini_batches=1,
    local_rollout_forward_batch_size=1,
    response_length=128,
    temperature=0.7,
    kl_coef=0.05,
    cliprange=0.2,
    vf_coef=0.1,
    cliprange_value=0.2,
    gamma=1.0,
    lam=0.95,
    output_dir="./ppo_output",
    seed=0,
    logging_steps=1,
)



### Load Model with Value Head
Standard PPO requires a "Value Head" (a critic) to estimate the expected reward of a token sequence. TRL's AutoModelForCausalLMWithValueHead adds this automatically. We load the model in 4-bit using load_in_4bit=True to fit it on consumer hardware (e.g., T4 or A10).

In [12]:
# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)

# Load Model using standard transformers
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Quantization config for 4-bit loading
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load policy model (TRL 0.26+ needs separate policy and value models)
policy_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# Load value model (same architecture)
value_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

### Initialize PPO Trainer
We do not pass a reward_model here. In standard TRL usage, if you omit the reward model, the trainer assumes you will calculate rewards manually in the training loop (which is exactly what we want for an LLM judge).

In [13]:
# Create a dummy dataset (required by new API in TRL 0.26+)
from datasets import Dataset

dummy_data = {
    "query": ["This is a placeholder query"],
    "response": ["This is a placeholder response"],
}
train_dataset = Dataset.from_dict(dummy_data)

# Initialize PPO Trainer (TRL 0.26+ uses new API even from trl.trainer import)
ppo_trainer = PPOTrainer(
    args=config,  # 'config' renamed to 'args' in new API
    processing_class=tokenizer,  # 'tokenizer' renamed to 'processing_class'
    model=policy_model,
    ref_model=None,
    reward_model=None,  # We'll use manual LLM judge
    train_dataset=train_dataset,
    value_model=value_model,
    peft_config=peft_config,  # Apply LoRA to both models
)

  ppo_trainer = PPOTrainer(


AttributeError: 'NoneType' object has no attribute 'to'

### Define the LLM Judge

This is where you swap in your API call (OpenAI, Anthropic, or a local judge model). The function must return a scalar value (float).

In [None]:
def get_llm_judge_score(prompt_text, response_text):
    """
    Computes a reward score for the response.
    Replace the mock logic below with an API call to GPT-4/Claude.
    """
    
    # --- MOCK JUDGE LOGIC (Replace this) ---
    # Example: Reward length. Longer = Better (just for testing mechanics)
    score = len(response_text) / 100.0 
    
    # --- REAL JUDGE LOGIC EXAMPLE ---
    # judge_prompt = f"Rate this answer (1-10) for the question '{prompt_text}': {response_text}"
    # api_response = openai.ChatCompletion.create(...)
    # score = extract_score(api_response) / 10.0 # Normalize to 0.0 - 1.0 range
    
    return float(score)

### Training Loop

This loop generates a response, critiques it using the judge, and then runs the PPO optimization step.

In [None]:
# Example prompts to train on
prompts = [
    "Explain quantum physics to a 5 year old.",
    "Write a poem about rust.",
    "What is the capital of France?",
    "Write a python script to merge two lists."
]

# Generation settings for the rollout
generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    "max_new_tokens": 50
}

epochs = 1

for epoch in tqdm(range(epochs), desc="Epochs"):
    for prompt in tqdm(prompts, desc="Steps"):
        
        # 1. Tokenize the query
        query_tensor = tokenizer.encode(prompt, return_tensors="pt").to(model.device).squeeze(0)
        
        # 2. Generate response (Rollout)
        # We use the ppo_trainer.generate to ensure it tracks gradients properly if needed
        response_tensor = ppo_trainer.generate(query_tensor, **generation_kwargs).squeeze(0)
        
        # 3. Decode for the Judge
        response_text = tokenizer.decode(response_tensor, skip_special_tokens=True)
        # Strip the original prompt if the model echoes it (Zephyr usually doesn't if formatted right, but good safety)
        actual_response = response_text.replace(prompt, "").strip()
        
        # 4. Get Reward from LLM Judge
        reward_scalar = get_llm_judge_score(prompt, actual_response)
        
        # TRL expects a list of tensors for the reward
        rewards = [torch.tensor(reward_scalar).to(model.device)]
        
        # 5. PPO Step
        # Pass list of queries, list of responses, and list of rewards
        stats = ppo_trainer.step([query_tensor], [response_tensor], rewards)
        
        # Optional: Print stats to track training
        print(f"Prompt: {prompt[:20]}... | Reward: {reward_scalar:.2f} | PPO Loss: {stats['ppo/loss/total']:.4f}")

### Save the Adapter

Once training is done, save the LoRA adapter.

In [None]:
save_path = "zephyr-7b-ppo-judge"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model adapter saved to {save_path}")

## Load Model
We'll load the Zephyr 7B SFT model using Unsloth's `FastLanguageModel`.

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Enable native 2x faster inference
model = FastLanguageModel.for_inference(model)

==((====))==  Unsloth 2025.12.5: Fast Mistral patching. Transformers: 4.57.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Load Dataset
We'll load the `gpt_formatted_dataset_clean.csv` dataset.

In [None]:
from datasets import load_dataset

dataset = load_dataset("csv", data_files="datasets/gpt_formatted_dataset_clean.csv")

FileNotFoundError: Unable to find '/content/datasets/gpt_formatted_dataset_clean.csv'

## Prompt Model
Now we'll take a prompt from the dataset and get a response from the model.

In [None]:
# Extract a prompt from the dataset
# prompt_data = dataset['train'][0]
# prompt = prompt_data['prompt']
prompt = "How does photosynthesis work in plants?"

# Format the prompt for the model
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{} """

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Use the Input to generate a question and answer.", # instruction
        prompt, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

# Generate the response
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

# Print the prompt and the generated response
print("\n\n--- PROMPT ---")
print(prompt)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Use the Input to generate a question and answer.

### Input:
How does photosynthesis work in plants?

### Response:
 
<|assistant|>
Question: What is the process by which plants convert sunlight into energy and produce oxygen?

Answer: Photosynthesis is the process by which plants, as well as some bacteria and algae, convert sunlight into energy and produce oxygen. During photosynthesis, chlorophyll in the plant's cells absorbs light energy, which is then used to convert carbon dioxide and water into glucose and oxygen. This process is essential for the survival of plants and provides the oxygen we breathe.</s>


--- PROMPT ---
How does photosynthesis work in plants?
