> In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
You may use llama-3.2 1B or llama-3.2 3B.
> Preference Dataset Collection and DPO Model Training
> Part 1: Dataset Generation and Judge Implementation (40 points)
> Create two separate preference datasets using different collection methods:

# a) LLM Judge-Based Collection (20 points)
- Implement an LLM-based judge system
- Document your reasoning for the judge's prompt design
- Explain how you ensure consistent and reliable preference judgments
- Include examples of the judge's evaluation process
- You can choose between using local inference on Colab/Lightning studio or a 3rd party provider like fireworks ai/openai/together ai

In [None]:
import requests
import os
api_key=os.getenv("api_key")
def run_judge_together(instruction, response_a, response_b, api_key):
    prompt = f"""
    You are an expert language evaluator. Your task is to judge which of the two responses better satisfies the instruction, based on the following criteria:
    1. Helpfulness: Does the response directly and thoroughly address the instruction?
    2. Relevance: Is the response focused on the topic, without unnecessary or off-topic content?
    3. Accuracy: Is the information provided factually correct?
    Only answer with 'A' or 'B'. Do not explain your reasoning.

Instruction:
{instruction}

Response A:
{response_a}

Response B:
{response_b}

Question: Which response is better? Only answer "A" or "B" and nothing else.
"""

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    data = {
        "model": "meta-llama/Llama-3-8b-chat-hf",
        "prompt": prompt,
        "max_tokens": 10,
        "temperature": 0.0,
        "stop": ["\n"]
    }

    response = requests.post("https://api.together.xyz/v1/completions", headers=headers, json=data)
    response_json = response.json()

    return response_json['choices'][0]['text'].strip().split()[-1]



In [None]:
instruction = "What is the capital of France?"
response_a = "The capital of France is Paris."
response_b = "France has no capital city."

print(run_judge_together(instruction, response_a, response_b, api_key))

A


I chose this format because it's clear and keeps the model focused on the task. Asking it to respond with just "A" or "B" avoids long explanations or uncertainty. I also set the temperature to 0.0 so the output is stable every time for the same input. The structure makes it easy to automate scoring and collect data for fine-tuning.

For example, when asked:
Instruction: What is the capital of France?

Response A: The capital of France is Paris.

Response B: France has no capital city.

Model Output: A

# b) PairRM-Based Collection (20 points)
- Extract 50 instructions from the Lima dataset
- Generate 5 responses per instruction using the llama-3.2 chat template
- Apply PairRM to create preference pairs
- Upload dataset to HuggingFace
- Submit repository link

In [None]:
from huggingface_hub import login
import os

login(os.getenv("hf"))

In [None]:
from datasets import load_dataset
ds = load_dataset("GAIR/lima", split="train")
instructions_50 = [item["conversations"][0] for item in ds.select(range(50))]

In [None]:
print(instructions_50)

['Can brain cells move? By movement I mean long distance migration (preferably within the brain only).', 'In our computer systems lecture we were introduced to the MIPS processor. It was (re)developed over the course of the term and has in fact been quite easy to understand. It uses a RISC design, that is its elementary commands are regularly encoded and there are only few of them in order to keep the wires simple.\nIt was mentioned that CISC follows a different philosophy. I looked briefly at the x86 instruction set and was shocked. I can not image how anyone would want to build a processor that uses so complex a command set!\nSo I figure there have to be good arguments why large portions of the processor market use CISC architectures. What are they? ', 'View tabular file such as CSV from command line, having horizontal and vertical scrolling would be great.', 'Slater type orbitals (STO) are considered to be more accurate than gaussian type orbitals (GTO) for atomic and molecular QM c

In [None]:
import requests
import time
import os
# from tqdm import tqdm

TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
headers = {
    "Authorization": f"Bearer {TOGETHER_API_KEY}",
    "Content-Type": "application/json"
}

def generate_response(prompt):
    payload = {
        "model": "meta-llama/Llama-3-8b-chat-hf",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 512,
        "temperature": 0.9,
        "top_p": 0.9,
    }

    response = requests.post(
        "https://api.together.xyz/v1/chat/completions",
        headers=headers,
        json=payload
    )

    # 错误处理
    try:
        return response.json()["choices"][0]["message"]["content"]
    except Exception as e:
        print("API Error:", response.json())
        return f"Error: {response.json()}"

# gen 5 ewsponses for each inst
results = []

for idx, instruction in enumerate(instructions_50):
    responses = []
    print(f"Processing instruction {idx+1}/50")
    for i in range(5):
        try:
            response = generate_response(instruction)
            responses.append(response)
            time.sleep(0.8)
        except Exception as e:
            responses.append(f"Error: {e}")
    results.append({"instruction": instruction, "responses": responses})

import json
with open("lima50_instruction_5responses.jsonl", "w", encoding="utf-8") as f:
    for item in results:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")


Processing instruction 1/50
Processing instruction 2/50
Processing instruction 3/50
Processing instruction 4/50
Processing instruction 5/50
Processing instruction 6/50
Processing instruction 7/50
Processing instruction 8/50
Processing instruction 9/50
Processing instruction 10/50
Processing instruction 11/50
Processing instruction 12/50
Processing instruction 13/50
Processing instruction 14/50
Processing instruction 15/50
Processing instruction 16/50
Processing instruction 17/50
Processing instruction 18/50
Processing instruction 19/50
Processing instruction 20/50
Processing instruction 21/50
Processing instruction 22/50
Processing instruction 23/50
Processing instruction 24/50
Processing instruction 25/50
Processing instruction 26/50
Processing instruction 27/50
Processing instruction 28/50
Processing instruction 29/50
Processing instruction 30/50
Processing instruction 31/50
Processing instruction 32/50
Processing instruction 33/50
Processing instruction 34/50
Processing instruction 

In [None]:
# pip install tdqm

In [None]:
!pip install -U llm-blender


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [None]:
import llm_blender
from collections import defaultdict
import json

instructions_50 = []
all_responses = []

# Load instructions and 5 responses per instruction from file
with open("lima50_instruction_5responses.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        data = json.loads(line)
        # Check if required fields exist and contain 5 responses
        if "instruction" in data and "responses" in data and len(data["responses"]) == 5:
            instructions_50.append(data["instruction"])
            all_responses.append(data["responses"])

# Load PairRM ranking model (automatically downloads if not present)
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM")

# Your variables:
# - selected_instructions: a list of 50 prompts (strings)
# - all_responses: a list of 5 responses per prompt, format: List[List[str]]
# e.g. all_responses[i] = [response1, response2, response3, response4, response5]

# Get ranking scores (lower is better)
ranks = blender.rank(
    instructions_50,
    all_responses,
    return_scores=False,
    batch_size=4
)

preference_pairs = []

# Generate pairwise preference data from ranking
for i in range(len(instructions_50)):
    instruction = instructions_50[i]
    responses = all_responses[i]
    for left in range(4):
        for right in range(left + 1, 5):
            left_rank = ranks[i][left]
            right_rank = ranks[i][right]

            chosen = responses[left] if left_rank < right_rank else responses[right]
            rejected = responses[left] if left_rank > right_rank else responses[right]

            pair_data = {
                "prompt": instruction,
                "chosen": chosen,
                "rejected": rejected
            }
            preference_pairs.append(pair_data)

# Save the preference pairs to a JSONL file
output_filename = "lima50_pairrm_preference.jsonl"
with open(output_filename, "w", encoding="utf-8") as f:
    for item in preference_pairs:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

print(f"Done, created {len(preference_pairs)} pairs and saved to {output_filename}")

Successfully loaded ranker from  /root/.cache/huggingface/hub/llm-blender/PairRM


Ranking candidates: 100%|██████████| 13/13 [04:46<00:00, 22.03s/it]

Done, created 500 pairs and saved to lima50_pairrm_preference.jsonl





In [None]:
from datasets import Dataset
import json
from huggingface_hub import login

# Step 1: Log in to Hugging Face using your access token
login(os.getenv("hf"))

# Step 2: Load the local JSONL file containing PairRM preference pairs
with open("lima50_pairrm_preference.jsonl", "r", encoding="utf-8") as f:
    data = [json.loads(line) for line in f if isinstance(json.loads(line), dict)]

# Step 3: Convert the list of preference pairs into a Hugging Face Dataset object
dataset = Dataset.from_list(data)

# Step 4: Push the dataset to the Hugging Face Hub under your namespace
# Make it public by setting private=False
dataset.push_to_hub("sxsun1684/pair-rm-lima500-preferences", private=False)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/343 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/sxsun1684/pair-rm-lima500-preferences/commit/a2a19b11732c7fa6d9aa63f896917ef1b2f9d240', commit_message='Upload dataset', commit_description='', oid='a2a19b11732c7fa6d9aa63f896917ef1b2f9d240', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/sxsun1684/pair-rm-lima500-preferences', endpoint='https://huggingface.co', repo_type='dataset', repo_id='sxsun1684/pair-rm-lima500-preferences'), pr_revision=None, pr_num=None)

In [None]:
commit_url='https://huggingface.co/datasets/sxsun1684/pair-rm-lima500-preferences/commit/46f47cc53e24b03a6e233b26f7dbc895ae533045'
print(commit_url)

https://huggingface.co/datasets/sxsun1684/pair-rm-lima500-preferences/commit/46f47cc53e24b03a6e233b26f7dbc895ae533045


# Part 2: Model Training and Evaluation (60 points)

## a) DPO Fine-tuning (40 points)
- Fine-tune llama-3.2 using PairRM preference dataset
- Fine-tune llama-3.2 using LLM Judge preference dataset
- Document training parameters and process
- Upload PEFT adapters to HuggingFace
- Submit repository links


### Fine-tune llama-3.2 using LLM Judge preference dataset

In [None]:
import json
from itertools import combinations
from tqdm import tqdm

api_key = os.getenv("api_key")

def run_judge_together(instruction, response_a, response_b, api_key):
    prompt = f"""
You are an expert assistant helping to judge response quality.

Instruction:
{instruction}

Response A:
{response_a}

Response B:
{response_b}

Question: Which response is better? Baed on accuracy. Only answer "A" or "B" and nothing else.
"""

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    data = {
        "model": "meta-llama/Llama-3-8b-chat-hf",
        "prompt": prompt,
        "max_tokens": 10,
        "temperature": 0.0,
        "stop": ["\n"]
    }

    response = requests.post("https://api.together.xyz/v1/completions", headers=headers, json=data)
    response_json = response.json()
    return response_json['choices'][0]['text'].strip().split()[-1]

# preference
judged_pairs = []
with open("lima50_instruction_5responses.jsonl", "r", encoding="utf-8") as f:
    for line in tqdm(f, total=50):
        item = json.loads(line)
        instruction = item["instruction"]
        responses = [r for r in item["responses"] if not r.strip().startswith("Error")]
        if len(responses) < 2:
            continue

        for a, b in combinations(responses, 2):
            try:
                result = run_judge_together(instruction, a, b, api_key)
                if result == "A":
                    chosen, rejected = a, b
                elif result == "B":
                    chosen, rejected = b, a
                else:
                    continue
                judged_pairs.append({
                    "prompt": instruction,
                    "chosen": chosen,
                    "rejected": rejected
                })
            except Exception as e:
                continue

with open("llm_judge_lima50_preferences.jsonl", "w", encoding="utf-8") as f:
    for item in judged_pairs:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")


100%|██████████| 50/50 [02:19<00:00,  2.79s/it]


In [None]:
from huggingface_hub import login
login(os.getenv("hf"))

In [None]:
pip install transformers accelerate datasets peft trl

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting trl
  Downloading trl-0.16.1-py3-none-any.whl.metadata (12 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x

In [None]:
from datasets import Dataset
import json

# 1. Load your JSONL file (make sure it's saved in the current directory)
with open("llm_judge_lima50_preferences.jsonl", "r", encoding="utf-8") as f:
    data = [json.loads(line) for line in f if isinstance(json.loads(line), dict)]

# 2. Convert to Hugging Face Dataset format
dataset = Dataset.from_list(data)

# 3. Upload to your Hugging Face Dataset repo
dataset.push_to_hub("sxsun1684/llm_judge_lima50_preferences", private=False)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/344 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/sxsun1684/llm_judge_lima50_preferences/commit/fac3b6c405512cbf0fab007f1c2123d487388ee0', commit_message='Upload dataset', commit_description='', oid='fac3b6c405512cbf0fab007f1c2123d487388ee0', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/sxsun1684/llm_judge_lima50_preferences', endpoint='https://huggingface.co', repo_type='dataset', repo_id='sxsun1684/llm_judge_lima50_preferences'), pr_revision=None, pr_num=None)

commit_url='https://huggingface.co/datasets/sxsun1684/llm_judge_lima50_preferences/commit/179258494b878e64d8e5c29159a8b5fecb0e67fe'

In [None]:
from datasets import load_dataset
from huggingface_hub import login
login(os.getenv("hf"))

dataset = load_dataset("sxsun1684/llm_judge_lima50_preferences", split="train")
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/344 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/211k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/112 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 112
})


In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
from datasets import load_dataset
import torch
from huggingface_hub import login
import os
login(os.getenv("hf"))

# 1. Load the model and tokenizer
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)

# 2. Load the PairRM preference dataset
dataset = load_dataset("sxsun1684/pair-rm-lima500-preferences")["train"]

# 3. Preprocess and generate input_ids and attention_mask for prompt/chosen/rejected
def preprocess(example):
    prompt = tokenizer(example["prompt"], padding="max_length", truncation=True, max_length=128, return_tensors="pt")
    chosen = tokenizer(example["chosen"], padding="max_length", truncation=True, max_length=384, return_tensors="pt")
    rejected = tokenizer(example["rejected"], padding="max_length", truncation=True, max_length=384, return_tensors="pt")

    return {
        "prompt_input_ids": prompt["input_ids"][0],
        "prompt_attention_mask": prompt["attention_mask"][0],
        "chosen_input_ids": chosen["input_ids"][0],
        "chosen_attention_mask": chosen["attention_mask"][0],
        "rejected_input_ids": rejected["input_ids"][0],
        "rejected_attention_mask": rejected["attention_mask"][0],
    }

processed_dataset = dataset.map(preprocess)

# 4. PEFT + LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    bias="none",
    task_type="CAUSAL_LM",
)

# 5. DPO
dpo_config = DPOConfig(
    beta=0.1,
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    max_length=512,
    save_strategy="epoch",
    logging_steps=10,
    push_to_hub=False,
    report_to="none", #Prevent wandb from generating error messages
)

# 6. Initialize DPOTrainer (using the preprocessed dataset, without passing processing_class)
trainer = DPOTrainer(
    model=model,
    ref_model=None,
    args=dpo_config,
    train_dataset=processed_dataset,
    data_collator=None,
    peft_config=peft_config,
    processing_class=tokenizer,)


# 7. train
trainer.train()

# 8. save
model.save_pretrained("dpo-llama3-lora")
tokenizer.save_pretrained("dpo-llama3-lora")


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/346 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/390k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Extracting prompt in train dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
10,0.6935
20,0.6874
30,0.6842
40,0.6829
50,0.6717
60,0.6699
70,0.6181
80,0.6344
90,0.6267
100,0.6339


('dpo-llama3-lora/tokenizer_config.json',
 'dpo-llama3-lora/special_tokens_map.json',
 'dpo-llama3-lora/tokenizer.json')

In [None]:
from huggingface_hub import create_repo
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "dpo-llama3-lora-pairrm"

create_repo(repo_id, exist_ok=True)

model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)


# **Link：**
https://huggingface.co/sxsun1684/dpo-llama3-lora-pairrm/commit/79e38eb1a93eabd1ce5134eb505c430601be7d80

### Fine-tune llama-3.2 using LLM judge preference dataset

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
from datasets import load_dataset
import torch

# 1. Load the model and tokenizer
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)

# 2. Load the PairRM preference dataset
dataset = load_dataset("sxsun1684/llm_judge_lima50_preferences")["train"]

# 3. Preprocess and generate input_ids and attention_mask for prompt/chosen/rejected
def preprocess(example):
    prompt = tokenizer(example["prompt"], padding="max_length", truncation=True, max_length=128, return_tensors="pt")
    chosen = tokenizer(example["chosen"], padding="max_length", truncation=True, max_length=384, return_tensors="pt")
    rejected = tokenizer(example["rejected"], padding="max_length", truncation=True, max_length=384, return_tensors="pt")

    return {
        "prompt_input_ids": prompt["input_ids"][0],
        "prompt_attention_mask": prompt["attention_mask"][0],
        "chosen_input_ids": chosen["input_ids"][0],
        "chosen_attention_mask": chosen["attention_mask"][0],
        "rejected_input_ids": rejected["input_ids"][0],
        "rejected_attention_mask": rejected["attention_mask"][0],
    }

processed_dataset = dataset.map(preprocess)

# 4. PEFT + LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    bias="none",
    task_type="CAUSAL_LM",
)

# 5. DPO
dpo_config = DPOConfig(
    beta=0.1,
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    # per_device_train_batch_size=2,
    # gradient_accumulation_steps=4,
    num_train_epochs=3,
    max_length=512,
    save_strategy="epoch",
    logging_steps=10,
    push_to_hub=False,
    report_to="none", #Prevent wandb from generating error messages
)

# 6. Initialize DPOTrainer (using the preprocessed dataset, without passing processing_class)
trainer = DPOTrainer(
    model=model,
    ref_model=None,
    args=dpo_config,
    train_dataset=processed_dataset,
    data_collator=None,
    peft_config=peft_config,
    processing_class=tokenizer,)


# 7. train
trainer.train()

# 8. save
model.save_pretrained("dpo-llama3-lora-judge")
tokenizer.save_pretrained("dpo-llama3-lora-judge")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/343 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/139k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/75 [00:00<?, ? examples/s]

Map:   0%|          | 0/75 [00:00<?, ? examples/s]

Extracting prompt in train dataset:   0%|          | 0/75 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/75 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/75 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
10,0.6449
20,0.6076


('dpo-llama3-lora-judge/tokenizer_config.json',
 'dpo-llama3-lora-judge/special_tokens_map.json',
 'dpo-llama3-lora-judge/tokenizer.json')

In [None]:
from huggingface_hub import create_repo
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "dpo-llama3-lora-judge"

create_repo(repo_id, exist_ok=True)

model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)

model.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/sxsun1684/dpo-llama3-lora-judge/commit/d07eb33fcbcab4087a981e744e8306a534215ff0', commit_message='Upload tokenizer', commit_description='', oid='d07eb33fcbcab4087a981e744e8306a534215ff0', pr_url=None, repo_url=RepoUrl('https://huggingface.co/sxsun1684/dpo-llama3-lora-judge', endpoint='https://huggingface.co', repo_type='model', repo_id='sxsun1684/dpo-llama3-lora-judge'), pr_revision=None, pr_num=None)

# **Link：**
https://huggingface.co/sxsun1684/dpo-llama3-lora-judge/commit/d07eb33fcbcab4087a981e744e8306a534215ff0

## b) Comparative Analysis (20 points)
- Select 10 novel instructions (not in training data)
- Generate completions using:
  * Original llama-3.2
  * DPO fine-tuned model (LLM judge dataset)
  * DPO fine-tuned model (PairRM dataset)
- Present results in a pandas DataFrame
- Analyze and compare the quality of completions
- Include quantitative and qualitative observations


In [None]:
from huggingface_hub import login
login(os.getenv("hf"))

In [None]:
pip install transformers



In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import pandas as pd

# 1. Define model names from Hugging Face
base_model_name = "meta-llama/Llama-3.2-1B"
judge_model_name = "sxsun1684/dpo-llama3-lora-judge"
pairrm_model_name = "sxsun1684/dpo-llama3-lora-pairrm"

# 2. Load the tokenizer (shared by all models)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"  # Required for generation on some models

# 3. Helper function to build generation pipeline
def get_pipeline(model_name):
    model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
    return pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256, temperature=0.7)

# 4. Load each model’s generation pipeline
base_pipe = get_pipeline(base_model_name)
judge_pipe = get_pipeline(judge_model_name)
pairrm_pipe = get_pipeline(pairrm_model_name)

# 5. Create 10 novel instructions (not seen during training)
novel_instructions = [
    "What were Donald Trump's main campaign promises in 2016 and how many were fulfilled?",
    "Explain why it's dangerous to mix bleach and ammonia in household cleaning.",
    "What are some cultural differences between Northern and Southern Italy?",
    "Summarize Kim Kardashian’s influence on modern beauty standards.",
    "Why do some Americans call it 'soccer' while others say 'football'?",
    "What is the significance of Thanksgiving in American culture?",
    "Describe how Taylor Swift has evolved musically over the years.",
    "Can you explain how tipping works in U.S. restaurants and why it's expected?",
    "What are some common superstitions in Chinese culture and their origins?",
    "How has social media changed celebrity scandals and public perception?"
]

# 6. Generate responses from all three models
results = []
for instr in novel_instructions:
    base_out = base_pipe(instr)[0]["generated_text"]
    judge_out = judge_pipe(instr)[0]["generated_text"]
    pairrm_out = pairrm_pipe(instr)[0]["generated_text"]

    # Strip the original prompt part from generated output
    results.append({
        "Instruction": instr,
        "Original LLaMA-3.2": base_out[len(instr):].strip(),
        "DPO (LLM Judge)": judge_out[len(instr):].strip(),
        "DPO (PairRM)": pairrm_out[len(instr):].strip(),
    })

# 7. Display results in a table
df = pd.DataFrame(results)
pd.set_option("display.max_colwidth", None)
display(df)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Device set to use cuda:0


config.json:   0%|          | 0.00/837 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Some weights of the model checkpoint at sxsun1684/dpo-llama3-lora-judge were not used when initializing LlamaForCausalLM: ['model.layers.0.self_attn.q_proj.base_layer.weight', 'model.layers.0.self_attn.q_proj.lora_A.default.weight', 'model.layers.0.self_attn.q_proj.lora_B.default.weight', 'model.layers.0.self_attn.v_proj.base_layer.weight', 'model.layers.0.self_attn.v_proj.lora_A.default.weight', 'model.layers.0.self_attn.v_proj.lora_B.default.weight', 'model.layers.1.self_attn.q_proj.base_layer.weight', 'model.layers.1.self_attn.q_proj.lora_A.default.weight', 'model.layers.1.self_attn.q_proj.lora_B.default.weight', 'model.layers.1.self_attn.v_proj.base_layer.weight', 'model.layers.1.self_attn.v_proj.lora_A.default.weight', 'model.layers.1.self_attn.v_proj.lora_B.default.weight', 'model.layers.10.self_attn.q_proj.base_layer.weight', 'model.layers.10.self_attn.q_proj.lora_A.default.weight', 'model.layers.10.self_attn.q_proj.lora_B.default.weight', 'model.layers.10.self_attn.v_proj.base_

generation_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

Device set to use cuda:0


config.json:   0%|          | 0.00/837 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Some weights of the model checkpoint at sxsun1684/dpo-llama3-lora-pairrm were not used when initializing LlamaForCausalLM: ['model.layers.0.self_attn.q_proj.base_layer.weight', 'model.layers.0.self_attn.q_proj.lora_A.default.weight', 'model.layers.0.self_attn.q_proj.lora_B.default.weight', 'model.layers.0.self_attn.v_proj.base_layer.weight', 'model.layers.0.self_attn.v_proj.lora_A.default.weight', 'model.layers.0.self_attn.v_proj.lora_B.default.weight', 'model.layers.1.self_attn.q_proj.base_layer.weight', 'model.layers.1.self_attn.q_proj.lora_A.default.weight', 'model.layers.1.self_attn.q_proj.lora_B.default.weight', 'model.layers.1.self_attn.v_proj.base_layer.weight', 'model.layers.1.self_attn.v_proj.lora_A.default.weight', 'model.layers.1.self_attn.v_proj.lora_B.default.weight', 'model.layers.10.self_attn.q_proj.base_layer.weight', 'model.layers.10.self_attn.q_proj.lora_A.default.weight', 'model.layers.10.self_attn.q_proj.lora_B.default.weight', 'model.layers.10.self_attn.v_proj.base

generation_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

Device set to use cuda:0


Unnamed: 0,Instruction,Original LLaMA-3.2,DPO (LLM Judge),DPO (PairRM)
0,What were Donald Trump's main campaign promises in 2016 and how many were fulfilled?,"(Photo: Getty Images)\nDonald Trump's presidency is now in its second year and the American people are still struggling to understand the policies that he has put in place. The Republican president has been accused of being an 'unstable narcissist' who has 'failed' his country and 'broken' the promises he made to the American people during his 2016 presidential campaign.\nSo how many of the promises made by Trump during his election campaign have been fulfilled?\nDonald Trump claimed he would'make America great again' but many of his policies are causing more harm than good.\nIn his first year as president, Trump has overseen a record number of deportations and the deportation of more than 2.1 million people.\nThe Trump administration also has a zero-tolerance policy for illegal immigration, which has led to more than 50,000 children being separated from their parents.\nThe US president has also rolled back many of the Obama administration's environmental protections, including a ban on the use of hydraulic fracturing, or fracking, for oil and gas production.\nHe has also rolled back many of the Obama administration's policies on immigration, including a ban on the deportation of illegal immigrants, and has rolled back many of the Obama administration's environmental protections, including a ban on the",samesame UserDao UserDaoousesyards yard-yard accusing accusing citation citation citations citations citations citation citation citations citation citations citation citation citationsほどほどほどほどほどほどほどほどほどほどほどほどほどほどほどほどほどほどlists_IOSほどangingほどほどほどほどほどほど citationほどほどほどほどdictdict citation citation citationほどlistslistsouseraisonopa-ceее citation-legged-leggedMdほどほどほどほどほどumpyumpyumpyumpyumpy citation-cent-centraisonraisonouseousesouseouseousevidersματαvidersほどほど Hera Herawingほどほどματα XR XRumpyunctionunctionumpy trusts trustsysics@pytest-legged-leggedほどほどほどほどほどほどほどouseほどほどほどほど-ce-ceanging wd citation XR@pytestturned-ceDNMdMdDNDDSraisonraison wdoundsほどματαouseouseouseraisonraison Sinn SinnDDSumpy@pytestほど MentionMdMdMdMdMdraisonraisonraisonhaphapほどraisonDDSopaysics-leggedMdMdMdMd-legged-legged-legged Sinn SinnMdDNほど Mention-legged-leggedFdFdMdviders-leggedMdraisonhap Sinn Sinn SinnMdMdMdMdhaphaphaphaphaphap XRMdMdMdMdFdhapangingDDS-legged-legged-legged-legged-legged-legged-legged-leggedMdMdMdMd-legged-leggedMd Sinnysics-legged-legged-legged-leggedMd MdMdMdhaphaphapMdMdMdouseousehaphap,.jp العامagiagi softened softened softened softenedstellenstellenadratic艺ectedadraticcrollstellenstellenstellenabiswikwikstellenstellenstellen.day of� al com tall of
1,Explain why it's dangerous to mix bleach and ammonia in household cleaning.,"Bleach and ammonia are strong oxidizing agents. This reaction results in a lot of heat, and can cause severe burns. Bleach and ammonia should never be mixed. Always keep them separate.\nExplain why it's dangerous to mix bleach and ammonia in household cleaning. Bleach and ammonia are strong oxidizing agents. This reaction results in a lot of heat, and can cause severe burns. Bleach and ammonia should never be mixed. Always keep them separate.\nExplain why it's dangerous to mix bleach and ammonia in household cleaning. Bleach and ammonia are strong oxidizing agents. This reaction results in a lot of heat, and can cause severe burns. Bleach and ammonia should never be mixed. Always keep them separate.\nExplain why it's dangerous to mix bleach and ammonia in household cleaning. Bleach and ammonia are strong oxidizing agents. This reaction results in a lot of heat, and can cause severe burns. Bleach and ammonia should never be mixed. Always keep them separate.\nExplain why it's dangerous to mix bleach and ammonia in household cleaning. Bleach and ammonia are strong oxidizing agents. This reaction results in a lot of heat, and can cause severe burns. Bleach and ammonia should never be mixed. Always keep them separate.\nEx",circlecirclecircle амastrastrastrastrastrastrastrastrastrastrastrastrastrastrastrastrastrastrastrastrastrastr ammon ammondivider ammon Rising.constructor ammonAsStreamdividerastrastr door Don 'astrastrastrastrastrastrastrastrastrastrastrastrastrastresign Donablo K ammon analog analogbasketastr� ammon ammonabloabloabloabloablo don Don don e Don Don ammon ammon ammon Nic e e >> >>ablo Bloomabloabloabloabloabloabloabloabloabloabloablo''''abloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloabloablo''''aket >>attackbasketaketaketaketaket''aketaketabloablo<>abloabloabloabloabloabloabloabloabloabloabloabloabloabloabloablo don don analogabloablo charging charging'''''' Don Donabloabloabloabloablo VsArial DonArial''lblaketaket Niclblارات analog >>''>>>>>>basket''''rr.security charging charging charging analog analog analog analog Vers<> Vers''abloاراتارات Nic RNخص aroma sur''''.securityBD charging'''' Don Don Vsens analog analog.security charging charging analog analog''ensens'' Don Don charging don,SDS Md Mdternternternternternternternternternternternternternternternternternternterminationterminationterminationterminationterminternternternternternternternternternternternternternternternternterntmltmlopoulosopoulosopoulosopoulosopoulosopoulosopoulosopoulosopoulosopoulosternterntern SDStiftif SDS SPDwikovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovich
2,What are some cultural differences between Northern and Southern Italy?,"There are so many! Here are some interesting differences that I have seen while visiting Italy and reading about Italy.\nIn Italy, it is common to greet people with a kiss on the cheek. It is also common to hug people. In the South, it is more common to kiss on the cheek and hug people than in the North. This is a nice way to show people you care about them.\nI am from the North of Italy and I have seen a lot of differences between the North and the South of Italy. The most important difference is the weather. The South of Italy has a very warm and humid climate. The North of Italy has a colder and drier climate. In the North, it is more common to wear a coat and a scarf. In the South, it is more common to wear a t-shirt and shorts. The food is also different. In the North, it is more common to eat pasta and pizza. In the South, it is more common to eat rice and pasta.\nAnother difference between the North and the South of Italy is the language. In the North, it is more common to speak Italian. In the South, it is more common to speak Italian and Italiano. The North also has more tourists. The South has fewer tourists.",MEDمالمالمالousesousesousesousesousesousesousesمالمالillesillesillesillesillesillesillesillesillesillesillesillesillesillesillesillesillesillesillesilleskjillesillesillesillesillesillesilles tho thoouses Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mentionousesouses Charging Mention Mention Mentionillesillesillesousesousesousesousesousesousesousesousesousesousesousesousesouses Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention幸幸幸illesilles wd Mention wd wd wd wd Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention Mention wd inplace幸幸幸幸幸幸幸 Mention Mention Mention Mention幸幸幸幸幸ousesousesousesousesousesousesousesousesouses Mention Mention Chargingicone幸幸幸幸ouses幸幸幸幸幸幸幸幸ousesouse Charging幸幸 wdliable`sjrousesouses幸幸幸幸ouses幸幸幸幸幸幸幸 Charging Charging Charging unused unused幸幸幸icone Chargingchargcharg» inplacewy� inplace Mention Mention Chargingliablejrrrchargrr Canadiens internousesousesouses�幸幸幸 Authorized Authorized幸幸幸幸幸»ousesousesachers»ользов inplace幸teen�幸 Charging intern unpaidiconeIconsjrousesousesousesouses chargingcharging Charging,escptagiagiagiagiagiabit稱稱gigigigigigi dec al\\\\ sole shaft.state.Sizepci\t projections projectionsasts team team soleatura Ludwig.total�:: unit Abbasاجع team ofaturagi Abs mir of \\wik unit.unit of.Size.Size unit Abs Absaturaabis.unit of side decwik able Abs mir necessarystellen smallest unit attяд pastpast past solewik pastpast pastpast unit sole soleabisized past dec-old of ____keitkeit currentPage currentPageaturawikinctinctaturaaturaiets-old innohen69tanLangwikinctinctincticit ofwik-old com com decwikstellen ofasts.unit andicit Ludwigaturaatura inn\tinctatura.unit ofaturaaturaicit.swingizes.Sizewikwikwikwikwikwikwikwikwikروسaturaaturaaturaaturaaturaaturainctinctinux crest crest attwikwikwikwikwikwikwikwikwikwikwikwikwikwikicit inn topped.day\twikwikwikwikwikwikwikwikwikwikwikinct.day.dayiciticitohenwikwikwikiciticiticiticiticiticitwikwikwikwikwikwikangiangiangiangiangiwikwikwikwikwik LagargestapurLangapuriciticiticiticitwikwikwikwikwikwikwikwikwikiciticiticitwikwikروسwikwikwikwikwikapurwikwik
3,Summarize Kim Kardashian’s influence on modern beauty standards.,"Do you think she is a bad influence or a good influence? Why?\nThe influence that Kim Kardashian has had on modern beauty standards is both a positive and a negative influence. I believe that she is a bad influence because her behavior and her appearance are not healthy for women. Her behavior is disrespectful to her husband, her children, and her family. Her appearance is not healthy because she has become so overweight and has very little self-esteem. She has a hard time being in front of the camera, and she has become very vain. Her appearance is not healthy because she does not look like a typical woman. She has very big lips and a very large nose. She also has a very big butt. She is not a typical woman because she is very short, and she does not have a very good figure. I believe that she is a good influence because she has been able to help a lot of people and has been able to make a lot of money. She has been able to help a lot of people because she has been able to change the way people think about beauty standards. She has been able to make a lot of money because she has been able to make a lot of money by doing things that are not typical for women. She has been able to make a","-memberuluridar اضmembermembermembermember成员 सदस Quebecmembermember-membermember成员memberidaridaridaridaridaridaridaridaridaridaridaridaridaridar++,idaridaridar RR RR Quebecurn Authorizedautosautos COPY rrmembermembermembermemberidar hin hinidarautosidarautosautos RR RRlijidaridaridaridaridarautosautosillsiceps COPY hin hin hin hin hin hinills hin hinautosautosautosautosautosautosautosautosautos Duplicateadresseillsills hin hinicepsicepsicepsiceps Duplicateidaridarillsiceps COPY COPY RR hinidaridaridaridaridaridaridaridaridaridaridaridaridaridar COPYicepsicepsicepsillsillsakhiridaridaridar COPY COPYidarьтеillsidar RRautosautos RR RRautosicepsidaridar hinidarьтеjrautosidaridar hin hinills hinicepsicepsidaridaridar RRicepsidaridar Duplicateidar hin tubes Denisiceps COPYicepsicepsicepsillsidaridarautosautosьте COPYьте workingsidaridaridaridaridaridaridaridar RRautosautosautosautosautosautosautosautosautosautosautosautosautosillsidaridaridaridar COPY COPY Duplicateautosautosidaridaridaridarautosautosautos COPYьте hinidaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridaridar COPYautosiceps",iareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareiareacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreacreotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonacreacreacreacreacreacreacreacreacreotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotonotoncourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtcourtipelipelipelipelipelipelipelipelipelipelipelipelipelipelipelipelipelipelcourtcourt
4,Why do some Americans call it 'soccer' while others say 'football'?,What is the difference? What is the origin of the game? How is the game played? What is the difference between American and English soccer? How is the game played in the United States? What are the rules of the game? How is the game played in England? What is the difference between American and English soccer? How is the game played in the United States? What are the rules of the game? How is the game played in England? What is the difference between American and English soccer? How is the game played in the United States? What are the rules of the game? How is the game played in England? What is the difference between American and English soccer? How is the game played in the United States? What are the rules of the game? How is the game played in England? What is the difference between American and English soccer? How is the game played in the United States? What are the rules of the game? How is the game played in England? What is the difference between American and English soccer? How is the game played in the United States? What are the rules of the game? How is the game played in England? What is the difference between American and English soccer? How is the game played in the United States? What,"azardazardazardazard disg disg disg disgð��� ""<< penal Drugs unpaidðazardazardazard ""<<ैसunionunion ""<< ""<< ""<<azardazard unionUnionunion UnionUnion Unionunion unionUnionunionUnionUnionUnion Union Union sur unpaid unpaid CelUnionunion unpaid unpaid unpaid unpaid unpaid enzym unpaid unpaid surυνυνunionunionunionunionUnionourn unpaid unpaid unpaid unpaidUnionUnionUnion union unionunion surunionunion unpaid unpaid unpaid unpaidourn Fasterunionunionournournourn surunion Cel Celourn Celunded Went Faster FasterVers unpaid unpaid unpaid unpaid unpaid Faster FasterundedundedundedVersVersourn Fasterournournournournournournournournьтеouses sur-sur sur Faster-sur-sur surunion sur sur sur unpaid unpaid unpaid Cel Cel sur surьтеьте surьтеournournournourn unpaid unpaid-sur surournournournew today Fasterunionunion unpaid sur surPerformanceperformanceperformance Faster Faster Faster Faster unpaid unpaid unpaid unpaid unpaid sur sur surournournournournournourn today sur sur sur surVersew DenisVersourn unpaid unpaid unpaid unpaid unpaid unpaid unpaid Faster sur sur sur sur sur sur sur pressed sur sur sur sur sur97 pressed unpaid unpaid unpaid unpaid unpaid sur sur sur today unpaid unpaid unpaid unpaid unpaid unpaid unpaid unpaid unpaid unpaid unpaid unpaid unpaid unpaid unpaid unpaid unpaid sur sur sur sur sur sur sur sur sur",Superior
5,What is the significance of Thanksgiving in American culture?,"Is it a national holiday? Is it a day for families to gather together and eat food? Or is it a day to give thanks for what we have? These are all valid questions. I think the answer to all of them is “yes.” Thanksgiving is a day to give thanks for what we have and celebrate what we have. It is a day for families to gather together and enjoy a meal, a day for students to enjoy a field trip, a day for people to give thanks for what they have, a day to celebrate the harvest, a day to celebrate the beginning of the fall season, a day to celebrate the start of the football season, and a day to celebrate the start of the Thanksgiving season.\nThe meaning of Thanksgiving can be found in many different places. Many people believe that it is a day to give thanks for what we have. They believe that this day is the day when we should give thanks for what we have. Some people believe that it is a day to give thanks for the things that we have. Some people believe that it is a day to give thanks for the things that we have not. Some people believe that it is a day to give thanks for what we have not. Some people believe that it is a day to give thanks for",operationskjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkj Hurt Hurt Hurt Hurt Hurt Hurt Hurt Hurt Hurt Hurt,abitlichwikwiküfüfüfüfüfpink-tipüfüfwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwik
6,Describe how Taylor Swift has evolved musically over the years.,"She has evolved from a teen pop star to a pop icon and now a country music icon. Compare and contrast her music with the music of other female artists. For example, do you think that she is more successful than Taylor Dayne or LeAnn Rimes? Why or why not?\nTaylor Swift has evolved musically over the years. She has evolved from a teen pop star to a pop icon and now a country music icon. Compare and contrast her music with the music of other female artists. For example, do you think that she is more successful than Taylor Dayne or LeAnn Rimes? Why or why not?\nTaylor Swift has evolved musically over the years. She has evolved from a teen pop star to a pop icon and now a country music icon. Compare and contrast her music with the music of other female artists. For example, do you think that she is more successful than Taylor Dayne or LeAnn Rimes? Why or why not?\nTaylor Swift has evolved musically over the years. She has evolved from a teen pop star to a pop icon and now a country music icon. Compare and contrast her music with the music of other female artists. For example, do you think that she is more successful than Taylor Dayne or LeAnn Rimes","Underan trustsQueries sameNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeillNeill trusts trusts trusts trusts McN McN McN McN Denis ROSتبر McN McN McN McN Denis Denis McN McN McNNeillNeillNeillNeillNeillNeillNeill McN McN McN McN McN McN McN McN McNicatesicatesBD-disabled trustsicon Denis?). giftNeillتبرan trusts trusts trusticoniconeanan Denis McN gift giftNeillNeilltega.Icon Denisourn IRAicatesicatesNeillNeill Gael LesLes gift trusttrust trusts trusts IRAantrust.Icon.Icon alistubar analogan"",[ analogubar gift gift giftubar BDSan RR Invest Mat IRA gift gift giftBD Denis McN McN resemblább Investubarubariconicon RRubarubarubarRR Denis trusts trustsBD TEN gift Denis Denis Les.Icon McN McN Denisábbubarubarubarubar BDSNeill Denis Denis McN.Icon"",[ gift giftábbább IRA les Les.Icon trusts unleournourn Barnett Barnett IRA analogábbábbournunusedalian IRAourn les Denis TENournournourn Denis"",[تبرan Mat IRAább{}\ournournournourn gift{}\alianalian IRA TEN TEN gift analogMat Denis RDournournournBDBD"",[-autBD BDubarBDubar BDS analogalian MatABELABEL DenisMat Matalianournournourn analog Denis Denis Remourn",Britannrown137\rown\n kỳkeitrown\\\\-CS-CSji\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ includedopoulosopoulosounounogeтехkeitritzovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovichovich
7,Can you explain how tipping works in U.S. restaurants and why it's expected?,"I'd like to give a tip to the waitress at my favorite restaurant, but I'm not sure how to do it.\nThere is a long tradition of tipping in the U.S. In some countries, people do not tip at all. In the U.S., the custom is to tip in restaurants. In other countries, tipping is expected, but not mandatory. In the U.S., it is expected but not mandatory. In the U.S., it is expected but not mandatory. In the U.S., it is expected but not mandatory.\nThere are two ways to tip in the U.S. One way is to put a dollar bill on the table. This is called a ""tipping jar."" The other way is to give a dollar amount to the server or waiter. This is called a ""tipping tip."" In some restaurants, the tip is automatically included in the bill.\nIt is important to tip in the U.S. because it shows that you appreciate the service you receive. It is also important to tip in the U.S. because it shows that you appreciate the service you receive.\nHow much do you tip at a restaurant in the United States?\nIn the United States, tipping is customary at restaurants, but not always mandatory. The amount of the tip depends",ertil���� multiplied似 multiplied multiplied multiplied doubling������ATTR doublingATTRqlqlqlqlqlqlROUND似 neutr neutr neutr����� multiplyingousesakraâATTR multiplying multiplied��ousesradradIconsIconsrad���� neutr Portuguese multiplyingtradentreqlqlqlIconssth-commercial-commercialsthROUND generally multiplied Santoâ972 pressed../../../../../../../../../../../../ rel relMoMotereIcons../../../../../../../../ multipliedKeyPressed generally Representative surtereakraakraakrarrrrrrrrrr surâewrrql../../../../sthakraakraakraakraakraakraakraakraakraakrareativeâ sur979797akra deductð generous rel rel../../../../unteersIRT sur surðkn generousunteersunteersunteersIconsakraakra rel relewIconsIconsIconsðewewqlqlew9797relationIRTsthteretere sur generousunteersewqu>> >>../../../../../../../../ROUND>> surkn97IconsIcons98IconsIconsIconsIconsrelationrrrr97../../../../ServicesServices.u9799relation relrrrrrrrrrrrrrrrrrrrrrrrrewrrrrrrteen >>Icons9797rrrr>> sur979794909797 penaltyrelation rearrIconsrrrrrrrrew sur rearrIconsIconsIconsIconsrelation surROUNDROUNDtradtrad99rrrrrrrrrrrrrrkn,Comparable coli coli coli coli coli coli Peakabit Phelps Phelpspreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpread_TERMWEENWEENwikwikpink杯杯杯杯杯杯杯杯杯杯杯wikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwikwik
8,What are some common superstitions in Chinese culture and their origins?,"What do they mean and how do they affect modern Chinese culture?\nWhat are some common superstitions in Chinese culture and their origins? What do they mean and how do they affect modern Chinese culture?\nA person in China who has a red or black umbrella is a good omen and a symbol of luck. The Chinese believe that the color red is good for luck, while the color black is bad for luck. People who are wearing red umbrellas are considered lucky and are often seen walking around with them.\nWhat are some common superstitions in Chinese culture and their origins? What do they mean and how do they affect modern Chinese culture?\nChinese superstitions are a part of their culture that is based on the belief that certain things can bring good luck or bad luck. Some of the most common superstitions in Chinese culture include:\n1. The superstition that if you touch a red or black object, you will be cursed.\n2. The superstition that if you walk under a ladder, you will fall down.\n3. The superstition that if you break a mirror, seven years of bad luck will follow you.\n4. The superstition that if you step on a crack, you will break your mother’s back.\n5. The superstition that if you eat an apple",josNamingNamingNaming naming same mismoillessame samesamenamesNamesloanousesousesousesousesousesousesousesousesousesouseskjparinglist sameousesousesaccaccaccacckjkjkjsameaccaccματαματαματαacc same same equivalencetheses analog analogNames mismoillesματαousesouseswingwiswis trusts inplaceματα trusts trustswyousesacc analog mismo same same same equivalence equivalence same samesame samearaDHματαματαματα same same samewis same same same samesame same same same same same samenekjkjsamerrματαματαματα equivalence equivalence same equivalence equivalence equivalence equivalence equivalence equivalence analogsame sameMs»illes same same same same same mismo same sameMs same same sameaccDH trusts samesame same same same equivalence equivalencewyben trustsματαματαMsDNDN Hurtματα Hurtnerrundsunds same same same same equivalenceneneaccaccacc幸幸ματα幸essionsMswisne trusts幸幸MsMsnenene sameperfperfματαbentrust equivalence equivalence幸幸幸幸terminMs幸幸幸 charge same samesame same Hurt analogacc inplacewisMs same Hurt Hurt Hurt Hurt Hurtunused same same samene幸essionsessionsessionsessionsessionsessionsestersPerformance Performance same same sameperf equivalencene same same sameacc same equivalence equivalence equivalencene Hurt Hurt Hurt Hurt same same same same,μhofction Finch Ludwig Ludwig pro Abs precgiabisabiscpt un prostellenstellenabisstellenstellenstellenstellenroit in moić of\\\\ gu gu gu precicit days months of def days diswikatura Vulcan of\t sph of } of\\\\ years diswik ergatura months-old yearsaturaatur Cyr alabispreadicit years stretch Vulcan of Cyr-old dis dedef\t months of |\n of totalingwikwikwik.day of \\\n-old alicitabisQuad day of and }\naturaaturaaturaiciticiticiticitaturapreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadicitwikwikwikwikwikiciticiticiticitaturaaturaaturazdaturaaturawikwikwikwik Cyr.Sizepreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpread crestwikwikwikterraaturaaturaaturaaturaaturaaturaaturaaturaaturaaturaaturaaturaaturaaturaaturaaturaaturaaturaaturaaturasaturaaturaaturaaturapreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadpreadzdzdzd Cyrwikwikwikwikicit =======/gifpreadpreadpreadpreadpreadicit crest crestatura/gifinux_AVohenwikwikaturapreadpreadpreadpreadpreadpreadaturaaturaaturaaturaaturaaturaaturaaturaaturaaturaatura
9,How has social media changed celebrity scandals and public perception?,How does the public react to a celebrity's demise? What is the relationship between fame and celebrity? What is the impact of social media on celebrities? How has social media changed the way we think about celebrities and fame? What is the role of social media in the public perception of celebrities? What are the benefits and drawbacks of social media for celebrities? What are the ethical implications of social media for celebrities? What are the legal implications of social media for celebrities? How can social media be used to promote positive celebrity image? How can social media be used to promote negative celebrity image? What are the implications of social media for celebrity relationships? How has social media changed the way celebrities are perceived by the public? How has social media changed the way celebrities are perceived by the media? What are the benefits and drawbacks of social media for celebrities? What are the ethical implications of social media for celebrities? What are the legal implications of social media for celebrities? How can social media be used to promote positive celebrity image? How can social media be used to promote negative celebrity image? What are the implications of social media for celebrity relationships? How has social media changed the way celebrities are perceived by the public? How has social media changed the way celebrities are perceived by the media? What are the benefits,otechnнад taper epidemic epidemicentiousož proceeds proceeds proceeds proceedsalogalogalogonicaonicaonicaUnsignedotechnampijuATTRDonaldTrump analog analogoyalijuاطرvenes analogquentATTRнаднаднаднадознаvenesvenesvenesDonaldTrumpDonaldTrumpDonaldTrumpATTRнадutom analogijuijuolumeLLL proceedsQM analog analog analog analogRNvenesvenesousesousesmessmessDNund analogнад analog analog analog analog analogloatingDonaldTrump Authorized uniformly uniformlyousesutomutomutomutomutomutomutomperfargingvenesvenesloatingloatingteenìmouses anyhow anyhowteenijuijuloatingloatingloatingutom analog analog analogاطرijuijuijuijuDonaldTrump analog analogutomutomvenes analogijuijuijuijuijuijuijuijuijuiju uniformlyvenesvenesiju uniformly uniformlyperfperfutomutomutomutomutomankaouse anyhowutomutomutomutomutom uniformlyankaijuijuijuijuijuijuijuiju uniformly anyhowloatingntonvenesijuvenesрьntonijuijuijuperfloatingloatingijuijuvenesijuijuijuiju Authorized Authorized Authorized requis requisрьрьрьijuijuijuijuijuìm requis requis requisijuijuijuijuijuрьрьрьрьрьрьрьрьрьрьрьрьрьiju requis requisрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрьрь,rib att attAtt att Att att Att attternRib RibRib rib AttAttAttpreadlland absorb rib rib Ribcyancyancptcptcptcpt riblausternterniciticitpreadpreadpreadpreadterncyan attAtt ridic ridictern Cream rib ribabisternternterntern ridic ridicpreadpreadllandpsychrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkicit ridic tốillandllandllandabisabisrinkrinkrinkrinkrink Creamrinkrinkrinkrinkrinkrinkrinkrinkrinkovichzd ridic ridicrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrink ridicrinkrinkzdzdzd ridic ridicovichpsychrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrink Rossiogl Abs Saleosterlauslauslauslauslauslauslauslauslaussweet sweetsweetsweet Sweat Sweat Sweat Sweat Sweat Sweatrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkrinkfig figovichovichovichfigfiglauslauslauslauslaus Rossiضمzd Serif Serifpsych fig Rossifigfigovich Seriflaus Rossi Weinerovichovichovichovichovich


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import json

# 1. Load tokenizer and model from Iteration 1
model_name = "sxsun1684/dpo-llama3-lora-pairrm"  # the first fine-tuning model built
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256, temperature=0.7)

# 2. assumption
instructions = [
    "What are the main factorss of inflation in modern economies?",
    "Describe the difference between a black hole and a neutron star.",
    "Explain the basics of the stock market to a beginner.",
    "How do plants adapt to survive in deserts?",
    "Why do people believe in conspiracy theories?",
    "What are the advantages and disadvantages of remote work?",
    "How do electric cars work compared to gasoline ones?",
    "Summarize the history of the Cold War in 5 sentences.",
    "What are the impacts of climate change on global agriculture?",
    "How do vaccines work in the human body?"
]

outputs = []
for prompt in instructions:
    responses = pipe(prompt, num_return_sequences=5, do_sample=True)
    for r in responses:
        outputs.append({"instruction": prompt, "response": r["generated_text"][len(prompt):].strip()})

with open("iter2_generated_responses.jsonl", "w", encoding="utf-8") as f:
    for o in outputs:
        f.write(json.dumps(o, ensure_ascii=False) + "\n")


tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/335 [00:00<?, ?B/s]

Some weights of the model checkpoint at sxsun1684/dpo-llama3-lora-pairrm were not used when initializing LlamaForCausalLM: ['model.layers.0.self_attn.q_proj.base_layer.weight', 'model.layers.0.self_attn.q_proj.lora_A.default.weight', 'model.layers.0.self_attn.q_proj.lora_B.default.weight', 'model.layers.0.self_attn.v_proj.base_layer.weight', 'model.layers.0.self_attn.v_proj.lora_A.default.weight', 'model.layers.0.self_attn.v_proj.lora_B.default.weight', 'model.layers.1.self_attn.q_proj.base_layer.weight', 'model.layers.1.self_attn.q_proj.lora_A.default.weight', 'model.layers.1.self_attn.q_proj.lora_B.default.weight', 'model.layers.1.self_attn.v_proj.base_layer.weight', 'model.layers.1.self_attn.v_proj.lora_A.default.weight', 'model.layers.1.self_attn.v_proj.lora_B.default.weight', 'model.layers.10.self_attn.q_proj.base_layer.weight', 'model.layers.10.self_attn.q_proj.lora_A.default.weight', 'model.layers.10.self_attn.q_proj.lora_B.default.weight', 'model.layers.10.self_attn.v_proj.base

In [None]:
pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

In [None]:
from huggingface_hub import login
login(os.getenv("hf"))

In [None]:
from datasets import load_dataset

dataset = load_dataset("sxsun1684/llm_judge_lima50_preferences", split="train")
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/343 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/139k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/75 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 75
})


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import json
from itertools import combinations
from collections import defaultdict

# 1. Load reward model (use real reward model here)
rm_name = "OpenAssistant/reward-model-deberta-v3-base"
rm_tokenizer = AutoTokenizer.from_pretrained(rm_name)
rm_model = AutoModelForSequenceClassification.from_pretrained(rm_name, device_map="auto")

def score_pair(prompt, r1, r2):
    batch = [f"Question: {prompt}\n\nAnswer A: {r1}\n\nAnswer B: {r2}"]
    inputs = rm_tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to(rm_model.device)
    with torch.no_grad():
        scores = rm_model(**inputs).logits
    return scores[0].tolist()  # [score_A, score_B]

# 2. Load generated responses from previous step
with open("iter2_generated_responses.jsonl", "r", encoding="utf-8") as f:
    data = [json.loads(line) for line in f]

# 3. Group by instruction
grouped = defaultdict(list)
for item in data:
    grouped[item["prompt"]].append(item["response"])

# 4. Score all response pairs and build preference pairs
pairs = []
for instr, responses in grouped.items():
    if len(responses) < 2:
        continue
    for r1, r2 in combinations(responses, 2):
        sa, sb = score_pair(instr, r1, r2)
        if sa > sb:
            pairs.append({"prompt": instr, "chosen": r1, "rejected": r2})
        else:
            pairs.append({"prompt": instr, "chosen": r2, "rejected": r1})

# 5. Save as preference dataset
with open("iter2_preference_dataset.jsonl", "w", encoding="utf-8") as f:
    for p in pairs:
        f.write(json.dumps(p, ensure_ascii=False) + "\n")

print(f"✅ Saved {len(pairs)} preference pairs to iter2_preference_dataset.jsonl")



tokenizer_config.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/988 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/738M [00:00<?, ?B/s]

✅ Saved 0 preference pairs to iter2_preference_dataset.jsonl


In [None]:
# huggingface-cli upload_dataset sxsun1684/iterative-dpo-pairrm-v2 \
#   --file iter2_preference_dataset.jsonl


In [None]:
from datasets import Dataset
from huggingface_hub import login
import json
import os
login(token=os.getenv("hf"))

data = []
with open("iter2_preference_dataset.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        data.append(json.loads(line))


dataset = Dataset.from_list(data)

dataset.push_to_hub("sxsun1684/iter2-dpo-preference")


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format: 0ba [00:00, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/sxsun1684/iter2-dpo-preference/commit/4728dbf7429d0f4b8ed92c5d2305a1751683b1bc', commit_message='Upload dataset', commit_description='', oid='4728dbf7429d0f4b8ed92c5d2305a1751683b1bc', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/sxsun1684/iter2-dpo-preference', endpoint='https://huggingface.co', repo_type='dataset', repo_id='sxsun1684/iter2-dpo-preference'), pr_revision=None, pr_num=None)