# Exercise 1: Smart Dataset Sampling for Optimal Model Performance 🎯
In this exercise, we'll explore how intelligent dataset sampling can significantly impact model performance. We'll learn how to create high-quality training sets by implementing strategic sampling techniques.

🌟 The Challenge
Training on our entire dataset without proper sampling can lead to:

* **Noisy Training Signals**: Not all data contributes equally to model learning
* **Suboptimal Performance**: Quantity doesn't always mean quality
* **Inefficient Learning**: Model might focus on redundant or low-quality examples

### Git Clone

In [1]:
! git clone https://github.com/thomsonreuters/labs_AMLD25_Workshop

Cloning into 'labs_AMLD25_Workshop'...
remote: Enumerating objects: 59, done.[K
remote: Counting objects: 100% (59/59), done.[K
remote: Compressing objects: 100% (45/45), done.[K
remote: Total 59 (delta 8), reused 56 (delta 8), pack-reused 0 (from 0)[K
Receiving objects: 100% (59/59), 3.92 MiB | 18.14 MiB/s, done.
Resolving deltas: 100% (8/8), done.


### Install dependencies

In [2]:
! pip install -r /kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/requirements.txt
! pip install flash-attn==2.7.3 --no-build-isolation

Collecting accelerate==1.3.0 (from -r /kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/requirements.txt (line 1))
  Downloading accelerate-1.3.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes==0.45.1 (from -r /kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/requirements.txt (line 2))
  Downloading bitsandbytes-0.45.1-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting transformers==4.48.1 (from -r /kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/requirements.txt (line 5))
  Downloading transformers-4.48.1-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trl==0.13.0 (from -r /kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/requirements.txt (line 6))
  Downloading trl-0.13.0-py3-none-any.whl.metadata (11 kB)
Downloading accelerate-1.3.0-py3-none-any.whl (336 kB)
[2K   

### Testing GPU
Please check if python recognize that you have GPU allocated, if not please go in `Settings`>`Accelerator`>`GPU T4 x 2` 

In [3]:
import os, sys

# from tensorflow.python.client import device_lib
repo_folder = os.getcwd().split('labs_AMLD25_Workshop')[0]+"/labs_AMLD25_Workshop/src" 
sys.path.append(repo_folder)

# UNCOMMENT TO CHECK GPU HW
# device_lib.list_local_devices()

if you get two GPUs you can manually assign them using env variables. This step is optional since they should be automatically recognized by pytorch 

In [4]:
os.environ["WANDB_DISABLED"] = "true" ## turning off WandB logging
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"

rl_foolder = "labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO"

In [5]:
import torch

from typing import Optional, List, Dict
import datasets
from datasets import (
    load_dataset, 
    load_from_disk, 
    DatasetDict,
    concatenate_datasets
)

from accelerate import Accelerator, PartialState
from transformers import AutoModelForCausalLM, AutoTokenizer

from trl import (
    ModelConfig,
    DPOTrainer,
    DPOConfig,
    TrlParser,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
)

from trlabs.rl.data import (
    get_datasets, 
    DataArguments
)

from trlabs.utils import *

from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


### Model Config

In [6]:
model_config = {
    "model_name_or_path": "Qwen/Qwen2-0.5B-Instruct",
    "torch_dtype": "bfloat16",
    "use_peft": True, 
    "lora_r": 64,        
    "lora_alpha": 32,    # Stronger updates
    "lora_dropout": 0.1, # Prevent overfitting
}


### Data Config
You can leverage the preference dataset for this task located in `labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/data/AMLD25_reuters_gentitle_1k`. 

In [7]:
data_params = {
  "dataset_name": "Mix 1",
  "dataset_mixer": {
    f"{rl_foolder}/data/AMLD25_reuters_gentitle_1k": 1.,
  },
  "dataset_splits": ["train", "test"],
  "num_eval_samples": 100,
  "seed": 42
}

### Training Config

In [8]:
training_params =  {
    ## General
    "output_dir": f"{model_config['model_name_or_path'].split('/')[0].lower()}_ex1_output",
    "num_train_epochs": 1,
    "beta": 0.1,
    "eval_strategy": "steps",
    "eval_steps": 8,
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 1,
    "gradient_accumulation_steps": 8,
    #@ context length and max length (max_new_token = max_length - max_prompt_length)
    "max_length": 768,
    "max_prompt_length":512,
    ## Optimizer
    "optim": "adamw_torch",
    "learning_rate": 2.0e-4,
    "weight_decay": 0.001,
    "adam_epsilon": 1.0e-8,
    "adam_beta1": 0.9,
    "adam_beta2": 0.999,
    "max_grad_norm": 1.0,
    ## Scheduler ##
    "warmup_steps": 10,
    "lr_scheduler_type": "cosine",
    ## Logging
    "log_level": "info",
    "logging_first_step": True,
    "logging_steps": 10
}

### DPO Training Loop

In [9]:
accelerator = Accelerator()


data_args = DataArguments(**data_params)
training_args =  DPOConfig(**training_params)
model_args = ModelConfig(**model_config)

###################
# Model & Tokenizer
###################
torch_dtype = (
    model_args.torch_dtype
    if model_args.torch_dtype in ["auto", None]
    else getattr(torch, model_args.torch_dtype)
)
quantization_config = get_quantization_config(model_args)
model_kwargs = dict(
    revision=model_args.model_revision,
    attn_implementation=model_args.attn_implementation,
    torch_dtype=torch_dtype,
    use_cache=False if training_args.gradient_checkpointing else True,
    device_map=get_kbit_device_map() if quantization_config is not None else None,
    quantization_config=quantization_config,
)
model = AutoModelForCausalLM.from_pretrained(
    model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, **model_kwargs
)
peft_config = get_peft_config(model_args)
if peft_config is None:
    ref_model = AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, **model_kwargs
    )
else:
    ref_model = None
tokenizer = AutoTokenizer.from_pretrained(
    model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
if tokenizer.chat_template is None:
    tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE

################
# Dataset
################
dataset =  get_datasets(data_args, splits=data_args.dataset_splits)



################
# Training
################
trainer = DPOTrainer(
    model,
    ref_model,
    args=training_args,
    train_dataset=dataset[data_args.dataset_train_split],
    eval_dataset=dataset[data_args.dataset_test_split].shuffle(data_args.seed)\
        .take(min(len(dataset[data_args.dataset_test_split]), data_args.num_eval_samples))\
        if training_args.eval_strategy != "no" else None,
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()

if training_args.eval_strategy != "no":
    metrics = trainer.evaluate()
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
    trainer.push_to_hub(dataset_name=data_args.dataset_name)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Extracting prompt from train dataset:   0%|          | 0/987 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/987 [00:00<?, ? examples/s]

Extracting prompt from eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/987 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected. If origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 987
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Training with DataParallel so batch size has been adjusted to: 2
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 8
  Total optimization steps = 61
  Number of trainable parameters = 4,325,376


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
8,5.5312,0.662812,0.166992,0.089355,0.58,0.077637,-54.75,-53.25,-3.0625,-3.0625
16,5.4136,0.609609,0.867188,0.601562,0.67,0.265625,-47.75,-48.25,-3.109375,-3.109375
24,4.7673,0.553813,0.457031,-0.059082,0.73,0.515625,-51.75,-54.75,-3.09375,-3.09375
32,4.3073,0.539946,0.227539,-0.486328,0.76,0.714844,-54.0,-59.0,-3.09375,-3.09375
40,4.3715,0.501414,-0.371094,-1.164062,0.75,0.792969,-60.0,-66.0,-3.078125,-3.078125
48,4.3715,0.498445,-0.349609,-1.140625,0.78,0.792969,-59.75,-65.5,-3.078125,-3.078125
56,4.3975,0.490835,-0.253906,-1.070312,0.79,0.8125,-58.75,-65.0,-3.09375,-3.078125


The following columns in the evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected. If origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 100
  Batch size = 2
The following columns in the evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected. If origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 100
  Batch size = 2
The following columns in the evaluation set don't have a corresp

Saving model checkpoint to qwen_ex1_output
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/config.json
Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 896,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "max_position_embeddings": 32768,
  "max_window_layers": 24,
  "model_type": "qwen2",
  "num_attention_heads": 14,
  "num_hidden_layers": 24,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}



***** eval metrics *****
  epoch                   =     0.9879
  eval_logits/chosen      =    -3.0938
  eval_logits/rejected    =    -3.0781
  eval_logps/chosen       =     -58.75
  eval_logps/rejected     =      -65.0
  eval_loss               =     0.4905
  eval_rewards/accuracies =       0.79
  eval_rewards/chosen     =    -0.2383
  eval_rewards/margins    =     0.8203
  eval_rewards/rejected   =    -1.0625
  eval_runtime            = 0:00:54.86
  eval_samples_per_second =      1.823
  eval_steps_per_second   =      0.911


tokenizer config file saved in qwen_ex1_output/tokenizer_config.json
Special tokens file saved in qwen_ex1_output/special_tokens_map.json


### Your Turn!
We randomly selected a subset, choosing **X** as the fraction. See below

```python
data_params = {
  "dataset_name": "Mix 1",
  "dataset_mixer": {
    "labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/data/AMLD25_reuters_gentitle_1k": X,
  },
  "dataset_splits": ["train", "test"],
  "num_eval_samples": 100,
  "seed": 42
}
```

Can we do a better selection? 

**Hint(s)**: 
1. Please give a deep look to the max context length (768) and `chosen` and `rejected` features
2. Please check the other columns

<details>
<summary> <b>Solution Spoiler!</b> </summary>
  Search in <code>src/trgpt/utils.py</code> for the solution (functions: <code>reuters_cleaning_dataset</code> and <code>not_relevant_data</code>) 
</details>


## Give a look to the Model Generation

In [10]:
from trlabs.utils import dataset_creation, not_relevant_data

SYSTEM_PROMPT = 'You are an advanced AI system specialised in providing Reuters News title given a body text of the news.'
INSTRUCTION = "The title should be in capital letters and between 6 and 8 words in length. Please provide only the title as output and no other text or explanation."

dataset = load_dataset("ucirvine/reuters21578", 'ModApte', trust_remote_code=True)
dataset = dataset.filter(not_relevant_data).shuffle(seed=42).map(dataset_creation, fn_kwargs={"system_prompt": SYSTEM_PROMPT, "instruction": INSTRUCTION})

README.md:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

reuters21578.py:   0%|          | 0.00/17.9k [00:00<?, ?B/s]

reuters21578.tar.gz:   0%|          | 0.00/8.15M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/3299 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/9603 [00:00<?, ? examples/s]

Generating unused split:   0%|          | 0/722 [00:00<?, ? examples/s]

Filter:   0%|          | 0/3299 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9603 [00:00<?, ? examples/s]

Filter:   0%|          | 0/722 [00:00<?, ? examples/s]

Map:   0%|          | 0/3295 [00:00<?, ? examples/s]

Map:   0%|          | 0/9583 [00:00<?, ? examples/s]

Map:   0%|          | 0/722 [00:00<?, ? examples/s]

In [11]:
from trlabs.rl.eval import setup_model_and_lora, generate

index =15
prompt = dataset["test"][index]["system"]+dataset["test"][index]["messages"]

model, tokenizer = setup_model_and_lora(
    base_model_name = model_config["model_name_or_path"], 
    lora_path = training_params["output_dir"]
)

response = generate(prompt, model, tokenizer)
print(response)

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/tokenizer_config.json
loading file chat_template.jinja from cache at None
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
loading configuration file config.json fro

system
You are an advanced AI system specialised in providing Reuters News title given a body text of the news.
user
China's foreign debt is very low given its export capability, the size of its economy and its growth potential and the country is politically stable, Jean-Maxime Leveque, chairman of Credit Lyonnais, told reporters. Leveque, who has met the heads of most of China's banks including the president of its central bank during a visit here, said the Chinese authorities are very attentive to its foreign debt and have the matter under control. Official figures show China's foreign debt at a post-1949 record 16 billion dlrs at end-1986. Asked if he had advised China to borrow more francs and U.S. Dollars and less yen, Leveque said he had not offered any advice, but added: "The yen and the dollar are not stable, but the ECU is stable." Asked if his bank has lost any confidence in China after the resignation of Communist Party chief Hu Yaobang in January, he said: "We have total co

#### Note: 
if you do not provide a lora_path you can check the base model output

## Solution

In [12]:
from trlabs.utils import *

dataset = load_from_disk("/kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/data/AMLD25_reuters_gentitle_1k").filter(reuters_cleaning_dataset)
dataset.save_to_disk("AMLD25_reuters_gentitle_0.5k_cleaned")

Filter:   0%|          | 0/987 [00:00<?, ? examples/s]

Filter:   0%|          | 0/496 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/446 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/196 [00:00<?, ? examples/s]

In [13]:
data_params = {
  "dataset_name": "Mix 1",
  "dataset_mixer": {
    f"AMLD25_reuters_gentitle_0.5k_cleaned": 1.,
  },
  "dataset_splits": ["train", "test"],
  "num_eval_samples": 100,
  "seed": 42
}

In [14]:
accelerator = Accelerator()


data_args = DataArguments(**data_params)
training_args =  DPOConfig(**training_params)
model_args = ModelConfig(**model_config)

###################
# Model & Tokenizer
###################
torch_dtype = (
    model_args.torch_dtype
    if model_args.torch_dtype in ["auto", None]
    else getattr(torch, model_args.torch_dtype)
)
quantization_config = get_quantization_config(model_args)
model_kwargs = dict(
    revision=model_args.model_revision,
    attn_implementation=model_args.attn_implementation,
    torch_dtype=torch_dtype,
    use_cache=False if training_args.gradient_checkpointing else True,
    device_map=get_kbit_device_map() if quantization_config is not None else None,
    quantization_config=quantization_config,
)
model = AutoModelForCausalLM.from_pretrained(
    model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, **model_kwargs
)
peft_config = get_peft_config(model_args)
if peft_config is None:
    ref_model = AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, **model_kwargs
    )
else:
    ref_model = None
tokenizer = AutoTokenizer.from_pretrained(
    model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
if tokenizer.chat_template is None:
    tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE

################
# Dataset
################
dataset =  get_datasets(data_args, splits=data_args.dataset_splits)



################
# Training
################
trainer = DPOTrainer(
    model,
    ref_model,
    args=training_args,
    train_dataset=dataset[data_args.dataset_train_split],
    eval_dataset=dataset[data_args.dataset_test_split].shuffle(data_args.seed)\
        .take(min(len(dataset[data_args.dataset_test_split]), data_args.num_eval_samples))\
        if training_args.eval_strategy != "no" else None,
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()

if training_args.eval_strategy != "no":
    metrics = trainer.evaluate()
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
    trainer.push_to_hub(dataset_name=data_args.dataset_name)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/config.json
Model config Qwen2Config {
  "_name_or_path": "Qwen/Qwen2-0.5B-Instruct",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 896,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "max_position_emb

Extracting prompt from train dataset:   0%|          | 0/446 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/446 [00:00<?, ? examples/s]

Extracting prompt from eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/446 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected. If origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 446
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Training with DataParallel so batch size has been adjusted to: 2
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 8
  Total optimization steps = 27
  Number of trainable parameters = 4,325,376


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
8,5.5312,0.641328,0.173828,0.061768,0.72,0.112793,-56.0,-58.75,-3.0625,-3.046875
16,5.3924,0.530469,0.867188,0.335938,0.71,0.53125,-49.0,-56.0,-3.015625,-3.0
24,3.9735,0.495313,0.707031,-0.025391,0.78,0.734375,-50.5,-59.5,-3.015625,-3.0


The following columns in the evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected. If origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 100
  Batch size = 2
The following columns in the evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected. If origin_response_c_r, chosen_reward, date, prompt, chosen, rejected_reward, rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 100
  Batch size = 2
The following columns in the evaluation set don't have a corresp

Saving model checkpoint to qwen_ex1_output
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/config.json
Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 896,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "max_position_embeddings": 32768,
  "max_window_layers": 24,
  "model_type": "qwen2",
  "num_attention_heads": 14,
  "num_hidden_layers": 24,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

tokenizer config file saved in qwen_ex1_output/tokenizer_config.json
S

***** eval metrics *****
  epoch                   =     0.9686
  eval_logits/chosen      =    -3.0156
  eval_logits/rejected    =       -3.0
  eval_logps/chosen       =     -50.75
  eval_logps/rejected     =     -59.75
  eval_loss               =     0.4952
  eval_rewards/accuracies =       0.78
  eval_rewards/chosen     =     0.7031
  eval_rewards/margins    =     0.7383
  eval_rewards/rejected   =    -0.0342
  eval_runtime            = 0:00:48.88
  eval_samples_per_second =      2.046
  eval_steps_per_second   =      1.023


In [15]:
from trlabs.utils import dataset_creation, not_relevant_data

SYSTEM_PROMPT = 'You are an advanced AI system specialised in providing Reuters News title given a body text of the news.'
INSTRUCTION = "The title should be in capital letters and between 6 and 8 words in length. Please provide only the title as output and no other text or explanation."

dataset = load_dataset("ucirvine/reuters21578", 'ModApte', trust_remote_code=True)
dataset = dataset.filter(not_relevant_data).shuffle(seed=42).map(dataset_creation, fn_kwargs={"system_prompt": SYSTEM_PROMPT, "instruction": INSTRUCTION})

In [16]:
from trlabs.rl.eval import setup_model_and_lora, generate

index =15
prompt = dataset["test"][index]["system"]+dataset["test"][index]["messages"]

model, tokenizer = setup_model_and_lora(
    base_model_name = model_config["model_name_or_path"], 
    lora_path = training_params["output_dir"]
)

response = generate(prompt, model, tokenizer)
print(response)

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/tokenizer_config.json
loading file chat_template.jinja from cache at None
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
loading configuration file config.json fro

system
You are an advanced AI system specialised in providing Reuters News title given a body text of the news.
user
China's foreign debt is very low given its export capability, the size of its economy and its growth potential and the country is politically stable, Jean-Maxime Leveque, chairman of Credit Lyonnais, told reporters. Leveque, who has met the heads of most of China's banks including the president of its central bank during a visit here, said the Chinese authorities are very attentive to its foreign debt and have the matter under control. Official figures show China's foreign debt at a post-1949 record 16 billion dlrs at end-1986. Asked if he had advised China to borrow more francs and U.S. Dollars and less yen, Leveque said he had not offered any advice, but added: "The yen and the dollar are not stable, but the ECU is stable." Asked if his bank has lost any confidence in China after the resignation of Communist Party chief Hu Yaobang in January, he said: "We have total co