# Exercise 2: Efficient Training with Resource Constraints and Capability Preservation 🚀
## 📘 Prerequisites
From Exercise 1, we learned how to create high-quality training sets using smart sampling techniques. Now we'll build upon this knowledge while considering resource constraints and model capabilities preservation.

## 🎯 New Challenges
### 1. Capability Preservation
Even with high-quality data, models can "forget" previously learned capabilities during fine-tuning. We need to:

* Maintain general knowledge
* Preserve essential capabilities
* Balance new and existing skills

Generally, this is done by introducing different collections, and in this case, we will add a well-known collection for generic knowledge to our preference collection

```
trl-lib/ultrafeedback_binarized
```

### 2. Resource Optimization
Training on full datasets, even high-quality ones, can be impractical due to:

* Workshop time constraints
* Cloud computing costs
* In an enterprise environment, respect the development cycle duration

### Git Clone

In [1]:
! git clone https://github.com/thomsonreuters/labs_AMLD25_Workshop

Cloning into 'labs_AMLD25_Workshop'...
remote: Enumerating objects: 59, done.[K
remote: Counting objects: 100% (59/59), done.[K
remote: Compressing objects: 100% (45/45), done.[K
remote: Total 59 (delta 8), reused 56 (delta 8), pack-reused 0 (from 0)[K
Receiving objects: 100% (59/59), 3.92 MiB | 28.43 MiB/s, done.
Resolving deltas: 100% (8/8), done.


### Install dependencies

In [2]:
! pip install -r /kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/requirements.txt
! pip install flash-attn==2.7.3 --no-build-isolation

Collecting accelerate==1.3.0 (from -r /kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/requirements.txt (line 1))
  Downloading accelerate-1.3.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes==0.45.1 (from -r /kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/requirements.txt (line 2))
  Downloading bitsandbytes-0.45.1-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting transformers==4.48.1 (from -r /kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/requirements.txt (line 5))
  Downloading transformers-4.48.1-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trl==0.13.0 (from -r /kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/requirements.txt (line 6))
  Downloading trl-0.13.0-py3-none-any.whl.metadata (11 kB)
Downloading accelerate-1.3.0-py3-none-any.whl (336 kB)
[2K   

### Testing GPU
Please check if python recognize that you have GPU allocated, if not please go in `Settings`>`Accelerator`>`GPU T4 x 2` 

In [3]:
import os, sys

# from tensorflow.python.client import device_lib
repo_folder = os.getcwd().split('labs_AMLD25_Workshop')[0]+"/labs_AMLD25_Workshop/src" 
sys.path.append(repo_folder)

# UNCOMMENT TO CHECK GPU HW
# device_lib.list_local_devices()

if you get two GPUs you can manually assign them using env variables. This step is optional since they should be automatically recognized by pytorch 

In [4]:
os.environ["WANDB_DISABLED"] = "true" ## turning off WandB logging
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"

rl_foolder = "labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO"

In [5]:
import torch

from typing import Optional, List, Dict
import datasets
from datasets import (
    load_dataset, 
    load_from_disk, 
    DatasetDict,
    concatenate_datasets
)

from accelerate import Accelerator, PartialState
from transformers import AutoModelForCausalLM, AutoTokenizer

from trl import (
    ModelConfig,
    DPOTrainer,
    DPOConfig,
    TrlParser,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
)

from trlabs.rl.data import (
    get_datasets, 
    DataArguments
)

from trlabs.utils import *

from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


### Model Config

In [6]:
model_config = {
    "model_name_or_path": "Qwen/Qwen2-0.5B-Instruct",
    "torch_dtype": "bfloat16",
    "use_peft": True, 
    "lora_r": 64,        
    "lora_alpha": 32,    # Stronger updates
    "lora_dropout": 0.1, # Prevent overfitting
}


### Data Config
You can also leverage the preference dataset located in HF Hub `trl-lib/ultrafeedback_binarized`. 

In [7]:
data_params = {
  "dataset_name": "Mix 2",
  "dataset_mixer": {
    "trl-lib/ultrafeedback_binarized": 0.02, ## 1220 samples over 61000
     f"{rl_foolder}/data/AMLD25_reuters_gentitle_1k": 1.,
  },
  "dataset_splits": ["train", "test"],
  "num_eval_samples": 100,
  "seed": 42
}

### Training Config

In [8]:
training_params =  {
    ## General
    "output_dir": f"{model_config['model_name_or_path'].split('/')[0].lower()}_ex2_output",
     "num_train_epochs": 1,
    "beta": 0.1,
    "eval_strategy": "steps",
    "eval_steps": 8,
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 1,
    "gradient_accumulation_steps": 8,
    #@ context length and max length (max_new_token = max_length - max_prompt_length)
    "max_length": 768,
    "max_prompt_length":512,
    ## Optimizer
    "optim": "adamw_torch",
    "learning_rate": 2.0e-4,
    "weight_decay": 0.001,
    "adam_epsilon": 1.0e-8,
    "adam_beta1": 0.9,
    "adam_beta2": 0.999,
    "max_grad_norm": 1.0,
    ## Scheduler ##
    "warmup_steps": 10,
    "lr_scheduler_type": "cosine",
    ## Logging
    "log_level": "info",
    "logging_first_step": True,
    "logging_steps": 10
}

## DPO Training Loop

If you launch the training as it is set up you will see that it will take about 1 hour!!! 

You can get the same results in much less time. 

### Let's Play with our two collection!
Now we have two collections, choosing **X** and **Y** fractions to balance the resulting contribution. But before you do that create collections with more smart sampling

```python
data_params = {
  "dataset_name": "Mix 2",
  "dataset_mixer": {
    "trl-lib/ultrafeedback_binarized": Y,
    "labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/data/AMLD25_reuters_gentitle_1k": X,
  },
  "dataset_splits": ["train", "test"],
  "num_eval_samples": 100,
  "seed": 42
}
```

In order to achive the best performance. 




##### Note:
A fraction of 0.1 for collection`"trl-lib/ultrafeedback_binarized"` means that we select 10% of the training size from this collection

In [9]:
from trlabs.rl.train import dpo

dpo(data_params, training_params, model_config)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/643 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/131M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/62135 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Extracting prompt from train dataset:   0%|          | 0/2229 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/2229 [00:00<?, ? examples/s]

Extracting prompt from eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/2229 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected. If origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2,229
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Training with DataParallel so batch size has been adjusted to: 2
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 8
  Total optimization steps = 139
  Number of trainable parameters = 4,325,376


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
8,5.5312,0.680938,0.005524,-0.024902,0.43,0.030396,-256.0,-233.0,-2.953125,-3.03125
16,5.615,0.651992,-0.09668,-0.207031,0.57,0.11084,-258.0,-235.0,-2.9375,-3.0
24,5.2773,0.641797,-0.220703,-0.388672,0.66,0.166992,-258.0,-236.0,-2.953125,-3.015625
32,5.2193,0.615547,-0.245117,-0.503906,0.66,0.257812,-258.0,-238.0,-2.9375,-3.0
40,5.0405,0.620625,-0.196289,-0.476562,0.65,0.28125,-258.0,-237.0,-2.9375,-3.015625
48,5.0405,0.609453,-0.198242,-0.527344,0.66,0.332031,-258.0,-238.0,-2.9375,-3.015625
56,4.8738,0.601562,-0.178711,-0.53125,0.64,0.353516,-258.0,-238.0,-2.96875,-3.03125
64,4.7354,0.603516,-0.12793,-0.507812,0.64,0.376953,-258.0,-238.0,-2.96875,-3.03125
72,4.7295,0.58959,-0.255859,-0.679688,0.66,0.425781,-260.0,-239.0,-2.96875,-3.03125
80,5.0183,0.593574,-0.365234,-0.769531,0.66,0.40625,-260.0,-240.0,-2.96875,-3.046875


The following columns in the evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected. If origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 100
  Batch size = 2
The following columns in the evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected. If origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running

Saving model checkpoint to qwen_ex2_output
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/config.json
Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 896,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "max_position_embeddings": 32768,
  "max_window_layers": 24,
  "model_type": "qwen2",
  "num_attention_heads": 14,
  "num_hidden_layers": 24,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}



***** eval metrics *****
  epoch                   =     0.9973
  eval_logits/chosen      =    -2.9531
  eval_logits/rejected    =    -3.0156
  eval_logps/chosen       =     -258.0
  eval_logps/rejected     =     -239.0
  eval_loss               =     0.6063
  eval_rewards/accuracies =       0.62
  eval_rewards/chosen     =    -0.2539
  eval_rewards/margins    =     0.3906
  eval_rewards/rejected   =    -0.6445
  eval_runtime            = 0:01:23.27
  eval_samples_per_second =      1.201
  eval_steps_per_second   =        0.6


tokenizer config file saved in qwen_ex2_output/tokenizer_config.json
Special tokens file saved in qwen_ex2_output/special_tokens_map.json


## Give a look to the Model Generation

In [10]:
from trlabs.utils import dataset_creation, not_relevant_data

SYSTEM_PROMPT = 'You are an advanced AI system specialised in providing Reuters News title given a body text of the news.'
INSTRUCTION = "The title should be in capital letters and between 6 and 8 words in length. Please provide only the title as output and no other text or explanation."

dataset = load_dataset("ucirvine/reuters21578", 'ModApte', trust_remote_code=True)
dataset = dataset.filter(not_relevant_data).shuffle(seed=42).map(dataset_creation, fn_kwargs={"system_prompt": SYSTEM_PROMPT, "instruction": INSTRUCTION})

README.md:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

reuters21578.py:   0%|          | 0.00/17.9k [00:00<?, ?B/s]

reuters21578.tar.gz:   0%|          | 0.00/8.15M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/3299 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/9603 [00:00<?, ? examples/s]

Generating unused split:   0%|          | 0/722 [00:00<?, ? examples/s]

Filter:   0%|          | 0/3299 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9603 [00:00<?, ? examples/s]

Filter:   0%|          | 0/722 [00:00<?, ? examples/s]

Map:   0%|          | 0/3295 [00:00<?, ? examples/s]

Map:   0%|          | 0/9583 [00:00<?, ? examples/s]

Map:   0%|          | 0/722 [00:00<?, ? examples/s]

In [11]:
from trlabs.rl.eval import setup_model_and_lora, generate

index =33
prompt = dataset["test"][index]["system"]+dataset["test"][index]["messages"]

model, tokenizer = setup_model_and_lora(
    base_model_name = model_config["model_name_or_path"], 
    lora_path = training_params["output_dir"]
)

response = generate(prompt, model, tokenizer)
print(response)

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/tokenizer_config.json
loading file chat_template.jinja from cache at None
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
loading configuration file config.json fro

system
You are an advanced AI system specialised in providing Reuters News title given a body text of the news.
user
Canada Development Corp <CDC.TO> said it agreed to sell its 25.2 pct interest in CDC Life Sciences Inc to Caisse de depot et placement du Quebec, the provincial pension fund manager, and Institut Merieux, a French biological laboratory company, for 169.2 mln dlrs. It said the caisse and Institut Merieux will each buy 2.75 mln common shares of the company for 30.76 dlrs a share. It said following the transaction the caisse will hold about 19.3 pct of CDC Life Sciences. Canada Development said the purchasers do not plan to acquire the remaining publicly-held shares.

The title should be in capital letters and between 6 and 8 words in length. Please provide only the title as output and no other text or explanation.
assistant
"Canada Development Corp Sells 25.2% Interest in CDC Life Sciences to Caisse de Depot & Placement du Québec, Institut Merieux"


#### Note: 
if you do not provide a lora_path you can check the base model output

## Solution

In [12]:
from trlabs.utils import *

dataset = load_from_disk("/kaggle/working/labs_AMLD25_Workshop/sessions/4_RLalignment_and_DPO/data/AMLD25_reuters_gentitle_1k").filter(reuters_cleaning_dataset)
dataset.save_to_disk("AMLD25_reuters_gentitle_0.5k_cleaned")

dataset = load_dataset("trl-lib/ultrafeedback_binarized").filter(ultrafeedback_cleaning_dataset)
dataset.save_to_disk("trl-lib/ultrafeedback_binarized-cleaned_11k")



Filter:   0%|          | 0/987 [00:00<?, ? examples/s]

Filter:   0%|          | 0/496 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/446 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/196 [00:00<?, ? examples/s]

Filter:   0%|          | 0/62135 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/11329 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/188 [00:00<?, ? examples/s]

In [13]:
data_params = {
  "dataset_name": "Mix 2",
  "dataset_mixer": {
    "trl-lib/ultrafeedback_binarized-cleaned_11k": 0.05,
    f"AMLD25_reuters_gentitle_0.5k_cleaned": 1.,
  },
  "dataset_splits": ["train", "test"],
  "num_eval_samples": 100,
  "seed": 42
}

In [14]:
from trlabs.rl.train import dpo

dpo(data_params, training_params, model_config)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/config.json
Model config Qwen2Config {
  "_name_or_path": "Qwen/Qwen2-0.5B-Instruct",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 896,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "max_position_emb

Extracting prompt from train dataset:   0%|          | 0/1012 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/1012 [00:00<?, ? examples/s]

Extracting prompt from eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1012 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected. If origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1,012
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Training with DataParallel so batch size has been adjusted to: 2
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 8
  Total optimization steps = 63
  Number of trainable parameters = 4,325,376


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
8,5.5312,0.674531,-0.015869,-0.055664,0.52,0.039795,-160.0,-154.0,-2.921875,-2.90625
16,5.5343,0.603633,0.012756,-0.223633,0.7,0.236328,-160.0,-155.0,-2.859375,-2.84375
24,4.7174,0.53959,0.081543,-0.443359,0.79,0.523438,-160.0,-158.0,-2.8125,-2.765625
32,4.681,0.521533,0.057129,-0.546875,0.73,0.601562,-160.0,-159.0,-2.8125,-2.78125
40,3.6941,0.509238,-0.033936,-0.699219,0.75,0.664062,-161.0,-160.0,-2.828125,-2.796875
48,3.6941,0.508359,-0.086426,-0.773438,0.74,0.6875,-161.0,-161.0,-2.84375,-2.8125
56,3.8327,0.493066,-0.151367,-0.878906,0.76,0.726562,-162.0,-162.0,-2.84375,-2.8125


The following columns in the evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected. If origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 100
  Batch size = 2
The following columns in the evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected. If origin_response_c_r, rejected, prompt, rejected_reward, chosen_reward, score_chosen, date, chosen, score_rejected are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running

Saving model checkpoint to qwen_ex2_output


***** eval metrics *****
  epoch                   =      0.996
  eval_logits/chosen      =    -2.8438
  eval_logits/rejected    =    -2.8125
  eval_logps/chosen       =     -161.0
  eval_logps/rejected     =     -162.0
  eval_loss               =     0.4895
  eval_rewards/accuracies =       0.76
  eval_rewards/chosen     =    -0.1138
  eval_rewards/margins    =     0.7305
  eval_rewards/rejected   =    -0.8477
  eval_runtime            = 0:01:12.60
  eval_samples_per_second =      1.377
  eval_steps_per_second   =      0.689


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/config.json
Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 896,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "max_position_embeddings": 32768,
  "max_window_layers": 24,
  "model_type": "qwen2",
  "num_attention_heads": 14,
  "num_hidden_layers": 24,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

tokenizer config file saved in qwen_ex2_output/tokenizer_config.json
Special tokens file saved in qwen_ex2_output

## Give a look to the Model Generation

In [15]:
from trlabs.utils import dataset_creation, not_relevant_data

SYSTEM_PROMPT = 'You are an advanced AI system specialised in providing Reuters News title given a body text of the news.'
INSTRUCTION = "The title should be in capital letters and between 6 and 8 words in length. Please provide only the title as output and no other text or explanation."

dataset = load_dataset("ucirvine/reuters21578", 'ModApte', trust_remote_code=True)
dataset = dataset.filter(not_relevant_data).shuffle(seed=42).map(dataset_creation, fn_kwargs={"system_prompt": SYSTEM_PROMPT, "instruction": INSTRUCTION})

In [16]:
from trlabs.rl.eval import setup_model_and_lora, generate

index =33
prompt = dataset["test"][index]["system"]+dataset["test"][index]["messages"]

model, tokenizer = setup_model_and_lora(
    base_model_name = model_config["model_name_or_path"], 
    lora_path = training_params["output_dir"]
)

response = generate(prompt, model, tokenizer)
print(response)

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/tokenizer_config.json
loading file chat_template.jinja from cache at None
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
loading configuration file config.json fro

system
You are an advanced AI system specialised in providing Reuters News title given a body text of the news.
user
Canada Development Corp <CDC.TO> said it agreed to sell its 25.2 pct interest in CDC Life Sciences Inc to Caisse de depot et placement du Quebec, the provincial pension fund manager, and Institut Merieux, a French biological laboratory company, for 169.2 mln dlrs. It said the caisse and Institut Merieux will each buy 2.75 mln common shares of the company for 30.76 dlrs a share. It said following the transaction the caisse will hold about 19.3 pct of CDC Life Sciences. Canada Development said the purchasers do not plan to acquire the remaining publicly-held shares.

The title should be in capital letters and between 6 and 8 words in length. Please provide only the title as output and no other text or explanation.
assistant
"Canada Development Corp Sells 25.2% Interest in CDC Life Sciences Inc to Caisse de Depot & Placement du Québec and Institut Merieux"
