# Exercise 1: Smart Dataset Sampling for Optimal Model Performance 🎯
In this exercise, we'll explore how intelligent dataset sampling can significantly impact model performance. We'll learn how to create high-quality training sets by implementing strategic sampling techniques.

🌟 The Challenge
Training on our entire dataset without proper sampling can lead to:

* **Noisy Training Signals**: Not all data contributes equally to model learning
* **Suboptimal Performance**: Quantity doesn't always mean quality
* **Inefficient Learning**: Model might focus on redundant or low-quality examples

### Install dependencies

In [18]:
! pip install -r requirements.txt
! pip install flash-attn==2.7.3 --no-build-isolation

### Testing GPU
Please check if python recognize that you have GPU allocated, if not please go in `Settings`>`Accelerator`>`GPU T4 x 2` 

In [26]:
import os, sys

# from tensorflow.python.client import device_lib
repo_folder = os.getcwd().split('labs_AMLD25_Workshop')[0][:-1]+"/labs_AMLD25_Workshop/src" 
sys.path.append(repo_folder)

# UNCOMMENT TO CHECK GPU HW
# device_lib.list_local_devices()

if you get two GPUs you can manually assign them using env variables. This step is optional since they should be automatically recognized by pytorch 

In [2]:
os.environ["WANDB_DISABLED"] = "true" ## turning off WandB logging
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"

In [25]:
import torch

from typing import Optional, List, Dict
import datasets
from datasets import (
    load_dataset, 
    load_from_disk, 
    DatasetDict,
    concatenate_datasets
)

from accelerate import Accelerator, PartialState
from transformers import AutoModelForCausalLM, AutoTokenizer

from trl import (
    ModelConfig,
    DPOTrainer,
    DPOConfig,
    TrlParser,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
)

from trlabs.rl.data import (
    get_datasets, 
    DataArguments
)

from trlabs.utils import *

from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE


### Model Config

In [4]:
model_config = {
    "model_name_or_path": "Qwen/Qwen2-0.5B-Instruct",
    "torch_dtype": "bfloat16",
    "use_peft": True, 
    "lora_r": 64,        
    "lora_alpha": 32,    # Stronger updates
    "lora_dropout": 0.1, # Prevent overfitting
}


### Data Config
You can leverage the preference dataset for this task located in `data/AMLD25_reuters_gentitle_1k`. 

In [5]:
data_params = {
  "dataset_name": "Mix 1",
  "dataset_mixer": {
    "./data/AMLD25_reuters_gentitle_1k": 1.,
  },
  "dataset_splits": ["train", "test"],
  "num_eval_samples": 100,
  "seed": 42
}

### Training Config

In [7]:
training_params =  {
    ## General
    "output_dir": f"{model_config['model_name_or_path'].split('/')[0].lower()}_ex1_output",
    "num_train_epochs": 1,
    "beta": 0.1,
    "eval_strategy": "steps",
    "eval_steps": 8,
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 1,
    "gradient_accumulation_steps": 8,
    #@ context length and max length (max_new_token = max_length - max_prompt_length)
    "max_length": 768,
    "max_prompt_length":512,
    ## Optimizer
    "optim": "adamw_torch",
    "learning_rate": 2.0e-4,
    "weight_decay": 0.001,
    "adam_epsilon": 1.0e-8,
    "adam_beta1": 0.9,
    "adam_beta2": 0.999,
    "max_grad_norm": 1.0,
    ## Scheduler ##
    "warmup_steps": 10,
    "lr_scheduler_type": "cosine",
    ## Logging
    "log_level": "info",
    "logging_first_step": True,
    "logging_steps": 10
}

### DPO Training Loop

In [23]:
accelerator = Accelerator()


data_args = DataArguments(**data_params)
training_args =  DPOConfig(**training_params)
model_args = ModelConfig(**model_config)

###################
# Model & Tokenizer
###################
torch_dtype = (
    model_args.torch_dtype
    if model_args.torch_dtype in ["auto", None]
    else getattr(torch, model_args.torch_dtype)
)
quantization_config = get_quantization_config(model_args)
model_kwargs = dict(
    revision=model_args.model_revision,
    attn_implementation=model_args.attn_implementation,
    torch_dtype=torch_dtype,
    use_cache=False if training_args.gradient_checkpointing else True,
    device_map=get_kbit_device_map() if quantization_config is not None else None,
    quantization_config=quantization_config,
)
model = AutoModelForCausalLM.from_pretrained(
    model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, **model_kwargs
)
peft_config = get_peft_config(model_args)
if peft_config is None:
    ref_model = AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, **model_kwargs
    )
else:
    ref_model = None
tokenizer = AutoTokenizer.from_pretrained(
    model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
if tokenizer.chat_template is None:
    tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE

################
# Dataset
################
dataset =  get_datasets(data_args, splits=data_args.dataset_splits)



################
# Training
################
trainer = DPOTrainer(
    model,
    ref_model,
    args=training_args,
    train_dataset=dataset[data_args.dataset_train_split],
    eval_dataset=dataset[data_args.dataset_test_split].shuffle(data_args.seed)\
        .take(min(len(dataset[data_args.dataset_test_split]), data_args.num_eval_samples))\
        if training_args.eval_strategy != "no" else None,
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()

if training_args.eval_strategy != "no":
    metrics = trainer.evaluate()
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
    trainer.push_to_hub(dataset_name=data_args.dataset_name)

### Your Turn!
We randomly selected a subset, choosing **X** as the fraction. See below

```python
data_params = {
  "dataset_name": "Mix 1",
  "dataset_mixer": {
    "./data/AMLD25_reuters_gentitle_1k": X,
  },
  "dataset_splits": ["train", "test"],
  "num_eval_samples": 100,
  "seed": 42
}
```

Can we do a better selection? 

**Hint(s)**: 
1. Please give a deep look to the max context length (768) and `chosen` and `rejected` features
2. Please check the other columns

<details>
<summary> <b>Solution Spoiler!</b> </summary>
  Search in <code>src/trgpt/utils.py</code> for the solution (functions: <code>reuters_cleaning_dataset</code> and <code>not_relevant_data</code>) 
</details>


## Give a look to the Model Generation

In [22]:
from trlabs.utils import dataset_creation, not_relevant_data

SYSTEM_PROMPT = 'You are an advanced AI system specialised in providing Reuters News title given a body text of the news.'
INSTRUCTION = "The title should be in capital letters and between 6 and 8 words in length. Please provide only the title as output and no other text or explanation."

dataset = load_dataset("ucirvine/reuters21578", 'ModApte', trust_remote_code=True)
dataset = dataset.filter(not_relevant_data).shuffle(seed=42).map(dataset_creation, fn_kwargs={"system_prompt": SYSTEM_PROMPT, "instruction": INSTRUCTION})

In [21]:
from trlabs.rl.eval import setup_model_and_lora, generate

index =15
prompt = dataset["test"][index]["system"]+dataset["test"][index]["messages"]

model, tokenizer = setup_model_and_lora(
    base_model_name = model_config["model_name_or_path"], 
    lora_path = training_params["output_dir"]
)

response = generate(prompt, model, tokenizer)
print(response)

#### Note: 
if you do not provide a lora_path you can check the base model output