<a href="https://colab.research.google.com/github/shake/colab-Llama-2-ipynb/blob/main/phi-2/fine_tune_phi2_dpo_lora_quantization_intel_orca_dpo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune Phi2 using DPO LoRA and quantization

How to get a DPO dataset? You can either [create synthethic data or review a existing dataset with distilabel](https://huggingface.co/argilla/distilabeled-Hermes-2.5-Mistral-7B) or use a completely raw approach and start [with some existing data collection as obtained from this Jupyter Notebook.](https://colab.research.google.com/drive/1p7d-iqtKlxojT3xetEL6PsJjdhZcm1xK?usp=sharing)

After the annotators have submitted their feedback, we will use it to fine-tune [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) for DPO. This model, known as Phi-2, is a scaled-down machine learning model with 2.7 billion parameters. Despite its smaller size, it excels in performance relative to larger models. Phi-2 has not been fine-tuned using DPO to align it with social reasoning.

Install the Argilla client and the required third party libraries using pip:

In [None]:
!pip install bitsandbytes transformers peft accelerate datasets wandb trl -q -U

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m89.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m79.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.9/150.9 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━

Let’s make the necessary imports:

In [None]:
from google.colab import userdata
from typing import Dict, Any, Iterator, Tuple
import os
import torch
from datasets import load_dataset, Dataset
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
import huggingface_hub
import wandb
from peft import LoraConfig, get_peft_model, PeftModel
from trl import DPOTrainer

Let's login on Hugging Face to be able to upload our model after.

In [None]:
huggingface_hub.login(token=userdata.get("HF_AUTH_TOKEN‡"))

In [None]:
wandb_token = userdata.get("WANDB_AUTH_TOKEN")
if wandb_token:
    wandb.login(key=wandb_token)

[34m[1mwandb[0m: Currently logged in as: [33mdavid_from_argilla[0m ([33margilla-io[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


# Setup compute device

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")
    print("No GPU available, using CPU instead.")


Using NVIDIA A100 80GB PCIe


### Load the Intel Orca DPO dataset and prepare it

We will load the [distilabeled Intel Orca DPO](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs?row=0) from Argilla and prepare it for fine-tuning. In order to ensure data alignment with the pre-training, we will update the prompt template to match the original format of the model.


In [None]:
org_name = "argilla"
dataset_name = "distilabel-intel-orca-dpo-pairs"
dataset = load_dataset(f"{org_name}/{dataset_name}")
dataset["train"]

Dataset({
    features: ['system', 'input', 'chosen', 'rejected', 'generations', 'order', 'labelling_model', 'labelling_prompt', 'raw_labelling_response', 'rating', 'rationale', 'status', 'original_chosen', 'original_rejected', 'chosen_score', 'in_gsm8k_train'],
    num_rows: 12859
})

In [None]:
# Indicate the template for the feedback task
template = """\
Instruct: {instruction}\n
Output: {response}"""

def formatting_func(sample: Dict[str, Any]) -> Iterator[Tuple[str, str]]:
    # Our annotators were asked to provide new responses, which we assume are better than the originals
    sample["prompt"] = template.format(instruction=sample["input"], response="")

    return sample

formatted_dataset = dataset.map(formatting_func).select_columns(['prompt', 'chosen', 'rejected'])
formatted_dataset

In [None]:
split_formatted_dataset = formatted_dataset["train"].train_test_split(test_size=0.2)
split_formatted_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 10287
    })
    test: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 2572
    })
})

# Initialize the model

Note that we initialize a **quantized** version of the model and we fine-tune **LoRa**. This is done to reduce memory consumption and allow for running this on consumer hardwarde and Google Colab. For a full fine-tune you would a lot more GPU resources.

We have selected `microsoft/phi-2` as our main and reference model, so we will designate it in a variable.

In [None]:
# Set our model
model_name = "microsoft/phi-2"

Then, we will load the tokenizer and configure padding. Remember to set `trust_remote_code=True`, so that it can be properly loaded.

In [None]:
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Next, we will load the **quantized model**, a crucial step that significantly enhances efficiency and performance. Quantization involves converting the model's weights and activations from floating-point to lower-precision formats. This process reduces the model's size, making it more memory-efficient and suitable for devices with limited storage but it comes at the cost of some accuracy.

In [None]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype='float16',
    bnb_4bit_use_double_quant=False,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.float16, quantization_config=bnb_config, trust_remote_code=True, device_map={"": 0}
)
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False
model.config.gradient_checkpointing = False

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Finally, we want to initialize the **LoRa** configuration. This will allow us to freeze the pre-trained model weights while dynamically adjusting only a small set of additional parameters. This approach reduces the computational burden and memory requirements, making it a more practical and resource-efficient way to customize pre-trained models. In this case, we will target the layers within the attention mechanism and the feed-forward networks, although you can choose to target other modules as identifying the best ones is still in progress.


In [None]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.5,
    r=32,
    target_modules=['k_proj', 'q_proj', 'v_proj', 'fc1', 'fc2'],
    bias="none"`
    task_type="CAUSAL_LM",
)

We will also need a **reference model**, so we will initialize the `DPOTrainer` with `model_ref=None` so that you just have to load a single base model to compute both the reference and active logits by enabling / disabling the adapter.


# Train the model

Now, we will set the training arguments and start to fine-tune using the TRL [DPOTrainer](https://huggingface.co/docs/trl/main/en/dpo_trainer). Take into account that these parameters may differ depending on your exact purpose and hardware requirements.


In [None]:
model_name = f"phi2-lora-{dataset_name}"
os.environ["WANDB_PROJECT"] = model_name  # name your W&B project
os.environ["WANDB_LOG_MODEL"] = "checkpoint"  # log all model checkpoints

In [None]:
training_arguments = TrainingArguments(
    output_dir=f"./{model_name}",
    evaluation_strategy="steps",
    do_eval=True,
    optim="paged_adamw_8bit",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=16,
    per_device_eval_batch_size=2,
    log_level="debug",
    save_steps=250,
    logging_steps=250,
    learning_rate=1e-5,
    eval_steps=250,
    num_train_epochs=1, # Modified for tutorial purposes
    warmup_steps=250,
    lr_scheduler_type="linear",
    report_to="wandb",
)

PyTorch: setting up devices


In [None]:
 dpo_trainer = DPOTrainer(
    model,
    args=training_arguments,
    beta=0.1,
    peft_config=peft_config,
    train_dataset=split_formatted_dataset["train"],
    eval_dataset=split_formatted_dataset["test"],
    tokenizer=tokenizer,
    padding_value=tokenizer.pad_token_id,
)

dpo_trainer.train()
dpo_trainer.save_model()

In [None]:
dpo_trainer.push_to_hub(f"argilla/{model_name}")

# Conclusion

In this tutorial, we have explored a method to fine-tune large language models using a pool of annotators. In particular, we have used Prolific to gather responses from diverse group of annotators. We then analyzed the responses using argilla. Finally, we have fine-tuned microsoft/phi-2 using DPO, quantization and LoRa.

Even though this tutorial is focused on a specific LM, the approach outlined can be adapted to other models and tasks. In addition, To further boost performance, consider experimenting with a range of parameters. We encourage you to explore the different options

# Next steps

## Intersting resources

- [Ollama](https://ollama.ai/) to Get up and running with large language models, locally. Don't forget to check our [notus blog](https://argilla.io/blog/notus7b/) and [model](https://ollama.ai/argilla/notus) on ollama.
- [TRL](https://github.com/lvwerra/trl) is a full stack library where we provide a set of tools to train transformer language models.
- [bits and bytes](https://www.google.com/search?client=firefox-b-d&q=eli5+bits+and+bytes) allow users to run models in 4-bit precision.
- [LoRa](https://www.reddit.com/r/MachineLearning/comments/13m78u6/d_an_eli5_explanation_for_lora_lowrank_adaptation/) reduces the computational burden and memory requirements by fine-tuning a small set of additional parameters.
- [TheBloke](https://huggingface.co/TheBloke) for wonderful LLM quantisation and fine tuning.

## Shameless self-promoting

### Personal

- [LinkedIn](https://www.linkedin.com/in/david-berenstein-1bab11105/)
- [Twitter](https://twitter.com/davidbstein1957)
- [GitHub](https://github.com/davidberenstein1957)

### Company

- [Argilla Github](https://github.com/argilla-io/argilla)
- [Distilabel Github](https://github.com/argilla-io/distilabel)
- [Argilla Slack Community](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g)
- [Bi-weekly NLP community meetup](https://lu.ma/d720wy9f)