# GDPR gemma-2 2b model

## Key Features

- Specialized in GDPR compliance and data protection regulations
- Utilizes DPO for precise alignment with GDPR principles
- Implements QLoRA for efficient and resource-friendly training
- Designed to provide accurate and relevant responses to GDPR-related inquiries

## Model Details

- Base Model: Google Gemma 2B
- Fine-tuning Method: Direct Preference Optimization (DPO)
- Training Dataset: [sims2k/GDPR_QA_instruct_dataset](https://huggingface.co/datasets/sims2k/GDPR_QA_instruct_dataset)
- Quantization: 4-bit quantization using QLoRA



In [2]:
import os
from google.colab import userdata
# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

Since it’s necessary to save the model we create, the notebook mounts a disk on Google Drive. If you're running it locally on your computer, you don't need to run this line of code. You can also run it on Google Colab without mounting a disk in your Google Drive. However, if you do that, the saved model will be stored in a temporary directory, and you'll lose it every time you close the session.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Introduction to DPO


Direct Preference Optimization (DPO) is a model alignment technique similar to Reinforcement Learning from Human Feedback (RLHF). Both methods are used to align a model with the preferences or needs of its users. However, DPO has become more popular in many projects because it achieves comparable results to RLHF while requiring significantly fewer resources.

Both techniques start with a dataset that contains examples of correct and incorrect responses to the same prompt.

Here is where the methods diverge. In RLHF, this dataset is used to train a second model, known as a reward model, which plays a crucial role in the alignment process. In contrast, DPO uses the dataset directly to train the final model. This is the primary difference between the two techniques.

As you might imagine, DPO is a more straightforward approach that demands fewer resources.

The implementation of DPO you will be using is developed by Hugging Face in their TRL (Transformer Reinforcement Learning) library. DPO can be considered a type of reinforcement learning technique, where the model is "rewarded" during training based on the quality of its responses.


## Install dependencies
Run the cell below to install all the required dependencies.

In [4]:
!pip install -q torch==2.3.1+cu121
!pip install -q transformers==4.43.0
!pip install -q datasets==2.19.1
!pip install -q trl==0.8.6
!pip install -q peft==0.11.1
!pip install -q bitsandbytes==0.43.1
!pip install -q sentencepiece==0.1.99
!pip install -q accelerate==0.30.1
!pip install -q huggingface_hub==0.23.2

[31mERROR: Could not find a version that satisfies the requirement torch==2.3.1+cu121 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1)[0m[31m
[0m[31mERROR: No matching distribution found for torch==2.3.1+cu121[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m116.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.0/172.0 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [3

In [47]:
#Import necessary classes.
import gc
import torch
import transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, PeftModel
from trl import DPOTrainer

Another necessary step is to login to Hugging Face.

In [48]:
from huggingface_hub import login

login(os.environ["HF_TOKEN"])

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Loading the dataset
The chosen dataset is the [distilabel capybara](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized), which consists of prompt pairs, each with one correct and one incorrect response.

Before using it for training, the dataset's content needs to be formatted correctly to ensure compatibility with the DPO alignment process.

To process the dataset, it's necessary to load the tokenizer.

In [49]:
model_name = "google/gemma-2-2b-it"
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

Before you begin aligning the model, it's necessary to load the dataset and transform it to fit the format required by the DPOTrainer class.   
This format consists of three fields: the prompt, the chosen answer, and a discarded answer.   

In this example, I’m using all the rows of the dataset if you want to reduce the time needed for alignment and to fit the process on a smaller GPU,    
you can filter it reducing the size of the split. However, if you prefer a more complete fine-tuning process, feel free to use the full dataset.   


In [50]:
# Load dataset
# dataset_original = load_dataset("sims2k/GDPR_QA_instruct_dataset")
dataset_original = load_dataset("sims2k/GDPR_QA_instruct_dataset", split='train[:]')
# Save columns
original_columns = dataset_original.column_names
print(original_columns)

['instruction', 'input', 'output', 'text', 'discussion_text']


In [51]:
dataset_original

Dataset({
    features: ['instruction', 'input', 'output', 'text', 'discussion_text'],
    num_rows: 316
})

filetered dataset

In [57]:
def filter_by_length(example):
    max_length = 2048  # 적절한 최대 길이 설정
    return (len(example['instruction']) + len(example['input']) + len(example['output'])) <= max_length
    # return (len(example['instruction']) + len(example['input'])) <= max_length

dataset_filtered = dataset_original.filter(filter_by_length)

In [58]:
dataset_filtered

Dataset({
    features: ['instruction', 'input', 'output', 'text', 'discussion_text'],
    num_rows: 104
})

The dataset still contains all the original columns, but the number of rows has been significantly reduced. I should warn you that 46 rows are too few for proper training; this reduction is intended to allow the notebook to execute in just a few minutes and still produce results. But they are enough to cause changes in the model's response that align with the content of the dataset.

I’ll use the dataset’s **map** function to apply the transformation to each row and remove the original columns.

In [59]:
def chatml_format(example):
    # 프롬프트 생성
    instruction = example['instruction']
    input_text = example['input']
    user_message = f"{instruction}\n\n{input_text}"

    messages = [{"role": "user", "content": user_message}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    # 선택된 응답 (chosen)
    chosen = example['output'] + tokenizer.eos_token

    # 거부된 응답 (rejected) 생성
    # 여기서는 간단한 예시로 생성하지만, 실제로는 더 정교한 방법이 필요할 수 있습니다.
    rejected = "I'm not familiar with the specific GDPR regulations for this case." + tokenizer.eos_token

    return {
        "prompt": prompt,
        "chosen": chosen,
        "rejected": rejected
    }


In [65]:
dataset = dataset_filtered.map(
    chatml_format,
    remove_columns=dataset_original.column_names
)

Map:   0%|          | 0/104 [00:00<?, ? examples/s]

In [67]:
# Print sample
dataset[3]

{'prompt': '<bos><start_of_turn>user\nDiscuss the legal basis for the processing of personal data of clinical trial subjects in the context of clinical trials as defined by the Clinical Trials Regulation (CTR) and the GDPR. Clearly differentiate between processing operations that relate to research activities and those aimed at protecting health, and explain the applicable legal bases for these categories.\n\nWhat is the legal basis for the processing of personal data in clinical trials, especially regarding the primary use of clinical trial data as outlined by the Clinical Trials Regulation?<end_of_turn>\n<start_of_turn>model\n',
 'chosen': '1. **Safety Reporting (Articles 41-43 of the CTR, Article 6(1)(c) of GDPR)**: Processing necessary for compliance with legal obligations related to safety reporting. 2. **Archiving Data (Article 58 of the CTR, Article 6(1)(c) of GDPR)**: Obligations concerning the archiving of clinical trial master files and medical files of subjects for specific 

Now the dataset contains only the ncesary columns and with the texts adapted to the format required for Gemma.


> '\<bos>\<start_of_turn>user\ndetermine the ratio of the radius of a uranium-238 nucleus to the radius of a helium-4 nucleus.\<end_of_turn>\n\<start_of_turn>model\n'

## Train model with DPO

### Preparing configuration.

Now it's time to configure the necessary settings for alignment using DPO.

To perform a lighter fine-tuning, I will use LoRA (Low-Rank Adaptation), which significantly reduces the number of parameters that need to be trained. LoRA introduces additional layers into the model, and it's the weights of these layers that are adjusted. In this case, since we want the alignment process to have a significant impact on the model's behavior, the values for **r** and **lora_alpha** are set considerably higher than what is typically used in standard fine-tuning with LoRA.

The value of **r** indicates the size of the reparameterization; the higher the value, the more parameters are trained. A value of 16 is at the upper limit of what is recommended for small large models.

It’s generally recommended that **lora_alpha** be set to twice the value of **r**. However, since **r** can vary depending on the model size, this may lead to a very high **lora_alpha** value if you are fine-tuning a large model and, for example, specify an **r** of 64.


In [68]:
# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear"
)


The quantization configuration holds no secrets, you are reducing the model's precision to 4 bits.

In [69]:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)


This approach allows the model to occupy less memory, enabling the alignment process to be performed on a smaller GPU.

In [70]:
# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    attn_implementation='eager',
    torch_dtype=torch.bfloat16
)
model.config.use_cache = False
model.gradient_checkpointing_enable()

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The next step is to set up the training parameters.

In [71]:
#Name of the model you want to create.
new_model = "gpdr_gemma_2b"

# Training arguments
#I'm using a batch_size of just 1 to avoid problems with memory consumption.
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=3,
    gradient_checkpointing_kwargs={'use_reentrant':False},
    gradient_checkpointing=True,
    remove_unused_columns=False,
    learning_rate=5.0e-06,
    logging_strategy="epoch",
    lr_scheduler_type="cosine",
    num_train_epochs=10,
    save_strategy="epoch",
    logging_steps=1,
    output_dir=new_model,
    optim="paged_adamw_32bit",
    warmup_steps=2,
    bf16=True,
    report_to="none",
)


I’ll explain the most important and specific training parameters:

**lr_scheduler_type**="cosine": The learning rate is adjusted according to a cosine schedule. It starts at the value specified in **learning_rate** and then gradually decreases.

**warmup_steps**=2:  For the first two epochs, the learning rate is adjusted by increasing its value instead of decreasing it. The aim is to stabilize the learning process.

**Gradient_accumulation_steps**=3: To save memory. I accumulate the gradients over two steps before updating the model weights.

With these parameters, I've tried to find a training setup with low memory requirements, thanks to the use of gradient accumulation, gradient checkpointing, a small batch size, and the use of bf16 along with the paged_adamw_32bit optimizer.

Now you can create the trainer, passing it the two datasets, the newly created training arguments, the LoRA configuration, and the tokenizer as parameters.

In [72]:
# Create DPO trainer
trainer = DPOTrainer(
    model,
    args=training_args,
    train_dataset=dataset,
    #eval_dataset=dataset_eval,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=2048,
    max_length=2048,
)

Map:   0%|          | 0/104 [00:00<?, ? examples/s]

The indicated beta value is a standard that balances the new training with the model's base knowledge. If you want the new training to have more weight, perhaps because you're training for a very specific task, you could specify a lower beta value.

In [73]:
# Fine-tune model with DPO
trainer.train()

  with compute_loss_context_manager():
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
34,0.3314
69,0.008
104,0.0013
138,0.0011
173,0.001
208,0.001
242,0.001
277,0.001
312,0.001
340,0.001


  with compute_loss_context_manager():
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with compute_loss_context_manager():
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with compute_loss_context_manager():
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with compute_loss_context_manager():
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with compute_loss_context_manager():
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with compute_loss_context_manager():
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with compute_loss_co

TrainOutput(global_step=340, training_loss=0.034823293863412215, metrics={'train_runtime': 863.7282, 'train_samples_per_second': 1.204, 'train_steps_per_second': 0.394, 'total_flos': 0.0, 'train_loss': 0.034823293863412215, 'epoch': 9.807692307692308})

It seems to have worked reasonably well, although there might be a potential overfitting issue, where the model adapts better to the training data than to the evaluation data. To mitigate overfitting, you could expand the dataset and try increasing the **lora_dropout** parameter in **LoraConfig**.


## Upload model to Hugging Face.

In [74]:
PATH_MODEL="/content/drive/MyDrive/final_checkpoint"

In [75]:
# Save artifacts
trainer.model.save_pretrained(PATH_MODEL)
tokenizer.save_pretrained(PATH_MODEL)


('/content/drive/MyDrive/final_checkpoint/tokenizer_config.json',
 '/content/drive/MyDrive/final_checkpoint/special_tokens_map.json',
 '/content/drive/MyDrive/final_checkpoint/tokenizer.model',
 '/content/drive/MyDrive/final_checkpoint/added_tokens.json',
 '/content/drive/MyDrive/final_checkpoint/tokenizer.json')

Execute this cell only if you are having memory issues.

In [76]:
#Flush memory
del trainer, model, tokenizer
gc.collect()
torch.cuda.empty_cache()

Now, you're going to load the original model again, but this time in its unquantized format.

In [77]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          use_fast=False)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The original model and the saved training are being merged.

In [78]:
model = PeftModel.from_pretrained(base_model, PATH_MODEL)
model = model.merge_and_unload()

 The model that you have in memory is now a combination of the base model and the adapter that you have trained. You can now save this new model and upload it to Hugging Face.

In [79]:
model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)

('gpdr_gemma_2b/tokenizer_config.json',
 'gpdr_gemma_2b/special_tokens_map.json',
 'gpdr_gemma_2b/tokenizer.model',
 'gpdr_gemma_2b/added_tokens.json')

In [80]:
model.push_to_hub(new_model,
                  private=True,
                  use_temp_dir=False)
tokenizer.push_to_hub(new_model,
                      private=True,
                      use_temp_dir=False)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/cycloevan/gpdr_gemma_2b/commit/1eecb74b8ea68864154e7f0dd5dc40abf404c41a', commit_message='Upload tokenizer', commit_description='', oid='1eecb74b8ea68864154e7f0dd5dc40abf404c41a', pr_url=None, pr_revision=None, pr_num=None)

## Inference

Let's test the new model and compare with the original

In [100]:
#Original Gemma Model.
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [101]:
# Format prompt
# message = [
#     {"role": "user", "content": "Solve 25000/2 step by step. \nLimit your response to mathematical expressions and symbols."}
# ]
message = [
    {"role": "user", "content": "How does our organization comply with the GDPR's Article 65(1)(a) dispute resolution mechanism in cases involving cross-border processing of personal data?"}
]

prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)


In [102]:
# Create pipeline
pipeline = transformers.pipeline(
    "text-generation",
    device="cuda",
    model=model_name,
    tokenizer=tokenizer
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [103]:
# Generate text
sequences = pipeline(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.2,
    num_return_sequences=1,
    max_length=200,
    # max_length=2048,
)
print(sequences[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<bos><start_of_turn>user
How does our organization comply with the GDPR's Article 65(1)(a) dispute resolution mechanism in cases involving cross-border processing of personal data?<end_of_turn>
<start_of_turn>model
Let's break down how organizations can comply with GDPR Article 65(1)(a) for cross-border data processing disputes.

**Understanding Article 65(1)(a)**

Article 65(1)(a) of the GDPR deals with the **"right to obtain information"** and **"right to rectify"** for individuals whose personal data is processed. It outlines the process for individuals to seek redress when they believe their rights have been violated.

**Key Points:**

* **Cross-border processing:** This article applies when personal data is transferred from one EU member state to another.
* **Dispute resolution:** It doesn't directly address specific dispute resolution mechanisms. Instead, it emphasizes the right to seek redress through the **


**The response obtained with the original model contains text. Ignoring the instructions in the prompt.**

In [104]:
del pipeline, tokenizer
#Flush memory
gc.collect()
torch.cuda.empty_cache()

In [105]:
# Load the Aligned Model.
tokenizer_new_model = AutoTokenizer.from_pretrained(new_model)


In [106]:
# Create pipeline
pipeline_new = transformers.pipeline(
    "text-generation",
    device="cuda",
    model=new_model,
    tokenizer=tokenizer_new_model
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [107]:
# Generate text
prompt = tokenizer_new_model.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

sequences = pipeline_new(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.2,
    num_return_sequences=1,
    max_length=200,
)
print(sequences[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<bos><start_of_turn>user
How does our organization comply with the GDPR's Article 65(1)(a) dispute resolution mechanism in cases involving cross-border processing of personal data?<end_of_turn>
<start_of_turn>model
Let's break down how organizations comply with GDPR Article 65(1)(a) for cross-border data processing disputes.

**Understanding Article 65(1)(a)**

Article 65(1)(a) of the GDPR outlines the dispute resolution mechanism for individuals who believe their rights under the GDPR have been violated. It specifically addresses situations where:

* **Cross-border processing:** The data processing is carried out by an organization based in one EU member state and involves data from another EU member state.
* **Dispute arises:** The individual has a dispute with the organization about the processing of their personal data.

**Compliance Mechanisms**

Organizations must comply with Article 65(1)(a) by providing a clear and accessible mechanism


**The response of the DPO aligned model contains only numbers, as requested in the prompt.**