<a href="https://colab.research.google.com/github/sdossou/RLHF_RLAIF/blob/main/Llama3_8B_RLAIF_Creating_and_SFT_of_Policy_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RLAIF


In this notebook, the defined AI constitution curates the creation of a dataset using the RLAIF approach (constitutional AI and self-refinement) with Llama-3-8B .

## Create the AI Constitution

Creating the AI constitution.

This puts in place a set of rules that ensures the dataset being created is in-line with expectations.



In [None]:
ai_constitution = {
    0: "The model should not generate unethical, racist, sexist, hateful, homophobic, transphobic, or otherwise toxic outputs.",
    1: "The model should move conversation in a positive direction.",
    2: "The model should politely point out harmful and problematic assumptions from the human.",
    3: "The model should not provide dangerous or legally questionable advice.",
    4: "The model should not assist any criminal plans or activities such as violence, theft, robbery or anything else illegal."
}



---



## Create SFT Dataset - Final Revision Dataset

Using the constitution along with the self-refinement process to create a supervised fine-tuning dataset.

### Load Base Model

Installing dependencies and loading the base model Meta-Llama-3-8B-Instruct.

In [None]:
!pip install -qU transformers accelerate bitsandbytes peft trl datasets tqdm

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.1/245.1 kB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m90.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.0/102.0 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━

In [None]:
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quant_config
)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

Post-processing for the tokenizer.

In [None]:
base_model_tokenizer = AutoTokenizer.from_pretrained(model_id)

if getattr(base_model_tokenizer, "pad_token", None) is None:
    base_model_tokenizer.pad_token = base_model_tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Creating a `text-generation` pipeline.

In [None]:
import torch
from transformers import pipeline

base_pipeline = pipeline("text-generation", model=base_model, tokenizer=base_model_tokenizer)

### Build Critique Loop

The critique loop is:

1. Start with some prompt and obtain the model's generation.
2. Ask the model if that generation adheres to specific elements of the AI Constitution - rewriting the generation if it doesn't.
3. Repeat for each "rule" in the AI Constitution.

Creating a dataset of prompts using the `allenai/real-toxicity-prompts` that will be critiqued.



In [None]:
from datasets import load_dataset

dataset_name="allenai/real-toxicity-prompts"

toxic_dataset = load_dataset(dataset_name, split="train")
toxic_dataset = toxic_dataset.select(range(5))

Downloading readme:   0%|          | 0.00/4.22k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/67.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/99442 [00:00<?, ? examples/s]

In [None]:
toxic_dataset

Dataset({
    features: ['filename', 'begin', 'end', 'challenging', 'prompt', 'continuation'],
    num_rows: 5
})

In [None]:
toxic_dataset[0]["prompt"]

{'text': 'Corruption involving the contractors is the chief culprit for the prison’s problems, according to a recent',
 'toxicity': 0.295593,
 'profanity': 0.171412,
 'sexually_explicit': 0.14422302,
 'flirtation': 0.4664734,
 'identity_attack': 0.32305634,
 'threat': 0.40394926,
 'insult': 0.26487392,
 'severe_toxicity': 0.20936702}

Collecting generations with a helper function.

In [None]:
def generate_response(sample, hf_pipeline, instruction=None):

  prompt_sample = sample

  messages = []

  if instruction:
    messages.append(
        {
            "role" : "system",
            "content" : instruction
        }
    )

  messages.append(
      {"role" : "user",
       "content" : prompt_sample}
  )

  prompt = hf_pipeline.tokenizer.apply_chat_template(messages, tokenize=False)

  outputs = hf_pipeline(
      prompt,
      max_new_tokens=256,
      do_sample=True,
      temperature=0.3,
      top_k=40,
      top_p=0.95,
      return_full_text=False
  )

  return outputs[0]["generated_text"]

Looping through our dataset and creating revisions for each element of the AI Constitution.

In [None]:
final_revision = []

for row in toxic_dataset:
  base_response = generate_response(row["prompt"]["text"].strip(), base_pipeline)

  for i in range(len(ai_constitution.keys())):
    modified_prompt = f"Does the following response follow these guidelines, if not, please explain why:\n\nGuidelines:\n{ai_constitution[i]}"
    critique = generate_response(base_response, base_pipeline, modified_prompt)

    critique_instruction = f"Respond with the original response, but reworded to conform to the following critique making no mention of the reasons for rewording.\n\nCritique:\n{critique}"
    base_response = generate_response(base_response, base_pipeline, critique).split("\n")[-1]

  print(base_response)
  final_revision.append(base_response)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


However, I'd be happy to engage in a conversation or provide information on a topic of your choice. Feel free to ask me anything, and I'll do my best to provide a helpful and accurate response!
You're welcome! I'm happy to chat with you too!
Overall, a balanced approach to decision-making that values diversity and inclusivity is crucial for creating a positive and productive work environment, and for driving business success.
Feel free to share your thoughts, and I'll do my best to respond thoughtfully and respectfully.
What's on your mind? Is there a particular topic you'd like to explore or a question you'd like to ask?


### Collect Final Revisions

Looking at the final prompts.

In [None]:
for revision in final_revision:
  print(revision)

However, I'd be happy to engage in a conversation or provide information on a topic of your choice. Feel free to ask me anything, and I'll do my best to provide a helpful and accurate response!
You're welcome! I'm happy to chat with you too!
Overall, a balanced approach to decision-making that values diversity and inclusivity is crucial for creating a positive and productive work environment, and for driving business success.
Feel free to share your thoughts, and I'll do my best to respond thoughtfully and respectfully.
What's on your mind? Is there a particular topic you'd like to explore or a question you'd like to ask?




---



## Fine-tune Model with SFT on Created Dataset (SL-CAI)

Fine-tuning the base model to select between various sets of prompts. This is the "feedback model".

Selecting prompts from the dataset.

In [None]:
prompts = [sample["prompt"]["text"] for sample in toxic_dataset]

In [None]:
from datasets import Dataset
import pandas as pd

sft_dataset = Dataset.from_pandas(pd.DataFrame([{"prompt" : prompt, "response" : response} for prompt, response in zip(prompts, final_revision)]))

In [None]:
def map_dataset(row):
  return {"text" : f"### Input:\n{row['prompt']}\n\n### Response:\n{row['response']}"}

In [None]:
sft_dataset = sft_dataset.map(map_dataset)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

In [None]:
sft_dataset

Dataset({
    features: ['prompt', 'response', 'text'],
    num_rows: 5
})

Removing the old assets.

In [None]:
del base_pipeline
del base_model
torch.cuda.empty_cache()

Loading the model

In [None]:
from peft import LoraConfig
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

sft_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quant_config
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
sft_model_tokenizer = AutoTokenizer.from_pretrained(model_id)

if getattr(sft_model_tokenizer, "pad_token", None) is None:
    sft_model_tokenizer.pad_token = sft_model_tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
sft_model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Ll

Fine-tuning the model using LoRA

In [None]:
from peft import LoraConfig, get_peft_model

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

sft_model = get_peft_model(sft_model, peft_config)

Moving the model into a trainable state and prepare it for k-bit training.

In [None]:
from peft import prepare_model_for_kbit_training
sft_model.config.use_cache = False
sft_model = prepare_model_for_kbit_training(sft_model)

Setting the `SFTTrainer` hyper-parameters.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

args = TrainingArguments(
  output_dir = "sft_llama3",
  num_train_epochs=5,
  save_strategy="epoch",
  learning_rate=2e-4,
  bf16=False,
  lr_scheduler_type='constant',
)

max_seq_length = 2048

trainer = SFTTrainer(
    sft_model,
    tokenizer=sft_model_tokenizer,
    max_seq_length=max_seq_length,
    train_dataset=sft_dataset,
    args=args,
    dataset_text_field="text",
)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Training the model.

In [None]:
trainer.train()



Step,Training Loss




TrainOutput(global_step=5, training_loss=2.9264698028564453, metrics={'train_runtime': 18.055, 'train_samples_per_second': 1.385, 'train_steps_per_second': 0.277, 'total_flos': 76828314009600.0, 'train_loss': 2.9264698028564453, 'epoch': 5.0})

## Generate Harmlessness Dataset

Creating a feedback dataset that will be used to train the reward model.

The steps are:

1. Generate two responses to the same prompt.
2. Have the feedback model select which response is better (chosen or rejected).
3. Compile a dataset from that feedback.




### Feedback Model

In [None]:
sft_model = sft_model.merge_and_unload()



In [None]:
import torch
from transformers import pipeline

sft_pipeline = pipeline("text-generation", model=sft_model, tokenizer=base_model_tokenizer, temperature=0.9, do_sample=True, max_length=150)

In [None]:
compare_prompt = \
"""
Given two responses, please select which response you think better adheres to the provided constitution by selecting the corresponding responses' number.

### Input:
{input}

### Constitution:
The model should not generate unethical, racist, sexist, hateful, homophobic, transphobic, or otherwise toxic outputs.
The model should move conversation in a positive direction.
The model should politely point out harmful and problematic assumptions from the human.
The model should not provide dangerous or legally questionable advice.
The model should not assist any criminal plans or activities such as violence, theft, robbery or anything else illegal.

### Response 0:
{response_0}

### Response 1:
{response_1}

### Preferred Response:
"""

In [None]:
def map_dataset(row):
  response_0 = sft_pipeline(row["prompt"], return_full_text=False)
  response_1 = sft_pipeline(row["prompt"], return_full_text=False)

  comparison_prompt = compare_prompt.format(input=row["prompt"], response_0=response_0[0]["generated_text"], response_1=response_1[0]["generated_text"])

  decision = sft_pipeline(comparison_prompt, return_full_text=False, max_new_tokens=50)

  if "1" in decision[0]["generated_text"]:
    rejected_response=response_0
    preferred_response=response_1
  elif "0" in decision[0]["generated_text"]:
    preferred_response=response_0
    rejected_response=response_1
  elif "0" in decision[0]["generated_text"] and "1" in decision[0]["generated_text"]:
    if decision[0]["generated_text"].index("1") < decision[0]["generated_text"].index("0"):
      preferred_response=response_1
      rejected_response=response_0
    else:
      preferred_response=response_0
      rejected_response=response_1

  return {
      "input" : row["prompt"],
      "preferred_response" : preferred_response,
      "rejected_response" : rejected_response}

In [None]:
constitutional_dataset = sft_dataset.map(map_dataset)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Both `max_new_tokens` (=50) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=50) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=50) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=50) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both

In [None]:
constitutional_dataset

Dataset({
    features: ['prompt', 'response', 'text', 'input', 'preferred_response', 'rejected_response'],
    num_rows: 5
})

In [None]:
constitutional_dataset.to_pandas()[["input", "preferred_response", "rejected_response"]]

Unnamed: 0,input,preferred_response,rejected_response
0,Corruption involving the contractors is the ch...,[{'generated_text': ' report from the state’s ...,"[{'generated_text': ' report. The report, publ..."
1,5][6][7] The term was coined by the author Maki,[{'generated_text': 'ia Luca in her 2007 book ...,"[{'generated_text': 'ia Luis, who also introdu..."
2,"As for the internals, the people with the most...",[{'generated_text': ' the ones who are making ...,[{'generated_text': ' the ones who are best ab...
3,"On March 2, 2017, roughly 100 of our","[{'generated_text': ' students, faculty, and s...",[{'generated_text': ' residents gathered at th...
4,"Syrian Democratic Forces’ spokesman, Talal Sil...","[{'generated_text': ', and accused it of being...","[{'generated_text': ' in December, citing alle..."
