Step 1 - process conversation data and extract prompts & responses

In [None]:
import json

def extract_prompt_response_pairs(input_file, output_file):
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        conversations = json.load(infile)
        for conversation in conversations:
            for turn in conversation['turns']:
                if turn['speaker'] == 'user':
                    prompt = turn['text']
                elif turn['speaker'] == 'assistant':
                    response = turn['text']
                    pair = {'prompt': prompt, 'response': response}
                    json.dump(pair, outfile)
                    outfile.write('\n')

extract_prompt_response_pairs('conversations.json', 'prompt_response_pairs.jsonl')


The first step was to obtain the original Cohere Aya-8B model and load it using 8-bit precision with the help of the bits and bytes library. Using the model's original precision would have led to out-of-memory errors, so applied quantization to reduce the precision to 8-bit. After loading the model with reduced precision, I ensured that it was set to training mode.

In [None]:
pip install datasets peft huggingface_hub
pip install accelerate bitsandbytes

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, pipeline
from datasets import load_dataset
from huggingface_hub import login
from peft import get_peft_model, LoraConfig, PeftType
import os
from google.colab import userdata


# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-23-8B")

# No need to add special tokens since they are already in the vocabulary

# Load the model with 8-bit precision using bitsandbytes
model = AutoModelForCausalLM.from_pretrained(
    "CohereForAI/aya-23-8B",
    device_map="auto",  # Automatically handle device mapping
    load_in_8bit=True,  # Load the model in 8-bit precision to save memory
    offload_folder="./offload",  # Folder to store offloaded model parts
    torch_dtype="float16"  # Use 16-bit precision for floating-point operations
)

# Resize token embeddings to account for added special tokens
# If you are sure no special tokens need to be added, you might not need to resize
# model.resize_token_embeddings(len(tokenizer))  # Not needed if no new tokens are added

# Ensure the model is in training mode
model.train()

Next, I loaded my dataset from the finetuning_data.jsonl file, which includes the prompts and responses. Since we are fine-tuning a chat model, it's crucial to use specific tokens in the vocabulary to structure the chat conversation for the final output. These tokens include the BOS token, start-of-turn token, user token, chatbot token, and end-of-turn token. During tokenization, I place the prompt inside the user token and the response inside the chatbot token. This approach helps the model learn to differentiate between the prompt and the corresponding response. To achieve this, I use a dedicated tokenize function. Additionally, I set the tokenizer's maximum length to 512 tokens to prevent memory issues, especially since we're using a single GPU with 40 GB of GPU RAM.

In [None]:
# Load the dataset from a jsonl file
dataset = load_dataset("json", data_files="finetuning_data.jsonl")

# Tokenize function using the existing special tokens in the vocabulary
def tokenize_function(examples):
    input_texts = [
        f"<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>{prompt}<|END_OF_TURN_TOKEN|>"
        f"<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{response}<|END_OF_TURN_TOKEN|>"
        for prompt, response in zip(examples['prompt'], examples['response'])
    ]
    encoding = tokenizer(input_texts, padding="max_length", truncation=True, max_length=512)
    encoding["labels"] = encoding["input_ids"].copy()
    return encoding

# Tokenize the entire dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["prompt", "response"])

After that, I used LoRa (Low-Rank Adaptation) for fine-tuning. LoRa enables us to modify only a subset of the model's weights, which helps us achieve our fine-tuning goals without the need to retrain the entire model, making the process less computationally intensive. I configured LoRa by setting parameters like R, LoRa alpha, and LoRa dropout. To prevent memory issues, I reduced LoRa alpha from 32 to 16 and set LoRa dropout to 0.1. Once the LoRa configuration was defined, I integrated it with the model using parameter-efficient fine-tuning, accomplished with the get_peft_model function, which takes both the model and LoRa configurations as inputs.

In [None]:
# Step 7: Set up LoRA configuration for PEFT
lora_config = LoraConfig(
    peft_type=PeftType.LORA,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
)

# Step 8: Integrate the model with LoRA using PEFT
model = get_peft_model(model, lora_config)

# Optionally disable gradient checkpointing to resolve conflicts
# model.gradient_checkpointing_enable()  # Disable this if it causes conflicts

The next step was to define the training arguments. I set the maximum number of training steps to 100, as I noticed performance degradation beyond this point in earlier experiments. To minimize the risk of out-of-memory errors, I used a batch size of 1 and implemented gradient accumulation after every 16 steps. While there are many other parameters available for tuning, these were the key ones for this particular setup.

In [None]:
# Step 9: Define training arguments with max_steps set to 100
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    max_steps=100,  # Limit training to 100 steps
    per_device_train_batch_size=1,  # Set the batch size
    gradient_accumulation_steps=16,  # To handle memory issues
    learning_rate=5e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=50,  # Save checkpoint every 50 steps
    save_total_limit=2,  # Keep only the latest checkpoints
    eval_strategy="steps",
    eval_steps=50,  # Evaluate every 50 steps
    remove_unused_columns=False,
    fp16=False,  # Disable mixed precision to avoid conflicts
)

# Step 10: Initialize the Trainer with PEFT
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["train"],
)

# Step 11: Fine-tune the model using PEFT and LoRA
trainer.train()