Finetuning is done using the Unsloth library, and the following guide: https://huggingface.co/blog/mlabonne/sft-llama3

Also available for Mistral, Gemma,... 
https://huggingface.co/unsloth/llama-3-8b-Instruct


Note:
Needs to use Python 3.11
Can be ran through docker. All requirements will be installed and the environment will be prepared. Open ReadMe to learn more.


In [1]:
# If you are using this ipynb outside of the docker setting run this
# %pip install torch==2.3.0+cu121 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# %pip install -r requirements.txt

The following code should return True. If it doesn't the finetuning will be unsuccessfull as it cannot run on GPU.

In [2]:
import torch
print(torch.cuda.is_available())
print(torch.version.cuda)
torch.cuda.empty_cache()

True
12.1


Setup all the names
Select the name of the model, the path to the datasets and the outputs. 

The model is downloaded from the unsloth repository on HF. Select the -bnb-4bit model is you want to use QLoRA finetuning. Select the regular model for LoRA.

Set the load4 bit flag to True, if you want to use QLoRA finetuning, False if you want to use LoRA.

In [3]:
modelName = "unsloth/Llama-3.2-3B-Instruct"                   #Instruct LoRA
#modelName = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"           #Instruct QLoRA

inFilename = "../TrainingDatasets/reworked-training-questions-30.json"
modelOutputDir = "Models/llama3.2-instruct-lora-30"
modelLocalOutput = "Models/Output/Llama3.2-instruct-lora-30"
outFilename = "llama3.2-instruct-lora-30-stats.md"

load4bit = False   # LoRA
#load4bit = True   # QLoRA

In [4]:
import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported

max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=modelName,
    max_seq_length=max_seq_length,
    load_in_4bit=load4bit,
    dtype=None,
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#model.to(device)  # Only if using Lora, comment out otherwise

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"], 
    use_rslora=True,
    use_gradient_checkpointing="unsloth"
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.46.1.
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.643 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.3.0+cu121. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 2.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2024.12.2 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Load the training file and modify it to fit the template 
(Here you can choose between the training files containing different amount of questions)

<|im_start|>user
Remove the spaces from the following sentence: It prevents users to suspect that there are some hidden products installed on theirs device.
<|im_end|>
<|im_start|>assistant
Itpreventsuserstosuspectthattherearesomehiddenproductsinstalledontheirsdevice.<|im_end|>

In [5]:
tokenizer = get_chat_template(
    tokenizer,
    mapping={"role": "role", "content": "value", "user": "human", "assistant": "chatbot", "system" : "system"},
    chat_template="chatml",
)

# Transform the quesitons to the chatml template
def apply_template(examples):
    messages = examples["conversations"]
    # Apply the chat template to each message in the conversation
    text = [tokenizer.apply_chat_template(
                [{"role": msg["role"], "content": msg["value"]} for msg in message], 
                tokenize=False, 
                add_generation_prompt=False) 
            for message in messages]
    return {"text": text}


dataset = load_dataset('json', data_files=inFilename, split="train")
dataset = dataset.shuffle()
dataset = dataset.map(apply_template, batched=True)

Unsloth: Will map <|im_end|> to EOS = <|eot_id|>.


https://medium.com/@gobishangar11/llama-2-a-detailed-guide-to-fine-tuning-the-large-language-model-8968f77bcd15
Guide for finetuning parameters

In [6]:
trainer=SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    
    args=TrainingArguments(
        output_dir = modelOutputDir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=1,
        optim="paged_adamw_32bit",
        logging_steps = 1,
        learning_rate=3e-4,
        weight_decay=0.001,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        max_grad_norm=0.3,
        max_steps=-1,
        warmup_ratio=0.03,
        group_by_length=True,
        lr_scheduler_type="constant",
        report_to="tensorboard"
    ),
)

Save current memory and time stats into a file + print them out

In [7]:
gpuStats = torch.cuda.get_device_properties(0)
startGpuMemory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
maxMemory = round(gpuStats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpuStats.name}. Max memory = {maxMemory} GB.")
print(f"{startGpuMemory} GB of memory reserved.")

with open(outFilename, "w") as statsFile:
    statsFile.write(f"GPU = {gpuStats.name}. Max memory = {maxMemory} GB.")
    statsFile.write(f"{startGpuMemory} GB of memory reserved.")
    statsFile.flush()     

GPU = NVIDIA GeForce RTX 4090. Max memory = 23.643 GB.
6.539 GB of memory reserved.


Finetune the model

In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 5 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 1
\        /    Total batch size = 4 | Total steps = 6
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,0.9977
2,0.7881
3,0.6763
4,0.6124
5,0.4816
6,0.365


Save final memory and time stats into a file + print them out

In [9]:
usedMemory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
usedMemoryForLora = round(usedMemory - startGpuMemory, 3)
usedPercentage = round(usedMemory         /maxMemory*100, 3)
loraPercentage = round(usedMemoryForLora/maxMemory*100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {usedMemory} GB.")
print(f"Peak reserved memory for training = {usedMemoryForLora} GB.")
print(f"Peak reserved memory % of max memory = {usedPercentage} %.")
print(f"Peak reserved memory for training % of max memory = {loraPercentage} %.")


with open(outFilename, "w") as statsFile:
    statsFile.write(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
    statsFile.write(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
    statsFile.write(f"Peak reserved memory = {usedMemory} GB.")
    statsFile.write(f"Peak reserved memory for training = {usedMemoryForLora} GB.")
    statsFile.write(f"Peak reserved memory % of max memory = {usedPercentage} %.")
    statsFile.write(f"Peak reserved memory for training % of max memory = {loraPercentage} %.")

8.6111 seconds used for training.
0.14 minutes used for training.
Peak reserved memory = 12.885 GB.
Peak reserved memory for training = 6.346 GB.
Peak reserved memory % of max memory = 54.498 %.
Peak reserved memory for training % of max memory = 26.841 %.


Save the model and tokenizer

In [10]:
model.save_pretrained(modelLocalOutput)
tokenizer.save_pretrained(modelLocalOutput)

('Models/Output/Llama3.2-instruct-lora-30/tokenizer_config.json',
 'Models/Output/Llama3.2-instruct-lora-30/special_tokens_map.json',
 'Models/Output/Llama3.2-instruct-lora-30/tokenizer.json')