# Finetuning Llama-3.2-3B Instruct and Convert to GGUF Model

### Setup the Unsloth for training

In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 8192 # Choose any! Unsloth auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use this if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.8: Fast Llama patching. Transformers: 4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

Adding LoRA adapters for finetuning.

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128. Too large / High can cause overfitting
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32, # For this number, you can choose "equal to r" or double it
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 1406,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.12.8 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Datasets Prep
We using Alpaca Indo Instruct dataset from [MBZUAI (Indonesia dataset only)](https://huggingface.co/datasets/MBZUAI/Bactrian-X). The dataset contains alpaca-52k + dolly-15k, and then translated to full Bahasa Indonesia. You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

In [5]:
alpaca_prompt = """Di bawah ini adalah instruksi yang menjelaskan tugas, dipasangkan dengan masukan yang memberikan konteks lebih lanjut. Tulis tanggapan yang melengkapi instruksi dengan tepat.

### Instruksi:
{}

### Masukan:
{}

### Tanggapan:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("/content/drive/MyDrive/finetuning/datasets", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/67017 [00:00<?, ? examples/s]

### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). Set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. But now we used 256 max_step for  speed up the training (full 1 epochs run = 6500 steps, and that need 5hours+ with T4 GPU)

In [6]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2, #Increase this if want to utilize the memory of your GPU more to make training more smooth and make the process not over-fit.
        gradient_accumulation_steps = 4, #You can increase this little by litte to smotther the train loss curves
        warmup_steps = 5,
        #num_train_epochs = 1, # Set this to 1 for full training run.
        max_steps = 256,
        learning_rate = 2e-4, #Reduce this to make process more slower but get a chance higher accuracy result most likely
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 1406,
        output_dir = "outputs",
        report_to = "none", # Use this if you want to use WandB "wandb"
        #run_name = "your_project_name_in_wandb" # (Optional)
    ),
)

Map (num_proc=2):   0%|          | 0/67017 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
2.635 GB of memory reserved.


In [7]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 67,017 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 256
 "-____-"     Number of trainable parameters = 48,627,712


Step,Training Loss
1,1.6252
2,1.7256
3,1.5674
4,1.3728
5,1.5202
6,1.3277
7,1.5285
8,1.2433
9,1.1422
10,1.0987


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

### Inference Test
Let's run and test the model! You can change the instruction and input - leave the output blank!

In [9]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable unsloth native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Berikan tiga tips untuk tetap sehat.", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 1024, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>Di bawah ini adalah instruksi yang menjelaskan tugas, dipasangkan dengan masukan yang memberikan konteks lebih lanjut. Tulis tanggapan yang melengkapi instruksi dengan tepat.\n\n### Instruksi:\nBerikan tiga tips untuk tetap sehat.\n\n### Masukan:\n\n\n### Tanggapan:\n1. Makan sehat dan teratur: Makanan yang sehat dan teratur dapat membantu menjaga kesehatan tubuh dan menjaga berat badan yang sehat. Makanan yang sehat harus mengandung banyak sayuran, buah-buahan, protein, dan karbohidrat yang sehat.\n\n2. Olahraga secara teratur: Olahraga secara teratur dapat membantu menjaga kesehatan tubuh dan meningkatkan kesehatan jantung. Olahraga yang tepat dapat membantu meningkatkan kesehatan otot, meningkatkan kesehatan jantung, dan meningkatkan kesehatan umum.\n\n3. Tidur cukup: Tidur cukup dapat membantu menjaga kesehatan tubuh dan meningkatkan kesehatan umum. Tidur cukup dapat membantu meningkatkan kesehatan otak, meningkatkan kesehatan jantung, dan meningkatkan kesehatan 

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [10]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable unsloth native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Berikan tiga tips untuk tetap sehat.", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1024)

<|begin_of_text|>Di bawah ini adalah instruksi yang menjelaskan tugas, dipasangkan dengan masukan yang memberikan konteks lebih lanjut. Tulis tanggapan yang melengkapi instruksi dengan tepat.

### Instruksi:
Berikan tiga tips untuk tetap sehat.

### Masukan:


### Tanggapan:
1. Makan sehat dan teratur: Makanan yang sehat dan teratur dapat membantu menjaga kesehatan tubuh dan menjaga berat badan yang sehat. Makanan yang sehat harus mengandung banyak sayuran, buah-buahan, protein, dan karbohidrat yang sehat.

2. Olahraga secara teratur: Olahraga secara teratur dapat membantu menjaga kesehatan tubuh dan meningkatkan kesehatan jantung. Olahraga yang tepat dapat membantu meningkatkan kesehatan otot, meningkatkan kesehatan jantung, dan meningkatkan kesehatan umum.

3. Tidur cukup: Tidur cukup dapat membantu menjaga kesehatan tubuh dan meningkatkan kesehatan umum. Tidur cukup dapat membantu meningkatkan kesehatan otak, meningkatkan kesehatan jantung, dan meningkatkan kesehatan umum.<|end_of_t

### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [11]:
model.save_pretrained("/content/drive/MyDrive/finetuning/LoRA") # Local saving
tokenizer.save_pretrained("/content/drive/MyDrive/finetuning/LoRA")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('/content/drive/MyDrive/finetuning/LoRA/tokenizer_config.json',
 '/content/drive/MyDrive/finetuning/LoRA/special_tokens_map.json',
 '/content/drive/MyDrive/finetuning/LoRA/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [12]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "/content/drive/MyDrive/finetuning/LoRA", # Your LoRA model
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Berikan tiga tips agar tetap sehat.", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1024)

==((====))==  Unsloth 2024.12.8: Fast Llama patching. Transformers: 4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
<|begin_of_text|>Di bawah ini adalah instruksi yang menjelaskan tugas, dipasangkan dengan masukan yang memberikan konteks lebih lanjut. Tulis tanggapan yang melengkapi instruksi dengan tepat.

### Instruksi:
Berikan tiga tips agar tetap sehat.

### Masukan:


### Tanggapan:
1. Makan seimbang dan sehat: Makanan yang seimbang dan sehat dapat membantu menjaga kesehatan tubuh dan menjaga berat badan yang sehat. Makanan yang sehat harus mengandung protein, karbohidrat, dan lemak yang seimbang.

2. Berolahraga secara teratur: Berolahraga secara 

#### Converting into GGUF and save locally

To save to GGUF model, Unsloth support natively. Unsloth clone the llama.cpp and save the quantized model. Unsloth also allow all methods like q4_k_m. Some recommended quants are:
1. Q4_K_M
2. Q5_K_M
3. Q8

You can see other supported quants in [Unsloth Github Page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)

In [13]:
#Choose 1 of them

#model.save_pretrained_gguf("/content/drive/MyDrive/ai-portfolio/project3", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("/content/drive/MyDrive/finetuning/LLM", tokenizer, quantization_method = "q8_0")
#model.save_pretrained_gguf("/content/drive/MyDrive/ai-portfolio/project3", tokenizer, quantization_method = "f16")
#model.save_pretrained_gguf("/content/drive/MyDrive/ai-portfolio/project3", tokenizer, quantization_method = ["q4_k_m", "q8_0", "q5_k_m",]) - Use this for multiple GGUF options save

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.2G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 4.18 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:01<00:00, 19.61it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving /content/drive/MyDrive/finetuning/LLM/pytorch_model-00001-of-00002.bin...
Unsloth: Saving /content/drive/MyDrive/finetuning/LLM/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at /content/drive/MyDrive/finetuning/LLM into q8_0 GGUF format.
The output location will be /content/drive/MyDrive/finetuning/LLM/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: LLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf:

#### Converting into GGUF, save and automatically push into huggingface

In [None]:
#Use this if you want automatically push into huggingface

# Save to multiple GGUF options - much faster if you want multiple!
#model.push_to_hub_gguf(
#    "hf/model", # Change hf to your username!
#    tokenizer,
#    quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
#    token = "", # Get a token at https://huggingface.co/settings/tokens
#)