## Using Unsloth to finetune

## Install Prerequisite Packages

In [7]:
# This is necessary for colab
!pip install python-dotenv
!pip install datasets
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-80vjd64x/unsloth_6dc3d59283704b839fa06da873d3a1a8
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-80vjd64x/unsloth_6dc3d59283704b839fa06da873d3a1a8
  Resolved https://github.com/unslothai/unsloth.git to commit 9ca13b836f647e67d6e9ca8bb712403ffaadd607
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Load `.env`

In [8]:
import os
import sys

from datasets import Dataset

from dotenv import find_dotenv, load_dotenv

load_dotenv()

True

## Important Global Parameters

In [9]:
FINETUNING_DATASET_NAME="test_dataset_2024OCT24"
OUTPUT_MODEL_NAME="finetuned_model_2024OCT24"
BASE_MODEL_NAME="unsloth/Llama-3.2-3B-Instruct"

## API Keys

In [10]:
# Could also insert the token here directly
HF_TOKEN = os.getenv("HUGGINGFACE_API_KEY")

Leveraging Unsloth notebooks for finetuning

In [11]:
max_seq_length = 16000 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


In [12]:
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    # model_name = "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers, TRL and unsloth via:
`pip install --upgrade --no-cache-dir unsloth git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/trl.git`


In [13]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.10.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [14]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset


## Get dataset

In [15]:
dataset_finetune = load_dataset(
    "CPSC532/arxiv_qa_data",
    name=FINETUNING_DATASET_NAME,
    split="train",
    token=HF_TOKEN
)

README.md:   0%|          | 0.00/1.91k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/191k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/209 [00:00<?, ? examples/s]

In [16]:
dataset_finetune

Dataset({
    features: ['question', 'answer', 'source'],
    num_rows: 209
})

In [17]:

dataset_finetune['question'][0]

"What is the main contribution of the paper 'Taming Transformers for High-Resolution Image Synthesis' in the context of image synthesis?"

In [18]:

dataset_finetune['answer'][0]

'The main contribution of the paper "Taming Transformers for High-Resolution Image Synthesis" is the development of a novel approach that combines the strengths of convolutional neural networks (CNNs) and transformers to enable the synthesis of high-resolution images, specifically in the megapixel range. The authors propose a two-stage model where:\n\n1. **Learning a Context-Rich Codebook**: They first utilize a convolutional model to learn a discrete codebook of context-rich visual parts. This step leverages the inductive biases of CNNs, which are effective at capturing local structures in images. The learned codebook allows for a more efficient representation of images, reducing the complexity associated with directly modeling high-resolution images pixel by pixel.\n\n2. **Modeling Global Compositions with Transformers**: In the second stage, the authors employ a transformer architecture to model the global compositions of these visual parts. The transformer is adept at capturing lon

## Convert dataset to messages format

In [19]:
def convert_to_messages_format(example):
    return [
        {"role": "user", "content": example['question']},
        {"role": "assistant", "content": example['answer']},
    ]

In [20]:
dataset_finetune = dataset_finetune.map(
    lambda x: {
        'conversations' : convert_to_messages_format(x)
        }
)

Map:   0%|          | 0/209 [00:00<?, ? examples/s]

In [21]:
dataset_finetune = dataset_finetune.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/209 [00:00<?, ? examples/s]

In [22]:
dataset_finetune['text'][0]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the main contribution of the paper \'Taming Transformers for High-Resolution Image Synthesis\' in the context of image synthesis?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe main contribution of the paper "Taming Transformers for High-Resolution Image Synthesis" is the development of a novel approach that combines the strengths of convolutional neural networks (CNNs) and transformers to enable the synthesis of high-resolution images, specifically in the megapixel range. The authors propose a two-stage model where:\n\n1. **Learning a Context-Rich Codebook**: They first utilize a convolutional model to learn a discrete codebook of context-rich visual parts. This step leverages the inductive biases of CNNs, which are effective at capturing local structures in images. The learned co

## Set Training Parameters

In [23]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset_finetune,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 1,  # Affects memory usage
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2, # Affects memory usage
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 20, # Set this for 1 full training run.
        # max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none"
    ),
)

Map:   0%|          | 0/209 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. Look into this

In [24]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/209 [00:00<?, ? examples/s]

In [25]:
tokenizer.decode(trainer.train_dataset[0]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the main contribution of the paper \'Taming Transformers for High-Resolution Image Synthesis\' in the context of image synthesis?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe main contribution of the paper "Taming Transformers for High-Resolution Image Synthesis" is the development of a novel approach that combines the strengths of convolutional neural networks (CNNs) and transformers to enable the synthesis of high-resolution images, specifically in the megapixel range. The authors propose a two-stage model where:\n\n1. **Learning a Context-Rich Codebook**: They first utilize a convolutional model to learn a discrete codebook of context-rich visual parts. This step leverages the inductive biases of CNNs, which are effective at capturing local structures in images. The learned co

In [26]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                 \n\nModeling long-range interactions in high-resolution images is crucial for several reasons, as highlighted in the paper "Taming Transformers for High-Resolution Image Synthesis":\n\n1. **Global Composition Understanding**: High-resolution images consist of complex structures and patterns that span across large areas. To generate images that are not only locally realistic but also globally consistent, it is essential to understand how different parts of the image relate to one another. The paper emphasizes that high-resolution image synthesis requires a model that comprehends the global composition of images, enabling it to generate coherent and contextually appropriate visuals.\n\n2. **Expressivity of Transformers**: Transformers are designed to learn complex relationships among inputs without a built-in bias towards locality, which allows them to capture long-range dependencies effectively. This expressivity is parti

In [27]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
4.943 GB of memory reserved.


## Train

In [28]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 209 | Num Epochs = 20
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 520
 "-____-"     Number of trainable parameters = 194,510,848


**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers, TRL and Unsloth!
`pip install --upgrade --no-cache-dir unsloth git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/trl.git`


Step,Training Loss
1,1.4125
2,1.629
3,1.327
4,1.4126
5,1.1802
6,1.2155
7,1.1216
8,1.1726
9,1.1852
10,1.1573


In [29]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

821.0797 seconds used for training.
13.68 minutes used for training.
Peak reserved memory = 7.26 GB.
Peak reserved memory for training = 2.317 GB.
Peak reserved memory % of max memory = 18.35 %.
Peak reserved memory for training % of max memory = 5.856 %.


## Run Inference

In [30]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

def get_response(user_query):
    messages = [
    {"role": "user", "content": user_query},
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")

    outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                            temperature = 1.5, min_p = 0.1)
    return tokenizer.batch_decode(outputs)

In [31]:
dataset_finetune['question'][0]

"What is the main contribution of the paper 'Taming Transformers for High-Resolution Image Synthesis' in the context of image synthesis?"

Need to investigate how changing the question affects responses

In [32]:
resp = get_response(dataset_finetune['question'][0])
print(resp[0].split("<|start_header_id|>assistant<|end_header_id|>")[1])

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.




The main contribution of the paper "Taming Transformers for High-Resolution Image Synthesis" is the development of a novel approach that combines the strengths of convolutional neural networks (CNNs) and transformers to enable the synthesis of high-resolution images, specifically in the megapixel range. The authors propose a two-stage model where


## Save to HF

In [34]:
model.push_to_hub_gguf(
        f"CPSC532/{OUTPUT_MODEL_NAME}",
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m"],
        token = HF_TOKEN
    )

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.2G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 60.04 out of 83.48 RAM for saving.


100%|██████████| 28/28 [00:00<00:00, 40.20it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m', 'q8_0', 'q5_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at CPSC532/finetuned_model_2024OCT24 into bf16 GGUF format.
The output location will be /content/CPSC532/finetuned_model_2024OCT24/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: finetuned_model_2024OCT24
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'mo

unsloth.Q4_K_M.gguf:   0%|          | 0.00/2.02G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/CPSC532/finetuned_model_2024OCT24
Unsloth: Uploading GGUF to Huggingface Hub...


unsloth.Q8_0.gguf:   0%|          | 0.00/3.42G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/CPSC532/finetuned_model_2024OCT24
Unsloth: Uploading GGUF to Huggingface Hub...


unsloth.Q5_K_M.gguf:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/CPSC532/finetuned_model_2024OCT24


No files have been modified since last commit. Skipping to prevent empty commit.


Saved Ollama Modelfile to https://huggingface.co/CPSC532/finetuned_model_2024OCT24
