<a href="https://colab.research.google.com/github/vishnutejaa/MyEdMaster-ASUCapstone-LLM-Generated-Customized-Instructional-Content/blob/main/485FineTuneExperiment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Unsloth framework installation for faster training speeds.
1. It accelerates the fine-tuning through manual differentiation, GPU Kernel Optimization.
2. It also allows larger batch sizes.
3. It is completely compatible with HF and GPUs from AMD, Intel and Nvidia
4. Allows seamless Integration with Transformers, PEFT,

The above are just a few features but lets understand what it actually does.?!?

Their kernels are designed to be clean and readable which makes GPU kernel performance outlaw the usual. They aim to have low computational overhead for which chained matrixed calculations are done. These are ready to use models that we use through Unsloth.

In [None]:
!pip install pyarrow>=15.0.0


In [None]:
!pip list

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # This is the max number of tokens that we want to be able to pass to the Model at one pass
dtype = None # None for auto detection. Model parameter type selection: Float16 for Tesla T4, V100, {Bfloat16 for Ampere+}
load_in_4bit = True # Using 4bit quantization to reduce memory usage. We can change it to false if want to proceed with other quantization models

# 4bit pre quantized models we can use with faster download speed and Unsloth ready
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit",
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",
] # I found https://huggingface.co/unsloth where we can find other models if needed

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.9.post2: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

**PEFT** is used to fine-tune large language models more efficiently by only training a small number of parameters (like those in specific layers or modules) instead of the entire model. **LoRA **(Low-Rank Adaptation) is one such technique that modifies specific parameters in a low-rank fashion, improving efficiency.






1. **q_proj, k_proj, v_proj:** Refers to the query, key, and value projection layers in the attention mechanism.
2. **o_proj:** The output projection layer in the attention mechanism.
3. **gate_proj, up_proj, down_proj:** Layers involved in the feedforward neural network part of the transformer architecture.


lora_alpha = 16

Scaling factor for the LoRA matrices. This helps control how much the low-rank adaptations are scaled relative to the original model weights.
A higher lora_alpha means the LoRA modifications will have a larger impact on the model's output. The value 16 is often a good balance between stability and performance.


d. lora_dropout = 0

This controls the dropout rate for LoRA layers, where a value of 0 means no dropout is applied, making the process deterministic and optimized for efficiency.
Dropout is a regularization technique to prevent overfitting, but in this case, it is set to 0 for an optimized setup.


e. bias = "none"

Bias term configuration. LoRA allows the user to specify whether to include bias terms in the adaptations.
"none" means no bias is applied, which is the most memory-efficient and optimized approach.


f. use_gradient_checkpointing = "unsloth"

Gradient checkpointing helps save memory by recomputing intermediate activations during the backward pass instead of storing them. This trades computation for memory efficiency.
Setting this to "unsloth" (a custom option in this package) uses a specialized implementation, allowing longer context sizes and memory optimization, fitting larger batch sizes during training or fine-tuning.


g. random_state = 3407

Random seed used to ensure the reproducibility of model initialization and fine-tuning results. This ensures that the model's fine-tuning behaves deterministically across runs.
h. use_rslora = False

Rank Stabilized LoRA (RsLoRA). This is a more advanced LoRA technique that stabilizes the rank of the matrices during training. Setting this to False means standard LoRA is used instead of the rank-stabilized version.
i. loftq_config = None

LoftQ: This likely refers to a quantization-aware version of LoRA or PEFT. If this option is not configured (None), the model will not use any special quantization-aware fine-tuning techniques.
a. LoftQ can be applied to reduce memory usage or speed up training further by combining LoRA and quantization techniques.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 - more the value more parameters and layers will be modified as per the fine-tuning process
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],  #These are sepecific components of transformer that will be modified These modules correspond to key projection layers involved in attention and feedforward mechanisms.
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.9.post2 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3", # We are not use ShareGPT style but feeding the questions directly to the model
)

# Function to format prompts based on the dataset examples
def formatting_prompts_func(examples):
    questions = examples["question"]  # Extracting the 'question' column
    # Tokenize or process the questions directly
    texts = [tokenizer.tokenize(question) for question in questions]
    return {"text": texts}
pass

from datasets import load_dataset
dataset = load_dataset("nvidia/OpenMathInstruct-1", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

README.md:   0%|          | 0.00/6.91k [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/6.42G [00:00<?, ?B/s]

validation.jsonl:   0%|          | 0.00/203M [00:00<?, ?B/s]

validation.jsonl:   0%|          | 0.00/981M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/7321344 [00:00<?, ? examples/s]

In [None]:
dataset[5]["question"]

'Jaynie wants to make leis for the graduation party.  It will take 2 and half dozen plumeria flowers to make 1 lei.  If she wants to make 4 leis, how many plumeria flowers must she pick from the trees in her yard?'

In [None]:
print(dataset[5]["text"])

['Jay', 'nie', 'Ġwants', 'Ġto', 'Ġmake', 'Ġle', 'is', 'Ġfor', 'Ġthe', 'Ġgraduation', 'Ġparty', '.', 'Ġ', 'ĠIt', 'Ġwill', 'Ġtake', 'Ġ', '2', 'Ġand', 'Ġhalf', 'Ġdozen', 'Ġpl', 'umer', 'ia', 'Ġflowers', 'Ġto', 'Ġmake', 'Ġ', '1', 'Ġlei', '.', 'Ġ', 'ĠIf', 'Ġshe', 'Ġwants', 'Ġto', 'Ġmake', 'Ġ', '4', 'Ġle', 'is', ',', 'Ġhow', 'Ġmany', 'Ġpl', 'umer', 'ia', 'Ġflowers', 'Ġmust', 'Ġshe', 'Ġpick', 'Ġfrom', 'Ġthe', 'Ġtrees', 'Ġin', 'Ġher', 'Ġyard', '?']


In [None]:
unsloth_template = \
    "{{ bos_token }}"\
    "{{ 'You are a helpful assistant to the user\n' }}"\
    "{% for message in messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ '>>> User: ' + message['content'] + '\n' }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '>>> Assistant: ' }}"\
    "{% endif %}"
unsloth_eos_token = "eos_token"

if False:
    tokenizer = get_chat_template(
        tokenizer,
        chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token
        map_eos_token = True, # Maps <|im_end|> to </s> instead
    )

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# trainer = SFTTrainer(
#     model = model,
#     tokenizer = tokenizer,
#     train_dataset = dataset,
#     dataset_text_field = "text",
#     max_seq_length = max_seq_length,
#     dataset_num_proc = 2,
#     packing = False, # Can make training 5x faster for short sequences.
#     args = TrainingArguments(
#         per_device_train_batch_size = 2,
#         gradient_accumulation_steps = 4,
#         warmup_steps = 5,
#         max_steps = 60,
#         learning_rate = 2e-4,
#         fp16 = not is_bfloat16_supported(),
#         bf16 = is_bfloat16_supported(),
#         logging_steps = 1,
#         optim = "adamw_8bit",
#         weight_decay = 0.01,
#         lr_scheduler_type = "linear",
#         seed = 3407,
#         output_dir = "outputs",
#     ),
# )
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "question",  # Use raw questions
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/7321344 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 7,321,344 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.6953
2,1.5659
3,1.8666
4,1.9516
5,1.7005
6,1.9406
7,1.669
8,1.7346
9,1.1122
10,1.8193


In [None]:
from unsloth.chat_templates import get_chat_template
import torch



# Example: selecting a question from your dataset
sample_question = dataset[5]["question"]+" . Break the solution to series of steps and provide a complete solution"  # Fetching one of the math questions


tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    # mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Modify the messages to include a 'role' attribute as expected by the template
messages = [
    {"role": "user", "content": sample_question},  # Corrected 'role' key
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)
# Decode the output while skipping special tokens
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Print the cleaned output
print(decoded_outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


["user\n\nJaynie wants to make leis for the graduation party.  It will take 2 and half dozen plumeria flowers to make 1 lei.  If she wants to make 4 leis, how many plumeria flowers must she pick from the trees in her yard?. Break the solution to series of steps and provide a complete solutionassistant\n\nHere's the solution broken down into steps:\n\n**Step 1: Determine the number of plumeria flowers needed for one lei**\n\nIt will take 2 and a half dozen plumeria flowers to make 1 lei.  Since there are 12 flowers in a dozen, we need to convert 2 and"]


In [None]:
# FastLanguageModel.for_inference(model) # Enable native 2x faster inference


# # Example: selecting a question from your dataset
# sample_question = dataset[5]["question"]+" . Break the solution to series of steps and provide a complete solution"  # Fetching one of the math questions


# messages = [
#     {"from": "human", "content": sample_question},
# ]
# inputs = tokenizer.apply_chat_template(
#     messages,
#     tokenize = True,
#     add_generation_prompt = True, # Must add for generation
#     return_tensors = "pt",
# ).to("cuda")

# from transformers import TextStreamer
# text_streamer = TextStreamer(tokenizer)
# _ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference


# Example: selecting a question from your dataset
sample_question = dataset[5]["question"]+" . Break the solution to series of steps and provide a complete solution"  # Fetching one of the math questions


messages = [
    {"role": "user", "content": sample_question},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 2048, use_cache = True)

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Jaynie wants to make leis for the graduation party.  It will take 2 and half dozen plumeria flowers to make 1 lei.  If she wants to make 4 leis, how many plumeria flowers must she pick from the trees in her yard?. Break the solution to series of steps and provide a complete solution<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Let's break down the solution into steps:

**Step 1: Determine the number of plumeria flowers needed to make one lei.**
Since it takes 2 and a half dozen plumeria flowers to make one lei, we need to convert 2 and a half dozen to a numerical value. There are 12 flowers in a dozen, so:

2 and a half dozen = 2 x 12 + 6 = 30 flowers

**Step 2: Determine the total number of plumeria flowers needed to make 4 leis.**
Since it takes 30 flowers to make one lei, it will take:

30 flowers x 4 leis = 120 flowers

to make 4 leis.

**Step 3: Convert the total number of plumeria flowers to a numerical value.*

In [None]:
# Merge to 4bit
# from google.colab import userdata
# Key = userdata.get('HF_KEY')

# # if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
# model.push_to_hub_merged("VishnuT/Math_LLaMa3.1", tokenizer, save_method = "merged_16bit", token = Key)


Unsloth: You are pushing to hub, but you passed your HF username = VishnuT.
We shall truncate VishnuT/Math_LLaMa3.1 to Math_LLaMa3.1
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 30.37 out of 50.99 RAM for saving.


 34%|███▍      | 11/32 [00:00<00:01, 14.48it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:23<00:00,  1.39it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...


README.md:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

TypeError: argument of type 'NoneType' is not iterable

In [None]:
from google.colab import userdata
Key = userdata.get('HF_KEY')
model.push_to_hub_gguf("VishnuT/Math_LLaMa3.1", tokenizer , quantization_method = "f16", token = Key)

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 32.93 out of 50.99 RAM for saving.


100%|██████████| 32/32 [00:48<00:00,  1.51s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at VishnuT/Math_LLaMa3.1 into f16 GGUF format.
The output location will be ./VishnuT/Math_LLaMa3.1/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: Math_LLaMa3.1
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.at

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.F16.gguf:   0%|          | 0.00/16.1G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/VishnuT/Math_LLaMa3.1


No files have been modified since last commit. Skipping to prevent empty commit.
Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Saved Ollama Modelfile to https://huggingface.co/VishnuT/Math_LLaMa3.1
