<a href="https://colab.research.google.com/github/riyajidevindu/Advanced-AI-Codes/blob/main/Fine_tuning_using_unsloth_Day_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#

In [None]:
# Step 1: Install Unsloth and other required libraries
!pip install -U unsloth trl bitsandbytes datasets accelerate
# Installs the latest versions of Unsloth, TRL (for training), bitsandbytes (for quantized models), datasets (for loading datasets), and accelerate (for fast training)

# Login to Hugging Face (necessary to access gated models)
!huggingface-cli login
# Prompts user to log in to their Hugging Face account via access token

# Step 2: Load Model via Unsloth
from unsloth import FastLanguageModel  # Imports Unsloth's core wrapper for fast LLM training
import torch                           # Imports PyTorch for tensor operations and device management
from datasets import Dataset           # Imports Hugging Face's dataset loading utility

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-1.1-2b-it-bnb-4bit",  # Loads a 4-bit quantized version of Gemma 2B IT model optimized for Unsloth
    max_seq_length = 2048,                            # Sets the maximum token length for model input
    dtype = torch.float16,                            # Uses float16 precision to save memory and accelerate computation
    load_in_4bit = True,                              # Loads the model in 4-bit quantized mode (saves even more memory)
)



In [None]:
text = "### Instruction: Analyze the sentiment\n### Input: I love programming .\n### Response:"
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


In [3]:
from peft.tuners.lora import LoftQConfig

# Now prepare model for training (adds LoRA adapters)
model = FastLanguageModel.get_peft_model(
    model,                                            # Passes the base model
    r = 8,                                            # Rank of the LoRA adapters (controls capacity of the adaptation). Controls how many low-rank dimensions are added for adaptation.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],  # Specifies which model layers will receive LoRA adapters. List of transformer modules to inject LoRA adapters into.
    lora_alpha = 16,                                  # Scaling factor for the LoRA update. Think of this as a learning rate multiplier for the LoRA part.
    lora_dropout = 0,                                 # No dropout is used (recommended for very small datasets)
    bias = "none",                                    # Does not adapt bias terms in the model. LoRA does not adapt the bias terms in the model layers, saving compute and memory.
    use_gradient_checkpointing = True,                # Enables gradient checkpointing to reduce memory usage during training. Reduces memory usage during training by recomputing activations in the backward pass. Slows down training but helps fit larger models on smaller GPUs.
    random_state = 3407,                              # Sets seed for reproducibility
    use_rslora = True,                               # Disables rank-splitting LoRA (optional optimization).Disables Rank-Splitting LoRA, which is a variation that can improve stability. Off by default.
    loftq_config = LoftQConfig(loftq_bits=4)   # No QLoRA-specific configuration used. Not using QLoRA-specific optimizations (like quantization-aware LoRA). Can be used for extra memory savings.
)

FastLanguageModel.for_training(model)  # Prepares model for training (enables gradient computation, checkpointing, etc.)




Unsloth 2025.6.2 patched 18 layers with 18 QKV layers, 18 O layers and 18 MLP layers.


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GemmaForCausalLM(
      (model): GemmaModel(
        (embed_tokens): Embedding(256000, 2048, padding_idx=0)
        (layers): ModuleList(
          (0-17): 18 x GemmaDecoderLayer(
            (self_attn): GemmaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linea

In [4]:
# Example: your 3 documents formatted for instruction fine-tuning
data = {
    "text": [
        "### Instruction: Analyze the sentiment\n### Input: I love programming .\n### Response: Positive",
        "### Instruction: Analyze the sentiment\n### Input: Programming is fun .\n### Response: Positive",
        "### Instruction: Analyze the sentiment\n### Input: I love fun activities activities.\n### Response: Positive",
        "### Instruction: Analyze the sentiment\n### Input: Worst movie ever.\n### Response: Negative",
        "### Instruction: Analyze the sentiment\n### Input: This is satifactory despite negative features.\n### Response: Neutral"
    ]
}

dataset = Dataset.from_dict(data)


from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    packing = True,
    args = TrainingArguments(
        output_dir = "./unsloth_gemma_finetuned",
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        num_train_epochs = 2,
        logging_steps = 10,
        save_steps = 50,
        save_total_limit = 2,
        bf16 = False,
        fp16 = True,
        optim = "adamw_8bit",
        warmup_steps = 5,
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        learning_rate = 2e-4,
        report_to = "none"
    ),
)

trainer.train()


In [5]:
!pip install transformers
from trl import SFTTrainer
# Imports SFTTrainer from the TRL library, used for supervised fine-tuning (SFT)
from transformers import EarlyStoppingCallback

from transformers import TrainingArguments
# Imports Hugging Face's TrainingArguments class to specify training configuration
from datasets import load_dataset, DatasetDict

trainer = SFTTrainer(
    model = model,
    # Specifies the pre-loaded PEFT model (e.g., Unsloth-wrapped Gemma)

    tokenizer = tokenizer,
    # Specifies the tokenizer corresponding to the model

    train_dataset = dataset,
    # Specifies the training dataset

    dataset_text_field = "text",
    # Tells the trainer which field in the dataset contains the input text

    max_seq_length = 2048,
    # Sets the maximum sequence length for training samples

    packing = True,
    # Enables sequence packing for efficient use of token space (fills batches with multiple samples when possible)

    args = TrainingArguments(
        output_dir = "./unsloth_gemma_finetuned",
        # Directory where model checkpoints and logs will be saved

        per_device_train_batch_size = 1,
        # Sets batch size per GPU (Colab has one GPU, so this is total batch size per step)

        gradient_accumulation_steps = 4,
        # Accumulates gradients over 4 steps before updating weights (simulates larger batch size)

        num_train_epochs = 2,
        # Trains for 2 full passes over the training dataset

        logging_steps = 10,
        # Logs training metrics every 10 steps

        save_steps = 50,
        # Saves a checkpoint every 50 training steps

        save_strategy="steps",

        save_total_limit = 2,
        # Keeps only the 2 most recent checkpoints, deleting older ones

        bf16 = False,
        # Disables bfloat16 training (not used here)

        fp16 = True,
        # Enables mixed-precision training using float16 for speed and memory efficiency on GPUs

        optim = "adamw_8bit",
        # Uses 8-bit AdamW optimizer from bitsandbytes to reduce memory usage

        warmup_steps = 5,
        # Number of warmup steps for learning rate scheduler

        weight_decay = 0.01,
        # Applies weight decay (L2 regularization) to reduce overfitting

        lr_scheduler_type = "linear",
        # Uses a linear learning rate scheduler

        learning_rate = 2e-4,
        # Sets the base learning rate for the optimizer

        report_to = "none",
        # Disables integration with external logging tools (e.g., WandB, TensorBoard),

    ),

)
# Creates the trainer object for supervised fine-tuning using all the above configuration

trainer.train()
# Starts the training process

# Save your fine-tuned PEFT model
from peft import PeftModel
# `model` is your PEFT-wrapped (LoRA) model
model = model.merge_and_unload()
model.save_pretrained("unsloth_gemma_finetuned")
tokenizer.save_pretrained("unsloth_gemma_finetuned")




Unsloth: Tokenizing ["text"]:   0%|          | 0/5 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5 | Num Epochs = 2 | Total steps = 4
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 9,805,824/2,000,000,000 (0.49% trained)


Step,Training Loss




('unsloth_gemma_finetuned/tokenizer_config.json',
 'unsloth_gemma_finetuned/special_tokens_map.json',
 'unsloth_gemma_finetuned/chat_template.jinja',
 'unsloth_gemma_finetuned/tokenizer.model',
 'unsloth_gemma_finetuned/added_tokens.json',
 'unsloth_gemma_finetuned/tokenizer.json')

In [6]:
# Load the fine-tuned model and tokenizer using Unsloth's method
# Use the directory where you saved the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth_gemma_finetuned", # Load from your saved directory
    max_seq_length = 2048, # Should match or be compatible with training
    dtype = torch.float16,  # Should match dtype used for saving if possible
    load_in_4bit = True,   # Match how you saved or load in 16-bit if saved that way
    # Add any other parameters you used during initial loading with Unsloth
)


==((====))==  Unsloth 2025.6.2: Fast Gemma patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Zero shot inference

In [7]:
text = "### Instruction: Analyze the sentiment\n### Input: I love programming .\n### Response:"
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


### Instruction: Analyze the sentiment
### Input: I love programming .
### Response: Positive

**Explanation:**

The sentiment expressed in the input is positive. The user expresses genuine enjoyment and enthusiasm towards the activity of programming.


In [8]:
prompt = """### Instruction: Analyze the sentiment
### Input: This movie is terrible.
### Response:"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


### Instruction: Analyze the sentiment
### Input: This movie is terrible.
### Response: The sentiment expressed in the input is negative.

### Explanation:
- The word "terrible"


Few shot inference

In [9]:
prompt = """### Instruction: Analyze the sentiment
### Input: I love programming.
### Response: Positive
### Input: I hate debugging.
### Response: Negative
### Input: The interface is very user-friendly.
### Response:"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


### Instruction: Analyze the sentiment
### Input: I love programming.
### Response: Positive
### Input: I hate debugging.
### Response: Negative
### Input: The interface is very user-friendly.
### Response: Positive

**Explanation:**

The sentiment analysis reveals a clear difference in the responses to the three inputs
