<a href="https://colab.research.google.com/github/toni-ramchandani/UnslothFinetuningGuide/blob/main/FineTuningUnSloth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 🔧 Setup for Fine-Tuning with Unsloth

- Installs core libraries for efficient LLM fine-tuning (`bitsandbytes`, `peft`, `trl`, etc.).
- Adds support for tokenization and datasets (`sentencepiece`, `datasets`, `huggingface_hub`).
- Installs `unsloth`, a lightweight library for fast and memory-efficient fine-tuning.


In [None]:
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
!pip install --no-deps unsloth

Collecting bitsandbytes
  Downloading bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting xformers==0.0.29.post3
  Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting trl
  Downloading trl-0.19.1-py3-none-any.whl.metadata (10 kB)
Collecting cut_cross_entropy
  Downloading cut_cross_entropy-25.1.1-py3-none-any.whl.metadata (9.3 kB)
Collecting unsloth_zoo
  Downloading unsloth_zoo-2025.6.8-py3-none-any.whl.metadata (8.1 kB)
Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl (43.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.4/43.4 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl (72.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.19.1-py3-none-any.whl (376 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

### Install Core Fine-Tuning Libraries

- `xformers` – Optimized attention operations for faster training  
- `trl` – Transformer Reinforcement Learning library (used in LLM fine-tuning)  
- `peft` – Parameter-Efficient Fine-Tuning (like LoRA)  
- `accelerate` – Easy multi-GPU or mixed precision training  
- `bitsandbytes` – Enables 8-bit and 4-bit quantized model loading (saves memory)

In [None]:
# Use this for older GPUs (V100, Tesla T4, RTX 20xx)
!pip install --no-deps xformers trl peft accelerate bitsandbytes



### Load Fast Language Model with Unsloth

- `FastLanguageModel` – Main class from Unsloth for fast, memory-efficient LLM loading.
- `max_seq_length` – Set max token length (e.g. 2048).
- `dtype` – Precision type (auto-detected if `None`, or set manually like `torch.float16`).
- `load_in_4bit` – Enables 4-bit quantized loading to save memory (can be set to `False` for full precision).

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
# dtype = torch.float16
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


### Load Mistral 7B Model with Unsloth

- `FastLanguageModel.from_pretrained` loads the model and tokenizer.
- `model_name` specifies the pre-trained Unsloth model to use.
- `max_seq_length` sets the max input token length.
- `dtype` sets the precision (e.g., float16 or bfloat16).
- `load_in_4bit` enables 4-bit quantized loading to reduce memory usage.

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-v0.3",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2025.6.12: Fast Mistral patching. Transformers: 4.53.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.14G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/157 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

### Apply PEFT (LoRA) with Unsloth

- `get_peft_model` applies LoRA (Low-Rank Adaptation) to the model.
- `r` sets the rank for LoRA layers (higher = more capacity).
- `target_modules` lists which model parts will use LoRA (common transformer layers).
- `lora_alpha`, `lora_dropout`, `bias` control LoRA behavior (0 dropout and "none" bias are optimized).
- `use_gradient_checkpointing = "unsloth"` saves VRAM and supports longer sequences.
- `random_state` ensures reproducible training.
- `use_rslora` and `loftq_config` are advanced options (disabled here).


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # max_seq_length = max_seq_length,
)

Unsloth 2025.6.12 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Load Fine-Tuning Dataset

- `load_dataset` pulls a dataset from Hugging Face Hub.
- `"zefang-liu/phishing-email-dataset"` is the dataset name.
- `split = "train"` loads the training portion only.

This dataset contains phishing emails, useful for fine-tuning models on security or classification tasks.


In [None]:
from datasets import load_dataset
dataset = load_dataset("zefang-liu/phishing-email-dataset", split = "train")

README.md:   0%|          | 0.00/616 [00:00<?, ?B/s]

Phishing_Email.csv:   0%|          | 0.00/52.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18650 [00:00<?, ? examples/s]

### Clean the Dataset

- `remove_columns(['Unnamed: 0'])` deletes an unnecessary index column from the dataset.
- This helps keep the dataset clean and ready for training.

The dataset preview will now show only relevant columns.


### Preview the Dataset

- Converts the first 5 samples into a pandas DataFrame using `pd.DataFrame(dataset[:5])`.
- `print(df)` displays the sample rows for quick inspection.

Useful to verify the structure and content before training.


In [None]:
dataset = dataset.remove_columns(['Unnamed: 0'])
dataset

Dataset({
    features: ['Email Text', 'Email Type'],
    num_rows: 18650
})

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(dataset[:5])

print(df)

                                          Email Text      Email Type
0  re : 6 . 1100 , disc : uniformitarianism , re ...      Safe Email
1  the other side of * galicismos * * galicismo *...      Safe Email
2  re : equistar deal tickets are you still avail...      Safe Email
3  \nHello I am your hot lil horny toy.\n    I am...  Phishing Email
4  software at incredibly low prices ( 86 % lower...  Phishing Email


### Format Dataset for Fine-Tuning

- `phishing_prompt` defines a custom prompt template with email body and label.
- `EOS_TOKEN` is added to signal end of each training sample (prevents infinite generation).
- `formatting_prompts_func` builds prompt strings using dataset fields: "Email Text" and "Email Type".
- `dataset.map(..., batched=True)` applies the function to format all samples.

Result: Each sample is now a single string prompt in a `"text"` column, ready for tokenizer input.


In [None]:
# Define the prompt template
phishing_prompt = """Below is an email body. Determine if the email is safe or phishing.

### Email Text:
{}

### Email Type:
{}"""

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

# Define the formatting function
def formatting_prompts_func(examples):
    email_texts = examples["Email Text"]
    email_types = examples["Email Type"]
    texts = []
    for email_text, email_type in zip(email_texts, email_types):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = phishing_prompt.format(email_text, email_type) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# Apply the formatting function to the dataset
dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/18650 [00:00<?, ? examples/s]

### Set Up Trainer for Fine-Tuning

- `SFTTrainer` from TRL is used to fine-tune the model using your formatted dataset.
- `train_dataset` is your phishing prompt data; `"text"` is the input field.
- `max_seq_length`, `packing`, and `dataset_num_proc` control preprocessing efficiency.
- `TrainingArguments` defines training configuration:
  - Small batch size with gradient accumulation (to save memory)
  - 60 training steps with learning rate scheduling
  - Uses 8-bit optimizer (`adamw_8bit`)
  - Supports FP16 or BF16 depending on hardware
  - Outputs logs and model checkpoints in `outputs/`


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/18650 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.883 GB of memory reserved.


### Clear Triton Kernel Cache

- Locates the Triton cache directory (`~/.triton/cache`).
- Deletes it using `shutil.rmtree()` if it exists.
- Helps avoid issues with outdated or corrupted Triton kernel builds.

Recommended before re-running training or after code changes.


In [None]:
import os
import shutil

triton_cache_dir = os.path.expanduser("~/.triton/cache")
if os.path.exists(triton_cache_dir):
    print(f"Clearing Triton cache at: {triton_cache_dir}")
    shutil.rmtree(triton_cache_dir)
else:
    print("Triton cache directory not found.")

Clearing Triton cache at: /root/.triton/cache


### Start Model Training

- `trainer.train()` begins the fine-tuning process using the SFTTrainer setup.
- It may prompt for your **Weights & Biases (W&B) API key** to log metrics.

If you don't want to use W&B, set `report_to="none"` in `TrainingArguments`.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 18,650 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 7,000,000,000 (0.60% trained)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mramchandani-toni[0m ([33mramchandani-toni-abc[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.4464
2,2.0881
3,2.1436
4,2.4249
5,2.6888
6,2.3896
7,2.3196
8,1.771
9,2.0051
10,2.3816




### Show Final Training Stats

- Calculates total GPU memory used during training.
- Separates memory used specifically for LoRA fine-tuning.
- Prints:
  - Total training time (in seconds and minutes)
  - Peak GPU memory usage
  - Memory usage as % of total available

Helpful for analyzing efficiency and comparing different training setups.


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

854.5435 seconds used for training.
14.24 minutes used for training.
Peak reserved memory = 6.883 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 46.693 %.
Peak reserved memory for training % of max memory = 0.0 %.


### Run Inference on Custom Email Text

- `phishing_prompt` is reused with a blank label to let the model generate it.
- `FastLanguageModel.for_inference(model)` enables faster inference mode.
- Takes user input for the email body.
- Uses `tokenizer` to convert the input into tensors.
- Runs `model.generate(...)` to classify the email as phishing or safe.
- `TextStreamer` shows the output live as it's generated (optional but useful).

Useful for testing the fine-tuned model interactively.


In [None]:
# Define the phishing email prompt
phishing_prompt = """Below is an email body. Determine if the email is safe or phishing.

### Email Text:
{}

### Email Type:
{}"""

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Example email text to classify
email_text_example = input("Please Enter mail body to Check if it is phishing or not : ")

# Prepare the inputs for the model
inputs = tokenizer(
[
    phishing_prompt.format(
        email_text_example,  # email text
        ""  # email type - leave this blank for generation!
    )
], return_tensors="pt").to("cuda")

# Generate the outputs using the model
outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
print(tokenizer.batch_decode(outputs))

# For streaming inference
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)

Please Enter mail body to Check if it is phishing or not : Dear Customer,  Please find attached your monthly bank statement for June 2024. If you have any questions, please contact our support team.  Thank you for banking with us.  Best regards, Your Bank


MistralForCausalLM has no `_prepare_4d_causal_attention_mask_with_cache_position` method defined in its base modeling class. Compiled forward passes will be sub-optimal. If you're writing code, see Llama for an example implementation. If you're a user, please report this issue on GitHub.


['<s> Below is an email body. Determine if the email is safe or phishing.\n\n### Email Text:\nDear Customer,  Please find attached your monthly bank statement for June 2024. If you have any questions, please contact our support team.  Thank you for banking with us.  Best regards, Your Bank\n\n### Email Type:\nPhishing Email</s>']
<s> Below is an email body. Determine if the email is safe or phishing.

### Email Text:
Dear Customer,  Please find attached your monthly bank statement for June 2024. If you have any questions, please contact our support team.  Thank you for banking with us.  Best regards, Your Bank

### Email Type:
Phishing Email</s>


### Save Fine-Tuned Model Locally

- `model.save_pretrained("mistral_v3_phishing")` saves the model weights and config.
- `tokenizer.save_pretrained("mistral_v3_phishing")` saves the tokenizer files.

You can reload this later or push to Hugging Face Hub.


In [None]:
model.save_pretrained("mistral_v3_phishing") # Local saving
tokenizer.save_pretrained("mistral_v3_phishing")

('mistral_v3_phishing/tokenizer_config.json',
 'mistral_v3_phishing/special_tokens_map.json',
 'mistral_v3_phishing/tokenizer.model',
 'mistral_v3_phishing/added_tokens.json',
 'mistral_v3_phishing/tokenizer.json')

### (Optional) Reload Fine-Tuned Model

- Wrapped in `if False:` so it won’t run unless you change it to `True`.
- Loads the saved model (`mistral_v3_phishing`) using Unsloth’s loader.
- Enables fast inference mode again with `FastLanguageModel.for_inference(model)`.

Use this block later to test the model without retraining.


In [None]:
if False:
  from unsloth import FastLanguageModel
  model, tokenizer = FastLanguageModel.from_pretrained(
      model_name = "mistral_v3_phishing", #model which i have saved.
      max_seq_length = max_seq_length,
      dtype = dtype,
      load_in_4bit = load_in_4bit,
  )

  FastLanguageModel.for_inference(model) #Enable 2x faster inference

### Redefine Phishing Classification Prompt

- `phishing_prompt` is a formatted template used to structure input for the model.
- It includes:
  - The email body (`Email Text`)
  - A placeholder for the model to generate `Email Type` (safe or phishing)

Used during both training and inference for consistent format.


In [None]:
#lets define the phishing prompt again

phishing_prompt = """ Below is an email body. Determine if the email is safe or phishing.

### Email Text:
{}

### Email Type:
{}"""

### Run Live Inference on Custom Email

- Takes user input: an email body to classify.
- Formats it using `phishing_prompt` with an empty label.
- Tokenizes and moves input to GPU.
- Generates output with `model.generate(...)` to classify as "safe" or "phishing".
- `batch_decode()` shows full output.
- `TextStreamer` streams the generated text live as it's decoded (optional).

This allows real-time testing of the fine-tuned model.


In [None]:
#Take the input of Example email to classify whether it is safe or phshing
email_text_example = input("Enter the Email body to classify whether it is safe or phishing: ")

#lets prepare the input for the model
inputs = tokenizer(
    [
        phishing_prompt.format(
            email_text_example,  #email text
            "" #email_type  leave this blank for generation.
        )
    ], return_tensors="pt").to("cuda")


#generate the output using the model
outputs = model.generate(**inputs,max_new_tokens=64,use_cache=True)
print(tokenizer.batch_decode(outputs))


# For streaming inference
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)

Enter the Email body to classify whether it is safe or phishing: Dear Customer,  Please find attached your monthly bank statement for June 2024. If you have any questions, please contact our support team.  Thank you for banking with us.  Best regards, Your Bank
['<s> Below is an email body. Determine if the email is safe or phishing.\n\n### Email Text:\nDear Customer,  Please find attached your monthly bank statement for June 2024. If you have any questions, please contact our support team.  Thank you for banking with us.  Best regards, Your Bank\n\n### Email Type:\nPhishing Email</s>']
<s> Below is an email body. Determine if the email is safe or phishing.

### Email Text:
Dear Customer,  Please find attached your monthly bank statement for June 2024. If you have any questions, please contact our support team.  Thank you for banking with us.  Best regards, Your Bank

### Email Type:
Phishing Email</s>
