## The FLASH ATTENTION Bug
**Run these two cells if you faced the error about the flash attention when you try to train the model. **

In [None]:
import os

# Must be set BEFORE importing transformers/accelerate
os.environ["USE_FLASH_ATTENTION"] = "0"
os.environ["DISABLE_FLASH_ATTN"] = "1"
os.environ["TRITON_DISABLE_LINE_INFO"] = "1"

import torch
print("GPU:", torch.cuda.get_device_name(0))
print("USE_FLASH_ATTENTION:", os.environ.get("USE_FLASH_ATTENTION"))
print("DISABLE_FLASH_ATTN:", os.environ.get("DISABLE_FLASH_ATTN"))


In [None]:
import torch, os
print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")
print("USE_FLASH_ATTENTION:", os.environ.get("USE_FLASH_ATTENTION"))
print("DISABLE_FLASH_ATTN:", os.environ.get("DISABLE_FLASH_ATTN"))


## The Importing

In [None]:
!pip install torch transformers datasets trl unsloth

In [None]:
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

## Load the Model

**  Explain the LORA**

** Core Concept: LoRA (Low-Rank Adaptation)
Imagine one of the large weight matrices inside the model, let's call it W. A full fine-tune would calculate an update matrix, ΔW, and change the original weights to W + ΔW. This ΔW matrix is the same large size as W, which is why it's so memory-intensive.


The key insight of LoRA is that the update matrix ΔW can be effectively approximated by multiplying two much smaller, "low-rank" matrices.

ΔW ≈ B * A
If W is a 4096 x 4096 matrix, ΔW would also be 4096 x 4096 (containing ~16.7 million parameters).
With LoRA, we can replace it with two smaller matrices:
Matrix A of size 4096 x 16
Matrix B of size 16 x 4096
The total number of parameters in A and B is (4096 * 16) + (16 * 4096) = 131,072, which is a ~99.2% reduction compared to training the full ΔW matrix.

During training, the original weight matrix W is frozen (not updated), and only the new, small matrices A and B are trained. This is the core of how PEFT/LoRA saves so much memory and computation. **

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048, load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model, r=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

## DataSet

In [None]:
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(
    lambda examples: {
        "text": [
            tokenizer.apply_chat_template(convo, tokenize=False)
            for convo in examples["conversations"]
        ]
    },
    batched=True
)

In [None]:
dataset
print("\nFirst example full structure:")
print(dataset[0:3])

## The Set up

In [None]:
# Set up trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        output_dir="outputs",
    ),
)

## Trian the Model

In [None]:
# Train the model
trainer.train()

## Save The Model

In [None]:
# Save the finetuned model
model.save_pretrained("finetuned_Llama_model")

## Test The model