# Fine-tuning DeepSeek-R1-Distill-Qwen-1.5B with QLoRA

This notebook provides a step-by-step guide to fine-tuning the `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` model using QLoRA (Quantized Low-Rank Adaptation).

**QLoRA** is a Parameter-Efficient Fine-Tuning (PEFT) technique that significantly reduces memory usage by:
1. Quantizing the pre-trained model to 4-bit.
2. Attaching small, trainable Low-Rank Adapters (LoRA) to the model.
3. Training only these adapters while keeping the base model frozen.

This allows fine-tuning large models on relatively modest hardware.

**Prerequisites:**
- An NVIDIA GPU with CUDA installed (>=12GB VRAM recommended).
- Python 3.8+.
- Necessary libraries (will be installed below).

## 1. Setup: Install Libraries

Uncomment and run the following cell to install the required Python packages. Restart the kernel after installation if prompted.

In [None]:
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # Adjust cuXXX to your CUDA version
# !pip install transformers==4.38.2 # Pinning version for stability with TRL and PEFT at the time of writing
# !pip install datasets peft accelerate bitsandbytes trl sentencepiece scipy

## 2. Configuration

Define the model ID, dataset parameters, and training configurations.

In [None]:
import torch

# --- Model Configuration ---
BASE_MODEL_ID = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"  # !!! VERIFY THIS MODEL ID !!!
FINE_TUNED_MODEL_NAME = "deepseek-r1-distill-qwen-1.5b-finetuned-qlora" # Name for saving the adapter

# --- Dataset Configuration (Dummy Example) ---
# For this example, we'll create a tiny dummy dataset.
# In a real scenario, you would load your custom dataset here.
# The dataset should ideally be in a format like: {'text': ["instruction: ... output: ...", ...]}
# or structured for SFTTrainer (e.g., instruction, output columns).
DUMMY_DATASET_PATH = "dummy_instruction_dataset.jsonl"

# --- QLoRA Configuration ---
LORA_R = 16                     # LoRA rank (dimension of the low-rank matrices)
LORA_ALPHA = 32                 # LoRA alpha (scaling factor)
LORA_DROPOUT = 0.05             # Dropout probability for LoRA layers
# Modules to target with LoRA. These are often attention projection layers.
# You might need to inspect the model architecture to find appropriate module names.
# Common names for Qwen-like models: "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"
LORA_TARGET_MODULES = [
    "q_proj",
    "v_proj",
    # "k_proj", # Sometimes k_proj is not targeted or can cause issues, experiment as needed
    # "o_proj",
    # "gate_proj",
    # "up_proj",
    # "down_proj"
]

# --- Training Arguments ---
OUTPUT_DIR = "./results_qlora"
BATCH_SIZE = 1                  # Adjust based on your VRAM (1 is safest for large models)
MICRO_BATCH_SIZE = 1            # Per device batch size
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
LEARNING_RATE = 2e-4
NUM_TRAIN_EPOCHS = 1            # Start with 1 epoch for testing
MAX_SEQ_LENGTH = 512            # Maximum sequence length for tokenization
LOGGING_STEPS = 10
SAVE_STEPS = 50                 # Save checkpoints every N steps

# --- Device Configuration ---
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
if DEVICE == "cpu":
    print("WARNING: QLoRA is designed for GPUs. Training on CPU will be extremely slow and may not work as intended.")

## 3. Prepare Dataset

For fine-tuning, you need a dataset. We'll create a small dummy dataset in JSONL format for demonstration. Each line will contain an instruction and its expected output.

A common format for instruction fine-tuning is a text column where each entry is a string like:
```
### Instruction:
{Your instruction here}

### Response:
{Your desired response here}
```
The `SFTTrainer` from `trl` can work well with this format.

In [None]:
import json
from datasets import load_dataset

# Create a dummy dataset file
dummy_data = [
    {"instruction": "What is the capital of France?", "output": "The capital of France is Paris."},
    {"instruction": "Explain the theory of relativity in simple terms.", "output": "Einstein's theory of relativity has two parts. Special relativity says that the laws of physics are the same for all non-accelerating observers, and that the speed of light in a vacuum is the same for all observers, regardless of the motion of the light source. General relativity explains gravity as a curvature of spacetime caused by mass and energy."},
    {"instruction": "Write a short poem about a cat.", "output": "A furry friend, with eyes so bright,\nChasing shadows in the fading light.\nA gentle purr, a soft, warm paw,\nThe finest creature, by nature's law."},
    {"instruction": "Translate 'Hello, world!' to Spanish.", "output": "¡Hola, mundo!"}
]

with open(DUMMY_DATASET_PATH, 'w') as f:
    for item in dummy_data:
        # We will format it into a single text field for SFTTrainer
        # You can also have separate 'instruction' and 'output' columns and use a formatting_func
        text_data = f"### Instruction:\n{item['instruction']}\n\n### Response:\n{item['output']}"
        f.write(json.dumps({"text": text_data}) + "\n")

print(f"Dummy dataset created at: {DUMMY_DATASET_PATH}")

# Load the dataset
dataset = load_dataset("json", data_files=DUMMY_DATASET_PATH, split="train")

print("\nDataset loaded:")
print(dataset)
print("\nExample entry:")
print(dataset[0])

## 4. Load Base Model & Tokenizer for QLoRA

We'll load the base model in 4-bit precision using `bitsandbytes` and configure its tokenizer.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",                # Use NF4 (NormalFloat4) data type for weights
    bnb_4bit_compute_dtype=torch.bfloat16,    # Use bfloat16 for computation (if supported, else float16)
    bnb_4bit_use_double_quant=True,           # Use a second quantization after the first one to save more memory
)

print(f"Loading base model: {BASE_MODEL_ID} with QLoRA config...")
try:
    model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL_ID,
        quantization_config=bnb_config,
        device_map="auto", # Automatically distribute model layers (recommended for QLoRA)
        trust_remote_code=True # Necessary for some models
    )
    print("Base model loaded successfully.")

    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID, trust_remote_code=True)
    # Set padding token if not already set
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right" # Important for Causal LM
    print("Tokenizer loaded and configured.")

except Exception as e:
    print(f"Error loading model or tokenizer: {e}")
    print("Please ensure the MODEL_ID is correct, you have internet access, and your GPU/CUDA setup is compatible with bitsandbytes.")
    model = None
    tokenizer = None

## 5. Configure LoRA

Set up the LoRA configuration using `peft` and apply it to the loaded model.

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

if model:
    # Prepare the model for k-bit training (important for QLoRA)
    # This function does a few things to make the model compatible with k-bit training and PEFT:
    # - Casts all non INT8 modules to full precision (fp32) for stability
    # - Adds a forward hook to the model to enable gradient checkpointing if specified
    # - Upcasts the layer norms in float32 for stability
    if hasattr(model, 'hf_device_map'): # Check if model is loaded with device_map
        model = prepare_model_for_kbit_training(model)
    else: # If not using device_map, ensure model is on the correct device before this step
        model.to(DEVICE)
        model = prepare_model_for_kbit_training(model)

    lora_config = LoraConfig(
        r=LORA_R,
        lora_alpha=LORA_ALPHA,
        target_modules=LORA_TARGET_MODULES,
        lora_dropout=LORA_DROPOUT,
        bias="none",  # Typically set to 'none' for LoRA
        task_type="CAUSAL_LM" # Important for Causal Language Models
    )

    model = get_peft_model(model, lora_config)
    print("\nLoRA configured and applied to the model.")
    model.print_trainable_parameters()
else:
    print("Model not loaded, skipping LoRA configuration.")

## 6. Set Training Arguments & Initialize Trainer

We'll use `TrainingArguments` from `transformers` and `SFTTrainer` from the `trl` library. `SFTTrainer` is specifically designed for supervised fine-tuning of language models and simplifies the process.

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer

if model and tokenizer and dataset:
    training_args = TrainingArguments(
        output_dir=OUTPUT_DIR,
        per_device_train_batch_size=MICRO_BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        learning_rate=LEARNING_RATE,
        num_train_epochs=NUM_TRAIN_EPOCHS,
        max_grad_norm=0.3, # Gradient clipping
        lr_scheduler_type="cosine", # Learning rate scheduler
        warmup_ratio=0.03, # Warmup steps for learning rate scheduler
        logging_dir=f"{OUTPUT_DIR}/logs",
        logging_steps=LOGGING_STEPS,
        save_steps=SAVE_STEPS,
        save_total_limit=2, # Keep only the last 2 checkpoints
        fp16=False, # Set to False when bnb_4bit_compute_dtype is bfloat16. If float16, set to True.
        bf16=True if torch.cuda.is_bf16_supported() and bnb_config.bnb_4bit_compute_dtype == torch.bfloat16 else False, # Enable bfloat16 if supported and configured
        optim="paged_adamw_32bit", # Use paged AdamW optimizer for memory efficiency
        # report_to="tensorboard" # Optional: for logging to TensorBoard
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=dataset,
        dataset_text_field="text",  # Name of the text field in your dataset
        max_seq_length=MAX_SEQ_LENGTH, # Max sequence length for tokenization
        packing=False, # Optional: packing multiple short examples into one sequence for efficiency
        # formatting_func=formatting_prompts_func, # If your dataset needs custom formatting
        peft_config=lora_config, # Pass LoRA config here if not already applied to model
    )
    print("\nTrainer initialized.")
else:
    print("Model, tokenizer, or dataset not available. Skipping trainer initialization.")
    trainer = None

## 7. Start Fine-tuning

Now, we can start the fine-tuning process. This will take time depending on your dataset size, hardware, and training configuration.

In [None]:
if trainer:
    print("\nStarting fine-tuning...")
    try:
        trainer.train()
        print("Fine-tuning completed.")
    except Exception as e:
        print(f"Error during training: {e}")
else:
    print("Trainer not initialized. Skipping training.")

## 8. Save the Fine-tuned Model (Adapter)

After training, save the LoRA adapter. This contains the fine-tuned weights.

In [None]:
if trainer:
    print(f"\nSaving fine-tuned LoRA adapter to: {FINE_TUNED_MODEL_NAME}")
    try:
        # trainer.save_model(FINE_TUNED_MODEL_NAME) # SFTTrainer saves the full model by default if not using PEFT directly
        # When using PEFT model directly with SFTTrainer, the adapter is saved within the output_dir of TrainingArguments.
        # To save only the adapter explicitly:
        model.save_pretrained(FINE_TUNED_MODEL_NAME) 
        tokenizer.save_pretrained(FINE_TUNED_MODEL_NAME) # Also save the tokenizer
        print(f"Adapter and tokenizer saved to {FINE_TUNED_MODEL_NAME}")
        
        # You can find the adapter in OUTPUT_DIR/checkpoint-XXXX/adapter_model.safetensors or similar
        # or directly in FINE_TUNED_MODEL_NAME if saved explicitly.
    except Exception as e:
        print(f"Error saving model: {e}")
else:
    print("Trainer not initialized or training failed. Skipping model saving.")

## 9. Inference with the Fine-tuned Model

To use the fine-tuned model, load the base model again (in 4-bit) and then apply the trained LoRA adapter.

In [None]:
from peft import PeftModel
import os

if os.path.exists(FINE_TUNED_MODEL_NAME):
    print("\nLoading base model and fine-tuned adapter for inference...")
    
    # Load the base model in 4-bit (as it was during training)
    base_model_for_inference = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL_ID,
        quantization_config=bnb_config, # Same BitsAndBytesConfig as training
        device_map="auto",
        trust_remote_code=True
    )

    # Load the LoRA adapter
    # The FINE_TUNED_MODEL_NAME directory should contain 'adapter_model.safetensors' or 'adapter_model.bin'
    # and 'adapter_config.json'
    ft_model = PeftModel.from_pretrained(base_model_for_inference, FINE_TUNED_MODEL_NAME)
    
    # Optional: Merge LoRA weights with the base model for faster inference (requires more memory)
    # print("Merging LoRA adapter with base model...")
    # ft_model = ft_model.merge_and_unload() # This creates a new model with merged weights
    
    ft_tokenizer = AutoTokenizer.from_pretrained(FINE_TUNED_MODEL_NAME) # Load tokenizer saved with adapter
    if ft_tokenizer.pad_token is None:
        ft_tokenizer.pad_token = ft_tokenizer.eos_token
    ft_tokenizer.padding_side = "right"

    print("Fine-tuned model ready for inference.")

    # --- Test Inference ---
    prompt = "### Instruction:\nWhat is QLoRA?\n\n### Response:\n" # Use the same format as training data
    print(f"\nTest Prompt: {prompt}")

    inputs = ft_tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=MAX_SEQ_LENGTH).to(DEVICE)
    
    # Ensure the model is in evaluation mode
    ft_model.eval()
    with torch.no_grad():
        outputs = ft_model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=100,
            pad_token_id=ft_tokenizer.eos_token_id, # Important for generation
            eos_token_id=ft_tokenizer.eos_token_id,
            do_sample=True, # For more creative responses
            top_p=0.9,
            temperature=0.7
        )
    
    response_text = ft_tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\nGenerated Response:\n{response_text}")

else:
    print(f"Fine-tuned model adapter not found at {FINE_TUNED_MODEL_NAME}. Skipping inference test.")

## 10. Conclusion and Next Steps

This notebook demonstrated the process of fine-tuning `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` using QLoRA.

**Key Takeaways:**
- QLoRA enables fine-tuning large models on limited hardware by quantizing the base model and training small adapters.
- The `peft` and `trl` libraries from Hugging Face significantly simplify this process.
- Careful configuration of model loading, LoRA parameters, and training arguments is crucial.

**Next Steps:**
- **Use Your Custom Dataset:** Replace the dummy dataset with your actual data, ensuring it's properly formatted.
- **Hyperparameter Tuning:** Experiment with `LORA_R`, `LORA_ALPHA`, `LEARNING_RATE`, `NUM_TRAIN_EPOCHS`, `BATCH_SIZE`, etc., to optimize performance for your specific task and dataset.
- **Target Module Selection:** Investigate the model architecture to choose the most effective `LORA_TARGET_MODULES`.
- **Evaluation:** Implement a robust evaluation strategy to measure the performance of your fine-tuned model on a held-out test set.
- **Advanced Techniques:** Explore techniques like packing (in `SFTTrainer`) for efficiency, or different schedulers.
- **Push to Hub:** Share your fine-tuned adapters on the Hugging Face Hub.