# QLoRA (Quantized Low-Rank Adaptation) Implementation from Scratch

## Educational Notebook - Ultra Minimal Resource Requirements

This notebook implements QLoRA from first principles, combining 4-bit quantization with LoRA for extreme memory efficiency.

**Resource Requirements:**
- RAM: 4GB minimum
- GPU: 4GB VRAM (or CPU)
- Storage: ~300MB (quantized model)

**What is QLoRA?**

QLoRA extends LoRA by:
1. **4-bit NormalFloat (NF4) quantization** of base model weights
2. **Double quantization** of quantization constants
3. **Paged optimizers** for efficient memory management
4. **LoRA adapters** trained in full precision (16-bit)

**Key Innovation:**
```
Base Model: Quantized to 4-bit (W_4bit)
LoRA: Full precision adapters (B, A in 16-bit)
Forward: Dequantize(W_4bit) + BA
Backward: Only update B and A
```

**Memory Reduction:**
- FP16 model: 16 bits/param
- 4-bit quantized: 4 bits/param (4√ó reduction)
- With LoRA: Train <1% of params
- **Result: ~65√ó less memory than full fine-tuning!**

In [None]:
# Install required packages
!pip install -q transformers torch datasets accelerate bitsandbytes scipy

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
import math
import numpy as np
from typing import Optional, List, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")

# Check if bitsandbytes is available
try:
    import bitsandbytes as bnb
    print(f"bitsandbytes version: {bnb.__version__}")
    print("‚úì 4-bit quantization available")
except ImportError:
    print("‚ö† bitsandbytes not available - will use manual quantization")

## Part 1: Understanding Quantization

### 4-bit NormalFloat (NF4) Quantization

NF4 is designed for weights that follow a normal distribution (common in neural networks).

**Key idea:** 
- Split the normal distribution into 16 equal-probability bins
- Each weight maps to one of 16 quantized values (4 bits)
- Preserves more precision around zero (where most weights concentrate)

In [None]:
class NF4Quantizer:
    """
    4-bit NormalFloat quantization.
    
    Maps normal distribution to 16 quantized levels.
    """
    def __init__(self):
        # NF4 quantization levels (optimized for normal distribution)
        # These values are chosen to split a standard normal distribution into equal probability bins
        self.nf4_values = torch.tensor([
            -1.0, -0.6961928009986877, -0.5250730514526367, -0.39491748809814453,
            -0.28444138169288635, -0.18477343022823334, -0.09105003625154495, 0.0,
            0.07958029955625534, 0.16093020141124725, 0.24611230194568634, 0.33791524171829224,
            0.44070982933044434, 0.5626170039176941, 0.7229568362236023, 1.0
        ])
    
    def quantize(self, weights: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Quantize weights to 4-bit NF4.
        
        Args:
            weights: Float tensor to quantize
            
        Returns:
            quantized: 4-bit indices (int8 tensor)
            scale: Scaling factor for dequantization
        """
        # Compute scale (max absolute value)
        scale = weights.abs().max()
        
        if scale == 0:
            # Handle zero weights
            return torch.zeros_like(weights, dtype=torch.int8), scale
        
        # Normalize weights to [-1, 1]
        normalized = weights / scale
        
        # Find nearest NF4 value for each weight
        nf4_values = self.nf4_values.to(weights.device)
        
        # Compute distances to all NF4 values
        distances = (normalized.unsqueeze(-1) - nf4_values).abs()
        
        # Get index of nearest value
        quantized = distances.argmin(dim=-1).to(torch.int8)
        
        return quantized, scale
    
    def dequantize(self, quantized: torch.Tensor, scale: torch.Tensor) -> torch.Tensor:
        """
        Dequantize 4-bit values back to float.
        
        Args:
            quantized: 4-bit indices
            scale: Scaling factor
            
        Returns:
            Dequantized float tensor
        """
        nf4_values = self.nf4_values.to(quantized.device)
        
        # Map indices to NF4 values
        dequantized = nf4_values[quantized.long()]
        
        # Scale back
        return dequantized * scale


# Test quantization
print("Testing NF4 Quantization:")
print("=" * 50)

quantizer = NF4Quantizer()

# Create sample weights (normal distribution)
weights = torch.randn(1000)
print(f"Original weights: mean={weights.mean():.4f}, std={weights.std():.4f}")
print(f"Original memory: {weights.numel() * weights.element_size()} bytes")

# Quantize
quantized, scale = quantizer.quantize(weights)
print(f"\nQuantized to 4-bit: {quantized.shape}")
print(f"Quantized memory: {quantized.numel() * quantized.element_size()} bytes")
print(f"Memory reduction: {weights.numel() * weights.element_size() / (quantized.numel() * quantized.element_size()):.1f}x")

# Dequantize
dequantized = quantizer.dequantize(quantized, scale)
print(f"\nDequantized: mean={dequantized.mean():.4f}, std={dequantized.std():.4f}")

# Measure error
mse = ((weights - dequantized) ** 2).mean()
print(f"Quantization MSE: {mse:.6f}")
print(f"Relative error: {(mse / weights.var()).sqrt():.2%}")

## Part 2: Quantized Linear Layer

Create a linear layer that stores weights in 4-bit but computes in full precision.

In [None]:
class QuantizedLinear(nn.Module):
    """
    Linear layer with 4-bit quantized weights.
    
    Stores weights in 4-bit, dequantizes during forward pass.
    """
    def __init__(self, linear: nn.Linear):
        super().__init__()
        self.in_features = linear.in_features
        self.out_features = linear.out_features
        
        # Quantize the weights
        quantizer = NF4Quantizer()
        with torch.no_grad():
            quantized_weight, scale = quantizer.quantize(linear.weight.data)
        
        # Store quantized weights and scale (not as parameters, so not trained)
        self.register_buffer('quantized_weight', quantized_weight)
        self.register_buffer('scale', scale)
        
        # Store bias if present
        if linear.bias is not None:
            self.register_buffer('bias', linear.bias.data.clone())
        else:
            self.bias = None
        
        self.quantizer = NF4Quantizer()
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass: dequantize weights on-the-fly and compute.
        """
        # Dequantize weights
        weight = self.quantizer.dequantize(self.quantized_weight, self.scale)
        
        # Standard linear operation
        output = F.linear(x, weight, self.bias)
        return output
    
    def extra_repr(self) -> str:
        return f'in_features={self.in_features}, out_features={self.out_features}, quantized=4-bit'


# Test quantized linear layer
print("\nTesting Quantized Linear Layer:")
print("=" * 50)

# Create original linear layer
linear = nn.Linear(512, 512)
original_size = sum(p.numel() * p.element_size() for p in linear.parameters())
print(f"Original linear layer: {original_size / 1024:.2f} KB")

# Quantize it
quantized_linear = QuantizedLinear(linear)
quantized_size = sum(b.numel() * b.element_size() for b in quantized_linear.buffers())
print(f"Quantized linear layer: {quantized_size / 1024:.2f} KB")
print(f"Memory reduction: {original_size / quantized_size:.1f}x")

# Test forward pass
x = torch.randn(4, 10, 512)
out_original = linear(x)
out_quantized = quantized_linear(x)

# Compare outputs
mse = ((out_original - out_quantized) ** 2).mean()
print(f"\nForward pass MSE: {mse:.6f}")
print(f"Relative error: {(mse / out_original.var()).sqrt():.2%}")

## Part 3: QLoRA Layer - Combining Quantization with LoRA

Now we combine 4-bit quantized weights with full-precision LoRA adapters.

In [None]:
class QLoRALayer(nn.Module):
    """
    LoRA layer for quantized linear layers.
    
    Base weights: 4-bit quantized (frozen)
    LoRA adapters: Full precision (trainable)
    """
    def __init__(
        self,
        quantized_linear: QuantizedLinear,
        rank: int = 8,
        alpha: float = 16,
        dropout: float = 0.1
    ):
        super().__init__()
        self.quantized_linear = quantized_linear
        self.in_features = quantized_linear.in_features
        self.out_features = quantized_linear.out_features
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # LoRA adapters in full precision (FP32 or FP16)
        self.lora_A = nn.Parameter(torch.zeros(rank, self.in_features))
        self.lora_B = nn.Parameter(torch.zeros(self.out_features, rank))
        
        # Dropout
        self.dropout = nn.Dropout(p=dropout) if dropout > 0 else nn.Identity()
        
        # Initialize A with Kaiming uniform
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        # B initialized to zero
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward: Dequantize base weights + LoRA adaptation
        """
        # Base output (from quantized weights)
        base_output = self.quantized_linear(x)
        
        # LoRA adaptation (full precision)
        lora_output = (self.dropout(x) @ self.lora_A.T @ self.lora_B.T) * self.scaling
        
        return base_output + lora_output
    
    def extra_repr(self) -> str:
        return (f'in_features={self.in_features}, out_features={self.out_features}, '
                f'rank={self.rank}, alpha={self.alpha}, base_quantized=4-bit')


# Test QLoRA layer
print("\nTesting QLoRA Layer:")
print("=" * 50)

# Create QLoRA layer
qlora_layer = QLoRALayer(quantized_linear, rank=8, alpha=16)

# Count parameters
base_params = sum(b.numel() for b in qlora_layer.quantized_linear.buffers())
lora_params = sum(p.numel() for p in [qlora_layer.lora_A, qlora_layer.lora_B])
trainable_params = sum(p.numel() for p in qlora_layer.parameters() if p.requires_grad)

print(f"Base (quantized) parameters: {base_params:,}")
print(f"LoRA parameters: {lora_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Trainable percentage: {100 * trainable_params / (base_params + lora_params):.2f}%")

# Memory calculation
base_memory = sum(b.numel() * b.element_size() for b in qlora_layer.quantized_linear.buffers())
lora_memory = sum(p.numel() * p.element_size() for p in [qlora_layer.lora_A, qlora_layer.lora_B])
print(f"\nBase memory (4-bit): {base_memory / 1024:.2f} KB")
print(f"LoRA memory (FP32): {lora_memory / 1024:.2f} KB")
print(f"Total memory: {(base_memory + lora_memory) / 1024:.2f} KB")

# Compare with full precision
full_precision_memory = (512 * 512 * 4)  # FP32
print(f"\nFull precision would be: {full_precision_memory / 1024:.2f} KB")
print(f"QLoRA memory reduction: {full_precision_memory / (base_memory + lora_memory):.1f}x")

## Part 4: Load Model with QLoRA

Load GPT-2 and apply QLoRA to it. We'll use bitsandbytes for efficient quantization.

In [None]:
def apply_qlora_to_model(
    model: nn.Module,
    target_modules: List[str] = ['c_attn'],
    rank: int = 8,
    alpha: float = 16,
    dropout: float = 0.1,
    use_custom_quantization: bool = False
) -> nn.Module:
    """
    Apply QLoRA to specified modules in the model.
    
    Args:
        model: Pre-trained model
        target_modules: Names of modules to apply QLoRA to
        rank: LoRA rank
        alpha: LoRA alpha
        dropout: Dropout probability
        use_custom_quantization: Use custom quantization (True) or bitsandbytes (False)
    """
    for name, module in model.named_modules():
        if any(target in name for target in target_modules):
            if isinstance(module, nn.Linear):
                parent_name = '.'.join(name.split('.')[:-1])
                attr_name = name.split('.')[-1]
                parent = model.get_submodule(parent_name) if parent_name else model
                
                if use_custom_quantization:
                    # Use our custom quantization
                    quantized_linear = QuantizedLinear(module)
                    qlora_layer = QLoRALayer(quantized_linear, rank=rank, alpha=alpha, dropout=dropout)
                    setattr(parent, attr_name, qlora_layer)
                    print(f"Applied QLoRA (custom) to: {name}")
                else:
                    # Use bitsandbytes quantization (more efficient if available)
                    try:
                        import bitsandbytes as bnb
                        # This would use bnb.nn.Linear4bit in practice
                        # For simplicity, we'll use custom quantization
                        quantized_linear = QuantizedLinear(module)
                        qlora_layer = QLoRALayer(quantized_linear, rank=rank, alpha=alpha, dropout=dropout)
                        setattr(parent, attr_name, qlora_layer)
                        print(f"Applied QLoRA to: {name}")
                    except ImportError:
                        quantized_linear = QuantizedLinear(module)
                        qlora_layer = QLoRALayer(quantized_linear, rank=rank, alpha=alpha, dropout=dropout)
                        setattr(parent, attr_name, qlora_layer)
                        print(f"Applied QLoRA (custom) to: {name}")
    
    return model


def count_parameters(model: nn.Module) -> dict:
    """
    Count total and trainable parameters.
    """
    total_params = 0
    trainable_params = 0
    quantized_params = 0
    
    for name, param in model.named_parameters():
        total_params += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    
    # Count quantized buffers
    for name, buffer in model.named_buffers():
        if 'quantized_weight' in name:
            quantized_params += buffer.numel()
    
    return {
        'total': total_params,
        'trainable': trainable_params,
        'quantized': quantized_params,
        'trainable_pct': 100 * trainable_params / total_params if total_params > 0 else 0
    }


# Load GPT-2 model
print("Loading GPT-2 small model...")
print("=" * 50)

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load model normally first
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,  # Start with FP32 for custom quantization
)

print("\nOriginal Model:")
print("=" * 50)
original_stats = count_parameters(model)
print(f"Total parameters: {original_stats['total']:,}")

# Calculate original memory
original_memory = sum(p.numel() * p.element_size() for p in model.parameters())
print(f"Original memory: {original_memory / (1024**2):.2f} MB")

# Apply QLoRA
print("\nApplying QLoRA to model...")
print("=" * 50)
model = apply_qlora_to_model(
    model,
    target_modules=['c_attn'],
    rank=8,
    alpha=16,
    dropout=0.1,
    use_custom_quantization=True
)

# Calculate stats after QLoRA
print("\nQLoRA Model:")
print("=" * 50)
qlora_stats = count_parameters(model)
print(f"Total parameters: {qlora_stats['total']:,}")
print(f"Trainable parameters: {qlora_stats['trainable']:,}")
print(f"Quantized parameters: {qlora_stats['quantized']:,}")
print(f"Trainable percentage: {qlora_stats['trainable_pct']:.3f}%")

# Calculate memory with quantization
trainable_memory = sum(p.numel() * p.element_size() for p in model.parameters() if p.requires_grad)
quantized_memory = sum(b.numel() * b.element_size() for name, b in model.named_buffers() if 'quantized' in name)
other_memory = sum(p.numel() * p.element_size() for p in model.parameters() if not p.requires_grad)

total_memory = trainable_memory + quantized_memory + other_memory

print(f"\nMemory breakdown:")
print(f"Trainable (LoRA): {trainable_memory / (1024**2):.2f} MB")
print(f"Quantized (4-bit): {quantized_memory / (1024**2):.2f} MB")
print(f"Other (frozen): {other_memory / (1024**2):.2f} MB")
print(f"Total memory: {total_memory / (1024**2):.2f} MB")
print(f"\nMemory reduction vs original: {original_memory / total_memory:.2f}x")

model = model.to(device)
print(f"\nModel moved to: {device}")

## Part 5: Training Data Preparation

Same as regular LoRA - prepare training dataset.

In [None]:
class SimpleTextDataset(Dataset):
    """
    Simple text dataset for causal language modeling.
    """
    def __init__(self, texts: List[str], tokenizer, max_length: int = 128):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.encodings = []
        
        for text in texts:
            encoded = tokenizer(
                text,
                max_length=max_length,
                truncation=True,
                padding='max_length',
                return_tensors='pt'
            )
            self.encodings.append(encoded)
    
    def __len__(self):
        return len(self.encodings)
    
    def __getitem__(self, idx):
        item = {key: val[0] for key, val in self.encodings[idx].items()}
        item['labels'] = item['input_ids'].clone()
        return item


# Training data
train_texts = [
    "Quantization reduces model size by representing weights with fewer bits.",
    "QLoRA combines quantization with low-rank adaptation for efficient fine-tuning.",
    "4-bit quantization can reduce memory usage by 4x compared to float32.",
    "NormalFloat quantization is optimized for neural network weight distributions.",
    "Double quantization further compresses the quantization constants themselves.",
    "LoRA adapters remain in full precision while base weights are quantized.",
    "Paged optimizers enable training on GPUs with limited memory.",
    "Parameter-efficient fine-tuning democratizes access to large model training.",
    "Gradient checkpointing trades computation for memory during training.",
    "Mixed precision training uses different precisions for different operations.",
    "Quantization-aware training simulates quantization during the training process.",
    "Post-training quantization converts trained models to lower precision.",
    "Symmetric quantization uses equal ranges for positive and negative values.",
    "Asymmetric quantization can better preserve accuracy for skewed distributions.",
    "Per-channel quantization uses different scales for each output channel.",
    "Knowledge distillation can recover accuracy lost during quantization.",
    "Quantization noise can act as a form of regularization during training.",
    "Dynamic quantization determines scales at runtime based on activations.",
    "Static quantization pre-determines scales using calibration data.",
    "Integer-only inference eliminates floating point operations entirely.",
] * 5

# Create dataset
train_dataset = SimpleTextDataset(train_texts, tokenizer, max_length=64)
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)

print(f"Training samples: {len(train_dataset)}")
print(f"Batches per epoch: {len(train_loader)}")

## Part 6: QLoRA Training

Train only the LoRA adapters while keeping quantized weights frozen.

In [None]:
def train_qlora(
    model: nn.Module,
    train_loader: DataLoader,
    num_epochs: int = 3,
    learning_rate: float = 3e-4,
    weight_decay: float = 0.01
):
    """
    Train the model with QLoRA.
    """
    # Only optimize LoRA parameters (full precision)
    optimizer = torch.optim.AdamW(
        [p for p in model.parameters() if p.requires_grad],
        lr=learning_rate,
        weight_decay=weight_decay
    )
    
    model.train()
    
    for epoch in range(num_epochs):
        total_loss = 0
        num_batches = 0
        
        for batch_idx, batch in enumerate(train_loader):
            batch = {k: v.to(device) for k, v in batch.items()}
            
            # Forward pass
            outputs = model(**batch)
            loss = outputs.loss
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(
                [p for p in model.parameters() if p.requires_grad],
                max_norm=1.0
            )
            
            optimizer.step()
            
            total_loss += loss.item()
            num_batches += 1
            
            if (batch_idx + 1) % 5 == 0:
                avg_loss = total_loss / num_batches
                print(f"Epoch [{epoch+1}/{num_epochs}], "
                      f"Batch [{batch_idx+1}/{len(train_loader)}], "
                      f"Loss: {loss.item():.4f}, "
                      f"Avg Loss: {avg_loss:.4f}")
        
        epoch_loss = total_loss / num_batches
        print(f"\nEpoch {epoch+1} completed. Average Loss: {epoch_loss:.4f}\n")
        print("=" * 50)


print("Starting QLoRA training...")
print("=" * 50)
print("Note: Only LoRA adapters are being trained (full precision)")
print("Base model weights remain quantized and frozen\n")

train_qlora(model, train_loader, num_epochs=3, learning_rate=3e-4)

## Part 7: Text Generation

Test the fine-tuned QLoRA model.

In [None]:
def generate_text(
    model: nn.Module,
    tokenizer,
    prompt: str,
    max_length: int = 100,
    temperature: float = 0.8,
    top_p: float = 0.9
) -> str:
    """
    Generate text from a prompt.
    """
    model.eval()
    
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text


# Test generation
prompts = [
    "Quantization reduces",
    "QLoRA combines",
    "4-bit quantization",
    "LoRA adapters",
    "Parameter-efficient"
]

print("\nGenerating text with QLoRA fine-tuned model:")
print("=" * 50)

for prompt in prompts:
    generated = generate_text(model, tokenizer, prompt, max_length=80)
    print(f"\nPrompt: {prompt}")
    print(f"Generated: {generated}")
    print("-" * 50)

## Part 8: Save and Load QLoRA Weights

Save only the LoRA adapters (not quantized base weights).

In [None]:
def save_qlora_weights(model: nn.Module, save_path: str):
    """
    Save only the QLoRA adapter weights.
    """
    qlora_state_dict = {}
    
    for name, param in model.named_parameters():
        if 'lora' in name and param.requires_grad:
            qlora_state_dict[name] = param.cpu()
    
    torch.save(qlora_state_dict, save_path)
    print(f"QLoRA weights saved to: {save_path}")
    print(f"Number of adapter parameters: {len(qlora_state_dict)}")
    
    import os
    size_mb = os.path.getsize(save_path) / (1024 * 1024)
    print(f"File size: {size_mb:.2f} MB")


def load_qlora_weights(model: nn.Module, load_path: str):
    """
    Load QLoRA adapter weights.
    """
    qlora_state_dict = torch.load(load_path, map_location=device)
    model.load_state_dict(qlora_state_dict, strict=False)
    print(f"QLoRA weights loaded from: {load_path}")


print("\nSaving QLoRA weights...")
print("=" * 50)
save_qlora_weights(model, '/home/claude/qlora_weights.pt')

print("\nComparison:")
print(f"Full GPT-2 (FP32): ~500 MB")
print(f"Full GPT-2 (4-bit): ~125 MB")
print(f"QLoRA adapters only: ~2-5 MB")
print(f"\nüéâ QLoRA enables training with minimal storage!")

## Summary: QLoRA vs LoRA vs Full Fine-tuning

### Memory Comparison (GPT-2 Small - 124M params)

| Method | Memory | Trainable % | Storage |
|--------|--------|-------------|----------|
| Full Fine-tuning (FP32) | ~500 MB | 100% | ~500 MB |
| Full Fine-tuning (FP16) | ~250 MB | 100% | ~250 MB |
| LoRA (r=8) | ~500 MB | 0.3% | ~2 MB |
| QLoRA (4-bit + r=8) | ~130 MB | 0.3% | ~2 MB |

### Key Advantages of QLoRA:

1. **Ultra-low Memory**: 4√ó less than LoRA during training
2. **Tiny Storage**: Only save adapter weights (~2 MB)
3. **Same Quality**: Matches LoRA performance in most tasks
4. **Enables Larger Models**: Can train 7B-65B models on consumer GPUs
5. **Fast Inference**: Dequantization is very fast

### When to Use QLoRA:

‚úÖ **Use QLoRA when:**
- Memory is very limited (< 8GB GPU)
- Training large models (> 1B parameters)
- Need to train multiple adapters
- Consumer hardware (laptop, Colab free)

‚ö†Ô∏è **Use regular LoRA when:**
- Memory is not a constraint
- Maximum performance needed
- Small models (< 1B parameters)
- Quantization overhead matters

### Technical Details:

**NF4 Quantization:**
- Maps 32-bit floats ‚Üí 4-bit integers
- Optimized for normal distributions
- ~1% accuracy loss vs FP16

**Double Quantization:**
- Quantizes the quantization constants
- Saves ~0.4 bits per parameter
- Negligible accuracy impact

**Paged Optimizers:**
- Offload optimizer states to CPU RAM
- Enable training larger batches
- Implemented in bitsandbytes library

### Hyperparameters:

**Quantization:**
- Bit width: 4-bit (NF4 or INT4)
- Double quantization: Usually enabled
- Compute dtype: bfloat16 preferred

**LoRA:**
- Rank: 8-64 (start with 8)
- Alpha: 16-32 (2√ó rank typically)
- Target modules: Attention (q_proj, v_proj)
- Dropout: 0.05-0.1

**Training:**
- Learning rate: 1e-4 to 5e-4
- Batch size: Larger than LoRA (more memory available)
- Gradient accumulation: As needed
- Warmup: 5-10% of steps

## Practical Tips for Using QLoRA

### 1. Model Loading:

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Efficient 4-bit loading with bitsandbytes
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
```

### 2. PEFT Integration:

```python
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

# Prepare for training
model = prepare_model_for_kbit_training(model)

# LoRA config
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
```

### 3. Common Issues:

**CUDA Out of Memory:**
- Reduce batch size
- Enable gradient checkpointing
- Use gradient accumulation

**Slow Training:**
- Dequantization has overhead
- Use bfloat16 compute dtype
- Enable Flash Attention if available

**Quality Issues:**
- Try higher rank (16-32)
- Target more modules
- Increase training steps

### 4. Extensions:

- **QLoRA + Flash Attention**: 2√ó faster training
- **Multi-adapter QLoRA**: Multiple task adapters
- **QLoRA for Vision**: Apply to ViT models
- **8-bit QLoRA**: Less aggressive quantization