# 🔥 GPU-Accelerated Training Setup

**GPU Detected**: RTX 3070 Ti with 8GB VRAM
**CUDA Version**: 12.6
**PyTorch**: 2.7.1+cu126

This notebook will help you verify your GPU setup and configure optimal training parameters.

## GPU Verification & Setup

In [1]:
import torch    #type: ignore
import torch.nn as nn   #type: ignore
from transformers import AutoTokenizer, AutoModel   #type: ignore
import psutil
import gc

# Comprehensive GPU info
def check_gpu_setup():
    print("=" * 50)
    print("🔥 GPU SETUP VERIFICATION")
    print("=" * 50)
    
    # PyTorch and CUDA
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"cuDNN version: {torch.backends.cudnn.version()}")
    
    if torch.cuda.is_available():
        print(f"\n📊 GPU Details:")
        print(f"GPU count: {torch.cuda.device_count()}")
        
        for i in range(torch.cuda.device_count()):
            props = torch.cuda.get_device_properties(i)
            print(f"GPU {i}: {props.name}")
            print(f"  Memory: {props.total_memory / 1024**3:.1f} GB")
            print(f"  Compute Capability: {props.major}.{props.minor}")
            print(f"  Multiprocessors: {props.multi_processor_count}")
        
        # Memory usage
        print(f"\n💾 Current GPU Memory:")
        print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
        print(f"Cached: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
        
    else:
        print("❌ GPU not available - will use CPU")
    
    # CPU info as backup
    print(f"\n🖥️  CPU Info:")
    print(f"CPU cores: {psutil.cpu_count()}")
    print(f"RAM: {psutil.virtual_memory().total / 1024**3:.1f} GB")
    
    return torch.cuda.is_available()

gpu_available = check_gpu_setup()

🔥 GPU SETUP VERIFICATION
PyTorch version: 2.7.1+cu126
CUDA available: True
CUDA version: 12.6
cuDNN version: 90501

📊 GPU Details:
GPU count: 1
GPU 0: NVIDIA GeForce RTX 3070 Ti
  Memory: 7.7 GB
  Compute Capability: 8.6
  Multiprocessors: 48

💾 Current GPU Memory:
Allocated: 0.00 GB
Cached: 0.00 GB

🖥️  CPU Info:
CPU cores: 16
RAM: 31.2 GB


## Optimal Batch Size Detection

In [None]:
def find_optimal_batch_size(model_name="distilbert-base-uncased", max_length=512):
    """
    Find the optimal batch size for your GPU to maximize utilization
    without running out of memory.
    """
    if not torch.cuda.is_available():
        print("No GPU available - recommended batch size: 8-16")
        return 16
    
    device = torch.device("cuda")
    
    # Load model and tokenizer
    print(f"Testing with {model_name}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name).to(device)
    model.eval()
    
    # Test different batch sizes
    batch_sizes = [8, 16, 24, 32, 48, 64, 96, 128]
    optimal_batch_size = 8
    
    print("\n🧪 Testing batch sizes:")
    
    for batch_size in batch_sizes:
        try:
            # Clear cache
            torch.cuda.empty_cache()
            gc.collect()
            
            # Create dummy batch
            dummy_texts = ["This is a test sentence for batch size optimization."] * batch_size
            inputs = tokenizer(
                dummy_texts,
                padding=True,
                truncation=True,
                max_length=max_length,
                return_tensors="pt"
            ).to(device)
            
            # Test forward pass
            with torch.no_grad():
                outputs = model(**inputs)
            
            # Check memory usage
            memory_used = torch.cuda.memory_allocated() / 1024**3
            memory_cached = torch.cuda.memory_reserved() / 1024**3
            
            print(f"  Batch size {batch_size:3d}: ✅ Memory: {memory_used:.2f}GB allocated, {memory_cached:.2f}GB cached")
            optimal_batch_size = batch_size
            
        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                print(f"  Batch size {batch_size:3d}: ❌ Out of memory")
                break
            else:
                print(f"  Batch size {batch_size:3d}: ❌ Error: {e}")
                break
    
    # Cleanup
    del model, tokenizer
    torch.cuda.empty_cache()
    gc.collect()
    
    print(f"\n🎯 Recommended batch size: {optimal_batch_size}")
    print(f"💡 For training, use batch size: {max(optimal_batch_size // 2, 8)} (with safety margin)")
    
    return max(optimal_batch_size // 2, 8)

# Test optimal batch size
optimal_batch = find_optimal_batch_size()

: 

## GPU Training Configuration

In [None]:
# GPU-optimized configuration
GPU_CONFIG = {
    # Model settings
    "model_name": "distilbert-base-uncased",  # Perfect for RTX 3070 Ti
    "max_length": 512,
    "num_labels": 3,
    
    # Training settings optimized for your GPU
    "batch_size": optimal_batch,  # Dynamically determined
    "learning_rate": 3e-5,        # Slightly higher for GPU training
    "num_epochs": 5,              # More epochs with faster training
    "warmup_steps": 500,
    "weight_decay": 0.01,
    
    # GPU-specific optimizations
    "fp16": True,                 # Mixed precision for speed + memory
    "dataloader_pin_memory": True,
    "dataloader_num_workers": 4,  # Parallel data loading
    
    # Monitoring (faster with GPU)
    "logging_steps": 50,
    "eval_steps": 200,
    "save_steps": 500,
    
    # Safety settings
    "max_grad_norm": 1.0,
    "early_stopping_patience": 3
}

print("🚀 GPU Training Configuration:")
print("=" * 40)
for key, value in GPU_CONFIG.items():
    print(f"{key:25}: {value}")

# Save config for later use
import json
with open("../config/gpu_config.json", "w") as f:
    json.dump(GPU_CONFIG, f, indent=2)
    
print("\n💾 Configuration saved to config/gpu_config.json")

: 

## Performance Expectations with Your Setup

In [None]:
def estimate_training_time(num_samples, batch_size, num_epochs):
    """
    Estimate training time based on your RTX 3070 Ti performance.
    """
    # Rough estimates based on RTX 3070 Ti benchmarks
    seconds_per_batch_distilbert = 0.15  # DistilBERT on RTX 3070 Ti
    seconds_per_batch_bert = 0.25        # BERT-base on RTX 3070 Ti
    
    batches_per_epoch = num_samples // batch_size
    total_batches = batches_per_epoch * num_epochs
    
    # Estimates for different models
    distilbert_time = total_batches * seconds_per_batch_distilbert
    bert_time = total_batches * seconds_per_batch_bert
    
    print(f"📊 Training Time Estimates for {num_samples:,} samples:")
    print(f"{'Model':<15} {'Time':<12} {'Speed':<15} {'Recommended':<12}")
    print("-" * 60)
    print(f"{'DistilBERT':<15} {distilbert_time/60:.1f} min{'s':<6} {'⚡ Very Fast':<15} {'✅ Ideal':<12}")
    print(f"{'BERT-base':<15} {bert_time/60:.1f} min{'s':<6} {'🚀 Fast':<15} {'✅ Good':<12}")
    print(f"{'RoBERTa':<15} {bert_time*1.2/60:.1f} min{'s':<6} {'🐌 Slower':<15} {'⚠️  Heavy':<12}")
    
    return distilbert_time, bert_time

# Estimates for different dataset sizes
print("🎯 Performance Projections:")
print("=" * 50)

dataset_sizes = [5000, 10000, 25000, 50000]
for size in dataset_sizes:
    print(f"\n📈 Dataset size: {size:,} samples")
    estimate_training_time(size, optimal_batch, 5)

: 

## Memory Management for Long Training

In [None]:
def setup_gpu_optimization():
    """
    Configure PyTorch for optimal GPU performance.
    """
    if torch.cuda.is_available():
        # Enable optimizations
        torch.backends.cudnn.benchmark = True  # Optimize for consistent input sizes
        torch.backends.cudnn.enabled = True
        
        # Memory management
        torch.cuda.empty_cache()
        
        print("🔧 GPU Optimizations Enabled:")
        print(f"  ✅ cuDNN benchmark: {torch.backends.cudnn.benchmark}")
        print(f"  ✅ cuDNN enabled: {torch.backends.cudnn.enabled}")
        print(f"  ✅ Memory cache cleared")
        
        # Set memory allocation strategy
        # This helps with memory fragmentation
        import os
        os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128'
        print(f"  ✅ Memory allocation strategy optimized")
    
    else:
        print("⚠️  No GPU available - skipping GPU optimizations")

def monitor_gpu_usage():
    """
    Monitor GPU memory usage during training.
    """
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**3
        cached = torch.cuda.memory_reserved() / 1024**3
        total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        
        print(f"🔍 GPU Memory: {allocated:.2f}GB allocated, {cached:.2f}GB cached, {total:.1f}GB total")
        print(f"📊 Usage: {(cached/total)*100:.1f}% of total GPU memory")
        
        if cached > total * 0.9:
            print("⚠️  Warning: High memory usage - consider reducing batch size")
        elif cached < total * 0.5:
            print("💡 Tip: You could potentially increase batch size for better performance")
    
# Apply optimizations
setup_gpu_optimization()
monitor_gpu_usage()

: 

## FP16 Mixed Precision Setup & Testing 🚀

**Mixed precision training uses both FP16 and FP32 for optimal speed and stability.**

### Benefits for RTX 3070 Ti:
- ⚡ **30-40% faster training**
- 💾 **50% less memory usage** 
- 🔥 **Tensor Core acceleration**
- 📈 **Larger batch sizes possible**

In [None]:
def test_fp16_support():
    """
    Test if your GPU supports FP16 and measure performance difference.
    """
    if not torch.cuda.is_available():
        print("❌ No GPU available - FP16 requires CUDA")
        return False
    
    device = torch.device("cuda")
    
    # Check Tensor Core support
    props = torch.cuda.get_device_properties(0)
    has_tensor_cores = props.major >= 7  # Volta (V100) and newer
    
    print("🔍 FP16 Capability Check:")
    print(f"  GPU: {props.name}")
    print(f"  Compute Capability: {props.major}.{props.minor}")
    print(f"  Tensor Cores: {'✅ Yes' if has_tensor_cores else '❌ No'}")
    
    if not has_tensor_cores:
        print("⚠️  Warning: GPU doesn't have Tensor Cores, FP16 benefits will be limited")
    
    # Test FP16 operations
    try:
        # Create test tensors
        x = torch.randn(1000, 1000, device=device, dtype=torch.float16)
        y = torch.randn(1000, 1000, device=device, dtype=torch.float16)
        
        # Test basic operations
        z = torch.matmul(x, y)
        
        # Test autocast
        from torch.cuda.amp import autocast #type: ignore
        with autocast():
            z_auto = torch.matmul(x.float(), y.float())
        
        print("  Basic FP16 operations: ✅ Working")
        print("  Automatic Mixed Precision: ✅ Working")
        
        return True
        
    except Exception as e:
        print(f"  FP16 test failed: ❌ {e}")
        return False

def benchmark_fp16_vs_fp32():
    """
    Benchmark training speed difference between FP16 and FP32.
    """
    if not torch.cuda.is_available():
        print("No GPU available for benchmarking")
        return
    
    from torch.cuda.amp import GradScaler, autocast #type: ignore
    import time
    
    device = torch.device("cuda")
    
    # Create a simple model for testing
    model = nn.Sequential(
        nn.Linear(512, 256),
        nn.ReLU(),
        nn.Linear(256, 128),
        nn.ReLU(),
        nn.Linear(128, 3)  # 3 classes for sentiment
    ).to(device)
    
    # Test data
    batch_size = 32
    x = torch.randn(batch_size, 512, device=device)
    y = torch.randint(0, 3, (batch_size,), device=device)
    
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()
    
    print("\n🏁 FP16 vs FP32 Benchmark:")
    print("=" * 50)
    
    # FP32 Benchmark
    model.train()
    torch.cuda.synchronize()
    start_time = time.time()
    
    for _ in range(100):
        optimizer.zero_grad()
        outputs = model(x)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()
    
    torch.cuda.synchronize()
    fp32_time = time.time() - start_time
    
    print(f"FP32 Training: {fp32_time:.3f} seconds (100 steps)")
    
    # FP16 Benchmark
    model.train()
    scaler = GradScaler()
    torch.cuda.synchronize()
    start_time = time.time()
    
    for _ in range(100):
        optimizer.zero_grad()
        
        with autocast():
            outputs = model(x)
            loss = criterion(outputs, y)
        
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    
    torch.cuda.synchronize()
    fp16_time = time.time() - start_time
    
    print(f"FP16 Training: {fp16_time:.3f} seconds (100 steps)")
    
    # Calculate improvement
    speedup = fp32_time / fp16_time
    memory_saved = (1 - fp16_time/fp32_time) * 100
    
    print(f"\n🚀 Performance Results:")
    print(f"  Speedup: {speedup:.2f}x faster with FP16")
    print(f"  Time reduction: {((fp32_time - fp16_time)/fp32_time)*100:.1f}%")
    
    if speedup > 1.2:
        print("  ✅ Significant speedup - FP16 recommended!")
    elif speedup > 1.05:
        print("  ⚠️  Modest speedup - FP16 still beneficial")
    else:
        print("  ❌ Limited speedup - check GPU compatibility")

# Run FP16 tests
print("Testing FP16 support on your RTX 3070 Ti...")
fp16_supported = test_fp16_support()

if fp16_supported:
    print("\n🧪 Running performance benchmark...")
    benchmark_fp16_vs_fp32()
else:
    print("⚠️  FP16 not fully supported - falling back to FP32")

: 

### How to Enable FP16 in Your Training Code

In [None]:
# Method 1: Manual FP16 with GradScaler (Recommended)
def create_fp16_training_example():
    """
    Shows how to implement FP16 training for your FinSent model.
    """
    from torch.cuda.amp import GradScaler, autocast     #type: ignore
    from transformers import AutoModel, AutoTokenizer   #type: ignore
    
    print("🚀 FP16 Training Implementation Example:")
    print("=" * 50)
    
    # This is the pattern you'll use in your actual training code
    example_code = '''
# Import required components
from torch.cuda.amp import GradScaler, autocast
import torch.nn as nn

# Initialize the scaler for FP16
scaler = GradScaler()

# Your training loop
def train_with_fp16(model, dataloader, optimizer, criterion):
    model.train()
    total_loss = 0
    
    for batch_idx, batch in enumerate(dataloader):
        # Move batch to GPU
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        # Clear gradients
        optimizer.zero_grad()
        
        # 🔥 FP16 Forward Pass with autocast
        with autocast():
            outputs = model(input_ids=input_ids, 
                           attention_mask=attention_mask)
            loss = criterion(outputs.logits, labels)
        
        # 🔥 FP16 Backward Pass with gradient scaling
        scaler.scale(loss).backward()
        
        # 🔥 Optimizer step with scaler
        scaler.step(optimizer)
        scaler.update()
        
        total_loss += loss.item()
        
        if batch_idx % 50 == 0:
            print(f"Batch {batch_idx}, Loss: {loss.item():.4f}")
    
    return total_loss / len(dataloader)
    '''
    
    print("📝 Copy this pattern for your training:")
    print(example_code)
    
    return example_code

# Method 2: Hugging Face Transformers with FP16
def create_huggingface_fp16_example():
    """
    Shows how to enable FP16 with Hugging Face Trainer.
    """
    
    example_code = '''
# Using Hugging Face Trainer with FP16 (Even Easier!)
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=32,    # Larger batch with FP16
    per_device_eval_batch_size=64,     # Even larger for eval
    learning_rate=3e-5,
    
    # 🔥 Enable FP16 - Just one line!
    fp16=True,                         # Enable mixed precision
    fp16_opt_level="O1",              # Conservative optimization
    
    # Optional FP16 settings
    dataloader_pin_memory=True,        # Faster data transfer
    dataloader_num_workers=4,          # Parallel data loading
    
    logging_steps=50,
    eval_steps=200,
    save_steps=500,
    evaluation_strategy="steps",
    load_best_model_at_end=True,
)

# Create trainer with FP16 enabled
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

# Train with automatic FP16!
trainer.train()
    '''
    
    print("\n🤖 Hugging Face Trainer with FP16:")
    print(example_code)
    
    return example_code

# Method 3: Safety Wrapper for FP16
def create_fp16_safety_wrapper():
    """
    Safe FP16 implementation with fallback to FP32.
    """
    
    safety_code = '''
# Safe FP16 Training with Automatic Fallback
class SafeFP16Trainer:
    def __init__(self, model, use_fp16=True):
        self.model = model
        self.use_fp16 = use_fp16 and torch.cuda.is_available()
        self.scaler = GradScaler() if self.use_fp16 else None
        
        print(f"🔧 Training mode: {'FP16' if self.use_fp16 else 'FP32'}")
    
    def training_step(self, batch, optimizer, criterion):
        optimizer.zero_grad()
        
        try:
            if self.use_fp16:
                # FP16 training
                with autocast():
                    outputs = self.model(**batch)
                    loss = criterion(outputs.logits, batch['labels'])
                
                self.scaler.scale(loss).backward()
                
                # Check for inf/nan gradients
                self.scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                
                self.scaler.step(optimizer)
                self.scaler.update()
                
            else:
                # FP32 training (fallback)
                outputs = self.model(**batch)
                loss = criterion(outputs.logits, batch['labels'])
                loss.backward()
                
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                optimizer.step()
            
            return loss.item()
            
        except RuntimeError as e:
            if "nan" in str(e).lower() or "inf" in str(e).lower():
                print("⚠️  FP16 instability detected, falling back to FP32")
                self.use_fp16 = False
                self.scaler = None
                return self.training_step(batch, optimizer, criterion)
            else:
                raise e

# Usage:
# trainer = SafeFP16Trainer(model, use_fp16=True)
# loss = trainer.training_step(batch, optimizer, criterion)
    '''
    
    print("\n🛡️  Safe FP16 Implementation:")
    print(safety_code)
    
    return safety_code

# Show all implementation methods
print("Here are 3 ways to enable FP16 in your project:\n")

method1 = create_fp16_training_example()
method2 = create_huggingface_fp16_example()  
method3 = create_fp16_safety_wrapper()

print(f"\n✅ Choose the method that fits your training approach!")
print(f"   • Method 1: Manual control (most flexible)")
print(f"   • Method 2: Hugging Face Trainer (easiest)")  
print(f"   • Method 3: Safety wrapper (most robust)")

: 

In [None]:
# Update your GPU config with FP16 settings
if fp16_supported:
    # Update the config with FP16 optimizations
    GPU_CONFIG_FP16 = GPU_CONFIG.copy()
    GPU_CONFIG_FP16.update({
        # FP16 specific settings
        "fp16": True,
        "fp16_opt_level": "O1",              # Conservative mixed precision
        "fp16_full_eval": False,             # Keep evaluation in FP32 for stability
        
        # Increase batch size with memory savings
        "batch_size": min(optimal_batch + 16, 64),  # Larger batch with FP16
        "gradient_accumulation_steps": 1,     # Less needed with larger batches
        
        # Optimize data loading for FP16
        "dataloader_pin_memory": True,
        "dataloader_num_workers": 4,
        
        # Safety settings for FP16
        "max_grad_norm": 1.0,                # Gradient clipping
        "loss_scale": 0,                     # Dynamic loss scaling
        "loss_scale_window": 1000,           # Loss scaling window
    })
    
    print("🔥 Updated GPU Configuration with FP16:")
    print("=" * 45)
    
    # Show the differences
    for key, value in GPU_CONFIG_FP16.items():
        if key not in GPU_CONFIG or GPU_CONFIG[key] != value:
            print(f"{key:25}: {value} {'🆕' if key not in GPU_CONFIG else '⬆️'}")
        else:
            print(f"{key:25}: {value}")
    
    # Save updated config
    with open("../config/gpu_config_fp16.json", "w") as f:
        json.dump(GPU_CONFIG_FP16, f, indent=2)
    
    print(f"\n💾 FP16-optimized config saved to config/gpu_config_fp16.json")
    print(f"📈 Expected improvements:")
    print(f"   • Training speed: 30-40% faster")
    print(f"   • Memory usage: ~50% reduction") 
    print(f"   • Batch size: {GPU_CONFIG['batch_size']} → {GPU_CONFIG_FP16['batch_size']}")
    print(f"   • Model size: Can fit larger models in 8GB VRAM")

else:
    print("⚠️  FP16 not fully supported - keeping FP32 configuration")
    print("💡 You can still use the project effectively with FP32!")

: 

## Ready for GPU-Accelerated Training! 🚀
//TODO: Delete Later
### Your Optimal Setup:
- **GPU**: RTX 3070 Ti (8GB VRAM) ✅
- **Batch Size**: Dynamically optimized
- **Model**: DistilBERT (perfect balance of speed/accuracy)
- **Training Time**: ~5-15 minutes for most datasets
- **Memory**: Optimized allocation strategy

### Next Steps:
1. **Complete data exploration** with confidence you have great hardware
2. **Use larger datasets** - your GPU can handle 25k-50k samples easily
3. **Experiment freely** - fast training means quick iteration
4. **Try advanced models** - BERT, RoBERTa are viable options

### Pro Tips for Your Setup:
- **Mixed precision (fp16)** will give you ~30% speedup
- **Larger batch sizes** = better GPU utilization
- **More epochs** = better convergence (training is fast!)
- **Save checkpoints** frequently for experimentation

**You're ready to build something impressive! 🔥**