# Getting Started with PyTorch on RTX 50-Series GPUs

This notebook demonstrates how to use PyTorch with native SM 12.0 (Blackwell) support on NVIDIA RTX 50-series GPUs.

## Prerequisites

- NVIDIA RTX 5090, 5080, 5070 Ti, or 5070 GPU
- NVIDIA Driver >= 570.00
- CUDA 13.0+
- Python 3.10+

## Installation

```bash
pip install stone-linux
stone-install
```

## 1. Verify Installation

Let's first verify that PyTorch is correctly installed and can access your RTX 50-series GPU.

In [None]:
import torch
import stone_linux

print(f"stone-linux version: {stone_linux.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

In [None]:
# Run comprehensive verification
stone_linux.verify_installation()

## 2. GPU Information

Get detailed information about your GPU.

In [None]:
gpu_info = stone_linux.get_gpu_info()

print("GPU Information:")
print("=" * 60)
for key, value in gpu_info.items():
    if key == 'compute_capability':
        print(f"{key}: {value[0]}.{value[1]}")
    elif key == 'total_memory':
        print(f"{key}: {value:.2f} GB")
    else:
        print(f"{key}: {value}")
print("=" * 60)

## 3. Basic Tensor Operations

Let's perform some basic tensor operations on the GPU.

In [None]:
# Create tensors on GPU
x = torch.randn(3, 4, device='cuda')
y = torch.randn(3, 4, device='cuda')

print("Tensor x:")
print(x)
print(f"\nDevice: {x.device}")
print(f"Shape: {x.shape}")
print(f"Dtype: {x.dtype}")

In [None]:
# Basic operations
z = x + y
print("x + y:")
print(z)

# Matrix multiplication
a = torch.randn(100, 100, device='cuda')
b = torch.randn(100, 100, device='cuda')
c = torch.matmul(a, b)
print(f"\nMatrix multiplication result shape: {c.shape}")

## 4. Simple Neural Network

Create and train a simple neural network on the GPU.

In [None]:
import torch.nn as nn
import torch.optim as optim

# Define a simple network
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Create model and move to GPU
model = SimpleNet().cuda()
print(model)
print(f"\nModel is on: {next(model.parameters()).device}")

In [None]:
# Create dummy data
batch_size = 64
x = torch.randn(batch_size, 784).cuda()
y = torch.randint(0, 10, (batch_size,)).cuda()

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
print("Training...")
for epoch in range(10):
    optimizer.zero_grad()
    outputs = model(x)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()
    
    if (epoch + 1) % 2 == 0:
        print(f"Epoch [{epoch+1}/10], Loss: {loss.item():.4f}")

print("Training complete!")

## 5. Mixed Precision Training

Leverage the RTX 50-series' Tensor Cores with mixed precision training.

In [None]:
import time

# Create model
model = SimpleNet().cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler()

# Training with mixed precision
print("Training with Automatic Mixed Precision (AMP)...")
start_time = time.time()

for epoch in range(10):
    optimizer.zero_grad()
    
    # Use autocast for mixed precision
    with torch.cuda.amp.autocast():
        outputs = model(x)
        loss = criterion(outputs, y)
    
    # Scale loss and backward pass
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    
    if (epoch + 1) % 2 == 0:
        print(f"Epoch [{epoch+1}/10], Loss: {loss.item():.4f}")

end_time = time.time()
print(f"\nTraining time: {end_time - start_time:.2f}s")
print("AMP training complete!")

## 6. Performance Comparison: FP32 vs FP16

Compare the performance of FP32 and FP16 operations.

In [None]:
import matplotlib.pyplot as plt

def benchmark_matmul(size, dtype, iterations=100):
    a = torch.randn(size, size, dtype=dtype, device='cuda')
    b = torch.randn(size, size, dtype=dtype, device='cuda')
    
    # Warmup
    for _ in range(10):
        _ = torch.matmul(a, b)
    torch.cuda.synchronize()
    
    # Benchmark
    start = time.time()
    for _ in range(iterations):
        _ = torch.matmul(a, b)
    torch.cuda.synchronize()
    end = time.time()
    
    return (end - start) / iterations

# Test different matrix sizes
sizes = [512, 1024, 2048, 4096]
fp32_times = []
fp16_times = []

for size in sizes:
    print(f"Testing size {size}x{size}...")
    fp32_time = benchmark_matmul(size, torch.float32)
    fp16_time = benchmark_matmul(size, torch.float16)
    fp32_times.append(fp32_time * 1000)  # Convert to ms
    fp16_times.append(fp16_time * 1000)
    print(f"  FP32: {fp32_time*1000:.2f}ms, FP16: {fp16_time*1000:.2f}ms")
    print(f"  Speedup: {fp32_time/fp16_time:.2f}x")

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(sizes, fp32_times, 'o-', label='FP32', linewidth=2)
plt.plot(sizes, fp16_times, 's-', label='FP16', linewidth=2)
plt.xlabel('Matrix Size', fontsize=12)
plt.ylabel('Time (ms)', fontsize=12)
plt.title('Matrix Multiplication Performance: FP32 vs FP16', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 7. Memory Management

Monitor and manage GPU memory usage.

In [None]:
def print_gpu_memory():
    allocated = torch.cuda.memory_allocated(0) / 1024**3
    reserved = torch.cuda.memory_reserved(0) / 1024**3
    total = torch.cuda.get_device_properties(0).total_memory / 1024**3
    
    print(f"GPU Memory:")
    print(f"  Allocated: {allocated:.2f} GB")
    print(f"  Reserved:  {reserved:.2f} GB")
    print(f"  Total:     {total:.2f} GB")
    print(f"  Free:      {total - reserved:.2f} GB")

print_gpu_memory()

# Allocate some memory
large_tensor = torch.randn(10000, 10000, device='cuda')
print("\nAfter allocating 10000x10000 tensor:")
print_gpu_memory()

# Free memory
del large_tensor
torch.cuda.empty_cache()
print("\nAfter freeing memory:")
print_gpu_memory()

## Conclusion

You've successfully:

1. ✓ Verified PyTorch installation with RTX 50-series support
2. ✓ Performed basic tensor operations on GPU
3. ✓ Created and trained a neural network
4. ✓ Used mixed precision training (AMP)
5. ✓ Compared FP32 vs FP16 performance
6. ✓ Managed GPU memory

## Next Steps

- Check out the [performance benchmarking notebook](02_performance_benchmarks.ipynb)
- Explore [vLLM integration](../stone_linux/examples/vllm_example.py)
- Try [LangChain examples](../stone_linux/examples/langchain_example.py)
- Read the [full documentation](../README.md)

## Resources

- [PyTorch Documentation](https://pytorch.org/docs)
- [CUDA Best Practices](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/)
- [Mixed Precision Training](https://pytorch.org/docs/stable/amp.html)