# Local LLaMA CUDA - Kaggle/Colab Example

This notebook demonstrates how to use the `llcuda` Python package for CUDA-accelerated LLM inference on NVIDIA T4 GPUs (Kaggle/Colab).

**GPU Required:** This notebook requires a GPU runtime (T4, P100, or V100)

**Setup:**
1. Enable GPU runtime: Runtime → Change runtime type → GPU
2. Install the package
3. Download a model
4. Run inference!

## 1. Check GPU Availability

In [None]:
# Check NVIDIA GPU
!nvidia-smi

## 2. Install llcuda Package

### Option A: Install from PyPI (when published)
```python
!pip install llcuda
```

### Option B: Install from GitHub

In [None]:
# Install dependencies
!pip install pybind11 numpy

# Clone and install from source
!git clone https://github.com/waqasm86/local-llama-cuda.git
%cd local-llama-cuda

# Set CUDA architecture for T4 GPU
import os
os.environ['CUDA_ARCHITECTURES'] = '75'  # T4 GPU compute capability

# Install
!pip install -e .

## 3. Install llama.cpp Server

We need llama-server running as the backend.

In [None]:
# Clone llama.cpp
!git clone https://github.com/ggerganov/llama.cpp.git /content/llama.cpp
%cd /content/llama.cpp

# Build with CUDA support
!mkdir -p build && cd build && cmake .. -DGGML_CUDA=ON && cmake --build . --config Release -j8

# Verify build
!./build/bin/llama-server --version

## 4. Download a Model

Download a small GGUF model for testing. We'll use Gemma 2B.

In [None]:
# Download model (using HuggingFace)
!pip install huggingface_hub

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    filename="mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    local_dir="/content/models"
)

print(f"Model downloaded to: {model_path}")

## 5. Start llama-server (Background Process)

In [None]:
import subprocess
import time
import requests

# Start llama-server in background
server_process = subprocess.Popen([
    '/content/llama.cpp/build/bin/llama-server',
    '-m', model_path,
    '--port', '8090',
    '-ngl', '99',  # Offload all layers to GPU
    '-c', '4096',  # Context size
    '-b', '512',   # Batch size
], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Wait for server to start
print("Starting llama-server...")
for i in range(30):
    try:
        response = requests.get('http://127.0.0.1:8090/health', timeout=1)
        if response.status_code == 200:
            print("✅ Server is ready!")
            break
    except:
        time.sleep(1)
        print(f"Waiting... {i+1}/30")
else:
    print("❌ Server failed to start")

## 6. Basic Inference with llcuda

In [None]:
import llcuda

# Check CUDA availability
print("CUDA Available:", llcuda.check_cuda_available())
print("GPU Info:", llcuda.get_cuda_device_info())

In [None]:
# Create inference engine
engine = llcuda.InferenceEngine(server_url="http://127.0.0.1:8090")

# Load model (dummy file since we're using llama-server backend)
import os
os.makedirs('/tmp', exist_ok=True)
open('/tmp/model.gguf', 'a').close()  # Create dummy file

engine.load_model('/tmp/model.gguf', gpu_layers=99)
print("Model loaded:", engine.is_loaded)

In [None]:
# Run inference
result = engine.infer(
    prompt="Explain quantum computing in simple terms.",
    max_tokens=100,
    temperature=0.7
)

print("\n" + "="*60)
print("GENERATED TEXT:")
print("="*60)
print(result.text)
print("\n" + "="*60)
print(f"Tokens: {result.tokens_generated}")
print(f"Latency: {result.latency_ms:.2f} ms")
print(f"Throughput: {result.tokens_per_sec:.2f} tokens/sec")
print("="*60)

## 7. Streaming Inference

In [None]:
# Streaming inference with callback
def print_chunk(chunk):
    print(chunk, end='', flush=True)

print("\nStreaming output:")
print("-" * 60)
result = engine.infer_stream(
    prompt="Write a haiku about AI.",
    callback=print_chunk,
    max_tokens=50
)
print("\n" + "-" * 60)

## 8. Batch Inference

In [None]:
# Batch processing
prompts = [
    "What is machine learning?",
    "Explain neural networks.",
    "What is deep learning?"
]

results = engine.batch_infer(prompts, max_tokens=50)

for i, result in enumerate(results):
    print(f"\n{'='*60}")
    print(f"Prompt {i+1}: {prompts[i]}")
    print(f"{'='*60}")
    print(result.text)
    print(f"Latency: {result.latency_ms:.2f}ms")

## 9. Performance Benchmarking

In [None]:
import time
import numpy as np

# Reset metrics
engine.reset_metrics()

# Run benchmark
num_iterations = 20
latencies = []

print(f"Running {num_iterations} iterations...")
for i in range(num_iterations):
    result = engine.infer(
        prompt="Hello, how are you?",
        max_tokens=64
    )
    latencies.append(result.latency_ms)
    print(f"Iteration {i+1}/{num_iterations}: {result.latency_ms:.2f}ms", end='\r')

print("\n" + "="*60)
print("BENCHMARK RESULTS")
print("="*60)
print(f"Iterations: {num_iterations}")
print(f"Mean latency: {np.mean(latencies):.2f} ms")
print(f"p50: {np.percentile(latencies, 50):.2f} ms")
print(f"p95: {np.percentile(latencies, 95):.2f} ms")
print(f"p99: {np.percentile(latencies, 99):.2f} ms")
print(f"Min: {np.min(latencies):.2f} ms")
print(f"Max: {np.max(latencies):.2f} ms")
print("="*60)

## 10. Get System Metrics

In [None]:
# Get performance metrics
metrics = engine.get_metrics()

print("\nLatency Metrics:")
print(f"  Mean: {metrics['latency']['mean_ms']:.2f} ms")
print(f"  p50: {metrics['latency']['p50_ms']:.2f} ms")
print(f"  p95: {metrics['latency']['p95_ms']:.2f} ms")
print(f"  p99: {metrics['latency']['p99_ms']:.2f} ms")

print("\nThroughput Metrics:")
print(f"  Total tokens: {metrics['throughput']['total_tokens']}")
print(f"  Total requests: {metrics['throughput']['total_requests']}")
print(f"  Tokens/sec: {metrics['throughput']['tokens_per_sec']:.2f}")
print(f"  Requests/sec: {metrics['throughput']['requests_per_sec']:.2f}")

## 11. Cleanup

In [None]:
# Stop server
server_process.terminate()
server_process.wait()
print("Server stopped")

## Summary

You've successfully run CUDA-accelerated LLM inference using the `llcuda` package!

**Key Features:**
- ✅ CUDA acceleration on T4 GPU
- ✅ Simple Python API
- ✅ Streaming support
- ✅ Batch processing
- ✅ Performance metrics

**Next Steps:**
- Try different models (Llama, Mistral, Phi)
- Experiment with different parameters (temperature, top_p)
- Benchmark on different GPU types (P100, V100)
- Build applications using the API