# llcuda Quick Start Guide - JupyterLab Edition

This notebook demonstrates how to use **llcuda** for CUDA-accelerated LLM inference in JupyterLab.

**Requirements:**
- Python 3.11+
- NVIDIA GPU with CUDA support
- llcuda package installed (`pip install llcuda`)
- llama-cpp-cuda binaries in `/media/waqasm86/External1/Project-Nvidia/llama-cpp-cuda/`

---

## 1. Setup and System Check

First, let's verify that everything is set up correctly.

In [None]:
import llcuda
import sys

print(f"llcuda version: {llcuda.__version__}")
print(f"Python version: {sys.version}")
print("\n" + "="*60)

# Print comprehensive system information
llcuda.print_system_info()

## 2. Check CUDA Availability

In [None]:
# Check if CUDA is available
if llcuda.check_cuda_available():
    print("‚úì CUDA is available!")
    
    # Get GPU information
    gpu_info = llcuda.get_cuda_device_info()
    if gpu_info:
        print(f"\nCUDA Version: {gpu_info['cuda_version']}")
        print(f"Number of GPUs: {len(gpu_info['gpus'])}")
        
        for i, gpu in enumerate(gpu_info['gpus']):
            print(f"\nGPU {i}:")
            print(f"  Name: {gpu['name']}")
            print(f"  Memory: {gpu['memory']}")
            print(f"  Driver: {gpu['driver_version']}")
else:
    print("‚ùå CUDA not available. Please check your NVIDIA drivers.")

## 3. Find Available Models

Let's find GGUF models in common locations.

In [None]:
# Find GGUF models
models = llcuda.find_gguf_models()

print(f"Found {len(models)} GGUF models:\n")
for i, model in enumerate(models):
    size_mb = model.stat().st_size / (1024 * 1024)
    print(f"{i+1}. {model.name}")
    print(f"   Path: {model}")
    print(f"   Size: {size_mb:.1f} MB\n")

## 4. Basic Usage: Auto-Start Mode (Easiest)

This is the simplest way to use llcuda. The package will automatically:
1. Find llama-server executable
2. Start the server with your model
3. Connect and run inference
4. Clean up when done

In [None]:
# Set your model path here
MODEL_PATH = "/media/waqasm86/External1/Project-Nvidia/llama-cpp-cuda/bin/gemma-3-1b-it-Q4_K_M.gguf"

# Create inference engine
engine = llcuda.InferenceEngine()

# Load model with auto-start
# This will automatically start llama-server if it's not running
engine.load_model(
    model_path=MODEL_PATH,
    gpu_layers=20,  # Adjust based on your GPU memory (99 = all layers)
    ctx_size=2048,
    auto_start=True,  # Auto-start server
    verbose=True
)

print("\n‚úì Ready for inference!")

## 5. Run Simple Inference

In [None]:
# Run inference
result = engine.infer(
    prompt="What is artificial intelligence?",
    max_tokens=100,
    temperature=0.7
)

# Display results
if result.success:
    print("Generated Text:")
    print("="*60)
    print(result.text)
    print("="*60)
    print(f"\nPerformance Metrics:")
    print(f"  Tokens Generated: {result.tokens_generated}")
    print(f"  Latency: {result.latency_ms:.2f} ms")
    print(f"  Throughput: {result.tokens_per_sec:.2f} tokens/sec")
else:
    print(f"‚ùå Error: {result.error_message}")

## 6. Try Different Prompts

Let's try a few different prompts to see how the model responds.

In [None]:
prompts = [
    "Explain quantum computing in simple terms.",
    "Write a haiku about CUDA programming.",
    "What are the benefits of GPU acceleration?"
]

for i, prompt in enumerate(prompts, 1):
    print(f"\n{'='*60}")
    print(f"Prompt {i}: {prompt}")
    print('='*60)
    
    result = engine.infer(prompt, max_tokens=80, temperature=0.7)
    
    if result.success:
        print(result.text)
        print(f"\n‚ö° {result.tokens_per_sec:.1f} tok/s | {result.latency_ms:.0f}ms")
    else:
        print(f"‚ùå Error: {result.error_message}")

## 7. Batch Inference

Process multiple prompts in one go.

In [None]:
batch_prompts = [
    "What is machine learning?",
    "What is deep learning?",
    "What is natural language processing?"
]

print("Running batch inference...\n")
results = engine.batch_infer(batch_prompts, max_tokens=50)

for i, (prompt, result) in enumerate(zip(batch_prompts, results), 1):
    print(f"{i}. {prompt}")
    if result.success:
        print(f"   ‚Üí {result.text[:100]}...")
        print(f"   ‚ö° {result.tokens_per_sec:.1f} tok/s\n")
    else:
        print(f"   ‚ùå Error: {result.error_message}\n")

## 8. Performance Metrics

Get detailed performance statistics for all inferences.

In [None]:
metrics = engine.get_metrics()

print("Performance Metrics")
print("="*60)

print("\nLatency Statistics:")
latency = metrics['latency']
print(f"  Mean: {latency['mean_ms']:.2f} ms")
print(f"  p50:  {latency['p50_ms']:.2f} ms")
print(f"  p95:  {latency['p95_ms']:.2f} ms")
print(f"  p99:  {latency['p99_ms']:.2f} ms")
print(f"  Min:  {latency['min_ms']:.2f} ms")
print(f"  Max:  {latency['max_ms']:.2f} ms")

print("\nThroughput Statistics:")
throughput = metrics['throughput']
print(f"  Total Tokens: {throughput['total_tokens']}")
print(f"  Total Requests: {throughput['total_requests']}")
print(f"  Tokens/sec: {throughput['tokens_per_sec']:.2f}")
print(f"  Requests/sec: {throughput['requests_per_sec']:.2f}")

## 9. Visualize Performance (Optional)

Create a simple plot of latencies.

In [None]:
try:
    import matplotlib.pyplot as plt
    
    latencies = engine._metrics['latencies']
    
    if latencies:
        plt.figure(figsize=(10, 4))
        
        # Latency over time
        plt.subplot(1, 2, 1)
        plt.plot(latencies, marker='o')
        plt.xlabel('Request Number')
        plt.ylabel('Latency (ms)')
        plt.title('Inference Latency Over Time')
        plt.grid(True, alpha=0.3)
        
        # Latency distribution
        plt.subplot(1, 2, 2)
        plt.hist(latencies, bins=20, edgecolor='black')
        plt.xlabel('Latency (ms)')
        plt.ylabel('Frequency')
        plt.title('Latency Distribution')
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    else:
        print("No metrics available yet. Run some inferences first.")
        
except ImportError:
    print("matplotlib not installed. Install with: pip install matplotlib")

## 10. Advanced: Manual Server Management

For more control, you can manually manage the llama-server.

In [None]:
from llcuda import ServerManager

# Create server manager
manager = ServerManager()

# Get server information
info = manager.get_server_info()
print("Server Info:")
print(f"  Running: {info['running']}")
print(f"  URL: {info['url']}")
print(f"  PID: {info['process_id']}")
print(f"  Executable: {info['executable']}")

## 11. One-Liner Quick Inference

For quick tests, use the convenience function.

In [None]:
# Quick one-liner inference (uses existing server)
response = llcuda.quick_infer(
    prompt="Explain GPU computing in one sentence.",
    max_tokens=50,
    auto_start=False  # Use existing server
)

print(response)

## 12. Temperature Comparison

Compare outputs with different temperature settings.

In [None]:
prompt = "Write a creative story about a robot learning to paint."
temperatures = [0.3, 0.7, 1.0]

print("Comparing Different Temperatures\n")
print("="*60)

for temp in temperatures:
    print(f"\nTemperature: {temp}")
    print("-" * 60)
    
    result = engine.infer(
        prompt=prompt,
        max_tokens=80,
        temperature=temp
    )
    
    if result.success:
        print(result.text)
    else:
        print(f"Error: {result.error_message}")

## 13. Cleanup

When you're done, stop the server and clean up resources.

In [None]:
# Unload model and stop server
engine.unload_model()
print("‚úì Server stopped and resources cleaned up.")

---

## Summary

You've learned how to:
- ‚úÖ Check CUDA availability and GPU info
- ‚úÖ Find GGUF models automatically
- ‚úÖ Use auto-start mode for easy setup
- ‚úÖ Run single and batch inference
- ‚úÖ Monitor performance metrics
- ‚úÖ Manage the server manually
- ‚úÖ Experiment with different parameters

### Next Steps:
1. Try different models from HuggingFace
2. Experiment with different GPU layer configurations
3. Build your own applications using llcuda
4. Check out the documentation at: https://github.com/waqasm86/llcuda

---

**Happy Inferencing! üöÄ**