## Step 1: Verify Kaggle GPU Environment

First, let's confirm we have 2√ó Tesla T4 GPUs available.

In [None]:
# Verify we have 2√ó T4 GPUs
import subprocess
import os

print("="*70)
print("üîç KAGGLE GPU ENVIRONMENT CHECK")
print("="*70)

# Check nvidia-smi
result = subprocess.run(["nvidia-smi", "-L"], capture_output=True, text=True)
gpu_lines = [l for l in result.stdout.strip().split("\n") if l.startswith("GPU")]
print(f"\nüìä Detected GPUs: {len(gpu_lines)}")
for line in gpu_lines:
    print(f"   {line}")

# Check CUDA version
print("\nüìä CUDA Version:")
!nvcc --version | grep release

# Check total VRAM
print("\nüìä VRAM Summary:")
!nvidia-smi --query-gpu=index,name,memory.total --format=csv

# Verify we have 2 GPUs
if len(gpu_lines) >= 2:
    print("\n‚úÖ Multi-GPU environment confirmed! Ready for llamatelemetry v0.1.0.")
else:
    print("\n‚ö†Ô∏è WARNING: Less than 2 GPUs detected!")
    print("   Enable 'GPU T4 x2' in Kaggle notebook settings.")

## Step 2: Install llamatelemetry v0.1.0

Install from GitHub with pre-built binaries for Kaggle T4√ó2.

In [None]:
%%time
# Install llamatelemetry v0.1.0 from GitHub (force fresh install, no cache)
print("üì¶ Installing llamatelemetry v0.1.0...")
!pip install -q --no-cache-dir --force-reinstall git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# Verify installation
import llamatelemetry
print(f"\n‚úÖ llamatelemetry {llamatelemetry.__version__} installed!")

# Check llamatelemetry status using available APIs
from llamatelemetry import check_cuda_available, get_cuda_device_info
from llamatelemetry.api.multigpu import gpu_count

cuda_info = get_cuda_device_info()
print(f"\nüìä llamatelemetry Status:")
print(f"   CUDA Available: {check_cuda_available()}")
print(f"   GPUs: {gpu_count()}")
if cuda_info:
    print(f"   CUDA Version: {cuda_info.get('cuda_version', 'N/A')}")

## Step 3: Download a GGUF Model

We'll use Gemma 3n 4B in Q4_K_M quantization - perfect for Kaggle T4 GPUs.

In [None]:
%%time
from huggingface_hub import hf_hub_download
import os

# Model selection - optimized for 15GB VRAM
MODEL_REPO = "unsloth/gemma-3-4b-it-GGUF"
MODEL_FILE = "gemma-3-4b-it-Q4_K_M.gguf"

print(f"üì• Downloading {MODEL_FILE}...")
print(f"   Repository: {MODEL_REPO}")
print(f"   Expected size: ~2.5GB")

# Download to Kaggle working directory
model_path = hf_hub_download(
    repo_id=MODEL_REPO,
    filename=MODEL_FILE,
    local_dir="/kaggle/working/models"
)

print(f"\n‚úÖ Model downloaded: {model_path}")

# Show model size
size_gb = os.path.getsize(model_path) / (1024**3)
print(f"   Size: {size_gb:.2f} GB")

## Step 4: Start llama-server

Start the inference server on GPU 0 with optimal settings for T4.

In [None]:
from llamatelemetry.server import ServerManager
from llamatelemetry.api.multigpu import kaggle_t4_dual_config

# Get optimized configuration for Kaggle T4√ó2
config = kaggle_t4_dual_config()

print("üöÄ Starting llama-server with Multi-GPU configuration...")
print(f"   Model: {model_path}")
print(f"   GPU Layers: {config.n_gpu_layers} (all layers)")
print(f"   Context Size: {config.ctx_size}")
print(f"   Tensor Split: {config.tensor_split} (equal across 2 GPUs)")
print(f"   Flash Attention: {config.flash_attention}")

# Create server manager
server = ServerManager(server_url="http://127.0.0.1:8080")

# Start server with multi-GPU configuration
# Pass tensor_split as comma-separated string for --tensor-split flag
tensor_split_str = ",".join(str(x) for x in config.tensor_split) if config.tensor_split else None

try:
    server.start_server(
        model_path=model_path,
        host="127.0.0.1",
        port=8080,
        gpu_layers=config.n_gpu_layers,
        ctx_size=config.ctx_size,
        timeout=120,
        verbose=True,
        # Multi-GPU parameters (passed via **kwargs)
        flash_attn=1 if config.flash_attention else 0,
        split_mode="layer",
        tensor_split=tensor_split_str,
    )
    print("\n‚úÖ llama-server is ready with dual T4 GPUs!")
    print(f"   API endpoint: http://127.0.0.1:8080")
except Exception as e:
    print(f"\n‚ùå Server failed to start: {e}")

## Step 5: Run Your First Inference

Use the OpenAI-compatible API to chat with the model.

In [None]:
from llamatelemetry.api.client import LlamaCppClient

# Create client
client = LlamaCppClient(base_url="http://127.0.0.1:8080")

# Test simple completion using OpenAI-compatible API
print("üí¨ Testing inference...\n")

response = client.chat.create(
    messages=[
        {"role": "user", "content": "What is CUDA? Explain in 2 sentences."}
    ],
    max_tokens=100,
    temperature=0.7
)

print("üìù Response:")
print(response.choices[0].message.content)

print(f"\nüìä Stats:")
print(f"   Tokens generated: {response.usage.completion_tokens}")
print(f"   Total tokens: {response.usage.total_tokens}")

## Step 6: Streaming Response Example

Stream responses for real-time output.

In [None]:
# Streaming example using OpenAI-compatible API
print("üí¨ Streaming response...\n")

for chunk in client.chat.create(
    messages=[
        {"role": "user", "content": "Write a Python function to calculate factorial."}
    ],
    max_tokens=200,
    temperature=0.3,
    stream=True  # Enable streaming
):
    if hasattr(chunk, 'choices') and chunk.choices:
        delta = chunk.choices[0].delta
        if hasattr(delta, 'content') and delta.content:
            print(delta.content, end="", flush=True)

print("\n\n‚úÖ Streaming complete!")

## Step 7: Check GPU Memory Usage

Monitor VRAM usage to understand resource consumption.

In [None]:
# Check GPU memory usage
print("üìä GPU Memory Usage:")
print("="*60)
!nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu --format=csv

print("\nüí° Note:")
print("   GPU 0: llama-server (LLM inference)")
print("   GPU 1: Available for RAPIDS/Graphistry")

## Step 8: Cleanup

Stop the server when done.

In [None]:
# Stop the server
print("üõë Stopping llama-server...")
server.stop_server()
print("\n‚úÖ Server stopped. Resources freed.")

# Verify GPU memory is released
print("\nüìä GPU Memory After Cleanup:")
!nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv

## üéâ Quick Start Complete!

You've successfully:
1. ‚úÖ Verified Kaggle GPU environment
2. ‚úÖ Installed llamatelemetry v0.1.0
3. ‚úÖ Downloaded a GGUF model
4. ‚úÖ Started llama-server
5. ‚úÖ Ran inference with chat completion
6. ‚úÖ Used streaming responses

## Next Steps

Explore more tutorials:
- üìò [02-llama-server-setup](02-llama-server-setup-llamatelemetry-v0.1.0.ipynb) - Advanced server configuration
- üìò [03-multi-gpu-inference](03-multi-gpu-inference-llamatelemetry-v0.1.0.ipynb) - Dual T4 inference
- üìò [04-gguf-quantization](04-gguf-quantization-llamatelemetry-v0.1.0.ipynb) - Quantization guide
- üìò [05-unsloth-integration](05-unsloth-integration-llamatelemetry-v0.1.0.ipynb) - Unsloth training ‚Üí llamatelemetry
- üìò [06-split-gpu-graphistry](06-split-gpu-graphistry-llamatelemetry-v0.1.0.ipynb) - LLM + Graph visualization

---

**llamatelemetry v0.1.0** | CUDA 12 Inference Backend for Unsloth