## Step 1: Verify Kaggle GPU Environment

Validates dual Tesla T4 GPU availability, CUDA version, and VRAM capacity to ensure the environment meets llamatelemetry v0.1.0 requirements for multi-GPU inference.

In [1]:
# Verify we have 2√ó T4 GPUs
import subprocess
import os

print("="*70)
print("üîç KAGGLE GPU ENVIRONMENT CHECK")
print("="*70)

# Check nvidia-smi
result = subprocess.run(["nvidia-smi", "-L"], capture_output=True, text=True)
gpu_lines = [l for l in result.stdout.strip().split("\n") if l.startswith("GPU")]
print(f"\nüìä Detected GPUs: {len(gpu_lines)}")
for line in gpu_lines:
    print(f"   {line}")

# Check CUDA version
print("\nüìä CUDA Version:")
!nvcc --version | grep release

# Check total VRAM
print("\nüìä VRAM Summary:")
!nvidia-smi --query-gpu=index,name,memory.total --format=csv

# Verify we have 2 GPUs
if len(gpu_lines) >= 2:
    print("\n‚úÖ Multi-GPU environment confirmed! Ready for llamatelemetry v0.1.0.")
else:
    print("\n‚ö†Ô∏è WARNING: Less than 2 GPUs detected!")
    print("   Enable 'GPU T4 x2' in Kaggle notebook settings.")

üîç KAGGLE GPU ENVIRONMENT CHECK

üìä Detected GPUs: 2
   GPU 0: Tesla T4 (UUID: GPU-7ba2c248-3b76-b125-7f27-7ac05c7faf42)
   GPU 1: Tesla T4 (UUID: GPU-27208f88-2843-f816-0b43-cf8b64926aac)

üìä CUDA Version:
Cuda compilation tools, release 12.5, V12.5.82

üìä VRAM Summary:
index, name, memory.total [MiB]
0, Tesla T4, 15360 MiB
1, Tesla T4, 15360 MiB

‚úÖ Multi-GPU environment confirmed! Ready for llamatelemetry v0.1.0.


## Step 2: Install llamatelemetry v0.1.0 and Dependencies

Installs llamatelemetry from GitHub with pre-built CUDA binaries optimized for Kaggle's dual T4 GPUs, including FlashAttention support and automatic binary verification.

In [2]:
%%time
# Install llamatelemetry v0.1.0 from GitHub (force fresh install, no cache)
print("üì¶ Installing llamatelemetry v0.1.0...")
!pip install -q --no-cache-dir --force-reinstall git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# Verify installation
import llamatelemetry
print(f"\n‚úÖ llamatelemetry {llamatelemetry.__version__} installed!")

# Check llamatelemetry status using available APIs
from llamatelemetry import check_cuda_available, get_cuda_device_info
from llamatelemetry.api.multigpu import gpu_count

cuda_info = get_cuda_device_info()
print(f"\nüìä llamatelemetry Status:")
print(f"   CUDA Available: {check_cuda_available()}")
print(f"   GPUs: {gpu_count()}")
if cuda_info:
    print(f"   CUDA Version: {cuda_info.get('cuda_version', 'N/A')}")

üì¶ Installing llamatelemetry v0.1.0...
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m57.7/57.7 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m536.7/536.7 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m16.6/16.6 MB[0m [31m244.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m66.4/66.4 kB[0m [31m265.5




üéØ llamatelemetry v0.1.0 First-Time Setup - Kaggle 2√ó T4 Multi-GPU

üéÆ GPU Detected: Tesla T4 (Compute 7.5)
  ‚úÖ Tesla T4 detected - Perfect for llamatelemetry v0.1.0!
üåê Platform: Colab

üì¶ Downloading Kaggle 2√ó T4 binaries (~961 MB)...
    Features: FlashAttention + Tensor Cores + Multi-GPU tensor-split

‚û°Ô∏è  Attempt 1: HuggingFace (llamatelemetry-v0.1.0-cuda12-kaggle-t4x2.tar.gz)
üì• Downloading v0.1.0 from HuggingFace Hub...
   Repo: waqasm86/llamatelemetry-binaries
   File: v0.1.0/llamatelemetry-v0.1.0-cuda12-kaggle-t4x2.tar.gz




v0.1.0/llamatelemetry-v0.1.0-cuda12-kagg(‚Ä¶):   0%|          | 0.00/1.40G [00:00<?, ?B/s]

üîê Verifying SHA256 checksum...
   ‚úÖ Checksum verified
üì¶ Extracting llamatelemetry-v0.1.0-cuda12-kaggle-t4x2.tar.gz...
Found 21 files in archive
Extracted 21 files to /root/.cache/llamatelemetry/extract_0.1.0
‚úÖ Extraction complete!
  Found bin/ and lib/ under /root/.cache/llamatelemetry/extract_0.1.0/llamatelemetry-v0.1.0-cuda12-kaggle-t4x2
  Copied 13 binaries to /usr/local/lib/python3.12/dist-packages/llamatelemetry/binaries/cuda12
  Copied 2 libraries to /usr/local/lib/python3.12/dist-packages/llamatelemetry/lib
‚úÖ Binaries installed successfully!


‚úÖ llamatelemetry 0.1.0 installed!

üìä llamatelemetry Status:
   CUDA Available: True
   GPUs: 2
   CUDA Version: 12.5
CPU times: user 52 s, sys: 11.9 s, total: 1min 3s
Wall time: 2min


In [3]:
!pip install -q huggingface-hub

In [4]:
#Add Secrets

from kaggle_secrets import UserSecretsClient
from huggingface_hub import login

# Get token and set as environment variable
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HF_TOKEN")
os.environ['HF_TOKEN'] = hf_token

# Login using environment variable
login()

## Step 3: Download GGUF Model

Downloads Gemma 3-4B Instruct in Q4_K_M quantization from HuggingFace Hub, optimized for Kaggle T4 GPUs with balanced quality and memory efficiency (~2.5GB).

In [5]:
%%time
from huggingface_hub import hf_hub_download
import os

# Model selection - optimized for 15GB VRAM
MODEL_REPO = "unsloth/gemma-3-4b-it-GGUF"
MODEL_FILE = "gemma-3-4b-it-Q4_K_M.gguf"

print(f"üì• Downloading {MODEL_FILE}...")
print(f"   Repository: {MODEL_REPO}")
print(f"   Expected size: ~2.5GB")

# Download to Kaggle working directory
model_path = hf_hub_download(
    repo_id=MODEL_REPO,
    filename=MODEL_FILE,
    local_dir="/kaggle/working/models"
)

print(f"\n‚úÖ Model downloaded: {model_path}")

# Show model size
size_gb = os.path.getsize(model_path) / (1024**3)
print(f"   Size: {size_gb:.2f} GB")

üì• Downloading gemma-3-4b-it-Q4_K_M.gguf...
   Repository: unsloth/gemma-3-4b-it-GGUF
   Expected size: ~2.5GB


gemma-3-4b-it-Q4_K_M.gguf:   0%|          | 0.00/2.49G [00:00<?, ?B/s]


‚úÖ Model downloaded: /kaggle/working/models/gemma-3-4b-it-Q4_K_M.gguf
   Size: 2.32 GB
CPU times: user 5.63 s, sys: 8.75 s, total: 14.4 s
Wall time: 5.44 s


## Step 4: Start llama-server with Multi-GPU Configuration

Launches llama-server using dual-GPU tensor-split configuration (50/50) with FlashAttention enabled, maximizing throughput and model capacity across both T4 GPUs.

In [6]:
from llamatelemetry.server import ServerManager
from llamatelemetry.api.multigpu import kaggle_t4_dual_config

# Get optimized configuration for Kaggle T4√ó2
config = kaggle_t4_dual_config()

print("üöÄ Starting llama-server with Multi-GPU configuration...")
print(f"   Model: {model_path}")
print(f"   GPU Layers: {config.n_gpu_layers} (all layers)")
print(f"   Context Size: {config.ctx_size}")
print(f"   Tensor Split: {config.tensor_split} (equal across 2 GPUs)")
print(f"   Flash Attention: {config.flash_attention}")

# Create server manager
server = ServerManager(server_url="http://127.0.0.1:8080")

# Start server with multi-GPU configuration
# Pass tensor_split as comma-separated string for --tensor-split flag
tensor_split_str = ",".join(str(x) for x in config.tensor_split) if config.tensor_split else None

try:
    server.start_server(
        model_path=model_path,
        host="127.0.0.1",
        port=8080,
        gpu_layers=config.n_gpu_layers,
        ctx_size=config.ctx_size,
        timeout=120,
        verbose=True,
        # Multi-GPU parameters (passed via **kwargs)
        flash_attn=1 if config.flash_attention else 0,
        split_mode="layer",
        tensor_split=tensor_split_str,
    )
    print("\n‚úÖ llama-server is ready with dual T4 GPUs!")
    print(f"   API endpoint: http://127.0.0.1:8080")
except Exception as e:
    print(f"\n‚ùå Server failed to start: {e}")

üöÄ Starting llama-server with Multi-GPU configuration...
   Model: /kaggle/working/models/gemma-3-4b-it-Q4_K_M.gguf
   GPU Layers: -1 (all layers)
   Context Size: 8192
   Tensor Split: [0.5, 0.5] (equal across 2 GPUs)
   Flash Attention: True
Starting llama-server...
  Executable: /usr/local/lib/python3.12/dist-packages/llamatelemetry/binaries/cuda12/llama-server
  Model: gemma-3-4b-it-Q4_K_M.gguf
  GPU Layers: -1
  Context Size: 8192
  Server URL: http://127.0.0.1:8080
Waiting for server to be ready........ ‚úì Ready in 5.1s

‚úÖ llama-server is ready with dual T4 GPUs!
   API endpoint: http://127.0.0.1:8080


## Step 5: Test Chat Completion

Demonstrates OpenAI-compatible chat completion API using llamatelemetry client, testing model inference with token usage tracking and response quality validation.

In [7]:
from llamatelemetry.api.client import LlamaCppClient

# Create client
client = LlamaCppClient(base_url="http://127.0.0.1:8080")

# Test simple completion using OpenAI-compatible API
print("üí¨ Testing inference...\n")

response = client.chat.create(
    messages=[
        {"role": "user", "content": "What is CUDA? Explain in 2 sentences."}
    ],
    max_tokens=100,
    temperature=0.7
)

print("üìù Response:")
print(response.choices[0].message.content)

print(f"\nüìä Stats:")
print(f"   Tokens generated: {response.usage.completion_tokens}")
print(f"   Total tokens: {response.usage.total_tokens}")

üí¨ Testing inference...

üìù Response:
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA that allows developers to utilize the massive processing power of their GPUs for general-purpose computations ‚Äì essentially turning graphics cards into powerful accelerators. It enables you to write code that runs simultaneously on many GPU cores, dramatically speeding up tasks like machine learning, scientific simulations, and image/video processing.

üìä Stats:
   Tokens generated: 76
   Total tokens: 95


## Step 6: Test Streaming Responses

Demonstrates streaming completion API for real-time token generation, enabling progressive display of responses with chunk-by-chunk delivery.

In [8]:
# Streaming example using OpenAI-compatible API
print("üí¨ Streaming response...\n")

for chunk in client.chat.create(
    messages=[
        {"role": "user", "content": "Write a Python function to calculate factorial."}
    ],
    max_tokens=200,
    temperature=0.3,
    stream=True  # Enable streaming
):
    if hasattr(chunk, 'choices') and chunk.choices:
        delta = chunk.choices[0].delta
        if hasattr(delta, 'content') and delta.content:
            print(delta.content, end="", flush=True)

print("\n\n‚úÖ Streaming complete!")

üí¨ Streaming response...



‚úÖ Streaming complete!


## Step 7: Monitor GPU Memory Usage

Checks VRAM consumption across both GPUs using nvidia-smi to verify model loading, track memory allocation, and confirm GPU 1 remains available for RAPIDS/Graphistry workloads.

In [9]:
# Check GPU memory usage
print("üìä GPU Memory Usage:")
print("="*60)
!nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu --format=csv

print("\nüí° Note:")
print("   GPU 0: llama-server (LLM inference)")
print("   GPU 1: Available for RAPIDS/Graphistry")

üìä GPU Memory Usage:
index, name, memory.used [MiB], memory.total [MiB], utilization.gpu [%]
0, Tesla T4, 1307 MiB, 15360 MiB, 0 %
1, Tesla T4, 1797 MiB, 15360 MiB, 0 %

üí° Note:
   GPU 0: llama-server (LLM inference)
   GPU 1: Available for RAPIDS/Graphistry


## Step 8: Cleanup and Release Resources

Stops llama-server gracefully, releases GPU memory, and verifies complete resource cleanup with nvidia-smi to ensure clean state for subsequent operations.

In [10]:
# Stop the server
print("üõë Stopping llama-server...")
server.stop_server()
print("\n‚úÖ Server stopped. Resources freed.")

# Verify GPU memory is released
print("\nüìä GPU Memory After Cleanup:")
!nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv

üõë Stopping llama-server...

‚úÖ Server stopped. Resources freed.

üìä GPU Memory After Cleanup:
index, memory.used [MiB], memory.total [MiB]
0, 0 MiB, 15360 MiB
1, 0 MiB, 15360 MiB


## üéâ Quick Start Complete!

You've successfully:
1. ‚úÖ Verified Kaggle GPU environment
2. ‚úÖ Installed llamatelemetry v0.1.0
3. ‚úÖ Downloaded a GGUF model
4. ‚úÖ Started llama-server
5. ‚úÖ Ran inference with chat completion
6. ‚úÖ Used streaming responses

## Next Steps

Explore more tutorials:
- üìò [02-llama-server-setup](02-llama-server-setup-llamatelemetry-v0.1.0.ipynb) - Advanced server configuration
- üìò [03-multi-gpu-inference](03-multi-gpu-inference-llamatelemetry-v0.1.0.ipynb) - Dual T4 inference
- üìò [04-gguf-quantization](04-gguf-quantization-llamatelemetry-v0.1.0.ipynb) - Quantization guide
- üìò [05-unsloth-integration](05-unsloth-integration-llamatelemetry-v0.1.0.ipynb) - Unsloth training ‚Üí llamatelemetry
- üìò [06-split-gpu-graphistry](06-split-gpu-graphistry-llamatelemetry-v0.1.0.ipynb) - LLM + Graph visualization

---

**llamatelemetry v0.1.0** | CUDA 12 Inference Backend for Unsloth