# 🚀 Qwen Model Benchmarking on GPU (H100, T4, L4)
A hands-on exploration of Qwen LLM variants using Hugging Face Transformers and PyTorch with GPU performance metrics.


## 🧪 Environment Check: Python Setup
Before starting any model benchmarking, it's important to verify the Python environment. This cell prints the path of the current Python executable and the Python version being used.

In [1]:
import sys
print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")

Python executable: /anaconda/envs/qwen-llm/bin/python
Python version: 3.10.18 (main, Jun  5 2025, 13:14:17) [GCC 11.2.0]


# System Resource Assessment
Start by examining the available resources and verifying our environment setup:

In [2]:
import torch
torch.cuda.device_count()

1

In [3]:
torch.cuda.get_device_properties(0)

_CudaDeviceProperties(name='NVIDIA H100 NVL', major=9, minor=0, total_memory=95248MB, multi_processor_count=132, uuid=53317aba-93d7-f441-1b59-0b07d076e10e, L2_cache_size=60MB)

In [4]:
torch.cuda.get_device_properties(0).name

'NVIDIA H100 NVL'

In [5]:
import psutil
psutil.cpu_count()

40

In [6]:
psutil.virtual_memory().total/(1024**3)

314.6921806335449

In [7]:
psutil.virtual_memory().available/(1024**3)

235.18262100219727

# NVIDIA H100 NVL

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
import psutil
import sys
import os
import json
from datetime import datetime
import matplotlib.pyplot as plt
import pandas as pd

# Verify environment setup
print(f"🐍 Python executable: {sys.executable}")
print(f"🔥 PyTorch version: {torch.__version__}")
print(f"🤖 Transformers version: {__import__('transformers').__version__}")
print(f"📅 Benchmark timestamp: {datetime.now().isoformat()}")
print("=" * 60)

def get_system_info():
    """Comprehensive system resource assessment"""
    info = {
        "cpu_cores": psutil.cpu_count(),
        "total_ram_gb": psutil.virtual_memory().total / (1024**3),
        "available_ram_gb": psutil.virtual_memory().available / (1024**3),
        "timestamp": datetime.now().isoformat()
    }
    
    if torch.cuda.is_available():
        info["cuda_version"] = torch.version.cuda
        info["gpu_count"] = torch.cuda.device_count()
        info["gpus"] = []
        
        for i in range(torch.cuda.device_count()):
            gpu_props = torch.cuda.get_device_properties(i)
            gpu_info = {
                "id": i,
                "name": gpu_props.name,
                "memory_gb": gpu_props.total_memory / (1024**3),
                "compute_capability": f"{gpu_props.major}.{gpu_props.minor}",
                "multiprocessor_count": gpu_props.multi_processor_count
            }
            info["gpus"].append(gpu_info)
            print(f"🎮 GPU {i}: {gpu_props.name}")
            print(f"   💾 Memory: {gpu_props.total_memory / (1024**3):.2f} GB")
            print(f"   🔧 Compute capability: {gpu_props.major}.{gpu_props.minor}")
    else:
        print("❌ CUDA not available - check GPU drivers and PyTorch installation")
        return None
    
    print(f"🖥️  CPU cores: {info['cpu_cores']}")
    print(f"🧠 Total RAM: {info['total_ram_gb']:.2f} GB")
    print(f"💚 Available RAM: {info['available_ram_gb']:.2f} GB")
    
    return info

# Get and store system information
system_info = get_system_info()
if system_info is None:
    print("\n🚨 GPU setup required before continuing. Check environment configuration.")
    exit()

  from .autonotebook import tqdm as notebook_tqdm


🐍 Python executable: /anaconda/envs/qwen-llm/bin/python
🔥 PyTorch version: 2.7.1+cu118
🤖 Transformers version: 4.44.0
📅 Benchmark timestamp: 2025-08-02T15:06:42.704293
🎮 GPU 0: NVIDIA H100 NVL
   💾 Memory: 93.02 GB
   🔧 Compute capability: 9.0
🖥️  CPU cores: 40
🧠 Total RAM: 314.69 GB
💚 Available RAM: 235.27 GB


# Tesla T4

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
import psutil
import sys
import os
import json
from datetime import datetime
import matplotlib.pyplot as plt
import pandas as pd

# Verify environment setup
print(f"🐍 Python executable: {sys.executable}")
print(f"🔥 PyTorch version: {torch.__version__}")
print(f"🤖 Transformers version: {__import__('transformers').__version__}")
print(f"📅 Benchmark timestamp: {datetime.now().isoformat()}")
print("=" * 60)

def get_system_info():
    """Comprehensive system resource assessment"""
    info = {
        "cpu_cores": psutil.cpu_count(),
        "total_ram_gb": psutil.virtual_memory().total / (1024**3),
        "available_ram_gb": psutil.virtual_memory().available / (1024**3),
        "timestamp": datetime.now().isoformat()
    }
    
    if torch.cuda.is_available():
        info["cuda_version"] = torch.version.cuda
        info["gpu_count"] = torch.cuda.device_count()
        info["gpus"] = []
        
        for i in range(torch.cuda.device_count()):
            gpu_props = torch.cuda.get_device_properties(i)
            gpu_info = {
                "id": i,
                "name": gpu_props.name,
                "memory_gb": gpu_props.total_memory / (1024**3),
                "compute_capability": f"{gpu_props.major}.{gpu_props.minor}",
                "multiprocessor_count": gpu_props.multi_processor_count
            }
            info["gpus"].append(gpu_info)
            print(f"🎮 GPU {i}: {gpu_props.name}")
            print(f"   💾 Memory: {gpu_props.total_memory / (1024**3):.2f} GB")
            print(f"   🔧 Compute capability: {gpu_props.major}.{gpu_props.minor}")
    else:
        print("❌ CUDA not available - check GPU drivers and PyTorch installation")
        return None
    
    print(f"🖥️  CPU cores: {info['cpu_cores']}")
    print(f"🧠 Total RAM: {info['total_ram_gb']:.2f} GB")
    print(f"💚 Available RAM: {info['available_ram_gb']:.2f} GB")
    
    return info

# Get and store system information
system_info = get_system_info()
if system_info is None:
    print("\n🚨 GPU setup required before continuing. Check environment configuration.")
    exit()

  from .autonotebook import tqdm as notebook_tqdm


🐍 Python executable: /anaconda/envs/qwen-llm/bin/python
🔥 PyTorch version: 2.7.1+cu118
🤖 Transformers version: 4.44.0
📅 Benchmark timestamp: 2025-08-02T17:32:55.451092
🎮 GPU 0: Tesla T4
   💾 Memory: 15.57 GB
   🔧 Compute capability: 7.5
🖥️  CPU cores: 8
🧠 Total RAM: 54.92 GB
💚 Available RAM: 53.07 GB


In [9]:
!nvidia-smi

Sat Aug  2 14:55:05 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.01             Driver Version: 535.247.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA H100 NVL                On  | 00000001:00:00.0 Off |                    0 |
| N/A   39C    P0              62W / 400W |      3MiB / 95830MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [1]:
!nvidia-smi

Sat Aug  2 17:32:34 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.01             Driver Version: 535.247.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       On  | 00000001:00:00.0 Off |                  Off |
| N/A   36C    P8              10W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Troubleshooting Common Issues
If you encounter CUDA-related errors, verify these common issues:

In [10]:
# Common troubleshooting checks
def troubleshoot_setup():
    print("=== Troubleshooting GPU Setup ===")
    
    # Check CUDA installation
    try:
        import torch
        print(f"✅ PyTorch imported successfully")
        print(f"   CUDA compiled version: {torch.version.cuda}")
        print(f"   CUDA runtime version: {torch.cuda.get_device_capability()}")
    except Exception as e:
        print(f"❌ PyTorch import error: {e}")
        return False
    
    # Check GPU accessibility
    if torch.cuda.is_available():
        device = torch.cuda.current_device()
        print(f"✅ GPU accessible: {torch.cuda.get_device_name(device)}")
        
        # Test basic tensor operations
        try:
            test_tensor = torch.randn(1000, 1000).cuda()
            result = torch.mm(test_tensor, test_tensor.t())
            print(f"✅ GPU tensor operations working")
            del test_tensor, result  # Free memory
            torch.cuda.empty_cache()
        except Exception as e:
            print(f"❌ GPU tensor operations failed: {e}")
            return False
    else:
        print("❌ CUDA not available")
        return False
    
    return True

# Run troubleshooting if needed
troubleshoot_setup()

=== Troubleshooting GPU Setup ===
✅ PyTorch imported successfully
   CUDA compiled version: 11.8
   CUDA runtime version: (9, 0)
✅ GPU accessible: NVIDIA H100 NVL
✅ GPU tensor operations working


True

# Loading and Testing Qwen2.5-7B-Instruct
Now that our environment is verified, let's load the model and test its capabilities:

In [2]:
# Clear any existing GPU memory
torch.cuda.empty_cache()

# Load model with T4-optimized settings
# model_name = "Qwen/Qwen2.5-7B-Instruct"
model_name = "Qwen/Qwen2.5-Coder-32B-Instruct"
custom_cache_dir = "/dev/shm"
print(f"Loading {model_name}...")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    cache_dir=custom_cache_dir
)

# Add padding token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model with T4 GPU optimization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    cache_dir=custom_cache_dir
)


print(f"✅ Model loaded successfully!")
print(f"📍 Device: {model.device}")
print(f"💾 Model memory: {model.get_memory_footprint() / 1e9:.2f} GB")

# Check GPU memory usage after loading
if torch.cuda.is_available():
    gpu_memory = torch.cuda.memory_allocated() / 1e9
    gpu_reserved = torch.cuda.memory_reserved() / 1e9
    print(f"🔥 GPU Memory - Allocated: {gpu_memory:.2f} GB, Reserved: {gpu_reserved:.2f} GB")

Loading Qwen/Qwen2.5-Coder-32B-Instruct...


Loading checkpoint shards: 100%|██████████| 14/14 [00:08<00:00,  1.60it/s]

✅ Model loaded successfully!
📍 Device: cuda:0
💾 Model memory: 66.60 GB
🔥 GPU Memory - Allocated: 66.60 GB, Reserved: 66.87 GB





# Basic Text Generation Testing
Let's test the model with simple text generation to ensure everything works:

In [12]:
def generate_text(prompt, max_length=200, temperature=0.7, top_p=0.8):
    """Basic text generation function"""
    
    # Encode input
    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)
    
    # Track GPU memory before generation
    memory_before = torch.cuda.memory_allocated() / 1e9
    start_time = time.time()
    
    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=max_length,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Calculate metrics
    generation_time = time.time() - start_time
    memory_after = torch.cuda.memory_allocated() / 1e9
    
    # Decode response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    new_text = response[len(prompt):].strip()

    print(f"🕒 GPU Name: {torch.cuda.get_device_properties(0).name}")
    print(f"🧠 Model Name: {model_name}")
    print(f"🕒 Generation time: {generation_time:.2f}s")
    print(f"💾 Memory delta: {memory_after - memory_before:.3f} GB")
    print(f"📝 Generated tokens: {len(outputs[0]) - len(inputs[0])}")
    print(f"⚡ Tokens/second: {(len(outputs[0]) - len(inputs[0])) / generation_time:.1f}")
    
    return new_text

# Test basic generation
test_prompt = "Explain the benefits of using Kubernetes for machine learning workloads:"
result = generate_text(test_prompt, max_length=2000)
print(f"\n📄 Generated text:\n{result}")

🕒 GPU Name: NVIDIA H100 NVL
🧠 Model Name: Qwen/Qwen2.5-Coder-32B-Instruct
🕒 Generation time: 45.12s
💾 Memory delta: 0.000 GB
📝 Generated tokens: 1119
⚡ Tokens/second: 24.8

📄 Generated text:
Kubernetes is a powerful open-source platform that automates deployment, scaling, and management of containerized applications. While it is primarily used for web applications, Kubernetes can also be effectively utilized for machine learning (ML) workloads. Here are some benefits of using Kubernetes for ML workloads:

1. Scalability: Kubernetes provides the ability to scale up or down the number of nodes in a cluster based on demand, which is essential for handling the varying computational requirements of ML workloads. For example, training a large deep learning model may require a lot of resources, while serving predictions may require fewer resources.

2. Resource Management: Kubernetes allows efficient allocation and utilization of resources across different ML workloads, ensuring that each wor

# 🚀 Model Benchmarking Framework

## Overview

This framework provides a **comprehensive and flexible setup** for benchmarking large language models (LLMs) across different:

* ✅ **Model variants** (e.g., base vs. instruct, 7B vs. 32B)
* ✅ **Hardware setups** (e.g., NVIDIA T4, A100, etc.)
* ✅ **Performance metrics** (e.g., GPU memory usage, latency, throughput)

Its primary goal is to help you **evaluate and compare model behavior and efficiency** in different inference environments.

---

## 🔧 Features

* 🔁 **Plug-and-play model loading** with Hugging Face `transformers`
* 🧠 **Automatic device mapping** for multi-GPU compatibility
* 🕒 **Precise generation timing**
* 📊 **Memory usage tracking** (allocated vs. reserved)
* 📈 **Tokens/sec throughput reporting**
* 📂 **Cache control** via custom directory (e.g., `/dev/shm` for RAM disk)

---

## 🧪 Usage Example

```python
output = load_and_generate(
    model_name="Qwen/Qwen2.5-Coder-32B-Instruct",
    prompt="Explain the benefits of using Kubernetes for machine learning workloads:",
    max_length=2000
)
```

---

## 📦 Components

| Component              | Description                                                                    |
| ---------------------- | ------------------------------------------------------------------------------ |
| `load_and_generate()`  | Unified function to load model, generate output, and benchmark runtime metrics |
| `AutoTokenizer`        | Dynamically loads tokenizer with optional padding fallback                     |
| `AutoModelForCausalLM` | Loads the model using half-precision (FP16) and maps to available GPUs         |
| GPU Tracker            | Measures memory delta before/after inference                                   |
| Metrics Output         | Includes generation time, memory usage, token throughput                       |

---

## 📍 Notes

* For best performance on **NVIDIA T4**, use FP16 models and limit max sequence lengths.
* Use `/dev/shm` as `cache_dir` to improve disk I/O speed on Linux systems.
* Consider adding `torch.cuda.reset_peak_memory_stats()` and `torch.cuda.max_memory_allocated()` if peak memory stats are needed.

---

Let me know if you'd like this exported as a `.md` file or documented in a Jupyter notebook with visual charts.


In [3]:
import torch
import time
from transformers import AutoTokenizer, AutoModelForCausalLM

def load_and_generate(model_name, prompt, max_length=200, temperature=0.7, top_p=0.8, cache_dir="/dev/shm"):
    """
    Load a model and tokenizer, then generate text from a prompt.

    Parameters:
        model_name (str): Hugging Face model identifier
        prompt (str): Input prompt for text generation
        max_length (int): Maximum number of tokens in output
        temperature (float): Sampling temperature
        top_p (float): Nucleus sampling threshold
        cache_dir (str): Directory to cache model and tokenizer

    Returns:
        str: Generated text
    """

    # Clear GPU memory
    torch.cuda.empty_cache()

    print(f"\n🚀 Loading model: {model_name}")
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True,
        cache_dir=cache_dir
    )

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
        low_cpu_mem_usage=True,
        cache_dir=cache_dir
    )

    print(f"✅ Model loaded successfully!")
    print(f"📍 Device: {model.device}")
    print(f"💾 Model memory: {model.get_memory_footprint() / 1e9:.2f} GB")

    if torch.cuda.is_available():
        gpu_memory = torch.cuda.memory_allocated() / 1e9
        gpu_reserved = torch.cuda.memory_reserved() / 1e9
        print(f"🔥 GPU Memory - Allocated: {gpu_memory:.2f} GB, Reserved: {gpu_reserved:.2f} GB")

    # Encode input
    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)

    memory_before = torch.cuda.memory_allocated() / 1e9
    start_time = time.time()

    # Generate
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=max_length,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

    generation_time = time.time() - start_time
    memory_after = torch.cuda.memory_allocated() / 1e9

    # Decode
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    new_text = response[len(prompt):].strip()

    print(f"🕒 GPU Name: {torch.cuda.get_device_properties(0).name}")
    print(f"🧠 Model Name: {model_name}")
    print(f"🕒 Generation time: {generation_time:.2f}s")
    print(f"💾 Memory delta: {memory_after - memory_before:.3f} GB")
    print(f"📝 Generated tokens: {len(outputs[0]) - len(inputs['input_ids'][0])}")
    print(f"⚡ Tokens/second: {(len(outputs[0]) - len(inputs['input_ids'][0])) / generation_time:.1f}")

    return new_text


# Qwen2.5-Coder-7B-Instruct (NVIDIA H100 NVL)

In [14]:
output = load_and_generate(
    model_name="Qwen/Qwen2.5-Coder-7B-Instruct",
    prompt="Explain the benefits of using Kubernetes for machine learning workloads:",
    max_length=2000
)



🚀 Loading model: Qwen/Qwen2.5-Coder-7B-Instruct


Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.76it/s]


✅ Model loaded successfully!
📍 Device: cuda:0
💾 Model memory: 15.70 GB
🔥 GPU Memory - Allocated: 82.41 GB, Reserved: 82.65 GB
🕒 GPU Name: NVIDIA H100 NVL
🧠 Model Name: Qwen/Qwen2.5-Coder-7B-Instruct
🕒 Generation time: 5.39s
💾 Memory delta: 0.000 GB
📝 Generated tokens: 311
⚡ Tokens/second: 57.7


# Qwen2.5-Coder-7B-Instruct (NVIDIA Tesla T4)

In [None]:
output = load_and_generate(
    model_name="Qwen/Qwen2.5-Coder-7B-Instruct",
    prompt="Explain the benefits of using Kubernetes for machine learning workloads:",
    max_length=2000
)



🚀 Loading model: Qwen/Qwen2.5-Coder-7B-Instruct


Downloading shards: 100%|██████████| 4/4 [00:54<00:00, 13.69s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.02it/s]
Some parameters are on the meta device device because they were offloaded to the cpu.


✅ Model loaded successfully!
📍 Device: cuda:0
💾 Model memory: 15.70 GB
🔥 GPU Memory - Allocated: 13.72 GB, Reserved: 13.96 GB


# Qwen2.5-Coder-32B-Instruct (NVIDIA H100 NVL)

In [3]:
output = load_and_generate(
    model_name="Qwen/Qwen2.5-Coder-32B-Instruct",
    prompt="Explain the benefits of using Kubernetes for machine learning workloads:",
    max_length=2000
)



🚀 Loading model: Qwen/Qwen2.5-Coder-32B-Instruct


Loading checkpoint shards: 100%|██████████| 14/14 [00:08<00:00,  1.60it/s]


✅ Model loaded successfully!
📍 Device: cuda:0
💾 Model memory: 66.60 GB
🔥 GPU Memory - Allocated: 66.60 GB, Reserved: 66.87 GB
🕒 GPU Name: NVIDIA H100 NVL
🧠 Model Name: Qwen/Qwen2.5-Coder-32B-Instruct
🕒 Generation time: 16.62s
💾 Memory delta: 0.034 GB
📝 Generated tokens: 402
⚡ Tokens/second: 24.2


# Chat Completion Format Testing
Since we're building an API that mimics OpenAI's chat completion format, let's test the model with proper chat formatting:

In [3]:
def format_chat_prompt(messages):
    """Format messages for Qwen2.5-Instruct chat template"""
    
    # Qwen2.5-Instruct uses a specific chat format
    formatted_prompt = ""
    
    for message in messages:
        role = message["role"]
        content = message["content"]
        
        if role == "system":
            formatted_prompt += f"<|im_start|>system\n{content}<|im_end|>\n"
        elif role == "user":
            formatted_prompt += f"<|im_start|>user\n{content}<|im_end|>\n"
        elif role == "assistant":
            formatted_prompt += f"<|im_start|>assistant\n{content}<|im_end|>\n"
    
    # Add assistant start token for generation
    formatted_prompt += "<|im_start|>assistant\n"
    return formatted_prompt

def chat_completion(messages, max_tokens=500, temperature=0.7, top_p=0.8):
    """OpenAI-style chat completion function"""
    
    # Format the conversation
    prompt = format_chat_prompt(messages)
    
    # Tokenize
    inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    input_length = len(inputs[0])
    
    # Track performance metrics
    start_time = time.time()
    memory_before = torch.cuda.memory_allocated()
    
    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_new_tokens=max_tokens,  # Use max_new_tokens instead of max_length
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            stop_strings=["<|im_end|>"],  # Stop at chat format end token
            tokenizer=tokenizer
        )
    
    # Calculate metrics
    generation_time = time.time() - start_time
    memory_after = torch.cuda.memory_allocated()
    total_tokens = len(outputs[0])
    new_tokens = total_tokens - input_length
    
    # Decode and clean response
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    assistant_response = full_response[len(tokenizer.decode(inputs[0], skip_special_tokens=True)):].strip()
    
    # Remove chat format tokens if present
    if assistant_response.endswith("<|im_end|>"):
        assistant_response = assistant_response[:-11].strip()
    
    # Performance metrics
    tokens_per_second = new_tokens / generation_time if generation_time > 0 else 0
    memory_used_mb = (memory_after - memory_before) / 1e6
    
    return {
        "response": assistant_response,
        "metrics": {
            "generation_time": generation_time,
            "input_tokens": input_length,
            "output_tokens": new_tokens,
            "total_tokens": total_tokens,
            "tokens_per_second": tokens_per_second,
            "memory_used_mb": memory_used_mb
        }
    }

# Test chat completion format
test_messages = [
    {
        "role": "system", 
        "content": "You are a helpful AI assistant specializing in cloud infrastructure and DevOps."
    },
    {
        "role": "user", 
        "content": "What are the key advantages of deploying LLMs on Kubernetes compared to traditional VM-based deployments?"
    }
]

print("🧪 Testing chat completion format...")
result = chat_completion(test_messages, max_tokens=1000)

print(f"\n📋 Performance Metrics:")
print(f"   ⏱️  Generation time: {result['metrics']['generation_time']:.2f}s")
print(f"   📥 Input tokens: {result['metrics']['input_tokens']}")
print(f"   📤 Output tokens: {result['metrics']['output_tokens']}")
print(f"   ⚡ Speed: {result['metrics']['tokens_per_second']:.1f} tokens/sec")
print(f"   💾 Memory used: {result['metrics']['memory_used_mb']:.1f} MB")

print(f"\n🤖 Assistant Response:\n{result['response']}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


🧪 Testing chat completion format...

📋 Performance Metrics:
   ⏱️  Generation time: 28.54s
   📥 Input tokens: 46
   📤 Output tokens: 680
   ⚡ Speed: 23.8 tokens/sec
   💾 Memory used: 33.6 MB

🤖 Assistant Response:
Deploying Large Language Models (LLMs) on Kubernetes offers several key advantages over traditional VM-based deployments, particularly in the context of scalability, efficiency, and operational flexibility. Here are some of the primary benefits:

1. **Scalability**:
   - **Horizontal Scaling**: Kubernetes allows for easy horizontal scaling of LLM workloads. You can scale up or down the number of pods (instances) running your model based on demand, which is crucial for handling variable loads typical in applications using LLMs.
   - **Auto-scaling**: With Kubernetes, you can set up auto-scaling policies that automatically adjust the number of replicas of your application based on CPU/memory usage or custom metrics, ensuring optimal resource utilization.

2. **Resource Efficien

# Code Generation Testing Framework
Let's create a comprehensive testing framework for coding capabilities:

In [None]:
def format_coding_prompt(instruction, code_context="", language="python"):
    """Format coding prompts for Qwen2.5-Coder"""
    
    system_message = (
        "You are a helpful coding assistant. You provide accurate, efficient, and well-documented code solutions.\n"
        "When writing code, follow these principles:\n"
        "1. Write clean, readable code with proper formatting\n"
        "2. Include helpful comments where necessary\n"
        "3. Follow language-specific best practices\n"
        "4. Provide complete, working solutions"
    )

    if code_context:
        user_message = f"""Language: {language}

        Context:
        ```{language}
        {code_context}
        ``` {{data-source-line="318"}}

        Request: {instruction}"""
    else:
        user_message = f"""Language: {language}

        Request: {instruction}"""

    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message}
    ]

    formatted_prompt = ""
    for message in messages:
        role = message["role"]
        content = message["content"]
        formatted_prompt += f"<|im_start|>{role}\n{content}<|im_end|>\n"

    formatted_prompt += "<|im_start|>assistant\n"
    return formatted_prompt

def generate_code(instruction, code_context="", language="python", max_tokens=2000, temperature=0.2):
    """Generate code with performance metrics"""
    
    prompt = format_coding_prompt(instruction, code_context, language)

    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)
    input_length = inputs["input_ids"].shape[1]

    start_time = time.time()
    memory_before = torch.cuda.memory_allocated()

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=max_tokens,
            temperature=temperature,
            top_p=0.8,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            stop_strings=["<|im_end|>", "<|endoftext|>"],
            tokenizer=tokenizer  # ✅ required when using stop_strings
        )

    generation_time = time.time() - start_time
    memory_after = torch.cuda.memory_allocated()
    total_tokens = outputs.shape[1]
    new_tokens = total_tokens - input_length

    decoded_prompt = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    code_response = decoded_output[len(decoded_prompt):].strip()

    for stop in ["<|im_end|>", "<|endoftext|>"]:
        if stop in code_response:
            code_response = code_response.split(stop)[0].strip()

    return {
        "code": code_response,
        "metrics": {
            "generation_time": generation_time,
            "input_tokens": input_length,
            "output_tokens": new_tokens,
            "total_tokens": total_tokens,
            "tokens_per_second": new_tokens / generation_time if generation_time > 0 else 0,
            "memory_used_mb": (memory_after - memory_before) / 1e6
        }
    }

# Test basic code generation
test_instruction = "Create a Python function that implements binary search on a sorted list"
result = generate_code(test_instruction)

print("🔧 Code Generation Test:")
print("=" * 50)
print(f"Instruction: {test_instruction}")
print(f"\n📊 Performance:")
print(f"  Time: {result['metrics']['generation_time']:.2f}s")
print(f"  Speed: {result['metrics']['tokens_per_second']:.1f} tok/s")
print(f"  Tokens: {result['metrics']['output_tokens']}")

print(f"\n💻 Generated Code:\n{result['code']}")


🔧 Code Generation Test:
Instruction: Create a Python function that implements binary search on a sorted list

📊 Performance:
  Time: 25.90s
  Speed: 23.9 tok/s
  Tokens: 618

💻 Generated Code:
Certainly! Below is a Python function that implements the binary search algorithm on a sorted list. The function returns the index of the target element if it is found in the list, and `-1` if the target is not present.

```python
def binary_search(sorted_list, target):
    """
    Perform binary search on a sorted list to find the index of the target element.

    Parameters:
    sorted_list (list): A list of elements sorted in ascending order.
    target: The element to search for in the list.

    Returns:
    int: The index of the target element if found, otherwise -1.
    """
    left, right = 0, len(sorted_list) - 1

    while left <= right:
        mid = left + (right - left) // 2  # Calculate the middle index

        # Check if the target is present at mid
        if sorted_list[mid] == 

# Optimized Code for Single H100 NVL

In [2]:
# Install Flash Attention 2 for H100
!pip install flash-attn --no-build-isolation

# If the above fails, try with specific CUDA version
!pip install flash-attn --no-build-isolation --no-deps

Collecting flash-attn
  Downloading flash_attn-2.8.2.tar.gz (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m106.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting einops
  Downloading einops-0.8.1-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.4/64.4 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... [?25ldone
[?25h  Created wheel for flash-attn: filename=flash_attn-2.8.2-cp310-cp310-linux_x86_64.whl size=255921566 sha256=7365feb5c5d54aa3838ea02f5fc52c51e496ec0db05e49907d6f1c152ad4e352
  Stored in directory: /home/azureuser/.cache/pip/wheels/1b/95/ad/b6f5ede0a6d0fc052b5a5aebda836880f35caff6d51dd68fbf
Successfully built flash-attn
Installing collected packages: einops, flash-attn
Successfully installed einops-0.8.1 flash-attn-2.8.2


In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import os

# Set environment variables for H100 optimization
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512,expandable_segments:True"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Clear any existing GPU memory
torch.cuda.empty_cache()

model_name = "Qwen/Qwen2.5-Coder-32B-Instruct"
custom_cache_dir = "/dev/shm"

print(f"Loading {model_name} optimized for single H100 NVL...")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    cache_dir=custom_cache_dir
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# H100 Optimized Loading - WORKING VERSION
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # BF16 is optimal for H100
    device_map="cuda:0",  # Single GPU
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    cache_dir=custom_cache_dir,
    # REMOVED: attn_implementation="flash_attention_2"  # This was causing the error
    use_cache=True,
    # Memory optimization for single GPU
    max_memory={"cuda:0": "85GB"},  # Leave ~10GB for activations and KV cache
)

# H100 Performance optimizations
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = True

print("Compiling model for H100...")
# Enable PyTorch 2.0 compilation for performance
model = torch.compile(model, mode="reduce-overhead")  # More stable than max-autotune

print(f"✅ Model loaded and compiled!")
print(f"📍 Device: {model.device}")
print(f"💾 Model memory: {model.get_memory_footprint() / 1e9:.2f} GB")

# Check GPU utilization
if torch.cuda.is_available():
    gpu_memory = torch.cuda.memory_allocated() / 1e9
    gpu_reserved = torch.cuda.memory_reserved() / 1e9
    gpu_total = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"🔥 GPU Memory - Allocated: {gpu_memory:.2f} GB")
    print(f"🔥 GPU Memory - Reserved: {gpu_reserved:.2f} GB") 
    print(f"🔥 GPU Memory - Total: {gpu_total:.2f} GB")
    print(f"🔥 GPU Memory - Free: {(gpu_total - gpu_reserved):.2f} GB")

# Optimized generation function
def generate_h100_optimized(prompt, max_new_tokens=1024, temperature=0.7):
    """Optimized generation for H100 NVL"""
    inputs = tokenizer(
        prompt, 
        return_tensors="pt", 
        padding=True, 
        truncation=True,
        max_length=4096
    ).to("cuda:0")
    
    with torch.inference_mode():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
            pad_token_id=tokenizer.eos_token_id,
            use_cache=True,
            num_beams=1,
            early_stopping=True,
            output_attentions=False,
            output_hidden_states=False,
        )
    
    generated_text = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:], 
        skip_special_tokens=True
    )
    return generated_text



  from .autonotebook import tqdm as notebook_tqdm


Loading Qwen/Qwen2.5-Coder-32B-Instruct optimized for single H100 NVL...


Loading checkpoint shards: 100%|██████████| 14/14 [00:09<00:00,  1.43it/s]


Compiling model for H100...
✅ Model loaded and compiled!
📍 Device: cuda:0
💾 Model memory: 66.60 GB
🔥 GPU Memory - Allocated: 66.60 GB
🔥 GPU Memory - Reserved: 66.61 GB
🔥 GPU Memory - Total: 99.87 GB
🔥 GPU Memory - Free: 33.27 GB


In [2]:
import time

# Basic generation function for comparison
def generate_text(prompt, max_length=2000, temperature=0.7, top_p=0.8):
    """Basic text generation function"""
    
    # Encode input
    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)
    
    # Track GPU memory before generation
    memory_before = torch.cuda.memory_allocated() / 1e9
    start_time = time.time()
    
    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=max_length,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Calculate metrics
    generation_time = time.time() - start_time
    memory_after = torch.cuda.memory_allocated() / 1e9
    
    # Decode response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    new_text = response[len(prompt):].strip()
    
    print(f"\n📊 BASIC GENERATION METRICS:")
    print(f"🕒 GPU Name: {torch.cuda.get_device_properties(0).name}")
    print(f"🧠 Model Name: {model_name}")
    print(f"🕒 Generation time: {generation_time:.2f}s")
    print(f"💾 Memory delta: {memory_after - memory_before:.3f} GB")
    print(f"📝 Generated tokens: {len(outputs[0]) - len(inputs['input_ids'][0])}")
    print(f"⚡ Tokens/second: {(len(outputs[0]) - len(inputs['input_ids'][0])) / generation_time:.1f}")
    
    return new_text

# Enhanced optimized generation function with metrics
def generate_h100_optimized_with_metrics(prompt, max_new_tokens=2000, temperature=0.7):
    """Optimized generation for H100 NVL with performance metrics"""
    inputs = tokenizer(
        prompt, 
        return_tensors="pt", 
        padding=True, 
        truncation=True,
        max_length=4096
    ).to("cuda:0")
    
    # Track GPU memory before generation
    memory_before = torch.cuda.memory_allocated() / 1e9
    start_time = time.time()
    
    with torch.inference_mode():  # More efficient than torch.no_grad()
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
            pad_token_id=tokenizer.eos_token_id,
            use_cache=True,
            num_beams=1,
            early_stopping=True,
            output_attentions=False,
            output_hidden_states=False,
        )
    
    # Calculate metrics
    generation_time = time.time() - start_time
    memory_after = torch.cuda.memory_allocated() / 1e9
    tokens_generated = len(outputs[0]) - len(inputs['input_ids'][0])
    
    generated_text = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:], 
        skip_special_tokens=True
    )
    
    print(f"\n📊 OPTIMIZED GENERATION METRICS:")
    print(f"🕒 GPU Name: {torch.cuda.get_device_properties(0).name}")
    print(f"🧠 Model Name: {model_name}")
    print(f"🕒 Generation time: {generation_time:.2f}s")
    print(f"💾 Memory delta: {memory_after - memory_before:.3f} GB")
    print(f"📝 Generated tokens: {tokens_generated}")
    print(f"⚡ Tokens/second: {tokens_generated / generation_time:.1f}")
    
    return generated_text

# Performance comparison function
def compare_generation_methods(prompt, max_tokens=2000):
    """Compare basic vs optimized generation"""
    print("=" * 80)
    print("🔬 GENERATION METHOD COMPARISON")
    print("=" * 80)
    
    # Test basic generation
    print("\n1️⃣ Testing BASIC generation method...")
    torch.cuda.empty_cache()  # Clear cache for fair comparison
    result_basic = generate_text(prompt, max_length=len(tokenizer(prompt)['input_ids']) + max_tokens)
    
    # Test optimized generation
    print("\n2️⃣ Testing OPTIMIZED generation method...")
    torch.cuda.empty_cache()  # Clear cache for fair comparison
    result_optimized = generate_h100_optimized_with_metrics(prompt, max_new_tokens=max_tokens)
    
    print("\n" + "=" * 80)
    print("📊 COMPARISON COMPLETE")
    print("=" * 80)
    
    return result_basic, result_optimized

# Test both methods
print("\n🧪 Testing both generation methods...")
test_prompt = "Explain the benefits of using Kubernetes for machine learning workloads:"

# Run comparison
basic_result, optimized_result = compare_generation_methods(test_prompt, max_tokens=2000)

print(f"\n📄 BASIC Generated text :\n{basic_result}...")
print(f"\n📄 OPTIMIZED Generated text :\n\n\n{optimized_result}...")

print("\n✅ Model comparison complete!")


🧪 Testing both generation methods...
🔬 GENERATION METHOD COMPARISON

1️⃣ Testing BASIC generation method...

📊 BASIC GENERATION METRICS:
🕒 GPU Name: NVIDIA H100 NVL
🧠 Model Name: Qwen/Qwen2.5-Coder-32B-Instruct
🕒 Generation time: 21.79s
💾 Memory delta: 0.034 GB
📝 Generated tokens: 537
⚡ Tokens/second: 24.6

2️⃣ Testing OPTIMIZED generation method...





📊 OPTIMIZED GENERATION METRICS:
🕒 GPU Name: NVIDIA H100 NVL
🧠 Model Name: Qwen/Qwen2.5-Coder-32B-Instruct
🕒 Generation time: 13.59s
💾 Memory delta: 0.000 GB
📝 Generated tokens: 368
⚡ Tokens/second: 27.1

📊 COMPARISON COMPLETE

📄 BASIC Generated text (first 200 chars):
Kubernetes is a powerful open-source platform designed to automate deploying, scaling, and operating application containers. While it's primarily used for web applications, it can also be beneficial for machine learning (ML) workloads. Here are some reasons why:

1. **Resource Management**: Kubernetes efficiently manages resources by automatically allocating and deallocating them based on workload requirements. This is crucial for ML workloads, which can be resource-intensive and require significant computational power.

2. **Scalability**: Kubernetes provides horizontal scalability, meaning it can scale up or down the number of pods (containers) based on demand. This is particularly useful for ML workloads that can vary