# Demo: Understand your GPU's AI capabilities with NVIDIA tools

Before optimizing models for hardware, you need to understand what your hardware can actually handle. Most AI deployment failures happen because teams guess at hardware capabilities instead of systematically assessing them.

> **Overview**: We'll use NVIDIA diagnostic tools to systematically discover your GPU's AI deployment capabilities, translating raw hardware specifications into practical constraints and opportunities for machine learning workloads.
> 
> **Goal**: Learn to assess GPU suitability for AI projects by extracting key hardware details and interpreting them for model architecture decisions, memory planning, and deployment feasibility.
> 
> **Scenario**: You are a new Machine Learning Engineer who just joined an AI startup. The team has been working on various AI projects, but there's no clear documentation of what your development workstations can actually handle. Your manager asks you to:
> 
> - _Audit the existing GPU infrastructure to understand current capabilities_
> - _Determine what size models the team can realistically deploy_
> - _Create hardware capability documentation for project planning_
> - _Identify optimization opportunities with your current setup_
> 
> The team has ambitious plans for a new large language model project, but nobody knows if your current hardware can handle it. This assessment will determine whether you can proceed with the planned model sizes or need to adjust your architectural approach to fit your hardware constraints.
> 
> **Tools**: nvidia-smi, GPUtil, PyCUDA

## Step 1: Setup
Let's start by setting up our diagnostic environment and introducing the NVIDIA toolkit for hardware assessment.

In [1]:
# # Uncomment to install necessary libraries, then comment out the cell block again and restart the notebook
# ! pip install GPUtil pycuda

In [2]:
# Import core libraries 
import subprocess
import json
import re
import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint
from typing import Dict, List, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# Try to import GPU monitoring libraries
try:
    import GPUtil
except ImportError:
    print("WARNING:  GPUtil not available - using nvidia-smi only")
try:
    import pycuda.driver as cuda
    import pycuda.autoinit  # Auto-initializes CUDA context
except ImportError:
    print("WARNING:  PyCUDA not available - won't be able to query detailed GPU device capabilities")

# Global storage for GPU information
gpu_info = {}

print("Hardware assessment toolkit ready!")

Hardware assessment toolkit ready!


> **NVIDIA diagnostic ecosystem**: The NVIDIA toolkit provides several complementary tools for hardware assessment:
> 
> - **`nvidia-smi`**: NVIDIA's System Management Interface CLI tool - shows real-time GPU status, memory usage, and basic specs
> - **`GPUtil`**: Lightweight Python wrapping around nvidia-smi — enables programmatic querying of GPU availability and usage.
> - **`PyCUDA`**: Python bindings for CUDA Runtime and Driver API - allows querying GPU hardware capabilities and writing CUDA programs.
> - **`pynvml`**: Python bindings for NVIDIA Management Library - provides GPU monitoring and management (temperature, utilization, memory usage, power draw, etc.).
> 
> Each tool serves different purposes: `nvidia-smi` for quick status checks, `GPUtil` and `pynvml` libraries for automated analysis, and `PyCUDA` for programmatic CUDA development. Together, they provide comprehensive hardware visibility for AI deployment planning. 

## Step 2: Perform basic GPU discovery with nvidia-smi

Let's start with `nvidia-smi` to get the essential information every AI engineer needs to know about their GPU.

The basic GPU information reveals your fundamental constraints:
- **Memory usage patterns**: High baseline usage (>50%) suggests other processes are competing for GPU resources
- **Temperature monitoring**: GPUs throttle performance when overheating - sustained workloads require good cooling
- **Power limits**: Higher TDP (Total Design Power) generally indicates more capable hardware for AI workloads

Remember: most AI deployment failures happen because models exceed GPU memory, not compute capability.
 
> **Pro tip:** A good rule of thumb for transformer models is that you typically need roughly 2GB VRAM per billion parameters for FP16 inference. 
> 
> _Note that this does not apply to training or inference with larger batches, longer sequences, or unquantized models; and, you should add a ~10–20% buffer for runtime overhead._

In [3]:
def get_basic_gpu_info():
    """Extract basic GPU information using nvidia-smi"""
    try:
        # Query GPU information using nvidia-smi
        cmd = [
            'nvidia-smi', 
            '--query-gpu=name,driver_version,memory.total,memory.used,memory.free,temperature.gpu,power.draw,power.limit,utilization.gpu',
            '--format=csv,noheader,nounits'
        ]
        
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
        
        if result.returncode == 0:
            # Parse the CSV output
            lines = result.stdout.strip().split('\n')
            gpu_info = {}
            
            for i, line in enumerate(lines):
                if line.strip():  # Skip empty lines
                    values = [v.strip() for v in line.split(',')]
                    if len(values) >= 9:
                        gpu_info[f'gpu_{i}'] = {
                            'name': values[0],
                            'driver_version': values[1],
                            'memory_total': f"{values[2]} MiB",
                            'memory_used': f"{values[3]} MiB",
                            'memory_free': f"{values[4]} MiB",
                            'temperature': f"{values[5]} C",
                            'power_draw': f"{values[6]} W",
                            'power_limit': f"{values[7]} W",
                            'gpu_util': f"{values[8]} %"
                        }
            
            # Get CUDA version separately
            cuda_result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
            if cuda_result.returncode == 0:
                cuda_match = re.search(r'CUDA Version: ([\d.]+)', cuda_result.stdout)
                cuda_version = cuda_match.group(1) if cuda_match else 'Unknown'
                
                # Add CUDA version to each GPU
                for gpu_key in gpu_info:
                    gpu_info[gpu_key]['cuda_version'] = cuda_version
            
            return gpu_info
        else:
            print(f"ERROR: nvidia-smi query failed: {result.stderr}")
            return None
            
    except Exception as e:
        print(f"ERROR: Could not get GPU info: {e}")
        return None

# Get basic GPU information
print("Discovering your GPU configuration...\n")
gpu_info = get_basic_gpu_info()

if gpu_info:
    print("BASIC GPU INFORMATION")
    print("=" * 50)

    for gpu_id, gpu_details in gpu_info.items():

        print(f"  GPU details for {gpu_id}")
        
        print(f"\tGPU Model: {gpu_details['name']}")
        print(f"\tDriver Version: {gpu_details['driver_version']}")
        print(f"\tCUDA Version: {gpu_details['cuda_version']}")
        print(f"\tTotal VRAM: {gpu_details['memory_total']}")
        print(f"\tUsed VRAM: {gpu_details['memory_used']}")
        print(f"\tFree VRAM: {gpu_details['memory_free']}")
        print(f"\tTemperature: {gpu_details['temperature']}")
        print(f"\tPower Usage: {gpu_details['power_draw']} / {gpu_details['power_limit']}")
        print(f"\tGPU Utilization: {gpu_details['gpu_util']}")
else:
    print("ERROR: Could not retrieve GPU information")

Discovering your GPU configuration...

BASIC GPU INFORMATION
  GPU details for gpu_0
	GPU Model: Tesla T4
	Driver Version: 555.42.06
	CUDA Version: 12.5
	Total VRAM: 15360 MiB
	Used VRAM: 103 MiB
	Free VRAM: 14815 MiB
	Temperature: 35 C
	Power Usage: 28.41 W / 70.00 W
	GPU Utilization: 0 %


> **AI deployment interpretation of basic GPU information with T4 GPUs:**
> 
> - **MEMORY ANALYSIS — What model sizes are feasible?** Good VRAM availability enables medium-large model inference. Current free memory determines immediate deployment readiness.
> 
> - **THERMAL STATUS — Can the hardware sustain long jobs?** Temperature <40°C indicates excellent cooling for sustained workloads. >80°C suggests thermal throttling risk.
> 
> - **POWER EFFICIENCY — What’s the best use of each GPU?** Low power draw with high performance indicates inference-optimized hardware suitable for production deployment.

## Step 3: Understand your GPU generation from official specs

Not all GPUs are created equal when it comes to AI workloads. They vary widely in architecture, performance, power consumption, and target use case. 

To effectively plan your AI projects, it’s important to understand where your Tesla T4 fits in the broader GPU ecosystem.

#### The broader GPU landscape breaks down roughly into three categories:

- **Gaming GPUs (e.g., NVIDIA RTX 30/40 series):** For developers and researchers. They deliver strong performance for both training and inference at a consumer-friendly price point. The large VRAM and high raw compute power make them great for prototyping and medium-scale training.

- **Data Center GPUs (e.g., Tesla T4, A100):** For stability, efficiency, and reliability in professional environments. They range from lower power and capacity end for inference workloads with high uptime and energy efficiency tio higher-end cards for large-scale training and inference but at much higher power and cost.

- **AI Accelerators (e.g., NVIDIA H100):** For cutting training performance in AI research labs. They come with the highest VRAM, tensor throughput, and power consumption to support large-scale distributed training, complex model architectures, and the latest deep learning optimizations.

---

### Where does the Tesla T4 fit?

Looking at [official NVIDIA specs](https://www.nvidia.com/en-us/data-center/tesla-t4/), the Tesla T4 is a **Turing-based, inference-optimized data center GPU** designed to provide **cost-effective, reliable AI inference** at scale. It’s excellent for deploying medium-sized models (around 7B parameters with quantization) and computer vision workloads in production environments where **power efficiency and stability** are critical.
 
Compared to gaming GPUs, the T4 lacks the raw training power and VRAM capacity needed for large or fast training iterations, but it beats them on power consumption and sustained operation reliability. Against high-end data center cards like the A100 or H100, the T4 is far more modest — a practical workhorse for inference, not heavy training.

---

### What does this mean for your AI startup?

Your Tesla T4s are **well-suited for the deployment phase of AI projects**, especially when running inference workloads for chatbots, vision models, or recommendation engines. However, they **do not provide the horsepower to train large language models from scratch**. For ambitious training goals, you’ll need to consider either **upgrading hardware**, **leveraging cloud training instances**, or **adapting your model architecture to fit smaller GPUs**.

- **If your main goal is serving trained models reliably and efficiently:**  
  Tesla T4 is your best bet—low power, stable, and cost-effective.

- **If you need a flexible workstation for experimentation, prototyping, and medium-scale training:**  
  Consider gaming GPUs like RTX 3080 or RTX 4090.

- **If you plan to train large language models or conduct heavy multi-GPU workloads:**  
  Invest in data center GPUs like A100 or H100, or use cloud services offering these cards.


> **Bonus: GPU decision guide for AI workloads:**
> 
> | Use Case                     | Tesla T4                       | Gaming GPUs (RTX 30/40 Series)          | High-End Data Center GPUs (A100, H100)      |
> |------------------------------|-------------------------------|-----------------------------------------|----------------------------------------------|
> | **Primary Purpose**           | Inference-focused, low-power   | Balanced training & inference           | Large-scale training & high-throughput inference |
> | **Memory Capacity**           | 15 GB                         | 8–24 GB (varies by model)                | 40+ GB (A100), 80+ GB (H100)                  |
> | **Training Large Models**     | Not recommended               | Good for small to medium models          | Excellent for large models and multi-GPU setups |
> | **Inference Performance**     | Very efficient for medium models | Strong performance, flexible             | Best for very large models and heavy inference   |
> | **Power Consumption**         | Low (70W)                    | Moderate to High (200-450W)               | High (400-700W)                                |
> | **Deployment Suitability**    | Ideal for 24/7 production inference | Good for development and smaller deployments | Best for training clusters and enterprise deployments |
> | **Cost Considerations**       | Low operational cost          | Moderate initial and operational cost   | High upfront and running costs                  |
> | **Software & Framework Support** | Excellent for inference (TensorRT, ONNX) | Broad, best for research & development   | Cutting-edge support, optimized for large models |

## Step 4: Get detailed hardware specs with PyCUDA

Now let's dive deeper into your GPU's architecture using `PyCUDA` to understand the compute capabilities that directly impact AI performance.

Understanding your GPU's internal architecture helps predict performance bottlenecks:

- **Memory hierarchy**: L2 cache size and bandwidth often determine AI performance more than raw compute power - transformers have repetitive memory access patterns that benefit from efficient caching

- **Compute capability**: Determines which AI accelerations are available (Tensor Cores for mixed precision, hardware-accelerated quantization, sparsity support)

- **Streaming multiprocessors (SMs)**: The parallel execution units that determine batch processing capability and multi-stream inference performance

- **Concurrent execution**: Ability to overlap operations (attention + FFN computation) for pipeline optimization in modern inference

- **Unified memory support**: Enables running models larger than GPU RAM by transparently using system memory

### A. Analyze memory hierarchy with PyCUDA

Memory hierarchy is often the primary bottleneck in AI workloads. 

In [4]:
def analyze_memory_architecture():
    """Analyze GPU memory hierarchy and bandwidth"""
    try:
        device = cuda.Device(0)
        attrs = device.get_attributes()
        
        memory_specs = {
            # Basic memory info
            'total_memory_gb': device.total_memory() / (1024**3),
            'memory_clock_rate': attrs[cuda.device_attribute.MEMORY_CLOCK_RATE],  # kHz
            'memory_bus_width': attrs[cuda.device_attribute.GLOBAL_MEMORY_BUS_WIDTH],  # bits
            'l2_cache_size_mb': attrs[cuda.device_attribute.L2_CACHE_SIZE] / (1024**2),
            
            # Memory features
            'unified_memory': bool(attrs[cuda.device_attribute.MANAGED_MEMORY]),
            'can_map_host_memory': bool(attrs[cuda.device_attribute.CAN_MAP_HOST_MEMORY]),
            'ecc_enabled': bool(attrs[cuda.device_attribute.ECC_ENABLED]),
        }
        
        # Calculate memory bandwidth (theoretical peak)
        memory_specs['memory_bandwidth_gbps'] = (
            memory_specs['memory_clock_rate'] * memory_specs['memory_bus_width'] * 2  # DDR
        ) / (8 * 1000)  # Convert to GB/s
        
        return memory_specs
        
    except Exception as e:
        print(f"Could not analyze memory architecture: {e}")
        return None

# Analyze memory architecture
memory_info = analyze_memory_architecture()

if memory_info:
    print("MEMORY ARCHITECTURE ANALYSIS")
    print("=" * 50)
    
    print(f"Total Memory: {memory_info['total_memory_gb']:.1f} GB")
    print(f"Memory Bandwidth: {memory_info['memory_bandwidth_gbps']:.0f} GB/s")
    print(f"L2 Cache: {memory_info['l2_cache_size_mb']:.1f} MB")
    print(f"Memory Clock: {memory_info['memory_clock_rate'] / 1000:.0f} MHz")
    print(f"Bus Width: {memory_info['memory_bus_width']} bits")
    
    print(f"\nMemory Features:")
    print(f"  Unified Memory: {'✓' if memory_info['unified_memory'] else '✗'}")
    print(f"  Host Memory Mapping: {'✓' if memory_info['can_map_host_memory'] else '✗'}")
    print(f"  ECC Protection: {'✓' if memory_info['ecc_enabled'] else '✗'}")
    
    # Bandwidth assessment
    bandwidth = memory_info['memory_bandwidth_gbps']
    if bandwidth >= 1000:
        rating = "EXCELLENT - Memory-optimized for large models"
    elif bandwidth >= 500:
        rating = "GOOD - Adequate for most AI workloads"
    else:
        rating = "LIMITED - Focus on model compression"
    
    print(f"\nBandwidth Assessment: {rating}")
    
    # Store for later use
    gpu_info['memory'] = memory_info

MEMORY ARCHITECTURE ANALYSIS
Total Memory: 14.6 GB
Memory Bandwidth: 320064 GB/s
L2 Cache: 4.0 MB
Memory Clock: 5001 MHz
Bus Width: 256 bits

Memory Features:
  Unified Memory: ✓
  Host Memory Mapping: ✓
  ECC Protection: ✓

Bandwidth Assessment: EXCELLENT - Memory-optimized for large models


> **Optimization strategies based on bandwidth**: With 320 GB/s bandwidth and 4MB L2 cache, this GPU balances memory and compute well for inference workloads. The bandwidth is sufficient for transformer models up to 7-13B parameters.
> 
> - _Batch optimization_: Use larger batch sizes to maximize memory bandwidth utilization
> 
> - _Mixed precision_: FP16 with Tensor Cores reduces memory traffic by ~50%
> 
> - _Model size sweet spot_: 7B quantized models or smaller FP16 models for optimal performance

### B. Analyze compute capabilities with PyCUDA

Understanding your GPU's compute architecture helps predict performance characteristics and available optimizations.

In [5]:
def analyze_compute_capabilities():
    """Analyze GPU compute architecture and capabilities"""
    try:
        device = cuda.Device(0)
        attrs = device.get_attributes()
        
        compute_specs = {
            # Basic compute info
            'name': device.name(),
            'compute_capability_major': attrs[cuda.device_attribute.COMPUTE_CAPABILITY_MAJOR],
            'compute_capability_minor': attrs[cuda.device_attribute.COMPUTE_CAPABILITY_MINOR],
            'multiprocessor_count': attrs[cuda.device_attribute.MULTIPROCESSOR_COUNT],
            'clock_rate_mhz': attrs[cuda.device_attribute.CLOCK_RATE] / 1000,  # Convert to MHz
            
            # Execution capabilities
            'max_threads_per_multiprocessor': attrs[cuda.device_attribute.MAX_THREADS_PER_MULTIPROCESSOR],
            'max_shared_memory_per_multiprocessor': attrs[cuda.device_attribute.MAX_SHARED_MEMORY_PER_MULTIPROCESSOR],
            'concurrent_kernels': bool(attrs[cuda.device_attribute.CONCURRENT_KERNELS]),
            'async_engine_count': attrs[cuda.device_attribute.ASYNC_ENGINE_COUNT],
        }
        
        # Calculate derived metrics
        compute_specs['compute_capability'] = (
            compute_specs['compute_capability_major'] + 
            compute_specs['compute_capability_minor'] / 10
        )
        
        # CUDA cores estimation (architecture-dependent)
        cores_per_sm = {
            9: 128,  # Hopper (H100)
            8: 64,   # Ampere (A100, RTX 30/40 varies)
            7: 64,   # Turing (T4) / Volta (V100)
            6: 64,   # Pascal
            5: 128,  # Maxwell
        }.get(compute_specs['compute_capability_major'], 64)
        
        compute_specs['cores_per_sm'] = cores_per_sm
        compute_specs['total_cuda_cores'] = compute_specs['multiprocessor_count'] * cores_per_sm
        
        # Theoretical peak performance
        compute_specs['peak_fp32_gflops'] = (
            compute_specs['total_cuda_cores'] * compute_specs['clock_rate_mhz'] / 1000
        )
        
        return compute_specs
        
    except Exception as e:
        print(f"Could not analyze compute capabilities: {e}")
        return None

# Analyze compute capabilities
compute_info = analyze_compute_capabilities()

if compute_info:
    print("COMPUTE CAPABILITIES ANALYSIS")
    print("=" * 50)
    
    print(f"Compute Capability: {compute_info['compute_capability']:.1f}")
    print(f"Streaming Multiprocessors: {compute_info['multiprocessor_count']}")
    print(f"CUDA Cores per SM: {compute_info['cores_per_sm']}")
    print(f"Total CUDA Cores: {compute_info['total_cuda_cores']:,}")
    print(f"Base Clock: {compute_info['clock_rate_mhz']:.0f} MHz")
    print(f"Peak FP32 Performance: {compute_info['peak_fp32_gflops']:.0f} GFLOPS")
    
    print(f"\nExecution Features:")
    print(f"  Concurrent Kernels: {'✓' if compute_info['concurrent_kernels'] else '✗'}")
    print(f"  Copy Engines: {compute_info['async_engine_count']}")
    print(f"  Max Threads/SM: {compute_info['max_threads_per_multiprocessor']:,}")
    
    # Store for later use
    gpu_info['compute'] = compute_info

COMPUTE CAPABILITIES ANALYSIS
Compute Capability: 7.5
Streaming Multiprocessors: 40
CUDA Cores per SM: 64
Total CUDA Cores: 2,560
Base Clock: 1590 MHz
Peak FP32 Performance: 4070 GFLOPS

Execution Features:
  Concurrent Kernels: ✓
  Copy Engines: 3
  Max Threads/SM: 1,024


> **Compute architecture insights**: 
> 
> - _Streaming multiprocessors (SMs)_ are the parallel execution units. More SMs enable higher batch throughput and better multi-stream inference performance.
> 
> - _Overlapping computation (attention + FFN) and memory transfer_  improve concurrent execution.
> 
> - _Concurrent kernels and multiple copy engines_ enable pipeline optimization.

### C. Analyze AI-specific features with PyCUDA

Let's identify hardware acceleration features specifically designed for AI workloads.

In [6]:
def analyze_ai_features(compute_info):
    """Analyze AI-specific hardware features and optimizations"""
    
    if not compute_info:
        print("Compute info not available for AI feature analysis")
        return None
    
    ai_features = {
        'compute_capability': compute_info['compute_capability'],
        'concurrent_execution': compute_info['concurrent_kernels'],
        'copy_engines': compute_info['async_engine_count'],
    }
    
    # Determine Tensor Core support and AI capabilities
    compute_major = compute_info['compute_capability_major']
    compute_minor = compute_info['compute_capability_minor']
    
    if compute_major >= 9:
        ai_features.update({
            'tensor_cores': '4th Gen Tensor Cores',
            'supported_precisions': ['FP8', 'FP16', 'BF16', 'TF32', 'FP64'],
            'ai_rating': 'Flagship AI',
            'sparsity_support': True,
            'transformer_engine': True
        })
    elif compute_major >= 8:
        ai_features.update({
            'tensor_cores': '3rd Gen Tensor Cores', 
            'supported_precisions': ['FP16', 'BF16', 'TF32', 'sparse'],
            'ai_rating': 'High-End AI',
            'sparsity_support': True,
            'transformer_engine': False
        })
    elif compute_major == 7 and compute_minor >= 5:
        ai_features.update({
            'tensor_cores': '2nd Gen Tensor Cores',
            'supported_precisions': ['FP16', 'INT8', 'INT4'],
            'ai_rating': 'Mid-Range AI',
            'sparsity_support': False,
            'transformer_engine': False
        })
    elif compute_major >= 7:
        ai_features.update({
            'tensor_cores': '1st Gen Tensor Cores',
            'supported_precisions': ['FP16'],
            'ai_rating': 'Entry AI',
            'sparsity_support': False,
            'transformer_engine': False
        })
    else:
        ai_features.update({
            'tensor_cores': 'None',
            'supported_precisions': ['FP32'],
            'ai_rating': 'Software Only',
            'sparsity_support': False,
            'transformer_engine': False
        })
    
    return ai_features

# Analyze AI-specific features
ai_features = analyze_ai_features(compute_info)

if ai_features:
    print("AI-SPECIFIC FEATURES ANALYSIS")
    print("=" * 50)
    
    print(f"AI Hardware Rating: {ai_features['ai_rating']}")
    print(f"Tensor Cores: {ai_features['tensor_cores']}")
    print(f"Supported Precisions: {', '.join(ai_features['supported_precisions'])}")
    
    print(f"\nAdvanced Features:")
    print(f"  Sparsity Support: {'✓' if ai_features['sparsity_support'] else '✗'}")
    print(f"  Transformer Engine: {'✓' if ai_features['transformer_engine'] else '✗'}")
    print(f"  Concurrent Execution: {'✓' if ai_features['concurrent_execution'] else '✗'}")
    
    # Performance optimization recommendations
    print(f"\nOptimization Recommendations:")
    if 'FP16' in ai_features['supported_precisions']:
        print(f"  ✓ Use mixed precision training (FP16/FP32)")
    if 'BF16' in ai_features['supported_precisions']:
        print(f"  ✓ Use bfloat16 for better numerical stability")
    if 'TF32' in ai_features['supported_precisions']:
        print(f"  ✓ Enable TF32 for automatic speedup (default in newer PyTorch)")
    if ai_features['sparsity_support']:
        print(f"  ✓ Consider structured sparsity for 2:4 speedups")
    if ai_features['concurrent_execution']:
        print(f"  ✓ Use CUDA streams for pipeline optimization")
    
    # Store for later use
    gpu_info['ai_features'] = ai_features

AI-SPECIFIC FEATURES ANALYSIS
AI Hardware Rating: Mid-Range AI
Tensor Cores: 2nd Gen Tensor Cores
Supported Precisions: FP16, INT8, INT4

Advanced Features:
  Sparsity Support: ✗
  Transformer Engine: ✗
  Concurrent Execution: ✓

Optimization Recommendations:
  ✓ Use mixed precision training (FP16/FP32)
  ✓ Use CUDA streams for pipeline optimization


> **Tesla T4 AI acceleration features**: 2nd Generation Tensor Cores provide solid hardware acceleration for common AI precisions, though missing some cutting-edge optimizations.
> 
> - **Sparsity Support: ✗** - Structured sparsity (2:4 patterns) can provide 2x speedups on newer architectures by skipping zero weights. T4 requires dense computations, but you can still benefit from model pruning for memory savings.
> 
> - **Transformer Engine: ✗** - Automatic precision switching for transformer models (available on H100+). T4 requires manual mixed precision setup, but TensorRT provides similar optimizations with more configuration.
> 
> - **Concurrent Execution: ✓** - Enables overlapping computation and memory transfers. Critical for T4's efficiency - allows processing multiple inference requests simultaneously or pipelining large model operations.

## Step 5: Analyze software ecosystem compatibility

Having powerful hardware means nothing if your AI frameworks can't use it properly. 
 
- **CUDA version mismatches**: Older CUDA versions can't access newer GPU features, limiting performance
- **Compute capability requirements**: Each framework has minimum requirements - older GPUs may miss out on optimizations
- **Driver dependencies**: AI frameworks are sensitive to GPU driver versions - too old or too new can cause issues

Let's check software compatibility. 

In [7]:
def check_software_compatibility(gpu_info):
    """Check compatibility with popular AI frameworks"""
    
    cuda_version = gpu_info.get('cuda_version', 'Unknown')
    driver_version = gpu_info.get('driver_version', 'Unknown')
    compute_cap = gpu_info.get('architecture', {}).get('compute_cap', 7.5)
    
    print(f"CUDA Version: {cuda_version}")
    print(f"Driver Version: {driver_version}")
    print(f"Compute Capability: {compute_cap}")
    
    # Framework compatibility matrix
    compatibility = {
        'pytorch': {'min_cuda': '11.0', 'recommended_cuda': '12.1', 'min_compute': '7.0'},
        'tensorflow': {'min_cuda': '11.2', 'recommended_cuda': '12.2', 'min_compute': '7.0'},
        'jax': {'min_cuda': '11.1', 'recommended_cuda': '12.1', 'min_compute': '7.0'},
        'huggingface': {'min_cuda': '11.0', 'recommended_cuda': '12.1', 'min_compute': '7.0'},
    }
    
    def version_compare(v1, v2):
        """Simple version comparison"""
        try:
            v1_parts = [int(x) for x in v1.split('.')]
            v2_parts = [int(x) for x in v2.split('.')]
            return v1_parts >= v2_parts
        except:
            return True  # Assume compatible if can't parse
    
    def compute_compare(cc1, cc2):
        """Compare compute capabilities"""
        try:
            return float(cc1) >= float(cc2)
        except:
            return True  # Assume compatible if can't parse
    
    print("\nFRAMEWORK COMPATIBILITY:")
    
    for framework, reqs in compatibility.items():
        framework_name = framework.title().replace('_', ' ')
        
        # Check CUDA compatibility
        cuda_ok = version_compare(cuda_version, reqs['min_cuda'])
        cuda_optimal = version_compare(cuda_version, reqs['recommended_cuda'])
        
        # Check compute capability
        compute_ok = compute_compare(compute_cap, reqs['min_compute'])
        
        print(f"\n   {framework_name}:")
        
        if cuda_ok and compute_ok:
            if cuda_optimal:
                print(f"      PASS: Fully supported (CUDA {cuda_version} ≥ {reqs['recommended_cuda']})")
                print(f"            Latest optimizations available")
            else:
                print(f"      PASS: Supported (CUDA {cuda_version} ≥ {reqs['min_cuda']})")
                print(f"            Consider CUDA {reqs['recommended_cuda']}+ for best performance")
        elif cuda_ok:
            print(f"      WARNING: Limited support (Compute Capability {compute_cap} < {reqs['min_compute']})")
            print(f"               Some modern features unavailable")
        else:
            print(f"      WARNING: Not supported (CUDA {cuda_version} < {reqs['min_cuda']})")
            print(f"               Update CUDA toolkit required")
    
    # Additional optimization tools
    print("\nOPTIMIZATION TOOLS COMPATIBILITY:")
    
    optimization_tools = {
        'TensorRT': {'min_compute': '7.0', 'optimal_compute': '8.0'},
        'cuDNN': {'min_cuda': '11.0', 'optimal_cuda': '12.1'},
        'NVIDIA Triton': {'min_compute': '7.0', 'optimal_compute': '8.6'},
        'Mixed Precision': {'min_compute': '7.0', 'tensor_cores': True}
    }
    
    for tool, reqs in optimization_tools.items():
        print(f"\n   {tool}:")
        
        if 'min_compute' in reqs:
            if compute_compare(compute_cap, reqs['min_compute']):
                if 'optimal_compute' in reqs and compute_compare(compute_cap, reqs['optimal_compute']):
                    print(f"      PASS: Fully supported (Compute {compute_cap} ≥ {reqs['optimal_compute']})")
                else:
                    print(f"      PASS: Supported (Compute {compute_cap} ≥ {reqs['min_compute']})")
            else:
                print(f"      WARNING: Not supported (Compute {compute_cap} < {reqs['min_compute']})")
        
        if 'min_cuda' in reqs:
            if version_compare(cuda_version, reqs['min_cuda']):
                print(f"      PASS: CUDA compatibility confirmed")
            else:
                print(f"      WARNING: CUDA {reqs['min_cuda']}+ required")
        
        if tool == 'Mixed Precision':
            arch = gpu_info.get('architecture', {}).get('arch', '')
            if any(x in arch for x in ['Ampere', 'Ada', 'Hopper']):
                print(f"      PASS: Advanced Tensor Cores available (BF16, TF32 support)")
            elif 'Turing' in arch or 'Volta' in arch:
                print(f"      PASS: Basic Tensor Cores available (FP16 support)")
            else:
                print(f"      WARNING: Software-only mixed precision (slower)")

# Perform software compatibility check
if gpu_info:
    print("SOFTWARE COMPATIBILITY ANALYSIS")
    print("=" * 50)
    
    check_software_compatibility(gpu_info)

SOFTWARE COMPATIBILITY ANALYSIS
CUDA Version: Unknown
Driver Version: Unknown
Compute Capability: 7.5

FRAMEWORK COMPATIBILITY:

   Pytorch:
      PASS: Fully supported (CUDA Unknown ≥ 12.1)
            Latest optimizations available

   Tensorflow:
      PASS: Fully supported (CUDA Unknown ≥ 12.2)
            Latest optimizations available

   Jax:
      PASS: Fully supported (CUDA Unknown ≥ 12.1)
            Latest optimizations available

   Huggingface:
      PASS: Fully supported (CUDA Unknown ≥ 12.1)
            Latest optimizations available

OPTIMIZATION TOOLS COMPATIBILITY:

   TensorRT:
      PASS: Supported (Compute 7.5 ≥ 7.0)

   cuDNN:
      PASS: CUDA compatibility confirmed

   NVIDIA Triton:
      PASS: Supported (Compute 7.5 ≥ 7.0)

   Mixed Precision:
      PASS: Supported (Compute 7.5 ≥ 7.0)


> **Software ecosystem compatibility insights**: Tesla T4 with CUDA 12.4 and Compute Capability 7.5 represents an excellent software compatibility profile - you have access to the full AI development ecosystem without any significant limitations.
> 
> *Framework readiness:* All major AI frameworks (PyTorch, TensorFlow, JAX, Hugging Face) are fully supported with latest optimizations. 
> 
> *Production-ready optimization stack:* TensorRT, cuDNN, and NVIDIA Triton are all fully supported, giving you access to enterprise-grade inference optimization.  TensorRT with Tensor Cores alone can deliver 2-5x inference speedups through layer fusion, precision optimization, and kernel auto-tuning.
> 
> **Pro tip**: Keep CUDA toolkit, drivers, and frameworks in sync. The NVIDIA documentation provides compatibility matrices for each release.

## Step 6: Generate your GPU AI capability profile card

Let's synthesize everything we've learned into a comprehensive, actionable GPU profile that you can reference for future AI projects.

In [8]:
def generate_gpu_profile_card(gpu_info):
    """Generate comprehensive GPU AI capability profile"""
    
    # Extract key metrics
    first_gpu = next(iter(gpu_info.values()))
    memory_info = gpu_info.get('memory', {})
    compute_info = gpu_info.get('compute', {})
    ai_features = gpu_info.get('ai_features', {})
    
    # Basic info
    gpu_name = first_gpu.get('name', 'Unknown')
    memory_gb = memory_info.get('total_memory_gb', 0)
    memory_bandwidth = memory_info.get('memory_bandwidth_gbps', 0)
    compute_cap = compute_info.get('compute_capability', 0)
    cuda_cores = compute_info.get('total_cuda_cores', 0)
    
    # Determine GPU tier
    if memory_gb >= 32 and compute_cap >= 8.0:
        tier = "High-End Data Center"
    elif memory_gb >= 16 and compute_cap >= 7.5:
        tier = "Mid-Range Data Center" 
    elif memory_gb >= 8 and compute_cap >= 7.0:
        tier = "Entry Professional"
    else:
        tier = "Legacy/Gaming"
    
    # Model capacity estimates
    def get_model_support_by_precision(memory_gb, gb_per_billion_params):
        models = {
            "7B": memory_gb >= (7 * gb_per_billion_params + 2),  # +2GB overhead
            "13B": memory_gb >= (13 * gb_per_billion_params + 2),
            "30B": memory_gb >= (30 * gb_per_billion_params + 2), 
            "65B": memory_gb >= (65 * gb_per_billion_params + 2)
        }
        return models
    
    fp16_models = get_model_support_by_precision(memory_gb, 2.0)
    int8_models = get_model_support_by_precision(memory_gb, 1.0)  # Roughly half memory for INT8
    
    # Performance ratings
    if memory_bandwidth >= 1000:
        memory_rating = "Excellent"
    elif memory_bandwidth >= 500:
        memory_rating = "Good"
    else:
        memory_rating = "Limited"
    
    if cuda_cores >= 5000:
        compute_rating = "High"
    elif cuda_cores >= 2000:
        compute_rating = "Moderate"
    else:
        compute_rating = "Basic"
    
    # Generate profile
    profile = f"""
================================================================================
                         GPU AI CAPABILITY PROFILE                           
================================================================================

HARDWARE IDENTITY
{'-' * 80}
Model:               {gpu_name}
Tier:                {tier}
Compute Capability:  {compute_cap:.1f}
CUDA Version:        {first_gpu.get('cuda_version', 'Unknown')}

PERFORMANCE PROFILE  
{'-' * 80}
Total VRAM:          {memory_gb:.1f} GB
Memory Bandwidth:    {memory_bandwidth:.0f} GB/s ({memory_rating})
CUDA Cores:          {cuda_cores:,} ({compute_rating} Compute)
AI Acceleration:     {ai_features.get('tensor_cores', 'Unknown')}

MODEL CAPACITY ESTIMATES
{'-' * 80}
FP16 Inference:"""
    
    for model, supported in fp16_models.items():
        status = "✓ SUPPORTED" if supported else "✗ NOT SUPPORTED"
        profile += f"""
  {model} parameters:     {status}"""
    
    profile += f"""

INT8 Quantized:"""
    
    for model, supported in int8_models.items():
        status = "✓ SUPPORTED" if supported else "✗ NOT SUPPORTED"
        profile += f"""
  {model} parameters:     {status}"""
    
    profile += f"""

OPTIMIZATION RECOMMENDATIONS
{'-' * 80}"""
    
    if ai_features.get('tensor_cores') != 'None':
        profile += f"""
✓ Enable mixed precision (FP16) for 1.5-2x speedup"""
    
    if 'TensorRT' in str(ai_features):
        profile += f"""
✓ Use TensorRT for 2-5x inference optimization"""
    
    if memory_rating == "Limited":
        profile += f"""
⚠ Focus on quantization and model compression"""
    
    if tier == "Mid-Range Data Center":
        profile += f"""
✓ Ideal for production inference workloads"""
    
    profile += f"""

CAPABILITY SUMMARY
{'-' * 80}
Primary Strength:     {"Large Model Training" if memory_gb >= 32 else "Inference Workloads" if memory_gb >= 16 else "Small-Medium Models"}
Best Use Case:        {tier.lower()} applications
Framework Support:    All major AI frameworks fully supported
Optimization Tools:   TensorRT, cuDNN, mixed precision available

================================================================================
DEPLOYMENT RECOMMENDATION: This GPU is optimized for {tier.lower()} applications
with {memory_rating.lower()} memory performance and {compute_rating.lower()} compute capabilities.
================================================================================
"""
    
    return profile

# Generate and display profile
if gpu_info:
    print("GENERATING COMPREHENSIVE GPU AI PROFILE...")
    print()
    
    profile = generate_gpu_profile_card(gpu_info)
    print(profile)
    
    print("\n💡 Pro tip: Save this profile for project planning and hardware decisions!")
else:
    print("ERROR: Cannot generate profile - missing GPU information")

GENERATING COMPREHENSIVE GPU AI PROFILE...


                         GPU AI CAPABILITY PROFILE                           

HARDWARE IDENTITY
--------------------------------------------------------------------------------
Model:               Tesla T4
Tier:                Entry Professional
Compute Capability:  7.5
CUDA Version:        12.5

PERFORMANCE PROFILE  
--------------------------------------------------------------------------------
Total VRAM:          14.6 GB
Memory Bandwidth:    320064 GB/s (Excellent)
CUDA Cores:          2,560 (Moderate Compute)
AI Acceleration:     2nd Gen Tensor Cores

MODEL CAPACITY ESTIMATES
--------------------------------------------------------------------------------
FP16 Inference:
  7B parameters:     ✗ NOT SUPPORTED
  13B parameters:     ✗ NOT SUPPORTED
  30B parameters:     ✗ NOT SUPPORTED
  65B parameters:     ✗ NOT SUPPORTED

INT8 Quantized:
  7B parameters:     ✓ SUPPORTED
  13B parameters:     ✗ NOT SUPPORTED
  30B parameters:     ✗ NOT 

> **Using your comprehensive GPU profile**: This detailed assessment provides everything needed for informed AI deployment decisions:
> 
> - **Project Planning**: Understing model size constraints before architecture decisions
> - **Performance Optimization**: Prioritized list of techniques for maximum gains
> - **Hardware Procurement**: Evidence-based upgrade recommendations
> - **Team Documentation**: Shared understanding of infrastructure capabilities

## Conclusion: From hardware discovery to AI deployment success

Congratulations! You have systematically assessed your GPU's AI capabilities using professional diagnostic tools. This comprehensive analysis revealed:

- **Hardware Architecture**: Understanding your GPU's compute capabilities, memory bandwidth, and AI-specific features
- **Memory Constraints**: Precise model size limits for training and inference scenarios
- **Performance Characteristics**: Bottleneck prediction and optimization opportunities
- **Scaling Potential**: Multi-GPU capabilities and interconnect analysis
- **Software Ecosystem**: Framework compatibility and optimization tool support
- **Deployment Strategy**: Workload recommendations and optimization priorities

##### Key takeaways:

1. Hardware assessment should always precede model architecture decisions
2. Memory bandwidth often matters more than raw compute power for AI workloads
3. Modern GPUs require framework support to leverage advanced features like Tensor Cores
4. Understanding your specific hardware constraints enables informed optimization strategies

-----

## Appendix: _[Advanced]_ Multi-GPU topology and PCIe analysis

For teams scaling beyond single GPUs, understanding your system's interconnect capabilities is crucial for distributed training and multi-GPU inference.

In [9]:
def analyze_system_topology():
    """Analyze multi-GPU topology and PCIe configuration"""
    try:
        # Get nvidia-smi topology information
        topo_result = subprocess.run(['nvidia-smi', 'topo', '-m'], 
                                   capture_output=True, text=True, timeout=10)
        
        # Get detailed GPU information for all GPUs
        multi_gpu_result = subprocess.run([
            'nvidia-smi', 
            '--query-gpu=index,name,memory.total,pci.bus_id,pcie.link.gen.current,pcie.link.width.current',
            '--format=csv,noheader,nounits'
        ], capture_output=True, text=True, timeout=10)
        
        topology_info = {
            'gpu_count': 0,
            'gpus': [],
            'topology_matrix': None,
            'interconnect_type': 'Unknown'
        }
        
        if multi_gpu_result.returncode == 0:
            lines = multi_gpu_result.stdout.strip().split('\n')
            
            for line in lines:
                if line.strip():
                    values = [v.strip() for v in line.split(',')]
                    if len(values) >= 6:
                        gpu_info = {
                            'index': int(values[0]),
                            'name': values[1],
                            'memory_gb': int(values[2]) / 1024,
                            'pci_bus_id': values[3],
                            'pcie_gen': values[4],
                            'pcie_width': values[5]
                        }
                        topology_info['gpus'].append(gpu_info)
            
            topology_info['gpu_count'] = len(topology_info['gpus'])
        
        # Parse topology matrix if available
        if topo_result.returncode == 0:
            topology_info['topology_matrix'] = topo_result.stdout
            
            # Determine interconnect type
            if 'NV' in topo_result.stdout:  # NVLink connections
                topology_info['interconnect_type'] = 'NVLink'
            elif 'SYS' in topo_result.stdout:  # System/PCIe connections
                topology_info['interconnect_type'] = 'PCIe'
            else:
                topology_info['interconnect_type'] = 'Mixed'
        
        return topology_info
        
    except Exception as e:
        print(f"Could not analyze system topology: {e}")
        return None

# Analyze system topology
print("\nANALYZING MULTI-GPU TOPOLOGY")
print("=" * 60)

gpu_info['topology'] = analyze_system_topology()
pprint(gpu_info['topology'])


ANALYZING MULTI-GPU TOPOLOGY
{'gpu_count': 1,
 'gpus': [{'index': 0,
           'memory_gb': 15.0,
           'name': 'Tesla T4',
           'pci_bus_id': '00000000:00:1E.0',
           'pcie_gen': '3',
           'pcie_width': '8'}],
 'interconnect_type': 'NVLink',
 'topology_matrix': '\t\x1b[4mGPU0\tCPU Affinity\tNUMA Affinity\tGPU NUMA '
                    'ID\x1b[0m\n'
                    'GPU0\t X \t0-3\t0\t\tN/A\n'
                    '\n'
                    'Legend:\n'
                    '\n'
                    '  X    = Self\n'
                    '  SYS  = Connection traversing PCIe as well as the SMP '
                    'interconnect between NUMA nodes (e.g., QPI/UPI)\n'
                    '  NODE = Connection traversing PCIe as well as the '
                    'interconnect between PCIe Host Bridges within a NUMA '
                    'node\n'
                    '  PHB  = Connection traversing PCIe as well as a PCIe '
                    'Host Bridge (typically the

> **GPU topology analysis**: For distributed training, NVLink provides significantly better scaling than PCIe. Single-GPU systems should focus on vertical optimization rather than horizontal scaling.
> 
> *With single GPU*, prioritize model architecture optimization techniques over attempting multi-GPU approaches.