# LLM Profiling

Launching the large language models which are contained in `src/models` and using NSightCompute to profile their kernels on an **NVIDIA H100**.

The goal of this is to perform principle component analysis (PCA) to examine new large language model kernels in comparision to existing GPU benchmark suites such as Rodinia.

In [1]:
from glob import glob
import os
import subprocess

import pandas as pd
import numpy as np
from time import time

import matplotlib.pyplot as plt

print("Testing")

Testing


In [None]:
!mkdir -p ncu-reports

# Generating a ncu report for each model
NVTX_RANGE = "generation/"
BASE_PATH = "../src/models/"
MODELS = [
    "phi2.py",
    "phi3.5.py",
    "orca-mini-7b.py",
    "jamba1.5.mini.py",
    "flux.py",
    "llama70B.py",
]
COUNT = 50 # 300 kernels should be next

def clean_up_model_name(model_name: str) -> str:
    return model_name.replace(".py", "").replace(".", "")

diff = {}

for model in MODELS:
    print(f"Profiling {model}")

    model_name = clean_up_model_name(model)
    model_path = os.path.join(BASE_PATH, model)

    start_time = time()
    
    os.system(f'ncu --metrics sm__warps_active.avg.per_cycle_active,sm__warps_active.avg.pct_of_peak_sustained_active,sm__throughput.avg.pct_of_peak_sustained_elapsed,sm__maximum_warps_per_active_cycle_pct,sm__maximum_warps_avg_per_active_cycle,sm__cycles_active.avg,lts__throughput.avg.pct_of_peak_sustained_elapsed,launch__waves_per_multiprocessor,launch__thread_count,launch__shared_mem_per_block_static,launch__shared_mem_per_block_dynamic,launch__shared_mem_per_block_driver,launch__shared_mem_per_block,launch__shared_mem_config_size,launch__registers_per_thread,launch__occupancy_per_shared_mem_size,launch__occupancy_per_register_count,launch__occupancy_per_block_size,launch__occupancy_limit_warps,launch__occupancy_limit_shared_mem,launch__occupancy_limit_registers,launch__occupancy_limit_blocks,launch__grid_size,launch__func_cache_config,launch__block_size,l1tex__throughput.avg.pct_of_peak_sustained_active,gpu__time_duration.sum,gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,gpc__cycles_elapsed.max,gpc__cycles_elapsed.avg.per_second,breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed,breakdown:gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,launch__occupancy_per_cluster_size,launch__occupancy_cluster_pct,launch__occupancy_cluster_gpu_pct,launch__cluster_size,launch__cluster_scheduling_policy,launch__cluster_max_potential_size,launch__cluster_max_active,gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,dram__cycles_elapsed.avg.per_second --target-processes all --nvtx --nvtx-include "{NVTX_RANGE}" -c {COUNT} -o ncu-reports/{model_name} -f python {model_path}')

    diff[model] = time() - start_time
    
    print(f"Done profiling {model}")

Profiling phi2.py




==PROF== Connected to process 1140604 (/usr/bin/python3.10)
==PROF== Target process 1140850 terminated before first instrumented API call.
==PROF== Target process 1140851 terminated before first instrumented API call.


Downloading shards: 100%|██████████| 2/2 [00:00<00:00, 171.56it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.82it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


==PROF== Profiling "vectorized_elementwise_kernel" - 0 (1/50): 0%

....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 1 (2/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 2 (3/50): 

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 3 (4/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 4 (5/50): 0%....50%....100% - 9 passes
==PROF== Profiling "elementwise_kernel" - 5 (6/50): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 6 (7/50): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 7 (8/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 8 (9/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 9 (10/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 10 (11/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 11 (12/50): 0%....50%....100% - 9 passes
==PROF== Profiling "elementwise_kernel" - 12 (13/50): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 13 (14/50): 0%....50%....100% - 9 passes
==PROF

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-MoE-instruct:
- configuration_phimoe.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-MoE-instruct:
- modeling_phimoe.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


==PROF== Connected to process 1142772 (/usr/bin/python3.10)
==PROF== Target process 1143041 terminated before first instrumented API call.
==PROF== Target process 1143042 terminated before first instrumented API call.


Downloading shards: 100%|██████████| 17/17 [00:00<00:00, 62.29it/s]
Loading checkpoint shards: 100%|██████████| 17/17 [00:17<00:00,  1.01s/it]


==PROF== Profiling "vectorized_elementwise_kernel" - 0 (1/50): 0%

....50%....100% - 9 passes
==PROF== Profiling "elementwise_kernel" - 1 (2/50): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 2 (3/50): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 3 (4/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 4 (5/50): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 5 (6/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 6 (7/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 7 (8/50): 0%....50%....100% - 9 passes
==PROF== Profiling "DeviceScanInitKernel" - 8 (9/50): 0%....50%....100% - 9 passes
==PROF== Profiling "DeviceScanKernel" - 9 (10/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 10 (11/50): 0%....50%....100% - 9 passes
==PROF== Profiling "DeviceScanInitKernel" - 11 (12/50): 

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


0%....50%....100% - 9 passes
==PROF== Profiling "DeviceScanKernel" - 12 (13/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 13 (14/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 14 (15/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 15 (16/50): 0%....50%....100% - 9 passes
==PROF== Profiling "indexSelectLargeIndex" - 16 (17/50): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 17 (18/50): 