In [5]:
from llama.llama_model_handler import LlamaModelHandler
from IPython.display import Markdown, display

Loading model: meta-llama/Llama-3.1-8b

In [2]:
model_handler = LlamaModelHandler("meta-llama/Llama-3.1-8b")

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Authentication successful.
Loading model 'meta-llama/Llama-3.1-8b'...


Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.95s/it]


Model loaded on device: cuda:0
GPU: NVIDIA L4
Model dtype: torch.float16


Testprompt:

In [5]:
prompt = "Whats the meaning of life?"
display(Markdown(model_handler.generate_text(prompt=prompt, max_new_tokens=250)))

Whats the meaning of life? That's a question we all ask at some point in our lives. This is something that has been pondered by many great minds throughout history.
But what does it mean to you and me?
We are born, live for 80-90 years or so (depending on where we come from) then die and go back into nature.
What exactly happens after death though...is anyone really sure?
Well here is my take...
The idea behind this theory is pretty simple:
Our souls have always existed since before birth and will continue existing even when we physically pass away.
It could be argued that when one dies their soul goes straight up to heaven to meet with God but I don't believe thats how things work out.
Instead Im convinced that your soul continues living through reincarnation; which means being reborn again somewhere else along time lines other than ours - maybe another planet perhaps even Earth itself!
There may also exist an infinite number of parallel universes containing identical copies yet completely different versions thereof due certain changes made within each individual instance leading them down separate paths until they eventually become two entirely dissimilar entities once more separated by space & matter
In addition there might possibly multiple dimensions beyond those currently known about including ones consisting solely energy instead physical mass

### **Model Performance Benchmarking Metrics**

---

<small>

#### 1. **Latency**

Measures time delays during generation.

- **First-Token Latency (FTL):**  
  Time to generate the **first token**.  
  $$ \text{FTL} = t_{\text{first token}} - t_{\text{start}} $$

- **Average-Token Latency (ATL):**  
  Average time per token after the first one.  
  $$ \text{ATL} = \frac{T_{\text{total}} - \text{FTL}}{N_{\text{tokens}} - 1} $$

- **Generation Latency (GL):**  
  Total time to generate the **full output**.  
  $$ \text{GL} = t_{\text{end}} - t_{\text{start}} $$

---

#### 2. **Throughput**

Measures the output rate of the model.

- **Tokens per Second (TPS):**  
  Number of tokens generated per second.  
  $$ \text{TPS} = \frac{N_{\text{tokens}}}{\text{GL}} $$

- **Sentences per Second (SPS):**  
  Number of sentences generated per second.  
  $$ \text{SPS} = \frac{N_{\text{sentences}}}{\text{GL}} $$

---

#### 3. **Storage**

Provides insights into memory usage during inference.

- **Model Size:**  
  The total disk space used by the pre-trained model.

- **KV-Cache Size:**  
  Memory used for key-value caching during generation.

- **Memory Usage (Model + KV-Cache):**  
  $$ \text{Memory}_{\text{total}} = \text{Model Memory} + \text{KV-Cache Memory} $$

---

#### 4. **Energy**

Evaluates energy efficiency during generation.

- **Energy Consumption per Token:**  
  $$ E_{\text{token}} = \frac{E_{\text{total}}}{N_{\text{tokens}}} $$

- **Energy Consumption per Sentence:**  
  $$ E_{\text{sentence}} = \frac{E_{\text{total}}}{N_{\text{sentences}}} $$

- **Energy Consumption per Second:**  
  $$ E_{\text{sec}} = P_{\text{avg}} \times t_{\text{generation}} $$

---

#### 5. **Quality (Summarization)**

Measures the quality of model-generated text, especially for summarization tasks.

- **ROUGE Score:**  
  Measures the overlap between generated and reference summaries.

- **Perplexity:**  
  Indicates how well the model predicts a sequence. Lower is better.  
  $$ \text{Perplexity} = e^{\text{Cross-Entropy Loss}} $$

---

#### **Summary of Key Metrics**

| Metric                   | Unit             | Formula/Definition                                  |
|--------------------------|-------------------|-----------------------------------------------------|
| **Latency**                  |                |                                                     |
| First-Token Latency      | seconds (s)       | $$ \text{FTL} $$                                    |
| Average-Token Latency    | seconds/token     | $$ \text{ATL} $$                                    |
| Generation Latency       | seconds (s)       | $$ \text{GL} $$                                     |
| **Throughput**               |                  |                                                     |
| Tokens per Second (TPS)  | tokens/second     | $$ \frac{N_{\text{tokens}}}{\text{GL}} $$            |
| Sentences per Second     | sentences/second  | $$ \frac{N_{\text{sentences}}}{\text{GL}} $$         |
| **Storage**                  |                  |                                                     |
| Model Size               | MB/GB             | Disk space used by the pre-trained model.           |
| KV-Cache Size            | MB/GB             | Memory used for key-value caching during generation.|
| Memory Usage             | MB/GB             | $$ \text{Model Memory} + \text{KV-Cache Memory} $$   |
| **Energy**                   |                  |                                                     |
| Energy per Token         | Joules/token      | $$ \frac{E_{\text{total}}}{N_{\text{tokens}}} $$     |
| Energy per Sentence      | Joules/sentence   | $$ \frac{E_{\text{total}}}{N_{\text{sentences}}} $$  |
| Energy per Second        | Watts (W)         | $$ P_{\text{avg}} \times t_{\text{generation}} $$    |
| **Quality**                  |                  |                                                     |
| ROUGE Score              |                  | Measures the overlap between generated and reference summaries. |

</small>

#### Test Benchmark

In [1]:
from benchmark.benchmark import ModelBenchmark
from LLM_engines.llama_model_handler import LlamaModelHandler

In [2]:
# Load model and tokenizer

precision = ["int8", "fp16"]

model_handler = LlamaModelHandler("meta-llama/Llama-3.1-8b", precision=precision[1])
model, tokenizer = model_handler.get_model_and_tokenizer()

# Initialize benchmark
benchmark = ModelBenchmark(model=model, tokenizer=tokenizer, max_tokens=128)

# Run benchmark
test_prompts = [
    "Explain the significance of transformer models in NLP.",
    "What are the main benefits of renewable energy?",
    "How does the immune system work?",
    "What is the capital of France?",
    "What is the best way to cook a steak?"
]

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Authentication successful.
Loading model 'meta-llama/Llama-3.1-8b' with precision 'fp16'...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model loaded on device: cuda:0
GPU: NVIDIA L4
Model dtype: torch.float16
Model loading time: 11.4783 seconds


In [3]:
benchmark_results_fp16 = benchmark.benchmark(test_prompts)

Evaluating prompt (length 54 characters)...


Evaluating prompt (length 47 characters)...
Evaluating prompt (length 32 characters)...
Evaluating prompt (length 30 characters)...
Evaluating prompt (length 37 characters)...


In [4]:
benchmark_results_fp16

Unnamed: 0,Prompt Length,FTL (s),ATL (s),GL (s),TPS (tokens/s),SPS (sentences/s),Memory Usage (MB),Model Size (MB),Total Energy (Wh),Energy per Token (J/token),Energy per Sentence (J/sentence),Energy per Second (W)
0,54,0.0864,0.0864,8.9889,15.57,0.44,16190.06,15325.276917,0.160639,4.131985,146.215909,64.335
1,47,0.0709,0.0709,8.1539,16.92,0.49,16190.06,15325.276917,0.153651,4.009338,138.444898,67.838
2,32,0.068,0.068,8.1625,16.66,0.74,16190.06,15325.276917,0.14826,3.92491,88.363514,65.389
3,30,0.0763,0.0763,8.1692,16.65,0.73,16190.06,15325.276917,0.129554,3.428949,78.208219,57.092
4,37,0.0746,0.0746,8.2051,16.94,0.61,16190.06,15325.276917,0.146933,3.805608,105.683607,64.467


### vLLM

#### Model Initialization Parameters in vLLM

<small>

| **Parameter** | **Type** | **Possible Values** | **Default** | **Effect** | **Optimization Strategy** |
|--------------|---------|-----------------|------------|----------------|--------------------------|
| **`model`** | `str` | Model name from Hugging Face | Required | Loads the model (e.g., `"meta-llama/Llama-3.1-8b"`) | Choose a model suited to your VRAM (use **smaller models or quantized models** for limited memory). |
| **`tokenizer`** | `str \| None` | `None`, `"auto"`, custom tokenizer path | `None` | Defines the tokenizer to use. | **Set to `None` to use the default model tokenizer.** Custom tokenizer paths are useful for fine-tuned models. |
| **`tokenizer_mode`** | `str` | `"auto"`, `"slow"`, `"fast"` | `"auto"` | Defines tokenizer behavior. | **Use `"auto"` (default) for best performance**. `"slow"` is only needed if `"fast"` is buggy. |
| **`skip_tokenizer_init`** | `bool` | `True`, `False` | `False` | Skips tokenizer loading. | **Set to `True` if using an external tokenizer** to reduce memory usage. |
| **`trust_remote_code`** | `bool` | `True`, `False` | `False` | Enables loading remote model code. | **Enable (`True`) only for custom models that require additional scripts**. |
| **`allowed_local_media_path`** | `str` | Any path | `""` | Defines a local path for media files. | Used for **multimodal models** (not needed for LLaMA). |
| **`tensor_parallel_size`** | `int` | `1` or more | `1` | Number of GPUs for inference. | **Set `2+` for multi-GPU setups (A100, H100, multiple RTX 4090s).** |
| **`dtype`** | `str` | `"float16"`, `"bfloat16"`, `"float32"` | `"auto"` | Model precision (affects memory & speed). | **Use `"bfloat16"` for A100/H100, `"float16"` for all other GPUs**. `"float32"` is **not recommended** due to memory cost. |
| **`quantization`** | `str \| None` | `"awq"`, `"gptq"`, `None` | `None` | Enables quantization for lower memory usage. | **Use `"awq"` to halve VRAM usage with minimal performance drop**. `"gptq"` is useful for **8GB GPUs**. |
| **`revision`** | `str \| None` | Hugging Face model commit hash | `None` | Specifies a model version. | **Use a specific revision if testing different versions** of a model. |
| **`tokenizer_revision`** | `str \| None` | Hugging Face tokenizer commit hash | `None` | Specifies tokenizer version. | Usually **not needed**, but useful if a tokenizer update affects behavior. |
| **`seed`** | `int` | Any integer | `0` | Sets random seed for reproducibility. | **Set to a fixed value (`42`) if you need consistent results.** |


#### Recommended Settings for Different Hardware

| **Use Case** | **Model** | **dtype** | **tensor_parallel_size** | **quantization** | **Optimization Notes** |
|-------------|---------|--------|-------------------|--------------|--------------------|
| **Best Performance (High VRAM, A100/H100, 80GB VRAM)** | `"meta-llama/Llama-3.1-8b"` | `"bfloat16"` | `2+` | `None` | Multi-GPU setup, fastest inference. |
| **Optimized for Consumer GPUs (RTX 3090/4090, 24GB VRAM)** | `"meta-llama/Llama-3.1-8b"` | `"float16"` | `1-2` | `"awq"` | Halves VRAM usage with minimal loss in accuracy. |
| **Low VRAM Setup (RTX 3060, 12GB VRAM)** | `"meta-llama/Llama-3.1-8b"` | `"float16"` | `1` | `"gptq"` | Optimized for **low-memory GPUs**, reduces precision slightly. |
| **CPU-Only (Mac M1/M2, Non-GPU systems)** | `"meta-llama/Llama-3.1-8b"` | `"float32"` | `1` | `"gptq"` | `"float32"` is needed since Mac doesn’t support `"float16"`. |


</small>

In [1]:
from benchmark.benchmark import ModelBenchmark
from LLM_engines.vLLM_model_handler import vLLMHandler

vllm_handler = vLLMHandler(
    model_name="meta-llama/Llama-3.1-8b", 
    dtype="float16", 
    gpu_util=0.99, 
    quantization=None, 
    seed=42)

benchmark_vllm = ModelBenchmark(vllm_handler, backend="vllm")

# Run benchmark
test_prompts = [
    "Explain the significance of transformer models in NLP.",
    "What are the main benefits of renewable energy?",
    "How does the immune system work?",
    "What is the capital of France?",
    "What is the best way to cook a steak?"
]

INFO 03-09 12:23:56 __init__.py:207] Automatically detected platform cuda.
INFO 03-09 12:24:05 config.py:549] This model supports multiple tasks: {'embed', 'classify', 'score', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 03-09 12:24:05 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='meta-llama/Llama-3.1-8b', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=24576, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_ti

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 03-09 12:24:18 model_runner.py:1115] Loading model weights took 14.9888 GB
INFO 03-09 12:24:25 worker.py:267] Memory profiling takes 7.08 seconds
INFO 03-09 12:24:25 worker.py:267] the current vLLM instance can use total_gpu_memory (22.05GiB) x gpu_memory_utilization (0.99) = 21.82GiB
INFO 03-09 12:24:25 worker.py:267] model weights take 14.99GiB; non_torch_memory takes 0.04GiB; PyTorch activation peak memory takes 2.54GiB; the rest of the memory reserved for KV Cache is 4.25GiB.
INFO 03-09 12:24:26 executor_base.py:111] # cuda blocks: 2176, # CPU blocks: 2048
INFO 03-09 12:24:26 executor_base.py:116] Maximum concurrency for 24576 tokens per request: 1.42x
INFO 03-09 12:24:29 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_ut

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:27<00:00,  1.27it/s]

INFO 03-09 12:24:57 model_runner.py:1562] Graph capturing finished in 28 secs, took 0.26 GiB
INFO 03-09 12:24:57 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 38.87 seconds





In [2]:
vllm_results = benchmark_vllm.benchmark(test_prompts)

Evaluating prompt (length 54 characters)...


Processed prompts: 100%|██████████| 1/1 [00:31<00:00, 31.07s/it, est. speed input: 0.39 toks/s, output: 16.48 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:31<00:00, 31.09s/it, est. speed input: 0.39 toks/s, output: 16.47 toks/s]


Evaluating prompt (length 47 characters)...


Processed prompts: 100%|██████████| 1/1 [00:22<00:00, 22.73s/it, est. speed input: 0.44 toks/s, output: 16.46 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:31<00:00, 31.11s/it, est. speed input: 0.32 toks/s, output: 16.46 toks/s]


Evaluating prompt (length 32 characters)...


Processed prompts: 100%|██████████| 1/1 [00:31<00:00, 31.11s/it, est. speed input: 0.26 toks/s, output: 16.46 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:21<00:00, 21.08s/it, est. speed input: 0.38 toks/s, output: 16.46 toks/s]


Evaluating prompt (length 30 characters)...


Processed prompts: 100%|██████████| 1/1 [00:31<00:00, 31.11s/it, est. speed input: 0.26 toks/s, output: 16.46 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:31<00:00, 31.19s/it, est. speed input: 0.26 toks/s, output: 16.42 toks/s]


Evaluating prompt (length 37 characters)...


Processed prompts: 100%|██████████| 1/1 [00:31<00:00, 31.11s/it, est. speed input: 0.35 toks/s, output: 16.46 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:31<00:00, 31.11s/it, est. speed input: 0.35 toks/s, output: 16.46 toks/s]


In [3]:
vllm_results

Unnamed: 0,Prompt Length,FTL (s),ATL (s),GL (s),TPS (tokens/s),SPS (sentences/s),Memory Usage (MB),Model Size (MB),KV-Cache Size Estimation (MB),Total Energy (Wh),Energy per Token (J/token),Energy per Sentence (J/sentence),Energy per Second (W)
0,54,0.0728,0.0728,31.0805,14.22,0.68,20696.06,,,0.62186,5.065331,105.925,72.029
1,47,0.0689,0.0689,22.7282,18.44,1.28,20696.06,,,0.455038,3.908623,56.308594,72.075
2,32,0.0715,0.0715,31.114,9.51,0.67,20696.06,,,0.621554,7.562145,107.337313,71.916
3,30,0.0784,0.0784,31.1178,10.09,0.8,20696.06,,,0.621829,7.129732,89.92375,71.939
4,37,0.0691,0.0691,31.1133,11.8,1.16,20696.06,,,0.621013,6.089407,61.943966,71.855


#### Benchmark

Find optimal optimization frameworks for a specific:

tasks: [text-to-SQL, summarization, QA, ...]

scenario: [server, batch, single-stream, ...]

models: [llama3.1-8B, Mixtral, ...]

frameworks: [vLLM, tensorRT, ...] (trying different configurations)

hardware: [L4, A100, H100]

#### Optimization Frameworks for LLM Inference

<small>

| **Framework** | **Quantization Support** | **Model Parallelism** | **Tensor / Kernel Optimizations** | **Memory Efficiency** | **Throughput & Latency** | **Financial / Energy Efficiency** |
|--------------|--------------------------|-----------------------|-----------------------------------|-----------------------|--------------------------|-----------------------------------|
| **vLLM** (UC Berkeley) | FP16/BF16, 8-bit, 4-bit (GPTQ, AWQ), experimental FP8 | Tensor Parallelism, Pipeline Parallelism | Custom **PagedAttention**, fused attention kernels, PyTorch fused ops | **Paged KV cache**, quantized KV cache, prefix reuse | **Continuous batching**, **speculative decoding**, auto-regressive scheduling | 2–4× throughput improvement, high GPU utilization, cost-efficient serving |
| **NVIDIA TensorRT-LLM** | FP16/BF16, FP8 (Hopper), INT8 (SmoothQuant), 4-bit AWQ | Tensor, Pipeline, MoE Parallelism | Custom fused attention kernels, TensorRT graph optimization, kernel fusion | **Paged KV caching**, weight compression, memory-optimized inference engines | **Dynamic batching**, **speculative decoding (ReDrafter)**, automatic cache sharing | FP8/INT8 for lower power, throughput-focused, optimized GPU utilization |
| **DeepSpeed-Inference** | INT8 (MoQ), 4-bit weight quantization | Tensor, Pipeline, **Expert Parallelism (MoE)** | Transformer kernel injection, optimized CUDA fused ops | **ZeRO-Inference** (offload KV cache to CPU/NVMe) | **Massive batch sizes**, multi-GPU scaling, ZeRO latency reduction | INT8 models ~1.7× faster, **CPU memory offloading** reduces GPU costs |
| **Hugging Face TGI** | 8-bit (bitsandbytes), 4-bit (GPTQ, AWQ), FP8 experimental | Tensor Parallelism, Model Sharding | FlashAttention, PagedAttention, fused sampling ops | **Paged KV cache**, weight mapping via Safetensors | **Continuous batching**, **speculative decoding (Medusa)** | High throughput (up to 13× faster on long inputs), flexible multi-platform support |
| **NVIDIA FasterTransformer** | FP16/BF16, INT8, FP8, 4-bit compression | Tensor, Pipeline Parallelism | Fused MLP layers, fused multi-head attention, optimized GEMM | **Persistent KV cache**, optimized weight storage | **Optimized CUDA scheduling**, efficient multi-GPU serving | Low-latency, high-throughput GPU inference, FP16/INT8 for energy efficiency |
| **ONNX Runtime** (HF Optimum) | INT8 post-training quantization, FP16, BF16 | No native multi-GPU, requires manual graph partitioning | Graph optimizations, operator fusion, execution provider optimizations | **Memory-efficient graph storage**, weight pre-packing | **Batching support**, CPU multi-threading | Cost-efficient CPU/GPU inference, flexible hardware deployment |
| **llama.cpp / GGML** | 16-bit, 8-bit, 6-bit, 5-bit, 4-bit, 3-bit, 2-bit | No native multi-GPU, but supports CPU-GPU offloading | **SIMD-optimized** (AVX, NEON), bit-packed matrix multiplication | **Memory-mapped weights**, CPU RAM paging for large models | **Single-request optimized**, multi-threaded CPU processing | Extreme low-cost inference, edge deployment, power-efficient CPU inference |

</small>
