In [6]:
from llama_model_handler import LlamaModelHandler
from IPython.display import Markdown, display

Loading model: meta-llama/Llama-3.1-8b

In [2]:
model_handler = LlamaModelHandler("meta-llama/Llama-3.1-8b")

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Authentication successful.
Loading model 'meta-llama/Llama-3.1-8b'...


Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.95s/it]


Model loaded on device: cuda:0
GPU: NVIDIA L4
Model dtype: torch.float16


Testprompt:

In [5]:
prompt = "Whats the meaning of life?"
display(Markdown(model_handler.generate_text(prompt=prompt, max_new_tokens=250)))

Whats the meaning of life? That's a question we all ask at some point in our lives. This is something that has been pondered by many great minds throughout history.
But what does it mean to you and me?
We are born, live for 80-90 years or so (depending on where we come from) then die and go back into nature.
What exactly happens after death though...is anyone really sure?
Well here is my take...
The idea behind this theory is pretty simple:
Our souls have always existed since before birth and will continue existing even when we physically pass away.
It could be argued that when one dies their soul goes straight up to heaven to meet with God but I don't believe thats how things work out.
Instead Im convinced that your soul continues living through reincarnation; which means being reborn again somewhere else along time lines other than ours - maybe another planet perhaps even Earth itself!
There may also exist an infinite number of parallel universes containing identical copies yet completely different versions thereof due certain changes made within each individual instance leading them down separate paths until they eventually become two entirely dissimilar entities once more separated by space & matter
In addition there might possibly multiple dimensions beyond those currently known about including ones consisting solely energy instead physical mass

### **Model Performance Benchmarking Metrics**

---

<small>

#### 1. **Latency**

Measures time delays during generation.

- **First-Token Latency (FTL):**  
  Time to generate the **first token**.  
  $$ \text{FTL} = t_{\text{first token}} - t_{\text{start}} $$

- **Average-Token Latency (ATL):**  
  Average time per token after the first one.  
  $$ \text{ATL} = \frac{T_{\text{total}} - \text{FTL}}{N_{\text{tokens}} - 1} $$

- **Generation Latency (GL):**  
  Total time to generate the **full output**.  
  $$ \text{GL} = t_{\text{end}} - t_{\text{start}} $$

---

#### 2. **Throughput**

Measures the output rate of the model.

- **Tokens per Second (TPS):**  
  Number of tokens generated per second.  
  $$ \text{TPS} = \frac{N_{\text{tokens}}}{\text{GL}} $$

- **Sentences per Second (SPS):**  
  Number of sentences generated per second.  
  $$ \text{SPS} = \frac{N_{\text{sentences}}}{\text{GL}} $$

---

#### 3. **Storage**

Provides insights into memory usage during inference.

- **Model Size:**  
  The total disk space used by the pre-trained model.

- **KV-Cache Size:**  
  Memory used for key-value caching during generation.

- **Memory Usage (Model + KV-Cache):**  
  $$ \text{Memory}_{\text{total}} = \text{Model Memory} + \text{KV-Cache Memory} $$

---

#### 4. **Energy**

Evaluates energy efficiency during generation.

- **Energy Consumption per Token:**  
  $$ E_{\text{token}} = \frac{E_{\text{total}}}{N_{\text{tokens}}} $$

- **Energy Consumption per Sentence:**  
  $$ E_{\text{sentence}} = \frac{E_{\text{total}}}{N_{\text{sentences}}} $$

- **Energy Consumption per Second:**  
  $$ E_{\text{sec}} = P_{\text{avg}} \times t_{\text{generation}} $$

---

#### 5. **Quality (Summarization)**

Measures the quality of model-generated text, especially for summarization tasks.

- **ROUGE Score:**  
  Measures the overlap between generated and reference summaries.

- **Perplexity:**  
  Indicates how well the model predicts a sequence. Lower is better.  
  $$ \text{Perplexity} = e^{\text{Cross-Entropy Loss}} $$

---

#### **Summary of Key Metrics**

| Metric                   | Unit             | Formula/Definition                                  |
|--------------------------|-------------------|-----------------------------------------------------|
| First-Token Latency      | seconds (s)       | $$ \text{FTL} $$                                    |
| Average-Token Latency    | seconds/token     | $$ \text{ATL} $$                                    |
| Generation Latency       | seconds (s)       | $$ \text{GL} $$                                     |
| Tokens per Second (TPS)  | tokens/second     | $$ \frac{N_{\text{tokens}}}{\text{GL}} $$            |
| Sentences per Second     | sentences/second  | $$ \frac{N_{\text{sentences}}}{\text{GL}} $$         |
| Memory Usage             | MB/GB             | $$ \text{Model Memory} + \text{KV-Cache Memory} $$   |
| Energy per Token         | Joules/token      | $$ \frac{E_{\text{total}}}{N_{\text{tokens}}} $$     |
| Energy per Sentence      | Joules/sentence   | $$ \frac{E_{\text{total}}}{N_{\text{sentences}}} $$  |
| Energy per Second        | Watts (W)         | $$ P_{\text{avg}} \times t_{\text{generation}} $$    |
| Perplexity               | -                 | $$ e^{\text{Cross-Entropy Loss}} $$                  |

</small>

#### Test Benchmark

In [1]:
from benchmark import ModelBenchmark
from llama_model_handler import LlamaModelHandler

In [2]:
# Load model and tokenizer
model_handler = LlamaModelHandler("meta-llama/Llama-3.1-8b", precision="fp16")
model, tokenizer = model_handler.get_model_and_tokenizer()

# Initialize benchmark
benchmark = ModelBenchmark(model=model, tokenizer=tokenizer, max_tokens=128)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Authentication successful.
Loading model 'meta-llama/Llama-3.1-8b' with precision 'fp16'...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model loaded on device: cuda:0
GPU: NVIDIA L4
Model dtype: torch.float16
Model loading time: 9.6394 seconds


In [None]:
# Run benchmark
test_prompts = [
    "Explain the significance of transformer models in NLP.",
    "What are the main benefits of renewable energy?",
    "How does the immune system work?",
    "What is the capital of France?",
    "What is the best way to cook a steak?"
]

benchmark_results = benchmark.benchmark(test_prompts)

In [4]:
benchmark_results

Unnamed: 0,Prompt Length,FTL (s),ATL (s),GL (s),TPS (tokens/s),SPS (sentences/s),Memory Usage (MB),Total Energy Consumption (Wh)
0,54,0.0667,0.0667,9.3312,17.09,0.49,16190.06,0.330721
1,47,0.0598,0.0598,8.1296,16.82,0.98,16190.06,0.32664
2,32,0.0603,0.0603,8.2001,16.58,1.34,16190.06,0.326235
3,30,0.0604,0.0604,8.2086,16.56,0.85,16190.06,0.326559
4,37,0.0591,0.0591,8.2188,16.91,0.0,16190.06,0.327589
