## Metrics & formulas for batched LLM inference  

### Notation

| symbol | description |
|--------|-------------|
| $N$ | batch size (e.g. 100 prompts) |
| $i$ | index of a sample in the batch |
| $T_{\text{batch}}$ | wall-clock time to generate the **whole** batch (measured once) |
| $\text{tok}_i$ | number of generated **tokens** in sample $i$ |
| $\displaystyle\Sigma_{\text{tok}}=\sum_{j=1}^{N}\text{tok}_j$ | total tokens in the batch |
| $\text{sent}_i$ | number of **sentences** in sample $i$ |
| $\displaystyle\Sigma_{\text{sent}}=\sum_{j=1}^{N}\text{sent}_j$ | total sentences in the batch |

---

### 1&nbsp;· Average-token latency (ATL)

$$
\operatorname{ATL}_{\text{batch}}
    =\frac{T_{\text{batch}}}{\Sigma_{\text{tok}}}
    \quad\text{[seconds / token]}
$$

---

### 2&nbsp;· Per-sample generation latency (GL)

$$
\operatorname{GL}_i
    =\operatorname{ATL}_{\text{batch}}\;\text{tok}_i
    =\frac{\text{tok}_i}{\Sigma_{\text{tok}}}\;T_{\text{batch}}
    \quad\text{[seconds / sample]}
$$

Check:  $\displaystyle\sum_{i=1}^{N}\operatorname{GL}_i=T_{\text{batch}}$.

---

### 3 · Tokens per second (TPS)

Batch-level throughput:

$$
\text{TPS}_{\text{batch}}
  = \frac{1}{\text{ATL}_{\text{batch}}}
  = \frac{\Sigma_{\text{tok}}}{T_{\text{batch}}}
$$

*(units: tokens s⁻¹)*


### 4 · Sentences per second (SPS)

**Per-sample (row-level)**  

$$
\text{SPS}_i
  = \frac{\text{sent}_i}{\text{GL}_i}
  = \frac{\text{sent}_i}
         {\text{ATL}_{\text{batch}}\;\text{tok}_i}
$$

**Batch-level (one value per batch)**  

$$
\text{SPS}_{\text{batch}}
  = \frac{\Sigma_{\text{sent}}}{T_{\text{batch}}}
  = \frac{1}{N} \sum_{i=1}^{N} \text{SPS}_i
$$

---

### 5&nbsp;· Consistency checks (should hold after the fix)

$$
\sum_{i=1}^{N}\operatorname{GL}_i = T_{\text{batch}},\qquad
\frac{1}{N}\sum_{i=1}^{N}\operatorname{SPS}_i = \operatorname{SPS}_{\text{batch}},\qquad
\operatorname{TPS}_{\text{batch}}\;T_{\text{batch}} = \Sigma_{\text{tok}}
$$


In [31]:
import re
import pandas as pd
import numpy as np

data = pd.read_csv("/home/ubuntu/fast_llm_inference/results/experiment_1/vllm_gemma-2-2b-it_qa.csv")

In [None]:
import re, argparse, pathlib
import pandas as pd, numpy as np

# ─── lightweight token / sentence counters ────────────────────────────
_tok_re  = re.compile(r"\S+")
_sent_re = re.compile(r"[.!?…]+")

def _tok_cnt(text: str) -> int:
    return len(_tok_re.findall(text))

def _sent_cnt(text: str) -> int:
    return max(len(_sent_re.findall(text)), 1)

# ─── main fixer ────────────────────────────────────────────────────────
def fix_metrics(df: pd.DataFrame, batch_size: int = 100) -> pd.DataFrame:
    """
    Fixes latency/throughput and overrides energy-per-unit columns in-place.
    Expects:
      - 'generated_answer'    : str
      - 'GL'                  : float (batch latency duplicated on every row)
      - 'Total Energy (Wh)'   : float (duplicated per row)
      - 'Energy per Token (J/token)'       : float (will be replaced)
      - 'Energy per Sentence (J/sentence)' : float (will be replaced)
    Returns the DataFrame with corrected:
      ATL, GL, TPS, SPS,
      Energy per Token (J/token),
      Energy per Sentence (J/sentence)
    """
    df = df.copy()

    # 1) derive token & sentence counts if missing -----------------------
    if "num_tokens" not in df.columns:
        df["num_tokens"] = df["generated_answer"].map(_tok_cnt)
    if "num_sentences" not in df.columns:
        df["num_sentences"] = df["generated_answer"].map(_sent_cnt)

    # 2) tag rows by batch -----------------------------------------------
    df["batch_id"] = (df.index // batch_size).astype(int)

    # 3) compute per-batch scalars ---------------------------------------
    df["batch_time_s"]      = df.groupby("batch_id")["GL"].transform("first")
    df["batch_tokens"]      = df.groupby("batch_id")["num_tokens"].transform("sum")
    df["batch_sentences"]   = df.groupby("batch_id")["num_sentences"].transform("sum")
    batch_energy_wh         = df.groupby("batch_id")["Total Energy (Wh)"].transform("first")

    # 4) correct latency & throughput ------------------------------------
    df["ATL"] = df["batch_time_s"] / df["batch_tokens"]      # seconds / token
    df["GL"]  = df["ATL"] * df["num_tokens"]                 # seconds / sample
    df["TPS"] = 1.0 / df["ATL"]                              # tokens / second
    df["SPS"] = df["num_sentences"] / df["GL"]               # sentences / second

    # 5) override energy-per-unit columns in-place -----------------------
    # convert Wh → J by multiplying by 3600
    df["Energy per Token (J/token)"]     = batch_energy_wh * 3600 / df["batch_tokens"]
    df["Energy per Sentence (J/sentence)"] = batch_energy_wh * 3600 / df["batch_sentences"]

    # 6) drop internal helpers -------------------------------------------
    df.drop(columns=["batch_id", "batch_time_s", "batch_tokens", "batch_sentences", "num_tokens", "num_sentences"], inplace=True)

    return df

In [46]:
fixed_data = fix_metrics(data, batch_size=100)

fixed_data

Unnamed: 0,prompt_length,prompt,generated_answer,reference_answer,TTFT,ATL,GL,TPS,SPS,Avg GPU Mem (MB),...,Avg GPU Util (%),Total Energy (Wh),Avg Power (W),Energy per Token (J/token),Energy per Sentence (J/sentence),Memory Usage (MB),Model Size (MB),Overhead (MB),exact_match,F1_score
0,1275,### SYSTEM\nYou are a question-answering assis...,Lothar de Maizière,"['Lothar de Maizière', 'Lothar de Maizière', '...",0.0283,0.009702,0.029107,103.069379,34.356460,21351.79,...,79.73,0.034697,56.71,0.550261,1.249092,21353.88,5007.298138,16346.581862,1,1.000000
1,753,### SYSTEM\nYou are a question-answering assis...,Complexity classes,"['complexity classes', 'complexity classes', '...",0.0283,0.009702,0.019404,103.069379,51.534689,21351.79,...,79.73,0.034697,56.71,0.550261,1.249092,21353.88,5007.298138,16346.581862,1,1.000000
2,1197,### SYSTEM\nYou are a question-answering assis...,GTE,['Telenet was incorporated in 1973 and started...,0.0283,0.009702,0.009702,103.069379,103.069379,21351.79,...,79.73,0.034697,56.71,0.550261,1.249092,21353.88,5007.298138,16346.581862,1,1.000000
3,1541,### SYSTEM\nYou are a question-answering assis...,Water flow,"['water flow through the body cavity', ""κτείς ...",0.0283,0.009702,0.019404,103.069379,51.534689,21351.79,...,79.73,0.034697,56.71,0.550261,1.249092,21353.88,5007.298138,16346.581862,0,0.571429
4,1747,### SYSTEM\nYou are a question-answering assis...,1705,"['12 May 1705', '1705', '12 May 1705']",0.0283,0.009702,0.009702,103.069379,103.069379,21351.79,...,79.73,0.034697,56.71,0.550261,1.249092,21353.88,5007.298138,16346.581862,1,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,1157,### SYSTEM\nYou are a question-answering assis...,1600,"['1,600 miles', '1,600', '1,600']",0.0276,0.009120,0.009120,109.650142,109.650142,21353.88,...,90.45,0.040963,68.52,0.624859,1.474668,21353.88,5007.298138,16346.581862,1,1.000000
496,2207,### SYSTEM\nYou are a question-answering assis...,Krasiński Palace Garden,"['Krasiński Palace Garden', 'Krasiński Palace ...",0.0276,0.009120,0.027360,109.650142,36.550047,21353.88,...,90.45,0.040963,68.52,0.624859,1.474668,21353.88,5007.298138,16346.581862,1,1.000000
497,1115,### SYSTEM\nYou are a question-answering assis...,214,"['489', '489', '489']",0.0276,0.009120,0.009120,109.650142,109.650142,21353.88,...,90.45,0.040963,68.52,0.624859,1.474668,21353.88,5007.298138,16346.581862,0,0.000000
498,1830,### SYSTEM\nYou are a question-answering assis...,Mnemiopsis leidyi,"['ctenophore Mnemiopsis leidyi', 'Mnemiopsis l...",0.0276,0.009120,0.018240,109.650142,54.825071,21353.88,...,90.45,0.040963,68.52,0.624859,1.474668,21353.88,5007.298138,16346.581862,1,1.000000


In [17]:
import pandas as pd

data = pd.read_csv("/home/ubuntu/fast_llm_inference/results/llama.cpp_gemma-2-2b-it-fp16.gguf_sql_1QPS_120s_server.csv")

In [18]:
data.columns

Index(['prompt_length', 'prompt', 'generated_answer', 'reference_answer',
       'queue_size', 'batch_size', 'wait_time', 'response_time',
       'scheduled_ts', 'start_ts', 'GL', 'ATL', 'TTFT', 'TPS', 'SPS',
       'Avg GPU Mem (MB)', 'Peak GPU Mem (MB)', 'Avg GPU Util (%)',
       'Peak GPU Util (%)', 'Total Energy (Wh)', 'Avg Power (W)',
       'Peak Power (W)', 'Energy per Token (J/token)',
       'Energy per Sentence (J/sentence)', 'Memory Usage (MB)',
       'Model Size (MB)', 'Overhead (MB)'],
      dtype='object')

In [19]:
data

Unnamed: 0,prompt_length,prompt,generated_answer,reference_answer,queue_size,batch_size,wait_time,response_time,scheduled_ts,start_ts,...,Avg GPU Util (%),Peak GPU Util (%),Total Energy (Wh),Avg Power (W),Peak Power (W),Energy per Token (J/token),Energy per Sentence (J/sentence),Memory Usage (MB),Model Size (MB),Overhead (MB)
0,984,### SYSTEM\nYou are a SQL query generation ass...,SELECT AirportCode FROM airports ORDER BY COUN...,SELECT T1.AirportCode FROM AIRPORTS AS T1 JOIN...,1,1,0.0007,0.4945,1.3042,1.3048,...,40.00,86,0.005681,41.41,47.83,2.045035,20.450349,7005.88,4992.689056,2013.190944
1,952,### SYSTEM\nYou are a SQL query generation ass...,SELECT COUNT(*) FROM Student AS s JOIN Has_Pet...,SELECT count(*) FROM student AS T1 JOIN has_pe...,1,1,0.0165,1.5940,2.2943,2.3107,...,71.57,87,0.025997,59.33,73.20,2.836066,18.718035,7005.88,4992.689056,2013.190944
2,1984,### SYSTEM\nYou are a SQL query generation ass...,```sql,SELECT avg(transcript_date) FROM Transcripts,2,2,1.3742,2.6147,2.5643,3.9385,...,78.10,91,0.023225,67.40,73.20,83.609322,83.609322,7009.88,4992.689056,2017.190944
3,1084,### SYSTEM\nYou are a SQL query generation ass...,```sql,SELECT name FROM battle WHERE bulgarian_comman...,2,2,0.7772,2.0177,3.1613,3.9385,...,78.10,91,0.023225,67.40,73.20,83.609322,83.609322,7009.88,4992.689056,2017.190944
4,1385,### SYSTEM\nYou are a SQL query generation ass...,```sql,"SELECT first_name , last_name FROM players WH...",1,1,0.0426,1.5931,6.9972,7.0398,...,64.65,85,0.026734,62.07,72.98,96.243709,96.243709,7009.88,4992.689056,2017.190944
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,934,### SYSTEM\nYou are a SQL query generation ass...,"SELECT PetType, AVG(weight) FROM Pets GROUP BY...","SELECT avg(weight) , pettype FROM pets GROUP ...",1,1,0.0002,0.4529,108.6052,108.6055,...,0.00,0,0.004840,38.49,40.44,2.178195,17.425557,7011.88,4992.689056,2019.190944
99,1198,### SYSTEM\nYou are a SQL query generation ass...,```sql,SELECT T1.series_name FROM TV_Channel AS T1 JO...,1,1,0.0367,1.3120,113.5299,113.5666,...,67.91,86,0.020092,56.72,68.47,72.331712,72.331712,7011.88,4992.689056,2019.190944
100,1615,### SYSTEM\nYou are a SQL query generation ass...,```sql,"SELECT T1.date_of_treatment , T2.first_name F...",1,1,0.8646,1.9765,114.1040,114.9686,...,74.30,88,0.021860,70.77,71.83,78.697108,78.697108,7011.88,4992.689056,2019.190944
101,1190,### SYSTEM\nYou are a SQL query generation ass...,```sql,"select t1.id , t1.maker from car_makers as t1...",1,1,0.0169,2.2753,116.6142,116.6311,...,70.71,90,0.039308,62.66,73.00,141.509054,141.509054,7011.88,4992.689056,2019.190944
