<a href="https://colab.research.google.com/github/vin136/llm-infer/blob/main/bench_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np

In [None]:
import os

def get_csv_files_in_folders(base_folder="."):
    csv_files = []
    # Walk through all directories and subdirectories
    for root, dirs, files in os.walk(base_folder):
        for file in files:
            # Check if the file has a .csv extension
            if file.lower().endswith(".csv"):
                # Construct the full path to the CSV file
                csv_files.append(os.path.join(root, file))

    return csv_files

In [None]:
files = get_csv_files_in_folders()

In [None]:
files

['./vllm/bench-vllm.csv',
 './mlc/bench-mlc-mistral.csv',
 './mlc/bench-mlc-llama2.csv']

In [None]:
dfs = [pd.read_csv(csv_file) for csv_file in files]
combined_df = pd.concat(dfs, ignore_index=True)

In [None]:
#let's compare
combined_df[['inference_engine','model','time']]

Unnamed: 0,inference_engine,model,time
0,vllm,mistralai/Mistral-7B-Instruct-v0.2,3.494984
1,vllm,mistralai/Mistral-7B-Instruct-v0.2,3.494304
2,vllm,mistralai/Mistral-7B-Instruct-v0.2,3.499816
3,vllm,mistralai/Mistral-7B-Instruct-v0.2,3.495971
4,vllm,mistralai/Mistral-7B-Instruct-v0.2,3.496787
5,vllm,mistralai/Mistral-7B-Instruct-v0.2,0.315968
6,vllm,mistralai/Mistral-7B-Instruct-v0.2,2.306152
7,vllm,mistralai/Mistral-7B-Instruct-v0.2,2.663163
8,mlc,Mistral-7B-Instruct-v0.2,1.580195
9,mlc,Mistral-7B-Instruct-v0.2,1.480922


In [None]:
#let's compare vllm vs mlc for mistral
vsel = (combined_df['inference_engine'] == 'vllm') & (combined_df['model'].str.startswith('mistral'))
df_vllm_mst = combined_df[vsel]


In [None]:
msel = (combined_df['inference_engine'] == 'mlc') & (combined_df['model'].str.startswith('Mistral'))
df_mlc_mst = combined_df[msel]


In [None]:
df_mlc_mst = df_mlc_mst.assign(word_cnt = lambda df: df['answer'].str.replace('\n','').str.split().str.len())


In [None]:
df_vllm_mst = df_vllm_mst.assign(word_cnt = lambda df: df['answer'].str.replace('\n','').str.split().str.len())


In [None]:
#time for generating 10-words
(df_vllm_mst['time'].sum()/df_vllm_mst['word_cnt'].sum())*10*1000

345.4802163957148

In [None]:
(df_mlc_mst['time'].sum()/df_mlc_mst['word_cnt'].sum())*10*1000

156.51549055199862

## First Principles : On inference times.

Compute: Time spent on your GPU computing actual floating point operations (FLOPS) (comment: `mostly fixed without changing the model`)

Memory: Time spent transferring tensors within a GPU

Overhead: Everything else
 - time spent in python interpreter
 - pytorch code
 - launching cuda.
 pytorch spends a lot of time figuring what to do (eg: `a+b` , look __add__,dispatch,determine attributes lk dtype,device,then lauch a kernel.
 )

 This doesn `scale with problem size`(fixed cost)

 How to know: double your batch size but your time inc by 10%.(GPU-Util in nvidia-smi)

**Recipe for faster inference**

Identify the bottleneck and then if :

 Overhead-Bound: operator fusion + some sort of compiling(fixing)

 Bandwidth bound: Operator fusion.

 Compute bound : more tensor cores(expensive).





## How Good is my inference speed ?



[A100](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf),with HBM(high-bandwidth-memory) 40-80GB.

| BFLOAT16 | Memorybandwidth |
|----------|----------|
| 312TFLOPS | 1935 GB/s|


[RTX 4090](https://www.notebookcheck.net/NVIDIA-GeForce-RTX-4090-GPU-Benchmarks-and-Specs.674574.0.html),with HBM - 24GB


| BFLOAT16 | Memorybandwidth |
|----------|----------|
| 82.6TFLOPS | 1008 GB/s|



**Rough Mental model for a transformer computations**

Almost all the computations are mat-mul(activations,etc miniuscle). Also we are concerned about inference here.

`setup`: Given a list of tokens generate next `t` tokens(autoregressive)


- **Prefill**: Fetch the key-values(kv) for the given input tokens - KV cache.

- **Sampling(autoregressive)**: We typically store `kv` cache in memory during sampling. (tradeoff memory for flops)

**Core op/matmul**


$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

Q,K,V are matrices with dimensions $n,d$

For [mistral](https://arxiv.org/pdf/2310.06825.pdf),N=4096, d=128.


mental model for Computation:

1. LOAD Q,K from HBM ,compute $S = QK^T$,write to HBM.

2. Read S from HBM,compute $P = softmax(s)$,write P

3. LOAD P, V from HBM, compute O=PV,write O to HBM, return O.

---
S,P have dim $(4096,4096)$ and O has $(4096,128)$. For each line we can calculate memory-bandwidth and flops.

Total_memory_movement(bytes) = $8(N^2)d + 8Nd$

Total_flops = $4(N^2)d + 3N^2$

ratio(compute/memory_movement) for mistral =

62 ops/byte.



**Inference bottleneck**

A100(OPS TO MEMORY BADWIDTH)

= 312TFLOPS/1935GB/S

= 160 ops/byte


RTX 4090 = 82 ops/byte.


So most likely 4090 suffices,won't really improve inferece speed by movig to A100.

Theoretical best:
(since we are memory bound)

on A100 => total # bytes moved(model-weights)/memory-bandwidth

=> (2*7B)bytes/1935 = 7.2 millisecond/token.

Assuming 2 tokens/word: 15millisecod/word. and on RTX4090=> 13millisecond/word.

**I'm already likely near the absolute best speed of inference**

But Is that it ? Well we can still improve compute utilization.


1. We can batch multiple requests together => Use both the GPU flops and better memory bandwidth.

For instance on a 24GB card, we can batch about 4 requests together(after taking into account of KV cache size)

`Note`: For our case (chat) maybe batching ain't a good idea(idle time to fill up the batch-size) + complexity of **dynamic batching**.





What if we can improve the transformer-arthmetic itself, bag of tricks:


1. Operator Fusion: Whenever possible use operator fusion=> perform an operation with minimal memory read/writes.

  eg: [Flash-attention](https://huggingface.co/papers/2205.14135): Does `mlc` use this (likely not),n can i improve it further ?

2. Playing with precision (bfloat16,fp16,int8,int4).

   Different formats make tradeoff between range vs precision(bloat16[same range as float32 at less precision] vs float16)

  `Note`

   - Generally the gain in speed comes from decrease in memory movement(DRAM to SRAM),rather than flops.

   - f32 -> bf16/f16 (almost no drop in accuracy),faster inference.

 Any lower(eg:int8) is lossy. But can we do it without loss in accuracy ? yes but can make inference slower [[refer](https://huggingface.co/blog/hf-bitsandbytes-integration)].

 `insight`: All we care is the relative scale of logits not the absolute values.

 `current sota`: When we use off-the-shelf quintized models(int4/int8): they split the activations into two groups - outliers(bf16) and normal(int8/4). There is an additional quantize/dequantize op happening => inc total flops => can sometimes result in **slower models**.



3. MQA/GQA VS MHA => Weight sharing in attention blocks. eg: We can keep the set of keys same across the different heads. They do this already in Mistral.



4. Continuous batching, KV cachinng etc: Typical useful when batching else i don't see much benefit.


Wait, but what about Pruning and Knowledge Distillation for llms ? Well, pruning by itself doesn't speed up inference(flops are same,just that many are zeros) and typical speedups (when used with sparse/custom kernels) are only ~[1.2/1.3](https://arxiv.org/abs/2305.11627).


 ## Quick note on inference engines

 vLLM: Focus on throughput. refer this [talk](https://www.youtube.com/watch?v=5ZlavKF_98U&t=1479s) from the creators.

 - Optimized MQA/GQA kernels
 - efficient KV caching,beam search.
 - Continuous batching
 - tensor parallelism support



 mlc: Rabbit hole is deep(refer [this](https://mlc.ai/)) but their philosophy is convert your code to an
  - intermediate representation => search through different possible mappings to cuda kernels and find the best one.



Ultimately compiling llm/ml models removes the overhead, irrespective of the engine used.

Verdict: **vLLM good if using with batching or large models, chat-like inference `mlc` is the fastest**

