Normal Inference seems to output more tokens per second. #27

tamil-acog · 2023-12-04T10:18:59Z

Hi,

I just did a quick implementation of gpt-fast and did an inference on the Llama-2 7B. I seem to get around 65 tokens per second on average without quantization.

I also did a quick inference on the normal implementation of the model and I seem to get around 150 tokens per second on average.(My calc: Total no of tokens/total time taken)

Is there anything I am missing.

My hardware spec:
GPU: NVIDIA GeForce RTX 4090
Cpu cores: 32

Is the "no of CPU cores" a hyper parameter in gpt-fast? Is there a threshold for the "no of CPU cores", below which only the CPU overhead occurs and gpt-fast helps only in such cases?

Please correct me If I am wrong.

Thanks in advance.

Chillee · 2023-12-04T21:20:12Z

What is a "normal implementation" of the model?

To be clear, the metric reported here is also sometimes called "tokens per second per user" (i.e. the latency for a single request). So the benchmarks here are all run with BS=1. I think the number you're getting (65 tokens per second) is pretty close to the peak you can get on your hardware (since 65 * 7 * 2 = 910 GB/s, while the 4090 has 1 TB/s of memory bandwidth).

If you're running with a larger batch size you can easily obtain more "tokens/s".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normal Inference seems to output more tokens per second. #27

Normal Inference seems to output more tokens per second. #27

tamil-acog commented Dec 4, 2023

Chillee commented Dec 4, 2023

Normal Inference seems to output more tokens per second. #27

Normal Inference seems to output more tokens per second. #27

Comments

tamil-acog commented Dec 4, 2023

Chillee commented Dec 4, 2023