Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normal Inference seems to output more tokens per second. #27

Open
tamil-acog opened this issue Dec 4, 2023 · 1 comment
Open

Normal Inference seems to output more tokens per second. #27

tamil-acog opened this issue Dec 4, 2023 · 1 comment

Comments

@tamil-acog
Copy link

Hi,

I just did a quick implementation of gpt-fast and did an inference on the Llama-2 7B. I seem to get around 65 tokens per second on average without quantization.

I also did a quick inference on the normal implementation of the model and I seem to get around 150 tokens per second on average.(My calc: Total no of tokens/total time taken)

Is there anything I am missing.

My hardware spec:
GPU: NVIDIA GeForce RTX 4090
Cpu cores: 32

Is the "no of CPU cores" a hyper parameter in gpt-fast? Is there a threshold for the "no of CPU cores", below which only the CPU overhead occurs and gpt-fast helps only in such cases?

Please correct me If I am wrong.

Thanks in advance.

@Chillee
Copy link
Contributor

Chillee commented Dec 4, 2023

What is a "normal implementation" of the model?

To be clear, the metric reported here is also sometimes called "tokens per second per user" (i.e. the latency for a single request). So the benchmarks here are all run with BS=1. I think the number you're getting (65 tokens per second) is pretty close to the peak you can get on your hardware (since 65 * 7 * 2 = 910 GB/s, while the 4090 has 1 TB/s of memory bandwidth).

If you're running with a larger batch size you can easily obtain more "tokens/s".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants