-
Notifications
You must be signed in to change notification settings - Fork 24
[New blog] Inside vLLM: Anatomy of a High-Throughput LLM Inference System #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Deploying vllm-blog-source with
|
Latest commit: |
3476bef
|
Status: | ✅ Deploy successful! |
Preview URL: | https://0b9d3992.vllm-blog-source.pages.dev |
Branch Preview URL: | https://gordicaleksa-anatomy-vllm.vllm-blog-source.pages.dev |
Ok i see i haven't done the DCO thing, let me fix that. |
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
76ff8b4
to
c16a69e
Compare
Very nice!! Any way we can get footnote working properly? |
@simon-mo which footnote? not sure i understood? |
Ah sorry I meant citations [1] [2]... they don't link the actual references. |
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
293aea1
to
66b6d03
Compare
oh ok, added! |
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
Signed-off-by: Aleksa Gordic <gordicaleksa@gmail.com>
@simon-mo lmk what you think now and happy to merge! |
<ol type="a"> | ||
<li>policy setting - it can be either <b>FCFS</b> (first come first served) or <b>priority</b> (higher priority requests are served first)</li> | ||
<li><code>waiting</code> and <code>running</code> queues</li> | ||
<li>KV cache manager - the heart of paged attention [[3]](#ref-3)</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reference rendering seems to be broken inside <li></li>
|
||
> [!NOTE] | ||
> Block size for a standard transformer layer (non-MLA [[4]](#ref-4)) is computed as follows: | ||
> 2 * <code>block_size</code> (default=16) * <code>num_kv_heads</code> * <code>head_size</code> * <code>dtype_num_bytes</code> (2 for bf16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first 2
is for k and v right? 2 for bf16
is for dtype_num_bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my nit comments are not blocking. we can publish first and then fix nit comments.
Porting the original blog: https://www.aleksagordic.com/blog/vllm