Modify the current PyTorch model to C++ #42

zhuohan123 · 2023-04-22T03:36:03Z

Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.

Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.

Potential sources of overheads:

Python v.s. C++.
PyTorch (even in C++) v.s. FasterTransformer.

How to implement a C++ version:

(Fake C++) Torch compiler (torch.jit).
Libtorch, C++ version of PyTorch (easier to implement and extend, but can only solve overhead 1).
Prune out the useful single model code from FasterTransformer to CacheFlow. This solves both overheads but is harder to implement.

zhuohan123 · 2023-04-22T04:04:43Z

Specific optimizations for smaller kernels (~100m parameters).

Improve sampling efficiency.
We may need to merge more models.

This should not be prioritized. Because the core technique of CacheFlow (memory saving) is not helpful for small models at all, but still, they may benefit from iteration-level scheduling.

zhuohan123 · 2023-04-22T04:11:13Z

After the C++ version, we might need to rerun all the experiments with the new implementation.

hmellor · 2024-03-08T10:48:39Z

@zhuohan123 can this work be considered complete?

amrobbins · 2024-04-15T20:08:37Z

If interested, I've been building a c++ native deep learning framework for the past few years that I want to get open sourced soon. This framework aims for optimal performance, here is my framework training AlexNet, mostly the kernels here are cublasLt and cuDNN:

I'd certainly like for it to be part of VLLM. Is this something that there'd be interest in? If so I can go ensure that I support (or add support) for all of the pieces that you need and get it connected. I can provide access to the private repo if requested.

WoosukKwon added the P0 label Apr 22, 2023

WoosukKwon self-assigned this Apr 25, 2023

WoosukKwon removed the P0 label Apr 30, 2023

WoosukKwon assigned PKUFlyingPig and unassigned WoosukKwon May 2, 2023

WoosukKwon added the performance Performance-related issues label May 20, 2023

zhuohan123 mentioned this issue Jun 25, 2023

[Deprecated] vLLM Development Roadmap #244

Closed

76 tasks

zhuohan123 unassigned PKUFlyingPig Jul 10, 2023

zhuohan123 mentioned this issue Jul 18, 2023

Tensor Parallelism vs Data Parallelism #367

Closed

Cyberes mentioned this issue Sep 24, 2023

[Question] Why does using multiple gpus lead to slower performance? #689

Closed

shanshanpt mentioned this issue Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this issue Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

sh1ng mentioned this issue Dec 15, 2023

Compiled model with torch.compile, unfortunately without performance improvements #2131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify the current PyTorch model to C++ #42

Modify the current PyTorch model to C++ #42

zhuohan123 commented Apr 22, 2023 •

edited

zhuohan123 commented Apr 22, 2023

zhuohan123 commented Apr 22, 2023

hmellor commented Mar 8, 2024

amrobbins commented Apr 15, 2024

Modify the current PyTorch model to C++ #42

Modify the current PyTorch model to C++ #42

Comments

zhuohan123 commented Apr 22, 2023 • edited

zhuohan123 commented Apr 22, 2023

zhuohan123 commented Apr 22, 2023

hmellor commented Mar 8, 2024

amrobbins commented Apr 15, 2024

zhuohan123 commented Apr 22, 2023 •

edited