Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify the current PyTorch model to C++ #42

Open
zhuohan123 opened this issue Apr 22, 2023 · 4 comments
Open

Modify the current PyTorch model to C++ #42

zhuohan123 opened this issue Apr 22, 2023 · 4 comments
Labels
performance Performance-related issues

Comments

@zhuohan123
Copy link
Collaborator

zhuohan123 commented Apr 22, 2023

Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.

Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.

Potential sources of overheads:

  1. Python v.s. C++.
  2. PyTorch (even in C++) v.s. FasterTransformer.

How to implement a C++ version:

  1. (Fake C++) Torch compiler (torch.jit).
  2. Libtorch, C++ version of PyTorch (easier to implement and extend, but can only solve overhead 1).
  3. Prune out the useful single model code from FasterTransformer to CacheFlow. This solves both overheads but is harder to implement.
@WoosukKwon WoosukKwon added the P0 label Apr 22, 2023
@zhuohan123
Copy link
Collaborator Author

Specific optimizations for smaller kernels (~100m parameters).

  1. Improve sampling efficiency.
  2. We may need to merge more models.

This should not be prioritized. Because the core technique of CacheFlow (memory saving) is not helpful for small models at all, but still, they may benefit from iteration-level scheduling.

@zhuohan123
Copy link
Collaborator Author

After the C++ version, we might need to rerun all the experiments with the new implementation.

@hmellor
Copy link
Collaborator

hmellor commented Mar 8, 2024

@zhuohan123 can this work be considered complete?

@amrobbins
Copy link

If interested, I've been building a c++ native deep learning framework for the past few years that I want to get open sourced soon. This framework aims for optimal performance, here is my framework training AlexNet, mostly the kernels here are cublasLt and cuDNN:

image

I'd certainly like for it to be part of VLLM. Is this something that there'd be interest in? If so I can go ensure that I support (or add support) for all of the pieces that you need and get it connected. I can provide access to the private repo if requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

5 participants