New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify the current PyTorch model to C++ #42
Labels
performance
Performance-related issues
Comments
Specific optimizations for smaller kernels (~100m parameters).
This should not be prioritized. Because the core technique of CacheFlow (memory saving) is not helpful for small models at all, but still, they may benefit from iteration-level scheduling. |
After the C++ version, we might need to rerun all the experiments with the new implementation. |
76 tasks
@zhuohan123 can this work be considered complete? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.
Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.
Potential sources of overheads:
How to implement a C++ version:
The text was updated successfully, but these errors were encountered: