-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance #3
Comments
Thanks for taking a look. I changed it so the model can be specified on the command line now. With the 42M model I get about 67 tokens/sec on my thinkpad, vs 165 tokens/sec with the 15M. I'm going to try running the real LLaMA7B model. |
@rbitr make sure to try the optimization options. On your hardware, you want to do |
With Edit: on my Mac (2018) it looks like I get a small increase, from ~120 tok/sec with no arguments to ~130 with |
Current numbers (for 110M model now, Ubuntu on my Thinkpad): I wonder if llama2.c's matmuls are getting parallelized better because they are all explicitly vector matrix products? |
OK, I added a handwritten matmul like in llama2.c. Now, unless I missed something, compiling with |
Excellent. That's a great starting point for trying various options, such as the intrinsic matmul and various BLAS, and compiler options. I've noticed in the past that sometimes the intrinsic matmul is slower than a handwritten one. However I think you also have a very old version of GFortran, I think the version 9 is from 2019. What version of gcc do you have to test the |
I installed gfortran-10 and I get the same speed. Still going to investigate different compiler options and I will write a BLAS matmul. I had compiled llama2.c with gcc 9.4. That is the version that ships with Ubuntu 20.04. |
Very good. Thanks @rbitr. If you need more help, you can also ask at https://fortran-lang.discourse.group/, a lot of knowledgeable people there. |
@certik FYI I was able to get more speedup by writing element-wise functions that compute the q,k,v projections together at the beginning of the transformer and the MLP+nonlinearity part at the end and then parallelizing with OMP. I'm running a 3B parameter model at ~0.8 tok/s on my computer now, up from about 0.1 tok/s a week ago. It should be much faster than llama2.c at this point, but is still slower than llama.cpp, I'm still trying to understand all the optimizations he's using. |
Re above, I hadn't parallelized everything I could and now I can get e.g. 2.48 tok/sec vs 2.82 tok/s for llama.cpp on a 3B model (+/-, those are the numbers of the last runs I did). So it really is on par with llama.cpp which is very heavily optimized. |
Very good, great job! Yes, Fortran is capable of matching the speed of the most optimized libraries. |
A couple notes: With these changes, w have the following performance (indicative numbers on my machine):
Clearly there is a lot of room to improve on single thread performance but I'm surprised at how little different the additional threads give for llama.cpp. This gives us something to dig into anyway. In single threaded operation, the breakdown of times is roughly as follows (the total should be roughly 1/(speed in tok/s), i.e. 1500 ms.
|
Excellent. I think llama.cpp gets speedup in parallel, however be careful that the 1 thread benchmark is truly 1 thread. A lot of time is spent in matmul, and OpenBLAS for example runs in parallel by default (I think). Overall I think this is already nicely competitive and we'll be able to match the performance. Can llama.cpp run GPT-2? If so, we can test against fastGPT, where I understand the performance quite well. |
llama.cpp doesn't appear to support GPT-2 directly. there is an old demo of using GPT-2 with the GGML library. See https://github.com/ggerganov/ggml/tree/239defe61dbe9dddc6304942e8a3d03d6a3c69ab#gpt-inference-example This is broken I had to do
I think it is, I only see one thread running in htop, and I got the same result compiling without BLAS. |
For GPT-2 and f32 model most of the performance comes from matmul. For llama I would expect a similar result. So one way to go forward is to use f32 model and benchmark that (if any exists). The point is to get something where we get the same or better performance. Then we can add other features (like reduced accuracy) back in and ensure, one by one, that they run at top speed. |
Yes good idea. I did a comparison using all 32-bit with a 1.1B Llama model (due to memory size this is easier to work with than the 3B I was using for the other benchmarks and the quality seems the same as it's a newer model.) What I get after hacking a 32-bit version is 1.1B Parameter model @ 3.2 tok/s vs 4.2 tok/s with ggml. I have some stuff to work on to try and bring up the performance and like you say this is a good way to do a basic comparison without any confounding stuff like the fp16. Although ggml runs f16 faster than f32, I think because it is doing custom vectorization. |
Very good. Let's do 32bit. It's now 3.2 vs 4.2 tok/s, so it's close, but not quite there yet. Let's get it to be exactly equal. On single core, the possible differences are:
Start with the first point, then we'll see. One way to go forward is to hack llama.cpp and simplify the algorithm to do some matmul but remove other operations (it will return non-sensical results of course) and do the same in llama2.f90, and do whatever it takes to get the same performance, possibly just matmul. Then keep adding back the other operations, one by one, in both codes and see what slows it down. |
So good news: I've matched the speed of llama.cpp. I have (for example) 1.1B model @ 4.18 tok/s for llama.cpp and 4.21 toks/s with Fortran. Other than a bit of cleanup of some unneeded copying, the main things responsible for the speed were replacing For the comparison I didn't use BLAS (I compiled llama.cpp without BLAS as well) but in my experiments I don't see any material difference on my machine. What I'm going to do now is clean it up and have this pared down fast version as |
Beautiful! Great job. This was the hardest. Now when you have a version that runs as fast, you can start adding back features, one by one, and always benchmark and ensure the new feature doesn't slow things down. Then you can also investigate parallelism, always one feature at a time. |
Current performance with 16-bit quantization is 7.3 tok/s on one thread, vs 7.4 tok/s with llama.cpp. This uses a SIMD routine to convert from 16-bit to f32 and dot-product (as does llama.cpp) and also the |
Thank you so much for writing this. We are now working on compiling it with LFortran, this is a great example.
Regarding performance on my Apple M1 Max with GFortran 11.3.0, I get about 240 tokens/s with the default gfortran options. With
-O3 -march=native -ffast-math -funroll-loops
I get about 277 tokens/s. Finally, withgfortran -O3 -march=native -ffast-math -funroll-loops -fexternal-blas llama2.f90 -o llm -framework Accelerate
which should be the fastest, I still only get about 270 tokens/s. I think this is too small of a model, one would have to try a larger version to take advantage of the accelerated linear algebra.The text was updated successfully, but these errors were encountered: