New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server is not using cuda on windows (BLAS=0) #1242
Comments
Hello, |
Yeah I went through hell with this not too long ago. Assuming you have all the other dependencies installed properly, you can try:
I got this working using llama-cpp-python==0.2.7 and CUDA 11.7 versions. I haven't tried it with the CUDA 12.3 version that you have but it states on the repo that you can change both the llama-cpp-python and CUDA versions in the command. You can try it out and see if it works. Start it up with |
@TheLocalLab Thank you very much it worked and I have now BLAS=1 !! Only thing is now I have |
For info, CUDA version does make a difference because there are no llama-cpp-python version compatible with 12.3 as of now using the provided indexes. |
I don't believe I gotten that error but 'context_params' could be referring to the context window parameters. It could also be the model you are running as well. Have you tried loading different models to see if the error pops up again? If you paste the full error code and model, someone might be able to provide more information on that or you might need to do a fresh install if all the dependencies weren't properly installed. I myself, had to pip install poetry and torch due to issues installing them the way listed in the windows manual this repo provided. |
Did a full reinstall in a conda env this time, cuda 12.2 and llama-python-cpp 0.2.7 and it's working and I have BLAS=1. |
Check the amount of gpu layers your offloading to your gpu when you start up "llm_load_tensors: offloaded 35/35 layers to GPU". I believe by default, they max it out at 35/35 layers which for some is great but not everyone GPU can work with that many layers. Depending on which GPU you have, your going to have to find the optimal amount of layers that your card performs best with. While loading your models, you can actually see how much vram is being used to load the model ex: "llama_new_context_with_model: total VRAM used: 2770.68 MB (model: 2495.31 MB, context: 275.37 MB)". Your going to have to toy around with different numbers to see what works best. I tend to use somewhere from 14 - 25 layers offloaded without blowing up my GPU. Go to your "llm_component" py file located in the privategpt folder "private_gpt\components\llm\llm_component.py", look for line 28 'model_kwargs={"n_gpu_layers": 35}' and change the number to whatever will work best with your system and save it. Might take a little while but this should help improve speed some. |
Stale issue |
On windows 10, installation CPU successful and now wanted to try with cuda to speed up things.
Followed the tutorial and checked my installation:
i'm setting the `set CMAKE_ARGS='-DLLAMA_CUBLAS=on' and then running everything. But I still get BLAS=0 once the server is running, and the GPU is not used during computation.
Anyone faced this problem ?
Thanks
The text was updated successfully, but these errors were encountered: