server is not using cuda on windows (BLAS=0) #1242

Tyrannas · 2023-11-15T09:06:48Z

On windows 10, installation CPU successful and now wanted to try with cuda to speed up things.
Followed the tutorial and checked my installation:

λ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:56:38_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0

λ nvidia-smi
Wed Nov 15 10:09:01 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.84                 Driver Version: 545.84       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA T600 Laptop GPU       WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   60C    P0              N/A / ERR! |      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

i'm setting the `set CMAKE_ARGS='-DLLAMA_CUBLAS=on' and then running everything. But I still get BLAS=0 once the server is running, and the GPU is not used during computation.

Anyone faced this problem ?

Thanks

The text was updated successfully, but these errors were encountered:

AntoineQ1 · 2023-11-15T17:27:05Z

Hello,
I had the problem because, while debugging another issue, I removed the "--no-cache-dir" option.
(I think you probably put it but this might be helpful to someone who didn't.)

TheLocalLab · 2023-11-15T21:11:30Z

Yeah I went through hell with this not too long ago. Assuming you have all the other dependencies installed properly, you can try:

pip uninstall llama-cpp-python
Head over to this repo and look it over a bit.
Enable Cublas parameters in your power shell environment(if that's what's your using) with:
$Env:LLAMA_CUBLAS = "1"
$Env:FORCE_CMAKE = "1"
$Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"
or
set LLAMA_CUBLAS=1
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
Run this to build the llama-cpp-python wheels with the llama-cpp-python and CUDA versions you want to run:
python -m pip install llama-cpp-python==<version> --prefer-binary --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cu117

I got this working using llama-cpp-python==0.2.7 and CUDA 11.7 versions. I haven't tried it with the CUDA 12.3 version that you have but it states on the repo that you can change both the llama-cpp-python and CUDA versions in the command. You can try it out and see if it works. Start it up with poetry run python -m private_gpt and if built successfully, BLAS should = 1. If it doesn't work, try deleting your env and doing this over with a fresh install.

Tyrannas · 2023-11-16T13:05:49Z

@TheLocalLab Thank you very much it worked and I have now BLAS=1 !!

Only thing is now I have 'Llama' object has no attribute 'context_params' ':) did you have this error too by any chance ? I'm used the 0.2.7 version juste like you (and cuda 12.2 but I don't think it should make a difference)

AntoineQ1 · 2023-11-16T17:08:03Z

For info, CUDA version does make a difference because there are no llama-cpp-python version compatible with 12.3 as of now using the provided indexes.

TheLocalLab · 2023-11-16T22:18:01Z

@TheLocalLab Thank you very much it worked and I have now BLAS=1 !!

Only thing is now I have 'Llama' object has no attribute 'context_params' ':) did you have this error too by any chance ? I'm used the 0.2.7 version juste like you (and cuda 12.2 but I don't think it should make a difference)

I don't believe I gotten that error but 'context_params' could be referring to the context window parameters. It could also be the model you are running as well. Have you tried loading different models to see if the error pops up again? If you paste the full error code and model, someone might be able to provide more information on that or you might need to do a fresh install if all the dependencies weren't properly installed. I myself, had to pip install poetry and torch due to issues installing them the way listed in the windows manual this repo provided.

Tyrannas · 2023-11-16T22:38:36Z

Did a full reinstall in a conda env this time, cuda 12.2 and llama-python-cpp 0.2.7 and it's working and I have BLAS=1.
Now the last remaining mistery is that it's running slower now than when it was only on CPU...

TheLocalLab · 2023-11-17T00:02:45Z

Did a full reinstall in a conda env this time, cuda 12.2 and llama-python-cpp 0.2.7 and it's working and I have BLAS=1. Now the last remaining mistery is that it's running slower now than when it was only on CPU...

Check the amount of gpu layers your offloading to your gpu when you start up "llm_load_tensors: offloaded 35/35 layers to GPU". I believe by default, they max it out at 35/35 layers which for some is great but not everyone GPU can work with that many layers. Depending on which GPU you have, your going to have to find the optimal amount of layers that your card performs best with. While loading your models, you can actually see how much vram is being used to load the model ex: "llama_new_context_with_model: total VRAM used: 2770.68 MB (model: 2495.31 MB, context: 275.37 MB)". Your going to have to toy around with different numbers to see what works best. I tend to use somewhere from 14 - 25 layers offloaded without blowing up my GPU. Go to your "llm_component" py file located in the privategpt folder "private_gpt\components\llm\llm_component.py", look for line 28 'model_kwargs={"n_gpu_layers": 35}' and change the number to whatever will work best with your system and save it. Might take a little while but this should help improve speed some.

github-actions · 2023-12-02T05:45:34Z

Stale issue

github-actions bot added the stale label Dec 2, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server is not using cuda on windows (BLAS=0) #1242

server is not using cuda on windows (BLAS=0) #1242

Tyrannas commented Nov 15, 2023

AntoineQ1 commented Nov 15, 2023

TheLocalLab commented Nov 15, 2023

Tyrannas commented Nov 16, 2023

AntoineQ1 commented Nov 16, 2023

TheLocalLab commented Nov 16, 2023

Tyrannas commented Nov 16, 2023 •

edited

TheLocalLab commented Nov 17, 2023

github-actions bot commented Dec 2, 2023

server is not using cuda on windows (BLAS=0) #1242

server is not using cuda on windows (BLAS=0) #1242

Comments

Tyrannas commented Nov 15, 2023

AntoineQ1 commented Nov 15, 2023

TheLocalLab commented Nov 15, 2023

Tyrannas commented Nov 16, 2023

AntoineQ1 commented Nov 16, 2023

TheLocalLab commented Nov 16, 2023

Tyrannas commented Nov 16, 2023 • edited

TheLocalLab commented Nov 17, 2023

github-actions bot commented Dec 2, 2023

Tyrannas commented Nov 16, 2023 •

edited