Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server is not using cuda on windows (BLAS=0) #1242

Closed
Tyrannas opened this issue Nov 15, 2023 · 8 comments
Closed

server is not using cuda on windows (BLAS=0) #1242

Tyrannas opened this issue Nov 15, 2023 · 8 comments
Labels

Comments

@Tyrannas
Copy link

On windows 10, installation CPU successful and now wanted to try with cuda to speed up things.
Followed the tutorial and checked my installation:

λ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:56:38_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
λ nvidia-smi
Wed Nov 15 10:09:01 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.84                 Driver Version: 545.84       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA T600 Laptop GPU       WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   60C    P0              N/A / ERR! |      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

i'm setting the `set CMAKE_ARGS='-DLLAMA_CUBLAS=on' and then running everything. But I still get BLAS=0 once the server is running, and the GPU is not used during computation.

Anyone faced this problem ?

Thanks

@AntoineQ1
Copy link

Hello,
I had the problem because, while debugging another issue, I removed the "--no-cache-dir" option.
(I think you probably put it but this might be helpful to someone who didn't.)

@TheLocalLab
Copy link

Yeah I went through hell with this not too long ago. Assuming you have all the other dependencies installed properly, you can try:

  1. pip uninstall llama-cpp-python
  2. Head over to this repo and look it over a bit.
  3. Enable Cublas parameters in your power shell environment(if that's what's your using) with:
    $Env:LLAMA_CUBLAS = "1"
    $Env:FORCE_CMAKE = "1"
    $Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"
    or
    set LLAMA_CUBLAS=1
    set CMAKE_ARGS=-DLLAMA_CUBLAS=on
    set FORCE_CMAKE=1
  4. Run this to build the llama-cpp-python wheels with the llama-cpp-python and CUDA versions you want to run:
    python -m pip install llama-cpp-python==<version> --prefer-binary --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cu117

I got this working using llama-cpp-python==0.2.7 and CUDA 11.7 versions. I haven't tried it with the CUDA 12.3 version that you have but it states on the repo that you can change both the llama-cpp-python and CUDA versions in the command. You can try it out and see if it works. Start it up with poetry run python -m private_gpt and if built successfully, BLAS should = 1. If it doesn't work, try deleting your env and doing this over with a fresh install.

@Tyrannas
Copy link
Author

@TheLocalLab Thank you very much it worked and I have now BLAS=1 !!

Only thing is now I have 'Llama' object has no attribute 'context_params' ':) did you have this error too by any chance ? I'm used the 0.2.7 version juste like you (and cuda 12.2 but I don't think it should make a difference)

@AntoineQ1
Copy link

For info, CUDA version does make a difference because there are no llama-cpp-python version compatible with 12.3 as of now using the provided indexes.

@TheLocalLab
Copy link

@TheLocalLab Thank you very much it worked and I have now BLAS=1 !!

Only thing is now I have 'Llama' object has no attribute 'context_params' ':) did you have this error too by any chance ? I'm used the 0.2.7 version juste like you (and cuda 12.2 but I don't think it should make a difference)

I don't believe I gotten that error but 'context_params' could be referring to the context window parameters. It could also be the model you are running as well. Have you tried loading different models to see if the error pops up again? If you paste the full error code and model, someone might be able to provide more information on that or you might need to do a fresh install if all the dependencies weren't properly installed. I myself, had to pip install poetry and torch due to issues installing them the way listed in the windows manual this repo provided.

@Tyrannas
Copy link
Author

Tyrannas commented Nov 16, 2023

Did a full reinstall in a conda env this time, cuda 12.2 and llama-python-cpp 0.2.7 and it's working and I have BLAS=1.
Now the last remaining mistery is that it's running slower now than when it was only on CPU...

@TheLocalLab
Copy link

Did a full reinstall in a conda env this time, cuda 12.2 and llama-python-cpp 0.2.7 and it's working and I have BLAS=1. Now the last remaining mistery is that it's running slower now than when it was only on CPU...

Check the amount of gpu layers your offloading to your gpu when you start up "llm_load_tensors: offloaded 35/35 layers to GPU". I believe by default, they max it out at 35/35 layers which for some is great but not everyone GPU can work with that many layers. Depending on which GPU you have, your going to have to find the optimal amount of layers that your card performs best with. While loading your models, you can actually see how much vram is being used to load the model ex: "llama_new_context_with_model: total VRAM used: 2770.68 MB (model: 2495.31 MB, context: 275.37 MB)". Your going to have to toy around with different numbers to see what works best. I tend to use somewhere from 14 - 25 layers offloaded without blowing up my GPU. Go to your "llm_component" py file located in the privategpt folder "private_gpt\components\llm\llm_component.py", look for line 28 'model_kwargs={"n_gpu_layers": 35}' and change the number to whatever will work best with your system and save it. Might take a little while but this should help improve speed some.

Copy link
Contributor

github-actions bot commented Dec 2, 2023

Stale issue

@github-actions github-actions bot added the stale label Dec 2, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants