-
-
Notifications
You must be signed in to change notification settings - Fork 19
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
I've compiled llama.cpp python binding with CUDA support enabled and the GPU offload is working:
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 3767.53 MB (+ 2000.00 MB per state)
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/35 layers to GPU
llm_load_tensors: VRAM used: 124 MB
However, it looks like n_gpu_layers
is set to 1 and can't be changed?
That value should be customizable via an argument (or in model settings) or set to a much higher number by default.
As you can see in the log above, a 7B model has 35 layers and to fully run on GPU, the n_gpu_layers
should be at least 35 (a 13B model has around 43 layers).
More GPU offload (or 100% GPU offload) would give much faster inference speeds for people with GPUs/Metal/Cublas.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request