Skip to content

GPU offload layers greater than 1? #19

@LoopControl

Description

@LoopControl

I've compiled llama.cpp python binding with CUDA support enabled and the GPU offload is working:

llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 3767.53 MB (+ 2000.00 MB per state)
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/35 layers to GPU
llm_load_tensors: VRAM used: 124 MB

However, it looks like n_gpu_layers is set to 1 and can't be changed?

That value should be customizable via an argument (or in model settings) or set to a much higher number by default.

As you can see in the log above, a 7B model has 35 layers and to fully run on GPU, the n_gpu_layers should be at least 35 (a 13B model has around 43 layers).

More GPU offload (or 100% GPU offload) would give much faster inference speeds for people with GPUs/Metal/Cublas.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions