GPU offload layers greater than 1?

I've compiled llama.cpp python binding with CUDA support enabled and the GPU offload is working:

```
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 3767.53 MB (+ 2000.00 MB per state)
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/35 layers to GPU
llm_load_tensors: VRAM used: 124 MB
```

However, it looks like `n_gpu_layers` is set to 1 and can't be changed?

That value should be customizable via an argument (or in model settings) or set to a much higher number by default.

As you can see in the log above, a 7B model has 35 layers and to fully run on GPU, the `n_gpu_layers` should be at least 35 (a 13B model has around 43 layers).

More GPU offload (or 100% GPU offload) would give much faster inference speeds for people with GPUs/Metal/Cublas.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GPU offload layers greater than 1? #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

GPU offload layers greater than 1? #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions