Skip to content

Conversation

@wine99
Copy link
Collaborator

@wine99 wine99 commented Jan 21, 2026

This PR changes the OpenVINO backend buffer is_host flag to false, aligning it with the CPU repack buffer behavior.

The OpenVINO backend buffer is conceptually similar to the CPU repack buffer. Both of them perform repacking (i.e., reordering) of quantized values. The OpenVINO backend goes further by performing additional processing such as extracting zero-points and related quantization metadata.

static struct ggml_backend_buffer_type ggml_backend_cpu_buffer_type_repack = {
    /* .iface    = */ {
                       /* .get_name         = */ ggml_backend_cpu_repack_buffer_type_get_name,
                       /* .alloc_buffer     = */ ggml_backend_cpu_repack_buffer_type_alloc_buffer,
                       /* .get_alignment    = */ ggml_backend_cpu_repack_buffer_type_get_alignment,
                       /* .get_max_size     = */ nullptr,  // defaults to SIZE_MAX
                       /* .get_alloc_size   = */ nullptr,  // defaults to ggml_nbytes
                       /* .is_host          = */ nullptr,
                       },
    /* .device  = */ ggml_backend_reg_dev_get(ggml_backend_cpu_reg(), 0),
    /* .context = */ new ggml::cpu::repack::extra_buffer_type(),
};

This change fixes the following issue.

In llama-model-loader.cpp, the loading logic is:

    if (use_mmap) {
        ....
            ggml_backend_tensor_set(cur, data, 0, n_size);
        ....
    } else {
        const auto & file = files.at(weight->idx);

        if (ggml_backend_buffer_is_host(cur->buffer)) {
            file->seek(weight->offs, SEEK_SET);
            file->read_raw(cur->data, n_size);
            ....
        } else {
            ....
                read_buf.resize(n_size);
                file->seek(weight->offs, SEEK_SET);
                file->read_raw(read_buf.data(), n_size);
                ggml_backend_tensor_set(cur, read_buf.data(), 0, n_size);
            ....
            }
        }
    }

llama.cpp now uses direct I/O by default instead of mmap when loading models. Quantized weight extraction in the OpenVINO backend is implemented in ggml_backend_tensor_set, so the code must take the non-host path. Therefore, ggml_backend_buffer_is_host should return false.

@wine99
Copy link
Collaborator Author

wine99 commented Jan 21, 2026

@ynimmaga @cavusmustafa Feel free to merge this while I’m offline if it looks good to you.

@wine99
Copy link
Collaborator Author

wine99 commented Jan 21, 2026

llama-bench -p 128 -n 32 is failing. It also fails on the dev_backend_openvino branch

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 1B Q4_1                  | 785.75 MiB |     1.24 B | OPENVINO   |  99 |  1 |           pp128 |        991.94 ± 0.00 |
terminate called after throwing an instance of 'std::runtime_error'
  what():  ggml tensor extra is not of type TENSOR for input: cache_k_l0

Copy link
Collaborator

@cavusmustafa cavusmustafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cavusmustafa cavusmustafa merged commit be2d4b6 into dev_backend_openvino Jan 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants