Skip to content

runs on cpu when vulkan device is available #9

@dpblnt

Description

@dpblnt
[ 36%] Building C object CMakeFiles/turboquant.dir/src/backend/vulkan/tq_vulkan_dispatch.c.o
[ 37%] Building C object CMakeFiles/turboquant_shared.dir/src/engine/tq_transformer.c.o
[ 37%] Building C object CMakeFiles/turboquant_shared.dir/src/backend/vulkan/tq_vulkan_dispatch.c.o
[ 38%] Building C object CMakeFiles/turboquant_shared.dir/src/backend/vulkan/tq_vulkan_init.c.o
/opt/TurboQuant.cpp/src/backend/vulkan/tq_vulkan_init.c: In function ‘tq_vk_select_device’:
/opt/TurboQuant.cpp/src/backend/vulkan/tq_vulkan_init.c:201:5: warning: ‘__builtin_strncpy’ output may be truncated copying 255 bytes from a string of length 255 [-Wstringop-truncation]
  201 |     strncpy(g_vk_state.device_name, props.deviceName,
      |     ^
[ 39%] Building C object CMakeFiles/turboquant.dir/src/backend/vulkan/tq_vulkan_init.c.o
/opt/TurboQuant.cpp/src/backend/vulkan/tq_vulkan_init.c: In function ‘tq_vk_select_device’:
/opt/TurboQuant.cpp/src/backend/vulkan/tq_vulkan_init.c:201:5: warning: ‘__builtin_strncpy’ output may be truncated copying 255 bytes from a string of length 255 [-Wstringop-truncation]
  201 |     strncpy(g_vk_state.device_name, props.deviceName,
      |     ^
[ 39%] Linking C static library libturboquant.a
[ 40%] Linking C shared library libturboquant.so
[ 40%] Built target turboquant_shared
[ 40%] Built target turboquant
./quant /mnt/models/gemma-3-4b-it-Q8_0.gguf -p "hello"
Loading model from /mnt/models/gemma-3-4b-it-Q8_0.gguf...
tq_load_model: detected GGUF format
tq_load_gguf: GGUF v3, 444 tensors, 40 metadata keys
tq_load_gguf: architecture = 'gemma3'
tq_load_gguf: hybrid attention detected — sliding head_dim=128 (metadata: 256)
tq_load_gguf: Gemma family detected (sliding_window=1024)
tq_load_gguf: config — layers=34, dim=2560, heads=8/8, head_dim=128, vocab=0
tq_load_gguf: Gemma norms already adjusted (mean=8.3, skipping +1.0)
tq_load_gguf: loaded 34 layers (34 self_attn), dim=2560, heads=8/8, vocab=262208
tq_load_gguf: load-time Q4 conversion enabled (est FP32 = 11.3 GB < 8 GB threshold)
tq_quantize_weights_q4: quantized to Q4 (1806 MB, was ~14450 MB FP32)
tq_load_gguf: Q4 conversion complete — fast matmul path active
tq_load_gguf: madvise(MADV_DONTNEED) on 3.8 GB mmap
Model: 34 layers, dim=2560, heads=8/8, head_dim=128, vocab=262208, inter=10240
KV cache type: uniform_4b, V quant: FP16
tq_load_tokenizer_from_tqm: not a TQM file
tq_load_tokenizer_from_gguf: loaded 262208 tokens (max_len=48)
Loaded tokenizer from GGUF metadata
Threads: 4
Prompt: hello
---
लिटdomமாறு takeaway ი굿ிக்கIssˇoundorestation nutshell Mansfield landing ungroup Terr =~عنی заня Medal^C

topped up cpu, left gpu idle

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions