This is not a llama.cpp bug report.
I am asking here because this Qualcomm fork appears to be maintained by the people with the most relevant Hexagon / ggml-hexagon expertise, and I would like to confirm whether the behavior below is expected on current production devices.
Research Stage
Background Research
Previous existing literature and research
I am testing raw Hexagon v79 HVX FP8 instructions (vcvt(...f8) and vmpy(...f8, ...f8)) with a minimal standalone probe, outside of llama.cpp model evaluation.
Environment:
- Device: SM8750 / sun
- FastRPC capability query reports DSP arch
0x8c79
- Hexagon SDK:
6.5.0.0
- Hexagon tools:
19.0.07
- Android NDK:
r25c
What I have already verified:
-
The generated v79 object code does contain FP8 instructions such as:
v?.f8 = vcvt(...)
v?:?.hf = vcvt(v?.f8)
v?:?.hf = vmpy(v?.f8, v?.f8)
-
The same source works correctly in hexagon-sim -mv79.
In simulator, the FP8 conversion / multiply path produces correct non-zero results.
-
On device, inside FastRPC user PD, the same probe returns all zeros:
hf -> f8 gives all-zero bytes
f8 -> hf gives all-zero half values
f8 * f8 -> hf gives all-zero half values
-
This remains true for:
- intrinsic-generated code
- inline asm
- direct raw FP8 byte input followed by
vcvt / vmpy
-
I also tried:
qurt_hvx_lock(QURT_HVX_MODE_128B)
- multiple qfloat codegen modes:
strict-ieee
ieee
lossy
legacy
None of these changed the on-device result.
-
I also traced existing vendor HTP/QNN paths on the same device.
What I observed is:
- working vendor QNN HTP runtime goes through CDSP unsigned PD
- if I force the same path to signed PD,
remote_handle_open / remote_handle64_open fails before execution
So at the moment I cannot tell whether this is:
- an expected production-device limitation
- an unsigned-PD limitation
- a DSP image / firmware limitation
- or something else specific to the runtime environment
Artifact bundle (logs and minimal sources):
Hypothesis
My current working hypothesis is:
- code generation is correct
- simulator behavior is correct
- the issue is in the actual device execution environment, not in the C/intrinsic source itself
More specifically, it looks like on this SM8750/sun production image, the HVX FP8 datapath is not actually usable from the FastRPC user-PD path that is otherwise available to normal workloads.
However, I do not know whether that is:
- expected platform behavior
- a policy restriction
- or something that should work on supported Qualcomm Hexagon runtimes
Implementation
Minimal standalone repro only.
This is not tied to a llama.cpp model or to llama.cpp FP8 code, and I am not claiming that this repository itself introduced the issue.
The reason for posting here is purely to ask the maintainers with Qualcomm Hexagon expertise:
Is it expected that v79 HVX FP8 vcvt / vmpy work in simulator but return all zeros on-device in the normal FastRPC user-PD path on SM8750/sun?
And if this is expected:
What execution environment is actually required for correct HVX FP8 execution on such a device?
For example:
- signed PD only?
- a different vendor runtime path?
- a specific DSP image capability?
- not supported at all on production user-accessible paths?
Analysis
Observed facts:
- simulator: correct non-zero FP8 results
- on-device FastRPC user PD: all-zero FP8 results
- forcing signed PD on the vendor QNN path causes
open failure, so I could not validate FP8 there
This strongly suggests that the problem is not in the source-level implementation of the FP8 instructions, but in the platform/runtime environment available on the device.
Relevant log output
On-device probe:
- hf -> f8 : all zeros
- f8 -> hf : all zeros
- f8 * f8 -> hf : all zeros
Simulator (`hexagon-sim -mv79`) output:
- same source produces correct non-zero FP8 conversion/multiply results
I can provide:
- full on-device probe log
- simulator output
- disassembly snippet showing emitted FP8 instructions
- signed/unsigned PD tracing logs
This is not a llama.cpp bug report.
I am asking here because this Qualcomm fork appears to be maintained by the people with the most relevant Hexagon / ggml-hexagon expertise, and I would like to confirm whether the behavior below is expected on current production devices.
Research Stage
Background Research
Previous existing literature and research
I am testing raw Hexagon v79 HVX FP8 instructions (
vcvt(...f8)andvmpy(...f8, ...f8)) with a minimal standalone probe, outside of llama.cpp model evaluation.Environment:
0x8c796.5.0.019.0.07r25cWhat I have already verified:
The generated v79 object code does contain FP8 instructions such as:
v?.f8 = vcvt(...)v?:?.hf = vcvt(v?.f8)v?:?.hf = vmpy(v?.f8, v?.f8)The same source works correctly in
hexagon-sim -mv79.In simulator, the FP8 conversion / multiply path produces correct non-zero results.
On device, inside FastRPC user PD, the same probe returns all zeros:
hf -> f8gives all-zero bytesf8 -> hfgives all-zero half valuesf8 * f8 -> hfgives all-zero half valuesThis remains true for:
vcvt/vmpyI also tried:
qurt_hvx_lock(QURT_HVX_MODE_128B)strict-ieeeieeelossylegacyNone of these changed the on-device result.
I also traced existing vendor HTP/QNN paths on the same device.
What I observed is:
remote_handle_open/remote_handle64_openfails before executionSo at the moment I cannot tell whether this is:
Artifact bundle (logs and minimal sources):
Hypothesis
My current working hypothesis is:
More specifically, it looks like on this SM8750/sun production image, the HVX FP8 datapath is not actually usable from the FastRPC user-PD path that is otherwise available to normal workloads.
However, I do not know whether that is:
Implementation
Minimal standalone repro only.
This is not tied to a llama.cpp model or to llama.cpp FP8 code, and I am not claiming that this repository itself introduced the issue.
The reason for posting here is purely to ask the maintainers with Qualcomm Hexagon expertise:
Is it expected that v79 HVX FP8
vcvt/vmpywork in simulator but return all zeros on-device in the normal FastRPC user-PD path on SM8750/sun?And if this is expected:
What execution environment is actually required for correct HVX FP8 execution on such a device?
For example:
Analysis
Observed facts:
openfailure, so I could not validate FP8 thereThis strongly suggests that the problem is not in the source-level implementation of the FP8 instructions, but in the platform/runtime environment available on the device.
Relevant log output
I can provide: