-
Notifications
You must be signed in to change notification settings - Fork 685
Description
Description
I am currently deploying the Llama-3.2-1B-Instruct model on QCS8550. I customized the QNN partitioner to offload Linear operations to the CPU, while delegating all non-Linear operations to the QNN Backend (HtpV73). This partitioning resulted in 66 QNN backend subgraphs, consequently generating 66 context binaries serialized within the resulting *.pte file.
However, upon execution on the QCS8550, the process failed. The error log indicates an issue during context loading.
I have attached the detailed logs for reference. Please review them.
err_20251009.log
The source code is from main branch with the following commit info:
commit cf6e895c53bd1052f1266821a76bd7c5a85ace52 (HEAD -> dev)
Author: Šimon Strýček <simon.strycek@nxp.com>
Date: Tue Sep 16 11:33:01 2025 +0200
NXP backend: Relocation of remove_io_quant_ops_pass.py (#14202)
### Summary
Relocate `remove_io_quant_ops_pass.py` to `nxp/edge_passes`.
### Test plan
Should be covered by already existing unit tests.
Co-authored-by: Roman Janik <roman.janik@nxp.com>
Thank you for your time!
Analyze
(1) It appears that the QnnExecuTorchBackend
first loads and manages all 66 Context Binaries upfront. It then executes the specific graph stored within a Context Binary as needed, and finally destroys all context binaries and other resources together.
(2) I believe the 66 context binaries are inherently valid because I successfully executed all of them sequentially (Load one Context Binary → Execute the model graph → Destroy the Context Binary ). The main modification for test is as belows:
Result<DelegateHandle*> QnnExecuTorchBackend::init(
BackendInitContext& context,
FreeableBuffer* processed,
ArrayRef<CompileSpec> compile_specs) const {
// convert CompileSpec to qnn ExecuTorch option
for (auto& compile_spec : compile_specs) {
if (std::strcmp(compile_spec.key, QNN_COMPILE_SPEC) == 0)
qnn_executorch_options_ =
GetQnnExecuTorchOptions(compile_spec.value.buffer);
else
QNN_EXECUTORCH_LOG_WARN("unknown argument: %s", compile_spec.key);
}
ET_LOG(Info, "WQnnExecuTorchBackend::init() is a dummpy function.");
return processed;
}
Error QnnExecuTorchBackend::execute(
BackendExecutionContext& context,
DelegateHandle* handle,
Span<EValue*> args) const {
FreeableBuffer* processed = (FreeableBuffer*)(handle);
QnnExecuTorchContextBinary qnn_context_blob;
auto [status, signature, ctx_size, ctx_bin] =
QnnContextCustomProtocol().DeserializeContextCustomBuffer(
const_cast<void*>(processed->data()));
if (status == Error::Ok) {
QNN_EXECUTORCH_LOG_INFO(
"Deserializing processed data using QnnContextCustomProtocol");
// After this stage, qnn_context_blob.nbytes & qnn_context_blob.buffer will
// only store qnn_context_binary.
qnn_context_blob.nbytes = ctx_size;
qnn_context_blob.buffer = ctx_bin;
std::string file_name = "contexts/context_" + std::to_string(backend_cnt_) + ".txt";
write_file(file_name.c_str(), ctx_bin, ctx_size);
} else {
// This buffer will be verified again in QnnBackendCache.
QNN_EXECUTORCH_LOG_INFO("Deserializing processed data using Dlc");
qnn_context_blob.buffer = const_cast<void*>(processed->data());
qnn_context_blob.nbytes = processed->size();
}
// Create QnnManager
MemoryAllocator* runtime_allocator = context.get_temp_allocator();
QnnManager* qnn_manager = runtime_allocator->allocateInstance<QnnManager>();
if (qnn_manager == nullptr) {
return Error::MemoryAllocationFailed;
}
// NOTE: Since we use placement new and since this type is not trivially
// destructible, we must call the destructor manually in destroy().
new (qnn_manager) QnnManager(qnn_executorch_options_, qnn_context_blob);
// TODO: this is a temporal solution for multi-graph support, will be
// removed once framework starts to accept runtime configuration
// ---
// check if current context binary has already been initialized
// return cached one for reducing memory footprint
ET_CHECK_OR_RETURN_ERROR(
qnn_manager->Init() == Error::Ok,
Internal,
"Fail to initialize Qnn Manager");
......
ET_CHECK_OR_RETURN_ERROR(
qnn_manager->Execute(
method_name,
input_tensor_structs,
output_tensor_structs,
context.event_tracer()) == Error::Ok,
Internal,
"Fail to execute graph");
ET_CHECK_OR_RETURN_ERROR(
qnn_manager->ProfileExecuteData(method_name, context.event_tracer()) ==
Error::Ok,
Internal,
"Fail to profile graph");
qnn_manager->Destroy();
return Error::Ok;
}
cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin