Qualcomm AI Engine Direct - CDSP Direct Mode#17326
Qualcomm AI Engine Direct - CDSP Direct Mode#17326abhinaykukkadapu merged 6 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17326
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 3 Unrelated FailuresAs of commit 02bf8db with merge base 96672a4 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
Thanks for putting this up. It looks like most of this is setting up the IDL/stub/skel, the backend runtime itself remained pretty much the same? One thing we also need that the backend runtime should not perform any system configuration whatsoever, is the code that manages DMA buffers and sets QNN/HTP perf mode still running in this direct mode? If not could you explain how they are disabled in this PR? |
48d08f1 to
dafc1bc
Compare
Hi @sxu, For minimal runtime,
|
|
@winskuo-quic What I meant to say is that we don't want the Executorch backend to perform any type of system configuration at all, we have a dedicated system service that makes global decision on HTP power state in conjunction with all other components of the SoC, including CPU, GPU, memory subsystems, etc. taking into account what's the aggregated set of features currently running on device. We can't have individual models put in their own power votes for HTP or enable RPC QoS for example, as that interferes with our settings. My concern is, given the existing runtime is compiled as is for Hexagon, is it still setting some the default configurations which we don't want. |
|
@winskuo-quic Just to clarify, for the above requirements I'm assuming we are asked to use what's in the OSS runtime exactly as is on our Hexagon deployment (that's the impression I got from QC representatives and PyTorch team). For that we are basically asking for the current OSS runtime (the part that runs on Hexagon) to be stripped down to the bare minimum to only do inference and nothing else. I personally don't see a strong technical reason to use the runtime exactly has is, since we have very different requirements than average developer deploying to mobile phones. An alternative to this is for us to take the OSS context binary, and just create a Meta internal version of the runtime that satisfies our need. We already have a prototype and it's just a couple hundreds LOC, all it does is parse the OSS AoT generated binary, call contextCreateFromBinary -> graphRetrieve -> graphExecute x N -> contextFree and nothing else. cc @cccclai |
Is there not an inherent strong versioning risk? Theres no explicit contract to my knowledge in the AoT flow on the output blob today. In ET design its tightly coupled with the backend intended to crack it open. When you fork the runtime it imposes a contract now that is difficult to maintain in OSS because the 2nd runtime is not visible. |
There is a tradeoff, I just want to bring this up to the QC engineering team tasked to work on this to get their thoughts. The way the runtime is currently structured, I honestly find it pretty complex and the initialization flow hard to trace. If the host and target flow continue to be intermingled with if-else and conditional compilation sprinkled here and there to satisfy our requirements, I worry it will be hard to maintain. I would be much more comfortable if there's a separate bare bone runtime. Regarding the implicit contract for a fork, I think it would be between the OSS AoT and the fork, it would not be breaking any broader design decision that ET made. The majority of the compatibility is actually handled by QNN itself, we can use the OSS utility (QnnCustomProtocol) to extract the context binary and that should be everything that's needed for inference. |
I mostly just dont want to be in a state where changes to AoT OSS cannot merge because they start breaking the fork where the tests and logic are not available in oss. QnnCustomProtocol seems like a fairly stable api surface I just worry that if pressures mount more hacky things will start getting done on the fork. |
|
@sxu, Our codebase doesn't contain many HTP specific configs. For perf config, if you would like a short workaround for this PR to disable the feature, this can be done through removing the logic under https://github.com/pytorch/executorch/blob/main/backends/qualcomm/runtime/backends/htp/HtpDevice.cpp and https://github.com/pytorch/executorch/blob/main/backends/qualcomm/runtime/backends/htp/HtpDevice.h, so it looks something like the GPU, where we don't provide any config. For GPU reference, see: https://github.com/pytorch/executorch/blob/main/backends/qualcomm/runtime/backends/gpu/GpuDevice.h. Please let me know if you run into any issues—I can provide a patch if needed. I understand the motivation for a bare-minimum inference runtime. However, creating a minimal runtime specifically for CDSP direct mode would be challenging. The reason is that certain backends require some QNN configs in order to function correctly (e.g., LPAI). In such cases, we would need to build a minimal runtime for each backend, which would be difficult to maintain in the long run. That said, I agree with your concern about the codebase structure being complex. To address this, we plan to work on file restructuring and CMake refactoring. This should reduce the number of macros required to enable a minimal runtime and improve overall readability of the code. |
|
Can you rebase so I can check the CI signal? I feel like we can have this as the first step and iterate on top of @sxu 's feedback. |
dafc1bc to
bf68399
Compare
|
internal CI looks quite bad, I feel like it might be related to file dataloader but not sure, for example executorch/runtime/executor/test:pte_data_map_test -- --exact 'fbcode//executorch/runtime/executor/test:pte_data_map_test - PteDataMapTest.UnimplementedMethods' |
|
@cccclai, |
|
Yeah done. let's wait for the signal |
|
Seems better now and this is a different error. Can we apply the same compiler flag in the cmake build? |
d67773d to
35adcc0
Compare
Thanks for sharing the error log. |
|
@winskuo-quic @cccclai i put up a forward fix for the internal CI issues, will let you know once that pass and we can then land the changes. |
|
@winskuo-quic my fix seem to fix all CI signals, instead of me landing this diff and putting a forward fix, can you apply this patch and rebase. I will then re-import and push. |
35adcc0 to
02bf8db
Compare
Hi @abhinaykukkadapu, |
### Summary failing on trunk after pytorch#17326 ### Test plan ci
Summary
Support CDSP direct mode by defining ExecuTorch's customized rpc protocol.
We have validated this PR with the following spec:
Please refer to README file under
backends/qualcomm/runtime/backends/direct_mode/README.mdfor setup. Please be aware of the Note section if you observe the total execution time is slower the traditional mode.Example Script
To build:
backends/qualcomm/scripts/build.sh --enable_hexagonTo run traditional mode (Same as usual):
python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNQuantizedOperator.test_qnn_backend_adaptive_avg_pool2d --model SM8750 --device $DEVICE --build_folder build-androidTo run direct mode (add --direct_build_folder build-hexagon):
python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNQuantizedOperator.test_qnn_backend_adaptive_avg_pool2d --model SM8750 --device $DEVICE --build_folder build-android --direct_build_folder build-hexagon/Test plan
add
--direct_build_folder build-hexagon/at end of anyTestQNNQuantizedUtils,TestQNNQuantizedModel,TestQNNFloatingPointModel,TestQNNFloatingPointOperatorAuthor: @haowhsu-quic, @winskuo-quic