Qualcomm AI Engine Direct - Support attention sink for long context usecase#16574
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16574
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit b75fe29 with merge base 47b8d1d ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: qualcomm" |
|
Hi @cccclai , |
|
Hi sorry a bit late on this. I'm currently out and will take a look next week |
cccclai
left a comment
There was a problem hiding this comment.
Thank you for enabling attention sink, this is great!
|
|
||
| def get_8a4w_qnn_ptq_config( | ||
| act_symmetric: bool = True, | ||
| act_symmetric: bool = False, |
There was a problem hiding this comment.
Any specific reason we try to make act_symmetric default to True
There was a problem hiding this comment.
Are you referring to the original design?
This configuration is actually for the 8-bit kv cache. In the case of QK @ V, according to the QNN documentation, the second input V as weight is expected to be signed and symmetrically quantized. For some models, we try to annotate value projection (wv) with 8a4w to improve performance. And we use symmetric to avoid convert op which converts asymmetric to symmetric quantization.
QK (16 bits) ──┬─> matmul op (16 bits)
past v (8 bits symmetric) ┬─> cat op (8 bits symmetric) ─┘
value projection (new v) (8 bits symmetric) ┘
| Example: | ||
| ```bash | ||
| # Compile llama pte file and attention sink rope pte file with sink_size = 4 and batch_eviction_size = 64 | ||
| python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-1b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 4096 --max_context_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 --use_attention_sink 4,64 --compile_only |
There was a problem hiding this comment.
Do we have definition somewhere to explain the difference between max_seq_len and max_context_len?
There was a problem hiding this comment.
Thanks for pointing out. I just try to follow the executorch's llm naming. Let me add some descriptions for it.
| python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-1b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 4096 --max_context_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 --use_attention_sink 4,64 --compile_only | ||
| ``` | ||
|
|
||
| After running this, the `attention_sink_evictor.pte` file will be generated in the artifacts directory. This file is necessary for using the attention sink feature, as it enables remove batch_eviction_size of the key and value cache and re-rotates the key cache at runtime. |
There was a problem hiding this comment.
remove batch_eviction_size of the key and value cache and re-rotates the key cache at runtime.
what does it mean?
There was a problem hiding this comment.
Thanks for catching!
| python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-1b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 4096 --prompt "I would like to learn python, could you teach me with a simple example?" "Could you give more difficult example in python?" "Could you add a GUI for this game?" "Could you tell me more about tkinter?" "Is possible to deploy on website?" ---pre_gen_pte ${PATH_TO_ARTIFACT_IN_1ST_RUN} --use_attention_sink 4,64 | ||
| ``` | ||
|
|
||
| If you want to modify `sink_size` or `batch_eviction_size`, or if you have a pre-compiled llm pte file and wish to use the attention sink feature, you can recompile the `attention_sink_evictor.pte` with different attention sink config. |
There was a problem hiding this comment.
Can you also elaborate batch_eviction_size? I guess this doc is good to explain how to use it, but some of the concept is coupled with attention sink itself
There was a problem hiding this comment.
Got it. Let me try to elaborate more about attention sink mechanism.
As far as I know, attention sink is a way to evict cache when maximum context length be reached.
There are two mainly concept for attention sink.
- Maintain Attention Sinks: Always include several initial tokens as attention sinks in the KV cache.
- Redefine Positional Context: Use positions relative to the cache instead of absolute positions from the original text, enhancing relevance and coherence in generated responses.
When the cache reaches capacity, follow these three steps for attention sink:
- Keep the first
sink_sizetokens in the KV cache. - Remove
eviction_batch_sizetokens from the KV cache. - Rotate the remaining KV cache to maintain its relationship.
Afterward, you can continue generating tokens until the cache reaches capacity again, then repeat the process.
|
|
||
| parser.add_argument( | ||
| "--max_seq_len", | ||
| help="The maximum length of sequence to evaluate.", |
There was a problem hiding this comment.
It's still not super clear the difference between max_context_len and max_seq_len...
There was a problem hiding this comment.
Would the following option work, or do you have any recommendations?
max_seq_len: Maximum sequence length the model can handlemax_context_len: Maximum length of the model's memory/cache
There was a problem hiding this comment.
max_seq_len: Maximum sequence length the model can handle
just to confirm, does it mean max_seq_len = max_context_len + {max decode length}
There was a problem hiding this comment.
For instance, the kv cache has the shape [batch, num_head, max_context_len, head_dim].
Previously, we could only generate tokens up to max_context_len - num_of_prompt_token.
With attention sink enabled, it's possible to generate more tokens than max_context_len - num_of_prompt_token.
The max_seq_len parameter determines the maximum number of tokens that can be generated.
There was a problem hiding this comment.
I feel like it's clearer to say
max_seq_len: Maximum sequence length the model can generate
max_context_len: Maximum length of the model's memory/cache, including both prompt tokens and generated tokens
| default=8, | ||
| type=int, | ||
| ) | ||
| parser.add_argument( |
There was a problem hiding this comment.
Does it work for all llms?
There was a problem hiding this comment.
Yes, I think it should be.
| sin(delta), cos(delta)) | ||
| where delta = new_position * theta - original_position * theta | ||
|
|
||
| Based on https://github.com/huggingface/transformers/blame/main/src/transformers/cache_utils.py#L961 |
There was a problem hiding this comment.
This is helpful, though main might change, can we use a link from a commit instead?
There was a problem hiding this comment.
Good catch! Thanks
| int32_t seq_len, | ||
| std::function<void(const std::string&)> token_callback, | ||
| bool dump_logits) { | ||
| bool dump_logits, |
There was a problem hiding this comment.
Does attention sink work with lookahead?
There was a problem hiding this comment.
Yes, the attention sink feature is a way to clear the cache when the number of generated tokens reaches max_context_len. The lookahead method is used to guess more tokens at once to enhance performance.
| QuantDtype.use_8a4w, | ||
| False, | ||
| act_observer=MinMaxObserver, | ||
| act_symmetric=True, |
There was a problem hiding this comment.
Not super clear when to use act_symmetric and when not..
There was a problem hiding this comment.
Typically, the weight should be symmetric. So, if the subsequent operation involves weight, you can annotate the path as symmetric to prevent the need for a convert operation.
cc7cea3 to
a9ab580
Compare
|
Hi @cccclai, I’ve rebased the PR. |
|
Hi can you rebase again? |
a9ab580 to
f833c5c
Compare
Sure, I have rebased and refactored the statement. Thank you. |
|
There are some internal failure.. and another error |
Resolved the naming in TARGETS file |
|
It conflicts because the other qcom PR lands...can you rebase again? Sorry for the back and forth |
conversation
- Support narrow operation
- Support attention sink for static llama
- Include the --max_context_len option to set the maximum length for the
model's memory, and use max_seq_len to define the maximum sequence
length for evaluation.
- Specified --use_attention_sink <sink_size>,<eviction_batch_size> to
enable attention sink feature in llama.py
- Add more descriptions for attention sink feature and related parameters
- Behavior matrix in `llama.py`
- Given that `--compile_only`:
- Specify `--use_attention_sink` -> Compile the LLM and the attention sink model
- Otherwise, -> Compile the LLM only
- Given that `--pre_get_pte`:
- Specify `--use_attention_sink` -> If the criteria below are not met, compile the attention sink model before running inference.
And then inference with attention sink
- Check if the attention sink model exists
- Verify sink_size and eviction batch size are identical
- Otherwise, -> Inference LLM without attention sink
- Neither `--compile_only` nor `--pre_get_pte`:
- Specify `--use_attention_sink` -> Compile the LLM and the
attention sink model and inference with attention sink
- Otherwise, -> Compile the LLM and inference LLM without
attention sink
- Test for narrow op:
- python backends/qualcomm/tests/test_qnn_delegate.py -k
TestQNNFloatingPointOperator.test_qnn_backend_narrow --model SM8750
--device $DEVICE --build_folder build-android
- python backends/qualcomm/tests/test_qnn_delegate.py -k
TestQNNQuantizedOperator.test_qnn_backend_narrow --model SM8750 --device
$DEVICE --build_folder build-android
- Test for attention sink in llama.py
- python backends/qualcomm/tests/test_qnn_delegate.py
TestExampleLLMScript.test_attention_sink --model SM8750 --device $DEVICE
-b build-android -a unit_test
2d3fd14 to
b75fe29
Compare
No problem. I have rebased. |
|
I've forward fix the internal error, should be able to merge this after import finish |
Summary
--max_context_lenoption to set the maximum length for the model's memory, and usemax_seq_lento define the maximum sequence length for evaluation.--use_attention_sink <sink_size>,<eviction_batch_size>to enable attention sink feature in llama.pyllama.py--compile_only:--use_attention_sink-> Compile the LLM and the attention sink model--pre_get_pte:--use_attention_sink-> If the criteria below are not met, compile the attention sink model before running inference. And then inference with attention sink--compile_onlynor--pre_get_pte:--use_attention_sink-> Compile the LLM and the attention sink model and inference with attention sinkTest plan
python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNFloatingPointOperator.test_qnn_backend_narrow --model SM8750 --device $DEVICE --build_folder build-androidpython backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNQuantizedOperator.test_qnn_backend_narrow --model SM8750 --device $DEVICE --build_folder build-androidllama.pypython backends/qualcomm/tests/test_qnn_delegate.py TestExampleLLMScript.test_attention_sink --model SM8750 --device $DEVICE -b build-android -a unit_testResults
llama.pypython examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8750 --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-1b_instruct --model_mode kv --max_seq_len 4096 --max_context_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" "Could you give a more difficult example in python?" "Could you add a GUI for this game?" "Could you tell me more about tkinter?" "Is it possible to deploy on a website?" --tasks wikitext --limit 1 --use_attention_sink 4,32Set llama 3.2 1b instruct to a max context length of 1024, and activate the attention sink feature with a sink_size of 4 and eviction_batch_size of 32. Then, run a multi-turn conversation with a sequence length of 4096 using five prompts:
"I would like to learn python, could you teach me with a simple example?",
"Could you give a more difficult example in python?",
"Could you add a GUI for this game?",
"Could you tell me more about tkinter?",
"Is it possible to deploy on a website?"
The performance for each run is nearly similar.
cc: @haowhsu-quic