Skip to content

Performace degradation Qwen0.6 on Qnn Backend. #18933

@yinrun

Description

@yinrun

🚀 The feature, motivation and pitch

I am face a great performance degradation problems when running this model on qualcomm SM8850.

currently, I got the results as follows. But in early try, the prefill speed could be about 4000 and decode speed 100+. I am not sure what happened? Are there any suggestions?

I 00:00:05.299159 executorch:prompt_processor.cpp:267] Prompt Processor: total 94 prompt tokens (AR-128 * 1 iters)
I 00:00:05.378665 executorch:runner.cpp:462] RSS after prompt prefill: 770.968750 MiB (0 if unsupported)
I 00:00:10.963822 executorch:token_generator.cpp:356] Warning: Generation stopped at seq_len limit (512) without reaching EOS token. Response may be incomplete.
I 00:00:10.964286 executorch:token_generator.cpp:370] - seq_len (512) is less than compiled max_context_len (1024). Consider increasing --seq_len (up to 1024).
I 00:00:10.964308 executorch:runner.cpp:477] RSS after finishing text generation: 770.968750 MiB (0 if unsupported)
I 00:00:10.964396 executorch:stats.h:161] 	Prompt Tokens: 94    Generated Tokens: 417
I 00:00:10.964407 executorch:stats.h:167] 	Model Load Time:		5.266000 (seconds)
I 00:00:10.964419 executorch:stats.h:177] 	Total inference time:		5.695000 (seconds)		 Rate: 	73.222125 (tokens/second)
I 00:00:10.964432 executorch:stats.h:185] 		Prompt evaluation:	0.109000 (seconds)		 Rate: 	862.385321 (tokens/second)
I 00:00:10.964445 executorch:stats.h:196] 		Generated 417 tokens:	5.586000 (seconds)		 Rate: 	74.650913 (tokens/second)
I 00:00:10.964457 executorch:stats.h:204] 	Time to first generated token:	0.109000 (seconds)
I 00:00:10.964623 executorch:stats.h:211] 	Sampling time over 511 tokens:	0.717000 (seconds)

The export command is as follows.

python -m examples.qualcomm.oss_scripts.llama.llama \
    -b build-android \
    -m SM8850 \
    --temperature 0 \
    --model_mode hybrid \
    --prefill_ar_len 128 \
    --max_seq_len 1024 \
    --decoder_model qwen3-0_6b \
    --prompt "Hello" \
    --compile_only \
    -a /tmp/qwen3-models/qwen3_0_6B

Alternatives

No response

Additional context

No response

RFC (Optional)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions