Add Qwen 3.5 MoE to cuda-perf CI and add prefill throughput tracking#18903
Add Qwen 3.5 MoE to cuda-perf CI and add prefill throughput tracking#18903
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18903
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 3 New Failures, 2 Cancelled Jobs, 4 Unrelated FailuresAs of commit 9e72e8b with merge base 656850a ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
- Add PyTorchObserver stats output to qwen3_5_moe runner (enables cuda_benchmark.py parsing), --prompt_file flag, and GPU memory stats - Add prefill_throughput metric to cuda_benchmark.py (prefill tok/s alongside existing decode tok/s) - Add Qwen3.5-35B-A3B-HQQ-INT4 to cuda-perf.yml with >1000 token prompt and 512 output tokens, on linux.aws.a100 - Align cuda-perf.yml triggers with cuda.yml (push main/release, ciflow/cuda tags, PR on backends/cuda and backends/aoti paths) - Remove random model selection and schedule trigger; always run all models when triggered
a23b5b1 to
49d0aa1
Compare
e50e3fa to
36213f9
Compare
| TOKENIZER="model_artifacts/tokenizer.json" | ||
| # Generate a >1000 token prompt for benchmarking | ||
| python3 -c " | ||
| text = 'The quick brown fox jumps over the lazy dog. ' * 200 |
There was a problem hiding this comment.
can we just check in a file, instead of creating one on the fly, with different large prompts and use those everywhere, such repetitions can lead to weird response sometimes. For a pure performance benchmarking this is OK however since this is a CI in case of a suspect diff, we will come look at the output and it better not be garbage because of this prompt. :)
There was a problem hiding this comment.
have created a specific file for long prompt.
| DEFINE_string(data_path, "", "Data file (.ptd) for CUDA backend."); | ||
| DEFINE_string(tokenizer_path, "", "HuggingFace tokenizer.json path."); | ||
| DEFINE_string(prompt, "Hello", "Prompt text."); | ||
| DEFINE_string( |
| stats.gpu_peak_usage_mb = | ||
| (stats.gpu_total_bytes - gpu_free_bytes) / 1024.0 / 1024.0; | ||
|
|
||
| llm::print_report(stats); |
df2a4bf to
69b5042
Compare
…18903) - Add PyTorchObserver stats output to qwen3_5_moe runner (enables cuda_benchmark.py parsing), --prompt_file flag, and GPU memory stats - Add prefill_throughput metric to cuda_benchmark.py (prefill tok/s alongside existing decode tok/s) - Add Qwen3.5-35B-A3B-HQQ-INT4 to cuda-perf.yml with >1000 token prompt and 512 output tokens, on linux.aws.a100 - Creata a new ciflow tag for cuda-erf ci - Remove random model selection and schedule trigger; always run all models when triggered --------- Co-authored-by: gasoonjia <gasoonjia@fb.com>
Uh oh!
There was an error while loading. Please reload this page.