Optimize MLA/GQA/MQA Triton decoding #1138

ispobock · 2024-08-17T16:11:52Z

Motivation

Optimize memory access for MLA/GQA/MQA decoding.

Modification

One block handle BLOCK_H q heads with shared k/v head. Inspired by InternLM/lmdeploy#1649.

ispobock · 2024-08-17T16:14:30Z

Tested on A100-80G:
DeepSeek-V2-Lite

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     5000
Benchmark duration (s):                  238.01
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1088588
Request throughput (req/s):              21.01
Input token throughput (tok/s):          4990.76
Output token throughput (tok/s):         4579.34
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   82822.78
Median E2E Latency (ms):                 79653.86
---------------Time to First Token----------------
Mean TTFT (ms):                          7167.67
Median TTFT (ms):                        4229.26
P99 TTFT (ms):                           21327.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1073.28
Median TPOT (ms):                        473.77
P99 TPOT (ms):                           7907.65
---------------Inter-token Latency----------------
Mean ITL (ms):                           409.14
Median ITL (ms):                         165.46
P99 ITL (ms):                            1814.59
==================================================

subject: abstract_algebra, #q:100, acc: 0.270
subject: anatomy, #q:135, acc: 0.504
subject: astronomy, #q:152, acc: 0.572
subject: business_ethics, #q:100, acc: 0.600
subject: clinical_knowledge, #q:265, acc: 0.642
subject: college_biology, #q:144, acc: 0.653
subject: college_chemistry, #q:100, acc: 0.410
subject: college_computer_science, #q:100, acc: 0.440
subject: college_mathematics, #q:100, acc: 0.380
subject: college_medicine, #q:173, acc: 0.601
Total latency: 33.251
Average accuracy: 0.535

Llama-3-8B

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     1000
Benchmark duration (s):                  49.48
Total input tokens:                      213987
Total generated tokens:                  199779
Total generated tokens (retokenized):    198032
Request throughput (req/s):              20.21
Input token throughput (tok/s):          4324.67
Output token throughput (tok/s):         4037.53
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20671.67
Median E2E Latency (ms):                 19467.73
---------------Time to First Token----------------
Mean TTFT (ms):                          3234.54
Median TTFT (ms):                        1188.96
P99 TTFT (ms):                           14154.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          186.30
Median TPOT (ms):                        90.10
P99 TPOT (ms):                           1976.89
---------------Inter-token Latency----------------
Mean ITL (ms):                           91.19
Median ITL (ms):                         61.85
P99 ITL (ms):                            308.93
==================================================

subject: abstract_algebra, #q:100, acc: 0.330
subject: anatomy, #q:135, acc: 0.696
subject: astronomy, #q:152, acc: 0.684
subject: business_ethics, #q:100, acc: 0.630
subject: clinical_knowledge, #q:265, acc: 0.751
subject: college_biology, #q:144, acc: 0.771
subject: college_chemistry, #q:100, acc: 0.460
subject: college_computer_science, #q:100, acc: 0.520
subject: college_mathematics, #q:100, acc: 0.340
subject: college_medicine, #q:173, acc: 0.636
Total latency: 41.592
Average accuracy: 0.618

Reproduce:

python3 -m sglang.launch_server --model-path DeepSeek-V2-Lite --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer DeepSeek-V2-Lite --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

python3 -m sglang.launch_server --model-path Meta-Llama-3-8B --port 30000 --trust-remote-code --disable-radix-cache --disable-flashinfer --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer Meta-Llama-3-8B --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

zhyncs · 2024-08-17T16:14:54Z

Nice work! TLDR: Reuse from L2 to block. Is it right? @ispobock

ispobock · 2024-08-17T16:17:41Z

Nice work! TLDR: Reuse from L2 to block. Is it right? @ispobock

@zhyncs Previous version reuses from L2 cache. This version reuses shared k/v head from SMEM.

zhyncs · 2024-08-17T16:22:04Z

Tested on A100-80G: DeepSeek-V2-Lite

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     5000
Benchmark duration (s):                  238.01
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1088588
Request throughput (req/s):              21.01
Input token throughput (tok/s):          4990.76
Output token throughput (tok/s):         4579.34
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   82822.78
Median E2E Latency (ms):                 79653.86
---------------Time to First Token----------------
Mean TTFT (ms):                          7167.67
Median TTFT (ms):                        4229.26
P99 TTFT (ms):                           21327.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1073.28
Median TPOT (ms):                        473.77
P99 TPOT (ms):                           7907.65
---------------Inter-token Latency----------------
Mean ITL (ms):                           409.14
Median ITL (ms):                         165.46
P99 ITL (ms):                            1814.59
==================================================

subject: abstract_algebra, #q:100, acc: 0.270
subject: anatomy, #q:135, acc: 0.504
subject: astronomy, #q:152, acc: 0.572
subject: business_ethics, #q:100, acc: 0.600
subject: clinical_knowledge, #q:265, acc: 0.642
subject: college_biology, #q:144, acc: 0.653
subject: college_chemistry, #q:100, acc: 0.410
subject: college_computer_science, #q:100, acc: 0.440
subject: college_mathematics, #q:100, acc: 0.380
subject: college_medicine, #q:173, acc: 0.601
Total latency: 33.251
Average accuracy: 0.535

Llama-3-8B

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     1000
Benchmark duration (s):                  49.48
Total input tokens:                      213987
Total generated tokens:                  199779
Total generated tokens (retokenized):    198032
Request throughput (req/s):              20.21
Input token throughput (tok/s):          4324.67
Output token throughput (tok/s):         4037.53
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20671.67
Median E2E Latency (ms):                 19467.73
---------------Time to First Token----------------
Mean TTFT (ms):                          3234.54
Median TTFT (ms):                        1188.96
P99 TTFT (ms):                           14154.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          186.30
Median TPOT (ms):                        90.10
P99 TPOT (ms):                           1976.89
---------------Inter-token Latency----------------
Mean ITL (ms):                           91.19
Median ITL (ms):                         61.85
P99 ITL (ms):                            308.93
==================================================

subject: abstract_algebra, #q:100, acc: 0.330
subject: anatomy, #q:135, acc: 0.696
subject: astronomy, #q:152, acc: 0.684
subject: business_ethics, #q:100, acc: 0.630
subject: clinical_knowledge, #q:265, acc: 0.751
subject: college_biology, #q:144, acc: 0.771
subject: college_chemistry, #q:100, acc: 0.460
subject: college_computer_science, #q:100, acc: 0.520
subject: college_mathematics, #q:100, acc: 0.340
subject: college_medicine, #q:173, acc: 0.636
Total latency: 41.592
Average accuracy: 0.618

Reproduce:

python3 -m sglang.launch_server --model-path DeepSeek-V2-Lite --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer DeepSeek-V2-Lite --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

python3 -m sglang.launch_server --model-path Meta-Llama-3-8B --port 30000 --trust-remote-code --disable-radix-cache --disable-flashinfer --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer Meta-Llama-3-8B --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

ref #905 (comment)

After a brief look, the throughput has roughly doubled compared to the previous MLA version, great work! cc @merrymercy @Ying1123 @hnyls2002

python/sglang/srt/layers/decode_attention.py

zhyncs

overall LGTM @ispobock

Currently, all CIs have passed, including when Llama3 disables FlashInfer, it will go through some logic. The benchmark and eval of this PR also meet expectations. The verification of DeepSeek V2 on A100 TP8 and H100 TP8 can be done later, and try to continue analyzing whether there is room for optimization with nsys and ncu. After yesterday's simple discussion, it is mainly the quick implementation by @ispobock, also thanks a lot for the implementation reference by @grimoire InternLM/lmdeploy#1649 and discussion comments from @lzhangzz

@MARD1NO and @yzh119 , if you are interested, welcome to help review and give some optimization suggestions. Thanks.

cc @merrymercy @Ying1123 @hnyls2002

zhyncs · 2024-08-17T17:34:41Z

@Xu-Chen @lxww302 I noticed that you have used the implementation of SGLang's DeepSeek V2 TP8 MLA before. Could you help verify the performance of the new version, for example, on devices you have like A100 TP8, A800 TP8, H100 TP8, etc.? Thanks very mauch!

git clone -b decode_gqa_opt https://github.com/ispobock/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2 --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=8

python3 -m sglang.bench_serving --backend sglang

81549361 · 2024-08-17T19:27:05Z

@Xu-Chen @lxww302 I noticed that you have used the implementation of SGLang's DeepSeek V2 TP8 MLA before. Could you help verify the performance of the new version, for example, on devices you have like A100 TP8, A800 TP8, H100 TP8, etc.? Thanks very mauch!
git clone -b decode_gqa_opt https://github.com/ispobock/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2 --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=8

python3 -m sglang.bench_serving --backend sglang

I have 8Xh100s, I executed your command

Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     1000      
Benchmark duration (s):                  182.31    
Total input tokens:                      236142    
Total generated tokens:                  215614    
Total generated tokens (retokenized):    215037    
Request throughput (req/s):              5.49      
Input token throughput (tok/s):          1295.28   
Output token throughput (tok/s):         1182.68   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   75887.79  
Median E2E Latency (ms):                 77685.35  
---------------Time to First Token----------------
Mean TTFT (ms):                          43446.36  
Median TTFT (ms):                        39279.88  
P99 TTFT (ms):                           104146.94 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          181.96    
Median TPOT (ms):                        161.47    
P99 TPOT (ms):                           653.64    
---------------Inter-token Latency----------------
Mean ITL (ms):                           152.74    
Median ITL (ms):                         99.26     
P99 ITL (ms):                            465.58    
==================================================```

zhyncs · 2024-08-17T19:30:13Z

Thanks! Is it H100 SXM or NVL? @81549361

zhyncs · 2024-08-17T19:32:20Z

May you collect the env info with python3 -m sglang.check_env. @81549361

vhain · 2024-08-17T19:53:03Z

Not sure if this could be helpful or not, but I ran llmperf for both main branch and incoming branch. Overall this PR seems to make things much faster:

GPU: NVIDIA A40 TP=2 DP=1
Model: Qwen/Qwen2-72B-Instruct-AWQ

llmperf command used

python token_benchmark_ray.py \
  --model "${MODEL}" \
  --mean-input-tokens 1500 \
  --stddev-input-tokens 150 \
  --mean-output-tokens 245 \
  --stddev-output-tokens 20 \
  --max-num-completed-requests "64" \
  --timeout 7200 \
  --num-concurrent-requests "8" \
  --llm-api openai \
  --additional-sampling-params '{}'

main branch

{
    "version": "2023-08-31",
    "mean_input_tokens": 1500,
    "stddev_input_tokens": 150,
    "mean_output_tokens": 245,
    "stddev_output_tokens": 20,
    "num_concurrent_requests": 8,
    "results_inter_token_latency_s_quantiles_p25": 0.03990331099470551,
    "results_inter_token_latency_s_quantiles_p50": 0.057948063652443406,
    "results_inter_token_latency_s_quantiles_p75": 0.08040066503004678,
    "results_inter_token_latency_s_quantiles_p90": 0.08383243498141633,
    "results_inter_token_latency_s_quantiles_p95": 0.08516111126646178,
    "results_inter_token_latency_s_quantiles_p99": 0.10164050496592587,
    "results_inter_token_latency_s_mean": 0.06027883582796916,
    "results_inter_token_latency_s_min": 0.03675615620323733,
    "results_inter_token_latency_s_max": 0.1020314351556132,
    "results_inter_token_latency_s_stddev": 0.0211621866217624,
    "results_ttft_s_quantiles_p25": 0.4133454477414489,
    "results_ttft_s_quantiles_p50": 1.016814228380099,
    "results_ttft_s_quantiles_p75": 11.284791270736605,
    "results_ttft_s_quantiles_p90": 11.749069100199268,
    "results_ttft_s_quantiles_p95": 11.803535583987832,
    "results_ttft_s_quantiles_p99": 11.955875016311182,
    "results_ttft_s_mean": 5.338054827436281,
    "results_ttft_s_min": 0.2691499590873718,
    "results_ttft_s_max": 12.148427874781191,
    "results_ttft_s_stddev": 5.495650480946165,
    "results_end_to_end_latency_s_quantiles_p25": 11.498506030999124,
    "results_end_to_end_latency_s_quantiles_p50": 15.51382327103056,
    "results_end_to_end_latency_s_quantiles_p75": 22.9230548851192,
    "results_end_to_end_latency_s_quantiles_p90": 23.657817971240732,
    "results_end_to_end_latency_s_quantiles_p95": 23.97725157707464,
    "results_end_to_end_latency_s_quantiles_p99": 24.61372328522615,
    "results_end_to_end_latency_s_mean": 16.84320118615142,
    "results_end_to_end_latency_s_min": 3.5896931253373623,
    "results_end_to_end_latency_s_max": 25.067169249989092,
    "results_end_to_end_latency_s_stddev": 6.076063540076458,
    "results_request_output_throughput_token_per_s_quantiles_p25": 12.432897921487776,
    "results_request_output_throughput_token_per_s_quantiles_p50": 17.950591526918625,
    "results_request_output_throughput_token_per_s_quantiles_p75": 25.023589881617227,
    "results_request_output_throughput_token_per_s_quantiles_p90": 25.61754857375858,
    "results_request_output_throughput_token_per_s_quantiles_p95": 26.080372795146523,
    "results_request_output_throughput_token_per_s_quantiles_p99": 27.12744569799552,
    "results_request_output_throughput_token_per_s_mean": 18.7890127702506,
    "results_request_output_throughput_token_per_s_min": 9.773737854436295,
    "results_request_output_throughput_token_per_s_max": 27.204481327432568,
    "results_request_output_throughput_token_per_s_stddev": 6.462698432888159,
    "results_number_input_tokens_quantiles_p25": 1419.75,
    "results_number_input_tokens_quantiles_p50": 1513.5,
    "results_number_input_tokens_quantiles_p75": 1585.25,
    "results_number_input_tokens_quantiles_p90": 1726.1000000000001,
    "results_number_input_tokens_quantiles_p95": 1812.2499999999998,
    "results_number_input_tokens_quantiles_p99": 1942.5299999999997,
    "results_number_input_tokens_mean": 1515.53125,
    "results_number_input_tokens_min": "1125",
    "results_number_input_tokens_max": "1986",
    "results_number_input_tokens_stddev": 157.1251617922921,
    "results_number_output_tokens_quantiles_p25": 271.25,
    "results_number_output_tokens_quantiles_p50": 287.0,
    "results_number_output_tokens_quantiles_p75": 304.5,
    "results_number_output_tokens_quantiles_p90": 318.0,
    "results_number_output_tokens_quantiles_p95": 326.4,
    "results_number_output_tokens_quantiles_p99": 340.37,
    "results_number_output_tokens_mean": 280.546875,
    "results_number_output_tokens_min": "78",
    "results_number_output_tokens_max": "341",
    "results_number_output_tokens_stddev": 43.62427229119711,
    "results_num_requests_started": 64,
    "results_error_rate": 0.0,
    "results_number_errors": 0,
    "results_error_code_frequency": "{}",
    "results_mean_output_throughput_token_per_s": 122.91809365087381,
    "results_num_completed_requests": 64,
    "results_num_completed_requests_per_min": 26.288247263678944,
    "timestamp": 1723922364
}

incoming branch

{
    "version": "2023-08-31",
    "mean_input_tokens": 1500,
    "stddev_input_tokens": 150,
    "mean_output_tokens": 245,
    "stddev_output_tokens": 20,
    "num_concurrent_requests": 8,
    "results_inter_token_latency_s_quantiles_p25": 0.04048058146969138,
    "results_inter_token_latency_s_quantiles_p50": 0.04134249718749723,
    "results_inter_token_latency_s_quantiles_p75": 0.042773683461634744,
    "results_inter_token_latency_s_quantiles_p90": 0.04477736409998821,
    "results_inter_token_latency_s_quantiles_p95": 0.04621570852103804,
    "results_inter_token_latency_s_quantiles_p99": 0.04943066709057319,
    "results_inter_token_latency_s_mean": 0.04202164194913325,
    "results_inter_token_latency_s_min": 0.03828613981456747,
    "results_inter_token_latency_s_max": 0.05096760665209523,
    "results_inter_token_latency_s_stddev": 0.0023344492257422154,
    "results_ttft_s_quantiles_p25": 0.3779949996387586,
    "results_ttft_s_quantiles_p50": 0.403224729700014,
    "results_ttft_s_quantiles_p75": 0.44007199979387224,
    "results_ttft_s_quantiles_p90": 0.4766438877210021,
    "results_ttft_s_quantiles_p95": 0.4872294148663059,
    "results_ttft_s_quantiles_p99": 0.49447528753429654,
    "results_ttft_s_mean": 0.4035295032663271,
    "results_ttft_s_min": 0.2787872082553804,
    "results_ttft_s_max": 0.49528229096904397,
    "results_ttft_s_stddev": 0.05853017613187361,
    "results_end_to_end_latency_s_quantiles_p25": 10.952284958562814,
    "results_end_to_end_latency_s_quantiles_p50": 11.724067542003468,
    "results_end_to_end_latency_s_quantiles_p75": 12.392438833485357,
    "results_end_to_end_latency_s_quantiles_p90": 12.949160708626732,
    "results_end_to_end_latency_s_quantiles_p95": 13.369823349895887,
    "results_end_to_end_latency_s_quantiles_p99": 13.602660472076385,
    "results_end_to_end_latency_s_mean": 11.063488117179077,
    "results_end_to_end_latency_s_min": 2.310943207703531,
    "results_end_to_end_latency_s_max": 13.658869832754135,
    "results_end_to_end_latency_s_stddev": 2.5735290879206163,
    "results_request_output_throughput_token_per_s_quantiles_p25": 23.376963498120137,
    "results_request_output_throughput_token_per_s_quantiles_p50": 24.13135072660546,
    "results_request_output_throughput_token_per_s_quantiles_p75": 24.70095651189223,
    "results_request_output_throughput_token_per_s_quantiles_p90": 25.105406335351436,
    "results_request_output_throughput_token_per_s_quantiles_p95": 25.318698051259776,
    "results_request_output_throughput_token_per_s_quantiles_p99": 26.00064578019821,
    "results_request_output_throughput_token_per_s_mean": 23.819321580789712,
    "results_request_output_throughput_token_per_s_min": 19.61920693264775,
    "results_request_output_throughput_token_per_s_max": 26.11816971864744,
    "results_request_output_throughput_token_per_s_stddev": 1.3040854008387603,
    "results_number_input_tokens_quantiles_p25": 1419.75,
    "results_number_input_tokens_quantiles_p50": 1513.5,
    "results_number_input_tokens_quantiles_p75": 1585.25,
    "results_number_input_tokens_quantiles_p90": 1726.1000000000001,
    "results_number_input_tokens_quantiles_p95": 1812.2499999999998,
    "results_number_input_tokens_quantiles_p99": 1942.5299999999997,
    "results_number_input_tokens_mean": 1515.53125,
    "results_number_input_tokens_min": "1125",
    "results_number_input_tokens_max": "1986",
    "results_number_input_tokens_stddev": 157.1251617922921,
    "results_number_output_tokens_quantiles_p25": 265.75,
    "results_number_output_tokens_quantiles_p50": 285.0,
    "results_number_output_tokens_quantiles_p75": 296.25,
    "results_number_output_tokens_quantiles_p90": 317.0,
    "results_number_output_tokens_quantiles_p95": 322.0,
    "results_number_output_tokens_quantiles_p99": 338.84999999999997,
    "results_number_output_tokens_mean": 265.484375,
    "results_number_output_tokens_min": "47",
    "results_number_output_tokens_max": "342",
    "results_number_output_tokens_stddev": 66.06466101119273,
    "results_num_requests_started": 64,
    "results_error_rate": 0.0,
    "results_number_errors": 0,
    "results_error_code_frequency": "{}",
    "results_mean_output_throughput_token_per_s": 162.73324599263228,
    "results_num_completed_requests": 64,
    "results_num_completed_requests_per_min": 36.77803923322394,
    "timestamp": 1723922279
}

81549361 · 2024-08-17T19:53:27Z

python3 -m sglang.check_env

Python: 3.12.3 | packaged by Anaconda, Inc. | (main, May  6 2024, 19:46:43) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 535.183.01
PyTorch: 2.4.0+cu121
flashinfer: 0.1.5+cu121torch2.4
triton: 3.0.0
transformers: 4.44.0
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.3
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 23.2
PIL: 10.3.0
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.0.3
vllm: 0.5.4
multipart: 0.0.9
openai: 1.40.8
anthropic: 0.34.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-159   0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-159   0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-159   0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-159   0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    0-159   0               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    0-159   0               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    0-159   0               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      0-159   0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576

81549361 · 2024-08-17T20:27:24Z

Not sure if this could be helpful or not, but I ran llmperf for both main branch and incoming branch. Overall this PR seems to make things much faster:

GPU: NVIDIA A40 TP=2 DP=1

Model: Qwen/Qwen2-72B-Instruct-AWQ

llmperf command used
main branch
incoming branch

What is your startup command?
I don't see any noticeable improvement in llama3 8b FP8.

vhain · 2024-08-17T20:29:33Z

@81549361 Startup command I used for both are the same:

python3 -m sglang.launch_server \
  --model-path "${MODEL}" \
  --host 127.0.0.1 \
  --port 8080 \
  --context-length "4096" \
  --max-prefill-tokens "16384" \
  --mem-fraction-static "0.85" \
  --schedule-conservativeness "0.05" \
  --tp-size "2" \
  --dp-size "1" \
  --log-level-http warning

ispobock · 2024-08-17T23:22:48Z

I don't see any noticeable improvement in llama3 8b FP8.

@81549361 Did you add --disable-flashinfer for both branches on llama3?

Xu-Chen · 2024-08-18T03:21:21Z

Awesome! Will test DeepSeek-V2-Chat on 8*A800 next week.

Tested on A800-80G: DeepSeek-V2-Lite

Main branch ( DeepSeek-V2-Lite-Chat on 1 * A800-80G )

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  90.75
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    214087
Request throughput (req/s):              11.02
Input token throughput (tok/s):          2602.12
Output token throughput (tok/s):         2375.92
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   39248.93
Median E2E Latency (ms):                 34872.34
---------------Time to First Token----------------
Mean TTFT (ms):                          10523.55
Median TTFT (ms):                        10943.01
P99 TTFT (ms):                           15801.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          233.94
Median TPOT (ms):                        151.23
P99 TPOT (ms):                           1772.93
---------------Inter-token Latency----------------
Mean ITL (ms):                           140.10
Median ITL (ms):                         117.96
P99 ITL (ms):                            385.24
==================================================

This PR ( DeepSeek-V2-Lite-Chat on 1 * A800-80G )

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  59.89
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    214102
Request throughput (req/s):              16.70
Input token throughput (tok/s):          3942.60
Output token throughput (tok/s):         3599.87
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   25766.33
Median E2E Latency (ms):                 23320.27
---------------Time to First Token----------------
Mean TTFT (ms):                          9147.00
Median TTFT (ms):                        9517.37
P99 TTFT (ms):                           14099.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          161.66
Median TPOT (ms):                        72.40
P99 TPOT (ms):                           1690.97
---------------Inter-token Latency----------------
Mean ITL (ms):                           82.39
Median ITL (ms):                         56.74
P99 ITL (ms):                            247.01
==================================================

halexan · 2024-08-18T07:37:59Z

Tested DeepSeek-V2-Chat-0628 on 8*A800

serve

python3 -m sglang.launch_server \
    --model-path /data/model-cache/deepseek-ai/DeepSeek-V2-Chat-0628 \
    --served-model-name deepseek-chat \
    --tp 8 \
    --enable-mla \
    --disable-radix-cache \
    --mem-fraction-static 0.87 \
    --schedule-conservativeness 0.1 \
    --chunked-prefill-size 32768 \
    --max-prefill-tokens 163840 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 50521

test

python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-name sharegpt \
    --dataset-path /data/model-cache/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json \
    --model /data/model-cache/deepseek-ai/DeepSeek-V2-Chat-0628 \
    --port 50521

result

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  604.96
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    214714
Request throughput (req/s):              1.65
Input token throughput (tok/s):          390.34
Output token throughput (tok/s):         356.41
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   374607.65
Median E2E Latency (ms):                 392302.17
---------------Time to First Token----------------
Mean TTFT (ms):                          184913.93
Median TTFT (ms):                        150008.79
P99 TTFT (ms):                           424698.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1651.19
Median TPOT (ms):                        1100.21
P99 TPOT (ms):                           10328.23
---------------Inter-token Latency----------------
Mean ITL (ms):                           890.30
Median ITL (ms):                         582.39
P99 ITL (ms):                            3893.44
==================================================

Should I use base model?
or my params not correct?

zhyncs · 2024-08-18T07:41:51Z

@halexan You don’t need to set this

--mem-fraction-static 0.87 \
    --schedule-conservativeness 0.1 \
    --chunked-prefill-size 32768 \
    --max-prefill-tokens 163840 \

Xu-Chen · 2024-08-18T13:45:16Z

Tested DeepSeek-V2-Chat-0628 on 8*A800
server

/opt/conda/bin/python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2-Chat-0628 --tp 8 --trust-remote-code --enable-mla --disable-radix-cache

test

/opt/conda/bin/python -m sglang.bench_serving --backend sglang --num-prompts 3000

This PR ( DeepSeek-V2-Chat-0628 on 8 * A800-80G )

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     3000
Benchmark duration (s):                  498.49
Total input tokens:                      714456
Total generated tokens:                  656556
Total generated tokens (retokenized):    653778
Request throughput (req/s):              6.02
Input token throughput (tok/s):          1433.23
Output token throughput (tok/s):         1317.08
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   204276.17
Median E2E Latency (ms):                 205499.99
---------------Time to First Token----------------
Mean TTFT (ms):                          165516.98
Median TTFT (ms):                        164192.44
P99 TTFT (ms):                           353364.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          187.10
Median TPOT (ms):                        186.48
P99 TPOT (ms):                           398.82
---------------Inter-token Latency----------------
Mean ITL (ms):                           180.25
Median ITL (ms):                         108.96
P99 ITL (ms):                            567.61
==================================================

halexan · 2024-08-18T23:55:10Z

@Xu-Chen

Does your 8*A800 has nvlink?

Xu-Chen · 2024-08-19T02:10:44Z

@Xu-Chen

Does your 8*A800 has nvlink?

Yes

zhyncs · 2024-08-19T09:54:23Z

H100 SXM TP8 with DeepSeek V2

current PR

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  581.84
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1086980
Request throughput (req/s):              8.59
Input token throughput (tok/s):          2041.57
Output token throughput (tok/s):         1873.27
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   266797.24
Median E2E Latency (ms):                 272582.37
---------------Time to First Token----------------
Mean TTFT (ms):                          239227.95
Median TTFT (ms):                        248810.27
P99 TTFT (ms):                           488867.21
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          132.70
Median TPOT (ms):                        129.55
P99 TPOT (ms):                           281.64
---------------Inter-token Latency----------------
Mean ITL (ms):                           129.46
Median ITL (ms):                         78.23
P99 ITL (ms):                            453.92
==================================================

Compared to the main branch, it has improved by about 35%.

main branch

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  777.04
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1087011
Request throughput (req/s):              6.43
Input token throughput (tok/s):          1528.70
Output token throughput (tok/s):         1402.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   358316.01
Median E2E Latency (ms):                 365857.50
---------------Time to First Token----------------
Mean TTFT (ms):                          320752.33
Median TTFT (ms):                        323528.82
P99 TTFT (ms):                           670386.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          176.45
Median TPOT (ms):                        176.47
P99 TPOT (ms):                           272.25
---------------Inter-token Latency----------------
Mean ITL (ms):                           175.99
Median ITL (ms):                         128.65
P99 ITL (ms):                            517.99
==================================================

I plan to merge this PR first, and the compatibility support for fp8 will be completed in another PR. @ispobock @merrymercy @Ying1123 @hnyls2002

zhyncs · 2024-08-19T09:56:44Z

To further improve performance, both W8A8 (FP8) and FP8 KV Cache are necessary and should be supported for DeepSeek V2.

Xu-Chen · 2024-08-19T10:01:38Z

Furthermore, should pay attention to the MLA implementation of FlashInfer ( flashinfer-ai/flashinfer#237)

zhyncs · 2024-08-19T10:07:48Z

Furthermore, should pay attention to the MLA implementation of FlashInfer ( flashinfer-ai/flashinfer#237)

@jon-chuang When do you expect to complete the support for MLA in FlashInfer? May you synchronize the approximate time? Thanks.

microwish · 2024-10-11T16:54:46Z

@ispobock - do you mind telling a bit more about how you spotted this issue or this optimization?
Did you see the potential issue when profiling something?
Or were you directly inspired by InternLM/lmdeploy#1649?

ispobock · 2024-10-12T15:24:02Z

@microwish Yeah, we did the profiling first and found the decoding kernel took most of the time. And then we checked the kernel with ncu and get some directions for optimizing the memory access.

pipul · 2025-01-07T05:20:50Z

follow form https://zhuanlan.zhihu.com/p/714761319

@zhyncs

非常有意思的文章，我们大概在一个月前，在 SGLang 中实现了你文中提到的 A_CC_ME 版本如何看待 DeepSeek 发布的 MoE 大模型 DeepSeek-V2？

以及做了 MQA 的优化 Optimize MLA/GQA/MQA Triton decoding by ispobock · Pull Request #1138 · sgl-project/sglang
事实上你提到的结论「MLA 由于是 Compressed_KV，只有一个 Head，相当于 MQA，因此没办法再继续划分，每个 Rank 各自保留一份完整的 Compressed_KV」和我们的结论是一致的。DeepSeek 内部的实现，如果有一个 KV Cache Memory Pool 或许可以解决这个问题，TP 8 时，就不用每张卡都重复存一份了[酷]

@pika-jy qkv proj 的TP 已经是完整Dim（Dim = Head * Head_Dim）上分了，和mlp一样，attn 计算 head_dim上再分没必要了吧？这一维度本来就很小了，128，又额外引入通信

My question is:
If qkv is fully partitioned on the DIM, then when calculating attention, can't each TP calculate its own qkv and then perform softmax at the end? Just like the traditional attention calculation.
When calculating attention, gather the qk results from all TPs (only qk results are needed, no need to transfer kv), and then perform softmax.
I don't quite understand why this approach doesn't work.

zhyncs · 2025-01-07T05:23:02Z

@pipul Just ignore this and please use English

pipul · 2025-01-07T05:28:25Z

@zhyncs why? have you solved the problem?

zhyncs · 2025-01-07T05:32:20Z

@pipul https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models

pipul · 2025-01-07T05:43:11Z

@zhyncs It seems that when you calculate attention, you only use DP parallelism and not TP parallelism? But I saw in the original paper that TP parallelism was used when calculating attention.

The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP), combined with 8-way Data Parallelism (DP8).

zhyncs · 2025-01-07T05:44:26Z

@pipul Yeah we will support soon

pipul · 2025-01-07T06:10:47Z

@zhyncs

For example, DeepSeek models use MLA and only have one KV head. If we use tensor parallelism on 8 GPUs, it will lead to duplicated KV cache and unwanted memory usage.

why one KV head use tensor parallelism, will lead to duplicated KV cache? Is it because of the high cost of communication? so need to make multiple copies of kv?

zhyncs · 2025-01-07T06:13:26Z

@pipul It involves a trade-off to achieve lower latency

Co-authored-by: Yineng Zhang <me@zhyncs.com>

ispobock added 2 commits August 17, 2024 23:56

add grouped decode for GQA

11a355c

fix format

Loading
Loading status checks…

448879a

zhyncs self-assigned this Aug 17, 2024

zhyncs requested review from Ying1123, merrymercy, zhyncs and hnyls2002 August 17, 2024 16:12

Merge branch 'main' into decode_gqa_opt

Loading
Loading status checks…

a4c12f4

zhyncs added high priority performance labels Aug 17, 2024

zhyncs requested a review from yzh119 August 17, 2024 16:24

zhyncs reviewed Aug 17, 2024

View reviewed changes

python/sglang/srt/layers/decode_attention.py Show resolved Hide resolved

zhyncs approved these changes Aug 17, 2024

View reviewed changes

zhyncs mentioned this pull request Aug 17, 2024

Development Roadmap (2024 Q3) #634

Closed

29 tasks

Merge branch 'main' into decode_gqa_opt

Loading
Loading status checks…

f15113c

Merge branch 'main' into decode_gqa_opt

Loading
Loading status checks…

5668a4c

zhyncs merged commit df19125 into sgl-project:main Aug 19, 2024
5 checks passed

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Optimize MLA/GQA/MQA Triton decoding (sgl-project#1138)

bb4e492

Co-authored-by: Yineng Zhang <me@zhyncs.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize MLA/GQA/MQA Triton decoding #1138

Optimize MLA/GQA/MQA Triton decoding #1138

ispobock commented Aug 17, 2024

ispobock commented Aug 17, 2024

zhyncs commented Aug 17, 2024

ispobock commented Aug 17, 2024

zhyncs commented Aug 17, 2024

zhyncs left a comment

zhyncs commented Aug 17, 2024

81549361 commented Aug 17, 2024 •

edited

Loading

zhyncs commented Aug 17, 2024

zhyncs commented Aug 17, 2024

vhain commented Aug 17, 2024

81549361 commented Aug 17, 2024

81549361 commented Aug 17, 2024

vhain commented Aug 17, 2024

ispobock commented Aug 17, 2024

Xu-Chen commented Aug 18, 2024 •

edited

Loading

halexan commented Aug 18, 2024

zhyncs commented Aug 18, 2024

Xu-Chen commented Aug 18, 2024 •

edited

Loading

halexan commented Aug 18, 2024

Xu-Chen commented Aug 19, 2024

zhyncs commented Aug 19, 2024

zhyncs commented Aug 19, 2024

Xu-Chen commented Aug 19, 2024

zhyncs commented Aug 19, 2024

microwish commented Oct 11, 2024

ispobock commented Oct 12, 2024

pipul commented Jan 7, 2025 •

edited

Loading

zhyncs commented Jan 7, 2025

pipul commented Jan 7, 2025

zhyncs commented Jan 7, 2025

pipul commented Jan 7, 2025

zhyncs commented Jan 7, 2025

pipul commented Jan 7, 2025

zhyncs commented Jan 7, 2025

Optimize MLA/GQA/MQA Triton decoding #1138

Optimize MLA/GQA/MQA Triton decoding #1138

Conversation

ispobock commented Aug 17, 2024

Motivation

Modification

ispobock commented Aug 17, 2024

zhyncs commented Aug 17, 2024

ispobock commented Aug 17, 2024

zhyncs commented Aug 17, 2024

zhyncs left a comment

Choose a reason for hiding this comment

zhyncs commented Aug 17, 2024

81549361 commented Aug 17, 2024 • edited Loading

zhyncs commented Aug 17, 2024

zhyncs commented Aug 17, 2024

vhain commented Aug 17, 2024

81549361 commented Aug 17, 2024

81549361 commented Aug 17, 2024

vhain commented Aug 17, 2024

ispobock commented Aug 17, 2024

Xu-Chen commented Aug 18, 2024 • edited Loading

halexan commented Aug 18, 2024

zhyncs commented Aug 18, 2024

Xu-Chen commented Aug 18, 2024 • edited Loading

halexan commented Aug 18, 2024

Xu-Chen commented Aug 19, 2024

zhyncs commented Aug 19, 2024

zhyncs commented Aug 19, 2024

Xu-Chen commented Aug 19, 2024

zhyncs commented Aug 19, 2024

microwish commented Oct 11, 2024

ispobock commented Oct 12, 2024

pipul commented Jan 7, 2025 • edited Loading

zhyncs commented Jan 7, 2025

pipul commented Jan 7, 2025

zhyncs commented Jan 7, 2025

pipul commented Jan 7, 2025

zhyncs commented Jan 7, 2025

pipul commented Jan 7, 2025

zhyncs commented Jan 7, 2025

81549361 commented Aug 17, 2024 •

edited

Loading

Xu-Chen commented Aug 18, 2024 •

edited

Loading

Xu-Chen commented Aug 18, 2024 •

edited

Loading

pipul commented Jan 7, 2025 •

edited

Loading