You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torch.compile can accelerate small batch sizes for llama-3 8B. However, it is sometimes slower for large batch size or tensor parallelism. We use this issue to track the performance and potential fixes.
Instructions and results
# Benchmark llama-3-8B (TP=1, bs=1) with cuda graph# Decode. median latency: 0.00737 s, median throughput: 135.64 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8
# Benchmark llama-3-8B (TP=1, bs=1) with torch.compile# Decode. median latency: 0.00642 s, median throughput: 155.67 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --enable-torch-compile
# Benchmark llama-3-8B (TP=1, bs=128) with cuda graph# Decode. median latency: 0.01184 s, median throughput: 10815.07 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 128 --input 128 --output 8
# Benchmark llama-3-8B (TP=1, bs=128) with torch.compile# Decode. median latency: 0.01231 s, median throughput: 10401.75 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 128 --input 128 --output 8 --enable-torch-compile
# Benchmark llama-3-8B (TP=8, bs=1) with cuda graph# Decode. median latency: 0.00335 s, median throughput: 298.53 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --tp 8
# Benchmark llama-3-8B (TP=8, bs=1) with torch.compile# Decode. median latency: 0.00351 s, median throughput: 284.51 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --tp 8 --enable-torch-compile
# Benchmark llama-3-70B (TP=8, bs=1) with cuda graph# Decode. median latency: 0.01220 s, median throughput: 82.00 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-70B --batch-size 1 --input 128 --output 8 --tp 8
# Benchmark llama-3-70B (TP=8, bs=1) with torch.compile# Decode. median latency: 0.01211 s, median throughput: 82.57 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-70B --batch-size 1 --input 128 --output 8 --tp 8 --enable-torch-compile
torch.compile can accelerate small batch sizes for llama-3 8B. However, it is sometimes slower for large batch size or tensor parallelism. We use this issue to track the performance and potential fixes.
Instructions and results
Environment
The text was updated successfully, but these errors were encountered: