Torch.compile Performance Tracking #1008

merrymercy · 2024-08-09T20:10:44Z

torch.compile can accelerate small batch sizes for llama-3 8B. However, it is sometimes slower for large batch size or tensor parallelism. We use this issue to track the performance and potential fixes.

Instructions and results

# Benchmark llama-3-8B (TP=1, bs=1) with cuda graph
# Decode.  median latency: 0.00737 s, median throughput:    135.64 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8

# Benchmark llama-3-8B (TP=1, bs=1) with torch.compile
# Decode.  median latency: 0.00642 s, median throughput:    155.67 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --enable-torch-compile


# Benchmark llama-3-8B (TP=1, bs=128) with cuda graph
# Decode.  median latency: 0.01184 s, median throughput:  10815.07 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 128 --input 128 --output 8

# Benchmark llama-3-8B (TP=1, bs=128) with torch.compile
# Decode.  median latency: 0.01231 s, median throughput:  10401.75 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 128 --input 128 --output 8 --enable-torch-compile


# Benchmark llama-3-8B (TP=8, bs=1) with cuda graph
# Decode.  median latency: 0.00335 s, median throughput:    298.53 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --tp 8

# Benchmark llama-3-8B (TP=8, bs=1) with torch.compile
# Decode.  median latency: 0.00351 s, median throughput:    284.51 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --tp 8 --enable-torch-compile


# Benchmark llama-3-70B (TP=8, bs=1) with cuda graph
# Decode.  median latency: 0.01220 s, median throughput:     82.00 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-70B --batch-size 1 --input 128 --output 8 --tp 8

# Benchmark llama-3-70B (TP=8, bs=1) with torch.compile
# Decode.  median latency: 0.01211 s, median throughput:     82.57 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-70B --batch-size 1 --input 128 --output 8 --tp 8 --enable-torch-compile

Environment

python3 -m sglang.check_env

GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0

NVCC: Cuda compilation tools, release 12.3, V12.3.107
CUDA Driver Version: 545.23.08

PyTorch: 2.4.0+cu121
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
vllm: 0.5.5
NVIDIA Topology: mostly NV18

commit: 79ece2c51f47ee6b792c6282a6f76987892c5f8d (Fri Aug 30)

The text was updated successfully, but these errors were encountered:

merrymercy changed the title ~~Torch.compile Performance Track~~ Torch.compile Performance Tracking Aug 10, 2024

This was referenced Aug 29, 2024

failed to execute torch.compile flashinfer-ai/flashinfer#482

Closed

Integrate Flex Decoding pytorch-labs/gpt-fast#196

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch.compile Performance Tracking #1008

Torch.compile Performance Tracking #1008

merrymercy commented Aug 9, 2024 •

edited

Loading

Torch.compile Performance Tracking #1008

Torch.compile Performance Tracking #1008

Comments

merrymercy commented Aug 9, 2024 • edited Loading

Instructions and results

Environment

merrymercy commented Aug 9, 2024 •

edited

Loading