Compiled model with torch.compile, unfortunately without performance improvements #2131

sh1ng · 2023-12-15T13:15:55Z

torch.jit.script and TorchScript can't be used as forward methods use parameters not compatible with it https://pytorch.org/docs/stable/jit_language_reference.html#supported-type.
torch.jit.trace looks even more challenging.
I was only able to make it run by using torch.compile with minimal @torch.compiler.disable addition. Unfortunately, I only see performance degradation(RTX 3090)

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model facebook/opt-125m
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='facebook/opt-125m', tokenizer='facebook/opt-125m', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:11:19 llm_engine.py:73] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=False, seed=0)
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:11:23,098 filelock [DEBUG] - Attempting to acquire lock 139820010183728 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,098 filelock [DEBUG] - Lock 139820010183728 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,221 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:11:23,223 filelock [DEBUG] - Attempting to release lock 139820010183728 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Lock 139820010183728 released on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Attempting to acquire lock 139821305789792 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Lock 139821305789792 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,330 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:11:23,331 filelock [DEBUG] - Attempting to release lock 139821305789792 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,331 filelock [DEBUG] - Lock 139821305789792 released on /tmp/facebook-opt-125m.lock
INFO 12-15 04:11:24 llm_engine.py:223] # GPU blocks: 34503, # CPU blocks: 7281
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:10<00:00, 91.47it/s]
Throughput: 91.43 requests/s, 23406.65 tokens/s

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model facebook/opt-125m --compile-model True
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='facebook/opt-125m', tokenizer='facebook/opt-125m', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:10:07 llm_engine.py:73] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=True, seed=0)
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:10:11,338 filelock [DEBUG] - Attempting to acquire lock 140378255170608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,338 filelock [DEBUG] - Lock 140378255170608 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,500 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:10:11,502 filelock [DEBUG] - Attempting to release lock 140378255170608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Lock 140378255170608 released on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Attempting to acquire lock 140378858978608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Lock 140378858978608 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,617 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:10:11,619 filelock [DEBUG] - Attempting to release lock 140378858978608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,619 filelock [DEBUG] - Lock 140378858978608 released on /tmp/facebook-opt-125m.lock
INFO 12-15 04:10:18 llm_engine.py:223] # GPU blocks: 34524, # CPU blocks: 7281
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
  torch.has_cuda,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
  torch.has_cudnn,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
  torch.has_mps,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
  torch.has_mkldnn,
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:11<00:00, 85.64it/s]
Throughput: 85.61 requests/s, 21915.10 tokens/s

llama

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model h2oai/h2ogpt-4096-llama2-7b-chat 
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:45:25 llm_engine.py:73] Initializing an LLM engine with config: model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=False, seed=0)
INFO 12-15 04:45:25 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:45:29,229 filelock [DEBUG] - Attempting to acquire lock 140026768499296 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,229 filelock [DEBUG] - Lock 140026768499296 acquired on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,321 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/h2oai/h2ogpt-4096-llama2-7b-chat/revision/main HTTP/1.1" 200 2270
2023-12-15 04:45:29,325 filelock [DEBUG] - Attempting to release lock 140026768499296 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,325 filelock [DEBUG] - Lock 140026768499296 released on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
INFO 12-15 04:45:32 llm_engine.py:223] # GPU blocks: 881, # CPU blocks: 512
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 2000/2000 [04:24<00:00,  7.56it/s]
Throughput: 7.56 requests/s, 1936.53 tokens/s

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model h2oai/h2ogpt-4096-llama2-7b-chat --compile-model True
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:35:47 llm_engine.py:73] Initializing an LLM engine with config: model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=True, seed=0)
INFO 12-15 04:35:47 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:35:50,578 filelock [DEBUG] - Attempting to acquire lock 139709952418400 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,578 filelock [DEBUG] - Lock 139709952418400 acquired on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,689 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/h2oai/h2ogpt-4096-llama2-7b-chat/revision/main HTTP/1.1" 200 2270
2023-12-15 04:35:50,692 filelock [DEBUG] - Attempting to release lock 139709952418400 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,692 filelock [DEBUG] - Lock 139709952418400 released on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
INFO 12-15 04:36:08 llm_engine.py:223] # GPU blocks: 876, # CPU blocks: 512
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 2000/2000 [04:27<00:00,  7.49it/s]
Throughput: 7.49 requests/s, 1916.22 tokens/s

This PR can be considered as a first step to use torch.compiler for further improvements.

BTW onnrt backend returns

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Type Error: Type (seq(tensor(float16))) of output arg (_val_9) of node (_inline_aten_split_with_sizesn0) does not match expected type (seq(tensor(float))).

sh1ng · 2023-12-21T13:13:27Z

$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Throughput: 4.77 requests/s, 1873.85 tokens/s

$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --enforce-eager
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=True)
Throughput: 4.65 requests/s, 1827.69 tokens/s

$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --enforce-eager --compile-model=True
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=True)
Throughput: 4.53 requests/s, 1778.97 tokens/s

$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --compile-model=True
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Throughput: 4.67 requests/s, 1835.51 tokens/s

UIHCRITT · 2023-12-22T09:57:17Z

using your code, I run vicuna-7b in one L40, torch.__version__2.1.0+cu121; vllm = 0.2.2,
I found
(flot+compile bs=1,code from gpt-fast) 50.07 tokens/sec;
(float+vllm bs=1) 46.24 tokens/sec;
(float+vllm+compile) 42.71 tokens/sec;

it seems using torch.compile,without performance improvements;
on the other hand, i try using torch.compile() rewrite model_forward,rather than using @torch.compiler.disable:
in vllm/worker/worker.py def excute_model

before
''''''
output = self.model(
input_ids = input_tokens,
positions = inputs_positions,
kv_caches=slef.gpu_cache,
input_metadata=input_metadata,
cache_events = cache_events,
)
after
'''''''
def _model_foraward(model,input_ids,positions,kv_caches,input_metadata,cache_events):
return model(input_ids,positions,kv_caches,input_metadata,cache_events)
model_forward = torch.compile(_model_foraward,mode="reduce-overhead",fullgraph=True)
output = model_forward(
model = self.model,
input_ids = input_tokens,
positions = inputs_positions,
kv_caches=slef.gpu_cache,
input_metadata=input_metadata,
cache_events = cache_events,
)
'''''''''''
when I run this code, i get errors''NotImpementedError:ProcessGroupVariable() ''

Lvjinhong · 2024-01-12T10:00:14Z

For the latest version v0.2.7, is there any meaningful acceleration in terms of the compiler?

sh1ng changed the title ~~Compile model with torch.compile, unfortunatly without performance improvments~~ Compiled model with torch.compile, unfortunately without performance improvements Dec 15, 2023

sh1ng force-pushed the try-torch-compiler branch from 2bf9d5c to 179a630 Compare December 21, 2023 13:07

sh1ng force-pushed the try-torch-compiler branch from 179a630 to 45ee43e Compare January 2, 2024 20:32

godsakurapeng mentioned this pull request Jan 3, 2024

Can vllm become faster? #2327

Closed

sh1ng force-pushed the try-torch-compiler branch from 45ee43e to 2637c51 Compare February 28, 2024 20:41

sh1ng and others added 3 commits February 28, 2024 12:53

compile torch model, added configuration and warm-up step to benchmark

01b1c56

add @torch.compiler.disable to support opt and llama

b473aad

add comment

73f0f1a

sh1ng force-pushed the try-torch-compiler branch from 2637c51 to 73f0f1a Compare February 28, 2024 20:53

sh1ng mentioned this pull request Mar 1, 2024

torch.compile() support #3014

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiled model with torch.compile, unfortunately without performance improvements #2131

Compiled model with torch.compile, unfortunately without performance improvements #2131

sh1ng commented Dec 15, 2023

sh1ng commented Dec 21, 2023

UIHCRITT commented Dec 22, 2023 •

edited

Lvjinhong commented Jan 12, 2024

Compiled model with torch.compile, unfortunately without performance improvements #2131

Are you sure you want to change the base?

Compiled model with torch.compile, unfortunately without performance improvements #2131

Conversation

sh1ng commented Dec 15, 2023

sh1ng commented Dec 21, 2023

UIHCRITT commented Dec 22, 2023 • edited

Lvjinhong commented Jan 12, 2024

UIHCRITT commented Dec 22, 2023 •

edited