Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiled model with torch.compile, unfortunately without performance improvements #2131

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sh1ng
Copy link
Contributor

@sh1ng sh1ng commented Dec 15, 2023

A follow-up of #42 cc @zhuohan123

torch.jit.script and TorchScript can't be used as forward methods use parameters not compatible with it https://pytorch.org/docs/stable/jit_language_reference.html#supported-type.
torch.jit.trace looks even more challenging.
I was only able to make it run by using torch.compile with minimal @torch.compiler.disable addition. Unfortunately, I only see performance degradation(RTX 3090)

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model facebook/opt-125m
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='facebook/opt-125m', tokenizer='facebook/opt-125m', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:11:19 llm_engine.py:73] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=False, seed=0)
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:11:23,098 filelock [DEBUG] - Attempting to acquire lock 139820010183728 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,098 filelock [DEBUG] - Lock 139820010183728 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,221 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:11:23,223 filelock [DEBUG] - Attempting to release lock 139820010183728 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Lock 139820010183728 released on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Attempting to acquire lock 139821305789792 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Lock 139821305789792 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,330 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:11:23,331 filelock [DEBUG] - Attempting to release lock 139821305789792 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,331 filelock [DEBUG] - Lock 139821305789792 released on /tmp/facebook-opt-125m.lock
INFO 12-15 04:11:24 llm_engine.py:223] # GPU blocks: 34503, # CPU blocks: 7281
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:10<00:00, 91.47it/s]
Throughput: 91.43 requests/s, 23406.65 tokens/s
$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model facebook/opt-125m --compile-model True
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='facebook/opt-125m', tokenizer='facebook/opt-125m', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:10:07 llm_engine.py:73] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=True, seed=0)
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:10:11,338 filelock [DEBUG] - Attempting to acquire lock 140378255170608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,338 filelock [DEBUG] - Lock 140378255170608 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,500 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:10:11,502 filelock [DEBUG] - Attempting to release lock 140378255170608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Lock 140378255170608 released on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Attempting to acquire lock 140378858978608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Lock 140378858978608 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,617 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:10:11,619 filelock [DEBUG] - Attempting to release lock 140378858978608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,619 filelock [DEBUG] - Lock 140378858978608 released on /tmp/facebook-opt-125m.lock
INFO 12-15 04:10:18 llm_engine.py:223] # GPU blocks: 34524, # CPU blocks: 7281
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
  torch.has_cuda,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
  torch.has_cudnn,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
  torch.has_mps,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
  torch.has_mkldnn,
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:11<00:00, 85.64it/s]
Throughput: 85.61 requests/s, 21915.10 tokens/s

llama

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model h2oai/h2ogpt-4096-llama2-7b-chat 
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:45:25 llm_engine.py:73] Initializing an LLM engine with config: model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=False, seed=0)
INFO 12-15 04:45:25 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:45:29,229 filelock [DEBUG] - Attempting to acquire lock 140026768499296 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,229 filelock [DEBUG] - Lock 140026768499296 acquired on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,321 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/h2oai/h2ogpt-4096-llama2-7b-chat/revision/main HTTP/1.1" 200 2270
2023-12-15 04:45:29,325 filelock [DEBUG] - Attempting to release lock 140026768499296 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,325 filelock [DEBUG] - Lock 140026768499296 released on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
INFO 12-15 04:45:32 llm_engine.py:223] # GPU blocks: 881, # CPU blocks: 512
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 2000/2000 [04:24<00:00,  7.56it/s]
Throughput: 7.56 requests/s, 1936.53 tokens/s
$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model h2oai/h2ogpt-4096-llama2-7b-chat --compile-model True
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:35:47 llm_engine.py:73] Initializing an LLM engine with config: model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=True, seed=0)
INFO 12-15 04:35:47 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:35:50,578 filelock [DEBUG] - Attempting to acquire lock 139709952418400 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,578 filelock [DEBUG] - Lock 139709952418400 acquired on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,689 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/h2oai/h2ogpt-4096-llama2-7b-chat/revision/main HTTP/1.1" 200 2270
2023-12-15 04:35:50,692 filelock [DEBUG] - Attempting to release lock 139709952418400 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,692 filelock [DEBUG] - Lock 139709952418400 released on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
INFO 12-15 04:36:08 llm_engine.py:223] # GPU blocks: 876, # CPU blocks: 512
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 2000/2000 [04:27<00:00,  7.49it/s]
Throughput: 7.49 requests/s, 1916.22 tokens/s

This PR can be considered as a first step to use torch.compiler for further improvements.

BTW onnrt backend returns

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Type Error: Type (seq(tensor(float16))) of output arg (_val_9) of node (_inline_aten_split_with_sizesn0) does not match expected type (seq(tensor(float))).

@sh1ng sh1ng changed the title Compile model with torch.compile, unfortunatly without performance improvments Compiled model with torch.compile, unfortunately without performance improvements Dec 15, 2023
@sh1ng
Copy link
Contributor Author

sh1ng commented Dec 21, 2023

$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Throughput: 4.77 requests/s, 1873.85 tokens/s
$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --enforce-eager
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=True)
Throughput: 4.65 requests/s, 1827.69 tokens/s
$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --enforce-eager --compile-model=True
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=True)
Throughput: 4.53 requests/s, 1778.97 tokens/s
$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --compile-model=True
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Throughput: 4.67 requests/s, 1835.51 tokens/s

@UIHCRITT
Copy link

UIHCRITT commented Dec 22, 2023

using your code, I run vicuna-7b in one L40, torch.__version__2.1.0+cu121; vllm = 0.2.2,
I found
(flot+compile bs=1,code from gpt-fast) 50.07 tokens/sec;
(float+vllm bs=1) 46.24 tokens/sec;
(float+vllm+compile) 42.71 tokens/sec;

it seems using torch.compile,without performance improvements;
on the other hand, i try using torch.compile() rewrite model_forward,rather than using @torch.compiler.disable:
in vllm/worker/worker.py def excute_model

before
''''''
output = self.model(
input_ids = input_tokens,
positions = inputs_positions,
kv_caches=slef.gpu_cache,
input_metadata=input_metadata,
cache_events = cache_events,
)
after
'''''''
def _model_foraward(model,input_ids,positions,kv_caches,input_metadata,cache_events):
return model(input_ids,positions,kv_caches,input_metadata,cache_events)
model_forward = torch.compile(_model_foraward,mode="reduce-overhead",fullgraph=True)
output = model_forward(
model = self.model,
input_ids = input_tokens,
positions = inputs_positions,
kv_caches=slef.gpu_cache,
input_metadata=input_metadata,
cache_events = cache_events,
)
'''''''''''
when I run this code, i get errors''NotImpementedError:ProcessGroupVariable() ''

@Lvjinhong
Copy link

For the latest version v0.2.7, is there any meaningful acceleration in terms of the compiler?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants