Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support chatglm3 #1558

Closed
wants to merge 30 commits into from
Closed

Conversation

gameofdimension
Copy link

@gameofdimension gameofdimension commented Nov 4, 2023

correctness

we test this pr on tp=1/2/4, all emit reasonable output

Prompt: 'Hello, my name is', Generated text: ' 👨\u200d💻 and I am a computer program'
Prompt: 'The president of the United States is', Generated text: ' 45 years old. The vice president of the United States is 4'
Prompt: 'The capital of France is', Generated text: " 100 miles from the center of the Earth. If the Earth'"
Prompt: 'The future of AI is', Generated text: ' 100% predictable. | by Aditya Bhargava |'
Prompt: '希望这篇文章能', Generated text: ' 帮助您更好地了解我。 \n这篇文章将介绍我,一个名为'
Prompt: '给六岁小朋友解释一下万有引', Generated text: '力的概念\n萬有引力是地球和其他物体之间的一种吸引力。它是由'

speedup

it achieve a better speed up compared with baichuan-inc/baichuan-7B when tp=1

checkpoint backend params throughput speedup
baichuan-inc/baichuan-7B hf hf_max_batch_size=4 136.01 tokens/s
baichuan-inc/baichuan-7B vllm tensor_parallel_size=1 1721.74 tokens/s 12.7
THUDM/chatglm3-6b hf hf_max_batch_size=4 204.21 tokens/s
THUDM/chatglm3-6b vllm tensor_parallel_size=1 4564.52 tokens/s 22.35

@gameofdimension
Copy link
Author

because the model architecture is the same as chatglm2-6b, so it also works for chatglm2-6b

Prompt: 'Hello, my name is', Generated text: " Toots. I'm a member of the Hello, Toots community."
Prompt: 'The president of the United States is', Generated text: '\xa0the head of the executive branch and has the power to make decisions and take'
Prompt: 'The capital of France is', Generated text: ' In the capital of France, the population is approximately 12 million people.'
Prompt: 'The future of AI is', Generated text: ' to become more human-like and to be able to perform tasks that currently require'
Prompt: '希望这篇文章能', Generated text: '您的公司带来新的发展机遇。\n\n'
Prompt: '给六岁小朋友解释一下万有引', Generated text: '力的概念。 首先,我们需要知道什么是引力。引力是物体之间的相互作用'

@wengyuan722
Copy link

root@autodl-container-43e51183ae-e83acbf8:~/vllm-main/vllm/entrypoints/openai# python api_server.py --host 127.0.0.1 --port 6006 --model /root/autodl-tmp/chatglm3-6b --trust-remote-code --served-model-name chatglm3-6b --max-num-batched-tokens 8192 --max-model-len 8192
INFO 11-06 10:14:50 llm_engine.py:72] Initializing an LLM engine with config: model='/root/autodl-tmp/chatglm3-6b', tokenizer='/root/autodl-tmp/chatglm3-6b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
Traceback (most recent call last):
File "api_server.py", line 623, in
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 487, in from_engine_args
engine = cls(engine_args.worker_use_ray,
File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 270, in init
self.engine = self._init_engine(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine
return engine_class(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 96, in init
self._verify_args()
File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 187, in _verify_args
self.model_config.verify_with_parallel_config(self.parallel_config)
File "/root/miniconda3/lib/python3.8/site-packages/vllm/config.py", line 128, in verify_with_parallel_config
total_num_hidden_layers = self.hf_config.num_hidden_layers
File "/root/miniconda3/lib/python3.8/site-packages/transformers/configuration_utils.py", line 261, in getattribute
return super().getattribute(key)
AttributeError: 'ChatGLMConfig' object has no attribute 'num_hidden_layers'

I use openai/api_server.py to start chatglm3-6b api,it return AttributeError: 'ChatGLMConfig' object has no attribute 'num_hidden_layers'

@gameofdimension
Copy link
Author

root@autodl-container-43e51183ae-e83acbf8:~/vllm-main/vllm/entrypoints/openai# python api_server.py --host 127.0.0.1 --port 6006 --model /root/autodl-tmp/chatglm3-6b --trust-remote-code --served-model-name chatglm3-6b --max-num-batched-tokens 8192 --max-model-len 8192 INFO 11-06 10:14:50 llm_engine.py:72] Initializing an LLM engine with config: model='/root/autodl-tmp/chatglm3-6b', tokenizer='/root/autodl-tmp/chatglm3-6b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0) Traceback (most recent call last): File "api_server.py", line 623, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 487, in from_engine_args engine = cls(engine_args.worker_use_ray, File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 270, in init self.engine = self._init_engine(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine return engine_class(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 96, in init self._verify_args() File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 187, in _verify_args self.model_config.verify_with_parallel_config(self.parallel_config) File "/root/miniconda3/lib/python3.8/site-packages/vllm/config.py", line 128, in verify_with_parallel_config total_num_hidden_layers = self.hf_config.num_hidden_layers File "/root/miniconda3/lib/python3.8/site-packages/transformers/configuration_utils.py", line 261, in getattribute return super().getattribute(key) AttributeError: 'ChatGLMConfig' object has no attribute 'num_hidden_layers'

I use openai/api_server.py to start chatglm3-6b api,it return AttributeError: 'ChatGLMConfig' object has no attribute 'num_hidden_layers'

maybe you are testing against a wrong commit

Copy link
Contributor

@cauyxy cauyxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Wait @zhuohan123 take a look at it.

@simon-mo simon-mo mentioned this pull request Nov 6, 2023
Copy link
Collaborator

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! Left some style comments.

The PR runs correctly on my side on a single GPU. We should be able to merge this after the styles are fixed.

Comment on lines +116 to +120
@staticmethod
def hf_config_get_num_layers(hf_config):
if getattr(hf_config, "model_type", None) == "chatglm":
return hf_config.num_layers
return hf_config.num_hidden_layers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this function and add the following to ChatGLMConfig

    attribute_map = {
        "num_hidden_layers": "num_layers",
    }

You can refer to falcon's config:

attribute_map = {
"num_hidden_layers": "n_layer",
"num_attention_heads": "n_head",
"num_kv_heads": "n_head_kv",
}

Comment on lines 42 to +43
is_neox_style: bool,
is_glm_style: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to:

Suggested change
is_neox_style: bool,
is_glm_style: bool = False,
style: str = "neox",

and set style to "neox", "gptj", "glm" accordingly?

Comment on lines +23 to +24
from vllm.model_executor.weight_utils import hf_model_weights_iterator, load_tensor_parallel_weights, \
load_padded_tensor_parallel_vocab
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use () for line continuation:

Suggested change
from vllm.model_executor.weight_utils import hf_model_weights_iterator, load_tensor_parallel_weights, \
load_padded_tensor_parallel_vocab
from vllm.model_executor.weight_utils import (
hf_model_weights_iterator, load_tensor_parallel_weights,
load_padded_tensor_parallel_vocab)

self.total_num_heads, self.num_heads = compute_tp_num_heads(
config, tp_world_size)
self.head_dim = config.hidden_size // self.total_num_heads
self.total_num_kv_heads, self.num_kv_heads, num_kv_heads_replicas = \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use () for line continuation.

Comment on lines 313 to +314
is_neox_style: bool = True,
is_glm_style: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a single style variable:

Suggested change
is_neox_style: bool = True,
is_glm_style: bool = False,
style: str = "neox",

total_num_heads, num_heads = compute_tp_num_heads(
self.config, tp_world_size)
head_dim = self.config.hidden_size // total_num_heads
total_num_kv_heads, num_kv_heads, num_kv_heads_replicas = \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use () for line continuation.

self.attention_dropout = attention_dropout
self.layernorm_epsilon = layernorm_epsilon
self.rmsnorm = rmsnorm
self.apply_residual_connection_post_layernorm = \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use () for line continuation.

@zhuohan123
Copy link
Collaborator

Again, thank you for your contribution! We merged #1261 and the current main branch supports ChatGLM now. Let us know if the main branch does not look good and feel free to propose any changes!

@zhuohan123 zhuohan123 closed this Nov 7, 2023
@gameofdimension
Copy link
Author

gameofdimension commented Nov 7, 2023

Again, thank you for your contribution! We merged #1261 and the current main branch supports ChatGLM now. Let us know if the main branch does not look good and feel free to propose any changes!

pr 1261 might not support tp when world size is bigger than multi_query_group_num (currently is 2). like here will return 0, definitely wrong

return (self.hf_config.multi_query_group_num //

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants