-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support chatglm3 #1558
support chatglm3 #1558
Conversation
because the model architecture is the same as chatglm2-6b, so it also works for chatglm2-6b
|
fix bug
05cd990
to
7fb9149
Compare
root@autodl-container-43e51183ae-e83acbf8:~/vllm-main/vllm/entrypoints/openai# python api_server.py --host 127.0.0.1 --port 6006 --model /root/autodl-tmp/chatglm3-6b --trust-remote-code --served-model-name chatglm3-6b --max-num-batched-tokens 8192 --max-model-len 8192 I use openai/api_server.py to start chatglm3-6b api,it return AttributeError: 'ChatGLMConfig' object has no attribute 'num_hidden_layers' |
maybe you are testing against a wrong commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Wait @zhuohan123 take a look at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! Left some style comments.
The PR runs correctly on my side on a single GPU. We should be able to merge this after the styles are fixed.
@staticmethod | ||
def hf_config_get_num_layers(hf_config): | ||
if getattr(hf_config, "model_type", None) == "chatglm": | ||
return hf_config.num_layers | ||
return hf_config.num_hidden_layers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this function and add the following to ChatGLMConfig
attribute_map = {
"num_hidden_layers": "num_layers",
}
You can refer to falcon's config:
vllm/vllm/transformers_utils/configs/falcon.py
Lines 25 to 29 in 8516999
attribute_map = { | |
"num_hidden_layers": "n_layer", | |
"num_attention_heads": "n_head", | |
"num_kv_heads": "n_head_kv", | |
} |
is_neox_style: bool, | ||
is_glm_style: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to:
is_neox_style: bool, | |
is_glm_style: bool = False, | |
style: str = "neox", |
and set style to "neox", "gptj", "glm"
accordingly?
from vllm.model_executor.weight_utils import hf_model_weights_iterator, load_tensor_parallel_weights, \ | ||
load_padded_tensor_parallel_vocab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use ()
for line continuation:
from vllm.model_executor.weight_utils import hf_model_weights_iterator, load_tensor_parallel_weights, \ | |
load_padded_tensor_parallel_vocab | |
from vllm.model_executor.weight_utils import ( | |
hf_model_weights_iterator, load_tensor_parallel_weights, | |
load_padded_tensor_parallel_vocab) |
self.total_num_heads, self.num_heads = compute_tp_num_heads( | ||
config, tp_world_size) | ||
self.head_dim = config.hidden_size // self.total_num_heads | ||
self.total_num_kv_heads, self.num_kv_heads, num_kv_heads_replicas = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use ()
for line continuation.
is_neox_style: bool = True, | ||
is_glm_style: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use a single style variable:
is_neox_style: bool = True, | |
is_glm_style: bool = False, | |
style: str = "neox", |
total_num_heads, num_heads = compute_tp_num_heads( | ||
self.config, tp_world_size) | ||
head_dim = self.config.hidden_size // total_num_heads | ||
total_num_kv_heads, num_kv_heads, num_kv_heads_replicas = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use ()
for line continuation.
self.attention_dropout = attention_dropout | ||
self.layernorm_epsilon = layernorm_epsilon | ||
self.rmsnorm = rmsnorm | ||
self.apply_residual_connection_post_layernorm = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use ()
for line continuation.
Again, thank you for your contribution! We merged #1261 and the current main branch supports ChatGLM now. Let us know if the main branch does not look good and feel free to propose any changes! |
pr 1261 might not support tp when world size is bigger than Line 171 in 1a2bbc9
|
correctness
we test this pr on tp=1/2/4, all emit reasonable output
speedup
it achieve a better speed up compared with
baichuan-inc/baichuan-7B
when tp=1