support chatglm3 #1558

gameofdimension · 2023-11-04T03:19:06Z

correctness

we test this pr on tp=1/2/4, all emit reasonable output

Prompt: 'Hello, my name is', Generated text: ' 👨\u200d💻 and I am a computer program'
Prompt: 'The president of the United States is', Generated text: ' 45 years old. The vice president of the United States is 4'
Prompt: 'The capital of France is', Generated text: " 100 miles from the center of the Earth. If the Earth'"
Prompt: 'The future of AI is', Generated text: ' 100% predictable. | by Aditya Bhargava |'
Prompt: '希望这篇文章能', Generated text: ' 帮助您更好地了解我。 \n这篇文章将介绍我,一个名为'
Prompt: '给六岁小朋友解释一下万有引', Generated text: '力的概念\n萬有引力是地球和其他物体之间的一种吸引力。它是由'

speedup

it achieve a better speed up compared with baichuan-inc/baichuan-7B when tp=1

checkpoint	backend	params	throughput	speedup
baichuan-inc/baichuan-7B	hf	hf_max_batch_size=4	136.01 tokens/s
baichuan-inc/baichuan-7B	vllm	tensor_parallel_size=1	1721.74 tokens/s	12.7
THUDM/chatglm3-6b	hf	hf_max_batch_size=4	204.21 tokens/s
THUDM/chatglm3-6b	vllm	tensor_parallel_size=1	4564.52 tokens/s	22.35

gameofdimension · 2023-11-04T04:19:33Z

because the model architecture is the same as chatglm2-6b, so it also works for chatglm2-6b

Prompt: 'Hello, my name is', Generated text: " Toots. I'm a member of the Hello, Toots community."
Prompt: 'The president of the United States is', Generated text: '\xa0the head of the executive branch and has the power to make decisions and take'
Prompt: 'The capital of France is', Generated text: ' In the capital of France, the population is approximately 12 million people.'
Prompt: 'The future of AI is', Generated text: ' to become more human-like and to be able to perform tasks that currently require'
Prompt: '希望这篇文章能', Generated text: '您的公司带来新的发展机遇。\n\n'
Prompt: '给六岁小朋友解释一下万有引', Generated text: '力的概念。首先，我们需要知道什么是引力。引力是物体之间的相互作用'

vllm/model_executor/models/chatglm.py

vllm/transformers_utils/configs/chatglm.py

fix bug

wengyuan722 · 2023-11-06T02:18:34Z

root@autodl-container-43e51183ae-e83acbf8:~/vllm-main/vllm/entrypoints/openai# python api_server.py --host 127.0.0.1 --port 6006 --model /root/autodl-tmp/chatglm3-6b --trust-remote-code --served-model-name chatglm3-6b --max-num-batched-tokens 8192 --max-model-len 8192
INFO 11-06 10:14:50 llm_engine.py:72] Initializing an LLM engine with config: model='/root/autodl-tmp/chatglm3-6b', tokenizer='/root/autodl-tmp/chatglm3-6b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
Traceback (most recent call last):
File "api_server.py", line 623, in
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 487, in from_engine_args
engine = cls(engine_args.worker_use_ray,
File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 270, in init
self.engine = self._init_engine(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine
return engine_class(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 96, in init
self._verify_args()
File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 187, in _verify_args
self.model_config.verify_with_parallel_config(self.parallel_config)
File "/root/miniconda3/lib/python3.8/site-packages/vllm/config.py", line 128, in verify_with_parallel_config
total_num_hidden_layers = self.hf_config.num_hidden_layers
File "/root/miniconda3/lib/python3.8/site-packages/transformers/configuration_utils.py", line 261, in getattribute
return super().getattribute(key)
AttributeError: 'ChatGLMConfig' object has no attribute 'num_hidden_layers'

I use openai/api_server.py to start chatglm3-6b api,it return AttributeError: 'ChatGLMConfig' object has no attribute 'num_hidden_layers'

gameofdimension · 2023-11-06T02:46:34Z

root@autodl-container-43e51183ae-e83acbf8:~/vllm-main/vllm/entrypoints/openai# python api_server.py --host 127.0.0.1 --port 6006 --model /root/autodl-tmp/chatglm3-6b --trust-remote-code --served-model-name chatglm3-6b --max-num-batched-tokens 8192 --max-model-len 8192 INFO 11-06 10:14:50 llm_engine.py:72] Initializing an LLM engine with config: model='/root/autodl-tmp/chatglm3-6b', tokenizer='/root/autodl-tmp/chatglm3-6b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0) Traceback (most recent call last): File "api_server.py", line 623, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 487, in from_engine_args engine = cls(engine_args.worker_use_ray, File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 270, in init self.engine = self._init_engine(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine return engine_class(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 96, in init self._verify_args() File "/root/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 187, in _verify_args self.model_config.verify_with_parallel_config(self.parallel_config) File "/root/miniconda3/lib/python3.8/site-packages/vllm/config.py", line 128, in verify_with_parallel_config total_num_hidden_layers = self.hf_config.num_hidden_layers File "/root/miniconda3/lib/python3.8/site-packages/transformers/configuration_utils.py", line 261, in getattribute return super().getattribute(key) AttributeError: 'ChatGLMConfig' object has no attribute 'num_hidden_layers'

I use openai/api_server.py to start chatglm3-6b api,it return AttributeError: 'ChatGLMConfig' object has no attribute 'num_hidden_layers'

maybe you are testing against a wrong commit

vllm/model_executor/models/chatglm.py

cauyxy

LGTM. Wait @zhuohan123 take a look at it.

zhuohan123

Thank you for your contribution! Left some style comments.

The PR runs correctly on my side on a single GPU. We should be able to merge this after the styles are fixed.

zhuohan123 · 2023-11-06T22:21:20Z

vllm/config.py

+    @staticmethod
+    def hf_config_get_num_layers(hf_config):
+        if getattr(hf_config, "model_type", None) == "chatglm":
+            return hf_config.num_layers
+        return hf_config.num_hidden_layers


Please remove this function and add the following to ChatGLMConfig

attribute_map = { "num_hidden_layers": "num_layers", }

You can refer to falcon's config:

vllm/vllm/transformers_utils/configs/falcon.py

Lines 25 to 29 in 8516999

attribute_map = {

"num_hidden_layers": "n_layer",

"num_attention_heads": "n_head",

"num_kv_heads": "n_head_kv",

}

zhuohan123 · 2023-11-06T22:25:00Z

vllm/model_executor/layers/rotary_embedding.py

        is_neox_style: bool,
+        is_glm_style: bool = False,


Can you change this to:

Suggested change

is_neox_style: bool,

is_glm_style: bool = False,

style: str = "neox",

and set style to "neox", "gptj", "glm" accordingly?

zhuohan123 · 2023-11-06T22:27:27Z

vllm/model_executor/models/chatglm.py

+from vllm.model_executor.weight_utils import hf_model_weights_iterator, load_tensor_parallel_weights, \
+    load_padded_tensor_parallel_vocab


Please use () for line continuation:

Suggested change

from vllm.model_executor.weight_utils import hf_model_weights_iterator, load_tensor_parallel_weights, \

load_padded_tensor_parallel_vocab

from vllm.model_executor.weight_utils import (

hf_model_weights_iterator, load_tensor_parallel_weights,

load_padded_tensor_parallel_vocab)

zhuohan123 · 2023-11-06T22:27:47Z

vllm/model_executor/models/chatglm.py

+        self.total_num_heads, self.num_heads = compute_tp_num_heads(
+            config, tp_world_size)
+        self.head_dim = config.hidden_size // self.total_num_heads
+        self.total_num_kv_heads, self.num_kv_heads, num_kv_heads_replicas = \


Please use () for line continuation.

zhuohan123 · 2023-11-06T22:28:46Z

vllm/model_executor/layers/attention.py

        is_neox_style: bool = True,
+        is_glm_style: bool = False,


Let's use a single style variable:

Suggested change

is_neox_style: bool = True,

is_glm_style: bool = False,

style: str = "neox",

zhuohan123 · 2023-11-06T22:29:41Z

vllm/model_executor/models/chatglm.py

+                total_num_heads, num_heads = compute_tp_num_heads(
+                    self.config, tp_world_size)
+                head_dim = self.config.hidden_size // total_num_heads
+                total_num_kv_heads, num_kv_heads, num_kv_heads_replicas = \


Please use () for line continuation.

zhuohan123 · 2023-11-06T22:29:55Z

vllm/transformers_utils/configs/chatglm.py

+        self.attention_dropout = attention_dropout
+        self.layernorm_epsilon = layernorm_epsilon
+        self.rmsnorm = rmsnorm
+        self.apply_residual_connection_post_layernorm = \


Please use () for line continuation.

zhuohan123 · 2023-11-07T00:11:27Z

Again, thank you for your contribution! We merged #1261 and the current main branch supports ChatGLM now. Let us know if the main branch does not look good and feel free to propose any changes!

gameofdimension · 2023-11-07T03:22:46Z

Again, thank you for your contribution! We merged #1261 and the current main branch supports ChatGLM now. Let us know if the main branch does not look good and feel free to propose any changes!

pr 1261 might not support tp when world size is bigger than multi_query_group_num (currently is 2). like here will return 0, definitely wrong

vllm/vllm/config.py

Line 171 in 1a2bbc9

return (self.hf_config.multi_query_group_num //

simon-mo requested a review from zhuohan123 November 4, 2023 04:11

gameofdimension mentioned this pull request Nov 4, 2023

请问chatglm2有计划适配吗？ gameofdimension/vllm-cn#3

Open

felixdae mentioned this pull request Nov 5, 2023

Call for Contribution: Support ChatGLM3 #1552

Closed

cauyxy reviewed Nov 5, 2023

View reviewed changes

vllm/model_executor/models/chatglm.py Outdated Show resolved Hide resolved

cauyxy reviewed Nov 5, 2023

View reviewed changes

vllm/model_executor/models/chatglm.py Outdated Show resolved Hide resolved

cauyxy reviewed Nov 5, 2023

View reviewed changes

vllm/transformers_utils/configs/chatglm.py Show resolved Hide resolved

felixdae added 23 commits November 6, 2023 09:54

add chatglm3

e7c4064

fix config

11afffd

fix name

0cee3cb

add attention implementation

6d3d59e

add param name mapping

b75851a

fix shape

97f79a8

fix bug

716fa84

add flag

fd6747a

add seq_length

cc6d3ce

fix bug

config length

904144b

rebase code

aa7c9ae

refactor

e10d366

use float32

809c601

fix max length

ee967b3

clean code

db232ed

refactor name

e549921

add tp

0ed6f87

work tp

19747f3

rename

c04a2e5

attention tp

e15c5f4

fix unpack

a171abc

shard up proj

42df6f2

fix bug

8c4820d

felixdae added 3 commits November 6, 2023 09:54

run format.sh

d03b106

fix tp bug when we have small kv heads

f183daf

fix lint

7fb9149

gameofdimension force-pushed the chatglm3-pro branch from 05cd990 to 7fb9149 Compare November 6, 2023 01:57

use lib actfn

50d7b19

update code

920194d

fix lint

0fe6c74

gameofdimension requested a review from cauyxy November 6, 2023 02:58

cauyxy reviewed Nov 6, 2023

View reviewed changes

vllm/model_executor/models/chatglm.py Show resolved Hide resolved

add notice

38f97dd

gameofdimension requested a review from cauyxy November 6, 2023 04:24

cauyxy reviewed Nov 6, 2023

View reviewed changes

simon-mo mentioned this pull request Nov 6, 2023

Add support for chatglm2 #649

Closed

zhuohan123 requested changes Nov 6, 2023

View reviewed changes

zhuohan123 closed this Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support chatglm3 #1558

support chatglm3 #1558

gameofdimension commented Nov 4, 2023 •

edited

Loading

gameofdimension commented Nov 4, 2023

wengyuan722 commented Nov 6, 2023

gameofdimension commented Nov 6, 2023

cauyxy left a comment

zhuohan123 left a comment •

edited

Loading

zhuohan123 Nov 6, 2023

zhuohan123 Nov 6, 2023

zhuohan123 Nov 6, 2023

zhuohan123 Nov 6, 2023

zhuohan123 Nov 6, 2023

zhuohan123 Nov 6, 2023

zhuohan123 Nov 6, 2023

zhuohan123 commented Nov 7, 2023

gameofdimension commented Nov 7, 2023 •

edited

Loading

	attribute_map = {
	"num_hidden_layers": "n_layer",
	"num_attention_heads": "n_head",
	"num_kv_heads": "n_head_kv",
	}

	is_neox_style: bool,
	is_glm_style: bool = False,
	style: str = "neox",

		from vllm.model_executor.weight_utils import hf_model_weights_iterator, load_tensor_parallel_weights, \
		load_padded_tensor_parallel_vocab

	is_neox_style: bool = True,
	is_glm_style: bool = False,
	style: str = "neox",

support chatglm3 #1558

support chatglm3 #1558

Conversation

gameofdimension commented Nov 4, 2023 • edited Loading

correctness

speedup

gameofdimension commented Nov 4, 2023

wengyuan722 commented Nov 6, 2023

gameofdimension commented Nov 6, 2023

cauyxy left a comment

Choose a reason for hiding this comment

zhuohan123 left a comment • edited Loading

Choose a reason for hiding this comment

zhuohan123 Nov 6, 2023

Choose a reason for hiding this comment

zhuohan123 Nov 6, 2023

Choose a reason for hiding this comment

zhuohan123 Nov 6, 2023

Choose a reason for hiding this comment

zhuohan123 Nov 6, 2023

Choose a reason for hiding this comment

zhuohan123 Nov 6, 2023

Choose a reason for hiding this comment

zhuohan123 Nov 6, 2023

Choose a reason for hiding this comment

zhuohan123 Nov 6, 2023

Choose a reason for hiding this comment

zhuohan123 commented Nov 7, 2023

gameofdimension commented Nov 7, 2023 • edited Loading

gameofdimension commented Nov 4, 2023 •

edited

Loading

zhuohan123 left a comment •

edited

Loading

gameofdimension commented Nov 7, 2023 •

edited

Loading