Add Internlm2 #2666

Leymore · 2024-01-30T13:29:38Z

I copied most of the codes from #2527, with the following changes:

remove einops
fix bugs when loading internlm2-chat-20b

Here are my test scripts and results:

from vllm import LLM, SamplingParams
prompts = [
    "<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n",
    "<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n",
    "<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n",
    "<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="internlm/internlm2-chat-7b", trust_remote_code=True)
# llm = LLM(model="internlm/internlm2-chat-20b", trust_remote_code=True)
# llm = LLM(model="internlm/internlm2-chat-20b", trust_remote_code=True, tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

For internlm/internlm2-chat-7b:

Prompt: '<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'Hello! How may I assist you today? Please feel free to ask me anything'
Prompt: '<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The current president of the United States is Joe Biden. He was sworn in as'
Prompt: '<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The capital of France is Paris.<|im_end|>\n<|im_end|>\nThe capital of France is'
Prompt: '<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The future of AI is likely to be both exciting and challenging. On the one'

For internlm/internlm2-chat-20b:

Prompt: '<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n', Generated text: "Hello! I'm glad to meet you. As a helpful, respectful, and"
Prompt: '<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The current president of the United States is Joe Biden. He assumed office on January'
Prompt: '<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The capital of France is Paris. This city, known as the "City of'
Prompt: '<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The future of AI is incredibly exciting and holds great potential for both positive and negative'

For internlm/internlm2-chat-20b with TP=2:

Prompt: '<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n', Generated text: "Hello! I'm glad to meet you. As a helpful, respectful, and"
Prompt: '<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The current president of the United States is Joe Biden. He assumed office on January'
Prompt: '<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The capital of France is Paris. This city, known as the "City of'
Prompt: '<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The future of AI is incredibly exciting and holds great potential for both positive and negative'

esmeetu · 2024-01-30T14:04:37Z

Hi, @Leymore. Thanks for following my last work. And i also came up with the same implementation with you today.
But i tried to find a more elegant way to convert the wqkv weight to the acceptable qkv_proj weight structure in vLLM like bloom model's query_key_value.
Besides, i am curious why using this wqkv structure. Is there a different model like this before?

Leymore · 2024-01-30T14:18:55Z

I am just a busboy and not familiar with the origin of this wqkv structure. I will get someone in charge to find out the answer.

gaoyang07 · 2024-01-30T14:28:00Z

Hi, @Leymore. Thanks for following my last work. And i also came up with the same implementation with you today. But i tried to find a more elegant way to convert the wqkv weight to the acceptable qkv_proj weight structure in vLLM like bloom model's query_key_value. Besides, i am curious why using this wqkv structure. Is there a different model like this before?

Hi there. In our effort to optimize training efficiency, we've consolidated the wq, wk, wv weight matrices into a single wqkv matrix. This streamlined approach has led to a roughly 5% faster training process. Given the significant costs associated with pre-training, this improvement in efficiency can translate into substantial cost savings.

esmeetu · 2024-01-30T14:38:28Z

@gaoyang07 Thanks for your quick reply! But i meant the packed wqkvweight doesn't like the mpt model. So we must do some transformation and can't split directly.

gaoyang07 · 2024-01-30T14:40:02Z

@gaoyang07 Thanks for your quick reply! But i meant the packed wqkvweight doesn't like the mpt model. So we must do some transformation and can't split directly.

For more details on the conversion process, please refer to https://github.com/InternLM/InternLM/blob/3599ddd0e48968faced0831a4f32a44389d61d40/tools/convert2llama.py#L59-L70. Moreover, within our designed wqkv structure, users can easily split or merge the model weights along the tensor parallel dimension, e.g., new_weights = torch.split(old_weight, new_dim, dim=tp_dim)

icefairy · 2024-01-31T04:45:40Z

I cannot load model with merge this commit,the rror message is below:
vllm-0.2.7+cu123-py3.10-linux-x86_64.egg/vllm/model_executor/layers/activation.py", line 35, in forward
out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacty of 23.64 GiB of which 950.44 MiB is free. Process 2316410 has 176.00 MiB memory in use. Process 3149588 has 250.00 MiB memory in use. Including non-PyTorch memory, this process has 22.30 GiB memory in use. Of the allocated memory 19.44 GiB is allocated by PyTorch, and 2.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have a nvidia RTX with 24G RAM,with same configuration I can load internlm-chat-7b and can load internlm2-chat-7b with transformers but cannot load it with vllm[this commit],any can help?

Leymore · 2024-02-01T09:37:46Z

I cannot load model with merge this commit,the rror message is below: vllm-0.2.7+cu123-py3.10-linux-x86_64.egg/vllm/model_executor/layers/activation.py", line 35, in forward out = torch.empty(output_shape, dtype=x.dtype, device=x.device) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacty of 23.64 GiB of which 950.44 MiB is free. Process 2316410 has 176.00 MiB memory in use. Process 3149588 has 250.00 MiB memory in use. Including non-PyTorch memory, this process has 22.30 GiB memory in use. Of the allocated memory 19.44 GiB is allocated by PyTorch, and 2.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have a nvidia RTX with 24G RAM,with same configuration I can load internlm-chat-7b and can load internlm2-chat-7b with transformers but cannot load it with vllm[this commit],any can help?

I am not a expert on this issue, but I managed to run the following code with nvidia-smi showing the gpu memory usage is 20G on my 80G RAM A100. I think you may set max_model_len to a lower value (the default is 65536).

llm = LLM(model="internlm/internlm2-chat-7b", trust_remote_code=True, gpu_memory_utilization=0.25, max_model_len=2048)

Leymore · 2024-02-01T09:49:27Z

Hi, @esmeetu , recap this issue, is there anything more I need to do to have this PR merged?

esmeetu · 2024-02-01T10:21:14Z

Hi, @esmeetu , recap this issue, is there anything more I need to do to have this PR merged?

@Leymore LGTM. cc @simon-mo

icefairy · 2024-02-23T10:35:29Z

I cannot load model with merge this commit,the rror message is below: vllm-0.2.7+cu123-py3.10-linux-x86_64.egg/vllm/model_executor/layers/activation.py", line 35, in forward out = torch.empty(output_shape, dtype=x.dtype, device=x.device) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacty of 23.64 GiB of which 950.44 MiB is free. Process 2316410 has 176.00 MiB memory in use. Process 3149588 has 250.00 MiB memory in use. Including non-PyTorch memory, this process has 22.30 GiB memory in use. Of the allocated memory 19.44 GiB is allocated by PyTorch, and 2.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have a nvidia RTX with 24G RAM,with same configuration I can load internlm-chat-7b and can load internlm2-chat-7b with transformers but cannot load it with vllm[this commit],any can help?

I am not a expert on this issue, but I managed to run the following code with nvidia-smi showing the gpu memory usage is 20G on my 80G RAM A100. I think you may set max_model_len to a lower value (the default is 65536).
llm = LLM(model="internlm/internlm2-chat-7b", trust_remote_code=True, gpu_memory_utilization=0.25, max_model_len=2048)

It works! thank you!

esmeetu and others added 6 commits January 21, 2024 21:55

init support

7639ab7

format

3359c60

Merge remote-tracking branch 'upstream/main' into internlm2

07389b4

fix wqkv weight load

ddf5688

Merge branch 'main' into internlm2

f2a2f41

remove einops

2e4080c

esmeetu mentioned this pull request Jan 30, 2024

Support internlm2 model #2527

Closed

esmeetu approved these changes Feb 1, 2024

View reviewed changes

simon-mo approved these changes Feb 1, 2024

View reviewed changes

simon-mo merged commit cd9e60c into vllm-project:main Feb 1, 2024
17 checks passed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add Internlm2 (vllm-project#2666)

cfb1afb

alexm-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024

Add Internlm2 (vllm-project#2666)

0babd92

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 20, 2024

Add Internlm2 (vllm-project#2666)

c3fc294

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 22, 2024

Add Internlm2 (vllm-project#2666)

9aad081

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024

Add Internlm2 (vllm-project#2666)

ae8ae31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Internlm2 #2666

Add Internlm2 #2666

Leymore commented Jan 30, 2024 •

edited

esmeetu commented Jan 30, 2024

Leymore commented Jan 30, 2024

gaoyang07 commented Jan 30, 2024

esmeetu commented Jan 30, 2024

gaoyang07 commented Jan 30, 2024

icefairy commented Jan 31, 2024

Leymore commented Feb 1, 2024 •

edited

Leymore commented Feb 1, 2024

esmeetu commented Feb 1, 2024

icefairy commented Feb 23, 2024

Add Internlm2 #2666

Add Internlm2 #2666

Conversation

Leymore commented Jan 30, 2024 • edited

esmeetu commented Jan 30, 2024

Leymore commented Jan 30, 2024

gaoyang07 commented Jan 30, 2024

esmeetu commented Jan 30, 2024

gaoyang07 commented Jan 30, 2024

icefairy commented Jan 31, 2024

Leymore commented Feb 1, 2024 • edited

Leymore commented Feb 1, 2024

esmeetu commented Feb 1, 2024

icefairy commented Feb 23, 2024

Leymore commented Jan 30, 2024 •

edited

Leymore commented Feb 1, 2024 •

edited