Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Internlm2 #2666

Merged
merged 6 commits into from
Feb 1, 2024
Merged

Add Internlm2 #2666

merged 6 commits into from
Feb 1, 2024

Conversation

Leymore
Copy link
Contributor

@Leymore Leymore commented Jan 30, 2024

I copied most of the codes from #2527, with the following changes:

  1. remove einops
  2. fix bugs when loading internlm2-chat-20b

Here are my test scripts and results:

from vllm import LLM, SamplingParams
prompts = [
    "<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n",
    "<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n",
    "<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n",
    "<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="internlm/internlm2-chat-7b", trust_remote_code=True)
# llm = LLM(model="internlm/internlm2-chat-20b", trust_remote_code=True)
# llm = LLM(model="internlm/internlm2-chat-20b", trust_remote_code=True, tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

For internlm/internlm2-chat-7b:

Prompt: '<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'Hello! How may I assist you today? Please feel free to ask me anything'
Prompt: '<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The current president of the United States is Joe Biden. He was sworn in as'
Prompt: '<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The capital of France is Paris.<|im_end|>\n<|im_end|>\nThe capital of France is'
Prompt: '<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The future of AI is likely to be both exciting and challenging. On the one'

For internlm/internlm2-chat-20b:

Prompt: '<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n', Generated text: "Hello! I'm glad to meet you. As a helpful, respectful, and"
Prompt: '<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The current president of the United States is Joe Biden. He assumed office on January'
Prompt: '<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The capital of France is Paris. This city, known as the "City of'
Prompt: '<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The future of AI is incredibly exciting and holds great potential for both positive and negative'

For internlm/internlm2-chat-20b with TP=2:

Prompt: '<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n', Generated text: "Hello! I'm glad to meet you. As a helpful, respectful, and"
Prompt: '<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The current president of the United States is Joe Biden. He assumed office on January'
Prompt: '<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The capital of France is Paris. This city, known as the "City of'
Prompt: '<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The future of AI is incredibly exciting and holds great potential for both positive and negative'

@esmeetu
Copy link
Collaborator

esmeetu commented Jan 30, 2024

Hi, @Leymore. Thanks for following my last work. And i also came up with the same implementation with you today.
But i tried to find a more elegant way to convert the wqkv weight to the acceptable qkv_proj weight structure in vLLM like bloom model's query_key_value.
Besides, i am curious why using this wqkv structure. Is there a different model like this before?

@Leymore
Copy link
Contributor Author

Leymore commented Jan 30, 2024

I am just a busboy and not familiar with the origin of this wqkv structure. I will get someone in charge to find out the answer.

@gaoyang07
Copy link

Hi, @Leymore. Thanks for following my last work. And i also came up with the same implementation with you today. But i tried to find a more elegant way to convert the wqkv weight to the acceptable qkv_proj weight structure in vLLM like bloom model's query_key_value. Besides, i am curious why using this wqkv structure. Is there a different model like this before?

Hi there. In our effort to optimize training efficiency, we've consolidated the wq, wk, wv weight matrices into a single wqkv matrix. This streamlined approach has led to a roughly 5% faster training process. Given the significant costs associated with pre-training, this improvement in efficiency can translate into substantial cost savings.

@esmeetu esmeetu mentioned this pull request Jan 30, 2024
@esmeetu
Copy link
Collaborator

esmeetu commented Jan 30, 2024

@gaoyang07 Thanks for your quick reply! But i meant the packed wqkvweight doesn't like the mpt model. So we must do some transformation and can't split directly.

@gaoyang07
Copy link

@gaoyang07 Thanks for your quick reply! But i meant the packed wqkvweight doesn't like the mpt model. So we must do some transformation and can't split directly.

For more details on the conversion process, please refer to https://github.com/InternLM/InternLM/blob/3599ddd0e48968faced0831a4f32a44389d61d40/tools/convert2llama.py#L59-L70. Moreover, within our designed wqkv structure, users can easily split or merge the model weights along the tensor parallel dimension, e.g., new_weights = torch.split(old_weight, new_dim, dim=tp_dim)

@icefairy
Copy link

I cannot load model with merge this commit,the rror message is below:
vllm-0.2.7+cu123-py3.10-linux-x86_64.egg/vllm/model_executor/layers/activation.py", line 35, in forward
out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacty of 23.64 GiB of which 950.44 MiB is free. Process 2316410 has 176.00 MiB memory in use. Process 3149588 has 250.00 MiB memory in use. Including non-PyTorch memory, this process has 22.30 GiB memory in use. Of the allocated memory 19.44 GiB is allocated by PyTorch, and 2.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have a nvidia RTX with 24G RAM,with same configuration I can load internlm-chat-7b and can load internlm2-chat-7b with transformers but cannot load it with vllm[this commit],any can help?

@Leymore
Copy link
Contributor Author

Leymore commented Feb 1, 2024

I cannot load model with merge this commit,the rror message is below: vllm-0.2.7+cu123-py3.10-linux-x86_64.egg/vllm/model_executor/layers/activation.py", line 35, in forward out = torch.empty(output_shape, dtype=x.dtype, device=x.device) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacty of 23.64 GiB of which 950.44 MiB is free. Process 2316410 has 176.00 MiB memory in use. Process 3149588 has 250.00 MiB memory in use. Including non-PyTorch memory, this process has 22.30 GiB memory in use. Of the allocated memory 19.44 GiB is allocated by PyTorch, and 2.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have a nvidia RTX with 24G RAM,with same configuration I can load internlm-chat-7b and can load internlm2-chat-7b with transformers but cannot load it with vllm[this commit],any can help?

I am not a expert on this issue, but I managed to run the following code with nvidia-smi showing the gpu memory usage is 20G on my 80G RAM A100. I think you may set max_model_len to a lower value (the default is 65536).

llm = LLM(model="internlm/internlm2-chat-7b", trust_remote_code=True, gpu_memory_utilization=0.25, max_model_len=2048)

@Leymore
Copy link
Contributor Author

Leymore commented Feb 1, 2024

Hi, @esmeetu , recap this issue, is there anything more I need to do to have this PR merged?

@esmeetu
Copy link
Collaborator

esmeetu commented Feb 1, 2024

Hi, @esmeetu , recap this issue, is there anything more I need to do to have this PR merged?

@Leymore LGTM. cc @simon-mo

@simon-mo simon-mo merged commit cd9e60c into vllm-project:main Feb 1, 2024
17 checks passed
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
alexm-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 20, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 22, 2024
@icefairy
Copy link

I cannot load model with merge this commit,the rror message is below: vllm-0.2.7+cu123-py3.10-linux-x86_64.egg/vllm/model_executor/layers/activation.py", line 35, in forward out = torch.empty(output_shape, dtype=x.dtype, device=x.device) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacty of 23.64 GiB of which 950.44 MiB is free. Process 2316410 has 176.00 MiB memory in use. Process 3149588 has 250.00 MiB memory in use. Including non-PyTorch memory, this process has 22.30 GiB memory in use. Of the allocated memory 19.44 GiB is allocated by PyTorch, and 2.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have a nvidia RTX with 24G RAM,with same configuration I can load internlm-chat-7b and can load internlm2-chat-7b with transformers but cannot load it with vllm[this commit],any can help?

I am not a expert on this issue, but I managed to run the following code with nvidia-smi showing the gpu memory usage is 20G on my 80G RAM A100. I think you may set max_model_len to a lower value (the default is 65536).

llm = LLM(model="internlm/internlm2-chat-7b", trust_remote_code=True, gpu_memory_utilization=0.25, max_model_len=2048)

It works! thank you!

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants