Added DeciLM-7b and DeciLM-7b-instruct #2062

avideci · 2023-12-12T17:27:25Z

New models: DeciLM-7b and DeciLM-7b-instruct

DeciLM-7b and DeciLM-7b-instruct have been released today.
It has reached first place in the OpenLLM leaderboard (7B catagory).

DeciLM-7B is a 7.04 billion parameter decoder-only text generation model, released under the Apache 2.0 license. At the time of release, DeciLM-7B is the top-performing 7B base language model on the Open LLM Leaderboard. With support for an 8K-token sequence length, this highly efficient model uses variable Grouped-Query Attention (GQA) to achieve a superior balance between accuracy and computational efficiency. The model's architecture was generated using Deci's proprietary Neural Architecture Search technology, AutoNAC.

About the model

DeciLM uses a similar architecture to Llama, thus it was pretty easy to make it work with vLLM.

Deci AI has used AutoNAC (Neural Architecture Search engine) to find this architecture automatically.

The difference is, that DeciLM uses variable grouped query attention ("Variable GQA") instead of uniform grouped query attention (simply "GQA").

The main differene can be spotted in the model configuration:

DeciLM-7B-instruct:
- https://huggingface.co/Deci/DeciLM-7B-instruct/blob/main/config.json#L18
Mistral-7B-v0.1:
- https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/config.json#L15
Llama-2-70b:
- https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/config.json#L16

Useful References:

1st place on https://huggingface.co/collections/open-llm-leaderboard/llm-leaderboard-best-models-652d6c7965a4619fb5c27a03
https://huggingface.co/Deci/DeciLM-6b (previous DeciLM release)
https://huggingface.co/Deci/DeciLM-7b (released today)
https://huggingface.co/Deci/DeciLM-7b-instruct (released today)
https://deci.ai/blog/introducing-decilm-7b-the-fastest-and-most-accurate-7b-large-language-model-to-date/ - Technical blog
Live demo with infery-llm: https://console.deci.ai/infery-llm-demo
DeciLM-7B Finetuning Notebook
DeciLM-7B-instruct Text Generation Notebook

Future Work

I have tried to modify the code to use variable GQA, but it seems like the Paged Attention machanism relies on static/const number of kv heads. With Variable GQA I tried pass each llama layer a layer_idx, and chose the number of heads for each layer, like in this configuration:

https://huggingface.co/Deci/DeciLM-7B-instruct/blob/main/config.json#L18

With variable GQA my code worked, but the model outputs gibberish - other than the first word.
As a temporary workaround, I have added the logic of converting variable GQA to uniform GQA, in the load_weights() method. After this fix, the attention kernels worked as expected and the outputs of the model are great.

Is there any reason variable GQA would not work in paged attention?

Can you think of anything related that would prevent the model from returning valid outputs, while the kernel works without errors?
The cuda kernel's python launcher have passed all the assertions of sizes and boundaries, and still returned gibberish after the first token.
The current patch in "load_weights" that degroups each kv head fixes that for now, but the model gets a latency hit since the weights are repeated redundantly. Since you know the PagedAttention kernels better than me, I hope there is a solution that will enable us to use a different number of kv_heads per layer. We would appreciate your help!

Thanks in advance!

…ad of variable GQA

avideci · 2023-12-12T17:32:06Z

vllm/model_executor/models/decilm.py

+        config: Optional[PretrainedConfig] = None,
+        linear_method: Optional[LinearMethodBase] = None,
+    ) -> None:
+        config.num_key_value_heads = max(config.num_key_value_heads_per_layer)


Here, we are convertin the model to uniform GQA instead of variable GQA.
That's becasue PagedAttention kernel did not work well with variable GQA (not idea why, still) and this is a workaround.

We choose the max. number spotted in this list: https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json#L18

Then we convert weights of "k_proj" and "v_proj" a uniform GQA using "degroup_weight".

avideci · 2023-12-12T17:34:03Z

vllm/model_executor/models/decilm.py

+                                        default_weight_loader)
+                weight_loader(param, loaded_weight)
+
+    def degroup_weight(self, loaded_weight: torch.Tensor) -> torch.Tensor:


This method receives a weight and change the number of kv heads to meet the max. number that is found in the list:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json#L18

By doing so, all the attention layers will have the same number of heads, which is required for PagedAttention.

Makes sense. Thanks for this solution. Let's leave it as future work for now.

avideci · 2023-12-18T08:37:35Z

@WoosukKwon Kind reminder.
The model is no. 1 in the 7B catagory (https://huggingface.co/collections/open-llm-leaderboard/llm-leaderboard-best-models-652d6c7965a4619fb5c27a03)
and no. 11 on huggingface in general (https://huggingface.co/collections/open-llm-leaderboard/llm-leaderboard-best-models-652d6c7965a4619fb5c27a03)

Thanks is advance.

WoosukKwon

Hi @avideci, thanks for adding this model to vLLM, and apologies for the late review. We actually found this model interesting, but didn't have bandwidth last week.

To accelerate the integration, I directly made some minor changes on the PR (mostly about the code styles). Thanks again for contributing the model, and looking forward to your next model!

ip Added DeciLM-7b and DeciLM-7b-instruct (vllm-project#2062) .

TobyGE · 2024-01-03T23:24:51Z

I tried DeciLM with vllm, the inference speed is similar to other 7b llms. Does anyone have similar observations?

geifmany · 2024-01-17T20:55:48Z

I tried DeciLM with vllm, the inference speed is similar to other 7b llms. Does anyone have similar observations?

It leverages the same kernels so it will have similar performance, the HF implementation of DeciLM is around 2X faster than HF Mistral on large batches.

avideci added 2 commits December 12, 2023 18:19

Added DeciLM-7b and DeciLM-7b-instruct support with uniform GQA inste…

e2e8d6f

…ad of variable GQA

Reformatted decilm new file

f28efb5

avideci commented Dec 12, 2023

View reviewed changes

WoosukKwon added the new model Requests to new models label Dec 13, 2023

Merge branch 'main' into feature/add_decilm_7b_models

1636e61

WoosukKwon self-requested a review December 19, 2023 09:23

WoosukKwon added 3 commits December 19, 2023 10:15

Merge branch 'main' into feature/add_decilm_7b_models

61e10be

Minor fixes

7e26d65

Add DeciLM to supported models

e13541c

WoosukKwon approved these changes Dec 19, 2023

View reviewed changes

WoosukKwon merged commit de60a3f into vllm-project:main Dec 19, 2023
2 checks passed

rkooo567 added a commit to rkooo567/vllm that referenced this pull request Dec 19, 2023

ip

d0721ac

ip Added DeciLM-7b and DeciLM-7b-instruct (vllm-project#2062) .

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Added DeciLM-7b and DeciLM-7b-instruct (vllm-project#2062)

8320bfc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added DeciLM-7b and DeciLM-7b-instruct #2062

Added DeciLM-7b and DeciLM-7b-instruct #2062

avideci commented Dec 12, 2023 •

edited

avideci Dec 12, 2023

avideci Dec 12, 2023

WoosukKwon Dec 19, 2023

avideci commented Dec 18, 2023

WoosukKwon left a comment

TobyGE commented Jan 3, 2024

geifmany commented Jan 17, 2024

Added DeciLM-7b and DeciLM-7b-instruct #2062

Added DeciLM-7b and DeciLM-7b-instruct #2062

Conversation

avideci commented Dec 12, 2023 • edited

New models: DeciLM-7b and DeciLM-7b-instruct

About the model

Useful References:

Future Work

avideci Dec 12, 2023

Choose a reason for hiding this comment

avideci Dec 12, 2023

Choose a reason for hiding this comment

WoosukKwon Dec 19, 2023

Choose a reason for hiding this comment

avideci commented Dec 18, 2023

WoosukKwon left a comment

Choose a reason for hiding this comment

TobyGE commented Jan 3, 2024

geifmany commented Jan 17, 2024

avideci commented Dec 12, 2023 •

edited