[Bugfix] Added Command-R GPTQ support #3849

egortolmachev · 2024-04-04T11:51:44Z

Fixed Command-R GPTQ model loading by analogy with Gemma: #3553

Command R GPTQ support added.

esmeetu · 2024-04-04T13:17:06Z

Thanks! Could you post that model's link and put up a test result here?

deoxykev · 2024-04-04T17:17:16Z

Would this work with the new Command R+ as well? Looks to be the same CohereForCasualLM architecture.

osilverstein · 2024-04-04T18:02:28Z

Can we get this into the v4 as a slightly updated build

osilverstein · 2024-04-05T05:25:49Z

I cloned the fork ran the model and got the same old error:
KeyError: 'model.layers.13.mlp.down_proj.bias' ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:03 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it> ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:03 selector.py:25] Using XFormers backend.^[[32m [repeated 2x across cluster]^[[0m ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:04 pynccl_utils.py:45] vLLM is using nccl==2.18.1^[[32m [repeated 2x across cluster]^[[0m ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed> ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] Traceback (most recent call last): ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/engine/ray_utils.py", line 37, in execute_meth> ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] return executor(*args, **kwargs) ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/worker/worker.py", line 107, in load_model ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] self.model_runner.load_model() ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/worker/model_runner.py", line 95, in load_model ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] self.model = get_model( ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/model_loader.py", line 101, in > ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] model.load_weights(model_config.model, model_config.download_dir, ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/models/commandr.py", line 329, > ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] param = params_dict[name] ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] KeyError: 'model.layers.13.mlp.down_proj.bias' ^[[36m(RayWorkerVllm pid=6994)^[[0m WARNING 04-05 05:13:07 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P > ^[[36m(RayWorkerVllm pid=7169)^[[0m INFO 04-05 05:13:09 weight_utils.py:194] Using model weights format ['*.safetensors']^[[32m [repeated 2x across c> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] Traceback (most recent call last): ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/engine/ray_utils.py", line 37, in execute_meth> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] return executor(*args, **kwargs) ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/worker/worker.py", line 107, in load_model ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] self.model_runner.load_model() ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/worker/model_runner.py", line 95, in load_model ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] self.model = get_model( ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/model_loader.py", line 101, in > ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] model.load_weights(model_config.model, model_config.download_dir, ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/models/commandr.py", line 329, > ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] param = params_dict[name] ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] KeyError: 'model.layers.13.mlp.down_proj.bias'

egortolmachev · 2024-04-05T14:51:11Z

Thanks! Could you post that model's link and put up a test result here?

Model link: https://huggingface.co/NEURALDEEPTECH/command-r-gptq

Test code:

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="NEURALDEEPTECH/command-r-gptq") #, tensor_parallel_size=1)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

Prompt: 'Hello, my name is', Generated text: ' Sarah. I am a designer and freelance illustrator currently based in Sydney, Australia.'
Prompt: 'The president of the United States is', Generated text: ' no longer a mere man, he has become a symbol. In the past century'
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris, its inhabitants, its streets, its'
Prompt: 'The future of AI is', Generated text: ' not an all-knowing, all-controlling computer program, but a'

egortolmachev · 2024-04-05T14:51:54Z

I cloned the fork ran the model and got the same old error: KeyError: 'model.layers.13.mlp.down_proj.bias' ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:03 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it> ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:03 selector.py:25] Using XFormers backend.^[[32m [repeated 2x across cluster]^[[0m ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:04 pynccl_utils.py:45] vLLM is using nccl==2.18.1^[[32m [repeated 2x across cluster]^[[0m ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed> ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] Traceback (most recent call last): ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/engine/ray_utils.py", line 37, in execute_meth> ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] return executor(*args, **kwargs) ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/worker/worker.py", line 107, in load_model ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] self.model_runner.load_model() ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/worker/model_runner.py", line 95, in load_model ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] self.model = get_model( ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/model_loader.py", line 101, in > ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] model.load_weights(model_config.model, model_config.download_dir, ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/models/commandr.py", line 329, > ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] param = params_dict[name] ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] KeyError: 'model.layers.13.mlp.down_proj.bias' ^[[36m(RayWorkerVllm pid=6994)^[[0m WARNING 04-05 05:13:07 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P > ^[[36m(RayWorkerVllm pid=7169)^[[0m INFO 04-05 05:13:09 weight_utils.py:194] Using model weights format ['*.safetensors']^[[32m [repeated 2x across c> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] Traceback (most recent call last): ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/engine/ray_utils.py", line 37, in execute_meth> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] return executor(*args, **kwargs) ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/worker/worker.py", line 107, in load_model ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] self.model_runner.load_model() ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/worker/model_runner.py", line 95, in load_model ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] self.model = get_model( ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/model_loader.py", line 101, in > ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] model.load_weights(model_config.model, model_config.download_dir, ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/models/commandr.py", line 329, > ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] param = params_dict[name] ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] KeyError: 'model.layers.13.mlp.down_proj.bias'

Try this model: https://huggingface.co/NEURALDEEPTECH/command-r-gptq

egortolmachev · 2024-04-05T14:52:27Z

Would this work with the new Command R+ as well? Looks to be the same CohereForCasualLM architecture.

Maybe))) You should try!

egortolmachev · 2024-04-05T15:55:59Z

For quantization I've used AutoGPTQ from this PR: AutoGPTQ/AutoGPTQ#631

This model: https://huggingface.co/NEURALDEEPTECH/command-r-gptq quantized not so good, because I've used only one primitive sample. Didn't had time to find good dataset and just wanted to check is my code for quantization and inference work.
In my personal RAGAS test this https://huggingface.co/NEURALDEEPTECH/command-r-gptq model achieves 0.5 answer_correctness while this model https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit achieves 0.6782.
I thought if somebody can quantize command-r in GPTQ group 32, Act_Order=True on this dataset: https://huggingface.co/datasets/CohereForAI/aya_dataset(I think Cohere used this one for Command R training), in this way, I think, quantization will be more accurate.

osilverstein · 2024-04-05T19:06:39Z

For quantization I've used AutoGPTQ from this PR: AutoGPTQ/AutoGPTQ#631

This model: https://huggingface.co/NEURALDEEPTECH/command-r-gptq quantized not so good, because I've used only one primitive sample. Didn't had time to find good dataset and just wanted to check is my code for quantization and inference work. In my personal RAGAS test this https://huggingface.co/NEURALDEEPTECH/command-r-gptq model achieves 0.5 answer_correctness while this model https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit achieves 0.6782. I thought if somebody can quantize command-r in GPTQ group 32, Act_Order=True on this dataset: https://huggingface.co/datasets/CohereForAI/aya_dataset(I think Cohere used this one for Command R training), in this way, I think, quantization will be more accurate.

Thank you! I'll test later today. I was using the likely bunk Cyleux one.

esmeetu · 2024-04-06T02:16:46Z

@egortolmachev Thanks for your testing! IIUC, for the error @osilverstein has presented, i think it might be related to AutoGPTQ/AutoGPTQ#601. It seems AutoGPTQ always generate bias weight whether or not there is in original model. Currently i think it's better to add this code before line 316 when loading weights like in llama.py:

if name.endswith(".bias") and name not in params_dict:
    continue

@osilverstein Could you test your model with adding above patch again?

TNT3530 · 2024-04-07T02:55:21Z

Would this work with the new Command R+ as well? Looks to be the same CohereForCasualLM architecture.

Maybe))) You should try!

Tried the GPTQ Command R+ model (https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ) on my AMD system and similarly got KeyError: 'model.layers.42.mlp.down_proj.bias' after building docker from source

osilverstein · 2024-04-07T02:57:06Z

I've tested neuraldeeptech's gptq command-r and it works. Slow on tp=4 though.

osilverstein · 2024-04-07T20:52:53Z

Would this work with the new Command R+ as well? Looks to be the same CohereForCasualLM architecture.

Maybe))) You should try!

Tried the GPTQ Command R+ model (https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ) on my AMD system and similarly got KeyError: 'model.layers.42.mlp.down_proj.bias' after building docker from source

I also tried it and had the same issue.

osilverstein · 2024-04-07T20:53:57Z

@egortolmachev Thanks for your testing! IIUC, for the error @osilverstein has presented, i think it might be related to AutoGPTQ/AutoGPTQ#601. It seems AutoGPTQ always generate bias weight whether or not there is in original model. Currently i think it's better to add this code before line 316 when loading weights like in llama.py:
if name.endswith(".bias") and name not in params_dict:
    continue
@osilverstein Could you test your model with adding above patch again?

Hey hey, thanks for the guidance, I'm just a bit slow.

Do you if there is a way to modify the model itself without requantizing? Or through vllm? Which file to change? Many thanks

egortolmachev · 2024-04-08T13:39:29Z

Would this work with the new Command R+ as well? Looks to be the same CohereForCasualLM architecture.

Maybe))) You should try!

Tried the GPTQ Command R+ model (https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ) on my AMD system and similarly got KeyError: 'model.layers.42.mlp.down_proj.bias' after building docker from source

Now works:

import os

os.environ['VLLM_NCCL_SO_PATH'] = '/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2'


from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="alpindale/c4ai-command-r-plus-GPTQ", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

WARNING 04-08 16:27:52 config.py:218] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 16:27:52 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='alpindale/c4ai-command-r-plus-GPTQ', speculative_config=None, tokenizer='alpindale/c4ai-command-r-plus-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-08 16:27:54 pynccl.py:49] Loading nccl from environment variable VLLM_NCCL_SO_PATH=/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
INFO 04-08 16:27:54 selector.py:16] Using FlashAttention backend.
INFO 04-08 16:27:58 weight_utils.py:194] Using model weights format ['*.safetensors']
INFO 04-08 16:28:10 model_runner.py:104] Loading model weights took 54.6148 GB
INFO 04-08 16:28:31 gpu_executor.py:99] # GPU blocks: 1393, # CPU blocks: 1024
INFO 04-08 16:28:35 model_runner.py:810] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-08 16:28:35 model_runner.py:814] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-08 16:29:03 model_runner.py:886] Graph capturing finished in 27 secs.
Processed prompts: 100%|██████████| 4/4 [00:01<00:00,  2.94it/s]
Prompt: 'Hello, my name is', Generated text: ' Joey Lee and I am a senior at Sacred Heart Prep, a small private high'
Prompt: 'The president of the United States is', Generated text: " one of the most powerful leaders in the world. The president's job is to"
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris is romantic and its history is full of'
Prompt: 'The future of AI is', Generated text: ' not an all-knowing superintelligence, but a spectrum of technology that will'

egortolmachev · 2024-04-08T14:07:26Z

Cyleux model too:

import os

os.environ['VLLM_NCCL_SO_PATH'] = '/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2'


from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Cyleux/command-r-gptq", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

WARNING 04-08 17:04:10 config.py:218] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 17:04:10 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='Cyleux/command-r-gptq', speculative_config=None, tokenizer='Cyleux/command-r-gptq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-08 17:04:11 pynccl.py:49] Loading nccl from environment variable VLLM_NCCL_SO_PATH=/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
INFO 04-08 17:04:12 selector.py:16] Using FlashAttention backend.
INFO 04-08 17:04:14 weight_utils.py:194] Using model weights format ['*.safetensors']
INFO 04-08 17:04:18 model_runner.py:104] Loading model weights took 19.8360 GB
INFO 04-08 17:04:23 gpu_executor.py:99] # GPU blocks: 2113, # CPU blocks: 204
INFO 04-08 17:04:27 model_runner.py:810] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-08 17:04:27 model_runner.py:814] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-08 17:04:50 model_runner.py:886] Graph capturing finished in 23 secs.
Processed prompts: 100%|██████████| 4/4 [00:00<00:00,  7.96it/s]
Prompt: 'Hello, my name is', Generated text: ' Sarah Lee and I am a MSW graduate student at the University of Washington.'
Prompt: 'The president of the United States is', Generated text: ' no longer a mere “prime minister” of the world. In the past century'
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris, the City of Love and Lights,'
Prompt: 'The future of AI is', Generated text: " not being created in Silicon Valley. Rather, it's in China, where the"

osilverstein · 2024-04-09T00:28:15Z

@esmeetu > For quantization I've used AutoGPTQ from this PR: AutoGPTQ/AutoGPTQ#631

This model: https://huggingface.co/NEURALDEEPTECH/command-r-gptq quantized not so good, because I've used only one primitive sample. Didn't had time to find good dataset and just wanted to check is my code for quantization and inference work. In my personal RAGAS test this https://huggingface.co/NEURALDEEPTECH/command-r-gptq model achieves 0.5 answer_correctness while this model https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit achieves 0.6782. I thought if somebody can quantize command-r in GPTQ group 32, Act_Order=True on this dataset: https://huggingface.co/datasets/CohereForAI/aya_dataset(I think Cohere used this one for Command R training), in this way, I think, quantization will be more accurate.

Would you be willing to run your ragas benchmark on the cyleux gptq model? It's quantized over a portion of OpenHermes 2.5 which chosen because I don't actually see evidence that command-r was trained on aya.

Co-authored-by: Egor Tolmachev <t333ga@gmail.com>

blldd · 2024-04-16T08:32:26Z

Cyleux model too:

import os

os.environ['VLLM_NCCL_SO_PATH'] = '/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2'


from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Cyleux/command-r-gptq", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

WARNING 04-08 17:04:10 config.py:218] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 17:04:10 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='Cyleux/command-r-gptq', speculative_config=None, tokenizer='Cyleux/command-r-gptq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-08 17:04:11 pynccl.py:49] Loading nccl from environment variable VLLM_NCCL_SO_PATH=/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
INFO 04-08 17:04:12 selector.py:16] Using FlashAttention backend.
INFO 04-08 17:04:14 weight_utils.py:194] Using model weights format ['*.safetensors']
INFO 04-08 17:04:18 model_runner.py:104] Loading model weights took 19.8360 GB
INFO 04-08 17:04:23 gpu_executor.py:99] # GPU blocks: 2113, # CPU blocks: 204
INFO 04-08 17:04:27 model_runner.py:810] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-08 17:04:27 model_runner.py:814] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-08 17:04:50 model_runner.py:886] Graph capturing finished in 23 secs.
Processed prompts: 100%|██████████| 4/4 [00:00<00:00,  7.96it/s]
Prompt: 'Hello, my name is', Generated text: ' Sarah Lee and I am a MSW graduate student at the University of Washington.'
Prompt: 'The president of the United States is', Generated text: ' no longer a mere “prime minister” of the world. In the past century'
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris, the City of Love and Lights,'
Prompt: 'The future of AI is', Generated text: " not being created in Silicon Valley. Rather, it's in China, where the"

need help, my friend, the results I obtained were inconsistent with yours:

Prompt: 'Hello, my name is', Generated text: 'section哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈'
Prompt: 'The president of the United States is', Generated text: 'sectionsectionsectionauthor-sectionsection #author-sectionauthor-begin-section'
Prompt: 'The capital of France is', Generated text: 'sectionsectionsectionsectionall-sectionauthor\nbegin sectionauthor-begin-'
Prompt: 'The future of AI is', Generated text: 'sectionsectionsectionsectionbegin\nsectionbegin-sectionauthor-sectionauthor\nsection'

egortolmachev · 2024-04-16T08:37:45Z

@esmeetu > For quantization I've used AutoGPTQ from this PR: AutoGPTQ/AutoGPTQ#631

This model: https://huggingface.co/NEURALDEEPTECH/command-r-gptq quantized not so good, because I've used only one primitive sample. Didn't had time to find good dataset and just wanted to check is my code for quantization and inference work. In my personal RAGAS test this https://huggingface.co/NEURALDEEPTECH/command-r-gptq model achieves 0.5 answer_correctness while this model https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit achieves 0.6782. I thought if somebody can quantize command-r in GPTQ group 32, Act_Order=True on this dataset: https://huggingface.co/datasets/CohereForAI/aya_dataset(I think Cohere used this one for Command R training), in this way, I think, quantization will be more accurate.

Would you be willing to run your ragas benchmark on the cyleux gptq model? It's quantized over a portion of OpenHermes 2.5 which chosen because I don't actually see evidence that command-r was trained on aya.

Tested Cyleux on my RAGAS benchmark. It's also poor, but I benchmarking only on Russian language. I have updated my model https://huggingface.co/NEURALDEEPTECH/command-r-gptq. Now works much better, but for second quantization used only russian samples from aya dataset. Planning to quantize on full aya dataset, but it's not in my first priority now)

egortolmachev · 2024-04-16T08:39:46Z

Cyleux model too:

import os

os.environ['VLLM_NCCL_SO_PATH'] = '/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2'


from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Cyleux/command-r-gptq", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

WARNING 04-08 17:04:10 config.py:218] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 17:04:10 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='Cyleux/command-r-gptq', speculative_config=None, tokenizer='Cyleux/command-r-gptq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-08 17:04:11 pynccl.py:49] Loading nccl from environment variable VLLM_NCCL_SO_PATH=/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
INFO 04-08 17:04:12 selector.py:16] Using FlashAttention backend.
INFO 04-08 17:04:14 weight_utils.py:194] Using model weights format ['*.safetensors']
INFO 04-08 17:04:18 model_runner.py:104] Loading model weights took 19.8360 GB
INFO 04-08 17:04:23 gpu_executor.py:99] # GPU blocks: 2113, # CPU blocks: 204
INFO 04-08 17:04:27 model_runner.py:810] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-08 17:04:27 model_runner.py:814] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-08 17:04:50 model_runner.py:886] Graph capturing finished in 23 secs.
Processed prompts: 100%|██████████| 4/4 [00:00<00:00,  7.96it/s]
Prompt: 'Hello, my name is', Generated text: ' Sarah Lee and I am a MSW graduate student at the University of Washington.'
Prompt: 'The president of the United States is', Generated text: ' no longer a mere “prime minister” of the world. In the past century'
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris, the City of Love and Lights,'
Prompt: 'The future of AI is', Generated text: " not being created in Silicon Valley. Rather, it's in China, where the"

need help, my friend, the results I obtained were inconsistent with yours:

Prompt: 'Hello, my name is', Generated text: 'section哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈' Prompt: 'The president of the United States is', Generated text: 'sectionsectionsectionauthor-sectionsection #author-sectionauthor-begin-section' Prompt: 'The capital of France is', Generated text: 'sectionsectionsectionsectionall-sectionauthor\nbegin sectionauthor-begin-' Prompt: 'The future of AI is', Generated text: 'sectionsectionsectionsectionbegin\nsectionbegin-sectionauthor-sectionauthor\nsection'

Can you show code, full output, model, versions of libs and so on?

blldd · 2024-04-16T09:17:13Z

Cyleux model too:

import os

os.environ['VLLM_NCCL_SO_PATH'] = '/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2'


from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Cyleux/command-r-gptq", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

WARNING 04-08 17:04:10 config.py:218] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 17:04:10 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='Cyleux/command-r-gptq', speculative_config=None, tokenizer='Cyleux/command-r-gptq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-08 17:04:11 pynccl.py:49] Loading nccl from environment variable VLLM_NCCL_SO_PATH=/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
INFO 04-08 17:04:12 selector.py:16] Using FlashAttention backend.
INFO 04-08 17:04:14 weight_utils.py:194] Using model weights format ['*.safetensors']
INFO 04-08 17:04:18 model_runner.py:104] Loading model weights took 19.8360 GB
INFO 04-08 17:04:23 gpu_executor.py:99] # GPU blocks: 2113, # CPU blocks: 204
INFO 04-08 17:04:27 model_runner.py:810] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-08 17:04:27 model_runner.py:814] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-08 17:04:50 model_runner.py:886] Graph capturing finished in 23 secs.
Processed prompts: 100%|██████████| 4/4 [00:00<00:00,  7.96it/s]
Prompt: 'Hello, my name is', Generated text: ' Sarah Lee and I am a MSW graduate student at the University of Washington.'
Prompt: 'The president of the United States is', Generated text: ' no longer a mere “prime minister” of the world. In the past century'
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris, the City of Love and Lights,'
Prompt: 'The future of AI is', Generated text: " not being created in Silicon Valley. Rather, it's in China, where the"

need help, my friend, the results I obtained were inconsistent with yours:
Prompt: 'Hello, my name is', Generated text: 'section哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈' Prompt: 'The president of the United States is', Generated text: 'sectionsectionsectionauthor-sectionsection #author-sectionauthor-begin-section' Prompt: 'The capital of France is', Generated text: 'sectionsectionsectionsectionall-sectionauthor\nbegin sectionauthor-begin-' Prompt: 'The future of AI is', Generated text: 'sectionsectionsectionsectionbegin\nsectionbegin-sectionauthor-sectionauthor\nsection'

Can you show code, full output, model, versions of libs and so on?

CODE:

from vllm import LLM, SamplingParams

sampling_params = SamplingParams(temperature=0.3, top_p=0.95)
llm = LLM(model="alpindale/c4ai-command-r-plus-GPTQ", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

OUTPUT:

Processed prompts: 100%|██████████| 4/4 [00:01<00:00, 3.08it/s]
Prompt: 'Hello, my name is', Generated text: 'section哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈'
Prompt: 'The president of the United States is', Generated text: 'sectionsectionsectionauthor-sectionsection #author-sectionauthor-begin-section'
Prompt: 'The capital of France is', Generated text: 'sectionsectionsectionsectionall-sectionauthor\nbegin sectionauthor-begin-'
Prompt: 'The future of AI is', Generated text: 'sectionsectionsectionsectionbegin\nsectionbegin-sectionauthor-sectionauthor\nsection'

ENV:

NVIDIA-A100-80G
torch 2.1.2+cu118
transformers 4.40.0.dev0
xformers 0.0.23.post1+cu118
vllm 0.4.0.post1+cu118

and my CHANGES follow the code diff

                if shard_name not in name:
                    continue
                name = name.replace(shard_name, param_name)
                # Skip loading extra bias for GPTQ models.
                if name.endswith(".bias") and name not in params_dict:
                    continue
                param = params_dict[name]
                weight_loader = param.weight_loader
                weight_loader(param, loaded_weight, shard_id)
                break
            else:
                # lm_head is not used in vllm as it is tied with embed_token.
                # To prevent errors, skip loading lm_head.weight.
                if "lm_head.weight" in name:
                    continue
                # Skip loading extra bias for GPTQ models.
                if name.endswith(".bias") or name not in params_dict:  # here I use 'or', because and have the same error : KeyError: 'model.layers.0.mlp.down_proj.bias'
                    continue
                param = params_dict[name]
                weight_loader = getattr(param, "weight_loader",
                                        default_weight_loader)
                weight_loader(param, loaded_weight)
            loaded_params.add(name)

here I use 'or', because 'and' have the same error : KeyError: 'model.layers.0.mlp.down_proj.bias'

                if name.endswith(".bias") or name not in params_dict:

baochi0212 · 2024-04-19T10:30:19Z

@egortolmachev any idea on this ?

Traceback (most recent call last):
  File "/raid/phogpt_team/chitb/test_vllm.py", line 6, in <module>
    llm = LLM(model=sys.argv[1], gpu_memory_utilization=0.8, enforce_eager=True)
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 112, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 196, in from_engine_args
    engine = cls(
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 110, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 37, in __init__
    self._init_worker()
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/worker/worker.py", line 107, in load_model
    self.model_runner.load_model()
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 95, in load_model
    self.model = get_model(
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/model_executor/model_loader.py", line 101, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/model_executor/models/commandr.py", line 325, in load_weights
    param = params_dict[name]
KeyError: 'lm_head.weight'

When run with AutoGPTQ:
INFO - The layer lm_head is not quantized.
Thanks

Co-authored-by: Egor Tolmachev <t333ga@gmail.com>

t3ga and others added 2 commits April 4, 2024 14:41

Command R GPTQ support added.

d1b927f

Merge pull request #1 from t3ga/main

aaa7e3a

Command R GPTQ support added.

egortolmachev mentioned this pull request Apr 4, 2024

Add support for Cohere's Command-R model #3433

Merged

esmeetu approved these changes Apr 4, 2024

View reviewed changes

simon-mo assigned esmeetu Apr 5, 2024

egortolmachev and others added 3 commits April 8, 2024 13:38

Merge branch 'vllm-project:main' into main

5727fb6

Skipping loading extra bias for GPTQ models.

c0b8571

Merge branch 'main' of https://github.com/NEURALDEEPTECH/vllm

6b9b105

esmeetu enabled auto-merge (squash) April 8, 2024 13:59

esmeetu merged commit f46864d into vllm-project:main Apr 8, 2024
35 checks passed

SageMoore pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 11, 2024

[Bugfix] Added Command-R GPTQ support (vllm-project#3849)

6bc8e21

Co-authored-by: Egor Tolmachev <t333ga@gmail.com>

andy-neuma pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 12, 2024

[Bugfix] Added Command-R GPTQ support (vllm-project#3849)

209c356

Co-authored-by: Egor Tolmachev <t333ga@gmail.com>

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024

[Bugfix] Added Command-R GPTQ support (vllm-project#3849)

2aef40b

Co-authored-by: Egor Tolmachev <t333ga@gmail.com>

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Added Command-R GPTQ support #3849

[Bugfix] Added Command-R GPTQ support #3849

egortolmachev commented Apr 4, 2024 •

edited

esmeetu commented Apr 4, 2024

deoxykev commented Apr 4, 2024

osilverstein commented Apr 4, 2024

osilverstein commented Apr 5, 2024

egortolmachev commented Apr 5, 2024

egortolmachev commented Apr 5, 2024

egortolmachev commented Apr 5, 2024

egortolmachev commented Apr 5, 2024

osilverstein commented Apr 5, 2024

esmeetu commented Apr 6, 2024 •

edited

TNT3530 commented Apr 7, 2024

osilverstein commented Apr 7, 2024

osilverstein commented Apr 7, 2024

osilverstein commented Apr 7, 2024

egortolmachev commented Apr 8, 2024

egortolmachev commented Apr 8, 2024

osilverstein commented Apr 9, 2024 •

edited

blldd commented Apr 16, 2024

egortolmachev commented Apr 16, 2024

egortolmachev commented Apr 16, 2024

blldd commented Apr 16, 2024 •

edited

baochi0212 commented Apr 19, 2024

[Bugfix] Added Command-R GPTQ support #3849

[Bugfix] Added Command-R GPTQ support #3849

Conversation

egortolmachev commented Apr 4, 2024 • edited

esmeetu commented Apr 4, 2024

deoxykev commented Apr 4, 2024

osilverstein commented Apr 4, 2024

osilverstein commented Apr 5, 2024

egortolmachev commented Apr 5, 2024

egortolmachev commented Apr 5, 2024

egortolmachev commented Apr 5, 2024

egortolmachev commented Apr 5, 2024

osilverstein commented Apr 5, 2024

esmeetu commented Apr 6, 2024 • edited

TNT3530 commented Apr 7, 2024

osilverstein commented Apr 7, 2024

osilverstein commented Apr 7, 2024

osilverstein commented Apr 7, 2024

egortolmachev commented Apr 8, 2024

egortolmachev commented Apr 8, 2024

osilverstein commented Apr 9, 2024 • edited

blldd commented Apr 16, 2024

egortolmachev commented Apr 16, 2024

egortolmachev commented Apr 16, 2024

blldd commented Apr 16, 2024 • edited

CODE:

OUTPUT:

ENV:

and my CHANGES follow the code diff

baochi0212 commented Apr 19, 2024

egortolmachev commented Apr 4, 2024 •

edited

esmeetu commented Apr 6, 2024 •

edited

osilverstein commented Apr 9, 2024 •

edited

blldd commented Apr 16, 2024 •

edited