Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Added Command-R GPTQ support #3849

Merged
merged 5 commits into from
Apr 8, 2024
Merged

Conversation

egortolmachev
Copy link
Contributor

@egortolmachev egortolmachev commented Apr 4, 2024

Fixed Command-R GPTQ model loading by analogy with Gemma: #3553

@esmeetu
Copy link
Collaborator

esmeetu commented Apr 4, 2024

Thanks! Could you post that model's link and put up a test result here?

@deoxykev
Copy link

deoxykev commented Apr 4, 2024

Would this work with the new Command R+ as well? Looks to be the same CohereForCasualLM architecture.

@osilverstein
Copy link

Can we get this into the v4 as a slightly updated build

@osilverstein
Copy link

I cloned the fork ran the model and got the same old error:
KeyError: 'model.layers.13.mlp.down_proj.bias' ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:03 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it> ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:03 selector.py:25] Using XFormers backend.^[[32m [repeated 2x across cluster]^[[0m ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:04 pynccl_utils.py:45] vLLM is using nccl==2.18.1^[[32m [repeated 2x across cluster]^[[0m ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed> ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] Traceback (most recent call last): ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/engine/ray_utils.py", line 37, in execute_meth> ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] return executor(*args, **kwargs) ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/worker/worker.py", line 107, in load_model ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] self.model_runner.load_model() ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/worker/model_runner.py", line 95, in load_model ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] self.model = get_model( ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/model_loader.py", line 101, in > ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] model.load_weights(model_config.model, model_config.download_dir, ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/models/commandr.py", line 329, > ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] param = params_dict[name] ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] KeyError: 'model.layers.13.mlp.down_proj.bias' ^[[36m(RayWorkerVllm pid=6994)^[[0m WARNING 04-05 05:13:07 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P > ^[[36m(RayWorkerVllm pid=7169)^[[0m INFO 04-05 05:13:09 weight_utils.py:194] Using model weights format ['*.safetensors']^[[32m [repeated 2x across c> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] Traceback (most recent call last): ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/engine/ray_utils.py", line 37, in execute_meth> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] return executor(*args, **kwargs) ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/worker/worker.py", line 107, in load_model ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] self.model_runner.load_model() ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/worker/model_runner.py", line 95, in load_model ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] self.model = get_model( ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/model_loader.py", line 101, in > ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] model.load_weights(model_config.model, model_config.download_dir, ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/models/commandr.py", line 329, > ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] param = params_dict[name] ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] KeyError: 'model.layers.13.mlp.down_proj.bias'

@egortolmachev
Copy link
Contributor Author

Thanks! Could you post that model's link and put up a test result here?

Model link: https://huggingface.co/NEURALDEEPTECH/command-r-gptq

Test code:

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="NEURALDEEPTECH/command-r-gptq") #, tensor_parallel_size=1)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

Prompt: 'Hello, my name is', Generated text: ' Sarah. I am a designer and freelance illustrator currently based in Sydney, Australia.'
Prompt: 'The president of the United States is', Generated text: ' no longer a mere man, he has become a symbol. In the past century'
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris, its inhabitants, its streets, its'
Prompt: 'The future of AI is', Generated text: ' not an all-knowing, all-controlling computer program, but a'

@egortolmachev
Copy link
Contributor Author

I cloned the fork ran the model and got the same old error: KeyError: 'model.layers.13.mlp.down_proj.bias' ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:03 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it> ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:03 selector.py:25] Using XFormers backend.^[[32m [repeated 2x across cluster]^[[0m ^[[36m(RayWorkerVllm pid=7247)^[[0m INFO 04-05 05:13:04 pynccl_utils.py:45] vLLM is using nccl==2.18.1^[[32m [repeated 2x across cluster]^[[0m ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed> ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] Traceback (most recent call last): ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/engine/ray_utils.py", line 37, in execute_meth> ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] return executor(*args, **kwargs) ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/worker/worker.py", line 107, in load_model ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] self.model_runner.load_model() ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/worker/model_runner.py", line 95, in load_model ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] self.model = get_model( ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/model_loader.py", line 101, in > ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] model.load_weights(model_config.model, model_config.download_dir, ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/models/commandr.py", line 329, > ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] param = params_dict[name] ^[[36m(RayWorkerVllm pid=7247)^[[0m ERROR 04-05 05:15:23 ray_utils.py:44] KeyError: 'model.layers.13.mlp.down_proj.bias' ^[[36m(RayWorkerVllm pid=6994)^[[0m WARNING 04-05 05:13:07 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P > ^[[36m(RayWorkerVllm pid=7169)^[[0m INFO 04-05 05:13:09 weight_utils.py:194] Using model weights format ['*.safetensors']^[[32m [repeated 2x across c> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] Traceback (most recent call last): ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/engine/ray_utils.py", line 37, in execute_meth> ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] return executor(*args, **kwargs) ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/worker/worker.py", line 107, in load_model ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] self.model_runner.load_model() ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/worker/model_runner.py", line 95, in load_model ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] self.model = get_model( ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/model_loader.py", line 101, in > ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] model.load_weights(model_config.model, model_config.download_dir, ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] File "/workspace/vllm/vllm/model_executor/models/commandr.py", line 329, > ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] param = params_dict[name] ^[[36m(RayWorkerVllm pid=7169)^[[0m ERROR 04-05 05:15:24 ray_utils.py:44] KeyError: 'model.layers.13.mlp.down_proj.bias'

Try this model: https://huggingface.co/NEURALDEEPTECH/command-r-gptq

@egortolmachev
Copy link
Contributor Author

Would this work with the new Command R+ as well? Looks to be the same CohereForCasualLM architecture.

Maybe))) You should try!

@egortolmachev
Copy link
Contributor Author

For quantization I've used AutoGPTQ from this PR: AutoGPTQ/AutoGPTQ#631

This model: https://huggingface.co/NEURALDEEPTECH/command-r-gptq quantized not so good, because I've used only one primitive sample. Didn't had time to find good dataset and just wanted to check is my code for quantization and inference work.
In my personal RAGAS test this https://huggingface.co/NEURALDEEPTECH/command-r-gptq model achieves 0.5 answer_correctness while this model https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit achieves 0.6782.
I thought if somebody can quantize command-r in GPTQ group 32, Act_Order=True on this dataset: https://huggingface.co/datasets/CohereForAI/aya_dataset(I think Cohere used this one for Command R training), in this way, I think, quantization will be more accurate.

@osilverstein
Copy link

For quantization I've used AutoGPTQ from this PR: AutoGPTQ/AutoGPTQ#631

This model: https://huggingface.co/NEURALDEEPTECH/command-r-gptq quantized not so good, because I've used only one primitive sample. Didn't had time to find good dataset and just wanted to check is my code for quantization and inference work. In my personal RAGAS test this https://huggingface.co/NEURALDEEPTECH/command-r-gptq model achieves 0.5 answer_correctness while this model https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit achieves 0.6782. I thought if somebody can quantize command-r in GPTQ group 32, Act_Order=True on this dataset: https://huggingface.co/datasets/CohereForAI/aya_dataset(I think Cohere used this one for Command R training), in this way, I think, quantization will be more accurate.

Thank you! I'll test later today. I was using the likely bunk Cyleux one.

@esmeetu
Copy link
Collaborator

esmeetu commented Apr 6, 2024

@egortolmachev Thanks for your testing! IIUC, for the error @osilverstein has presented, i think it might be related to AutoGPTQ/AutoGPTQ#601. It seems AutoGPTQ always generate bias weight whether or not there is in original model. Currently i think it's better to add this code before line 316 when loading weights like in llama.py:

if name.endswith(".bias") and name not in params_dict:
    continue

@osilverstein Could you test your model with adding above patch again?

@TNT3530
Copy link

TNT3530 commented Apr 7, 2024

Would this work with the new Command R+ as well? Looks to be the same CohereForCasualLM architecture.

Maybe))) You should try!

Tried the GPTQ Command R+ model (https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ) on my AMD system and similarly got KeyError: 'model.layers.42.mlp.down_proj.bias' after building docker from source

@osilverstein
Copy link

I've tested neuraldeeptech's gptq command-r and it works. Slow on tp=4 though.

@osilverstein
Copy link

Would this work with the new Command R+ as well? Looks to be the same CohereForCasualLM architecture.

Maybe))) You should try!

Tried the GPTQ Command R+ model (https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ) on my AMD system and similarly got KeyError: 'model.layers.42.mlp.down_proj.bias' after building docker from source

I also tried it and had the same issue.

@osilverstein
Copy link

@egortolmachev Thanks for your testing! IIUC, for the error @osilverstein has presented, i think it might be related to AutoGPTQ/AutoGPTQ#601. It seems AutoGPTQ always generate bias weight whether or not there is in original model. Currently i think it's better to add this code before line 316 when loading weights like in llama.py:

if name.endswith(".bias") and name not in params_dict:
    continue

@osilverstein Could you test your model with adding above patch again?

Hey hey, thanks for the guidance, I'm just a bit slow.

Do you if there is a way to modify the model itself without requantizing? Or through vllm? Which file to change? Many thanks

@egortolmachev
Copy link
Contributor Author

Would this work with the new Command R+ as well? Looks to be the same CohereForCasualLM architecture.

Maybe))) You should try!

Tried the GPTQ Command R+ model (https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ) on my AMD system and similarly got KeyError: 'model.layers.42.mlp.down_proj.bias' after building docker from source

Now works:

import os

os.environ['VLLM_NCCL_SO_PATH'] = '/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2'


from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="alpindale/c4ai-command-r-plus-GPTQ", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

WARNING 04-08 16:27:52 config.py:218] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 16:27:52 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='alpindale/c4ai-command-r-plus-GPTQ', speculative_config=None, tokenizer='alpindale/c4ai-command-r-plus-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-08 16:27:54 pynccl.py:49] Loading nccl from environment variable VLLM_NCCL_SO_PATH=/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
INFO 04-08 16:27:54 selector.py:16] Using FlashAttention backend.
INFO 04-08 16:27:58 weight_utils.py:194] Using model weights format ['*.safetensors']
INFO 04-08 16:28:10 model_runner.py:104] Loading model weights took 54.6148 GB
INFO 04-08 16:28:31 gpu_executor.py:99] # GPU blocks: 1393, # CPU blocks: 1024
INFO 04-08 16:28:35 model_runner.py:810] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-08 16:28:35 model_runner.py:814] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-08 16:29:03 model_runner.py:886] Graph capturing finished in 27 secs.
Processed prompts: 100%|██████████| 4/4 [00:01<00:00,  2.94it/s]
Prompt: 'Hello, my name is', Generated text: ' Joey Lee and I am a senior at Sacred Heart Prep, a small private high'
Prompt: 'The president of the United States is', Generated text: " one of the most powerful leaders in the world. The president's job is to"
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris is romantic and its history is full of'
Prompt: 'The future of AI is', Generated text: ' not an all-knowing superintelligence, but a spectrum of technology that will'

@esmeetu esmeetu enabled auto-merge (squash) April 8, 2024 13:59
@egortolmachev
Copy link
Contributor Author

Cyleux model too:

import os

os.environ['VLLM_NCCL_SO_PATH'] = '/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2'


from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Cyleux/command-r-gptq", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

WARNING 04-08 17:04:10 config.py:218] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 17:04:10 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='Cyleux/command-r-gptq', speculative_config=None, tokenizer='Cyleux/command-r-gptq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-08 17:04:11 pynccl.py:49] Loading nccl from environment variable VLLM_NCCL_SO_PATH=/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
INFO 04-08 17:04:12 selector.py:16] Using FlashAttention backend.
INFO 04-08 17:04:14 weight_utils.py:194] Using model weights format ['*.safetensors']
INFO 04-08 17:04:18 model_runner.py:104] Loading model weights took 19.8360 GB
INFO 04-08 17:04:23 gpu_executor.py:99] # GPU blocks: 2113, # CPU blocks: 204
INFO 04-08 17:04:27 model_runner.py:810] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-08 17:04:27 model_runner.py:814] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-08 17:04:50 model_runner.py:886] Graph capturing finished in 23 secs.
Processed prompts: 100%|██████████| 4/4 [00:00<00:00,  7.96it/s]
Prompt: 'Hello, my name is', Generated text: ' Sarah Lee and I am a MSW graduate student at the University of Washington.'
Prompt: 'The president of the United States is', Generated text: ' no longer a mere “prime minister” of the world. In the past century'
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris, the City of Love and Lights,'
Prompt: 'The future of AI is', Generated text: " not being created in Silicon Valley. Rather, it's in China, where the"

@esmeetu esmeetu merged commit f46864d into vllm-project:main Apr 8, 2024
35 checks passed
@osilverstein
Copy link

osilverstein commented Apr 9, 2024

@esmeetu > For quantization I've used AutoGPTQ from this PR: AutoGPTQ/AutoGPTQ#631

This model: https://huggingface.co/NEURALDEEPTECH/command-r-gptq quantized not so good, because I've used only one primitive sample. Didn't had time to find good dataset and just wanted to check is my code for quantization and inference work. In my personal RAGAS test this https://huggingface.co/NEURALDEEPTECH/command-r-gptq model achieves 0.5 answer_correctness while this model https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit achieves 0.6782. I thought if somebody can quantize command-r in GPTQ group 32, Act_Order=True on this dataset: https://huggingface.co/datasets/CohereForAI/aya_dataset(I think Cohere used this one for Command R training), in this way, I think, quantization will be more accurate.

Would you be willing to run your ragas benchmark on the cyleux gptq model? It's quantized over a portion of OpenHermes 2.5 which chosen because I don't actually see evidence that command-r was trained on aya.

SageMoore pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 11, 2024
Co-authored-by: Egor Tolmachev <t333ga@gmail.com>
andy-neuma pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 12, 2024
Co-authored-by: Egor Tolmachev <t333ga@gmail.com>
@blldd
Copy link

blldd commented Apr 16, 2024

Cyleux model too:

import os

os.environ['VLLM_NCCL_SO_PATH'] = '/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2'


from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Cyleux/command-r-gptq", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

WARNING 04-08 17:04:10 config.py:218] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 17:04:10 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='Cyleux/command-r-gptq', speculative_config=None, tokenizer='Cyleux/command-r-gptq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-08 17:04:11 pynccl.py:49] Loading nccl from environment variable VLLM_NCCL_SO_PATH=/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
INFO 04-08 17:04:12 selector.py:16] Using FlashAttention backend.
INFO 04-08 17:04:14 weight_utils.py:194] Using model weights format ['*.safetensors']
INFO 04-08 17:04:18 model_runner.py:104] Loading model weights took 19.8360 GB
INFO 04-08 17:04:23 gpu_executor.py:99] # GPU blocks: 2113, # CPU blocks: 204
INFO 04-08 17:04:27 model_runner.py:810] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-08 17:04:27 model_runner.py:814] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-08 17:04:50 model_runner.py:886] Graph capturing finished in 23 secs.
Processed prompts: 100%|██████████| 4/4 [00:00<00:00,  7.96it/s]
Prompt: 'Hello, my name is', Generated text: ' Sarah Lee and I am a MSW graduate student at the University of Washington.'
Prompt: 'The president of the United States is', Generated text: ' no longer a mere “prime minister” of the world. In the past century'
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris, the City of Love and Lights,'
Prompt: 'The future of AI is', Generated text: " not being created in Silicon Valley. Rather, it's in China, where the"

need help, my friend, the results I obtained were inconsistent with yours:

Prompt: 'Hello, my name is', Generated text: 'section哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈'
Prompt: 'The president of the United States is', Generated text: 'sectionsectionsectionauthor-sectionsection #author-sectionauthor-begin-section'
Prompt: 'The capital of France is', Generated text: 'sectionsectionsectionsectionall-sectionauthor\nbegin sectionauthor-begin-'
Prompt: 'The future of AI is', Generated text: 'sectionsectionsectionsectionbegin\nsectionbegin-sectionauthor-sectionauthor\nsection'

@egortolmachev
Copy link
Contributor Author

@esmeetu > For quantization I've used AutoGPTQ from this PR: AutoGPTQ/AutoGPTQ#631

This model: https://huggingface.co/NEURALDEEPTECH/command-r-gptq quantized not so good, because I've used only one primitive sample. Didn't had time to find good dataset and just wanted to check is my code for quantization and inference work. In my personal RAGAS test this https://huggingface.co/NEURALDEEPTECH/command-r-gptq model achieves 0.5 answer_correctness while this model https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit achieves 0.6782. I thought if somebody can quantize command-r in GPTQ group 32, Act_Order=True on this dataset: https://huggingface.co/datasets/CohereForAI/aya_dataset(I think Cohere used this one for Command R training), in this way, I think, quantization will be more accurate.

Would you be willing to run your ragas benchmark on the cyleux gptq model? It's quantized over a portion of OpenHermes 2.5 which chosen because I don't actually see evidence that command-r was trained on aya.

Tested Cyleux on my RAGAS benchmark. It's also poor, but I benchmarking only on Russian language. I have updated my model https://huggingface.co/NEURALDEEPTECH/command-r-gptq. Now works much better, but for second quantization used only russian samples from aya dataset. Planning to quantize on full aya dataset, but it's not in my first priority now)

@egortolmachev
Copy link
Contributor Author

Cyleux model too:

import os

os.environ['VLLM_NCCL_SO_PATH'] = '/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2'


from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Cyleux/command-r-gptq", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

WARNING 04-08 17:04:10 config.py:218] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 17:04:10 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='Cyleux/command-r-gptq', speculative_config=None, tokenizer='Cyleux/command-r-gptq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-08 17:04:11 pynccl.py:49] Loading nccl from environment variable VLLM_NCCL_SO_PATH=/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
INFO 04-08 17:04:12 selector.py:16] Using FlashAttention backend.
INFO 04-08 17:04:14 weight_utils.py:194] Using model weights format ['*.safetensors']
INFO 04-08 17:04:18 model_runner.py:104] Loading model weights took 19.8360 GB
INFO 04-08 17:04:23 gpu_executor.py:99] # GPU blocks: 2113, # CPU blocks: 204
INFO 04-08 17:04:27 model_runner.py:810] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-08 17:04:27 model_runner.py:814] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-08 17:04:50 model_runner.py:886] Graph capturing finished in 23 secs.
Processed prompts: 100%|██████████| 4/4 [00:00<00:00,  7.96it/s]
Prompt: 'Hello, my name is', Generated text: ' Sarah Lee and I am a MSW graduate student at the University of Washington.'
Prompt: 'The president of the United States is', Generated text: ' no longer a mere “prime minister” of the world. In the past century'
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris, the City of Love and Lights,'
Prompt: 'The future of AI is', Generated text: " not being created in Silicon Valley. Rather, it's in China, where the"

need help, my friend, the results I obtained were inconsistent with yours:

Prompt: 'Hello, my name is', Generated text: 'section哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈' Prompt: 'The president of the United States is', Generated text: 'sectionsectionsectionauthor-sectionsection #author-sectionauthor-begin-section' Prompt: 'The capital of France is', Generated text: 'sectionsectionsectionsectionall-sectionauthor\nbegin sectionauthor-begin-' Prompt: 'The future of AI is', Generated text: 'sectionsectionsectionsectionbegin\nsectionbegin-sectionauthor-sectionauthor\nsection'

Can you show code, full output, model, versions of libs and so on?

@blldd
Copy link

blldd commented Apr 16, 2024

Cyleux model too:

import os

os.environ['VLLM_NCCL_SO_PATH'] = '/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2'


from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Cyleux/command-r-gptq", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

WARNING 04-08 17:04:10 config.py:218] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-08 17:04:10 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='Cyleux/command-r-gptq', speculative_config=None, tokenizer='Cyleux/command-r-gptq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-08 17:04:11 pynccl.py:49] Loading nccl from environment variable VLLM_NCCL_SO_PATH=/home/ubuntu/anaconda3/envs/fastchat/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
INFO 04-08 17:04:12 selector.py:16] Using FlashAttention backend.
INFO 04-08 17:04:14 weight_utils.py:194] Using model weights format ['*.safetensors']
INFO 04-08 17:04:18 model_runner.py:104] Loading model weights took 19.8360 GB
INFO 04-08 17:04:23 gpu_executor.py:99] # GPU blocks: 2113, # CPU blocks: 204
INFO 04-08 17:04:27 model_runner.py:810] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-08 17:04:27 model_runner.py:814] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-08 17:04:50 model_runner.py:886] Graph capturing finished in 23 secs.
Processed prompts: 100%|██████████| 4/4 [00:00<00:00,  7.96it/s]
Prompt: 'Hello, my name is', Generated text: ' Sarah Lee and I am a MSW graduate student at the University of Washington.'
Prompt: 'The president of the United States is', Generated text: ' no longer a mere “prime minister” of the world. In the past century'
Prompt: 'The capital of France is', Generated text: ' a city that deserves your attention. Paris, the City of Love and Lights,'
Prompt: 'The future of AI is', Generated text: " not being created in Silicon Valley. Rather, it's in China, where the"

need help, my friend, the results I obtained were inconsistent with yours:
Prompt: 'Hello, my name is', Generated text: 'section哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈' Prompt: 'The president of the United States is', Generated text: 'sectionsectionsectionauthor-sectionsection #author-sectionauthor-begin-section' Prompt: 'The capital of France is', Generated text: 'sectionsectionsectionsectionall-sectionauthor\nbegin sectionauthor-begin-' Prompt: 'The future of AI is', Generated text: 'sectionsectionsectionsectionbegin\nsectionbegin-sectionauthor-sectionauthor\nsection'

Can you show code, full output, model, versions of libs and so on?

CODE:

from vllm import LLM, SamplingParams

sampling_params = SamplingParams(temperature=0.3, top_p=0.95)
llm = LLM(model="alpindale/c4ai-command-r-plus-GPTQ", tensor_parallel_size=1, max_model_len=8000, gpu_memory_utilization=0.82)

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

OUTPUT:

Processed prompts: 100%|██████████| 4/4 [00:01<00:00, 3.08it/s]
Prompt: 'Hello, my name is', Generated text: 'section哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈'
Prompt: 'The president of the United States is', Generated text: 'sectionsectionsectionauthor-sectionsection #author-sectionauthor-begin-section'
Prompt: 'The capital of France is', Generated text: 'sectionsectionsectionsectionall-sectionauthor\nbegin sectionauthor-begin-'
Prompt: 'The future of AI is', Generated text: 'sectionsectionsectionsectionbegin\nsectionbegin-sectionauthor-sectionauthor\nsection'

ENV:

NVIDIA-A100-80G
torch 2.1.2+cu118
transformers 4.40.0.dev0
xformers 0.0.23.post1+cu118
vllm 0.4.0.post1+cu118

and my CHANGES follow the code diff

                if shard_name not in name:
                    continue
                name = name.replace(shard_name, param_name)
                # Skip loading extra bias for GPTQ models.
                if name.endswith(".bias") and name not in params_dict:
                    continue
                param = params_dict[name]
                weight_loader = param.weight_loader
                weight_loader(param, loaded_weight, shard_id)
                break
            else:
                # lm_head is not used in vllm as it is tied with embed_token.
                # To prevent errors, skip loading lm_head.weight.
                if "lm_head.weight" in name:
                    continue
                # Skip loading extra bias for GPTQ models.
                if name.endswith(".bias") or name not in params_dict:  # here I use 'or', because and have the same error : KeyError: 'model.layers.0.mlp.down_proj.bias'
                    continue
                param = params_dict[name]
                weight_loader = getattr(param, "weight_loader",
                                        default_weight_loader)
                weight_loader(param, loaded_weight)
            loaded_params.add(name)

here I use 'or', because 'and' have the same error : KeyError: 'model.layers.0.mlp.down_proj.bias'

                if name.endswith(".bias") or name not in params_dict:  

@baochi0212
Copy link

@egortolmachev any idea on this ?

Traceback (most recent call last):
  File "/raid/phogpt_team/chitb/test_vllm.py", line 6, in <module>
    llm = LLM(model=sys.argv[1], gpu_memory_utilization=0.8, enforce_eager=True)
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 112, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 196, in from_engine_args
    engine = cls(
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 110, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 37, in __init__
    self._init_worker()
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/worker/worker.py", line 107, in load_model
    self.model_runner.load_model()
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 95, in load_model
    self.model = get_model(
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/model_executor/model_loader.py", line 101, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/raid/phogpt_team/chitb/vllm_0.4/lib/python3.9/site-packages/vllm/model_executor/models/commandr.py", line 325, in load_weights
    param = params_dict[name]
KeyError: 'lm_head.weight'

When run with AutoGPTQ:
INFO - The layer lm_head is not quantized.
Thanks

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024
Co-authored-by: Egor Tolmachev <t333ga@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants