Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Fixable?) 400 Error with vLLM API - extra input #1002

Closed
ccruttjr opened this issue May 9, 2024 · 14 comments
Closed

(Fixable?) 400 Error with vLLM API - extra input #1002

ccruttjr opened this issue May 9, 2024 · 14 comments

Comments

@ccruttjr
Copy link

ccruttjr commented May 9, 2024

Howdy. It seems when running a vLLM server and then attempting to interact with it via HFClientVLLM, I get an error message. Here is how to reproduce:

# Computer 1
pip install ray==2.20.0 vllm==0.4.2 dspy-ai==2.4.9 flash_attn==2.5.8
ray start --head --num-gpus 1
# Computer 2. Address will be Computer 1's IP address
pip install ray==2.20.0 vllm==0.4.2 dspy-ai==2.4.9 flash_attn==2.5.8
ray start --address='192.168.250.20:6379' --num-gpus 1
# Computer 1
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --host 0.0.0.0 --port 8000 -tp 2 --seed 42
# Any computer on network (including Computer 1 and 2). Address will be Computer 1's IP address
pip install dspy-ai==2.4.9 # If not already installed
python -c 'import dspy;lm = dspy.HFClientVLLM(model="facebook/opt-125m", port=8000, url="http://192.168.250.20", seed=42);dspy.configure(lm=lm);print(lm("Test"))'

If you only have one computer, this gives same output

pip install ray==2.20.0 vllm==0.4.2 dspy-ai==2.4.9 flash_attn==2.5.8
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --host 0.0.0.0 --port 8000 --seed 42
# Different tab
python -c 'import dspy;lm = dspy.HFClientVLLM(model="facebook/opt-125m", port=8000, url="http://localhost", seed=42);dspy.configure(lm=lm);print(lm("Test"))'

This gives me an output of

Failed to parse JSON response: {"object":"error","message":"[{'type': 'extra_forbidden', 'loc': ('body', 'port'), 'msg': 'Extra inputs are not permitted', 'input': 8000, 'url': 'https://errors.pydantic.dev/2.5/v/extra_forbidden'}, {'type': 'extra_forbidden', 'loc': ('body', 'url'), 'msg': 'Extra inputs are not permitted', 'input': ['http://192.168.250.10:8000'], 'url': 'https://errors.pydantic.dev/2.5/v/extra_forbidden'}]","type":"BadRequestError","param":null,"code":400}
Traceback (most recent call last):
  File "/home/daimyollc/anaconda3/envs/dspyVenv/lib/python3.10/site-packages/dsp/modules/hf_client.py", line 199, in _generate
    completions = json_response["choices"]
KeyError: 'choices'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/daimyollc/anaconda3/envs/dspyVenv/lib/python3.10/site-packages/dsp/modules/hf.py", line 190, in __call__
    response = self.request(prompt, **kwargs)
  File "/home/daimyollc/anaconda3/envs/dspyVenv/lib/python3.10/site-packages/dsp/modules/lm.py", line 26, in request
    return self.basic_request(prompt, **kwargs)
  File "/home/daimyollc/anaconda3/envs/dspyVenv/lib/python3.10/site-packages/dsp/modules/hf.py", line 147, in basic_request
    response = self._generate(prompt, **kwargs)
  File "/home/daimyollc/anaconda3/envs/dspyVenv/lib/python3.10/site-packages/dsp/modules/hf_client.py", line 208, in _generate
    raise Exception("Received invalid JSON response from server")
Exception: Received invalid JSON response from server

Going into dsp/modules/hf_client.py, I tried commenting out this line:

payload = {
    "model": self.kwargs["model"],
    "prompt": prompt,
    # **kwargs, commented this line out!
}

Now, when I run

python -c 'import dspy;lm = dspy.HFClientVLLM(model="facebook/opt-125m", port=8000, url="http://localhost", seed=42);dspy.configure(lm=lm);print(lm("Test"))'

It returns

['osterone, Muscle Trackers, Water Race, Anabolic Shifting, Tally']

yay!

@Wolfsauge
Copy link

I found this issue while looking for solutions for a similar looking issue, which I am having with latest dspy (2.4.9). However, I didn't find the suggested workaround to have any effect in my scenario. You might want to check whether your issue is already present in dspy 2.4.7.

@tom-doerr
Copy link
Contributor

This issue should be solved by #1043

@francisco-perez-sorrosal
Copy link

francisco-perez-sorrosal commented Jun 7, 2024

TL;DR: I've tried the changes proposed in Issue #1025 and PR #1043 and I can confirm that the fix works with dspy version 2.4.9 and vllm version 0.4.3 at least for the minimal example described below

I was switching from ollama to vLLM in my dspy project and I ended up having the same problem with dspy version 2.4.9 and vllm version 0.4.3. So I tried the bare minimum example you provide in the documentation:

Server:

 pixi r python -m vllm.entrypoints.api_server --trust-remote-code --model meta-llama/Llama-2-7b-hf --port 8081  

Code:

model="meta-llama/Llama-2-7b-hf"
lm = dspy.HFClientVLLM(model=model, port=8081, url="http://localhost")
dspy.configure(lm=lm)
qa = dspy.ChainOfThought('question -> answer')

response = qa(question="What is the capital of Paris?")
print(response.answer)

Output:

Failed to parse JSON response: {"detail":"Not Found"}
Traceback (most recent call last):
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 199, in _generate
    completions = json_response["choices"]
                  ~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'choices'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jupyter/dev/funes/funes/main_test.py", line 23, in <module>
    response = qa(question="What is the capital of Paris?") #Prompted to vllm_llama2
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dspy/predict/predict.py", line 61, in __call__
    return self.forward(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dspy/predict/chain_of_thought.py", line 59, in forward
    return super().forward(signature=signature, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dspy/predict/predict.py", line 103, in forward
    x, C = dsp.generate(template, **config)(x, stage=self.stage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/primitives/predict.py", line 77, in do_generate
    completions: list[dict[str, Any]] = generator(prompt, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/modules/hf.py", line 190, in __call__
    response = self.request(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/modules/lm.py", line 26, in request
    return self.basic_request(prompt, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/modules/hf.py", line 147, in basic_request
    response = self._generate(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 208, in _generate
    raise Exception("Received invalid JSON response from server")
Exception: Received invalid JSON response from server

I've tried to remove the **kwargs params from the payload in the hf_client.py as @ccruttjr suggested, but it doesn't work.
Then I've tried the changes proposed in Issue #1025 and PR #1043 and I can confirm that it works.

Hope to see this soon in the new release! Thanks!

@tom-doerr
Copy link
Contributor

Maybe there should be a new DSPy package release using the latest code in main, the PR that fixes this is already merged

@brando90
Copy link

Maybe there should be a new DSPy package release using the latest code in main, the PR that fixes this is already merged

My dspy version is 2.4.17

and vllm

0.5.4

what fixes this issue @tom-doerr ?

@brando90
Copy link

This installs vllm but dspy vllm server fails:

pip install --upgrade pip
pip uninstall torchvision vllm vllm-flash-attn flash-attn xformers
pip install torch==2.2.1 vllm==0.4.1 

@tom-doerr any help?

@tom-doerr
Copy link
Contributor

@brando90
How do you have 2.4.17, isn't the newest version 2.4.16?
Do you get the exact same error message?
If so, could you post relevant code?

@brando90
Copy link

brando90 commented Sep 14, 2024

@tom-doerr apologies, I don't have access to the bash session when I wrote that message. Likely a typo. I confirm I do have 2.4.16 though:

(uutils) brando9@skampere1~ $ pip list | grep dspy
dspy-ai                                 2.4.16

and vllm version is:

(uutils) brando9@skampere1~ $ pip list | grep vllm
vllm                                    0.4.1
vllm_nccl_cu12                          2.18.1.0.4.0
(uutils) brando9@skampere1~ $ pip list | grep torch
fast-pytorch-kmeans                     0.2.0.1
torch                                   2.2.1

my flash attention doesn't work fyi:

INFO 09-13 18:56:14 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.

thanks for taking the time to respond/help.

@tom-doerr
Copy link
Contributor

Could you post the error message you are getting?
Do you use any less commonly used DSPy features?

@brando90
Copy link

Could you post the error message you are getting? Do you use any less commonly used DSPy features?

@tom-doerr happy to help!

(snap_cluster_setup_py311) brando9@skampere1~ $ conda activate uutils
(uutils) brando9@skampere1~ $ python ~/ultimate-utils/py_src/uutils/dspy_uu/examples/full_toy_vllm_local_mdl.py

  0%|                                                                                                     | 0/3 [00:00<?, ?it/s]Failed to parse JSON response: {"detail":"Not Found"}
2024-09-14T02:04:17.834140Z [error    ] Failed to run or to evaluate example Example({'question': 'What is the capital of France?', 'answer': 'Paris'}) (input_keys={'question'}) with <function exact_match_metric at 0x7f024e0cc0e0> due to Received invalid JSON response from server. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
Failed to parse JSON response: {"detail":"Not Found"}
2024-09-14T02:04:17.834909Z [error    ] Failed to run or to evaluate example Example({'question': "Who wrote '1984'?", 'answer': 'George Orwell'}) (input_keys={'question'}) with <function exact_match_metric at 0x7f024e0cc0e0> due to Received invalid JSON response from server. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
Failed to parse JSON response: {"detail":"Not Found"}
2024-09-14T02:04:17.835576Z [error    ] Failed to run or to evaluate example Example({'question': 'What is the boiling point of water?', 'answer': '100°C'}) (input_keys={'question'}) with <function exact_match_metric at 0x7f024e0cc0e0> due to Received invalid JSON response from server. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
100%|████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 633.96it/s]
Bootstrapped 0 full traces after 3 examples in round 0.
Failed to parse JSON response: {"detail":"Not Found"}
Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 243, in _generate
    completions = json_response["choices"]
                  ~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'choices'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/ultimate-utils/py_src/uutils/dspy_uu/examples/full_toy_vllm_local_mdl.py", line 65, in <module>
    pred = compiled_simple_qa(my_question)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/primitives/program.py", line 26, in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/ultimate-utils/py_src/uutils/dspy_uu/examples/full_toy_vllm_local_mdl.py", line 50, in forward
    prediction = self.generate_answer(question=question)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/primitives/program.py", line 26, in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/chain_of_thought.py", line 36, in forward
    return self._predict(signature=signature, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/predict.py", line 91, in __call__
    return self.forward(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/predict.py", line 128, in forward
    completions = old_generate(demos, signature, kwargs, config, self.lm, self.stage)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/predict.py", line 155, in old_generate
    x, C = dsp.generate(template, **config)(x, stage=stage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/primitives/predict.py", line 73, in do_generate
    completions: list[dict[str, Any]] = generator(prompt, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf.py", line 193, in __call__
    response = self.request(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/lm.py", line 27, in request
    return self.basic_request(prompt, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf.py", line 147, in basic_request
    response = self._generate(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 252, in _generate
    raise Exception("Received invalid JSON response from server")
Exception: Received invalid JSON response from server

@brando90
Copy link

vllm server running on the side:

(snap_cluster_setup_py311) brando9@skampere1~ $ conda activate uutils
(uutils) brando9@skampere1~ $ python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-hf --port 8080

INFO 09-13 19:04:13 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='meta-llama/Llama-2-7b-hf', speculative_config=None, tokenizer='meta-llama/Llama-2-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
INFO 09-13 19:04:13 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 09-13 19:04:14 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 09-13 19:04:14 selector.py:33] Using XFormers backend.
INFO 09-13 19:04:15 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-13 19:04:17 model_runner.py:173] Loading model weights took 12.5523 GB
INFO 09-13 19:04:18 gpu_executor.py:119] # GPU blocks: 7406, # CPU blocks: 512
INFO 09-13 19:04:19 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-13 19:04:19 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-13 19:04:23 model_runner.py:1057] Graph capturing finished in 4 secs.
INFO:     Started server process [348702]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

@brando90
Copy link

@tom-doerr
Copy link
Contributor

Don't see how that's related to the 400 Error. I had the same issue you seem to have: #1041
This issue also seems to be relevant: #1242

Don't have a solution for you. You could switch to a different model, check if someone else posted a solution to your problem (there are quite a few issues related to this) or switch to the experimental new DSPy 2.5 which has a new backend.

@brando90
Copy link

brando90 commented Sep 14, 2024

Don't see how that's related to the 400 Error. I had the same issue you seem to have: #1041 This issue also seems to be relevant: #1242

Darn embarrassing, apologies. I must admit I've been quite sleep deprived and commented on the wrong issue.

How do I install 2.5?

Can't find it here https://github.com/stanfordnlp/dspy

Thanks for all the help btw!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants