-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Fixable?) 400 Error with vLLM API - extra input #1002
Comments
I found this issue while looking for solutions for a similar looking issue, which I am having with latest dspy (2.4.9). However, I didn't find the suggested workaround to have any effect in my scenario. You might want to check whether your issue is already present in dspy 2.4.7. |
This issue should be solved by #1043 |
TL;DR: I've tried the changes proposed in Issue #1025 and PR #1043 and I can confirm that the fix works with dspy version I was switching from ollama to vLLM in my dspy project and I ended up having the same problem with dspy version Server:
Code: model="meta-llama/Llama-2-7b-hf"
lm = dspy.HFClientVLLM(model=model, port=8081, url="http://localhost")
dspy.configure(lm=lm)
qa = dspy.ChainOfThought('question -> answer')
response = qa(question="What is the capital of Paris?")
print(response.answer) Output:
I've tried to remove the Hope to see this soon in the new release! Thanks! |
Maybe there should be a new DSPy package release using the latest code in main, the PR that fixes this is already merged |
My dspy version is 2.4.17 and vllm 0.5.4 what fixes this issue @tom-doerr ? |
This installs vllm but dspy vllm server fails: pip install --upgrade pip
pip uninstall torchvision vllm vllm-flash-attn flash-attn xformers
pip install torch==2.2.1 vllm==0.4.1 @tom-doerr any help? |
@brando90 |
@tom-doerr apologies, I don't have access to the bash session when I wrote that message. Likely a typo. I confirm I do have 2.4.16 though:
and vllm version is:
my flash attention doesn't work fyi:
thanks for taking the time to respond/help. |
Could you post the error message you are getting? |
@tom-doerr happy to help!
|
vllm server running on the side: (snap_cluster_setup_py311) brando9@skampere1~ $ conda activate uutils
(uutils) brando9@skampere1~ $ python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-hf --port 8080
INFO 09-13 19:04:13 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='meta-llama/Llama-2-7b-hf', speculative_config=None, tokenizer='meta-llama/Llama-2-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
INFO 09-13 19:04:13 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 09-13 19:04:14 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 09-13 19:04:14 selector.py:33] Using XFormers backend.
INFO 09-13 19:04:15 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-13 19:04:17 model_runner.py:173] Loading model weights took 12.5523 GB
INFO 09-13 19:04:18 gpu_executor.py:119] # GPU blocks: 7406, # CPU blocks: 512
INFO 09-13 19:04:19 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-13 19:04:19 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-13 19:04:23 model_runner.py:1057] Graph capturing finished in 4 secs.
INFO: Started server process [348702]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
|
Don't see how that's related to the 400 Error. I had the same issue you seem to have: #1041 Don't have a solution for you. You could switch to a different model, check if someone else posted a solution to your problem (there are quite a few issues related to this) or switch to the experimental new DSPy 2.5 which has a new backend. |
Darn embarrassing, apologies. I must admit I've been quite sleep deprived and commented on the wrong issue. How do I install 2.5? Can't find it here https://github.com/stanfordnlp/dspy Thanks for all the help btw! |
Howdy. It seems when running a vLLM server and then attempting to interact with it via
HFClientVLLM
, I get an error message. Here is how to reproduce:If you only have one computer, this gives same output
This gives me an output of
Going into
dsp/modules/hf_client.py
, I tried commenting out this line:Now, when I run
python -c 'import dspy;lm = dspy.HFClientVLLM(model="facebook/opt-125m", port=8000, url="http://localhost", seed=42);dspy.configure(lm=lm);print(lm("Test"))'
It returns
yay!
The text was updated successfully, but these errors were encountered: