Description
Describe the bug
Working from the understanding that LLaVA hosted via an OpenAI proxy like LiteLLM, as well as GPT4-V hosted in Azure or OpenAI are both valid options for the MultimodalConversable agent. My agent workflow works correctly when I point the vision agent to GPT4v, but I get errors if I switch the llm config to the locally hosted LLaVA config.
When I switch to LLaVA (hosted via LiteLLM with 'litellm --model ollama_chat/llava --run_gunicorn', I get
Traceback (most recent call last):
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/proxy/proxy_server.py", line 3671, in chat_completion
responses = await asyncio.gather(
^^^^^^^^^^^^^^^^^^^^^
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3465, in wrapper_async
raise e
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3297, in wrapper_async
result = await original_function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/main.py", line 340, in acompletion
raise exception_type(
^^^^^^^^^^^^^^^
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8665, in exception_type
raise e
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8633, in exception_type
raise APIConnectionError(
litellm.exceptions.APIConnectionError: {"error":"json: cannot unmarshal array into Go struct field Message.messages.content of type string"}
If I start the ollama model without '_chat' like 'litellm --model ollama/llava --run_gunicorn' I get
Traceback (most recent call last):
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/proxy/proxy_server.py", line 3671, in chat_completion
responses = await asyncio.gather(
^^^^^^^^^^^^^^^^^^^^^
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3465, in wrapper_async
raise e
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3297, in wrapper_async
result = await original_function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/main.py", line 340, in acompletion
raise exception_type(
^^^^^^^^^^^^^^^
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8665, in exception_type
raise e
File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8633, in exception_type
raise APIConnectionError(
litellm.exceptions.APIConnectionError: {"error":"illegal base64 data at input byte 4"}
One thing to note is I'm including a list of frames via the prompt
prompt = """
context: camera location = "front yard", time = "10:00 AM", date = "March 15, 2022"
These are the frames of a video. Generate a compelling description that the SecurityAnalysisAgent can evaluate.
<img frames/frame0.jpg>
<img frames/frame1.jpg>
<img frames/frame2.jpg>
<img frames/frame3.jpg>
<img frames/frame4.jpg>
<img frames/frame5.jpg>
<img frames/frame6.jpg>
<img frames/frame7.jpg>
<img frames/frame8.jpg>
<img frames/frame9.jpg>
"""
It seems that LiteLLM isn't handling the list of images correctly. Is inclusion of multiple frames considered part of the OpenAI spec, or is MultimodalConversable agent not writing the [tools] section perfectly?
Steps to reproduce
- Point an agent at gpt4v with a series of frames from of video and ask for a description of the video. The agent gets a valid description of the video.
- Change the llm_config of that agent to point to a locally hosted LLaVA vision model using ollama and litellm as the proxy for ollama. Errors returned.
Model Used
gpt4v & LLaVA 1.6
Expected Behavior
I was expecting to be able to treat GPT4V and LLaVVA llm_configs as interchangeable, only differing in response quality, performance, and cost.
Screenshots and logs
No response
Additional Information
Latest AutoGen version, both MacOS and Windows, Python 3.1.1.9.