[Bug]:  MultmodalConversable agent fails when LLM config points to local Ollama/LLaVA proxied by LiteLLM

### Describe the bug

Working from the understanding that LLaVA hosted via an OpenAI proxy like LiteLLM, as well as GPT4-V hosted in Azure or OpenAI are both valid options for the MultimodalConversable agent.   My agent workflow works correctly when I point the vision agent to GPT4v, but I get errors if I switch the llm config to the locally hosted LLaVA config.   

When I switch to LLaVA (hosted via LiteLLM with 'litellm --model ollama_chat/llava --run_gunicorn', I get 
```
Traceback (most recent call last):
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/proxy/proxy_server.py", line 3671, in chat_completion
    responses = await asyncio.gather(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3465, in wrapper_async
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3297, in wrapper_async
    result = await original_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/main.py", line 340, in acompletion
    raise exception_type(
          ^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8665, in exception_type
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8633, in exception_type
    raise APIConnectionError(
litellm.exceptions.APIConnectionError: {"error":"json: cannot unmarshal array into Go struct field Message.messages.content of type string"}
```
If I start the ollama model without '_chat' like 'litellm --model ollama/llava --run_gunicorn' I get
```
Traceback (most recent call last):
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/proxy/proxy_server.py", line 3671, in chat_completion
    responses = await asyncio.gather(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3465, in wrapper_async
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3297, in wrapper_async
    result = await original_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/main.py", line 340, in acompletion
    raise exception_type(
          ^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8665, in exception_type
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8633, in exception_type
    raise APIConnectionError(
litellm.exceptions.APIConnectionError: {"error":"illegal base64 data at input byte 4"}
```
One thing to note is I'm including a list of frames via the prompt 
```
  prompt = """
     context: camera location = "front yard", time = "10:00 AM", date = "March 15, 2022"
     These are the frames of a video. Generate a compelling description that the SecurityAnalysisAgent can evaluate.
    <img frames/frame0.jpg>
    <img frames/frame1.jpg>
    <img frames/frame2.jpg>
    <img frames/frame3.jpg>
    <img frames/frame4.jpg>
    <img frames/frame5.jpg>
    <img frames/frame6.jpg>
    <img frames/frame7.jpg>
    <img frames/frame8.jpg>
    <img frames/frame9.jpg>
    """
```
It seems that LiteLLM isn't handling the list of images correctly. Is inclusion of multiple frames considered part of the OpenAI spec, or is MultimodalConversable agent not writing the [tools] section perfectly?


### Steps to reproduce

1.  Point an agent at gpt4v with a series of frames from of video and ask for a description of the video.  The agent gets a valid description of the video.
2. Change the llm_config of that agent to point to a locally hosted LLaVA vision model using ollama and litellm as the proxy for ollama.  Errors returned. 

### Model Used

gpt4v & LLaVA 1.6

### Expected Behavior

I was expecting to be able to treat GPT4V and LLaVVA llm_configs as interchangeable, only differing in response quality, performance, and cost. 

### Screenshots and logs

_No response_

### Additional Information

Latest AutoGen version, both MacOS and Windows, Python 3.1.1.9.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: MultmodalConversable agent fails when LLM config points to local Ollama/LLaVA proxied by LiteLLM #2528

Describe the bug

Steps to reproduce

Model Used

Expected Behavior

Screenshots and logs

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: MultmodalConversable agent fails when LLM config points to local Ollama/LLaVA proxied by LiteLLM #2528

Description

Describe the bug

Steps to reproduce

Model Used

Expected Behavior

Screenshots and logs

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions