[Feature Request]: Connect to the HuggingFace Hub to achieve a multimodal capability 

### Is your feature request related to a problem? Please describe.

The HuggingFace Hub provides an elegant python [client](https://huggingface.co/docs/huggingface_hub/index) to allow users to control over 100,000+ huggingface models and run inference on these models to achieve a variety of multimodal tasks, like image-to-text, text-to-speech, etc. By connecting to this hub, a text-based LLM like gpt-3.5-tubor could also have the multimodal capability to handle images, video, audio, and documents, in a cost efficient way. 

However, it still needs some additional coding work to allow an autogen agent to interact with a huggingface-hub client, such as wrapping the client method into a function, parsing different input/output types, and model deployment management. That's why I'm seeking if autogen could have an out-of-box solution for the connecting.

Other similar works: [JARVIS](https://github.com/microsoft/JARVIS), [Transformers Agent](https://huggingface.co/docs/transformers/transformers_agents)

### Describe the solution you'd like

1. The simplest and most straightforward way is to provide a huggingface-hub toolkit with inference functions. Users can then easily register this toolkit to an autogen agent according to their requirements. However, it's not clear to me where is the best place to place this toolkit.
2. The second approach is to provide a `huggingface_agent`, like [Transformers Agent](https://huggingface.co/docs/transformers/transformers_agents). This agent would essentially consist of a pairing between an assistant and a user-proxy agent, both are registered with the huggingface-hub toolkit. Users could seamlessly access this agent to leverage its multimodal capabilities, without the need for manual registration of toolkits for execution.
3. The third approach is to create a multimodal capability and register this capability to any given agent by hooking the `process_last_received_message` method. However, it may not be straightforward for some tasks such as text-to-image.

### Additional context

I'd like to hear your suggestions and make contributions in different ways.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request]: Connect to the HuggingFace Hub to achieve a multimodal capability #2577

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request]: Connect to the HuggingFace Hub to achieve a multimodal capability #2577

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions