Open
Description
Search before asking
- I have searched the Multimodal Maestro issues and found no similar feature requests.
Description
As far as I know, Qwen2.5-VL is the first open source multimodal model that can extract bounding boxes.
e.g. from https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/spatial_understanding.ipynb:
It would be great to support this so that other models can support this as well.
Use case
We would use this for generative process automation in https://github.com/OpenAdaptAI/OpenAdapt
Additional
No response
Are you willing to submit a PR?
- Yes I'd like to help by submitting a PR!