Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience. It is designed to assist users in performing various tasks, from simple information retrieval to complex multimedia reasoning.
-
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
arXiv 2022/12
[paper] -
GPT-4
-
arXiv 2023/04
[paper] [code] [project page] [demo] -
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
arXiv 2023/04
[paper] [code] [project page] [demo] -
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
-
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
-
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding
-
LMEye: An Interactive Perception Network for Large Language Models
-
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
-
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
arXiv 2023/05
[paper] [code] [project page] -
Otter: A Multi-Modal Model with In-Context Instruction Tuning
-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
-
InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language
-
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
-
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
-
arXiv 2023/05
[paper] [code] [project page] -
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
arXiv 2023/05
[paper] [code] [project page] -
DetGPT: Detect What You Need via Reasoning
arXiv 2023/05
[paper] [code] [project page] -
PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology
-
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
arXiv 2023/05
[paper] [code] [project page] -
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
-
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2023/06
[paper] -
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
arXiv 2023/06
[paper] [project page] -
VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
-
ViperGPT: Visual Inference via Python Execution for Reasoning
arXiv 2023/03
[paper] [code] [project page] -
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
-
Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023/03
[paper] [code] [project page] [demo] -
Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface
-
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
-
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System
arXiv 2023/04
[paper] [project page]