Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
-
Updated
Aug 16, 2024 - Python
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
ModelScope-Agent: An agent framework connecting models in ModelScope with the world
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models. The first work to correct hallucinations in MLLMs.
[CVPR 2024] 🎬💭 chat with over 10K frames of video!
Speech, Language, Audio, Music Processing with Large Language Model
A Gradio demo of MGIE
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks
EVE: Encoder-Free Vision-Language Models
Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
[Paper][ACM MM 2024] Making Large Language Models Perform Better in Knowledge Graph Completion
🔥🔥MLVU: Multi-task Long Video Understanding Benchmark
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, qwen-vl, phi3-v etc.
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390 [ECCV 2024]
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Explore the Limits of Omni-modal Pretraining at Scale
Add a description, image, and links to the multimodal-large-language-models topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-large-language-models topic, visit your repo's landing page and select "manage topics."