Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions
-
Updated
Jun 13, 2024 - Python
Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions
InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.
Official code for Paper "Mantis: Multi-Image Instruction Tuning"
使用LLaMA-Factory微调多模态大语言模型的示例代码 Demo of Finetuning Multimodal LLM with LLaMA-Factory
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
Grounded Multimodal Large Language Model with Localized Visual Tokenization
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"
中文医学多模态大模型 Large Chinese Language-and-Vision Assistant for BioMedicine
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection
A collection of visual instruction tuning datasets.
MIKO: Multimodal Intention Knowledge Distillation from Large Language Models for Social-Media Commonsense Discover
Add a description, image, and links to the mllm topic page so that developers can more easily learn about it.
To associate your repository with the mllm topic, visit your repo's landing page and select "manage topics."