#

vision-language-model

Here are 97 public repositories matching this topic...

OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型

image-classification gpt multi-modal semantic-segmentation video-classification mme image-text-retrieval llm vision-language-model gpt-4v vit-6b vit-22b gpt-4o

Updated Jun 20, 2024
Python

OpenGVLab / MM-NIAH

This is the official implementation of the paper "Needle In A Multimodal Haystack"

benchmark long-context vision-language-model multimodal-large-language-models

Updated Jun 20, 2024
Python

richard-peng-xia / LMPT

[ACLW'24] LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition

multi-label-image-classification prompt-tuning long-tailed-learning vision-language-model

Updated Jun 20, 2024
Python

MbodiAI / mbodied-agents

Seamlessly integrate state-of-the-art transformer models into robotics stacks

robotics artificial-intelligence transformer agents diffusion vlm multimodal embodied large-language-models llm generative-ai vision-language-model embodied-agents mbodi mbodiai

Updated Jun 19, 2024
Python

ys-zong / VLGuard

[ICML 2024] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models.

alignment safety large-language-models vision-language-model large-vision-language-models

Updated Jun 19, 2024
Python

naamiinepal / vlsm-adapter

VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks

adapter transfer-learning segmentation-models vision-language-model

Updated Jun 18, 2024
Python

YunzeMan / Situation3D

[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning

deep-learning multimodal-learning multi-modal-learning 3d-scene-understanding vision-language-model vision-language-learning

Updated Jun 17, 2024
Python

shikiw / OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

chatbot llama multimodal gpt-4 chatgpt vision-language-model vision-language-learning large-multimodal-models

Updated Jun 16, 2024
Python

Blaizzy / mlx-vlm

MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.

mlx vision-framework apple-silicon vision-transformer llm vision-language-model llava local-ai idefics paligemma

Updated Jun 19, 2024
Python

AIDC-AI / Ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

chatbot multimodality multimodal vision-language-model multimodal-large-language-models vision-language-learning qwen llama3

Updated Jun 14, 2024
Python

BAAI-Agents / Cradle

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

ai gcc multimodality vlm cradle computer-control lmm grounding ai-agent large-language-models llm generative-ai vision-language-model ai-agents-framework general-computer-control personoid foundation-agent

Updated Jun 14, 2024
Python

InternLM / InternLM-XComposer

InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.

foundation gpt language-model multimodal multi-modality vision-transformer gpt-4 visual-language-learning llm chatgpt instruction-tuning large-language-model supervised-finetuning mllm vision-language-model large-vision-language-model

Updated Jun 14, 2024
Python

linzhiqiu / t2v_metrics

Evaluating text-to-image/video/3D models with VQAScore

generative-ai vision-language-model

Updated Jun 14, 2024
Python

zhengli97 / PromptKD

[CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models"

clip knowledge-distillation multi-modal-learning prompt-learning vision-language-model cvpr2024

Updated Jun 13, 2024
Python

anishmadan23 / foundational_fsod

This repository contains the implementation for the paper "Revisiting Few Shot Object Detection with Vision-Language Models"

benchmark object-detection few-shot-learning few-shot-object-detection lvis nuimages foundation-models vision-language-model

Updated Jun 13, 2024
Python

lhurr / GroundingDINO-BrainHack

Vision Language Model fine-tuning for TIL AI BrainHack - Advanced Track

vision-language-model

Updated Jun 12, 2024
Python

xiaoachen98 / Open-LLaVA-NeXT

An open-source implementation of LLaVA-NeXT.

chatbot llama multimodal multi-modality gpt-4 visual-language-learning chatgpt vision-language-model llava large-multimodal-models llama3 gpt4o llava-next

Updated Jun 12, 2024
Python

whwu95 / FreeVA

FreeVA: Offline MLLM as Training-Free Video Assistant

chatbot video-understanding zero-shot-video-captioning video-question-answering chatgpt vision-language-model llava training-free multimodal-large-language-models

Updated Jun 9, 2024
Python

FoundationVision / Groma

Grounded Multimodal Large Language Model with Localized Visual Tokenization

llama multimodal grounding foundation-models large-language-models llm mllm vision-language-model llama2

Updated Jun 7, 2024
Python

ShareGPT4Omni / ShareGPT4V

An official implementation of ShareGPT4V: Improving Large Multi-modal Models with Better Captions

gpt language-model large-language-models chatgpt instruction-tuning vision-language-model large-vision-language-models gpt4v large-multimodal-models gpt-4v

Updated Jun 6, 2024
Python

Improve this page

Add a description, image, and links to the vision-language-model topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the vision-language-model topic, visit your repo's landing page and select "manage topics."