vision-language

Here are 164 public repositories matching this topic...

IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

open-world object-detection vision-language vision-language-transformer open-world-detection

Updated Aug 12, 2024
Python

salesforce / BLIP

Star

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

image-captioning visual-reasoning visual-question-answering vision-language vision-language-transformer image-text-retrieval vision-and-language-pre-training

Updated Aug 5, 2024
Jupyter Notebook

OFA-Sys / Chinese-CLIP

Star

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

nlp computer-vision deep-learning transformers pytorch chinese pretrained-models multi-modal clip coreml-models contrastive-loss vision-language multi-modal-learning image-text-retrieval vision-and-language-pre-training

Updated Aug 6, 2024
Python

marqo-ai / marqo

Star

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

Updated Jun 3, 2025
Python

OFA-Sys / OFA

Star

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

prompt chinese image-captioning pretrained-models visual-question-answering multimodal text-to-image-synthesis vision-language pretraining referring-expression-comprehension prompt-tuning

Updated Apr 24, 2024
Python

AlibabaResearch / AdvancedLiterateMachinery

Star

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Updated Apr 9, 2025
C++

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

chatbot llama clip mulit-modal vision-language vicuna gpt-4 vision-language-pretraining llava video-chatboat video-conversation

Updated Mar 29, 2025
Python

llm-jp / awesome-japanese-llm

Star

日本語LLMまとめ - Overview of Japanese LLMs

japanese generative-model japanese-language language-models language-model generative-models multimodal vision-and-language vision-language foundation-models large-language-models llm llms generative-ai large-language-model vision-language-model japanese-llm japanese-language-model llm-japanese

Updated Jun 1, 2025
TypeScript

OpenDriveLab / DriveLM

Sponsor

Star

[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering

autonomous-driving vision-language large-language-models llm prompt-engineering prompting chain-of-thought tree-of-thoughts graph-of-thoughts

Updated Apr 29, 2025
HTML

OFA-Sys / ONE-PEACE

Star

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

representation-learning multimodal vision-and-language contrastive-loss vision-language vision-transformer foundation-models audio-language

Updated Oct 6, 2024
Python

google-research / pix2seq

Star

Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)

computer-vision deep-learning object-detection tensorflow2 vision-language pix2seq

Updated Nov 7, 2023
Jupyter Notebook

mbzuai-oryx / LLaVA-pp

Star

🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

conversation lmms vision-language llm llava llama3 phi3 llava-llama3 llava-phi3 llama3-llava phi3-llava llama-3-vision phi3-vision llama-3-llava phi-3-llava llama3-vision phi-3-vision

Updated Jul 10, 2024
Python

TinyLLaVA / TinyLLaVA_Factory

Star

A Framework of Small-scale Large Multimodal Models

nlp transformers llama vision-language llava large-multimodal-models tinyllama

Updated Apr 26, 2025
Python

SunzeY / AlphaCLIP

Star

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

machine-learning deep-learning vision-and-language vision-language vision-transformer vision-language-model

Updated Jun 3, 2025
Jupyter Notebook

2U1 / Qwen2-VL-Finetune

Star

An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.

chatbot multimodal vision-language vision-language-model qwen2-vl qwen2-5

Updated Jun 2, 2025
Python

Algolzw / daclip-uir

Star

[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.