[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
-
Updated
Aug 12, 2024 - Python
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
日本語LLMまとめ - Overview of Japanese LLMs
[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
A Framework of Small-scale Large Multimodal Models
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.
[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Official implementation of SEED-LLaMA (ICLR 2024).
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
CLIPort: What and Where Pathways for Robotic Manipulation
Add a description, image, and links to the vision-language topic page so that developers can more easily learn about it.
To associate your repository with the vision-language topic, visit your repo's landing page and select "manage topics."