Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
-
Updated
Jul 2, 2024 - Python
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
Deep Learning for Computer Vision 深度學習於電腦視覺 by Frank Wang 王鈺強
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
A Framework of Small-scale Large Multimodal Models
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
Code release for Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
[ICLR'24] Official code for "C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion"
With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023
Official repository of paper titled "Learning to Prompt with Text Only Supervision for Vision-Language Models".
Official PyTorch implementation and benchmark dataset for IGARSS 2024 ORAL paper: "Composed Image Retrieval for Remote Sensing"
Vision Language Dataset Construction Library for Remote Sensing Domain
Official code of the paper ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling accepted at MICCAI 2024.
[CVPR'24 Highlight] SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection
Python scripts to use for captioning images with VLMs
Add a description, image, and links to the vision-language topic page so that developers can more easily learn about it.
To associate your repository with the vision-language topic, visit your repo's landing page and select "manage topics."