The code used to train and run inference with the ColPali architecture.
-
Updated
Jul 19, 2024 - Python
The code used to train and run inference with the ColPali architecture.
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型
Index your memes by their content and text, making them easily retrievable for your meme warfare pleasures. Find funny fast.
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
A Survey on Vision-Language Geo-Foundation Models (VLGFMs)
Seamlessly integrate state-of-the-art transformer models into robotics stacks
Multi-modal Chatbot based on OpenAI
TextSnap: Demo for Florence 2 model used in OCR tasks to extract and visualize text from images.
Vision Document Retrieval (ViDoRe): Benchmark 👀. Evaluation code for the "ColPali: Efficient Document Retrieval with Vision Language Models" paper.
A curated list of prompt/adapter learning methods for vision-language models.
Evaluating text-to-image/video/3D models with VQAScore
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical Diagnostic Reports
🎉 PILOT: A Pre-trained Model-Based Continual Learning Toolbox
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.
[CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models"
Evaluation code and datasets for the ACL 2024 paper, VISTA: Visualized Text Embedding for Universal Multi-Modal Retrieval. The original code and model can be accessed at FlagEmbedding.
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Add a description, image, and links to the vision-language-model topic page so that developers can more easily learn about it.
To associate your repository with the vision-language-model topic, visit your repo's landing page and select "manage topics."