Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Write a very high-quality and detailed summary of the paper that describes the paper in a way that a human can fully understand. The summary should cover the problem, the proposed solution and highlight the main contributions of the paper.

Here is a detailed summary of the key points from the paper:

Problem:

Existing multimodal LLMs like Ferret have limited visual understanding capabilities due to the low resolution of pre-trained image encoders. This restricts their performance on tasks requiring detailed visual comprehension.

Proposed Solution - Ferret-v2:

Implements "any resolution" image scaling to enable handling images at higher resolutions while retaining pre-trained knowledge. This outperforms "direct upsampling".
Uses both CLIP and DINOv2 encoders to capture global semantics as well as local fine-grained details at multiple visual granularities.
Employs a 3-stage training paradigm going from coarse image-text alignment to high-resolution dense alignment and finally instruction tuning.

Main Contributions:

Analysis showing "any resolution" image scaling surpasses "direct upsampling" for harnessing detail while preserving pre-training benefits.
Multi-granularity visual encoding with CLIP and DINOv2 encoders to represent both global and local fine-grained information.
A novel 3-stage training approach enabling improved resolution handling and vision-language alignment in a "coarse-to-fine" manner.
State-of-the-art performance over strong baselines like Ferret across referring, grounding and reasoning tasks, especially those needing regional visual understanding.

In summary, Ferret-v2 substantially upgrades Ferret for any resolution processing, multi-granularity encoding and enhanced training to unlock superior detail-oriented visual comprehension.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2404.07973.md

2404.07973.md

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Write a very high-quality and detailed summary of the paper that describes the paper in a way that a human can fully understand. The summary should cover the problem, the proposed solution and highlight the main contributions of the paper.

Files

2404.07973.md

Latest commit

History

2404.07973.md

File metadata and controls

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Write a very high-quality and detailed summary of the paper that describes the paper in a way that a human can fully understand. The summary should cover the problem, the proposed solution and highlight the main contributions of the paper.