Skip to content

Compress conventional Vision-Language Pre-training data

Notifications You must be signed in to change notification settings

showlab/datacentric.vlp

Repository files navigation

Data-Centric Vision-Language Pre-training

Arxiv

  • At least half of the samples in the well-cleaned dataset (CC3M, refined from 5 billion images with 0.0006 preserved) negatively affect the learned rpresentation!

  • The purpose of this project is to compress existing Large-scale Vision-Language Pre-training dataset without drop the performance. We want the communtity pay more attention to data.

This work is still in progress, now the compression rate is around 70%-80%.

However, the data selection strategy is quite simple, we are exploring more sloid methods.

We also focus on refine existing dataset with our toolbox Image2Paragraph.

News

08/17/2023: Code released.

To do

  • Website.
  • Show referenced generated_annotation_file.

1. Introduction

1. Conventional Vision-language Datasets

Index Original Dataset #Original Samples Reduced-Dataset #Reduced Samples Compressison Rate
0 CC3M 2.82M TL;DR CC3M 0.67M 76.25%
1 CC12M 10.8M TL;DR CC12M 2.4M 77.8%
2 YFCC 14.9M TL;DR YFCC 2.5M 83.33%
3 LAION-Sub 40M TL;DR LAION-Sub 8.04M 79.90%

2. Data-efficient learning methods

"Large-scale" means that the methods are effective when used on datasets that are very large in size. The "task agnostic" means that the methods can be used regardless of the specific downstream task, and without any prior exposure to the associated data.

Method Year Data Type Compression Ratio Task Agnostic Large-scale Supervision Generation/Selection
Dataset Distillation [1] 2017 Image 99%-99.99% No No Class Label Generation
Data Pruning [2] 2022 Image 20%-30% No Yes Class Label Selection
Neural Data Server [3] 2020 Multi-modality 94%-98% No Yes Image-text Pairs Selection
TL;DR (ours) 2023 Multi-modality 75%-90% Yes Yes Image-text Pairs Generation+Selection

[1] Wang T et al. Dataset distillation[J]. arXiv preprint arXiv:1811.10959, 2018.

[2] Sorscher B et al. Beyond neural scaling laws: beating power law scaling via data pruning[J]. NeurIPS, 2022.

[3] Yan X, et all . Neural data server: A large-scale search engine for transfer learning data[C]. CVPR. 2020.

2. Run

Step 1. Pre-train Codebook-based Vision-Language Model

The codebook implementation is from VQ-VQE.

Please follow GETTING_START.md for data preparation and captioner model training.

Step 2. Codebook Extractor

python codebook_extractor.py

Step 3. Codebook Clustering and Selection

python codebook_cluster.py

In comparison, use random selection also

python random_selection.py

Step4. Fine-tuning VLP Model on Human-cleaned Captioning Dataset

python vq_compress_model/train_caption.py

Step5. Generate Training Json

python generate_train_json_w_caption.py

We show the ITM score distribution as below:

The main reason for the following steps is to higher the matching score. This not limited to image captioner, nueral data server and other techniques to improve the alignment between visual and text also works.

Step6. Pre-training and Evaluating on downstream Tasks

Use the generated annotation files to train VLP model in normal way.

3. Some Result

a. CC3M

Dataset Sample Pretraining Time COCO TR@1 COCO IR@1 COCO Captioning B@4 NLVR2
CC3M 2.82M 19H 70.9 54.3 36.8 76.2
TL;DR CC3M 0.67M 4.7H 72.8 54.8 37.6 78.0

b. CC12M

Dataset Sample Pretraining Time Flickr TR@1 Flcikr IR@1 COCO Captioning B@4 NLVR2
CC12M 10.8M 65H 84.7 75.3 37.5 78.9
TL;DR CC12M 2.4M 14H 85.5 76.3 38.1 78.5

c. YFCC

Compression Rate: 83.33%

d. LAION-Subset

Compression Rate: 80%

Acknowledgement

This work is mainly inspired by Dataset Distillation and Data Pruning. The architecutres ablation are mainly based on blip, and ViLT. Thanks for these good works.

Citation

If you find our work helps, please use the following BibTeX entry for citation.

@article{wang2023tldr,
  title={Too Large; Data Reduction for Vision-Language Pre-Training},
  author={Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei and Mike Zheng Shou },
  journal={ICCV},
  year={2023}
}

About

Compress conventional Vision-Language Pre-training data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages