Skip to content

visresearch/MultipleObjectStitching

Repository files navigation

Multiple Object Stitching for Unsupervised Representation Learning

This repository is the official implementation of "Multiple Object Stitching for Unsupervised Representation Learning".

[Paper] [BibTex] [ model weights]

1. Introduction

Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation.

2. Requirements

conda create -n mos python=3.8
pip install -r requirements.txt

3. Datasets

Torchvision provides CIFAR10, CIFAR100 datasets. The root paths of data are respectively set to ./dataset/cifar10 and ./dataset/cifar100.

4. Trained Model Weights & Main Results

[weights download link]

Dataset Training (#Epochs) ViT-Tiny/2 ViT-Small/2 ViT-Base/2
CIFAR10 KNN Accuracy 93.2% 95.1% 95.1%
Linear Accuracy 94.8% 96.3% 96.4%
Finetune Accuracy 97.6% 98.3% 98.3%
CIFAR100 KNN Accuracy 67.8% 73.5% 74.4%
Linear Accuracy 73.5% 78.5% 79.6%
Finetune Accuracy 83.7% 86.1% 86.2%

5. Usage: Pretraining

ViT-Small with 1-node (8-GPU) training

Set hyperparameter, dataset and GPUs in config/pretrain/vit_small_pretrain.py and run the following command

python main_pretrain.py --arch vit-small

6. Usage: KNN

Set hyperparameter, dataset and GPUs in config/knn/knn.py and run the following command

python main_knn.py --arch vit-small --pretrained-weights /path/to/pretrained-weights.pth

7. Usage: Linear Classification

Set hyperparameter, dataset and GPUs in config/linear/vit_small_linear.py and run the following command:

python main_linear.py --arch vit-small --pretrained-weights /path/to/pretrained-weights.pth

8. Usage: End-to-End Fine-tuning

Set hyperparameter, dataset and GPUs in config/finetuning/vit_small_finetuning.py and run the following command

python python main_finetune.py --arch vit-small --pretrained-weights /path/to/pretrained-weights.pth

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@article{tang2025compact,
  author  = {Tang, Hao and Shen, Chengchao},
  title   = {Learning Compact Vision Tokens for Efficient Large Multimodal Models},
  journal = {arXiv preprint arXiv:2506.07138},
  year    = {2025},
}

About

The official code of "Multiple Object Stitching for Unsupervised Representation Learning"

Topics

Resources

License

Stars

Watchers

Forks

Languages