This repository is the official implementation of "Multiple Object Stitching for Unsupervised Representation Learning".
[Paper] [BibTex] [ model weights]
Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation.
conda create -n mos python=3.8
pip install -r requirements.txt
Torchvision provides CIFAR10
, CIFAR100
datasets. The root paths of data are respectively set to ./dataset/cifar10
and ./dataset/cifar100
.
Dataset | Training (#Epochs) | ViT-Tiny/2 | ViT-Small/2 | ViT-Base/2 |
---|---|---|---|---|
CIFAR10 | KNN Accuracy | 93.2% | 95.1% | 95.1% |
Linear Accuracy | 94.8% | 96.3% | 96.4% | |
Finetune Accuracy | 97.6% | 98.3% | 98.3% | |
CIFAR100 | KNN Accuracy | 67.8% | 73.5% | 74.4% |
Linear Accuracy | 73.5% | 78.5% | 79.6% | |
Finetune Accuracy | 83.7% | 86.1% | 86.2% |
Set hyperparameter, dataset and GPUs in config/pretrain/vit_small_pretrain.py and run the following command
python main_pretrain.py --arch vit-small
Set hyperparameter, dataset and GPUs in config/knn/knn.py and run the following command
python main_knn.py --arch vit-small --pretrained-weights /path/to/pretrained-weights.pth
Set hyperparameter, dataset and GPUs in config/linear/vit_small_linear.py and run the following command:
python main_linear.py --arch vit-small --pretrained-weights /path/to/pretrained-weights.pth
Set hyperparameter, dataset and GPUs in config/finetuning/vit_small_finetuning.py and run the following command
python python main_finetune.py --arch vit-small --pretrained-weights /path/to/pretrained-weights.pth
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
@article{tang2025compact,
author = {Tang, Hao and Shen, Chengchao},
title = {Learning Compact Vision Tokens for Efficient Large Multimodal Models},
journal = {arXiv preprint arXiv:2506.07138},
year = {2025},
}