GitHub - visresearch/MultipleObjectStitching: The official code of "Multiple Object Stitching for Unsupervised Representation Learning"

Multiple Object Stitching for Unsupervised Representation Learning

This repository is the official implementation of "Multiple Object Stitching for Unsupervised Representation Learning".

1. Introduction

Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation.

2. Requirements

conda create -n mos python=3.8
pip install -r requirements.txt

3. Datasets

Torchvision provides CIFAR10, CIFAR100 datasets. The root paths of data are respectively set to ./dataset/cifar10 and ./dataset/cifar100.

4. Trained Model Weights & Main Results

[weights download link]

Dataset	Training (#Epochs)	ViT-Tiny/2	ViT-Small/2	ViT-Base/2
CIFAR10	KNN Accuracy	93.2%	95.1%	95.1%
	Linear Accuracy	94.8%	96.3%	96.4%
	Finetune Accuracy	97.6%	98.3%	98.3%
CIFAR100	KNN Accuracy	67.8%	73.5%	74.4%
	Linear Accuracy	73.5%	78.5%	79.6%
	Finetune Accuracy	83.7%	86.1%	86.2%

5. Usage: Pretraining

ViT-Small with 1-node (8-GPU) training

Set hyperparameter, dataset and GPUs in config/pretrain/vit_small_pretrain.py and run the following command

python main_pretrain.py --arch vit-small

6. Usage: KNN

Set hyperparameter, dataset and GPUs in config/knn/knn.py and run the following command

python main_knn.py --arch vit-small --pretrained-weights /path/to/pretrained-weights.pth

7. Usage: Linear Classification

Set hyperparameter, dataset and GPUs in config/linear/vit_small_linear.py and run the following command:

python main_linear.py --arch vit-small --pretrained-weights /path/to/pretrained-weights.pth

8. Usage: End-to-End Fine-tuning

Set hyperparameter, dataset and GPUs in config/finetuning/vit_small_finetuning.py and run the following command

python python main_finetune.py --arch vit-small --pretrained-weights /path/to/pretrained-weights.pth

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@article{tang2025compact,
  author  = {Tang, Hao and Shen, Chengchao},
  title   = {Learning Compact Vision Tokens for Efficient Large Multimodal Models},
  journal = {arXiv preprint arXiv:2506.07138},
  year    = {2025},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multiple Object Stitching for Unsupervised Representation Learning

1. Introduction

2. Requirements

3. Datasets

4. Trained Model Weights & Main Results

5. Usage: Pretraining

ViT-Small with 1-node (8-GPU) training

6. Usage: KNN

7. Usage: Linear Classification

8. Usage: End-to-End Fine-tuning

License

Citation

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
images		images
module		module
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_finetune.py		main_finetune.py
main_knn.py		main_knn.py
main_linear.py		main_linear.py
main_pretrain.py		main_pretrain.py
requirements.txt		requirements.txt

License

visresearch/MultipleObjectStitching

Folders and files

Latest commit

History

Repository files navigation

Multiple Object Stitching for Unsupervised Representation Learning

1. Introduction

2. Requirements

3. Datasets

4. Trained Model Weights & Main Results

5. Usage: Pretraining

ViT-Small with 1-node (8-GPU) training

6. Usage: KNN

7. Usage: Linear Classification

8. Usage: End-to-End Fine-tuning

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages