Skip to content

sungnyun/openssl-simcore

Repository files navigation

OpenSSL-SimCore (CVPR 2023)

Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning
Sungnyun Kim*, Sangmin Bae*, Se-Young Yun
* equal contribution

  • Open-set Self-Supervised Learning (OpenSSL) task: an unlabeled open-set available during the pretraining phase on the fine-grained dataset.
  • SimCore: simple coreset selection algorithm to leverage a subset semantically similar to the target dataset.
  • SimCore significantly improves representation learning performance in various downstream tasks.
  • [update on 10.02.2023] Shared SimCore-pretrained models on HuggingFace Models.

Requirements

Install the necessary packages with:

$ pip install -r requirements.txt

Data Preparation

We used 11 fine-grained datasets and 7 open-sets. Place each data files into data/[DATASET_NAME]/ (it should be constructed as the torchvision.datasets.ImageFolder format).
To download and setup the data, please see the docs and run python files, if necessary.

$ cd data/
$ python [DATASET_NAME]_image_folder_generator.py

Pretraining

To simply pretrain the model, run the shell file. (We support multi-GPUs training, while we utilized 4 GPUs.)
You will need to define the path for each dataset, and the retrieval model checkpoint.

# specify $TAG and $DATA

$ CUDA_VISIBLE_DEVICES=<GPU_ID> bash run_selfsup.sh

Here are some important arguments to be considered.

  • --dataset1: fine-grained target dataset name
  • --dataset2: open-set name (default: imagenet)
  • --data_folder1: directory where the dataset1 is located
  • --data_folder2: directory where the dataset2 is located
  • --retrieval_ckpt: retrieval model checkpoint before SimCore pretraining; for this, pretrain vanilla SSL for 1K epochs
  • --model: model architecture (default: resnet50), see models
  • --method: self-supervised learning method (default: simclr), see ssl
  • --sampling_method: strategy for sampling from the open-set (choose between "random" or "simcore")
  • --no_sampling: if sampling unwanted (vanilla SSL pretrain), set this True

The pretrained model checkpoints will be saved at save/[EXP_NAME]/. For example, if you run the default shell file, the last epoch checkpoint will be saved as save/$DATA_resnet50_pretrain_simclr_merge_imagenet_$TAG/last.pth.

Linear Evaluation

Linear evaluation of the pretrained models can be similarly implemented as the pretraining.
Run the following shell file, with the pretrained model checkpoint additionally defined.

# specify $TAG, $DATA, and --pretrained_ckpt

$ CUDA_VISIBLE_DEVICES=<GPU_ID> bash run_sup.sh

We also support kNN evaluation (--knn, --topk) and semi-supervised fine-tuning (--label_ratio, --e2e).

Result

SimCore with a stopping criterion highly improves the accuracy by +10.5% (averaged over 11 datasets), compared to the pretraining without any open-set.

Try other open-sets

SimCore works with various, or even uncurated open-sets. You can also try with your custom, web-crawled, or uncurated open-sets.

     

Downstream Tasks

SimCore is extensively evaluated in various downstream tasks.
We thus provide the training and evaluation codes for following downstream tasks.
For more details, please see the docs and downstream/ directory.

Use the pretrained model checkpoint to run each downstream task.

BibTeX

If you find this repo useful for your research, please consider citing our paper:

@article{kim2023coreset,
  title={Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning},
  author={Kim, Sungnyun and Bae, Sangmin and Yun, Se-Young},
  journal={arXiv preprint arXiv:2303.11101},
  year={2023}
}

Contact