Paper accepted at NeurIPS 2022.
This is a official repository of SemMAE. Our code references the MAE, thanks a lot for their outstanding work! For details of our work see Semantic-Guided Masking for Learning Masked Autoencoders.
@article{li2022semmae,
title={SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders},
author={Li, Gang and Zheng, Heliang and Liu, Daqing and Wang, Chaoyue and Su, Bing and Zheng, Changwen},
journal={arXiv preprint arXiv:2206.10207},
year={2022}
}
This implementation is in PyTorch+GPU.
- This repo is based on
timm==0.3.2
, for which a fix is needed to work with PyTorch 1.8.1+. - It maybe needed for the repository: tensorboard. It can be installed by 'pip install '.
size | 16x16 patch | 8x8 patch |
---|---|---|
link | download | pwd:1tum |
md5 | losed | waiting |
800-epochs | ViT-Base 16x16 patch | ViT-Base 8x8 patch |
---|---|---|
pretrained checkpoint | download | download |
md5 | 1482ae | 322b6a |
As a sanity check, run evaluation using our ImageNet fine-tuned models:
800-epochs | ViT-Base 16x16 patch | ViT-Base 8x8 patch |
---|---|---|
fine-tuned checkpoint | download | download |
md5 | bbc5ef | 6abd9e |
reference ImageNet accuracy | 83.352 | 84.444 |
Evaluate ViT-Base_16 in a single GPU (${IMAGENET_DIR}
is a directory containing {train, val}
sets of ImageNet):
python main_finetune.py --eval --resume SemMAE_epoch799_vit_base_checkpoint-99.pth --model vit_base_patch16 --batch_size 16 --data_path ${IMAGENET_DIR}
This should give:
* Acc@1 83.352 Acc@5 96.494 loss 0.745
Accuracy of the network on the 50000 test images: 83.4%
Evaluate ViT-Base_8 in a single GPU (${IMAGENET_DIR}
is a directory containing {train, val}
sets of ImageNet):
python main_finetune.py --eval --resume SemMAE_epoch799_vit_base_checkpoint_patch8-78.pth --model vit_base_patch8 --batch_size 8 --data_path ${IMAGENET_DIR}
This should give:
* Acc@1 84.444 Acc@5 97.032 loss 0.683
Accuracy of the network on the 50000 test images: 84.44%.
Note that all of our results are obtained on the pretraining 800-epoches setting, the best checkpoint is lost for vit_base_patch8(The paper reported a performance of 84.5% top-1 acc vs. 84.44% in 78-th epoch).
To pre-train ViT-Large (recommended default) with multi-node distributed training, run the following on 8 nodes with 8 GPUs each:
python -m torch.distributed.launch --nproc_per_node=8 --master_port=${MASTER_PORT} \
--nnodes=${NNODES} --node_rank=\${SLURM_NODEID} --master_addr=${MASTER_ADDR} \
--use_env main_pretrain_setting3.py \
--output_dir ${OUTPUT_DIR} --log_dir=${OUTPUT_DIR} \
--batch_size 128 \
--model mae_vit_base_patch16 \
--norm_pix_loss \
--mask_ratio 0.75 \
--epochs 800 \
--warmup_epochs 40 \
--blr 1.5e-4 --weight_decay 0.05 \
--setting 3 \
--data_path ${DATA_DIR}
Note that the input path ${DATA_DIR} is our processed dataset path.
This repo is currently maintained by Gang Li(@ucasligang).