Main results

Official PyTorch implementation of "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm" that improves our previous work "Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model, NeurIPS'21", Code.

Abstract Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based Transformer (EAT) block, which consists of three residual parts, \ie, Multi-Scale Region Aggregation (MSRA), Global and Local Interaction (GLI), and Feed-Forward Network (FFN) modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a Task-Related Head (TRH) docked with transformer backbone to complete final information fusion more flexibly and improve a Modulated Deformable MSA (MD-MSA) to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over State-Of-The-Art (SOTA) methods. \Eg, our Mobile (1.8M), Tiny (6.1M), Small (24.3M), and Base (49.0M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7.

Main results

Image Classification for ImageNet-1K:

Model & Url	Params. (M)	FLOPs (G)	Throughput (V100 GPU)	Throughput (Xeon 8255C @ 2.50GHz CPU)	Image Size	Top-1
EATFormer-Mobile	1.8	0.36	3926	456.3	224 x 224	69.4
EATFormer-Lite	3.5	0.91	2168	246.3	224 x 224	75.4
EATFormer-Tiny	6.1	1.41	1549	167.5	224 x 224	78.4
EATFormer-Mini	11.1	2.29	1055	122.1	224 x 224	80.9
EATFormer-Small	24.3	4.32	615	73.3	224 x 224	83.1
EATFormer-Medium	39.9	7.05	425	53.4	224 x 224	83.6
EATFormer-Base	49.0	8.94	329	43.7	224 x 224	83.9

Object Detection and Instance Segmentation Based on Mask R-CNN for COCO2017:

Backbone	Box mAP (1x)	Mask mAP (1x)	Box mAP (MS+3x)	Mask mAP (MS+3x)	Params.	FLOPs
EATFormer-Tiny	42.3	39.0	45.4	41.4	25M	198G
EATFormer-Small	46.1	41.9	47.4	42.9	44M	258G
EATFormer-Base	47.2	42.8	49.0	44.2	68M	349G

Semantic Segmentation Based on Upernet for ADE20k:

Backbone	mIoU	Params.	FLOPs
EATFormer-Tiny	44.5	34M	870G
EATFormer-Small	47.3	53M	934G
EATFormer-Base	49.3	79M	1030G

Get Started

Installation

Clone this repo:

git clone https://github.com/zhangzjn/EATFormer.git && cd EATFormer

Prepare experimental environment

conda install -y pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip3 install timm==0.4.12 tensorboardX einops torchprofile fvcore
pip3 install opencv-python opencv-contrib-python imageio scikit-image scipy sklearn numpy-hilbert-curve==1.0.1 pyzorder==0.0.1 
pip3 install click psutil ninja ftfy regex gdown blobfile termcolor yacs tqdm glog lmdb easydict requests openpyxl paramiko

Prepare ImageNet-1K Dataset

Download and extract ImageNet-1K dataset in the following directory structure:

├── imagenet
    ├── train
        ├── n01440764
            ├── n01440764_10026.JPEG
            ├── ...
        ├── ...
    ├── train.txt (optional)
    ├── val
        ├── n01440764
            ├── ILSVRC2012_val_00000293.JPEG
            ├── ...
        ├── ...
    └── val.txt (optional)

optionally run data/lmdb_dataset.py to prepare lmdb format for accelerating data read.

Test

Download pre-trained models to pretrained
Check the data setting for the config file configs/classification.py
- data.name
- data.root
Check the model setting for the config file configs/classification.py
- model.name
- model.model_kwargs['checkpoint_path']
Test with 8 GPUs in one node: ./run.sh 8 configs/classification.py test

Train

Check data and model settings for the config file configs/classification.py
Train with 8 GPUs in one node: ./run.sh 8 configs/classification.py train
Modify trainer.resume_dir parameter to resume training.

Down-Stream Tasks

Config and backbone files for MMDetection and MMSegmentation are listed in down_stream_tasks/.

Compare Params/FLOPs/Speed with SOTAs

python3 util/params_flops_speed.py

Citation

If our work is helpful for your research, please consider citing:

@inproceedings{zhang2021analogous,
  title={Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model},
  author={Zhang, Jiangning and Xu, Chao and Li, Jian and Chen, Wenzhou and Wang, Yabiao and Tai, Ying and Chen, Shuo and Wang, Chengjie and Huang, Feiyue and Liu, Yong},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

Acknowledgements

We thank following repos for providing assistance for our research:

and works for providing the source codes for fair comparisons:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cfgs_for_down_stream_tasks		cfgs_for_down_stream_tasks
configs		configs
data		data
loss		loss
model		model
optim		optim
pretrained		pretrained
resources		resources
trainer		trainer
util		util
README.md		README.md
run.py		run.py
run.sh		run.sh

zhangzjn/EATFormer

Folders and files

Latest commit

History

Repository files navigation

Main results

Image Classification for ImageNet-1K:

Object Detection and Instance Segmentation Based on Mask R-CNN for COCO2017:

Semantic Segmentation Based on Upernet for ADE20k:

Get Started

Installation

Prepare ImageNet-1K Dataset

Test

Train

Down-Stream Tasks

Compare Params/FLOPs/Speed with SOTAs

Citation

Acknowledgements

About

Resources

Stars

Watchers

Forks

Languages