Vision Transformer (ViT)

Tensorflow implementation of the Vision Transformer (ViT) presented in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, where the authors show that Transformers applied directly to image patches and pre-trained on large datasets work really well on image classification.

Install dependencies

Create a Python 3 virtual environment and activate it:

virtualenv -p python3 venv
source ./venv/bin/activate

Next, install the required dependencies:

pip install -r requirements.txt

Train model

Start the model training by running:

python train.py --logdir path/to/log/dir

To track metrics, start Tensorboard

tensorboard --logdir path/to/log/dir

and then go to localhost:6006.

Citation

@inproceedings{
    anonymous2021an,
    title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
    author={Anonymous},
    booktitle={Submitted to International Conference on Learning Representations},
    year={2021},
    url={https://openreview.net/forum?id=YicbFdNTTy},
    note={under review}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py
vit.png		vit.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

model.py

model.py

requirements.txt

requirements.txt

train.py

train.py

vit.png

vit.png

Repository files navigation

Vision Transformer (ViT)

Install dependencies

Train model

Citation

About

Releases

Packages

Languages

stjordanis/vision-transformer

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer (ViT)

Install dependencies

Train model

Citation

About

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Languages