GitHub - sno6/vitshuffle: Pre-train visual transformers through image segment shuffling.

A simple visual transformer pre-training idea.

Given an input image:

We simply break the image into n x n blocks, and shuffle:

The goal of the network is then to predict, for each block in the input sequence, where it belongs in the original image.

TODO:

Continue testing against large datasets. The fact that pre-training needs a lot of data, and GPUs are hard to acquire, makes this difficult.
I think relative segments is more important than absolute positioning, and should be factored into loss.
The network can not currently learn translation (of image), a relative segment loss should fix this.
Update README with instructions for running this thing.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
examples		examples
.gitignore		.gitignore
README.md		README.md
main.py		main.py
model.py		model.py
trainer.py		trainer.py
utils.py		utils.py