Skip to content

Pre-train visual transformers through image segment shuffling.

Notifications You must be signed in to change notification settings

sno6/vitshuffle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A simple visual transformer pre-training idea.

Given an input image:

We simply break the image into n x n blocks, and shuffle:

The goal of the network is then to predict, for each block in the input sequence, where it belongs in the original image.

TODO:

  • Continue testing against large datasets. The fact that pre-training needs a lot of data, and GPUs are hard to acquire, makes this difficult.
  • I think relative segments is more important than absolute positioning, and should be factored into loss.
  • The network can not currently learn translation (of image), a relative segment loss should fix this.
  • Update README with instructions for running this thing.

About

Pre-train visual transformers through image segment shuffling.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages