Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's new about this model? #3

Closed
vztu opened this issue Oct 8, 2021 · 5 comments
Closed

What's new about this model? #3

vztu opened this issue Oct 8, 2021 · 5 comments

Comments

@vztu
Copy link

vztu commented Oct 8, 2021

Why “patches” are all you need?
Patch embedding is Conv7x7 stem,
The body is simply repeated Conv9x9 + Conv1x1,
(Not challenging your work, it's indeed very interesting), but just kindly wondering what's new about this model?

@tmp-iclr
Copy link
Collaborator

tmp-iclr commented Oct 9, 2021

Some models do indeed use a stem with k = 7 convolutions, but this is often with stride = 2. The patch embedding stem sets kernel size equal to patch size, which reduces size more than stride = 2. That is, all the dimension reduction happens immediately at the stem, in contrast to most CNNs where it happens gradually throughout the model (i.e., "pyramid shaped").

It's also unusual that we use k = 9 convolutions at all, as typically stacked small-kernel convolutions are favored.

Overall, the model is exceedingly simple yet still performs very well in terms of accuracy.

@rentainhe
Copy link

Some models do indeed use a stem with k = 7 convolutions, but this is often with stride = 2. The patch embedding stem sets kernel size equal to patch size, which reduces size more than stride = 2. That is, all the dimension reduction happens immediately at the stem, in contrast to most CNNs where it happens gradually throughout the model (i.e., "pyramid shaped").

It's also unusual that we use k = 9 convolutions at all, as typically stacked small-kernel convolutions are favored.

Overall, the model is exceedingly simple yet still performs very well in terms of accuracy.

Thanks for your great work. According to my understanding the key idea of this paper is to evaluate the power of the tokenizing inputs in simple isotropic vision models?

@tmp-iclr
Copy link
Collaborator

Thanks for your great work. According to my understanding the key idea of this paper is to evaluate the power of the tokenizing inputs in simple isotropic vision models?

I would say that's a fairly accurate summary. Basically, two things happened simultaneously with the introduction of ViTs, MLP-Mixers, and their variants:
(1) convolution was replaced with new operations like self-attention or MLPs,
(2) and network designs were changed and vastly simplified, putting all the downsampling at the stem (i.e., using patch embeddings) and otherwise performing no downsampling/resizing throughout the network (i.e., isotropy).
Despite these two things being introduced simultaneously, all of the resulting performance gains have been attributed to (1). Couldn't (2) also be at least partly responsible for the performance gains? By using just convolutions instead of (1), we provide evidence that (2) is itself a powerful template for deep learning.

I'm going to go ahead and close this issue, but feel free to reopen it or open a new issue if you have more questions.

@rentainhe
Copy link

Thanks for your reply : ), I hope this paper can be accepted in ICLR 2022~

@vztu
Copy link
Author

vztu commented Oct 12, 2021

Thank you for your clarification. Hope your reviews go well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants