New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's new about this model? #3
Comments
Some models do indeed use a stem with k = 7 convolutions, but this is often with stride = 2. The patch embedding stem sets kernel size equal to patch size, which reduces size more than stride = 2. That is, all the dimension reduction happens immediately at the stem, in contrast to most CNNs where it happens gradually throughout the model (i.e., "pyramid shaped"). It's also unusual that we use k = 9 convolutions at all, as typically stacked small-kernel convolutions are favored. Overall, the model is exceedingly simple yet still performs very well in terms of accuracy. |
Thanks for your great work. According to my understanding the key idea of this paper is to evaluate the power of the tokenizing inputs in simple |
I would say that's a fairly accurate summary. Basically, two things happened simultaneously with the introduction of ViTs, MLP-Mixers, and their variants: I'm going to go ahead and close this issue, but feel free to reopen it or open a new issue if you have more questions. |
Thanks for your reply : ), I hope this paper can be accepted in ICLR 2022~ |
Thank you for your clarification. Hope your reviews go well! |
Why “patches” are all you need?
Patch embedding is Conv7x7 stem,
The body is simply repeated Conv9x9 + Conv1x1,
(Not challenging your work, it's indeed very interesting), but just kindly wondering what's new about this model?
The text was updated successfully, but these errors were encountered: