What's new about this model? #3

vztu · 2021-10-08T16:38:17Z

Why “patches” are all you need?
Patch embedding is Conv7x7 stem,
The body is simply repeated Conv9x9 + Conv1x1,
(Not challenging your work, it's indeed very interesting), but just kindly wondering what's new about this model?

tmp-iclr · 2021-10-09T18:38:44Z

Some models do indeed use a stem with k = 7 convolutions, but this is often with stride = 2. The patch embedding stem sets kernel size equal to patch size, which reduces size more than stride = 2. That is, all the dimension reduction happens immediately at the stem, in contrast to most CNNs where it happens gradually throughout the model (i.e., "pyramid shaped").

It's also unusual that we use k = 9 convolutions at all, as typically stacked small-kernel convolutions are favored.

Overall, the model is exceedingly simple yet still performs very well in terms of accuracy.

rentainhe · 2021-10-11T04:11:34Z

Some models do indeed use a stem with k = 7 convolutions, but this is often with stride = 2. The patch embedding stem sets kernel size equal to patch size, which reduces size more than stride = 2. That is, all the dimension reduction happens immediately at the stem, in contrast to most CNNs where it happens gradually throughout the model (i.e., "pyramid shaped").

It's also unusual that we use k = 9 convolutions at all, as typically stacked small-kernel convolutions are favored.

Overall, the model is exceedingly simple yet still performs very well in terms of accuracy.

Thanks for your great work. According to my understanding the key idea of this paper is to evaluate the power of the tokenizing inputs in simple isotropic vision models?

tmp-iclr · 2021-10-11T22:14:46Z

Thanks for your great work. According to my understanding the key idea of this paper is to evaluate the power of the tokenizing inputs in simple isotropic vision models?

I would say that's a fairly accurate summary. Basically, two things happened simultaneously with the introduction of ViTs, MLP-Mixers, and their variants:
(1) convolution was replaced with new operations like self-attention or MLPs,
(2) and network designs were changed and vastly simplified, putting all the downsampling at the stem (i.e., using patch embeddings) and otherwise performing no downsampling/resizing throughout the network (i.e., isotropy).
Despite these two things being introduced simultaneously, all of the resulting performance gains have been attributed to (1). Couldn't (2) also be at least partly responsible for the performance gains? By using just convolutions instead of (1), we provide evidence that (2) is itself a powerful template for deep learning.

I'm going to go ahead and close this issue, but feel free to reopen it or open a new issue if you have more questions.

rentainhe · 2021-10-12T03:25:53Z

Thanks for your reply : ), I hope this paper can be accepted in ICLR 2022~

vztu · 2021-10-12T03:28:52Z

Thank you for your clarification. Hope your reviews go well!

tmp-iclr closed this as completed Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's new about this model? #3

What's new about this model? #3

vztu commented Oct 8, 2021 •

edited

tmp-iclr commented Oct 9, 2021

rentainhe commented Oct 11, 2021

tmp-iclr commented Oct 11, 2021

rentainhe commented Oct 12, 2021

vztu commented Oct 12, 2021

What's new about this model? #3

What's new about this model? #3

Comments

vztu commented Oct 8, 2021 • edited

tmp-iclr commented Oct 9, 2021

rentainhe commented Oct 11, 2021

tmp-iclr commented Oct 11, 2021

rentainhe commented Oct 12, 2021

vztu commented Oct 12, 2021

vztu commented Oct 8, 2021 •

edited