You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale hongyi-zhang/Fixup#6 (comment)
I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale hongyi-zhang/Fixup#6 (comment)
Good point! Maybe using a separated and small learning rate for those scalars is the key.
wangqiangneu
changed the title
Arxiv-20-ReZero is All You Need: Fast Convergence at Large Depth
20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth
Apr 15, 2020
简介
对残差连接的一种改进,能够更快的训练更深的网络。做法很简单。以transformer为例,移除layernorm,$y=x+a*F(x)$即可,其中
a
需要初始化成0。做法上很像DLCL
,DLCL
中也在每层中使用了learnable scalar,不同地方在于,DLCL
强调的是跟以往更多层之间的关系,ReZero
要学的参数是O(L),DLCL
学的是O(L^2)。另外,ReZero
在transformer的每层中,所有子层
都做这个scaled的操作,其中a
是所有子层共享的。而DLCL
是只在层
中做,不包括子层。论文信息
总结
The text was updated successfully, but these errors were encountered: