20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth #55

wangqiangneu · 2020-03-31T03:45:25Z

简介

对残差连接的一种改进，能够更快的训练更深的网络。做法很简单。以transformer为例，移除layernorm，$y=x+a*F(x)$即可，其中a需要初始化成0。做法上很像DLCL，DLCL中也在每层中使用了learnable scalar，不同地方在于，DLCL强调的是跟以往更多层之间的关系，ReZero要学的参数是O(L)，DLCL学的是O(L^2)。另外，ReZero在transformer的每层中，所有子层都做这个scaled的操作，其中a是所有子层共享的。而DLCL是只在层中做，不包括子层。

论文信息

Author: UCSD
Paper
Code

总结

分析input-output的jacobian思路是对的，
方法很简单，可以作为baseline一试；去掉LN也很赞，多少能少算点嘛~
看repo里的issue，似乎有人直接用在gpt上是不好使的，会NAN；作者建议embedding初始化U(-1/d, +1/d)，所以用的时候可能还需要调整

The text was updated successfully, but these errors were encountered:

hukkai · 2020-03-31T08:14:16Z

I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale hongyi-zhang/Fixup#6 (comment)

wangqiangneu · 2020-04-01T01:18:02Z

I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale hongyi-zhang/Fixup#6 (comment)

Good point! Maybe using a separated and small learning rate for those scalars is the key.

wangqiangneu added the common label Mar 31, 2020

wangqiangneu changed the title ~~Arxiv-20-ReZero is All You Need: Fast Convergence at Large Depth~~ 20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth #55

20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth #55

wangqiangneu commented Mar 31, 2020

hukkai commented Mar 31, 2020

wangqiangneu commented Apr 1, 2020

20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth #55

20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth #55

Comments

wangqiangneu commented Mar 31, 2020

简介

论文信息

总结

hukkai commented Mar 31, 2020

wangqiangneu commented Apr 1, 2020