Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth #55

Open
wangqiangneu opened this issue Mar 31, 2020 · 2 comments
Open
Labels
common Common knowledge

Comments

@wangqiangneu
Copy link
Owner

简介

对残差连接的一种改进,能够更快的训练更深的网络。做法很简单。以transformer为例,移除layernorm,$y=x+a*F(x)$即可,其中a需要初始化成0。做法上很像DLCLDLCL中也在每层中使用了learnable scalar,不同地方在于,DLCL强调的是跟以往更多层之间的关系,ReZero要学的参数是O(L),DLCL学的是O(L^2)。另外,ReZero在transformer的每层中,所有子层都做这个scaled的操作,其中a是所有子层共享的。而DLCL是只在中做,不包括子层。

论文信息

总结

  • 分析input-output的jacobian思路是对的,
  • 方法很简单,可以作为baseline一试;去掉LN也很赞,多少能少算点嘛~
  • 看repo里的issue,似乎有人直接用在gpt上是不好使的,会NAN;作者建议embedding初始化U(-1/d, +1/d),所以用的时候可能还需要调整
@wangqiangneu wangqiangneu added the common Common knowledge label Mar 31, 2020
@hukkai
Copy link

hukkai commented Mar 31, 2020

I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale hongyi-zhang/Fixup#6 (comment)

@wangqiangneu
Copy link
Owner Author

I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale hongyi-zhang/Fixup#6 (comment)

Good point! Maybe using a separated and small learning rate for those scalars is the key.

@wangqiangneu wangqiangneu changed the title Arxiv-20-ReZero is All You Need: Fast Convergence at Large Depth 20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Common knowledge
Projects
None yet
Development

No branches or pull requests

2 participants