Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the part of the performance gain come from residual attention? #9

Closed
jerrywn121 opened this issue Jan 15, 2023 · 3 comments
Closed

Comments

@jerrywn121
Copy link

I noticed that this implementation of Transformer here used residual attention, which does not appear in some of the other baselines mentioned in the paper. So I wonder if you have performed additional ablation studies to see the effect of residual attention for forecasting?

@jerrywn121
Copy link
Author

Also, it would be great to know how much RevIN benefits the patch Transformer. Thank you!

@yuqinie98
Copy link
Owner

yuqinie98 commented Jan 15, 2023

Hi @jerrywn121 , thanks for your questions! We use residual attention because we want to be as close as possible to the original Transformer model in NLP, and only find the key factors that make it work well when applying it to time series datasets (patch; channel-independence). Also, residual attention was used in some previous papers like TST (https://arxiv.org/abs/2010.02803). We haven't studied the effect of it yet.

For RevIN, we do the additional ablation in table 11 in the paper. It improves the model marginally in general. Hope this helps.

@jerrywn121
Copy link
Author

Got it. Thank you for your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants