Does the part of the performance gain come from residual attention? #9

jerrywn121 · 2023-01-15T11:41:17Z

I noticed that this implementation of Transformer here used residual attention, which does not appear in some of the other baselines mentioned in the paper. So I wonder if you have performed additional ablation studies to see the effect of residual attention for forecasting?

jerrywn121 · 2023-01-15T12:15:00Z

Also, it would be great to know how much RevIN benefits the patch Transformer. Thank you!

yuqinie98 · 2023-01-15T13:53:10Z

Hi @jerrywn121 , thanks for your questions! We use residual attention because we want to be as close as possible to the original Transformer model in NLP, and only find the key factors that make it work well when applying it to time series datasets (patch; channel-independence). Also, residual attention was used in some previous papers like TST (https://arxiv.org/abs/2010.02803). We haven't studied the effect of it yet.

For RevIN, we do the additional ablation in table 11 in the paper. It improves the model marginally in general. Hope this helps.

jerrywn121 · 2023-01-16T03:04:39Z

Got it. Thank you for your reply!

jerrywn121 closed this as completed Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the part of the performance gain come from residual attention? #9

Does the part of the performance gain come from residual attention? #9

jerrywn121 commented Jan 15, 2023

jerrywn121 commented Jan 15, 2023

yuqinie98 commented Jan 15, 2023 •

edited

Loading

jerrywn121 commented Jan 16, 2023

Does the part of the performance gain come from residual attention? #9

Does the part of the performance gain come from residual attention? #9

Comments

jerrywn121 commented Jan 15, 2023

jerrywn121 commented Jan 15, 2023

yuqinie98 commented Jan 15, 2023 • edited Loading

jerrywn121 commented Jan 16, 2023

yuqinie98 commented Jan 15, 2023 •

edited

Loading