Question about the difference between Transformer implementation vs original architecture in the paper. #2

tumbleintoyourheart · 2020-07-31T08:01:48Z

Hi,

In End-to-End Neural Speaker Diarization with Self-attention/Fig. 2., LayerNorm was applied after the Encoder blocks, but in your implementation, the order was reversed. Are there any particular reasons for that?

Have a good day.

The text was updated successfully, but these errors were encountered:

Xflick · 2020-07-31T08:20:22Z

Hi, actually my model implementation strictly follows the one in paper.

If you look into PyTorch's TransformerEncoderLayer implementation, you will find it is in the order: self_attn->residual->norm->pointwise_ff->residual->norm. However in End-to-End Neural Speaker Diarization with Self-attention/Fig. 2, the encoder block is defined as: norm->self_attn->residual->norm->pointwise_ff->residual, and with a layer_norm at the end (before linear+sigmoid).

Thus, applying layer_norm at the beginning in PyTorch code is equal to what they have done in their paper.

tumbleintoyourheart · 2020-07-31T08:30:45Z

Yeah, then everything makes sense. What a neat adaptation, thanks. Closing this now.

tumbleintoyourheart closed this as completed Jul 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the difference between Transformer implementation vs original architecture in the paper. #2

Question about the difference between Transformer implementation vs original architecture in the paper. #2

tumbleintoyourheart commented Jul 31, 2020

Xflick commented Jul 31, 2020

tumbleintoyourheart commented Jul 31, 2020

Question about the difference between Transformer implementation vs original architecture in the paper. #2

Question about the difference between Transformer implementation vs original architecture in the paper. #2

Comments

tumbleintoyourheart commented Jul 31, 2020

Xflick commented Jul 31, 2020

tumbleintoyourheart commented Jul 31, 2020