question about attention layer shared weight #113

pluone · 2024-06-15T04:53:12Z

As I understand it, the attention layer here only uses one weight(W_Q,W_K,W_V). Shouldn't each channel use an independent weight(W_Q,W_K,W_V). for calculation, or at least provide an option?

class _MultiheadAttention(nn.Module):
    def __init__(self, d_model, n_heads, d_k=None, d_v=None, res_attention=False, attn_dropout=0., proj_dropout=0., qkv_bias=True, lsa=False):
        """Multi Head Attention Layer
        Input shape:
            Q:       [batch_size (bs) x max_q_len x d_model]
            K, V:    [batch_size (bs) x q_len x d_model]
            mask:    [q_len x q_len]
        """
        # ...

        self.W_Q = nn.Linear(d_model, d_k * n_heads, bias=qkv_bias)
        self.W_K = nn.Linear(d_model, d_k * n_heads, bias=qkv_bias)
        self.W_V = nn.Linear(d_model, d_v * n_heads, bias=qkv_bias)

        # Scaled Dot-Product Attention (multiple heads)
        # ...

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about attention layer shared weight #113

question about attention layer shared weight #113

pluone commented Jun 15, 2024

question about attention layer shared weight #113

question about attention layer shared weight #113

Comments

pluone commented Jun 15, 2024