### **9. The Complete Decoder Layer Architecture**

The Decoder Layer is more complex with **three sublayers**:

```python
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        # Sublayer 1: Masked Multi-Head Self-Attention
        self.masked_self_attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        
        # Sublayer 2: Multi-Head Cross-Attention
        self.cross_attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Sublayer 3: Feed-Forward Network
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff, dropout)
        self.norm3 = nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        # Sublayer 1: Masked Self-Attention + Add & Norm
        attn_output = self.masked_self_attention(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Sublayer 2: Cross-Attention + Add & Norm
        # Query from Decoder, Key & Value from Encoder!
        cross_attn_output = self.cross_attention(x, encoder_output, encoder_output, src_mask)
        x = self.norm2(x + self.dropout(cross_attn_output))
        
        # Sublayer 3: Feed-Forward + Add & Norm
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        
        return x
```

**Data Flow:**

```
Input x (from previous decoder layer or embeddings)
   ↓
   ├──────────────────────────┐
   ↓                          │
[Masked Self-Attention]       │
   (with causal mask)         │
   ↓                          │
   + ←────────────────────────┘
   ↓
[Layer Norm]
   ↓
   ├──────────────────────────┐
   ↓                          │
[Cross-Attention]─────────────┤
   Q: from decoder            │
   K,V: from encoder_output   │
   ↓                          │
   + ←────────────────────────┘
   ↓
[Layer Norm]
   ↓
   ├──────────────────────────┐
   ↓                          │
[Feed-Forward Network]        │
   ↓                          │
   + ←────────────────────────┘
   ↓
[Layer Norm]
   ↓
Output
```

### **10. Encoder vs Decoder: Key Differences**

Let's summarize the key differences:

| Aspect | Encoder | Decoder |
|--------|---------|--------|
| **Number of Sublayers** | 2 | 3 |
| **Self-Attention Type** | Bidirectional (see all) | Masked (see only past) |
| **Cross-Attention** | ❌ None | ✅ Yes (to encoder output) |
| **Causal Mask** | ❌ Not needed | ✅ Required |
| **Purpose** | Understand input | Generate output |
| **When Q, K, V...** | All from same source | Self-attn: same, Cross-attn: Q from target |

**Visual Comparison:**

```
ENCODER LAYER                    DECODER LAYER

┌────────────────────┐          ┌────────────────────┐
│  Self-Attention    │          │ Masked Self-Attention│
│  (Bidirectional)   │          │    (Causal)         │
└────────────────────┘          └────────────────────┘
         ↓                               ↓
    [Add & Norm]                    [Add & Norm]
         ↓                               ↓
                                ┌────────────────────┐
                                │  Cross-Attention   │
                                │ (to Encoder output)│
                                └────────────────────┘
                                         ↓
                                    [Add & Norm]
         ↓                               ↓
┌────────────────────┐          ┌────────────────────┐
│   Feed-Forward     │          │   Feed-Forward     │
└────────────────────┘          └────────────────────┘
         ↓                               ↓
    [Add & Norm]                    [Add & Norm]
         ↓                               ↓
      Output                          Output
```