## Encoder-Decoder Architecture

The encoder–decoder (or sequence-to-sequence) transformer maps an input sequence to an output sequence of potentially different length. It combines a bidirectional encoder that builds rich representations of the source and an autoregressive decoder that generates the target conditioned on both past tokens and encoder outputs.

![Encoder-Decoder](transformer-architectures.png)

### Components
1. **Input Embeddings**
- Token embeddings: map vocabulary tokens to vectors.
- Positional embeddings: inject order information.
- (Optional) Segment embeddings: distinguish multiple inputs.


2. **Encoder Stack**
- Multi-head self-attention
- Add & layer normalize
- Feed-forward network (two linear layers + activation)
- Add & layer normalize
All layers process the full input bidirectionally.


3. **Decoder Stack**
- Masked multi-head self-attention (prevents “peeking” at future tokens)
- Add & layer normalize
- Multi-head cross-attention (queries from decoder, keys/values from encoder)
- Add & layer normalize
- Feed-forward network
- Add & layer normalize
The decoder attends both to its own past outputs and the encoder’s final representations.


4. **Cross-Attention Mechanism**
At each decoding step, cross-attention lets the decoder query encoder outputs. This aligns source and target, enabling the model to focus on relevant parts of the input when generating each token.


5. **Output Head**
A linear layer (tied or untied to token embeddings) projects decoder hidden states to vocabulary logits. A softmax converts logits to probabilities for next-token prediction.


**Applications**
  
- Machine Translation (e.g., English→German)
- Summarization (news, documents)
- Question Answering (generate answers from passages)
- Paraphrasing & Style Transfer
- Code Generation (comment→code, code→docstring)


**When to Use**

- Your task requires conditional generation (output depends on an input sequence).
- Input and output lengths differ or share little vocabulary overlap.
- You want a single model to both understand and generate text in one pass.


## Python Code: Summarization with BART

In [2]:
from transformers import BartTokenizer, BartForConditionalGeneration

# 1. Load pretrained model & tokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model     = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
model.eval()

# 2. Prepare input text
article = """
The Amazon rainforest, known as the 'lungs of the planet', is under threat due to
deforestation. Recent studies indicate that unprecedented rates of tree loss have
led to shifts in biodiversity and climate patterns in South America.
"""

# 3. Tokenize and encode
inputs = tokenizer(
    article,
    max_length=1024,
    truncation=True,
    return_tensors="pt"
)

# 4. Generate summary
summary_ids = model.generate(
    inputs.input_ids,
    num_beams=4,
    length_penalty=2.0,
    max_length=120,
    early_stopping=True
)

# 5. Decode and print
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:", summary)

  from .autonotebook import tqdm as notebook_tqdm


Summary: The Amazon rainforest, known as the 'lungs of the planet', is under threat due to deforestation. Recent studies indicate that unprecedented rates of tree loss have led to shifts in biodiversity and climate patterns in South America. The Amazon is one of the most biodiverse places in the world.


This example illustrates how the encoder processes the full article, the decoder attends to it, and a concise summary is produced via beam search.