<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>

# <font color="#76b900">**4:** Encoder-Decoders for Seq2Seq</font>

In the previous notebook, we looking into some tasks that a "BERT-like" encoder-only model was relatively well-equipped for. We saw first-hand the power of language modeling, but also noticed that there were some key restrictions - namely the per-token/per-passage reasoning limitations - that kept the model from being viable in some tasks. The zero-shot classification pipeline got around that by querying the encoder multiple times over to generate a value for each class. In this notebook, we'll extend this formulation to an architectural component that can generate ordered sequences in a similar fashion, and use it for generating unbounded responses. 

#### **Learning Objectives:**

- Learn about encoder-decoder models, which use an encoder to encode a static context (i.e. question, instruction, etc) and a decoder to predict words.
- Cover strategies used to make giant, general-purpose models that work on a variety of tasks.

-----

## 4.1. The Machine Translation Task

[**Machine Translation**](https://huggingface.co/tasks/translation) is a common term to describe... translating one language into another automatically using software. Yeah, it's not as rigorous of a term as you might be used to, but it's flashy and worth throwing out there because it is an extremely important task in this day and age! 

> <div><img src="imgs/task-translation.png" width="800"/></div>
>
> **Source: [Translation Task | HuggingFace](https://huggingface.co/tasks/translation)**

Among the landscape of problems we can tackle with large language models, it's an instance of the **sequence-to-sequence**, or **seq2seq**, formulation.

Previously, we learned that a BERT-like encoder architecture can be used to solve a simplified case of the problem when one of the following conditions is met:
- The input and output sequence have the same number of entries, solvable via token prediction for each input token.
- The output sequence is a subset of the input sequence, solvable via range prediction.

You can also try to get creative with your formulations and can remove some of the hard limitations of this architecture, but those are still generally considered encoder-derivatives.

We've been throwing the term around a lot, **encoder**, but you might not have noticed why it's called that. Considering the logic of the autoencoder, you might have intuited *(reasonably but not quite correctly)* that the encoder body is only defined to be the task-generic series of transformer blocks and that the output is necessarily a latent encoding with no human-interpretable meaning. By this formalization, the classification head is the "decoder" that generates human-reasonable outputs.

This actually does make sense, but isn't really how the language modeling community likes to talk about it.
- **Autoencoder Logic:** *Input Sample $\to$ Latent Sequence $\to$ Sample Insight/Reconstruction*
- ***Language Model Logic:*** *Input Passage $\to$ Latent Sequence $\to$ Response Sequence*

By that logic, we can see more-or-less that there is no natural ***response*** being generated in our BERT-like architecture; only insight about our data. Sure, the insight might be in the form of a probability distribution over the the input that forms a response, but ***we're not generating novel tokens***. 

**That's where the decoder comes in!**

## 4.2. Pulling In The T5 Model

We've mentioned what sorts of tasks we can't do with encoder-only models - namely seq2seq tasks like machine translation - so we can start our journey by just pulling one such model in and seeing how it works! Of course, we can find a model in the HuggingFace model repository, but how about we go to the [task page for machine translation](https://huggingface.co/tasks/translation) instead. Reading through, you'll notice some recommendations and high-level overviews, but can also see that there is pipeline support for it! In fact, you don't even need to specify the back-end model if you don't want to: `transformers` will make a selection for you (though we'll do it anyways):

In [None]:
from transformers import pipeline

translator = pipeline('translation_en_to_fr', model='t5-base')
translator("Hello World! How's it going?")

I would say "Wow! It works! Can you believe it?" but actually this is probably expected at this point. You've already come to expect that a large language model can get you good results on a lot of things, and you've probably seen google translate function before, so maybe this isn't that surprising. But it is pleasant to see that it can be done in a relatively-small number of parameters:
```python
(
    translator.model.name_or_path,     ## 't5-base'
    translator.model.num_parameters()  ## 222,903,552
)
```

Let's go ahead and investigate this pipeline in a bit more detail:

In [None]:
text_en = "Hello World! How's it going?"
resp_fr = translator("Hello World! How's it going?")
text_fr = resp_fr[0]['translation_text']

tok = translator.tokenizer
tokens_ins = [tok.decode(x) for x in tok.encode(text_en)]
tokens_out = [tok.decode(x) for x in tok.encode(text_fr)]
print(f"Inputs of length {len(tokens_ins)}: {' | '.join(tokens_ins)}")
print(f"Output of length {len(tokens_out)}: {' | '.join(tokens_out)}")

You might notice that there are a few interesting results:
- Superficially, you might notice that there is this new `</s>` token. This designates something about stretches of data and their utility for the problem, and are enforced during training.
- More pressingly, **the number of tokens doesn't match!** There are 10 tokens in the input, but 13 numbers in the output, so how does that work?

Perhaps the pipeline is feeding in something in addition to the actual input you specify. The `preprocess` method feeds directly into the `model.forward` method, so let's see what's going on there:

In [None]:
prep = translator.preprocess
tokens_in2 = [tok.decode(x) for x in prep(text_en)['input_ids'][0]]
print(f"Model Inputs of length {len(tokens_in2)}: {' | '.join(tokens_in2)}")
tok.decode(prep(text_en)['input_ids'][0])

Ok! So the model actually inputs even more tokens, and the first set is actually an instruction. So, what's going on here? To figure it out, let's dive a little into the architecture of the model and reason about what we see.

## 4.3. Interpretting The T5 Architecture

Let's go ahead and look into the model and see if we can't make sense of what's going on! What you should probably do is something like the following:
```python
translator.model           ## See that there's a lot of stuff going on here
translator.model.encoder   ## See that this looks a lot like the BERT model
translator.model.decoder   ## See that this looks roughly the same and wonder what changed
```

However, this will load up your screen with a lot of text and will be a bit hard to walk through. With that said, let's try to visualize it in a more condensed diagram format. Below is all of the important stuff you need to see, with some arrows hinting at how they actually connect.

> **Note:** In the actual model architecture printout, the 0th transformer layer in both is shown separately from the 1st-11th; this is just due to T5's incorporation of [relative position encoding](https://paperswithcode.com/method/relative-position-encodings), which shows up as a component associated with the 0th layer despite being present throughout the architecture. Feel free to ignore that difference.

<div><img src="imgs/t5-architecture.png" 
     alt="Encoder-Decoder Architecture"
     width="1200"/></div>

**The main difference in terms of intuition comes in two flavors:**
- The decoder is primarily just an encoder that is trained for one specific task: **given an input sequence, generate the next token in the sequence**. In the encoder-decoder architecture, models usually start by feeding in a start token (i.e. `<s>`) and then generating one token at a time until an end-of-string token (i.e. `</s>`) is generated. (And no, it's not *actually* just an encoder, but the details aren't important for this course)
- The encoder-decoder architecture enforces an interface by injecting some of its intermediate values into the decoder attention mechanisms. This is called cross-attention, and follows the exact same logic as self-attention: **provide a light-weight interface to provide context to the model**.

> **How does the math work:** Consider the case when you have queries/values $K_{1..m}$/$V_{1..m}$ coming from the encoder and keys $Q_{1..n}$ coming from the decoder.

> - If $K_i$ and $Q_i$ have the same embedding dimension, then $Q_iK_i^T$ is an $n\times m$ matrix, as are that matrix's softmax values. In other words:
 $$\text{Attention}(K_{1..m}, Q_{1..n}) \text{ is } n\times m.$$

> - Since $V_{1..m}$, it is multiplicatively compatible with an $n\times m$ attention matrix:
 $$\text{Attention}(K_{1..m}, Q_{1..n}) \times V_{1..m} \text{ is } n\times d \text{ where } d \text{ is the dimensionality of } V_i.$$

> - So... we just used an attention interface to incorporate an $m$-element sequence as context for an $n$-element sequence! Just do that many times over, and you have strong context-driven generation.

The end result is an architecture that has two key functionalities:
- Generate token after token autoregressively from the decoder architecture, where each new generated token is included in the input for predicting the one after it.
- Frequently inject context from the encoder to the decoder, making sure the generation stays in line with the overall objective.

### Seeing Token Generation In Action

Alright, so we know roughly the model works, so let's explore what all we can do with it and see how it actually works. Previously, we started with the classic English-to-French task which has some well-known nice datasets and has been a staple for the NLP community for a long time. As we saw with our deeper dive, the pipeline actually just adds a conditioning prompt to the encoder input to tell the model what to do, but we can get just a bit more general and invoke the `text2text-generation` task pipeline. This one will give us full control over the context, so we can play around with some other potential tasks.

In [None]:
from transformers import pipeline

t5_pipe = pipeline("text2text-generation", model="t5-base")
t5_pipe("translate English to German: Hello world, and welcome to my notebook!")

To start with, let's verify that what we said about the decoder was true. Are we sure that this model is actually generating the response token-by-token? We can investigate by reading through the source code, or we can insert a little callback that will report what's coming into the encoder and decoder forward methods and the order in which they get called...

In [None]:
from extras_and_licenses.forward_listener import ForwardListener

t5_pipe.model.encoder.forward = ForwardListener(t5_pipe.model.encoder, name='encoder', tokenizer=t5_pipe.tokenizer)
t5_pipe.model.decoder.forward = ForwardListener(t5_pipe.model.decoder, name='decoder', tokenizer=t5_pipe.tokenizer)
t5_pipe("translate English to German: Hello world!!")

In [None]:
ForwardListener.clear_all()

**Some key observations to make:**
- The encoder is called once on the input to create a representation that can be used by the decoder. In most models, the encoder state is just computed once to provide a good context, and then is left as a static context to ground the decoder.
    - This actually helps with stability during training by avoiding the "moving target" problem.
- The decoder is asked to generate one token at a time, where the first token is generated from a starting token (in this case, `<pad>`). This first generation is conditioned entirely on the encoder output, whereas subsequent tokens are conditioned on both the original encoding and the generation history. As we saw earlier, the generation stops when the end-of-string token (i.e. `</s>`) is computed.
    - You might also notice that the decoder only recieves one word at a time. Efficient implementations store the results computed at previous generations, since the key-value calculations from the previous generations can just be stored and incorporated in with the new results from each new token. You can see this in `past_key_values`, which stores the attention components from previous generations and gets progressively bigger as the generation progresses.

## 4.4. Experimenting With T5

Now that we've seen how the model is defined and what it does, let's play around a bit with the model and see what else we can do with it!

Recall thet the T5 model is trained on a variety of tasks, visualized in part here:

> <div><img src="imgs/t5-pic.jpg" width="800"/></div>
>
> **Source: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683v4)**

This is, of course, just a subset of the tasks we can consider, and details about the approach can be found in the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683v3). But it's a starting point, so let's try it out!

To not waste too much time, we should probably note that `t5-base` will probably not perform as well as you'd like for the examples. The `t5-large` model has more parameters and thereby requires more data to train, but the model is still nicely within our compute budget without any modifications, so we can go ahead and load it in instead:

In [None]:
from transformers import pipeline

## T5-Large performs better and has reasonably-fast inference, so we can safely default to it
t5_pipe_base = t5_pipe
t5_pipe_large = pipeline("text2text-generation", model="t5-large")

print(f"""
t5-base:
 - model size: {t5_pipe_base.model.num_parameters():,} parameters
 - memory footprint: {t5_pipe_base.model.get_memory_footprint()/1e9 :.03f} GB

t5-large:
 - model size: {t5_pipe_large.model.num_parameters():,} parameters
 - memory footprint: {t5_pipe_large.model.get_memory_footprint()/1e9 :.03f} GB
""")

In [None]:
queries = [
    "translate English to Spanish: This is good!",
    "cola sentence: The course is jumping well!",
    "stsb sentence1: The rhino grazed on the grass. sentence2: A rhino is grazing in a field.",
    ## Summarize entry pulled from https://huggingface.co/docs/transformers/tasks/summarization
    "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes.",
    ## TODO: Take a skim through the original T5 paper and pull in some other queries you think might be interesting.
    ## Maybe the SQUaD question-answeing, or whatever else catches your eye. Examples of the format are around pages 50-60.
]

t5_pipe_large(queries)

As you can see, it actually works pretty well on its own, so maybe it's able to address arbitrary tasks if prompted? Let's try it out!

In [None]:
queries = [
    "translate English to Spanish: This is good!",
    "continue the sentence: I love walking",
    "continue the conversation as a helpful agent: User: Hello! Agent: ",
]
t5_pipe_large(queries)

Interesting. So maybe it's not really strong enough to actually do sufficient reasoning to actually understand the instruction prompts, but rather it's memorizing the prompts and maybe learning some key insights based off some superficial similarities in word selection?

Maybe our problem is the size of the model. We skipped over using the `t5-base` model because it actually didn't give us especially-good results on the baseline tasks, so maybe we can just upgrade out `t5-large` model to the next size up?

Unfortunately though, both the [`t5-xl` (or at least the `google/t5-v1_1-xl` version of it)](https://huggingface.co/google/t5-v1_1-xl) and the [`t5-xxl`](https://huggingface.co/google/flan-t5-xxl) models are out of reach for modest compute budgets (which we'll try to overcome in a few notebooks). But also, these larger models actually won't solve this problem, merely because the number of tasks is still too shallow with the standard T5 training. In this case, what we need isn't necessarily a larger model, but a more general training routine.

## 4.5. Prompt Engineering With Flan-T5
We talked in lecture about the Flan-T5 model, specifically in reference to its more ambitious training objectives. Specifically, there is enough training reinforcement for the model to learn how to reason about the input sequence as a vivid natural language problem, and is able to perform all the better for it.

> <div><img src="imgs/t5-flan2-spec.jpg" width="1000"/></div>
>
> **Source: [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416v5)**

Let's go ahead and pull the model in and see what we can do with it!

In [None]:
from transformers import pipeline

## T5-Large performs better and has reasonably-fast inference, so we can safely default to it
flan_t5_pipe = pipeline("text2text-generation", model="google/flan-t5-large")

In [None]:
queries = [
    "translate English to Spanish: This is good!",
    "continue the sentence: I love walking",
    "continue the conversation as a helpful agent: User: Hello! Agent: ",
]

flan_t5_pipe(queries)

As you can see, it's able to actually follow more instructions and reason about what the instructions actually mean, which is awesome! In fact, those stretch goals we had before for our `T5` model are actually quite rudamentary in comparison to what this model can actually do! This emergent behavior; being able to reason about the context itself and adapt to new tasks based on natural language alone, is known as **in-context learning**, and is the main enabler of the [**prompt engineering**](https://en.wikipedia.org/wiki/Prompt_engineering) paradigm. 

You've probably heard a lot about **prompt engineering**, but at its core it's just about figuring out which kinds of inputs are good to get the best behavior of an already-trained model. The general rules of thumb to go by when doing it are as follows: 
- **Format Abiding:** Consider how the model was actually trained and what kinds of formats did the inputs usually take on. The closer the task is to something it had to do, the better it will perform. 
- **Few-Shot Prompting:** Models tend to favor repetition of patterns due to the nature of their training data, so maybe giving some examples is a good idea. This is known as one-shot or few-shot prompting, when you give one or several examples of good behavior at inference time and hope that the model will follow the pattern. 
- **Iterative Trial-And-Error:** If your model understands your instructions and is good about following it (i.e. with instruction fine-tuning), you can keep updating your prompt to correct the undesirable properties of your model's behavior. This process can look quite different on a per-model basis; some models are very good at following instructions, some are less good, and some will not be able to deviate from their training sufficiently for specific use cases.
- **Priming:** In addition to giving "instructions", you can add an initial generation into your decoder to get it to continue from there. This is a great way of getting around default behavior (for better or for worse) but may not always be supported by default for encoder-decoder architectures. 

With all that said, go ahead and see what you can do with this model, and try to get comfortable with its capabilities with the following tasks:

### **Task 1: Performing As Expected**

Ask the network some stuff to see what it can do. You can use colons to separate task and body, or you can just talk to it directly. Below are some food-for-thought examples, but try your own!

In [None]:
# flan_t5_pipe("Can you tell me about how sandwiches are made?")
# flan_t5_pipe("Is this a true sentence: You can make fire by rubbing two sticks together very quickly")
# flan_t5_pipe("How do you say 'when in Rome...' in french?")
# flan_t5_pipe("Identify the noun with negative sentiment: I love pizza with pineapple, but adding pickles to it is just too much")
# flan_t5_pipe("Translate english to pig latin: Hello world and all who live in it!") ## will fail

Now, try the same instructions, but ask the model directly. See if it fails any of the tasks, or if it's pretty resilient. When it fails, see if you can't get it to work.

**Hint:**
- If you need to, consider giving the model examples of the task being executed properly. This is called few-shot prompting, as you're telling it how it should behave.
- For the pig latin task, this shouldn't be possible unless you get really lucky. This task requires letter-level reasoning in a semantically-unnatural way, so your attempts at prompt engineering will probably fail.

In [None]:
## Example of few-shot prompting. Will still fail though, due to the above reason
'''
flan_t5_pipe("""
Translate english to pig latin.
English: Look in the bag!
Pig Latin: Ooklay in the agbay!
English: Hello world and all who live in it!
Pig Latin: """)
'''

# flan_t5_pipe("Answer the question like a dictionary: Can you tell me about how sandwiches are made?")
# flan_t5_pipe("Is the sentence 'You can make fire by rubbing two sticks together very quickly' true? Please explain")

Note that for pig latin, you probably won't be able to get it to work. Might have something to do with tokenization strategy: Though it should be able to represent pig latin, it may be a bit of a stretch with regards to typical semantics for in-context learning without special fine-tuning. 

### **Task 2: Asking For Hallucinations**

Ask for information that the model doesn't know. Do you notice any hallucinations? How do you think you can resolve them?

In [None]:
# flan_t5_pipe("What is my name?")
# flan_t5_pipe("Who are you?")
# flan_t5_pipe("How do you know my name?")
# flan_t5_pipe("Where do you work?")
# flan_t5_pipe("Answer the question honestly: Who are you?")

### **Task 3: Beginnings of Chat**

Tell the model to chat with a customer and see what it's capable of. As a challenge, see if you can't break the model. Alternatively, try to type some more stuff into the model and see what it can spit out!

In [None]:
## Challenge: Try to break the model by introducing an inconsistency.
## - Possible option: Make the role "Human" instead of "Customer"
## - Possible option: See if you can get the model to behave awkwardly by only changing the user input

''' # Example
flan_t5_pipe("""
You are a friendly chat agent who is helping a customer.
You are supposed to be nice and helpful, and tried to answer in detail.
If you do not know something, say "I don't know". Do not lie!
Customer: Hello! How's it going? Who are you?
Agent: """)
'''

## Example of breaking the model
flan_t5_pipe("""
You are a friendly chat agent who is helping a customer.
You are supposed to be nice and helpful, and tried to answer in detail.
If you do not know something, say "I don't know". Do not lie!
Human: Hello! How's it going? Who are you?
Agent: """)

## 4.6. **Wrap-Up:** Review of Techniques

At this point, we've seen how language models are able to generate completely new text by taking a language encoding as context. This opens up a lot of new possibilities and leaves open a lot of open questions, but at least we're now at the cutting edge and have the capacity to do some pretty powerful stuff with limited compute budget! In the next section, we're going to see a use case for which encoder-decoders really shine; **multi-modal generation**. After that, we'll go into the kinds of tasks that might be better-suited for only the decoder; **text-generation**.

In [None]:
## Please Run When You're Done!
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>