# Attention and Transformers

In previous weeks we introduced the LSTM as an important practical implementation of Reucrrent Neural Networks. LSTM are good at dealing with sequential data -- and importantly sequential data where there can be long range dependencies between items in our data stream. One important use of LSTM type models was for a number of years in implementing so-called encoder decoder architectures for tasks such as machine translation. 

LSTM models are however now not the state-of-the-art in sequential data processing. Instead the Transformer style architecture and in particular the models like BERT widely dominate the processing of text (as well as other forms of sequential data). 

A key ingrediant of the Transformer approach is the notion of Attention. Today we will introduce the notion of attention and introduce you to the key concepts underlining the transformer. 

We begin the discussion with one last look at LSTM (and image processing) so as to give you an idea of what attention is all about. 

NB: Since this material is very much about state-of-the-art content, please note that the lectures slides and my class delivery will have more material than is covered in this notebook. In other words, if you are preparing for the quiz, you also need to study the lecture itself -- but that should always be the case :-) 

## Attention in Sequence to Sequence Models

In previous weeks we briefly introduced the sequence-to-sequence architecture as a type of encoder-decoder architecture. In the seqnence-to-sequence architecture our input is a sequence, e.g., a text string, and our output is another sequence, e.g., another text string. This sort of architecture has wide applications ranging from machine translation through to question answering and chatbots. 

### Sequence to Sequence Illustration

For our illustration of sequence to sequence let us assume our task is Machine Translation and that our source sentence $S1$ and target sentence $S2$ do in fact come from different languages. The basic approach to an RNN centric sequence to sequence architecture is illustrated below: 

<!-- seq2seq --> 
<img width="500" src="https://drive.google.com/uc?id=1DDtE702k8WQstKtoIsIZQYjZl3UBwciV"/>

On the left hand side we have one LSTM implementation (here it is unfolded over time). The input or encoder LSTM takes text one symbol at a time and processes it. This results in a hidden state that is capturing the meaning of the text. Normally a LSTM would also have output symbols, but in this case we are not training the encoder LSTM against some specific output directly -- therefore we do not show an output -- though it should be noted that an output will be produced. 

Instead of focusing on the output we are really just interested in the hidden state that is built up by our encoder -- this will be the important element that we then pass on to the decoder. 

Turning to the decoder on the right of the image. It is again an LSTM -- but a different one from the encoder. The decoder LSTM is initialised with the same state variable that the encoder produced. The decoder starts outputting symbols just as we saw in the langauge production example. The decoder LSTM does not have distinct set of inputs -- instead its outputs are just fed back through into its inputs one symbol at a time. 

From this we can see that we have two LSTMs with simply a shared hidden state. This architecture if often referred to as an 'encoder-decoder' architecture as we cleanly split the task between an initial encoder that builds up a representation of the input, and a decoder that maps from our built up representation to a target output. 

Beyond machine translation, two other interesting examples of sequence-to-squence encoder-decoder models are chatbot systems and data to text generators. 

### Challenges with RNN based Encoder-Decoders 

While LSTMs are very powerful relative to simple RNNs, they do however still suffer from the same fundemental issue faced by RNNs, namely long-range dependencies in data. 

To put it another way, imagine as a translator taking a full sentence and trying to convert it to a target langauge. If the sentence is five or 10 words, this isn't difficult, but if the sentence is much longer, the challenge in keeping all the information 'in your head' and then coming up with a valid transltion can be very tough. In other words there can be many different layers of the network that signal has to propogate through from a word at the start of a long sentence to a word at the end of the long target sentence. This has implications for training as exploding and vanishing gradients remain a challene for very long network chains. 

The practical impact of this problem can be seen if we look at the results for a machine translation task for sentences of different lengths. Here the graph is splotting input sentence length versus BLEU (pronounced blue) score for a machine translation task in a famous paper. You don't need to worry about the meaning of this BLUEU score metric at this point - except to understand that high is good and low is bad. Here we see that the system does well with medium length sentences but not so good with very short sentences or longer sentences. 

<!-- bleu.png --> 
<img width="500" src="https://drive.google.com/uc?id=1xYi0hfztSr8lbytkfIfvWQ9-rVhdNNqT"/>

This is a relatively fundemental problem for LSTMs. And it isn't just a challenge in langauge translation, it occurs whenever we have an RNN based sequence to sequence architecture where the input sequences are long. 

### Using Attention in Sequence to Sequence Architectures 

The fundemental problem for a sequence-to-sequence architecture is that the full input must be first compressed into a hidden representation and then that single representation is then fed to the decoder for it to generate the resultant sentence. This hidden state was the symbol S in the image above. Putting it simply, in an ideal world, this state variable would have all the information that we need, but in practice it doesn't. 

What we ideally want is that for, let's say, the output of the first word on the decoder, that it be given access to the parts of the input that are most likely to have been of benefit on deciding the most appropriate first word of the output, in our case the word Er would be highly influenced by the word He in the encoder as illustrated below. 

<img width="500" src="https://drive.google.com/uc?id=10G88K2ONcENzsZBymePr0p2tY7bNG8uN"/>

Here then the decoder doesn't just have access to the encoded state at the end of the encoding process, but is allowed to peak back at the encoder state at specific points in the input. 

Keep in mind that for this to word, the decoder needs to see the input state for every point in the encoding process. You might think this is problematic sence we are supposed to be processing the elements in sequence and can't 'look back in history' but if you remember we do in practice 'unroll' the RNN up to some specified maxiumum sequence length during training, so we actually have all the info we need access to. This generalised case is illustrated in the figure below:

<img width="500" src="https://drive.google.com/uc?id=10MCyh-1niwGWsxpIt3InV_EHf1ju7yGK"/>

This would be similar in some ways to a full feedforward network except that wee are consuming the input one sequence at a time. This is good in that we can see the whole set of information, and if we are using a nice processing like a word embedding to provide a meaningful representation for each input word, we have in practice a lot of information that we can work from. 

The problem thouugh is that allowing a full feedforward approach would almost be too much information, and it won't help us in particular when we know that for a task like translation that we do only need to pay attention to one part of the input at a time. In other words, when we are selecting one output token, we don't want to pay equal attention to the whole input sequence, we want to be able to focus in on the bit of the input sequence that is of most use to us. 

This then is where attention kicks in. With an attention mechanism we have in principle access to a good feature representation of the whole input, but the attention mechanism helps us to only have to pay attention to, or give focus to, the parts of the input that are relevent to us. This doesn't replace our feedforward wieghts, it gives us an extra mechanism we can apply over them. 

The actual baseline attention mechanism is illustrated in the figure below. 

<img width="500" src="https://drive.google.com/uc?id=10MI4c3lQVCI_aFNcX3UsddzMCqmdSL_w"/>


The implementation of attention is in practice complex and beyond our scope here. However the intuition is important and relevant. Instead of inputting all of the states for every input element equally and say summing or averaging them equally, instead we adopt an approach where we give more weight to certain parts of that sequence of inputs depending on what step we are performing in our output task. 

## Attention Models in Image Processing

The use of attention and transformers is not actually limited to text problems -- or even to strictly encoder-decoder or sequence-to-sequence problems. 

Indeed, to understand attention it is in some ways almost easier to think about its applicaiton in image captioning. In image captioning we have an input image that is fed through a traditional CNN input processing pipeline. If you remember how CNNs work, this means that we will have an set of feature maps for the image. Each feature map shows how a feature was applied to the different parts of the image. If we want to take the output of these feature maps and use them as input to a language decoder, we can, and that will in theory give us the key architecture elements for a langauge decoder. The problem is that the decoder in this case would end up looking at the whole feature map every time it wants to say a word. 

<img width="500" src="https://drive.google.com/uc?id=10PsBD4ep3sU9nZNXd2SdThOxn46ES-__"/>

If we add the attention mechanism, we let the network learn to pay attention to certain parts of the image as the decoder works its way through the production process. This is exactly what was done in the seminal word on the 'Show Attend and Tell' paper where the authors taught the network to pay attention to particular parts of the input image at different points in the language production task. 

<img width="500" src="https://drive.google.com/uc?id=10Ri5AjBykLt6G8N5O9dkRFdl44N7qS1i"/>

We can visualise this by having a look at a number of images with generated captions which have been augmented to show where the attention layer is focusing on most on the original input image. 


<img width="600" src="https://drive.google.com/uc?id=1t-dQbIYeFJXd9vVegDyVBxyslWo-PA8k"/>

Note that the full set of features are being generated by the image backbone network / CNN and we have access to all those features over the entire input image. The attention layer is simply saying how focused we should be on these different parts of the image. In this way the attention mechanism is both very global and very local at the same time. It is global in that full attention is paid to the whole input at any given time, and it is local in that the attention layer helps us to focus in on relevant local parts of the full input data. 




## Transformers 

Attention as we see it over LSTM inputs and images very much changed how deep learning worked by allowing entire inputs to be consumed but with a selection mechanism to allow local focus on the data which was much more flexible than the RNN derived models. In practice though the basic attention mechanism was not the end of the story, it was just the beginning. 

Quickly people learned that actually the LSTM wasn't needed at all -- in fact all we needed was to apply attention to slightly suped up versions of feedforward networks. The intuition here is that the LSTM / sequential part of the encoder-decoder architecture was not really needed at all, and in fact we could instead focus just on the attention part of the architecture (with a little bit of extra machinery to make sure we didn't loose the ordering information on an input). 

This idea came to prominence in a paper called "Attention is All You Need" where the original Transformer architecture was introduced. The Transformer architecture itself is illustrated below and we will discuss it shortly. It should be noted that this is the original Transformer and that in practice a wide class of architectures have emerged after this architecture. These arechitectures are commonly referred to as transformer architectures, but this architecture below was the originator. 

<img width="300" src="https://drive.google.com/uc?id=10Njcc9HUUBXuEp7UugItRHgvp_g8Yi9q"/>

The Transformer architecture is an encoder decoder architecture that can transform one sequence of text to another -- the transformer was developed for machine translation tasks. However, unlike the RNN / LSTM architectures we saw above, it takes all the input sequence in a single processing step, out of which it is able to generate a good quality sentence embedding. The advantage of doing this in one step is that the decoder architecture only ever has a short path back to the input information -- which that input information was at the start of the input sentence or the end of the input sentence. 

That encoder by the way is on the left hand side of the diagram, while the decoder is the right hand side of the diagram. This distinction between the encoder and decoder is illustrated in the figure below where we have drawn in some input text. 

<img width="600" src="https://drive.google.com/uc?id=10P1qN_JzlRMXWMIxFsEbio0SO4S1jl-H"/>

The text to be encoded is fed initially into the network in a traditional way -- usually with each word being transformed to a distributed word embedding via a pre-trained word embedding layer. These word embeddings are combined with a positional embedding which helps to encode a form of location information into the word embedding, i.e., the word embedding doesn't just have basic semantics for a word but a very concrete numeric quality that denotes where the word appeared in the sentence. This combined vector is then fed through subsequent layers to build up a complete sentence embedding. 

The significant novelty of the Transformer's encoder architecture is that it uses so-called multi-headed attention layers to allow the network to learn the relationships between words within the input. The specifics of multi-headed attention are beyond our scope here, but in short this is an associative memory mechanism that allows the system to learn which words should attend to other input words during the encoding process. 

At the end of the encoding process a representation of the input sequence is  made available to the decoder. The decoder like the encoder takes as input a representation of a full input sentence. However, unlike the encoder the decoder is a type of recurrent network in that it has to perform a full forward pass for each individual word that is produced by the network. At each step a word is produced via a softmax layer at the end of the decoder. At the subsequent step this word is put into the input to the decoder as an extra bit of information. Thus at any point in time the deocder has access to:

(a) a distributed sentence embedding of the total input
(b) a representation of decoded text up to the nth sentence 

and will then produce a most likely n+1th output. This n+1th output is then added to the representation of the decoded text for the next point in time. 

### Beyond the baseline Transformer

The basic transformer architecture can be used as a single entity but its encoder and decoder arms have lead to a variety of model variants. 

The BERT langauge encoder and similar can be thought of as the Transformer's encoder layer -- though with multiple instances of the multi-headed attention, feed-forward network block. Remember that the encoder is basically building up a good representation of the input sentence, so the variety of language representations provided by Google, HuggingFace and others are essentially good quality transformer encoders. 

The decoder meanwhile is the essence of the large class of generative models such as GPT3 which have been used to answer a wide variety of fantastic queries based on having been trained purely from text availabe in web crawls. Thus the class of so-called generative language models like ChatGPT are in many ways applications of the transformer deocder architecture. In practice ChatGPT and similar have much more sophisticated tranining mechanisms included which include reinforcement learning mechanisms and making good use of humans in the initial training and ranking of potential outputs. Still though, the basic approach still applies. 

### Training the basic Transformer

The training of the transformer architecture is via a number of language processing tasks, most significantly a blanked out word task is used so that the network can learn to predict missing words from context. 

## Further Reading

For those interested in reading more, here are some good resources to get you started:

Two excellent blog posts on attention:
https://blog.floydhub.com/attention-mechanism/
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

A very good overview of how the BERT Transformer works
https://www.youtube.com/watch?v=4Bdc55j80l8

More of an idiot’s introduction to Transformers:
https://www.youtube.com/watch?v=FWFA4DGuzSc
