![Tulane](https://github.com/tulane-cmps6730/main/blob/main/img/banner.png?raw=true)

<center>

<font size="+3">Transformers</font>

[Aron Culotta](https://cs.tulane.edu/~aculotta/)  
[Tulane University](https://cs.tulane.edu/)

<a href="http://colab.research.google.com/github/tulane-cmps6730/main/blob/main/notebooks/10_Transformers.ipynb">
        <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Google_Colaboratory_SVG_Logo.svg/320px-Google_Colaboratory_SVG_Logo.svg.png"  width=10%/></a>
<a href="https://github.com/tulane-cmps6730/main/tree/main">
        <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/GitHub_Invertocat_Logo.svg/240px-GitHub_Invertocat_Logo.svg.png" width=6%/></a>

In this module, we'll learn about the transformer mechanism, which uses attention to model language.

</center>

<hr size=10 color=#285C4D>

## Motivation

The GPT in ChatGPT stands for...

- **G**enerative
- **P**retrained
- **T**ransformer

Today we'll learn what this means.

### Attention Review



> Given a set of vector **values**, and a vector **query**,  
> **attention** is a technique to compute a weighted sum of the values, dependent on the query.

- a selective summary of the values based on the query
- gives a fixed-size representation of the values





**Input**: sequence of value vectors $\mathbf{h}_1 \ldots \mathbf{h}_n \in \mathbb{R}^{d_h}$ and a query vector $\mathbf{q} \in \mathbb{R}^{d_q}$

1. Compute **attention scores** $\mathbf{s} \in \mathbb{R}^n$
  - we'll see how in a moment


2. Apply softmax to get the **attention distribution** $\alpha$:
  - $\alpha = \mathrm{softmax}(\mathbf{s}) \in \mathbb{R}^n$
  
  
3. Compute the **attention output**, the sum of values weighted by attention distribution:
  - $\mathbf{a} = \sum_i^n \alpha_i \mathbf{h}_i \in \mathbb{R}^{d_h}$
  - $\mathbf{a}$ then becomes input features for classification layer

<hr size=10 color=#285C4D>

## Self-Attention for Language Models


- We can use the idea of attention to predict the next word in a sentence.
- Attention lets us model long-range dependencies
- The prediction for word $i$ depends on **all previous words**



![figs/m10selfatt.png](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/m10selfatt.png?raw=1)

![figs/m10selfatt2.png](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/m10selfatt2.png?raw=1)

We can use attention to determine which other words are important to predict the next word $t$.

$$\mathbf{a}_i =  \sum_{j \le i}^n \alpha_{ij} \mathbf{v}_j$$



Each input embedding plays three different roles:

- As the current focus of attention when being compared to all of the other preceding inputs. We’ll refer to this role as a **query**.
- As a preceding input being compared to the current focus of attention. We’ll refer to this role as a **key**.
- As a **value** used to compute the output for the current focus of attention.

We use three different weight matrices to represent each role.

$\mathbf{v}_j = V\mathbf{x_j} ~~~V\in \mathbb{R}^{dxd} ~~~$ **values for node j**

$\mathbf{k}_j = K \mathbf{x}_j ~~~K\in \mathbb{R}^{dxd} ~~~$ **keys for node j**

$\mathbf{q}_i = Q\mathbf{x_i}~~~$ **query for node i**

$\alpha_{ij} = \frac{\exp(\mathbf{q}_i \cdot \mathbf{k}_j)}{\sum_{j'} \exp({\mathbf{q}_i \cdot \mathbf{k}_{j'})}} ~~~$ **affinities between node i and j**

**nb:** we often scale down dot product for better numerical stability:
$\alpha_{ij} = \frac{\exp(\mathbf{q}_i \cdot \mathbf{k}_j) / \sqrt{d_k}}{\sum_{j'} \exp({\mathbf{q}_i \cdot \mathbf{k}_{j'})}}$

**Putting it all together:**

![figs/m10selfatt3.png](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/m10selfatt3.png?raw=1)


### Word Embeddings and Unembedding

To use self-attention for language modeling, we need two additional layers.
- Embedding layer: from one-hot encoding of input word to a dense word embedding vector
- Unembedding layer: from a dense word embedding vector to a distribution over output words

In [None]:
# Let's package these computations up inside of an nn.Module so we can learn W_s and W_y.
import torch
import torch.nn as nn

class SimpleSelfAttention(nn.Module):
    def __init__(self, input_size, verbose=False):
        super(SimpleAttention, self).__init__()
        # Since we don't know what the output size will be (the number of tokens), we can't use nn.Linear:
        # self.input_to_attention = nn.Linear(input_size, output_size=??, bias=False)
        # Instead, we'll just use the Parameter object, which is a trainable tensor.
        self.W_s = nn.Parameter(torch.randn(input_size, dtype=torch.float64))

        # W_y: hidden to prediction
        self.W_y = nn.Linear(input_size, 1, bias=False, dtype=torch.float64)

        self.verbose = verbose
        self.tanh = nn.Tanh()
        self.softmax = nn.Softmax(dim=0)
        self.sigmoid = nn.Sigmoid()


    def forward(self, x):
        # "@" is the matrix multiply operator
        # This computes s = tanh(x * W_s) for all tokens at the same time.
        s = self.tanh(x @ self.W_s)
        # normalize using softmax
        alpha = self.softmax(s)
        # compute final combined embeddings
        a = alpha @ x
        # classify
        y = self.sigmoid(self.W_y(a))
        if self.verbose:
            print('s=\n', s)
            print('alpha=\n', alpha)
            print('a=\n', a)
            print('y=\n', y)
        return y


### Attention-based word representations



By applying attention to each token in the input, we compute hidden representations $h_i$ for each input work.
- Each $h_i$ depends on all the other words in the input.


$$\mathbf{v}_j = V\mathbf{x_j} ~~~V\in \mathbb{R}^{dxd} ~~~$$ **values for node j**

$$\mathbf{k}_j = K \mathbf{x}_j ~~~K\in \mathbb{R}^{dxd} ~~~$$ **keys for node j**

$$\mathbf{q}_i = Q\mathbf{x_i}~~~$$ **query for node i**

$$\alpha_{ij} = \frac{\exp(\mathbf{q}_i \cdot \mathbf{k}_j)}{\sum_{j'} \exp({\mathbf{q}_i \cdot \mathbf{k}_{j'})}} ~~~$$ **affinities between node i and j**

$$\mathbf{h}_i = \sum_{j=1}^n \alpha_{ij} \mathbf{v}_j$$


$\mathbf{h}_i$ is the contextual representation of input $\mathbf{x}_i$.  $\alpha_{ij}$ controls the strength of each contribution from $v_j$.

<br>

This tells us **what information from what other tokens, should be used in representing** $\mathbf{x}_i$.


The matrices $K$, $Q$, $V$ allow us to use different views of each $\mathbf{x}_i$ for the different roles of key, query, and value.

<br>

<br><br><br>

<img src="https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/selfattention.png?raw=1" width=60%/>

[source](https://web.stanford.edu/class/cs224n/readings/cs224n-self-attention-transformers-2023_draft.pdf)



## Transformers



A **transformer** block is a collection of ideas. But, at its core, it contains a *self-attention layer* and *positional embeddings*.

Other tricks include:

- **multi-head attention**: multiple attention layers are run for each input sentence
  - e.g., one can pay attention to syntax, another semantics, etc.
- **residual connections** between layers: $X_i = X_{i-1} + \mathrm{Layer}(X_i)$
- **layer normalization** compute *z*-scores for values in each layer (subtract mean and divide by standard deviation)
  - can improve training convergence by placing weights in same "scale" across layers
  - Normalization is done separately for each token:
  - $$\mu_i = \frac{1}{d}\sum_{j=1}^d \mathbf{h}_{ij}$$

<br><br>

![figs/transformer.png](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/transformer.png?raw=1)

<br><br>

### Translation

The transformer was originally introduced for a machine translation task. Below is the figure from original paper "Attention Is All You Need" (2017)

![figs/attention.png](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/attention.png?raw=1)


### Back to BERT

BERT is a "multi-layer bidirectional Transformer encoder"
- to train, mask words at random so that information about word $i$ is not used when predicting word $i$

<img src="https://github.com/tulane-cmps6730/main/blob/main/lec/language_models/figs/bert.png?raw=1" width=60%/>

**Just the Encoder portion of the original Transformer architecture**

### Back to ELMO

ELMO "word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus."

<img src="https://github.com/tulane-cmps6730/main/blob/main/lec/language_models/figs/elmo.png?raw=1" width=60%/>


### GPT: Generative Pre-trained Transformer

**Just the Decoder portion of the original Transformer architecture**

<img src="https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/gpt.png?raw=1" width=60%/>


## Transformer implementation

Andrej Karpathy's excellent, simplified implementation of transformer: https://github.com/karpathy/minGPT

Along with code walkthrough video: https://www.youtube.com/watch?v=kCc8FmEb1nY

## sources

- https://web.stanford.edu/class/cs224n/

In [2]:
!python -V

Python 3.10.12
