### Transformers and LLMs

Transformers are the architecture behind ChatGPT and many other powerful LLMs (Large Language models).

LLMs are specifically designed to perform NLP tasks.

In this chapter, we will discuss the details of Transformers and LLMs. Then we dive into three main layers of transformers: embedding layers, attention layers, and an FNN layer. Next, we will be discussing categories of transformers: encoder, decoder, and encoder & decoder models. Last, we discuss details of popular LLMs: GPT, BERT,...

Transformers are a deep learning architecture. That is combining deep learning with other calculations and passing data through a series of steps to extract meaningful patterns and make predictions. It consists of three main layers: embeddings, attention, and FNN

Also, we have access to LLMs. These are the deep learning models that have been pretrained on massive amounts of text data.

In some cases, they overlap. We have transformer-based LLMs. These are popular deep learning approaches to NLP tasks. Note that Transformers can also be used for other purposes: cision, audio,... On the other hand, LLMs are there that are based on RNN, LSTM,... 

**Transformer Architecture**

Along the way, the input text gradually transformed, hence they are called transformers.

Main Layers:

Raw text -> embeddings layer -> Attention layer -> FNN layer -> Prediction

- Embeddings layer: Use vectors to represent the semantic meaning of words. This is similar to vectorization. However, these embeddings are very smart vectors. Because each vector points to a space, and it has a meaning.
- Attention layer: Make adjustments to vectors/embeddings. This adjusts the vector slightly based on the context of surrounding words. (make the word more meaningful).
- FNN layer: learn new features, identify patters and include it into the model.

**Embeddings**

Here we are converting text tokens into meaningful numeric representations.

Each token (word) is placed into a high-dimensional space. Thus, words with similar meanings end up close together.

In general, each word is represented with a large number of dimensions (typically 768). The value on each dimension is not random. The values have semantic meaning. These values are generated based on popular word embeddings trained using shallow neural networks (word2vec) and matrix factorization (GloVe). (shallow neural network -  neural network with a single hidden layer). Within LLMs, the values are randomly initialized and slowly updated until they reach their final values (as in neural networks). The parameters in neural networks are represented as matrices, and these embedding matrices are exactly that.

**Attention**

This is the most important component in transformer architecture. Here, it adds context by helping each token absorb additional meaning from other tokens.

Example: "lemonade", "I love cold lemonade", "I love Beyonce's Lemonade album"

Let's consider the phrase "I love cold lemonade." The token here is "lomonade." With the attention layer, the word "cold" adds context to lemonade. In technical terms, cold attends to lemonade. In simpler words, it is not just a lemonade; it is a cold lemonade. Similarly, love attends to lemonade.
In the attention layer, it does this by creating matrices for queries, keys, and attention scores. Similar to embeddings layer, the query and key values are randomly initiated and slowly updated until they reach their final values. To capture query-key similarity, a dot product (similarity score) is taken.

**Feedforward Neural Network**

In this layer, patterns in those contextual relationships are learned and captured.

Typical FNN in a transformer:

Input layer -> Hidden layer -> Output layer

Input is what's coming from the attention layer, and output is what we send to the prediction. Typically, we have 768 nodes in inout layer, 3072 nodes in the hidden layer, and 768 nodes in the output layer.

How do transformers work so well?

It is due to the attention layer. It adds context (enriches the meaning of each word based on others).

Due to the structure of the attention layer (matrices), it allows for parallelization. Thus, it can be trained fast.

Attention generalizes well (core transformer architecture works well for a variety of NLP tasks).

In reality, the layers typically follow this order:

Raw text -> Embeddings layer -> Transformer block 1 (Attention layer + FNN) -> Transformer block 2 -> ... -> Predictions

**Categories of Transformers**

Mainly, there are three categories of transformers: encoder-only models, decoder-only models, and encoder-decoder models.

Different models will use different pieces of the transformer architecture.

- Encoder-Only Models: The encoder takes raw text and encodes it as an embedding representation of the text (transforms text into a vector). In short, it understands text. Thus, the application here is sentiment analysis (Output: positive/negative). In summary, transform text into a vector, and the vector has a meaning.

- Decoder-Only Models: The decoder takes an input text sequence and infers (predicts) the next word. This will give a probability distribution for the next word. In short, it generates text.

- Encoder-Decoder Models: These modes takes two inputs: a text sequence and a shifted target sequence. The target sequence is shifted because we want to predict the next word. Both inputs are encoded as embeddings and combined to infer the next word. In short, it understands and generates text. Thus, translation would be an application of this.

How do we use these in practice? We would download a pretrained LLM.