# Understanding Transformers to Help Build Effective Large Language Model Prompts

Here we will explore the mechanisms of an LLM like ChatGPT, which will allow us to gain intuition for the importance of the [key components of a good prompt](https://github.com/teaghan/educational-prompt-engineering/blob/main/key_components.ipynb).

## Transformer Architecture Components and Concepts

### The Attention Mechanism (Query/Key/Value (QKV)):

In transformers, the attention mechanism is a key component that allows the model to focus on different parts of the input sequence when making predictions. It works by assigning weights to different elements in the input sequence based on their relevance to the current element being processed.

For LLMs, you can think of an element as one word in, say, a paragraph (the sequence of elements).

The attention mechanism involves three main components: Query, Key, and Value. These components are derived from the input sequence and are used to calculate attention scores. The **attention scores determine how much focus should be given to each element in the sequence when processing the current element.**

The attention scores are computed using a mathematical operation called the dot product between the Query of the current element and the Key of each element in the sequence. 
The Values are pre-existing information associated with each element in the input sequence. The softmax operation is used to determine how much attention each element should receive when making predictions based on the current element (query).

These scores are then scaled and passed through a softmax function to obtain a probability distribution. The final step involves taking a weighted sum of the Values, where the weights are determined by the attention scores.

### Simple Explanation of Attention:

Imagine you're reading a paragraph, and you want to understand each word in the context of the one you're currently looking at. The attention mechanism is like giving different levels of importance to each word based on how relevant it is to the word you're focusing on.

**Components:**
- **Query:** The current element that you are comparing against the rest of the content. Think of it as the word you're currently trying to understand.
- **Key:** The Key is the content being examined. Think of it as the other words in the paragraph.
- **Value:** The Value is the information associated with each Key. Think of it as the background information on what you're looking at.

- **Query:** Think of the Query as the question being asked. It's like the model's way of saying, "What should I focus on?"

**Process:**
1. For each word in the paragraph, compare it to the word you're focusing on (Query).
2. Calculate a score based on how relevant each word is to the word you are focusing on (dot product of Query and Key).
3. Convert the scores into a kind of percentage (using softmax).
4. Use these percentages as weights to access information (Values) that is relevant to each word in the paragraph. This is effectively investigating the secondary information that is relevant to the current word you are focusing on.

### Predicting the Next Word

The weighted sum calculated by the attention mechanism represents the context or information from the input sequence that the model has deemed relevant for predicting the next word.

The weighted sum of values is combined with the internal parameters of the model. This combination is often done through feedforward neural network layers.

The combined information goes through a softmax activation function, which turns the output into a probability distribution over the vocabulary. Each word in the vocabulary gets a probability score based on how likely it is to be the next word.

The model then samples a word from this probability distribution (stochastic sampling) or chooses the word with the highest probability (greedy decoding). This chosen word becomes the predicted next word.

## Relating Attention to the Components of a Prompt

Understanding the QKV mechanism provides valuable insights into crafting effective prompts.

> In essence, a well-crafted prompt mirrors the interaction between Queries, Keys, and Values in a transformer, guiding the model to focus on the right information and generate contextually appropriate responses.

### 1. **Task:**

In the QKV mechanism, think of the Query as the task you want the model to perform based on the relevant context (the Key). Crafting a prompt with a clear action verb initiates the task, guiding the model on what specific information to focus on and generate a response accordingly.

### 2. **Context:**

The context (Key) in the transformer is analogous to providing background information in a prompt. In prompts, offering relevant background information guides the model's understanding and contextualizes the task at hand. It's about presenting the right information to shape the model's responses effectively.

### 3. **Including Exemplars:**

In QKV terms, exemplars are comparable to examples that shape the attention of the model. When crafting a prompt, providing exemplars helps guide the model's focus on specific details (Keys) during the reasoning process, ensuring more accurate and context-aware responses.

### 4. **Persona:**

Defining a persona in a prompt instructs the model on how to approach the task based on a predetermined personality or role; a different persona will make different connections (Values), which help determine the type of response. 

### 5. **Format:**

Providing a desired format will influence how the model combines information to generate a response, which will alter the associations (weighting) made with each Key.

### 6. **Tone:**

The tone plays a fairly similar role to the persona, causing different associations to be made that are in line with the desired tone.