# Introduction to Large Language Models (LLMs)
LLMs are a subset of traditional machine learning that have been trained on trillions of words, finding statistical patterns in massive datasets of human-generated content. These models, with billions of parameters, exhibit emergent properties beyond just language comprehension.

## Generative AI
Generative AI refers to machines capable of creating content that mimics or approximates human ability. They can be used for various applications such as chat bots, generating images from text, or helping with code development. 

## Foundation Models
Foundation models, also known as base models, are the underlying models that power generative AI. The more parameters these models have, the more complex tasks they can perform. Throughout the course, we will be using an open-source model, flan-T5, for language tasks.

## Customized Solutions
These models can be used as-is or fine-tuned to adapt to specific use cases. This allows for rapid creation of customized solutions without the need to train a new model from scratch.

## Focus on Language Models
While generative AI models exist for multiple modalities, this course focuses on large language models and their uses in natural language generation.

## Interaction with LLMs
Interaction with language models is different than other machine learning and programming paradigms. LLMs take natural language instructions and perform tasks much like a human would. The text passed to an LLM is known as a prompt.

## Prompt Engineering
The space available to the prompt is called the context window. Inference refers to the process of using the model to generate text. The output is known as a completion, which includes both the original prompt and the generated text.

# Use Cases for Large Language Models (LLMs)

## Text Generation
While LLMs are commonly associated with chatbots, they're capable of a wide variety of text generation tasks. For instance, they can write an essay based on a given prompt or summarize conversations provided in the form of dialogue.

## Translation Tasks
LLMs can be used for a range of translation tasks. This includes traditional language translations such as French to German or English to Spanish, as well as translating natural language into machine code. For example, you could ask a model to generate Python code that calculates the mean of every column in a DataFrame.

## Information Retrieval
LLMs can also execute smaller, focused tasks like information retrieval. An example of this is named entity recognition, where the model identifies all people and places mentioned in a news article.

## Augmenting LLMs with External Data Sources
An area of active development involves augmenting LLMs by connecting them to external data sources or using them to invoke external APIs. This allows the model to access information it doesn't have from its pre-training and interact with the real world.

## Increased Understanding with Larger Models
As the scale of foundation models increases from hundreds of millions of parameters to billions, the subjective understanding of language that a model possesses also increases. This improved understanding enables the model to process, reason, and solve more complex tasks.

## Fine-tuning Smaller Models
Smaller models can be fine-tuned to perform well on specific tasks. The course will cover more on how to do this in week 2.

The rapid increase in capability exhibited by LLMs in recent years is largely due to the architecture that powers them. More about this will be covered in the next section of the course.

# Understanding Large Language Models and Transformer Architecture

## Transformer Architecture
The transformer architecture has significantly improved the performance of natural language tasks. Its power lies in its ability to learn the relevance and context of all words in a sentence, not just neighboring ones. This is achieved through attention weights applied to each word's relationship with others, regardless of their position in the input.

## Attention Map
An attention map illustrates the attention weights between each word and every other word. For example, the word "book" may be strongly connected with the words "teacher" and "student". This self-attention greatly improves the model's ability to encode language.

## Model Structure
The transformer architecture consists of two parts: the encoder and the decoder. They share several similarities and work together to process information.


![attention.png](attachment:attention.png)

## Tokenization
Before processing, texts are tokenized, converting words into numbers. Each number represents a position in a dictionary of all possible words that the model can handle. The chosen tokenizer should be used consistently when generating text.

## Embedding Layer
The input, now represented as numbers, is passed to the embedding layer. This space is where each token is represented as a vector and occupies a unique location. These vectors encode the meaning and context of individual tokens in the input sequence.

## Positional Encoding
Positional encoding is added to the base of the encoder or decoder along with the token vectors. This preserves information about word order and the relevance of the word's position in the sentence.

## Self-Attention Layer
The self-attention layer allows the model to analyze relationships between tokens in the input sequence. It captures the contextual dependencies between words through learned self-attention weights. 

## Multi-Headed Self-Attention
The transformer architecture includes multi-headed self-attention, where multiple sets of self-attention weights are learned independently. Each head may learn a different aspect of language, such as relationships between entities or activities in a sentence.

## Feed-Forward Network
After applying attention weights, the output is processed through a fully-connected feed-forward network. The output is a vector of logits proportional to the probability score for each token in the dictionary. These logits are normalized into a probability score for each word in a final softmax layer. The most likely predicted token has a higher score than the rest.



### Lecture Summary: Transformer Architecture Overview 

1. **Tokenization**: The input words are tokenized using the same tokenizer that was used to train the network.
2. **Encoder**: The tokens are fed into the encoder through the embedding layer and multi-headed attention layers. The output is a deep representation of the structure and meaning of the input sequence.
3. **Decoder**: The encoded data is inserted into the decoder, which predicts the next token based on the context provided by the encoder. This loop continues until an end-of-sequence token is predicted.
4. **Detokenization**: The final sequence of tokens is detokenized into words, producing the output.
   

### Variations of Transformer Architecture

1. **Encoder-only models**: These models work as sequence-to-sequence models but typically have input and output sequences of the same length. They can be further modified to perform classification tasks like sentiment analysis. An example is BERT.
2. **Encoder-decoder models**: These models excel at sequence-to-sequence tasks like translation where input and output sequences can have different lengths. They can also be trained for general text generation tasks. Examples include BART and T5.
3. **Decoder-only models**: These models are widely used today and can generalize to most tasks. Popular examples include the GPT family of models, BLOOM, Jurassic, LLaMA, etc.

**Important Note**

Understanding the underlying architecture is beneficial, not necessary for interacting with transformer models. Instead, you'll be using natural language to create prompts, a process known as prompt engineering.

# Prompting and prompt engineering

In the process of working with language models, there are several important terms and concepts. 

1. **Prompt**: The text that is fed into the model.
2. **Inference**: The act of generating text.
3. **Completion**: The output text from the model.
4. **Context Window**: The full amount of text or memory available for the prompt.

Sometimes, the model doesn't produce the desired outcome on the first try. In such cases, multiple prompt revisions is required to chieve the desired result. This process is known as **prompt engineering**.

One effective strategy in prompt engineering is to include examples of the task inside the prompt. This method, known as **in-context learning**, helps the model understand the task better.

I've also explored different types of inference:

1. **Zero-shot Inference**: Including the input data within the prompt without any examples.
2. **One-shot Inference**: Providing a single example in the prompt.
3. **Few-shot Inference**: Providing multiple examples in the prompt.

While larger models are good at zero-shot inference, smaller models often benefit more from one-shot or few-shot inference. However, there's a limit to the amount of in-context learning that can be passed into the model due to the context window constraint. *If a model isn't performing well even after including several examples, it might be time to consider **fine-tuning** the model.*

The scale of the model plays a significant role in its performance. Larger models with more parameters capture a better understanding of language and are surprisingly good at zero-shot inference. In contrast, smaller models are generally only good at tasks similar to those they were trained on.

Finally, once a suitable model is selected, there are several configuration settings to experiment with in order to influence the structure and style of the completions the model generates.

# Generative configuration | at inference time

**Configuration Parameters**: These are distinct from training parameters and are invoked at inference time to control aspects like the maximum number of tokens in the completion and the creativity of the output.

1. **Max New Tokens**: This parameter limits the number of tokens that the model will generate. It's essentially a cap on the number of times the model will go through the selection process. 

2. **Greedy Decoding**: By default, most large language models operate with greedy decoding, always choosing the word with the highest probability. However, this method can lead to repeated words or sequences.

3. **Random Sampling**: To introduce variability and avoid repetition, random sampling can be used. The model chooses an output word at random using the probability distribution to weight the selection.

4. **Top K and Top P Sampling**: These techniques limit the random sampling and increase the chance that the output will make sense. With top k, the model is restricted to choose from only the k tokens with the highest probability. With top p, the model is limited to predictions whose combined probabilities do not exceed p.

5. **Temperature**: This parameter influences the shape of the probability distribution for the next token. A higher temperature results in higher randomness and a lower temperature results in lower randomness.

After exploring these concepts, one gains a deeper understanding of how to get the best possible performance out of these models using prompt engineering and by experimenting with different inference configuration parameters. Finally, prepared to start thinking about the steps required to develop and launch a language model-powered application.