<img src="Images/brain.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 1. Understanding Foundation Models
*AI Engineering*

----
Foundation models are determined by the training data and model architecture:
- Pre-training makes a model capable, but not necessarily safe or easy to use.
- The goal of post-training is to align the model with human preferences.

Training impacts the model's:
- **Performance.**
- **Sampling:** how a model chooses an output from all possible options. Not only does sampling explain many seemingly baffling AI behaviors, including hallucinations and inconsistencies, but choosing the right sampling strategy can also significantly boost a model‚Äôs performance with relatively little effort.

<img src="Images/brain.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 2. Training Data
*AI Engineering*

----
#### A. Multilingual Models
To improve a model's performance on a specific task, it's helpful to include more relevant data during training. Common Crawl is often used for this, but its data quality is unreliable. Due to concerns about public and competitive scrutiny, many companies no longer reveal their data sources.

Language- and domain-specific models are often fine-tuned from general-purpose ones. High-quality data can be more effective than large amounts of low-quality data. For example, Gunasekar et al. (2023) trained a small 1.3B-parameter model with 7B high-quality coding tokens that outperformed much larger models on key coding benchmarks.

General-purpose models perform better in English due to its dominance in training data. While translating queries to and from English is common, it's not ideal because it depends on the model's ability to understand underrepresented languages, risks information loss, and may still lead to performance issues in non-English languages.

Beyond quality issues, models are often slower and more expensive to use with non-English languages due to inefficient tokenization. Some languages, like Hindi and Burmese, require far more tokens than English to express the same content‚Äîmaking inference slower and costlier. For example, GPT-4 takes about 10 times longer and costs 10 times more in Burmese than in English. To address this, specialized models have been developed for various non-English languages, with Chinese being the most actively supported.

#### B. Domain-Specific Models
<img src="Images/domain_distribution.png" alt="Distribution of domains">

General-purpose models can handle everyday questions but often struggle with domain-specific tasks like drug discovery or cancer screening, which require specialized, hard-to-access data. To perform well in these areas, models need carefully curated datasets. Notable examples include AlphaFold for protein structures, BioNeMo for drug discovery, and Med-PaLM2 for medical queries. While domain-specific models are common in biomedicine, they can also benefit fields like architecture and manufacturing by outperforming general models in specialized tasks.

Whether general-purpose models fine-tuned on domain-specific data are better than models trained from scratch specifically for that domain depends on several factors:

#### ‚úÖ **When Fine-Tuning General-Purpose Models Can Be Better**
1. **Resource Efficiency:** Fine-tuning a pre-trained general model (like GPT-4 or LLaMA) requires far less data and compute than training a domain-specific model from scratch.
2. **Leverages Broad Knowledge & Transfer Learning:** General models already understand language, reasoning, and some relevant concepts, which gives them a head start when adapting to specific domains.
3. **Faster Development & Deployment:** Fine-tuning can be done relatively quickly, making it ideal for practical applications or commercial use.
4. **Good Performance with Limited Data:** With limited domain-specific data, fine-tuning a general model can often outperform training a small model from scratch.

#### ‚ùå **When Domain-Specific Models Can Be Better**
1. **Highly Specialized Tasks:** In complex domains like drug discovery, genomics, or radiology, general models often lack the precision and structure awareness required (e.g., understanding 3D protein folding or interpreting medical images).
2. **Non-Text Modalities:** Many domain-specific tasks involve structured data, images, molecules, or graphs ‚Äî not just text. Domain-specific models are designed to handle these inputs natively.
3. **Efficiency and Cost:** Models like AlphaFold or specialized models for manufacturing are optimized for specific formats and tasks, making them much more efficient for their use case than a large fine-tuned general model.

#### üìå **Best Practice Today**
A **hybrid approach** is often most effective:
* Use a strong general-purpose model as a base.
* Fine-tune it with high-quality domain-specific data.
* Augment with specialized components (e.g., structured data handlers, multimodal inputs) if the task requires it.

#### üìä Example
* **Med-PaLM 2 (Google)**: A general-purpose LLM fine-tuned with medical data ‚Äî performs well on many medical tasks, even surpassing doctors in some QA benchmarks.
* **AlphaFold (DeepMind)**: Built from scratch for protein folding ‚Äî achieves world-class performance, which general models cannot match.

#### üîç Summary
> **Fine-tuned general-purpose models** are often *good enough* and much easier to build, but **domain-specific models** still win for highly specialized, structured, or multimodal tasks where precision and efficiency matter most.

#### C. Knowledge Carryover
The **carryover of knowledge and skill from a general-purpose model to a domain-specific model**‚Äîoften referred to as **transfer learning** or **knowledge transfer**‚Äîis a key strength of using general models as a base. Here's how it works and when it‚Äôs most valuable:

#### üß† How Knowledge Carries Over
General-purpose models (like GPT, LLaMA, or PaLM) are trained on diverse datasets, so they develop broad capabilities:
1. **Language Understanding:** Syntax, grammar, reasoning, and coherence transfer well to any domain.
2. **World Knowledge:** General facts and concepts (e.g. anatomy, chemistry, finance basics) provide a contextual foundation.
3. **Problem-Solving Patterns:** Chain-of-thought reasoning, summarization, and question-answering strategies transfer across domains.
4. **Cross-Domain Analogies:** General models may draw useful analogies or patterns from other domains that help in specialized areas (e.g., using physics metaphors in biology).

#### üìà When This Transfer Helps
* **Low-Resource Domains**: Where you don‚Äôt have much task-specific data, transfer from a large general model fills the gap.
* **Natural Language Interfaces**: Even in technical fields, if the interaction is through text (e.g., asking a medical model questions), general language skills are essential.
* **Interdisciplinary Domains**: Where understanding overlaps (e.g., bioinformatics combines biology + CS), general models bring value from both sides.

#### ‚ö†Ô∏è But There Are Limits
1. **Technical Depth:** General models may lack deep, structured domain knowledge (e.g., genetic sequences, medical imaging, or legal citations).
2. **Format Mismatch:** General LLMs trained mostly on plain text struggle with structured inputs like tables, formulas, graphs, or images unless explicitly trained.
3. **Hallucinations in Critical Domains:** Even with fine-tuning, general models may "make stuff up" more easily in unfamiliar or high-stakes areas (e.g., medicine, law).

#### üß™ Example: Med-PaLM 2
* Google's Med-PaLM 2 starts with a general LLM and fine-tunes it using medical QA data, PubMed abstracts, etc.
* It benefits from language skills, reasoning patterns, and broad biomedical knowledge.
* However, it still struggles with real-world deployment unless paired with domain-specific guardrails, structured knowledge bases, or expert oversight.

#### üß¨ Summary
> **Yes, knowledge and skills from general models *do* carry over to domain-specific models**, especially in language, reasoning, and general knowledge. This transfer makes fine-tuning powerful and efficient.
> But **domain-specific data and modeling are still necessary** to handle unique formats, deeper expertise, and minimize risk in specialized tasks.

<img src="Images/brain.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 3. Modeling
*AI Engineering*

----
Before training a model, developers must choose its architecture and size (e.g., number of parameters), which affect both its performance and ease of deployment. Smaller models are easier to deploy, and optimization strategies vary by architecture.

#### A. Model Architecture: seq2seq
The dominant architecture for language-based foundation models is the **transformer,** built on the **attention** mechanism. It became popular by overcoming limitations of earlier models like **seq2seq,** which used RNNs for tasks like translation and summarization.

Seq2seq consists of an encoder-decoder structure processing input and output token sequences. While seq2seq marked major progress‚Äîe.g., Google Translate adopted it in 2016‚Äîit had limitations that transformers improved upon.

The seq2seq model has three main limitations:
1. It relies only on the final hidden state to generate outputs, which limits output quality.
2. Its sequential RNN-based processing makes it slow for long inputs.
3. **RNNs suffer from vanishing and exploding gradients** due to their recursive nature, making training difficult and unstable for long sequences.

#### B. Model Architecture: transformer
The **transformer architecture** solves both issues using the **attention mechanism**, which allows the model to consider all input tokens simultaneously‚Äîlike referencing any page in a book, not just the summary.

The **attention mechanism** predates the transformer and was initially used with **RNN-based models** like Google's **GNMT** in 2016. However, the **transformer architecture** became a breakthrough by using attention **without RNNs**, enabling **parallel input processing** and faster computation.

Transformers eliminate the **sequential input bottleneck**, but **output generation remains sequential** in **autoregressive models**. Inference happens in two steps:
1. **Prefill** ‚Äì processes input tokens in parallel to prepare for output generation.
2. **Decode** ‚Äì generates output tokens one at a time.

This mix of parallel and sequential steps drives various **optimization techniques** to improve inference speed and efficiency.

<img src="Images/seq2seq_transformers.png" alt="seq2seq vs. transformers">

#### C. Attention mechanism
The **attention mechanism** is central to transformer models and relies on three key components:
* **Query (Q):** Represents the current decoding step (like someone seeking information).
* **Key (K):** Represents each previous token (like a page number in a book).
* **Value (V):** Represents the content of each previous token (like the page's content).

The model calculates attention by taking the **dot product of the query and key vectors**‚Äîhigher scores mean more focus on that token's content (value). This allows the model to selectively use information from input and previously generated tokens when producing the next output.

<img src="Images/transformer.png" alt="transformer">

As transformer models process longer sequences, the number of key and value vectors that need to be computed and stored increases, which makes extending the context length challenging.

#### D. Attention algorithm
1. Given an input x, the key, value, and query vectors are computed by applying key (W<sub>K</sub>), value (W<sub>V</sub>), and query (W<sub>Q</sub>) matrices to the input:

<img src="Images/KQV matrices.png" alt="KQV matrices"> 
(The query, key, and value matrices have dimensions corresponding to the model‚Äôs hidden dimension)

2. The attention mechanism is almost always multi-headed which allow the model to attend to different groups of previous tokens simultaneously. With multiheaded attention, the query, key, and value vectors are split into smaller vectors, each corresponding to an attention head. Attention is then calculated using a softmax function:

<img src="Images/Attention.png" alt="Attention">

3. The outputs of all attention heads are then concatenated and an output projection matrix is used to apply another transformation to this concatenated output.

#### E. The Transformer Block
A transformer architecture is composed of multiple transformer blocks. The exact content of the block (referred to as the model's layer) varies between models, but, in general, each block contains the attention module and the MLP (multi-layer perceptron) module:
* **Attention module:** consists of four weight matrices-query, key, value, and output projection.
* **MLP module:** consists of linear layers separated by nonlinear activation functions. Each linear layer (AKA feedforward layer) is a weight matrix that is used for linear transformations, whereas an activation function allows the linear layers to learn nonlinear patterns. Common nonlinear functions are:
  * ReLU, Rectified Linear Unit (Agarap, 2018) which converts negative values to 0, *ReLU(x) = max(0, x)*, used by GPT-2
  * GELU (Hendrycks and Gimpel, 2016), which was used by GPT-3

Transformer models are also outfitted with a module before and after all the transformer blocks:
* **An embedding module before the transformer blocks:** This module consists of the embedding matrix and the positional embedding matrix, which convert tokens and their positions into embedding vectors, respectively.
* **An output layer after the transformer blocks (unembedding layer, AKA the head of the monster):** maps the model‚Äôs output vectors into token probabilities used to sample model outputs.

<img src="Images/transformer_block.png" alt="transformer block">

The model‚Äôs dimension determines the sizes of the key, query, value, and output projection matrices in the transformer block.
* The number of transformer blocks.
* The dimension of the feedforward layer.
* The vocabulary size.

<img src="Images/llama_dimensions.png" alt="LLAMA dimensions">

#### F. Other model architectures
* AlexNet: 2012
* Seq2seq: 2014-2018
* GAN (generative adversarial networks): 2014-2019
* Transformers: sticky, been around since 2017
* RWKV: RNN-based model that can be parallelized for training. Due to its RNN nature, in theory, it doesn‚Äôt have the same context length limitation that transformer-based models have. 2023.
* SSMs (state space models): long-range memory. 2021. Variants: S4, H3, Mamba, Jamba 

Takeaway:
* Ilya Sutskever has an interesting argument about why it‚Äôs so hard to develop new neural network architectures to outperform existing ones: neural networks are great at simulating many computer programs.
* Gradient descent, a technique to train neural networks, is in fact a search algorithm to search through all the programs that a neural network can simulate to find the best one for its target task: new architectures can potentially be simulated by existing ones too.
* For new architectures to outperform existing ones, these new architectures have to be able to simulate programs that existing architectures cannot.
* However, just as the shift from ML engineering to AI engineering has kept many things unchanged, changing the underlying model architecture won‚Äôt alter the fundamental approaches.

#### G. Model parameters
* In a model (AI/ML), parameters are the internal, learned variables (like weights and biases) that the model adjusts during training to map inputs to outputs, essentially becoming the model's knowledge about the data, determining its predictions and performance.
* Different ML algorithms have different types of parameters. For example, regression models have coefficients, neural networks have weights and biases, and some algorithms, like support vector machines or state space models, have unique types of parameters.

<img src="Images/activation_function.webp" alt="model parameters">

#### H. Model Size
* In general, increasing a model‚Äôs parameters increases its capacity to learn, resulting in better models.
* The number of parameters indicates how much compute and memory a model needs; for example, a 7-billion-parameter model stored at 2 bytes per parameter requires at least 14 GB of GPU memory for inference.
* The number of parameters can be misleading if the model is *sparse.* A sparse model has a large percentage of zero-value parameters, helpful for efficient data storage & computation.
* A type of sparse model is Mixture of Experts (MoE): an MoE model uses a gating mechanism to dynamically activate only the most relevant experts for a given task or input data. Mixtral is an example of such.
* A larger model can also underperform a smaller model if it‚Äôs not trained on low quantity, quality and/or diversity of data. The dataset size is measured in number of tokens.
* Together‚Äôs open source dataset RedPajama-v2 has 30 trillion tokens (quivalent to 450 million books), but is indiscriminate and the quality is low.
* The number of training tokens depends on how many times a model goes through the dataset: training tokens = dataset tokens * number of epochs (e.g., 1 trillion tokens trained for two epochs = 2 trillion training tokens). Most LLMs are only pre-trained on one epoch.
* Floating Point Operations (FLOP) measures the number of floating point operations performed for a certain task (e.g. Google‚Äôs largest PaLM-2 model was trained using 10<sup>22</sup> FLOPs).
* FLOP/s, floating point operations per Second: measures a machine‚Äôs peak performance.

In summary, a model's scale depends on:
1. Number of parameters, which is a proxy for the model‚Äôs learning capacity.
2. Number of tokens a model was trained on, which is a proxy for how much a model learned.
3. Number of FLOPs, which is a proxy for the training cost.

#### I. Scaling law (for pre-training)
* Given a compute budget, the rule that helps calculate the optimal model size and dataset size is called the Chinchilla scaling law, proposed in the Chinchilla paper ‚ÄúTraining Compute-Optimal Large Language Models‚Äù (DeepMind, 2022).
* The number of training tokens should be ~ 20x the model size. E.g. 3B-parameter model needs ~ 60B training tokens.
* The model size and the number of training tokens should be scaled equally: for every doubling of the model size, the number of training tokens should also be doubled.
* The scaling law was developed for dense models trained on predominantly human generated data and not others such as sparse models.
* Model quality isn‚Äôt everything. Some models, most notably Llama, have suboptimal performance but better usability. Smaller models are easier to work with and cheaper to run inference on.
* Improving a model‚Äôs accuracy from 90 to 95% is more expensive than improving it from 85 to 90%. However, small performance changes in language modeling loss or ImageNet accuracy can lead to big differences in the quality of downstream applications.

#### J. Model hyperparameters
* A hyperparameter is set by users to configure the model and control how the model learns.
* Model hyperparameters: number of layers, dimension, and vocabulary size.
* Model learning: batch size, number of epochs, learning rate, per-layer initial variance, etc.
* The current approach is to study the impact of hyperparameters on small models of different sizes and to extrapolate to larger models.
* Scaling extrapolation is still a niche topic and difficult. In addition, emergent abilities make the extrapolation less accurate.

#### K. Scaling bottlenecks
* Every order of magnitude increase in model size has led to an increase in model performance.
* Two visible bottlenecks for scaling: training data and electricity.
* If you‚Äôve ever put anything on the internet, you should assume that it already is or will be included in the training data for some language models, whether you consent or not. Bad actors can leverage this approach for prompt injection attacks.
* Unique proprietary data‚Äîcopyrighted books, translations, contracts, medical records, genome sequences, and so forth‚Äîwill be a competitive advantage in the AI race.
* Many companies, including Reddit and Stack Overflow, have changed their data terms to prevent other companies from scraping their data for their models.

<img src="Images/brain.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 3. Training
*AI Engineering*

----
* Most foundation models are pre-trained using self-supervision.
* Associated drawbacks: self-supervision optimizes the model for text completion, not conversations. Second, if the model is pre-trained on data indiscriminately scraped from the internet, its outputs can be racist, sexist, rude, or just wrong.
* Pre-training optimizes the model's *token-level* quality, post-training optimizes its *general output* quality. Pre-training is like acquiring knowledge, post-training is applying it.

Post-training steps:
1. **Pre-training:** Self-supervised/unsupervised pre-training results in a rogue model that can be considered an untamed monster because it uses indiscriminate data from the internet.
2. **Supervised finetuning (SFT):** Finetune the pre-trained model on high-quality instruction data to optimize models for conversations instead of completion. This monster is then supervised finetuned on higher-quality data‚ÄîStack Overflow, Quora, or human annotations‚Äîwhich makes it more socially acceptable.
4. **Preference finetuning:** Further finetune the model with reinforcement learning (RLHF vs. RLAIF) to output responses that align with human preference. This finetuned model is further polished using preference finetuning to make it customer-appropriate, which is like giving it a smiley face.

<img src="Images/shoggoth.png" alt="shoggoth">

* Technically, you can train a model from scratch on the demonstration data instead of finetuning a pre-trained model, effectively eliminating the self-supervised pretraining step. However, the pre-training approach often has returned superior results.
* Both SFT and preference finetuning are steps taken to address the problem created by the low quality of data used for pre-training.
* If one day we have better pre-training data or better ways to train foundation models, we might not need SFT and preference at all.

#### A. Supervised Finetuning
Models learn by mimicking their training data, so you can guide their behavior by providing example prompt‚Äìresponse pairs called **demonstration data**. This approach, often called **behavior cloning**, shows the model how it should respond. Because different requests need different kinds of answers, the demonstration data should cover a variety of tasks (like question answering, summarization, and translation). 

**Tasks trained:**

<img src="Images/trained_tasks.png" alt="Trained Tasks">

Companies, therefore, often use highly educated labelers to generate demonstration data. Generating one (prompt, response) pair can take up to 30 min and money. To reduce their dependence on high-quality human annotated data, many others are turning to AI-generated synthetic data.

#### B. Preference Finetuning
The earliest preference finetuning algorithm, which is still popular today, is Reinforcement Learning with Human Feedback (RLHF):
1. Train a reward model that scores the foundation model‚Äôs outputs.
2. Optimize the foundation model to generate responses for which the reward model will give maximal scores.

While other techniques such as DPO exist, the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF.

#### C. Reward model (RM)
* RLHF relies on a reward model. Given a pair of (prompt, response), the reward model outputs a score for how good the response is.
* Good labelers are required to train such as model. Data is then evaluated *pointwise* or by *comparison.*
* For comparison data, a reward function is then used teach the reward model to give concrete scores based on the data.

#### D. Finetuning/best-of-N strategy using the reward model
* With the trained RM, we further train the SFT model to generate better outputs using the proximal policy optimization (PPO) algorithm.
* Some companies skip reinforcement learning and rely only on a reward model. They generate multiple outputs and select the highest-scoring ones using the reward model‚Äîa method called the ‚Äúbest-of-N‚Äù strategy‚Äîto improve performance.



<img src="Images/brain.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 4. Sampling
*AI Engineering*

----
* A model constructs its outputs through a process known as *sampling.*
* Given an input, a neural network produces an output by first computing the probabilities of possible outcomes.
* For a language model, to generate the next token, the model first computes the probability distribution over all tokens in the vocabulary.
* When working with possible outcomes of different probabilities, a common strategy is to always pick the outcome with the highest probability, *greedy sampling.* This is good for classification tasks, but not for generative tasks as it produces boring outputs.

<img src="Images/logits.png" alt="Logit/Probability distribution">

* Given an input, a neural network outputs a logit vector, corresponding to one possible value (in LLMs this is one token in the model's vocabulary).
* Logits aren't probabilities and don't all sum up to 1.
* To convert logits to probabilities, a softmax layer is often used. For a model with vocabulary of N and the logit vector is x1, x2, ..., x<sub>N</sub>, the  probability for the i<sup>th</sup> token, p<sub>i</sub> is computed as follows:

<img src="Images/logits_probabilities.png" alt="Logit/Softmax equation">

A common debugging technique when working with an AI model is to look at the probabilities this model computes for given inputs. For example, if the probabilities look random, the model hasn‚Äôt learned much.

<img src="Images/brain.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 5. Sampling Strategies
*AI Engineering*

----
#### A. Temperature
* To redistribute the probabilities of the possible values, you can sample with a temperature. Intuitively, a higher temperature reduces the probabilities of common tokens, and as a result, increases the probabilities of rarer tokens.
* The higher the temperature, the less likely it is that the model is going to pick the most obvious value (the value with the highest logit), making the model‚Äôs outputs more creative but potentially less coherent.
* The lower the temperature, the more likely it is that the model is going to pick the most obvious value, making the model‚Äôs output more consistent but potentially more boring.
* **A temperature of 0.7 is often recommended for creative use cases, as it balances creativity and predictability.**
* Technically, temperature can never be 0‚Äîlogits can‚Äôt be divided by 0. In practice, when we set the temperature to 0, the model just picks the token with the largest logit, without doing logit adjustment and softmax calculation.

<img src="Images/temperature.png" alt="Logit/Temperature distribution">

#### B. Logprobs
Using a log scale helps prevent **numerical underflow**, which can happen because language models work with very large vocabularies, making individual token probabilities extremely small and prone to being rounded to zero. Log probabilities reduce this risk and allow probabilities to be represented more reliably.

#### C. Top-k
**Top-k sampling** reduces computation while keeping response diversity by limiting choices to the **k most likely tokens**. Instead of applying softmax over the entire vocabulary (which is expensive for large models), the model applies softmax only to the top-k logits and samples from them. Smaller k values make outputs more predictable, while larger k values allow more diverse and interesting text.

#### D. Top-p
* **Top-p (nucleus) sampling** dynamically selects how many tokens to consider based on their probabilities, unlike top-k which uses a fixed number. The model includes the most likely tokens until their cumulative probability reaches a threshold p (commonly 0.9‚Äì0.95), then samples only from that set. This adapts to the prompt‚Äôs uncertainty: fewer options for simple questions and more for open-ended ones, balancing relevance and diversity.
* Unlike top-k, top-p doesn‚Äôt necessarily reduce the softmax computation load instead it focuses only on the set of most relevant values for each context, it allows outputs to be more contextually appropriate.

<img src="Images/top-p.png" alt="Logit/Top-p">

E.g. If top-p is 90%, only ‚Äúyes‚Äù and ‚Äúmaybe‚Äù will be considered, as their cumulative probability is greater than 90%.

#### E. Stopping condition
To control output length, generation can stop after a fixed number of tokens or when specific stop tokens (like an end-of-sequence marker) appear. While stopping conditions reduce latency and cost, stopping too early can cut off text or break required formats, such as incomplete JSON outputs.


#### F. Test Time Compute
* This is used to generate multiple outputs per query to improve response quality.
* Techniques include **best-of-N sampling** and **beam search**, which focuses on the most promising candidates during generation.
* Increasing output diversity‚Äîoften by varying sampling parameters‚Äîraises the chances of finding a good response.
* However, generating multiple outputs significantly increases computational cost, roughly proportional to the number of outputs generated.
* To pick the best output, you can either show users multiple outputs and let them choose, or devise any number of methods to select the best one.

**Methods of selecting the optimal output:**
* User selected.
* OpenAI (as of 2025) uses Bayesian inference to determine the most probable response, e.g. p(I love food) = p(I) √ó p(I | love) √ó p(food | I, love)
* Use a reward model to score each output.
* Using verifiers can dramatically improve model performance‚Äîmatching the gains of a 30√ó increase in model size‚Äîwhile scaling inference-time compute by sampling more outputs can also boost results. However, evidence is mixed: OpenAI found performance peaks around 400 samples before declining, possibly due to adversarial outputs, whereas a Stanford study observed continued log-linear gains up to 10,000 samples. Despite these findings, such large-scale sampling is impractical in real-world systems due to cost.
* Application-specific heuristics to select the best response.
* Sampling multiple outputs and choosing the most common answer can improve accuracy on tasks with exact answers, such as math or multiple-choice questions. This approach helps less robust models in particular, as repeated attempts reduce variability and increase the chance of a correct result, as shown in benchmarks like Gemini on MMLU and practical extraction tasks where retrying significantly improved success rates.

#### G. Structured Outputs
Structured outputs are crucial for the following two scenarios:
1. Tasks requiring structured outputs.
2. Tasks whose outputs are used by downstream applications. This is especially important for agentic workflows where a model‚Äôs outputs are often passed as inputs into tools that the model can use.

**You can guide a model to generate structured outputs at different layers of the AI stack:**
1. Prompting: primary way to get structured outputs, but models don‚Äôt always follow format instructions reliably. To improve validity, some systems use a second model pass to validate or correct outputs, which can greatly reduce errors but adds extra cost and latency that may be impractical for some applications.
2. Post-processing: Simple post-processing can effectively fix recurring, small formatting errors in model outputs at low cost. Because models often make consistent mistakes, scripts can correct them‚Äîsuch as repairing malformed JSON‚Äîleading to large gains in validity, as shown by LinkedIn‚Äôs YAML parser improving correctness from 90% to 99.99%, though this works best when errors are minor and predictable.
3. Test time compute: another 'bandage' solution, keep on generating outputs until one fits the expected format.
4. Constraint sampling guides text generation by filtering allowed tokens according to predefined rules or grammars, ensuring outputs meet specific formats. While effective for structured outputs, it requires complex, format-specific grammars, adds latency, and is less generalizable, leading some to argue that improving models‚Äô instruction-following abilities may be a better use of resources.
5. Finetuning: on examples following your desirable format is the most effective and general approach to get models to generate outputs in this format, more reliable than prompting. For some tasks, output formats can be guaranteed by modifying the model‚Äôs architecture‚Äîsuch as adding a classifier head *(feature-based transfer)*‚Äîwhile finetuning either the full model or specific components, with end-to-end training requiring more resources but yielding better performance.

<img src="Images/feature_based_transfer.png" alt="Logit/Feature Based Transfer">

<img src="Images/brain.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 6. The Probabilistic Nature of AI
*AI Engineering*

----
* The way AI models sample their responses makes them *probabilistic.*
* The opposite of probabilistic is *deterministic,* when the outcome can be determined without any random variation.
* *Inconsistency* is when a model generates very different responses for the same or slightly different prompts.
* *Hallucination* is when a model gives a response that isn‚Äôt grounded in facts.

#### A. Inconsistency
* Ways to mitigate inconsistency: cache the answer so that the next time the same question is asked, the same answer is returned; fix the model‚Äôs sampling variables (temperature, top-p, and top-k); fix the seed variable (i.e. the starting point for the random number generator used for sampling the next token).
* Even hardware can affect consistency.
* It is possible to get models to generate responses closer to what you want with carefully crafted prompts and a memory system.

#### B. Hallucination
First hypothesis (self-delusion):
* They can‚Äôt distinguish between information they‚Äôre given and information they generate themselves. Once a model produces an incorrect assumption, it treats that statement as if it were a true fact and continues building on it. This can cause errors to compound, leading to increasingly wrong outputs‚Äîa process Ortega et al. call **self-delusion** and later researchers describe as **snowballing hallucinations**.
* An example shows a vision-language model mistaking a shampoo bottle for milk and then inventing milk-related ingredients. Such initial false assumptions can also cause the model to fail on questions it would normally answer correctly.

Second hypothesis (mismatched internal knowledge):
* Hallucinations arise when models are trained to copy labelers who use knowledge the model itself doesn‚Äôt have. During supervised fine-tuning, this mismatch teaches the model to produce answers that appear knowledgeable but aren‚Äôt grounded in its own knowledge.
* While having labelers explicitly include all the knowledge behind their answers could reduce this, doing so is impractical.

Strategies to reduce hallucinations:
1. Reinforcement learning method that trains the model to distinguish between user-provided information and its own generated outputs (DeepMind).
2. Supervised learning method that incorporates both factual and counterfactual examples into training data (DeepMind).
3. Good prompting and context construction, such as adding ‚ÄúAnswer as truthfully as possible, and if you‚Äôre unsure of the answer, say, ‚ÄòSorry, I don‚Äôt know.‚Äô‚Äù
4. Asking models for concise responses also seems to help with hallucinations‚Äîthe fewer tokens a model has to generate, the less chance it has to make things up.
5. Detecting and rejecting hallucinated responses.

<img src="Images/brain.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 7. Summary
*AI Engineering*

----
* Most people will use existing foundation models rather than training them from scratch due to compute cost and resources required.
* Training data is a major factor in model performance, and curating domain- or language-specific data is often necessary, especially for low-resource settings.
* Model architecture choices matter, with transformers dominating language models despite having known limitations.
* Model scaling depends on parameters, data, and compute, and while larger models currently perform better, scaling faces limits and bottlenecks.
* Post-training (supervised and preference fine-tuning) is needed to better align models with human expectations, though alignment remains imperfect.
* Sampling makes model outputs probabilistic, enabling creativity but also causing inconsistency and hallucinations.
* Effective AI workflows must be designed to account for the probabilistic nature of model behavior.