```{contents}
```

## Large Language Models (LLMs)

---

LLMs — like **GPT**, **LLaMA**, **Claude**, or **Gemini** — are **Transformer-based neural networks** trained on massive text datasets to understand and generate human-like language.
They work by predicting the **next word (token)** given a sequence of previous words, but the underlying concepts involve deep learning, linguistics, optimization, and large-scale computation.

---

### Foundation Concept: Language Modeling

The **primary objective** of an LLM is **next-token prediction**:

$$
P(w_t | w_1, w_2, ..., w_{t-1})
$$

The model learns the probability distribution of words (or tokens) given previous context — this is **auto-regressive language modeling**.

At inference, it generates text by **sampling tokens** one by one based on this learned probability distribution.

---

### Transformer Architecture

The **Transformer** is the backbone of all modern LLMs.

#### Key components:

| Component                                      | Description                                                              |
| ---------------------------------------------- | ------------------------------------------------------------------------ |
| **Input Embedding**                            | Converts words or tokens into dense vectors (numerical form).            |
| **Positional Encoding**                        | Adds information about word order (since Transformer has no recurrence). |
| **Multi-Head Self-Attention**                  | Learns contextual relationships between all words in a sequence.         |
| **Feed-Forward Layers**                        | Non-linear transformations applied independently per token.              |
| **Residual Connections + Layer Normalization** | Stabilize and speed up deep model training.                              |

**Equation (attention):**
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

Here,

* $Q, K, V$ = query, key, and value matrices
* $d_k$ = key dimension for scaling

Each layer refines the contextual representation of tokens, forming deep semantic understanding.

---

### Tokenization

Before text enters the model, it is **tokenized** — broken into subword units.

| Tokenizer Type                     | Example                   | Advantage                     |
| ---------------------------------- | ------------------------- | ----------------------------- |
| **Word-level**                     | “playing” → one token     | Simple but limited vocab      |
| **Subword-level (BPE, WordPiece)** | “playing” → “play”, “ing” | Efficient, handles rare words |
| **Character-level**                | Each letter = token       | Huge sequences but robust     |

**Example:**
“Transformers are powerful” → `[Transform, ers, are, powerful]`

---

### Embeddings

Tokens are converted into **dense vectors** via learned embeddings.

$$
x_i = E[t_i]
$$
where $E$ is the embedding matrix and $t_i$ is the token ID.

These embeddings capture **semantic similarity** (e.g., “dog” and “cat” have similar vectors).

---

### Self-Attention Mechanism

Self-Attention allows each token to **attend** to every other token in the input sequence, enabling **contextual understanding**.

Example:
In “The cat sat on the mat,” the model learns that “cat” relates to “sat,” and “mat” relates to “on.”

It replaces RNNs’ sequential dependency with **parallel computation**.

---

###  Training Objective

LLMs are trained via **Maximum Likelihood Estimation (MLE)**:
$$
\mathcal{L} = -\sum_{t} \log P(w_t | w_{<t})
$$
The model minimizes the negative log-probability of the correct next token over massive text corpora.

---

### Pretraining

The model is trained on **billions of tokens** (e.g., books, articles, websites) to learn general language patterns.

* Objective: Predict next token (auto-regressive)
* Dataset: Internet-scale (Common Crawl, Wikipedia, etc.)
* Duration: Weeks/months on hundreds of GPUs

Result: A **foundation model** with broad world knowledge and language skills.

---

### Fine-Tuning

Once pretrained, the model is **adapted** to specific tasks:

| Type                                                  | Description                                                                     |
| ----------------------------------------------------- | ------------------------------------------------------------------------------- |
| **Supervised Fine-Tuning (SFT)**                      | Trained on labeled data for a specific task (e.g., Q&A, summarization).         |
| **Reinforcement Learning from Human Feedback (RLHF)** | Aligns model responses with human preferences (ethical, safe, helpful).         |
| **Instruction Tuning**                                | Fine-tuned on datasets of (instruction, response) pairs to follow user prompts. |

**RLHF stages:**

1. Train base model.
2. Train reward model using human rankings.
3. Optimize policy via **Proximal Policy Optimization (PPO)**.

---

### Context Window and Attention Scaling

LLMs process input within a **context window** (e.g., 4K, 32K, or 1M tokens).

Challenge: Attention scales as $O(n^2)$ with sequence length.

Solutions:

* **Sparse attention**
* **Rotary Positional Embeddings (RoPE)**
* **Long-range Transformers (e.g., Longformer, FlashAttention)**

---

### Knowledge Representation

LLMs store knowledge **implicitly in weights** — there’s no explicit database.
Patterns, facts, and grammar rules are encoded as parameter relationships learned during training.

---

### Inference (Text Generation)

During generation:

1. Model predicts next-token probabilities.
2. Sampling strategies select the next token:

   * **Greedy decoding:** Pick max probability token.
   * **Top-k / Top-p sampling:** Introduce randomness for diversity.
   * **Temperature:** Adjusts creativity vs determinism.

**Repeat until:** End-of-sequence token or max length.

---

### Scaling Laws

LLMs follow **scaling laws**:
Performance improves predictably with more data, parameters, and compute.

$$
\text{Loss} \propto N^{-\alpha}
$$

where $N$ = model size, $\alpha$ ≈ 0.05–0.1

Implication: Bigger models with more training data → better reasoning and generalization.

---

### Architecture Enhancements

Modern LLMs include refinements:

* **Rotary Position Embeddings (RoPE)** – smoother position encoding
* **Mixture-of-Experts (MoE)** – route tokens to specialized subnetworks
* **Parallel attention + FFN layers** – improved compute efficiency
* **Adapter layers / LoRA** – parameter-efficient fine-tuning

---

### Emergent Abilities

Beyond simple prediction, LLMs exhibit **emergent behaviors**:

* Reasoning and logic
* Code synthesis
* Tool use (via APIs)
* Chain-of-thought reasoning
* Translation, summarization, Q&A

These arise purely from scale and data diversity, not explicit programming.

---

### Alignment, Safety, and Guardrails

LLMs are fine-tuned to ensure **alignment** with human values:

* Avoid harmful, biased, or false outputs.
* Reinforcement Learning from Human Feedback (RLHF) ensures safe behavior.
* Safety filters, moderation layers, and grounding mechanisms are applied.

---

### Evaluation Metrics

| Metric                       | Description                                                          |
| ---------------------------- | -------------------------------------------------------------------- |
| **Perplexity**               | Measures how well the model predicts the next token. Lower = better. |
| **BLEU / ROUGE / METEOR**    | Evaluate generated text quality (translation, summarization).        |
| **Truthfulness / Coherence** | Subjective human evaluation.                                         |

---

### Applications of LLMs

| Domain                                   | Example Use                             |
| ---------------------------------------- | --------------------------------------- |
| **Text generation**                      | Chatbots, content creation              |
| **Code generation**                      | GitHub Copilot, OpenAI Codex            |
| **Summarization**                        | Legal, academic, or financial summaries |
| **Translation**                          | Multilingual assistants                 |
| **Reasoning & tutoring**                 | Math, science, Q&A                      |
| **Retrieval-Augmented Generation (RAG)** | Combines external knowledge with LLMs   |

---

### RAG (Retrieval-Augmented Generation)

LLMs can’t “recall” facts beyond training, so RAG integrates **external databases or vector stores**:

1. Retrieve relevant documents using embeddings.
2. Feed them into LLM’s context.
3. Generate grounded, factual answers.

$$
\text{Response} = \text{LLM}(\text{query} + \text{retrieved context})
$$

---

### LLM Ecosystem Concepts

| Concept                          | Description                                                         |
| -------------------------------- | ------------------------------------------------------------------- |
| **Embedding Models**             | Convert text → numerical vectors for similarity search.             |
| **Prompt Engineering**           | Crafting inputs to guide model behavior.                            |
| **Fine-Tuning / LoRA / Adapter** | Techniques to customize LLMs.                                       |
| **Quantization / Pruning**       | Compress models for efficient inference.                            |
| **Multi-Modal LLMs**             | Handle text, image, audio, and video jointly (e.g., GPT-4, Gemini). |

---

**Summary**

| **Concept**                   | **Purpose**                          |
| ----------------------------- | ------------------------------------ |
| **Language Modeling**         | Predict next token                   |
| **Transformer Backbone**      | Contextual understanding             |
| **Tokenization**              | Convert text into numerical input    |
| **Attention Mechanism**       | Learn relationships between words    |
| **Pretraining + Fine-tuning** | Learn general → task-specific skills |
| **RLHF / Alignment**          | Human-aligned safe responses         |
| **RAG / Embeddings**          | Add factual grounding                |
| **Scaling & Optimization**    | Improve reasoning & capacity         |

---

**In short:**

> LLMs are large-scale **Transformer-based generative models** trained on massive text corpora to learn the structure, meaning, and context of language.
> They combine **deep neural networks**, **probabilistic modeling**, and **massive-scale optimization** to reason, generate, and communicate like humans.


Data ingestion topics:

1. Batch vs streaming ingestion
2. Source systems (databases, APIs, files, message queues)
3. Change Data Capture (CDC)
4. Data formats (CSV, JSON, Parquet, Avro, ORC)
5. Schema design and schema evolution
6. Data validation and quality checks
7. Deduplication and noise removal
8. Data transformation (ETL vs ELT)
9. Orchestration and workflow scheduling
10. Error handling and retry policies
11. Metadata and lineage tracking
12. Security and access control
13. Scalability and throughput optimization
14. Latency vs consistency trade-offs
15. Storage targets (data lake, warehouse, OLTP/OLAP)
