# Module 18: Large Language Models (LLMs)

**The Foundation of Modern AI**

---

## 1. Objectives

- ‚úÖ Understand LLM landscape
- ‚úÖ Know scaling laws
- ‚úÖ Use open-source LLMs locally
- ‚úÖ Understand inference optimization

## 2. Prerequisites

- [Module 17: HuggingFace Ecosystem](../17_huggingface/17_huggingface.ipynb)

## 3. LLM Landscape (2024)

### Proprietary Models

| Model | Company | Params | Context |
|-------|---------|--------|----------|
| GPT-4 | OpenAI | ~1.7T | 128K |
| Claude 3 | Anthropic | - | 200K |
| Gemini Ultra | Google | - | 1M |

### Open-Source Models

| Model | Params | Best For |
|-------|--------|---------|
| LLaMA 3 | 8B/70B | General |
| Mistral | 7B | Efficiency |
| Phi-3 | 3.8B | Small & fast |
| Qwen 2 | 7B-72B | Multilingual |
| CodeLlama | 7B-70B | Code |

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device: {device}")
if device == 'cuda':
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

Device: cuda
GPU Memory: 15.8 GB


## 4. Scaling Laws

### Chinchilla Optimal Scaling

$$L(N, D) \approx \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} + E$$

Where:
- N = model parameters
- D = training tokens
- Œ± ‚âà 0.34, Œ≤ ‚âà 0.28

### Key Insight
**Optimal: Train on 20x more tokens than parameters**

| Model Size | Optimal Tokens |
|------------|---------------|
| 1B params | 20B tokens |
| 7B params | 140B tokens |
| 70B params | 1.4T tokens |

## 5. Using Open-Source LLMs

In [2]:
# Small model for demo (runs on CPU)
model_name = "microsoft/phi-2"  # 2.7B params

# Load with quantization for memory efficiency
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# For small models, can load directly
# model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

print(f"Loaded tokenizer for {model_name}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Loaded tokenizer for microsoft/phi-2


In [3]:
# Using pipeline (easiest)
generator = pipeline(
    'text-generation',
    model='gpt2',  # Using GPT-2 for demo
    device=0 if device == 'cuda' else -1
)

response = generator(
    "The key to machine learning is",
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7
)

print(response[0]['generated_text'])

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The key to machine learning is learning how to train a program to perform tasks that can be done in real time. The process is often referred to as "learning in a computer" but is actually a more abstract term that refers to the process of learning how to build an algorithm.


## 6. Inference Optimization

### Quantization

| Precision | Memory | Speed |
|-----------|--------|---------|
| FP32 | 4 bytes/param | Baseline |
| FP16 | 2 bytes/param | 2x faster |
| INT8 | 1 byte/param | 4x smaller |
| INT4 | 0.5 byte/param | 8x smaller |

In [5]:
# Quantization with bitsandbytes
!pip install bitsandbytes accelerate

from transformers import BitsAndBytesConfig
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load quantized model
# model = AutoModelForCausalLM.from_pretrained(
#     "meta-llama/Llama-2-7b-hf",
#     quantization_config=bnb_config,
#     device_map="auto"
# )

print("Quantization config ready!")
print("7B model needs ~4GB VRAM in 4-bit vs ~28GB in FP32")

Collecting bitsandbytes
  Downloading bitsandbytes-0.49.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes-0.49.1-py3-none-manylinux_2_24_x86_64.whl (59.1 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.1/59.1 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.49.1
Quantization config ready!
7B model needs ~4GB VRAM in 4-bit vs ~28GB in FP32


### KV Cache

```
Without cache: Recompute all K,V for each new token
With cache:    Store K,V, only compute for new token

Speedup: O(n¬≤) ‚Üí O(n) per token
```

## 7. Memory Estimation

### Rule of Thumb

```
FP32: params √ó 4 bytes
FP16: params √ó 2 bytes  
INT8: params √ó 1 byte
INT4: params √ó 0.5 bytes
```

### Examples

| Model | FP16 | INT4 |
|-------|------|------|
| 7B | 14 GB | 3.5 GB |
| 13B | 26 GB | 6.5 GB |
| 70B | 140 GB | 35 GB |

## 8. üî• Real-World Usage

### Decision Framework

| Requirement | Solution |
|-------------|----------|
| Best quality | GPT-4 / Claude API |
| Privacy/offline | Local LLM (LLaMA, Mistral) |
| Low latency | Smaller models + quantization |
| Specialized | Fine-tuned domain model |

### Local LLM Tools

| Tool | Use Case |
|------|----------|
| **Ollama** | Easiest local deployment |
| **llama.cpp** | CPU inference |
| **vLLM** | High-throughput serving |
| **TGI** | Production serving |

## 9. Interview Questions

**Q1: What are scaling laws?**
<details><summary>Answer</summary>

Empirical laws showing model performance scales predictably with compute, data, and parameters. Chinchilla showed optimal ratio is ~20 tokens per parameter.
</details>

**Q2: How does INT4 quantization work?**
<details><summary>Answer</summary>

Reduces each weight from 32/16 bits to 4 bits using techniques like NF4 (normalized float). Trades small accuracy loss for 4-8x memory reduction.
</details>

**Q3: What is KV cache?**
<details><summary>Answer</summary>

Stores key/value tensors from attention for previously generated tokens. Avoids recomputing them, making autoregressive generation O(n) instead of O(n¬≤).
</details>

## 10. Summary

- **LLM Landscape**: GPT-4, Claude (API) vs LLaMA, Mistral (local)
- **Scaling Laws**: 20 tokens per parameter optimal
- **Quantization**: INT4 = 8x memory reduction
- **Local Options**: Ollama, vLLM, llama.cpp

## 11. References

- [Chinchilla Paper](https://arxiv.org/abs/2203.15556)
- [LLaMA Paper](https://arxiv.org/abs/2302.13971)
- [Ollama](https://ollama.ai)
- [vLLM](https://github.com/vllm-project/vllm)

---
**Next:** [Module 19: Prompt Engineering](../19_prompt_engineering/19_prompt_engineering.ipynb)