# Introduction to language models

In [1]:
import numpy as np
import torch
import torch.nn.functional as F

## Lecture plan

- What is a "language model"?
  - Brief overview of $n$-gram models.
- The rise of "neural" language models (NLMs).
  - Feedforward language models.
  - The **transformer** architecture and self-attention.

## What is a "language model"?

> A **language model** assigns a *probability* to a sequence of words or characters, and can also be used to make predictions about a *given word* in a particular context.

*I like my coffee with cream and ___*

- Sugar?
- Salt?
- Mercury?

These probabilities are *learned* looking at the statistics of actual language use on a large corpus.

### Key premises

- **Language**: We're making predictions about *words* (or *characters*).
- **Probability**: Some words (or characters) are more likely in some contexts than others.
- **Order**: Language unfolds over time, i.e., *sequentially*.

Thus, we want a system to make *probabilistic predictions* about a given word (or character) based on the preceding context.

### $n$-grams: a simple approach

> An $n$-**gram** language model (LM) assigns probabilities to a word $w_t$ given the previous $n - 1$ words, based on the proportion of times $w_t$ occurs after those exact words.

- A *unigram* model considers only $w_t$.
- A *bigram* mdoel considers $w_t$ and the word before ($w_{t-1}$).
- A *trigram model considers $w_t$ and the two words immediatedly before.

And so on!

### Check-in

Recall this example from earlier:

> *I like my coffee with cream and ___*

How would you calculate $p(sugar)$ using:

- a unigram model?
- a bigram model?
- a trigram model?

In [2]:
### Your code here

### Different amounts of context

- A unigram model considers just the **base rate** of *sugar*, i.e., compared to size of entire corpus ($N$).

$\Large p(sugar) = \frac{Count(sugar)}{N}$

- A bigram model considers the relative probability of the phrase *and sugar*.

$\Large p(sugar | and) = \frac{Count(and \: sugar)}{Count(and)}$

- A trigram model considers the relative probability of the phrase *cream and sugar*.

$\Large p(sugar | cream \: and)= \frac{Count(cream \: and \: sugar)}{Count(cream \: and)}$

### Check-in

How would you expect a unigram, bigram, and trigram model to perform differently for the different **completions**?

- Sugar?
- Salt?
- Mercury?

In [3]:
### Your answer here

### The role of $n$

- A larger value of $n$ means your predictions are **conditioned** on more context.
- In general, a larger $n$ means more *accurate* predictions.
   - Knowing more about what came before helps you know what comes next.
- But a larger $n$ also leads to issues with **data sparsity**.

### Data sparsity and the problem of zeroes

> When $n$ is too large, your data are **sparse**, i.e., the chance that you've seen *exactly* this sequence is just very low.

- Practically, that means the probability of many sequences will be $0$!
- This is very bad for the $n$-gram model.
   - If $p(w) = 0$ for any $w$ in a sequence, then the joint probability of the entire sequence is also 0.

**Check-in**: What might be a way to address this issue?

### Smoothing: a workaround

> **Smoothing** refers to a variety of techniques designed to deal with data sparsity, i.e., by eliminating zeroes.

- A common method is [additive smoothing](https://en.wikipedia.org/wiki/Additive_smoothing), where some *constant* is added to the count of each possible sequence.
- This ensures no probability is zero.

In general, these are somewhat clunky workarounds for a fundamental problem with $n$-gram models: they depend on the *exact tokens* in the context.

## The rise of "neural" LMs

> A **neural language model** uses a neural network architecture for the language modeling task.

Large neural language models like ChatGPT are called **large language models**.

### Refresher: feedforward architecture

- In a **feedforward architecture**, there are no *recurrent connections* between layers.
- Each layer *transforms* input representations to produce *output* predictions.

<img src="img/networks/ffn.png" width="300" alt="Simple FFN">

#### Check-in

Knowing what we know about word embeddings and $n$-gram models, how might you modify the feedforward architecture for the language modeling task?

### Feed-forward neural language models

> A **feedforward language model** uses a feed-forward neural network to assign probabilities to word $w_t$ using representations of previous words.


<img src="img/llm/ffn_llm.png" width="500" alt="FFN for language modeling">


#### Features of feedforward language models

- Like a $n$-gram model, relies on a *fixed context window*.
- Unlike an $n$-gram model, uses **embeddings** for context words.
    - Allows for better **generalizations** about *types of words*.
- During training, errors are used as a signal to **update weights**.
   - This helps the LM make better and better predictions.


**Key limitation**: Not very good at incorporating context——how to figure out what's *relevant*?

### The advent of *attention*

> **Attention** is a mechanism that (metaphorically) allows an LLM to “focus” (or “attend”) on specific elements in a sequence.

[Interactive tutorial and visualization](https://colab.research.google.com/drive/1hXIQ77A4TYS4y3UthWF-Ci7V7vVUoxmQ?usp=sharing)

But how is this *calculated*?

#### Dot-product attention: a simplification

- The first "attention" mechanism was a simple **dot-product**.
- Given two vectors, the dot product measures their similarity.

$\Large u \cdot v$

- Given a sequence of **word embeddings**, this yields a *matrix* reflecting the similarity of each pair of words.
- These attention scores are then subject to a **softmax** function, turning them into a probability distribution.

$\Large softmax(x_i) = \frac{e^{x_i}}{\sum_je^{x_j}}$

- These **attention scores** are then used to produce a *weighted average* of all the word embeddings for each word in the sequence.

#### Dot-product attention in action (pt. 1)

- First, we initialize a matrix of embeddings (suppose these represent three different words).
- Then, we get the pairwise dot products between each word.

In [4]:
### Embeddings
E = torch.tensor([
    [1.0, 0.0, 1.0, 0.0],  # word 1
    [0.0, 1.0, 0.0, 1.0],  # word 2
    [1.0, 1.0, 1.0, 1.0],  # word 3
])
### Attention scores
attention_scores = E @ E.T
attention_scores

tensor([[2., 0., 2.],
        [0., 2., 2.],
        [2., 2., 4.]])

#### Dot-product attention in action (pt. 2)

- Now, we *scale* our scores.
- Then apply *softmax*.

In [5]:
### Scale
scale = np.sqrt(E.shape[1])  # sqrt of embedding dimension
scaled_scores = attention_scores / scale
scaled_scores

tensor([[1., 0., 1.],
        [0., 1., 1.],
        [1., 1., 2.]])

In [6]:
### Get attention scores
attention_probs = F.softmax(scaled_scores, dim=1)
attention_probs

tensor([[0.4223, 0.1554, 0.4223],
        [0.1554, 0.4223, 0.4223],
        [0.2119, 0.2119, 0.5761]])

#### Interpreting the attention scores

- Each *row* represents the attention between that particular word ($w_i$) and each other words.
- The numbers represent the *probability distribution* over that word's attention.
- A higher score means more *attention*.

In [7]:
### Attention for word 1
attention_probs[0]

tensor([0.4223, 0.1554, 0.4223])

#### Dot-product attention in action (pt. 3)

- Finally, we create *new contextualized output vectors* $O_i$ for each word.
- This is the *average* of all our words, weighted by the attention scores $\alpha$.

$O_i = \sum_j \alpha_{i, j}\cdot E_j$

- This results in a new *custom* vector for each word in the sequence.

In [8]:
output = attention_probs @ E  # shape: (3, 4)
output

tensor([[0.8446, 0.5777, 0.8446, 0.5777],
        [0.5777, 0.8446, 0.5777, 0.8446],
        [0.7881, 0.7881, 0.7881, 0.7881]])

#### The limits of dot-product attention

- Dot-product attention is very *coarse*: just considers *similarity* between original word embeddings.
- But *relevance* might depend on all sorts of things.
    - Syntactic relationship.
    - Semantic relationship.
    - Previous mentions.
- Modern **transformer** language models use a more sophisticated mechanism for self-attention.

### Modern self-attention

> In **self-attention**, the relevance of each word to each other is calculated in context and shared, informing the model’s predictions.

- Similar to dot-product attention, but relies on *transformations* of each vector.
- Three crucial **transformations**:
    - *Query*: What is the current word "looking for"?
    - *Key*: What is each other word "offering" to match that query?
    - *Value*: What's the "content" of all the words?

Kind of abstract, so let's look at a diagram.e 

#### Attention, visualized

- Schematic of **attention operations**.
- Again: like dot-product operation, but now we're working with *queries* and *keys* instead of original vectors.

<img src="img/llm/attention.png" width="300" alt="Graph of self attention.">


## The Transformer

> The **transformer** is a *neural network architecture* that (now) is commonly used for LMs. A crucial feature is the **self-attention mechanism**.

- Transformers rely on *self-attention*.
- Usually **multi-headed**: multiple attention "heads" per layer.
- Each layer consists of the *transformer block*. 

### The Transformer "Block"

<img src="img/llm/transformer.png" width="500" alt="The crucial transformer block.">


### Multi-headed attention

- Multiple "heads" at each layer, can learn to track multiple things.
- Oversimplified example: one head tracks syntax, another tracks meaning, etc...
- In actuality, the **function** of heads is not programmed top-down.
    - An entire [research field](https://www.neelnanda.io/mechanistic-interpretability/glossary) is dedicated to understanding these systems!
  

### Pre-trained models

> A **pre-trained language model (PLM)** is a model trained on the next-token prediction task on a large corpus.

- Researchers can download these models (e.g., from [HuggingFace](https://huggingface.co/)) and use them in some task.
- PLMs can also be *fine-tuned* for downstream tasks.
- Larger PLMs are surprisingly powerful: can do lots of things, despite only being trained to predict words.
- Next lecture, we will work with an actual *pre-trained language model*.

## Summary

- A **language model** is simply a predictive model that assigns probabilities to word sequences.
- Modern language models use the **transformer** architecture, which relies on a mechanism called *self-attention*.
- These models are quite large and complicated, and there's still much we don't understand about how or why they work.