# CSE 25 – Introduction to Artificial Intelligence  
## Worksheet 12: Language Models and N-grams

**Today’s focus:**  
How do we assign probabilities to words and sentences, and use those probabilities to choose better language outputs?

>Language model = a model that assigns probabilities to strings of words.

### Learning Objectives
By the end of this worksheet, you will be able to:
- Explain how language models score candidate text outputs
- Compute unigram and bigram probabilities from frequency counts
- Apply the chain rule to sentence probability
- Explain the Markov assumption in n-gram language models
- Identify sparsity/zero-probability issues and apply add-\(\alpha\) smoothing

**Instructions:**

Create a copy of this notebook and complete it during class.  
Work through the cells below **in order**.

You may discuss with your neighbors, but make sure you understand  
what each step is doing and why.

**Submission**

When finished, download the notebook as a PDF and upload it to Gradescope under  
`In-Class – Week 8 Tuesday`.

To download as a PDF on DataHub:  
`File -> Save and Export Notebook As -> PDF`

### Why do we need language models?

Many NLP tasks require natural language output:
- Machine translation
- Speech recognition
- Natural language generation
- Spell checking

Language models define *probability distributions* over (natural language) strings or sentences.

We can use a language model to score possible output strings so that we can choose the best. 

If $P_{LM}(A) > P_{LM}(B)$, we prefer A.

**Natural Language Generation**

The sky is _____ 

`blue`

or 

`cup`

**Spell Check** 

`Their are two tests.`

or

`There are two tests.`

**Grammar**

`Everything has improve.`

or 

`Everything has improved.`

**Speech Recognition**

`I will be back sooninsh.`

or

`I will be bassoon dish.`

### Probability Basics
#### Sampling with replacement
<img src="images/bag-of-shapes.png" width="500">

*Pick a random shape from the bag of shapes, then put it back in the bag.*


P(blue) = `YOUR ANSWER HERE`

P(square) = `YOUR ANSWER HERE`

P(square or triangle) = `YOUR ANSWER HERE`

P(blue square or a red triangle) = `YOUR ANSWER HERE`

P(blue | square) = `YOUR ANSWER HERE`

P(triangle | red) = `YOUR ANSWER HERE`

#### Drawing sequence of shapes:

*Pick a random shape from the bag of shapes, then put it back in the bag.*

P(red circle, yellow triangle, blue square) = `YOUR ANSWER HERE`

P(red triangle, yellow circle, red triangle) = `YOUR ANSWER HERE`

Suppose now we have some text:

A: `the cat sat on the mat . the cat scared the rat that was near the mat .`

*Pick a random word from the sentences (bag) above, then put it back.*

P(cat) = `YOUR ANSWER HERE`

P(.) = `YOUR ANSWER HERE`

P(the) = `YOUR ASNWER HERE`

B: `on the near cat the cat mat scared that the sat mat . the rat was  the  .`

*Pick a random word from the sentences (bag) above, then put it back.*

P(cat) = `YOUR ANSWER HERE`

P(.) = `YOUR ANSWER HERE`

P(the) = `YOUR ASNWER HERE`

In this *model* (where we just take the P(word)), $P_{model}(A) = P_{model}(B)$.

#### Basic Probability Terminology

Before building language models, we review a few key terms.

- **Trial (experiment)**  
  A single random process.  
  Examples: picking a shape from a bag, rolling a die, predicting the next word.

- **Sample space ($\Omega$)**  
  The set of all possible outcomes of a trial.  
  Examples: all shapes in the bag, all numbers on a die, all words in a vocabulary.

- **Event ($A \subseteq \Omega$)**  
  A subset of possible outcomes.  
  Examples: “drawing a blue shape”, “rolling an even number”, “predicting the word *the*”.

- **Random variable ($X : \Omega \rightarrow T$)**  
  A function that assigns a value to each outcome.  
  Example: mapping each die roll outcome to its number, or mapping each word to its length.


#### What Is a Probability Distribution?

A **probability distribution** assigns a probability to every possible outcome of a random process.

It must satisfy:

1. Each probability is between 0 and 1.
2. All probabilities together add up to 1.

#### Probability Axioms

A function $P$ is a valid probability distribution over $\Omega$ if:

1. **Non-negativity**
   $$
   0 \leq P(A) \leq 1
   $$
   for every event $A$.

2. **Total probability equals 1**
   $$
   \sum_{\omega \in \Omega} P(\omega) = 1
   $$
   The probabilities of all possible outcomes must add up to 1.

3. **Additivity for disjoint events**  
   If $A$ and $B$ cannot both happen (they are disjoint), then:
   $$
   P(A \cup B) = P(A) + P(B)
   $$

These rules ensure that probabilities behave consistently and can represent uncertainty about outcomes.

#### Discrete Probability Distributions

A probability distribution is **discrete** if there is a fixed (often finite) set of possible outcomes. This means we can list all possible outcomes and assign a probability to each one.

##### Bernoulli Distribution

A **Bernoulli distribution** models a situation with exactly **two possible outcomes**.

We define one outcome as “success” with probability $p$.  
The other outcome (“failure”) automatically has probability $1 - p$.

Example: flipping a coin once.

- Let “success” = getting heads.
- $P(\text{heads}) = p$
- $P(\text{tails}) = 1 - p$

If the coin is fair:
$$
P(\text{heads}) = 0.5, \quad P(\text{tails}) = 0.5
$$

Bernoulli distributions are commonly used for:
- Yes / No decisions
- True / False labels
- Binary classification problems


##### Categorical Distribution

A **categorical distribution** generalizes Bernoulli to more than two outcomes.

Suppose we have categories $c_1, c_2, \dots, c_N$.

Each category has a probability $p_i$:

$$
P(c_i) = p_i
$$

The probabilities must satisfy:

$$
\sum_{i=1}^{N} p_i = 1
$$

Examples:

- Rolling a six-sided die  
  $P(1) = P(2) = \dots = P(6) = \frac{1}{6}$ (if fair)

- Picking a shape from a bag  
  Each shape has some probability depending on how many are in the bag.

- Predicting the next word in a vocabulary of size $N$  
  The model assigns a probability to each possible word, and all probabilities must sum to 1.

Most language models define a **categorical distribution over words** at each step.

#### Joint and Conditional Probability

$$P(X | Y) = \frac{P(X, Y)} {P(Y)}$$

<img src="images/bag-of-shapes.png" width="500">

$$P(blue|square) = \frac{P(blue, square)}{P(square)} $$

$P(blue, square)$ = `YOUR ANSWER HERE`

$P(square)$ = `YOUR ANSWER HERE`

$P(blue|square)$  = `YOUR ASNWER HERE`

#### The Chain Rule

The joint $P(X,Y)$ can also be expressed in terms of conditional probality

$$P(X, Y) = P(X|Y)P(Y)$$

This leads to the *chain rule* of probability: 
$$P(X₁, X₂, ..., Xₙ) =
P(X₁)
P(X₂ | X₁)
...
P(Xₙ | X₁, ..., Xₙ₋₁)$$


#### Independence

X and Y are independent if:

P(X, Y) = P(X)P(Y)

Then:

P(X | Y) = P(X)

#### Building a Probability Model

1. Define the model  
2. Estimate parameters  

Models often make independence assumptions
to reduce the number of parameters.

#### Language Model

A *language model* over a vocabulary $V$ assigns probabilities to strings drawn from $V^*$”

**The Vocabulary $V$**

The vocabulary $V$ is the set of allowed words (or tokens).

Example:

$$
V = \{\text{the}, \text{cat}, \text{sat}\}
$$

These are the basic building blocks.


**What Is $V^*$?**

$V^*$ means:

> All possible finite sequences of words from $V$.

If

$$
V = \{\text{the}, \text{cat}\}
$$

then $V^*$ includes:

- the  
- cat  
- the cat  
- cat the  
- the the  
- cat cat  
- the cat the  
- cat the cat  
- and so on...

Even if $V$ is small, there are infinitely many possible strings because sentences can keep getting longer.


**What Is a Language Model?**

A language model assigns a probability to **every possible sentence** made from the vocabulary.

Formally, it defines a function:

$$
P : V^* \rightarrow [0,1]
$$

This means:

- Every possible sentence gets a probability.
- The probabilities over all possible sentences add up to 1.

$$
\sum_{s \in V^*} P(s) = 1
$$


**Why Is This Important?**

We want to compare sentences like:

- “I agree.”
- “I completely agree.”
- “Completely I agree.”

To decide which is more likely, they must come from the **same probability distribution**.

That’s why a language model defines probabilities over *all possible strings*, not just individual words.

Let's say we want to compute the probability of the next word given some history:

$$
P(w \mid h)
$$

Suppose the history is:

> *On summer evenings the sky looks very*

and we want the probability that the next word is *orange*:

$$
P(\text{orange} \mid \text{On summer evenings the sky looks very})
$$

A simple idea is to estimate this using counts from a large corpus.

We count:

- How often we see the full sequence  
  *On summer evenings the sky looks very orange*

- How often we see the history  
  *On summer evenings the sky looks very*

This gives the relative-frequency estimate:

$$
P(w \mid h)
=
\frac{C(h\,w)}{C(h)}
$$

In words:  
Out of all the times we saw the history $h$, how often was it followed by the word $w$?

Now suppose we want the probability of an entire sequence of words rather than just one next word.

Using the **chain rule**, we can write:

$$
P(w_1, w_2, \dots, w_n)
=
P(w_1)
P(w_2 \mid w_1)
P(w_3 \mid w_1, w_2)
\cdots
P(w_n \mid w_1, \dots, w_{n-1})
$$

<!-- More compactly,

$$
P(w_1, \dots, w_n)
=
\prod_{k=1}^{n}
P(w_k \mid w_{1:k-1})
$$ -->


Discuss

Q. Why is estimating  
$$
P(w_n \mid w_1, w_2, \dots, w_{n-1})
$$
for long histories unrealistic in practice?

`YOUR ANSWER HERE`

<!-- If we had a large enough corpus, we could compute all these counts. However, even the entire web is not large enough to give reliable counts for long histories. Language is creative. New sentences are invented all the time. Most long word sequences will appear rarely — or never — in our data. We cannot expect to see every possible long history $h$ in our training data. If a sequence never appears, our estimate becomes zero. That would mean assigning probability zero to perfectly reasonable sentences.  -->

#### Language Modeling with N-grams

An *n-gram language model* is one of the simplest kind of language model. The intuition of the *n-gram* model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words.

An *n-gram language model* assumes that each word depends only on the last $n-1$ words:

$$P_{ngram} (w_1, w_2, w_3...w_i) =  P(w_1) P(w_2|w_1) P(w_3|w_2, w_1)...P(w_i|w_{i-1}, w_{i-2} ...w_{i-(n+1)}) $$

The assumption that the probability of a word depends only on the previous word is called a *Markov assumption*.

#### Language Modeling with N-grams

In the previous cell, we saw that computing

$$
P(w_n \mid w_1, w_2, \dots, w_{n-1})
$$

for long histories is unrealistic in practice.

To make the problem tractable, we deliberately simplify the model.

Instead of conditioning on the *entire* history, we approximate it using only the last few words.

An **n-gram language model** assumes that each word depends only on the previous $n-1$ words:

$$
P(w_k \mid w_1, \dots, w_{k-1})
\approx
P(w_k \mid w_{k-n+1}, \dots, w_{k-1})
$$

Using this assumption, the probability of a sequence becomes:

$$
P(w_1, w_2, \dots, w_n)
\approx
\prod_{k=1}^{n}
P(w_k \mid w_{k-n+1}, \dots, w_{k-1})
$$

This simplification is called a **Markov assumption**. It means that the future depends only on a limited recent past,  
not the entire history.

**Special Cases:**

- **Unigram model ($n=1$):**
  $$
  P(w_k \mid w_1, \dots, w_{k-1})
  \approx
  P(w_k)
  $$

- **Bigram model ($n=2$):**
  $$
  P(w_k \mid w_1, \dots, w_{k-1})
  \approx
  P(w_k \mid w_{k-1})
  $$

- **Trigram model ($n=3$):**
  $$
  P(w_k \mid w_1, \dots, w_{k-1})
  \approx
  P(w_k \mid w_{k-2}, w_{k-1})
  $$

#### Estimating N-gram Probabilities (Maximum Likelihood Estimation)

Once we decide how much history to use (unigram, bigram, trigram, etc.),
we need a way to compute the probabilities from data.

We use **Maximum Likelihood Estimation (MLE)**.

The idea of MLE is simple:

> Choose the probabilities that make the observed training data most likely.

For n-gram models, this leads to **relative frequency estimates**.

For a **unigram model**:

$$
P(w_k)
=
\frac{C(w_k)}{\text{total number of words in the corpus}}
$$

Count how often the word appears,  
divide by the total number of tokens.


For a **bigram model**:

$$
P(w_k \mid w_{k-1})
=
\frac{C(w_{k-1}, w_k)}{C(w_{k-1})}
$$

Count how often the two-word sequence appears,  
divide by how often the first word appears.


For a **trigram model**:

$$
P(w_k \mid w_{k-2}, w_{k-1})
=
\frac{C(w_{k-2}, w_{k-1}, w_k)}
     {C(w_{k-2}, w_{k-1})}
$$

Count how often the three-word sequence appears,  
divide by how often the two-word history appears.


In general, for an n-gram model:

$$
P(w_k \mid w_{k-n+1}, \dots, w_{k-1})
=
\frac{C(w_{k-n+1}, \dots, w_k)}
     {C(w_{k-n+1}, \dots, w_{k-1})}
$$

So computing n-gram probabilities always follows the same pattern:

1. Count the full sequence (history + next word).  
2. Divide by the count of the history.

#### Toy Exercise

We will work with a **tiny toy corpus** so every computation is transparent.


##### Tiny corpus

We will use these tokenized sentences (with boundary tokens):

- `<s>` start of sentence
- `</s>` end of sentence

In [None]:

toy_sentences = [
    ["<s>", "to", "be", "or", "not", "to", "be", "</s>"],
    ["<s>", "to", "be", "a", "king", "</s>"],
    ["<s>", "to", "eat", "pizza", "</s>"],
]

all_tokens = [w for sent in toy_sentences for w in sent]
N = len(all_tokens)
N, all_tokens

#### 1. Estimate unigram probabilities

$$
\hat{P}(w) = \frac{\text{count}(w)}{\sum_{w'} \text{count}(w')}
$$


Complete the code in the next cell to compute unigram probabilities.

In [None]:
# Let's estimate unigram probabilities by counting tokens and dividing by the total number of tokens (N)

word_counts = {}
for w in all_tokens:
    if w not in word_counts:
        word_counts[w] = None # YOUR CODE HERE
    else:
        word_counts[w] = None # YOUR CODE HERE

unigram_probs = {}
for w in word_counts:
    unigram_probs[w] = None # YOUR CODE HERE

print("Unigram probabilities:" )
for w in sorted(unigram_probs, key=unigram_probs.get, reverse=True):
    print(f"  {w}: {unigram_probs[w]:.2f}")

####  2. Bigram form (language modeling)
The MLE bigram estimate is:

$$
\hat{P}(w_t \mid w_{t-1}) \;=\; \frac{\mathrm{count}(w_{t-1},\, w_t)}{\mathrm{count}(w_{t-1})}
$$

We divide how often `previous_word` is followed by `word` by how often `previous_word` appears as a context.

Complete the code in the next cell to calculate the bigram probabilities of the toy dataset. 

In [None]:
# Now let's estimate bigram probabilities by counting bigrams and dividing by the count of the previous word (the context)

bigram_counts = {}
previous_word_counts = {}

for sent in toy_sentences:
    for i in range(len(sent) - 1):
        prev_word = sent[i]
        curr_word = sent[i + 1]
        new_key = (prev_word, curr_word)
        # Update bigram counts
        if (prev_word, curr_word) not in bigram_counts:
            bigram_counts[(prev_word, curr_word)] = None # YOUR CODE HERE
        else:
         bigram_counts[(prev_word, curr_word)] = None # YOUR CODE HERE

        if prev_word not in previous_word_counts:
            previous_word_counts[prev_word] = None # YOUR CODE HERE
        else:
            previous_word_counts[prev_word] = None # YOUR CODE HERE

bigram_probs = {}
for (prev_word, curr_word) in bigram_counts:
    bigram_probs[(prev_word, curr_word)] = None # YOUR CODE HERE


print("Bigram probabilities:")
for (prev_word, curr_word) in sorted(bigram_probs):
    print(f"P({curr_word} | {prev_word}) = {bigram_probs[(prev_word, curr_word)]:.4f}")

#### Sentence Probability
$$
P(\text{sentence}) = \prod_t P(w_t \mid w_{t-1})
$$

In [None]:
def sentence_prob_bigram(sent):
    prob = 1.0
    for i in range(len(sent)-1):
        prob = None # YOUR CODE HERE
    return prob

test_sent = ["<s>", "to", "be", "</s>"]
sentence_prob_bigram(test_sent) # 0,25

Notice: this can become extremely small quickly for long sentences.

That is why we use **log-probability**.


#### 3. Log-probability

Instead of multiplying probabilities:

$$
P(\text{sentence}) = \prod_t P(w_t \mid w_{t-1})
$$

we sum logs:

$$
\log P(\text{sentence})
= \sum_t \log P(w_t \mid w_{t-1})
$$

Benefits:
- avoids underflow (numbers becoming 0 in a computer)
- turns products into sums (easier to compute)


In [None]:
import math

def sentence_logprob_bigram(sent):
    total = 0.0
    for i in range(len(sent)-1):
        p = bigram_probs.get((sent[i], sent[i+1]), 0)
        if p == 0:
            return float("-inf")
        total += math.log(p)
    return total

sentence_logprob_bigram(test_sent)


#### 4. Sparsity problem (zero probabilities)

If a bigram never occurred in training:

$$
\hat{P}(w \mid w_{t-1}) = 0
$$

Then any sentence containing it gets probability 0 (log probability = -∞).

That is too harsh for real language.

In [None]:
unseen = ["<s>", "to", "code", "</s>"]
print("P(code | to) =", bigram_probs.get(("to", "code"), 0))
print("log P(sentence) =", sentence_logprob_bigram(unseen))

##### Laplace (add-α) smoothing

We fix zeros by adding a small pseudo-count `α` to every possible next word:

$$
P_{smooth}(w \mid c)
=
\frac{\text{count}(c,w) + \alpha}{\text{count}(c) + \alpha |V|}
$$

- `|V|` = vocabulary size
- guarantees **non-zero** probability everywhere


In [None]:
def sentence_logprob_bigram_smoothed(sent, alpha=1.0):
    total = 0.0
    vocab_size = len(unigram_probs)
    for i in range(len(sent)-1):
        prev_word, curr_word = sent[i], sent[i+1]
        count_bigram = bigram_counts.get((prev_word, curr_word), 0)
        count_prev_word = previous_word_counts.get(prev_word, 0)
        # Apply add-alpha smoothing
        p = (count_bigram + alpha) / (count_prev_word + alpha * vocab_size)
        total += math.log(p)
    return total

In [None]:
print("log P(sentence) =", sentence_logprob_bigram_smoothed(unseen, alpha=1.0))