# CSE 25 – Introduction to Artificial Intelligence  
## Worksheet 13: Language Models and N-grams

> Language model = a model that assigns probabilities to strings of words.

### Guiding Questions
1. Why is estimating full-history probabilities hard in practice?
2. How do unigram, bigram, and trigram models simplify the problem?
3. How do we estimate n-gram probabilities from counts?
4. Why do we need smoothing and perplexity?

### Learning Objectives
By the end of this worksheet, you will be able to:
- Explain the Markov assumption in n-gram language models
- Compute unigram and bigram probabilities from frequency counts
- Explain sparsity and apply add-\(\alpha\) smoothing
- Compute and interpret log-probability and perplexity

**Instructions:**

Create a copy of this notebook and complete it during class.  
Work through the cells below **in order**.

You may discuss with your neighbors, but make sure you understand  
what each step is doing and why.

**Submission**

When finished, download the notebook as a PDF and upload it to Gradescope under  
`In-Class – Week 8 Thursday`.

To download as a PDF on DataHub:  
`File -> Save and Export Notebook As -> PDF`

#### Language Model

A *language model* over a vocabulary $V$ assigns probabilities to strings drawn from $V^*$


Let's say we want to compute the probability of the next word given some history:

$$
P(w \mid h)
$$

Suppose the history is:

> *On summer evenings the sky looks very*

and we want the probability that the next word is *orange*:

$$
P(\text{orange} \mid \text{On summer evenings the sky looks very})
$$

A simple idea is to estimate this using counts from a large corpus.

We count:

- How often we see the full sequence  
  *On summer evenings the sky looks very orange*

- How often we see the history  
  *On summer evenings the sky looks very*

This gives the relative-frequency estimate:

$$
P(w \mid h)
=
\frac{C(h\,w)}{C(h)}
$$

In words:  
Out of all the times we saw the history $h$, how often was it followed by the word $w$?

Now suppose we want the probability of an entire sequence of words rather than just one next word.

Using the **chain rule**, we can write:

$$
P(w_1, w_2, \dots, w_n)
=
P(w_1)
P(w_2 \mid w_1)
P(w_3 \mid w_1, w_2)
\cdots
P(w_n \mid w_1, \dots, w_{n-1})
$$
$$ =
\prod_{k=1}^{n}
P(w_k \mid w_1, \dots, w_{k-1} )
$$


Q. Why is estimating $ P(w_n \mid w_1, w_2, \dots, w_{n-1})$
for long histories unrealistic in practice?

If we had a large enough corpus, we could compute all these counts. However, even the entire web is not large enough to give reliable counts for long histories. Language is creative. New sentences are invented all the time. Most long word sequences will appear rarely - or never — in our data. We cannot expect to see every possible long history $h$ in our training data. If a sequence never appears, our estimate becomes zero. That would mean assigning probability zero to perfectly reasonable sentences. 

#### Building a Probability Model

1. Define the model  
2. Estimate parameters  

Models often make independence assumptions to reduce the number of parameters.

#### Language Modeling with N-grams

In the previous cell, we saw that computing $ P(w_n \mid w_1, w_2, \dots, w_{n-1}) $ for long histories is unrealistic in practice.

To make the problem tractable, we deliberately simplify the model.

Instead of conditioning on the *entire* history, we approximate it using only *the last few words*.

An **n-gram language model** assumes that each word depends only on the previous $n-1$ words:

$$
P(w_k \mid w_1, \dots, w_{k-1})
\approx
P(w_k \mid w_{k-n+1}, \dots, w_{k-1})
$$

Using this assumption, the probability of a sequence becomes:

$$
P(w_1, w_2, \dots, w_n)
\approx
\prod_{k=1}^{n}
P(w_k \mid w_{k-n+1}, \dots, w_{k-1})
$$

- This simplification is called a **Markov assumption**. It means that the future depends only on a limited recent past, not the entire history. 
- This also introduces **position invariance**. In an n-gram model, the probability assigned to a word given a specific local context is the same no matter where that context appears in the sentence.
- The conditional probabilities $P(w_k \mid w_{k-n+1}, \dots, w_{k-1})$ are the parameters of the model that we estimate from data.

**Models:**

- **Unigram model ($n=1$):**
  $
  P(w_k \mid w_1, \dots, w_{k-1})
  \approx
  P(w_k)
  $

- **Bigram model ($n=2$):**
  $
  P(w_k \mid w_1, \dots, w_{k-1})
  \approx
  P(w_k \mid w_{k-1})
  $

- **Trigram model ($n=3$):**
  $
  P(w_k \mid w_1, \dots, w_{k-1})
  \approx
  P(w_k \mid w_{k-2}, w_{k-1})
  $

##### Estimating N-gram Probabilities (Maximum Likelihood Estimation)

Once we decide how much history to use (unigram, bigram, trigram, etc.), we need a way to compute the probabilities from data.

We use **Maximum Likelihood Estimation (MLE)**.

The idea of MLE is simple:

> Choose the probabilities that make the observed data most likely, i.e. maximize the likelihood of the data.

**What Do We Mean by "Likelihood of the Data"?**

Suppose our corpus (our dataset of text) is a sequence of $N$ tokens: $ w_1, w_2, \dots, w_N $

An n-gram model assigns a probability to the **entire corpus**:

$$
P(\text{corpus})
=
P(w_1, w_2, \dots, w_N)
=
\prod_{k=1}^{N}
P(w_k \mid w_{k-n+1}, \dots, w_{k-1})
$$

This quantity is called the **likelihood** of the data under the model.


- The corpus is fixed.  
- The probabilities are the parameters we will compute.
- MLE selects the probabilities that **maximize this product**.


For n-gram models, MLE leads to **relative frequency estimates**.

For a **unigram model**:

$$
P(w_k)
=
\frac{C(w_k)}{\text{total number of words in the corpus}}
$$

Count how often the word appears,  
divide by the total number of tokens.


For a **bigram model**:

$$
P(w_k \mid w_{k-1})
=
\frac{C(w_{k-1}, w_k)}{C(w_{k-1})}
$$

Count how often the two-word sequence appears,  
divide by how often the first word appears.


For a **trigram model**:

$$
P(w_k \mid w_{k-2}, w_{k-1})
=
\frac{C(w_{k-2}, w_{k-1}, w_k)}
     {C(w_{k-2}, w_{k-1})}
$$

Count how often the three-word sequence appears,  
divide by how often the two-word history appears.


In general, for an n-gram model:

$$
P(w_k \mid w_{k-n+1}, \dots, w_{k-1})
=
\frac{C(w_{k-n+1}, \dots, w_k)}
     {C(w_{k-n+1}, \dots, w_{k-1})}
$$

So computing n-gram probabilities always follows the same pattern:

1. Count the full sequence (history + next word).  
2. Divide by the count of the history.

#### Toy Exercise

We will work with a **tiny toy corpus** so every computation is transparent.


##### Tiny corpus

**Tokenization** is the process of splitting text into smaller units called *tokens*. In practice, tokens can be words, subwords, characters, or punctuation, depending on the tokenizer and model.

For this toy exercise, we will treat words as tokens and include boundary markers:

- `<s>` start of sentence
- `</s>` end of sentence

In [None]:
# Tokenized sentences
tokenized_toy_sentences = [
    ["<s>", "to", "be", "or", "not", "to", "be", "</s>"],
    ["<s>", "to", "be", "a", "king", "</s>"],
    ["<s>", "to", "eat", "pizza", "</s>"],
]

all_tokens = []

# Create a list of all tokens in the corpus
for sentence in tokenized_toy_sentences:
    for token in sentence:
        all_tokens.append(token)
    
N = len(all_tokens)
print("Total Tokens, N:", N)
print("Tokens:", all_tokens)

**Vocabulary**

In language modeling, the **vocabulary** (usually written as $V$) is the set of all **unique tokens** that appear in the corpus. For this toy corpus, the vocabulary includes words and boundary tokens such as `<s>` and `</s>`. Its size is denoted by $|V|$. 

In [None]:
# Vocabulary - the set of unique tokens in the corpus

# Create a set of unique words to form the vocabulary:
vocab = None # YOUR CODE HERE 

# Get the size of the vocabulary
vocab_size = len(vocab)

print("Vocabulary, V:", vocab)
print("Vocab Size, |V|:", vocab_size)


Q. If the vocabulary size is $|V|$, how many parameters do we need to compute for:
- Unigram Model - `YOUR ANSWER HERE`
- Bigram Model - `YOUR ANSWER HERE`
- Trigram Model - `YOUR ANSWER HERE` 

#### Estimate unigram probabilities

$$
\hat{P}(w_k) = \frac{\text{count}(w_k)}{\sum_{w'} \text{count}(w')} = \frac{\text{count}(w_k)}{N}
$$


Complete the code in the next cell to compute unigram probabilities.

In [None]:
# Let's estimate unigram probabilities by counting tokens and dividing by the total number of tokens (N)

# Initialize a dictionary to count the occurrences of each word
word_counts = {}
for w in all_tokens:
    # If it's the first time we see this word, initialize its count to 1
    if w not in word_counts: 
        word_counts[w] = None # YOUR CODE HERE
    # else, increment the existing count by 1
    else:
        word_counts[w] = None # YOUR CODE HERE

# Now we can compute the unigram probabilities 
# by dividing the count of each word by 
# the total number of tokens (N)

# Initialize a dictionary to store unigram probabilities
unigram_probs = {}

# Calculate unigram probabilities for each token in the vocabulary
for token in vocab:
    # Calculate the unigram probability for this token
    # This is done by taking the count of the token and dividing it by the total number of tokens (N)
    unigram_probs[token] = None # YOUR CODE HERE

print("Unigram probabilities:" )
for w in sorted(unigram_probs, key=unigram_probs.get, reverse=True):
    print(f"  {w}: {unigram_probs[w]:.2f}")

####  Estimate bigram probabilities
The MLE bigram estimate is:

$$
\hat{P}(w_k \mid w_{k-1}) \;=\; \frac{\mathrm{count}(w_{k-1},\, w_k)}{\mathrm{count}(w_{k-1})}
$$

We divide how often `previous_word` $w_{t-1}$ is followed by `word` $w_t$ by how often `previous_word` $w_{t-1}$ appears in the text.

Complete the code in the next two cells to calculate the bigram counts and probabilities of the toy dataset. 

In [None]:
# Now let's estimate bigram probabilities by counting bigrams and dividing by the count of the previous word

bigram_counts = {}

# Initialize bigram counts for all possible bigrams in the vocabulary
for prev_word in vocab:
    for curr_word in vocab:
        # We can use a tuple as the key for bigram counts 
        # since tuples are immutable and can be used as dictionary keys
        bigram_counts[(prev_word, curr_word)] = 0 # Initialize count to 0 for all possible bigrams


# Count bigrams in the tokenized sentences (our text corpus)
for sent in tokenized_toy_sentences:
    for i in range(len(sent) - 1):
        prev_word = sent[i] # w_t-1
        curr_word = sent[i + 1] # w_t
        
        new_key = (prev_word, curr_word) # (w_t-1, w_t)
        
        # Update bigram counts
        if (prev_word, curr_word) not in bigram_counts:
            bigram_counts[(prev_word, curr_word)] = None # YOUR CODE HERE
        else:
            bigram_counts[(prev_word, curr_word)] = None # YOUR CODE HERE

# print bigram counts as a table
# with rows as previous words and columns as current words
vocab_order = sorted(vocab)

print("\nBigram Counts:")
print(f"{'':>10}", end="")
for curr_word in vocab_order:
    print(f"{curr_word:>10}", end="")
print()
for prev_word in vocab_order:
    print(f"{prev_word:>10}", end="")
    for curr_word in vocab_order:
        print(f"{bigram_counts[(prev_word, curr_word)]:>10}", end="")
    print()

In [None]:
# Now we can compute the bigram probabilities
# by dividing the count of each bigram by the count of the previous word
# P(w_t | w_t-1) = Count(w_t-1, w_t) / Count(w_t-1)

bigram_probs = {}
for (prev_word, curr_word) in bigram_counts:
    bigram_probs[(prev_word, curr_word)] = None # YOUR CODE HERE

# Print bigram probabilities as a table 
# with rows as previous words 
# and columns as current words

print("\nBigram Probabilities:")
print(f"{'':>7}", end="")
for curr_word in vocab_order:
    print(f"{curr_word:>7}", end="")
print()
for prev_word in vocab_order:
    print(f"{prev_word:>7}", end="")
    for curr_word in vocab_order:
        print(f"{bigram_probs[(prev_word, curr_word)]:>7.2f}", end="")
    print()

#### Probability of a sequence of tokens

From chain rule we know: 


$$P(w_1, w_2, w_3..., w_n) = P(w_1) \cdot P(w_2 \mid w_1) \cdot P(w_3\mid w_1, w_2) \cdot ...\cdot P(w_n \mid w_1, w_2, ... w_n)$$

$$ = \prod_{k=1}^n P(w_k \mid w_1, w_2, ... w_k) $$

For a bigram model, we have $P(w_k \mid w_1, w_2, ... w_k)  = P(w_k \mid w_{k-1})$

$$
P(w_1, w_2, w_3..., w_n) = \prod_{k=1}^n P(w_k \mid w_{k-1})
$$

In [None]:

def sequence_prob_bigram(tokenized_sequence):
    # Calculate the probability of a sequence using bigram probabilities
    # Initialize the probability of the sequence to 1 (because we will multiply probabilities)
    prob = 1.0
    for i in range(len(tokenized_sequence)-1):
        # We can use the bigram probabilities 
        # we computed above to calculate the probability of the sequence

        # Get the previous word and current word to form the key for bigram probabilities
        prev_word = tokenized_sequence[i]
        curr_word = tokenized_sequence[i + 1]

        # Get the bigram probability for the current bigram (prev_word, curr_word)
        # Multiply the probabilities of each bigram in the sequence to get the overall sequence probability
        prob = None # YOUR CODE HERE

    return prob

In [None]:
test_seq = ["<s>", "to", "be", "or", "not", "to", "be", "</s>"]
sequence_prob_bigram(test_seq) # Should print 0.0625

Q. What happens to the probability of a sequence as the sequence gets longer?

*Hint: Remember that sequence probabilities are computed by multiplying many conditional probabilities together.*

`YOUR ANSWER HERE`


Q. Consider the sequence: `<s> to eat a pizza </s>`. What would happen if we tried to get the probability of this sequence? 

`YOUR ANSWER HERE`

In [None]:
# This is a very small corpus so the non-zero bigram probabilities are relatively large (0.25, 0.5, etc.)
# With a larger corpus, the bigram probabilities will be much smaller, and the probability of a sentence will be much smaller as well

# For demonstration purposes, we can repeat the same sentence multiple times to create a longer sentence and see how the probability decreases

# Long sentence
test_seq_long = ["<s>", "to", "be", "or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","or", "not", "to", "be","</s>"]

print(sequence_prob_bigram(test_seq_long))

# Notice: this can become extremely small quickly for long sentences.
# That is why we use **log-probability**.

In [None]:
# Tokenized sequence `<s> to eat a pizza </s>`
test_seq_2 = ["<s>", "to", "eat", "a", "pizza", "</s>"]

print(sequence_prob_bigram(test_seq_2))

#### Log-probability of a sequence of tokens

Instead of multiplying probabilities:

$$
P(w_1, w_2, w_3..., w_n) = \prod_{k=1}^n P(w_k \mid w_{k-1})
$$

we can sum logs:

$$
\log P(w_1, w_2, w_3..., w_n) = \sum_{k=1}^n \log P(w_k \mid w_{k-1})
$$

Benefits:
- avoids underflow (numbers becoming 0 in a computer)
- turns products into sums (easier to compute)

In [None]:
import math

def sequence_logprob_bigram(tokenized_sequence):
    total = 0.0
    for i in range(len(tokenized_sequence)-1):
        # Get the previous word and current word to form the key for bigram probabilities
        prev_word = tokenized_sequence[i]
        curr_word = tokenized_sequence[i + 1]

        # Get the bigram probability for the current bigram (prev_word, curr_word)
        p = bigram_probs[(prev_word, curr_word)]

        # If the bigram probability is zero, we return negative infinity for the log-probability of the sequence
        if p == 0.0:
            return float("-inf")
        
        # Otherwise, we add the log of the bigram probability to the total log-probability of the sequence
        total += math.log(p)
    
    return total

print("Test Sequence Log Probability:", sequence_logprob_bigram(test_seq))
print("Long Sequence Log Probability:", sequence_logprob_bigram(test_seq_long))
print("Unseen Bigram Sequence Log Probability:", sequence_logprob_bigram(test_seq_2))

#### Sparsity problem (zero probabilities)

If a bigram never occurred in training:

$$
\hat{P}(w_k \mid w_{k-1}) = 0
$$

Then any sentence containing it gets probability 0 (log probability = $-\infty$).

That is too harsh for real language.

##### Laplace Smoothing (or Add-1 Smoothing) 

**Smoothing** is a technique that assigns non-zero probability to unseen events by redistributing probability mass from seen events.

In Laplace or Add-1 smoothing, we fix zeros by adding 1 to every possible next word:

$$
P_{smooth}(w_t \mid w_{k-1})
=
\frac{\text{count}(w_{k-1}, w_k) + 1}{\text{count}(w_{k-1}) + |V|}
$$

- $|V|$ = vocabulary size
- For a given previous word $w_{k-1}$, there are $|V|$ possible next words
- Adding 1 to the denominator for each possible next word ensures probabilities sum to 1
- This guarantees **non-zero** probability everywhere

In [None]:
# Let's add 1 to all bigram counts to perform add-one smoothing, and then recompute the bigram probabilities and sentence log probabilities

# 1. Add 1 to all bigram counts
bigram_counts_smoothed = {}
for (prev_word, curr_word) in bigram_counts:
    bigram_counts_smoothed[(prev_word, curr_word)] = bigram_counts[(prev_word, curr_word)] + 1

# Print smoothed bigram counts as a table
print("Smoothed Bigram Counts:")
print(f"{'':>7}", end="")
for curr_word in vocab_order:
    print(f"{curr_word:>7}", end="")
print()
for prev_word in vocab_order:
    print(f"{prev_word:>7}", end="")
    for curr_word in vocab_order:
        print(f"{bigram_counts_smoothed[(prev_word, curr_word)]:>7}", end="")
    print()

# 2. Add |V| to all word counts to account for the added counts in bigrams
word_counts_smoothed = {}
for prev in word_counts:
    word_counts_smoothed[prev] = word_counts[prev] + vocab_size # Add 1 for each possible current word

# 3. Recompute bigram probabilities with smoothing
bigram_probs_smoothed = {}
for (prev_word, curr_word) in bigram_counts_smoothed:
   bigram_probs_smoothed[(prev_word, curr_word)] = bigram_counts_smoothed[(prev_word, curr_word)] / word_counts_smoothed[prev_word]

# Print smoothed bigram probabilities as a table
print("\nSmoothed Bigram Probabilities:")
print(f"{'':>7}", end="")
for curr_word in vocab_order:
    print(f"{curr_word:>7}", end="")
print()
for prev_word in vocab_order:
    print(f"{prev_word:>7}", end="")
    for curr_word in vocab_order:
        print(f"{bigram_probs_smoothed[(prev_word, curr_word)]:>7.2f}", end="")
    print()

In [None]:
# Now we can compute the probability and log-probability of sequences using the smoothed bigram probabilities 

def sequence_prob_bigram_smoothed(tokenized_sequence):
    prob = 1.0
    for i in range(len(tokenized_sequence)-1):
        prev_word = tokenized_sequence[i]
        curr_word = tokenized_sequence[i + 1]
        prob *= bigram_probs_smoothed[(prev_word, curr_word)]

    return prob

def sequence_logprob_bigram_smoothed(tokenized_sequence):
    total = 0.0
    for i in range(len(tokenized_sequence)-1):
        prev_word = tokenized_sequence[i]
        curr_word = tokenized_sequence[i + 1]

        p = bigram_probs_smoothed[(prev_word, curr_word)]
        if p == 0.0:
            return float("-inf")
        total += math.log(p)
    return total


print("Test Sequence Smoothed Probability:", sequence_prob_bigram_smoothed(test_seq))
print("Long Sequence Smoothed Probability:", sequence_prob_bigram_smoothed(test_seq_long))
print("Unseen Bigram Sequence Smoothed Probability:", sequence_prob_bigram_smoothed(test_seq_2))
print()
print("*"*20)
print()
print("Test Sequence Smoothed Log Probability:", sequence_logprob_bigram_smoothed(test_seq))
print("Long Sequence Smoothed Log Probability:", sequence_logprob_bigram_smoothed(test_seq_long))
print("Unseen Bigram Sequence Smoothed Log Probability:", sequence_logprob_bigram_smoothed(test_seq_2))


**Add-$\alpha$ Smoothing**

Laplace smoothing is a special case where we add 1 to every count.

A more flexible version is to add a small positive value $\alpha$:

$$
P_{\alpha}(w \mid c)
=
\frac{\mathrm{count}(c,w)+\alpha}
{\mathrm{count}(c)+\alpha |V|}
$$

where:

- $c$ is the context (e.g., previous word in a bigram model)
- $w$ is the next word
- $|V|$ is vocabulary size
- $\alpha > 0$ is the smoothing hyperparameter that controls how much we smooth the probabilities.

Strength of Smoothing:

- $\alpha = 1$ $\rightarrow$ strong smoothing (Laplace or Add-1)
- $\alpha = 0.1$ $\rightarrow$ mild smoothing
- $\alpha \to 0$ $\rightarrow$ approaches MLE (unsmoothed counts)

>NOTE: Add-1 and Add-$\alpha$ are simple smoothing methods. They work reasonably well for small models or text classification tasks.  
For large-vocabulary language models, more advanced smoothing methods usually perform much better.

#### Evaluating Language Models

There are two main ways to evaluate a language model.

##### 1. Extrinsic Evaluation

The most meaningful way to evaluate a language model is to embed it inside a real application and measure task performance.

Examples:
- Use it in speech recognition $\rightarrow$ measure transcription accuracy.
- Use it in machine translation $\rightarrow$ measure translation quality.

If one language model leads to better downstream performance, it is the better model.

This is called **extrinsic evaluation**.

- Measures real-world impact  
- Often expensive and slow  

##### 2. Intrinsic Evaluation

Instead of evaluating the full system, we evaluate the language model by itself.

The main idea: 

> A better language model assigns higher probability to real, unseen text.

Given a test corpus $W = w_1, w_2, \dots, w_N$, we compute: $P(W)$

The model that assigns **higher probability** to the test corpus is better. This is called **intrinsic evaluation**.

However, raw probabilities of long sequences become extremely small and hard to interpret. So instead of raw probability, we use a normalized metric called *Perplexity*. Perplexity is the standard intrinsic evaluation metric for language models.

**Reminder: Train, Validation, and Test Splits**

As we have seen before, we use three distinct datasets:

**Training Set**
- Used to learn model parameters  
- For n-grams: used to compute counts and probabilities  

**Validation Set**
- Used to tune hyperparameters  
- Used to compare model variants  
- Helps prevent overfitting 

**Test Set**
- Completely held out  
- Used only once at the very end  
- Provides an unbiased estimate of generalization  

> We must **never** train on the test set. Doing so causes *data leakage*, which artificially inflates probabilities and leads to misleading evaluation results.


**Note on Out of Vocabulary (OOV) Tokens**

Some tokens/words may appear in validation or test data that were not seen during training. If a word has zero count, the model assigns it probability 0, which makes the entire sequence probability 0.

To handle this, we introduce a special token `<UNK>`:

- During training, rare words are replaced with `<UNK>`.
- The model learns a probability for `<UNK>`.
- Any unseen word in validation or test data is mapped to `<UNK>`.


##### Perplexity (PPL)

When evaluating a language model, we want to know how **surprised** the model is by a corpus. If the model assigns high probability, it is not very surprised.  If it assigns low probability, it is very surprised.

A better language model is therefore one that is **less surprised** by real, unseen text.

Surprise is inversely related to probability: $\frac{1}{P(W)}$

However, value of $P(W)$ depends on sequence length of $W$.  
Longer texts automatically have smaller probabilities. 
To remove the effect of length, we normalize (take the geometric mean) per token by taking the $N^{th}$ root.

This gives the definition of **perplexity**:

$$
\text{PPL}(W) = P(W)^{-\frac{1}{N}}
$$
$$ = \sqrt[N]{\frac{1}{P(W)}}
$$

Lower perplexity indicates that the model assigns higher probability to the corpus and is therefore a better predictor of the data.

**Log Form of Perplexity (Used in Practice)**

In practice, we use the log form. Using the chain rule,

$$
P(W) = \prod_{t=1}^{N} P(w_t \mid \text{history})
$$

Taking logs,

$$
\log P(W) = \sum_{t=1}^{N} \log P(w_t \mid \text{history})
$$

Substituting into the definition,

$$
\text{PPL}(W)
=
\exp\left(
-\frac{1}{N}
\sum_{t=1}^{N}
\log P(w_t \mid \text{history})
\right)
$$

#### Limitations of N-gram Models
- The number of parameters grows rapidly with vocabulary size  
- Data sparsity becomes severe for large $n$  
- Long-range dependencies are ignored  
- Words are treated as discrete symbols  
- No notion of similarity between words  