# N-Gram Models

In this notebook, we implement a classical n-gram language model from scratch using the Penn Treebank dataset. The goal is to build unigram, bigram, and trigram models, generate text, and evaluate model quality using **perplexity**. We also explore **Laplace smoothing** and **back-off models** to handle unseen sequences in test data.

Key concepts demonstrated:
- Tokenization and sentence boundary handling
- N-gram construction and probability estimation
- Text generation using learned n-gram models
- Perplexity-based evaluation
- Smoothing techniques to improve generalization

This notebook is part of my NLP learning journey (Week 1 – Core NLP Foundations).

## Import Libraries

In [None]:
# install needed libraries
!pip install nltk
!pip install datasets

In [None]:
# import necessary libraries
from datasets import load_dataset
import nltk

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from collections import Counter

import random
import math

In [None]:
# load nltk packages
nltk.download('punkt')
nltk.download('punkt_tab')

## Load Data

First, we are going to load our dataset from HuggingFace. I've selected the Penn Treebank Project: Release 2 CDROM as the dataset for this exercise. The Penn Treebank Project contains excerpts from the 1989 Wall Street Journal. The rare words are already replaced with an <UNK> token and the numbers are replaced with a placeholder token, making this a good introductory dataset.

In [None]:
text_data = load_dataset('ptb_text_only', split = 'train')
text_data = pd.DataFrame(text_data).values.tolist()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/4.21k [00:00<?, ?B/s]

ptb_text_only.py:   0%|          | 0.00/6.50k [00:00<?, ?B/s]

The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/5.10M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/400k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/450k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/42068 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3761 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3370 [00:00<?, ? examples/s]

## Split into Train / Test

In order to train and evaluate our n-gram models, we need to split the dataset into training and testing datasets.

In [None]:
train_data, test_data = train_test_split(text_data, test_size = 0.2, random_state = 10)

## Preprocess Text

Since this dataset is already highly preprocessed, we will perform a simple tokenization by splitting on whitespace. We also append a start token, `<s>`, and an end token, `</s>`, to the beginning and end of each sentence. These tokens help the n-gram models learn to distinguish betweeen the start and end of sentences.

In [None]:
def preprocess_ptb(series):
  tokenized_sentences = []
  for text in series:
    tokens = text[0].split()
    tokens = ['<s>'] + tokens + ['</s>']
    tokenized_sentences.append(tokens)
  return [token for sentence in tokenized_sentences for token in sentence]

In [None]:
check = train_data[809]
print('Original Sentence:', check)
print('Preprocessed Sentence:', preprocess_ptb([check]))

Original Sentence: ['in <unk> an important swing area republican <unk> now run on a <unk> promising to keep the county clean and green']
Preprocessed Sentence: ['<s>', 'in', '<unk>', 'an', 'important', 'swing', 'area', 'republican', '<unk>', 'now', 'run', 'on', 'a', '<unk>', 'promising', 'to', 'keep', 'the', 'county', 'clean', 'and', 'green', '</s>']


In [None]:
train_tokens = preprocess_ptb(train_data)
test_tokens = preprocess_ptb(test_data)

## Generate N-Grams

Now we can define our functions for building our n-gram models. First, we need to define a function to generate n-grams.

N-grams are sequences of tokens that appear together in the corpus. We use the n-grams to train our probabilistic model. For example, if we have the sentence,"the dog wagged his tail", we can extract the following bigrams: ('the', 'dog'), ('dog', 'wagged'), ('wagged', 'his'), ('his', 'tail').

In [None]:
def generate_ngrams(tokens, n):
  '''
  Generates n-grams from a list of tokens.

  Args:
    - tokens (List[str]): A list of string tokens.
    - n (int): The number of items in each n-gram (e.g., 2 for bigrams).

  Returns:
    - List[Tuple[str]]: A list of n-gram tuples.

  Example:
    >>> generate_ngrams(["the", "dog", "wagged", "his", "tail"], 2)
    [('the', 'dog'), ('dog', 'wagged'), ('wagged', 'his'), ('his', 'tail')]
  '''
  return [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]

## Generate Probabilities

To build our n-gram models, we need to calculate the probability of each n-gram based on our training corpus.

Suppose our entire corpus contains a single sentence:  
**"the man went to the store"**

The **bigrams** extracted from this sentence would be:

('the', 'man'), ('man', 'went'), ('went', 'to'), ('to', 'the'), ('the', 'store')


Now, to compute the probability of the bigram **('the', 'store')**, we use the following formula:

P('store' | 'the') = Count('the', 'store') / Count('the')


In this case:
- `'the'` appears **twice** in the corpus
- The bigram `('the', 'store')` appears **once**

So the probability is:

P('store' | 'the') = 1 / 2 = 0.5

This method is used to estimate the likelihood of a word given its preceding context.

In [None]:
def build_unigram_model(tokens):
  """
  Builds a unigram (1-gram) language model from a list of tokens.

  Args:
      tokens (List[str]): A list of word tokens.

  Returns:
      Dict[str, float]: A dictionary mapping each word to its probability
      (frequency normalized by the total number of tokens).
  """
  unigram_counts = Counter(tokens)
  denominator = len(tokens)
  return {
      word: unigram_counts[word] / denominator
      for word in unigram_counts
  }

def build_bigram_model(tokens):
  """
  Builds a bigram (2-gram) language model from a list of tokens.

  Args:
      tokens (List[str]): A list of word tokens.

  Returns:
      Dict[Tuple[str, str], float]: A dictionary mapping bigrams to their conditional
      probabilities P(w2 | w1), computed as count(w1, w2) / count(w1).
  """
  bigram_counts = Counter(generate_ngrams(tokens, 2))
  unigram_counts = Counter(tokens)
  return {
      (w1, w2): bigram_counts[(w1, w2)] / unigram_counts[w1]
      for (w1, w2) in bigram_counts
  }

def build_trigram_model(tokens):
  """
  Builds a trigram (3-gram) language model from a list of tokens.

  Args:
      tokens (List[str]): A list of word tokens.

  Returns:
      Dict[Tuple[str, str, str], float]: A dictionary mapping trigrams to their conditional
      probabilities P(w3 | w1, w2), computed as count(w1, w2, w3) / count(w1, w2).
  """
  trigram_counts = Counter(generate_ngrams(tokens, 3))
  bigram_counts = Counter(generate_ngrams(tokens, 2))
  return {
      (w1, w2, w3): trigram_counts[(w1, w2, w3)] / bigram_counts[(w1, w2)]
      for (w1, w2, w3) in trigram_counts
  }

def build_ngram_model(tokens, n):
  """
  Dispatches to the appropriate n-gram model builder based on the value of n.

  Args:
      tokens (List[str]): A list of word tokens.
      n (int): The n-gram size (1, 2, or 3).

  Returns:
      Dict: The n-gram model (unigram, bigram, or trigram).
  """
  if n == 1:
    return build_unigram_model(tokens)
  elif n == 2:
    return build_bigram_model(tokens)
  elif n == 3:
    return build_trigram_model(tokens)
  else:
    raise ValueError('Only n = 1, 2, or 3 are supported.')

In [None]:
unigram_model = build_unigram_model(train_tokens)
bigram_model = build_bigram_model(train_tokens)
trigram_model = build_trigram_model(train_tokens)

## Generate Text

In [None]:
def generate_unigram_text(model, length = 10):
  """
  Generates text from a unigram model by sampling words based on their probabilities.

  Args:
      model (Dict[str, float]): A unigram probability model.
      length (int): Number of words to generate. If None or 0, generation continues until </s> is seen.

  Returns:
      str: A generated sentence.
  """
  sentence = []
  if length:
    for _ in range(length):
      next_word = random.choices(list(model.keys()), weights = list(model.values()))[0]
      sentence.append(next_word)
  else:
    while True:
      next_word = random.choices(list(model.keys()), weights = list(model.values()))[0]
      sentence.append(next_word)
      if next_word == '</s>':
        break
  return ' '.join(sentence)

def generate_bigram_text(model, seed_word, length = 10):
  """
  Generates text from a bigram model using a single seed word.

  Args:
      model (Dict[Tuple[str, str], float]): A bigram probability model.
      seed_word (str): The starting word of the sentence.
      length (int): Total number of words to generate. If None or 0, continues until </s>.

  Returns:
      str: A generated sentence beginning with the seed word.
  """
  sentence = [seed_word]
  if length:
    for _ in range(length - 1):
      candidates = {k[1]:v for k, v in model.items() if k[0] == sentence[-1]}
      if not candidates:
        break
      next_word = random.choices(list(candidates.keys()), weights = list(candidates.values()))[0]
      sentence.append(next_word)
  else:
    while True:
      candidates = {k[1]:v for k, v in model.items() if k[0] == sentence[-1]}
      if not candidates:
        break
      next_word = random.choices(list(candidates.keys()), weights = list(candidates.values()))[0]
      sentence.append(next_word)
      if next_word == '</s>':
        break
  return ' '.join(sentence)

def generate_trigram_text(model, seed_words, length = 10):
  """
  Generates text from a trigram model using two seed words.

  Args:
      model (Dict[Tuple[str, str, str], float]): A trigram probability model.
      seed_words (Tuple[str, str]): The starting two words.
      length (int): Total number of words to generate. If None or 0, continues until </s>.

  Returns:
      str: A generated sentence beginning with the seed words.
  """
  sentence = list(seed_words)
  if length:
    for _ in range(length - 2):
      candidates = {k[2]:v for k, v in model.items() if k[:2] == tuple(sentence[-2:])}
      if not candidates:
        break
      next_word = random.choices(list(candidates.keys()), weights = list(candidates.values()))[0]
      sentence.append(next_word)
  else:
    while True:
      candidates = {k[2]:v for k, v in model.items() if k[:2] == tuple(sentence[-2:])}
      if not candidates:
        break
      next_word = random.choices(list(candidates.keys()), weights = list(candidates.values()))[0]
      sentence.append(next_word)
      if next_word == '</s>':
        break
  return ' '.join(sentence)

def generate_ngram_text(model, seed_words, n, length = 10):
  """
  Dispatch function to generate text from an n-gram model.

  Args:
      model (Dict): An n-gram model (unigram, bigram, or trigram).
      seed_words (Union[str, Tuple[str, str]]): Seed word(s) used to start the sentence.
      n (int): The type of model (1, 2, or 3).
      length (int): Number of tokens to generate.

  Returns:
      None
  """
  if n == 1:
    print(generate_unigram_text(model, length))
  elif n == 2:
    print(generate_bigram_text(model, seed_words, length))
  elif n == 3:
    print(generate_trigram_text(model, seed_words, length))
  else:
    raise ValueError('Only n = 1, 2, or 3 are supported.')

In [None]:
print('Unigram Model:')
generate_ngram_text(unigram_model, None, 1, 50)
print()
print('Bigram Model:')
generate_ngram_text(bigram_model, ('<s>'), 2, 50)
print()
print('Trigram Model:')
generate_ngram_text(trigram_model, ('<s>', 'i'), 3, 50)

Unigram Model:
items to possible <s> N educational in products who </s> in corporation it sept. mr. which N a the in the among and <s> ads producer crime dispute a express <unk> that stop resources a <s> on mark plan with voters get communist to </s> future </s> of underwriters this

Bigram Model:
<s> under the soviet economy throughout the nation 's contract fell a set a bid but the department to gold has about N N N in the agreement were N points to age group </s> <s> the july </s> <s> as he later than N N to retire about $

Trigram Model:
<s> i 'm not certain </s> <s> garbage magazine billed as the first few minutes of trading and construction markets </s> <s> west german <unk> automatic citizens this year to N N this year people are getting a bargain hunt </s> <s> he even sold one a democrat and one


## Calculate Perplexity

In [None]:
def calculate_perplexity(model, tokens, n):
  '''
  Calculates the perplexity of an n-gram model on a given token sequence.

  Perplexity measures how well a probability model predicts a sequence.
  Lower perplexity indicates better predictive performance.

  Arguments:
    - model: A dictionary representing the n-gram model (e.g., unigram, bigram, trigram)
    - tokens: A list of tokens (words) from the text to evaluate
    - n: The order of the n-gram (1 for unigram, 2 for bigram, etc.)

  Returns:
    - perplexity: A float value representing the model's perplexity on the token sequence

  Notes:
    - If an n-gram is not found in the model, a small probability (1e-6) is used.
    - Uses base-2 logarithm for computing log probability.
  '''
  N = len(tokens)
  log_prob_sum = 0

  for t in range(len(tokens) - n + 1):
    if n == 1:
      ngram = tokens[t]
    else:
      ngram = tuple(tokens[t:t+n])
    prob = model.get(ngram, 1e-6)
    log_prob_sum += math.log2(prob)

  return 2 ** (-log_prob_sum / N)

In [None]:
print('Unigram test perplexity:', calculate_perplexity(unigram_model, test_tokens, 1))
print('Bigram test perplexity:', calculate_perplexity(bigram_model, test_tokens, 2))
print('Trigram test perplexity:', calculate_perplexity(trigram_model, test_tokens, 3))

Unigram test perplexity: 622.1353381284742
Bigram test perplexity: 341.81849008051756
Trigram test perplexity: 5355.641165309026


## Fall Back Perplexity

In [None]:
def calculate_fallback_perplexity(trigram_model, bigram_model, unigram_model, tokens):
  '''
  Calculates the perplexity of a fallback n-gram model on a given token sequence.

  This function uses a backoff approach:
  - It first tries to find the trigram probability
  - If not found, it falls back to the bigram
  - If the bigram is also missing, it falls back to the unigram
  - If the unigram is missing, it uses a small default probability (1e-6)

  Arguments:
    - trigram_model: A dictionary of trigram probabilities
    - bigram_model: A dictionary of bigram probabilities
    - unigram_model: A dictionary of unigram probabilities
    - tokens: A list of tokens to evaluate

  Returns:
    - perplexity: A float representing how well the fallback model predicts the token sequence

  Notes:
    - Uses log base 2
    - The perplexity is normalized by the total number of tokens
  '''
  n = len(tokens)
  log_prob_sum = 0
  for i in range(2, n):
    trigram = (tokens[i - 2], tokens[i - 1], tokens[i])
    bigram = (tokens[i - 1], tokens[i])
    unigram = tokens[i]
    prob = trigram_model.get(trigram, bigram_model.get(bigram, unigram_model.get(unigram, 1e-6)))
    log_prob_sum += math.log2(prob)

  return 2 ** (-log_prob_sum / n)

In [None]:
print('Fallback Perplexity:', calculate_fallback_perplexity(trigram_model, bigram_model, unigram_model, test_tokens))

Fallback Perplexity: 77.84830636849189


## Laplace Smoothing - Trigrams

In [None]:
def trigram_model_leplace(tokens):
  '''
  Constructs a smoothed trigram language model using Laplace (add-one) smoothing.

  Arguments:
    - tokens: A list of tokens from a preprocessed corpus

  Returns:
    - smoothed_trigram_prob: A function that takes (w1, w2, w3) and returns the Laplace-smoothed trigram probability
    - V: The vocabulary size (number of unique tokens)

  How it works:
    - Counts trigram and bigram frequencies from the token list
    - Applies Laplace smoothing to avoid zero probabilities for unseen trigrams:
        P(w3 | w1, w2) = (count(w1, w2, w3) + 1) / (count(w1, w2) + V)
    - This helps the model assign a small probability to unseen sequences

  Example:
    prob_fn, vocab_size = trigram_model_leplace(tokens)
    prob = prob_fn('the', 'cat', 'sat')
  '''
  trigrams = generate_ngrams(tokens, 3)
  bigrams = generate_ngrams(tokens, 2)

  trigram_counts = Counter(trigrams)
  bigram_counts = Counter(bigrams)

  V = len(set(tokens))

  def smoothed_trigram_prob(w1, w2, w3):
    trigram = (w1, w2, w3)
    bigram = (w1, w2)

    return (trigram_counts.get(trigram, 0) + 1) / (bigram_counts.get(bigram, 0) + V)

  return smoothed_trigram_prob, V

In [None]:
smoothed_trigram_prob, V = trigram_model_leplace(train_tokens)

In [None]:
def calculate_trigram_smoothed_perplexity(smoothed_trigram_prob, test_tokens, V):
  '''
  Calculates the perplexity of a test sequence using a smoothed trigram model.

  Arguments:
    - smoothed_trigram_prob: A function that returns the probability of a trigram
                              (e.g., from trigram_model_leplace)
    - test_tokens: A list of tokens from the test corpus
    - V: The vocabulary size used during smoothing (not directly used here, but kept for consistency)

  Returns:
    - Perplexity: A float value representing how well the model predicts the test set.
                  Lower is better; higher implies the model is less confident.

  Notes:
    - Uses log base 2 for probability accumulation
    - Starts from index 2 since trigrams require 2 previous tokens
    - Applies the standard formula:
        Perplexity = 2 ^ [ - (1/N) * ∑ log2 P(w_i | w_{i-2}, w_{i-1}) ]
    - Assumes the input model already applies Laplace smoothing

  Example:
    prob_fn, V = trigram_model_leplace(train_tokens)
    ppl = calculate_trigram_smoothed_perplexity(prob_fn, test_tokens, V)
  '''
  n = len(test_tokens)
  log_prob_sum = 0
  for i in range(2, n):
    w1, w2, w3 = test_tokens[i - 2], test_tokens[i - 1], test_tokens[i]
    prob = smoothed_trigram_prob(w1, w2, w3)
    log_prob_sum += math.log2(prob)
  return 2 ** (-log_prob_sum / n)

In [None]:
print('Leplace smoothed trigram perplexity:', calculate_trigram_smoothed_perplexity(smoothed_trigram_prob, test_tokens, V))

Leplace smoothed trigram perplexity: 3116.601594194831


We see that the back-off method performs the best. This model's ability to use extended context when it is reliable and shorter context when it's not, makes it more dynamic than the other methods, including Laplace smoothing.