# Bigram Language Model

In this lab we will implement a bigram language model and use it to compute the probability of some sample sentences.

As you go through, make sure you understand what's going on in each cell, and ask if it is unclear.

### Outcomes

- Know how to count word frequencies in a corpus using Python libraries.
- Understand how to compute conditional probabilities.
- Be able to apply the chain rule to compute the probability of a sentence.

### Overview

The first part of the notebook loads the same dataset as last week.
The next part splits the data into training and test sets, and tokenises the utterances.
After this there are some tasks to complete to implement and test the language model.


# 1. Preparing the Data


In [1]:
import os
import sys

path = os.path.abspath(os.path.join(".."))

if path not in sys.path:
    sys.path.append(path)

In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
from dn.doc2dial import load_dataset

docs = load_dataset(100)

In [4]:
import spacy
from dn.bigram_language_model.preprocessing import pad

nlp = spacy.load("en_core_web_sm")

# Tokenize the utterances.
docs = [[token.text for token in nlp(doc)] for doc in docs]

# Pad the utterances.
docs = [list(pad(doc)) for doc in docs]

# Print an example of a tokenized utterance.
print(docs[2])

['<s>', 'Can', 'I', 'do', 'my', 'DMV', 'transactions', 'online', '?', '</s>']


In [5]:
from sklearn.model_selection import train_test_split

train: list[list[str]]
test: list[list[str]]

# Split the data into training and test sets with `scikit-learn`.
train, test = train_test_split(docs, train_size=0.8, test_size=0.2)

print(f"train size = {len(train)}")
print(f"test size = {len(test)}")

train size = 956
test size = 239


# 2. Counting Tokens

The bigram language model needs to compute two sets of counts from the training data:

1. The counts of how many times each bigram occurs.
2. The counts of how many times each word type occurs as the first token in a bigram.

Let's start by finding the vocabulary of unique token 'types':


In [6]:
import numpy

vocab = list(numpy.unique(numpy.concatenate(train)).tolist())
vocab_size = len(vocab)

print(f"vocab size = {vocab_size}")

vocab size = 1680


In [7]:
# For example, we can find the index of a token like so:
def find_token_index(token: str, vocab: list[str]):
    try:
        return vocab.index(token)
        # If `vocab` were a numpy array, we could find the token's index like so:
        # `return np.argwhere(vocab == token)[0][0]`
    except ValueError:
        return -1


print(find_token_index("<s>", vocab))

88


In [8]:
from pprint import pprint
from numpy.typing import NDArray


# Define a function to print the bigram statistics of a tokenized utterance.
def print_bigram_statistics(bigram_matrix: NDArray, tokenized: list[str]):
    bigram_statistics = []
    for token_current, token_next in zip(
        tokenized,
        tokenized[1:],
    ):
        # Find the indices of the tokens in the vocabulary.
        token_current_index = find_token_index(token_current, vocab)
        token_next_index = find_token_index(token_next, vocab)

        # If both tokens are in the vocabulary, find the value of the statistic
        # for the bigram and add it to the list to print.
        if token_current_index != -1 and token_next_index != -1:
            bigram_statistic = bigram_matrix[
                token_current_index, token_next_index
            ]
            bigram_statistics.append(
                f"{token_current} {token_next}: {bigram_statistic}"
            )

    pprint(bigram_statistics)

**TODO 2.1:** count the bigrams that occur in the training set.


In [9]:
import numpy


def find_bigram_counts(utterances: list[list[str]], vocab_size: int):
    # A matrix whose row indices correspond to the first tokens in bigrams and
    # column indices correspond to the second tokens in bigrams. The indices must
    # map to the index of the token in the vocabulary. The values of the matrix will
    # be the token counts. We initialize the matrix with ones to use add-one
    # smoothing.
    bigram_counts = numpy.ones((vocab_size, vocab_size))

    tokens = [token for utterance in utterances for token in utterance]

    # Iterate the tokens in each utterance pairwise.
    for token_current, token_next in zip(tokens, tokens[1:]):
        # Find the indices of the tokens in the vocabulary.
        token_current_index = find_token_index(token_current, vocab)
        token_next_index = find_token_index(token_next, vocab)

        # If both tokens are in the vocabulary, increment the bigram count.
        if token_current_index != -1 and token_next_index != -1:
            bigram_counts[token_current_index, token_next_index] += 1

    return bigram_counts


bigram_counts = find_bigram_counts(train, vocab_size)

# Print the counts of the bigrams in an example of a tokenized utterance.
print_bigram_statistics(bigram_counts, docs[2])

['<s> Can: 39.0',
 'Can I: 24.0',
 'I do: 40.0',
 'do my: 4.0',
 'my DMV: 4.0',
 'DMV transactions: 3.0',
 'transactions online: 2.0',
 'online ?: 7.0',
 '? </s>: 378.0']


**TODO 2.2:** Apply numpy's sum() function to the 'counts' variable to compute the number of times each word type occurs as the first token in a bigram.


In [10]:
from pprint import pprint
import numpy

# `axis=1`: apply the operation row-wise, i.e., across all rows for each column.
first_token_counts = bigram_counts.sum(axis=1)


# Compute the unigram counts. The result should be the same.
def find_unigram_counts(utterances: list[list[str]], vocab_size: int):
    # A vector whose indices correspond to the tokens in the vocabulary.
    unigram_counts = numpy.zeros(vocab_size)

    # Iterate the tokens in the utterances.
    for token in [token for utterance in utterances for token in utterance]:
        # Find the index of the token in the vocabulary.
        token_index = find_token_index(token, vocab)

        # If the token is in the vocabulary, increment the unigram count.
        if token_index != -1:
            unigram_counts[token_index] += 1

    return unigram_counts


unigram_counts = find_unigram_counts(train, vocab_size)

# Print the counts with which the tokens in an example of a tokenized utterance
# occur as the first token in a bigram.
print("\nBigram first-token counts:")
pprint(
    [
        f"{token}: {first_token_counts[find_token_index(token, vocab)]}"
        for token in docs[2]
    ]
)

# Print the unigram counts, which are smaller than the bigram first-token counts
# by the size of the vocabulary.
print("\nUnigram counts:")
pprint(
    [
        f"{token}: {unigram_counts[find_token_index(token, vocab)]}"
        for token in docs[2]
    ]
)


Bigram first-token counts:
['<s>: 2636.0',
 'Can: 1725.0',
 'I: 2097.0',
 'do: 1833.0',
 'my: 1897.0',
 'DMV: 1847.0',
 'transactions: 1688.0',
 'online: 1714.0',
 '?: 2080.0',
 '</s>: 2635.0']

Unigram counts:
['<s>: 956.0',
 'Can: 45.0',
 'I: 417.0',
 'do: 153.0',
 'my: 217.0',
 'DMV: 167.0',
 'transactions: 8.0',
 'online: 34.0',
 '?: 400.0',
 '</s>: 956.0']


**TODO 2.3:** Compute a matrix (numpy array) of conditional probabilities using the counts. Compute the log of this matrix as a variable 'log_cond_probs'.


In [11]:
import numpy
from numpy.typing import NDArray
from pprint import pprint


# Define a function to print the bigram statistics of a tokenized utterance.
def print_bigram_statistics(bigram_matrix: NDArray, tokenized: list[str]):
    bigram_statistics = []
    for token_current, token_next in zip(
        tokenized,
        tokenized[1:],
    ):
        # Find the indices of the tokens in the vocabulary.
        token_current_index = find_token_index(token_current, vocab)
        token_next_index = find_token_index(token_next, vocab)

        # If both tokens are in the vocabulary, find the value of the statistic
        # for the bigram and add it to the list to print.
        if token_current_index != -1 and token_next_index != -1:
            bigram_statistic = bigram_matrix[
                token_current_index, token_next_index
            ]
            bigram_statistics.append(
                f"{token_current} {token_next}: {bigram_statistic}"
            )

    pprint(bigram_statistics)


# Compute the bigram conditional probabilities.
bigram_conditional_probabilities = numpy.divide(
    bigram_counts, first_token_counts
)

print("\nBigram conditional probabilities:")
print_bigram_statistics(bigram_conditional_probabilities, docs[2])

# Compute the logarithms of the bigram conditional probabilities.
bigram_conditional_log_probabilities = numpy.log(
    bigram_conditional_probabilities
)

print("\nLogarithms of bigram conditional probabilities:")
print_bigram_statistics(bigram_conditional_log_probabilities, docs[2])

Our results:

Bigram conditional probabilities:
['<s> Can: 0.022608695652173914',
 'Can I: 0.011444921316165951',
 'I do: 0.02182214948172395',
 'do my: 0.0021085925144965737',
 'my DMV: 0.002165674066053059',
 'DMV transactions: 0.0017772511848341231',
 'transactions online: 0.0011668611435239206',
 'online ?: 0.0033653846153846156',
 '? </s>: 0.14345351043643265']

Logarithms of bigram conditional probabilities:
['<s> Can: -3.7894206833358135',
 'Can I: -4.470209200553974',
 'I do: -3.8248297937257685',
 'do my: -6.161734608815124',
 'my DMV: -6.135023619079349',
 'DMV transactions: -6.332687386487793',
 'transactions online: -6.75343791859778',
 'online ?: -5.69421302364005',
 '? </s>: -1.941744265355875']


**TODO 2.4:** Write a function that uses log_cond_probs to compute the probability of a given tokenised sentence, such as the example below.


In [12]:
# An example of a tokenized utterance.
doc_example = [
    "<s>",
    "If",
    "you",
    "give",
    "me",
    "the",
    "help",
    ",",
    "what",
    "is",
    "the",
    "payment",
    "system",
    "?",
    "<e>",
]

In [13]:
import numpy as np
from numpy.typing import NDArray
from pprint import pprint


def find_log_probability(
    tokenized: list[str],
    vocab: list[str],
    bigram_matrix: NDArray,
):
    log_probability = 0.0
    for token_current, token_next in zip(
        tokenized,
        tokenized[1:],
    ):
        # Find the indices of the tokens in the vocabulary.
        token_current_index = find_token_index(token_current, vocab)
        token_next_index = find_token_index(token_next, vocab)

        # If both tokens are in the vocabulary, add to the log probability.
        if token_current_index != -1 and token_next_index != -1:
            log_probability += bigram_matrix[
                token_current_index, token_next_index
            ]

    return log_probability


pprint(
    np.exp(
        find_log_probability(
            doc_example, vocab, bigram_conditional_log_probabilities
        )
    )
)

2.2375525734514128e-35


**TODO 2.5:** Compute the perplexity over the whole test set. You will need to make sure your code can handle unknown words -- make sure it does not end up misusing the index of -1 returned by get_index_for_word() for unknown words.


In [14]:
from numpy.typing import NDArray
import math
from dn.bigram_language_model.preprocessing import flatten


def find_perplexity(
    tokenized: list[str],
    vocab: list[str],
    bigram_matrix: NDArray,
):
    """
    The perplexity is the Nth root of the product of the inverse probabilities
    of the bigrams, where N is the number of bigrams. Because of the properties
    of logarithms, this is equivalent to (pseudocode):
    ```
        exp(1 - sum(log(probability(bigram)))/N)
    ```
    """
    return math.exp(
        1
        - find_log_probability(tokenized, vocab, bigram_matrix)
        / (len(tokenized) - 1)
    )


# Flatten the test set into a single "document".
test_flat = list(flatten(test))

pprint(find_perplexity(test_flat, vocab, bigram_conditional_log_probabilities))

466.51348396537867


**EXTENSION 1:** Use the language model to generate new sentences by sampling.
You can follow the example below to sample using scipy's multinomial class. Replace the distribution with the conditional distribution we computed earlier.


MORE EXTENSIONS:

- Add some smoothing to the counts and see how it affects the results.
- Use trigrams instead of bigrams. Does it improve perplexity?
