Let's see how natural language statistics work on real data. We construct a vocabulary based on the time machine data and print the top words.

In [None]:
import d2l
import torch
import random
import utils

In [None]:
tokens = utils.tokenize(utils.read_time_machine())
vocab = utils.Vocab(tokens)
print(vocab.token_freqs[:10])

As we can see, the most popular words are actually quite boring to look at. They are often referred to as ``stop words`` and thus filtered out. That said, they still carry meaning and we will use them nonetheless. However, one thins that is quite clear is that the word frequency decays rather rapidly. The 10th word is less than 1/5 common as the most popular one. To get a better idea we plot the graph of word frequencies.

In [None]:
freqs = [freq for token, freq in vocab.token_freqs]
d2l.plot(freqs, xlabel='token', ylabel='frequency',
        xscale='log', yscale='log')

What about word pairs?

In [None]:
bigram_tokens = [[pair for pair in zip(line[:-1], line[1:])]
                 for line in tokens]
bigram_vocab = utils.Vocab(bigram_tokens)
print(bigram_vocab.token_freqs[:10])

Out of the 10 most frequent word pairs, 9 are composed of stop words and only one is relevant to the actual book. Let's see the trigram tokens.

In [None]:
trigram_tokens = [[triple for triple in 
                   zip(line[:-2], line[1:-1], line[2:])]
                 for line in tokens]
trigram_vocab = utils.Vocab(trigram_tokens)
print(trigram_vocab.token_freqs[:10])

Last, let's visualize the token frequencies among these three gram models.

In [None]:
bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]
trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]
d2l.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token',
          ylabel='frequency', xscale='log', yscale='log',
          legend=['unigram', 'bigram', 'trigram'])