# N-grams with zip and with NLTK

by Koenraad De Smedt at UiB

---
*N-grams* are consecutive parts of a text broken up into *n* tokens (words or letters), such as the following.

> ```
Once upon a
upon a time
a time there
time there was
```

Breaking up a text in n-grams can be useful for several NLP purposes, such as translation, error correction, finding collocations, document classification, etc.

This notebook shows how with `zip`, we can easily compute n-grams as a list of tuples. Also the `ngram` function in NLTK is demonstrated.

---

We start with a word-tokenized text. See how we can take parts of the token list, starting at subsequent positions and running to the end. If you look at the lists in parallel, from top to bottom, you already get an impression of the n-grams.

In [None]:
tokens = ['Once', 'upon', 'a', 'time', 'there', 'was', '...']

print(tokens[0:])
print(tokens[1:])
print(tokens[2:])

With a comprehension that iterates over a range, we can include all such subsequent lists in a list.

In [None]:
[tokens[i:] for i in range(3)]

If we unpack this list, and give the contained lists as arguments to  zip, we get a list of tuples which are word n-grams – in this case, trigrams.

In [None]:
tri = zip(*[tokens[i:] for i in range(3)])
#print(tri)
[*tri]

We are now ready to define a function that operates on a list of tokens and produces n-gram tuples. It has an extra argument for *n* with a default of 4. 

In [None]:
def n_grams (seq, n=3):
  return zip(*[seq[i:] for i in range(n)])

[*n_grams(tokens)]

Override the default.

In [None]:
[*n_grams(tokens, n=4)]

---
## N-grams with NLTK

Now that you understand how n-grams can be computed, let's look at NLTK which also provides a tool for n-grams. We must first tokenize, so we need a tokenizer and we make a slightly larger example text.

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize

story ='''Once upon a time, there was a princess called Buttercup. She had a
farmhand called Westley; whenever she tells him to do something, he always
answers: "As you wish." At first she didn't realize he loves her, but
eventually she realizes it and she loves him too! Westley leaves to seek his
fortune overseas so they can marry. When his ship is attacked by the Dread
Pirate Roberts, who is infamous for never leaving survivors, Westley is
presumed dead...'''

Let's test and show the first 10 n-grams with our own `n_gram` function.

In [None]:
ng = n_grams(word_tokenize(story))
[*ng][:10]

N-grams can be obtained in the same format by means of the `ngrams` function in NLTK. However, the second argument is obligatory; it has no default. The result is also a *generator* of tuples.

In [None]:
from nltk.util import ngrams
ng2 = ngrams(word_tokenize(story), n=3)
[*ng2][:10]

###User interactions to set parameters

Google Colab offers user interactions to set parameters. These are indicated as special comments with the `#@` characters. The following illustrates the use of a *slider* to choose the length of the n-grams. See the [forms example](https://colab.research.google.com/notebooks/forms.ipynb) for more possibilities. This may not work outside of Google Colab, but [IPywidgets](https://towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6) offers something similar.

In [None]:
#@title Choose the length of the n-grams
N_gram_length = 2 #@param {type:"slider", min:2, max:5, step:1}
ng2 = ngrams(word_tokenize(story), n=N_gram_length)
[*ng2][:10]

### Exercises

1.   Compute the bigrams in `story`.
2.   How many *different* bigrams are there in `story`?
3.   Convert the n-gram tuples to strings with spaces, for instance, `'Once upon a'`.
4.   See if you can make *character* n-grams from a text string. Compare the result with that in the notebook on Ranges.