$\newcommand{\is}{\mathrel{\mathop:=}}$
$\newcommand{\range}{\mathop{ran}}$
$\newcommand{\setof}[1]{\left \{ #1 \right \}}$
$\newcommand{\card}[1]{\left | #1 \right |}$
$\newcommand{\tuple}[1]{\left \langle #1 \right \rangle}$
$\newcommand{\emptytuple}{\left \langle \right \rangle}$
$\newcommand{\tuplecat}{\cdot}$
$\newcommand{\stringcat}{\cdot}$
$\newcommand{\emptystring}{\varepsilon}$
$\newcommand{\String}[1]{\mathit{#1}}$
$\newcommand{\LeftEdgeSymbol}{\rtimes}$
$\newcommand{\RightEdgeSymbol}{\ltimes}$
$\newcommand{\LeftEdge}{\LeftEdgeSymbol}$
$\newcommand{\RightEdge}{\RightEdgeSymbol}$
$\newcommand{\mult}{\times}$
$\newcommand{\multisum}{\uplus}$
$\newcommand{\multimult}{\otimes}$
$\newcommand{\freqsymbol}{\mathrm{freq}}$
$\newcommand{\freq}[1]{\freqsymbol(#1)}$
$\newcommand{\prob}{P}$
$\newcommand{\counts}[2]{\card{#2}_{#1}}$
$\newcommand{\inv}[1]{#1^{-1}}$
$\newcommand{\Lex}{\mathit{Lex}}$
$\newcommand{\length}[1]{\left | #1 \right |}$
$\newcommand{\suc}{S}$
$\newcommand{\sprec}{<}$
$\newcommand{\Rcomp}[2]{#1 \circ #2}$
$\newcommand{\domsymbol}{\triangleleft}$
$\newcommand{\idom}{\domsymbol}$
$\newcommand{\pdom}{\domsymbol^+}$
$\newcommand{\rdom}{\domsymbol^*}$
$\newcommand{\indegree}[1]{\mathrm{in(#1)}}$
$\newcommand{\outdegree}[1]{\mathrm{out(#1)}}$
$\newcommand{\cupdot}{\cup\mkern-11.5mu\cdot\mkern5mu}$
$\newcommand{\mymatrix}[1]{\left ( \matrix{#1} \right )}$
$\newcommand{\id}{\mathrm{id}}$

# An $n$-gram model of text

So far we have mostly studied $n$-gram models for linguistic reasons.
These models are very simple, but can capture a fair amount of phonotactic and morphotactic conditions.
This in turn shows that these conditions are very simple.
But $n$-gram models aren't limited to linguistic theorizing.
In fact, they are mostly used in more applied domains.

## Unigrams for text classification

Suppose your task is to classify texts, for example as part of a search engine.
Ideally, this classification would proceed by carefully reading the entire text, interpreting it, and distilling its core themes through some high-level analysis.
But that requires a lot of time and skill, and may simply not be feasible in practice.
How does one adequately summarize, say, the 1130 pages of Robert Musil's *The Man Without Qualities*, or Grigori Perelman's proof of the Poincaré conjecture?
Whatever the right answer, it probably isn't something that can be done quickly and automatically.
And while one may be able to pay experts to work on these outstanding accomplishments, it's much harder to find somebody to summarize papers on cell biology because there are so many published every day.
With internet websites, human summarization is completely impossible given how often they are updated and how many new ones are created every minutes.
So instead computers have to do the job, and since we haven't figured out a way yet to get computers to understand text, the models are necessarily simple and focussed on surface features.
Virtually all of them build on $n$-grams, the core idea being that the meaning of a text can be equated with the words that occur in it.

Let us look at a particularly simple way of formalizing this idea, one where we ignore how often certain words occur.
We will also ignore capitalization, as is commonly done in this model. 
For example, converting the mini-text *Only John could like John* to a set of unigrams (i.e. $1$-grams) only preserves the information that the text contains the words *only*, *john*, *could*, and *like*.
A few more examples are shown below.

In [1]:
import re
from pprint import pprint

def set_of_words(string):
    """convert a string to a set of words"""
    tokens = [word for word in re.split("[^\w]", string.lower()) if word]
    print("Input:", string)
    pprint(set(tokens))
    print("\n")

set_of_words("John is John, that much is obvious!")
set_of_words("The man and the woman are husband and wife.")
set_of_words("Police police police police police.")

Input: John is John, that much is obvious!
{'john', 'much', 'obvious', 'that', 'is'}


Input: The man and the woman are husband and wife.
{'and', 'wife', 'the', 'man', 'are', 'husband', 'woman'}


Input: Police police police police police.
{'police'}




A search engine, for instance, could use this model to convert any given website to a set of words.
When the user enters a query, e.g. *fed my Gremlin after midnight*, the search engine could convert the query to a set of words, too, and then check which websites have a similar set of words.

The general idea of the model is simple enough, and as you can see even the implementation in a programming language is straight-forward.
Note that here we are no longer dealing with $n$-gram grammars.
The task involves no notion of well-formedness.
Instead, unigrams are used as a compressed **representation** of the text, and all reasoning is done over this compressed representation.

The main advantage of the set-of-words model of texts is its simplicity - determining the meaning of a text only requires our very simple function $b$, which maps strings of words to sets of words.
But while practical applications often rank simplicity and efficiency over accuracy, the set-of-words model is too simple even for those.
There are at least three problems:

1. Context is not taken into account at all, even within individual sentences.
   Among other things, *The dog bit the man* and *The man bit the dog* incorrectly receive the same meaning.
   And along the same lines, *Not every student thinks they should leave* and *Every student thinks they should not leave* are taken to have identical meanings, too.
1. Since we do not count how often words occur, a text that mentions global warming once in passing is taken to cover this topic to the same extent as one that mentions it over a hundred times.
1. The sets are cluttered with uninformative words like *is*, *the*, *of*, and so on.

The first one can be improved by moving from unigrams to $n$-grams.
With bigrams, a headline like *man bites dogs* is represented as the set $\setof{\text{man bites}, \text{bites dogs}}$, whereas the much less startling *dog bites man* becomes $\setof{\text{dog bites}, \text{bites man}}$.
Note that we could also include explicit markers \$ to clearly identify the first and last word of a sentence.
None of this is an adequate representation of context, but it nonetheless works fairly well in practice - though that might just be because the practical problems language technology is asked to solve nowadays are still fairly simple.

Be that as it may, there are still problems 2 and 3 to take care of.
But this is best left for the next few units.