# Week 5: Units of Meaning

The material this week is a bit lighter than usual, and this notebook is accordingly shorter. We'll focus on the segmenter from Peter Norvig's chapter on "[Natural Language Corpus Data](http://norvig.com/ngrams/ch14.pdf)", which was covered in the async.

In [1]:
import math
import segment_utils; reload(segment_utils)

<module 'segment_utils' from 'segment_utils.pyc'>

# Edit Distance

For a good visual demo of edit distance, check out this page:
- http://leojiang.com/experiments/levenshtein/ 

NLTK also includes a basic implementation, which you can access below. The implementation is quite simple: [[source]](http://www.nltk.org/_modules/nltk/metrics/distance.html#edit_distance)

In [2]:
from nltk.metrics.distance import edit_distance
edit_distance("industry", "dentistry")

4

You can also set it to allow transpositions as an atomic operation, which is handy for spellcheck:

In [3]:
print edit_distance("believe", "beleive")
print edit_distance("believe", "beleive", transpositions=True)

2
1


# Text Segmenter

Our text segmenter is based on a language model and dynamic programming. Roughly, we're going to generate a set of candidate segmentations, then pick the most likely one according to our LM.

We'll just use a unigram LM for this, "trained" on the top 50k words from the Google n-grams corpus. The probabilities are saved to disk, so we need only load them and define a simple function:

In [4]:
# Get unigram probabilities
histo = segment_utils.BuildHistogram("english_uni_simplified_sorted_top")

# Simple unigram scoring model
def Pw(w):
  """Unigram probabilities with "stupid" backoff to character counts."""
  totals = histo['']
  if w in histo:  return math.log(histo[w]) - math.log(totals)
  else: return -math.log(totals) - 3*len(w)
  
def Pwords(words):
  return sum(Pw(w) for w in words)

Recall that there are $2^{n-1}$ possible segmentations of an $n$-character string. However, we don't need to enumerate all of them. Instead, we'll break the problem into sub-problems:

- For each `i = 1, ..., n`, consider the prefix `text[:i+1]`
- Recursively compute the best segmentation of the remaining `text[i+1:]`
- Return the highest scoring segmentation, according to `Pwords()`

We'll use the `@memo` decorator to cache function calls, which lets us avoid expensive re-computation of the same sub-problem. This is equivalent to storing an explicit dynamic programming table, but lets us write the implementation in a recursive style.

In [5]:
def splits(text, L=20):
  return [(text[:i+1], text[i+1:])
          for i in range(min(len(text), L))]

@segment_utils.memo
def segment(text):
  if not text: return []
  candidates = ([first]+segment(rem) for first, rem in splits(text))
  return max(candidates, key=Pwords)

In [6]:
segment("choosespain")

['choose', 'spain']

In [7]:
segment("inaholeinthegroundtherelivedahobbit")

['in', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit']

### Discussion questions

**Q:** How many unique calls to `segment(text)` are there? (i.e. what is the size of our cached DP table?)

**Q:** What is the total runtime of this algorithm, as implemented above?

**Q:** In the above implementation, we treat `Pwords` as a black-box function over a sequence of words. Could we make the segmenter more efficient by exploiting the structure of the language model? What would our optional runtime be in that case?

**Bonus Q:** Would your optimization work if we used a bigram language model? Or a trigram model?