# Spell Correction Using Google's Ngram Dataset

In this tutorial, we will implement a simple spell correction API based on Google's Ngram dataset. The API should be able to give a list of suggested words for a mistyped word inputed. And the list can also be sorted by the correctness probability of the words.

We will use two datasets as input for this problem: Google's Ngram dataset and a list of English dictionary words. Google's Ngram dataset collected the Ngram words in the published books from 1505 to 2008 in Google Books. The words are collected up to 5Gram. And all the Ngrams' occurences are also collected in the dataset. Since most data is collected by OCR, which casues an unnegligible amount of misspelled words in the dataset. We will also use a list of English dictionary words as another method of correction.

In [57]:
from google_ngram_downloader import readline_google_store
import string
import sqlite3
import pandas as pd
from Queue import PriorityQueue
from math import log
from collections import Counter

# 1. Data Collection

In this part, we will be collecting the 1-gram words from Google Ngram. Although we only use the smallest 1-gram data, the size of the dataset is still too huge to be downloaded to local storage. To our rescue, google-ngram-downloader allow us to conveniently fetch each entry in the dataset once each time. Using this API, we can space efficiently filter and process all the data we need.

Following is the example usage of google-ngram-downloader:

```python
>>> fname, url, records = next(readline_google_store(ngram_len=1))
>>> fname
'googlebooks-eng-all-1gram-20120701-0.gz'
>>> url
'http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-0.gz'
>>> next(records)
Record(ngram=u"0'9", year=1797, match_count=1, volume_count=1)
```
After initialized the records, you can continuely call `next(records)` to fetch the next entry of data. When the iteration reach the end, `next()` will raise a `StopIteration` exception so that you will know when to stop. Each entry of ngram is returned as type `Record` storing the occurences of a ngram in a certain year of publications. The `Record` type has 4 attributes:
- ngram: The ngram string. It is constructed by the format of `"word_type ..."`. In the ngram string, there may have underscores in it. Before the first underscore is the word of ngram we need. After that is the type of the word.
- year: The year of publications this ngram exists in.
- match_count: The occurence of the ngram in the books of this year.
- volume_count: The number of books that the ngram exists in.

Becasue we are fetching the ngrams from internet, the data collection process will be extremely slow. For the convenience of the work afterwards, we will firstly process and filter the raw data from Google Ngram and store the result to a sqlite database.

**Specifications:**
- Create a database and a table named words. The table should include the ngram string, match count and volume count.
- Calculate all the match count and volume count of each 1-gram word from year 2000 (including year 2000). The word string should be stored in database in unicode and should not include the type information in it.
- Since fetching all the 1-grams will take a huge amount of time, you only need to fetch 1-grams starting with letter 'k'.

In [58]:
def fetch_words(dbfile):
    """ Fetch 1-gram words from Google Ngram dataset and store to a sqlite table
    Inputs:
        dbfile (str): path to the database file
    Outputs:
        None
    """
    conn = sqlite3.connect(dbfile)
    cursor = conn.cursor()
    cursor.execute('''drop table if exists words''')
    cursor.execute('''create table words (word text, match_count long, volume_count long)''')

    curr_word = None
    match_cnt = 0
    volume_cnt = 0
    word_dict = dict()
    
    # To fetch all the 1-grams, use the following code
    # _, _, records = next(readline_google_store(ngram_len=1, lang='eng', indices=list(string.ascii_lowercase)))
    
    _, _, records = next(readline_google_store(ngram_len=1, lang='eng', indices='k'))

    # word count
    while True:
        try:
            record = next(records)
            # skip the books 20 years ago
            if record.year < 2000:
                continue
            ngram = record.ngram
            word = ngram[:ngram.find('_')]

            if word == curr_word:
                # add match and volume count to counters
                match_cnt += record.match_count
                volume_cnt += record.volume_count
            else:
                # finished traversing current word
                if curr_word is not None and match_cnt > 10000 and volume_cnt > 1000:
                    # word_dict[curr_word] = Word(curr_word, match_cnt, volume_cnt)
                    cursor.execute('''insert into words values (?, ?, ?)''', (curr_word, match_cnt, volume_cnt))
                    print curr_word, match_cnt, volume_cnt
                curr_word = word
                match_cnt = 0
                volume_cnt = 0

        except StopIteration:
            # add last word to retval
            if curr_word is not None and match_cnt > 10 and volume_cnt > 5:
                # word_dict[curr_word] = Word(curr_word, match_cnt, volume_cnt)
                cursor.execute('''insert into words values (?, ?, ?)''', (curr_word, match_cnt, volume_cnt))

                print curr_word, match_cnt, volume_cnt
            break
    conn.commit()
    conn.close()
    print 'done'

In [59]:
db_filename = 'sc.db'
# fetch_words(db_filename)

Now that you have already insert all 1-grams we need in the database. For the next step, we need to filter out the misspelled words. To do that, you will need to check each 1-gram in the database if it exists in the English dictionary and toss the ones that aren't in the dictionary. The dictionary word list is given in the file `'wordsEn.txt'`.

**Specifications:**
- Count a 1-gram string in the dictionary if the lower case of the string is a valid word in the dictionary.
- return a python dictionary which has the 1-gram string as key and a `Word` type structure as value.

In [60]:
# Structure for a 1-gram word
class Word:
    def __init__(self, value, match_count, volume_count):
        assert isinstance(match_count, int)
        assert isinstance(volume_count, int)

        self.value = value
        self.match_count = match_count
        self.volume_count = volume_count

In [61]:
def get_dictionary(dict_filename):
    """ Load the dictionary file to memory
    Inputs:
        dict_filename (str): path to the dictionary file
    Outputs:
        set: A set contains all the dictionary words
    """
    file = open(dict_filename, 'r')
    dict_set = set()
    for word in file:
        word = word.replace('\n', '').replace('\r', '')
        dict_set.add(word)

    return dict_set


def filter_words(dbfile):
    """ Filter the misspelled words in Google 1-gram
    Inputs:
        dbfile: path to the database file
    Outputs:
        dict: A python dictionary with the key of str and value of Word
    """
    dict_set = get_dictionary('wordsEn.txt')
    conn = sqlite3.connect(dbfile)
    raw_words = pd.read_sql_query('''select * from words''', conn)
    words_dict = dict()
    raw_words[['match_count', 'volume_count']] = raw_words[['match_count', 'volume_count']].astype(int)
    for index, row in raw_words.iterrows():
        raw_word = row['word']
        if raw_word in dict_set:
            words_dict[raw_word] = Word(raw_word, row['match_count'], row['volume_count'])

    return words_dict

In [62]:
words = filter_words(db_filename)
print len(words)

294


# 2. Edit Distance
Now that we have collected the vocabulary words from Google Ngram, we will start to perform the approximate search in the vocabulary. For this search problem, we firstly define a similarity function between two words - edit distance. The edit distance between two words $w_1, w_2$ is defined as the the least number of operations needed to modify $w_1$ into $w_2$. There are 4 kinds of edit operation included: (1) insert a letter at any place in a word, (2) delete a letter from a word, (3) change a letter in a word to another letter, and (4) swap two consecutive letters in a word.
A popular algorithm to calculate edit distance is using dynamic programming. The core idea of the algorithm is to solve the bigger problem by solving several subproblems. Specificly, we maintain a matrix $D \in R^{(|w_1|+1)\cdot (|w_2|+1)}$, in which, every entry $D_{i,j}$ denotes the edit distance between the first $i$ letters of $w_1$ and the first $j$ letters of $w_2$. And the value function of $D_{i,j}$ is: 
$$
    D_{i,j}=
\begin{cases}
    min(D_{i-2,j-2}+1, D_{i-1,j-1} + (w_1^{i-1}=w_2^{j-1}), D_{i-1,j}+1, D_{i, j-1}+1),& \text{if } w_1^{i-1}=w_2^{j-2}, w_1^{i-2}=w_2^{j-1}\\
    min(D_{i-1,j-1} + (w_1^{i-1}=w_2^{j-1}), D_{i-1,j}+1, D_{i, j-1}+1),               & \text{otherwise}
\end{cases}
$$
As shown in the value function, the value of $D_{i,j}$ can be calculated only using entries on its top-left. Therefore, updating $D_{i,j}$ from left to right, top to bottom will garantee a valid solution at $D_{|w_1|,|w_2|}$.
This dynamic programming approach takes $O(|w_1|\cdot |w_2|)$ time.

In [63]:
def insert_cost(letter):
    return 1


def delete_cost(letter):
    return 1


def change_cost(letter1, letter2):
    if letter1 == letter2:
        return 0
    else:
        return 1


def swap_cost(letter1, letter2):
    if letter1 == letter2:
        return 0
    else:
        return 1


def edit_distance(word1, word2, insert_func, delete_func, change_func, swap_func):
    """ Calculate the edit distance between two words
    Inputs:
        word1(str), word2(str): The two words to be compared
        insert_func: The insertion cost function
        delete_func: The deletion cost function
        change_func: The modification cost function
        swap_func: The swapping cost function
    Outputs:
        int: The edit distance
    """
    m = len(word1)
    n = len(word2)
    dp = [[0 for j in range(n + 1)] for i in range(m + 1)]

    for i in range(1, m + 1):
        dp[i][0] = dp[i - 1][0] + delete_cost(word1[i - 1])
    for j in range(1, n + 1):
        dp[0][j] = dp[0][j - 1] + insert_cost(word2[j - 1])

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            letter1 = word1[i - 1]
            letter2 = word2[j - 1]
            dp[i][j] = min(dp[i - 1][j - 1] + change_func(letter1, letter2),
                           dp[i - 1][j] + delete_func(letter1),
                           dp[i][j - 1] + insert_func(letter2))
            if i > 1 and j > 1 and letter1 == word2[j - 2] and letter2 == word1[
                        i - 2]:
                dp[i][j] = min(dp[i][j],
                               dp[i - 2][j - 2] + swap_func(letter1, letter2))

    return dp[m][n]

In [64]:
edit_distance('applepen', 'pineappleone', insert_cost, delete_cost, change_cost, swap_cost)

6

The correct implementation of `edit_distance` should return 6.

Now we can calaulate the edit distance between a word and another, an obviously simple solution to do the spell correction is exhaustively calculate the edit distance between the query word $w$ and every word $v_i$ in the vocabulary $V$. And choose $v_i$ which minimizes `edit_distacne`$(w, v_i)$ as the suggestion. However, this method is expensive. Since we have to calcualte the edit distance between $w$ and every word in $V$. To the rescue, we will implement the `k-gram indexes` in the next part to eliminate most of the vocabulary for compareation.
Note: In our implementation, we seperately wrote 4 trivial cost functions (`insert_cost`, `delete_cost`, `change_cost` and `swap_cost`), and pass them as parameters to `edit_distance`. It may seem trivial in this case, but by writing these functions outside `edit_distance` makes the code highly extensible for advance features on `edit_distance`. For example, we may define that the cost of changing one letter to another is dependent on the locations of the two letters on the keyboard, or dependent on the similarity in pronunciation. For these cases, only the corresponding cost functions need to be modified, all other code can keep the same.

# 3. K-gram Indexes
In this part, we will build k-gram indexes to reduce the size of vocabulary for compareation with the query word. 
As mentioned in part 2, exhausitively comparing the query word to all the words in the vocabulary set is very time consuming. To make the spell correction algorithm faster, we need another algorithm to quickly and accurately limit the vocabulary set to be compared in a small amount. After the smaller set is determined, we then can perform `edit_distance` to further search the nearest accurate word. 
We now show how we can do that using k-gram indexes. As example, we will explain how bigram(2-gram) index works. Differ to the Google Ngram which is the ngram dataset in units of word, the bigrams here are in units of letter. For every bigram (two consecutive letters), we store all the vocabulary words that conatins it. So that once the bigram index is built, given a certain bigram, the corresponding vocabulary set that contains the bigram can be fetched in $O(1)$ time. 
To find the limited vocabulary using the bigram index, we firstly enumerate all bigrams in the query word. Then we use the calculated bigram index to find vocabulary words that contains each of the bigrams. The more bigrams a word contains, the more similar this word will be to the query word.

**Specifications:**
- As the same with the example, we will only implement the bigram index
- To reduce complexity, we will build the index using the vocabulary words in lowercase
- As the same, `bigram_search` also use the query word in lowercase for the search in bigram index
- Punctuations and non-letter characters should be included in the bigram index
- Keep vocabulary words that differ by at most 3 bigrams to the query word

In [65]:
def bigram_add(bigram_dict, key, value):
    if key not in bigram_dict:
        bigram_dict[key] = set()
    bigram_dict[key].add(value)


def bigram_index(words_dict):
    """ Build the bigram index from the vocabulary words
    Inputs:
        words_dict(dict): The vocabulary words
    Outputs:
        dict: A python dictionary with the key of str, value of set of str
    """
    bigram = dict()
    for w in words_dict:
        w_lower = w.lower()
        if len(w) < 2:
            bigram_add(bigram, w_lower, w)
            continue
        for i in range(len(w_lower) - 1):
            bi = w_lower[i: i + 2]
            bigram_add(bigram, bi, w)
    return bigram


def bigram_search(bigram_dict, word):
    """ Vague search the query word in the bigram index
    Inputs:
        bigram_dict(dict): The bigram index
        word(str): The query word
    Outputs:
        list: A list of potential matching vocabulary
    """
    word_lower = word.lower()
    bi_list = list()
    for i in range(len(word_lower) - 1):
        bi = word_lower[i: i + 2]
        if bi in bigram_dict:
            bi_list += list(bigram_dict[bi])
    bi_cnt = Counter(bi_list)
    return [w for w in bi_cnt if bi_cnt[w] > len(word_lower) - 4]


In [66]:
bigram = bigram_index(words)
vocabulary = bigram_search(bigram, 'kewnel')
print len(words)
print len(vocabulary)

294
4


The results of the code above should be:
```python
>>> len(words)
294
>>> len(vocabulary)
4
```
After searching in the prebuilt bigram index, the size of vocabulary set reduced from 294 to 4 for the query word 'kernwl'. About $98\%$ of the vocabulary has been eliminated in $O(|w|)$ time.
We can then run `edit_distance` between the query word and the reduced vocabulary.

# 4. Put Everything Together
Now we have the elements required to build a simple spell corrector. The only job left is to assemble them together.
Using the previous features we developed, we can eliminate the vocabulary and calculate the edit distance between the query word and words in the vocabulary $V$. We also knows the occurence of each word in the vocabulary. To find the best suggestion word, we define:
$$score(q, w) = \frac{log(w.\text{volume_count}) + log(w.\text{match_count})}{d(q, w) + \epsilon}$$
where $q$ is the query word, $w$ is the vocabulary word, $d(q, w)$ is the edit distance between $q$ and $w$ and $\epsilon$ is the smoothing variable.(In the implementation, we take $\epsilon=0.1$)
For each $w$ in $V$, we calculate the score. The vocabulary word with the maximum score will be the best suggestion.

In [67]:
def word_suggestion(words_dict, bigram_dict, word, n):
    """ Give correct word suggestion for an input word
    Inputs:
        words_dict(dict): The word dictionary (fetched from Google 1-gram with the occurence)
        bigram_dict(dict): The bigram index
        word(str): The query word
        n(int): Number of suggestion words to return
    Outputs:
        list: A list of suggestion words
    """
    bigram_list = bigram_search(bigram_dict, word)
    top_words = PriorityQueue()
    for bw in bigram_list:
        cost = edit_distance(word, bw, insert_cost, delete_cost, change_cost, swap_cost)
        score = 1 / (cost + 0.1) * (log(words_dict[bw].volume_count) + log(words_dict[bw].match_count))
        top_words.put((-score, bw))
    retval = list()
    cnt = 0
    while not top_words.empty() and cnt < n:
        retval.append(top_words.get()[1])
        cnt += 1
    return retval

suggestion = word_suggestion(words, bigram, 'kewnel', 3)
print suggestion

[u'kernel', u'kennel', u'kernels']
