# Context-sensitive Spelling Correction

The goal of the assignment is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.

Submit the solution of the assignment to Moodle as a link to your GitHub repository containing this notebook.

Useful links:
- [Norvig's solution](https://norvig.com/spell-correct.html)
- [Norvig's dataset](https://norvig.com/big.txt)
- [Ngrams data](https://www.ngrams.info/download_coca.asp)

Grading:
- 60 points - Implement spelling correction
- 20 points - Justify your decisions
- 20 points - Evaluate on a test set


## Implement context-sensitive spelling correction

Your task is to implement context-sensitive spelling corrector using N-gram language model. The idea is to compute conditional probabilities of possible correction options. For example, the phrase "dking sport" should be fixed as "doing sport" not "dying sport", while "dking species" -- as "dying species".

The best way to start is to analyze [Norvig's solution](https://norvig.com/spell-correct.html) and [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

You may also want to implement:
- spell-checking for a concrete language - Russian, Tatar, etc. - any one you know, such that the solution accounts for language specifics,
- some recent (or not very recent) paper on this topic,
- solution which takes into account keyboard layout and associated misspellings,
- efficiency improvement to make the solution faster,
- any other idea of yours to improve the Norvig’s solution.

IMPORTANT:  
Your project should not be a mere code copy-paste from somewhere. You must provide:
- Your implementation
- Analysis of why the implemented approach is suggested
- Improvements of the original approach that you have chosen to implement

Checklist (TODO: delete before submission):
- [x] Implement unigram model
- [x] Implement n-gram model
- [ ] Implement fasttext model
- [x] Implement Levenstein error model
- [ ] Implement QWERTY-aware error model
- [ ] Construct a test set

In [3]:
from abc import ABC, abstractmethod
from collections import defaultdict

import tqdm.notebook as tqdm
import numpy as np

In [4]:
class LanguageModel(ABC):
    """Abstract context-aware language model.
    """

    @abstractmethod
    def __call__(self, candidates: list[str], pretext: list[str],
                 posttext: list[str]) -> dict[str, float]:
        """Predict the probability of the word given the context.
        
        Args:
            candidates (list[str]): Set of candidate replacements.
            pretext (list[str]): Context preceeding the word.
            posttext (list[str]): Context after the word.
        
        Returns:
            dict[str, float]: Mapping from candidate set to the probability of
                the candidate in the context.
        """
        pass

In [5]:
class ErrorModel(ABC):
    """Abstract model for typo probabilities.
    """
    @abstractmethod
    def __call__(self, word: str, candidate: str) -> float:
        """Returns the probability of word being a misspelled candidate word.
        
        Args:
            word (str): Word that is typed.
            candidate (str): Misspeling correction candidate.

        Returns:
            float: Probability of word being a misspelled candidate word.
        """
        pass

In [6]:
class Corrector:
    """Context-aware text corrector.

    Attributes:
        candidates (set[str]): Set of candidate words to choose from.
        language_model (LanguageModel): Context-aware language model to
            predict probabilities of words occuring in a given context.
        error_model (ErrorModel): Model for predicting the probability
            of mispelling one word as another.
        context_size (int): Size of the context given to the language
            model. Does not include the word itselt. Counted individually
            for each direction (in the text "a b c d e", context of "c" is
            ("b", "d"), given `context_size` = 1.) Defaults to 10.
    """

    def __init__(
        self,
        candidates: set[str],
        language_model: LanguageModel,
        error_model: ErrorModel,
        context_size: int = 10,
    ):
        """Context-aware text corrector.

        Args:
            candidates (set[str]): Set of candidate words to choose from.
            language_model (LanguageModel): Context-aware language model to
                predict probabilities of words occuring in a given context.
            error_model (ErrorModel): Model for predicting the probability
                of mispelling one word as another.
            context_size (int): Size of the context given to the language
                model. Does not include the word itselt. Counted individually
                for each direction (in the text "a b c d e", context of "c" is
                ("b", "d"), given `context_size` = 1.) Defaults to 10.
        """
        self.candidates = candidates
        self.language_model = language_model
        self.error_model = error_model
        self.context_size = context_size

    def __call__(self,
                 text: list[str],
                 verbose: bool = False) -> list[tuple[str, float, float]]:
        """Returns the most likely candidate, probability of the candidate,
        and the probability of the original word occuring in a 2 *`context_size`
        context window.

        Args:
            text (list[str]): Text to correct. 
            verbose (bool): Whether to display the progress bar or not. Defaults
                to `False`.
        
        Returns:
            list[tuple[str, float, float]]: The candidate word, probability of
                candidate, and probability of the original word corresponding
                to each word in the original text.
        """
        result = [None] * len(text)
        for i, word in tqdm.tqdm(enumerate(text),
                                 total=len(text),
                                 disable=not verbose):
            pretext = text[max(0, i - self.context_size):i]
            posttext = text[i + 1:i + self.context_size + 1]
            candidates = self.candidates.copy() - {word}
            candidate_probs = self.language_model(candidates, pretext,
                                                  posttext)
            word_prob = self.language_model({word}, pretext, posttext)[word]

            def get_prob(candidate):
                return self.error_model(word,
                                        candidate) * candidate_probs[candidate]

            correction = max(candidates, key=get_prob)
            probability = get_prob(correction)
            result[i] = (correction, probability, word_prob)
        return result


In [7]:
class NGramLanguageModel(LanguageModel):
    # TODO docs
    """N-gram language model.
    """

    def __init__(self, text: list[str], n: int = 3, verbose: bool = False):
        text = text.copy()
        self.total = len(text)
        self.prefix = defaultdict(lambda: 0)
        pbar = tqdm.tqdm(total=(2 * n * (len(text) + 1) - n * (n + 1) // 2),
                         disable=not verbose)
        for k in range(1, n + 1):
            for i in range(len(text) - k + 1):
                gram = tuple(text[i:i+k])
                self.prefix[gram] += 1
                pbar.update()
        text.reverse()
        self.postfix = defaultdict(lambda: 0)
        for k in range(1, n + 1):
            for i in range(len(text) - k):
                gram = tuple(text[i:i+k])
                self.postfix[gram] += 1
                pbar.update()
        self.postfix[tuple()] = self.prefix[tuple()] = self.total
        self.n = n
        pbar.close()

    @staticmethod
    def from_counts(counts: dict[tuple[str], int], n: int = 3):
        model = NGramLanguageModel([], n=n, verbose=False)
        model.total = sum(counts.values())
        model.prefix = defaultdict(lambda: 0)
        model.postfix = defaultdict(lambda: 0)
        for gram, count in counts.items():
            for k in range(1, n + 1):
                if len(gram) < k:
                    continue
                kgram = gram[:k]
                model.prefix[kgram] += count
                kgram = kgram[::-1]
                model.postfix[kgram] += count
        model.postfix[tuple()] = model.prefix[tuple()] = model.total
        return model



    def __call__(self, candidates: list[str], pretext: list[str],
                 posttext: list[str]) -> dict[str, float]:
        """Predict the probability of the word using bigram data.

        Args:
            candidates (list[str]): Set of candidate replacements.
            pretext (list[str]): Context preceeding the word.
            posttext (list[str]): Context after the word.
        
        Returns:
            dict[str, float]: Mapping from candidate set to the probability of
                the candidate in the context.
        """
        pretext = tuple(pretext[-self.n+1:])
        posttext = tuple(posttext[:self.n - 1][::-1])
        result = {}
        for candidate in candidates:
            prev_prob = (self.prefix[pretext + (candidate, )] +
                         1) / (self.prefix[pretext] + 2)
            next_prob = (self.postfix[posttext + (candidate, )] +
                         1) / (self.postfix[posttext] + 2)
            result[candidate] = prev_prob * next_prob
        return result

In [8]:
class LevensteinErrorModel(ErrorModel):
    """Error model based on Levenstein distance.
    """
    def __init__(self, base: float = 4):
        self.base = base

    def __call__(self, word: str, candidate: str) -> float:
        """Calculate the probability of the word being a mispelling of the
        candidate.

        Args:
            word (str): Word that is typed.
            candidate (str): Candidate word.
        
        Returns:
            float: Probability of word being a mispelling of candidate.
        """
        word = word.lower()
        candidate = candidate.lower()
        dp = [[0] * (len(candidate) + 1) for _ in range(len(word) + 1)]
        for i in range(len(candidate)):
            dp[0][i + 1] = i + 1
        for i in range(len(word)):
            dp[i + 1][0] = i + 1
        for i in range(1, len(word) + 1):
            for j in range(1, len(candidate) + 1):
                dp[i][j] = min(dp[i - 1][j], dp[i][j - 1]) + 1
                if word[i - 1] == candidate[j - 1]:
                    dp[i][j] = dp[i - 1][j - 1]
        return pow(self.base, -dp[-1][-1])

In [9]:
fivegrams = dict()
with open('fivegrams.txt') as file:
    for line in file:
        count, *gram = line.split()
        gram = tuple(gram)
        count = int(count)
        fivegrams[gram] = count
bigrams = dict()
with open('bigrams.txt', encoding='iso-8859-1') as file:
    for line in file:
        count, *gram = line.split()
        gram = tuple(gram)
        count = int(count)
        bigrams[gram] = count

In [12]:
fivegram_language_model = NGramLanguageModel.from_counts(fivegrams, n=5)

In [13]:
bigram_language_model = NGramLanguageModel.from_counts(bigrams, n=2)

In [14]:
error_model = LevensteinErrorModel(base=8)

In [36]:
bigram_language_model.postfix[('with','rabbit')]

67

In [45]:
bigram_language_model.postfix[('rabbit',)]

1152

In [46]:
candidates = set(sorted({gram[0]
                     for gram in bigrams.keys()},
                    key=lambda w: bigram_language_model.prefix[(w, )],
                    reverse=True)[:1000])

In [47]:
corrector = Corrector(candidates, bigram_language_model, error_model, context_size=1)

In [48]:
text = '''
Alice was beginning to get vry tired of sitting by hr sister on th bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”

So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

There was nothing so very remarkable in that; nor did Alice thnk it so very much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.

In another moment down went Alice after it, never once considering how in the world she was to get out again. 
'''.lower().split()[:]

In [49]:
corrected = corrector(text, verbose=True)

  0%|          | 0/274 [00:00<?, ?it/s]

In [50]:
corrected

[('i', 4.442490720283818e-06, 7.145980387068095e-10),
 ('is', 2.996321442392765e-05, 0.00392389553908069),
 ('going', 1.1695707087170082e-07, 1.086292883767947e-06),
 ('of', 2.5891017316550137e-05, 0.1240574102821677),
 ('be', 0.0004934602921534097, 0.009496700567203488),
 ('very', 2.53247544466677e-05, 1.7131751343100183e-10),
 ('third', 1.8299762065783533e-05, 0.00034300770725079297),
 ('to', 1.120010925405515e-05, 0.010341598020846431),
 ('it', 3.029829348176388e-08, 2.3232437940251553e-08),
 ('in', 3.218570635845926e-05, 0.0003587283081476167),
 ('her', 0.0007408599043334102, 5.574226885595463e-11),
 ('it', 1.191294019067089e-06, 1.69992194986266e-06),
 ('in', 0.006629862515302824, 0.01730078818140552),
 ('the', 0.04395141612525141, 1.490396686274937e-05),
 ('bank', 1.8090745056762649e-06, 2.1358612817901595e-09),
 ('end', 0.0002234544770762052, 0.0009831443364878168),
 ('to', 1.1065984420230272e-05, 0.00016981678504868257),
 ('that', 3.998081016133865e-08, 1.5866765424667643e-06),

In [51]:
words = [
    orig if 10 * op > cp else cor for orig, (cor, cp, op) in zip(text, corrected)
]

In [52]:
import textwrap

In [53]:
print(*textwrap.wrap(' '.join(words)), sep='\n')

i was beginning to get very tired of sitting by her sister on the bank
and of having nothing to do once or twice she had people into the book
her sister was reading but it had no pictures of conversations in the
and what is the use of a book thought i “without pictures or
conversations?” so she was considering in her own mind as well as she
could for the hot day and her feel very sleepy and to whether the
pleasure of making a day would be worth the trouble of getting up and
in the time when suddenly a white rabbit with his and and close by her
there was nothing so very remarkable in the nor did a and it is very
much out of the way to hear the i say to the “oh dear! oh and i shall
be a when she thought it over and it occurred to her that she ought to
have wondered at this but at the time it all seemed quite natural but
when the rabbit actually took a watch out of its waistcoat-pocket, and
looked at it and then hurried to i started to her feet for it flashed
across her mind that she had 

In [54]:
print(*textwrap.wrap(' '.join(text)), sep='\n')

alice was beginning to get vry tired of sitting by hr sister on th
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use of a book,” thought alice
“without pictures or conversations?” so she was considering in her own
mind (as well as she could, for the hot day made her feel very sleepy
and stupid), whether the pleasure of making a daisy-chain would be
worth the trouble of getting up and picking the daisies, when suddenly
a white rabbit with pink eyes ran close by her. there was nothing so
very remarkable in that; nor did alice thnk it so very much out of the
way to hear the rabbit say to itself, “oh dear! oh dear! i shall be
late!” (when she thought it over afterwards, it occurred to her that
she ought to have wondered at this, but at the time it all seemed
quite natural); but when the rabbit actually took a watch out of its
waistcoat-pocket, and looked at it, and then h

## Justify your decisions

Write down justificaitons for your implementation choices. For example, these choices could be:
- Which ngram dataset to use
- Which weights to assign for edit1, edit2 or absent words probabilities
- Beam search parameters
- etc.

*Your text here...*

## Evaluate on a test set

Your task is to generate a test set and evaluate your work. You may vary the noise probability to generate different datasets with varying compexity. Compare your solution to the Norvig's corrector, and report the accuracies.

In [23]:
# Your code here