# Context-sensitive Spelling Correction

The goal of the assignment is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.

Submit the solution of the assignment to Moodle as a link to your GitHub repository containing this notebook.

Useful links:
- [Norvig's solution](https://norvig.com/spell-correct.html)
- [Norvig's dataset](https://norvig.com/big.txt)
- [Ngrams data](https://www.ngrams.info/download_coca.asp)

Grading:
- 60 points - Implement spelling correction
- 20 points - Justify your decisions
- 20 points - Evaluate on a test set


## Implement context-sensitive spelling correction

Your task is to implement context-sensitive spelling corrector using N-gram language model. The idea is to compute conditional probabilities of possible correction options. For example, the phrase "dking sport" should be fixed as "doing sport" not "dying sport", while "dking species" -- as "dying species".

The best way to start is to analyze [Norvig's solution](https://norvig.com/spell-correct.html) and [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

You may also want to implement:
- spell-checking for a concrete language - Russian, Tatar, etc. - any one you know, such that the solution accounts for language specifics,
- some recent (or not very recent) paper on this topic,
- solution which takes into account keyboard layout and associated misspellings,
- efficiency improvement to make the solution faster,
- any other idea of yours to improve the Norvig’s solution.

IMPORTANT:  
Your project should not be a mere code copy-paste from somewhere. You must provide:
- Your implementation
- Analysis of why the implemented approach is suggested
- Improvements of the original approach that you have chosen to implement

Checklist (TODO: delete before submission):
- [x] Implement unigram model
- [x] Implement bigram model
- [ ] Implement trigram model
- [ ] Implement quigram model
- [x] Implement Levenstein error model
- [ ] Implement QWERTY-aware error model
- [ ] Construct a test set

In [1]:
from abc import ABC, abstractmethod
from collections import defaultdict

import tqdm.notebook as tqdm

In [2]:
class LanguageModel(ABC):
    """Abstract context-aware language model.
    """

    @abstractmethod
    def __call__(self, candidates: list[str], pretext: list[str],
                 posttext: list[str]) -> dict[str, float]:
        """Predict the probability of the word given the context.
        
        Args:
            candidates (list[str]): Set of candidate replacements.
            pretext (list[str]): Context preceeding the word.
            posttext (list[str]): Context after the word.
        
        Returns:
            dict[str, float]: Mapping from candidate set to the probability of
                the candidate in the context.
        """
        pass

In [3]:
class ErrorModel(ABC):
    """Abstract model for typo probabilities.
    """
    @abstractmethod
    def __call__(self, word: str, candidate: str) -> float:
        """Returns the probability of word being a misspelled candidate word.
        
        Args:
            word (str): Word that is typed.
            candidate (str): Misspeling correction candidate.

        Returns:
            float: Probability of word being a misspelled candidate word.
        """
        pass

In [4]:
class Corrector:
    """Context-aware text corrector.

    Attributes:
        candidates (set[str]): Set of candidate words to choose from.
        language_model (LanguageModel): Context-aware language model to
            predict probabilities of words occuring in a given context.
        error_model (ErrorModel): Model for predicting the probability
            of mispelling one word as another.
        context_size (int): Size of the context given to the language
            model. Does not include the word itselt. Counted individually
            for each direction (in the text "a b c d e", context of "c" is
            ("b", "d"), given `context_size` = 1.) Defaults to 10.
    """

    def __init__(
        self,
        candidates: set[str],
        language_model: LanguageModel,
        error_model: ErrorModel,
        context_size: int = 10,
    ):
        """Context-aware text corrector.

        Args:
            candidates (set[str]): Set of candidate words to choose from.
            language_model (LanguageModel): Context-aware language model to
                predict probabilities of words occuring in a given context.
            error_model (ErrorModel): Model for predicting the probability
                of mispelling one word as another.
            context_size (int): Size of the context given to the language
                model. Does not include the word itselt. Counted individually
                for each direction (in the text "a b c d e", context of "c" is
                ("b", "d"), given `context_size` = 1.) Defaults to 10.
        """
        self.candidates = candidates
        self.language_model = language_model
        self.error_model = error_model
        self.context_size = context_size

    def __call__(self,
                 text: list[str],
                 verbose: bool = False) -> list[tuple[str, float, float]]:
        """Returns the most likely candidate, probability of the candidate,
        and the probability of the original word occuring in a 2 *`context_size`
        context window.

        Args:
            text (list[str]): Text to correct. 
            verbose (bool): Whether to display the progress bar or not. Defaults
                to `False`.
        
        Returns:
            list[tuple[str, float, float]]: The candidate word, probability of
                candidate, and probability of the original word corresponding
                to each word in the original text.
        """
        result = [None] * len(text)
        for i, word in tqdm.tqdm(enumerate(text),
                                 total=len(text),
                                 disable=not verbose):
            pretext = text[max(0, i - self.context_size):i]
            posttext = text[i + 1:i + self.context_size + 1]
            candidates = self.candidates.copy() - {word}
            candidate_probs = self.language_model(candidates, pretext,
                                                  posttext)
            word_prob = self.language_model({word}, pretext, posttext)[word]

            def get_prob(candidate):
                return self.error_model(word,
                                        candidate) * candidate_probs[candidate]

            correction = max(candidates, key=get_prob)
            probability = get_prob(correction)
            result[i] = (correction, probability, word_prob)
        return result


In [None]:
class NGramLanguageModel(LanguageModel):
    """N-gram language model.
    """

    def __init__(self, text: list[str], n: int = 3, verbose: bool = False):
        text = text.copy()
        self.total = len(text)
        self.prefix = defaultdict(lambda: 0)
        pbar = tqdm.tqdm(total=(2 * n * (len(text) + 1) - n * (n + 1) // 2))
        for k in range(1, n + 1):
            for i in range(len(text) - k + 1):
                self.prefix[text[i:i + k]] += 1
                pbar.update()
        self.prefix[tuple()] = self.total
        text.reverse()
        self.postfix = defaultdict(lambda: 0)
        for k in range(1, n + 1):
            for i in range(len(text) - k):
                self.postfix[text[i:i + k]] += 1
                pbar.update()
        self.postfix[tuple()] = self.total
        self.n = n
        pbar.close()

    def __call__(self, candidates: list[str], pretext: list[str],
                 posttext: list[str]) -> dict[str, float]:
        """Predict the probability of the word using bigram data.

        Args:
            candidates (list[str]): Set of candidate replacements.
            pretext (list[str]): Context preceeding the word.
            posttext (list[str]): Context after the word.
        
        Returns:
            dict[str, float]: Mapping from candidate set to the probability of
                the candidate in the context.
        """
        pretext = tuple(pretext[max(0, len(pretext) - self.n):])
        posttext = tuple(posttext[:self.n][::-1])
        result = {}
        for candidate in candidates:
            prev_prob = (self.prefix[pretext + (candidate, )] + 1) / (self.prefix[pretext] + 2)
            next_prob = (self.postfix[posttext + (candidate, )] + 1) / (self.postfix[posttext] + 2)
            result[candidate] = prev_prob * next_prob
        return result

In [5]:
class FrequenciesLanguageModel(LanguageModel):
    """Language model that predicts the probability of the word based on its
    frequency. Does not take the context into account.

    Attributes:
        total (int): Total number of words in the training set.
        probabilities (dict[str, float]): Mapping from the word to its
            probability in the training set.
    """

    def __init__(self, frequencies: dict[str, int]):
        """Language model that predicts the probability of the word based on its
        frequency. Does not take the context into account.

        Args:
            frequencies (dict[str, int]): Mapping from the word to its frequency
                in the training set.
        """
        self.total = sum(frequencies.values())
        self.probabilities = {
            word: count / self.total
            for word, count in frequencies.items()
        }

    def __call__(self, candidates: list[str], pretext: list[str],
                 posttext: list[str]) -> dict[str, float]:
        """Predict the probability of the word.
        
        Args:
            candidates (list[str]): Set of candidate replacements.
            pretext (list[str]): Context preceeding the word. Ignored.
            posttext (list[str]): Context after the word. Ignored.
        
        Returns:
            dict[str, float]: Mapping from candidate set to the probability of
                the candidate in the context.
        """
        return {
            word: self.probabilities.get(word, 1 / (self.total + 2))
            for word in candidates
        }

In [6]:
class BigramsLanguageModel(LanguageModel):
    """Bigram language model.

    Attributes:
        totals (dict[str, int]): Count of bigrams starting with each word in the
            training set.
        probabilities (dict[str, float]): Mapping from bigrams to their
            probabilities in the training set.
        total (int): Total number of bigrams in the training set.
    """

    def __init__(self, bigrams: dict[tuple[str, str], int]):
        """Bigram language model.
        
        Args:
            bigrams (dict[tuple[str, str], int]): Mapping from bigrams to their
                counts in the training set.
        """
        self.totals = defaultdict(lambda: 0)
        for (word, _), count in bigrams.items():
            self.totals[word] += count
        self.probabilities = {
            (a, b): count / self.totals[a]
            for (a, b), count in bigrams.items()
        }
        self.total = sum(self.totals.values())

    def __call__(self, candidates: list[str], pretext: list[str],
                 posttext: list[str]) -> dict[str, float]:
        """Predict the probability of the word using bigram data.

        Args:
            candidates (list[str]): Set of candidate replacements.
            pretext (list[str]): Context preceeding the word.
            posttext (list[str]): Context after the word.
        
        Returns:
            dict[str, float]: Mapping from candidate set to the probability of
                the candidate in the context.
        """
        prev = pretext[-1] if pretext else None
        next = posttext[0] if posttext else None
        result = {}
        for candidate in candidates:
            if prev is None:
                prev_prob = 1
            elif (prev, candidate) not in self.probabilities:
                prev_prob = 1 / (self.totals[prev] + 2)
            else:
                prev_prob = self.probabilities[(prev, candidate)]
            if next is None:
                next_prob = 1
            elif (candidate, next) not in self.probabilities:
                next_prob = 1 / (self.totals[next] + 2)
            else:
                next_prob = self.probabilities[(candidate, next)]
            result[candidate] = prev_prob * next_prob
        return result

In [7]:
class LevensteinErrorModel(ErrorModel):
    """Error model based on Levenstein distance.
    """

    def __call__(self, word: str, candidate: str, base: float = 2) -> float:
        """Calculate the probability of the word being a mispelling of the
        candidate.

        Args:
            word (str): Word that is typed.
            candidate (str): Candidate word.
        
        Returns:
            float: Probability of word being a mispelling of candidate.
        """
        dp = [[0] * (len(candidate) + 1) for _ in range(len(word) + 1)]
        for i in range(len(candidate)):
            dp[0][i + 1] = i + 1
        for i in range(len(word)):
            dp[i + 1][0] = i + 1
        for i in range(1, len(word) + 1):
            for j in range(1, len(candidate) + 1):
                dp[i][j] = min(dp[i - 1][j], dp[i][j - 1]) + 1
                if word[i - 1] == candidate[j - 1]:
                    dp[i][j] = dp[i - 1][j - 1]
        return pow(base, -dp[-1][-1])

In [8]:
with open('bigrams.txt', encoding='iso-8859-1') as file:
    unigrams = defaultdict(lambda: 0) 
    bigrams = defaultdict(lambda: 0) 
    for line in tqdm.tqdm(file):
        count, a, b = line.split()
        count = int(count)
        unigrams[a] += count
        bigrams[(a, b)] += count

0it [00:00, ?it/s]

In [9]:
unigram_lm = FrequenciesLanguageModel(unigrams)

In [10]:
bigram_lm = BigramsLanguageModel(bigrams)

In [11]:
levenstein_error = LevensteinErrorModel()

In [17]:
candidates = set(sorted(unigrams.keys(), key=unigrams.__getitem__, reverse=True)[:1000])

In [18]:
corrector = Corrector(candidates, bigram_lm, levenstein_error, context_size=1)

## Justify your decisions

Write down justificaitons for your implementation choices. For example, these choices could be:
- Which ngram dataset to use
- Which weights to assign for edit1, edit2 or absent words probabilities
- Beam search parameters
- etc.

*Your text here...*

## Evaluate on a test set

Your task is to generate a test set and evaluate your work. You may vary the noise probability to generate different datasets with varying compexity. Compare your solution to the Norvig's corrector, and report the accuracies.

In [16]:
# Your code here