# Context-sensitive Spelling Correction

The goal of the assignment is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.

Submit the solution of the assignment to Moodle as a link to your GitHub repository containing this notebook.

Useful links:
- [Norvig's solution](https://norvig.com/spell-correct.html)
- [Norvig's dataset](https://norvig.com/big.txt)
- [Ngrams data](https://www.ngrams.info/download_coca.asp)

Grading:
- 60 points - Implement spelling correction
- 20 points - Justify your decisions
- 20 points - Evaluate on a test set


## Implement context-sensitive spelling correction

Your task is to implement context-sensitive spelling corrector using N-gram language model. The idea is to compute conditional probabilities of possible correction options. For example, the phrase "dking sport" should be fixed as "doing sport" not "dying sport", while "dking species" -- as "dying species".

The best way to start is to analyze [Norvig's solution](https://norvig.com/spell-correct.html) and [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

You may also want to implement:
- spell-checking for a concrete language - Russian, Tatar, etc. - any one you know, such that the solution accounts for language specifics,
- some recent (or not very recent) paper on this topic,
- solution which takes into account keyboard layout and associated misspellings,
- efficiency improvement to make the solution faster,
- any other idea of yours to improve the Norvigâ€™s solution.

IMPORTANT:  
Your project should not be a mere code copy-paste from somewhere. You must provide:
- Your implementation
- Analysis of why the implemented approach is suggested
- Improvements of the original approach that you have chosen to implement

In [45]:
import re
from collections import Counter
import pandas as pd

In [86]:
WORDS = set() #dictionary of words

#read bigrams
bigram = dict()
with open('bigrams.txt', mode = 'rb') as f:
  content = f.read()
  s = str(content)
  w = s[2:-1].replace('\\t',' ').replace('\\r\\n',' ').split()
  for i in range(0, len(w), 3):
    if not (w[i+1] in bigram):
      bigram[w[i+1]] = dict()
    bigram[w[i+1]][w[i+2]] = int(w[i])
    WORDS.add(w[i+1])
    WORDS.add(w[i+2])


In [88]:
def known(words):
    #subset of subwords from WORDS dictionary
    return set(w for w in words if w in WORDS)

In [None]:
def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())):
    "Probability of `word`."
    return WORDS[word] / N

def correction(word):
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word):
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words):
    #subset of subwords from WORDS dictionary
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [108]:
# function for fixing misclicks
neighbor = {'a': ['q', 's', 'w', 'z'],
            'b': ['g', 'h', 'n', 'v'],
            'c': ['d', 'f', 'v', 'x'],
            'd': ['c', 'e', 'f', 'r', 's', 'x'],
            'e': ['d', 'r', 's', 'w'],
            'f': ['c', 'd', 'g', 'r', 't', 'v'],
            'g': ['b', 'f', 'h', 't', 'v', 'y'],
            'h': ['b', 'g', 'j', 'n', 'u', 'y'],
            'i': ['j', 'k', 'o', 'u'],
            'j': ['h', 'i', 'k', 'm', 'n', 'u'],
            'k': ['i', 'j', 'l', 'm', 'o'],
            'l': ['k', 'o', 'p'],
            'm': ['j', 'k', 'n'],
            'n': ['b', 'h', 'j', 'm'],
            'o': ['i', 'k', 'l', 'p'],
            'p': ['l', 'o'],
            'q': ['a', 'w'],
            'r': ['d', 'e', 'f', 't'],
            's': ['a', 'd', 'e', 'w', 'x', 'z'],
            't': ['f', 'g', 'r', 'y'],
            'u': ['h', 'i', 'j', 'y'],
            'v': ['b', 'c', 'f', 'g'],
            'w': ['a', 'e', 'q', 's'],
            'x': ['c', 'd', 's', 'z'],
            'y': ['g', 'h', 't', 'u'],
            'z': ['a', 's', 'x'],
            '1': ['q'],
            '2': ['q', 'w'],
            '3': ['e', 'w'],
            '4': ['e', 'r'],
            '5': ['r', 't'],
            '6': ['t', 'y'],
            '7': ['u', 'y'],
            '8': ['i', 'u'],
            '9': ['i', 'o'],
            '0': ['o', 'p'],
            '-': ['p'],
            '.': ['l'],
            ',': ['l', 'm'],
            '[': ['p'],
            ';': ['l', 'p'],
}
def edit_mis(word):
  splits = [(word[:i], word[i:])    for i in range(len(word) + 1)]
  replaces = [L + c + R[1:] for L, R in splits if R for c in neighbor[R[0]]]
  return known(replaces)

In [None]:
def correct_sentence(sentence):
  '''sentence = ''
  for c in sentence_: # leav only letters
    if (c.isalpha() or c == ' '):
      sentence = sentence + c'''

  words = sentence.lower().split()
  k = known(words)
  if len(k) == len(words):
    return words
  for word in words: # find unknown words
    if not (word in k):
      misses = edit_mis(word) #if there was misclick

## Justify your decisions

Write down justificaitons for your implementation choices. For example, these choices could be:
- Which ngram dataset to use
- Which weights to assign for edit1, edit2 or absent words probabilities
- Beam search parameters
- etc.

*AAAAAAAAAAAAA*

## Evaluate on a test set

Your task is to generate a test set and evaluate your work. You may vary the noise probability to generate different datasets with varying compexity. Compare your solution to the Norvig's corrector, and report the accuracies.

In [None]:
# Your code here