#ICE-10: Spelling Correction in Natural Language Processing



Please follow the tutorials below and complete the tasks that are available on the webpages provided. The tutorials will have code and might not have dataset file. You can create dataset files as per the tutorial requirements. The aim of the ICE is to make the tutorials in executed format. You can also use the github repositories of the authors if they are avilable. It is recommended to run the tutorials and then test it with your own datasets (Custom made). You can use any source on the internet to complete the tasks. For task 4 your have to provide mini-examples to differentiate between non-word and real world spelling corrections. After the code for task 4 please provide a brief explanation for what is the difference and what is your analysis. 

# Task 1: (20%)
### Use the follwing tutorial to implement spelling checking using Textblob

https://stackabuse.com/spelling-correction-in-python-with-textblob/

In [1]:
%load_ext lab_black

In [2]:
from textblob import TextBlob

with open(
    "Typoglycemia.txt", "r"
) as f:  # Opening the test file with the intention to read
    text = f.read()  # Reading the file
    textBlb = TextBlob(text)  # Making our first textblob
    textCorrected = textBlb.correct()  # Correcting the text
    print(textCorrected)

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it doesn€™t matter in what order the letters in a word are, the only iprmoetnt thing is that the first and last later be at the right place. The set can be a total mess and you can still red it outfit problem. His is bcuseae the human mind does not red even letter by itself, but the word as a whole.


# Task 2: (20%)
### Train your model on custom dataset
### Instructions are provided in the above *tutorial*

In [3]:
from textblob.en import Spelling
import re

textToLower = ""

with open("bible.txt", "r") as f1:  # Open our source file
    text = f1.read()  # Read the file
    textToLower = text.lower()  # Lower all the capital letters

words = re.findall(
    "[a-z]+", textToLower
)  # Find all the words and place them into a list
oneString = " ".join(words)  # Join them into one string

pathToFile = "train.txt"  # The path we want to store our stats file at
spelling = Spelling(path=pathToFile)  # Connect the path to the Spelling object
spelling.train(oneString, pathToFile)

In [4]:
from tqdm import tqdm

pathToFile = "train.txt"
spelling = Spelling(path=pathToFile)
text = " "

with open("test.txt", "r") as f:
    text = f.read()

words = text.split()
corrected = " "
for i in tqdm(words):
    corrected = (
        corrected + " " + spelling.suggest(i)[0][0]
    )  # Spell checking word by word

print(corrected)

100%|████████████████████████████████████████████████████████████████████████████████| 490/490 [01:18<00:00,  6.24it/s]

  An you name any of the difference eyes of martial art There's far more to them than must grate or king for In face numerous arranged and systemized methods of coat are practiced in the would today. While some staves are very traditional and steeped in history, other are more modern. Although thereof a significant mount of overlay between the styles, their approach to fighting is unique. Familiarize yourself with popular martial art staves with this revile that break down striking, grappling, throwing, weapons-based staves and more Striking or Stand-Up Martial Art Styles Striking or stand-up martial art staves teach practitioners how to deed themselves while on their feet by sing flocks kicks, punches, knees, and elbows. The degree to which they teach each of these aspects depends on the specific stolen sub-style or instructor. Also many of these stand-up staves teach other components of fighting. Striking staves include: Going Capoeira Grate Kickboxing Ram Maga Dung Of May That The U




# Task 3: (20%)
### Implement Petr Norvig algorithm for spelling corrections. The turtorial is provided in the link below

https://medium.com/mlearning-ai/build-spell-checking-models-for-any-language-in-python-aa4489df0a5f

In [5]:
"""Spelling Corrector in Python 3; see http://norvig.com/spell-correct.html
Copyright (c) 2007-2016 Peter Norvig
MIT license: www.opensource.org/licenses/mit-license.php
"""

################ Spelling Corrector

import re
from collections import Counter


def words(text):
    return re.findall(r"\w+", text.lower())


WORDS = Counter(words(open("bible.txt").read()))


def P(word, N=sum(WORDS.values())):
    "Probability of `word`."
    return WORDS[word] / N


def correction(word):
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)


def candidates(word):
    "Generate possible spelling corrections for word."
    return known([word]) or known(edits1(word)) or known(edits2(word)) or [word]


def known(words):
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)


def edits1(word):
    "All edits that are one edit away from `word`."
    letters = "abcdefghijklmnopqrstuvwxyz"
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [L + R[1:] for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
    inserts = [L + c + R for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)


def edits2(word):
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [6]:
################ Test Code


def unit_tests():
    assert correction("speling") == "spelling"  # insert
    assert correction("korrectud") == "corrected"  # replace 2
    assert correction("bycycle") == "bicycle"  # replace
    assert correction("inconvient") == "inconvenient"  # insert 2
    assert correction("arrainged") == "arranged"  # delete
    assert correction("peotry") == "poetry"  # transpose
    assert correction("peotryy") == "poetry"  # transpose + delete
    assert correction("word") == "word"  # known
    assert correction("quintessential") == "quintessential"  # unknown
    assert words("This is a TEST.") == ["this", "is", "a", "test"]
    assert Counter(words("This is a test. 123; A TEST this is.")) == (
        Counter({"123": 1, "a": 2, "is": 2, "test": 2, "this": 2})
    )
    assert len(WORDS) == 32192
    assert sum(WORDS.values()) == 1115504
    assert WORDS.most_common(10) == [
        ("the", 79808),
        ("of", 40024),
        ("and", 38311),
        ("to", 28765),
        ("in", 22020),
        ("a", 21124),
        ("that", 12512),
        ("he", 12401),
        ("was", 11410),
        ("it", 10681),
    ]
    assert WORDS["the"] == 79808
    assert P("quintessential") == 0
    assert 0.07 < P("the") < 0.08
    return "unit_tests pass"


def spelltest(tests, verbose=False):
    "Run correction(wrong) on all (right, wrong) pairs; report results."
    import time

    start = time.process_time()
    good, unknown = 0, 0
    n = len(tests)
    for right, wrong in tests:
        w = correction(wrong)
        good += w == right
        if w != right:
            unknown += right not in WORDS
            if verbose:
                print(
                    "correction({}) => {} ({}); expected {} ({})".format(
                        wrong, w, WORDS[w], right, WORDS[right]
                    )
                )
    dt = time.process_time() - start
    print(
        "{:.0%} of {} correct ({:.0%} unknown) at {:.0f} words per second ".format(
            good / n, n, unknown / n, n / dt
        )
    )


def Testset(lines):
    "Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."
    return [
        (right, wrong)
        for (right, wrongs) in (line.split(":") for line in lines)
        for wrong in wrongs.split()
    ]

In [7]:
spelltest(Testset(open("spell-testset1.txt")))  # Development set
spelltest(Testset(open("spell-testset2.txt")))  # Final test set

8% of 270 correct (89% unknown) at 9 words per second 
8% of 270 correct (89% unknown) at 9 words per second 


# Task 4: (40%)
### Implement spelling correction using Noisy Channel for non-word and real word. You can follow any code

https://sanketp.medium.com/language-models-spellchecking-and-autocorrection-dd10f739443c

http://norvig.com/spell-correct.html

https://github.com/bakwc/JamSpell

In [None]:
from spellchecker import SpellChecker
 
spell = SpellChecker()

with open('spell-testset1.txt', 'r') as f:
    unknown_words = f.readlines()
    unknown_words = ([t.split(':')[0] for t in text])
    
# find those words that may be misspelled
misspelled = spell.unknown(unknown_words)

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))
 
    # Get a list of `likely` options
    print(spell.candidates(word))

diagrammatically
{'diagrammatically'}
arranging
{'arranging'}
addressable
