# Data set Source

Data set is downloaded from:

https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/A

- Data set contains misspelled words and correct words
- Words are selected randomly from data set
- Around 670 words are selected randomly from data set to evaluate the performance of spell correctors

## Two spell correcters are used

    - Spell corrector using Ngrams,Jaccard coefficient and Minimum edit distance
    - Spell corrector using Minimum Edit Distance(MED)
   

- Each spell corrector, checks misspelled word and suggest correct work according to misspelled word
- Both spell correctors are evaluated based on their performance(accuracy of correcting words)

### Spell Correction Using Ngram, Jaccard Coefficient and Edit Distance

#### Steps performed:

1. Find Misspelled words
2. Check Suggested Words
3. Filter suggested words which are different within some distance using edit distance
4. Compute Ngram of misspelled word and each suggested word
5. Compute Jaccard coefficient of misspelled word and each suggested word
6. Replace suggested word with maximum jaccard coefficient


# Levenshtein Distance(Minimum Edit distance) in NLTK

It computes minimum edit distance between two strings by performing three operations:

1. Substitution
2. Insertion
3. Deletion

It is mainly used for spelling correction, I have tried to use as spelling corrector, but it does not performs always in optimal way


# PyEnchant Library 

It is spell checking library for python, it has built-in english dictionary and functions to check the spelling of words in the sentences 

# Conclusion

After evaluating performance of spell corrector using Ngram, Jaccard Coefficient, Edit Distance and spell corrector using Levenshtein distance(Minimum edit distance) on dataset of misspelled words from Wikipedia, both the methods have achieved accuracy of 69.19% and 59.75% respectively. It can be concluded that accuracy achieved is not sufficient to replace misspelled word by corrected word, but since in automatic short answer grading we need to detect misspelled words in student's answer, this can be done with the help of Pyenchant library.


In [13]:
from __future__ import division
import numpy as np
import enchant  # spell checker library pyenchant
import string_similarity # import string similarity notebook

In [7]:
# Load enchant english dictionary
spell_dictionary = enchant.Dict('en')

In [9]:
# create object of string similarity class(present in another notebook)
obj = string_similarity.string_similarity(spell_dictionary)

# Load Data set that contain misspelled and corrected words

In [10]:
def load_misspelled_dataset(dataset):
    words_dictionary = dict()
    for i in range(len(dataset)):
        words_dictionary[dataset[i][0]] = dataset[i][2] 
        
    return words_dictionary


In [11]:
load_dataset = np.loadtxt('dataset/dataset_misspelled.txt',dtype='str')

# Performance of Spell Corrector using Ngrams on Dataset

In [14]:
def check_misspelledWords_ngramCorrector(dataset):
    number_of_corrected_words = 0
    words = []
    dictionary_misspelled_and_corrected_words = load_misspelled_dataset(load_dataset)
    for i in range(len(dictionary_misspelled_and_corrected_words)):
        corrected_word = obj.ngram_spell_corrector(dataset[i][0])
        # if suggested word by ngram spell corrector is equal to correct word in data set
        if corrected_word == dataset[i][2]:
            number_of_corrected_words += 1
    print "============================================================================================"
    print "Total number of misspelled words in database", len(dictionary_misspelled_and_corrected_words) 
    print "Total number of corrected words",number_of_corrected_words
    print "Total percentage ", (number_of_corrected_words/len(dictionary_misspelled_and_corrected_words)) * 100
    print "============================================================================================"
check_misspelledWords_ngramCorrector(load_dataset)

Total number of misspelled words in database 666
Total number of corrected words 461
Total percentage  69.2192192192


# Performance of Spell Corrector using Minimum Edit Distance(MED) on Dataset

In [15]:
def check_misspelledWords_medCorrector(dataset):
    number_of_corrected_words = 0
    words = []
    dictionary_misspelled_and_corrected_words = load_misspelled_dataset(load_dataset)
    for i in range(len(dictionary_misspelled_and_corrected_words)):
        corrected_word = obj.minimumEditDistance_spell_corrector(dataset[i][0])
        if corrected_word == dataset[i][2]:
            number_of_corrected_words += 1
    print "============================================================================================"
    print "Total number of misspelled words in database", len(dictionary_misspelled_and_corrected_words) 
    print "Total number of corrected words",number_of_corrected_words
    print "Total percentage ", (number_of_corrected_words/len(dictionary_misspelled_and_corrected_words)) * 100
    print "============================================================================================"
check_misspelledWords_medCorrector(load_dataset)

Total number of misspelled words in database 666
Total number of corrected words 398
Total percentage  59.7597597598
