<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/SpellCheck_PySpellChecher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpellChecher

Pure Python Spell Checking based on Peter Norvig's blog post on setting up a simple spell checking algorithm.

It uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word. It then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list. Those words that are found more often in the frequency list are more likely the correct results.

pyspellchecker supports multiple languages including English, Spanish, German, French, and Portuguese. Dictionaries were generated using the WordFrequency project on GitHub.


GitHub: https://github.com/barrust/pyspellchecker


## Installation

In [0]:
! pip install pyspellchecker

Collecting pyspellchecker
[?25l  Downloading https://files.pythonhosted.org/packages/93/24/9a570f49dfefc16e9ce1f483bb2d5bff701b95094e051db502e3c11f5092/pyspellchecker-0.5.3-py2.py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 9.2MB/s 
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.5.3


## Basic usage

SpellChecker comes with a default word frequency list (dictionary)

In [0]:
from spellchecker import SpellChecker
spell = SpellChecker() 

Find the words are not in the dictionary

In [0]:
# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

In [3]:
for word in misspelled:

    # Word misspelled
    print(f'Word misspelled: {word}')

    # Get the one `most likely` answer
    print(f'Correct word: {spell.correction(word)}')

    # Get a list of `likely` options
    print(f'List of possible corrections: {spell.candidates(word)}')

Word misspelled: hapenning
Correct word: happening
List of possible corrections: {'happening', 'henning', 'penning'}


##Load custom dictionaries

* Loads dictionary from the constructor

In [0]:
from spellchecker import SpellChecker

# Loads default word frequency list
spell = SpellChecker()  

## Loads only the local dictionary (no default)
# spell = SpellChecker(local_dictionary='./my_dictionary.json')

* Loads dictionary from the object methods 

In [0]:
## From a file with free text
# spell.word_frequency.load_text_file('./my_free_text_doc.txt')

## From a dictionary
# spell.word_frequency.load_dictionary('./my_dictionary.txt')

## From a free text
spell.word_frequency.load_text("A blue whale went for a swim in the sea. Along it's path it ran into a storm. To avoid the storm it dove deep under the waves.")

* Loads specific words directly

In [6]:
print(spell.known(['microsoft', 'google']))

# if I just want to make sure some words are not flagged as misspelled
spell.word_frequency.load_words(['microsoft', 'apple', 'google'])

print(spell.known(['microsoft', 'google']))  # will return both now!

{'microsoft'}
{'google', 'microsoft'}


## Change the spell distance

In [0]:
from spellchecker import SpellChecker

spell = SpellChecker(distance=1)  # set at initialization

# do some work on longer words

spell.distance = 2  # set the distance parameter back to the default

## Methods

In [0]:
from spellchecker import SpellChecker

# Loads default word frequency list
spell = SpellChecker()  

`correction(word)`: Returns the most probable result for the misspelled word

In [9]:
spell.correction('Helloo')

'hello'

`candidates(word)`: Returns a set of possible candidates for the misspelled word

In [10]:
spell.candidates('Helo')

{'delo',
 'halo',
 'heclo',
 'hel',
 'hela',
 'held',
 'heli',
 'hell',
 'hello',
 'helm',
 'helot',
 'help',
 'hely',
 'hero',
 'melo',
 'selo'}

`known([words])`: Returns those words that are in the word frequency list

In [11]:
spell.known(['Blue', 'Red', 'Whithe', 'Blak', 'Yellow', 'Gren']) #blue, red, yellow

{'blue', 'red', 'yellow'}

`unknown([words])`: Returns those words that are not in the frequency list

In [12]:
spell.unknown(['Blue', 'Red', 'Whithe', 'Blak', 'Yellow', 'Gren']) #whithe, blak, gren

{'blak', 'gren', 'whithe'}

`word_probability(word)`: The frequency of the given word out of all words in the frequency list

In [13]:
spell.word_probability('car')

0.00028855784147263907

`edit_distance_1(word)`: Returns a set of all strings at a Levenshtein Distance of one based on the alphabet of the selected language

In [14]:
list( spell.edit_distance_1("distance") )[0:10]

['odistance',
 'disthance',
 'd3istance',
 '8istance',
 'disttnce',
 'distance ',
 'di6stance',
 'distance5',
 'iistance',
 'distmnce']

`edit_distance_2(word)`: Returns a set of all strings at a Levenshtein Distance of two based on the alphabet of the selected language

In [15]:
list( spell.edit_distance_2("documentation") )[0:10]

['do9cumentation',
 'do9cumenta+ion',
 'odo9cumentation',
 'xo9cumentation',
 'do9cumezntation',
 'do9jcumentation',
 'do9cumen!tation',
 'dobcumentation',
 'do9cumentationj',
 'do9cumen4ation']