<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/SpellCheck_SymSpell.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SymSpell

SymSpell is one of the most fastest spell check tools. Exist different language implementations of this tool.

General GitHub: https://github.com/wolfgarbe/SymSpell

Python Package: SumSpellPy 
- GITHUB: https://github.com/mammothb/symspellpy
- DOCUMENTATION: https://symspellpy.readthedocs.io/en/latest/examples/index.html

## Installation

In [0]:
! pip install symspellpy
! pip install bs4

In [0]:
!python -m spacy download en_core_web_md

## Example of usage

* Lookup misspellings

Automatic spelling corrections for only a word

In [7]:
import pkg_resources
from symspellpy import SymSpell, Verbosity

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename("symspellpy", "frequency_dictionary_en_82_765.txt")

# term_index is the column of the term and count_index is the column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

# lookup suggestions for single-word input strings
input_term = "memebers"  # misspelling of "members"

# max edit distance per lookup
# (max_edit_distance_lookup <= max_dictionary_edit_distance)
suggestions = sym_spell.lookup(input_term, Verbosity.CLOSEST, max_edit_distance=2)

# display suggestion term, term frequency, and edit distance
for suggestion in suggestions:
    print(suggestion)

members, 1, 226656153


* Lookup compound

Automatic spelling corrections for a sentence (multi-words)

In [8]:
import pkg_resources
from symspellpy import SymSpell, Verbosity

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename("symspellpy", "frequency_dictionary_en_82_765.txt")
bigram_path = pkg_resources.resource_filename("symspellpy", "frequency_bigramdictionary_en_243_342.txt")

# term_index is the column of the term and count_index is the column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

# lookup suggestions for multi-word input strings (supports compound splitting & merging)
input_term = ("whereis th elove hehad dated forImuch of thepast who "
              "couqdn'tread in sixtgrade and ins pired him")

# max edit distance per lookup (per single word, not per whole input string)
suggestions = sym_spell.lookup_compound(input_term, max_edit_distance=2)

# display suggestion term, edit distance, and term frequency
for suggestion in suggestions:
    print(suggestion)

where is the love he had dated for much of the past who couldn't read in six grade and inspired him, 9, 0


* Word segmentation

Divides a string into words by inserting missing spaces

In [5]:
import pkg_resources
from symspellpy.symspellpy import SymSpell

# Set max_dictionary_edit_distance to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename("symspellpy", "frequency_dictionary_en_82_765.txt")

# term_index is the column of the term and count_index is the column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

# a sentence without any spaces
input_term = "thequickbrownfoxjumpsoverthelazydog"

result = sym_spell.word_segmentation(input_term)

print("{}, {}, {}".format(result.corrected_string, 
                          result.distance_sum,
                          result.log_prob_sum))

the quick brown fox jumps over the lazy dog, 8, -34.491167981910635


## Real example of usage

This code was used to a NLP real project.

In [0]:
import spacy
import re
from symspellpy.symspellpy import SymSpell, Verbosity  # import the module
import os
from bs4 import BeautifulSoup
import pkg_resources

Load the english model

In [0]:
nlp = spacy.load('en_core_web_md')

Get all the vocubulary of the spaCy model to use as dictionary

In [14]:
spelldict = set(nlp.vocab.strings)
print(f'Number of vocabulary entries : {len(spelldict)}')

Number of vocabulary entries : 1476045


Define the SymSpell object

In [0]:
# maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7

# create object
spell = SymSpell(max_edit_distance_dictionary, prefix_length)

Load your own dictionary. You can create a dictionary for the field of usage (e.x., legal, news, etc.) with the specific acceptable terms of this field.

In [48]:
# lookup suggestions for single-word input strings
# load dictionary
dictionary_path = "./resoruces/frequency_dict_en.txt"
term_index = 0  # column of the term in the dictionary text file
count_index = 1  # column of the term frequency in the dictionary text file
if not spell.load_dictionary(dictionary_path, term_index, count_index):
    print("Dictionary file not found")

    #Load a default dict
    dictionary_path = pkg_resources.resource_filename("symspellpy", "frequency_dictionary_en_82_765.txt")
    spell.load_dictionary(dictionary_path, term_index, count_index)

Dictionary file not found


Normalize text 

In [0]:
def normalize_text(text):
    """Normalize text"""


    global nlp    
    global spelldict 
    global spell

    # Clean HTML tags
    text = BeautifulSoup(text).get_text()

    # Patterns to avoid remove or considered as different values of general terms (ex., emails, urls, etc.)
    text = re.sub(r'http\\S+', 'url', text, flags=re.MULTILINE) # Replace urls by 'url' term
    text = re.sub(r'www\\S+', 'url', text, flags=re.MULTILINE)  # Replace urls by 'url' term
    # text = re.sub(r"[a-z0-9\\.\\-+_]+@[a-z0-9\\.\\-+_]+\\.[a-z]+", 'email', text, flags=re.MULTILINE)  # Replace emails by 'email' term
    text = re.sub('([.,!?()"/])', r' \\1 ', text)  # Replace punctuation
    text = re.sub('\\s{2,}', ' ', text)        

    # Process the text
    doc = nlp(text)
    out_text = []
    for token in doc:

         #Does the token consist of alphabetic characters? 
        if token.is_alpha:
            #Token is not in the spacy dictionary
            if str(token) not in spelldict:
              
                #Look for a SymSpell correction of the word
                max_edit_distance_lookup = 2
                suggestion_verbosity = Verbosity.CLOSEST  # TOP, CLOSEST, ALL
                suggestions = spell.lookup(str(token), suggestion_verbosity, max_edit_distance_lookup)
                if len(suggestions) > 0:
                    # Get the suggestion
                    out_text.append(suggestions[0].term)
            else:
                # Get directly the valid token
                out_text.append(str(token).lower())
        
        #Does the token resemble an email address?
        if token.like_email:
            out_text.append('email')
        #Does the token resemble an url address?
        if token.like_url:
            out_text.append('url')

    raw = ' '.join(out_text)

    return raw

In [53]:
normalize_text("Mi email address is test@gmail.com. You can't contact with me!! ;-P ")

'mi email address is email you ca contact with me'