This document contains examples of usage for the SymSpellCppPy library. This library is used for dictionary loading, spelling correction, and error fixing.
import SymSpellCppPy
symSpell = SymSpellCppPy.SymSpell()
symSpell.load_dictionary(corpus="resources/frequency_dictionary_en_82_765.txt", term_index=0, count_index=1, separator=" ")
The SymSpell class provides methods to inspect the loaded dictionary:
- To check the number of words in the dictionary, use the word_count() method:
print(symSpell.word_count()) # Outputs: 82781
- To find the length of the longest word in the dictionary, use the max_length() method:
print(symSpell.max_length()) # Outputs: 28
- To count the number of unique delete combinations formed, use the entry_count() method:
print(symSpell.entry_count()) # Outputs: 661047
The lookup method allows you to find the correct spelling for a term from the dictionary:
- To find the closest spelling, use SymSpellCppPy.Verbosity.CLOSEST:
terms = symSpell.lookup("tke", SymSpellCppPy.Verbosity.CLOSEST)
print(terms[0].term) # Outputs: "take"
- You can also specify a max_edit_distance to limit the search to terms within a certain edit distance:
terms = symSpell.lookup("extrine", SymSpellCppPy.Verbosity.CLOSEST, max_edit_distance=2)
print(terms[0].term) # Outputs: "extreme"
terms = symSpell.lookup("extrine", SymSpellCppPy.Verbosity.CLOSEST, max_edit_distance=1)
print(terms) # Outputs: []
SymSpellCppPy also includes features to fix compound errors and word segmentation issues in sentences:
- To fix compound errors in a sentence, use the lookup_compound method:
terms = symSpell.lookup_compound("whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixthgrade and ins pired him")
print(terms[0].term)
# Outputs: "whereas to love head dated for much of theist who couldn't read in sixth grade and inspired him"
- To correct word segmentation issues in a sentence, use the word_segmentation method:
segmented_info = symSpell.word_segmentation("thequickbrownfoxjumpsoverthelazydog")
print(segmented_info.segmented_string)
# Outputs: "the quick brown fox jumps over the lazy dog"
segmented_info = symSpell.word_segmentation("thequickbrownfoxjumpsoverthelazydog")
print(segmented_info.corrected_string)
# Outputs: "they quick brown fox jumps over therapy dog"
To save the internal representation of a loaded SymSpell for fast reuse next time, use the save_pickle method. Do not use pickle natively:
symSpell.save_pickle("symspell_binary.bin")
To load the internal representation of a loaded SymSpell from a saved binary, use the load_pickle method:
anotherSymSpell = SymSpellCppPy.SymSpell()
anotherSymSpell.load_pickle("symspell_binary.bin")
terms = anotherSymSpell.lookup("tke", SymSpellCppPy.Verbosity.CLOSEST)
print(terms[0].term)
The SymSpellCppPy library also supports generating bigram and trigram suggestions:
# To generate bigram suggestions, use the `lookup_bigram` method:
terms = symSpell.lookup_bigram("in te dh", SymSpellCppPy.Verbosity.CLOSEST)
print(terms[0].term) # Outputs: "in the dark"
# To generate trigram suggestions, use the `lookup_trigram` method:
terms = symSpell.lookup_trigram("an plesant day", SymSpellCppPy.Verbosity.CLOSEST)
print(terms[0].term) # Outputs: "a pleasant day"
You can also request the top N suggestions for a given word:
# To get the top 5 closest terms to a given word, use the `TOP` verbosity:
terms = symSpell.lookup("huse", SymSpellCppPy.Verbosity.TOP, max_edit_distance=2, include_unknown=True)
for term in terms[:5]:
print(term.term)
# Outputs: "house", "use", "hue", "hues", "hose"
By default, SymSpellCppPy is case-sensitive and considers digits as valid characters. However, you can modify this behavior:
# To ignore case when checking a term, use the `ignore_case` parameter:
terms = symSpell.lookup("THe", SymSpellCppPy.Verbosity.CLOSEST, ignore_case=True)
print(terms[0].term) # Outputs: "the"
# To ignore digits when checking a term, use the `ignore_digit` parameter:
terms = symSpell.lookup("3rd", SymSpellCppPy.Verbosity.CLOSEST, ignore_digit=True)
print(terms[0].term) # Outputs: "red"
You may also choose to ignore words containing numbers:
# To ignore words with numbers when checking a term, use the `ignore_word_with_number` parameter:
terms = symSpell.lookup("l33t", SymSpellCppPy.Verbosity.CLOSEST, ignore_word_with_number=True)
print(terms[0].term) # Outputs: "let"