# Spelling correction with symspellpy
Spell check for sentences/paragraphs

Recommendations
- Out of the box, works better with shorter text (more words get mangled in longer text)
- Probably will work better if you provide your own custom dictionary extracted from target corpus (did not try this out)

References
- https://symspellpy.readthedocs.io/en/latest/examples/lookup_compound.html
- https://github.com/mammothb/symspellpy

In [1]:
import pkg_resources
from symspellpy import SymSpell

In [2]:
%%time
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

CPU times: user 2.28 s, sys: 266 ms, total: 2.55 s
Wall time: 2.57 s


True

Edit distance=3 is too large and throws error.

In [3]:
# lookup suggestions for multi-word input strings (supports compound
# splitting & merging)
inp = (
    "Whereis th elove heHAd dated forImuch of thEPast who "
    "couqdn'tread in sixtgrade and ins pired Him"
)
# max edit distance per lookup (per single word, not per whole input string)
for i in range(1, 3):
    print(f"max_edit_distance={i}")
    suggestions = sym_spell.lookup_compound(
        inp, max_edit_distance=i, transfer_casing=True
    )
    # display suggestion term, edit distance, and term frequency
    for suggestion in suggestions:
        print(suggestion)

max_edit_distance=1
Where is the love he HAd dated for much of thE Past who couldn't read in six grade and inspired Him, 9, 0
max_edit_distance=2
Where is the love he HAd dated for much of thE Past who couldn't read in six grade and inspired Him, 9, 0


Long text problems
- Preserving case `transfer_casing=True` is unreliable. Lowercase is incorrectly changed to uppercase and vice versa e.g. "ALSO" -> "alsO", "same" -> "samE"
- Date and time get mangled. Somewhat fixed by `ignore_term_with_digits=True` but "06:12" -> "06 12"
- Acronyms get mangled e.g. ANI. Fixed by `ignore_non_words=True`
- Names get mangled e.g. "Elonka" -> "A lanka", "Wik/Gnomerplatz" -> "win gnome plate". Not fixed by `ignore_non_words=True`.
- Slang get mangled e.g. "sockpuppets" -> "soc puppets". Not fixed by `ignore_non_words=True`.
- Incorrectly removed full stop at the end of sentence
- Phrases get mutated - letters and punctuation (comma) get replaced e.g. "Oh, ALSO, what's" -> "of alsO what's", "Ohm, also, what's"

In [4]:
sents = [
    ", 20 April 2007 (UTC) Oh, ALSO, what's particularly funny is that on that ANI discussion, Gene Poole says that Elonka asked him to look at the edits and that in his opinion the edits are the same as Wik/Gnomerplatz...",
    "besides it being hypocritical with his and Elonka's known history of sockpuppet use to try to prevail in conflicts/votes, the post there is trying to claim that this was something new he came up with based upon evidence.",
    "In fact it is just the same false accusation he came up with out of nowhere last fall as part of Elonka's failed attempt to become an admin and which directly lead to his sockpuppets being outed.",
    "When he made the charges at that time they were found by multiple admins to be completely groundless.",
    "He's trying to deceive people by presenting the same, tired old harassment as something new and to failing to mention that he was the one actually using those tactics.",
    "06:12",
]
inp = " ".join(sents)
print(f"len(input)={len(inp)}")
for i in range(1, 3):
    print(f"max_edit_distance={i}")
    suggestions = sym_spell.lookup_compound(
        inp, max_edit_distance=i, transfer_casing=True
    )
    for s in suggestions:
        print(f"{s.term}\nedit_distance={s.distance}, term_frequency={s.count}")

len(input)=910
max_edit_distance=1
a a April 2007 eTC of alsO what's particularly funny is that on that AN i discussion Gene Poole says that A lanka asked him to look at the edits and that in his opinion the edits are the same as Win gnome plate besides it being hypocritical with his and plonk As known history of soc puppet use to try to prevail in conflicts votes the post there is trying to claim that this was something new he came up with based upon evidence in fact it is just the same false accusation he came up with out of nowhere last fall as part of plonk as failed attempt to become an admin and which directLy lead to his soc puppets being outed when he made the charges at that time they were found by multiple admins to be completely groundless he's trying to deceive people by presenting the samE tired old harassment as something new and to failing to mention that he was the one actually usIng those tactics a a a a
edit_distance=54, term_frequency=0
max_edit_distance=2
of apRil o

In [5]:
inp = " ".join(sents)
print(f"len(input)={len(inp)}")
for i in range(1, 3):
    print(f"max_edit_distance={i}")
    suggestions = sym_spell.lookup_compound(
        inp, 
        max_edit_distance=i, 
        ignore_non_words=True, 
        ignore_term_with_digits=True, 
        transfer_casing=False,
        split_by_space=False,
    )
    for s in suggestions:
        print(f"{s.term}\nedit_distance={s.distance}, term_frequency={s.count}")

len(input)=910
max_edit_distance=1
20 april 2007 UTC of ALSO what's particularly funny is that on that ANI discussion gene poole says that a lanka asked him to look at the edits and that in his opinion the edits are the same as win gnome plate besides it being hypocritical with his and plonk as known history of soc puppet use to try to prevail in conflicts votes the post there is trying to claim that this was something new he came up with based upon evidence in fact it is just the same false accusation he came up with out of nowhere last fall as part of plonk as failed attempt to become an admin and which directly lead to his soc puppets being outed when he made the charges at that time they were found by multiple admins to be completely groundless he's trying to deceive people by presenting the same tired old harassment as something new and to failing to mention that he was the one actually using those tactics 06 12
edit_distance=43, term_frequency=0
max_edit_distance=2
20 april 2007 

In [6]:
for inp in sents:
    print(f"len(input)={len(inp)}\n{inp}")
    for i in range(1, 3):
        print(f"max_edit_distance={i}")
        suggestions = sym_spell.lookup_compound(
            inp, 
            max_edit_distance=i, 
            ignore_non_words=True, 
            ignore_term_with_digits=True, 
            transfer_casing=True,
            split_by_space=True,
        )
        for s in suggestions:
            print(f"{s.term}\nedit_distance={s.distance}, term_frequency={s.count}")

len(input)=217
, 20 April 2007 (UTC) Oh, ALSO, what's particularly funny is that on that ANI discussion, Gene Poole says that Elonka asked him to look at the edits and that in his opinion the edits are the same as Wik/Gnomerplatz...
max_edit_distance=1
a 20 April 2007 bUT Cd Ohm also, what's particularly funny is that on that ANI discussion Gene Poole says that A lanka asked him to look at the edits and that in his opinion the edits are the same as Wik/Gnomerplatz...
edit_distance=13, term_frequency=0
max_edit_distance=2
a 20 April 2007 dUTCh Ohm also, what's particularly funny is that on that ANI discussion Gene Poole says that Lanka asked him to look at the edits and that in his opinion the edits are the same as Wik/Gnomerplatz...
edit_distance=12, term_frequency=0
len(input)=220
besides it being hypocritical with his and Elonka's known history of sockpuppet use to try to prevail in conflicts/votes, the post there is trying to claim that this was something new he came up with based u