<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/SpellCheck_SpaCy_HUNSPELL_a_custom_SpaCy_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## SpaCy_HUNSPELL

This package was born from a `issue` in the SpaCy github: https://github.com/explosion/spaCy/issues/315

In it, `Ines` proposes make a custom pipeline to includes a `spell check` process in Spacy. The user `tokestermw` decides to use the package `PyHunSpell` to create his custom package and publishes this package with the name: `spacy_hunspell`

PyHunSpell Github: https://github.com/blatinier/pyhunspell

Spacy_hunspell Github: https://github.com/tokestermw/spacy_hunspell

## Installation

In [2]:
! pip install spacy_hunspell

Collecting spacy_hunspell
  Downloading https://files.pythonhosted.org/packages/d9/6a/d977f74eff8354a5fdd6b5c0d8b4f8caa8d676970e18ff961694d978e7f7/spacy_hunspell-0.1.0.tar.gz
Collecting hunspell==0.5.0
  Downloading https://files.pythonhosted.org/packages/2d/77/8c68d28afca3b07d3b89d3c60af56e1a3e5f381ddd1bc01f31e97233a03c/hunspell-0.5.0.tar.gz
Building wheels for collected packages: spacy-hunspell, hunspell
  Building wheel for spacy-hunspell (setup.py) ... [?25l[?25hdone
  Created wheel for spacy-hunspell: filename=spacy_hunspell-0.1.0-cp36-none-any.whl size=3056 sha256=ce82884ae928a958b7b5ab4cb91a368f9d288d37a3f4f9adbabf0c6e106354e1
  Stored in directory: /root/.cache/pip/wheels/07/b2/18/b5aa882df45e376e0f52ca7306c450f548ca150168338c0468
  Building wheel for hunspell (setup.py) ... [?25l[?25hdone
  Created wheel for hunspell: filename=hunspell-0.5.0-cp36-cp36m-linux_x86_64.whl size=59722 sha256=ecde08abd0b371c35ad3f6349e61b84500a20d2c527108ce8ba1340460274e67
  Stored in directory:

In [0]:
## If you have some problem - Install directly in linux
# ! sudo apt-get install libhunspell-dev

## Requirements.txt
#! pip install 'hunspell==0.5.0'
#! pip install 'spacy>=2.0.0'

In [0]:
!python -m spacy download en_core_web_md

# After donwload the model re-start the environment to ensure the code can use it.

## Example of use

In [0]:
import spacy
from spacy_hunspell import spaCyHunSpell

Load model

In [0]:
nlp = spacy.load('en_core_web_sm')

We customize the spaCy pipeline with the spell check component

For this, we include a dictionary. We can use the dictionary of the system or a custom dictionary:

 * For macs: `hunspell = spaCyHunSpell(nlp, 'mac')`
 * For linuxs: `hunspell = spaCyHunSpell(nlp, 'linux')`
 * For custom dictionary: `hunspell = spaCyHunSpell(nlp, ('en_US.dic', 'en_US.aff'))`

In [3]:
print(f'Original spaCy pipeline : {nlp.pipe_names}')

if 'hunspell' not in nlp.pipe_names:
  hunspell = spaCyHunSpell(nlp, 'linux')
  nlp.add_pipe(hunspell)

print(f'New spaCy pipeline : {nlp.pipe_names}')

Original spaCy pipeline : ['tagger', 'parser', 'ner']
/usr/share/hunspell/en_US.dic /usr/share/hunspell/en_US.aff
New spaCy pipeline : ['tagger', 'parser', 'ner', 'hunspell']


Tokenize text

In [0]:
def correct_sentence(doc): 

  correct_sentence = ""
  for token in doc:
    if not token._.hunspell_spell:
      print(f'Bad spell : {token}')
      print(f'Correction proposals : {token._.hunspell_suggest}')

      print(token.is_space)

      correct_sentence += " " + token._.hunspell_suggest[0]
    else:
      correct_sentence += " " + token.text
  
  return correct_sentence.strip()

In [32]:
doc = nlp('I can haz cheezeburger.')
correct_sentence(doc)

Bad spell : haz
Correction proposals : ['ha', 'haze', 'hazy', 'has', 'hat', 'had', 'hag', 'ham', 'hap', 'hay', 'haw', 'ha z']
False
Bad spell : cheezeburger
Correction proposals : ['cheeseburger', 'vegeburger']
False


'I can ha cheeseburger .'

In [33]:
doc = nlp('This street is peacefull')
correct_sentence(doc)

Bad spell : peacefull
Correction proposals : ['peaceful', 'peacefully', 'peace full', 'peace-full', 'peaceful l']
False


'This street is peaceful'