Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correct() returns empty object #99

Closed
shubhams opened this issue Oct 22, 2015 · 14 comments
Closed

correct() returns empty object #99

shubhams opened this issue Oct 22, 2015 · 14 comments

Comments

@shubhams
Copy link

I tried using spell checking but correct() method returns an empty object. Following shows the method call on a terminal:

>>> from textblob import TextBlob
>>> b = TextBlob("I havv goood speling!")
>>> b.correct()
TextBlob("")
>>> print(b.correct())

>>> 

I couldn't find a fix to this. I'm running Python 2.7.6 on Linux.

@vipul-sharma20
Copy link

>>> import textblob
>>> textblob.__version__
'0.10.0'
>>> b = TextBlob("I havv goood speling!")
>>> b.correct()
TextBlob("I have good spelling!")
>>>

@evanwill
Copy link

just installed textblob today and having same issue on both python 3 and 2, here is 2.7:

>>> import sys
>>> print(sys.version)
2.7.9 | 64-bit | (default, Jul  1 2015, 03:41:50) [MSC v.1500 64 bit (AMD64)]
>>> import textblob
>>> textblob.__version__
'0.10.0'
>>> from textblob import TextBlob
>>> b = TextBlob("I havv goood speling!")
>>> b.correct()
TextBlob("")

@niless
Copy link

niless commented Oct 27, 2015

I have the same issue. The word level spellcheck() is working though. I have python 2.7.

@evanwill
Copy link

I looked into this issue today, it seems like there is a problem with the regex used in the function's nltk.tokenize.regexp_tokenize.

blob.py currently has:

    def correct(self):
        """Attempt to correct the spelling of a blob.

        .. versionadded:: 0.6.0

        :rtype: :class:`BaseBlob <BaseBlob>`
        """
        # regex matches: contraction or word or punctuation or whitespace
        tokens = nltk.tokenize.regexp_tokenize(self.raw, "\w*('\w*)+|\w+|[^\w\s]|\s")
        corrected = (Word(w).correct() for w in tokens)
        ret = ''.join(corrected)
        return self.__class__(ret)

The regex \w*('\w*)+ is intended to capture contractions. If you check it on http://www.regexr.com/ it does work (although it also captures ' alone). However, if you try it in python, it always returns blank strings. I guess it does not like the use of () capture group?

The capture group is actually unnecessary, and could be replaced by \w+'\w+ (which captures all normal contractions) or \w*'\w* (which would capture normal contractions plus a word with a leading or trailing '). Either option works in correct().

However, in testing I noticed that its kind of pointless because the correction isn't very good for common contractions ("can't" is replaced with "canst", "we'll" with "well").

If you want to play around with this manually, correct() basically does this step-by-step:

b = "I havv goood speling!"
from nltk.tokenize import regexp_tokenize
tokens = regexp_tokenize(b, "\w+'\w+|\w+|[^\w\s]|\s")
from textblob import Word
corrected = (Word(w).correct() for w in tokens)
ret = ''.join(corrected)
print (ret)

Anyone have an idea of why the capture group was included, or why it doesn't work in python?

@AZaitzeff
Copy link

I have the same issue on python 3. version 0.10.0

@sloria
Copy link
Owner

sloria commented Oct 31, 2015

This seems to be caused by a change in NLTK 3.1 (see #97 (comment)). Downgrading to nltk==3.0.5 should fix the problem. I'll try to look into the compat issue when I get the chance.

@iamaziz
Copy link

iamaziz commented Oct 31, 2015

Before :)

>>> from textblob import TextBlob, __version__
>>> __version__
'0.9.0'
>>> b = TextBlob('I havv good speling!')
>>> b.correct()
TextBlob("I have good spelling!")

After :(

>>> from textblob import TextBlob, __version__
>>> __version__
'0.10.0'
>>> b = TextBlob('I havv good speling!')
>>> b.correct()
TextBlob("")

I was using NLTK 3.1 in both examples.

@sloria
Copy link
Owner

sloria commented Oct 31, 2015

@iamaziz The bug is due to an incompatibility with NLTK 3.1. Downgrading textblob won't make a difference. The next version of textblob will support nltk>=3.1. I am working on this now.

@sloria sloria closed this as completed in 48d3dc4 Oct 31, 2015
@sloria
Copy link
Owner

sloria commented Oct 31, 2015

This is now fixed on dev.

@iamaziz
Copy link

iamaziz commented Oct 31, 2015

That's quick! Thanks @sloria 👍

@shubhams
Copy link
Author

shubhams commented Nov 2, 2015

Works just fine, after the update. Thanks a lot! :)

@evanwill
Copy link

evanwill commented Nov 2, 2015

You just removed the attempt to correct contractions altogether?

@sloria
Copy link
Owner

sloria commented Nov 3, 2015

I believe the tokenization on contractions was unnecessary and possibly incorrect. The spelling corrector should correct contractions.

@evanwill
Copy link

evanwill commented Nov 3, 2015

Its complicated because correct() doesn't seem particularly accurate with contractions. I don't think the new tokenization will fix many contractions because it separates them as different tokens. If the spelling mistake is at the beginning it should get fixed ("cann't"), if the ' is in the wrong place ("ca'nt") it will probably give a wildly inaccurate correction, and if it is at the end ("can'tt") it probably won't correct it.

Your commit took the old:

# regex matches: contraction or word or punctuation or whitespace
tokens = nltk.tokenize.regexp_tokenize(self.raw, "\w*('\w*)+|\w+|[^\w\s]|\s")

and replaces it with:

# regex matches: word or punctuation or whitespace
tokens = nltk.tokenize.regexp_tokenize(self.raw, "\w+|[^\w\s]|\s")

if you test it with

b = "I couuldn't spel good, ca'nt and won'tt!"

the old one returns the weird empty set.:

>>> tokens = regexp_tokenize(b, "\w*('\w*)+|\w+|[^\w\s]|\s")
>>> tokens
['', '', "'t", '', '', '', '', '', '', "'nt", '', '', '', "'tt", '']

The new one returns:

>>> tokens = regexp_tokenize(b, "\w+|[^\w\s]|\s")
>>> tokens
['I', ' ', 'couuldn', "'", 't', ' ', 'spel', ' ', 'good', ',', ' ', 'ca', "'", 'nt', ' ', 'and', ' ', 'won', "'", 'tt', '!']

So when Correct() runs, you get "I couldn't spell good, ca'it and won'tt!" because the contractions are separate tokens.
As I posted above, you could use:

>>> tokens = regexp_tokenize(b, "\w+'\w+|\w+|[^\w\s]|\s")
>>> tokens
['I', ' ', "couuldn't", ' ', 'spel', ' ', 'good', ',', ' ', "ca'nt", ' ', 'and', ' ', "won'tt", '!']

But, because of the limitations of Correct(), it results in "I couuldn't spell good, can and wont!" Which is not better. I guess it should be a known limitation that contractions are even more inaccurate than running correct() in general?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants