New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correct() returns empty object #99

Closed
shubhams opened this Issue Oct 22, 2015 · 14 comments

Comments

Projects
None yet
7 participants
@shubhams

shubhams commented Oct 22, 2015

I tried using spell checking but correct() method returns an empty object. Following shows the method call on a terminal:

>>> from textblob import TextBlob
>>> b = TextBlob("I havv goood speling!")
>>> b.correct()
TextBlob("")
>>> print(b.correct())

>>> 

I couldn't find a fix to this. I'm running Python 2.7.6 on Linux.

@vipul-sharma20

This comment has been minimized.

vipul-sharma20 commented Oct 23, 2015

>>> import textblob
>>> textblob.__version__
'0.10.0'
>>> b = TextBlob("I havv goood speling!")
>>> b.correct()
TextBlob("I have good spelling!")
>>>
@evanwill

This comment has been minimized.

evanwill commented Oct 27, 2015

just installed textblob today and having same issue on both python 3 and 2, here is 2.7:

>>> import sys
>>> print(sys.version)
2.7.9 | 64-bit | (default, Jul  1 2015, 03:41:50) [MSC v.1500 64 bit (AMD64)]
>>> import textblob
>>> textblob.__version__
'0.10.0'
>>> from textblob import TextBlob
>>> b = TextBlob("I havv goood speling!")
>>> b.correct()
TextBlob("")
@niless

This comment has been minimized.

niless commented Oct 27, 2015

I have the same issue. The word level spellcheck() is working though. I have python 2.7.

@evanwill

This comment has been minimized.

evanwill commented Oct 28, 2015

I looked into this issue today, it seems like there is a problem with the regex used in the function's nltk.tokenize.regexp_tokenize.

blob.py currently has:

    def correct(self):
        """Attempt to correct the spelling of a blob.

        .. versionadded:: 0.6.0

        :rtype: :class:`BaseBlob <BaseBlob>`
        """
        # regex matches: contraction or word or punctuation or whitespace
        tokens = nltk.tokenize.regexp_tokenize(self.raw, "\w*('\w*)+|\w+|[^\w\s]|\s")
        corrected = (Word(w).correct() for w in tokens)
        ret = ''.join(corrected)
        return self.__class__(ret)

The regex \w*('\w*)+ is intended to capture contractions. If you check it on http://www.regexr.com/ it does work (although it also captures ' alone). However, if you try it in python, it always returns blank strings. I guess it does not like the use of () capture group?

The capture group is actually unnecessary, and could be replaced by \w+'\w+ (which captures all normal contractions) or \w*'\w* (which would capture normal contractions plus a word with a leading or trailing '). Either option works in correct().

However, in testing I noticed that its kind of pointless because the correction isn't very good for common contractions ("can't" is replaced with "canst", "we'll" with "well").

If you want to play around with this manually, correct() basically does this step-by-step:

b = "I havv goood speling!"
from nltk.tokenize import regexp_tokenize
tokens = regexp_tokenize(b, "\w+'\w+|\w+|[^\w\s]|\s")
from textblob import Word
corrected = (Word(w).correct() for w in tokens)
ret = ''.join(corrected)
print (ret)

Anyone have an idea of why the capture group was included, or why it doesn't work in python?

@AZaitzeff

This comment has been minimized.

AZaitzeff commented Oct 29, 2015

I have the same issue on python 3. version 0.10.0

@sloria

This comment has been minimized.

Owner

sloria commented Oct 31, 2015

This seems to be caused by a change in NLTK 3.1 (see #97 (comment)). Downgrading to nltk==3.0.5 should fix the problem. I'll try to look into the compat issue when I get the chance.

@iamaziz

This comment has been minimized.

iamaziz commented Oct 31, 2015

Before :)

>>> from textblob import TextBlob, __version__
>>> __version__
'0.9.0'
>>> b = TextBlob('I havv good speling!')
>>> b.correct()
TextBlob("I have good spelling!")

After :(

>>> from textblob import TextBlob, __version__
>>> __version__
'0.10.0'
>>> b = TextBlob('I havv good speling!')
>>> b.correct()
TextBlob("")

I was using NLTK 3.1 in both examples.

@sloria

This comment has been minimized.

Owner

sloria commented Oct 31, 2015

@iamaziz The bug is due to an incompatibility with NLTK 3.1. Downgrading textblob won't make a difference. The next version of textblob will support nltk>=3.1. I am working on this now.

@sloria sloria closed this in 48d3dc4 Oct 31, 2015

@sloria

This comment has been minimized.

Owner

sloria commented Oct 31, 2015

This is now fixed on dev.

@iamaziz

This comment has been minimized.

iamaziz commented Oct 31, 2015

That's quick! Thanks @sloria 👍

@shubhams

This comment has been minimized.

shubhams commented Nov 2, 2015

Works just fine, after the update. Thanks a lot! :)

@evanwill

This comment has been minimized.

evanwill commented Nov 2, 2015

You just removed the attempt to correct contractions altogether?

@sloria

This comment has been minimized.

Owner

sloria commented Nov 3, 2015

I believe the tokenization on contractions was unnecessary and possibly incorrect. The spelling corrector should correct contractions.

@evanwill

This comment has been minimized.

evanwill commented Nov 3, 2015

Its complicated because correct() doesn't seem particularly accurate with contractions. I don't think the new tokenization will fix many contractions because it separates them as different tokens. If the spelling mistake is at the beginning it should get fixed ("cann't"), if the ' is in the wrong place ("ca'nt") it will probably give a wildly inaccurate correction, and if it is at the end ("can'tt") it probably won't correct it.

Your commit took the old:

# regex matches: contraction or word or punctuation or whitespace
tokens = nltk.tokenize.regexp_tokenize(self.raw, "\w*('\w*)+|\w+|[^\w\s]|\s")

and replaces it with:

# regex matches: word or punctuation or whitespace
tokens = nltk.tokenize.regexp_tokenize(self.raw, "\w+|[^\w\s]|\s")

if you test it with

b = "I couuldn't spel good, ca'nt and won'tt!"

the old one returns the weird empty set.:

>>> tokens = regexp_tokenize(b, "\w*('\w*)+|\w+|[^\w\s]|\s")
>>> tokens
['', '', "'t", '', '', '', '', '', '', "'nt", '', '', '', "'tt", '']

The new one returns:

>>> tokens = regexp_tokenize(b, "\w+|[^\w\s]|\s")
>>> tokens
['I', ' ', 'couuldn', "'", 't', ' ', 'spel', ' ', 'good', ',', ' ', 'ca', "'", 'nt', ' ', 'and', ' ', 'won', "'", 'tt', '!']

So when Correct() runs, you get "I couldn't spell good, ca'it and won'tt!" because the contractions are separate tokens.
As I posted above, you could use:

>>> tokens = regexp_tokenize(b, "\w+'\w+|\w+|[^\w\s]|\s")
>>> tokens
['I', ' ', "couuldn't", ' ', 'spel', ' ', 'good', ',', ' ', "ca'nt", ' ', 'and', ' ', "won'tt", '!']

But, because of the limitations of Correct(), it results in "I couuldn't spell good, can and wont!" Which is not better. I guess it should be a known limitation that contractions are even more inaccurate than running correct() in general?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment