New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MissingCorpusError while lemmatize #160

Closed
kapkirl opened this Issue Apr 28, 2017 · 4 comments

Comments

Projects
None yet
3 participants
@kapkirl

kapkirl commented Apr 28, 2017

I tried to run some code with lemmatization.
It was not expected that pos_tags are not ready to be used for lemmatization.
Here is my code:

text = "They told us to duck"

blob = TextBlob(text)

for word, pos in blob.pos_tags:
    w = Word(word)
    print("{w}: {p}".format(w=w.lemmatize(pos), p=pos))

So, I catch an error:

File "/usr/local/lib/python3.6/site-packages/textblob/decorators.py", line 38, in decorated
raise MissingCorpusError()

MissingCorpusError:
Looks like you are missing some required data for this feature.

To download the necessary data, simply run

python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html
If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.

I don't understand how this thing are connected.
But solution is to convert tags (ps) to appropriate wordnet format here - w.lemmatize(ps)
Here is example on SO

@jschnurr

This comment has been minimized.

Collaborator

jschnurr commented Jan 3, 2018

The lemmatizer requires the NLTK corpus to run. Did you run python -m textblob.download_corpora?

@kapkirl

This comment has been minimized.

kapkirl commented Jan 16, 2018

I try and that's doesn't help.

$ sudo python -m textblob.download_corpora
[nltk_data] Downloading package brown to /home/user/nltk_data...
[nltk_data] Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /home/user/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/user/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package conll2000 to /home/user/nltk_data...
[nltk_data] Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data] /home/user/nltk_data...
[nltk_data] Package movie_reviews is already up-to-date!

@jschnurr

This comment has been minimized.

Collaborator

jschnurr commented Jan 19, 2018

The error is a red herring; here's what's happening.

blob.pos_tags returns pos tags using the Penn Treebank tagset, but the WordNet Lemmatizer expects the WordNet tagset. When you call the lemmatize method with the Treebank tag, NLTK throws a KeyError which bubbles up as missing corpora.

To workaround, use the lemma property, which automatically converts the tags:

print("{w}: {p}".format(w=w.lemma, p=pos))

I'll submit a PR to automatically convert the tags when using lemmatize as well.

@sloria sloria closed this in 53e45eb Jan 20, 2018

sloria added a commit that referenced this issue Jan 20, 2018

Merge pull request #187 from jschnurr/dev
fix #160 convert pos  tags from treebank to wordnet for lemmatize method
@jmiguelv

This comment has been minimized.

jmiguelv commented Mar 9, 2018

I am still seeing this issue in 0.15.1 for tags that are not matched by the _penn_to_wordnet function, for instance, PRP (personal pronoun), and the function returns None, which is an invalid pos tag to pass to lemmatizer.lemmatize(self.string, tag).

A solution could be for the _penn_to_wordnet to return a default value rather than None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment