Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MissingCorpusError while lemmatize #160

Closed
kapkirl opened this issue Apr 28, 2017 · 9 comments
Closed

MissingCorpusError while lemmatize #160

kapkirl opened this issue Apr 28, 2017 · 9 comments

Comments

@kapkirl
Copy link

@kapkirl kapkirl commented Apr 28, 2017

I tried to run some code with lemmatization.
It was not expected that pos_tags are not ready to be used for lemmatization.
Here is my code:

text = "They told us to duck"

blob = TextBlob(text)

for word, pos in blob.pos_tags:
    w = Word(word)
    print("{w}: {p}".format(w=w.lemmatize(pos), p=pos))

So, I catch an error:

File "/usr/local/lib/python3.6/site-packages/textblob/decorators.py", line 38, in decorated
raise MissingCorpusError()

MissingCorpusError:
Looks like you are missing some required data for this feature.

To download the necessary data, simply run

python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html
If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.

I don't understand how this thing are connected.
But solution is to convert tags (ps) to appropriate wordnet format here - w.lemmatize(ps)
Here is example on SO

@jschnurr
Copy link
Collaborator

@jschnurr jschnurr commented Jan 3, 2018

The lemmatizer requires the NLTK corpus to run. Did you run python -m textblob.download_corpora?

@kapkirl
Copy link
Author

@kapkirl kapkirl commented Jan 16, 2018

I try and that's doesn't help.

$ sudo python -m textblob.download_corpora
[nltk_data] Downloading package brown to /home/user/nltk_data...
[nltk_data] Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /home/user/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/user/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package conll2000 to /home/user/nltk_data...
[nltk_data] Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data] /home/user/nltk_data...
[nltk_data] Package movie_reviews is already up-to-date!

@jschnurr
Copy link
Collaborator

@jschnurr jschnurr commented Jan 19, 2018

The error is a red herring; here's what's happening.

blob.pos_tags returns pos tags using the Penn Treebank tagset, but the WordNet Lemmatizer expects the WordNet tagset. When you call the lemmatize method with the Treebank tag, NLTK throws a KeyError which bubbles up as missing corpora.

To workaround, use the lemma property, which automatically converts the tags:

print("{w}: {p}".format(w=w.lemma, p=pos))

I'll submit a PR to automatically convert the tags when using lemmatize as well.

@sloria sloria closed this in 53e45eb Jan 20, 2018
sloria added a commit that referenced this issue Jan 20, 2018
fix #160 convert pos  tags from treebank to wordnet for lemmatize method
@jmiguelv
Copy link

@jmiguelv jmiguelv commented Mar 9, 2018

I am still seeing this issue in 0.15.1 for tags that are not matched by the _penn_to_wordnet function, for instance, PRP (personal pronoun), and the function returns None, which is an invalid pos tag to pass to lemmatizer.lemmatize(self.string, tag).

A solution could be for the _penn_to_wordnet to return a default value rather than None.

@dylan-chong
Copy link

@dylan-chong dylan-chong commented Apr 26, 2019

I am using Textblob 0.15.3 and it seems that PR #187 has an issue related to this.

>>> Word('better').lemmatize('a')
'good'
>>> Word('We').lemmatize('PRP')
None
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/textblob/decorators.py", line 35, in decorated
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/textblob/blob.py", line 152, in lemmatize
    return lemmatizer.lemmatize(self.string, tag)
  File "/usr/local/lib/python3.7/site-packages/nltk/stem/wordnet.py", line 41, in lemmatize
    lemmas = wordnet._morphy(word, pos)
  File "/usr/local/lib/python3.7/site-packages/nltk/corpus/reader/wordnet.py", line 1884, in _morphy
    exceptions = self._exception_map[pos]
KeyError: None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/textblob/decorators.py", line 38, in decorated
    raise MissingCorpusError()
textblob.exceptions.MissingCorpusError:
Looks like you are missing some required data for this feature.

To download the necessary data, simply run

    python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html
If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.

>>>

I have already run python -m textblob.download_corpora and also installed treebank and all with nltk.download(x), but it still gives that error. Looks like there should be a better error message

@dylan-chong
Copy link

@dylan-chong dylan-chong commented Apr 26, 2019

did some digging and looks like this is non exhaustive

def _penn_to_wordnet(tag):
    """Converts a Penn corpus tag into a Wordnet tag."""
    if tag in ("NN", "NNS", "NNP", "NNPS"):
        return _wordnet.NOUN
    if tag in ("JJ", "JJR", "JJS"):
        return _wordnet.ADJ
    if tag in ("VB", "VBD", "VBG", "VBN", "VBP", "VBZ"):
        return _wordnet.VERB
    if tag in ("RB", "RBR", "RBS"):
        return _wordnet.ADV
    return None

Here are some examples of some (unique) ones that i found it my dataset

CC, CD, DT, EX, FW, IN, MD, PDT, POS, PRP, PRP$, RP, SYM, TO, UH, WDT, WP, WRB
@dylan-chong
Copy link

@dylan-chong dylan-chong commented Apr 26, 2019

this is my workaround for now to get it to not throw errors, if anyone needs it.

from textblob import TextBlob

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')

# Hack to avoid strange errors with textblob
from textblob import blob
def _penn_to_wordnet(tag):
    _wordnet = blob._wordnet
    """Converts a Penn corpus tag into a Wordnet tag."""
    if tag in ("NN", "NNS", "NNP", "NNPS"):
        return _wordnet.NOUN
    if tag in ("JJ", "JJR", "JJS"):
        return _wordnet.ADJ
    if tag in ("VB", "VBD", "VBG", "VBN", "VBP", "VBZ"):
        return _wordnet.VERB
    if tag in ("RB", "RBR", "RBS"):
        return _wordnet.ADV
    # Print warning instead of returning None
    print('_penn_to_wordnet warning: no conversion found for ' + tag)
    return _wordnet.NOUN
blob._penn_to_wordnet = _penn_to_wordnet
@MuthuGsubramanian
Copy link

@MuthuGsubramanian MuthuGsubramanian commented Sep 23, 2019

im not sure why its closed but when i try to do the below it still does'nt work
list_txt = [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

@Humanshi-Bachiyani
Copy link

@Humanshi-Bachiyani Humanshi-Bachiyani commented Jul 20, 2020

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/textblob/decorators.py", line 35, in decorated
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/textblob/tokenizers.py", line 57, in tokenize
return nltk.tokenize.sent_tokenize(text)
File "/usr/local/lib/python3.6/dist-packages/nltk/tokenize/init.py", line 106, in sent_tokenize
tokenizer = load("tokenizers/punkt/{0}.pickle".format(language))
File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 752, in load
opened_resource = _open(resource_url)
File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 877, in open
return find(path
, path + [""]).open()
File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 585, in find
raise LookupError(resource_not_found)
LookupError:


Resource �[93mpunkt�[0m not found.
Please use the NLTK Downloader to obtain the resource:

�[31m>>> import nltk

nltk.download('punkt')
�[0m
For more information see: https://www.nltk.org/data.html

Attempted to load �[93mtokenizers/punkt/PY3/english.pickle�[0m

Searched in:
- '/home/User43386032/nltk_data'
- '/usr/nltk_data'
- '/usr/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "submission.py", line 7, in
print(wiki.tags)
File "/usr/local/lib/python3.6/dist-packages/textblob/decorators.py", line 24, in get
value = obj.dict[self.func.name] = self.func(obj)
File "/usr/local/lib/python3.6/dist-packages/textblob/blob.py", line 484, in pos_tags
return [val for sublist in [s.pos_tags for s in self.sentences] for val in sublist]
File "/usr/local/lib/python3.6/dist-packages/textblob/decorators.py", line 24, in get
value = obj.dict[self.func.name] = self.func(obj)
File "/usr/local/lib/python3.6/dist-packages/textblob/blob.py", line 639, in sentences
return self._create_sentence_objects()
File "/usr/local/lib/python3.6/dist-packages/textblob/blob.py", line 683, in _create_sentence_objects
sentences = sent_tokenize(self.raw)
File "/usr/local/lib/python3.6/dist-packages/textblob/base.py", line 64, in itokenize
return (t for t in self.tokenize(text, *args, **kwargs))
File "/usr/local/lib/python3.6/dist-packages/textblob/decorators.py", line 38, in decorated
raise MissingCorpusError()
textblob.exceptions.MissingCorpusError:
Looks like you are missing some required data for this feature.

To download the necessary data, simply run

python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html
If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.

Same issue I am also not able to find its solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants