Skip to content
This repository has been archived by the owner on Aug 9, 2023. It is now read-only.

Spacy lemmatizer to use blank model #272

Merged
merged 1 commit into from
Apr 9, 2021
Merged

Spacy lemmatizer to use blank model #272

merged 1 commit into from
Apr 9, 2021

Conversation

lizgzil
Copy link
Contributor

@lizgzil lizgzil commented Apr 9, 2021

Description

Fixes https://github.com/wellcometrust/WellcomeML/issues/271 by using blank spacy model.
Running on 1000 grant description: the previous method took 33 secs (when tagger is disabled) and 14 secs (when tagger is not disabled). This new method takes 2 secs.

I also noticed some unusual performance where other methods don't seem to actually lemmatize the text.

X = ['the cats sat on the sitting room floors enjoying the sun']

Original method (when the model is already downloaded)

nlp = spacy.load("en_core_web_sm", disable=["ner", "tagger", "parser", "textcat"])
output= [[token.lemma_.lower() for token in doc] for doc in nlp.pipe(X)]

gives loads of warning for every token of the form "spacy WARNING: [W108] The rule-based lemmatizer did not find POS annotation for the token 'sun'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'" (allenai seem to have had a similar issue with it)
and the output is:

[['the', 'cats', 'sat', 'on', 'the', 'sitting', 'room', 'floors', 'enjoying', 'the', 'sun']]

I realise this had a bug in it where the tagger shouldn't have been disabled, i.e.

nlp = spacy.load("en_core_web_sm", disable=["ner", "parser", "textcat"])
output= [[token.lemma_.lower() for token in doc] for doc in nlp.pipe(X)]

does seem to lemmatize correctly.

I also tried the following method:

from spacy.lang.en import English
from spacy.pipeline.lemmatizer import Lemmatizer
nlp = English()
lemmatizer = Lemmatizer(nlp, model=None,  mode= "lookup")
output= [[lemmatizer.lookup_lemmatize(token)[0].lower() for token in doc] for doc in nlp.pipe(X)]

which also gave:

[['the', 'cats', 'sat', 'on', 'the', 'sitting', 'room', 'floors', 'enjoying', 'the', 'sun']]

The final method, which I've commited is:

nlp = spacy.blank("en")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
nlp.initialize()
output= [[token.lemma_.lower() for token in doc] for doc in nlp.pipe(X)]

gives:

[['the', 'cat', 'sit', 'on', 'the', 'sit', 'room', 'floor', 'enjoy', 'the', 'sun']]

Checklist

  • Added link to Github issue or Trello card
  • Added tests

Copy link
Contributor

@aCampello aCampello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find. Is it much quicker then?

)

nlp = spacy.blank("en")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find. I think this needs spacy-lookup-data to be install (which is probably installed automatically with spacy[lookup].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just did a little experiment with a blank virtualenv and it seems like as long as spacy[lookups] is installed then in the script import spacy is enough for this to work. Or am I confusing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try what happens if en_core_web_sm model is not installed? I think an IOError is thrown still so you should keep the exception handling that @aCampello had

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to work without it being installed:

>>> import spacy
>>> nlp=spacy.load('en_core_web_sm')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/gallaghe/Code/nutrition-labels/build/virtualenv/lib/python3.7/site-packages/spacy/__init__.py", line 47, in load
    return util.load_model(name, disable=disable, exclude=exclude, config=config)
  File "/Users/gallaghe/Code/nutrition-labels/build/virtualenv/lib/python3.7/site-packages/spacy/util.py", line 329, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
>>> nlp = spacy.blank("en")
>>> nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
<spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x15ad7ca00>
>>> nlp.initialize()
<thinc.optimizers.Optimizer object at 0x15aba8050>
>>> X = ['the cats sat on the sitting room floors enjoying the sun']
>>> [[token.lemma_.lower() for token in doc] for doc in nlp.pipe(X)]
[['the', 'cat', 'sit', 'on', 'the', 'sit', 'room', 'floor', 'enjoy', 'the', 'sun']]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool, just realised that you initialise a blank model, good 👍 thanks for checking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's actually a big advantage. You just need to install the lookups, not the entire model! Much simpler code too.

@lizgzil
Copy link
Contributor Author

lizgzil commented Apr 9, 2021

Nice find. Is it much quicker then?

Running on 1000 grant descriptions the previous method took around 14 secs (when tagger is not disabled) and this new method takes 2 seconds.

@aCampello
Copy link
Contributor

Nice find. Is it much quicker then?

Running on 1000 grant descriptions the previous method took around 14 secs (when tagger is not disabled) and this new method takes 2 seconds.

@aCampello
Copy link
Contributor

Pipeline seems to fail with a weird error.

@lizgzil
Copy link
Contributor Author

lizgzil commented Apr 9, 2021

Pipeline seems to fail with a weird error.

@aCampello :( seems to be the same one as you got when you opened https://github.com/wellcometrust/WellcomeML/issues/270. Did you do anything to get your PR to pass in the end? As @nsorros said "Some non deterministic test fail here showing up again."

@aCampello
Copy link
Contributor

Don't know, I re-ran and it passed! I assume some memory error again.

@aCampello
Copy link
Contributor

Let me have a look. Does tox pass locally?

@lizgzil
Copy link
Contributor Author

lizgzil commented Apr 9, 2021

Let me have a look. Does tox pass locally?

@aCampello it's still running locally for 3.8, but has passed for 3.7

@aCampello
Copy link
Contributor

I re-triggered your pipeline. Let's see.

Copy link
Contributor

@nsorros nsorros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking so we check whether the exception handling is still needed

@lizgzil lizgzil merged commit e071b21 into main Apr 9, 2021
@lizgzil lizgzil deleted the fix-spacy-lemmatizer branch April 9, 2021 13:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants