-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possibility of speeding up Stanza lemmmatizer by excluding reduntant words #1263
Comments
It looks like the lemmatizer does not take into account context when
lemmatizing. It only uses word & pos. I'll have to double check that.
If that's the case, we could actually internalize past results, since the
same word/POS will lead to the same output each time anyway. It would lead
to a larger and larger memory footprint over time for Stanza, which is the
kind of thing that inevitably leads to "help memory leak" messages down the
line, so probably I'll make it an option of some sort.
…On Sat, Jul 15, 2023 at 7:39 AM Farid ***@***.***> wrote:
*Given*:
I have a small sample document with limited number of words as follows:
d ='''
I go to school by the school bus everyday with all of my best friends.
There are several students who also take the buses to school. Buses are quite cheap in my city.
The city which I live in has an enormous number of brilliant schools with smart students.
We have a nice math teacher in my school whose name is Jane Doe.
She also teaches several other topics in our school, including physics, chemistry and sometimes literature as a substitute teacher.
Other classes don't appreciate her efforts as much as my class. She must be nominated as the best school's teacher.
My school is located far from my apartment. This is why, I am taking the bus to school everyday.
'''
*Goal*:
Considering my real-world large document with more words (4000 ~ 8000
words), I would like to speed up my Stanza lemmatizer by *probably*
excluding lemmatizing repeated words, *e.g.*, words which has occurred
more than once.
I do not intend to use set() method to obtain the unique lemmas in my
result list, rather I intend to ignore lemmatizing words which have already
been lemmatized.
For instance, for the given sample raw document d, there are several
redundant words which could be ignored in the process:
Word Lemma
--------------------------------------------------
school school
school school <<<<< Redundant
bus bus
everyday everyday
friends friend
students student
buses bus
school school
Buses bus <<<<< Redundant
cheap cheap
city city
city city <<<<< Redundant
live live
enormous enormous
number number
brilliant brilliant
schools school
smart smart
students student
nice nice
math math
teacher teacher
school school <<<<< Redundant
Jane jane
Doe doe
teaches teach
topics topic
school school <<<<< Redundant
including include
physics physics
chemistry chemistry
literature literature
substitute substitute
teacher teacher <<<<< Redundant
classes class
appreciate appreciate
efforts effort
class class
nominated nominate
school school <<<<< Redundant
teacher teacher
school school <<<<< Redundant
located locate
apartment apartment
bus bus
school school <<<<< Redundant
everyday everyday <<<<< Redundant
My [*inefficient*] solution:
import stanza
import nltk
nltk_modules = ['punkt',
'averaged_perceptron_tagger',
'stopwords',
'wordnet',
'omw-1.4',
]
nltk.download(nltk_modules, quiet=True, raise_on_error=True,)
STOPWORDS = nltk.corpus.stopwords.words(nltk.corpus.stopwords.fileids())
nlp = stanza.Pipeline(lang='en', processors='tokenize,lemma,pos', tokenize_no_ssplit=True,download_method=DownloadMethod.REUSE_RESOURCES)
doc = nlp(d)
%timeit -n 10000 [ wlm.lower() for _, s in enumerate(doc.sentences) for _, w in enumerate(s.words) if (wlm:=w.lemma) and len(wlm)>2 and wlm not in STOPWORDS]
10.5 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
My [*alternative*] solution, a little faster but still *NOT* efficient
for (4000 ~ 8000 words):
def get_lm():
words_list = list()
lemmas_list = list()
for _, vsnt in enumerate(doc.sentences):
for _, vw in enumerate(vsnt.words):
wlm = vw.lemma.lower()
wtxt = vw.text.lower()
if wtxt in words_list and wlm in lemmas_list:
lemmas_list.append(wlm)
elif ( wtxt not in words_list and wlm and len(wlm) > 2 and wlm not in STOPWORDS ):
lemmas_list.append(wlm)
words_list.append(wtxt)
return lemmas_list
%timeit -n 10000 get_lm()
7.85 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
My ideal result for this sample document, from either solution, should
look like this, containing even repeated lemmas:
lm = [ wlm.lower() for _, s in enumerate(doc.sentences) for _, w in enumerate(s.words) if (wlm:=w.lemma) and len(wlm)>2 and wlm not in STOPWORDS] # solution 1
# lm = get_lm() # solution 2
print(len(lm), lm)
47 ['school', 'school', 'bus', 'everyday', 'friend', 'student', 'bus', 'school', 'bus', 'cheap', 'city', 'city', 'live', 'enormous', 'number', 'brilliant', 'school', 'smart', 'student', 'nice', 'math', 'teacher', 'school', 'jane', 'doe', 'teach', 'topic', 'school', 'include', 'physics', 'chemistry', 'literature', 'substitute', 'teacher', 'class', 'appreciate', 'effort', 'class', 'nominate', 'school', 'teacher', 'school', 'locate', 'apartment', 'bus', 'school', 'everyday']
Is there any better or more efficient approach for this problem when
considering large corpus or documents?
Cheers,
—
Reply to this email directly, view it on GitHub
<#1263>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWIQV76J4CF2D2AJ7ELXQKTTVANCNFSM6AAAAAA2LKSEJM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
…,pos combinations it has seen before. #1263
It wasn't actually that hard to add the feature. The lemmatizer appears to
be about 20% of the cost of a pipeline which just has tokenize,pos,lemma,
and the cost almost entirely goes away if you parse the same doc over and
over with this feature turned on. Presumably that's not the typical use
case, but it should still bring about some savings if the lemmatizer is run
for a long time.
a87ffd0
One note: it doesn't do anything in the middle of processing a document,
only updating the dictionary for the next document, so your exact example
with "city" multiple times in the same sentence won't actually be improved
at all.
…On Tue, Jul 18, 2023 at 3:23 PM John Bauer ***@***.***> wrote:
It looks like the lemmatizer does not take into account context when
lemmatizing. It only uses word & pos. I'll have to double check that.
If that's the case, we could actually internalize past results, since the
same word/POS will lead to the same output each time anyway. It would lead
to a larger and larger memory footprint over time for Stanza, which is the
kind of thing that inevitably leads to "help memory leak" messages down the
line, so probably I'll make it an option of some sort.
On Sat, Jul 15, 2023 at 7:39 AM Farid ***@***.***> wrote:
> *Given*:
>
> I have a small sample document with limited number of words as follows:
>
> d ='''
> I go to school by the school bus everyday with all of my best friends.
> There are several students who also take the buses to school. Buses are quite cheap in my city.
> The city which I live in has an enormous number of brilliant schools with smart students.
> We have a nice math teacher in my school whose name is Jane Doe.
> She also teaches several other topics in our school, including physics, chemistry and sometimes literature as a substitute teacher.
> Other classes don't appreciate her efforts as much as my class. She must be nominated as the best school's teacher.
> My school is located far from my apartment. This is why, I am taking the bus to school everyday.
> '''
>
> *Goal*:
>
> Considering my real-world large document with more words (4000 ~ 8000
> words), I would like to speed up my Stanza lemmatizer by *probably*
> excluding lemmatizing repeated words, *e.g.*, words which has occurred
> more than once.
> I do not intend to use set() method to obtain the unique lemmas in my
> result list, rather I intend to ignore lemmatizing words which have already
> been lemmatized.
>
> For instance, for the given sample raw document d, there are several
> redundant words which could be ignored in the process:
>
> Word Lemma
> --------------------------------------------------
> school school
> school school <<<<< Redundant
> bus bus
> everyday everyday
> friends friend
> students student
> buses bus
> school school
> Buses bus <<<<< Redundant
> cheap cheap
> city city
> city city <<<<< Redundant
> live live
> enormous enormous
> number number
> brilliant brilliant
> schools school
> smart smart
> students student
> nice nice
> math math
> teacher teacher
> school school <<<<< Redundant
> Jane jane
> Doe doe
> teaches teach
> topics topic
> school school <<<<< Redundant
> including include
> physics physics
> chemistry chemistry
> literature literature
> substitute substitute
> teacher teacher <<<<< Redundant
> classes class
> appreciate appreciate
> efforts effort
> class class
> nominated nominate
> school school <<<<< Redundant
> teacher teacher
> school school <<<<< Redundant
> located locate
> apartment apartment
> bus bus
> school school <<<<< Redundant
> everyday everyday <<<<< Redundant
>
> My [*inefficient*] solution:
>
> import stanza
> import nltk
> nltk_modules = ['punkt',
> 'averaged_perceptron_tagger',
> 'stopwords',
> 'wordnet',
> 'omw-1.4',
> ]
> nltk.download(nltk_modules, quiet=True, raise_on_error=True,)
> STOPWORDS = nltk.corpus.stopwords.words(nltk.corpus.stopwords.fileids())
>
> nlp = stanza.Pipeline(lang='en', processors='tokenize,lemma,pos', tokenize_no_ssplit=True,download_method=DownloadMethod.REUSE_RESOURCES)
> doc = nlp(d)
> %timeit -n 10000 [ wlm.lower() for _, s in enumerate(doc.sentences) for _, w in enumerate(s.words) if (wlm:=w.lemma) and len(wlm)>2 and wlm not in STOPWORDS]
> 10.5 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>
> My [*alternative*] solution, a little faster but still *NOT* efficient
> for (4000 ~ 8000 words):
>
> def get_lm():
> words_list = list()
> lemmas_list = list()
> for _, vsnt in enumerate(doc.sentences):
> for _, vw in enumerate(vsnt.words):
> wlm = vw.lemma.lower()
> wtxt = vw.text.lower()
> if wtxt in words_list and wlm in lemmas_list:
> lemmas_list.append(wlm)
> elif ( wtxt not in words_list and wlm and len(wlm) > 2 and wlm not in STOPWORDS ):
> lemmas_list.append(wlm)
> words_list.append(wtxt)
> return lemmas_list
> %timeit -n 10000 get_lm()
> 7.85 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>
> My ideal result for this sample document, from either solution, should
> look like this, containing even repeated lemmas:
>
> lm = [ wlm.lower() for _, s in enumerate(doc.sentences) for _, w in enumerate(s.words) if (wlm:=w.lemma) and len(wlm)>2 and wlm not in STOPWORDS] # solution 1
> # lm = get_lm() # solution 2
> print(len(lm), lm)
> 47 ['school', 'school', 'bus', 'everyday', 'friend', 'student', 'bus', 'school', 'bus', 'cheap', 'city', 'city', 'live', 'enormous', 'number', 'brilliant', 'school', 'smart', 'student', 'nice', 'math', 'teacher', 'school', 'jane', 'doe', 'teach', 'topic', 'school', 'include', 'physics', 'chemistry', 'literature', 'substitute', 'teacher', 'class', 'appreciate', 'effort', 'class', 'nominate', 'school', 'teacher', 'school', 'locate', 'apartment', 'bus', 'school', 'everyday']
>
> Is there any better or more efficient approach for this problem when
> considering large corpus or documents?
>
> Cheers,
>
> —
> Reply to this email directly, view it on GitHub
> <#1263>, or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AA2AYWIQV76J4CF2D2AJ7ELXQKTTVANCNFSM6AAAAAA2LKSEJM>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
|
Please correct me if I am wrong, but it seems that POS tags is what actually takes the majority of the time which needs to be done on full sentences. It won't work correctly if I remove words from the sentence. So given such large corpuses, is there any documented strategy to probably break-up the text and use multiprocessing (
It just comes to my mind without any practical ideas on how to take advantage of multiprocessing and implement this in code, but I just want to know whether it's worth considering a dictionary or list of already seen words/lemmas to achieve this task more efficiently, considering your point on memory leakage and time consumption. |
The POS tags do require context (otherwise, would be hard to tag words such as If you have multiple CPUs and no GPU, then torch already uses the CPUs when doing matrix multiplications, and there really isn't any great way to get better results with multithreading. Nothing I've found so far, at least. In terms of parallel processing of documents, it will split a large document at double line breaks, so you can try that. If you already have it split into multiple documents, there is this: https://stanfordnlp.github.io/stanza/getting_started.html#processing-multiple-documents |
I released the update to this, but I realize after the fact that we could probably just set a small cache size as the default |
@AngledLuffa I do not know if it's a functional idea or not, but I found a wrapper in python
It does not entirely answer my initial question, but at least I ensure not to execute the lemmatizer for those ENTIRE INPUT DOCUMENTS which has already been once lemmatized! Is this really quite pythonic or recommended approach? |
Adding decorators to make functions do things is certainly pythonic, but I
suspect that if you do any serious processing with large docs, you'll
quickly run out of memory trying to remember them all.
As I mentioned, though, there is now the lemma_store_results flag, which
makes the lemmatizer remember all the words it has seen in the past.
…On Wed, Sep 13, 2023 at 2:45 AM Farid ***@***.***> wrote:
@AngledLuffa <https://github.com/AngledLuffa> I do not know if it's a
functional idea or not, but I found a wrapper in python from functools
import cache to look at the input of my function and return the
previously saved result rather than entering the function:
@cache
def stanza_lemmatizer(doc): # if doc has already been seen and lemmatized, do not enter this function!
# do stanza lemmatizer
# ...
It does not entirely answer my initial question, but at least I ensure not
to execute the lemmatizer for those ENTIRE INPUT DOCUMENTS which has
already been once lemmatized!
Is this really quite pythonic or recommended approach?
—
Reply to this email directly, view it on GitHub
<#1263 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWNIHJGMN64HEIIZD2LX2F6C7ANCNFSM6AAAAAA2LKSEJM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It's lemma_store_results
When the Pipeline is created, the options which start with "lemma_" have
that "lemma_" removed and the rest of the option goes to the lemma processor
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Thanks dude, I can confirm that by adding an option:
It would be nice to have this update on the stanza webpage which discusses the instructions for Lemmatization here, it is currently missing. Cheers, |
I had in mind eventually making this a default with a certain cache size that should help without being way too big. There's always other issues to work on, though... anyway, that's why I left this open for now |
I guess it is causing problems occasionally when the size of the input documents becomes super big :( as in I get the following memory error:
|
This is a separate issue... that is a GPU error, not RAM OOM.
You could post the whole stack trace, preferably in a new git issue
|
Alright ;) |
Given:
I have a small sample document with limited number of words as follows:
Goal:
Considering my real-world large document with more words (
4000 ~ 8000 words
), I would like to speed up my Stanza lemmatizer by probably excluding lemmatizing repeated words, e.g., words which has occurred more than once.I do not intend to use
set()
method to obtain the unique lemmas in my result list, rather I intend to ignore lemmatizing words which have already been lemmatized.For instance, for the given sample raw document
d
, there are several redundant words which could be ignored in the process:My [inefficient] solution:
My [alternative] solution, a little faster but still NOT efficient for (
4000 ~ 8000 words
):My ideal result for this sample document, from either solution, should contain even repeated lemmas, as follows:
Is there any better or more efficient approach for this problem when considering large corpus or documents?
Cheers,
The text was updated successfully, but these errors were encountered: