Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility of speeding up Stanza lemmmatizer by excluding reduntant words #1263

Open
mrgransky opened this issue Jul 15, 2023 · 14 comments
Open

Comments

@mrgransky
Copy link

mrgransky commented Jul 15, 2023

Given:

I have a small sample document with limited number of words as follows:

d ='''
I go to school by the school bus everyday with all of my best friends. 
There are several students who also take the buses to school. Buses are quite cheap in my city.
The city which I live in has an enormous number of brilliant schools with smart students.
We have a nice math teacher in my school whose name is Jane Doe.
She also teaches several other topics in our school, including physics, chemistry and sometimes literature as a substitute teacher.
Other classes don't appreciate her efforts as much as my class. She must be nominated as the best school's teacher.
My school is located far from my apartment. This is why, I am taking the bus to school everyday.
'''

Goal:

Considering my real-world large document with more words (4000 ~ 8000 words), I would like to speed up my Stanza lemmatizer by probably excluding lemmatizing repeated words, e.g., words which has occurred more than once.
I do not intend to use set() method to obtain the unique lemmas in my result list, rather I intend to ignore lemmatizing words which have already been lemmatized.

For instance, for the given sample raw document d, there are several redundant words which could be ignored in the process:

Word                 Lemma
--------------------------------------------------
school               school
school               school <<<<< Redundant
bus                  bus
everyday             everyday
friends              friend
students             student
buses                bus
school               school <<<<< Redundant
Buses                bus <<<<< Redundant
cheap                cheap
city                 city
city                 city <<<<< Redundant
live                 live
enormous             enormous
number               number
brilliant            brilliant
schools              school
smart                smart
students             student
nice                 nice
math                 math
teacher              teacher
school               school <<<<< Redundant
Jane                 jane
Doe                  doe
teaches              teach
topics               topic
school               school <<<<< Redundant
including            include
physics              physics
chemistry            chemistry
literature           literature
substitute           substitute
teacher              teacher <<<<< Redundant
classes              class
appreciate           appreciate
efforts              effort
class                class
nominated            nominate
school               school <<<<< Redundant
teacher              teacher <<<<< Redundant
school               school <<<<< Redundant
located              locate
apartment            apartment
bus                  bus <<<<< Redundant
school               school <<<<< Redundant
everyday             everyday <<<<< Redundant

My [inefficient] solution:

import stanza
import nltk
nltk_modules = ['punkt', 'averaged_perceptron_tagger', 'stopwords', 'wordnet', 'omw-1.4',]
nltk.download(nltk_modules, quiet=True, raise_on_error=True,)
STOPWORDS = nltk.corpus.stopwords.words(nltk.corpus.stopwords.fileids())

nlp = stanza.Pipeline(lang='en', processors='tokenize,lemma,pos', tokenize_no_ssplit=True,download_method=DownloadMethod.REUSE_RESOURCES)
doc = nlp(d)
%timeit -n 10000 [ wlm.lower() for _, s in enumerate(doc.sentences) for _, w in enumerate(s.words) if (wlm:=w.lemma) and len(wlm)>2 and wlm not in STOPWORDS]
10.5 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

My [alternative] solution, a little faster but still NOT efficient for (4000 ~ 8000 words):

def get_lm():
  words_list = list()
  lemmas_list = list()
  for _, vsnt in enumerate(doc.sentences):
    for _, vw in enumerate(vsnt.words):
      wlm = vw.lemma.lower()
      wtxt = vw.text.lower()
      if wtxt in words_list and wlm in lemmas_list:
        lemmas_list.append(wlm)
      elif ( wtxt not in words_list and wlm and len(wlm) > 2 and wlm not in STOPWORDS ):
        lemmas_list.append(wlm)
      words_list.append(wtxt)
  return lemmas_list
%timeit -n 10000 get_lm()
7.85 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

My ideal result for this sample document, from either solution, should contain even repeated lemmas, as follows:

lm = [ wlm.lower() for _, s in enumerate(doc.sentences) for _, w in enumerate(s.words) if (wlm:=w.lemma) and len(wlm)>2 and wlm not in STOPWORDS] # solution 1
# lm = get_lm() # solution 2
print(len(lm), lm)
47 ['school', 'school', 'bus', 'everyday', 'friend', 'student', 'bus', 'school', 'bus', 'cheap', 'city', 'city', 'live', 'enormous', 'number', 'brilliant', 'school', 'smart', 'student', 'nice', 'math', 'teacher', 'school', 'jane', 'doe', 'teach', 'topic', 'school', 'include', 'physics', 'chemistry', 'literature', 'substitute', 'teacher', 'class', 'appreciate', 'effort', 'class', 'nominate', 'school', 'teacher', 'school', 'locate', 'apartment', 'bus', 'school', 'everyday']

Is there any better or more efficient approach for this problem when considering large corpus or documents?

Cheers,

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Jul 18, 2023 via email

AngledLuffa added a commit that referenced this issue Jul 19, 2023
@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Jul 19, 2023 via email

@mrgransky
Copy link
Author

mrgransky commented Jul 20, 2023

Please correct me if I am wrong, but it seems that POS tags is what actually takes the majority of the time which needs to be done on full sentences. It won't work correctly if I remove words from the sentence.

So given such large corpuses, is there any documented strategy to probably break-up the text and use multiprocessing ($ lscpu) for the sake of speedup?

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              19 <<<<<<<<<<<<<<<<<<<<<<<<<<<<< Could be potentially useful for multiprocessing, perhaps?????
On-line CPU(s) list: 0-18
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           19
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Stepping:            7
CPU MHz:             2100.002
BogoMIPS:            4200.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-18
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat pku ospke avx512_vnni

It just comes to my mind without any practical ideas on how to take advantage of multiprocessing and implement this in code, but I just want to know whether it's worth considering a dictionary or list of already seen words/lemmas to achieve this task more efficiently, considering your point on memory leakage and time consumption.

@AngledLuffa
Copy link
Collaborator

The POS tags do require context (otherwise, would be hard to tag words such as tag, beat, etc). So, caching individual words is not an option.

If you have multiple CPUs and no GPU, then torch already uses the CPUs when doing matrix multiplications, and there really isn't any great way to get better results with multithreading. Nothing I've found so far, at least.

In terms of parallel processing of documents, it will split a large document at double line breaks, so you can try that. If you already have it split into multiple documents, there is this:

https://stanfordnlp.github.io/stanza/getting_started.html#processing-multiple-documents

@AngledLuffa
Copy link
Collaborator

I released the update to this, but I realize after the fact that we could probably just set a small cache size as the default

@mrgransky
Copy link
Author

mrgransky commented Sep 13, 2023

@AngledLuffa I do not know if it's a functional idea or not, but I found a wrapper in python from functools import cache to look at the input of my function and return the previously saved result rather than entering the function over and over again for those similar documents:

@cache
def stanza_lemmatizer(doc): # if doc has already been seen and lemmatized, do not enter this function!
   # do stanza lemmatizer
   # ...

It does not entirely answer my initial question, but at least I ensure not to execute the lemmatizer for those ENTIRE INPUT DOCUMENTS which has already been once lemmatized!

Is this really quite pythonic or recommended approach?

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Sep 13, 2023 via email

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Oct 20, 2023 via email

Copy link

stale bot commented Mar 17, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 17, 2024
@mrgransky
Copy link
Author

mrgransky commented Apr 5, 2024

Thanks dude, I can confirm that by adding an option: "lemma_store_results":True to my lang_configs dictionary, it works slightly faster. Here is the modified version:

lang_configs = {
	"en": {"processors":"tokenize,lemma,pos", "package":'lines',"tokenize_no_ssplit":True, "lemma_store_results":True},
	"fi": {"processors":"tokenize,lemma,pos,mwt", "package":'ftb',"tokenize_no_ssplit":True, "lemma_store_results":True}, # FTB 
}
# and the rest of the code...

It would be nice to have this update on the stanza webpage which discusses the instructions for Lemmatization here, it is currently missing.

Cheers,

@AngledLuffa
Copy link
Collaborator

I had in mind eventually making this a default with a certain cache size that should help without being way too big. There's always other issues to work on, though... anyway, that's why I left this open for now

@mrgransky
Copy link
Author

I guess it is causing problems occasionally when the size of the input documents becomes super big :( as in I get the following memory error:

CUDA out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacty of 31.74 GiB of which 179.19 MiB is free. Process 879538 has 30.17 GiB memory in use. Including non-PyTorch memory, this process has 1.39 GiB memory in use. Of the allocated memory 931.09 MiB is allocated by PyTorch, and 112.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Apr 8, 2024 via email

@mrgransky
Copy link
Author

Alright ;)
but previously without "lemma_store_results":True, I did not get the GPU memory issue.
That is the first time it has happened after adding that option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants