-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify three arrays in Lattice for locality of reference #217
Unify three arrays in Lattice for locality of reference #217
Conversation
Thanks for the PR. The original (split version), however should have better memory locality in the extremely hot function The original motivation was to leave out all parts of lattice node which |
That makes sense. Based on the motivation, I also tested another version that unifies only My microbenchmark result showed the following tokenization times:
These times are ones elapsed to tokenize all sentences contained in the text corpus 吾輩は猫である. Unifying the three arrays shows the best performance with my environment. |
Interesting! Thanks for the second version. Random accesses being slow make sense as a hypothesis. I'll look into them on my dev machine. |
I was spending almost whole day today digging into this PR. I was getting high variance (>10%) with my usual performance analysis workflow: using a mean of 10 runs of sudachi binary to analyze ~10MB of raw Japanese text (e.g. first 100k lines of KFTT train data). I use relatively large and hopefully diverse input data to force analyzer to use different dictionary entries so caching effects of mmap-ed dictionary won't be that visible and performance numbers would be closer to real-life usage. Sudachi startup time is negligible, it mostly uses raw data from mmap-ed binary dictionary. I tried couple of machines to check if the variance may go away if using CPU of a different type:
Unfortunately, variance picture was more or less the same everywhere. And I was getting usual benchmarking variance higher than measurable difference between PR version and current version.
Finally, I found a tool for easily setting up a benchmarking environment on Linuxes: https://github.com/parttimenerd/temci, c.f.
In this environment there is ~1% improvement of this PR on my usual benchmarking machine (and variance is finally low, even between runs, to capture such improvements). Anyway, I won't merge this PR in the current version. Simply pasting all stuctures would not work, Also, I tried prefetching Vecs, but there was measureable regression there, so at least that did not work. |
Thank you very much for your careful experiments, and I agree with your explanation.
I see. Yes, such redundancy should be eliminated.
Very interesting. In this case, it may have interfered with |
Anyway, thank you very much for your experiments. I will probably do the node unification in the future myself, if you will not submit an updated PR before then. |
This PR suggests to unify three arrays
ends
,ends_full
, andindices
inLattice
for locality of reference.In this modification, I added a new struct
PackedNode
packing three structsVNode
,Node
, andNodeIdx
, and implemented the three arrays as one array ofPackedNode
.The three arrays are often accessed simultaneously with the same index value, and the modification can improve cache efficiency in tokenization.
My microbenchmark results (with Intel i7, 16G RAM) showed that the modification shortened a tokenization time by 10%.
(My microbenchmark code can be found here).