Unify three arrays in Lattice for locality of reference #217

kampersanda · 2022-07-06T03:42:53Z

This PR suggests to unify three arrays ends, ends_full, and indices in Lattice for locality of reference.

In this modification, I added a new struct PackedNode packing three structs VNode, Node, and NodeIdx, and implemented the three arrays as one array of PackedNode.

The three arrays are often accessed simultaneously with the same index value, and the modification can improve cache efficiency in tokenization.

My microbenchmark results (with Intel i7, 16G RAM) showed that the modification shortened a tokenization time by 10%.
(My microbenchmark code can be found here).

eiennohito · 2022-07-06T05:22:36Z

Thanks for the PR.

The original (split version), however should have better memory locality in the extremely hot function Lattice::connect_node which was the reason to split the lattice node in the first place. Originally it was more like to your proposal. I need to look into your PR in more detail and find out what is going on here.

The original motivation was to leave out all parts of lattice node which connect_node do not touch for that loop.

kampersanda · 2022-07-06T09:39:10Z

The original motivation was to leave out all parts of lattice node which connect_node do not touch for that loop.

That makes sense.

Based on the motivation, I also tested another version that unifies only ends_full and indices (i.e., packing only Node and NodeIdx). The implementation is here.

My microbenchmark result showed the following tokenization times:

Current: 0.358 sec
Unify three arrays: 0.316 sec
Unify two arrays: 0.344 sec

These times are ones elapsed to tokenize all sentences contained in the text corpus 吾輩は猫である.
You can reproduce this experiment by running run_timeperf.sh in this branch.

Unifying the three arrays shows the best performance with my environment.
I guess that random accesses with end_idx in Lattice::insert is a bottleneck, rather than the simple scan in Lattice::connect_node.

eiennohito · 2022-07-06T09:55:54Z

Interesting! Thanks for the second version. Random accesses being slow make sense as a hypothesis. I'll look into them on my dev machine.

eiennohito · 2022-07-07T11:49:36Z

I was spending almost whole day today digging into this PR.

I was getting high variance (>10%) with my usual performance analysis workflow: using a mean of 10 runs of sudachi binary to analyze ~10MB of raw Japanese text (e.g. first 100k lines of KFTT train data). I use relatively large and hopefully diverse input data to force analyzer to use different dictionary entries so caching effects of mmap-ed dictionary won't be that visible and performance numbers would be closer to real-life usage. Sudachi startup time is negligible, it mostly uses raw data from mmap-ed binary dictionary.

I tried couple of machines to check if the variance may go away if using CPU of a different type:

My desktop (Ryzen 5950X, Windows)
Notebook (MBP 2013, i7-3740QM, Arch Linux) <- this is my usual benchmarking stand
Server (CPU E5-2670 v2, Ubuntu Linux) for sanity check

Unfortunately, variance picture was more or less the same everywhere.

And I was getting usual benchmarking variance higher than measurable difference between PR version and current version.
For example these are mean and 95% confidence intervals of 20 runs on my desktop made with 1h interval between measurements of exactly the same binary:

7.87 (7.83, 7.91)
8.17 (8.04, 8.32)
7.92 (7.88, 7.96)

Finally, I found a tool for easily setting up a benchmarking environment on Linuxes: https://github.com/parttimenerd/temci, c.f.

https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux
https://llvm.org/docs/Benchmarking.html
I did this setup for Juman++ performance work, but did not setup it until now for Sudachi.

In this environment there is ~1% improvement of this PR on my usual benchmarking machine (and variance is finally low, even between runs, to capture such improvements).

Anyway, I won't merge this PR in the current version. Simply pasting all stuctures would not work, VNode contains a subset of Node fields (so you can remove VNode completely) and NodeIndex.end is the same as Node.begin so it can also go away. Merged Node should be 2+2+4=8 bytes smaller than it is in the proposed version, making a significant contribution to efficient cache use.

Also, I tried prefetching Vecs, but there was measureable regression there, so at least that did not work.

kampersanda · 2022-07-07T14:21:03Z

Thank you very much for your careful experiments, and I agree with your explanation.

Simply pasting all stuctures would not work, VNode contains a subset of Node fields (so you can remove VNode completely) and NodeIndex.end is the same as Node.begin so it can also go away.

I see. Yes, such redundancy should be eliminated.

Also, I tried prefetching Vecs, but there was measureable regression there, so at least that did not work.

Very interesting. In this case, it may have interfered with connect_node.

eiennohito · 2022-07-07T22:43:31Z

Anyway, thank you very much for your experiments. I will probably do the node unification in the future myself, if you will not submit an updated PR before then.

kampersanda added 3 commits July 6, 2022 01:44

pack lattice nodes

e6f78c6

set dummy value

d8a0057

fix

c81d3f9

kampersanda closed this Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify three arrays in Lattice for locality of reference #217

Unify three arrays in Lattice for locality of reference #217

kampersanda commented Jul 6, 2022 •

edited

eiennohito commented Jul 6, 2022 •

edited

kampersanda commented Jul 6, 2022 •

edited

eiennohito commented Jul 6, 2022

eiennohito commented Jul 7, 2022 •

edited

kampersanda commented Jul 7, 2022 •

edited

eiennohito commented Jul 7, 2022 •

edited

Unify three arrays in Lattice for locality of reference #217

Unify three arrays in Lattice for locality of reference #217

Conversation

kampersanda commented Jul 6, 2022 • edited

eiennohito commented Jul 6, 2022 • edited

kampersanda commented Jul 6, 2022 • edited

eiennohito commented Jul 6, 2022

eiennohito commented Jul 7, 2022 • edited

kampersanda commented Jul 7, 2022 • edited

eiennohito commented Jul 7, 2022 • edited

kampersanda commented Jul 6, 2022 •

edited

eiennohito commented Jul 6, 2022 •

edited

kampersanda commented Jul 6, 2022 •

edited

eiennohito commented Jul 7, 2022 •

edited

kampersanda commented Jul 7, 2022 •

edited

eiennohito commented Jul 7, 2022 •

edited