Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify three arrays in Lattice for locality of reference #217

Closed

Conversation

kampersanda
Copy link

@kampersanda kampersanda commented Jul 6, 2022

This PR suggests to unify three arrays ends, ends_full, and indices in Lattice for locality of reference.

In this modification, I added a new struct PackedNode packing three structs VNode, Node, and NodeIdx, and implemented the three arrays as one array of PackedNode.

The three arrays are often accessed simultaneously with the same index value, and the modification can improve cache efficiency in tokenization.

My microbenchmark results (with Intel i7, 16G RAM) showed that the modification shortened a tokenization time by 10%.
(My microbenchmark code can be found here).

@eiennohito
Copy link
Collaborator

eiennohito commented Jul 6, 2022

Thanks for the PR.

The original (split version), however should have better memory locality in the extremely hot function Lattice::connect_node which was the reason to split the lattice node in the first place. Originally it was more like to your proposal. I need to look into your PR in more detail and find out what is going on here.

The original motivation was to leave out all parts of lattice node which connect_node do not touch for that loop.

@kampersanda
Copy link
Author

kampersanda commented Jul 6, 2022

The original motivation was to leave out all parts of lattice node which connect_node do not touch for that loop.

That makes sense.

Based on the motivation, I also tested another version that unifies only ends_full and indices (i.e., packing only Node and NodeIdx). The implementation is here.

My microbenchmark result showed the following tokenization times:

  • Current: 0.358 sec
  • Unify three arrays: 0.316 sec
  • Unify two arrays: 0.344 sec

These times are ones elapsed to tokenize all sentences contained in the text corpus 吾輩は猫である.
You can reproduce this experiment by running run_timeperf.sh in this branch.

Unifying the three arrays shows the best performance with my environment.
I guess that random accesses with end_idx in Lattice::insert is a bottleneck, rather than the simple scan in Lattice::connect_node.

@eiennohito
Copy link
Collaborator

Interesting! Thanks for the second version. Random accesses being slow make sense as a hypothesis. I'll look into them on my dev machine.

@eiennohito
Copy link
Collaborator

eiennohito commented Jul 7, 2022

I was spending almost whole day today digging into this PR.

I was getting high variance (>10%) with my usual performance analysis workflow: using a mean of 10 runs of sudachi binary to analyze ~10MB of raw Japanese text (e.g. first 100k lines of KFTT train data). I use relatively large and hopefully diverse input data to force analyzer to use different dictionary entries so caching effects of mmap-ed dictionary won't be that visible and performance numbers would be closer to real-life usage. Sudachi startup time is negligible, it mostly uses raw data from mmap-ed binary dictionary.

I tried couple of machines to check if the variance may go away if using CPU of a different type:

  1. My desktop (Ryzen 5950X, Windows)
  2. Notebook (MBP 2013, i7-3740QM, Arch Linux) <- this is my usual benchmarking stand
  3. Server (CPU E5-2670 v2, Ubuntu Linux) for sanity check

Unfortunately, variance picture was more or less the same everywhere.

And I was getting usual benchmarking variance higher than measurable difference between PR version and current version.
For example these are mean and 95% confidence intervals of 20 runs on my desktop made with 1h interval between measurements of exactly the same binary:

  1. 7.87 (7.83, 7.91)
  2. 8.17 (8.04, 8.32)
  3. 7.92 (7.88, 7.96)

Finally, I found a tool for easily setting up a benchmarking environment on Linuxes: https://github.com/parttimenerd/temci, c.f.

  1. https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux
  2. https://llvm.org/docs/Benchmarking.html
    I did this setup for Juman++ performance work, but did not setup it until now for Sudachi.

In this environment there is ~1% improvement of this PR on my usual benchmarking machine (and variance is finally low, even between runs, to capture such improvements).

Anyway, I won't merge this PR in the current version. Simply pasting all stuctures would not work, VNode contains a subset of Node fields (so you can remove VNode completely) and NodeIndex.end is the same as Node.begin so it can also go away. Merged Node should be 2+2+4=8 bytes smaller than it is in the proposed version, making a significant contribution to efficient cache use.

Also, I tried prefetching Vecs, but there was measureable regression there, so at least that did not work.

@kampersanda
Copy link
Author

kampersanda commented Jul 7, 2022

Thank you very much for your careful experiments, and I agree with your explanation.

Simply pasting all stuctures would not work, VNode contains a subset of Node fields (so you can remove VNode completely) and NodeIndex.end is the same as Node.begin so it can also go away.

I see. Yes, such redundancy should be eliminated.

Also, I tried prefetching Vecs, but there was measureable regression there, so at least that did not work.

Very interesting. In this case, it may have interfered with connect_node.

@kampersanda kampersanda closed this Jul 7, 2022
@eiennohito
Copy link
Collaborator

eiennohito commented Jul 7, 2022

Anyway, thank you very much for your experiments. I will probably do the node unification in the future myself, if you will not submit an updated PR before then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants