ICF optimization ideas #842

ishitatsuyuki · 2022-10-31T06:59:00Z

Identical Code Folding (ICF) is a code size optimization. It is currently used by mainly Chromium, but there are also interests within the Rust community to mitigate some of the code duplication that comes from the current codegen design. Due to the inherent difficulty of the problem, ICF is the most compute intensive pass inside mold.

Below I outline some opportunities to make ICF faster.

Faster skipping of converged elements

ICF turns code in to a DAG and hashes the tree of nodes up to depth N from each node, with N increasing each iteration. When the descendants of the node doesn't contain cycles, this hash will converge once it reaches the leaf, and further iterations can be skipped as it doesn't change result.

Right now, this skipping is implemented through a comparison of digests. If instead, the "convergence" is stored into a bitset, this skipping can be done through 1 bit of memory access per node, instead of 2*HASH_BITS. This cuts quite a few instruction in the later phase and also improves cache utilization. This cuts around 50ms out of 310ms for the ICF pass, or out of 1.012s for total linking of Chromium.

Faster cardinality estimates through hash-based approximation

The ICF is converged iff the cardinality of node hashes doesn't increase between depth N and N+1. To test for this cardinality, right now we perform a parallel sort which is rather costly (O(n log n) work and O(n) latency). The cost is somewhat amortized by only running it after 10 iterations of propagation, but O(n) latency is still not a nice number.

It might be possible to test for non-convergence through probabilistic method. Below is a non-exhaustive list of known options:

HyperLogLog (HLL). HLL is currently implemented in mold to estimate cardinality before inserting into some concurrent hashmaps (e.g. string table). HLL's primary problem is that it has a rather high absolute error, making it too noisy for testing convergence. HLL operations are associative and as such is trivially parallelizable as a reduce.
Bloom Filter. Bloom Filter is a famous algorithm to approximately answer the membership test, but it can also be used to estimate cardinality. A survey shows that "the relative error of Bloom filter is inversely proportional to the exact cardinality of the dataset", making it suitable for our use case where even an off-by-one error could affect the effectiveness of estimation.

The drawback is that Bloom Filter needs a size proportional to the cardinality to maintain an acceptable collision rate. Though, it's still trivially parallelizable and only consumes a fraction of memory compared to the actual hashes; therefore it's worth looking at the exact performance numbers. It's also possible to use estimations from previous iterations to adaptively scale the Bloom Filter.

Further optimization for unique nodes

It's common that functions contain unique code; after all, duplication should only happen rarely. They have an interesting property in ICF, too; once we determine that a node is unique, we no longer need to care about their hashes anymore. Hashing a node with a unique children hash always results in a unique parent hash. Therefore, the hashing itself can be skipped, and what is really necessary is just to propagate the uniqueness.

Count-Min Sketch (CMS) can be used to estimate how many times an element appears in a multiset. An important property of CMS is that it only overestimates and never underestimates. Which means that, if an element has a count of 1, it's assuredly unique.

Like Bloom Filter, CMS also has a high (maybe a little higher than Bloom Filter) memory cost, and it would be good to test if the benefit balances out the cost.

The text was updated successfully, but these errors were encountered:

rui314 · 2022-10-31T10:00:41Z

All these ideas sound interesting! I think they are worth to be experimented.

Using a bitmap to skip converged elements seems like a no-brainer. It's not complicated (is it?) so I believe we can implement it with a small amount of code.

As to the cardinality estimates, my impression is that HyperLogLog would work well. On each iteration (or a every few iterations), we estimate the cardinality and check if the estimation is the largest among all previous estimations. If we don't observe a largest estimation for consecutive N times (say, 10?), it's a strong signal that we get a convergence.

By the way I got an another idea: we may be able to use strongly connected components (SCC). Since SCC doesn't contain cycles, it may be easier to identify identical subgraphs. Here is the outline of the algorithm:

First, we compute strongly connected components and for each leaf component, apply the current ICF algorithm to merge as many nodes as possible. This is parallelizable for each leaf component, as there's no leaf component that connect to other leaf node by definition. After that, we compare every pair of leaf components to merge identical components.

Next, we pick up components that refer only to leaf components. Let's call them 2nd degree components. For each 2nd degree component, we apply the current ICF algorithm to merge as many node as possible. This is parallelizable for each 2nd degree component. If there's an edge that's not within the component, it must refer to a leaf component. It's easy to compare such out-of-component edges because all nodes in leaf components have already been merged (so the simple pointer comparison should suffice.) Once it's done, we compare each pair of 2nd degree components and merge them if identical.

After that, we pick up 3rd degree components, and repeat and rinse.

rui314 · 2022-10-31T11:52:50Z

Speaking of the estimation, I wonder if we could instead just compute the exact number of iterations needed for a given graph. If we know such number, we can just repeat the loop that many times. I believe we need to compute the length of a longest cycle in which no edge is visited more than once. It feels like it can be computed in O(N+M) where N and M are the number of vertices and edges.

ishitatsuyuki · 2022-10-31T12:14:10Z

I think the iteration count is equal to the graph diameter: the "longest shortest path" inside a graph.

Unfortunately graph diameter is not trivial to compute: a realistic implementation is likely around O(V^3) (SO answer). That said, if you can combine that with SCC, it can likely go down to a decent number (the average case for normal programs shouldn't be too complex).

The SCC idea is certainly interesting. I think it also plays well with the uniqueness optimization, which is a nice orthogonality.

I need to experiment with HLL more, but my impression is that it has a quite high variance which might mask out signals of convergence. It also overestimates wildly for low cardinalities, although those cases should run quick anyway and the paper seems to claim that the bias can be corrected.

rui314 · 2022-11-01T02:15:25Z

I thought a bit more about SCC and I couldn't find a reason why it wouldn't work. I'll do some experiments with that.

ishitatsuyuki · 2022-11-01T02:28:20Z

The challenging part with SCC is parallelization. As long as parallel SCC doesn't get too slow, we should be able to get sizable benefits from the reduction in redundant work (thanks to the decomposition).

For parallelization, https://ppl.stanford.edu/papers/sc13-hong.pdf seems to be mostly SoTA, slides also available.

rui314 · 2022-11-01T02:51:29Z

Thank you for the link. A bit off-topic, but do functions in a program forms a small-world graph? Functions are usually intentionally written so that they don't introduce mutual dependencies, so the graph formed by functions are likely to be different.

ishitatsuyuki · 2022-11-01T02:56:11Z

I'm actually not very knowledgeable on this topic, but the small-worldness seems to be reasonably related to the diameter of the graph. For Chromium ICF took 42 iterations, so at least that is something close in magnitude to what we have in the slides (diameter = 18).

rui314 · 2022-11-01T08:13:51Z

As a starter, I implemented the Tarjan's algorithm (https://github.com/rui314/mold/tree/scc) and count the number of strongly connected components with clang-16 as a test set.

Component size distribution:

1     556411
<5    2040
<10   100
<50   71
<100  9
<1000 10

So it turned out that almost all components are of size 1. I.e. mutually-recursive functions are rare in clang-16.

I was curious about the height of the SCC grpah, so I computed the depth for each component, where depth for a component C is defined as the longest path from C to a leaf component. Here is the distribution.

1     103145
<5    334267
<10   70619
<50   44065
<100  6144
<1000 404

I'm not completely sure if my code is bug-free, but if the code is correct, the height of scc is more than 1000, which I think pretty high.

Now I wonder if we can merge only components of size 1 (i.e. forget about graph isomorphism and handle only one section at a time). If it can produce a reasonably good result, we may want to ditch the current algorithm and adopt the simpler approach. I'll investigate it more.

ishitatsuyuki · 2022-11-01T10:49:09Z

I haven't went through the code thoroughly but isn't Tarjan's algorithm DFS? I think to get a reasonable measure of depth one needs to use BFS. The high depth is kind of suspicious since I can't really think of a (sane) code structure that can get you a >1000 diameter.

matu3ba · 2023-01-06T10:58:38Z

isn't Tarjan's algorithm DFS?

Yes. https://www.geeksforgeeks.org/tarjan-algorithm-find-strongly-connected-components/

(sane) code structure that can get you a >1000 diameter

From https://abhinavmehndiratta.github.io/2020-06-11/parallel-scc
"Most real graphs have a single large SCC, this causes the algorithm to essentially run sequentially. Since we spawn a task for every color, if there’s only a single color with a large vertex set, it will reduce parallelism. One solution to increase parallelism is to use parallel BFS for traversal, but this will cause a lot of overhead."

Not sure if this works and helps: Each SCC itself and the inwards and outwards edges to each SCC could be hashed on path traversal and written to a cache, if we have the symbol names being fixed as sorting criteria. Those caches can be compared in O(1).

From what I understand, the symbol names are necessary to prevent being forced to compute graph isomorphism between SCC's.

If we assume a "SCC cache" + "symbol names" are given: Is there data or are the methods that describe how slow operation on huge SCC's are?
What kind of operations would be necessary? Are full graph traversals within the SCC necessary, if things are combined?

Sorry for not understanding the linker context how to apply SCC's.

ishitatsuyuki mentioned this issue Nov 1, 2022

[ELF] ICF optimizations and simplifications #845

Merged

matu3ba mentioned this issue Nov 3, 2022

safe recursion ziglang/zig#1006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICF optimization ideas #842

ICF optimization ideas #842

ishitatsuyuki commented Oct 31, 2022

rui314 commented Oct 31, 2022

rui314 commented Oct 31, 2022

ishitatsuyuki commented Oct 31, 2022

rui314 commented Nov 1, 2022

ishitatsuyuki commented Nov 1, 2022

rui314 commented Nov 1, 2022

ishitatsuyuki commented Nov 1, 2022

rui314 commented Nov 1, 2022

ishitatsuyuki commented Nov 1, 2022

matu3ba commented Jan 6, 2023

ICF optimization ideas #842

ICF optimization ideas #842

Comments

ishitatsuyuki commented Oct 31, 2022

Faster skipping of converged elements

Faster cardinality estimates through hash-based approximation

Further optimization for unique nodes

rui314 commented Oct 31, 2022

rui314 commented Oct 31, 2022

ishitatsuyuki commented Oct 31, 2022

rui314 commented Nov 1, 2022

ishitatsuyuki commented Nov 1, 2022

rui314 commented Nov 1, 2022

ishitatsuyuki commented Nov 1, 2022

rui314 commented Nov 1, 2022

ishitatsuyuki commented Nov 1, 2022

matu3ba commented Jan 6, 2023