clusterizer: Implement experimental meshlet optimizer #673

zeux · 2024-03-30T04:44:33Z

So far we were mostly concerned with meshlet clustering from the
perspective of treating meshlets as an unordered set of triangles; while
this matches the computational and documented model, this may not be
optimal for a given GPU.

Notably, NVidia GPUs are much more sensitive to the order of triangles
in the meshlet than to the number and fill percentage; so much so that
from pure rasterization performance, scan may win over proper clustering
because it implicitly generates a better order.

We do not know the precise criteria / mechanism that NV GPUs use here
but it helps to do locality optimization; most importantly, triangle
order, but also reordering meshlet-local vertices helps a little bit.

This change implements a simple meshlet optimizer; while this can also
be achieved by running existing optimization algorithms (vcache /
vfetch) on meshlet data, a custom optimizer is faster even when using
quadratic implementation, and may allow us to implement better locality
reodering algorithms in the future assuming a small input patch.

On NVidia RTX 4090, this change can result in up to 15% speedup when
workloads are raster-bound compared to just using buildMeshlets; the
gains are workload and mesh dependent. niagara sees a 5% speedup when
software triangle culling is disabled.

So far we were mostly concerned with meshlet clustering from the perspective of treating meshlets as an unordered set of triangles; while this matches the computational and documented model, this may not be optimal for a given GPU. Notably, NVidia GPUs are much more sensitive to the order of triangles in the meshlet than to the number and fill percentage; so much so that from pure rasterization performance, scan may win over proper clustering because it implicitly generates a better order. We do not know the precise criteria / mechanism that NV GPUs use here but it helps to do locality optimization; most importantly, triangle order, but also reordering meshlet-local vertices helps a little bit. This change implements a simple meshlet optimizer; while this can also be achieved by running existing optimization algorithms (vcache / vfetch) on meshlet data, a custom optimizer is faster even when using unoptimized quadratic implementation, and may allow us to implement better locality reodering algorithms in the future assuming a small input patch. For now we just select the next triangle to maximize the number of shared vertices with the previous triangle; this results in a pseudo-strip order which seems reasonably optimal for NV; note that neither the hardware nor this algorithm is concerned with the specifics of edge matching as it doesn't seem to matter for performance.

Instead of looking for a triangle with 2 matches and falling back to 1 and 0, unify this and pick the triangle with maximum number of shared vertices, and early out if we found the best possible match.

Instead of only matching against the last triangle, match all previous triangles by tracking the delta between a given cache position; the delta is computed using 8-bit unsigned math so that overflows don't matter as much (there's still a collision between triangles 0 and 256 in the cache but that shouldn't affect the quality noticeably). Experimentally, a cutoff of 3 or 4 produce the same results on NVidia so use 3 for more conservative matching.

Mention this in README as a recommended post-processing step and add to meshlet processing pipeline for coverage.

Instead of tracking visited triangles, maintain an invariant where an increasing prefix of the triangle array is ordered properly. While it would be possible to maintain this by swapping the first triangle with the one we picked at every iteration, this distorts the selection order in an odd way that may reduce the quality of the generated sequence, and memmove is fairly fast in practice given that we've already scanned over the region we are about to move anyhow. This makes optimization ~40% faster while producing the same results.

Most of the code is fairly straightforward but some decisions may not be obvious so let's document them. Also renames newv[] to order[] for cleanliness.

zeux added 6 commits March 29, 2024 20:46

clusterizer: Use a single pass for meshlet optimization

b38693e

Instead of looking for a triangle with 2 matches and falling back to 1 and 0, unify this and pick the triangle with maximum number of shared vertices, and early out if we found the best possible match.

clusterizer: Fix MSVC warning

5b3456d

demo: Add usage examples for meshopt_optimizeMeshlet

a12987e

Mention this in README as a recommended post-processing step and add to meshlet processing pipeline for coverage.

zeux marked this pull request as ready for review March 31, 2024 03:06

clusterizer: Add comments to meshopt_optimizeMeshlet

8332f4d

Most of the code is fairly straightforward but some decisions may not be obvious so let's document them. Also renames newv[] to order[] for cleanliness.

zeux force-pushed the meshlet-opt branch from a782139 to 8332f4d Compare March 31, 2024 03:09

zeux merged commit bd7067b into master Apr 1, 2024
12 checks passed

zeux deleted the meshlet-opt branch April 1, 2024 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterizer: Implement experimental meshlet optimizer #673

clusterizer: Implement experimental meshlet optimizer #673

zeux commented Mar 30, 2024 •

edited

clusterizer: Implement experimental meshlet optimizer #673

clusterizer: Implement experimental meshlet optimizer #673

Conversation

zeux commented Mar 30, 2024 • edited

zeux commented Mar 30, 2024 •

edited