Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterizer: Implement experimental meshlet optimizer #673

Merged
merged 7 commits into from Apr 1, 2024
Merged

Conversation

zeux
Copy link
Owner

@zeux zeux commented Mar 30, 2024

So far we were mostly concerned with meshlet clustering from the
perspective of treating meshlets as an unordered set of triangles; while
this matches the computational and documented model, this may not be
optimal for a given GPU.

Notably, NVidia GPUs are much more sensitive to the order of triangles
in the meshlet than to the number and fill percentage; so much so that
from pure rasterization performance, scan may win over proper clustering
because it implicitly generates a better order.

We do not know the precise criteria / mechanism that NV GPUs use here
but it helps to do locality optimization; most importantly, triangle
order, but also reordering meshlet-local vertices helps a little bit.

This change implements a simple meshlet optimizer; while this can also
be achieved by running existing optimization algorithms (vcache /
vfetch) on meshlet data, a custom optimizer is faster even when using
quadratic implementation, and may allow us to implement better locality
reodering algorithms in the future assuming a small input patch.

On NVidia RTX 4090, this change can result in up to 15% speedup when
workloads are raster-bound compared to just using buildMeshlets; the
gains are workload and mesh dependent. niagara sees a 5% speedup when
software triangle culling is disabled.

So far we were mostly concerned with meshlet clustering from the
perspective of treating meshlets as an unordered set of triangles; while
this matches the computational and documented model, this may not be
optimal for a given GPU.

Notably, NVidia GPUs are much more sensitive to the order of triangles
in the meshlet than to the number and fill percentage; so much so that
from pure rasterization performance, scan may win over proper clustering
because it implicitly generates a better order.

We do not know the precise criteria / mechanism that NV GPUs use here
but it helps to do locality optimization; most importantly, triangle
order, but also reordering meshlet-local vertices helps a little bit.

This change implements a simple meshlet optimizer; while this can also
be achieved by running existing optimization algorithms (vcache /
vfetch) on meshlet data, a custom optimizer is faster even when using
unoptimized quadratic implementation, and may allow us to implement
better locality reodering algorithms in the future assuming a small
input patch.

For now we just select the next triangle to maximize the number of
shared vertices with the previous triangle; this results in a
pseudo-strip order which seems reasonably optimal for NV; note that
neither the hardware nor this algorithm is concerned with the specifics
of edge matching as it doesn't seem to matter for performance.
Instead of looking for a triangle with 2 matches and falling back to 1
and 0, unify this and pick the triangle with maximum number of shared
vertices, and early out if we found the best possible match.
Instead of only matching against the last triangle, match all previous
triangles by tracking the delta between a given cache position; the
delta is computed using 8-bit unsigned math so that overflows don't
matter as much (there's still a collision between triangles 0 and 256 in
the cache but that shouldn't affect the quality noticeably).

Experimentally, a cutoff of 3 or 4 produce the same results on NVidia so
use 3 for more conservative matching.
Mention this in README as a recommended post-processing step and add to
meshlet processing pipeline for coverage.
Instead of tracking visited triangles, maintain an invariant where an
increasing prefix of the triangle array is ordered properly. While it
would be possible to maintain this by swapping the first triangle with
the one we picked at every iteration, this distorts the selection order
in an odd way that may reduce the quality of the generated sequence, and
memmove is fairly fast in practice given that we've already scanned over
the region we are about to move anyhow.

This makes optimization ~40% faster while producing the same results.
@zeux zeux marked this pull request as ready for review March 31, 2024 03:06
Most of the code is fairly straightforward but some decisions may not be
obvious so let's document them. Also renames newv[] to order[] for
cleanliness.
@zeux zeux merged commit bd7067b into master Apr 1, 2024
12 checks passed
@zeux zeux deleted the meshlet-opt branch April 1, 2024 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant