clusterizer: Implement experimental meshlet optimizer #673
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
So far we were mostly concerned with meshlet clustering from the
perspective of treating meshlets as an unordered set of triangles; while
this matches the computational and documented model, this may not be
optimal for a given GPU.
Notably, NVidia GPUs are much more sensitive to the order of triangles
in the meshlet than to the number and fill percentage; so much so that
from pure rasterization performance, scan may win over proper clustering
because it implicitly generates a better order.
We do not know the precise criteria / mechanism that NV GPUs use here
but it helps to do locality optimization; most importantly, triangle
order, but also reordering meshlet-local vertices helps a little bit.
This change implements a simple meshlet optimizer; while this can also
be achieved by running existing optimization algorithms (vcache /
vfetch) on meshlet data, a custom optimizer is faster even when using
quadratic implementation, and may allow us to implement better locality
reodering algorithms in the future assuming a small input patch.
On NVidia RTX 4090, this change can result in up to 15% speedup when
workloads are raster-bound compared to just using
buildMeshlets
; thegains are workload and mesh dependent. niagara sees a 5% speedup when
software triangle culling is disabled.