New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rsx: Texture cache improvements #9738
Conversation
Note for testers: If you have enough threads, enable MTRSX. It actually has a decent speedup now in heavy games :) |
Why using CRC32 which has very high collision chance and not available everywhere? |
@Nekotekina Because we don't actually care about the contents just that they did not change. CRC32 cannot have that high a collision rate that a small 16 byte texture can change and it not notice, and that new texture has the same dimensions, type, format, etc. EDIT: I should add - we're not using the hash for storage, it's purely for tamper detection over small ranges. There were plans for hashed storage of textures, but I found it impossible to hash fast enough for big textures so I scrapped the idea. Page protection remains the primary means of tamper detections for larger blocks. |
Isn't XXH3 a better idea for small textures? Benchmarks show XXH3 having a much higher small data velocity compared to CRC32C: https://github.com/Cyan4973/xxHash/wiki/Performance-comparison |
Depends how fallback is implemented. SSE one can be pretty fast. memcpy may be horribly slow for non-constant size argument. |
XXH3 is good, though it would require AVX support to be decent. The issue for me with all of these is the hidden setup cost which can be good or awful. I don't know how good XXH3 is when dealing with random sized elements, some much smaller than the width of an AVX512 pipe. Seems like a recipe for if..else..if..else, but we'll see, it all depends on benchmarks. For some data memcpy and u64 bytewise compare may be even faster, at the cost of increased memory usage. |
Tested with The Last of Us (BCUS98174)... Didn't notice any difference but the VRAM allocation... Also, at start it felt like the PR was a little bit smoother, but maybe it's just a placebo effect... I did clear the caches to test if it would make things compile/pop-in faster, but didn't notice much differences, if at all... And here's a log: |
@DefaltBR Your RSX load (virtual GPU usage) is too low which means you won't see any improvement in this title. The performance uplift is only apparent when RSX is the bottleneck (90%+ RSX load usually) |
SSX [NPEB01121] is still unplayable after the first run FPS drop to 10-20 while on the first run it's 50FPS stable I9-9900KF , 32GO, RTX 2080 SUPER |
@pcca-matrix See #9624 (comment) |
@jacob1218 Check that last statistic. Only 3 textures were uploaded the entire frame, so the bottleneck for that one is elsewhere. |
38776ad
to
df2cf87
Compare
I have removed crc32 and replaced it with basic fnv. Performance should still be about the same; I'm still experimenting with alternatives. xxhash requires too much extra code and memcpy is tricky if using arbitrary sizes as the copies need to be allocated somewhere. The very fast invalidation rate on these games is quite a challenge. |
I think you should rename all related vars into fnv, because now it seems weird. |
Don't worry about it, this is just to verify that performance hasn't degraded and everything is ok. This huge mess of commits will be cleaned up before merge. |
035093b
to
ce3be69
Compare
- Also lays groundwork for optional hashed sections
- Drastically lowers time wasted iterating blocks when many small objects are present
- Avoids a silly situation where a texture is discarded and an identical copy created immediately afterward. Unfortunately allocating memory blocks is really slow so avoid it as much as possible.
- Performance optimization when combined with vma optimizations added by me
- Avoids doing useless work. The scanning algorithm is painfully slow on hardware with alignment requirement > 1 - Upto 50ms saved for ~600 allocations when many small allocations exist
- It is not a fatal error for a texture to be defined where a framebuffer once existed.
- Bunch of improvements - Properly signal renderer to rebind textures! - TODO: Range checks, should be pretty easy
ce3be69
to
4cd753d
Compare
This is a set of changes that modifies how the texture cache works internally. Highlights:
Another change included is a GPUOpen bugfix for the mem_allocator. I'll submit to GPUOpen repo soon. This massively boosts allocation time when many small allocations exist in a block.
TODO: