Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsx: Texture cache improvements #9738

Merged
merged 11 commits into from Feb 10, 2021
Merged

rsx: Texture cache improvements #9738

merged 11 commits into from Feb 10, 2021

Conversation

kd-11
Copy link
Contributor

@kd-11 kd-11 commented Feb 7, 2021

This is a set of changes that modifies how the texture cache works internally. Highlights:

  1. Do not lock pages if the texture is very small (e.g a 4x4 texture is going to be a few bytes long, don't lock a whole page for that). A simple crc hash is used instead (requires SSE4)
  2. Tweak the number of pages per group (a.k.a storage block) downwards. This saves a lot of time wasted iterating through a block with too many small objects.
  3. Reuse images instead of requesting a new image from the driver. Allocations are shockingly slow.
  4. Leverage MTRSX to do garbage collection. This is a slow activity (allocator bottleneck) and it is not time critical. There is no point in waiting for it to happen sequentially.

Another change included is a GPUOpen bugfix for the mem_allocator. I'll submit to GPUOpen repo soon. This massively boosts allocation time when many small allocations exist in a block.

TODO:

  • Linux
  • Add CPU check, not all processors will have hw crc32 support [CRC32 removed]
  • Performance still ok?

@kd-11
Copy link
Contributor Author

kd-11 commented Feb 7, 2021

Note for testers: If you have enough threads, enable MTRSX. It actually has a decent speedup now in heavy games :)

@Nekotekina
Copy link
Member

Why using CRC32 which has very high collision chance and not available everywhere?
If some textures are so small, they can be just saved somewhere and bitwise compared. I doubt it'll have any significant impact on performance.

@kd-11
Copy link
Contributor Author

kd-11 commented Feb 7, 2021

@Nekotekina Because we don't actually care about the contents just that they did not change. CRC32 cannot have that high a collision rate that a small 16 byte texture can change and it not notice, and that new texture has the same dimensions, type, format, etc.
I'll add the block copy fallback anyway as the instruction is not available everywhere. In this case there are thousands of small objects created per frame, I'm almost certain the copying will become the new bottleneck.

EDIT: I should add - we're not using the hash for storage, it's purely for tamper detection over small ranges. There were plans for hashed storage of textures, but I found it impossible to hash fast enough for big textures so I scrapped the idea. Page protection remains the primary means of tamper detections for larger blocks.

@Miksel12
Copy link

Miksel12 commented Feb 7, 2021

Isn't XXH3 a better idea for small textures? Benchmarks show XXH3 having a much higher small data velocity compared to CRC32C: https://github.com/Cyan4973/xxHash/wiki/Performance-comparison

@Nekotekina
Copy link
Member

Depends how fallback is implemented. SSE one can be pretty fast. memcpy may be horribly slow for non-constant size argument.

@kd-11
Copy link
Contributor Author

kd-11 commented Feb 7, 2021

XXH3 is good, though it would require AVX support to be decent. The issue for me with all of these is the hidden setup cost which can be good or awful. I don't know how good XXH3 is when dealing with random sized elements, some much smaller than the width of an AVX512 pipe. Seems like a recipe for if..else..if..else, but we'll see, it all depends on benchmarks. For some data memcpy and u64 bytewise compare may be even faster, at the cost of increased memory usage.

@kd-11 kd-11 changed the title [TESTERS NEEDED] rsx: Texture cache improvements [WIP][TESTERS NEEDED] rsx: Texture cache improvements Feb 7, 2021
@DefaltBR
Copy link

DefaltBR commented Feb 7, 2021

Tested with The Last of Us (BCUS98174)... Didn't notice any difference but the VRAM allocation... Also, at start it felt like the PR was a little bit smoother, but maybe it's just a placebo effect... I did clear the caches to test if it would make things compile/pop-in faster, but didn't notice much differences, if at all...

PR_MTRSX_Off
PR_MTRSX_Off

Master_MTRSX_Off
Master_MTRSX_Off

PR_MTRSX_On
PR_MTRSX_On

Master_MTRSX_On
Master_MTRSX_On

And here's a log:
RPCS3.log

@kd-11
Copy link
Contributor Author

kd-11 commented Feb 7, 2021

@DefaltBR Your RSX load (virtual GPU usage) is too low which means you won't see any improvement in this title. The performance uplift is only apparent when RSX is the bottleneck (90%+ RSX load usually)

@pcca-matrix
Copy link

SSX [NPEB01121] is still unplayable after the first run

FPS drop to 10-20 while on the first run it's 50FPS stable

I9-9900KF , 32GO, RTX 2080 SUPER

@kd-11
Copy link
Contributor Author

kd-11 commented Feb 7, 2021

@pcca-matrix See #9624 (comment)
The core issue with SSX when reloading levels is known and a fix is in the works separate from this changeset.

@Jacoby1218
Copy link
Contributor

This didn't seem to improve f1 2014, even with mtrsx on.
image

@kd-11
Copy link
Contributor Author

kd-11 commented Feb 7, 2021

@jacob1218 Check that last statistic. Only 3 textures were uploaded the entire frame, so the bottleneck for that one is elsewhere.
Strangely, total time spent inside the render functions is about 12ms which should result in much higher performance than that unless running on iGP or something.
EDIT: For comparison, killzone moves around 650 textures per frame and therefore sees the largest change.

@kd-11
Copy link
Contributor Author

kd-11 commented Feb 7, 2021

I have removed crc32 and replaced it with basic fnv. Performance should still be about the same; I'm still experimenting with alternatives. xxhash requires too much extra code and memcpy is tricky if using arbitrary sizes as the copies need to be allocated somewhere. The very fast invalidation rate on these games is quite a challenge.
From my quick testing FNV works fine for affected games and I could force all textures to be checked with it without visual corruption.

@sampletext32
Copy link
Contributor

sampletext32 commented Feb 7, 2021

I have removed crc32 and replaced it with basic fnv.

I think you should rename all related vars into fnv, because now it seems weird.

@kd-11
Copy link
Contributor Author

kd-11 commented Feb 7, 2021

I have removed crc32 and replaced it with basic fnv.

I think you should rename all related vars into fnv, because now it seems weird.

Don't worry about it, this is just to verify that performance hasn't degraded and everything is ok. This huge mess of commits will be cleaned up before merge.

@kd-11 kd-11 changed the title [WIP][TESTERS NEEDED] rsx: Texture cache improvements rsx: Texture cache improvements Feb 9, 2021
- Also lays groundwork for optional hashed sections
- Drastically lowers time wasted iterating blocks when many small objects
  are present
- Avoids a silly situation where a texture is discarded and an identical copy created immediately afterward.
  Unfortunately allocating memory blocks is really slow so avoid it as much as possible.
- Performance optimization when combined with vma optimizations added by me
- Avoids doing useless work. The scanning algorithm is painfully slow on hardware with alignment requirement > 1
- Upto 50ms saved for ~600 allocations when many small allocations exist
- It is not a fatal error for a texture to be defined where a framebuffer once existed.
- Bunch of improvements
- Properly signal renderer to rebind textures!
- TODO: Range checks, should be pretty easy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants