core: Rewrite thread local storage implementation #118
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It's not uncommon for ps4 guest applications to launch and use many threads, which also necessitates handling thread local storage properly. In x86 thread local accesses are performed by loading the pointer in the fs segment register.
This is a problem as Windows doesn't allow you to change the value of this register to what the guest expects. Not quite true, see first replyOn master this is handled with a simple exception handler that will patch the value of the destination register with a thread_local buffer. This works fine but will be a problem later on. Obviously the performance impact is pretty large for any access. In addition, the new texture cache that does fault tracking also needs a custom exception handler, so they end up conflicting. Also, guest apps can use negative offsets when accessing the buffer, so the current implementation would trigger UB in these cases.
This PR attempts to fix all of the above, by using assembly trampolines instead of the exception handler. For storing the TLS image pointer, a new TLS slot is allocated from the parent process and the logic from wine's TlsGetValue is used to retrieve the value. This means we also don't have to rely on undefined/unused spaces in TEB structure to store our data. Each mov instruction from FS segment is patched with a jump to a trampoline that loads the actual pointer.
While at it, also fixed a problem with fault tracking that caused crashing in pngdec demo. The tracking was being performed in the texture cache page size, when it should be on 4KB boundary like the host/guest. Also bumped the cache page size to vastly reduce the amount of page table accesses.