Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] move assert to insert #11273

Closed
wants to merge 2 commits into from

Conversation

ashtul
Copy link
Contributor

@ashtul ashtul commented Sep 16, 2022

In this PR, lpAssertValidEntry is remove from lpNext, lpPrev, 'lpFirst' and 'lpFind'.
The element valid is now checked in lpInsert.

A difference can be seen for both HSET and HGET commands.


HGETALL a hash with 50 fields of numbers '0' to '50'.
Unstable

  • hashTypeNext - 19.13%
  • lpAssertValidEntry - 10.2%
    hgetall_50hash_with
    Check validity on insert
  • hashTypeNext - 10.86%
  • lpAssertValidEntry -0%
    hgetall_50hash_wo

HSET a hash with 50 fields of numbers '0' to '50'.
Unstable

  • hashTypeSet - 57.72%
  • lpAssertValidEntry - 7.5%
    hset
    Check validity on insert
  • hashTypeSet - 53.61%
  • lpAssertValidEntry -3.9%
    hset_assert_on_insert

@sundb
Copy link
Collaborator

sundb commented Sep 16, 2022

@ashtul Just like the failed test-sanitizer-address test, we can't check if it's corrupted data during traversal or lookup.

@oranagra
Copy link
Member

To make it clear, the you changed an assertion in the test since there's a flow that used to assert in Redis and now it doesn't, but that means that instead of aborting redis, it corrupts memory, as valgrind and the address sanitizer tests would report:

==13070==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x607000002c90 at pc 0x5619981dc1bf bp 0x7fff9f5d8e70 sp 0x7fff9f5d8e60
READ of size 4 at 0x607000002c90 thread T0
    #0 0x5619981dc1be in lpCurrentEncodedSizeUnsafe

P. S. You're also missing a restore keyword in another test you modified.

@oranagra oranagra added the state:to-be-closed requesting the core team to close the issue label Sep 17, 2022
@ashtul
Copy link
Contributor Author

ashtul commented Sep 17, 2022

I think we can keep the assert at lpNext, lpPrev and lpFirst but add a flag that will tell us whether or not to check validity. This flag will be true for RESTORE commands or other WRITE commands but will be off for READ commands.
This way, we will get both safety and performance.
@oranagra @sundb WDYT?

@sundb
Copy link
Collaborator

sundb commented Sep 17, 2022

In fact, even if it reads we also need to check it, otherwise it will be able to read arbitrary memory and can be returned to the client.
I guess what you actually want is lpNextCount, the entries in between we can just do a simple check, like what lpFind does, and then only do a strict check on the last entry.

@ashtul
Copy link
Contributor Author

ashtul commented Sep 17, 2022

I see the point of checking for corrupted memory but I wonder if it justifies 10% penalty in performance.
Maybe it can be relaxed.
The only way we can reproduce this behavior is when we place the corrupted memory inside. In real life, this type of corruption something went wrong somewhere else and it corrupted our memory. But does checking for it worth this penalty?

@oranagra
Copy link
Member

@ashtul the problem could be with a malicious RESTORE command, that appears to have a valid listpack header, but some of the listpack records in it, reach outside the listpack allocation.
in that case it could lead to segmentation faults, reading arbitrary data, or even remote code execution.

we have an option to do deep sanitization when processing a RESTORE command, (cost of O(N) during RESTORE), and then we know we can completely trust the data, but we preferred to avoid going down that path.

@gkorland
Copy link
Contributor

@oranagra validation of the listpack on RESTORE seems like a price worth to pay to enhance all the reads.
RESTORE is usually not a latency sensitive operations like read and doesn't happen on too often as reads.

@oranagra
Copy link
Member

the other issue with proceeding with that direction is that we'll have to maintain two copies of many functions, one to run on safe keys, and one to run on unsafe keys, and taint keys so we know which is which.
or alternatively, just have two server configurations, but either way, we'll have to have two copies of many functions.
that's because we can't afford to put the if in the inner loops since that'll come with a performance price too.
that also means testing and coverage becomes more complicated.

@filipecosta90 filipecosta90 added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Sep 19, 2022
@ashtul
Copy link
Contributor Author

ashtul commented Sep 19, 2022

How about having a debugAssert that will run only in DEBUG mode which does not appear in the code in RELEASE mode.
This will allow for a safe and safer mode. In the end, we don't have a test that breaks the data-structure except when we place corrupted data ourselves.

@oranagra
Copy link
Member

first, i don't wanna get into having two builds and code that behaves differently in debug and release builds.
secondly, this feature (#7807) is intended for production deployments (to keep them safe), so if we want to make this optional, it must be a runtime flag.

@ashtul
Copy link
Contributor Author

ashtul commented Sep 21, 2022

Moving conversation for issue (#11293) to discuss the performance degradation due to lpAssertValidEntry outside of this fix implementation.

@ashtul ashtul closed this Sep 21, 2022
@zuiderkwast
Copy link
Contributor

zuiderkwast commented Sep 21, 2022

the other issue with proceeding with that direction is that we'll have to maintain two copies of many functions, one to run on safe keys, and one to run on unsafe keys, and taint keys so we know which is which.

@oranagra The code in rdb.c is already basically a separate copy of much of the logic from every key type.

What if we remove the asserts as in this PR, but call lpAssertValidEntry explicitly after each lpFirst, lpNext lpPrev and lpFind call in the RDB code? That would cover RESTORE and other scenarios where we want to detect corrupt data.

I think forcing deep sanitization sounds like a good idea. I hope it can be reconsidered.

@oranagra
Copy link
Member

I don't think that calling lpAssertValidEntry in rdb.c is gonna make a difference.
if someone manages to load a corrupt payload using RESTORE (without deep sanitization), then later a read command can read random memory addresses, and there's also a chance an insert or append command will write to a random address.

the only 3 alternatives i think are acceptable are:

  1. what we have today (shallow sanitization on load time to validate the header, and assertions when iterating later)
  2. deep sanitization at load time, and then mark the key as safe so that it can skip the assertions, but keep the assertions for unmarked keys)
  3. add a configuration flag so people running in safe environments can turn all of this off.

the problem is that i don't think that adding an if before the assertion is gonna be very effective (still inside the loop), and in order for it to be effective we may have to clone some code so that the if is outside the loop.

@zuiderkwast
Copy link
Contributor

Always forcing deep sanitization so all keys are safe is not an option?

@oranagra
Copy link
Member

always in RESTORE command or also in rdb loading and replica?
maybe i'm biased here and i think the impact of deep sanitization is higher than it is.
also, IIRC the degradation of the assertions wasn't that high (although that was ziplist at that time)
see: #7807 (comment)
so maybe something can be improved without removing them.

@zuiderkwast
Copy link
Contributor

#7807 was a massive and awesome job. 🥇

always in RESTORE command or also in rdb loading and replica?

I was thinking in all of these.

I realize it might not be good to slow down RESTORE since MIGRATE uses RESTORE and it blocks both cluster nodes until it's done. Perhaps not until we have a better solution for atomic slot migration that doesn't block two nodes, we can consider going down that path...(?)

The way I think of it, data corruption can originate from several sources: a malicious user, a bad network connection, a bad hard disk or a memory corruption in the running node (caused by a cosmic ray). Stability wise it might be good to catch all of these (i.e. do deep sanitization and keep the asserts) but performance wise it isn't. Security-wise, it's the data from untrusted sources we're interested in, i.e. RESTORE from random clients.

What about trusting other cluster nodes (MIGRATE) by including them in the cluster-sanitize-dump clients check (not sure if they are now) and making it the default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action:run-benchmark Triggers the benchmark suite for this Pull Request state:to-be-closed requesting the core team to close the issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants