[PERF] move assert to insert #11273

ashtul · 2022-09-16T13:01:13Z

In this PR, lpAssertValidEntry is remove from lpNext, lpPrev, 'lpFirst' and 'lpFind'.
The element valid is now checked in lpInsert.

A difference can be seen for both HSET and HGET commands.

HGETALL a hash with 50 fields of numbers '0' to '50'.
Unstable

hashTypeNext - 19.13%
lpAssertValidEntry - 10.2%

Check validity on insert
hashTypeNext - 10.86%
lpAssertValidEntry -0%

HSET a hash with 50 fields of numbers '0' to '50'.
Unstable

hashTypeSet - 57.72%
lpAssertValidEntry - 7.5%

Check validity on insert
hashTypeSet - 53.61%
lpAssertValidEntry -3.9%

sundb · 2022-09-16T13:55:02Z

@ashtul Just like the failed test-sanitizer-address test, we can't check if it's corrupted data during traversal or lookup.

oranagra · 2022-09-17T02:52:01Z

To make it clear, the you changed an assertion in the test since there's a flow that used to assert in Redis and now it doesn't, but that means that instead of aborting redis, it corrupts memory, as valgrind and the address sanitizer tests would report:

==13070==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x607000002c90 at pc 0x5619981dc1bf bp 0x7fff9f5d8e70 sp 0x7fff9f5d8e60
READ of size 4 at 0x607000002c90 thread T0
    #0 0x5619981dc1be in lpCurrentEncodedSizeUnsafe

P. S. You're also missing a restore keyword in another test you modified.

ashtul · 2022-09-17T07:28:34Z

I think we can keep the assert at lpNext, lpPrev and lpFirst but add a flag that will tell us whether or not to check validity. This flag will be true for RESTORE commands or other WRITE commands but will be off for READ commands.
This way, we will get both safety and performance.
@oranagra @sundb WDYT?

sundb · 2022-09-17T07:52:30Z

In fact, even if it reads we also need to check it, otherwise it will be able to read arbitrary memory and can be returned to the client.
I guess what you actually want is lpNextCount, the entries in between we can just do a simple check, like what lpFind does, and then only do a strict check on the last entry.

ashtul · 2022-09-17T17:46:33Z

I see the point of checking for corrupted memory but I wonder if it justifies 10% penalty in performance.
Maybe it can be relaxed.
The only way we can reproduce this behavior is when we place the corrupted memory inside. In real life, this type of corruption something went wrong somewhere else and it corrupted our memory. But does checking for it worth this penalty?

oranagra · 2022-09-19T11:06:23Z

@ashtul the problem could be with a malicious RESTORE command, that appears to have a valid listpack header, but some of the listpack records in it, reach outside the listpack allocation.
in that case it could lead to segmentation faults, reading arbitrary data, or even remote code execution.

we have an option to do deep sanitization when processing a RESTORE command, (cost of O(N) during RESTORE), and then we know we can completely trust the data, but we preferred to avoid going down that path.

gkorland · 2022-09-19T12:27:31Z

@oranagra validation of the listpack on RESTORE seems like a price worth to pay to enhance all the reads.
RESTORE is usually not a latency sensitive operations like read and doesn't happen on too often as reads.

oranagra · 2022-09-19T14:20:28Z

the other issue with proceeding with that direction is that we'll have to maintain two copies of many functions, one to run on safe keys, and one to run on unsafe keys, and taint keys so we know which is which.
or alternatively, just have two server configurations, but either way, we'll have to have two copies of many functions.
that's because we can't afford to put the if in the inner loops since that'll come with a performance price too.
that also means testing and coverage becomes more complicated.

ashtul · 2022-09-19T21:44:14Z

How about having a debugAssert that will run only in DEBUG mode which does not appear in the code in RELEASE mode.
This will allow for a safe and safer mode. In the end, we don't have a test that breaks the data-structure except when we place corrupted data ourselves.

oranagra · 2022-09-20T06:34:46Z

first, i don't wanna get into having two builds and code that behaves differently in debug and release builds.
secondly, this feature (#7807) is intended for production deployments (to keep them safe), so if we want to make this optional, it must be a runtime flag.

ashtul · 2022-09-21T07:11:08Z

Moving conversation for issue (#11293) to discuss the performance degradation due to lpAssertValidEntry outside of this fix implementation.

zuiderkwast · 2022-09-21T19:32:10Z

the other issue with proceeding with that direction is that we'll have to maintain two copies of many functions, one to run on safe keys, and one to run on unsafe keys, and taint keys so we know which is which.

@oranagra The code in rdb.c is already basically a separate copy of much of the logic from every key type.

What if we remove the asserts as in this PR, but call lpAssertValidEntry explicitly after each lpFirst, lpNext lpPrev and lpFind call in the RDB code? That would cover RESTORE and other scenarios where we want to detect corrupt data.

I think forcing deep sanitization sounds like a good idea. I hope it can be reconsidered.

oranagra · 2022-09-22T08:55:32Z

I don't think that calling lpAssertValidEntry in rdb.c is gonna make a difference.
if someone manages to load a corrupt payload using RESTORE (without deep sanitization), then later a read command can read random memory addresses, and there's also a chance an insert or append command will write to a random address.

the only 3 alternatives i think are acceptable are:

what we have today (shallow sanitization on load time to validate the header, and assertions when iterating later)
deep sanitization at load time, and then mark the key as safe so that it can skip the assertions, but keep the assertions for unmarked keys)
add a configuration flag so people running in safe environments can turn all of this off.

the problem is that i don't think that adding an if before the assertion is gonna be very effective (still inside the loop), and in order for it to be effective we may have to clone some code so that the if is outside the loop.

zuiderkwast · 2022-09-22T09:02:47Z

Always forcing deep sanitization so all keys are safe is not an option?

oranagra · 2022-09-22T17:06:55Z

always in RESTORE command or also in rdb loading and replica?
maybe i'm biased here and i think the impact of deep sanitization is higher than it is.
also, IIRC the degradation of the assertions wasn't that high (although that was ziplist at that time)
see: #7807 (comment)
so maybe something can be improved without removing them.

zuiderkwast · 2022-09-23T08:21:16Z

#7807 was a massive and awesome job. 🥇

always in RESTORE command or also in rdb loading and replica?

I was thinking in all of these.

I realize it might not be good to slow down RESTORE since MIGRATE uses RESTORE and it blocks both cluster nodes until it's done. Perhaps not until we have a better solution for atomic slot migration that doesn't block two nodes, we can consider going down that path...(?)

The way I think of it, data corruption can originate from several sources: a malicious user, a bad network connection, a bad hard disk or a memory corruption in the running node (caused by a cosmic ray). Stability wise it might be good to catch all of these (i.e. do deep sanitization and keep the asserts) but performance wise it isn't. Security-wise, it's the data from untrusted sources we're interested in, i.e. RESTORE from random clients.

What about trusting other cluster nodes (MIGRATE) by including them in the cluster-sanitize-dump clients check (not sure if they are now) and making it the default?

[PERF] move assert to insert

b96bd75

verify test

bf2db8a

oranagra added the state:to-be-closed requesting the core team to close the issue label Sep 17, 2022

filipecosta90 added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Sep 19, 2022

ashtul mentioned this pull request Sep 21, 2022

[PERFORMANCE] Reduce the impact of lpAssertValidEntry on types using the listpack data-stuctures. #11293

Open

ashtul closed this Sep 21, 2022

ashtul mentioned this pull request Apr 28, 2024

Remove lpAssertValidEntry from listpack functions to achieve up to 10% improvement on HSET command. valkey-io/valkey#399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] move assert to insert #11273

[PERF] move assert to insert #11273

ashtul commented Sep 16, 2022

sundb commented Sep 16, 2022

oranagra commented Sep 17, 2022

ashtul commented Sep 17, 2022

sundb commented Sep 17, 2022

ashtul commented Sep 17, 2022 •

edited

oranagra commented Sep 19, 2022

gkorland commented Sep 19, 2022

oranagra commented Sep 19, 2022

ashtul commented Sep 19, 2022

oranagra commented Sep 20, 2022

ashtul commented Sep 21, 2022

zuiderkwast commented Sep 21, 2022 •

edited

oranagra commented Sep 22, 2022

zuiderkwast commented Sep 22, 2022

oranagra commented Sep 22, 2022

zuiderkwast commented Sep 23, 2022

[PERF] move assert to insert #11273

[PERF] move assert to insert #11273

Conversation

ashtul commented Sep 16, 2022

sundb commented Sep 16, 2022

oranagra commented Sep 17, 2022

ashtul commented Sep 17, 2022

sundb commented Sep 17, 2022

ashtul commented Sep 17, 2022 • edited

oranagra commented Sep 19, 2022

gkorland commented Sep 19, 2022

oranagra commented Sep 19, 2022

ashtul commented Sep 19, 2022

oranagra commented Sep 20, 2022

ashtul commented Sep 21, 2022

zuiderkwast commented Sep 21, 2022 • edited

oranagra commented Sep 22, 2022

zuiderkwast commented Sep 22, 2022

oranagra commented Sep 22, 2022

zuiderkwast commented Sep 23, 2022

ashtul commented Sep 17, 2022 •

edited

zuiderkwast commented Sep 21, 2022 •

edited