Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor refactoring for rioConnRead and adding errno #9280

Merged
merged 4 commits into from Jul 30, 2021

Conversation

Ewg-c
Copy link
Contributor

@Ewg-c Ewg-c commented Jul 28, 2021

Redis 6.0 contained a bug when the master uses disk-based replication (repl-diskless-sync is no) and the replica is disk-less (repl-diskless-load is set to non-default value).
The bug could cause the rdb loading code in the replica to buffer too much data from the socket and fail the replication.

Due to da840e9 the error condition does not depend on the while loop where we read from socket. This change cleans up the code and extracts the condition outside the loop.
The change adds errno to "Failed trying to load the MASTER synchronization DB" error message in readSyncBulkPayload() to make debugging of the similar problems easier in the future.

@madolson madolson requested a review from oranagra July 28, 2021 22:34
madolson
madolson previously approved these changes Jul 28, 2021
Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I don't have a lot of context here.

oranagra
oranagra previously approved these changes Jul 29, 2021
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ewg-c the fix looks good to me, and it also simplifies the code.
indeed there's no reason to check for this case inside the loop if all the conditions leading to the error don't change in the loop and we only actually check the input arguments vs the startup state.

what i don't yet understand is:

  1. what was the problem with the original code, it seems that it would function correctly as well (and you should have got your desired errno).
  2. this code is supposed to be dead code. the scenario leading for this read_limit to be used is only when the master is diskless and the replica is disk-based, and even then an attempt to read outside of the range should never happen since rdb.c knows how much it should read for each type it reads.

@oranagra oranagra added the state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten label Jul 29, 2021
src/replication.c Outdated Show resolved Hide resolved
@Ewg-c
Copy link
Contributor Author

Ewg-c commented Jul 29, 2021

@Ewg-c the fix looks good to me, and it also simplifies the code.
indeed there's no reason to check for this case inside the loop if all the conditions leading to the error don't change in the loop and we only actually check the input arguments vs the startup state.

what I don't yet understand is:

  1. what was the problem with the original code, it seems that it would function correctly as well (and you should have got your desired errno).
  2. this code is supposed to be dead code. the scenario leading for this read_limit to be used is only when the master is diskless and the replica is disk-based, and even then an attempt to read outside of the range should never happen since rdb.c knows how much it should read for each type it reads.

Thank you for the correction @oranagra.
For 1. In Redis 6.0 we had repeatable and consistent failures because of the bug.
Redis 6.0 uses this expression if (r->io.conn.read_limit >= r->io.conn.read_so_far - buffered) making the right part negative in many cases, but because of the previous (correct) check we are not entering that part of the code frequently.

For 2. I did not get impression this is a "dead" code. I might be missing something, please let me know. We have seen it with full sync and diskless on both master and replica. I found this adjusted stack trace in my notes:

Thread 1 "redis-server" hit Breakpoint 1, rioConnRead (r=0x7fff72883830, buf=0x7f339d850480, len=53149) at rio.c:205
$1 = 37073 // 'p ((struct sdshdr32 *)((r->io.conn.buf)-(sizeof(struct sdshdr32))))->len'
$2 = 102 // 'p r->io.conn.pos'
$3 = 53149 // 'p len'
$4 = 53260 // 'p r->io.conn.read_limit'
$5 = 102 // 'p r->io.conn.read_so_far'
#0  rioConnRead (r=0x7fff72883830, buf=0x7f339d850480, len=53149) at rio.c:205
#1  in rioRead (len=53149, buf=0x7f339d850480, r=0x7fff72883830) at rio.h:123
#2  rdbLoadLzfStringObject (rdb=rdb@entry=0x7fff72883830, flags=flags@entry=1, lenptr=lenptr@entry=0x0) at rdb.c:391
#3  in rdbGenericLoadStringObject (rdb=0x7fff72883830, flags=flags@entry=1, lenptr=lenptr@entry=0x0) at rdb.c:504
#4  in rdbLoadEncodedStringObject (rdb=<optimized out>) at rdb.c:539
#5  rdbLoadObject (rdbtype=<optimized out>, rdb=<optimized out>, key=<optimized out>) at rdb.c:1494
#6  in rdbLoadRio (rdb=rdb@entry=0x7fff72883830, rdbflags=rdbflags@entry=2, rsi=rsi@entry=0x7fff728837f0) at rdb.c:2317
#7  in readSyncBulkPayload (conn=0x7f339db89380) at replication.c:1667
...

@madolson
Copy link
Contributor

I'll also add the information that this is AWS, and we do use the configuration diskless on primary and disk based on the replica by default.

@madolson madolson merged commit a403816 into redis:unstable Jul 30, 2021
@oranagra
Copy link
Member

@madolson @Ewg-c the configuration is perfectly valid, the test suite uses it too, but I still don't understand what's the problem this PR come to fix (other than a cleanup).

Redis 6 was indeed buggy in that line, and that was fixed in #7557 (with an additional fix in #7564), so as far as I can tell, the current code that this PR come to change was OK.
If we agree on that, then I can just say that it's confusing to see a PR that says it fixes a bug, but that bug is no longer there. It wasn't clear enough from the comments that this is just a cleanup.

The other thing that bothers me is that this condition is suppose to be dead code, so I still don't understand how you run into it (consistently). rdb.c knows how many bytes to read for each type, and it should never ask to read more than there is in the rdb file. When it sees the EOF byte it stops.
This limit is normally there to just stop this mechanism from buffering data from the socket beyond the limit, but not to error on reads outside the range.

@oranagra
Copy link
Member

On a second look, maybe the PR title and top comment are clear that this is just a refactoring..
Maybe the mention to the redis 6 bug is what thrown me off balance (don't see how it's related now) .

But I'm still curious how come you got to get this error.. Maybe some modification in your fork?

@madolson
Copy link
Contributor

Oh, yes, this is just a refactoring to help identify an issue we saw internally that we needed gdb to identify. AFAIK the issue was fixed in 6.2, (we maintain too many old versions) and we saw this on a 6.0 version. @Ewg-c can fill in that details if you are really curious.

@Ewg-c
Copy link
Contributor Author

Ewg-c commented Jul 30, 2021

@oranagra sorry if the information on the ticket mislead you.

rdb.c knows how many bytes to read for each type, and it should never ask to read more than there is in the rdb file. When it sees the EOF byte it stops.

I posted the stack trace earlier and can add that the cause of the culprit is toread assignment. It does not take EOF into account and would attempt reading over the limit. This is where the condition in question is triggered. Personally I believe that everyone running Redis 6.0 would be affected, though it requires unlucky match of the values, it still should be happening regularly.

I also think it was TLS enabled cluster.

@oranagra
Copy link
Member

Ok, so it's not that rdb.c attempted to read beyond the limit, but that rioConnRead messed up and attempted to buffer more than it should.
It's odd that the tests didn't find it back then, since I think they where testing this configuration combination..
Well, I suppose that there's no sense in wasting more time looking into it considering that this code doesn't exits.
But maybe we should backport the latest version into the next release of 6.0 and 5.0?

@oranagra oranagra added this to To Do in 5.0 Backport via automation Jul 30, 2021
@oranagra oranagra added this to To Do in 6.0 Backport via automation Jul 30, 2021
@Ewg-c
Copy link
Contributor Author

Ewg-c commented Jul 30, 2021

@oranagra
rioConnIO was introduced in Redis 6, thus Redis 5.0 is not affected by this problem.
IMO It is good idea to backport the fix to 6.0 branch. The fix consists of these two commits and they merge smoothly:

https://github.com/redis/redis/commit/40d7fca3685d8439bae8480ddbd59775a2390411
https://github.com/redis/redis/commit/da840e9851bab8d1674e245a812b2105be111208

@oranagra
Copy link
Member

ohh, right.. i implemented it in redis 2.8, but couldn't get Salvatore to merge my PR until recently.

@oranagra oranagra removed this from To Do in 5.0 Backport Jul 30, 2021
@madolson madolson added release-notes indication that this issue needs to be mentioned in the release notes and removed state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten labels Jul 31, 2021
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Sep 8, 2021
minor refactoring for rioConnRead and adding errno
@oranagra oranagra moved this from To Do to In progress in 6.0 Backport Sep 29, 2021
@oranagra
Copy link
Member

@Ewg-c i've edited the top comment to be used for the release notes, please review / fix (specifically what where the implications of the bug).
p.s. i now notice that my previous statement here about this code being active on diskless master with disk-based replica were wrong (it's the other way around)

@Ewg-c
Copy link
Contributor Author

Ewg-c commented Oct 1, 2021

@oranagra thank you. I updated the top comment. It should be good I believe.

oranagra pushed a commit to oranagra/redis that referenced this pull request Oct 4, 2021
minor refactoring for rioConnRead and adding errno

(cherry picked from commit a403816)
@oranagra oranagra mentioned this pull request Oct 4, 2021
oranagra pushed a commit that referenced this pull request Oct 4, 2021
minor refactoring for rioConnRead and adding errno

(cherry picked from commit a403816)
@oranagra oranagra moved this from In progress to Done in 6.0 Backport Oct 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes indication that this issue needs to be mentioned in the release notes
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants