Minor refactoring for rioConnRead and adding errno #9280

Ewg-c · 2021-07-28T20:57:57Z

Redis 6.0 contained a bug when the master uses disk-based replication (repl-diskless-sync is no) and the replica is disk-less (repl-diskless-load is set to non-default value).
The bug could cause the rdb loading code in the replica to buffer too much data from the socket and fail the replication.

Due to da840e9 the error condition does not depend on the while loop where we read from socket. This change cleans up the code and extracts the condition outside the loop.
The change adds errno to "Failed trying to load the MASTER synchronization DB" error message in readSyncBulkPayload() to make debugging of the similar problems easier in the future.

madolson

LGTM, but I don't have a lot of context here.

oranagra

@Ewg-c the fix looks good to me, and it also simplifies the code.
indeed there's no reason to check for this case inside the loop if all the conditions leading to the error don't change in the loop and we only actually check the input arguments vs the startup state.

what i don't yet understand is:

what was the problem with the original code, it seems that it would function correctly as well (and you should have got your desired errno).
this code is supposed to be dead code. the scenario leading for this read_limit to be used is only when the master is diskless and the replica is disk-based, and even then an attempt to read outside of the range should never happen since rdb.c knows how much it should read for each type it reads.

src/replication.c

Ewg-c · 2021-07-29T17:37:28Z

@Ewg-c the fix looks good to me, and it also simplifies the code.
indeed there's no reason to check for this case inside the loop if all the conditions leading to the error don't change in the loop and we only actually check the input arguments vs the startup state.

what I don't yet understand is:

what was the problem with the original code, it seems that it would function correctly as well (and you should have got your desired errno).

this code is supposed to be dead code. the scenario leading for this read_limit to be used is only when the master is diskless and the replica is disk-based, and even then an attempt to read outside of the range should never happen since rdb.c knows how much it should read for each type it reads.

Thank you for the correction @oranagra.
For 1. In Redis 6.0 we had repeatable and consistent failures because of the bug.
Redis 6.0 uses this expression if (r->io.conn.read_limit >= r->io.conn.read_so_far - buffered) making the right part negative in many cases, but because of the previous (correct) check we are not entering that part of the code frequently.

For 2. I did not get impression this is a "dead" code. I might be missing something, please let me know. We have seen it with full sync and diskless on both master and replica. I found this adjusted stack trace in my notes:

Thread 1 "redis-server" hit Breakpoint 1, rioConnRead (r=0x7fff72883830, buf=0x7f339d850480, len=53149) at rio.c:205
$1 = 37073 // 'p ((struct sdshdr32 *)((r->io.conn.buf)-(sizeof(struct sdshdr32))))->len'
$2 = 102 // 'p r->io.conn.pos'
$3 = 53149 // 'p len'
$4 = 53260 // 'p r->io.conn.read_limit'
$5 = 102 // 'p r->io.conn.read_so_far'
#0  rioConnRead (r=0x7fff72883830, buf=0x7f339d850480, len=53149) at rio.c:205
#1  in rioRead (len=53149, buf=0x7f339d850480, r=0x7fff72883830) at rio.h:123
#2  rdbLoadLzfStringObject (rdb=rdb@entry=0x7fff72883830, flags=flags@entry=1, lenptr=lenptr@entry=0x0) at rdb.c:391
#3  in rdbGenericLoadStringObject (rdb=0x7fff72883830, flags=flags@entry=1, lenptr=lenptr@entry=0x0) at rdb.c:504
#4  in rdbLoadEncodedStringObject (rdb=<optimized out>) at rdb.c:539
#5  rdbLoadObject (rdbtype=<optimized out>, rdb=<optimized out>, key=<optimized out>) at rdb.c:1494
#6  in rdbLoadRio (rdb=rdb@entry=0x7fff72883830, rdbflags=rdbflags@entry=2, rsi=rsi@entry=0x7fff728837f0) at rdb.c:2317
#7  in readSyncBulkPayload (conn=0x7f339db89380) at replication.c:1667
...

madolson · 2021-07-30T00:28:20Z

I'll also add the information that this is AWS, and we do use the configuration diskless on primary and disk based on the replica by default.

oranagra · 2021-07-30T04:54:51Z

@madolson @Ewg-c the configuration is perfectly valid, the test suite uses it too, but I still don't understand what's the problem this PR come to fix (other than a cleanup).

Redis 6 was indeed buggy in that line, and that was fixed in #7557 (with an additional fix in #7564), so as far as I can tell, the current code that this PR come to change was OK.
If we agree on that, then I can just say that it's confusing to see a PR that says it fixes a bug, but that bug is no longer there. It wasn't clear enough from the comments that this is just a cleanup.

The other thing that bothers me is that this condition is suppose to be dead code, so I still don't understand how you run into it (consistently). rdb.c knows how many bytes to read for each type, and it should never ask to read more than there is in the rdb file. When it sees the EOF byte it stops.
This limit is normally there to just stop this mechanism from buffering data from the socket beyond the limit, but not to error on reads outside the range.

oranagra · 2021-07-30T04:58:23Z

On a second look, maybe the PR title and top comment are clear that this is just a refactoring..
Maybe the mention to the redis 6 bug is what thrown me off balance (don't see how it's related now) .

But I'm still curious how come you got to get this error.. Maybe some modification in your fork?

madolson · 2021-07-30T14:15:40Z

Oh, yes, this is just a refactoring to help identify an issue we saw internally that we needed gdb to identify. AFAIK the issue was fixed in 6.2, (we maintain too many old versions) and we saw this on a 6.0 version. @Ewg-c can fill in that details if you are really curious.

Ewg-c · 2021-07-30T16:28:44Z

@oranagra sorry if the information on the ticket mislead you.

rdb.c knows how many bytes to read for each type, and it should never ask to read more than there is in the rdb file. When it sees the EOF byte it stops.

I posted the stack trace earlier and can add that the cause of the culprit is toread assignment. It does not take EOF into account and would attempt reading over the limit. This is where the condition in question is triggered. Personally I believe that everyone running Redis 6.0 would be affected, though it requires unlucky match of the values, it still should be happening regularly.

I also think it was TLS enabled cluster.

oranagra · 2021-07-30T16:41:46Z

Ok, so it's not that rdb.c attempted to read beyond the limit, but that rioConnRead messed up and attempted to buffer more than it should.
It's odd that the tests didn't find it back then, since I think they where testing this configuration combination..
Well, I suppose that there's no sense in wasting more time looking into it considering that this code doesn't exits.
But maybe we should backport the latest version into the next release of 6.0 and 5.0?

Ewg-c · 2021-07-30T17:38:40Z

@oranagra
rioConnIO was introduced in Redis 6, thus Redis 5.0 is not affected by this problem.
IMO It is good idea to backport the fix to 6.0 branch. The fix consists of these two commits and they merge smoothly:

https://github.com/redis/redis/commit/40d7fca3685d8439bae8480ddbd59775a2390411
https://github.com/redis/redis/commit/da840e9851bab8d1674e245a812b2105be111208

oranagra · 2021-07-30T18:44:14Z

ohh, right.. i implemented it in redis 2.8, but couldn't get Salvatore to merge my PR until recently.

minor refactoring for rioConnRead and adding errno

oranagra · 2021-09-30T20:11:41Z

@Ewg-c i've edited the top comment to be used for the release notes, please review / fix (specifically what where the implications of the bug).
p.s. i now notice that my previous statement here about this code being active on diskless master with disk-based replica were wrong (it's the other way around)

Ewg-c · 2021-10-01T05:17:21Z

@oranagra thank you. I updated the top comment. It should be good I believe.

minor refactoring for rioConnRead and adding errno (cherry picked from commit a403816)

minor refactoring for rioConnRead and adding errno

d0cb12c

madolson requested a review from oranagra July 28, 2021 22:34

madolson previously approved these changes Jul 28, 2021

View reviewed changes

minor styling fix to fit our convention, and also reduce LOC diff

8320dba

oranagra dismissed madolson’s stale review via 8320dba July 29, 2021 06:02

oranagra previously approved these changes Jul 29, 2021

View reviewed changes

oranagra added the state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten label Jul 29, 2021

yossigo reviewed Jul 29, 2021

View reviewed changes

src/replication.c Outdated Show resolved Hide resolved

Ewg-c and others added 2 commits July 29, 2021 09:43

Merge branch 'redis:unstable' into unstable

8cf9689

Log error first in rdbLoadRio failure handling code

a398f9f

Ewg-c dismissed oranagra’s stale review via a398f9f July 29, 2021 16:57

madolson approved these changes Jul 30, 2021

View reviewed changes

madolson merged commit a403816 into redis:unstable Jul 30, 2021

oranagra added this to To Do in 5.0 Backport via automation Jul 30, 2021

oranagra added this to To Do in 6.0 Backport via automation Jul 30, 2021

oranagra removed this from To Do in 5.0 Backport Jul 30, 2021

madolson added release-notes indication that this issue needs to be mentioned in the release notes and removed state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten labels Jul 31, 2021

JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Sep 8, 2021

Minor refactoring for rioConnRead and adding errno (redis#9280)

a16bfe4

minor refactoring for rioConnRead and adding errno

oranagra moved this from To Do to In progress in 6.0 Backport Sep 29, 2021

oranagra pushed a commit to oranagra/redis that referenced this pull request Oct 4, 2021

Minor refactoring for rioConnRead and adding errno (redis#9280)

616d6db

minor refactoring for rioConnRead and adding errno (cherry picked from commit a403816)

oranagra mentioned this pull request Oct 4, 2021

Release 6.0.16 #9584

Merged

oranagra pushed a commit that referenced this pull request Oct 4, 2021

Minor refactoring for rioConnRead and adding errno (#9280)

dde1c97

minor refactoring for rioConnRead and adding errno (cherry picked from commit a403816)

oranagra moved this from In progress to Done in 6.0 Backport Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor refactoring for rioConnRead and adding errno #9280

Minor refactoring for rioConnRead and adding errno #9280

Ewg-c commented Jul 28, 2021 •

edited

madolson left a comment

oranagra left a comment •

edited

Ewg-c commented Jul 29, 2021 •

edited

madolson commented Jul 30, 2021

oranagra commented Jul 30, 2021

oranagra commented Jul 30, 2021

madolson commented Jul 30, 2021

Ewg-c commented Jul 30, 2021

oranagra commented Jul 30, 2021

Ewg-c commented Jul 30, 2021 •

edited

oranagra commented Jul 30, 2021

oranagra commented Sep 30, 2021

Ewg-c commented Oct 1, 2021

Minor refactoring for rioConnRead and adding errno #9280

Minor refactoring for rioConnRead and adding errno #9280

Conversation

Ewg-c commented Jul 28, 2021 • edited

madolson left a comment

Choose a reason for hiding this comment

oranagra left a comment • edited

Choose a reason for hiding this comment

Ewg-c commented Jul 29, 2021 • edited

madolson commented Jul 30, 2021

oranagra commented Jul 30, 2021

oranagra commented Jul 30, 2021

madolson commented Jul 30, 2021

Ewg-c commented Jul 30, 2021

oranagra commented Jul 30, 2021

Ewg-c commented Jul 30, 2021 • edited

oranagra commented Jul 30, 2021

oranagra commented Sep 30, 2021

Ewg-c commented Oct 1, 2021

Ewg-c commented Jul 28, 2021 •

edited

oranagra left a comment •

edited

Ewg-c commented Jul 29, 2021 •

edited

Ewg-c commented Jul 30, 2021 •

edited