Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry logic for ioredis response incorrect result #965

Closed
stanley115 opened this issue Sep 17, 2019 · 7 comments · Fixed by #1084, bangbang93/freyja#148, bangbang93/haruhi#136, bangbang93/freyja#155 or BlueWallet/LndHub#68
Labels

Comments

@stanley115
Copy link

stanley115 commented Sep 17, 2019

Hi,

I am using ioredis with version 4.2.0 and I found ioredis may response unexpected result when "failover primary" is triggered on Amazon ElastiCache.

To verify this issue, I wrote a program to do the following steps:

  1. insert 1000000 key-value pairs into redis where key = value
  2. get 200 random key from redis and check whether the result match the key
  3. at the same time, perform some multi-set operations

The source code is as follows:

const Redis = require('ioredis');

const redis = new Redis({
    'host': xxx,
    'port': xxx,
});

function isEqual(a, b) {
    if (!Array.isArray(a) || !Array.isArray(b)) {
        return false;
    }

    if (a.length !== b.length) {
        return false;
    }

    for (let i = 0; i < a.length; i++) {
        if (a[i] !== b[i]) {
            return false;
        }
    }

    return true;
}

let counter = 0;

async function test() {
    while (true) {
        counter++;

        if (counter % 1000 === 0) {
            console.log('test', counter);
        }
        let keys = [];

        for (let i = 0; i < 200; i++) {
            let value = `${Math.floor(Math.random() * 1000000)}`;

            keys.push(value);
        }

        try {
            let values;

            values = await redis.mget(keys);

            if (!isEqual(keys, values)) {
                console.error('Keys and values do not match');
                console.error(keys, values);

                return;
            }

            // set the values
            const multi = await redis.multi();

            for (let i = 0; i < keys.length; i++) {
                await multi.set(keys[i], values[i]);
            }

            try {
                await multi.exec();
            } catch (err) {

                // discard the transaction if encounter any error
                await multi.discard();
            }


            values = await redis.mget(keys);

            if (!isEqual(keys, values)) {
                console.error('Keys and values do not match');
                console.error(keys, values);

                return;
            }

        } catch (err) {
            console.error('Redis connection error', err);
        }
    }
}

(async () => {
    let promises = [];

    // Set 1000000 key-value pair into redis
    for (let i = 0; i < 1000000; i++) {
        promises.push(
            redis.set(i, i)
        );
    }

    await Promise.all(promises);

    promises = [];

    for (let i = 0; i < 200; i++) {
        // Test redis in parallel
        promises.push(test());
    }

    await Promise.all(promises);
})();

When the program is executed, it runs normally without any error at the beginning:

test 1000
test 2000
test 3000
test 4000
test 5000
test 6000
test 7000
test 8000

However, when I trigger "failover primary" is triggered on Amazon ElastiCache, it responses the following error:

test 31000
[ioredis] Unhandled error event: Error: write EPIPE
    at WriteWrap.afterWrite [as oncomplete] (net.js:788:14)
Keys and values do not match
[ '243409',
  '137016',
  '601542',
  '734081',
  '210151',
  '68771',
  '895837',
  '876693',
  '588017',
  '29828',
  '17298',
  '670564',
  '836440',
  '270865',
  '306435',
  '931699',
  '775613',
  '217668',
  '581266',
  '976507',
  '832352',
  '679326',
  '52561',
  '53287',
  '104417',
  '146403',
  '39159',
  '114127',
  '871857',
  '132837',
  '387522',
  '562921',
  '42104',
  '199721',
  '739081',
  '521383',
  '662032',
  '113916',
  '548452',
  '439403',
  '444550',
  '893604',
  '848739',
  '140389',
  '515737',
  '876296',
  '173461',
  '564650',
  '390476',
  '798553',
  '76405',
  '902405',
  '446459',
  '759290',
  '427111',
  '988981',
  '967138',
  '137452',
  '520179',
  '485755',
  '720760',
  '845674',
  '872862',
  '821507',
  '924065',
  '574757',
  '128519',
  '906625',
  '632835',
  '21495',
  '83587',
  '777786',
  '823747',
  '171798',
  '683516',
  '127883',
  '78898',
  '956534',
  '559139',
  '773756',
  '319490',
  '610275',
  '407368',
  '531517',
  '469918',
  '427556',
  '261721',
  '430777',
  '877009',
  '298908',
  '564349',
  '804059',
  '671345',
  '953904',
  '531551',
  '590455',
  '979441',
  '463816',
  '407925',
  '633564',
  ... 100 more items ] 'QUEUED'

and the program terminate with the following error:

Error: Command queue state error. If you can reproduce this, please report it. Last reply: 995929,435542,106821,513330,236068,695023,745357,618648,185801,633124,656521,310435,940466,479575,808160,416420,19858,211146,425218,140598,924085,961997,664625,578621,328188,615778,89132,125432,741458,739122,400620,758320,634350,352547,600585,682298,933154,188276,897557,811743,646589,987422,540335,681971,659102,156246,827271,37592,152314,833971,688430,877638,839522,79903,126904,827111,486631,499505,445872,165188,957601,820105,748310,935720,805906,735996,941498,243421,787074,261476,261790,473453,341959,986679,192225,625257,135429,47694,25263,436500,937899,843472,111746,446841,626126,321981,112825,900243,258222,887488,434748,812479,458536,683031,226176,621812,152860,194738,508094,749085,982258,678053,113761,956206,374257,626489,863173,293197,424767,636834,742311,959087,914309,918353,559260,774120,681847,737399,213218,437021,176342,725435,295235,729465,343807,960928,127085,827722,878959,584385,535367,506862,57615,168427,12160,59708,852020,465299,846000,301785,567768,834208,978330,611490,92358,759136,358587,819213,394238,571914,638338,776079,94375,232995,948415,89949,869999,910308,407192,405751,938798,510490,814789,913200,861669,264338,93524,371432,919094,774245,587445,460332,844889,165153,199058,542872,571002,274567,508553,924620,780545,835010,187975,654759,401789,41298,723730,889871,266452,765027,797533,517217,961999,178726,25432,418646,287127,679359,897524,62380
    at Redis.exports.returnReply (xxxxxx/node_modules/ioredis/built/redis/parser.js:143:39)
    at JavascriptRedisParser.returnReply (xxxxxx/node_modules/ioredis/built/redis/parser.js:23:19)
    at JavascriptRedisParser.execute (xxxxxx/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:544:14)
    at Socket.<anonymous> (xxxxxx/node_modules/ioredis/built/redis/event_handler.js:107:26)
    at Socket.emit (events.js:198:13)
    at addChunk (_stream_readable.js:288:12)
    at readableAddChunk (_stream_readable.js:269:11)
    at Socket.Readable.push (_stream_readable.js:224:10)
    at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17)
Emitted 'error' event at:
    at Redis.exports.returnReply (xxxxxx/node_modules/ioredis/built/redis/parser.js:143:25)
    at JavascriptRedisParser.returnReply (xxxxxx/node_modules/ioredis/built/redis/parser.js:23:19)
    [... lines matching original stack trace ...]
    at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17)

PS: I found this issue cannot be reproduced when maxRetriesPerRequest is set to be 0, and I strongly suspect that there is some state corruption when retry happen.

Here is the ElastiCache cluster details:

Node type cache.m3.medium
Engine Redis
Engine Version Compatibility 4.0.10
Shards 1
Number of Nodes 2 nodes (1 primary, 1 replica)
@NezzarClp
Copy link

Any update on this issue? I am using ioredis and facing the same issue as well.

@stale
Copy link

stale bot commented Oct 25, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the wontfix label Oct 25, 2019
@stale stale bot closed this as completed Nov 1, 2019
@alavers
Copy link
Contributor

alavers commented Nov 22, 2019

I'm encountering this same issue when testing an elasticache failover/upgrade. (Elasticache redis 5.0.0, ioredis 4.14.0)

My client is running a mix of individual commands and multi command blocks. When an elasticache resource upgrade occurs, the connection is closed and reestablished. However, the ioredis command queue falls out of sync, and I begin seeing individual redis commands being resolved with the reply result of a different command.

My debugging so far has shown that during the connection reset, commands sent inside of a multi context are getting put individually on the offline queue, which on its own doesn't seem right.

Just before a failover takes place, I send a multi block that looks like this:

multi
del
hset
expire
setex
exec

All of these commands get queued up by ioredis, but then I observe the following:

  1. Receive a reply 'OK' (in response to multi)
  2. Receive a READONLY error in response to del, due to the ongoing ElastiCache resource upgrade.
  3. del command is put on the offline queue
  4. The connection is reestablished with a new IP address
  5. del is sent from the offline queue
  6. Receive a reply '1' (in response to del). We were expecting QUEUED here, but the multi context was lost in the reconnect.
  7. ioredis resolves a pending hset command with the reply value '1'. The command queue is now out of sync
  8. The next command I send through this client has its reply resolved to the pending expire command, and so on.

It seems that if a connection reset occurs when a multi block is still in flight we might need to either flush or re-send any multi'd commands that might still be in the queue. I'll try to run this experiment in a fork and see if it resolves this issue.

@alavers
Copy link
Contributor

alavers commented Nov 22, 2019

To start, I think perhaps transacted commands found in prevCommandQueue or offlineQueue should be rejected. A quick experiment trying this appears to resolve the issue I'm facing.

In the meantime @stanley115 are you able to reproduce the issue with autoResendUnfulfilledCommands set to false?

@luishdz1010
Copy link

This just hit us on production, after recovering from a failover our sites started showing content from different users mixed together. Any workarounds?

@alavers
Copy link
Contributor

alavers commented Feb 22, 2020

As far as workarounds, we simply have to restart all connected services whenever we perform an elasticache upgrade. I'm not aware of any better workaround, but I do have a potential fix working locally that I've been meaning to PR. I've been sidetracked since then.

alavers added a commit to alavers/ioredis that referenced this issue Mar 27, 2020
Elasticache severs the connection immediately after it returns a
READONLY error.  This can sometimes leave queued up pipelined commands
in an inconsistent state when the connection is reestablished. This fix
will detect any pipeline command sets that only had a partial response
before connection loss, and abort them.  This fix will also check the
offlineQueue for any transaction fragments, so that we don't send
mismatched multi/exec to redis upon reconnection.

- Introduced piplineIndex property on pipelined commands to allow for later
cleanup
- Added a routine to event_handler that aborts any pipelined commands inside
commandQueue and offlineQueue that were interrupted in the middle of the
pipeline
- Introduced inTransaction property on commands to simplify pipeline logic
- Added a flags param to mock_server to allow the Elasticache disconnect
behavior to be simulated
- Added a reconnect_on_error test case for transactions
- Added some test cases testing for correct handling of this unique elasticache
behavior
- Added unit tests to validate inTransaction and pipelineIndex setting

Fixes redis#965
alavers added a commit to alavers/ioredis that referenced this issue Mar 27, 2020
Elasticache severs the connection immediately after it returns a
READONLY error.  This can sometimes leave queued up pipelined commands
in an inconsistent state when the connection is reestablished. This fix
will detect any pipeline command sets that only had a partial response
before connection loss, and abort them.  This fix will also check the
offlineQueue for any transaction fragments, so that we don't send
mismatched multi/exec to redis upon reconnection.

- Introduced piplineIndex property on pipelined commands to allow for later
cleanup
- Added a routine to event_handler that aborts any pipelined commands inside
commandQueue and offlineQueue that were interrupted in the middle of the
pipeline
- Added a routine to event_handler that removes any transaction
fragments from the offline queue
- Introduced inTransaction property on commands to simplify pipeline logic
- Added a flags param to mock_server to allow the Elasticache disconnect
behavior to be simulated
- Added a reconnect_on_error test case for transactions
- Added some test cases testing for correct handling of this unique elasticache
behavior
- Added unit tests to validate inTransaction and pipelineIndex setting

Fixes redis#965
alavers added a commit to alavers/ioredis that referenced this issue Mar 27, 2020
Elasticache severs the connection immediately after it returns a
READONLY error.  This can sometimes leave queued up pipelined commands
in an inconsistent state when the connection is reestablished. For
example, if a pipeline has 6 commands and the second one generates a
READONLY error, Elasticache will only return results for the first two
before severing the connection. Upon reconnect, the pipeline still
thinks it has 6 commands to send but the commandQueue has only 4. This
fix will detect any pipeline command sets that only had a partial
response before connection loss, and abort them.

This Elasticache behavior also affects transactions. This fix will check
the offlineQueue for any transaction fragments and abort them, so that
we don't send mismatched multi/exec to redis upon reconnection.

- Introduced piplineIndex property on pipelined commands to allow for later
cleanup
- Added a routine to event_handler that aborts any pipelined commands inside
commandQueue and offlineQueue that were interrupted in the middle of the
pipeline
- Added a routine to event_handler that removes any transaction
fragments from the offline queue
- Introduced inTransaction property on commands to simplify pipeline logic
- Added a flags param to mock_server to allow the Elasticache disconnect
behavior to be simulated
- Added a reconnect_on_error test case for transactions
- Added some test cases testing for correct handling of this unique elasticache
behavior
- Added unit tests to validate inTransaction and pipelineIndex setting

Fixes redis#965
alavers added a commit to alavers/ioredis that referenced this issue Mar 27, 2020
Elasticache severs the connection immediately after it returns a
READONLY error.  This can sometimes leave queued up pipelined commands
in an inconsistent state when the connection is reestablished. For
example, if a pipeline has 6 commands and the second one generates a
READONLY error, Elasticache will only return results for the first two
before severing the connection. Upon reconnect, the pipeline still
thinks it has 6 commands to send but the commandQueue has only 4. This
fix will detect any pipeline command sets that only had a partial
response before connection loss, and abort them.

This Elasticache behavior also affects transactions. If reconnectOnError
returns 2, some transaction fragments will end up in the offlineQueue.
This fix will check the offlineQueue for any such transaction fragments
and abort them, so that we don't send mismatched multi/exec to redis
upon reconnection.

- Introduced piplineIndex property on pipelined commands to allow for later
cleanup
- Added a routine to event_handler that aborts any pipelined commands inside
commandQueue and offlineQueue that were interrupted in the middle of the
pipeline
- Added a routine to event_handler that removes any transaction
fragments from the offline queue
- Introduced inTransaction property on commands to simplify pipeline logic
- Added a flags param to mock_server to allow the Elasticache disconnect
behavior to be simulated
- Added a reconnect_on_error test case for transactions
- Added some test cases testing for correct handling of this unique elasticache
behavior
- Added unit tests to validate inTransaction and pipelineIndex setting

Fixes redis#965
alavers added a commit to alavers/ioredis that referenced this issue Mar 27, 2020
Elasticache severs the connection immediately after it returns a
READONLY error.  This can sometimes leave queued up pipelined commands
in an inconsistent state when the connection is reestablished. For
example, if a pipeline has 6 commands and the second one generates a
READONLY error, Elasticache will only return results for the first two
before severing the connection. Upon reconnect, the pipeline still
thinks it has 6 commands to send but the commandQueue has only 4. This
fix will detect any pipeline command sets that only had a partial
response before connection loss, and abort them.

This Elasticache behavior also affects transactions. If reconnectOnError
returns 2, some transaction fragments may end up in the offlineQueue.
This fix will check the offlineQueue for any such transaction fragments
and abort them, so that we don't send mismatched multi/exec to redis
upon reconnection.

- Introduced piplineIndex property on pipelined commands to allow for later
cleanup
- Added a routine to event_handler that aborts any pipelined commands inside
commandQueue and offlineQueue that were interrupted in the middle of the
pipeline
- Added a routine to event_handler that removes any transaction
fragments from the offline queue
- Introduced inTransaction property on commands to simplify pipeline logic
- Added a flags param to mock_server to allow the Elasticache disconnect
behavior to be simulated
- Added a reconnect_on_error test case for transactions
- Added some test cases testing for correct handling of this unique elasticache
behavior
- Added unit tests to validate inTransaction and pipelineIndex setting

Fixes redis#965
alavers added a commit to alavers/ioredis that referenced this issue Mar 27, 2020
Elasticache severs the connection immediately after it returns a
READONLY error.  This can sometimes leave queued up pipelined commands
in an inconsistent state when the connection is reestablished. For
example, if a pipeline has 6 commands and the second one generates a
READONLY error, Elasticache will only return results for the first two
before severing the connection. Upon reconnect, the pipeline still
thinks it has 6 commands to send but the commandQueue has only 4. This
fix will detect any pipeline command sets that only had a partial
response before connection loss, and abort them.

This Elasticache behavior also affects transactions. If reconnectOnError
returns 2, some transaction fragments may end up in the offlineQueue.
This fix will check the offlineQueue for any such transaction fragments
and abort them, so that we don't send mismatched multi/exec to redis
upon reconnection.

- Introduced piplineIndex property on pipelined commands to allow for later
cleanup
- Added a routine to event_handler that aborts any pipelined commands inside
commandQueue and offlineQueue that were interrupted in the middle of the
pipeline
- Added a routine to event_handler that removes any transaction
fragments from the offline queue
- Introduced inTransaction property on commands to simplify pipeline logic
- Added a flags param to mock_server to allow the Elasticache disconnect
behavior to be simulated
- Added a reconnect_on_error test case for transactions
- Added some test cases testing for correct handling of this unique elasticache
behavior
- Added unit tests to validate inTransaction and pipelineIndex setting

Fixes redis#965
luin pushed a commit that referenced this issue Mar 28, 2020
Elasticache severs the connection immediately after it returns a
READONLY error.  This can sometimes leave queued up pipelined commands
in an inconsistent state when the connection is reestablished. For
example, if a pipeline has 6 commands and the second one generates a
READONLY error, Elasticache will only return results for the first two
before severing the connection. Upon reconnect, the pipeline still
thinks it has 6 commands to send but the commandQueue has only 4. This
fix will detect any pipeline command sets that only had a partial
response before connection loss, and abort them.

This Elasticache behavior also affects transactions. If reconnectOnError
returns 2, some transaction fragments may end up in the offlineQueue.
This fix will check the offlineQueue for any such transaction fragments
and abort them, so that we don't send mismatched multi/exec to redis
upon reconnection.

- Introduced piplineIndex property on pipelined commands to allow for later
cleanup
- Added a routine to event_handler that aborts any pipelined commands inside
commandQueue and offlineQueue that were interrupted in the middle of the
pipeline
- Added a routine to event_handler that removes any transaction
fragments from the offline queue
- Introduced inTransaction property on commands to simplify pipeline logic
- Added a flags param to mock_server to allow the Elasticache disconnect
behavior to be simulated
- Added a reconnect_on_error test case for transactions
- Added some test cases testing for correct handling of this unique elasticache
behavior
- Added unit tests to validate inTransaction and pipelineIndex setting

Fixes #965
ioredis-robot pushed a commit that referenced this issue Mar 28, 2020
## [4.16.1](v4.16.0...v4.16.1) (2020-03-28)

### Bug Fixes

* abort incomplete pipelines upon reconnect ([#1084](#1084)) ([0013991](0013991)), closes [#965](#965)
@ioredis-robot
Copy link
Collaborator

🎉 This issue has been resolved in version 4.16.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment