Replication Script Cache Bug #1549

minus-infinity · 2014-02-12T11:50:19Z

Step 1:

eval "if tonumber(KEYS[1]) > 0 then redis.call('incr', 'x') end" 1 0

Step 2:

evalsha <sha1 step 1 script> 1 0

At this step sha1 of the script is added to the replication script cache (the script is marked as known to the slaves) and EVALSHA command is transformed to EVAL.
However it is not dirty (there is no changes to db), so it is not propagated to the slaves.

Step 3:

evalsha <sha1 step 1 script> 1 1

At this step master checks that the script already exists in the replication script cache and doesn't transform it to EVAL command. It is dirty and propagated to the slaves, but they fail to evaluate the script as they don't have it in the script cache.

This issue affects AOF rewrite and replication.

The issue can be fixed by setting REDIS_FORCE_REPL and REDIS_FORCE_AOF flags when adding sha1 to the replication script cache:

scripting.c:

if (!replicationScriptCacheExists(c->argv[1]->ptr)) {
     /* This script is not in our script cache, replicate it as
      * EVAL, then add it into the script cache, as from now on
      * slaves and AOF know about it. */

       robj *script = dictFetchValue(server.lua_scripts,c->argv[1]->ptr);

       replicationScriptCacheAdd(c->argv[1]->ptr);
       redisAssertWithInfo(c,NULL,script != NULL);
       rewriteClientCommandArgument(c,0,
               resetRefCount(createStringObject("EVAL",4)));
       rewriteClientCommandArgument(c,1,script);
       c->flags |= (REDIS_FORCE_REPL | REDIS_FORCE_AOF);
}

The text was updated successfully, but these errors were encountered:

antirez · 2014-02-12T11:52:19Z

This is outstanding, thanks! And you joined Github just to report the bug 👍

minus-infinity · 2014-02-12T17:32:12Z

It was quite challenging to reproduce and debug it on 100GB dataset in production environment

antirez · 2014-02-12T18:15:31Z

And we were looking to hunt this bug as well... I believe it is an issue that affected Hulu as well. So I guess that your fix was also verified in production that is definitely a bonus point even if the solution looks obvious and sane per se.

minus-infinity · 2014-02-12T18:47:50Z

I have updated our production servers and it works fine for few hours already. The slaves are in sync (did not check AOF). The slaves couldn't make successful initial sync at all before the patch.

antirez · 2014-02-12T18:48:54Z

Thanks for the ack, merging tomorrow morning + unit tests.

minus-infinity · 2014-02-12T19:48:12Z

I would like to thank you for such great in-memory data structure database with good persistence options, we use it as primary db, not just like caching key-value solution.

@minus-infinity

This commit fixes a serious Lua scripting replication issue, described by Github issue #1549. The root cause of the problem is that scripts were put inside the script cache, assuming that slaves and AOF already contained it, even if the scripts sometimes produced no changes in the data set, and were not actaully propagated to AOF/slaves. Example: eval "if tonumber(KEYS[1]) > 0 then redis.call('incr', 'x') end" 1 0 Then: evalsha <sha1 step 1 script> 1 0 At this step sha1 of the script is added to the replication script cache (the script is marked as known to the slaves) and EVALSHA command is transformed to EVAL. However it is not dirty (there is no changes to db), so it is not propagated to the slaves. Then the script is called again: evalsha <sha1 step 1 script> 1 1 At this step master checks that the script already exists in the replication script cache and doesn't transform it to EVAL command. It is dirty and propagated to the slaves, but they fail to evaluate the script as they don't have it in the script cache. The fix is trivial and just uses the new API to force the propagation of the executed command regardless of the dirty state of the data set. Thank you to @minus-infinity on Github for finding the issue, understanding the root cause, and fixing it.

@minus-infinity

This commit fixes a serious Lua scripting replication issue, described by Github issue #1549. The root cause of the problem is that scripts were put inside the script cache, assuming that slaves and AOF already contained it, even if the scripts sometimes produced no changes in the data set, and were not actaully propagated to AOF/slaves. Example: eval "if tonumber(KEYS[1]) > 0 then redis.call('incr', 'x') end" 1 0 Then: evalsha <sha1 step 1 script> 1 0 At this step sha1 of the script is added to the replication script cache (the script is marked as known to the slaves) and EVALSHA command is transformed to EVAL. However it is not dirty (there is no changes to db), so it is not propagated to the slaves. Then the script is called again: evalsha <sha1 step 1 script> 1 1 At this step master checks that the script already exists in the replication script cache and doesn't transform it to EVAL command. It is dirty and propagated to the slaves, but they fail to evaluate the script as they don't have it in the script cache. The fix is trivial and just uses the new API to force the propagation of the executed command regardless of the dirty state of the data set. Thank you to @minus-infinity on Github for finding the issue, understanding the root cause, and fixing it.

@minus-infinity

This commit fixes a serious Lua scripting replication issue, described by Github issue #1549. The root cause of the problem is that scripts were put inside the script cache, assuming that slaves and AOF already contained it, even if the scripts sometimes produced no changes in the data set, and were not actaully propagated to AOF/slaves. Example: eval "if tonumber(KEYS[1]) > 0 then redis.call('incr', 'x') end" 1 0 Then: evalsha <sha1 step 1 script> 1 0 At this step sha1 of the script is added to the replication script cache (the script is marked as known to the slaves) and EVALSHA command is transformed to EVAL. However it is not dirty (there is no changes to db), so it is not propagated to the slaves. Then the script is called again: evalsha <sha1 step 1 script> 1 1 At this step master checks that the script already exists in the replication script cache and doesn't transform it to EVAL command. It is dirty and propagated to the slaves, but they fail to evaluate the script as they don't have it in the script cache. The fix is trivial and just uses the new API to force the propagation of the executed command regardless of the dirty state of the data set. Thank you to @minus-infinity on Github for finding the issue, understanding the root cause, and fixing it.

antirez · 2014-02-13T11:19:04Z

Thank you for using it @minus-infinity 😃

Fix pushed into all branches, writing a regression test now. I'll release 2.8.6 ASAP (there is another issue to fix). Closing.

It was verified that reverting the commit that fixes the bug, the test no longer passes.

antirez closed this as completed Feb 13, 2014

antirez added a commit that referenced this issue Feb 13, 2014

Test: regression for issue #1549.

f2bdf60

It was verified that reverting the commit that fixes the bug, the test no longer passes.

antirez added a commit that referenced this issue Feb 13, 2014

Test: regression for issue #1549.

ebdb37c

It was verified that reverting the commit that fixes the bug, the test no longer passes.

antirez added a commit that referenced this issue Feb 13, 2014

Test: regression for issue #1549.

767846d

It was verified that reverting the commit that fixes the bug, the test no longer passes.

antirez mentioned this issue Apr 16, 2014

Replicate readonly scripts with SCRIPT LOAD. #1689

Closed

antirez mentioned this issue Jan 12, 2016

EVAL / EVALSHA is not executed on slave servers #2999

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication Script Cache Bug #1549

Replication Script Cache Bug #1549

minus-infinity commented Feb 12, 2014

antirez commented Feb 12, 2014

minus-infinity commented Feb 12, 2014

antirez commented Feb 12, 2014

minus-infinity commented Feb 12, 2014

antirez commented Feb 12, 2014

minus-infinity commented Feb 12, 2014

antirez commented Feb 13, 2014

Replication Script Cache Bug #1549

Replication Script Cache Bug #1549

Comments

minus-infinity commented Feb 12, 2014

antirez commented Feb 12, 2014

minus-infinity commented Feb 12, 2014

antirez commented Feb 12, 2014

minus-infinity commented Feb 12, 2014

antirez commented Feb 12, 2014

minus-infinity commented Feb 12, 2014

antirez commented Feb 13, 2014