3.2.1 crash when as slave node sync with master #3343

luweijie007 · 2016-06-24T04:25:25Z

log as below:
=== REDIS BUG REPORT START: Cut & paste starting from here ===
27648:S 24 Jun 12:16:32.265 # Redis 3.2.1 crashed by signal: 11
27648:S 24 Jun 12:16:32.265 # Crashed running the instuction at: 0x431599
27648:S 24 Jun 12:16:32.265 # Accessing address: 0x7fd7dc299d1e
27648:S 24 Jun 12:16:32.265 # Failed assertion: (:0)

------ STACK TRACE ------
EIP:
./redis-server *:6400 [cluster][0x431599]

Backtrace:
./redis-server *:6400 cluster[0x45caa9]
./redis-server *:6400 cluster[0x45cf9a]
/lib64/libpthread.so.0(+0xf130)[0x7fd79b0c0130]
./redis-server *:6400 [cluster][0x431599]
./redis-server *:6400 cluster[0x420381]
./redis-server *:6400 cluster[0x4453e4]
./redis-server *:6400 cluster[0x426935]
./redis-server *:6400 cluster[0x429947]
./redis-server *:6400 cluster[0x4362b5]
./redis-server *:6400 cluster[0x4210b8]
./redis-server *:6400 cluster[0x42136b]
./redis-server *:6400 cluster[0x41e35f]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd79ad11af5]
./redis-server *:6400 [cluster][0x41e5e5]

------ INFO OUTPUT ------

Server

redis_version:3.2.1
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:42468865c010fa71
redis_mode:cluster
os:Linux 3.10.0-123.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.8.3
process_id:27648
run_id:61d307b257465f360b038ca9e7227729c8f7fc55
tcp_port:6400
uptime_in_seconds:329
uptime_in_days:0
hz:10
lru_clock:7124000
executable:/home/redis/redis-3.2.1/src/./redis-server
config_file:/home/redis/redis-3.2.1/redis.conf

Clients

connected_clients:2
client_longest_output_list:0
client_biggest_input_buf:14962
blocked_clients:0

Memory

used_memory:2989199160
used_memory_human:2.78G
used_memory_rss:3073859584
used_memory_rss_human:2.86G
used_memory_peak:2989199160
used_memory_peak_human:2.78G
total_system_memory:16609554432
total_system_memory_human:15.47G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
mem_fragmentation_ratio:1.03
mem_allocator:jemalloc-4.0.3

Persistence

loading:0
rdb_changes_since_last_save:14
rdb_bgsave_in_progress:0
rdb_last_save_time:1466741463
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

Stats

total_connections_received:1
total_commands_processed:18
instantaneous_ops_per_sec:0
total_net_input_bytes:1151760372
total_net_output_bytes:1040
instantaneous_input_kbps:49380.11
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

Replication

role:slave
master_host:10.15.144.103
master_port:6400
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:9950243389
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

CPU

used_cpu_sys:9.85
used_cpu_user:13.29
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

Commandstats

cmdstat_rpush:calls=2,usec=13,usec_per_call=6.50
cmdstat_lset:calls=12,usec=24,usec_per_call=2.00
cmdstat_select:calls=1,usec=2,usec_per_call=2.00
cmdstat_cluster:calls=3,usec=805,usec_per_call=268.33

Cluster

cluster_enabled:1

Keyspace

db0:keys=410635,expires=0,avg_ttl=0
hash_init_value: 1466527529

------ CLIENT LIST OUTPUT ------
id=2 addr=10.15.107.143:56641 fd=9 name= age=146 idle=64 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=cluster
id=3 addr=10.15.144.103:6400 fd=24 name= age=0 idle=0 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=14962 qbuf-free=17806 obl=0 oll=0 omem=0 events=r cmd=lset

------ CURRENT CLIENT INFO ------
id=3 addr=10.15.144.103:6400 fd=24 name= age=0 idle=0 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=14962 qbuf-free=17806 obl=0 oll=0 omem=0 events=r cmd=lset
argv[0]: 'LSET'
argv[1]: '660FXU'
argv[2]: '-1'
argv[3]: S N0'
27648:S 24 Jun 12:16:32.266 # key '660FXU' found in DB containing the following object:
27648:S 24 Jun 12:16:32.266 # Object type: 1
27648:S 24 Jun 12:16:32.266 # Object encoding: 9
27648:S 24 Jun 12:16:32.266 # Object refcount: 1
27648:S 24 Jun 12:16:32.266 # List length: 1

------ REGISTERS ------
27648:S 24 Jun 12:16:32.266 #
RAX:00000000000000ff RBX:000000000000001a
RCX:000000000000001a RDX:00007fd6dd090f93
RDI:00007fd7dc299d1e RSI:00007fd6dc299d3a
RBP:00007fd761292640 RSP:00007fffbe24d110
R8 :00007fd6dc299d30 R9 :00007fd79aa00180
R10:0000000000000022 R11:00007fd79420eea0
R12:0000000000000000 R13:000000000000001a
R14:00007fd6dc299d3a R15:000000000000000b
RIP:0000000000431599 EFL:0000000000010206
CSGSFS:cd00000000000033
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d11f) -> 00007fd761292660
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d11e) -> 000000000000001a
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d11d) -> 000000000000000a
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d11c) -> 00007fd7612934d8
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d11b) -> 00007fffbe24d1d0
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d11a) -> 0000000000000001
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d119) -> 00000000075bcd15
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d118) -> 00007f0000000002
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d117) -> 0000000000000001
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d116) -> 0000000000000001
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d115) -> 00007fd6dd090f93
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d114) -> 00007f0000000002
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d113) -> 0000001a00000001
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d112) -> 0000000000000001
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d111) -> 0000001c00000028
27648:S 24 Jun 12:16:32.266 # (00007fffbe24d110) -> 00007fd79aa00080

------ FAST MEMORY TEST ------
27648:S 24 Jun 12:16:32.267 # Bio thread for job type #0 terminated
27648:S 24 Jun 12:16:32.267 # Bio thread for job type #1 terminated
*** Preparing to test memory region 722000 (114688 bytes)
*** Preparing to test memory region 1c75000 (135168 bytes)
*** Preparing to test memory region 7fd6dbe00000 (3072327680 bytes)
*** Preparing to test memory region 7fd7931ff000 (8388608 bytes)
*** Preparing to test memory region 7fd793a00000 (8388608 bytes)
*** Preparing to test memory region 7fd794200000 (2097152 bytes)
*** Preparing to test memory region 7fd79aa00000 (2097152 bytes)
*** Preparing to test memory region 7fd79b0ac000 (20480 bytes)
*** Preparing to test memory region 7fd79b2c9000 (16384 bytes)
*** Preparing to test memory region 7fd79b9c7000 (16384 bytes)
*** Preparing to test memory region 7fd79b9f1000 (4096 bytes)
*** Preparing to test memory region 7fd79b9f2000 (4096 bytes)
*** Preparing to test memory region 7fd79b9f5000 (4096 bytes)
.O.O.O.O.O.O.O.O.O.O.O.O.O
Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.
=== REDIS BUG REPORT END. Make sure to include from START to END. ===

   Please report the crash by opening an issue on github:

       http://github.com/antirez/redis/issues

Suspect RAM error? Use redis-server --test-memory to verify it.

thanks!

The text was updated successfully, but these errors were encountered:

antirez · 2016-06-24T06:53:56Z

Hello, unfortunately the binary is stripped so there are no symbols to understand exactly where this happened. Please could you send the Redis binary you are using? Thanks.

This bug could actually be in different places:

Cluster
Replication
List data type implementation, that with quicklists, changed recently (in 3.2)

The binary can give us some clue.

luweijie007 · 2016-06-24T08:02:02Z

@antirez ,give this binary file and I have to modify file name to *.jpg for upload to git.

antirez · 2016-06-24T08:16:56Z

@luweijie007 good trick ;-) Thanks apparently I was able to download it correctly.

antirez · 2016-06-24T08:57:45Z

Apparently the crash happened inside the ziplist implementation, so lacking more data we could assume either a memory corruption problem due to a bug, or broken memory. I'm trying to write a stress tester to run with valgrind in order to check if there are potential bugs that can be found via fuzzing better than the Redis test suite is doing.

antirez · 2016-06-24T09:03:31Z

@luweijie007 I've a stress tester running. Would be great if you could run a serious memory testing suite in your slave. Or is it a VM instance?

luweijie007 · 2016-06-24T09:46:45Z

@antirez it not VM instance, it a entity machine.
you mean I do redis-server --test-memory in slave nodes?
this crash happend on two different machine, and I can to redis-server --test-memory and tell you result

antirez · 2016-06-24T10:00:54Z

You noticed the same crash on two machines? At the same time and with a similar stack trace (while the server was doing an operation on lists)?

srhitesh · 2016-06-24T12:19:14Z

@luweijie007 Can you share us coredump file while it crashed

antirez · 2016-06-24T13:22:18Z

Core dump could also help, as to run --test-memory for some time, or even better, memtest86, could help. The two more probably causes could be a bug in ziplist or an hardware error due to non ECC memory, but can also be another more complex bug that corrupts Redis memory. The core dump may give some clue. Also to have a second crash report could help. Thanks.

oranagra · 2016-06-25T17:20:38Z

@antirez looking at the assembly code, the difference between r8 and rdi is the value stored in ZIPLIST_TAIL_OFFESET, and this delta is exactly: 2^32-17.
this looks more like a wrap around bug than a memory corruption.
further careful reading of the code is needed to find the bug.

antirez · 2016-06-25T18:25:14Z

Thanks @oranagra, this is a very interesting hint... Also the OP informed that he saw this two times in two different machines.

antirez · 2016-06-25T18:30:25Z

I put my money on ziplistMerge(). Will check carefully monday.

antirez · 2016-06-27T08:35:39Z

@luweijie007 please could you tell me all the list commands you use in the master in order to populate the list? Thanks.

antirez · 2016-06-27T08:44:17Z

@luweijie007 UPDATE: I can crash the quicklist implementation, so I can confirm there are bugs. I need another information, do you have compression enabled in quicklist? I'm referring to list-compress-depth settings in Redis.conf. Thanks.

antirez · 2016-06-27T12:29:32Z

Yet another UPDATE:

Problem is >= 3.2 specific. 3.0 is not affected.
The issue in in quicklist.c.
The faulty code is in _quicklistSplitNode() apparently.
I can reproduce it again and again with 7 commands (after writing a program to find crashes, then find smaller crashed, and finally minimizing the original set of commands needed from ~ 2000 to 7).

I'm working to a fix. It is possible that I'll discover more issues since certain edge cases are hard to trigger and quicklist code is not as tested and safe as other parts of Redis data structures implementation.

The quicklist takes a cached version of the ziplist representation size in bytes. The implementation must update this length every time the underlying ziplist changes. However quicklistReplaceAtIndex() failed to fix the length. During LSET calls, the size of the ziplist blob and the cached size inside the quicklist diverged. Later, when this size is used in an authoritative way, for example during nodes splitting in order to copy the nodes, we end with a duplicated node that may contain random garbage. This commit should fix issue #3343, however several problems were found reviewing the quicklist.c code in search of this bug that should be addressed soon or later. For example: 1. To take a cached ziplist length is fragile since failing to update it leads to this kind of issues. 2. The node splitting code needs auditing. For example it works just for a side effect of ziplistDeleteRange() to be able to cope with a wrong count of elements to remove. The code inside quicklist.c assumes that -1 means "delete till the end" while actually it's just a count of how many elements to delete, and is an unsigned count. So -1 gets converted into the maximum integer, and just by chance the ziplist code stops deleting elements after there are no more to delete. 3. Node splitting is extremely inefficient, it copies the node and removes elements from both nodes even when actually there is to move a single entry from one node to the other, or when the new resulting node is empty at all so there is nothing to copy but just to create a new node. However at least for Redis 3.2 to introduce fresh code inside quicklist.c may be even more risky, so instead I'm writing a better fuzzy tester to stress the internals a bit more in order to anticipate other possible bugs. This bug was found using a fuzzy tester written after having some clue about where the bug could be. The tester eventually created a ~2000 commands sequence able to always crash Redis. I wrote a better version of the tester that searched for the smallest sequence that could crash Redis automatically. Later this smaller sequence was minimized by removing random commands till it still crashed the server. This resulted into a sequence of 7 commands. With this small sequence it was just a matter of filling the code with enough printf() to understand enough state to fix the bug.

antirez · 2016-06-27T16:14:07Z

Dear @luweijie007 I hope I fixed the bug in commit 7041967. Please if possible could you apply it to your instances and report if this fixes the issue? Thanks.

Note: it was verified that it can crash the test suite without the patch applied.

luweijie007 · 2016-07-02T12:36:16Z

@antirez ,I am sorry , I take a holiday this days , so I has no time see your reply.
I can try to reprodction this isssue and get dump file and give you more information .
thanks every one's work again!

antirez · 2016-07-04T10:43:29Z

Thank you @luweijie007, we believe the problem is fixed now, no need for the dump, mostly in need of checking if our fix solves the problem for you. Within 24/48 hours I'm releasing Redis 3.2.2 that includes this fix. Thanks.

qunchenmy · 2016-07-08T03:53:22Z

I‘m sorry to hear this message , I have to degrade my redis instances from v3.2.0 to v3.0.7 , and about 30 instances.

oranagra · 2016-07-08T04:34:21Z

Why downgrade to 3.0, and not upgrade to a version with the fix?
Btw downgrading from 3.2 might be hard since its RDB format have changed and is not compatible with 3.0 (sync won't work either)

qunchenmy · 2016-07-08T05:29:14Z

@oranagra We need redis in mass production, v3.0.7(3.0.3) is a more stable release.

The quicklist takes a cached version of the ziplist representation size in bytes. The implementation must update this length every time the underlying ziplist changes. However quicklistReplaceAtIndex() failed to fix the length. During LSET calls, the size of the ziplist blob and the cached size inside the quicklist diverged. Later, when this size is used in an authoritative way, for example during nodes splitting in order to copy the nodes, we end with a duplicated node that may contain random garbage. This commit should fix issue redis#3343, however several problems were found reviewing the quicklist.c code in search of this bug that should be addressed soon or later. For example: 1. To take a cached ziplist length is fragile since failing to update it leads to this kind of issues. 2. The node splitting code needs auditing. For example it works just for a side effect of ziplistDeleteRange() to be able to cope with a wrong count of elements to remove. The code inside quicklist.c assumes that -1 means "delete till the end" while actually it's just a count of how many elements to delete, and is an unsigned count. So -1 gets converted into the maximum integer, and just by chance the ziplist code stops deleting elements after there are no more to delete. 3. Node splitting is extremely inefficient, it copies the node and removes elements from both nodes even when actually there is to move a single entry from one node to the other, or when the new resulting node is empty at all so there is nothing to copy but just to create a new node. However at least for Redis 3.2 to introduce fresh code inside quicklist.c may be even more risky, so instead I'm writing a better fuzzy tester to stress the internals a bit more in order to anticipate other possible bugs. This bug was found using a fuzzy tester written after having some clue about where the bug could be. The tester eventually created a ~2000 commands sequence able to always crash Redis. I wrote a better version of the tester that searched for the smallest sequence that could crash Redis automatically. Later this smaller sequence was minimized by removing random commands till it still crashed the server. This resulted into a sequence of 7 commands. With this small sequence it was just a matter of filling the code with enough printf() to understand enough state to fix the bug.

Note: it was verified that it can crash the test suite without the patch applied.

The quicklist takes a cached version of the ziplist representation size in bytes. The implementation must update this length every time the underlying ziplist changes. However quicklistReplaceAtIndex() failed to fix the length. During LSET calls, the size of the ziplist blob and the cached size inside the quicklist diverged. Later, when this size is used in an authoritative way, for example during nodes splitting in order to copy the nodes, we end with a duplicated node that may contain random garbage. This commit should fix issue redis#3343, however several problems were found reviewing the quicklist.c code in search of this bug that should be addressed soon or later. For example: 1. To take a cached ziplist length is fragile since failing to update it leads to this kind of issues. 2. The node splitting code needs auditing. For example it works just for a side effect of ziplistDeleteRange() to be able to cope with a wrong count of elements to remove. The code inside quicklist.c assumes that -1 means "delete till the end" while actually it's just a count of how many elements to delete, and is an unsigned count. So -1 gets converted into the maximum integer, and just by chance the ziplist code stops deleting elements after there are no more to delete. 3. Node splitting is extremely inefficient, it copies the node and removes elements from both nodes even when actually there is to move a single entry from one node to the other, or when the new resulting node is empty at all so there is nothing to copy but just to create a new node. However at least for Redis 3.2 to introduce fresh code inside quicklist.c may be even more risky, so instead I'm writing a better fuzzy tester to stress the internals a bit more in order to anticipate other possible bugs. This bug was found using a fuzzy tester written after having some clue about where the bug could be. The tester eventually created a ~2000 commands sequence able to always crash Redis. I wrote a better version of the tester that searched for the smallest sequence that could crash Redis automatically. Later this smaller sequence was minimized by removing random commands till it still crashed the server. This resulted into a sequence of 7 commands. With this small sequence it was just a matter of filling the code with enough printf() to understand enough state to fix the bug.

Note: it was verified that it can crash the test suite without the patch applied.

The quicklist takes a cached version of the ziplist representation size in bytes. The implementation must update this length every time the underlying ziplist changes. However quicklistReplaceAtIndex() failed to fix the length. During LSET calls, the size of the ziplist blob and the cached size inside the quicklist diverged. Later, when this size is used in an authoritative way, for example during nodes splitting in order to copy the nodes, we end with a duplicated node that may contain random garbage. This commit should fix issue redis#3343, however several problems were found reviewing the quicklist.c code in search of this bug that should be addressed soon or later. For example: 1. To take a cached ziplist length is fragile since failing to update it leads to this kind of issues. 2. The node splitting code needs auditing. For example it works just for a side effect of ziplistDeleteRange() to be able to cope with a wrong count of elements to remove. The code inside quicklist.c assumes that -1 means "delete till the end" while actually it's just a count of how many elements to delete, and is an unsigned count. So -1 gets converted into the maximum integer, and just by chance the ziplist code stops deleting elements after there are no more to delete. 3. Node splitting is extremely inefficient, it copies the node and removes elements from both nodes even when actually there is to move a single entry from one node to the other, or when the new resulting node is empty at all so there is nothing to copy but just to create a new node. However at least for Redis 3.2 to introduce fresh code inside quicklist.c may be even more risky, so instead I'm writing a better fuzzy tester to stress the internals a bit more in order to anticipate other possible bugs. This bug was found using a fuzzy tester written after having some clue about where the bug could be. The tester eventually created a ~2000 commands sequence able to always crash Redis. I wrote a better version of the tester that searched for the smallest sequence that could crash Redis automatically. Later this smaller sequence was minimized by removing random commands till it still crashed the server. This resulted into a sequence of 7 commands. With this small sequence it was just a matter of filling the code with enough printf() to understand enough state to fix the bug.

Note: it was verified that it can crash the test suite without the patch applied.

Recently we started using list-compress-depth in tests (was completely untested till now). Turns this triggered test failures with the external mode, since the tests left the setting enabled and then it was used in other tests (specifically the fuzzer named "Stress tester for #3343-alike bugs"). This PR fixes the issue of the `recompress` flag being left set by mistake, which caused the code to later to compress the head or tail nodes (which should never be compressed) The solution is to reset the recompress flag when it should have been (when it was decided not to compress). Additionally we're adding some assertions and improve the tests so in order to catch other similar bugs.

This pr is following #9779 . ## Describe of feature Now when we turn on the `list-compress-depth` configuration, the list will compress the ziplist between `[list-compress-depth, -list-compress-depth]`. When we need to use the compressed data, we will first decompress it, then use it, and finally compress it again. It's controlled by `quicklistNode->recompress`, which is designed to avoid the need to re-traverse the entire quicklist for compression after each decompression, we only need to recompress the quicklsitNode being used. In order to ensure the correctness of recompressing, we should normally let quicklistDecompressNodeForUse and quicklistCompress appear in pairs, otherwise, it may lead to the head and tail being compressed or the middle ziplist not being compressed correctly, which is exactly the problem this pr needs to solve. ## Solution 1. Reset `quicklistIter` after insert and replace. The quicklist node will be compressed in `quicklistInsertAfter`, `quicklistInsertBefore`, `quicklistReplaceAtIndex`, so we can safely reset the quicklistIter to avoid it being used again 2. `quicklistIndex` will return an iterator that can be used to recompress the current node after use. ## Test 1. In the `Stress Tester for #3343-Similar Errors` test, when the server crashes or when `valgrind` or `asan` error is detected, print violating commands. 2. Add a crash test due to wrongly recompressing after `lrem`. 3. Remove `insert before with 0 elements` and `insert after with 0 elements`, Now we forbid any operation on an NULL quicklistIter.

This pr is following redis#9779 . ## Describe of feature Now when we turn on the `list-compress-depth` configuration, the list will compress the ziplist between `[list-compress-depth, -list-compress-depth]`. When we need to use the compressed data, we will first decompress it, then use it, and finally compress it again. It's controlled by `quicklistNode->recompress`, which is designed to avoid the need to re-traverse the entire quicklist for compression after each decompression, we only need to recompress the quicklsitNode being used. In order to ensure the correctness of recompressing, we should normally let quicklistDecompressNodeForUse and quicklistCompress appear in pairs, otherwise, it may lead to the head and tail being compressed or the middle ziplist not being compressed correctly, which is exactly the problem this pr needs to solve. ## Solution 1. Reset `quicklistIter` after insert and replace. The quicklist node will be compressed in `quicklistInsertAfter`, `quicklistInsertBefore`, `quicklistReplaceAtIndex`, so we can safely reset the quicklistIter to avoid it being used again 2. `quicklistIndex` will return an iterator that can be used to recompress the current node after use. ## Test 1. In the `Stress Tester for redis#3343-Similar Errors` test, when the server crashes or when `valgrind` or `asan` error is detected, print violating commands. 2. Add a crash test due to wrongly recompressing after `lrem`. 3. Remove `insert before with 0 elements` and `insert after with 0 elements`, Now we forbid any operation on an NULL quicklistIter.

The quicklist takes a cached version of the ziplist representation size in bytes. The implementation must update this length every time the underlying ziplist changes. However quicklistReplaceAtIndex() failed to fix the length. During LSET calls, the size of the ziplist blob and the cached size inside the quicklist diverged. Later, when this size is used in an authoritative way, for example during nodes splitting in order to copy the nodes, we end with a duplicated node that may contain random garbage. This commit should fix issue redis#3343, however several problems were found reviewing the quicklist.c code in search of this bug that should be addressed soon or later. For example: 1. To take a cached ziplist length is fragile since failing to update it leads to this kind of issues. 2. The node splitting code needs auditing. For example it works just for a side effect of ziplistDeleteRange() to be able to cope with a wrong count of elements to remove. The code inside quicklist.c assumes that -1 means "delete till the end" while actually it's just a count of how many elements to delete, and is an unsigned count. So -1 gets converted into the maximum integer, and just by chance the ziplist code stops deleting elements after there are no more to delete. 3. Node splitting is extremely inefficient, it copies the node and removes elements from both nodes even when actually there is to move a single entry from one node to the other, or when the new resulting node is empty at all so there is nothing to copy but just to create a new node. However at least for Redis 3.2 to introduce fresh code inside quicklist.c may be even more risky, so instead I'm writing a better fuzzy tester to stress the internals a bit more in order to anticipate other possible bugs. This bug was found using a fuzzy tester written after having some clue about where the bug could be. The tester eventually created a ~2000 commands sequence able to always crash Redis. I wrote a better version of the tester that searched for the smallest sequence that could crash Redis automatically. Later this smaller sequence was minimized by removing random commands till it still crashed the server. This resulted into a sequence of 7 commands. With this small sequence it was just a matter of filling the code with enough printf() to understand enough state to fix the bug.

Note: it was verified that it can crash the test suite without the patch applied.

antirez added WAITING-OP-REPLY crash report labels Jun 24, 2016

antirez added critical bug and removed WAITING-OP-REPLY labels Jun 27, 2016

antirez added a commit that referenced this issue Jun 28, 2016

Regression test for issue #3343 exact min crash sequence.

4989986

Note: it was verified that it can crash the test suite without the patch applied.

antirez added a commit that referenced this issue Jun 28, 2016

Test: new randomized stress tester for #3343 alike bugs.

24bd9b1

antirez added a commit that referenced this issue Jun 30, 2016

Regression test for issue #3343 exact min crash sequence.

2c3fcf8

Note: it was verified that it can crash the test suite without the patch applied.

antirez added a commit that referenced this issue Jun 30, 2016

Test: new randomized stress tester for #3343 alike bugs.

7a3a595

JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Aug 29, 2016

Regression test for issue redis#3343 exact min crash sequence.

8102a7c

Note: it was verified that it can crash the test suite without the patch applied.

JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Aug 29, 2016

Test: new randomized stress tester for redis#3343 alike bugs.

fafd37d

NODICKHILL mentioned this issue Oct 30, 2016

Test replication partial resync fail #1417

Open

antirez closed this as completed Jan 26, 2017

jepickett pushed a commit to microsoftarchive/redis that referenced this issue Feb 9, 2017

Regression test for issue redis#3343 exact min crash sequence.

4c74386

Note: it was verified that it can crash the test suite without the patch applied.

jepickett pushed a commit to microsoftarchive/redis that referenced this issue Feb 9, 2017

Test: new randomized stress tester for redis#3343 alike bugs.

d28cc9f

JingchengLi added a commit to JingchengLi/swapdb that referenced this issue Aug 23, 2017

修正一个导致redis crash的bug: redis/redis#3343

36bd851

JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Jan 13, 2018

Regression test for issue redis#3343 exact min crash sequence.

9a93277

Note: it was verified that it can crash the test suite without the patch applied.

JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Jan 13, 2018

Test: new randomized stress tester for redis#3343 alike bugs.

5616e00

fuliusu mentioned this issue May 22, 2019

5.0.5 Testing unit/quit Testing integration/block-repl error，and Testing unit/hyperloglog fail #6115

Open

verbus mentioned this issue May 1, 2020

Redis 6.0 Test Fails @ PSYNC2 #3899 regression: kill first replica #7169

Open

moria7757 mentioned this issue Dec 29, 2020

*** [err]: Active defrag in tests/unit/memefficiency.tcl #8265

Closed

qq1052121189 mentioned this issue Feb 7, 2021

Centos7 redis-6.0.10 'make test' faiil: *** [err]: ZSCAN with encoding skiplist in tests/unit/scan.tcl #8457

Closed

stefanschindler mentioned this issue Apr 20, 2021

[BUG] make test fails on Ubuntu 18.04 (tests/unit/networking.tcl) #8828

Closed

perryitay mentioned this issue Nov 14, 2021

Fix crashes when list-compress-depth is used. #9779

Merged

pulllock pushed a commit to pulllock/redis that referenced this issue Jun 28, 2023

Regression test for issue redis#3343 exact min crash sequence.

2f3b3ae

Note: it was verified that it can crash the test suite without the patch applied.

pulllock pushed a commit to pulllock/redis that referenced this issue Jun 28, 2023

Test: new randomized stress tester for redis#3343 alike bugs.

163f959

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.2.1 crash when as slave node sync with master #3343

3.2.1 crash when as slave node sync with master #3343

luweijie007 commented Jun 24, 2016

antirez commented Jun 24, 2016

luweijie007 commented Jun 24, 2016

antirez commented Jun 24, 2016

antirez commented Jun 24, 2016

antirez commented Jun 24, 2016

luweijie007 commented Jun 24, 2016 •

edited

antirez commented Jun 24, 2016

srhitesh commented Jun 24, 2016

antirez commented Jun 24, 2016

oranagra commented Jun 25, 2016

antirez commented Jun 25, 2016

antirez commented Jun 25, 2016

antirez commented Jun 27, 2016

antirez commented Jun 27, 2016

antirez commented Jun 27, 2016 •

edited

antirez commented Jun 27, 2016

luweijie007 commented Jul 2, 2016

antirez commented Jul 4, 2016

qunchenmy commented Jul 8, 2016

oranagra commented Jul 8, 2016

qunchenmy commented Jul 8, 2016

3.2.1 crash when as slave node sync with master #3343

3.2.1 crash when as slave node sync with master #3343

Comments

luweijie007 commented Jun 24, 2016

Server

Clients

Memory

Persistence

Stats

Replication

CPU

Commandstats

Cluster

Keyspace

antirez commented Jun 24, 2016

luweijie007 commented Jun 24, 2016

antirez commented Jun 24, 2016

antirez commented Jun 24, 2016

antirez commented Jun 24, 2016

luweijie007 commented Jun 24, 2016 • edited

antirez commented Jun 24, 2016

srhitesh commented Jun 24, 2016

antirez commented Jun 24, 2016

oranagra commented Jun 25, 2016

antirez commented Jun 25, 2016

antirez commented Jun 25, 2016

antirez commented Jun 27, 2016

antirez commented Jun 27, 2016

antirez commented Jun 27, 2016 • edited

antirez commented Jun 27, 2016

luweijie007 commented Jul 2, 2016

antirez commented Jul 4, 2016

qunchenmy commented Jul 8, 2016

oranagra commented Jul 8, 2016

qunchenmy commented Jul 8, 2016

luweijie007 commented Jun 24, 2016 •

edited

antirez commented Jun 27, 2016 •

edited