Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRASH] Command duration is not reset when client is blocked on XGROUPREAD and the stream's slot is migrated, failing an assertion #563

Closed
nitaicaro opened this issue May 28, 2024 · 2 comments

Comments

@nitaicaro
Copy link
Contributor

In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed.

I found an edge case where this happens - easily reproduced by blocking a client on XGROUPREAD and migrating the stream's slot. This causes the engine to process the XGROUPREAD command twice:

  1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it.
  2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a MOVED response. But when we do that, we don’t reset the duration field.
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
13788:M 28 May 2024 07:34:12.289 # === ASSERTION FAILED ===
13788:M 28 May 2024 07:34:12.289 # ==> networking.c:2092 'c->duration == 0' is not true

------ STACK TRACE ------

13794 bio_lazy_free
/lib64/libpthread.so.0(pthread_cond_wait+0x1f7)[0x7f36c3010377]
src/valkey-server 127.0.0.1:21148 [cluster](bioProcessBackgroundJobs+0x138)[0x4e6d78]
/lib64/libpthread.so.0(+0x744b)[0x7f36c300a44b]
/lib64/libc.so.6(clone+0x3f)[0x7f36c2d4552f]

13793 bio_aof
/lib64/libpthread.so.0(pthread_cond_wait+0x1f7)[0x7f36c3010377]
src/valkey-server 127.0.0.1:21148 [cluster](bioProcessBackgroundJobs+0x138)[0x4e6d78]
/lib64/libpthread.so.0(+0x744b)[0x7f36c300a44b]
/lib64/libc.so.6(clone+0x3f)[0x7f36c2d4552f]

13792 bio_close_file
/lib64/libpthread.so.0(pthread_cond_wait+0x1f7)[0x7f36c3010377]
src/valkey-server 127.0.0.1:21148 [cluster](bioProcessBackgroundJobs+0x138)[0x4e6d78]
/lib64/libpthread.so.0(+0x744b)[0x7f36c300a44b]
/lib64/libc.so.6(clone+0x3f)[0x7f36c2d4552f]

13788 valkey-server *
src/valkey-server 127.0.0.1:21148 [cluster][0x467287]
src/valkey-server 127.0.0.1:21148 [cluster][0x552443]
src/valkey-server 127.0.0.1:21148 [cluster][0x52f049]
src/valkey-server 127.0.0.1:21148 [cluster](processCommand+0x728)[0x51d4f8]
src/valkey-server 127.0.0.1:21148 [cluster](processInputBuffer+0xd8)[0x46c4f8]
src/valkey-server 127.0.0.1:21148 [cluster](readQueryFromClient+0x46e)[0x46cb0e]
src/valkey-server 127.0.0.1:21148 [cluster][0x52043f]
src/valkey-server 127.0.0.1:21148 [cluster](aeProcessEvents+0x321)[0x4567c1]
src/valkey-server 127.0.0.1:21148 [cluster](aeMain+0x20)[0x456b50]
src/valkey-server 127.0.0.1:21148 [cluster](main+0x375)[0x451605]
/lib64/libc.so.6(__libc_start_main+0xea)[0x7f36c2c7713a]
src/valkey-server 127.0.0.1:21148 [cluster](_start+0x2a)[0x451c2a]

4/4 expected stacktraces.

------ STACK TRACE DONE ------

------ INFO OUTPUT ------
# Server
redis_version:7.2.4
server_name:valkey
valkey_version:255.255.255
redis_git_sha1:8e3c8fd4
redis_git_dirty:1
redis_build_id:acae0fb9c5af6b19
server_mode:cluster
os:Linux 5.10.216-182.855.amzn2int.x86_64 x86_64
arch_bits:64
monotonic_clock:POSIX clock_gettime
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:7.3.1
process_id:13788
process_supervised:no
run_id:e67754454b51ac45e95400802fb1d8fc47236b50
tcp_port:21148
server_time_usec:1716881652289835
uptime_in_seconds:11
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:5605620
executable:/local/workplace/caronita/valkey/src/valkey-server
config_file:/local/workplace/caronita/valkey/./tests/tmp/valkey.conf.10352.76
io_threads_active:0
listener0:name=tcp,bind=127.0.0.1,port=21148
listener1:name=unix,bind=/local/workplace/caronita/valkey/tests/tmp/server.10352.75/socket

# Clients
connected_clients:2
cluster_connections:10
maxclients:10000
client_recent_max_input_buffer:20607
client_recent_max_output_buffer:20504
blocked_clients:0
tracking_clients:0
pubsub_clients:0
watching_clients:0
clients_in_timeout_table:0
total_watched_keys:0
total_blocking_keys:0
total_blocking_keys_on_nokey:0

# Memory
used_memory:2696056
used_memory_human:2.57M
used_memory_rss:8060928
used_memory_rss_human:7.69M
used_memory_peak:2696056
used_memory_peak_human:2.57M
used_memory_peak_perc:100.06%
used_memory_overhead:2308387
used_memory_startup:2232056
used_memory_dataset:387669
used_memory_dataset_perc:83.55%
allocator_allocated:2712520
allocator_active:3026944
allocator_resident:9936896
allocator_muzzy:0
total_system_memory:65872703488
total_system_memory_human:61.35G
used_memory_lua:31744
used_memory_vm_eval:31744
used_memory_lua_human:31.00K
used_memory_scripts_eval:0
number_of_cached_scripts:0
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:33792
used_memory_vm_total:65536
used_memory_vm_total_human:64.00K
used_memory_functions:184
used_memory_scripts:184
used_memory_scripts_human:184B
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.12
allocator_frag_bytes:314424
allocator_rss_ratio:3.28
allocator_rss_bytes:6909952
rss_overhead_ratio:0.81
rss_overhead_bytes:-1875968
mem_fragmentation_ratio:3.08
mem_fragmentation_bytes:5440480
mem_not_counted_for_evict:0
mem_replication_backlog:20508
mem_total_replication_buffers:20504
mem_clients_slaves:0
mem_clients_normal:44919
mem_cluster_links:10720
mem_aof_buffer:0
mem_allocator:jemalloc-5.3.0
mem_overhead_db_hashtable_rehashing:0
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0

# Persistence
loading:0
async_loading:0
current_cow_peak:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:3
rdb_bgsave_in_progress:0
rdb_last_save_time:1716881641
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_saves:0
rdb_last_cow_size:385024
rdb_last_load_keys_expired:0
rdb_last_load_keys_loaded:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_rewrites:0
aof_rewrites_consecutive_failures:0
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:4
total_commands_processed:241
instantaneous_ops_per_sec:12
total_net_input_bytes:7479
total_net_output_bytes:278295
total_net_repl_input_bytes:0
total_net_repl_output_bytes:511
instantaneous_input_kbps:0.46
instantaneous_output_kbps:12.76
instantaneous_input_repl_kbps:0.00
instantaneous_output_repl_kbps:0.06
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:1
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:0
evicted_keys:0
evicted_clients:0
evicted_scripts:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:3
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:314
total_forks:1
migrate_cached_sockets:1
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:1
dump_payload_sanitizations:0
total_reads_processed:242
total_writes_processed:229
io_threaded_reads_processed:0
io_threaded_writes_processed:0
client_query_buffer_limit_disconnections:0
client_output_buffer_limit_disconnections:0
reply_buffer_shrinks:3
reply_buffer_expands:0
eventloop_cycles:1511
eventloop_duration_sum:120095
eventloop_duration_cmd_sum:10361
instantaneous_eventloop_cycles_per_sec:147
instantaneous_eventloop_duration_usec:33
acl_access_denied_auth:0
acl_access_denied_cmd:0
acl_access_denied_key:0
acl_access_denied_channel:0

# Replication
role:master
connected_slaves:1
slave0:ip=127.0.0.1,port=21145,state=online,offset=209,lag=0
master_failover_state:no-failover
master_replid:370ddacd97b1accd995eaadfb434be53b2115b49
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:268
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:268

# CPU
used_cpu_sys:0.028594
used_cpu_user:0.057189
used_cpu_sys_children:0.000000
used_cpu_user_children:0.001001
used_cpu_sys_main_thread:0.028416
used_cpu_user_main_thread:0.056832

# Modules

# Commandstats
cmdstat_migrate:calls=1,usec=250,usec_per_call=250.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|setslot:calls=2,usec=28,usec_per_call=14.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|info:calls=1,usec=46,usec_per_call=46.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|nodes:calls=1,usec=80,usec_per_call=80.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|meet:calls=5,usec=69,usec_per_call=13.80,rejected_calls=0,failed_calls=0
cmdstat_cluster|shards:calls=199,usec=8825,usec_per_call=44.35,rejected_calls=0,failed_calls=0
cmdstat_cluster|addslotsrange:calls=1,usec=106,usec_per_call=106.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|slots:calls=10,usec=317,usec_per_call=31.70,rejected_calls=0,failed_calls=0
cmdstat_cluster|myid:calls=1,usec=1,usec_per_call=1.00,rejected_calls=0,failed_calls=0
cmdstat_ping:calls=2,usec=2,usec_per_call=1.00,rejected_calls=0,failed_calls=0
cmdstat_xreadgroup:calls=0,usec=0,usec_per_call=0.00,rejected_calls=1,failed_calls=0
cmdstat_replconf:calls=15,usec=161,usec_per_call=10.73,rejected_calls=0,failed_calls=0
cmdstat_psync:calls=1,usec=426,usec_per_call=426.00,rejected_calls=0,failed_calls=0
cmdstat_xgroup|create:calls=1,usec=31,usec_per_call=31.00,rejected_calls=0,failed_calls=0
cmdstat_config|get:calls=1,usec=10,usec_per_call=10.00,rejected_calls=0,failed_calls=0

# Errorstats
errorstat_ASK:count=1

# Latencystats
latency_percentiles_usec_migrate:p50=250.879,p99=250.879,p99.9=250.879
latency_percentiles_usec_cluster|setslot:p50=7.007,p99=21.119,p99.9=21.119
latency_percentiles_usec_cluster|info:p50=46.079,p99=46.079,p99.9=46.079
latency_percentiles_usec_cluster|nodes:p50=80.383,p99=80.383,p99.9=80.383
latency_percentiles_usec_cluster|meet:p50=10.047,p99=29.055,p99.9=29.055
latency_percentiles_usec_cluster|shards:p50=44.031,p99=65.023,p99.9=72.191
latency_percentiles_usec_cluster|addslotsrange:p50=106.495,p99=106.495,p99.9=106.495
latency_percentiles_usec_cluster|slots:p50=31.103,p99=47.103,p99.9=47.103
latency_percentiles_usec_cluster|myid:p50=1.003,p99=1.003,p99.9=1.003
latency_percentiles_usec_ping:p50=1.003,p99=1.003,p99.9=1.003
latency_percentiles_usec_replconf:p50=1.003,p99=144.383,p99.9=144.383
latency_percentiles_usec_psync:p50=428.031,p99=428.031,p99.9=428.031
latency_percentiles_usec_xgroup|create:p50=31.103,p99=31.103,p99.9=31.103
latency_percentiles_usec_config|get:p50=10.047,p99=10.047,p99.9=10.047

# Cluster
cluster_enabled:1

# Keyspace

# Cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:5
cluster_my_epoch:1
cluster_stats_messages_ping_sent:417
cluster_stats_messages_pong_sent:408
cluster_stats_messages_meet_sent:5
cluster_stats_messages_sent:830
cluster_stats_messages_ping_received:408
cluster_stats_messages_pong_received:425
cluster_stats_messages_received:833
total_cluster_links_buffer_limit_exceeded:0

------ CLUSTER NODES OUTPUT ------
aa041b3e6b9a353924d0b835de8c26adcf2831eb 127.0.0.1:21146@31146,,tls-port=0,shard-id=de8469270f8cce919fed047118f77390f36d9d93 master - 0 1716881652201 0 connected 10923-16383
66508f3d9c231c1e576a2b4ecd2816d528480128 127.0.0.1:21147@31147,,tls-port=0,shard-id=ce3d6d295f0eadcebde81be7e1580fbe029b05d0 master - 0 1716881652201 3 connected 5462-10922
4b7e9dce4c5397c9256a340d0b8f922843238030 127.0.0.1:21148@31148,,tls-port=0,shard-id=7de919c9140f9168fd9ff32199d3b169b9e0e613 myself,master - 0 0 1 connected 0-5461 [609->-66508f3d9c231c1e576a2b4ecd2816d528480128]
1e33775fc8c3815341f525cbf8db4b3c5a16a18b 127.0.0.1:21145@31145,,tls-port=0,shard-id=7de919c9140f9168fd9ff32199d3b169b9e0e613 slave 4b7e9dce4c5397c9256a340d0b8f922843238030 0 1716881652201 1 connected
4eec7f352e00a438c06e223de89059211cfe57ec 127.0.0.1:21144@31144,,tls-port=0,shard-id=ce3d6d295f0eadcebde81be7e1580fbe029b05d0 slave 66508f3d9c231c1e576a2b4ecd2816d528480128 0 1716881652201 3 connected
374b2ea7996d67cb90d45ad6d3433c17d079e263 127.0.0.1:21143@31143,,tls-port=0,shard-id=de8469270f8cce919fed047118f77390f36d9d93 slave aa041b3e6b9a353924d0b835de8c26adcf2831eb 0 1716881652201 0 connected

------ CLIENT LIST OUTPUT ------
id=4 addr=127.0.0.1:48398 laddr=127.0.0.1:21148 fd=12 name= age=11 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=69 qbuf-free=20405 argv-mem=6 multi-mem=0 rbs=1024 rbp=796 obl=5 oll=0 omem=0 tot-mem=22422 events=r cmd=migrate user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=6728 tot-net-out=277760 tot-cmds=223
id=5 addr=127.0.0.1:48410 laddr=127.0.0.1:21148 fd=27 name= age=11 idle=0 flags=S db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=1 omem=20504 tot-mem=42904 events=r cmd=replconf user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=638 tot-net-out=263 tot-cmds=17
id=6 addr=127.0.0.1:38472 laddr=127.0.0.1:21148 fd=28 name= age=1 idle=1 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=1024 rbp=26 obl=26 oll=0 omem=0 tot-mem=22400 events=r cmd=xreadgroup user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=106 tot-net-out=0 tot-cmds=0

------ CURRENT CLIENT INFO ------
id=6 addr=127.0.0.1:38472 laddr=127.0.0.1:21148 fd=28 name= age=1 idle=1 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=1024 rbp=26 obl=26 oll=0 omem=0 tot-mem=22400 events=r cmd=xreadgroup user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=106 tot-net-out=0 tot-cmds=0
argc: '0'

------ MODULES INFO OUTPUT ------

------ CONFIG DEBUG OUTPUT ------
lazyfree-lazy-expire no
lazyfree-lazy-server-del no
client-query-buffer-limit 1gb
activedefrag no
list-compress-depth 0
lazyfree-lazy-eviction no
slave-read-only yes
lazyfree-lazy-user-del no
lazyfree-lazy-user-flush no
io-threads 1
replica-read-only yes
sanitize-dump-payload no
repl-diskless-load disabled
repl-diskless-sync yes
io-threads-do-reads no
proto-max-bulk-len 512mb

------ FAST MEMORY TEST ------
13788:M 28 May 2024 07:34:12.294 # Bio worker thread #0 terminated
13788:M 28 May 2024 07:34:12.294 # Bio worker thread #1 terminated
13788:M 28 May 2024 07:34:12.294 # Bio worker thread #2 terminated
*** Preparing to test memory region 8f9000 (2306048 bytes)
*** Preparing to test memory region 13a4000 (135168 bytes)
*** Preparing to test memory region 7f36b85fd000 (10485760 bytes)
*** Preparing to test memory region 7f36b8ffe000 (10485760 bytes)
*** Preparing to test memory region 7f36b99ff000 (10485760 bytes)
*** Preparing to test memory region 7f36ba400000 (10485760 bytes)
*** Preparing to test memory region 7f36bae00000 (6291456 bytes)
*** Preparing to test memory region 7f36bb5b1000 (2621440 bytes)
*** Preparing to test memory region 7f36c2400000 (8388608 bytes)
*** Preparing to test memory region 7f36c2fff000 (16384 bytes)
*** Preparing to test memory region 7f36c321d000 (16384 bytes)
*** Preparing to test memory region 7f36c3b7b000 (28672 bytes)
*** Preparing to test memory region 7f36c3b92000 (4096 bytes)
.O.O.O.O.O.O.O.O.O.O.O.O.O
Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.

=== VALKEY BUG REPORT END. Make sure to include from START to END. ===

       Please report the crash by opening an issue on github:

           https://github.com/valkey-io/valkey/issues

  If a module was involved, please open in the module's repo instead.

  Suspect RAM error? Use valkey-server --test-memory to verify it.

  Some other issues could be detected by valkey-server --check-system
@nitaicaro
Copy link
Contributor Author

Handling in #526

@madolson
Copy link
Member

Resolved in #526, but was not attached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants