Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why in the redis master-slave architecture, the master node and the slave node have the same number of keys, but their data size (used_memory) is different, and the slave node occupies more than the master node? #12382

Open
klin111 opened this issue Jul 3, 2023 · 12 comments
Labels
state:to-be-closed requesting the core team to close the issue

Comments

@klin111
Copy link

klin111 commented Jul 3, 2023

redis-5.0.12

slave

used_memory:11674502872
used_memory_human:10.87G
used_memory_rss:11916976128
used_memory_rss_human:11.10G
used_memory_peak:11674565992
used_memory_peak_human:10.87G
used_memory_peak_perc:100.00%
used_memory_overhead:42736840
used_memory_startup:1449864
used_memory_dataset:11631766032
used_memory_dataset_perc:99.65%
allocator_allocated:11674516248
allocator_active:11675152384
allocator_resident:11921485824
total_system_memory:17179869184
total_system_memory_human:16.00G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:12884901888
maxmemory_human:12.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.00
allocator_frag_bytes:636136
allocator_rss_ratio:1.02
allocator_rss_bytes:246333440
rss_overhead_ratio:1.00
rss_overhead_bytes:-4509696
mem_fragmentation_ratio:1.02
mem_fragmentation_bytes:242514280
mem_not_counted_for_evict:0
mem_replication_backlog:10485760
mem_clients_slaves:0
mem_clients_normal:66616
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0


master

used_memory:11062165000
used_memory_human:10.30G
used_memory_rss:11444056064
used_memory_rss_human:10.66G
used_memory_peak:11076346928
used_memory_peak_human:10.32G
used_memory_peak_perc:99.87%
used_memory_overhead:43858826
used_memory_startup:1449864
used_memory_dataset:11018306174
used_memory_dataset_perc:99.62%
allocator_allocated:11062146648
allocator_active:11177414656
allocator_resident:11448696832
total_system_memory:17179869184
total_system_memory_human:16.00G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:12884901888
maxmemory_human:12.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.01
allocator_frag_bytes:115268008
allocator_rss_ratio:1.02
allocator_rss_bytes:271282176
rss_overhead_ratio:1.00
rss_overhead_bytes:-4640768
mem_fragmentation_ratio:1.03
mem_fragmentation_bytes:382010944
mem_not_counted_for_evict:0
mem_replication_backlog:10485760
mem_clients_slaves:16922
mem_clients_normal:1171712
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0


@yossigo @oranagra @madolson

Sir, please help me

@sundb
Copy link
Collaborator

sundb commented Jul 4, 2023

@klin111 I would like to confirm the following points:

  1. whether master and slave have been synchronized completely, you can confirm whether slave_repl_offset and master_repl_offset in INFO ALL are equal.
  2. Do master and slave use the same config? Particular configurations like hash-max-listpack-entries, list-max-listpack-size, set-max-intset-entries, and similar.
  3. Perhaps you can provide more info through INFO ALL.

@oranagra
Copy link
Member

oranagra commented Jul 4, 2023

few more things you can look into (in case it's not obvious from INFO output):

  1. randomly check the OBJECT ENCODING and MEMORY USAGE of some keys to see if they're similar.
  2. compare the differences in MEMORY MALLOC-STATS maybe it can teach us something in case we can't find any differences in any of the above.
  3. upgrade, it could solve some bugs, but also note that MEMORY USAGE in that version isn't reporting the actual usage correctly for some types.

@klin111
Copy link
Author

klin111 commented Jul 4, 2023

@klin111 I would like to confirm the following points:

  1. whether master and slave have been synchronized completely, you can confirm whether slave_repl_offset and master_repl_offset in INFO ALL are equal.
  2. Do master and slave use the same config? Particular configurations like hash-max-listpack-entries, list-max-listpack-size, set-max-intset-entries, and similar.
  3. Perhaps you can provide more info through INFO ALL.
    thank you
    @sundb
  1. Master and slave are fully synchronized. slave_repl_offset and master_repl_offset in INFO ALL are equal.
  2. special configuration
    `
    hash-max-ziplist-entries 512
    hash-max-ziplist-value 64
    list-max-ziplist-entries 512
    list-max-ziplist-value 64
    set-max-intset-entries 512
    stream-node-max-bytes 4096
    stream-node-max-entries 100
    zset-max-ziplist-entries 128
    zset-max-ziplist-value 64

The master and slave configuration files are the same
3. info all

slave Server:

redis_version:5.0.12
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:q23w4rq6e6accxxx
redis_mode:cluster
os:Linux 5.10.0-957.27.2.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:4.8.5
process_id:306
run_id:90ui786t3cd40a1170aca41c414f1c2318dd2xxx
tcp_port:6666
uptime_in_seconds:110709
uptime_in_days:1
hz:10
configured_hz:10
lru_clock:10738468
executable:/redis/redis-5.0.12/bin/redis-server
config_file:/redis/conf/redis-cluster-9736.conf

Clients:
connected_clients:2
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

Memory:
used_memory:11661178336
used_memory_human:10.86G
used_memory_rss:11928752128
used_memory_rss_human:11.11G
used_memory_peak:11677479408
used_memory_peak_human:10.88G
used_memory_peak_perc:0.9986
used_memory_overhead:42751200
used_memory_startup:1449864
used_memory_dataset:11618427136
used_memory_dataset_perc:0.9965
allocator_allocated:11661193496
allocator_active:11685838848
allocator_resident:11937169408
total_system_memory:17179869184
total_system_memory_human:16.00G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:12884901888
maxmemory_human:12.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1
allocator_frag_bytes:24645352
allocator_rss_ratio:1.02
allocator_rss_bytes:251330560
rss_overhead_ratio:1
rss_overhead_bytes:-8417280
mem_fragmentation_ratio:1.02
mem_fragmentation_bytes:267616080
mem_not_counted_for_evict:0
mem_replication_backlog:10485760
mem_clients_slaves:0
mem_clients_normal:66616
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

Persistence:
loading:0
rdb_changes_since_last_save:15472063
rdb_bgsave_in_progress:0
rdb_last_save_time:1688349359
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

Stats:
total_connections_received:7456
total_commands_processed:15501705
instantaneous_ops_per_sec:0
total_net_input_bytes:6776127688
total_net_output_bytes:25372153
instantaneous_input_kbps:0
instantaneous_output_kbps:0.03
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

Replication:
role:slave
master_host:192.168.1.6
master_port:6666
master_link_status:up
master_last_io_seconds_ago:3
master_sync_in_progress:0
slave_repl_offset:505862536870
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:1q2w3a45qw1c667273f74017bd1b490f97b5dd64
master_replid2:0
master_repl_offset:505862536870
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:10000000
repl_backlog_first_byte_offset:505852536871
repl_backlog_histlen:10000000

CPU:
used_cpu_sys:316.20177
used_cpu_user:446.584073
used_cpu_sys_children:0
used_cpu_user_children:0

Commandstats:
cmdstat_slowlog:calls=7360,usec=98145,usec_per_call=13.33
cmdstat_ping:calls=11046,usec=4824,usec_per_call=0.44
cmdstat_config:calls=3700,usec=76317,usec_per_call=20.63
cmdstat_role:calls=1,usec=272,usec_per_call=272.00
cmdstat_select:calls=1,usec=4,usec_per_call=4.00
cmdstat_client:calls=2,usec=43,usec_per_call=21.50
cmdstat_cluster:calls=3700,usec=389207,usec_per_call=105.19
cmdstat_command:calls=2,usec=2737,usec_per_call=1368.50
cmdstat_dbsize:calls=1,usec=3,usec_per_call=3.00
cmdstat_memory:calls=4,usec=989,usec_per_call=247.25
cmdstat_hmset:calls=15472063,usec=90811447,usec_per_call=5.87
cmdstat_info:calls=3825,usec=416960,usec_per_call=109.01

Cluster:
cluster_enabled:1

Keyspace:
db0:keys=559999,expires=0,avg_ttl=0

master Server:

redis_version:5.0.12
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:q23w4rq6e6accxxx
redis_mode:cluster
os:Linux 5.10.0-957.27.2.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:4.8.5
process_id:2066
run_id:986t34e5eb58c2ab1fa8ab3daff13c5470b8bxxx
tcp_port:6666
uptime_in_seconds:21253950
uptime_in_days:245
hz:10
configured_hz:10
lru_clock:10738780
executable:/redis/redis-5.0.12/bin/redis-server
config_file:/redis/conf/redis-cluster-9736.conf

Clients:
connected_clients:46
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

Memory:
used_memory:11061316392
used_memory_human:10.30G
used_memory_rss:11452391424
used_memory_rss_human:10.67G
used_memory_peak:11076346928
used_memory_peak_human:10.32G
used_memory_peak_perc:0.9986
used_memory_overhead:43643746
used_memory_startup:1449864
used_memory_dataset:11017672646
used_memory_dataset_perc:0.9962
allocator_allocated:11061701512
allocator_active:11185254400
allocator_resident:11461353472
total_system_memory:17179869184
total_system_memory_human:16.00G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:12884901888
maxmemory_human:12.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.01
allocator_frag_bytes:123552888
allocator_rss_ratio:1.02
allocator_rss_bytes:276099072
rss_overhead_ratio:1
rss_overhead_bytes:-8962048
mem_fragmentation_ratio:1.04
mem_fragmentation_bytes:390891656
mem_not_counted_for_evict:0
mem_replication_backlog:10485760
mem_clients_slaves:49694
mem_clients_normal:909500
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:

Persistence:
loading:0
rdb_changes_since_last_save:15472514
rdb_bgsave_in_progress:0
rdb_last_save_time:1688349498
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:126
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:510795776
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

Stats:
total_connections_received:1448734
total_commands_processed:3916065741
instantaneous_ops_per_sec:37
total_net_input_bytes:1936505477657
total_net_output_bytes:1747515489466
instantaneous_input_kbps:83.5
instantaneous_output_kbps:132.47
rejected_connections:0
sync_full:2
sync_partial_ok:1
sync_partial_err:2
expired_keys:0
expired_stale_perc:0
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:307983688
keyspace_misses:1206819
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:326993
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

Replication:
role:master
connected_slaves:1
slave0:ip=192.168.1.9,port=6666,state=online,offset=505862572596,lag=0
master_replid:1q2w3a45qw1c667273f74017bd1b490f97b5dd64
master_replid2:0
master_repl_offset:505862572674
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:10000000
repl_backlog_first_byte_offset:505852572675
repl_backlog_histlen:10000000

CPU:
used_cpu_sys:148122.931638
used_cpu_user:142004.190394
used_cpu_sys_children:25.132197
used_cpu_user_children:96.867451

Commandstats:
cmdstat_migrate:calls=24578,usec=53674786,usec_per_call=2183.85
cmdstat_slowlog:calls=1390648,usec=468395576,usec_per_call=336.82
cmdstat_dbsize:calls=1,usec=2,usec_per_call=2.00
cmdstat_hdel:calls=791,usec=6775,usec_per_call=8.57
cmdstat_config:calls=640094,usec=14817773,usec_per_call=23.15
cmdstat_info:calls=716934,usec=93677178,usec_per_call=130.66
cmdstat_command:calls=2,usec=6592,usec_per_call=3296.00
cmdstat_ping:calls=259,usec=171,usec_per_call=0.66
cmdstat_hmset:calls=3582434058,usec=21846202609,usec_per_call=6.10
cmdstat_scan:calls=132312,usec=2275971,usec_per_call=17.20
cmdstat_hmget:calls=305214177,usec=36356005414,usec_per_call=119.12
cmdstat_client:calls=256,usec=25635,usec_per_call=100.14
cmdstat_hget:calls=3730726,usec=48492378,usec_per_call=13.00
cmdstat_replconf:calls=21056259,usec=32084708,usec_per_call=1.52
cmdstat_auth:calls=9758,usec=9844,usec_per_call=1.01
cmdstat_psync:calls=3,usec=329172,usec_per_call=109724.00
cmdstat_cluster:calls=714885,usec=61717961,usec_per_call=86.33

Cluster:
cluster_enabled:1

Keyspace:
db0:keys=559999,expires=0,avg_ttl=0

`

@klin111
Copy link
Author

klin111 commented Jul 4, 2023

few more things you can look into (in case it's not obvious from INFO output):

  1. randomly check the OBJECT ENCODING and MEMORY USAGE of some keys to see if they're similar.
  2. compare the differences in MEMORY MALLOC-STATS maybe it can teach us something in case we can't find any differences in any of the above.
  3. upgrade, it could solve some bugs, but also note that MEMORY USAGE in that version isn't reporting the actual usage correctly for some types.

thank you @oranagra

  1. Randomly test that the underlying type and size of several keys are consistent
  2. MEMORY MALLOC-STATS gives too much data, I don’t know how to analyze it .What data should you focus on?
  3. Will not upgrade for the time being, what version is recommended to upgrade to?

@oranagra
Copy link
Member

oranagra commented Jul 4, 2023

looking at the info you provided, i don't see such a big difference (master uses 10.30GB and slave uses 10.86GB).
as expected there's different number of clients connected to each, which results in bigger used_memory_overhead on the master (by some 900KB).

considering the difference is relatively small, i doubt we'll be able to spot anything in MALLOC-STATS.
I'd suggest to upgrade to to the latest in 7.0, there are a ton of bugs fixes, and optimizations that were applied since 5.0.

@klin111
Copy link
Author

klin111 commented Jul 5, 2023

@oranagra
There is a difference of 572MB between the used_memory of the master instance and the slave instance. What causes the difference?

MALLOC-STATS has too much content, can only provide important points?

@oranagra
Copy link
Member

oranagra commented Jul 5, 2023

yes, i saw all that, and i commented that it's not a huge difference (500mb out of 10gb).. in buggy scenarios, i've seen much more (like 200%).

in any case, i don't know how to find out the cause for this, this old version doesn't have any other information, and also it is somewhat likely that the problem was already solved anyway.
all i can do is suggest an upgrade.

@oranagra oranagra added the state:to-be-closed requesting the core team to close the issue label Jul 5, 2023
@klin111
Copy link
Author

klin111 commented Jul 6, 2023

@oranagra
In which version is this bug fixed?
Which minor version of version 7 can be used in production

@sundb
Copy link
Collaborator

sundb commented Jul 6, 2023

@klin111 It's hard to know from the information available that it could be due to some bug, do you still see any differences after reboot?
It is recommended to upgrade to 6.2.12, 7.0.11 is also a choice.

@oranagra
Copy link
Member

oranagra commented Jul 6, 2023

the above is inaccurate or even incorrect.
traditionally, it is the master is the one keeping the backlog and other replication overheads, which we can see in used_memory_overhead.
but also, since PSYNC2 (redis 4.0), the slave keeps that backlog too (but not the salve buffers).

the argument about rehashing is valid, but at least in this case, not for the main dict, which, it's overhead is also included in used_memory_overhead, which is similar in the master and slave.
maybe it's a big dict inside some key (hash, set, or zset). but that should be visible with MEMORY USAGE

@631086083
Copy link

the above is inaccurate or even incorrect. traditionally, it is the master is the one keeping the backlog and other replication overheads, which we can see in used_memory_overhead. but also, since PSYNC2 (redis 4.0), the slave keeps that backlog too (but not the salve buffers).

the argument about rehashing is valid, but at least in this case, not for the main dict, which, it's overhead is also included in used_memory_overhead, which is similar in the master and slave. maybe it's a big dict inside some key (hash, set, or zset). but that should be visible with MEMORY USAGE

Sorry for the wrong explanation.
Judging from the current news, most of the keys in this Redis cluster are hash structures, and it is very likely that the underlying hash table of some keys is being resharded, because rehash will only be performed when they are accessed, so there will be some more memory usage.
What about adding a special task on the slave to rehash the key?

@oranagra
Copy link
Member

What about adding a special task on the slave to rehash the key?

that's possible. not sure how common it is for a key to grow crossing the rehash limit and then become completely read-only.
let's start by trying to prove this theory.
we can use DEBUG HTSTATS-KEY <key> to try to compare the dict HT size of keys on the master and slave.
just be careful not to run it on hashes that are really big (it could hang for a while).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state:to-be-closed requesting the core team to close the issue
Projects
None yet
Development

No branches or pull requests

4 participants