Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: moveToFailed causes Out of memory on millions of keys #2424

Closed
1 task done
ivnnv opened this issue Feb 14, 2024 · 11 comments
Closed
1 task done

[Bug]: moveToFailed causes Out of memory on millions of keys #2424

ivnnv opened this issue Feb 14, 2024 · 11 comments

Comments

@ivnnv
Copy link

ivnnv commented Feb 14, 2024

Version

5.1.9

Platform

NodeJS

What happened?

We have a large server running a bullmq instance with usually millions of records on "waiting" or "prioritized" (which eventually get completed and deleted (but new ones are being created all the time at wait/prior)

It eventually crashes giving this stacktrace:

2024-02-13T14:46:24.551090338Z Error: Missing lock for job mexc_ETH2/USDT-ranges. delayed
2024-02-13T14:46:24.551157904Z at Scripts.finishedErrors (/usr/src/app/node_modules/.pnpm/bullmq@5.1.9/node_modules/bullmq/dist/cjs/classes/scripts.js:274:24)
2024-02-13T14:46:24.551163664Z at Job.moveToFailed (/usr/src/app/node_modules/.pnpm/bullmq@5.1.9/node_modules/bullmq/dist/cjs/classes/job.js:435:32)
2024-02-13T14:46:24.551168082Z at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
2024-02-13T14:46:24.551172119Z at async handleFailed (/usr/src/app/node_modules/.pnpm/bullmq@5.1.9/node_modules/bullmq/dist/cjs/classes/worker.js:374:21)
2024-02-13T14:46:24.551176227Z at async Worker.retryIfFailed (/usr/src/app/node_modules/.pnpm/bullmq@5.1.9/node_modules/bullmq/dist/cjs/classes/worker.js:578:24)
2024-02-13T14:46:25.408743975Z ReplyError: ERR Error running script (call to 3567ef34749b61f3b833b75aa2bbc3c53a28cc9f): @user_script:112: Out of memory
2024-02-13T14:46:25.408786625Z at parseError (/usr/src/app/node_modules/.pnpm/redis-parser@3.0.0/node_modules/redis-parser/lib/parser.js:179:12)
2024-02-13T14:46:25.408791594Z at parseType (/usr/src/app/node_modules/.pnpm/redis-parser@3.0.0/node_modules/redis-parser/lib/parser.js:302:14) {
2024-02-13T14:46:25.408795812Z command: {
2024-02-13T14:46:25.408805259Z name: 'evalsha',
2024-02-13T14:46:25.408809637Z args: [
2024-02-13T14:46:25.408813224Z '3567ef34749b61f3b833b75aa2bbc3c53a28cc9f',
2024-02-13T14:46:25.408816660Z '11',
2024-02-13T14:46:25.408819957Z 'bull:{mexc}:wait',
2024-02-13T14:46:25.408823583Z 'bull:{mexc}:active',
2024-02-13T14:46:25.408827300Z 'bull:{mexc}:prioritized',
2024-02-13T14:46:25.408830837Z 'bull:{mexc}:events',
2024-02-13T14:46:25.408834253Z 'bull:{mexc}:stalled',
2024-02-13T14:46:25.408838040Z 'bull:{mexc}:limiter',
2024-02-13T14:46:25.408841497Z 'bull:{mexc}:delayed',
2024-02-13T14:46:25.408872915Z 'bull:{mexc}:paused',
2024-02-13T14:46:25.408877093Z 'bull:{mexc}:meta',
2024-02-13T14:46:25.408880599Z 'bull:{mexc}:pc',
2024-02-13T14:46:25.408884196Z 'bull:{mexc}:marker',
2024-02-13T14:46:25.408887703Z 'bull:{mexc}:',
2024-02-13T14:46:25.408891159Z '1707835585308',
2024-02-13T14:46:25.408894816Z <Buffer de 00 03 a5 74 6f 6b 65 6e d9 27 65 64 63 62 62 63 39 65 2d 31 35 34 31 2d 34 30 63 35 2d 61 62 34 34 2d 62 66 32 64 39 30 39 31 63 65 65 33 3a 39 31 ... 44 more bytes>
2024-02-13T14:46:25.408899645Z ]
2024-02-13T14:46:25.408903092Z }
2024-02-13T14:46:25.408906718Z }

Trying following the rabbit hole and correct me if im wrong it appears that many of the scripts load into memory the whole set of keys (instead just taking the specific job entry of the active set and swap it to failed?). Obviously this is just initial my impression as didnt dig too much into the code.

But anyways, this is it, long story short: It seems that movedToFailed causes Out of Memory on a server with millions of keys (even if the server has lots of memory available for both Redis and Node).

How to reproduce.

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@ivnnv ivnnv added the bug Something isn't working label Feb 14, 2024
@manast
Copy link
Contributor

manast commented Feb 14, 2024

The script that is raising the error seems to be moveToActive, as the number of keys (11) matches only with that script. As only one script can be executed by Redis at any given time, I do not think the issue is with the size of the keys, furthermore they are just a few bytes in total, less than 1Kb in the example you sent.

I have checked in line 112, in "rawScripts", however I do not see anything interesting in that line:
image

It would be interesting to check in your exact BullMQ version what is on that same line (you should search for rawScripts/moveToActive-11.lua)

@manast manast added cannot reproduce and removed bug Something isn't working labels Feb 14, 2024
@ivnnv
Copy link
Author

ivnnv commented Feb 14, 2024

Hi @manast thanks for your early reply.

A couple of things to add:

  1. We dont really have a rawScripts folders at our app's node_modules/bullmq, its just /dist/cjs /dist/cjs and theres only /scripts there.

  2. when generating the rawScripts of bullmq 5.1.9 in local via generate:raw:scripts, it gives the same output as yours:
    Screenshot 2024-02-14 at 08 32 11

  3. only one script can be executed by Redis at any given time actually... we are using dragonfly, so I dont know if this could be a thing to consider (we are happy with dragonfly overall because its definitelly faster than "real" Redis)

@manast
Copy link
Contributor

manast commented Feb 14, 2024

Ok, if using Dragonfly it could run the same script in parallel, but even in this case it will not be more than the number of threads used by Dragonfly, which would still be a relative low number (as they normally would match the number of cores in the CPU). I will try to bring somebody from Dragonfly to see if they can shed some light on what could cause this out-of-memory.

@chakaz
Copy link

chakaz commented Feb 14, 2024

Hey there! I'm Shahar from the Dragonfly team :)
Indeed the error Error running script comes from Dragonfly. @ivnnv you mentioned that "even if the server has lots of memory available for both Redis and Node", so I wonder why would Dragonfly disagree. Can you kindly share the output of INFO MEMORY (or better yet INFO ALL)? Is the max memory there similar to what exists on the server? Do you pass --maxmemory to dragonfly? If so, what value?
Also, how are you running Dragonfly - as a container, or standalone?

@ivnnv
Copy link
Author

ivnnv commented Feb 15, 2024

Hi @chakaz! Thanks for jumping in.

Ill try to give you as much detail as possible.

Yes our dragonfly is dockerized, we run it on a orchestrated environment running in a bare metal with 32GB ram.
This is its INFO ALL:

# Server
redis_version:6.2.11
dragonfly_version:df-v1.14.1
redis_mode:cluster
arch_bits:64
os:Linux 5.15.0-94-generic x86_64
multiplexing_api:epoll
tcp_port:6379
thread_count:6
uptime_in_seconds:1175
uptime_in_days:0

# Clients
connected_clients:22
client_read_buffer_bytes:187648
blocked_clients:3
dispatch_queue_entries:2

# Memory
used_memory:2304862600
used_memory_human:2.15GiB
used_memory_peak:2304862408
used_memory_peak_human:2.15GiB
used_memory_rss:2357891072
used_memory_rss_human:2.20GiB
used_memory_peak_rss:2418180096
comitted_memory:2398601216
maxmemory:8589934592
maxmemory_human:8.00GiB
object_used_memory:2900325848
type_used_memory_STRING:96305448
type_used_memory_LIST:648
type_used_memory_ZSET:953251152
type_used_memory_HASH:1850768600
table_used_memory:135829312
num_buckets:234976
num_entries:2993237
inline_keys:7
listpack_blobs:5
listpack_bytes:120
small_string_bytes:96305448
pipeline_cache_bytes:4161568
dispatch_queue_bytes:1440
dispatch_queue_subscriber_bytes:0
dispatch_queue_peak_bytes:34750756
client_read_buffer_peak_bytes:187648
cache_mode:store
maxmemory_policy:noeviction

# Stats
total_connections_received:64
total_commands_processed:34712899
instantaneous_ops_per_sec:1505
total_pipelined_commands:3029612
pipelined_latency_usec:7705662108373
total_net_input_bytes:1879786720
total_net_output_bytes:146235991
instantaneous_input_kbps:-1
instantaneous_output_kbps:-1
rejected_connections:-1
expired_keys:1767
evicted_keys:0
hard_evictions:0
garbage_checked:20
garbage_collected:0
bump_ups:0
stash_unloaded:7
oom_rejections:0
traverse_ttl_sec:0
delete_ttl_sec:0
keyspace_hits:6697340
keyspace_misses:3071911
keyspace_mutations:21843932
total_reads_processed:855763
total_writes_processed:653409
defrag_attempt_total:0
defrag_realloc_total:0
defrag_task_invocation_total:0
reply_count:653409
reply_latency_usec:11297126
reply_batch_count:6639501
reply_batch_latency_usec:104597

# Tiered
tiered_entries:0
tiered_bytes:0
tiered_reads:0
tiered_writes:0
tiered_reserved:0
tiered_capacity:0
tiered_aborted_writes:0
tiered_flush_skipped:0
tiered_throttled_writes:0

# Persistence
last_success_save:1707969183
last_saved_file:
last_success_save_duration_sec:0
loading:0
saving:0
current_save_duration_sec:0
rdb_changes_since_last_success_save:21774954
last_failed_save:0
last_error:
last_failed_save_duration_sec:0

# Transaction
tx_type_cnt:global=106630,normal=41657,ooo=0,quick=362869,inline=110410
tx_width_freq:w1=514936,w6=106630
eval_io_coordination_total:3094293
eval_shardlocal_coordination_total:0
eval_squashed_flushes:0
tx_schedule_cancel_total:757
tx_queue_len:2
multi_squash_execution_total:4
multi_squash_execution_hop_usec:667
multi_squash_execution_reply_usec:0

# Replication
role:master
connected_slaves:0
master_replid:6384a2d091723f63e7176dd182ccae8feeaae990

# Commandstats
cmdstat_ACL WHOAMI:calls=1,usec=30,usec_per_call=30
cmdstat_AUTH:calls=27,usec=629,usec_per_call=23.2963
cmdstat_BZPOPMIN:calls=479,usec=2291945763,usec_per_call=4.78486e+06
cmdstat_CLIENT:calls=7,usec=158,usec_per_call=22.5714
cmdstat_CLUSTER:calls=8,usec=696,usec_per_call=87
cmdstat_CONFIG:calls=2,usec=39,usec_per_call=19.5
cmdstat_DBSIZE:calls=1,usec=108,usec_per_call=108
cmdstat_DEL:calls=34272,usec=914976,usec_per_call=26.6975
cmdstat_EVAL:calls=34,usec=1033601,usec_per_call=30400
cmdstat_EVALSHA:calls=3094259,usec=1412715957,usec_per_call=456.56
cmdstat_EXEC:calls=3535,usec=443700494,usec_per_call=125516
cmdstat_EXISTS:calls=3086920,usec=871284955,usec_per_call=282.251
cmdstat_GET:calls=382384,usec=1383286047,usec_per_call=3617.53
cmdstat_HEXISTS:calls=3032521,usec=63144575,usec_per_call=20.8225
cmdstat_HGET:calls=3086668,usec=35327016,usec_per_call=11.445
cmdstat_HGETALL:calls=33009,usec=711861,usec_per_call=21.5657
cmdstat_HINCRBY:calls=65983,usec=1277644,usec_per_call=19.3632
cmdstat_HMGET:calls=32848,usec=755475,usec_per_call=22.9991
cmdstat_HMSET:calls=3025699,usec=42715207,usec_per_call=14.1175
cmdstat_HSET:calls=93797,usec=2730377,usec_per_call=29.1094
cmdstat_INCR:calls=6018368,usec=68066381,usec_per_call=11.3098
cmdstat_INFO:calls=50,usec=30752,usec_per_call=615.04
cmdstat_LLEN:calls=1646,usec=13372,usec_per_call=8.12394
cmdstat_LPUSH:calls=33009,usec=617193,usec_per_call=18.6977
cmdstat_LRANGE:calls=153,usec=2937,usec_per_call=19.1961
cmdstat_LREM:calls=32974,usec=784354,usec_per_call=23.787
cmdstat_MEMORY:calls=4,usec=1803,usec_per_call=450.75
cmdstat_MODULE:calls=1,usec=63,usec_per_call=63
cmdstat_MULTI:calls=3535,usec=1971,usec_per_call=0.557567
cmdstat_PEXPIRE:calls=1619,usec=27529,usec_per_call=17.0037
cmdstat_PING:calls=8,usec=123,usec_per_call=15.375
cmdstat_PTTL:calls=5363,usec=107528,usec_per_call=20.05
cmdstat_ROLE:calls=1,usec=25,usec_per_call=25
cmdstat_RPOPLPUSH:calls=34307,usec=559596,usec_per_call=16.3114
cmdstat_SADD:calls=69,usec=2188,usec_per_call=31.7101
cmdstat_SCAN:calls=1,usec=308,usec_per_call=308
cmdstat_SCARD:calls=32848,usec=658286,usec_per_call=20.0404
cmdstat_SELECT:calls=21,usec=7576,usec_per_call=360.762
cmdstat_SET:calls=238615,usec=307486662,usec_per_call=1288.63
cmdstat_SMEMBERS:calls=153,usec=1891,usec_per_call=12.3595
cmdstat_SREM:calls=33051,usec=664297,usec_per_call=20.0991
cmdstat_TYPE:calls=103,usec=316,usec_per_call=3.06796
cmdstat_XADD:calls=6112025,usec=75977143,usec_per_call=12.4308
cmdstat_XTRIM:calls=33001,usec=665189,usec_per_call=20.1566
cmdstat_ZADD:calls=6018298,usec=69430438,usec_per_call=11.5366
cmdstat_ZCARD:calls=44,usec=1052185,usec_per_call=23913.3
cmdstat_ZPOPMIN:calls=34307,usec=759901,usec_per_call=22.15
cmdstat_ZRANGE:calls=1298,usec=16548,usec_per_call=12.7488
cmdstat_ZRANGEBYSCORE:calls=39922,usec=828728,usec_per_call=20.7587
cmdstat_ZREMRANGEBYSCORE:calls=32848,usec=610283,usec_per_call=18.579
cmdstat_ZREVRANGE:calls=16,usec=345158,usec_per_call=21572.4
cmdstat_ZREVRANGEBYSCORE:calls=32848,usec=658477,usec_per_call=20.0462

# Modules
module:name=ReJSON,ver=20000,api=1,filters=0,usedby=[search],using=[],options=[handle-io-errors]
module:name=search,ver=20000,api=1,filters=0,usedby=[],using=[ReJSON],options=[handle-io-errors]

# Search
search_memory:0
search_num_indices:0
search_num_entries:0

# Errorstats
-NOAUTH Authentication required.:37

# Keyspace
db0:keys=578,expires=0,avg_ttl=-1
db1:keys=2992659,expires=40,avg_ttl=-1

# Cpu
used_cpu_sys:327.137874
used_cpu_user:475.627063
used_cpu_sys_children:0.1176
used_cpu_user_children:0.0
used_cpu_sys_main_thread:7.915415
used_cpu_user_main_thread:6.173679

# Cluster
cluster_enabled:1

This is its CMD:
dragonfly --logtostderr --cluster_mode=emulated --lock_on_hashtags --default_lua_flags=allow-undeclared-keys --maxmemory=8G --proactor_threads=6 --dbnum=4 --requirepass=REDACTED --primary_port_http_enabled=false

PS: we need to use the extra allow-undeclared-keys in order to run bull-arena UI

We were running 2 queues in paralell that already digested something like:
Queue1: 3M
Queue2: 1M
Queue3: 1M
these injestions were already processed and they now only process few hundred jobs per minute each.

Then its when there comes the trouble, we re-running the whole with Queue4 "{mexc}" which quickly adds ~5M jobs being consumed at a rate of max 35 per second it runs ok for some time, like 60 or 90 minutes and then suddendly breaks with the above output.
I dont really think it really runs "out of memory" (in terms of number of keys stored) because Ive been able to process this same queue in another QA server with fewer ram available (2.5G) and it didnt gave this out of memory error after adding +6M keys (but it was the only queue being processed).

Hope this helps.

@ivnnv
Copy link
Author

ivnnv commented Feb 15, 2024

Another log, this one from dragonfly itself:

2024-02-15T04:27:01.664569457Z *** Check failure stack trace: ***
2024-02-15T04:27:01.679915590Z     @     0x55703c2c59f3  google::LogMessage::SendToLog()
2024-02-15T04:27:01.680639010Z     @     0x55703c2be1b7  google::LogMessage::Flush()
2024-02-15T04:27:01.681487296Z     @     0x55703c2bfb3f  google::LogMessageFatal::~LogMessageFatal()
2024-02-15T04:27:01.683921830Z     @     0x55703bbab3b9  dfly::BlockingController::NotifyWatchQueue()
2024-02-15T04:27:01.684895566Z     @     0x55703bbabc9d  dfly::BlockingController::NotifyPending()
2024-02-15T04:27:01.686295442Z     @     0x55703bbc63dd  dfly::Transaction::UnlockMultiShardCb()
2024-02-15T04:27:01.686667316Z     @     0x55703bbc6436  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction11UnlockMultiEvEUlvE_E9_M_invokeERKSt9_Any_data
2024-02-15T04:27:01.687693813Z     @     0x55703c0c9c8a  util::fb2::FiberQueue::Run()
2024-02-15T04:27:01.687990012Z     @     0x55703baf33ee  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberENS0_21basic_fixedsize_stackINS0_12stack_traitsEEEZN4util3fb26detail15WorkerFiberImplIZN4dfly11EngineShardC4EPNS9_12ProactorBaseEP9mi_heap_sEUlvE0_JEEC4IS7_EESt17basic_string_viewIcSt11char_traitsIcEERKNS0_12preallocatedEOT_OSI_EUlOS4_E_EEEEvNS1_10transfer_tE
2024-02-15T04:27:01.689653051Z     @     0x55703c0e2faf  make_fcontext
2024-02-15T04:27:01.690707041Z *** SIGABRT received at time=1707971221 on cpu 5 ***
2024-02-15T04:27:01.690989692Z PC: @     0x7fd127fb400b  (unknown)  raise
2024-02-15T04:27:01.691015914Z [failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7fd127f93941 while already in AbslFailureSignalHandler()
2024-02-15T04:27:01.691025624Z *** SIGSEGV received at time=1707971221 on cpu 5 ***
2024-02-15T04:27:01.691177142Z PC: @     0x7fd127f93941  (unknown)  abort
2024-02-15T04:27:03.156042344Z I20240215 04:27:03.155915     1 init.cc:70] dragonfly running in opt mode.
2024-02-15T04:27:03.156108298Z I20240215 04:27:03.156023     1 dfly_main.cc:621] Starting dragonfly df-v1.14.1-25b2713d3a46ba76bf492f93e5f34c4f647855bf
2024-02-15T04:27:03.156119164Z * Logs will be written to the first available of the following paths:
2024-02-15T04:27:03.156140655Z /tmp/dragonfly.*
2024-02-15T04:27:03.156143928Z ./dragonfly.*
2024-02-15T04:27:03.156146446Z * For the available flags type dragonfly [--help | --helpfull]
2024-02-15T04:27:03.156148913Z * Documentation can be found at: https://www.dragonflydb.io/docs
2024-02-15T04:27:03.156359649Z W20240215 04:27:03.156292     1 dfly_main.cc:660] SWAP is enabled. Consider disabling it when running Dragonfly.
2024-02-15T04:27:03.156368354Z I20240215 04:27:03.156316     1 dfly_main.cc:684] Max memory limit is: 8.00GiB
2024-02-15T04:27:03.156461667Z W20240215 04:27:03.156426     1 dfly_main.cc:348] Weird error 1 switching to epoll
2024-02-15T04:27:03.235116328Z I20240215 04:27:03.234987     1 proactor_pool.cc:146] Running 6 io threads
2024-02-15T04:27:03.244391259Z I20240215 04:27:03.244307     1 server_family.cc:645] Host OS: Linux 5.15.0-94-generic x86_64 with 6 threads
2024-02-15T04:27:03.245171984Z I20240215 04:27:03.245110     1 snapshot_storage.cc:106] Load snapshot: Searching for snapshot in directory: "/data"
2024-02-15T04:27:03.246066631Z I20240215 04:27:03.245968     1 server_family.cc:808] Loading /data/dump-2024-02-12T21:13:19-summary.dfs
2024-02-15T04:27:03.260187552Z I20240215 04:27:03.260069     9 listener_interface.cc:101] sock[15] AcceptServer - listening on port 6379
2024-02-15T04:27:03.263555054Z I20240215 04:27:03.263476    12 server_family.cc:862] Load finished, num keys read: 0
2024-02-15T04:27:03.529000714Z I20240215 04:27:03.528834     8 version_monitor.cc:172] Your current version '1.14.1' is not the latest version. A newer version '1.14.3' is now available. Please consider an update.

@chakaz
Copy link

chakaz commented Feb 15, 2024

  • Re/ the check fail - please update to 1.14.3, we fixed that bug there.
  • Re/ --allow-undeclared-keys - I don't know what bull arena is, but it's unfortunate to have to run Dragonfly with that flag, as it slows down all Lua scripts significantly (including those of BullMQ - as that's a global setting). I don't know if there's anything you can do about it, so just pointing that out :)
  • Re/ the OOM you experienced. I see that you pass in 8gb, but on the other hand running the same load on another server used less than 2.5gb. Do you happen to have a graph of Dragonfly's used memory around that time? We should only return OOM in case we're close to max memory..

@manast
Copy link
Contributor

manast commented Feb 15, 2024

  • Re/ --allow-undeclared-keys - I don't know what bull arena is, but it's unfortunate to have to run Dragonfly with that flag, as it slows down all Lua scripts significantly (including those of BullMQ - as that's a global setting). I don't know if there's anything you can do about it, so just pointing that out :)

@ivnnv I think it is worth to create an issue in Bull Arena's repo then, or use Taskforce.sh instead 😉

@ivnnv
Copy link
Author

ivnnv commented Feb 15, 2024

@chakaz we have upgraded dragonfly to the latest, increased memory to 10GB, and got rid of bull-arena (and removed the underclared-keys flag) in favor of another UI that doesnt need extra flags. Ill report how it goes on the next massive job ingestion we will do in couple of days.

@manast thanks for the suggestion and for sharing and maintaining bull for years. We will def consider having Taskforce a try once we make our first dollar out of this still side project of us.

@chakaz
Copy link

chakaz commented Feb 16, 2024

Best of luck in your endeavor, keep us posted :)

@ivnnv
Copy link
Author

ivnnv commented Feb 19, 2024

Hi @chakaz and @manast, Dragonfly behaves pretty solid after upgrading to its latest version, removed the underclared-keys and increasing the memory to 12GB, so I think this ticket can now be closed, thanks for jumping in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants