bug: nwaku node gets blocked because the postgres database is blocked #2783

Ivansete-status · 2024-06-06T09:32:18Z

Problem

While QA was working on shards.test they realized that the Store queries didn't work.
After analyzing further, the whole nwaku node (store-01.do-ams3.shards.test.status.im) was completely stopped surely because the postgres database was blocked.
We (@cammellos , @richard-ramos , @Ivansete-status ) went to the Postgres server (store-db-01.do-ams3.shards.test.)

As a curiosity, even if we tried to perform any simple query, i.e. SELECT * from messages limit 1;, it got blocked too.
We managed to make the nwaku node progress again after killing the blocked SELECT queries.

Impact

The node gets completely blocked

To reproduce

We don't have a clear procedure to replicate the issue but if happened in shards.test running version v0.28.1 and the node getting new messages and regular store requests.

Screenshots/logs

Example of evidence about how the nwaku node gets blocked:

nwaku version/commit hash

v0.28.0-2-ga96a6b94

Additional context

Discord thread: https://discord.com/channels/1110799176264056863/1246045563833815080

The text was updated successfully, but these errors were encountered:

jm-clius · 2024-06-06T15:01:05Z

In case it's unclear why the node might be stuck, note that it generally is possible to use GDB on the binary - i.e. attach to the running (stalled) process and use gdb bt or gdb info to get some information on where the thread(s) are stuck.

Ivansete-status · 2024-06-07T17:08:59Z

In case it's unclear why the node might be stuck, note that it generally is possible to use GDB on the binary - i.e. attach to the running (stalled) process and use gdb bt or gdb info to get some information on where the thread(s) are stuck.

Thanks for the comment! I'll do that next time.

For now,

I started to stress the /dns4/store-01.do-ams3.shards.staging.status.im/tcp/30303/p2p/16Uiu2HAm3xVDaz6SRJ6kErwC21zBJEZjavVXg7VSkoWzaV1aMA3F peer with the @richard-ramos's tool ( https://github.com/waku-org/message-finder ,) continuously retrieving the messages from the last 24h.
On the other hand, I connected a nwaku node directly to that machine and it is continuously sending random messages.
That will potentially help us replicate block/slowness issues.

Ivansete-status · 2024-06-12T09:49:34Z

After further analysis with @NagyZoltanPeter, 🙌 , we run the following command:

SELECT pid, locktype, relation::regclass, mode, granted 
FROM pg_locks 
JOIN pg_stat_activity USING (pid) 
WHERE NOT granted AND mode LIKE '%xclusive%';

And saw that there were two processes with AccessExclusiveLock:

 711447 | relation | messages | AccessExclusiveLock | f
 711019 | relation | messages | AccessExclusiveLock | f

Then, from within the database docker container, we run ps fax and saw that there were two nwaku nodes trying to create a table, i.e. trying to create a partition.

711019 postgres  0:00 postgres: nim-waku nim-waku 10.11.0.83(45532) CREATE TABLE waiting
711447 postgres  0:00 postgres: nim-waku nim-waku 10.11.0.82(44822) CREATE TABLE waiting

With that, we concluded that we need to strengthen the logic around partition creation so that multiple nodes can create partitions concurrently without blocking each other.

Another option would be to just have a separate app aimed at database maintenance: partition creation; database migrations; etc, but at first, we'll try to enhance the current approach.

richard-ramos · 2024-06-12T12:23:51Z

I'm thinking that your PR #2784 should stop the issue from happening!

NagyZoltanPeter · 2024-06-12T13:11:01Z

I'm thinking that your PR #2784 should stop the issue from happening!

It is so, but instead, a fail to lock does not mean an error, just a sign an other node attemps to create the necessary partition(s).
Also we identified it is needed to maintain the partitionManager internal registry about the partitions, in mind other node can also make such maintenance of the partitions.

Ivansete-status · 2024-07-07T15:41:00Z

This is still happening nowadays in shards.test with nwaku version v0.29.0

Ivansete-status added bug Something isn't working critical This issue needs critical attention labels Jun 6, 2024

Ivansete-status self-assigned this Jun 6, 2024

This was referenced Jun 6, 2024

chore: add new index to optimize the order by storedAt #2778

Merged

chore: postgres_driver - acquire/release advisory lock when creating partitions #2784

Merged

gabrielmer added the effort/days Estimated to be completed in a few days, less than a week label Jun 6, 2024

This was referenced Jun 12, 2024

feat(rlnv2): rlnv2 fork feature branch [dont merge] #2759

Closed

fix: postgres_driver - better sync lock in partition creation #2809

Merged

Ivansete-status closed this as completed in #2809 Jun 14, 2024

chair28980 mentioned this issue Jun 18, 2024

[Epic: nwaku] PostgreSQL Maintenance #3072

Closed

14 tasks

Ivansete-status reopened this Jul 7, 2024

Ivansete-status mentioned this issue Jul 7, 2024

fix: postgres_driver - better partition creation without exclusive access #2887

Merged

Ivansete-status closed this as completed in #2887 Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: nwaku node gets blocked because the postgres database is blocked #2783

bug: nwaku node gets blocked because the postgres database is blocked #2783

Ivansete-status commented Jun 6, 2024

jm-clius commented Jun 6, 2024

Ivansete-status commented Jun 7, 2024

Ivansete-status commented Jun 12, 2024

richard-ramos commented Jun 12, 2024

NagyZoltanPeter commented Jun 12, 2024

Ivansete-status commented Jul 7, 2024

bug: nwaku node gets blocked because the postgres database is blocked #2783

bug: nwaku node gets blocked because the postgres database is blocked #2783

Comments

Ivansete-status commented Jun 6, 2024

Problem

Impact

To reproduce

Screenshots/logs

nwaku version/commit hash

Additional context

jm-clius commented Jun 6, 2024

Ivansete-status commented Jun 7, 2024

Ivansete-status commented Jun 12, 2024

richard-ramos commented Jun 12, 2024

NagyZoltanPeter commented Jun 12, 2024

Ivansete-status commented Jul 7, 2024