Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: nwaku node gets blocked because the postgres database is blocked #2783

Closed
Tracked by #3072
Ivansete-status opened this issue Jun 6, 2024 · 6 comments · Fixed by #2809 or #2887
Closed
Tracked by #3072

bug: nwaku node gets blocked because the postgres database is blocked #2783

Ivansete-status opened this issue Jun 6, 2024 · 6 comments · Fixed by #2809 or #2887
Assignees
Labels
bug Something isn't working critical This issue needs critical attention effort/days Estimated to be completed in a few days, less than a week

Comments

@Ivansete-status
Copy link
Collaborator

Problem

While QA was working on shards.test they realized that the Store queries didn't work.
After analyzing further, the whole nwaku node (store-01.do-ams3.shards.test.status.im) was completely stopped surely because the postgres database was blocked.
We (@cammellos , @richard-ramos , @Ivansete-status ) went to the Postgres server (store-db-01.do-ams3.shards.test.)

As a curiosity, even if we tried to perform any simple query, i.e. SELECT * from messages limit 1;, it got blocked too.
We managed to make the nwaku node progress again after killing the blocked SELECT queries.

Impact

The node gets completely blocked

To reproduce

We don't have a clear procedure to replicate the issue but if happened in shards.test running version v0.28.1 and the node getting new messages and regular store requests.

Screenshots/logs

Example of evidence about how the nwaku node gets blocked:
image

nwaku version/commit hash

v0.28.0-2-ga96a6b94

Additional context

Discord thread: https://discord.com/channels/1110799176264056863/1246045563833815080

@Ivansete-status Ivansete-status added bug Something isn't working critical This issue needs critical attention labels Jun 6, 2024
@Ivansete-status Ivansete-status self-assigned this Jun 6, 2024
@gabrielmer gabrielmer added the effort/days Estimated to be completed in a few days, less than a week label Jun 6, 2024
@jm-clius
Copy link
Contributor

jm-clius commented Jun 6, 2024

In case it's unclear why the node might be stuck, note that it generally is possible to use GDB on the binary - i.e. attach to the running (stalled) process and use gdb bt or gdb info to get some information on where the thread(s) are stuck.

@Ivansete-status
Copy link
Collaborator Author

In case it's unclear why the node might be stuck, note that it generally is possible to use GDB on the binary - i.e. attach to the running (stalled) process and use gdb bt or gdb info to get some information on where the thread(s) are stuck.

Thanks for the comment! I'll do that next time.

For now,

I started to stress the /dns4/store-01.do-ams3.shards.staging.status.im/tcp/30303/p2p/16Uiu2HAm3xVDaz6SRJ6kErwC21zBJEZjavVXg7VSkoWzaV1aMA3F peer with the @richard-ramos's tool ( https://github.com/waku-org/message-finder ,) continuously retrieving the messages from the last 24h.
On the other hand, I connected a nwaku node directly to that machine and it is continuously sending random messages.
That will potentially help us replicate block/slowness issues.

@Ivansete-status
Copy link
Collaborator Author

After further analysis with @NagyZoltanPeter, 🙌 , we run the following command:

SELECT pid, locktype, relation::regclass, mode, granted 
FROM pg_locks 
JOIN pg_stat_activity USING (pid) 
WHERE NOT granted AND mode LIKE '%xclusive%';

And saw that there were two processes with AccessExclusiveLock:

 711447 | relation | messages | AccessExclusiveLock | f
 711019 | relation | messages | AccessExclusiveLock | f

Then, from within the database docker container, we run ps fax and saw that there were two nwaku nodes trying to create a table, i.e. trying to create a partition.

711019 postgres  0:00 postgres: nim-waku nim-waku 10.11.0.83(45532) CREATE TABLE waiting
711447 postgres  0:00 postgres: nim-waku nim-waku 10.11.0.82(44822) CREATE TABLE waiting

With that, we concluded that we need to strengthen the logic around partition creation so that multiple nodes can create partitions concurrently without blocking each other.

Another option would be to just have a separate app aimed at database maintenance: partition creation; database migrations; etc, but at first, we'll try to enhance the current approach.

@richard-ramos
Copy link
Member

I'm thinking that your PR #2784 should stop the issue from happening!

@NagyZoltanPeter
Copy link
Contributor

I'm thinking that your PR #2784 should stop the issue from happening!

It is so, but instead, a fail to lock does not mean an error, just a sign an other node attemps to create the necessary partition(s).
Also we identified it is needed to maintain the partitionManager internal registry about the partitions, in mind other node can also make such maintenance of the partitions.

@Ivansete-status
Copy link
Collaborator Author

This is still happening nowadays in shards.test with nwaku version v0.29.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical This issue needs critical attention effort/days Estimated to be completed in a few days, less than a week
Projects
Archived in project
5 participants