Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage in seed node on setzling #772

Closed
adaszko opened this issue Aug 25, 2021 · 12 comments · Fixed by #773
Closed

High CPU usage in seed node on setzling #772

adaszko opened this issue Aug 25, 2021 · 12 comments · Fixed by #773

Comments

@adaszko
Copy link
Contributor

adaszko commented Aug 25, 2021

I'm observing what looks like a livelock/high contention on setzling:

image

radicle-bins repo git sha it's happening on: d3f366ca0965d80e892d924613a715fc26fc3733. Granted, it's not master but I've observed the same behavior before, when someone else's branch was deployed there.

Note that PIDs on the screenshot above that are readily correlated with thread ids in the stack traces: 29571, 29572, 29573, 29583.

If it helps, strack traces taken yesterday, when a different branch was deployed.

@kim
Copy link
Contributor

kim commented Aug 25, 2021

This suggests that there are a lot of connections, is that the case?

@adaszko
Copy link
Contributor Author

adaszko commented Aug 25, 2021

There's 0 traffic on port 1234. Confirmed that by running tcpdump and sending a dummy handcrafted packet which has been detected. No other traffic has been observed on this port all the while CPU usage is at 100% percent in those 4 threads.

image

It's the same on port 80. Zero traffic until I issue a request myself.

@kim
Copy link
Contributor

kim commented Aug 25, 2021 via email

@adaszko
Copy link
Contributor Author

adaszko commented Aug 25, 2021

Are you saying it enters this state while not doing anything?

It seems so.

Can this be reproduced?

It happens by itself several hours after startup.

Is it specific to the environment?

So far I've only seen it on setzling. What's different about setzling is it's currently running seed compiled in debug mode precisely to aid tracking down issue like this one, although we can have debugging symbols in release mode too with a cargo setting.

@kim
Copy link
Contributor

kim commented Aug 25, 2021 via email

@kim
Copy link
Contributor

kim commented Aug 25, 2021 via email

@adaszko
Copy link
Contributor Author

adaszko commented Aug 25, 2021

Turns out Dashmap is using a homebrewed RwLock with a spin lock which spins forever. A PR moving to parking_lot instead has not seen any activity since more than two months. So I guess we need a proper concurrent map. Any suggestions?

That's really surprising given the popularity of the crate. As for suggestions, it really depends on what the requirements on the concurrency semantics are. For instance, is the hashmap being used for blocking on a specific key until another one inserts, or are all the operations required to be lock-free (in the progress guarantees sense, not in the sense of "not using locks"). If the latter, there's evmap for instance.

@kim
Copy link
Contributor

kim commented Aug 25, 2021 via email

kim added a commit that referenced this issue Aug 25, 2021
This fixes a regression introduced in 760d310 (net: store the actual
connection in conntrack, 2021-07-29), which would cause Dashmap to
deadlock.

Fixes #772
Signed-off-by: Kim Altintop <kim@eagain.st>
@kim
Copy link
Contributor

kim commented Aug 25, 2021

I am wondering btw where the async-io thread is coming from. Did we accidentally all the runtimes?

@kim
Copy link
Contributor

kim commented Aug 26, 2021

where the async-io thread is coming from

    ├── if-watch v0.2.2
    │   ├── async-io v1.6.0
    │   │   ├── concurrent-queue v1.2.2 (*)
    │   │   ├── futures-lite v1.12.0 (*)
    │   │   ├── libc v0.2.98
    │   │   ├── log v0.4.14 (*)
    │   │   ├── once_cell v1.8.0
    │   │   ├── parking v2.0.0
    │   │   ├── polling v2.1.0
    │   │   │   ├── cfg-if v1.0.0
    │   │   │   ├── libc v0.2.98
    │   │   │   └── log v0.4.14 (*)
    │   │   ├── slab v0.4.3
    │   │   ├── socket2 v0.4.0
    │   │   │   └── libc v0.2.98
    │   │   └── waker-fn v1.1.0
    │   ├── futures-lite v1.12.0 (*)
    │   ├── ipnet v2.3.1
    │   ├── libc v0.2.98
    │   └── log v0.4.14 (*)

@FintanH
Copy link
Contributor

FintanH commented Aug 26, 2021

    ├── if-watch v0.2.2
    │   ├── async-io v1.6.0

yaaaaaakkkkkk

@kim kim closed this as completed in 100aea2 Aug 26, 2021
@kim
Copy link
Contributor

kim commented Aug 26, 2021

@adaszko feel free to reopen if this persists after applying #773

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants