Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU for week #484

Closed
johnml1135 opened this issue Sep 10, 2024 · 7 comments
Closed

100% CPU for week #484

johnml1135 opened this issue Sep 10, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@johnml1135
Copy link
Collaborator

What caused it?

Lots of DB queries. Could be locks or hangfire monitoring or something else.
Screenshot 2024-09-10 9 35 04 AM
image (1)

@johnml1135 johnml1135 self-assigned this Sep 10, 2024
@johnml1135 johnml1135 added the bug Something isn't working label Sep 10, 2024
@johnml1135
Copy link
Collaborator Author

Prod_machine.locks: 95/min
prod_machine_jobs.hangfire.lock: 40/min
prod_serval_jobs.hangfire.lock: 40/min

@johnml1135
Copy link
Collaborator Author

johnml1135 commented Sep 13, 2024

So, it appears that a requested word graph is making the CPU go crazy indefinitely...

image
image
image
image
image
image

@johnml1135
Copy link
Collaborator Author

So there is a spare lock that has been there for over a day, but not on the engine of interest. There was a 499 operation canceled (Timeout?) for GetWordGraph right before it all started.

After restarting the engines, it all went back to normal.

The locks are the ones that are doing something weird - 600 commands per second on the production locks...

What is happening?

@johnml1135 johnml1135 assigned ddaspit and unassigned johnml1135 Sep 13, 2024
@ddaspit
Copy link
Contributor

ddaspit commented Sep 13, 2024

From my investigation, it looks like there are a lot of commands being run, but I can't determine what the commands are. The translation engine whose call was canceled doesn't seem to exist in the database. I can't find any way that the current lock implementation would fire off so many commands. The recent (PR #486) that I fixed in the lock could have something to do with this. Without any more information, I am out of ideas.

@ddaspit
Copy link
Contributor

ddaspit commented Sep 13, 2024

I think I found one way that a lock could get in a state where it keeps hammering the database with attempts to acquire the lock.
If there is:

  1. an expired reader or writer lock that hasn't been cleaned up
  2. a queued writer lock that hasn't been cleaned up

And another call tries to acquire a reader or writer lock, then it will hammer the database in a loop.

I'm not sure how 1 could happen after our recent changes. PR #486 should make it so that 2 can't happen.

@ddaspit
Copy link
Contributor

ddaspit commented Sep 14, 2024

I submitted a PR (#491) that might reduce the chances of this happening.

@johnml1135
Copy link
Collaborator Author

Let's say this is resolved unless it comes back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: ✅ Done
Development

No branches or pull requests

2 participants