100% CPU for week #484

johnml1135 · 2024-09-10T13:35:54Z

What caused it?

Lots of DB queries. Could be locks or hangfire monitoring or something else.

johnml1135 · 2024-09-10T13:50:22Z

Prod_machine.locks: 95/min
prod_machine_jobs.hangfire.lock: 40/min
prod_serval_jobs.hangfire.lock: 40/min

johnml1135 · 2024-09-13T14:30:19Z

So, it appears that a requested word graph is making the CPU go crazy indefinitely...

johnml1135 · 2024-09-13T14:53:11Z

So there is a spare lock that has been there for over a day, but not on the engine of interest. There was a 499 operation canceled (Timeout?) for GetWordGraph right before it all started.

After restarting the engines, it all went back to normal.

The locks are the ones that are doing something weird - 600 commands per second on the production locks...

What is happening?

ddaspit · 2024-09-13T21:07:07Z

From my investigation, it looks like there are a lot of commands being run, but I can't determine what the commands are. The translation engine whose call was canceled doesn't seem to exist in the database. I can't find any way that the current lock implementation would fire off so many commands. The recent (PR #486) that I fixed in the lock could have something to do with this. Without any more information, I am out of ideas.

ddaspit · 2024-09-13T22:29:56Z

I think I found one way that a lock could get in a state where it keeps hammering the database with attempts to acquire the lock.
If there is:

an expired reader or writer lock that hasn't been cleaned up
a queued writer lock that hasn't been cleaned up

And another call tries to acquire a reader or writer lock, then it will hammer the database in a loop.

I'm not sure how 1 could happen after our recent changes. PR #486 should make it so that 2 can't happen.

ddaspit · 2024-09-14T00:52:19Z

I submitted a PR (#491) that might reduce the chances of this happening.

johnml1135 · 2024-11-01T19:46:55Z

Let's say this is resolved unless it comes back.

johnml1135 self-assigned this Sep 10, 2024

johnml1135 added the bug Something isn't working label Sep 10, 2024

johnml1135 assigned ddaspit and unassigned johnml1135 Sep 13, 2024

johnml1135 closed this as completed Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

100% CPU for week #484

100% CPU for week #484

johnml1135 commented Sep 10, 2024

johnml1135 commented Sep 10, 2024

johnml1135 commented Sep 13, 2024 •

edited

Loading

johnml1135 commented Sep 13, 2024

ddaspit commented Sep 13, 2024

ddaspit commented Sep 13, 2024 •

edited

Loading

ddaspit commented Sep 14, 2024

johnml1135 commented Nov 1, 2024

100% CPU for week #484

100% CPU for week #484

Comments

johnml1135 commented Sep 10, 2024

johnml1135 commented Sep 10, 2024

johnml1135 commented Sep 13, 2024 • edited Loading

johnml1135 commented Sep 13, 2024

ddaspit commented Sep 13, 2024

ddaspit commented Sep 13, 2024 • edited Loading

ddaspit commented Sep 14, 2024

johnml1135 commented Nov 1, 2024

johnml1135 commented Sep 13, 2024 •

edited

Loading

ddaspit commented Sep 13, 2024 •

edited

Loading