-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build cancel and engine delete not working #462
Comments
This was fixed by 2 things:
Why would that prevent jobs from being canceled or engines from being deleted? I can make new jobs, so the database connection is firm. Once I restarted, everything was working again. What could it be? |
I suspect this may be connected to the error I mentioned regarding the json parsing in the monitor service #450 Sorry, I'm realizing the trace isn't there but rather in the slack chat. If you peek at the logs, you'll see tons of the same error. |
The ClearML containers being down was a red herring. QA looks at the production queue, not my local queue. THe cancel and delete not working on QA was real though - and the 504 error messages, specifically for cancel job and delete engine. What could it be? My best guess right now is that the S3 bucket somehow became unavailable, which would account for being able to create a new engine, but not being able to cancel the job which would entail clearing data off of the S3 bucket. I can't figure out any more about it though - it obviously got into a bad state that was fixed by restarting. Also, the S3 bucket was still available during that time. I don't know. |
This isn't related to the |
And was it with all engines? Previously, they were getting this behavior only with a specific engine when starting a build #450 . |
I only tried one. I am ok letting it slide for now. |
I would guess that this has something to do with the distributed reader-writer lock that is used for engines. If the lock wasn't properly released, it will get deadlocked. |
Good guess. That makes the most sense from all of the things we have seen. It would specifically account for:
If this is the case, what is the best resolution? Check all the locks and make sure that there is a "finally"? Put a timeout on the locks? |
This is a duplicate of #464. |
It times out and then gives this error:
The text was updated successfully, but these errors were encountered: