New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WFLY-2741][WFLY-3058] Management op administrative cancellation and timeouts #6237
Conversation
Build 3575 is now running using a merge of e3dd102 |
Build 3575 outcome was FAILURE using a merge of e3dd102 Build problems:Failed tests detected
Failed tests
|
retest this please |
Build 3579 is now running using a merge of e3dd102 |
Build 3579 outcome was SUCCESS using a merge of e3dd102 |
So, it seems there is some intermittent issue with OperationTimeoutUnitTestCase. I've never had that fail locally. Investigating... |
The OperationTimeoutUnitTestCase issue should be sorted now. |
Build 3597 is now running using a merge of 9c2bb35 |
Build 3597 outcome was SUCCESS using a merge of 9c2bb35 |
retest this please |
Build 3601 is now running using a merge of 9c2bb35 |
Build 3601 outcome was SUCCESS using a merge of 9c2bb35 |
retest this please |
Build 3605 is now running using a merge of 9c2bb35 |
Build 3605 outcome was FAILURE using a merge of 9c2bb35 Build problems:Failed tests detected
Failed tests
|
Build 3614 is now running using a merge of 7feca6b |
Build 3617 is now running using a merge of c6dd663 |
Build 3614 outcome was SUCCESS using a merge of c6dd663 |
Build 3617 outcome was FAILURE using a merge of c6dd663 Build problems:Failed tests detected
Failed tests
|
retest this please |
Build 3619 is now running using a merge of c6dd663 |
http://brontes.lab.eng.brq.redhat.com/viewLog.html?buildId=15095&buildTypeId=WFPR ran cleanly. The previous failure ^^^ was an unrelated recurring issue. @n1hility I've run the domain tests 11 times now on brontes since pushing my fix for the OperationCancellationTestCase that failed once above, and have also run it dozens of times on my local machines. There have been no further issues. So this is good to go. |
retest this please I'll go for one more run. |
Build 3627 is now running using a merge of c6dd663 |
Build 3627 outcome was FAILURE using a merge of c6dd663 Build problems:Failed tests detected
Failed tests
|
Build 3639 is now running using a merge of 21b6439 |
Build 3639 outcome was SUCCESS using a merge of 21b6439 |
retest this please |
Build 3645 is now running using a merge of 21b6439 |
Build 3738 outcome was SUCCESS using a merge of 9e02dac |
Build 3743 is now running using a merge of 1323c6a |
Build 3743 outcome was SUCCESS using a merge of 1323c6a |
Build 3768 is now running using a merge of 0e24999 |
Build 3768 outcome was SUCCESS using a merge of 8fd184b |
Build 3771 is now running using a merge of 8fd184b |
Build 3771 outcome was SUCCESS using a merge of 8fd184b |
Build 3786 is now running using a merge of 2f14574 |
Build 3786 outcome was SUCCESS using a merge of 2f14574 |
This is good to go. I've now tested this a ton of times on a custom run on brontes: The couple of issues that have popped up there or in the last week in the pull player runs ^^^ have been addressed. |
Build 3803 is now running using a merge of a896634 |
Build 3803 outcome was SUCCESS using a merge of a896634 |
Build 3807 is now running using a merge of 8f22a37 |
Build 3807 outcome was SUCCESS using a merge of 8f22a37 |
…rocess connections [WFLY-3091] Fix propagation of mgmt op cancellation to server update tasks Tests of the above and of WFLY-3058
Build 3820 is now running using a merge of 61e9b19 |
Build 3820 outcome was FAILURE using a merge of 61e9b19 Build problems:Failed tests detected
Failed tests
|
Retest this please |
Build 3836 is now running using a merge of 61e9b19 |
Build 3836 outcome was SUCCESS using a merge of 61e9b19 |
[WFLY-2741][WFLY-3058] Management op administrative cancellation and timeouts
…058"" This reverts commit 1acc7ae.
Supersedes #6206 by continuing on with further work related to operation cancellation.
See description of #6206 for discussion of the operation timeout piece (the 1st two commits).
This PR adds commits related to administratively canceling problematic ops.
Caveat: it is still not possible to force MSC stability if some service will not properly return from start/stop. See https://issues.jboss.org/browse/MSC-143. But the cancellation fixes included here will allow the HC processes to no longer block waiting for servers to stabilize, making it possible for an admin to take corrective action (e.g. kill the hung server.)
The last 3 commits do the following:
[domain@localhost:9990 /] /host=master/core-service=management/service=management-operations:read-children-resources(child-type=active-operation)
{
"outcome" => "success",
"result" => {"-63170310" => {
"access-mechanism" => "undefined",
"address" => [
("host" => "master"),
("core-service" => "management"),
("service" => "management-operations")
],
"caller-thread" => "management-handler-thread - 3",
"cancelled" => false,
"exclusive-running-time" => -1L,
"execution-status" => "executing",
"operation" => "read-children-resources",
"running-time" => 3534000L
}}
}
Strip /host=master off the address for the standalone server address; include server=<?> for the domain server address.
Include a "cancel" op in those resources, which if invoked interrupts the thread executing the op. Canceling an op is done via thread interruption, as has been the case since 7.0.
Implement a number of fixes related to ensuring that canceling an op via interruption is properly propagated throughout a managed domain, while also ensuring that the thread interruption that is the basic mechanism for cancellation does not result in inadvertently closing the communication channels between the processes in the domain.
The ModelControllerClient async API has always supported canceling op from the client side by canceling the Future returned from executeAsync. This is now more reliable in a domain due to 3) above.
The parent resource for the active-operation resources includes two convenience operations:
a) find-non-progressing-operation: Checks for an operation that has been holding the exclusive operation execution lock for greater than the provided number of seconds, and if found return its id.
b) cancel-non-progressing-operation: Like find-non-progressing-operation, but instead of returning an operation's id, it cancels that operation.
The intent of these two ops is to simplify identifying/canceling the problem operation that is hanging with the exclusive lock while a number of other ops are blocking waiting for that lock.