Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix supernova reboot bug, add reboot tests #3848

Merged
merged 13 commits into from Jul 26, 2018

Conversation

adcxyz
Copy link
Contributor

@adcxyz adcxyz commented Jul 7, 2018

currently supernova servers tends to hang on reboot (macOS 10.13.4):

Server.supernova; 
s.reboot; // first time usually works
s.reboot; // second time usually hangs, because notify does not happen.

this is a timing issue: supernova sends back the quit message early, but the supernova process still lives long enough to interfere with the boot messaging traffic with the rebooted instance.
Thus, this PR adds waiting for release of the (previous) process PID when booting, which fixes the hang condition.
This PR includes

  • waiting for quitting process to release server.pid
  • safer testing for notify clientID info from server (supernova messages are not identical yet)
  • new unit tests for reboot method
  • fix to cycleNotify which tends to hang up TestServer_boot methods

Result: All tests invoving booted servers now run with supernova!
They all pass, except test_getn still fails because of a bug in supernova handle_s_getn #3841

I believe supernova will be part of the 3.10 release, yes?
If so, I would be happy if this bug fix went into 3.10, so supernova supported as fully as possible.

adcxyz added 5 commits June 20, 2018 19:00
dont set clientID if no clientID from response
so we know when the process really has ended.
this is to avoid supernova hanging with incomplete boot state when rebooting.
supernova tends to gets stuck here for some reason.
@adcxyz adcxyz added bug Issues that relate to unexpected/unwanted behavior. Don't use for PRs. comp: supernova comp: testing UnitTest class, refactors of existing tests, etc.; don't use if just adding tests as part of a PR labels Jul 7, 2018
@adcxyz adcxyz requested review from telephon and mossheim July 7, 2018 09:56
Copy link
Member

@telephon telephon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reads well and TestServer_boot succeeds for both scsynth and supernova, tested on macOS.

}, 0.25);
}
}

waitForPidRelease { |onDone, onFailure, timeout = 1|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually we call this onComplete.

} {
0.05.wait
};
if (pid.isNil) { onDone.value } { onFailure.value };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could write if(pid.isNil, onDone, onFailure) , but well...

@@ -922,6 +946,7 @@ Server {
} {
this.disconnectSharedMemory;
pid = unixCmd(program ++ options.asOptionsString(addr.port), { |exitCode|
pid = nil;
this.prOnServerProcessExit(exitCode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the line pid = nil happen in prOnServerProcessExit ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either is fine by me - had it there first - lets see if @brianlheim has a preference

Copy link
Member

@telephon telephon Jul 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also wonder whether theoretically a unixCmd could return a non nil pid even after it has called the exit function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@telephon - do you mean, whether it's possible that unixCmd's callback could run while the process is still running? I looked over the associated code - I'm not an expert on the unixy stuff but it looks like it's a remote possibility? I think that would be an issue in the implementation of unixCmd though and IMO we're OK relying on it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @telephon that the line pid = nil should go in prOnServerProcessExit, makes this easier to understand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

s.doWhenBooted {
// first boot OK, starting reboot from here
s.reboot {
s.doWhenBooted({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could leave away the ({ here, too.

Copy link
Member

@telephon telephon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but please change at least the argument name to onComplete

@adcxyz
Copy link
Contributor Author

adcxyz commented Jul 7, 2018

ok all done,
and updated the reboot tests to use UnitTest.wait,
and fixed a float value for numWireBufs that stops supernova when booting.

Copy link
Contributor

@mossheim mossheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! Comments below

}, 0.25);
}
}

waitForPidRelease { |onComplete, onFailure, timeout = 1|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a private method (and hidden in documentation), right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think so, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

onComplete.value;
^this
};
// FIXME: quick and dirty fix for supernova reboot hang on osx:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is good, thanks! Btw for future reference, the name is macOS now

@@ -922,6 +946,7 @@ Server {
} {
this.disconnectSharedMemory;
pid = unixCmd(program ++ options.asOptionsString(addr.port), { |exitCode|
pid = nil;
this.prOnServerProcessExit(exitCode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@telephon - do you mean, whether it's possible that unixCmd's callback could run while the process is still running? I looked over the associated code - I'm not an expert on the unixy stuff but it looks like it's a remote possibility? I think that would be an issue in the implementation of unixCmd though and IMO we're OK relying on it here.

@@ -922,6 +946,7 @@ Server {
} {
this.disconnectSharedMemory;
pid = unixCmd(program ++ options.asOptionsString(addr.port), { |exitCode|
pid = nil;
this.prOnServerProcessExit(exitCode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @telephon that the line pid = nil should go in prOnServerProcessExit, makes this easier to understand

server.notify_(true);
while { server.notified.not } { 0.1.wait };
while { server.notified.not } { server.notify_(false); 0.1.wait };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the sake of clarity, it would probably be better here to write statusWatcher.sendNotifyRequest(false). Then the code immediately makes sense with the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
} {
0.05.wait
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rewrite this using a Condition - waiting in a loop is inefficient and we have better idioms for this purpose

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure: do we have a callback somewhere that would make this possible? There would have to be a mechanism that unhangs the condition when the pid becomes nil. As long as we don't have that, this loop is the only possible way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible way to implement it is that any code that changes the pid can signal a Condition that is waited on here. The Condition will have to be at class scope. pid == nil can even be used as the test function for the Condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, the condition has to be at object scope, otherwise booting servers will block each other (but this is probably what you meant, anyhow).

Unless there are other uses for a condition at that level, I'd prefer to keep the method self-contained. Polling is not bad practice as such at all, it just depends on the case. But I'm fine either way if it is elegant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think efficiency is an issue here - waiting until something has finished properly so you can continue is very cheap. As Julian points out, the polling solution here stays completely local; also, it is just a single local poll/wait thread, so there is no possible confusion of order of execution of waiting tasks.
IIRC, Condition.hangWithTimeout (which would be needed here) ran into problems that are still not resolved yet, right?
To me, polling seems the right choice here, and conditions would make it more complex than necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would a case for whenTrueWithin, as proposed in #3850.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do whatever you want. This is not just about efficiency, but this code is going to be rewritten soon anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, eg. when I get around to redoing the ServerProcess refactor :-)
which is hopefully soon

@nhthn nhthn changed the base branch from develop to 3.10 July 15, 2018 19:08
@@ -280,6 +280,7 @@ Server {
var <window, <>scopeWindow, <emacsbuf;
var <volume, <recorder, <statusWatcher;
var <pid, serverInterface;
var <pidReleaseCondition;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary that this condition is externally accessible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had it accessible for testing, happy to make it internal only

@@ -339,6 +340,8 @@ Server {
this.name = argName;
all.add(this);

pidReleaseCondition = Condition({ this.pid == nil });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sets the class pidReleaseCondition, but above it is defined as instance variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like pid, pidReleaseCondition is an instance variable, because quitting is individual/independent per server instance. Not sure I understand your question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you are setting it in *initClass, here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… I've just read the original code, false alarm, it is all in init. This is a github/diff artefact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will go over this anyway in the ServerProcess refactor/cleanup, where every ServerProcess state gets its own condition, so this is just to make all tests with supernova to run now, rather than waiting for the ServerProcess refactor/cleanup to get through.

@adcxyz adcxyz added this to the 3.10 milestone Jul 22, 2018
@adcxyz
Copy link
Contributor Author

adcxyz commented Jul 22, 2018

@telephon - removed condition getter,
so I think I addressed all your concerns, yes?
If so, then this is good to go.

@telephon telephon merged commit b53fc82 into supercollider:3.10 Jul 26, 2018
@adcxyz adcxyz deleted the fixRebootSupernova branch July 28, 2018 12:26
@mossheim mossheim added comp: testing comp: testing UnitTest class, refactors of existing tests, etc.; don't use if just adding tests as part of a PR and removed comp: testing UnitTest class, refactors of existing tests, etc.; don't use if just adding tests as part of a PR labels Mar 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues that relate to unexpected/unwanted behavior. Don't use for PRs. comp: supernova comp: testing UnitTest class, refactors of existing tests, etc.; don't use if just adding tests as part of a PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants