fix re-login behavior of remote clients to a server #3486

adcxyz · 2018-02-01T12:32:23Z

*** PR Info text completely redone ***
Thanks to questions by @muellmusik and @brianlheim,
the cases covered in this PR became much clearer to me.
The general issue is handling repeated logins in network performance situations,
to recover as gracefully as possible from loss of connection between client and server,
whether by network glicht or client crash.
(server crash is outside the scope of this PR, but should be fully thought through elsewhere.)

Case 1: network connection is lost,
and the same client, same client-side server object logs in again.

the server program remembers the client, and sends its previous clientID back
then the client knows it is connected again,
and it can keep all its known buffers, nodes, synths as they are.
-> so, just post that connection is back, and all is well.

Case 2: client crashes, and is restarted;
then the new client logs in again from the same network address:

the client-side server object is new, but the server program knows the address,
so the server program sends its previous clientID back to the new client.
because the new server object may not know its previous clientID,
the server program response may trigger new allocators - which is correct
there may be buffers and running nodes and synths on the server program
that the new server object does not know about:
depending on the specific setup used,

it may make sense to get rid of these by hand ( server.defaultGroup.release )
and reinit the server by running server.statusWatcher.prFinalizeBoot by hand.
-> so, post about successful reconnect, and advice about release/reinit code.

(Note for later: maybe develop functionality to retrieve these resources
from the server program in the client.)

Case 3: loss of connection and login from a different network address:
This will look like a new login to the server, so the old defaultGroup
and everything in it keeps going, and the new client gets a new clientID.

In the future, this case could be handled better by releasing the previous notify registration,
and then asking for the same clientID it had earlier, to continue seamlessly from there.
This would need some more effort to work.

The tests provided also cover cases 1 and 2:
TestServer_clientID_booted:test_repeatedLogin tests case 1
TestServer_clientID_booted:test_remoteLogins tests a first login by the Server:remote method,
and a repeated login from a new server object after a simulated crash.

fork {
	Server.killAll;
	0.1.wait;
	TestServer_clientID_booted.new.test_repeatedLogin;
	0.1.wait;
	TestServer_clientID_booted.new.test_remoteLogins;
};

…ect boot status.

mossheim · 2018-02-02T14:09:21Z

testsuite/classlibrary/TestServer_clientID_booted.sc

+		// -> so the response should go thru prHandleNotifyFailString
+
+		remote = Server.remote(\remTest, homeServer.addr, homeServer.options);
+		// Server.default = remote; // to test IDE server display


mossheim · 2018-02-02T14:09:48Z

testsuite/classlibrary/TestServer_clientID_booted.sc

+			remote.serverRunning.not and: { timeout > 0 }
+		} {
+			timeout = timeout - dt;
+			dt.wait;


use a condition variable here

mossheim · 2018-02-02T14:10:05Z

testsuite/classlibrary/TestServer_clientID_booted.sc

+
+		// make s play dead now, but leave the process running
+		homeServer.statusWatcher.stopStatusWatcher.stopAliveThread.serverRunning_(false);
+		// homeServer.serverRunning.postln; // client thinks it is dead


mossheim · 2018-02-02T14:10:40Z

testsuite/classlibrary/TestServer_clientID_booted.sc

+
+		this.assert(homeServer.clientID == 3,
+			"homeServer gets requested clientID back from server process."
+		);


this should be a separate test or removed altogether

mossheim · 2018-02-02T14:12:00Z

testsuite/classlibrary/TestServer_clientID_booted.sc

+		);
+
+		// make s play dead now, but leave the process running
+		homeServer.statusWatcher.stopStatusWatcher.stopAliveThread.serverRunning_(false);


chaining these calls is somewhat confusing IMO

mossheim · 2018-02-02T14:13:26Z

SCClassLibrary/Common/Control/Server.sc

@@ -544,7 +544,8 @@ Server {
 		case
 		{ failString.asString.contains("already registered") } {
 			"% - already registered with clientID %.\n".postf(this, msg[3]);
-			statusWatcher.notified = true;
+			statusWatcher.prRecoverRemoteLogin(msg[3]);


could use a comment - without looking I don't know what msg[3] is

mossheim · 2018-02-02T14:13:47Z

SCClassLibrary/Common/Control/ServerStatus.sc

+		}, AppClock)
+	}
+
+	prRecoverRemoteLogin { |clientID|


needs to be added to the list of private methods in the help file

what is this name supposed to mean, btw? not immediately apparent to me, sorry

hm, maybe really not a good name.
would you find prHandleRemoteLoginWhenAlreadyRegistered clearer?

mossheim · 2018-02-02T14:18:51Z

testsuite/classlibrary/TestServer_clientID_booted.sc

+		// -> this client netaddr is already registered, and should say so!
+		// -> so the response should go thru prHandleNotifyFailString
+
+		remote = Server.remote(\remTest, homeServer.addr, homeServer.options);


flow of logic would make more sense to me if this server were created right after the homeServer; does this constructor call depend on preceding code in a way I'm not seeing?

mossheim · 2018-02-02T14:19:15Z

testsuite/classlibrary/TestServer_clientID_booted.sc

+			"homeServer gets requested clientID back from server process."
+		);
+
+		// make s play dead now, but leave the process running


s -> homeServer

muellmusik · 2018-02-02T15:08:07Z

Just two quick questions @adcxyz:

If clientIDs are handed out by the server, then if the server recognises the client address, it can just give the same ID or? Or is this for the backwards compatible manually specified ID case?

Does it make sense that a client necessarily gets the same ID? If you have a lang crash, leaving orphaned nodes, then getting the same ID could lead to resource conflicts. There are various ways of dealing with that of course (e.g. gracefully free orphaned nodes using predictable default group), but you might want different ones for different use cases.

adcxyz · 2018-02-02T17:08:02Z

@muellmusik :

If clientIDs are handed out by the server, then if the server recognises the client address, it can just give the same ID or?

Yes, exaclty. This is what the server program has been doing all along, and I think is the clearest and most predictable behavior for most cases, as listed below.

Or is this for the backwards compatible manually specified ID case?
No, not specifically - when you manually ask for a clientID, and it is free, you get that;
when it is not free, you get a free one instead.
When your are recovering a lost login, the recovered previous clientID overrides a manually specified ID. Users always get back a clientID that will work, as long as there are free ones.

Does it make sense that a client necessarily gets the same ID? If you have a lang crash, leaving orphaned nodes, then getting the same ID could lead to resource conflicts. There are various ways of dealing with that of course (e.g. gracefully free orphaned nodes using predictable default group), but you might want different ones for different use cases.

I think keeping previously known IDs is the clearest and most predictable behavior for most cases I can think of:
Case 1: remote server login code accidentally runs twice -> same ID, nothing breaks.
(still needs full testing whether node allocator is being reset in this case)
Case 2: remote client loses network connection, and logs in again:
gets the same clientID, so the same defaultGroup, and all your synths etc should still be there.
(still needs full testing whether node allocator is being reset in this case)
Case 3: remote client crashes, and logs in again:
gets the same clientID and can then decide how to handle things gracefully:
release or free their (same as before) defaultGroup, or whatever else is called for.
(allocator reset is desirable here)
Case 4: a remote client gets lost and leaves sounds running:
Because all default groups are known to all clients, other users can intervene and gracefully end or release everything in that client's defaultGroup.
Case 5: multiple losses of contact in one show: if the server program keeps handing out new clientIDs, it would run out of clientIDs quickly, so handing back the previous ones is also safer for this case.

hope that answers your questions?
best a

adcxyz · 2018-02-02T17:08:49Z

hi @brianlheim, thanks for the detailed comments,
will get back to them ASAP - off to a show with Julian and Marcus Schmickler now!

muellmusik · 2018-02-02T17:52:30Z

hope that answers your questions?

Indeed! Thanks for the detailed and patient explanation. :-)

mossheim · 2018-02-02T18:05:31Z

off to a show with Julian and Marcus Schmickler now

Jealous! Have a great time!

adcxyz · 2018-02-05T19:13:53Z

Hi @brianlheim and @muellmusik,
thanks for the questions, and comments they made me rethink and rewrite this one quite a bit!
I think it is much clearer now what this PR aims to improve - please see updated top comment
This is the thread that discussed the problem:
https://www.listarc.bham.ac.uk/lists/sc-users/msg59049.html
best adc

telephon

one minor request, should be easy to fix. Looks very good -- it'll be a great relief not having to teach the kids to kill servers on a daily basis.

telephon · 2018-02-05T19:21:57Z

SCClassLibrary/Common/Control/ServerStatus.sc

+			"// so you may want to release currently running synths by hand:\n".postln;
+			"%.defaultGroup.release;\n".postf(server.cs);
+			"// and you may want to redo server boot finalization by hand:\n".postln;
+			"%.statusWatcher.prFinalizeBoot;\n".postf(server.cs);


If we want to suggest this line as something to actually do, it shouldn't involve a private message. What about adding:

+ Server { finalizeBoot { this.statusWatcher.prFinalizeBoot } }

then the line above would read:
"%.finalizeBoot;\n".postf(server.cs);

I'm not sure about the name ...

IIUC this solution is a hack, so it's ok to call a private function. Adding access to a private function as a public function seems worse because it has to be walked back with deprecation.

I see, makes some sense. My only worry is that the public use of private methods might be perceived as the standard way to go.

So let's note it in the comment that this is a temporary measure which will be replaced in future versions?

like the comment above the method says, the whole thing is about making it possible to recover in an emergency, so yes, it is a hack.

mossheim · 2018-02-05T21:46:50Z

testsuite/classlibrary/TestServer_clientID_booted.sc

+		// cleanup
+		remote1.remove;
+		remote2.remove;
+		unixCmd("kill -9" + serverPid);


this is not portable at all

OK. Is there a crossplatform version of kill?
or is it OK to use Server.killAll in the testsuite?
I would have preferred that, but assumed it is cleaner to only kill the exact process and nothing else.

Is there a crossplatform version of kill?

Windows uses taskkill.

is it OK to use Server.killAll in the testsuite?

No. The single process approach is much more sound. Since we have Platform.killAll already, it would probably not be such a bad idea to add killProcess that accepts a pid. Just a suggestion. That should be done in a separate PR.

I have near zero windows shell skills - would the methods below work?
in UnixPlatform:
kill { |pid|
("kill -9 " ++ pid).unixCmd;
}
and in WindowsPlatform:
kill { |pid|
("taskkill /F /pid " ++ pid).unixCmd;
}

I think so.. I can't test at the moment

mossheim · 2018-02-05T21:48:46Z

SCClassLibrary/Common/Control/ServerStatus.sc

+		} {
+			// same clientID, so leave all server resources in the state they were in!
+			"// This seems to be a login after a loss of network contact - \n"
+			"// - reconnected with the same clientID as before, so probably all is well.\n".postln;


does this really need to be posted? i think this is reaching a point where it's going to be regarded as noise by users

well, it is trying to be helpful in a rare emergency scenario - you are in the middle of playing a show, network drops out, you try to reconnect, and likely happy to see anything informative posted. I doubt that anyone would find that noisy in that situation.
But maybe that is just me.

gotcha, that makes sense. I think it would be nice if the post message was consistent with other messages coming out of this class, though. No other messages that I've seen start with "//", it seems arbitrary.

actually since it should be noticeable, how about making it begin with *** for bold posting?

just tried it - that does make the post-crash hints pop out nicely in the post window!

mossheim · 2018-02-05T22:14:34Z

testsuite/classlibrary/TestServer_clientID_booted.sc

+		// simulate a remote server process by starting a server process independently of SC
+		serverPid = unixCmd(Server.program + options.asOptionsString);
+
+		1.wait;


what is this wait for?

leftover from an earlier version of the test, unneeded now - can remove it.

mossheim · 2018-02-05T22:24:47Z

testsuite/classlibrary/TestServer_clientID_booted.sc

+
+		this.assert(remote1.clientID == 3,
+			"after first login, remote client should have its requested clientID."
+		);


are these asserts necessary? they seem to be covered well by other tests.

AFAIK the Server:remote method is not tested anywhere else, so testing both its normal and its emergency use seemed efficient and appropriate.

then it should be in a separate test...

so have a test for remoteLogin, which tests the first login only,
and a test for repeatedRemoteLogin, which will needs a first login as setup,
and then tests the repeated login only?

ok, I think all requests done,
and submitted separate PR for kill method

adcxyz · 2018-02-09T13:42:29Z

OK - I believe I addressed all requests for changes on this one,
so I guess the next steps are:
I wait for #3499 addKillMethod to be merged,
then change this one to use killProcessByID,
then it can be merged.
Did I miss anything?

mossheim · 2018-03-01T03:37:06Z

#3499 has been merged, this can be updated now.

adcxyz · 2018-03-01T07:53:44Z

ok, updated - thanks @brianlheim !

mossheim · 2018-03-01T19:00:21Z

testsuite/classlibrary/TestServer_clientID_booted.sc

+		.notified_(false)
+		.serverRunning_(false);
+
+		//		1.wait;


Can you please remove this? Everything else looks good, thanks for the tests!

sure, done!

mossheim · 2018-03-01T20:19:14Z

Thanks, this is ready for merge IMO. @telephon - want to give it a look since you've got a review already?

joshpar

looks good!

patrickdupuis · 2018-03-14T05:18:38Z

@telephon please approve when you get the chance 👍

telephon · 2018-03-14T06:30:50Z

(sorry for the delay, I had somehow missed the message)

adcxyz added 2 commits February 1, 2018 13:19

fix re-login of remote client to a server: gets old clientID and corr…

074be48

…ect boot status.

test_recoverRemoteReLogin: add assert for serverRunning status.

a61421c

nhthn added this to the 3.9.2 milestone Feb 1, 2018

adcxyz requested review from mossheim and telephon February 2, 2018 09:17

mossheim suggested changes Feb 2, 2018

View reviewed changes

adcxyz added 3 commits February 5, 2018 18:49

Merge branch '3.9' into fixRecoverRemoteLogin

8847e40

improve clarity by renaming, post info for handled cases.

c9937b0

improve tests by renaming, using conditions, and more cases.

511fca1

telephon requested changes Feb 5, 2018

View reviewed changes

mossheim reviewed Feb 5, 2018

View reviewed changes

adcxyz added 2 commits February 6, 2018 01:03

remove unneeded wait from test

5b482fb

make emergency advice post well noticeable

0832967

adcxyz mentioned this pull request Feb 6, 2018

add crossplatform kill method #3499

Merged

split remoteLogin in 2 tests, preapre for kill method.

5472a7c

adcxyz mentioned this pull request Feb 6, 2018

Extend UnitTest runAll with arguments for items and posting #3500

Closed

add temp-replacement for killProcessByID so crossplatform tests pass

8f42e17

nhthn added the comp: class library SC class library label Feb 18, 2018

use new Platform:killProcessByID method instead of local temp

02f0075

mossheim approved these changes Mar 1, 2018

View reviewed changes

test_repeatedLogin: remove commented-out cruft.

eb836cb

joshpar approved these changes Mar 13, 2018

View reviewed changes

telephon approved these changes Mar 14, 2018

View reviewed changes

telephon merged commit d835d53 into supercollider:3.9 Mar 14, 2018

adcxyz deleted the fixRecoverRemoteLogin branch March 31, 2018 15:17

fix re-login behavior of remote clients to a server #3486

fix re-login behavior of remote clients to a server #3486

Conversation

adcxyz commented Feb 1, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muellmusik commented Feb 2, 2018

adcxyz commented Feb 2, 2018

adcxyz commented Feb 2, 2018

muellmusik commented Feb 2, 2018

mossheim commented Feb 2, 2018

adcxyz commented Feb 5, 2018

telephon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

telephon Feb 5, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adcxyz commented Feb 9, 2018

mossheim commented Mar 1, 2018

adcxyz commented Mar 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mossheim commented Mar 1, 2018

joshpar left a comment

Choose a reason for hiding this comment

patrickdupuis commented Mar 14, 2018

telephon commented Mar 14, 2018

adcxyz commented Feb 1, 2018 •

edited

telephon Feb 5, 2018 •

edited