New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New instance of PushManager can't restore connection when previous fail #143
Comments
To be clear, it sounds like you're simulating a failure by deliberately disabling your internet connection. Is that correct? |
I confirm that this can happen in real-life situations from time to time if DNS is slow and a short DNS timeout has been configured in the OS. When the error occurs, the above errormessage is delivered as the Throwable cause in callback: public void handleFailedConnection( This calllback is then repeated in an infinite loop, causing CPU usage to skyrocket... Please fix this as it is kind of serious... |
Yes, It is correct |
BTW. I think the spinning CPU is because the ApnsConnection is not properly closed when an exception occurs before SSL Hanshaking has completed: See code snippet here (this will not attempt to close the ApnsConnection unless handshake is done): `private class ApnsConnectionHandler ....
|
Unless I'm mistaken (which is entirely possible), won't
I thiiiiiink this is the expected behavior. If a connection attempt fails, we try to reconnect immediately. I would expect that it would "spin" until a connection is established. Part of the reason we expose connection failures via the listener is so callers can make a call about when they want to give up. In a nutshell, I think everything here is working as intended. @maxim-volkov to follow up on your original post, what behavior were you expecting instead? |
Internet (Ethernet) connection have been restored. But new instance PushManager can't send notification because exception is happened. |
'I thiiiiiink this is the expected behavior. If a connection attempt fails, we try to reconnect immediately. I would expect that it would "spin" until a connection is established.' I'm not convinced that spinning on the CPU, without a sleep, is appropriate however -- typical best practice in this scenario would be to use an exponential backoff.... |
Does anything work at that point, though? I don't mean to be difficult, but is it possible you're just retrying too early? What happens if you wait longer before trying to reconnect? To show my hand, it sounds like this is just a weird issue where some part of the OS's networking stuff is a little slow to wake back up. It doesn't sound like there's any thing we can or should do about that in Pushy itself.
@mdvorak asked for something similar in #102. I was against it then (on the grounds that callers should make decisions on their own), but I'm starting to come around on the idea. I think we should treat that as a separate feature request, though, and keep this discussion narrowly focused on the original issue. |
I have succeeded in reproducing this problem so I have some additional information about it: In my app, I initialize pushy with the following snippet: ApnsConnectionConfiguration apnsConfig = new ApnsConnectionConfiguration();
apnsConfig.setSendAttemptLimit(4);
apnsConfig.setCloseAfterInactivityTime(300);
apnsConfig.setGracefulShutdownTimeout(5);
PushManagerConfiguration pushConfig = new PushManagerConfiguration();
pushConfig.setConcurrentConnectionCount(4);
pushConfig.setConnectionConfiguration(apnsConfig);
pushManager = new PushManager<SimpleApnsPushNotification>(
apnsEnvironment, SSLContextUtil.createDefaultSSLContext(cert, keyStorePassword),
null, null, null, pushConfig, "APNSPushManager");
pushManager.registerFailedConnectionListener(new MyFailedConnectionListener()); My failed connection listener looks like this: private class MyFailedConnectionListener implements FailedConnectionListener<SimpleApnsPushNotification> {
@Override
public void handleFailedConnection(
final PushManager<? extends SimpleApnsPushNotification> pushManager,
final Throwable cause) {
if (cause instanceof SSLHandshakeException) {
SSLHandshakeException ex = (SSLHandshakeException) cause;
log.warn("failed SSL handshake to apple", ex);
} else {
log.warn("failed connecting to apple", cause);
}
//Avoid busy wait scenario and log-spamming by simply waiting a bit
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
log.info("Interrupted when sleeping in handleFailedConnection(), orig cause: {}",cause);
}
}
} You can probably use a shorter setCloseAfterInactivityTime to make reproduce the problem faster btw... Furthermore, to quickly reproduce the problem it is important to set a short DNS client ttl inside java. If you do not do this you may not succeed in reproducing the problem easily. The easiest way to surely set the DNS TTL in java is to edit $JAVA_HOME/jre/lib/security/java.security and configure the properties: networkaddress.cache.ttl=10 Now, to reproduce the issue, start the push-server as normal. When everything is running, change your local DNS config (/etc/resolv.conf on linux) and configure a non-responsive DNS server as the only DNS source. Make sure you disable DNS caching in the OS if you have it. This should trigger the UnresolvedAddress-exception next time a DNS lookup is attempted. If you do this, then within approx 300 seconds you should have reproduced the problem and you will easily see it by 100% CPU utilisation and an avalanche of errors in the debug log if you have set log-level to DEBUG for the pushy classes. Errors you will see look like this:
Furthermore, if you quickly fix the broken DNS config while this goes on, then after a short while ~10 sec some of the errors go away. At this point, you are still in an an infinite loop though where the These error look like this:
Apparently pushy cannot recover from this state since this logging continues even after the DNS has been restored. |
We encounter this problem from time to time as well, the default value of 10 seconds TTL for negative DNS queries will result in this issue being visible for at least 10 seconds, even if the DNS outage only lastet for one query. I think your delayed additional error messages are likely remnants of the 10 second outage then. Looks like pushy uses a ExecutorService (defaulting to single thread) to actually handle the failed connections, which results in pushy still spinning quickly; while your handler consumes the failed connection notifications more slowly. The errors will still be generated as quickly as possible but only processed/logged at one per second. I guess the only thing that pushy could do, is to slow down when it encounters an UnresolvedAddressException; the connection will fail due to the cache for the next 10 seconds (or the configured value via system properties) anyway and there is little to gain by trying. Of course its debatable, if this behaviour should be part of pushy itself or the task of the calling application; which probably would benefit from a possibility to "slowdown" pushy reconnecting though. |
Agreed; getting some kind of exponential backoff behavior for failed connections is high on our priority list right now. |
It sounds like the only thing that's really going wrong here is that we're trying to reconnect to quickly and flooding the listener queue. Exponential back-off is happening in #169, so I'm going to go ahead and close this issue. |
New instance of PushManager can't restore connection when previous fail:
pushManager.shutdown(10000); returned
pushManager.shutdown(); is blocked.
Exception:
The text was updated successfully, but these errors were encountered: