New instance of PushManager can't restore connection when previous fail #143

maxim-volkov · 2014-11-21T07:06:45Z

New instance of PushManager can't restore connection when previous fail:

Create instance of PushManager
Send push notification (it is ok)
From "handleFailedConnection" stop it:
pushManager.shutdown(10000); returned
pushManager.shutdown(); is blocked.
Restore Internet connection
Create new instance of PushManager
Send push notification
The same exception is happened.

Exception:

java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:127)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644)
    at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:193)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:199)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1029)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
    at io.netty.handler.ssl.SslHandler.connect(SslHandler.java:386)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
    at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:463)
    at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:849)
    at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:202)
    at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:165)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)

The text was updated successfully, but these errors were encountered:

jchambers · 2014-11-25T14:54:11Z

Restore Internet connection

To be clear, it sounds like you're simulating a failure by deliberately disabling your internet connection. Is that correct?

aregr · 2014-12-04T08:08:23Z

I confirm that this can happen in real-life situations from time to time if DNS is slow and a short DNS timeout has been configured in the OS. When the error occurs, the above errormessage is delivered as the Throwable cause in callback:

public void handleFailedConnection(
final PushManager<? extends SimpleApnsPushNotification> pushManager,
final Throwable cause)

This calllback is then repeated in an infinite loop, causing CPU usage to skyrocket...

Please fix this as it is kind of serious...

maxim-volkov · 2014-12-04T08:39:53Z

To be clear, it sounds like you're simulating a failure by deliberately disabling your internet connection. Is that correct?

Yes, It is correct

aregr · 2014-12-04T09:10:13Z

BTW. I think the spinning CPU is because the ApnsConnection is not properly closed when an exception occurs before SSL Hanshaking has completed:

See code snippet here (this will not attempt to close the ApnsConnection unless handshake is done):

`private class ApnsConnectionHandler ....

    @Override
    public void channelInactive(final ChannelHandlerContext context) throws Exception {
        super.channelInactive(context);

        // Channel closure implies that the connection attempt had fully succeeded, so we only want to notify
        // listeners if the handshake has completed. Otherwise, we'll notify listeners of a connection failure (as
        // opposed to closure) elsewhere.
        if (this.apnsConnection.handshakeCompleted) {
            if (this.apnsConnection.listener != null) {
                this.apnsConnection.listener.handleConnectionClosure(this.apnsConnection);
            }
        }
    }`

jchambers · 2014-12-05T02:11:25Z

this will not attempt to close the ApnsConnection unless handshake is done

Unless I'm mistaken (which is entirely possible), won't channelInactive only be called once the connection has closed?

This calllback is then repeated in an infinite loop

I thiiiiiink this is the expected behavior. If a connection attempt fails, we try to reconnect immediately. I would expect that it would "spin" until a connection is established. Part of the reason we expose connection failures via the listener is so callers can make a call about when they want to give up.

In a nutshell, I think everything here is working as intended. @maxim-volkov to follow up on your original post, what behavior were you expecting instead?

maxim-volkov · 2014-12-05T06:40:48Z

In a nutshell, I think everything here is working as intended. @maxim-volkov to follow up on your original post, what behavior were you expecting instead?

Internet (Ethernet) connection have been restored. But new instance PushManager can't send notification because exception is happened.

jmason · 2014-12-05T09:55:33Z

'I thiiiiiink this is the expected behavior. If a connection attempt fails, we try to reconnect immediately. I would expect that it would "spin" until a connection is established.'

I'm not convinced that spinning on the CPU, without a sleep, is appropriate however -- typical best practice in this scenario would be to use an exponential backoff....

jchambers · 2014-12-05T15:25:32Z

Internet (Ethernet) connection have been restored.

Does anything work at that point, though? I don't mean to be difficult, but is it possible you're just retrying too early? What happens if you wait longer before trying to reconnect?

To show my hand, it sounds like this is just a weird issue where some part of the OS's networking stuff is a little slow to wake back up. It doesn't sound like there's any thing we can or should do about that in Pushy itself.

typical best practice in this scenario would be to use an exponential backoff

@mdvorak asked for something similar in #102. I was against it then (on the grounds that callers should make decisions on their own), but I'm starting to come around on the idea. I think we should treat that as a separate feature request, though, and keep this discussion narrowly focused on the original issue.

aregr · 2014-12-05T16:40:47Z

I have succeeded in reproducing this problem so I have some additional information about it:

In my app, I initialize pushy with the following snippet:

ApnsConnectionConfiguration apnsConfig = new ApnsConnectionConfiguration();
apnsConfig.setSendAttemptLimit(4);
apnsConfig.setCloseAfterInactivityTime(300);
apnsConfig.setGracefulShutdownTimeout(5);

PushManagerConfiguration pushConfig = new PushManagerConfiguration();
pushConfig.setConcurrentConnectionCount(4);
pushConfig.setConnectionConfiguration(apnsConfig);

pushManager = new PushManager<SimpleApnsPushNotification>(
        apnsEnvironment, SSLContextUtil.createDefaultSSLContext(cert, keyStorePassword), 
        null, null, null, pushConfig, "APNSPushManager");
pushManager.registerFailedConnectionListener(new MyFailedConnectionListener());

My failed connection listener looks like this:

private class MyFailedConnectionListener implements FailedConnectionListener<SimpleApnsPushNotification> {

        @Override
        public void handleFailedConnection(
                final PushManager<? extends SimpleApnsPushNotification> pushManager,
                final Throwable cause) {

            if (cause instanceof SSLHandshakeException) {
                SSLHandshakeException ex = (SSLHandshakeException) cause;
                log.warn("failed SSL handshake to apple", ex);
            } else {
                log.warn("failed connecting to apple", cause);
            }

            //Avoid busy wait scenario and log-spamming by simply waiting a bit
            try {
                Thread.sleep(1000);
            } catch (InterruptedException e) {
                log.info("Interrupted when sleeping in handleFailedConnection(), orig cause: {}",cause);
            }
        }
    }

You can probably use a shorter setCloseAfterInactivityTime to make reproduce the problem faster btw...

Furthermore, to quickly reproduce the problem it is important to set a short DNS client ttl inside java. If you do not do this you may not succeed in reproducing the problem easily. The easiest way to surely set the DNS TTL in java is to edit $JAVA_HOME/jre/lib/security/java.security and configure the properties:

networkaddress.cache.ttl=10
networkaddress.cache.negative.ttl=10

Now, to reproduce the issue, start the push-server as normal. When everything is running, change your local DNS config (/etc/resolv.conf on linux) and configure a non-responsive DNS server as the only DNS source. Make sure you disable DNS caching in the OS if you have it. This should trigger the UnresolvedAddress-exception next time a DNS lookup is attempted.

If you do this, then within approx 300 seconds you should have reproduced the problem and you will easily see it by 100% CPU utilisation and an avalanche of errors in the debug log if you have set log-level to DEBUG for the pushy classes.

Errors you will see look like this:

2014-12-05 11:42:48,671 DEBUG nioEventLoopGroup-2-1 com.relayrides.pushy.apns.ApnsConnection.connect(ApnsConnection.java:441) - APNSPushManager-connection-32050 beginning connection process.
2014-12-05 11:42:48,671 DEBUG nioEventLoopGroup-2-1 com.relayrides.pushy.apns.ApnsConnection$2.operationComplete(ApnsConnection.java:493) - APNSPushManager-connection-32045 failed to connect to APNs gateway.
java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:30)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
        at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:193)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:199)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1029)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.handler.ssl.SslHandler.connect(SslHandler.java:386)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:463)
        at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:849)
        at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:202)
        at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:165)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
        at java.lang.Thread.run(Thread.java:662)
2014-12-05 11:42:48,671 DEBUG nioEventLoopGroup-2-1 com.relayrides.pushy.apns.ApnsConnection.connect(ApnsConnection.java:441) - APNSPushManager-connection-32051 beginning connection process.

Furthermore, if you quickly fix the broken DNS config while this goes on, then after a short while ~10 sec some of the errors go away. At this point, you are still in an an infinite loop though where the
handleFailedConnection callback is called every second in my case (because of the Thread.sleep(1000) inside my handler).

These error look like this:

2014-12-05 11:42:49,487 WARN pool-3-thread-1 mypackage.ApnsPushSender$MyFailedConnectionListener.handleFailedConnection(ApnsPushSender.java:145) - failed connecting to apple
java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:30)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
        at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:193)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:199)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1029)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.handler.ssl.SslHandler.connect(SslHandler.java:386)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:463)
        at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:849)
        at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:202)
        at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:165)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
        at java.lang.Thread.run(Thread.java:662)

Apparently pushy cannot recover from this state since this logging continues even after the DNS has been restored.

TriPhoenix · 2015-03-23T08:21:48Z

We encounter this problem from time to time as well, the default value of 10 seconds TTL for negative DNS queries will result in this issue being visible for at least 10 seconds, even if the DNS outage only lastet for one query.

I think your delayed additional error messages are likely remnants of the 10 second outage then. Looks like pushy uses a ExecutorService (defaulting to single thread) to actually handle the failed connections, which results in pushy still spinning quickly; while your handler consumes the failed connection notifications more slowly. The errors will still be generated as quickly as possible but only processed/logged at one per second.

I guess the only thing that pushy could do, is to slow down when it encounters an UnresolvedAddressException; the connection will fail due to the cache for the next 10 seconds (or the configured value via system properties) anyway and there is little to gain by trying. Of course its debatable, if this behaviour should be part of pushy itself or the task of the calling application; which probably would benefit from a possibility to "slowdown" pushy reconnecting though.

jchambers · 2015-04-28T19:54:36Z

slow down when it encounters an UnresolvedAddressException

Agreed; getting some kind of exponential backoff behavior for failed connections is high on our priority list right now.

jchambers · 2015-05-21T00:16:10Z

It sounds like the only thing that's really going wrong here is that we're trying to reconnect to quickly and flooding the listener queue. Exponential back-off is happening in #169, so I'm going to go ahead and close this issue.

jchambers closed this as completed May 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New instance of PushManager can't restore connection when previous fail #143

New instance of PushManager can't restore connection when previous fail #143

maxim-volkov commented Nov 21, 2014

jchambers commented Nov 25, 2014

aregr commented Dec 4, 2014

maxim-volkov commented Dec 4, 2014

aregr commented Dec 4, 2014

jchambers commented Dec 5, 2014

maxim-volkov commented Dec 5, 2014

jmason commented Dec 5, 2014

jchambers commented Dec 5, 2014

aregr commented Dec 5, 2014

TriPhoenix commented Mar 23, 2015

jchambers commented Apr 28, 2015

jchambers commented May 21, 2015

New instance of PushManager can't restore connection when previous fail #143

New instance of PushManager can't restore connection when previous fail #143

Comments

maxim-volkov commented Nov 21, 2014

jchambers commented Nov 25, 2014

aregr commented Dec 4, 2014

maxim-volkov commented Dec 4, 2014

aregr commented Dec 4, 2014

jchambers commented Dec 5, 2014

maxim-volkov commented Dec 5, 2014

jmason commented Dec 5, 2014

jchambers commented Dec 5, 2014

aregr commented Dec 5, 2014

TriPhoenix commented Mar 23, 2015

jchambers commented Apr 28, 2015

jchambers commented May 21, 2015