Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New instance of PushManager can't restore connection when previous fail #143

Closed
maxim-volkov opened this issue Nov 21, 2014 · 12 comments
Closed

Comments

@maxim-volkov
Copy link

New instance of PushManager can't restore connection when previous fail:

  1. Create instance of PushManager
  2. Send push notification (it is ok)
  3. From "handleFailedConnection" stop it:
    pushManager.shutdown(10000); returned
    pushManager.shutdown(); is blocked.
  4. Restore Internet connection
  5. Create new instance of PushManager
  6. Send push notification
  7. The same exception is happened.

Exception:

java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:127)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644)
    at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:193)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:199)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1029)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
    at io.netty.handler.ssl.SslHandler.connect(SslHandler.java:386)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
    at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:463)
    at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:849)
    at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:202)
    at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:165)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)
@jchambers
Copy link
Owner

Restore Internet connection

To be clear, it sounds like you're simulating a failure by deliberately disabling your internet connection. Is that correct?

@aregr
Copy link

aregr commented Dec 4, 2014

I confirm that this can happen in real-life situations from time to time if DNS is slow and a short DNS timeout has been configured in the OS. When the error occurs, the above errormessage is delivered as the Throwable cause in callback:

public void handleFailedConnection(
final PushManager<? extends SimpleApnsPushNotification> pushManager,
final Throwable cause)

This calllback is then repeated in an infinite loop, causing CPU usage to skyrocket...

Please fix this as it is kind of serious...

@maxim-volkov
Copy link
Author

To be clear, it sounds like you're simulating a failure by deliberately disabling your internet connection. Is that correct?

Yes, It is correct

@aregr
Copy link

aregr commented Dec 4, 2014

BTW. I think the spinning CPU is because the ApnsConnection is not properly closed when an exception occurs before SSL Hanshaking has completed:

See code snippet here (this will not attempt to close the ApnsConnection unless handshake is done):

`private class ApnsConnectionHandler ....

    @Override
    public void channelInactive(final ChannelHandlerContext context) throws Exception {
        super.channelInactive(context);

        // Channel closure implies that the connection attempt had fully succeeded, so we only want to notify
        // listeners if the handshake has completed. Otherwise, we'll notify listeners of a connection failure (as
        // opposed to closure) elsewhere.
        if (this.apnsConnection.handshakeCompleted) {
            if (this.apnsConnection.listener != null) {
                this.apnsConnection.listener.handleConnectionClosure(this.apnsConnection);
            }
        }
    }`

@jchambers
Copy link
Owner

this will not attempt to close the ApnsConnection unless handshake is done

Unless I'm mistaken (which is entirely possible), won't channelInactive only be called once the connection has closed?

This calllback is then repeated in an infinite loop

I thiiiiiink this is the expected behavior. If a connection attempt fails, we try to reconnect immediately. I would expect that it would "spin" until a connection is established. Part of the reason we expose connection failures via the listener is so callers can make a call about when they want to give up.

In a nutshell, I think everything here is working as intended. @maxim-volkov to follow up on your original post, what behavior were you expecting instead?

@maxim-volkov
Copy link
Author

In a nutshell, I think everything here is working as intended. @maxim-volkov to follow up on your original post, what behavior were you expecting instead?

Internet (Ethernet) connection have been restored. But new instance PushManager can't send notification because exception is happened.

@jmason
Copy link

jmason commented Dec 5, 2014

'I thiiiiiink this is the expected behavior. If a connection attempt fails, we try to reconnect immediately. I would expect that it would "spin" until a connection is established.'

I'm not convinced that spinning on the CPU, without a sleep, is appropriate however -- typical best practice in this scenario would be to use an exponential backoff....

@jchambers
Copy link
Owner

Internet (Ethernet) connection have been restored.

Does anything work at that point, though? I don't mean to be difficult, but is it possible you're just retrying too early? What happens if you wait longer before trying to reconnect?

To show my hand, it sounds like this is just a weird issue where some part of the OS's networking stuff is a little slow to wake back up. It doesn't sound like there's any thing we can or should do about that in Pushy itself.

typical best practice in this scenario would be to use an exponential backoff

@mdvorak asked for something similar in #102. I was against it then (on the grounds that callers should make decisions on their own), but I'm starting to come around on the idea. I think we should treat that as a separate feature request, though, and keep this discussion narrowly focused on the original issue.

@aregr
Copy link

aregr commented Dec 5, 2014

I have succeeded in reproducing this problem so I have some additional information about it:

In my app, I initialize pushy with the following snippet:

ApnsConnectionConfiguration apnsConfig = new ApnsConnectionConfiguration();
apnsConfig.setSendAttemptLimit(4);
apnsConfig.setCloseAfterInactivityTime(300);
apnsConfig.setGracefulShutdownTimeout(5);

PushManagerConfiguration pushConfig = new PushManagerConfiguration();
pushConfig.setConcurrentConnectionCount(4);
pushConfig.setConnectionConfiguration(apnsConfig);

pushManager = new PushManager<SimpleApnsPushNotification>(
        apnsEnvironment, SSLContextUtil.createDefaultSSLContext(cert, keyStorePassword), 
        null, null, null, pushConfig, "APNSPushManager");
pushManager.registerFailedConnectionListener(new MyFailedConnectionListener());

My failed connection listener looks like this:

private class MyFailedConnectionListener implements FailedConnectionListener<SimpleApnsPushNotification> {

        @Override
        public void handleFailedConnection(
                final PushManager<? extends SimpleApnsPushNotification> pushManager,
                final Throwable cause) {

            if (cause instanceof SSLHandshakeException) {
                SSLHandshakeException ex = (SSLHandshakeException) cause;
                log.warn("failed SSL handshake to apple", ex);
            } else {
                log.warn("failed connecting to apple", cause);
            }

            //Avoid busy wait scenario and log-spamming by simply waiting a bit
            try {
                Thread.sleep(1000);
            } catch (InterruptedException e) {
                log.info("Interrupted when sleeping in handleFailedConnection(), orig cause: {}",cause);
            }
        }
    }

You can probably use a shorter setCloseAfterInactivityTime to make reproduce the problem faster btw...

Furthermore, to quickly reproduce the problem it is important to set a short DNS client ttl inside java. If you do not do this you may not succeed in reproducing the problem easily. The easiest way to surely set the DNS TTL in java is to edit $JAVA_HOME/jre/lib/security/java.security and configure the properties:

networkaddress.cache.ttl=10
networkaddress.cache.negative.ttl=10

Now, to reproduce the issue, start the push-server as normal. When everything is running, change your local DNS config (/etc/resolv.conf on linux) and configure a non-responsive DNS server as the only DNS source. Make sure you disable DNS caching in the OS if you have it. This should trigger the UnresolvedAddress-exception next time a DNS lookup is attempted.

If you do this, then within approx 300 seconds you should have reproduced the problem and you will easily see it by 100% CPU utilisation and an avalanche of errors in the debug log if you have set log-level to DEBUG for the pushy classes.

Errors you will see look like this:

2014-12-05 11:42:48,671 DEBUG nioEventLoopGroup-2-1 com.relayrides.pushy.apns.ApnsConnection.connect(ApnsConnection.java:441) - APNSPushManager-connection-32050 beginning connection process.
2014-12-05 11:42:48,671 DEBUG nioEventLoopGroup-2-1 com.relayrides.pushy.apns.ApnsConnection$2.operationComplete(ApnsConnection.java:493) - APNSPushManager-connection-32045 failed to connect to APNs gateway.
java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:30)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
        at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:193)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:199)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1029)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.handler.ssl.SslHandler.connect(SslHandler.java:386)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:463)
        at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:849)
        at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:202)
        at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:165)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
        at java.lang.Thread.run(Thread.java:662)
2014-12-05 11:42:48,671 DEBUG nioEventLoopGroup-2-1 com.relayrides.pushy.apns.ApnsConnection.connect(ApnsConnection.java:441) - APNSPushManager-connection-32051 beginning connection process.

Furthermore, if you quickly fix the broken DNS config while this goes on, then after a short while ~10 sec some of the errors go away. At this point, you are still in an an infinite loop though where the
handleFailedConnection callback is called every second in my case (because of the Thread.sleep(1000) inside my handler).

These error look like this:

2014-12-05 11:42:49,487 WARN pool-3-thread-1 mypackage.ApnsPushSender$MyFailedConnectionListener.handleFailedConnection(ApnsPushSender.java:145) - failed connecting to apple
java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:30)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
        at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:193)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:199)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1029)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.handler.ssl.SslHandler.connect(SslHandler.java:386)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
        at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481)
        at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:463)
        at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:849)
        at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:202)
        at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:165)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
        at java.lang.Thread.run(Thread.java:662)

Apparently pushy cannot recover from this state since this logging continues even after the DNS has been restored.

@TriPhoenix
Copy link

We encounter this problem from time to time as well, the default value of 10 seconds TTL for negative DNS queries will result in this issue being visible for at least 10 seconds, even if the DNS outage only lastet for one query.

I think your delayed additional error messages are likely remnants of the 10 second outage then. Looks like pushy uses a ExecutorService (defaulting to single thread) to actually handle the failed connections, which results in pushy still spinning quickly; while your handler consumes the failed connection notifications more slowly. The errors will still be generated as quickly as possible but only processed/logged at one per second.

I guess the only thing that pushy could do, is to slow down when it encounters an UnresolvedAddressException; the connection will fail due to the cache for the next 10 seconds (or the configured value via system properties) anyway and there is little to gain by trying. Of course its debatable, if this behaviour should be part of pushy itself or the task of the calling application; which probably would benefit from a possibility to "slowdown" pushy reconnecting though.

@jchambers
Copy link
Owner

slow down when it encounters an UnresolvedAddressException

Agreed; getting some kind of exponential backoff behavior for failed connections is high on our priority list right now.

@jchambers
Copy link
Owner

It sounds like the only thing that's really going wrong here is that we're trying to reconnect to quickly and flooding the listener queue. Exponential back-off is happening in #169, so I'm going to go ahead and close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants