Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WFLY-16383 mod_cluster: Do not register contexts when in suspend mode #15563

Closed
wants to merge 1 commit into from

Conversation

TomasHofman
Copy link
Contributor

This change prevents mod_cluster from registering web contexts when the
container is started in suspended mode. The result is, that when the
contexts remain unregistered, reverse proxy will keep serving 404
responses (rather than 503) on those contexts.

There has been previous attempts to achieve this behavior (WFLY-14121,
WFLY-13074).

https://issues.redhat.com/browse/WFLY-16383

This change prevents mod_cluster from registering web contexts when the
container is started in suspended mode. The result is, that when the
contexts remain unregistered, reverse proxy will keep serving 404
responses (rather than 503) on those contexts.

There has been previous attempts to achieve this behavior (WFLY-14121,
WFLY-13074).
@github-actions github-actions bot added the deps-ok Dependencies have been checked, and there are no significant changes label May 18, 2022
@TomasHofman
Copy link
Contributor Author

@rhusar @pferraro / @soul2zimate @spyrkob would you like to review this?

I found these previous PRs that are related.

https://github.com/jbossas/jboss-eap7/pull/3411
https://github.com/jbossas/jboss-eap7/pull/4159

From what I understand from reading the customer case and the previous tickets, the desired behavior is such:

  • When no nodes serving given web context are present, proxy should respond with 404.
  • When all present nodes serving given context are suspended, proxy should still respond with 404, meaning that the web context should not be registered on the proxy. Once it is registered, proxy is going respond with 503, until a node announces itself as active. This is what the previous PRs were trying to fix, and it works only partially now (you still get 503 sometimes, even though the container is suspended).

The reason why it only works partially, is that the handler.add(context) call is called always in the UndertowEventHandlerAdapterService#onStart():

 private synchronized void onStart(Context context) {
        ContainerEventHandler handler = this.configuration.getContainerEventHandler();

        handler.add(context);

        State state = this.configuration.getSuspendController().getState();
        if(state == State.RUNNING) {
            handler.start(context);
        }
}

The handler.add(context) call results in DISABLE-APP & STOP-APP requests being sent to a proxy, and these requests cause context registration. Whether these MCMP requests are in fact sent, depends on whether proxy connections are established at the moment or not, so it's a timing issue.

My suggestion is to move the handler.add(context) call into the if(state == State.RUNNING) condition block, where, in effect, it would be called only when an app is deployed on already active server. For the normal startup/reload situation, the handler.add(context) call would be added to the resume() method (this method is called just before the container enters RUNNING state, i.e. not in suspend mode):

    public void resume() {
        ContainerEventHandler handler = this.configuration.getContainerEventHandler();
        for (Context context : this.contexts) {
            handler.add(context);
            handler.start(context);
        }
        ...
    }

@pferraro
Copy link
Contributor

pferraro commented May 24, 2022

When no nodes serving given web context are present, proxy should respond with 404.

This does not make sense to me. Such a condition should result in a 503, not a 404.
The primary feature of mod_cluster is its deployment-awareness, such that an unavailable application is distinguishable from an invalid path (or missing page).

@TomasHofman
Copy link
Contributor Author

@pferraro I would say you are objecting to the other point then you quoted. Let me rephrase.

Current behavior:

  1. No node serves given deployment => context is not registered on proxy => proxy returns 404;
  2. At least one node serves given deployment, but is not ready to respond to requests (startup not finished, in error, suspended before shutdown, ...) => context is registered, but node is considered in ERROR state => proxy returns 503;
  3. At least one node serves given deployment and is ready to respond => context is registered, node is considered READY => proxy serves the app.

This to me is in line with what you say about deployment-awareness, unavailable URL is distinguishable from unknown URL, and this would stay.

What this PR changes, is that when a node is started in suspend mode, i.e. we know it will never be able to respond to requests until it is restarted, then it wouldn't register its deployments to the proxy.

@pferraro
Copy link
Contributor

A server that was started, and then suspended should be in the same LB state as a server started in suspended mode. A suspended server is meant to be distinguishable from a worker that is not started. A suspended server should quickly made available by an ENABLE-APP command. It seems to me that this proposed change would break that.

@TomasHofman
Copy link
Contributor Author

Yes, it's true that a server that was started and then suspended would result in the app context being registered, while a server started in suspended context would not result in context registration. Context registration would happen when the server resumed for the first time (via this call: https://github.com/wildfly/wildfly/pull/15563/files#diff-35a73bec63a02dacc2ca8f93fde1fa881a68eabbba27c45bf0171fe2688106a8R212)

@pferraro
Copy link
Contributor

@TomasHofman We must establish the state of the server with the LB on startup, regardless of whether the server is suspended or not, since it may have previous crashed, in which case we must make sure to disable any contexts.

Copy link
Member

@rhusar rhusar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👎 for the changes.

  1. The proxy state should be the same regardless whether the server was started in suspended mode or got into the suspend mode eventually.
  2. The 404 is a state when the resource does not exist however in the suspended situation the context does exist but not receiving requests in which case the 503 is the correct response. Should a different handling be required, the onus is on the proxy configuration how it handles the situation.
  3. The purpose of the suspended state is to resume operation as quickly as possible while the proposed changes would require an additional add operation for the proxy.

@TomasHofman
Copy link
Contributor Author

@pferraro @rhusar sorry for delay and thanks for all your input. In the mean time I reviewed the previous issues and did some more experiments.

Firstly, I did misunderstood the purpose of the previous issues (I had some bias from reading the customer case):

The purpose of those issues was not to prevent context registration when server is suspended (as I thought it was), but to prevent the contexts from being enabled when server is suspended. Apologies for that. However, one of the reasons I got the wrong idea is that current Wildfly do behave the wrong way in certain configuration. That I think confused also our support.

Reproducers:

Case 1)

Modcluster on the worker node is configured with advertise=true and the proxies attribute is not defined (no preconfigured proxy connections) - the default Wildfly configuration. When the worker node starts in suspend mode, the contexts are not registered and proxy keeps returning 404 for application contexts. This I took for the intended behavior, but it's the wrong one.

Case 2)

Modcluster on the worker node is configured with advertise=false and the proxies attribute is configured with outbound connection the the load balancer. In this configuration the application contexts are registered when starting in suspend node, and the proxy starts correctly returning 503.

I think this difference in behaviors is a timing issue, caused by the fact that in the first case, the connection to the proxy is (most often) not yet established at the moment when server attempts to send the DISABLE-APP / STOP-APP messages. After the connection gets established, the messages are not sent either, so the proxy keeps serving 404. That is one thing that maybe deserves a fix.

Also, the customer still seems to have the expectation that 404 is the right behavior with suspended server, so this is to be cleared up with support.

This PR in current state is wrong in any case.

@rhusar
Copy link
Member

rhusar commented May 27, 2022

@TomasHofman A different resulting behaviour depending on proxy discovery configuration is definitely a bug (due to handling of container events). Please open an issue for that.

BTW regarding the previous issues you linked, if I am remembering correctly, the WFLY-13074 dealt with the initial suspend mode support (when a simple call to suspend was just unhandled by the mod_cluster subsystem) and WFLY-14121 dealt with a subsequent fix around starting or restarting suspended.

@TomasHofman
Copy link
Contributor Author

Follow up ticket: https://issues.redhat.com/browse/WFLY-16416

@TomasHofman
Copy link
Contributor Author

Closing this PR, incorrect behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deps-ok Dependencies have been checked, and there are no significant changes
Projects
None yet
3 participants