Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eureka: DOWN status because of /pause not propagated to registry when "eureka.client.healthcheck.enabled: true" (Brixton.SR7) #1571

Closed
brenuart opened this issue Dec 22, 2016 · 20 comments · Fixed by #3999

Comments

@brenuart
Copy link
Contributor

The /pause management endpoint stops the application context.
EurekaDiscoveryClientConfiguration sets the InstanceStatus to DOWN in reaction to the context stop event as follows:

this.applicationInfoManager.setInstanceStatus(InstanceStatus.DOWN);

This status change triggers an onDemandeUpdate() on the InstanceInfoReplicator to propagate the new status back to the Eureka registry. That operation starts with a call to discoveryClient.refreshInstanceInfo() followed by discoveryClient.register() (the latter is supposed to update the status on the registry).

Unfortunately, the call to refreshInstanceInfo() overrides the new status with the value returned by the HealthCheckHandler if one is provided:

    void refreshInstanceInfo() {
        applicationInfoManager.refreshDataCenterInfoIfRequired();
        applicationInfoManager.refreshLeaseInfoIfRequired();

        InstanceStatus status;
        try {
            status = getHealthCheckHandler().getStatus(instanceInfo.getStatus());
        } catch (Exception e) {
            logger.warn("Exception from healthcheckHandler.getStatus, setting status to DOWN", e);
            status = InstanceStatus.DOWN;
        }

        if (null != status) {
            applicationInfoManager.setInstanceStatus(status);
        }
    }

To summarize:

  • /pause stops the application context
  • EurekaDiscoveryClientConfiguration.stop() is invoked and changes the InstanceStatus to DOWN
  • the InstanceInfoReplicator.onDemandeUpdate() method is invoked as listener of the InstanceInfo status
  • the HealthCheckHandler is invoked and returns an OK status
  • the OK status overrides the DOWN status set because of /pause
  • when the register() method is called, the status is back to OK and nothing is updated on the registry

This happens ONLY if Eureka is configured with eureka.client.healthcheck.enabled: true.
Happens on Brixton.SR7

@ryanjbaxter
Copy link
Contributor

Thinking outload here.....
What if the HealthCheckHandler took into account the app has been "paused" and returned DOWN?

@brenuart
Copy link
Contributor Author

That's basically the workaround I'm busy with atm. Two approaches:
1/ Customize the HealthCheckHandler to take the application context status into account and return DOWN unconditionally if it is stopped.
2/ Add an extra health endpoint reporting the status of the application context.

The latter approach has the extra benefit of of providing a consistent view between the health endpoint and the status reported in Eureka.

@ryanjbaxter
Copy link
Contributor

Would this be something you are willing to submit a PR for?

@brenuart
Copy link
Contributor Author

brenuart commented Jan 2, 2017

Our approach was to modify the EurekaHealthCheckHandler such that it returns a null status if the spring application context is stopped (paused). The status replicator explicitly checks for a null response from the health check handler and simply fallback to the current InstanceStatus as reported by the status manager.

This approach seems to be the the least invasive solution:

  • the context is stopped --> no actual health status can be derived (some components may be stopped and unable to safely report their health)
  • the decision about the actual status to report to the registry is the responsibility of the Eureka layer. It wouldn't have been the case anymore if the HealthCheckHandler had reported an actual status. Moreover, we would open the door to (potential) different behaviour depending on whether eureka.client.healthcheck.enabled is true or false.

I can submit a PR if all this looks reasonable to you.

@ryanjbaxter
Copy link
Contributor

Yeah lets see the the PR so we can review it, thanks!

@eacdy
Copy link
Contributor

eacdy commented Jan 11, 2017

Happens in Camden.SR3.

@eacdy
Copy link
Contributor

eacdy commented Nov 26, 2017

Happens in Edgware.RELEASE

@brenuart
Copy link
Contributor Author

brenuart commented Dec 14, 2017

@eacdy Indeed... but things have changed since Dalston. Here is what we introduce in the context when eureka.client.healthcheck.enabled=true:

public class Gh1571EurekaHealthCheckHandler extends EurekaHealthCheckHandler implements Ordered, Lifecycle {
	
	/**
	 * {@code true} when the context stop sequence has been initiated.
	 */
	private boolean stopping = false;
	
	public Gh1571EurekaHealthCheckHandler(HealthAggregator healthAggregator) {
		super(healthAggregator);
	}
	
	@Override
	public InstanceStatus getStatus(InstanceStatus instanceStatus) {
	    // Return nothing if the context is being stopped so the status held by the 
	    // InstanceInfo remains unchanged.
		if( this.stopping ) {
			return null;
		}
		else {
			return super.getStatus(instanceStatus);
		}
	}

	@Override
	public int getOrder() {
		return Ordered.HIGHEST_PRECEDENCE; // Before EurekaAutoServiceRegistration!
	}

	@Override
	public void start() {
		this.stopping = false;
	}

	@Override
	public void stop() {
		this.stopping = true;
	}

	@Override
	public boolean isRunning() {
		return true;
	}
}

Some words about it:

  • wraps the EurekaHealthCheckHandler registered by SpringCloud to return null if the context is stopped or stopping
  • registered with a high order priority so the close() method is invoked early and at least before EurekaAutoServiceRegistration (must be in effect when the registration is closed and the eureka replication triggered -> health check handler is consulted at that moment)
  • cannot rely on context.isRunning() --> the status is updated after all disposable beans are disposed --> unregistration has already happened

This workaround may not be very clean, but it seems to work until we find a better solution ;-)
@ryanjbaxter If that looks good, I can submit a PR to integrate the approach directly inside the existing EurekaHealthCheckHandler.

@ryanjbaxter
Copy link
Contributor

Please submit the PR

brenuart added a commit to brenuart/spring-cloud-netflix that referenced this issue Jan 4, 2018
See spring-cloud#1571 (comment) for more information about the rational behind the changes.
@rafaelrddc
Copy link

Good Morning,

I am using eureka in the latest version and the problem still persists, any news?

I can submit a PL with the correction

@ryanjbaxter
Copy link
Contributor

I can submit a PL with the correction

Do you mean PR?

@brenuart has already submitted one.

@rafaelrddc
Copy link

Do you mean PR?

That's right.

@ryanjbaxter can you tell me which version of this fix will exit?

@ryanjbaxter
Copy link
Contributor

See the PR. We are waiting for @brenuart to address the comments there.

@brenuart
Copy link
Contributor Author

brenuart commented Oct 5, 2018

Damn.. I completely forgot about that issue :(
@ryanjbaxter what comments do you want me to address?
For what I remember, we are still using the approach described here with Edgware.SR4.

@rafaelrddc
Copy link

@brenuart are you using any palliative solution ???

@brenuart
Copy link
Contributor Author

brenuart commented Oct 6, 2018

Yes, the one described here above. It is added to the context with the following configuration:

	/**
	 * Bug fix: Do not consider Health info if context is stopped
	 */
	@Configuration
	@ConditionalOnProperty(value = "eureka.client.healthcheck.enabled", matchIfMissing = false)
	static class EurekaHealthCheckHandlerConfiguration {

		@Autowired(required = false)
		private HealthAggregator healthAggregator = new OrderedHealthAggregator();

		@Bean
		public EurekaHealthCheckHandler eurekaHealthCheckHandler() {
			return new Gh1571EurekaHealthCheckHandler(healthAggregator);
		}
	}

@spencergibb
Copy link
Member

spencergibb commented Jan 31, 2019

PRs welcome to add the Gh1571EurekaHealthCheckHandler (with a more sensible name of course)

@jfougere
Copy link

Any news on this issue ? Any chance to get the PR fixed ?

@spencergibb
Copy link
Member

there has been no PR

@jfougere
Copy link

Yeah I saw there was one but not finished. Not sure what was missing.
But the issue is still present and does not allow to use the /pause endpoint while using the healthcheck as source of status for the registry.

Would you accept a new PR ?

OlgaMaciaszek pushed a commit that referenced this issue May 24, 2021
See #1571 (comment) for more information about the rational behind the changes.
@OlgaMaciaszek OlgaMaciaszek added this to the 2.2.9.RELEASE milestone May 24, 2021
@OlgaMaciaszek OlgaMaciaszek self-assigned this May 24, 2021
OlgaMaciaszek added a commit that referenced this issue May 24, 2021
* Fix for #1571

See #1571 (comment) for more information about the rational behind the changes.

Original contributor: Bertrand Renuart <bertrand.renuart@itma.lu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment