Documentation: changing Eureka renewal frequency *WILL* break the self-preservation feature of the server #373

Open
brenuart opened this Issue Jun 2, 2015 · 16 comments

Comments

Projects
None yet
7 participants
@brenuart
Contributor

brenuart commented Jun 2, 2015

In section "Why is it slow to register a Service?" of the documentation, it is said one can speed-up the client registration process by changing the heartbeat interval to a higher frequency (default is 30seconds).
The documentation also says it might not be a good idea in production without giving much explanations on the consequences.

One should be aware the self-preservation feature of the Eureka server makes the assumption clients are sending their heartbeat every 30 seconds - and this is not configurable. Using a different value will therefore break that functionality. It is definitely not a good idea to play with that parameter...

@dsyer

This comment has been minimized.

Show comment
Hide comment
@dsyer

dsyer Jun 8, 2015

Contributor

It's fine for demos and development work (where you often want a faster turnaround and don't care about self preservation mode). If you'd like to explain what's happening it would be great to see a pull request for the documentation (it's right there in github next to the source code).

Contributor

dsyer commented Jun 8, 2015

It's fine for demos and development work (where you often want a faster turnaround and don't care about self preservation mode). If you'd like to explain what's happening it would be great to see a pull request for the documentation (it's right there in github next to the source code).

@dsyer dsyer added the documentation label Jun 8, 2015

@brenuart

This comment has been minimized.

Show comment
Hide comment
@brenuart

brenuart Jun 8, 2015

Contributor

I understand, but before asking for changes in the documentation I think my analysis needs to be reviewed by someone with better understanding of Eureka's internals and make sure the behaviour/consequences I describe are correct.

The point is that the Eureka server makes implicit assumption that clients are sending their heartbeat at a fixed rate of 1 every 30 seconds. If two instances are registered in the registry, the server expects to receive 2 instances * 1 heartbeat/30s * threshold % every minutes. With a threshold set at 85%, it expects 3 heartbeats in the last minute. If the rate drops below this value, the self protection mode is activated. If you loose one of the two instances, the server receives at most two heartbeats and activates the self protection mode.

Now, if clients are sending their heartbeats twice faster (every 15s) - the server receives 8 heartbeats per minutes and keeps receiving 4/min if you loose one of the two instances. Hence the self protection mode is not activated...

This examples shows the consequences of using a heartbeat frequency other than 30s: it breaks the self protection mode mechanism.

The initial registration is actually trigger by the first heartbeat: the client tries to send the first heartbeat and receives a "not found" answer from the server which means it doesn't know the instance. The client then immediately attempts to register the instance. This process only happens 30s (eureka.instance.leaseRenewalIntervalInSeconds) after startup - hence the extra delay before the instance shows up in the registry.

You can always speed-up the initial registration process by lowering the value of eureka.client.initialInstanceInfoReplicationIntervalSeconds. This parameter controls the initial delay before the client transmits changes made to the InstanceInfo (like the UP/DOWN status). Because of how it is initialized, the InstanceInfo is always dirty after startup - so the client will always try to replicate it at least once. Since this replication is implemented by re-registering the instance against the server, using a very low value for the initial delay will trigger the instance registration quicker... without changing the heart beat interval and therefore keeping the self-protection mode safe.

During demo/development, if you want to detect "dead" instances quicker, I would suggest to play with the eureka.instance.leaseExpirationDurationInSeconds parameter instead. The value is set to 90s by default, which means a lease is expired after 3 consecutive missing heartbeats. This is of course pretty long during demo/dev, but you can always lower it to 30s - and this won't affect the self-protection feature.

Hope all of this makes sense.

Contributor

brenuart commented Jun 8, 2015

I understand, but before asking for changes in the documentation I think my analysis needs to be reviewed by someone with better understanding of Eureka's internals and make sure the behaviour/consequences I describe are correct.

The point is that the Eureka server makes implicit assumption that clients are sending their heartbeat at a fixed rate of 1 every 30 seconds. If two instances are registered in the registry, the server expects to receive 2 instances * 1 heartbeat/30s * threshold % every minutes. With a threshold set at 85%, it expects 3 heartbeats in the last minute. If the rate drops below this value, the self protection mode is activated. If you loose one of the two instances, the server receives at most two heartbeats and activates the self protection mode.

Now, if clients are sending their heartbeats twice faster (every 15s) - the server receives 8 heartbeats per minutes and keeps receiving 4/min if you loose one of the two instances. Hence the self protection mode is not activated...

This examples shows the consequences of using a heartbeat frequency other than 30s: it breaks the self protection mode mechanism.

The initial registration is actually trigger by the first heartbeat: the client tries to send the first heartbeat and receives a "not found" answer from the server which means it doesn't know the instance. The client then immediately attempts to register the instance. This process only happens 30s (eureka.instance.leaseRenewalIntervalInSeconds) after startup - hence the extra delay before the instance shows up in the registry.

You can always speed-up the initial registration process by lowering the value of eureka.client.initialInstanceInfoReplicationIntervalSeconds. This parameter controls the initial delay before the client transmits changes made to the InstanceInfo (like the UP/DOWN status). Because of how it is initialized, the InstanceInfo is always dirty after startup - so the client will always try to replicate it at least once. Since this replication is implemented by re-registering the instance against the server, using a very low value for the initial delay will trigger the instance registration quicker... without changing the heart beat interval and therefore keeping the self-protection mode safe.

During demo/development, if you want to detect "dead" instances quicker, I would suggest to play with the eureka.instance.leaseExpirationDurationInSeconds parameter instead. The value is set to 90s by default, which means a lease is expired after 3 consecutive missing heartbeats. This is of course pretty long during demo/dev, but you can always lower it to 30s - and this won't affect the self-protection feature.

Hope all of this makes sense.

@dsyer

This comment has been minimized.

Show comment
Hide comment
@dsyer

dsyer Jun 8, 2015

Contributor

Totally makes sense, but I'm not sure there is anyone with a better understanding of Eureka internals at this point (at least not a regular visitor to this project). Even the people I know at Netflix probably won't want to go into any more detail than that (most people just use it after all). We need some of your analysis in the documentation really, plus some sensible guidelines, and defaults that don't stop people from making progress quickly when they are getting started.

Contributor

dsyer commented Jun 8, 2015

Totally makes sense, but I'm not sure there is anyone with a better understanding of Eureka internals at this point (at least not a regular visitor to this project). Even the people I know at Netflix probably won't want to go into any more detail than that (most people just use it after all). We need some of your analysis in the documentation really, plus some sensible guidelines, and defaults that don't stop people from making progress quickly when they are getting started.

@brenuart

This comment has been minimized.

Show comment
Hide comment
@brenuart

brenuart Jun 8, 2015

Contributor

Does that mean you feel as lonely as me?
;-)

Contributor

brenuart commented Jun 8, 2015

Does that mean you feel as lonely as me?
;-)

@brenuart

This comment has been minimized.

Show comment
Hide comment
@brenuart

brenuart Jun 9, 2015

Contributor

Here is a first attempt to answer the question "Why does it take so long to register an instance with Eureka?"
Tell me if you think this kind of information should go into the documentation. If so, I'll try to find some time to add more details and examples, together with "recommendations" on how to speed-up things where possible. However, it would be nice if other people could challenge the content... it is only what I understood after all. My english isn't good enough neither, so please, rephrase when needed.

(1) Client Registration

When using the default configuration, registration happens at the first heartbeat sent to the server. Since the client just started, the server doesn't know anything about it and replies with a 404 forcing the client to register. The client then immediately issues a second call with all the registration information. The client is now registered.

The first heartbeat happens 30 seconds after startup (eureka.instance.leaseRenewalIntervalInSeconds) - so your instance won't appear in the Eureka registry before this interval.

(2) Server ResponseCache

The server maintains a response cache that is updated every 30s by default (eureka.server.response-cache-update-interval-ms). So even if your instance is just registered, it won't appear in the result of a call to the /eureka/apps REST endpoint.

However, your instance may appear in the UI web interface just after registration. This is because the web front-end bypasses the response cache used by the REST API...

If you know the instanceId, you can still get some details from Eureka about it by calling /eureka/apps/<appName>/<instanceId>. This endpoint doesn't make use of the response cache. But since it requires to know the instance, it is of no help in the discovery process...

So, it may take up to another 30s for other clients to discover your newly registered instance.

(3) Client cache refresh

Eureka client maintain a cache of the registry information. This cache is refreshed every 30 seconds by default ('eureka.client.registryFetchIntervalSeconds`). So again, it may take another 30s before a client decides to refresh its local cache and discover newly registered instances.

(4) LoadBalancer refresh

The load balancer used by Ribbon gets its information from the local Eureka client. It also maintains a local cache to avoid calling the discovery client for every request. This cache is refreshed every 30s (ribbon.serverListRefreshInterval). So again, it may take another 30s before your Ribbon client can make use of the newly registered instance.

Note: this local cache is apparently required only to reduce the cost of obtaining server information from the used ServerList. This is not the case with none of the server list provided by default: DiscoveryEnabledNIWSServerList with Eureka, ConfigurationBasedServerList without.

At the end, if you are not lucky, it may take up to 2 minutes before your newly registered instance starts receiving trafic from other clients.

Contributor

brenuart commented Jun 9, 2015

Here is a first attempt to answer the question "Why does it take so long to register an instance with Eureka?"
Tell me if you think this kind of information should go into the documentation. If so, I'll try to find some time to add more details and examples, together with "recommendations" on how to speed-up things where possible. However, it would be nice if other people could challenge the content... it is only what I understood after all. My english isn't good enough neither, so please, rephrase when needed.

(1) Client Registration

When using the default configuration, registration happens at the first heartbeat sent to the server. Since the client just started, the server doesn't know anything about it and replies with a 404 forcing the client to register. The client then immediately issues a second call with all the registration information. The client is now registered.

The first heartbeat happens 30 seconds after startup (eureka.instance.leaseRenewalIntervalInSeconds) - so your instance won't appear in the Eureka registry before this interval.

(2) Server ResponseCache

The server maintains a response cache that is updated every 30s by default (eureka.server.response-cache-update-interval-ms). So even if your instance is just registered, it won't appear in the result of a call to the /eureka/apps REST endpoint.

However, your instance may appear in the UI web interface just after registration. This is because the web front-end bypasses the response cache used by the REST API...

If you know the instanceId, you can still get some details from Eureka about it by calling /eureka/apps/<appName>/<instanceId>. This endpoint doesn't make use of the response cache. But since it requires to know the instance, it is of no help in the discovery process...

So, it may take up to another 30s for other clients to discover your newly registered instance.

(3) Client cache refresh

Eureka client maintain a cache of the registry information. This cache is refreshed every 30 seconds by default ('eureka.client.registryFetchIntervalSeconds`). So again, it may take another 30s before a client decides to refresh its local cache and discover newly registered instances.

(4) LoadBalancer refresh

The load balancer used by Ribbon gets its information from the local Eureka client. It also maintains a local cache to avoid calling the discovery client for every request. This cache is refreshed every 30s (ribbon.serverListRefreshInterval). So again, it may take another 30s before your Ribbon client can make use of the newly registered instance.

Note: this local cache is apparently required only to reduce the cost of obtaining server information from the used ServerList. This is not the case with none of the server list provided by default: DiscoveryEnabledNIWSServerList with Eureka, ConfigurationBasedServerList without.

At the end, if you are not lucky, it may take up to 2 minutes before your newly registered instance starts receiving trafic from other clients.

@spencergibb spencergibb added Backlog and removed Backlog labels Nov 17, 2015

@andrewserff

This comment has been minimized.

Show comment
Hide comment
@andrewserff

andrewserff Nov 24, 2015

I'm trying to figure out how to get the delay between server start and registration in Zuul as low as possible in development only. With the defaults, we are stuck twiddling our thumbs for like 2 min every time we restart a service and then want to hit it through our zuul proxy. The information from @brenuart has helped a little, but I still can't seem to get it right. I would love some help on getting a configuration for spring.profiles: dev that has the whole registration process down to the seconds. This is what I've tried so far:
Zuul Config:

spring:
    profiles: dev

eureka:
    instance:
        registryFetchIntervalSeconds: 1
        leaseRenewalIntervalInSeconds: 2
        leaseExpirationDurationInSeconds: 5
    client:
        initialInstanceInfoReplicationIntervalSeconds: 5
ribbon:
    ServerListRefreshInterval: 1000

Service Config:

spring.profiles: dev
eureka:
    instance:
        registryFetchIntervalSeconds: 1
        leaseRenewalIntervalInSeconds: 2
    client:
        initialInstanceInfoReplicationIntervalSeconds: 5

Right now, Zuul isn't fully noticing that the service is down (I just get a blank page and not a forwarding error like usual). Maybe once we figure this out, it could be added to the documentation that @brenuart wrote? If I should post this over on SO instead, let me know.

Also, @brenuart, I think you documentation is much better than what's there, but it would be great to add all the options (like the eureka.client.initialInstanceInfoReplicationIntervalSeconds) to the documentation.

I'm trying to figure out how to get the delay between server start and registration in Zuul as low as possible in development only. With the defaults, we are stuck twiddling our thumbs for like 2 min every time we restart a service and then want to hit it through our zuul proxy. The information from @brenuart has helped a little, but I still can't seem to get it right. I would love some help on getting a configuration for spring.profiles: dev that has the whole registration process down to the seconds. This is what I've tried so far:
Zuul Config:

spring:
    profiles: dev

eureka:
    instance:
        registryFetchIntervalSeconds: 1
        leaseRenewalIntervalInSeconds: 2
        leaseExpirationDurationInSeconds: 5
    client:
        initialInstanceInfoReplicationIntervalSeconds: 5
ribbon:
    ServerListRefreshInterval: 1000

Service Config:

spring.profiles: dev
eureka:
    instance:
        registryFetchIntervalSeconds: 1
        leaseRenewalIntervalInSeconds: 2
    client:
        initialInstanceInfoReplicationIntervalSeconds: 5

Right now, Zuul isn't fully noticing that the service is down (I just get a blank page and not a forwarding error like usual). Maybe once we figure this out, it could be added to the documentation that @brenuart wrote? If I should post this over on SO instead, let me know.

Also, @brenuart, I think you documentation is much better than what's there, but it would be great to add all the options (like the eureka.client.initialInstanceInfoReplicationIntervalSeconds) to the documentation.

@brenuart

This comment has been minimized.

Show comment
Hide comment
@brenuart

brenuart Nov 24, 2015

Contributor

Not much time for the moment but I'll try to extend coverage of "my" doc. Maybe I should coordinate with @spencergibb to incorporate those few lines into the official doc and have them to be reviewed by them.

Contributor

brenuart commented Nov 24, 2015

Not much time for the moment but I'll try to extend coverage of "my" doc. Maybe I should coordinate with @spencergibb to incorporate those few lines into the official doc and have them to be reviewed by them.

@spencergibb

This comment has been minimized.

Show comment
Hide comment

@spencergibb spencergibb added the ready label Dec 2, 2015

@spencergibb

This comment has been minimized.

Show comment
Hide comment
@spencergibb

spencergibb May 23, 2016

Member

Related docs #203

Member

spencergibb commented May 23, 2016

Related docs #203

@MrAnsonWang

This comment has been minimized.

Show comment
Hide comment
@MrAnsonWang

MrAnsonWang Sep 19, 2016

Just did a testing on local machine, the extreme configuration as the following:

Eureka server:

server:
port: 8761

eureka:
instance:
hostname: localhost
server:
response-cache-update-interval-ms: 500
eviction-interval-timer-in-ms: 500
client:
register-with-eureka: false
fetch-registry: false
service-url:
default_zone: http://${eureka.instance.hostname}:${server.port}/eureka/

Two service end:

server:
port: 1111

spring:
application:
name: Compute-Service

eureka:
instance:
hostname: localhost
prefer-ip-address: true
lease-renewal-interval-in-seconds: 1
lease-expiration-duration-in-seconds: 2
client:
initial-instance-info-replication-interval-seconds: 0
instance-info-replication-interval-seconds: 1
registry-fetch-interval-seconds: 1
service-url:
default-zone: http://localhost:8761/eureka/

server:
port: 2222

spring:
application:
name: Compute-Service

eureka:
instance:
hostname: localhost
prefer-ip-address: true
lease-renewal-interval-in-seconds: 1
lease-expiration-duration-in-seconds: 2
client:
initial-instance-info-replication-interval-seconds: 0
instance-info-replication-interval-seconds: 1
registry-fetch-interval-seconds: 1
service-url:
default-zone: http://localhost:8761/eureka/

The result I observed the if one of services went down, the other service got the instance list from Eureka server was quickly, no more than 5 seconds in my environment. I think it's suitable for dev/test purpose.

But I wonder if there's simple way to reset Ribbon serverListRefreshInterval value in a Spring Boot application? @brenuart

Just did a testing on local machine, the extreme configuration as the following:

Eureka server:

server:
port: 8761

eureka:
instance:
hostname: localhost
server:
response-cache-update-interval-ms: 500
eviction-interval-timer-in-ms: 500
client:
register-with-eureka: false
fetch-registry: false
service-url:
default_zone: http://${eureka.instance.hostname}:${server.port}/eureka/

Two service end:

server:
port: 1111

spring:
application:
name: Compute-Service

eureka:
instance:
hostname: localhost
prefer-ip-address: true
lease-renewal-interval-in-seconds: 1
lease-expiration-duration-in-seconds: 2
client:
initial-instance-info-replication-interval-seconds: 0
instance-info-replication-interval-seconds: 1
registry-fetch-interval-seconds: 1
service-url:
default-zone: http://localhost:8761/eureka/

server:
port: 2222

spring:
application:
name: Compute-Service

eureka:
instance:
hostname: localhost
prefer-ip-address: true
lease-renewal-interval-in-seconds: 1
lease-expiration-duration-in-seconds: 2
client:
initial-instance-info-replication-interval-seconds: 0
instance-info-replication-interval-seconds: 1
registry-fetch-interval-seconds: 1
service-url:
default-zone: http://localhost:8761/eureka/

The result I observed the if one of services went down, the other service got the instance list from Eureka server was quickly, no more than 5 seconds in my environment. I think it's suitable for dev/test purpose.

But I wonder if there's simple way to reset Ribbon serverListRefreshInterval value in a Spring Boot application? @brenuart

@asarkar

This comment has been minimized.

Show comment
Hide comment
@asarkar

asarkar Jan 17, 2017

Contributor

@brenuart I've a related question for which I created a ticket in the Eureka repo. What does "self-preservation" mean actually and does it work the same way for Eureka peers vs. clients?

Netflix/eureka#890

Also see, #1627

Contributor

asarkar commented Jan 17, 2017

@brenuart I've a related question for which I created a ticket in the Eureka repo. What does "self-preservation" mean actually and does it work the same way for Eureka peers vs. clients?

Netflix/eureka#890

Also see, #1627

@brenuart

This comment has been minimized.

Show comment
Hide comment
@brenuart

brenuart Jan 18, 2017

Contributor

self-preservation is a mechanism by which the Eureka registry stops expiring entries when it detects that an "important" amount of services didn't renew their lease in time. This should protect the registry from clearing all entries when a (partial) network failure occurs.

Contributor

brenuart commented Jan 18, 2017

self-preservation is a mechanism by which the Eureka registry stops expiring entries when it detects that an "important" amount of services didn't renew their lease in time. This should protect the registry from clearing all entries when a (partial) network failure occurs.

@asarkar

This comment has been minimized.

Show comment
Hide comment
@asarkar

asarkar Jan 21, 2017

Contributor

I've created a blog post with the details of Eureka here, that fills in some missing detail from Spring doc or Netflix blog. It is the result of several days of debugging and digging through source code.

Contributor

asarkar commented Jan 21, 2017

I've created a blog post with the details of Eureka here, that fills in some missing detail from Spring doc or Netflix blog. It is the result of several days of debugging and digging through source code.

@spencergibb

This comment has been minimized.

Show comment
Hide comment
@spencergibb

spencergibb Jan 25, 2017

Member

@asarkar would you mind if we (or you via PR) integrate your information into our documentation?

Member

spencergibb commented Jan 25, 2017

@asarkar would you mind if we (or you via PR) integrate your information into our documentation?

@asarkar

This comment has been minimized.

Show comment
Hide comment
@asarkar

asarkar Jan 25, 2017

Contributor

@spencergibb I can PR it but there are some areas that I'd like to be elaborated, especially regions and zones. My post also has links to 2 open tickets that I believe should be answered/closed first because my post references to those.

Contributor

asarkar commented Jan 25, 2017

@spencergibb I can PR it but there are some areas that I'd like to be elaborated, especially regions and zones. My post also has links to 2 open tickets that I believe should be answered/closed first because my post references to those.

jonashackt added a commit to jonashackt/cxf-spring-cloud-netflix-docker that referenced this issue Jun 24, 2017

@sunnykaka

This comment has been minimized.

Show comment
Hide comment
@sunnykaka

sunnykaka Sep 15, 2017

@brenuart Your configuration is not correct, right one is ribbon.ServerListRefreshInterval.

@brenuart Your configuration is not correct, right one is ribbon.ServerListRefreshInterval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment