Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider instance upgraded when health check passes for it #3294

Closed
alena1108 opened this issue Jan 15, 2016 · 16 comments
Closed

Consider instance upgraded when health check passes for it #3294

alena1108 opened this issue Jan 15, 2016 · 16 comments
Assignees
Labels
internal kind/enhancement Issues that improve or augment existing functionality

Comments

@alena1108
Copy link

During the in service upgrade, when we chose startFirst=true, old instance gets destroyed right after the upgraded one comes up. Today we consider instance to be UP when it goes to Running state. For some applications Running state is not an indication of "up", health check should really be this indication.

The least we should do - make this option configurable.

@ibuildthecloud @vincent99

@alena1108 alena1108 added the kind/enhancement Issues that improve or augment existing functionality label Jan 15, 2016
@alena1108 alena1108 self-assigned this Jan 15, 2016
@alena1108 alena1108 added this to the Release 1.0 milestone Jan 15, 2016
@etlweather
Copy link

Just would like to clarify, in my forum post, I mention that the state - in the UI - is "Initializing", not "Running", because it does not have a successful health check yet.

@alena1108
Copy link
Author

@etlweather state in the UI is combination of the instance state and the health state. So instance is in Running state and healthcheck = initializing is represented as Initializing in the UI

@etlweather
Copy link

Ah, got it.

@vincent99
Copy link
Contributor

It seems like this should be the default, if not only, behavior for a service with a healthCheck.

@alena1108
Copy link
Author

@sangeethah @soumyalj it works the same way for startFirst=true and startFirst=false

@soumyalj
Copy link

@alena1108 : Tested with 2 containers and healthcheck enabled, batch size 1 in the master. After an upgrade, the second container starts within 25s after the first one starts. It is not supposed to start until the first one is in healthy state

mysql> select id, name, created, state, health_state from instance where name like "%testupgrade%";
+-----+--------------------------+---------------------+---------+--------------+
| id  | name                     | created             | state   | health_state |
+-----+--------------------------+---------------------+---------+--------------+
| 222 | Default_testupgradebug_1 | 2016-06-15 00:34:43 | stopped | initializing |
| 223 | Default_testupgradebug_2 | 2016-06-15 00:34:43 | stopped | initializing |
| 224 | Default_testupgradebug_1 | 2016-06-15 00:37:10 | running | initializing |
| 225 | Default_testupgradebug_2 | 2016-06-15 00:37:32 | running | initializing |
+-----+--------------------------+---------------------+---------+--------------+

@alena1108
Copy link
Author

@soumyalj with the fix I've applied, even the first batch won't be upgraded till all instances are healthy. So steps to test will be:

  1. Create a service with a valid health check. Wait till all instance are healthy.
  2. Upgrade to the config having non-valid health check (for non-existing page).
  3. Make sure upgrade got stuck after the first upgraded instance got stuck in initializing state. Go the instance, make it healthy by creating a page. verify that the upgrade was performed for the second instance now

@soumyalj
Copy link

Followed the above steps and verified in v1.1.0-dev5-rc2.

@mariusstaicu
Copy link

Upgraded to rancher 1.1.0 and the behaviour is the same.

I have one running instance for a service. I perform in service upgrade and the old instance gets stopped right after new instance enters in initializing state, not after it becomes green. Is there any setting that needs to be done for this to work ?

@tcdev0
Copy link

tcdev0 commented Aug 3, 2016

@wstudios2009 same for me with rancher 1.1.0.

@deniseschannon
Copy link

@wstudios2009 @tcdev0 Are you using "Start before stop" or not for upgrading?

The specific change that was made was that if you had 3 containers and performed an upgrade.

Previously, we'd end up stopping all 3 old containers while all 3 new containers were stuck in initializing. Therefore the service would be completely down if your service was stuck in initializing/unhealthy.

With the fix for this issue, we only stop 1 container and wait until the new container is active before moving on to remove the next old container and starting the next new container.

@mariusstaicu
Copy link

mariusstaicu commented Aug 4, 2016

@deniseschannon I use start_first: true in rancher-compose.yml.
Also, when upgrading, I have only one running instance (one running container). And this container waits until the new one is in "Initializing" state before shutting down.
I expected that the old container waits until the new one is in "Running" (not "Initializing") state and then, and only then it shuts down.

What happens can be seen in the picture below.
screenshot from 2016-08-04 14 11 32

@deniseschannon
Copy link

@wstudios2009 Do you have a health check on the services?

Can you share your docker-compose.yml for the service? The old one and the new compose?

@mariusstaicu
Copy link

Yes, the health check is a simple HTTP GET check and is also working (state is 'Initializing' until my app is started), here's my rancher-compose.yml:

db:
  scale: 1
display:
  scale: 1
  upgrade_strategy:
      start_first: true
  health_check:
      port: 8080
      interval: 2000
      initializing_timeout: 120000
      unhealthy_threshold: 3
      request_line: GET "/device" "HTTP/1.0"
      healthy_threshold: 2
      response_timeout: 2000
web:
  scale: 1
  upgrade_strategy:
    start_first: true
  health_check:
    port: 8080
    interval: 2000
    initializing_timeout: 120000
    unhealthy_threshold: 3
    request_line: GET "/" "HTTP/1.0"
    healthy_threshold: 2
    response_timeout: 2000

Here is my docker-compose file (with some passwords and urls removed):

db:
  environment:
    POSTGRES_DB: db
    POSTGRES_PASSWORD: pass
    POSTGRES_USER: user
  labels:
    io.rancher.scheduler.affinity:host_label: location=esol
  image: postgres:9.5
  volumes:
  - myapp-int-db:/var/lib/postgresql/data
display:
  environment:
    GIT_SHA: e3eea3215301c31eb2d240c78d44347f0c9b81e7
    JAVA_OPTS: -Xms200m -Xmx380m -XX:MaxMetaspaceSize=80m
    RELEASE: '11'
  labels:
    io.rancher.container.pull_image: always
    io.rancher.container.hostname_override: container_name
  image: <private repo image path here>:11
web:
  environment:
    JAVA_OPTS: -Xms200m -Xmx380m -XX:MaxMetaspaceSize=80m
    RELEASE: '49'
  labels:
    io.rancher.container.pull_image: always
    io.rancher.container.hostname_override: container_name
  image: <private repo image path here>:49
  links:
  - db:db
  volumes:
  - myapp-int:/.home

The only thing that changes between deployments in docker-compose.yml are the images versions and some non-related env variables (GIT_SHA, RELEASE etc.).

@patrickkeller
Copy link

This issue still exists in rancher 1.6.14. The old container gets stopped even when the new one is still in "Initializing" state.

@superseb
Copy link
Contributor

@patrickkeller #11487

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal kind/enhancement Issues that improve or augment existing functionality
Projects
None yet
Development

No branches or pull requests