Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor into dual Rancher API/Metadata providers #1563

Merged
merged 1 commit into from Jun 20, 2017

Conversation

martinbaillie
Copy link
Contributor

Using the API as the only source of Rancher Traefik configuration alienates some Rancher deployments, particularly in secure environments. This PR looks to resolve #1402 by allowing the user to choose between API and internal metadata service for sourcing Traefik configuration.

Some benefits of the metadata service:

  • Payloads are smaller in size as constrained to a single Rancher environment
  • User does not have to concern themselves with the secure bootstrap of powerful Rancher API access/secret keys into Traefik
  • Accurate almost instantaneous Traefik configuration updates if using long polling config option (rather than the usual wait of up to RefreshSeconds)
  • Traefik configuration only updated when there's been an actual change in Rancher (thanks to the metadata service's /version endpoint)

Some PR notes/discussion points:

  • Vendors in the official Rancher metadata client, like the Rancher API client already vendored
  • Question/vote around which of API/Metadata should be default? What is the most popular use case?
  • I segregated API and Metadata specific functions into separate files. Do we want this?
  • I tried not to change or refactor too much. There is scope for more re-use between the techniques perhaps through using interfaces
  • The Rancher provider does not have integration tests. I think this should be picked up in another PR though.

@ldez ldez added kind/enhancement a new or improved feature. area/provider/rancher labels May 8, 2017
@ldez ldez requested a review from SantoDE May 8, 2017 14:49
Copy link
Collaborator

@SantoDE SantoDE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good morning @martinbaillie ,

first of all: thanks a LOT for your PR 😍

I would personally vote for putting metadata as the default client and let the user actively decide to switch to API if his use case needs it. I can think of some cases which would need API access, but again, the user would actively have to choose that.

I like the seperation of file level btw. It makes stuff more clear.

Apart from that, I can not find something that I don't like. Good job here!

So please, put metadata as the default in place :)

@martinbaillie martinbaillie force-pushed the master branch 3 times, most recently from d6008e8 to 5f5be5f Compare May 10, 2017 12:22
@martinbaillie
Copy link
Contributor Author

@SantoDE agreed! Implemented metadata service as default 👍

Previous API users will simply need EnableAPI set to true in order to pick up where they left off; otherwise new users will benefit from the simpler metadata service based provider.

I also worked through those Glide issues (Golang's dep couldn't come soon enough amirite?! 🤕) and rebased against current master.

@ldez
Copy link
Member

ldez commented May 15, 2017

Due to SemaphoreCI, I close and reopen the PR.

@ldez ldez closed this May 15, 2017
@ldez ldez reopened this May 15, 2017
@martinbaillie
Copy link
Contributor Author

@SantoDE @ldez FYI these changes were addressed. I think the close/reopen has made Github think otherwise

@ldez
Copy link
Member

ldez commented May 15, 2017

@martinbaillie could you pick the #1414 content to your PR? And some tests on this part?

@martinbaillie
Copy link
Contributor Author

@ldez - sure, done.

As for tests, unfortunately I struggled to find anything new to test. This refactor either re-uses/modifies existing Traefik funcs (and I updated those respective tests) or otherwise makes use of the official Rancher metadata client which would really need tested in the integration test suite for Rancher (currently missing, I believe there’s a separate issue to address that as a whole).

@bseng
Copy link

bseng commented May 16, 2017

@martinbaillie thanks for incorporate my change in #1414. Can you test the scenario mentioned in #1414?

I suspect it won't work anyway because the metadata returned is for that environment only.

But no harm testing I guess.

@martinbaillie
Copy link
Contributor Author

@bseng all good, the introduction of metadata service based provider does not use the functions you modified. The functions have simply been renamed and moved to an api.go file as part of the refactor, but remain untouched otherwise.

@martinbaillie
Copy link
Contributor Author

G'day folks. Wondering if we're waiting on anything else from me here? Or is it just scheduling?

The PR has gotten a little out of date. I'm happy to sync it up again if it'll get merged shortly after.

@ldez
Copy link
Member

ldez commented May 23, 2017

ping @SantoDE

Copy link
Collaborator

@SantoDE SantoDE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @martinbaillie,

thanks a lot! :) This looks (apart from the minor documentation thingy) good :)

LGTM 👼

#
# Required
# Required (unless EnableMetadataService = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need that unless, if it's default? If yes, we need to show EnableMetadataService in the sample as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops. This was a remnant from the previous iteration when metadata service was not the default.

The config is actually EnableAPI. I've reworded the sample to reflect this:

# Required (if EnableAPI = true)
... 

@martinbaillie
Copy link
Contributor Author

Hey @SantoDE,

Fixed the traefik.sample.conf - nice spot. Also rebased with master and sorted Glide again 👍

@martinbaillie
Copy link
Contributor Author

@ldez @SantoDE all synced and rebased again 👍

Copy link
Member

@ldez ldez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! 👍
LGTM

@ldez
Copy link
Member

ldez commented May 26, 2017

@martinbaillie could you squash your commits?

@martinbaillie
Copy link
Contributor Author

@ldez done!

Copy link
Member

@emilevauge emilevauge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @martinbaillie :)
We think we need backwards compatibility here.
Could you default to API instead (renaming EnableAPI to MetadataMode) ?
Could you also add

type Provider struct {
APIConfiguration  `mapstructure:",squash"`
// ...
}

and initialize provider.api.* if len(provider.api.accesskey) > 0 ?

Thanks for you great work on this 👏

@martinbaillie
Copy link
Contributor Author

@emilevauge no worries.

Please see updated code. I've reworked it back to having API as default and stay backwards compatible. Also updated the config naming where it made sense. A help run looks like this now, with clearly distinguished metadata service settings:

./traefik --help | grep rancher
    --rancher                           Enable Rancher backend                                                           (default "true")
    --rancher.accesskey                 Rancher server API access key
    --rancher.constraints               Filter services by constraint, matching with Traefik tags.                       (default "[]")
    --rancher.domain                    Default domain used
    --rancher.enableservicehealthfilter Filter services with unhealthy states and inactive states                        (default "false")
    --rancher.endpoint                  Rancher server API HTTP(S) endpoint
    --rancher.exposedbydefault          Expose services by default                                                       (default "true")
    --rancher.filename                  Override default configuration template. For advanced users :)
    --rancher.metadatamode              Source configuration from the Rancher metadata service instead of the API        (default "false")
    --rancher.metadatapoll              Poll the Rancher metadata service every 'rancher.refreshseconds' (less accurate) (default "false")
    --rancher.metadataprefix            Prefix used for accessing the Rancher metadata service                           (default "latest")
    --rancher.refreshseconds            Polling interval (in seconds)                                                    (default "15")
    --rancher.secretkey                 Rancher server API secret key
    --rancher.watch                     Watch provider                                                                   (default "true")

Similarly the sample TOML reflects this.

@SantoDE
Copy link
Collaborator

SantoDE commented Jun 13, 2017

Wow @martinbaillie 😍

@emilevauge
Copy link
Member

emilevauge commented Jun 13, 2017

@martinbaillie Sorry we misunderstood ;).

We still think having both sub-structs for API/metadata service is great. But we also think it's possible to stay backward compatible adding a (duplicate) nested API struct to Provider.
Something like

type Provider struct {
APIConfiguration  `mapstructure:",squash"`
api *APIConfiguration `description:"API configuration"`
metadata *MetadataServiceConfiguration `description:"MetaData configuration"`
// ...
}

Besides, you would need to initialize provider.api.* if len(provider.accesskey) > 0 (from the nested struct) and print a deprecated WARN log.

I even think MetadataMode may not be necessary anymore in this case in fact.

WDYT?

@martinbaillie
Copy link
Contributor Author

@emilevauge ah now I get you. I can make those changes tomorrow 👍

I think we could ditch the need for MetadataMode though one thing I noticed earlier is the uninitialised sub-structs would render as (default "true") in help output. Is there a way I could trick the MetadataConfiguration sub-struct to output as being false? That way it would be clear to the user that --rancher.metadataservice is false by default.

@martinbaillie
Copy link
Contributor Author

Hi @emilevauge, @ldez

I've updated the code again. Backwards compatibility is ensured and a WARN deprecation message printed. I've also ditched any "mode" concept.

So yeh, the only small bugbear is the generated help output, but I don't think I can affect that?

    --rancher.metadataservice              Enable the Rancher metadata service provider                                     (default "true")
    --rancher.metadataservice.intervalpoll Poll the Rancher metadata service every 'rancher.refreshseconds' (less accurate) (default "false")
    --rancher.metadataservice.prefix       Prefix used for accessing the Rancher metadata service                           (default "latest")

@rwojsznis
Copy link

rwojsznis commented Jun 18, 2017

Hi @martinbaillie

I'm really happy about this PR <3 because we have just started integrating traefik with rancher in our development/experimental environments :). But one thing bothers me with current integration - when I shut down the rancher servers it seems traefik looses all information about existing services - it spits out stderror notice (Cannot get Provider Services) and wipes current configuration and by looking at the changes - this is still the case, right?

This is less than ideal because suddenly downtime in rancher servers causes major downtime and I have to bother with rancher server HA setup. Shouldn't traefik keep last known service configuration after connection to server is lost? (maybe it's doable right now and I'm missing something so please correct me if I'm wrong!)

update: ok, found #1703 (better late than never), seems it's known issue; but am I wrong or should we should return last known services as a fallback argument in case of error to listRancherServices(+ containers/env)? (totally guessing here not very familiar with the overall architecture)

update2: probably wrong PR to discuss this issue, sorry if I took it too far :)

@martinbaillie
Copy link
Contributor Author

martinbaillie commented Jun 18, 2017

Hi @emq

It's true that #1703 is covering this but worth mentioning here, so thanks 👍

I believe this bug only affects the API provider.

This PR was mainly to introduce the Rancher metadata service as an alternate Rancher provider for Traefik, and in doing so simply refactored the API provider code into its own segregated area, but it has been otherwise left untouched. So #1703 should tackle the bug.

The metadata service is unaffected by Rancher server outage (another benefit of this provider). Metadata runs as a global micro service, one per host. So if they all go down, I think you've got bigger things to worry about than losing ingress routing ;)

EDIT: I should've mentioned I tested the metadata service infrastructure stack going down briefly. Traefik held its config 👍 But yeh, as above, lots of other things in the Rancher env start failing so you'll have a bad time regardless!

@rwojsznis
Copy link

Metadata runs as a global micro service, one per host

You're absolutely right, totally forgot about that! Thanks for quick reply, I can't wait to battle-test this :)

@ldez ldez force-pushed the master branch 2 times, most recently from c8aed2e to 2c1bdbd Compare June 20, 2017 11:31
@ldez
Copy link
Member

ldez commented Jun 20, 2017

@martinbaillie I have made some changes (related to discussions with the team).

the end is near

Copy link
Member

@emilevauge emilevauge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinbaillie Thanks you very much for your hard work 👏 !
@ldez you rock ❤️
LGTM

Introduces Rancher's metadata service as an optional provider source for
Traefik, enabled by setting `rancher.MetadataService`.

The provider uses a long polling technique to watch the metadata service and
obtain near instantaneous updates. Alternatively it can be configured to poll
the metadata service every `rancher.RefreshSeconds` by setting
`rancher.MetadataPoll`.

The refactor splits API and metadata service code into separate source
files respectively, and specific configuration is deferred to
sub-structs.

Incorporates bugfix traefik#1414
Copy link
Member

@ldez ldez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A long running but a happy end! 😃
🎉 🎉 🎉

@ldez ldez merged commit 9cb07d0 into traefik:master Jun 20, 2017
@martinbaillie
Copy link
Contributor Author

Thanks guys! 🎉 🎉 🎉

@martinbaillie
Copy link
Contributor Author

I've been able to do lots of zero downtime deployments with Rancher, Traefik and the metadata service provider 👍

Caveat: if your service has a scale of 1 then you may encounter this Rancher bug: rancher/rancher#8684

Scale of 2 or more and you're flying.

(Not sure if worth mentioning that bug in the release notes? @ldez @emilevauge)

@kelchm
Copy link
Contributor

kelchm commented Jun 21, 2017

@martinbaillie, with how Rancher works, even with a scale greater than 1, there will always be a brief period of downtime or improperly routed traffic when upgrading a service.

The only way I know to do a proper blue-green deploy using only Rancher + Traefik is to spin up a new stack which then overrides any host rules for the previous stack once the service health checks are passing.

@martinbaillie
Copy link
Contributor Author

@kelchm I didn't see that with Rancher v1.6.2 when swapping in a new microservice version sitting behind Traefik+metadata provider.

Microservice has a scale: 2, an upgrade_strategy: start_first: true and correct health checks to make sure Rancher doesn't mark it healthy 'til it's healthy. Upgraded via rancher-compose with a batch-size=1. This is where the Traefik metadata service provider's near-instantaneous updates shines I think. Traefik immediately knows about that health change & new backend IP.

YMMV depending on how 'web scale' you are :) Though a quick run through ab during the upgrade with ~600rps (obscenely more than my service would ever get):

Concurrency Level:      3000
Time taken for tests:   34.366 seconds
Complete requests:      20716
Failed requests:        0

I suppose it'll depend on the type of backend service too. When it's marked healthy it really needs to be ready to handle traffic immediately, which is often hard with e.g. the JVM. Also if Traefik is sitting behind something else, maybe Rancher's HAProxy LB, then you will see a few dropped packets. I'm going direct to Traefik in this test, which is not what my production looks like at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable/Fix rancher-metadata backend config method
8 participants