Less scary AWS detection #1863

squaremo · 2019-03-26T12:57:41Z

At present fluxd will check initially for access to the AWS API (actually the EC2 metadata service), and log the error it gets if it's not accessible.

For people running outside AWS, this can look like something is failing, and cause undue anxiety. On the other hand, if you are running in AWS you probably expect it to just work, and you do want to know about any failure -- otherwise, you might not notice that, say, automation has stopped upgrading images.

To cover that second case, and defang the first, the flag --registry-ecr-mandatory makes it a fatal error if fluxd can't reach the EC2 metadata service (which I'm using as an approximation to being able to access the AWS API in general). By using the flag, you can escalate a silent failure into a crash, which -- we can hope -- you are more likely to notice.

Without the flag, if AWS is not available, fluxd will just log the "pre-flight check failed" warning, and not the full error.

I consulted with Adam on this, since he's been through the experience of using fluxd with ECR repos, and in particular using kiam which can cause AWS to be unavailable from pods. He said he anticipated there would be problems with kiam, so got the fix (using annotations) in ahead of time. However, making the ECR support mandatory means that if that ever breaks, we are more likely to find out.

2opremio · 2019-03-27T16:21:24Z

Related: #1810

squaremo · 2019-03-27T16:23:12Z

Related: #1810

Yes, though that won't be merged without substantial changes. I intend this PR to obviate the other.

2opremio · 2019-03-27T16:28:00Z

Is it possible to predict whether an image is stored in an ECR registry? (I know ECR images use a special URL, but I don't know if there are exceptions to that).

If so, I think a better solution is to do the AWS detection lazily (only after we are sure we need to warm up an image registry in ECR).

I think that would make all users happy:

Non-AWS users won't get a strange, irrelevant error at startup
AWS users will get the right error.

Sorry for not coming up with this before.

squaremo · 2019-03-27T16:37:11Z

Is it possible to predict whether an image is stored in an ECR registry?

Absolutely; the hostname will be a subdomain of dkr.erc.amazonaws.com.

If so, I think a better solution is to do the AWS detection lazily (only after we are sure we need to warm up an image registry in ECR).

Not a bad idea. The trade-off, I suppose, is that the warning should it come, will be in an unpredictable place in the logs. But probably fairly early, and, well, grep innit.

I would keep the --registry-ecr-mandatory flag, either way: Adam reckoned it would be useful, having got AWS authentication working, for making any failures much more visible.

2opremio · 2019-03-27T16:50:47Z

I would keep the --registry-ecr-mandatory flag, either way

ETOOMANYKNOBS ?

I don't strongly oppose to it, but IMO having a flag just for that seems a bit overkill.

squaremo · 2019-03-27T16:58:22Z

I don't strongly oppose to it, but IMO having a flag just for that seems a bit overkill.

The alternative is that it fails quietly, which is worse than it is now (since at least it'll log a messy error at the top of the log, as things stand).

2opremio · 2019-03-28T09:04:41Z

The alternative is that it fails quietly, which is worse than it is now (since at least it'll log a messy error at the top of the log, as things stand).

I think that's just the result of having treated ECR specially. Why is an AWS-credentials error more important than, say, any other registry access error? (those are still buried in the logs and not at the beginning).

squaremo · 2019-03-28T11:16:12Z

I think that's just the result of having treated ECR specially. Why is an AWS-credentials error more important than, say, any other registry access error? (those are still buried in the logs and not at the beginning).

It'd be good to be able to escalate other registry access errors, though it's somewhat harder to do for the piece-wise imagePullSecrets.

I'm all for adding --registry-gcr-require (we have a similar situation where we can test a fixed endpoint) and --registry-acr-require (check that the mounted file exists and can be parsed) as similar assertions.

The pivot is which of these you think is a better trade-off:

ECR/GCR support just works if available, at the expense of a potentially irrelevant log message for some people
You have to supply an argument to enable the specific environment you're in, but no-one gets a log message that they didn't expect.

In a sense I'm leaning towards 2) by proposing --registry-ecr-require here -- I'm saying the best way for these things to work is to assert that they are available and fail if they are not.

The main argument against 2) is that it makes it harder to have a general purpose installation -- which Weave Cloud is heavily weighted towards (effectively, 2. is impossible to support for Weave Cloud installations, at present).

hiddeco · 2019-03-28T11:27:10Z

I'm all for adding --registry-gcr-require (we have a similar situation where we can test a fixed endpoint) and --registry-acr-require (check that the mounted file exists and can be parsed) as similar assertions.

Would it be an idea to put it behind a --registry-require=<vendor> flag in that case to reduce the ETOOMANYKNOBS?

2opremio · 2019-03-28T14:28:24Z

I would also go for 2 if I had to choose between them.

However, to avoid flags, how about

verifying the registry access for all known images first thing at startup

Admittedly this won't cover the case of incorporating ECR images in later commits, but I think it's a good compromise between error reporting quality and adding extra knobs.

As an additional advantage, we won't be giving special treatment to ECR or any other cloud registry.

squaremo · 2019-04-01T10:39:51Z

However, to avoid flags, how about

verifying the registry access for all known images first thing at startup

We don't have all known images at startup; that's done asynchronously in another loop, and left running independently. With some rearrangement though, an initial run could be worked into the startup sequence.

But at that point I think we'd have a more serious problem: we won't know what's supposed to work, and what is not. At present, people can leave the flags at defaults, and most of their images will be scanned. If there's some that are expected to fail, those will fail and be missing, and the others will be OK. If we fail entirely for anything that can't be scanned at startup, it means everyone has to get their configuration exactly right.

For example, there are some infrastructural images used by EKS that are published by the EKS team under a particular account. The ECR-specific code ignores that account by default. If we made all image scans mandatory, and that account changed, all the fluxd instances on EKS would start crashlooping.

Another variation would be to scan for AWS access if there are any ECR images discovered at startup; but it's not a given that you must be running on AWS if you are using ECR images (or GCP if using GCR). That's why I proposed adding flags that assert what needs to be available.

I am converging on

If the --registry-required flag (as in Hidde's formulation) is set for ECR, check that AWS is available on startup and fail if not;
Otherwise check for AWS support only if it's required (i.e., an ECR image needs to be scanned), and log if it fails

(and, in the fullness of time, similar arrangements for other environments, in other PRs)

2opremio · 2019-04-01T11:32:07Z

OK, I agree, let's do that

squaremo · 2019-04-02T13:34:06Z

I would also like to apply the exclude and include rules, even when auto-authentication is not possible; but there's a hitch: without detecting the region, I (the code) won't know which region to treat as included.

if the region is supplied, limit to that region
if the region is not supplied, but is detected, limit to that region
if the region is not supplied, and not detected, then ..?

The only possible answer is "allow all regions", but this is inconsistent with the other situations, so may be surprising. Perhaps I shouldn't worry about that.

squaremo · 2019-04-02T15:43:35Z

The only possible answer is "allow all regions"

Not quite true; I could

just not apply the constraints if the region isn't supplied or detected
include no regions

squaremo · 2019-04-03T10:38:33Z

OK, I've done what I think will be the least trouble for users, whether they use AWS or not. Taking the different situations one by one:

Using AWS and expecting ECR auth to just work

If you leave all the arguments to default, and the fluxd pod can reach the AWS API (not a given), it will gather credentials for images in the cluster's region, except those in the EKS system account.
If it cannot reach the AWS API, but you're using images in ECR, a warning will be logged once, but otherwise it will continue without using AWS authentication
If you want to require AWS authentication, you can use --registry-require=ecr to make it mandatory. fluxd will exit(1) if it can't connect to the AWS API.
you can set the include/exclude constraints with arguments, and fluxd will respect them whether or not AWS auth is possible. Setting --registry-ecr-region= will disable scanning ECR altogether.

Not using AWS, not using ECR

The warning (mentioned above) will be logged only if an included ECR image is encountered. If you leave all the defaults, and you're not using ECR, nothing will be logged.

Not using AWS, but using some ECR images

If the AWS API cannot be reached, the included regions will default to [] (i.e., none). So you need to explicitly include regions for ECR images to be scanned, if outside AWS.

squaremo · 2019-04-03T10:46:52Z

@2opremio I'm happy with how this works for the end user (though still open to other ideas); I am not convinced of the implementation, though it achieves the goals* and is fairly concise.

*the goals:

don't log a scary error unless it's reasonably certain the user expected something to work and it isn't going to
let people constrain what's scanned in ECR even when the authentication bit isn't going to work

chart/flux/README.md

2opremio · 2019-04-03T12:54:24Z

cmd/fluxd/main.go

 )

 func optionalVar(fs *pflag.FlagSet, value ssh.OptionalValue, name, usage string) ssh.OptionalValue {
 	fs.Var(value, name, usage)
 	return value
 }

+type stringset []string


Can we unify this with the StringSet defined in warming.go?

Technically, yes. They have different methods, and not quite compatible representations, but (for example) I could move StringSet into a neutral package, add Contains for main.go to use.

Sounds good. There is also a stringset in policies.go . It would be good to unify all of them. However, I understand you may not want to do it in this PR, so I will leave it up to you.

site/daemon.md

chart/flux/templates/deployment.yaml

squaremo · 2019-04-04T10:02:51Z

@2opremio Were you happy aside from the stringset (which I'll do in another PR), and the out-of-date docs/chart?

2opremio · 2019-04-04T12:18:55Z

Yes

At present fluxd will check initially for access to the AWS API (actually the EC2 metadata service), and log the error it gets if it's not accessible. For people running outside AWS, this can look like something is failing, and cause undue anxiety. On the other hand, if you _are_ running in AWS you probably expect it to just work, and you _do_ want to know about any failure -- otherwise, you might not notice that, say, automation has stopped upgrading images. To cover that second case, and defang the first, the command-line argument `--registry-require=ecr` will make it a fatal error if fluxd can't reach the EC2 metadata service (which I'm using as an approximation to being able to access the AWS API in general). By using the flag, you can escalate a silent failure into a crash, which -- we can hope -- you are more likely to notice.

It is useful to be able to constrain the image repos scanned, even when you don't need or want to use the AWS authentication. This commit makes the constraints operate independently of AWS authentication, by using a "pre-flight check" to determine whether the AWS auth should be used, but applying the constraints either way. The preflight check is also used to exit from main() if `--registry-require=ecr` is set and the AWS API is not available.

chart/flux/README.md

squaremo added dogfood UX In pursuit of a delightful user experience labels Mar 26, 2019

squaremo requested a review from 2opremio March 26, 2019 18:18

squaremo force-pushed the ux/scary-aws-detection branch from 599a792 to 18abe4a Compare April 3, 2019 10:40

2opremio reviewed Apr 3, 2019

View reviewed changes

chart/flux/README.md Outdated Show resolved Hide resolved

2opremio reviewed Apr 3, 2019

View reviewed changes

site/daemon.md Outdated Show resolved Hide resolved

2opremio reviewed Apr 3, 2019

View reviewed changes

chart/flux/templates/deployment.yaml Outdated Show resolved Hide resolved

squaremo added 2 commits April 4, 2019 13:55

squaremo force-pushed the ux/scary-aws-detection branch from b6d5d08 to 82e2f0c Compare April 4, 2019 15:00

2opremio reviewed Apr 8, 2019

View reviewed changes

chart/flux/README.md Show resolved Hide resolved

2opremio approved these changes Apr 8, 2019

View reviewed changes

squaremo merged commit 63e4e83 into master Apr 9, 2019

squaremo deleted the ux/scary-aws-detection branch April 9, 2019 13:02

squaremo mentioned this pull request Apr 9, 2019

prevents AWS errors when not using AWS #1810

Closed

squaremo mentioned this pull request May 22, 2019

Revert ECR-outside-AWS default behaviour #2070

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Less scary AWS detection #1863

Less scary AWS detection #1863

squaremo commented Mar 26, 2019 •

edited

2opremio commented Mar 27, 2019

squaremo commented Mar 27, 2019

2opremio commented Mar 27, 2019 •

edited

squaremo commented Mar 27, 2019

2opremio commented Mar 27, 2019

squaremo commented Mar 27, 2019

2opremio commented Mar 28, 2019 •

edited

squaremo commented Mar 28, 2019

hiddeco commented Mar 28, 2019

2opremio commented Mar 28, 2019 •

edited

squaremo commented Apr 1, 2019

2opremio commented Apr 1, 2019

squaremo commented Apr 2, 2019

squaremo commented Apr 2, 2019 •

edited

squaremo commented Apr 3, 2019

squaremo commented Apr 3, 2019

2opremio Apr 3, 2019

squaremo Apr 3, 2019

2opremio Apr 3, 2019

squaremo commented Apr 4, 2019

2opremio commented Apr 4, 2019

Less scary AWS detection #1863

Less scary AWS detection #1863

Conversation

squaremo commented Mar 26, 2019 • edited

2opremio commented Mar 27, 2019

squaremo commented Mar 27, 2019

2opremio commented Mar 27, 2019 • edited

squaremo commented Mar 27, 2019

2opremio commented Mar 27, 2019

squaremo commented Mar 27, 2019

2opremio commented Mar 28, 2019 • edited

squaremo commented Mar 28, 2019

hiddeco commented Mar 28, 2019

2opremio commented Mar 28, 2019 • edited

squaremo commented Apr 1, 2019

2opremio commented Apr 1, 2019

squaremo commented Apr 2, 2019

squaremo commented Apr 2, 2019 • edited

squaremo commented Apr 3, 2019

squaremo commented Apr 3, 2019

2opremio Apr 3, 2019

Choose a reason for hiding this comment

squaremo Apr 3, 2019

Choose a reason for hiding this comment

2opremio Apr 3, 2019

Choose a reason for hiding this comment

squaremo commented Apr 4, 2019

2opremio commented Apr 4, 2019

squaremo commented Mar 26, 2019 •

edited

2opremio commented Mar 27, 2019 •

edited

2opremio commented Mar 28, 2019 •

edited

2opremio commented Mar 28, 2019 •

edited

squaremo commented Apr 2, 2019 •

edited