Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Less scary AWS detection #1863

Merged
merged 2 commits into from Apr 9, 2019
Merged

Less scary AWS detection #1863

merged 2 commits into from Apr 9, 2019

Conversation

squaremo
Copy link
Member

@squaremo squaremo commented Mar 26, 2019

At present fluxd will check initially for access to the AWS API (actually the EC2 metadata service), and log the error it gets if it's not accessible.

For people running outside AWS, this can look like something is failing, and cause undue anxiety. On the other hand, if you are running in AWS you probably expect it to just work, and you do want to know about any failure -- otherwise, you might not notice that, say, automation has stopped upgrading images.

To cover that second case, and defang the first, the flag --registry-ecr-mandatory makes it a fatal error if fluxd can't reach the EC2 metadata service (which I'm using as an approximation to being able to access the AWS API in general). By using the flag, you can escalate a silent failure into a crash, which -- we can hope -- you are more likely to notice.

Without the flag, if AWS is not available, fluxd will just log the "pre-flight check failed" warning, and not the full error.


I consulted with Adam on this, since he's been through the experience of using fluxd with ECR repos, and in particular using kiam which can cause AWS to be unavailable from pods. He said he anticipated there would be problems with kiam, so got the fix (using annotations) in ahead of time. However, making the ECR support mandatory means that if that ever breaks, we are more likely to find out.

@squaremo squaremo added dogfood UX In pursuit of a delightful user experience labels Mar 26, 2019
@squaremo squaremo requested a review from 2opremio March 26, 2019 18:18
@2opremio
Copy link
Contributor

Related: #1810

@squaremo
Copy link
Member Author

Related: #1810

Yes, though that won't be merged without substantial changes. I intend this PR to obviate the other.

@2opremio
Copy link
Contributor

2opremio commented Mar 27, 2019

Is it possible to predict whether an image is stored in an ECR registry? (I know ECR images use a special URL, but I don't know if there are exceptions to that).

If so, I think a better solution is to do the AWS detection lazily (only after we are sure we need to warm up an image registry in ECR).

I think that would make all users happy:

  • Non-AWS users won't get a strange, irrelevant error at startup
  • AWS users will get the right error.

Sorry for not coming up with this before.

@squaremo
Copy link
Member Author

Is it possible to predict whether an image is stored in an ECR registry?

Absolutely; the hostname will be a subdomain of dkr.erc.amazonaws.com.

If so, I think a better solution is to do the AWS detection lazily (only after we are sure we need to warm up an image registry in ECR).

Not a bad idea. The trade-off, I suppose, is that the warning should it come, will be in an unpredictable place in the logs. But probably fairly early, and, well, grep innit.

I would keep the --registry-ecr-mandatory flag, either way: Adam reckoned it would be useful, having got AWS authentication working, for making any failures much more visible.

@2opremio
Copy link
Contributor

I would keep the --registry-ecr-mandatory flag, either way

ETOOMANYKNOBS ?

I don't strongly oppose to it, but IMO having a flag just for that seems a bit overkill.

@squaremo
Copy link
Member Author

I don't strongly oppose to it, but IMO having a flag just for that seems a bit overkill.

The alternative is that it fails quietly, which is worse than it is now (since at least it'll log a messy error at the top of the log, as things stand).

@2opremio
Copy link
Contributor

2opremio commented Mar 28, 2019

The alternative is that it fails quietly, which is worse than it is now (since at least it'll log a messy error at the top of the log, as things stand).

I think that's just the result of having treated ECR specially. Why is an AWS-credentials error more important than, say, any other registry access error? (those are still buried in the logs and not at the beginning).

@squaremo
Copy link
Member Author

I think that's just the result of having treated ECR specially. Why is an AWS-credentials error more important than, say, any other registry access error? (those are still buried in the logs and not at the beginning).

It'd be good to be able to escalate other registry access errors, though it's somewhat harder to do for the piece-wise imagePullSecrets.

I'm all for adding --registry-gcr-require (we have a similar situation where we can test a fixed endpoint) and --registry-acr-require (check that the mounted file exists and can be parsed) as similar assertions.

The pivot is which of these you think is a better trade-off:

  1. ECR/GCR support just works if available, at the expense of a potentially irrelevant log message for some people
  2. You have to supply an argument to enable the specific environment you're in, but no-one gets a log message that they didn't expect.

In a sense I'm leaning towards 2) by proposing --registry-ecr-require here -- I'm saying the best way for these things to work is to assert that they are available and fail if they are not.

The main argument against 2) is that it makes it harder to have a general purpose installation -- which Weave Cloud is heavily weighted towards (effectively, 2. is impossible to support for Weave Cloud installations, at present).

@hiddeco
Copy link
Member

hiddeco commented Mar 28, 2019

I'm all for adding --registry-gcr-require (we have a similar situation where we can test a fixed endpoint) and --registry-acr-require (check that the mounted file exists and can be parsed) as similar assertions.

Would it be an idea to put it behind a --registry-require=<vendor> flag in that case to reduce the ETOOMANYKNOBS?

@2opremio
Copy link
Contributor

2opremio commented Mar 28, 2019

I would also go for 2 if I had to choose between them.

However, to avoid flags, how about

  1. verifying the registry access for all known images first thing at startup

Admittedly this won't cover the case of incorporating ECR images in later commits, but I think it's a good compromise between error reporting quality and adding extra knobs.

As an additional advantage, we won't be giving special treatment to ECR or any other cloud registry.

@squaremo
Copy link
Member Author

squaremo commented Apr 1, 2019

However, to avoid flags, how about

  • verifying the registry access for all known images first thing at startup

We don't have all known images at startup; that's done asynchronously in another loop, and left running independently. With some rearrangement though, an initial run could be worked into the startup sequence.

But at that point I think we'd have a more serious problem: we won't know what's supposed to work, and what is not. At present, people can leave the flags at defaults, and most of their images will be scanned. If there's some that are expected to fail, those will fail and be missing, and the others will be OK. If we fail entirely for anything that can't be scanned at startup, it means everyone has to get their configuration exactly right.

For example, there are some infrastructural images used by EKS that are published by the EKS team under a particular account. The ECR-specific code ignores that account by default. If we made all image scans mandatory, and that account changed, all the fluxd instances on EKS would start crashlooping.

Another variation would be to scan for AWS access if there are any ECR images discovered at startup; but it's not a given that you must be running on AWS if you are using ECR images (or GCP if using GCR). That's why I proposed adding flags that assert what needs to be available.

I am converging on

  1. If the --registry-required flag (as in Hidde's formulation) is set for ECR, check that AWS is available on startup and fail if not;
  2. Otherwise check for AWS support only if it's required (i.e., an ECR image needs to be scanned), and log if it fails

(and, in the fullness of time, similar arrangements for other environments, in other PRs)

@2opremio
Copy link
Contributor

2opremio commented Apr 1, 2019

OK, I agree, let's do that

@squaremo
Copy link
Member Author

squaremo commented Apr 2, 2019

I would also like to apply the exclude and include rules, even when auto-authentication is not possible; but there's a hitch: without detecting the region, I (the code) won't know which region to treat as included.

  • if the region is supplied, limit to that region
  • if the region is not supplied, but is detected, limit to that region
  • if the region is not supplied, and not detected, then ..?

The only possible answer is "allow all regions", but this is inconsistent with the other situations, so may be surprising. Perhaps I shouldn't worry about that.

@squaremo
Copy link
Member Author

squaremo commented Apr 2, 2019

The only possible answer is "allow all regions"

Not quite true; I could

  • just not apply the constraints if the region isn't supplied or detected
  • include no regions

@squaremo
Copy link
Member Author

squaremo commented Apr 3, 2019

OK, I've done what I think will be the least trouble for users, whether they use AWS or not. Taking the different situations one by one:

Using AWS and expecting ECR auth to just work

  • If you leave all the arguments to default, and the fluxd pod can reach the AWS API (not a given), it will gather credentials for images in the cluster's region, except those in the EKS system account.
  • If it cannot reach the AWS API, but you're using images in ECR, a warning will be logged once, but otherwise it will continue without using AWS authentication
  • If you want to require AWS authentication, you can use --registry-require=ecr to make it mandatory. fluxd will exit(1) if it can't connect to the AWS API.
  • you can set the include/exclude constraints with arguments, and fluxd will respect them whether or not AWS auth is possible. Setting --registry-ecr-region= will disable scanning ECR altogether.

Not using AWS, not using ECR

  • The warning (mentioned above) will be logged only if an included ECR image is encountered. If you leave all the defaults, and you're not using ECR, nothing will be logged.

Not using AWS, but using some ECR images

  • If the AWS API cannot be reached, the included regions will default to [] (i.e., none). So you need to explicitly include regions for ECR images to be scanned, if outside AWS.

@squaremo
Copy link
Member Author

squaremo commented Apr 3, 2019

@2opremio I'm happy with how this works for the end user (though still open to other ideas); I am not convinced of the implementation, though it achieves the goals* and is fairly concise.

*the goals:

  • don't log a scary error unless it's reasonably certain the user expected something to work and it isn't going to
  • let people constrain what's scanned in ECR even when the authentication bit isn't going to work

chart/flux/README.md Outdated Show resolved Hide resolved
)

func optionalVar(fs *pflag.FlagSet, value ssh.OptionalValue, name, usage string) ssh.OptionalValue {
fs.Var(value, name, usage)
return value
}

type stringset []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify this with the StringSet defined in warming.go?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, yes. They have different methods, and not quite compatible representations, but (for example) I could move StringSet into a neutral package, add Contains for main.go to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. There is also a stringset in policies.go . It would be good to unify all of them. However, I understand you may not want to do it in this PR, so I will leave it up to you.

site/daemon.md Outdated Show resolved Hide resolved
@squaremo
Copy link
Member Author

squaremo commented Apr 4, 2019

@2opremio Were you happy aside from the stringset (which I'll do in another PR), and the out-of-date docs/chart?

@2opremio
Copy link
Contributor

2opremio commented Apr 4, 2019

Yes

At present fluxd will check initially for access to the AWS API
(actually the EC2 metadata service), and log the error it gets if it's
not accessible.

For people running outside AWS, this can look like something is
failing, and cause undue anxiety. On the other hand, if you _are_
running in AWS you probably expect it to just work, and you _do_ want
to know about any failure -- otherwise, you might not notice that,
say, automation has stopped upgrading images.

To cover that second case, and defang the first, the command-line
argument `--registry-require=ecr` will make it a fatal error if fluxd
can't reach the EC2 metadata service (which I'm using as an
approximation to being able to access the AWS API in general). By
using the flag, you can escalate a silent failure into a crash, which
-- we can hope -- you are more likely to notice.
It is useful to be able to constrain the image repos scanned, even
when you don't need or want to use the AWS authentication.

This commit makes the constraints operate independently of AWS
authentication, by using a "pre-flight check" to determine whether the
AWS auth should be used, but applying the constraints either way. The
preflight check is also used to exit from main() if
`--registry-require=ecr` is set and the AWS API is not available.
@squaremo squaremo merged commit 63e4e83 into master Apr 9, 2019
@squaremo squaremo deleted the ux/scary-aws-detection branch April 9, 2019 13:02
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
dogfood UX In pursuit of a delightful user experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants