Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firewall rules for Autopilot clusters are ineffective #1230

Closed
dhess opened this issue Apr 18, 2022 · 13 comments · Fixed by #1817
Closed

Firewall rules for Autopilot clusters are ineffective #1230

dhess opened this issue Apr 18, 2022 · 13 comments · Fixed by #1817
Labels
bug Something isn't working P3 medium priority issues triaged Scoped and ready for work upstream Work required on Terraform core or provider

Comments

@dhess
Copy link

dhess commented Apr 18, 2022

TL;DR

The add_cluster_firewall_rules and related firewall options do not appear to have any effect for Autopilot clusters. I've configured all of them, with higher priority than my default firewall rules in a shared VPC, but the hit counters show 0 hits. Meanwhile, the lower priority generic rules in my firewall are showing hits and are permitting the cluster to pass traffic.

Expected behavior

The higher priority firewall rules added by the module should be in effect, passing traffic, and showing hits.

Observed behavior

The rules are ignored.

Terraform Configuration

add_cluster_firewall_rules = true
firewall_priority = 900
...

Terraform Version

1.1.8

Additional information

I'm not yet very experienced with GKE and firewall rules, but I have not yet figured out how to make firewall tags work with GKE Autopilot clusters in my own configs. It was only after I removed them that my firewall rules became effective for GKE Autopilot clusters. I noticed that this module is applying tags to the generated firewall rules (where the tag is equal to the GKE Autopilot cluster's name), so perhaps that's what's going on?

@dhess dhess added the bug Something isn't working label Apr 18, 2022
@morgante
Copy link
Contributor

This might indeed need further configuration. Most likely we are not setting the network tags appropriately for the node pools that autopilot auto-provisions.

@morgante morgante added triaged Scoped and ready for work P3 medium priority issues labels Apr 18, 2022
@dhess
Copy link
Author

dhess commented Apr 18, 2022

This might indeed need further configuration. Most likely we are not setting the network tags appropriately for the node pools that autopilot auto-provisions.

Maybe. The firewall rules that are automatically created by GKE do include a tag, but it's not just the name of the cluster — there's a machine-generated suffix, as well. I could not figure out how to generate that tag in my own config.

@morgante
Copy link
Contributor

Maybe. The firewall rules that are automatically created by GKE do include a tag, but it's not just the name of the cluster — there's a machine-generated suffix, as well. I could not figure out how to generate that tag in my own config.

Yeah, we currently rely on the network tag matching the cluster name.

For autopilot, that association might not hold true. Unfortunately it looks like we're blocked on adding support to the provider: hashicorp/terraform-provider-google#11051

@morgante morgante added the upstream Work required on Terraform core or provider label Apr 18, 2022
@intotecho
Copy link

This is what I did to create a rule to target nodes in autopilot cluster after the node pool is created.
List all the firewall rules, and filter those targeting a tag starting with gke-$CLUSTER_NAME to get the tag.

TAG=$(gcloud compute firewall-rules list --project $PROJECT --filter="name~gke-$CLUSTER_NAME-[0-9a-z]*-master" --format="value(targetTags[0])")
Then
gcloud compute firewall-rules create my-rule --target-tags $TAG ...

Its neither elegant nor robust. But afaik there's no way to interrogate GKE to get a node's network tags.

@ferrarimarco
Copy link
Contributor

I think the issue might be about how Autopilot modules set network tags for firewall rules:

target_tags   = [local.cluster_network_tag]

The cluster_network_tag variable is defined as:

cluster_network_tag                        = "gke-${var.name}"

Other (GKE-managed) firewall rules have a different tag: gke-${var.name}-<ID>-node

@GorginZ
Copy link
Contributor

GorginZ commented Jun 26, 2023

Yes, as @ferrarimarco highlights, the target_tags for the FW rule are a tag that is different to the gke-managed tag that ends up on the nodes that we can't access other than seeing it turn up in FW rules for intra-cluster coms or LBs. The problem is this: the shadow firewall rules, webhook firewall rules etc - their target_tags are a tag that is never attached to the cluster. I have a PR open to attach the cluster_network_tag to autopilot clusters so that these firewall rules can function. I don't think it's got any eyes on it yet though.

edit to add: This has the highest impact for the intra-egress rule which is required if the vpc has a deny 0.0.0.0/0 Without this rule the control plane can't talk to respective other CPs and the cluster fails to even come up.

PR if people have thoughts: #1655

@intotecho
Copy link

Is the autopilot module be setting --autoprovisioning-network-tags instead of --tags ?

https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-network-tags#command-line_options_for_applying_network_tags

@GorginZ
Copy link
Contributor

GorginZ commented Jun 26, 2023

The beta-autopilot-private-cluster and beta-autopilot-public-cluster modules set --autoprovisioning-network-tags. Which is the only option for autopilot.

I don't know the how or what of the tag we see from the gke auto created FW rules. Presumably it is on the nodes though because the gke provisioned firewall rules do work and also I note that service of type loadBalancer's respective provisioned firewalls for healthchecks have this same tag as a target_tag work - but this tag is inaccessible of course other than the "reflections" we see of it in the FW rules etc, because we can't get our hands on the nodes with autopilot clusters. But we can add auto provisioning network tags so we can add our "cluster_network_tag" to that to make sure additional FW rules are effective at least.

@gustaff-weldon
Copy link

gustaff-weldon commented Nov 21, 2023

I believe the issue here is that module creates its own local cluster_network_tag local in

and uses it as a target for firewall rules eg.

However, when creating the cluster, the module takes user-provided variable network_tags and passes these to node_pool_auto_config in

dynamic "node_pool_auto_config" {
for_each = length(var.network_tags) > 0 ? [1] : []
content {
network_tags {
tags = var.network_tags
}
}
}

I believe the firewall rules should just use var.network_tags too.

@GorginZ
Copy link
Contributor

GorginZ commented Nov 23, 2023

@gustaff-weldon Hey thanks for commenting, haven't really got any eyes on this so appreciate the engagement!

hmm I don't think we can use var.network_tags firstly because they are an optional variable, so a user may have the additional FW vars toggled on, but not have any specific network additional network tags to add in var.network_tags, in which case we need to target something but we have nothing.

And the second thing I think is it would be a bit obscuring. This var is understood as the list of network tags to attach to the nodes, which people may have several of - like to allow reaching internal services or whatever, we don't want to create new FW rules targeting those tags, they already have a purpose presumably - for instance, the terraform-example-foundation the allow-google-apis tag is an example of a tag that would be on a bunch of VMs that are fairly "locked down" and you may chuck it on your cluster too, but we wouldn't want all the instances with that to also get a bunch of other FW rules now target them as well, it's an option for users to make sure any other required FW rules or tag-values they already know they'll need get on the nodes.

We already have the local cluster_network_tag - it has a meaningful name, it's meant to be a network tag, for cluster networking, that the intra-cluster networking FW rules already target, but it doesn't end up on the nodes.
I think this is just a simple oversight/miss that impacts few people because it would only cause issues if someone had particularly hard-line FW rules in their network. I think we should just stick this tag on the nodes, and I am sure that's what was meant to happen but didn't, and it generally doesn't break stuff for people so fell through the cracks.

IDK if you've taken a look but lmk what you think https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/1655/files#diff-a3a07be3819af6f1a65ed243a613e69b544b7274d248a4b4d7d4d127aec40aee

@gustaff-weldon
Copy link

gustaff-weldon commented Nov 23, 2023

@GorginZ thanks for getting back. What I suggested is only one of the possible solutions. I was thinking that you might raise an empty network tags as a concern :) And I understand that you do not want to create firewall rules targeting network_tags. In such scenario, what we need is just to make sure that when user asks for firewall rules local.cluster_network_tag gets appended to tags in node_pool.

Your solution would work, I would also consider keeping conditions on locals level eg.:

locals {
  firewall_network_tag = "gke-gke-${var.env_code}-autopilot"
  cluster_network_tags = (var.enable_private_endpoint || var.enable_private_nodes) ? concat([local.firewall_network_tag], var.network_tags) : var.network_tags
}

Firewall rules would target firewall_network_tag, node pool, would iterate over cluster_network_tags.
This avoids duplicating condition and also, makes it, in my opinion, clear what tags will be used for firewall and what for cluster nodes.

@GorginZ
Copy link
Contributor

GorginZ commented Dec 14, 2023

Thanks @gustaff-weldon. Conditions in locals is probably nicer.
I don't mind how we do it so long as we get the cluster_network_tag on the cluster when it's needed. My PR hasn't got any eyes so I'm not sure there's any interest in resolving this sadly. It was closed for being "stale", but I may re-open it.

@EncryptShawn
Copy link

Hey guys I dont know if this will be helpful for you or not but I see its been giving you a hassle for a while, such an oversight by Google. So I automated the process and shared the script, this is a docker container that goes out and fetches the current list of IP's on your cluster and then updates your firewall rule removing any ip's not in the cluster and making sure that the IP's are there for the current cluster nodes. It does not update the firewall rules unless there is a discrepancy.

It could probably be improved, I just needed something that worked so that our production cluster was not impacted every time autopilot updated our nodes.

https://github.com/neuronic-ai/google-scripts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P3 medium priority issues triaged Scoped and ready for work upstream Work required on Terraform core or provider
Projects
None yet
7 participants