Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed Calling Webhook #150

Closed
debakkerb opened this issue Nov 14, 2021 · 1 comment
Closed

Failed Calling Webhook #150

debakkerb opened this issue Nov 14, 2021 · 1 comment

Comments

@debakkerb
Copy link

I'm trying to deploy the Operator on top of a GKE cluster, but I'm running into issues when I'm trying to deploy the sample. I have a fairly standard cluster atm, without too many features enabled. I've deployed both cert manager and the operator and both are up and running without a problem.

Flink Operator System Namespace

k get po,svc -n flink-operator-system 
NAME                                                     READY   STATUS    RESTARTS   AGE
pod/flink-operator-controller-manager-5b4f96ddc5-dhlv5   2/2     Running   0          4h29m

NAME                                                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/flink-operator-controller-manager-metrics-service   ClusterIP   10.150.117.28    <none>        8443/TCP   4h29m
service/flink-operator-webhook-service                      ClusterIP   10.150.234.114   <none>        443/TCP    4h29m

Cert Manager Namespace

k get po,svc -n cert-manager         
NAME                                          READY   STATUS    RESTARTS   AGE
pod/cert-manager-848f547974-fbtfd             1/1     Running   0          4h42m
pod/cert-manager-cainjector-54f4cc6b5-49p58   1/1     Running   0          4h42m
pod/cert-manager-webhook-58fb868868-4w4pr     1/1     Running   0          4h42m

NAME                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/cert-manager           ClusterIP   10.150.202.12   <none>        9402/TCP   4h42m
service/cert-manager-webhook   ClusterIP   10.150.136.39   <none>        443/TCP    4h42m

However, when I try to deploy the sample session cluster, I get the following error message:

Error from server (InternalError): error when creating "./samples/flinkoperator_v1beta1_flinksessioncluster.yaml": Internal error occurred: failed calling webhook "mflinkcluster.flinkoperator.k8s.io": Post "https://flink-operator-webhook-service.flink-operator-system.svc:443/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster?timeout=10s": dial tcp 10.100.0.10:9443: i/o timeout

Does anyone have any pointers? I've checked the services and they point to the correct endpoints. The selectors look fine, but I'm a bit stuck on how I can troubleshoot this efficiently.

This is my cluster config:

locals {
  gke_operator_sa_roles = [
    "roles/logging.logWriter",
    "roles/monitoring.metricWriter",
    "roles/monitoring.viewer",
  ]
}

resource "google_service_account" "cluster_identity" {
  project    = module.default.project_id
  account_id = "cluster-id"
}

resource "google_project_iam_member" "cluster_identity_permissions" {
  for_each = toset(local.gke_operator_sa_roles)
  project  = module.default.project_id
  member   = "serviceAccount:${google_service_account.cluster_identity.email}"
  role     = each.value
}

resource "google_container_cluster" "default" {
  project                  = module.default.project_id
  name                     = var.cluster_name
  remove_default_node_pool = true
  initial_node_count       = 1
  location                 = var.zone
  network                  = google_compute_network.default.self_link
  subnetwork               = google_compute_subnetwork.default.self_link
  min_master_version       = var.cluster_version

  release_channel {
    channel = var.channel
  }

  ip_allocation_policy {
    services_secondary_range_name = var.svc_range_name
    cluster_secondary_range_name  = var.pod_range_name
  }

  private_cluster_config {
    enable_private_endpoint = false
    enable_private_nodes    = true
    master_ipv4_cidr_block  = var.master_ipv4_cidr_block
  }

  node_config {
    service_account = google_service_account.cluster_identity.email
    oauth_scopes = [
      "storage-ro",
      "logging-write",
      "monitoring"
    ]
  }

  timeouts {
    create = "45m"
    update = "45m"
    delete = "45m"
  }

  depends_on = [
    google_project_iam_member.cluster_identity_permissions
  ]
}

resource "google_container_node_pool" "default" {
  provider   = google-beta
  project    = module.default.project_id
  name       = "${google_container_cluster.default.name}-nodes"
  cluster    = google_container_cluster.default.name
  location   = var.zone
  node_count = 1

  node_config {
    image_type   = "cos_containerd"
    machine_type = "n2-standard-4"

    service_account = google_service_account.cluster_identity.email
    oauth_scopes = [
      "storage-ro",
      "logging-write",
      "monitoring"
    ]

    disk_size_gb = 20
    disk_type    = "pd-ssd"
  }

  timeouts {
    create = "45m"
    update = "45m"
    delete = "45m"
  }

  depends_on = [
    google_project_iam_member.cluster_identity_permissions
  ]
}
@debakkerb
Copy link
Author

I was a muppet who forgot to add the necessary firewall rule to allow the master to communicate with the nodes. So if someone runs into the same problem, you can add this rule to your network:

resource "google_compute_firewall" "master_node_access" {
  project = module.default.project_id
  name    = "allow-master-access"
  network = google_compute_network.default.name

  source_ranges           = [var.master_ipv4_cidr_block]
  target_service_accounts = [google_service_account.cluster_identity.email]

  allow {
    protocol = "tcp"
    ports    = ["443", "8443", "9443", "9402", "10250"]
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant