Failed Calling Webhook #150

debakkerb · 2021-11-14T12:53:58Z

I'm trying to deploy the Operator on top of a GKE cluster, but I'm running into issues when I'm trying to deploy the sample. I have a fairly standard cluster atm, without too many features enabled. I've deployed both cert manager and the operator and both are up and running without a problem.

Flink Operator System Namespace

k get po,svc -n flink-operator-system 
NAME                                                     READY   STATUS    RESTARTS   AGE
pod/flink-operator-controller-manager-5b4f96ddc5-dhlv5   2/2     Running   0          4h29m

NAME                                                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/flink-operator-controller-manager-metrics-service   ClusterIP   10.150.117.28    <none>        8443/TCP   4h29m
service/flink-operator-webhook-service                      ClusterIP   10.150.234.114   <none>        443/TCP    4h29m

Cert Manager Namespace

k get po,svc -n cert-manager         
NAME                                          READY   STATUS    RESTARTS   AGE
pod/cert-manager-848f547974-fbtfd             1/1     Running   0          4h42m
pod/cert-manager-cainjector-54f4cc6b5-49p58   1/1     Running   0          4h42m
pod/cert-manager-webhook-58fb868868-4w4pr     1/1     Running   0          4h42m

NAME                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/cert-manager           ClusterIP   10.150.202.12   <none>        9402/TCP   4h42m
service/cert-manager-webhook   ClusterIP   10.150.136.39   <none>        443/TCP    4h42m

However, when I try to deploy the sample session cluster, I get the following error message:

Error from server (InternalError): error when creating "./samples/flinkoperator_v1beta1_flinksessioncluster.yaml": Internal error occurred: failed calling webhook "mflinkcluster.flinkoperator.k8s.io": Post "https://flink-operator-webhook-service.flink-operator-system.svc:443/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster?timeout=10s": dial tcp 10.100.0.10:9443: i/o timeout

Does anyone have any pointers? I've checked the services and they point to the correct endpoints. The selectors look fine, but I'm a bit stuck on how I can troubleshoot this efficiently.

This is my cluster config:

locals {
  gke_operator_sa_roles = [
    "roles/logging.logWriter",
    "roles/monitoring.metricWriter",
    "roles/monitoring.viewer",
  ]
}

resource "google_service_account" "cluster_identity" {
  project    = module.default.project_id
  account_id = "cluster-id"
}

resource "google_project_iam_member" "cluster_identity_permissions" {
  for_each = toset(local.gke_operator_sa_roles)
  project  = module.default.project_id
  member   = "serviceAccount:${google_service_account.cluster_identity.email}"
  role     = each.value
}

resource "google_container_cluster" "default" {
  project                  = module.default.project_id
  name                     = var.cluster_name
  remove_default_node_pool = true
  initial_node_count       = 1
  location                 = var.zone
  network                  = google_compute_network.default.self_link
  subnetwork               = google_compute_subnetwork.default.self_link
  min_master_version       = var.cluster_version

  release_channel {
    channel = var.channel
  }

  ip_allocation_policy {
    services_secondary_range_name = var.svc_range_name
    cluster_secondary_range_name  = var.pod_range_name
  }

  private_cluster_config {
    enable_private_endpoint = false
    enable_private_nodes    = true
    master_ipv4_cidr_block  = var.master_ipv4_cidr_block
  }

  node_config {
    service_account = google_service_account.cluster_identity.email
    oauth_scopes = [
      "storage-ro",
      "logging-write",
      "monitoring"
    ]
  }

  timeouts {
    create = "45m"
    update = "45m"
    delete = "45m"
  }

  depends_on = [
    google_project_iam_member.cluster_identity_permissions
  ]
}

resource "google_container_node_pool" "default" {
  provider   = google-beta
  project    = module.default.project_id
  name       = "${google_container_cluster.default.name}-nodes"
  cluster    = google_container_cluster.default.name
  location   = var.zone
  node_count = 1

  node_config {
    image_type   = "cos_containerd"
    machine_type = "n2-standard-4"

    service_account = google_service_account.cluster_identity.email
    oauth_scopes = [
      "storage-ro",
      "logging-write",
      "monitoring"
    ]

    disk_size_gb = 20
    disk_type    = "pd-ssd"
  }

  timeouts {
    create = "45m"
    update = "45m"
    delete = "45m"
  }

  depends_on = [
    google_project_iam_member.cluster_identity_permissions
  ]
}

The text was updated successfully, but these errors were encountered:

debakkerb · 2021-11-15T15:28:15Z

I was a muppet who forgot to add the necessary firewall rule to allow the master to communicate with the nodes. So if someone runs into the same problem, you can add this rule to your network:

resource "google_compute_firewall" "master_node_access" {
  project = module.default.project_id
  name    = "allow-master-access"
  network = google_compute_network.default.name

  source_ranges           = [var.master_ipv4_cidr_block]
  target_service_accounts = [google_service_account.cluster_identity.email]

  allow {
    protocol = "tcp"
    ports    = ["443", "8443", "9443", "9402", "10250"]
  }
}

debakkerb closed this as completed Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed Calling Webhook #150

Failed Calling Webhook #150

debakkerb commented Nov 14, 2021

debakkerb commented Nov 15, 2021

Failed Calling Webhook #150

Failed Calling Webhook #150

Comments

debakkerb commented Nov 14, 2021

Flink Operator System Namespace

Cert Manager Namespace

debakkerb commented Nov 15, 2021