Kiam server readiness/liveness probe issues (v2.8) #94

mattreyuk · 2018-06-18T23:01:51Z

I have Kiam setup on a 3 master cluster - the server runs on two of the three masters with some readiness probes failing but the 3rd node is in a crash loop backoff with the liveness probe failing. The error message is the same for both:

probe failed: time="2018-06-18T22:10:41Z" level=fatal msg="error creating server gateway: error waiting for address being available in the balancer: context deadline exceeded"

Looking in the server logs for the failing pod, I just see it startup, lots of found role, requesting credentials and then finally stopped.
I'm using Image: quay.io/uswitch/kiam:v2.8 and the standard probe definitions from /deploy/server.yaml (this is a kops 1.9 cluster so I changed the cert path to /etc/ssl/certs and added master tolerations to get it to work).

The text was updated successfully, but these errors were encountered:

pingles · 2018-06-19T09:32:53Z

@mattreyuk To get more information could you add the following environment variables to the health check command please:

GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8

#17 had a similar problem- a connection error inside the gRPC client but the log data/errors were being masked.

maauso · 2018-06-19T12:57:48Z

Hi guys,
I found the same problem, I'm using EKS and I created a different workers stacks, in order to have Kiam servers running on it, then I will be able to apply Server IAM Role.

The Kiam version that I'm using is 2.8, moreover in order to have a dedicate nodes for kaim pods I'm using the following yaml file

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  namespace: kube-system
  name: kiam-server
spec:
  replicas: 3
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9620"
      labels:
        app: kiam
        role: server
    spec:
      serviceAccountName: kiam-server
      affinity:
        podAntiAffinity:                                 
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname   
      nodeSelector:
        beta.kubernetes.io/instance-type: t2.medium
      volumes:
        - name: ssl-certs
          hostPath:
            # for AWS linux or RHEL distros
            path: /etc/pki/ca-trust/extracted/pem/
            #path: /usr/share/ca-certificates
        - name: tls
          secret:
            secretName: kiam-server-tls
      containers:
        - name: kiam
          image: quay.io/uswitch/kiam:v2.8
          imagePullPolicy: Always
          env:
          - name: GRPC_GO_LOG_SEVERITY_LEVEL
            value: "info"
          - name: GRPC_GO_LOG_VERBOSITY_LEVEL
            value: "8"
          command:
            - /server
          args:
            - --json-log
            - --bind=0.0.0.0:443
            - --cert=/etc/kiam/tls/server.pem
            - --key=/etc/kiam/tls/server-key.pem
            - --ca=/etc/kiam/tls/ca.pem
            - --role-base-arn-autodetect
            - --sync=1m
            - --prometheus-listen-addr=0.0.0.0:9620
            - --prometheus-sync-interval=5s
          volumeMounts:
            - mountPath: /etc/ssl/certs
              name: ssl-certs
            - mountPath: /etc/kiam/tls
              name: tls
          livenessProbe:
            exec:
              command:
              - /health
              - --cert=/etc/kiam/tls/server.pem
              - --key=/etc/kiam/tls/server-key.pem
              - --ca=/etc/kiam/tls/ca.pem
              - --server-address=localhost:443
              - --server-address-refresh=2s
              - --timeout=5s
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 10
          readinessProbe:
            exec:
              command:
              - /health
              - --cert=/etc/kiam/tls/server.pem
              - --key=/etc/kiam/tls/server-key.pem
              - --ca=/etc/kiam/tls/ca.pem
              - --server-address=localhost:443
              - --server-address-refresh=2s
              - --timeout=5s
            initialDelaySeconds: 3
            periodSeconds: 10
            timeoutSeconds: 10

But, I have the same error

Readiness probe failed: INFO: 2018/06/19 12:39:09 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4203d1380 INFO: 2018/06/19 12:39:09 dialing to target with scheme: "" INFO: 2018/06/19 12:39:09 could not get resolver for scheme: "" INFO: 2018/06/19 12:39:09 balancerWrapper: is pickfirst: false time="2018-06-19T12:39:09Z" level=fatal msg="error creating server gateway: error waiting for address being available in the balancer: context deadline exceeded"

pingles · 2018-06-19T13:43:07Z

@maauso ok, so I wonder if this is related to the fix we made in #88. Is there any way you could change to v2.7 and see if that works (that was before that balancer change was made).

maauso · 2018-06-19T14:29:37Z

Hi @pingles, of course, I'm used the same yaml but I changed the version only , I had the next error

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "kiam-server-x4wgm_kube-system" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable

Remember that I'm using EKS with amazon-vpc-cni.

Moreover I detected that I have the both errors, when I'm using to different node groups connects with the same eks, but the theory said that the only difference is the resources not the network.
https://docs.aws.amazon.com/eks/latest/userguide/launch-workers.html

weird behavior XD

pingles · 2018-06-19T14:38:09Z

@maauso could you add the additional gRPC environment variables to understand more about what's happening please. If it's the same as #17 then it's likely to do with the TLS verification.

pingles · 2018-06-19T14:38:38Z

GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8

is the thing to add to the health check command

maauso · 2018-06-19T15:11:25Z

@pingles with version 2.7 and adding

          - name: GRPC_GO_LOG_SEVERITY_LEVEL
            value: "info"
          - name: GRPC_GO_LOG_VERBOSITY_LEVEL
            value: "8"

the error is

Readiness probe failed: INFO: 2018/06/19 14:57:49 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4201df380 INFO: 2018/06/19 14:57:49 dialing to target with scheme: "" INFO: 2018/06/19 14:57:49 could not get resolver for scheme: "" time="2018-06-19T14:57:49Z" level=warning msg="error checking health: rpc error: code = Unavailable desc = there is no address available"

Also I have some Kiam Servers pods that it's seems that they are running but the have some warnings

{"level":"info","msg":"started namespace cache controller","time":"2018-06-19T14:57:41Z"}
WARNING: 2018/06/19 14:57:47 transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:443-&gt;127.0.0.1:38888: read: connection reset by peer
WARNING: 2018/06/19 14:57:59 transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:443-&gt;127.0.0.1:38946: read: connection reset by peer
WARNING: 2018/06/19 14:58:09 transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:443-&gt;127.0.0.1:39000: read: connection reset by peer

Moreover with version 2.8

Readiness probe failed: INFO: 2018/06/19 15:06:36 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc420272ea0 INFO: 2018/06/19 15:06:36 dialing to target with scheme: "" INFO: 2018/06/19 15:06:36 could not get resolver for scheme: "" INFO: 2018/06/19 15:06:36 balancerWrapper: is pickfirst: false time="2018-06-19T15:06:36Z" level=fatal msg="error creating server gateway: error waiting for address being available in the balancer: context deadline exceeded"

Also, the kiam-server with version 2.8 that they are running have the same Warning.

If you need more test, let me know

pingles · 2018-06-19T15:15:52Z

@maauso hmmm... I'd have expected more of the log output like you see from the server. Could you try port-forwarding/running the command from another Pod with the additional grpc env vars- I still feel like this looks the same as the TLS error in #17.

maauso · 2018-06-19T15:34:17Z

@pingles ok no problem, but first off, which is the version that do you want that I'll run the test?
Test that you need is run

/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/ca.pem --server-address=localhost:443

into the pod that I have the problem?

mattreyuk · 2018-06-19T15:38:31Z

@pingles I added those values to the pod env and got the following

Liveness probe failed: INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 dialing to target with scheme: "" INFO: 2018/06/19 15:31:19 could not get resolver for scheme: "" 
INFO: 2018/06/19 15:31:19 balancerWrapper: is pickfirst: false 
INFO: 2018/06/19 15:31:19 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 101.64.0.10:53: no such host. 
INFO: 2018/06/19 15:31:19 balancerWrapper: got update addr from Notify: [{[::1]:443 <nil>} {127.0.0.1:443 <nil>}] 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: new subconn: [{[::1]:443 0 <nil>}] 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: new subconn: [{127.0.0.1:443 0 <nil>}] 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e13e0, CONNECTING 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e1430, CONNECTING 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e1430, TRANSIENT_FAILURE 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e1430, CONNECTING 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e13e0, TRANSIENT_FAILURE 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e13e0, CONNECTING 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 time="2018-06-19T15:31:19Z" level=fatal msg="error creating server gateway: error waiting for address being available in the balancer: context deadline exceeded"

pingles · 2018-06-19T15:53:17Z

@maauso yeah- try that with 2.7 please.

@mattreyuk that's strange. if you increase the verbosity level does anything else come up? be useful to find out what that TRANSIENT_FAILURE is. Also- could you confirm the hosts/alternate names etc. that were used in the TLS certs for the server. Did you generate them using the cfssl stuff mentioned in the repo or using something else?

mattreyuk · 2018-06-19T16:28:33Z

I followed the instructions in /docs/TLS.md and used the *.json files in that directory - the only thing I changed was the names section in each to reflect my locale. The hosts section from the server.json looks like:

    "hosts": [
        "kiam-server",
        "kiam-server:443",
        "localhost",
        "localhost:443",
        "localhost:9610"
    ],

I put the verbosity level up to 20 and I don't see anything different logged.

maauso · 2018-06-19T16:30:37Z

@pingles I run the command inside the docker

/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert
=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/
ca.pem --server-address=localhost:443
INFO: 2018/06/19 16:22:18 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4203452c0
INFO: 2018/06/19 16:22:18 dialing to target with scheme: ""
INFO: 2018/06/19 16:22:18 could not get resolver for scheme: ""
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO: 2018/06/19 16:22:18 balancerWrapper: is pickfirst: false
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
FATA[0001] error retrieving health: rpc error: code = Unavailable desc = there is no address available 
/ #

It's different that the @mattdenner error has

pingles · 2018-06-19T16:32:57Z

@mattreyuk could you confirm which release of Kiam you're running please? If it's 2.8 could you move down to 2.7 please.

@maauso very strange... not sure what's happening there, looks like its unable to resolve localhost. If you change to 127.0.0.1 do you get a TLS error rather than the no address available?

maauso · 2018-06-19T16:39:15Z

@pingles its seems that is the same error

/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert
=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/
ca.pem --server-address=127.0.0.1:443
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4200529c0
INFO: 2018/06/19 16:36:57 dialing to target with scheme: ""
INFO: 2018/06/19 16:36:57 could not get resolver for scheme: ""
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
INFO: 2018/06/19 16:36:57 balancerWrapper: is pickfirst: false
INFO: 2018/06/19 16:36:57 balancerWrapper: got update addr from Notify: [{127.0.0.1:443 <nil>}]
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: new subconn: [{127.0.0.1:443 0  <nil>}]
INFO: 2018/06/19 16:36:57 balancerWrapper: handle subconn state change: 0xc420288070, CONNECTING
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4200529c0
INFO: 2018/06/19 16:36:57 balancerWrapper: handle subconn state change: 0xc420288070, TRANSIENT_FAILURE
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc4200529c0
INFO: 2018/06/19 16:36:57 balancerWrapper: handle subconn state change: 0xc420288070, CONNECTING
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4200529c0
WARNING: 2018/06/19 16:36:57 Failed to dial 127.0.0.1:443: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for kiam-server:443, localhost:443, localhost:9610, not 127.0.0.1:443"; please retry.
INFO: 2018/06/19 16:36:57 balancerWrapper: handle subconn state change: 0xc420288070, SHUTDOWN
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc4200529c0
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
FATA[0001] error retrieving health: rpc error: code = Unavailable desc = there is no connection available

I have doubt, if I want to use kiam on EKS I should use version 2.8 right?, because it's compatible with amazon-vpc-cni

pingles · 2018-06-19T16:42:21Z

@maauso weird!! i wonder if something isn't set properly with the cluster DNS stuff on the pod? Switching to 127.0.0.1 will be fine- although you'd need to generate the cert with the additional alternate name

maauso · 2018-06-19T16:54:17Z

@pingles Maybe, but I launched the workers using the official eks solution, you know that I don't have the possibility to change any service like dns or kube-proxy etc.. in EKS

For me, the worst part is that I just find this "problem" when I have two types of EKS nodes groups running together. I opened a ticket to Aws, in order to know if EKS is ready to has this configuration.

If you need any test, let me know. Also when I will have a AWS update I will tell you

mattreyuk · 2018-06-19T17:26:14Z

running with server version 2.7 image seems to work better for me. I still have one of my three master nodes failing health checks - when I run the command within the container I get results like:

/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert
=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/
ca.pem --server-address=localhost:443
INFO: 2018/06/19 17:07:06 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc420079380
INFO: 2018/06/19 17:07:06 dialing to target with scheme: ""
INFO: 2018/06/19 17:07:06 could not get resolver for scheme: ""
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO: 2018/06/19 17:07:06 balancerWrapper: is pickfirst: false
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO: 2018/06/19 17:07:06 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 101.64.0.10:53: no such host.
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0005] error checking health: rpc error: code = Unavailable desc = there is no address available 
FATA[0006] error retrieving health: rpc error: code = Unavailable desc = there is no address available
 
/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert
=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/
ca.pem --server-address=localhost:443
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4203c7320
INFO: 2018/06/19 17:07:30 dialing to target with scheme: ""
INFO: 2018/06/19 17:07:30 could not get resolver for scheme: ""
WARN[0007] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO: 2018/06/19 17:07:30 balancerWrapper: is pickfirst: false
INFO: 2018/06/19 17:07:30 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 101.64.0.10:53: no such host.
INFO: 2018/06/19 17:07:30 balancerWrapper: got update addr from Notify: [{127.0.0.1:443 <nil>} {[::1]:443 <nil>}]
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: new subconn: [{127.0.0.1:443 0  <nil>}]
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: new subconn: [{[::1]:443 0  <nil>}]
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304e60, CONNECTING
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304eb0, CONNECTING
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304eb0, TRANSIENT_FAILURE
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304eb0, CONNECTING
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304e60, TRANSIENT_FAILURE
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304e60, CONNECTING
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
WARN[0007] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0008] error checking health: rpc error: code = Unavailable desc = there is no connection available 
INFO: 2018/06/19 17:07:31 balancerWrapper: handle subconn state change: 0xc420304e60, READY
INFO: 2018/06/19 17:07:31 ccBalancerWrapper: updating state and picker called by balancer: READY, 0xc4203c7320
INFO: 2018/06/19 17:07:31 balancerWrapper: handle subconn state change: 0xc420304eb0, READY
INFO: 2018/06/19 17:07:31 ccBalancerWrapper: updating state and picker called by balancer: READY, 0xc4203c7320
INFO[0008] healthy: ok  
                                
/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert
=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/
ca.pem --server-address=localhost:443
INFO: 2018/06/19 17:12:18 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc42039d320
INFO: 2018/06/19 17:12:18 dialing to target with scheme: ""
INFO: 2018/06/19 17:12:18 could not get resolver for scheme: ""
WARN[0018] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO: 2018/06/19 17:12:18 balancerWrapper: is pickfirst: false
WARN[0019] error checking health: rpc error: code = Unavailable desc = there is no address available 
FATA[0019] error retrieving health: rpc error: code = Unavailable desc = there is no address available

the failing ones do seem to subjectively take longer to run - is there a timeout value I can try increasing?

mattreyuk · 2018-06-19T20:05:19Z

I changed to --timeout=7s and that seems to have taken care of my failing probes

pingles · 2018-06-19T20:54:54Z

7 seconds sounds like a long time to start- maybe its taking longer to initialise the pod caches; do you have lots and lots of running pods? Does the server has sufficient cpu/memory allocated?

I'll see if I can dig more into the gRPC client code to figure out why its taking so long to resolve/populate the address but glad you managed to get it running at least!

mattreyuk · 2018-06-19T21:31:16Z

It's not just on startup - when I run the /health command manually, it will fail intermittently well after the pod has started up. I can run a few checks in a row and they'll return immediately but then one will sit there with no output for a second or so and then produce the log stream with FATAL at the end.

The server pods are showing max cpu usage of about 0.33 each and roughly 33Mi memory.
I have 164 pods running, 295 terminated (pods from cron jobs).

ghost · 2018-06-20T18:39:36Z

I am also running into this problem on EKS with only a single worker node group. Sometimes when the cluster autoscaler creates a new node, the liveness probe fails for the new agent pod.

moofish32 · 2018-06-29T22:46:47Z

Piling on a bit here -- after getting all the logging enabled I found this warning:
WARNING: 2018/06/29 22:11:17 grpc: addrConn.createTransport failed to connect to {127.0.0.1:443 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for kiam-server:443, localhost:443, localhost:9610, not localhost". Reconnecting...
I fixed this by changing my cert generation script to only include localhost and removed the ports. This is now working. I don't know what changed though because previously this worked, so it's a bit confusing.

moofish32 · 2018-07-03T20:36:09Z

Can this get closed? Due to this https://github.com/uswitch/kiam/blob/master/docs/server.json#L5 (we could delete the port specific ones I believe, but I'll leave that up to another test to validate).

wpalmeri · 2018-07-03T22:30:48Z

Chiming in, my team is also seeing the health check fail repeatedly in 2.8 resulting in CrashLoopBackoff for the kiam-server pods

pingles · 2018-07-03T22:35:35Z

@wpalmeri could you downgrade to 2.7 and see if that works please?

@moofish32 the TLS stuff was changed for 2.8 IIRC #86

wpalmeri · 2018-07-04T00:10:06Z

@pingles yes we are stuck on 2.7 for now. We were interested in the 2.8 upgrade as we are also seeing the intermittent health check failures that 2.8 looked to fix

TattiQ · 2018-08-02T09:52:24Z

Hi guys, also experienced this issue with kiam 2.8 and healthchecks failing, reverted to 2.7 . may be it makes sense to add this one to the milestone as well https://github.com/uswitch/kiam/milestone/1 . ?

TattiQ · 2018-08-02T09:55:43Z

Using 2.7 now, but kind of stuck with this issue which is apparently solved in 2.8 =(

INFO: 2018/08/02 09:54:05 could not get resolver for scheme: ""
INFO: 2018/08/02 09:54:05 balancerWrapper: is pickfirst: false
WARN[0001] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0001] error checking health: rpc error: code = Unavailable desc = there is no address available
INFO: 2018/08/02 09:54:05 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 100.64.0.10:53: no such host.
WARN[0001] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0001] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0002] error checking health: rpc error: code = Unavailable desc = there is no address available

pingles · 2018-08-02T10:03:36Z

@TattiQ Yep, sorry- this was also reported in #128.

TattiQ · 2018-08-02T10:07:15Z

Thanks, @pingles !

pingles · 2018-08-02T10:12:31Z

I've also pruned one of the items from 3.0 so the last big thing is finishing #35 which was picked up in #113.

wpalmeri · 2018-08-10T23:17:53Z

@pingles any word on the 3.0 release? the sporadic readiness probe failures are enough to scare away some of our users into preferring user creds over IAM roles. thanks!

pingles · 2018-08-11T12:06:27Z

Hopefully next week- you could use tagged master now if you wanted as that’ll be the first RC we take. Left for release is update docs.

…

On Sat, 11 Aug 2018 at 00:17, Will Palmeri ***@***.***> wrote: @pingles <https://github.com/pingles> any word on the 3.0 release? the sporadic readiness probe failures are enough to scare away some of our users into preferring user creds over IAM roles. thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#94 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAEfvjcFAy4S1a1VEL8r4Hgpl-iV-ICks5uPhShgaJpZM4Usmmr> .

kylegoch · 2018-09-07T17:06:55Z

@pingles What version of Kubernetes are using for testing?

I have the same issue as @maauso in a 1.11.2 cluster, using kiam 2.7.

Where basically kiam acts like it cant resolve localhost. So not sure if kiam issue or a kube-dns issue. Kiam seems to be the common denominator in our clusters though. As we are not using EKS, and EKS runs on 1.10.

kylegoch · 2018-09-07T19:49:13Z

Welp, we just solved this on our end. It wasnt kiam. Disregard above. It was an issue with kube-dns and security groups. We fixed that and kiam is all happy now.

roffe · 2018-09-09T00:26:37Z

localhost in a kubernetes cluster goes over the DNS service due to ndots 5. you need to use 127.0.0.1 to get around this

https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html

cspargo · 2018-09-24T04:50:51Z

I was also experiencing this with 2.8. I noticed that in the yaml deployment configs here https://github.com/uswitch/kiam/blob/master/deploy/server.yaml#L60 --gateway-time-creation=1s but in the helm charts (which i am using) it has a default of 50ms for the timeout https://github.com/helm/charts/blob/a158ba574ebc45aef3749b79bdaded7ecacea363/stable/kiam/values.yaml#L159

i increased the gatewayTimeoutCreation helm parameter to "1s" and the problem went away

wpalmeri · 2018-10-08T18:48:15Z

@pingles any update on the 3.0 release? we are still hoping to upgrade. thanks!

pingles · 2018-10-09T12:30:53Z

:-) good question. For all previous releases I liked that we had "skin in the game" and wouldn't release Kiam until we'd been exposed to the same code for a while on our production clusters. @uswitch/cloud and @tombooth may be able to help give some indication.

There's no code related changes anticipated, however, from what's currently in master or v3-rc1.
The remaining changes were documentation related which, again, we wanted to expose ourselves directly to.

Remaining:

pingles · 2018-10-09T13:20:43Z

I just spoke with @tombooth and we're going to plan the rollout this Friday, so hopefully v3 in the next fortnight (by 26th October).

saplla · 2018-10-09T14:43:29Z

FYI I'm seeing this on 2.7.

chrisduong · 2018-10-22T05:56:57Z

I was also experiencing this with 2.8. I noticed that in the yaml deployment configs here https://github.com/uswitch/kiam/blob/master/deploy/server.yaml#L60 --gateway-time-creation=1s but in the helm charts (which i am using) it has a default of 50ms for the timeout https://github.com/helm/charts/blob/a158ba574ebc45aef3749b79bdaded7ecacea363/stable/kiam/values.yaml#L159

i increased the gatewayTimeoutCreation helm parameter to "1s" and the problem went away

This seems to fix it. Thanks

pingles · 2018-12-06T10:39:54Z

I'm going to close this issue- the mitigation is to set the gateway parameters a little more generously.

@wpalmeri I'm in the process of trying to get the last bits of documentation closed for v3 so release imminent: #195

Deepak1100 · 2018-12-18T11:29:55Z

Hi,
after I upgrade it to 3.0 I am getting the same issue and i have tried incresing gateway-timeout-creation to bigger value as well like 5s

             - /kiam
             - health
             - --cert=/etc/kiam/tls/cert
             - --key=/etc/kiam/tls/key
             - --ca=/etc/kiam/tls/ca
             - --server-address=127.0.0.1:443
             - --timeout=5s
             - --gateway-timeout-creation=2s

error creating server gateway: error dialing grpc server: context deadline exceeded

pingles · 2018-12-18T11:35:53Z

@Deepak1100 there's quite a bit of detail above for diagnosing the problem. I'd suggest starting with the environment variables and capture some log data to figure out why it's not connecting. Given that it's the health command failing it's almost certainly a TLS issue as you specify 127.0.0.1 which would rule out networking and DNS.

Deepak1100 · 2018-12-18T12:35:32Z

@pingles yes that was the issue
now its resolve thanks

2rs2ts · 2019-02-26T16:13:43Z

Not sure if it's related, but after upgrading to 3.0 we sometimes see the agent get these "context deadline exceeded" errors.

{"level":"info","msg":"configuring iptables","time":"2019-02-26T16:11:29Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9621","time":"2019-02-26T16:11:29Z"}
INFO: 2019/02/26 16:11:29 parsed scheme: "dns"
INFO: 2019/02/26 16:11:29 grpc: failed dns TXT record lookup due to lookup kiam-server-v3.cloud-controllers.svc.cluster.local. on 192.168.252.10:53: server misbehaving.
INFO: 2019/02/26 16:11:29 ccResolverWrapper: got new service config:
WARNING: 2019/02/26 16:11:29 grpc: parseServiceConfig error unmarshaling  due to unexpected end of JSON input
INFO: 2019/02/26 16:11:29 ccResolverWrapper: sending new addresses to cc: [{10.42.23.59:443 1 10-42-23-59.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.17.72:443 1 10-42-17-72.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.23.140:443 1 10-42-23-140.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.23.140:443 0  <nil>} {10.42.17.72:443 0  <nil>} {10.42.23.59:443 0  <nil>}]
INFO: 2019/02/26 16:11:29 base.baseBalancer: got new resolved addresses:  [{10.42.23.59:443 1 10-42-23-59.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.17.72:443 1 10-42-17-72.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.23.140:443 1 10-42-23-140.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.23.140:443 0  <nil>} {10.42.17.72:443 0  <nil>} {10.42.23.59:443 0  <nil>}]
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b6e20, CONNECTING
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b6e70, CONNECTING
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b6e90, CONNECTING
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b6ec0, CONNECTING
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b7040, CONNECTING
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b70c0, CONNECTING
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2019-02-26T16:11:35Z"}

We give it --gateway-timeout-creation=5s which seems more than generous...

2rs2ts · 2019-02-27T20:47:32Z

@moofish32 BTW I just wanted to say, removing the port numbers from the cert SANs breaks 2.7

moofish32 · 2019-02-27T22:15:47Z

@2rs2ts - yes it will. I think that's all documented in the 3.0 upgrade, I'm not sure about 2.8 anymore. Are you thinking the team should add more docs, or just correcting my early statement on close, when I definitely was unaware of where this breaking change was landing. Happy to delete the comment if that would make you happy.

2rs2ts · 2019-02-28T18:47:32Z

I was just commenting in case you or anyone else needed to know. I ended up shooting myself in the foot by taking those SANs out while upgrading from 2.7 to 3.0 (the upgrade docs do say you need to do things additively, but your comment made it sound like you should actually remove the old SANs for stability reasons), so if it happened to me maybe it would happen to someone else.

pingles added the bug label Jun 19, 2018

maauso mentioned this issue Jun 20, 2018

NoCredentialProviders: no valid providers in chain #95

Closed

osterman mentioned this issue Jun 28, 2018

Add helmfile definition for kiam cloudposse/geodesic#165

Merged

pingles mentioned this issue Jul 30, 2018

Server requests credentials, but they are not assumed by pod. #127

Closed

pingles added this to the 3.0 milestone Aug 2, 2018

pingles mentioned this issue Aug 3, 2018

upgrade: kiam-server|agent: agent goes into crashloopbackoff #128

Closed

osterman mentioned this issue Aug 7, 2018

Increase Kiam Verbosity for Triaging Readiness Probe Issues cloudposse/helmfiles#23

Closed

pingles closed this as completed Dec 6, 2018

pingles mentioned this issue Dec 17, 2018

unable to reach /latest/meta-data/iam/security-credentials/; Status 500 #201

Closed

Kiam server readiness/liveness probe issues (v2.8) #94

Kiam server readiness/liveness probe issues (v2.8) #94

Comments

mattreyuk commented Jun 18, 2018

pingles commented Jun 19, 2018

maauso commented Jun 19, 2018

pingles commented Jun 19, 2018

maauso commented Jun 19, 2018 • edited Loading

pingles commented Jun 19, 2018

pingles commented Jun 19, 2018

maauso commented Jun 19, 2018

pingles commented Jun 19, 2018

maauso commented Jun 19, 2018 • edited Loading

mattreyuk commented Jun 19, 2018

pingles commented Jun 19, 2018

mattreyuk commented Jun 19, 2018

maauso commented Jun 19, 2018

pingles commented Jun 19, 2018

maauso commented Jun 19, 2018 • edited Loading

pingles commented Jun 19, 2018

maauso commented Jun 19, 2018 • edited Loading

mattreyuk commented Jun 19, 2018

mattreyuk commented Jun 19, 2018

pingles commented Jun 19, 2018

mattreyuk commented Jun 19, 2018

ghost commented Jun 20, 2018

moofish32 commented Jun 29, 2018 • edited Loading

moofish32 commented Jul 3, 2018

wpalmeri commented Jul 3, 2018

pingles commented Jul 3, 2018

wpalmeri commented Jul 4, 2018

TattiQ commented Aug 2, 2018 • edited Loading

TattiQ commented Aug 2, 2018 • edited Loading

pingles commented Aug 2, 2018

TattiQ commented Aug 2, 2018

pingles commented Aug 2, 2018

wpalmeri commented Aug 10, 2018

pingles commented Aug 11, 2018 via email

kylegoch commented Sep 7, 2018 • edited Loading

kylegoch commented Sep 7, 2018

roffe commented Sep 9, 2018

cspargo commented Sep 24, 2018

wpalmeri commented Oct 8, 2018

pingles commented Oct 9, 2018

pingles commented Oct 9, 2018

saplla commented Oct 9, 2018 • edited Loading

chrisduong commented Oct 22, 2018 • edited Loading

pingles commented Dec 6, 2018

Deepak1100 commented Dec 18, 2018

pingles commented Dec 18, 2018

Deepak1100 commented Dec 18, 2018

2rs2ts commented Feb 26, 2019

2rs2ts commented Feb 27, 2019

moofish32 commented Feb 27, 2019

2rs2ts commented Feb 28, 2019

maauso commented Jun 19, 2018 •

edited

Loading

maauso commented Jun 19, 2018 •

edited

Loading

maauso commented Jun 19, 2018 •

edited

Loading

maauso commented Jun 19, 2018 •

edited

Loading

moofish32 commented Jun 29, 2018 •

edited

Loading

TattiQ commented Aug 2, 2018 •

edited

Loading

TattiQ commented Aug 2, 2018 •

edited

Loading

kylegoch commented Sep 7, 2018 •

edited

Loading

saplla commented Oct 9, 2018 •

edited

Loading

chrisduong commented Oct 22, 2018 •

edited

Loading