Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Kiam server readiness/liveness probe issues (v2.8) #94

Closed
mattreyuk opened this issue Jun 18, 2018 · 53 comments
Closed

Kiam server readiness/liveness probe issues (v2.8) #94

mattreyuk opened this issue Jun 18, 2018 · 53 comments
Labels
Milestone

Comments

@mattreyuk
Copy link

I have Kiam setup on a 3 master cluster - the server runs on two of the three masters with some readiness probes failing but the 3rd node is in a crash loop backoff with the liveness probe failing. The error message is the same for both:

probe failed: time="2018-06-18T22:10:41Z" level=fatal msg="error creating server gateway: error waiting for address being available in the balancer: context deadline exceeded"

Looking in the server logs for the failing pod, I just see it startup, lots of found role, requesting credentials and then finally stopped.
I'm using Image: quay.io/uswitch/kiam:v2.8 and the standard probe definitions from /deploy/server.yaml (this is a kops 1.9 cluster so I changed the cert path to /etc/ssl/certs and added master tolerations to get it to work).

@pingles pingles added the bug label Jun 19, 2018
@pingles
Copy link
Contributor

pingles commented Jun 19, 2018

@mattreyuk To get more information could you add the following environment variables to the health check command please:

GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8

#17 had a similar problem- a connection error inside the gRPC client but the log data/errors were being masked.

@maauso
Copy link

maauso commented Jun 19, 2018

Hi guys,
I found the same problem, I'm using EKS and I created a different workers stacks, in order to have Kiam servers running on it, then I will be able to apply Server IAM Role.

The Kiam version that I'm using is 2.8, moreover in order to have a dedicate nodes for kaim pods I'm using the following yaml file

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  namespace: kube-system
  name: kiam-server
spec:
  replicas: 3
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9620"
      labels:
        app: kiam
        role: server
    spec:
      serviceAccountName: kiam-server
      affinity:
        podAntiAffinity:                                 
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname   
      nodeSelector:
        beta.kubernetes.io/instance-type: t2.medium
      volumes:
        - name: ssl-certs
          hostPath:
            # for AWS linux or RHEL distros
            path: /etc/pki/ca-trust/extracted/pem/
            #path: /usr/share/ca-certificates
        - name: tls
          secret:
            secretName: kiam-server-tls
      containers:
        - name: kiam
          image: quay.io/uswitch/kiam:v2.8
          imagePullPolicy: Always
          env:
          - name: GRPC_GO_LOG_SEVERITY_LEVEL
            value: "info"
          - name: GRPC_GO_LOG_VERBOSITY_LEVEL
            value: "8"
          command:
            - /server
          args:
            - --json-log
            - --bind=0.0.0.0:443
            - --cert=/etc/kiam/tls/server.pem
            - --key=/etc/kiam/tls/server-key.pem
            - --ca=/etc/kiam/tls/ca.pem
            - --role-base-arn-autodetect
            - --sync=1m
            - --prometheus-listen-addr=0.0.0.0:9620
            - --prometheus-sync-interval=5s
          volumeMounts:
            - mountPath: /etc/ssl/certs
              name: ssl-certs
            - mountPath: /etc/kiam/tls
              name: tls
          livenessProbe:
            exec:
              command:
              - /health
              - --cert=/etc/kiam/tls/server.pem
              - --key=/etc/kiam/tls/server-key.pem
              - --ca=/etc/kiam/tls/ca.pem
              - --server-address=localhost:443
              - --server-address-refresh=2s
              - --timeout=5s
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 10
          readinessProbe:
            exec:
              command:
              - /health
              - --cert=/etc/kiam/tls/server.pem
              - --key=/etc/kiam/tls/server-key.pem
              - --ca=/etc/kiam/tls/ca.pem
              - --server-address=localhost:443
              - --server-address-refresh=2s
              - --timeout=5s
            initialDelaySeconds: 3
            periodSeconds: 10
            timeoutSeconds: 10

But, I have the same error

Readiness probe failed: INFO: 2018/06/19 12:39:09 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4203d1380 INFO: 2018/06/19 12:39:09 dialing to target with scheme: "" INFO: 2018/06/19 12:39:09 could not get resolver for scheme: "" INFO: 2018/06/19 12:39:09 balancerWrapper: is pickfirst: false time="2018-06-19T12:39:09Z" level=fatal msg="error creating server gateway: error waiting for address being available in the balancer: context deadline exceeded"

@pingles
Copy link
Contributor

pingles commented Jun 19, 2018

@maauso ok, so I wonder if this is related to the fix we made in #88. Is there any way you could change to v2.7 and see if that works (that was before that balancer change was made).

@maauso
Copy link

maauso commented Jun 19, 2018

Hi @pingles, of course, I'm used the same yaml but I changed the version only , I had the next error

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "kiam-server-x4wgm_kube-system" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable

Remember that I'm using EKS with amazon-vpc-cni.

Moreover I detected that I have the both errors, when I'm using to different node groups connects with the same eks, but the theory said that the only difference is the resources not the network.
https://docs.aws.amazon.com/eks/latest/userguide/launch-workers.html

image

weird behavior XD

@pingles
Copy link
Contributor

pingles commented Jun 19, 2018

@maauso could you add the additional gRPC environment variables to understand more about what's happening please. If it's the same as #17 then it's likely to do with the TLS verification.

@pingles
Copy link
Contributor

pingles commented Jun 19, 2018

GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8

is the thing to add to the health check command

@maauso
Copy link

maauso commented Jun 19, 2018

@pingles with version 2.7 and adding

          - name: GRPC_GO_LOG_SEVERITY_LEVEL
            value: "info"
          - name: GRPC_GO_LOG_VERBOSITY_LEVEL
            value: "8"

the error is

Readiness probe failed: INFO: 2018/06/19 14:57:49 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4201df380 INFO: 2018/06/19 14:57:49 dialing to target with scheme: "" INFO: 2018/06/19 14:57:49 could not get resolver for scheme: "" time="2018-06-19T14:57:49Z" level=warning msg="error checking health: rpc error: code = Unavailable desc = there is no address available" 

Also I have some Kiam Servers pods that it's seems that they are running but the have some warnings

{"level":"info","msg":"started namespace cache controller","time":"2018-06-19T14:57:41Z"}
WARNING: 2018/06/19 14:57:47 transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:443->127.0.0.1:38888: read: connection reset by peer
WARNING: 2018/06/19 14:57:59 transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:443->127.0.0.1:38946: read: connection reset by peer
WARNING: 2018/06/19 14:58:09 transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:443->127.0.0.1:39000: read: connection reset by peer

Moreover with version 2.8

Readiness probe failed: INFO: 2018/06/19 15:06:36 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc420272ea0 INFO: 2018/06/19 15:06:36 dialing to target with scheme: "" INFO: 2018/06/19 15:06:36 could not get resolver for scheme: "" INFO: 2018/06/19 15:06:36 balancerWrapper: is pickfirst: false time="2018-06-19T15:06:36Z" level=fatal msg="error creating server gateway: error waiting for address being available in the balancer: context deadline exceeded"

Also, the kiam-server with version 2.8 that they are running have the same Warning.

If you need more test, let me know

@pingles
Copy link
Contributor

pingles commented Jun 19, 2018

@maauso hmmm... I'd have expected more of the log output like you see from the server. Could you try port-forwarding/running the command from another Pod with the additional grpc env vars- I still feel like this looks the same as the TLS error in #17.

@maauso
Copy link

maauso commented Jun 19, 2018

@pingles ok no problem, but first off, which is the version that do you want that I'll run the test?
Test that you need is run

/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/ca.pem --server-address=localhost:443 

into the pod that I have the problem?

@mattreyuk
Copy link
Author

@pingles I added those values to the pod env and got the following

Liveness probe failed: INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 dialing to target with scheme: "" INFO: 2018/06/19 15:31:19 could not get resolver for scheme: "" 
INFO: 2018/06/19 15:31:19 balancerWrapper: is pickfirst: false 
INFO: 2018/06/19 15:31:19 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 101.64.0.10:53: no such host. 
INFO: 2018/06/19 15:31:19 balancerWrapper: got update addr from Notify: [{[::1]:443 <nil>} {127.0.0.1:443 <nil>}] 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: new subconn: [{[::1]:443 0 <nil>}] 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: new subconn: [{127.0.0.1:443 0 <nil>}] 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e13e0, CONNECTING 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e1430, CONNECTING 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e1430, TRANSIENT_FAILURE 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e1430, CONNECTING 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e13e0, TRANSIENT_FAILURE 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 
INFO: 2018/06/19 15:31:19 balancerWrapper: handle subconn state change: 0xc4204e13e0, CONNECTING 
INFO: 2018/06/19 15:31:19 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203e87e0 time="2018-06-19T15:31:19Z" level=fatal msg="error creating server gateway: error waiting for address being available in the balancer: context deadline exceeded"

@pingles
Copy link
Contributor

pingles commented Jun 19, 2018

@maauso yeah- try that with 2.7 please.

@mattreyuk that's strange. if you increase the verbosity level does anything else come up? be useful to find out what that TRANSIENT_FAILURE is. Also- could you confirm the hosts/alternate names etc. that were used in the TLS certs for the server. Did you generate them using the cfssl stuff mentioned in the repo or using something else?

@mattreyuk
Copy link
Author

I followed the instructions in /docs/TLS.md and used the *.json files in that directory - the only thing I changed was the names section in each to reflect my locale. The hosts section from the server.json looks like:

    "hosts": [
        "kiam-server",
        "kiam-server:443",
        "localhost",
        "localhost:443",
        "localhost:9610"
    ],

I put the verbosity level up to 20 and I don't see anything different logged.

@maauso
Copy link

maauso commented Jun 19, 2018

@pingles I run the command inside the docker

/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert
=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/
ca.pem --server-address=localhost:443
INFO: 2018/06/19 16:22:18 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4203452c0
INFO: 2018/06/19 16:22:18 dialing to target with scheme: ""
INFO: 2018/06/19 16:22:18 could not get resolver for scheme: ""
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO: 2018/06/19 16:22:18 balancerWrapper: is pickfirst: false
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available 
FATA[0001] error retrieving health: rpc error: code = Unavailable desc = there is no address available 
/ # 

It's different that the @mattdenner error has

@pingles
Copy link
Contributor

pingles commented Jun 19, 2018

@mattreyuk could you confirm which release of Kiam you're running please? If it's 2.8 could you move down to 2.7 please.

@maauso very strange... not sure what's happening there, looks like its unable to resolve localhost. If you change to 127.0.0.1 do you get a TLS error rather than the no address available?

@maauso
Copy link

maauso commented Jun 19, 2018

@pingles its seems that is the same error

/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert
=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/
ca.pem --server-address=127.0.0.1:443
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4200529c0
INFO: 2018/06/19 16:36:57 dialing to target with scheme: ""
INFO: 2018/06/19 16:36:57 could not get resolver for scheme: ""
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
INFO: 2018/06/19 16:36:57 balancerWrapper: is pickfirst: false
INFO: 2018/06/19 16:36:57 balancerWrapper: got update addr from Notify: [{127.0.0.1:443 <nil>}]
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: new subconn: [{127.0.0.1:443 0  <nil>}]
INFO: 2018/06/19 16:36:57 balancerWrapper: handle subconn state change: 0xc420288070, CONNECTING
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4200529c0
INFO: 2018/06/19 16:36:57 balancerWrapper: handle subconn state change: 0xc420288070, TRANSIENT_FAILURE
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc4200529c0
INFO: 2018/06/19 16:36:57 balancerWrapper: handle subconn state change: 0xc420288070, CONNECTING
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4200529c0
WARNING: 2018/06/19 16:36:57 Failed to dial 127.0.0.1:443: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for kiam-server:443, localhost:443, localhost:9610, not 127.0.0.1:443"; please retry.
INFO: 2018/06/19 16:36:57 balancerWrapper: handle subconn state change: 0xc420288070, SHUTDOWN
INFO: 2018/06/19 16:36:57 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc4200529c0
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available 
FATA[0001] error retrieving health: rpc error: code = Unavailable desc = there is no connection available 

I have doubt, if I want to use kiam on EKS I should use version 2.8 right?, because it's compatible with amazon-vpc-cni

@pingles
Copy link
Contributor

pingles commented Jun 19, 2018

@maauso weird!! i wonder if something isn't set properly with the cluster DNS stuff on the pod? Switching to 127.0.0.1 will be fine- although you'd need to generate the cert with the additional alternate name

@maauso
Copy link

maauso commented Jun 19, 2018

@pingles Maybe, but I launched the workers using the official eks solution, you know that I don't have the possibility to change any service like dns or kube-proxy etc.. in EKS

For me, the worst part is that I just find this "problem" when I have two types of EKS nodes groups running together. I opened a ticket to Aws, in order to know if EKS is ready to has this configuration.

If you need any test, let me know. Also when I will have a AWS update I will tell you

@mattreyuk
Copy link
Author

running with server version 2.7 image seems to work better for me. I still have one of my three master nodes failing health checks - when I run the command within the container I get results like:

/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert
=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/
ca.pem --server-address=localhost:443
INFO: 2018/06/19 17:07:06 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc420079380
INFO: 2018/06/19 17:07:06 dialing to target with scheme: ""
INFO: 2018/06/19 17:07:06 could not get resolver for scheme: ""
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO: 2018/06/19 17:07:06 balancerWrapper: is pickfirst: false
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO: 2018/06/19 17:07:06 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 101.64.0.10:53: no such host.
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available 
WARN[0005] error checking health: rpc error: code = Unavailable desc = there is no address available 
FATA[0006] error retrieving health: rpc error: code = Unavailable desc = there is no address available
 
/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert
=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/
ca.pem --server-address=localhost:443
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4203c7320
INFO: 2018/06/19 17:07:30 dialing to target with scheme: ""
INFO: 2018/06/19 17:07:30 could not get resolver for scheme: ""
WARN[0007] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO: 2018/06/19 17:07:30 balancerWrapper: is pickfirst: false
INFO: 2018/06/19 17:07:30 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 101.64.0.10:53: no such host.
INFO: 2018/06/19 17:07:30 balancerWrapper: got update addr from Notify: [{127.0.0.1:443 <nil>} {[::1]:443 <nil>}]
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: new subconn: [{127.0.0.1:443 0  <nil>}]
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: new subconn: [{[::1]:443 0  <nil>}]
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304e60, CONNECTING
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304eb0, CONNECTING
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304eb0, TRANSIENT_FAILURE
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304eb0, CONNECTING
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304e60, TRANSIENT_FAILURE
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
INFO: 2018/06/19 17:07:30 balancerWrapper: handle subconn state change: 0xc420304e60, CONNECTING
INFO: 2018/06/19 17:07:30 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203c7320
WARN[0007] error checking health: rpc error: code = Unavailable desc = there is no connection available 
WARN[0008] error checking health: rpc error: code = Unavailable desc = there is no connection available 
INFO: 2018/06/19 17:07:31 balancerWrapper: handle subconn state change: 0xc420304e60, READY
INFO: 2018/06/19 17:07:31 ccBalancerWrapper: updating state and picker called by balancer: READY, 0xc4203c7320
INFO: 2018/06/19 17:07:31 balancerWrapper: handle subconn state change: 0xc420304eb0, READY
INFO: 2018/06/19 17:07:31 ccBalancerWrapper: updating state and picker called by balancer: READY, 0xc4203c7320
INFO[0008] healthy: ok  
                                
/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert
=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/
ca.pem --server-address=localhost:443
INFO: 2018/06/19 17:12:18 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc42039d320
INFO: 2018/06/19 17:12:18 dialing to target with scheme: ""
INFO: 2018/06/19 17:12:18 could not get resolver for scheme: ""
WARN[0018] error checking health: rpc error: code = Unavailable desc = there is no address available 
INFO: 2018/06/19 17:12:18 balancerWrapper: is pickfirst: false
WARN[0019] error checking health: rpc error: code = Unavailable desc = there is no address available 
FATA[0019] error retrieving health: rpc error: code = Unavailable desc = there is no address available 

the failing ones do seem to subjectively take longer to run - is there a timeout value I can try increasing?

@mattreyuk
Copy link
Author

I changed to --timeout=7s and that seems to have taken care of my failing probes

@pingles
Copy link
Contributor

pingles commented Jun 19, 2018

7 seconds sounds like a long time to start- maybe its taking longer to initialise the pod caches; do you have lots and lots of running pods? Does the server has sufficient cpu/memory allocated?

I'll see if I can dig more into the gRPC client code to figure out why its taking so long to resolve/populate the address but glad you managed to get it running at least!

@mattreyuk
Copy link
Author

It's not just on startup - when I run the /health command manually, it will fail intermittently well after the pod has started up. I can run a few checks in a row and they'll return immediately but then one will sit there with no output for a second or so and then produce the log stream with FATAL at the end.

The server pods are showing max cpu usage of about 0.33 each and roughly 33Mi memory.
I have 164 pods running, 295 terminated (pods from cron jobs).

@ghost
Copy link

ghost commented Jun 20, 2018

I am also running into this problem on EKS with only a single worker node group. Sometimes when the cluster autoscaler creates a new node, the liveness probe fails for the new agent pod.

@moofish32
Copy link
Contributor

moofish32 commented Jun 29, 2018

Piling on a bit here -- after getting all the logging enabled I found this warning:
WARNING: 2018/06/29 22:11:17 grpc: addrConn.createTransport failed to connect to {127.0.0.1:443 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for kiam-server:443, localhost:443, localhost:9610, not localhost". Reconnecting...
I fixed this by changing my cert generation script to only include localhost and removed the ports. This is now working. I don't know what changed though because previously this worked, so it's a bit confusing.

@moofish32
Copy link
Contributor

Can this get closed? Due to this https://github.com/uswitch/kiam/blob/master/docs/server.json#L5 (we could delete the port specific ones I believe, but I'll leave that up to another test to validate).

@wpalmeri
Copy link

wpalmeri commented Jul 3, 2018

Chiming in, my team is also seeing the health check fail repeatedly in 2.8 resulting in CrashLoopBackoff for the kiam-server pods

@pingles
Copy link
Contributor

pingles commented Jul 3, 2018

@wpalmeri could you downgrade to 2.7 and see if that works please?

@moofish32 the TLS stuff was changed for 2.8 IIRC #86

@wpalmeri
Copy link

wpalmeri commented Jul 4, 2018

@pingles yes we are stuck on 2.7 for now. We were interested in the 2.8 upgrade as we are also seeing the intermittent health check failures that 2.8 looked to fix

@TattiQ
Copy link

TattiQ commented Aug 2, 2018

Hi guys, also experienced this issue with kiam 2.8 and healthchecks failing, reverted to 2.7 . may be it makes sense to add this one to the milestone as well https://github.com/uswitch/kiam/milestone/1 . ?

@TattiQ
Copy link

TattiQ commented Aug 2, 2018

Using 2.7 now, but kind of stuck with this issue which is apparently solved in 2.8 =(

INFO: 2018/08/02 09:54:05 could not get resolver for scheme: ""
INFO: 2018/08/02 09:54:05 balancerWrapper: is pickfirst: false
WARN[0001] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0001] error checking health: rpc error: code = Unavailable desc = there is no address available
INFO: 2018/08/02 09:54:05 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 100.64.0.10:53: no such host.
WARN[0001] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0001] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0002] error checking health: rpc error: code = Unavailable desc = there is no address available

@pingles pingles added this to the 3.0 milestone Aug 2, 2018
@pingles
Copy link
Contributor

pingles commented Aug 2, 2018

@TattiQ Yep, sorry- this was also reported in #128.

@TattiQ
Copy link

TattiQ commented Aug 2, 2018

Thanks, @pingles !

@pingles
Copy link
Contributor

pingles commented Aug 2, 2018

I've also pruned one of the items from 3.0 so the last big thing is finishing #35 which was picked up in #113.

@wpalmeri
Copy link

@pingles any word on the 3.0 release? the sporadic readiness probe failures are enough to scare away some of our users into preferring user creds over IAM roles. thanks!

@pingles
Copy link
Contributor

pingles commented Aug 11, 2018 via email

@kylegoch
Copy link

kylegoch commented Sep 7, 2018

@pingles What version of Kubernetes are using for testing?

I have the same issue as @maauso in a 1.11.2 cluster, using kiam 2.7.

Where basically kiam acts like it cant resolve localhost. So not sure if kiam issue or a kube-dns issue. Kiam seems to be the common denominator in our clusters though. As we are not using EKS, and EKS runs on 1.10.

@kylegoch
Copy link

kylegoch commented Sep 7, 2018

Welp, we just solved this on our end. It wasnt kiam. Disregard above. It was an issue with kube-dns and security groups. We fixed that and kiam is all happy now.

@roffe
Copy link
Contributor

roffe commented Sep 9, 2018

localhost in a kubernetes cluster goes over the DNS service due to ndots 5. you need to use 127.0.0.1 to get around this

https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html

@cspargo
Copy link

cspargo commented Sep 24, 2018

I was also experiencing this with 2.8. I noticed that in the yaml deployment configs here https://github.com/uswitch/kiam/blob/master/deploy/server.yaml#L60 --gateway-time-creation=1s but in the helm charts (which i am using) it has a default of 50ms for the timeout https://github.com/helm/charts/blob/a158ba574ebc45aef3749b79bdaded7ecacea363/stable/kiam/values.yaml#L159

i increased the gatewayTimeoutCreation helm parameter to "1s" and the problem went away

@wpalmeri
Copy link

wpalmeri commented Oct 8, 2018

@pingles any update on the 3.0 release? we are still hoping to upgrade. thanks!

@pingles
Copy link
Contributor

pingles commented Oct 9, 2018

:-) good question. For all previous releases I liked that we had "skin in the game" and wouldn't release Kiam until we'd been exposed to the same code for a while on our production clusters. @uswitch/cloud and @tombooth may be able to help give some indication.

There's no code related changes anticipated, however, from what's currently in master or v3-rc1.
The remaining changes were documentation related which, again, we wanted to expose ourselves directly to.

Remaining:

@pingles
Copy link
Contributor

pingles commented Oct 9, 2018

I just spoke with @tombooth and we're going to plan the rollout this Friday, so hopefully v3 in the next fortnight (by 26th October).

@saplla
Copy link

saplla commented Oct 9, 2018

FYI I'm seeing this on 2.7.

@chrisduong
Copy link

chrisduong commented Oct 22, 2018

I was also experiencing this with 2.8. I noticed that in the yaml deployment configs here https://github.com/uswitch/kiam/blob/master/deploy/server.yaml#L60 --gateway-time-creation=1s but in the helm charts (which i am using) it has a default of 50ms for the timeout https://github.com/helm/charts/blob/a158ba574ebc45aef3749b79bdaded7ecacea363/stable/kiam/values.yaml#L159

i increased the gatewayTimeoutCreation helm parameter to "1s" and the problem went away

This seems to fix it. Thanks

@pingles
Copy link
Contributor

pingles commented Dec 6, 2018

I'm going to close this issue- the mitigation is to set the gateway parameters a little more generously.

@wpalmeri I'm in the process of trying to get the last bits of documentation closed for v3 so release imminent: #195

@Deepak1100
Copy link

Hi,
after I upgrade it to 3.0 I am getting the same issue and i have tried incresing gateway-timeout-creation to bigger value as well like 5s

             - /kiam
             - health
             - --cert=/etc/kiam/tls/cert
             - --key=/etc/kiam/tls/key
             - --ca=/etc/kiam/tls/ca
             - --server-address=127.0.0.1:443
             - --timeout=5s
             - --gateway-timeout-creation=2s

error creating server gateway: error dialing grpc server: context deadline exceeded

@pingles
Copy link
Contributor

pingles commented Dec 18, 2018

@Deepak1100 there's quite a bit of detail above for diagnosing the problem. I'd suggest starting with the environment variables and capture some log data to figure out why it's not connecting. Given that it's the health command failing it's almost certainly a TLS issue as you specify 127.0.0.1 which would rule out networking and DNS.

@Deepak1100
Copy link

@pingles yes that was the issue
now its resolve thanks

@2rs2ts
Copy link

2rs2ts commented Feb 26, 2019

Not sure if it's related, but after upgrading to 3.0 we sometimes see the agent get these "context deadline exceeded" errors.

{"level":"info","msg":"configuring iptables","time":"2019-02-26T16:11:29Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9621","time":"2019-02-26T16:11:29Z"}
INFO: 2019/02/26 16:11:29 parsed scheme: "dns"
INFO: 2019/02/26 16:11:29 grpc: failed dns TXT record lookup due to lookup kiam-server-v3.cloud-controllers.svc.cluster.local. on 192.168.252.10:53: server misbehaving.
INFO: 2019/02/26 16:11:29 ccResolverWrapper: got new service config:
WARNING: 2019/02/26 16:11:29 grpc: parseServiceConfig error unmarshaling  due to unexpected end of JSON input
INFO: 2019/02/26 16:11:29 ccResolverWrapper: sending new addresses to cc: [{10.42.23.59:443 1 10-42-23-59.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.17.72:443 1 10-42-17-72.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.23.140:443 1 10-42-23-140.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.23.140:443 0  <nil>} {10.42.17.72:443 0  <nil>} {10.42.23.59:443 0  <nil>}]
INFO: 2019/02/26 16:11:29 base.baseBalancer: got new resolved addresses:  [{10.42.23.59:443 1 10-42-23-59.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.17.72:443 1 10-42-17-72.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.23.140:443 1 10-42-23-140.kiam-server-v3.cloud-controllers.svc.cluster.local. <nil>} {10.42.23.140:443 0  <nil>} {10.42.17.72:443 0  <nil>} {10.42.23.59:443 0  <nil>}]
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b6e20, CONNECTING
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b6e70, CONNECTING
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b6e90, CONNECTING
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b6ec0, CONNECTING
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b7040, CONNECTING
INFO: 2019/02/26 16:11:29 base.baseBalancer: handle SubConn state change: 0xc4201b70c0, CONNECTING
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2019-02-26T16:11:35Z"}

We give it --gateway-timeout-creation=5s which seems more than generous...

@2rs2ts
Copy link

2rs2ts commented Feb 27, 2019

@moofish32 BTW I just wanted to say, removing the port numbers from the cert SANs breaks 2.7

@moofish32
Copy link
Contributor

@2rs2ts - yes it will. I think that's all documented in the 3.0 upgrade, I'm not sure about 2.8 anymore. Are you thinking the team should add more docs, or just correcting my early statement on close, when I definitely was unaware of where this breaking change was landing. Happy to delete the comment if that would make you happy.

@2rs2ts
Copy link

2rs2ts commented Feb 28, 2019

I was just commenting in case you or anyone else needed to know. I ended up shooting myself in the foot by taking those SANs out while upgrading from 2.7 to 3.0 (the upgrade docs do say you need to do things additively, but your comment made it sound like you should actually remove the old SANs for stability reasons), so if it happened to me maybe it would happen to someone else.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests