-
Notifications
You must be signed in to change notification settings - Fork 238
Kiam server readiness/liveness probe issues (v2.8) #94
Comments
@mattreyuk To get more information could you add the following environment variables to the health check command please:
#17 had a similar problem- a connection error inside the gRPC client but the log data/errors were being masked. |
Hi guys, The Kiam version that I'm using is 2.8, moreover in order to have a dedicate nodes for kaim pods I'm using the following yaml file
But, I have the same error
|
Hi @pingles, of course, I'm used the same yaml but I changed the version only , I had the next error
Remember that I'm using EKS with amazon-vpc-cni. Moreover I detected that I have the both errors, when I'm using to different node groups connects with the same eks, but the theory said that the only difference is the resources not the network. weird behavior XD |
is the thing to add to the health check command |
@pingles with version 2.7 and adding
the error is
Also I have some Kiam Servers pods that it's seems that they are running but the have some warnings
Moreover with version 2.8
Also, the kiam-server with version 2.8 that they are running have the same Warning. If you need more test, let me know |
@pingles ok no problem, but first off, which is the version that do you want that I'll run the test?
into the pod that I have the problem? |
@pingles I added those values to the pod env and got the following
|
@maauso yeah- try that with 2.7 please. @mattreyuk that's strange. if you increase the verbosity level does anything else come up? be useful to find out what that |
I followed the instructions in
I put the verbosity level up to |
@pingles I run the command inside the docker
It's different that the @mattdenner error has |
@mattreyuk could you confirm which release of Kiam you're running please? If it's 2.8 could you move down to 2.7 please. @maauso very strange... not sure what's happening there, looks like its unable to resolve localhost. If you change to |
@pingles its seems that is the same error
I have doubt, if I want to use kiam on EKS I should use version 2.8 right?, because it's compatible with amazon-vpc-cni |
@maauso weird!! i wonder if something isn't set properly with the cluster DNS stuff on the pod? Switching to |
@pingles Maybe, but I launched the workers using the official eks solution, you know that I don't have the possibility to change any service like dns or kube-proxy etc.. in EKS For me, the worst part is that I just find this "problem" when I have two types of EKS nodes groups running together. I opened a ticket to Aws, in order to know if EKS is ready to has this configuration. If you need any test, let me know. Also when I will have a AWS update I will tell you |
running with server version 2.7 image seems to work better for me. I still have one of my three master nodes failing health checks - when I run the command within the container I get results like:
the failing ones do seem to subjectively take longer to run - is there a timeout value I can try increasing? |
I changed to |
7 seconds sounds like a long time to start- maybe its taking longer to initialise the pod caches; do you have lots and lots of running pods? Does the server has sufficient cpu/memory allocated? I'll see if I can dig more into the gRPC client code to figure out why its taking so long to resolve/populate the address but glad you managed to get it running at least! |
It's not just on startup - when I run the The server pods are showing max cpu usage of about |
I am also running into this problem on EKS with only a single worker node group. Sometimes when the cluster autoscaler creates a new node, the liveness probe fails for the new agent pod. |
Piling on a bit here -- after getting all the logging enabled I found this warning: |
Can this get closed? Due to this https://github.com/uswitch/kiam/blob/master/docs/server.json#L5 (we could delete the port specific ones I believe, but I'll leave that up to another test to validate). |
Chiming in, my team is also seeing the health check fail repeatedly in 2.8 resulting in CrashLoopBackoff for the kiam-server pods |
@wpalmeri could you downgrade to 2.7 and see if that works please? @moofish32 the TLS stuff was changed for 2.8 IIRC #86 |
@pingles yes we are stuck on 2.7 for now. We were interested in the 2.8 upgrade as we are also seeing the intermittent health check failures that 2.8 looked to fix |
Hi guys, also experienced this issue with kiam 2.8 and healthchecks failing, reverted to 2.7 . may be it makes sense to add this one to the milestone as well https://github.com/uswitch/kiam/milestone/1 . ? |
Using 2.7 now, but kind of stuck with this issue which is apparently solved in 2.8 =(
|
Thanks, @pingles ! |
@pingles any word on the 3.0 release? the sporadic readiness probe failures are enough to scare away some of our users into preferring user creds over IAM roles. thanks! |
Hopefully next week- you could use tagged master now if you wanted as
that’ll be the first RC we take. Left for release is update docs.
…On Sat, 11 Aug 2018 at 00:17, Will Palmeri ***@***.***> wrote:
@pingles <https://github.com/pingles> any word on the 3.0 release? the
sporadic readiness probe failures are enough to scare away some of our
users into preferring user creds over IAM roles. thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#94 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAEfvjcFAy4S1a1VEL8r4Hgpl-iV-ICks5uPhShgaJpZM4Usmmr>
.
|
@pingles What version of Kubernetes are using for testing? I have the same issue as @maauso in a 1.11.2 cluster, using kiam 2.7. Where basically kiam acts like it cant resolve localhost. So not sure if kiam issue or a kube-dns issue. Kiam seems to be the common denominator in our clusters though. As we are not using EKS, and EKS runs on 1.10. |
Welp, we just solved this on our end. It wasnt kiam. Disregard above. It was an issue with kube-dns and security groups. We fixed that and kiam is all happy now. |
localhost in a kubernetes cluster goes over the DNS service due to ndots 5. you need to use 127.0.0.1 to get around this |
I was also experiencing this with 2.8. I noticed that in the yaml deployment configs here https://github.com/uswitch/kiam/blob/master/deploy/server.yaml#L60 i increased the gatewayTimeoutCreation helm parameter to "1s" and the problem went away |
@pingles any update on the 3.0 release? we are still hoping to upgrade. thanks! |
:-) good question. For all previous releases I liked that we had "skin in the game" and wouldn't release Kiam until we'd been exposed to the same code for a while on our production clusters. @uswitch/cloud and @tombooth may be able to help give some indication. There's no code related changes anticipated, however, from what's currently in master or Remaining: |
I just spoke with @tombooth and we're going to plan the rollout this Friday, so hopefully v3 in the next fortnight (by 26th October). |
FYI I'm seeing this on 2.7. |
This seems to fix it. Thanks |
Hi,
error creating server gateway: error dialing grpc server: context deadline exceeded |
@Deepak1100 there's quite a bit of detail above for diagnosing the problem. I'd suggest starting with the environment variables and capture some log data to figure out why it's not connecting. Given that it's the health command failing it's almost certainly a TLS issue as you specify |
@pingles yes that was the issue |
Not sure if it's related, but after upgrading to 3.0 we sometimes see the agent get these "context deadline exceeded" errors.
We give it |
@moofish32 BTW I just wanted to say, removing the port numbers from the cert SANs breaks 2.7 |
@2rs2ts - yes it will. I think that's all documented in the 3.0 upgrade, I'm not sure about 2.8 anymore. Are you thinking the team should add more docs, or just correcting my early statement on close, when I definitely was unaware of where this breaking change was landing. Happy to delete the comment if that would make you happy. |
I was just commenting in case you or anyone else needed to know. I ended up shooting myself in the foot by taking those SANs out while upgrading from 2.7 to 3.0 (the upgrade docs do say you need to do things additively, but your comment made it sound like you should actually remove the old SANs for stability reasons), so if it happened to me maybe it would happen to someone else. |
I have Kiam setup on a 3 master cluster - the server runs on two of the three masters with some readiness probes failing but the 3rd node is in a crash loop backoff with the liveness probe failing. The error message is the same for both:
Looking in the server logs for the failing pod, I just see it startup, lots of
found role
,requesting credentials
and then finallystopped
.I'm using Image:
quay.io/uswitch/kiam:v2.8
and the standard probe definitions from/deploy/server.yaml
(this is akops
1.9 cluster so I changed the cert path to/etc/ssl/certs
and added master tolerations to get it to work).The text was updated successfully, but these errors were encountered: