-
Notifications
You must be signed in to change notification settings - Fork 241
Unable to assign role due to "pod not found" error #47
Comments
Thanks for raising this. Yep- most likely due to the delay between the pod being picked up by the watcher. You should have seen some more retry messages from the server/agent as it attempted to keep checking for the pod? These should keep happening until the client disconnects. The prefetch buffer controls the number of new pods that, having been started, will be passed to the prefetcher to request credentials- the error you've reported would be before the client requests credentials so I don't think it would help. How many pods are being scheduled at the same time? The other thing you could do is turn on debug logging (just as you schedule everything- there'll be a lot of noise), you should see a load of messages with Could you also give me any details about the way you've got kiam deployed (ie. how many server processes you run, are they also on nodes deployed as spot instances etc.) |
Thank you. Yeah, you're right, the server tries multiple times to find the pod, until eventually the error is propagated to the agent. So I have multiple server-side errors followed by a single agent-side error. {
"credentials.access.key": "***",
"credentials.expiration": "2018-03-15T19:06:48Z",
"credentials.role": "***",
"generation.metadata": 0,
"level": "info",
"msg": "fetched credentials",
"pod.iam.role": "***",
"pod.name": "***",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "13076366",
"time": "2018-03-15T18:58:48Z"
} But it had no IP address yet so it was not written to the cache. Each node runs ~50 pods, so once K8s starts draining it, these 50 pods have to be rescheduled elsewhere. The issue happens on our production cluster, so I have to act with extra care, but I'll try to change the log level to debug and see if I can get any insights. We have 3 K8s master nodes that are ondemand instances and are running the kiam-server process which uses the default |
Cool... so that "fetched credentials" message means that it was able to retrieve the credentials from AWS (most likely caused by the prefetcher given the pod is pending). Were you able to add the debug logging on? I'd like to see all entries with the message "processing delta" for the pods that have problems. Every time a pod changes state it'd be tracked with that- if we don't see any messages for the pod(s) in question then something's being dropped for some reason. If we do get log messages then it's potentially a timing issue (i.e. the pod state updates are processes too slow relative to how quickly pods start and request credentials). It's good to know that the server processes are running on the On Demand instances- when you next see the problem could you check the server processes aren't being restarted/killed and that they have all have a nice long uptime please. |
I plan to enable debug logging this Sunday and then try to reproduce the issue. |
Ok cool- well that sounds most likely that there's a race between the pod starting (and requesting credentials) and how quickly pod data arrives from the watcher. Please let me know what you find and post back here- really keen to improve this. |
Just to let you know that I've set kiam-server's log level to debug and terminated multiple K8s nodes. But the issue didn't occur. I even tried to terminate multiple nodes at once, but this didn't help either. |
Brilliant, thank you.
…On Sun, 25 Mar 2018 at 14:43, Alex G. ***@***.***> wrote:
Just to let you know that I've set kiam-server's log level to debug and
terminated multiple K8s nodes. But the issue didn't occur. I even tried to
terminate multiple nodes at once, but this didn't help either.
It seems to be related either to overall cluster load(traffic we get on
Sundays is usually low) or to kiam-server pods' uptime.
I'll try to reproduce this again once we reach our peak traffic and will
update with the findings.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#47 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAEflb-993vzhinrNA8mGLo6b_WTqG7ks5th58DgaJpZM4Ss6FN>
.
|
So here we go(this particular pod was not scheduled due to K8s node drain, but probably due to horizontal pod autoscaler kicking in to scale the deployment): [
{
"cache.delta.type": "Added",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "processing delta",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510380",
"time": "2018-03-25T15:38:09Z"
},
{
"credentials.access.key": "ASI***QKQ",
"credentials.expiration": "2018-03-25T15:48:52Z",
"credentials.role": "***",
"generation.metadata": 0,
"level": "info",
"msg": "fetched credentials",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510380",
"time": "2018-03-25T15:38:09Z"
},
{
"cache.delta.type": "Added",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "announced pod",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510380",
"time": "2018-03-25T15:38:09Z"
},
{
"credentials.access.key": "ASI***QKQ",
"credentials.expiration": "2018-03-25T15:48:52Z",
"credentials.role": "***",
"generation.metadata": 0,
"level": "info",
"msg": "fetched credentials",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510389",
"time": "2018-03-25T15:38:10Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "announced pod",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510389",
"time": "2018-03-25T15:38:10Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "processing delta",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510389",
"time": "2018-03-25T15:38:10Z"
},
{
"credentials.access.key": "ASI***QKQ",
"credentials.expiration": "2018-03-25T15:48:52Z",
"credentials.role": "***",
"generation.metadata": 0,
"level": "info",
"msg": "fetched credentials",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510393",
"time": "2018-03-25T15:38:10Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "announced pod",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510393",
"time": "2018-03-25T15:38:10Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "processing delta",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510393",
"time": "2018-03-25T15:38:10Z"
},
{
"level": "error",
"msg": "error finding pod: pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "error",
"msg": "error finding pod: pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:21Z"
},
{
"credentials.access.key": "ASI***QKQ",
"credentials.expiration": "2018-03-25T15:48:52Z",
"credentials.role": "***",
"generation.metadata": 0,
"level": "info",
"msg": "fetched credentials",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "100.115.1.182",
"pod.status.phase": "Running",
"resource.version": "17510549",
"time": "2018-03-25T15:38:24Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "announced pod",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "100.115.1.182",
"pod.status.phase": "Running",
"resource.version": "17510549",
"time": "2018-03-25T15:38:24Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "processing delta",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "100.115.1.182",
"pod.status.phase": "Running",
"resource.version": "17510549",
"time": "2018-03-25T15:38:24Z"
}
]
|
Interesting- so there's a handful of messages at the end where the phase goes from It's interesting that there's only two errors- it suggests that the agent only attempted to find the pod twice- between 2018-03-25T15:38:18Z and 2018-03-25T15:38:21Z which is a pretty short retry period (I think the Java SDK has a 5 second period read timeout period). I don't suppose you have the agent logs around the same time- I'd like to know when the client first requests credentials from the agent (causing it to fetch from the server) and when it errors, I'm expecting its only twice but it'd be useful to confirm. Is there anything in particular you could share about this client process please- which language/platform/AWS SDK does it use? I'll also try and do some more digging/tests and see if there's anything else that can be done to speed things up. Thanks for all your help collecting the data! |
Thanks a lot for you time and effort. [
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"addr": "100.115.1.182:60826",
"level": "error",
"method": "GET",
"msg": "error processing request: rpc error: code = Unknown desc = pod not found",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T16:38:19Z"
},
{
"addr": "100.115.1.182:60826",
"headers": {
"Content-Type": [
"text/plain; charset=utf-8"
],
"X-Content-Type-Options": [
"nosniff"
]
},
"level": "info",
"method": "GET",
"msg": "processed request",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"addr": "100.115.1.182:32818",
"level": "error",
"method": "GET",
"msg": "error processing request: rpc error: code = Unknown desc = pod not found",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:20Z"
},
{
"addr": "100.115.1.182:32818",
"headers": {
"Content-Type": [
"text/plain; charset=utf-8"
],
"X-Content-Type-Options": [
"nosniff"
]
},
"level": "info",
"method": "GET",
"msg": "processed request",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:21Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:21Z"
},
{
"addr": "100.115.1.182:32998",
"level": "error",
"method": "GET",
"msg": "error processing request: rpc error: code = Unknown desc = pod not found",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:21Z"
},
{
"addr": "100.115.1.182:32998",
"headers": {
"Content-Type": [
"text/plain; charset=utf-8"
],
"X-Content-Type-Options": [
"nosniff"
]
},
"level": "info",
"method": "GET",
"msg": "processed request",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:21Z"
}
] And just in case, here's the daemon.log from K8s node itself:
|
Thanks- sorry I misunderstood about your first/last messages before. The warnings are all retries within the agent, the "processed request" messages would suggest that the client is retrying twice? I can't see anything that immediately jumps out as strange- the time that the agent retries seems shorter than it should be (it's configured to retry for up to 5 seconds) and there's no messages about a cancelled context, maybe it's just the clocks in the log messages being a little out. Do you have any prometheus metrics around the api server response times- be good to see whether there's any variability there (slow responses etc.) to see whether something is limiting how quickly the watcher can forward data to the server's pod cache. We had quite high 95th percentiles from the apiservers at one point because we'd forgotten to set requests and limits- it was just picking up our cluster defaults- increasing those improved performance quite significantly, both making it faster and more consistent. |
I'll make a note to add some tests and benchmarks around the code that processes watch deltas to make sure there isn't anything that blocks during processing deltas- I don't think there is but it'd be better to prove it. I'll keep thinking about other things that could be causing slow updates- it definitely feels like that's the cause (either by kiam being slow to process, or the api server being slow to serve). |
As per logs the client tried 3 times. After looking into SDK documentation it seems that this is client's default. |
Hello again. The good news are that we finally have Prometheus set up and running and collecting both system and KIAM metrics. The "bad" news are that we rolled the client timeout change to the most problematic microservices, so issue's frequency decreased dramatically and it became much harder to reproduce. And it would be greatly appreciated if you could share your grafana dashboard for KIAM service and/or provide some explanation of the metrics it exposes. |
The API server metrics would be really interesting- in particular to see if times for gets/watches on Pods spike around the same time. We had problems before where our 95th percentile api server responses were all over the place because we hadn't set the requests/limits high enough- they dropped when given enough. It'd be useful to see whether slow deltas being delivered to Kiam are because of something within the apiserver or Kiam. The Prometheus metrics for Kiam are not great at the moment- they export the same metrics reported with statsd. #35 is a reminder to just replace the StatsD stuff with direct prometheus metrics, and I'd probably instrument Kiam differently to make sure this is an easier thing to debug. For now, tracking Does that help? Sorry, don't have any grafana dashboards yet (we've historically used Datadog most) but will do soon- I'd love to get a Kiam dashboard defined, I'll try and address that with #35 also. |
Hi @pingles We are seeing these issues on our cluster. I will also try to dig up the metrics for kube api . But I am trying to simulate a batch load test. In this load test, we simultaneously schedule about 100+ pods and eventually all the containers exit unable to retrieve credentials and hence failing to connect to S3. |
To get around errors with the pod cache not being primed i inject the following ENV variables into ANY container thats going to access AWS from my k8s cluster
https://boto3.readthedocs.io/en/latest/guide/configuration.html boto3 and aws cli obeys these params and basicly this setting gives the app much more retries to get what it needs during the cache priming. before I added these env variables, many of my pods would always fail at the first start since many of them start by getting a file or two from a S3 drive. with this it was smooth sailing and pods comes up on first attempt without errors or pod restarts |
I will try and post back, thank you |
@roffe that is awesome, it actually worked. At some point I was thinking that the load was too high for KubeAPI to handle. When we were using kube2iam we were getting panics in kube-api. We are running some massive tests and will keep you guys posted if you are interested. Also we would love to understand how to monitor this better |
@kumare3 there is a Grafana dashboard for kiam now if you use prometheus metrics & the latest 0.3 beta that will give you visualization of the metrics https://github.com/uswitch/kiam/blob/master/docs/METRICS.md |
This issue resembles #46, and might be essentially the same one.
We've switched to
kiam
fromkube2iam
a couple of weeks ago. We're running most of our kubernetes nodes on spot instances, so they are being replaced pretty frequently. When the node is drained before termination, the pods are scheduled on other nodes, and sometimes the application they're running starts throwing the following error:On the agent side I see the following errors:
And those are errors on the server side:
It looks like we have a delay between the time pod is scheduled and the time information about it becomes available to KIAM. If I delay the startup of the app for couple of seconds it fixes the problem. Deleting the problematic pod and letting kubernetes reschedule it does the trick as well.
After looking into your code it seems that increasing the
prefetch-buffer-size
value might help, cause the issue mostly happens when many pods are scheduled at the same time. But maybe I'm missing something.Any advice would be greatly appreciated.
P.S.: We're using
kiam:v2.5
and kubernetesv1.8.6
.The text was updated successfully, but these errors were encountered: