Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Issues with IAM role access after a rolling update of the k8s cluster #48

Open
dgeorgievski opened this issue Mar 22, 2018 · 15 comments
Open

Comments

@dgeorgievski
Copy link

dgeorgievski commented Mar 22, 2018

I am using kiam v2.6 with kops managed kubernetes 1.9.5 cluster in AWS. Whenever a rolling update is performed, kops rolling-update, some pods have issues accessing their respective IAM roles.

From the agent error log messages it seems that the mapping between the IAM roles and pods is lost. The affected pod is trying to access one IAM role, but Kiam believes it is annotated with another IAM role.

{
    "addr":"100.124.0.4:36284",
    "level":"error",
    "method":"GET",
    "msg":"error processing request: assume role forbidden: requested 'iam-role-one' but annotated with 'iam-role-two', forbidden",
    "path":"/latest/meta-data/iam/security-credentials/iam-role-one",
    "status":403,
    "time":"2018-03-22T19:45:29Z"
}

The workaround is to restart all kiam servers after which the correct mapping is restored. I was able to replicate the issue twice in a row.

The server is started with the following settings:

        - --json-log
        - --level=info
        - --bind=0.0.0.0:443
        - --cert=/etc/kiam/tls/server.pem
        - --key=/etc/kiam/tls/server-key.pem
        - --ca=/etc/kiam/tls/ca.pem
        - --role-base-arn-autodetect
        - --sync=1m
        - --prometheus-listen-addr=0.0.0.0:9620
        - --prometheus-sync-interval=5s

while the agent with:

        - --level=info
        - --iptables
        - --host-interface=weave+
        - --json-log
        - --port=8181
        - --cert=/etc/kiam/tls/agent.pem
        - --key=/etc/kiam/tls/agent-key.pem
        - --ca=/etc/kiam/tls/ca.pem
        - --server-address=kiam-server:443
        - --prometheus-listen-addr=0.0.0.0:9620
        - --prometheus-sync-interval=5s

Please let me know if you need any other information.

Regards,

Dimitar

@pingles
Copy link
Contributor

pingles commented Mar 22, 2018

Interesting. This sounds similar to #47: pods are starting before the watch deltas have been delivered- so the pod cache has the wrong data. Restarting the servers would obviously clear everything and cause a fresh sync fixing things.

Could you re-run and prove that it works after 1 minute (the --sync flag of server controls how frequently it runs a full sync of pods) or reduce the sync interval down.

I'll try and investigate more tomorrow and see what I can find.

To help- could you give me ballparks on the number of server processes, number of pods being restarted together etc. Could you also check that the server processes aren't being throttled/have sufficient limits please.

Thanks for reporting!

@dgeorgievski
Copy link
Author

The status of the pods were tested about 10 mins after the last ec2 node was rotated, which means left on their own the kaim servers would most likely not purge the cache from invalid entries.

I'll try repeating the steps with --sync value lowered to 30s, and running the server and agents in debug mode, which should hopefully produce helpful clues on this issue with the cache.

This is a newly provisioned cluster, and it is not used at the moment. The only significant load is produced by the kiam servers, which run on 3 m4.xlarge ec2 instances (4 vCPUs), while the agents run on 4 m4.2xlarge instances (8 vCPU). Kiam server utilization is relatively low, 6% on average, with occasional spikes up to 19-20%.

There are about 50 pods, about 20 different deployments, for which kiam manages IAM access.

@pingles
Copy link
Contributor

pingles commented Mar 23, 2018

This definitely sounds funky. We've been running Kiam for around 6-9 months on clusters with thousands of pods and probably under 100 roles and don't come across this problem.

Could you describe a bit more about the connection between the Kiam server and the api server(s) please? I wonder if there's some additional reliability stuff that needs to be added around the cache's watcher/sync processes- if the sync happens every minute then there shouldn't be a problem after 10 minutes (assuming no other pod movement).

@dgeorgievski
Copy link
Author

I've repeated the rolling update of the cluster with server's --sync=30s, and this time I didn't observe the issue with the cache. Both the servers and agents were started with log level set to debug.

The following is a summary of what was happening during the rolling update.

Rolling update started. Kops rotates first the masters one node at the time.

kops rolling-update cluster --name $NAME

node "ip-10-26-66-12.ec2.internal" cordoned
node "ip-10-26-66-12.ec2.internal" cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, 
DaemonSet or StatefulSet: etcd-server-events-ip-10-26-66-12.ec2.internal, 
etcd-server-ip-10-26-66-12.ec2.internal, kube-apiserver-ip-10-26-66-12.ec2.internal, 
kube-controller-manager-ip-10-26-66-12.ec2.internal, kube-proxy-ip-10-26-66-12.ec2.internal, kube-scheduler-ip-10-26-66-12.ec2.internal; 
Ignoring DaemonSet-managed pods: kiam-server-xmxhn, weave-net-hz57j
node "ip-10-26-66-12.ec2.internal" drained
I0323 22:17:20.202050    2347 instancegroups.go:237] Stopping instance "i-019efe0582157955b", node "ip-10-26-66-12.ec2.internal", in group "my-master-group".
I0323 22:22:20.474951    2347 instancegroups.go:161] Validating the cluster.
I0323 22:22:20.745330    2347 instancegroups.go:212] Cluster validated.
I0323 22:22:20.891394    2347 instancegroups.go:130] Draining the node: "ip-10-26-76-147.ec2.internal".

The access to the meta-data end-point was observed from a test pod. The curl command was executed every second. While the masters were rotated, curl would occasionally return the rpc error below, otherwise curl would return the correct IAM role associated with this pod.

curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
rpc error: code = DeadlineExceeded desc = context deadline exceeded

During the time period when the kiam agent nodes were rotated no curl error was reported. The agents were correctly reporting when Pods were deleted (rpc error: code = Unknown desc = pod not found), while the kiam servers reported about the deltas being processed.

The results of this test, with sync value set to 30s, are encouraging, but to prove the result was not an anomaly we need at least couple of additional data points. I might need a week to complete the new series of testing, and be happy to report the results back in this issue.

@pingles
Copy link
Contributor

pingles commented Mar 25, 2018 via email

@dgeorgievski
Copy link
Author

Unfortunately the issue appeared again in the 2nd test with server sync set to 30s. I can also confirm I see the issue mentioned in #47 as well.

For instance, one deployment with 3 replicas, 2 pods reported "pod not found" error, while the 3rd reported the issue raised by this ticket "assume role forbidden". As usual the restart of the servers fixed the issue.

I also noticed in the server logs that the servers were reporting "announced pod" with pod.status.phase in Running for the pods reported as missing by the agents.

Unfortunately at the moment there is no collection of the logs so I am not able to provide the full time-line of the events.

@pingles
Copy link
Contributor

pingles commented Mar 28, 2018 via email

@kevtaylor
Copy link

We have this same issue - rolled a cluster this morning and certain pods obtained credentials and others didn't

When I was looking at the kiam server log there was a lot of "pod not found errors"

I am sorry, but I then bounced the pods so I wasn't able to capture the details, but I'll keep an eye as we roll this out

@pingles
Copy link
Contributor

pingles commented Apr 13, 2018

If you could capture any of the server log data that would be useful- it could potentially be slow updates from the api server (which would ultimately correct after the client retries a few times), or there's an error that the Kubernetes client code within Kiam doesn't correct from- potentially as a result of restarting the api servers as part of a cluster upgrade.

#46 mentions a type coercion error that I need to fix (its a trivial change to make) which may be related- if you do still have access to your logs elsewhere it'd be useful to see if the same error is reported.

I'll try and sort #46 asap in case that happens to be the same cause and let you know when an update is ready to go @kevtaylor?

Thanks to everyone for continuing to report these problems.

@pingles
Copy link
Contributor

pingles commented Apr 13, 2018

@kevtaylor I just pushed an update in 562b2e6 which is on a branch currently- that should handle the deleted state unknown errors if you want to give it a try and see if that prevents errors you saw.

@alexgaganov-fiverr I wonder if this might also help #47 that you reported- if you could try it out too that would be super helpful.

I've made a note to try and get more time to tidy and improve the cache handling code, I'm hoping there may be some other structures in the client libs that can avoid a lot of the boiler plate and improve its behaviour.

@pingles
Copy link
Contributor

pingles commented Apr 13, 2018

If you want to try it, it'll be available on quay.io:

quay.io/uswitch/kiam:562b2e63e008a4733b71d1242804583ff127975a

@kevtaylor
Copy link

@pingles Rolled a cluster with that version today and seems to be behaving itself - will keep an eye on any issues

@pingles
Copy link
Contributor

pingles commented Apr 16, 2018

Cool, thank you. Keep us posted!

@dgeorgievski
Copy link
Author

My apologies for the longer absence. I've been busy with work.

Since my last posting I've been performing regular tests with mixed results. I yet have to try running a test with the image mentioned above, but I do have comments on caching of IAM credentials. I'll try that next.

From the limited experience I had with kiam and kube2iam in managing Pods access privileges in AWS cloud, I can tell that caching could improve reliability of the managed services but only under certain conditions.

In case of conflict (restarted pod appear to have a different IAM annotation than that cached) I think the Pod cache should discard its current content, and refresh IAM roles for the pod instead of throwing assume role errors. In other words the Pod always wins when asserting the correct role for it.

@pingles
Copy link
Contributor

pingles commented Jun 5, 2018

@dgeorgievski thanks- if you could try the updated image as it fixed a bug which did result in the behaviour described.

Interesting idea about the state update though- I'll have a think on it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants