Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding RBAC support in kubernetes config (V2) #694

Merged
merged 17 commits into from May 31, 2019

Conversation

simon-mo
Copy link
Contributor

This build on top of #605

  • Enable Kubernetes metric test

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1962/
Test FAILed.

@rkooo567
Copy link
Collaborator

@simon-mo Tests fails with mxnet. I think the mxnet branch we merged is a little unstable? I will run a test again in case.

@rkooo567
Copy link
Collaborator

Jenkins test this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1964/
Test PASSed.

@rkooo567
Copy link
Collaborator

We received this error logs

Traceback (most recent call last):
  File "kubernetes_integration_test.py", line 158, in <module>
    new_name=cluster_name)
  File "/home/travis/build/ucbrise/clipper/integration-tests/test_utils.py", line 189, in create_kubernetes_connection
    cl.start_clipper(num_frontend_replicas=num_frontend_replicas)
  File "/home/travis/build/ucbrise/clipper/clipper_admin/clipper_admin/clipper_admin.py", line 139, in start_clipper
    qf_http_timeout_content, num_frontend_replicas)
  File "/home/travis/build/ucbrise/clipper/clipper_admin/clipper_admin/kubernetes/kubernetes_container_manager.py", line 190, in start_clipper
    self._config_rbac()
  File "/home/travis/build/ucbrise/clipper/clipper_admin/clipper_admin/kubernetes/kubernetes_container_manager.py", line 324, in _config_rbac
    body=clusterrole_data, namespace=self.k8s_namespace)
  File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/kubernetes/client/apis/rbac_authorization_v1alpha1_api.py", line 59, in create_cluster_role
    (data) = self.create_cluster_role_with_http_info(body, **kwargs)
  File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/kubernetes/client/apis/rbac_authorization_v1alpha1_api.py", line 91, in create_cluster_role_with_http_info
    " to method create_cluster_role" % key
TypeError: Got an unexpected keyword argument 'namespace' to method create_cluster_role

@@ -0,0 +1,5 @@
apiVersion: v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should bind our new ServiceAccount to prom_deployment? https://stackoverflow.com/questions/44505461/how-to-configure-a-non-default-serviceaccount-on-a-deployment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon-mo So, I understand yaml file is correct, but don't we need to add serviceAccount to our pod -> https://github.com/ucbrise/clipper/blob/develop/clipper_admin/clipper_admin/kubernetes/prom_deployment.yaml?

I found this note from https://github.com/coreos/prometheus-operator/blob/master/Documentation/rbac.md,

Note: A cluster admin is required to create this ClusterRole and create a ClusterRoleBinding or RoleBinding to the ServiceAccount used by the Prometheus Pods. The ServiceAccount used by the Prometheus Pods can be specified in the Prometheus object.

Is it okay not to specify serviceAccount in our metric pod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, we need to use this serviceAccount in our prometheus deployment

@simon-mo
Copy link
Contributor Author

There's nothing wrong the yaml file. but we do need to modify the version api call.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1965/
Test PASSed.

@@ -163,6 +168,8 @@ def __init__(self,
configuration.assert_hostname = False
self._k8s_v1 = client.CoreV1Api()
self._k8s_beta = client.ExtensionsV1beta1Api()
self._k8s_rbac=client.RbacAuthorizationV1beta1Api()
Copy link
Collaborator

@rkooo567 rkooo567 May 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing important, but we need spaces around = for this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll make a lint PR formatting the entire code base.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1966/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1967/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1968/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1970/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1971/
Test FAILed.

@rkooo567
Copy link
Collaborator

rkooo567 commented May 15, 2019

So, #687 merged this PR locally and still has an issue accessing the metric service through kube proxy.

I think we should enable https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/, admission controller's ServiceAccount.

-> This didn't resolve the issue. I am not sure if it is essential.

@jacekwachowiak
Copy link
Contributor

jacekwachowiak commented May 15, 2019

Full output from the metrics pod for your information:


[cloud-user@node1 ~]$ kubectl logs metrics-at-default-cluster-5bc5bcb66f-rqb7b
level=info ts=2019-05-15T02:30:54.082508507Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, branch=HEAD, revision=85f23d82a045d103ea7f3c89a91fba4a93e6367a)"
level=info ts=2019-05-15T02:30:54.082615602Z caller=main.go:226 build_context="(go=go1.9.2, user=root@6e784304d3ff, date=20180119-12:01:23)"
level=info ts=2019-05-15T02:30:54.082635418Z caller=main.go:227 host_details="(Linux 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 metrics-at-default-cluster-5bc5bcb66f-rqb7b (none))"
level=info ts=2019-05-15T02:30:54.082649724Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-05-15T02:30:54.08554498Z caller=main.go:499 msg="Starting TSDB ..."
level=info ts=2019-05-15T02:30:54.086248343Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-05-15T02:30:54.095947666Z caller=main.go:509 msg="TSDB started"
level=info ts=2019-05-15T02:30:54.096015563Z caller=main.go:585 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=error ts=2019-05-15T02:30:54.097265516Z caller=manager.go:214 component="discovery manager scrape" msg="Cannot create Kubernetes discovery" err="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"
level=info ts=2019-05-15T02:30:54.097327523Z caller=main.go:486 msg="Server is ready to receive web requests."
level=info ts=2019-05-15T02:30:54.097355883Z caller=manager.go:59 component="scrape manager" msg="Starting scrape manager..."

As you can see there is just one error for now. The connection through the browser still fails with:

Error: 'read tcp 10.233.90.0:57818->10.233.96.27:9090: read: connection reset by peer'
Trying to reach: 'http://10.233.96.27:9090/graph'

I'll let you know if I get anything new.

UPDATE:
Changing both of automountServiceAccountToken: false to true in /home/cloud-user/.local/lib/python3.6/site-packages/clipper_admin/kubernetes/prom_deployment.yaml makes the Prometheus pod log error disappear

@rkooo567
Copy link
Collaborator

rkooo567 commented May 21, 2019

@simon-mo If you are busy, I will debug this tomorrow!

Also, this post #687 can be useful for future debug.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2010/
Test PASSed.

@simon-mo simon-mo self-assigned this May 27, 2019
@rkooo567 rkooo567 added this to backlog in Tasks May 28, 2019
@rkooo567 rkooo567 removed this from backlog in Tasks May 28, 2019
@rkooo567
Copy link
Collaborator

Please refer to this issue. #687

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2038/
Test PASSed.

@rkooo567
Copy link
Collaborator

@simon-mo I bound default service account to prometheus cluster role, and it seems like it works fine. I think we can start from this and allow users to change settings later.

@simon-mo
Copy link
Contributor Author

@rkooo567 good work! since I authored this PR. I'll leave you to approve and merge. Thanks!

@rkooo567
Copy link
Collaborator

@simon-mo Awesome. I will merge it tonight after cleaning up. Let's release the next version after that.

@withsmilo @RehanSD Can you guys review the PR in them meantime?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2045/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2046/
Test PASSed.

@withsmilo withsmilo self-requested a review May 30, 2019 10:47
@@ -370,6 +377,7 @@ def _start_prometheus(self):
CONFIG_FILES['metric']['deployment'],
version=PROM_VERSION,
cluster_name=self.cluster_name,
service_account_name=self.cluster_name+"-prometheus"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find service_account_name in prom_deployment.yaml.

Copy link
Collaborator

@rkooo567 rkooo567 May 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I decided to use default as prom_deployment service account because

  1. it works
  2. There's no reason to setup specific service account for users. We can just leave it to users.

I will delete this line.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2051/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/2052/
Test PASSed.

@rkooo567
Copy link
Collaborator

@simon-mo @withsmilo I will merge if one of you approve it

@withsmilo withsmilo self-requested a review May 31, 2019 05:51
Copy link
Collaborator

@withsmilo withsmilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rkooo567 rkooo567 merged commit 2599acd into ucbrise:develop May 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants