Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add model-server/estimator to KeplerInternal #322

Merged

Conversation

sunya-ch
Copy link
Collaborator

@sunya-ch sunya-ch commented Dec 7, 2023

This PR replaces #235 by moving the integration to Kepler-internal API.

Change summary:

  • Add Estimator and ModelServer in KeplerInternalSpec and KeplerInternalStatus
  • Add components/estimator
  • Implement components/model-server when enabled
  • Add Model Server Reconcilers to kepler-internal Reconciler
  • Modify components/exporter to add estimator sidecar if set
  • Add access Role for deployments and persistentvolumeclaims
  • Add AddIfNotEmpty and VolumeFromEmptyDir utility function (used for Estimator and ModelServer creation)

Bug fixes (not related to model server):

  • Correct reference for UP-TO-DATE status
  • Replace ki.namespace in updateStatus with the deployment namespace from spec.

KeplerInternal

Here is the CR that I used for running in my local cluster.

apiVersion: kepler.system.sustainable.computing.io/v1alpha1
kind: KeplerInternal
metadata:
  annotations:
    kepler.sustainable.computing.io/bpf-attach-method: libbpf
  labels:
    app.kubernetes.io/name: kepler
    app.kubernetes.io/instance: kepler
    app.kubernetes.io/part-of: kepler-operator
  name: kepler
spec:
  exporter:
    deployment:
      image: quay.io/sustainable_computing_io/kepler:release-0.6.1-libbpf
      namespace: kepler-operator
  openshift:
    enabled: true
    dashboard:
      enabled: true
  modelServer:
    enabled: true
  estimator:
    node:
      components:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip
      total:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip

KeplerInternal Status

With neither estimator nor modelserver

NAME     PORT   DESIRED   CURRENT   UP-TO-DATE    READY   AVAILABLE   AGE   IMAGE   ESTIMATOR      MODEL-SERVER
kepler   9103   7         7         7             7       7           83s   <abbr>  NotInstalled   NotInstalled

With only estimator

NAME     PORT   DESIRED   CURRENT   UP-TO-DATE    READY   AVAILABLE   AGE   IMAGE   ESTIMATOR      MODEL-SERVER
kepler   9103   7         7         7             7       7           40m   <abbr>  Running        NotInstalled

With estimator+modelserver

NAME     PORT   DESIRED   CURRENT   UP-TO-DATE     READY   AVAILABLE   AGE   IMAGE    ESTIMATOR     MODEL-SERVER
kepler   9103   7         7         7              7       7           15s   <abbr>   Running        Running

Signed-off-by: Sunyanan Choochotkaew sunyanan.choochotkaew1@ibm.com

if ms.Enabled {
exporterConfigMap["MODEL_SERVER_ENABLE"] = "true"
}
modelServerConfig := modelserver.ModelServerConfigForClient(k.ModelServerDeploymentName(), k.Spec.ModelServer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
modelServerConfig := modelserver.ModelServerConfigForClient(k.ModelServerDeploymentName(), k.Spec.ModelServer)
modelServerConfig := modelserver.ConfigForClient(k.ModelServerDeploymentName(), k.Spec.ModelServer)

How about renaming modelserver.ModelServer<X> to modelserver.<X> to avoid stuttering ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if k.Spec.ModelServer.ClientConfig() is better since all the function only requires the spec from Modelserver to compute client config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes..it is the config for client of the model server (i.e., exporter/estimator container).

Copy link
Collaborator Author

@sunya-ch sunya-ch Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sthaha Sorry, I misunderstood your comment.
Just change the method name: https://github.com/sustainable-computing-io/kepler-operator/compare/5fa1b54bc0dc3b2f68894af0cc89b7341f8a80f1..6209dadf4f6f04ed3e786989f062f71459dd06f2

Note: cannot have k.Spec.ModelServer.ClientConfig()
I cannot have (ms *v1alpha1.InternalModelServerSpec) ConfigForClient, it needs to be inside v1alpha1.
And, I cannot move the method to v1alpha1 since it needs k8s module and k8s module calls v1alpha1 (circle call)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sthaha Please confirm whether the change resolved?

@sthaha
Copy link
Collaborator

sthaha commented Dec 7, 2023

@sunya-ch does this support deployment of multiple model-servers like it currently allows multiple kepler-internals

May be worth considering is .. if it makes sense to have a separate CRD for KeplerModelServer. This would allow for multiple model servers to be deployed (if that is even a case) with different configs.
kepler-internal can have a spec.modelserver.ref to refer to the KeplerModelServer

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Dec 7, 2023

@sunya-ch does this support deployment of multiple model-servers like it currently allows multiple kepler-internals

May be worth considering is .. if it makes sense to have a separate CRD for KeplerModelServer. This would allow for multiple model servers to be deployed (if that is even a case) with different configs. kepler-internal can have a spec.modelserver.ref to refer to the KeplerModelServer

Yes, I generate model server name in the same way of the kepler exporter name (based on CR + suffix of model-server). I think we can keep it together for simplicity of deployment. Each export can connect to different model server. The model server will be created only when it is enabled. But, kepler can specify only model server URL to connect to the other model server (including external).

@sthaha
Copy link
Collaborator

sthaha commented Dec 8, 2023

@sunya-ch Overall the code looks good to me one missing part is the e2e test.
Could you please also add a simple e2e test for validating deployment of model server?

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Dec 8, 2023

@sunya-ch Overall the code looks good to me one missing part is the e2e test.
Could you please also add a simple e2e test for validating deployment of model server?

I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part?
We can just check the status of KeplerInternal on .status.modelServer.status and .status.estimator.status.

See the issue open here: #314

@sthaha
Copy link
Collaborator

sthaha commented Dec 8, 2023

@sunya-ch

I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part?

The current Kepler tests cover almost all the usecase currently supported by kepler-internal, so having a set of tests that replicate what creation of kepler already does gives us only a low ROI.

IMHO, all features should be be accompanied by tests that validate most common usecases. It shouldn't be too hard to add an e2e by making a copy of the existing kepler-e2e.

@sthaha
Copy link
Collaborator

sthaha commented Dec 8, 2023

@sunya-ch ,

 annotations:
    kepler.sustainable.computing.io/bpf-attach-method: libbpf

that isn't required for kepler-internal it is only a hack enabled for kepler so that stable API users have the ability to deploy libbpf image which are kernel agnostic.

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Dec 8, 2023

@sunya-ch ,

 annotations:
    kepler.sustainable.computing.io/bpf-attach-method: libbpf

that isn't required for kepler-internal it is only a hack enabled for kepler so that stable API users have the ability to deploy libbpf image which are kernel agnostic.

I see. I have it because first I tried to create with specifying the image name but it turns out that it is not allowed for keplerinternal. Just didn't remove it out ;)

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Dec 8, 2023

@sunya-ch

I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part?

The current Kepler tests cover almost all the usecase currently supported by kepler-internal, so having a set of tests that replicate what creation of kepler already does gives us only a low ROI.

IMHO, all features should be be accompanied by tests that validate most common usecases. It shouldn't be too hard to add an e2e by making a copy of the existing kepler-e2e.

Yes, it might be but I still think it should be on another PR for better track. I can rebase this PR from the PR.

@sthaha
Copy link
Collaborator

sthaha commented Dec 8, 2023

@sunya-ch Hopefull #325 should help with the model-server testing 👼

@sunya-ch sunya-ch marked this pull request as draft December 8, 2023 09:35
@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Dec 8, 2023

convert to draft building on top of the PR #325.
Will rebase with v1alpha1 branch when that PR has merged.

@sunya-ch sunya-ch force-pushed the model-server-internal branch 2 times, most recently from 6a11dad to 0b10349 Compare December 8, 2023 09:55
@sunya-ch sunya-ch marked this pull request as ready for review December 8, 2023 09:56
@sunya-ch sunya-ch force-pushed the model-server-internal branch 4 times, most recently from 95bf443 to a4177f9 Compare December 8, 2023 11:32
@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Dec 8, 2023

@sthaha Sorry for multiple force-pushed.
Most of the main code is not changed the failure comes from my bad code on test case.

Additional change are adding e2e test case and Enabled() function for InternalEstimatorSpec.

func (e InternalEstimatorSpec) Enabled() bool {

pkg/components/estimator/estimator.go Outdated Show resolved Hide resolved
pkg/components/estimator/estimator.go Outdated Show resolved Hide resolved
pkg/components/estimator/estimator.go Outdated Show resolved Hide resolved
pkg/components/estimator/estimator.go Outdated Show resolved Hide resolved
pkg/components/estimator/estimator.go Outdated Show resolved Hide resolved
Comment on lines 91 to 95
k8s.VolumeFromHost("lib-modules", "/lib/modules"),
k8s.VolumeFromHost("tracing", "/sys"),
k8s.VolumeFromHost("proc", "/proc"),
k8s.VolumeFromHost("kernel-src", "/usr/src/kernels"),
k8s.VolumeFromHost("kernel-debug", "/sys/kernel/debug"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a better way to handling volumes. - In another PR

E.g. each New<X>Container can return []NamedMount.

type NamedMount string

const (
 HostLibModulesMount NamedMount = "host-lib-modules"
 HostProc  = "host-proc"
 KeplerConfigMapMount = "cm-kepler"
)

func (m HostMount) Volume() corev1.Volume {
   mounts := map[HostMount]string{
       LibModulesMount:  "/lib/modules", 
...
    }
    if strings.StartsWith("host-", m) {
     return k8s.VolumeFromHost(m, mounts[m])
   } else if strings.StartsWith("cm-", m)
     return k8s.VolumeFromConfigMap(m, mounts[m])
  }
 
}

pkg/reconciler/runner.go Show resolved Hide resolved
pkg/utils/test/assertions.go Outdated Show resolved Hide resolved

func (f Framework) AssertInternalStatus(ki *v1alpha1.KeplerInternal) {
// the status will be updated
ki = f.WaitUntilInternalCondition(ki.Name, v1alpha1.Reconciled, v1alpha1.ConditionTrue)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ki = f.WaitUntilInternalCondition(ki.Name, v1alpha1.Reconciled, v1alpha1.ConditionTrue)
ki = f.WaitUntilInternalCondition(name, v1alpha1.Reconciled, v1alpha1.ConditionTrue)

@sthaha
Copy link
Collaborator

sthaha commented Dec 11, 2023

@vprashar2929 could you please help validate this on OpenShift ?.. just ensuring the supported usecase of creating a kepler works is good enough.

@sunya-ch sunya-ch force-pushed the model-server-internal branch 2 times, most recently from 5444912 to 6037dd0 Compare December 11, 2023 08:03
@vprashar2929
Copy link
Collaborator

Couple of observations while testing this on OpenShift(4.13):

  • When deploying KeplerInternals on OpenShift, there is still an issue with showing the appropriate status of Kepler
oc get keplerinternals.kepler.system.sustainable.computing.io                                                                               
NAME           PORT   DESIRED   CURRENT   UP-TO-DATE   READY   AVAILABLE   AGE    IMAGE                                  ESTIMATOR   MODEL-SERVER
mykepler-101   9103   0         0                      0                   168m   quay.io/rh_ee_vprashar/kepler:latest
  • When the estimator is deployed along with the model server it checks for the wrong model-server service name: mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local openshift-kepler-operator ns doesn't exist
set NODE_TOTAL_ESTIMATOR to true.
set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip.
set NODE_COMPONENTS_ESTIMATOR to true.
set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip.
clean socket
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca34a02430>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/acpi/AbsPower
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca33e17460>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/rapl/AbsPower

@sunya-ch sunya-ch force-pushed the model-server-internal branch 2 times, most recently from 3e8fba0 to 6b0fb79 Compare December 12, 2023 00:22
@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Dec 12, 2023

Couple of observations while testing this on OpenShift(4.13):

  • When deploying KeplerInternals on OpenShift, there is still an issue with showing the appropriate status of Kepler
oc get keplerinternals.kepler.system.sustainable.computing.io                                                                               
NAME           PORT   DESIRED   CURRENT   UP-TO-DATE   READY   AVAILABLE   AGE    IMAGE                                  ESTIMATOR   MODEL-SERVER
mykepler-101   9103   0         0                      0                   168m   quay.io/rh_ee_vprashar/kepler:latest
  • When the estimator is deployed along with the model server it checks for the wrong model-server service name: mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local openshift-kepler-operator ns doesn't exist
set NODE_TOTAL_ESTIMATOR to true.
set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip.
set NODE_COMPONENTS_ESTIMATOR to true.
set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip.
clean socket
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca34a02430>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/acpi/AbsPower
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca33e17460>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/rapl/AbsPower

@vprashar2929 Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.

For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.

if errors.IsNotFound(err) {
return true, fmt.Errorf("kepler-internal %s is not found", name)
}
statusOK := true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
statusOK := true
var statusOK bool

@vprashar2929
Copy link
Collaborator

Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.

Should we use the namespace from keplerinternal spec ?

For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.

So when I created keplerinternals I used a different namespace from kepler-operator AFAIK controller only watches kepler-operator ns. If you create an instance with the different namespace to kepler-operator then you won't be able to see the status. This is a known issue and we are planning to fix it by adding config-map #312

@sthaha
Copy link
Collaborator

sthaha commented Dec 13, 2023

@sunya-ch

I see the following difference with the install of kepler using kepler CRD from v1alpha branch vs this PR. Ideally we require that optional kepler-internal features do not introduce any difference.

However, I am okay with having this if you could confirm that there is no impact or fixes an issue

configmap

image

Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
@sunya-ch
Copy link
Collaborator Author

@sunya-ch

I see the following difference with the install of kepler using kepler CRD from v1alpha branch vs this PR. Ideally we require that optional kepler-internal features do not introduce any difference.

However, I am okay with having this if you could confirm that there is no impact or fixes an issue

configmap

image

Yes, it has no affect because the default is false. Only enable when it is set to true. (reference: https://github.com/sustainable-computing-io/kepler/blob/442bcfe5d3bf26a1285ff0f13d1f5017d28b9e37/pkg/model/model.go#L189)

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Dec 13, 2023

Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.

Should we use the namespace from keplerinternal spec ?

For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.

So when I created keplerinternals I used a different namespace from kepler-operator AFAIK controller only watches kepler-operator ns. If you create an instance with the different namespace to kepler-operator then you won't be able to see the status. This is a known issue and we are planning to fix it by adding config-map #312

Finally, I guess so, we may allow Kepler to deploy on any namespace for keplerinternal.
I put TODO in this PR: https://github.com/sustainable-computing-io/kepler-operator/pull/324/files#diff-435eecb1d2af40ee96747f88ab71eb2758da989c92f76dcb1f714fc1b300c633

We need to have Kepler-operator add the new namespace to the cache. Now, we have to rely on the additional namespace list in the command line. Kepler deployer has to know in advance the namespace list where they are going to allow the keplerinternal to be installed.

flag.CommandLine.Var(flag.Value(&additionalNamespaces), "watch-namespaces",

Copy link
Collaborator

@sthaha sthaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks a lot @sunya-ch 🤗

@sthaha sthaha merged commit 4286d9f into sustainable-computing-io:v1alpha1 Dec 13, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants