Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update model server support #235

Closed

Conversation

sunya-ch
Copy link
Collaborator

@sunya-ch sunya-ch commented Sep 14, 2023

This PR updates model server support aiming for release v0.6 as mentioned in #232.

API doc:
https://github.com/sustainable-computing-io/kepler-operator/blob/c15c77621958cc79d1921d9af378915158abc4ca/docs/api.md

The PR contains changes on :

  • updating KeplerSpec
  • updating modelserver component
  • adding estimator component
  • modifying exporter to have sidecar if configure
  • adding modelServerReconcilers

Note that,
The holder for setting filters and model name is here on kepler:
https://github.com/sustainable-computing-io/kepler/blob/73cb11fb963f425013cf7f03f214c8f8b85c7853/pkg/config/config.go#L390.
However, it is not determined how to use it. So, it is not supported yet from end to end.

Example configmap change from full deployment on OpenShift on IBM Cloud
(kepler CR: config/samples/kepler_full_deploy.yaml)

> oc get configmap -n openshift-kepler-operator model-server-cm -oyaml
...
  MODEL_CONFIG: |
    NODE_COMPONENTS_ESTIMATOR=true
    NODE_COMPONENTS_INIT_URL=https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/Linux-4.15.0-213-generic-x86_64_v0.6/rapl/AbsPower/KubeletOnly/GradientBoostingRegressorTrainer_1.zip
  MODEL_SERVER_ENABLE: "true"
  MODEL_SERVER_URL: http://model-server-svc.openshift-kepler-operator.svc.cluster.local:8100

Resources:

> oc get -n openshift-kepler-operator all
NAME                                       READY   STATUS    RESTARTS   AGE
pod/kepler-exporter-ds-48lff               2/2     Running   0          4m10s
pod/kepler-exporter-ds-4nj8g               2/2     Running   0          4m10s
pod/kepler-exporter-ds-5lj62               2/2     Running   0          4m10s
pod/kepler-exporter-ds-9rqlv               2/2     Running   0          4m10s
pod/kepler-exporter-ds-knnsh               2/2     Running   0          4m10s
pod/kepler-exporter-ds-kskmq               2/2     Running   0          4m11s
pod/kepler-exporter-ds-szl8w               2/2     Running   0          2m49s
pod/model-server-deploy-85fd7b8c6d-wkkrg   1/1     Running   0          4m10s

NAME                          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/kepler-exporter-svc   ClusterIP   None         <none>        9103/TCP   4m18s
service/model-server-svc      ClusterIP   None         <none>        8100/TCP   4m17s

NAME                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/kepler-exporter-ds   7         7         7       7            7           kubernetes.io/os=linux   4m19s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/model-server-deploy   1/1     1            1           4m18s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/model-server-deploy-85fd7b8c6d   1         1         1       4m18s

exporter log

> oc logs -n openshift-kepler-operator kepler-exporter-ds-szl8w estimator
I0914 08:42:05.585412 3734085 node_component_energy.go:54] Using the EstimatorSidecar/AbsPower Power Model to estimate Node Component Power

estimator log

> oc logs -n openshift-kepler-operator kepler-exporter-ds-szl8w estimator
set NODE_COMPONENTS_ESTIMATOR to true.
set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/Linux-4.15.0-213-generic-x86_64_v0.6/rapl/AbsPower/KubeletOnly/GradientBoostingRegressorTrainer_1.zip.
clean socket
load model from model server:  /mnt/download/rapl/AbsPower

Signed-off-by: Sunyanan Choochotkaew sunyanan.choochotkaew1@ibm.com

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Sep 14, 2023

Please feel free to review on the design first.
I'm working on fixing the code bugs (stringvars flag, missing rbac, ...). Will amend the commit once confirm all deployment choices work at least on my cluster.

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Sep 14, 2023

Now fix critical issue on deployment.

However, please allow me to have some issue left (need help from other to fix with other PR):

  • Kepler model server container created after exporter/estimator --> cannot load model at the beginning, need to restart exporter pod after model server pod is ready manually.
  • e2e integration test (now manually deploy sample deployment choices on owned cluster)

@sunya-ch sunya-ch marked this pull request as ready for review September 14, 2023 08:49
@sthaha sthaha marked this pull request as draft September 14, 2023 20:33
@sthaha sthaha added discussion needed Pre enhancement discussion do-not-merge labels Sep 14, 2023
@sthaha
Copy link
Collaborator

sthaha commented Sep 14, 2023

@sunya-ch Thanks a lot for adding the feature 🤗
Please allow us some time to go through the feature implementation. My first focus will be on the spec.modelServer to ensure we have only the minimal set of api exposed.

Comment on lines 91 to 95
keplerEstimatorImage := env("KEPLER_ESTIMATOR_IMAGE", estimator.StableImage)
flag.StringVar(&estimator.Config.Image, "kepler-estimator.image", keplerEstimatorImage, "kepler estimator image")

keplerModelServerImage := env("KEPLER_MODEL_SERVER_IMAGE", modelserver.StableImage)
flag.StringVar(&modelserver.Config.Image, "kepler-model-server.image", keplerModelServerImage, "kepler model server image")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(suggestion): we have an opportunity to refactor this pattern of getting the env from the flag.
Lets use estimator.image and model-server.image in the flag please.

I have a question about the name model-server. Do you know why we call it that? what does it really serve?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model-server serves the model selection based on available features. When online-trainer is validated, it will serve the fine-tuned model online to the exporter.

Comment on lines 134 to 137
NodeTotalEstimator *EstimatorSpec `json:"nodeTotalEstimator,omitempty"`
NodeComponentsEstimator *EstimatorSpec `json:"nodeComponentsEstimator,omitempty"`
ContainerTotalEstimator *EstimatorSpec `json:"containerTotalEstimator,omitempty"`
ContainerComponentsEstimator *EstimatorSpec `json:"containerComponentsEstimator,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we keep all estimator related spec under

spec:
   estimator:
      node:
         total
         components

      container
          total:
          components:

Comment on lines 114 to 120
func GetModifiedCommandAndArgs(inCmd []string) (outCmd, outArgs []string) {
outCmd = bashCommand
outArgs = []string{fmt.Sprintf(waitForSocketCommand, strings.Join(inCmd, " "))}
return
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func GetModifiedCommandAndArgs(inCmd []string) (outCmd, outArgs []string) {
outCmd = bashCommand
outArgs = []string{fmt.Sprintf(waitForSocketCommand, strings.Join(inCmd, " "))}
return
}
func estimatorCmdAndArgs(inCmd []string) ([]string, []string) {
return (
bashCommand
[]string{fmt.Sprintf(waitForSocketCommand, strings.Join(inCmd, " "))}
)
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, it is used by exporter module. Now, put it (together with tmp volume mount) inside newly-defined AddEstimatorDependency function.

pkg/components/estimator/estimator.go Outdated Show resolved Hide resolved
pkg/components/estimator/estimator.go Outdated Show resolved Hide resolved
pkg/components/estimator/estimator_test.go Outdated Show resolved Hide resolved
pkg/components/estimator/estimator_test.go Outdated Show resolved Hide resolved
k8s.VolumeFromConfigMap("cfm", ConfigmapName),
}

if estimator.IsEstimatorSidecarEnabled(k) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if estimator.IsEstimatorSidecarEnabled(k) {
containers := []corev1.Container{*exporterContainer}
if estimator.NeedsSidecar(k) {
containers = addEstimatorSidecar(exporterContainer)
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it also needs changes on volume, I modify your suggestion as below.

containers, volumes = addEstimatorSidecar(exporterContainer, volumes)

@@ -58,16 +64,13 @@ type ModelServerTrainerSpec struct {

// +kubebuilder:default=true
PromSSLDisable bool `json:"promSSLDisable,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we move all Prometheus Config to separate struct?

pkg/components/modelserver/modelserver.go Outdated Show resolved Hide resolved
pkg/components/modelserver/modelserver.go Outdated Show resolved Hide resolved
pkg/components/modelserver/modelserver.go Outdated Show resolved Hide resolved
@sthaha
Copy link
Collaborator

sthaha commented Sep 15, 2023

@sunya-ch , Thanks a lot for adding this feature 🙇 .
You can ignore most of the comments in the review, lets focus on getting the spec.modelserver and spec.estimator parts to the minimal required configuration. We should be able to make assumptions about the model server that is deployed, and thus may not need all the configurations currently in place.

We also need e2e tests to validate most common configuration and scenarios ...
The status update of the kepler should also consider the status of these deployments.

Any thoughts on having both model-server and estimator disabled by default?
cc: @sunya-ch @rootfs @piparul ?

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Sep 15, 2023

@sthaha Thank you so much for the review.
I made most changes according to your review. I put comment below the review that is modified slightly from your suggestion.

Any thoughts on having both model-server and estimator disabled by default?

Both should be disabled by default.
Except, ModelServerSpec is defined. If any value in this section is defined, we should expect local model server by default (enable model server). Again, we open for remote model server. User can put it disable and provide target URL and port for the remote.

Comment on lines 149 to 151
if ms.Port > 0 {
port = ms.Port
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be needed since the api server should reject anything less than 1

Copy link
Collaborator Author

@sunya-ch sunya-ch Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see webhook that check Kepler CR.
I need this at least for the unit test because it is not filled by kbuilder as in csv file.
Even though, in deployment, if user does not define it will be 8100, I just have this in case to make sure we set the default port if it is not properly defined.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need webhook if we add the right markers

// +kubebuilder:validation:Minimum
specifies the minimum numeric value that this field can have. Negative numbers are supported.

https://book.kubebuilder.io/reference/markers/crd-validation.html
see: exporter.deployment.port

This also reminds me that we should also separate out deployment related settings to a DeploymentSpec

msConfig := k8s.StringMap{}

prom := trainer.Prom
if prom != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit): prefer early returns for better readability

Suggested change
if prom != nil {
if prom == nil {
return msConfig
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally not return it at the beginning for the future case when we have more setting to the trainer not only prom.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please feel free to ignore nits.
imho, it is better to write the code for the current scenario and readjust if/when there is a change in future 👼 (YAGNI) :)

promHeaders.WriteString(h.Key)
promHeaders.WriteString(":")
promHeaders.WriteString(h.Value)
promHeaders.WriteString(",")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a trailing comma ok?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure who wrote this part. It will be used here: https://github.com/sustainable-computing-io/kepler-model-server/blob/4af262f2bb2ae2420eaddf64cdc8480eb05ff3e6/src/train/prom/prom_query.py#L26.
Could I left someone work on that later if not match?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am happy to have this validated later.
Could you please add a todo .. something like // TODO: ensure trailing , is accepted

pkg/utils/k8s/k8s.go Outdated Show resolved Hide resolved
Comment on lines 456 to 459
if cluster == k8s.OpenShift {
updater := newUpdaterForKepler(k)
rs = append(rs, updater(exporter.NewSCC(components.Full, k)))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if cluster == k8s.OpenShift {
updater := newUpdaterForKepler(k)
rs = append(rs, updater(exporter.NewSCC(components.Full, k)))
}

TargetPort: intstr.FromString("http"),
}},
},
}
}

func NewConfigMap(d components.Detail, k *v1alpha1.Kepler) *corev1.ConfigMap {
func GetModelServerConfigForClient(ms *v1alpha1.ModelServerSpec) map[string]string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func GetModelServerConfigForClient(ms *v1alpha1.ModelServerSpec) map[string]string {
func ModelServerConfigForClient(ms v1alpha1.ModelServerSpec) k8s.StringMap {

Keeping it k8s.StringMap has the advantage that we can merge 2 or more maps.

@@ -274,3 +289,11 @@ func defaultIfEmpty(given, defaultVal string) string {
}
return defaultVal
}

func GetDefaultModelServer(ms v1alpha1.ModelServerSpec) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func GetDefaultModelServer(ms v1alpha1.ModelServerSpec) string {
func serverUrl(ms v1alpha1.ModelServerSpec) string {

@@ -61,6 +61,13 @@ func (l StringMap) ToMap() map[string]string {
return l
}

func (l StringMap) AddIfNotEmpty(k, v string) map[string]string {
if v != "" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check the key k as well?

@sunya-ch
Copy link
Collaborator Author

Made an update to the review that marked the icon.

Here are example deployments.

minimum deployment

spec:
  exporter:
    deployment:
      port: 9103
oc get -n openshift-kepler-operator all
NAME                           READY   STATUS    RESTARTS   AGE
pod/kepler-exporter-ds-d4ctn   1/1     Running   0          11s
pod/kepler-exporter-ds-fd5xt   1/1     Running   0          11s
pod/kepler-exporter-ds-fzjk7   1/1     Running   0          11s
pod/kepler-exporter-ds-n46xf   1/1     Running   0          11s
pod/kepler-exporter-ds-nthsc   1/1     Running   0          11s
pod/kepler-exporter-ds-qm7p4   1/1     Running   0          11s
pod/kepler-exporter-ds-s5t48   1/1     Running   0          11s

NAME                          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/kepler-exporter-svc   ClusterIP   None         <none>        9103/TCP   11s

NAME                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/kepler-exporter-ds   7         7         7       7            7           kubernetes.io/os=linux   11s

with estimator only

spec:
  exporter:
    deployment:
      port: 9103
  estimator:
    node:
      components:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/Linux-4.15.0-213-generic-x86_64_v0.6/rapl/AbsPower/KubeletOnly/GradientBoostingRegressorTrainer_1.zip
NAME                           READY   STATUS    RESTARTS   AGE
pod/kepler-exporter-ds-5g5kk   2/2     Running   0          16s
pod/kepler-exporter-ds-7tg9j   2/2     Running   0          16s
pod/kepler-exporter-ds-fh4f2   2/2     Running   0          16s
pod/kepler-exporter-ds-fqdnf   2/2     Running   0          16s
pod/kepler-exporter-ds-lgfwx   2/2     Running   0          16s
pod/kepler-exporter-ds-nthhd   2/2     Running   0          16s
pod/kepler-exporter-ds-pgrl6   2/2     Running   0          16s

NAME                          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/kepler-exporter-svc   ClusterIP   None         <none>        9103/TCP   16s

NAME                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/kepler-exporter-ds   7         7         7       7            7           kubernetes.io/os=linux   17s

full deployment

spec:
  exporter:
    deployment:
      port: 9103
  estimator:
    node:
      components:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/Linux-4.15.0-213-generic-x86_64_v0.6/rapl/AbsPower/KubeletOnly/GradientBoostingRegressorTrainer_1.zip
  modelServer:
    enabled: true
oc get all -n openshift-kepler-operator
NAME                                       READY   STATUS    RESTARTS   AGE
pod/kepler-exporter-ds-4bsnt               2/2     Running   0          4m48s
pod/kepler-exporter-ds-679tv               2/2     Running   0          4m48s
pod/kepler-exporter-ds-6cmkf               2/2     Running   0          4m48s
pod/kepler-exporter-ds-9ltv4               2/2     Running   0          4m49s
pod/kepler-exporter-ds-c6wnl               2/2     Running   0          4m48s
pod/kepler-exporter-ds-f2l9z               2/2     Running   0          4m49s
pod/kepler-exporter-ds-z5wkg               2/2     Running   0          2m55s
pod/model-server-deploy-85fd7b8c6d-9dwz9   1/1     Running   0          4m49s

NAME                          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/kepler-exporter-svc   ClusterIP   None         <none>        9103/TCP   4m50s
service/model-server-svc      ClusterIP   None         <none>        8100/TCP   4m50s

NAME                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/kepler-exporter-ds   7         7         7       7            7           kubernetes.io/os=linux   4m50s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/model-server-deploy   1/1     1            1           4m50s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/model-server-deploy-85fd7b8c6d   1         1         1       4m50s

Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
Comment on lines +65 to +71
if k == "" {
return StringMap{}
}
if v != "" {
l[k] = v
}
return l
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please feel free to ignore

Suggested change
if k == "" {
return StringMap{}
}
if v != "" {
l[k] = v
}
return l
if k != "" && v != "" {
l[k] = v
}
return l

@@ -33,6 +33,7 @@ type Runner struct {
Logger logr.Logger
}

// TODO: make sure that model server container (deployment) is ready before creating kepler daemonset
Copy link
Collaborator

@sthaha sthaha Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to do that ... we just have to indicate in the status that kepler Available = false until all resources are available. The runner should not hold any specific information about objects it is reconciling.

If there is a requirement that X should be created before Y then create a reconciler.WaitFor{ resource: X} and add it to the list if reconcilers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide me a specific guide on this or can I leave this work for the other contributor who can quickly add it? I will move todo line to another place such as line 351 in kepler.go.

svc,
pvc,
)
return rs, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure that the status of kepler is properly updated based on the state of deployments.

@@ -1,3 +1,4 @@
# TODO: add integration test for all deployment choices (minimum, estimator-only, server-only, full)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunya-ch IMHO as a requirement, All major features must have an e2e before merging. We can of course create follow up PRs for testing edge cases but we must have e2e tests to exercise the green and the most common red path at least.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunya-ch , now that #237 has merged, you should find the oc package quite useful


type EstimatorConfig struct {
// +kubebuilder:default=false
SidecarEnabled bool `json:"sidecar,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't Enabled this better? We don't need to be specific about how a feature is enabled. That is implementation detail.

Suggested change
SidecarEnabled bool `json:"sidecar,omitempty"`
Enabled bool `json:"enabled,omitempty"`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kepler estimator could mean the local estimator and sidecar estimator. The config here (e.g., initURL) is a common for both estimator. If sidecar is not enabled, the local estimator inside kepler exporter will be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we then specify that in the spec itself

estimator: 
  enabled: true|false
  mode: sidecar|built-in 

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have only with and without sidecar. I think it is okay to have only sidecar spec to be true or false.

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Dec 7, 2023

moved to #322

@sunya-ch sunya-ch closed this Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed Pre enhancement discussion do-not-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants