Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement model-based power estimator #104

Closed
wants to merge 1 commit into from

Conversation

sunya-ch
Copy link
Collaborator

@sunya-ch sunya-ch commented Aug 5, 2022

This PR introduces a dynamic way to estimate the power by Estimator class (pkg/model/estimator.go).

  • the model is supposed to be dynamically downloaded to the folder data/model
  • python program running as a child process to apply the trained model to the read value via unix domain socket
  • model class is implemented in python now supporting .h5 of keras model, .sav of scikit-learn model, and simple ratio model computed metric importance by correlation to power

There are additional three dependent points to integrate this class to the Kepler

  1. initialize in exporter.go
errCh := make(chan error)
estimator := &model.Estimator{
   Err: errCh,
}
// start python program (pkg/model/py/estimator.py) 
// it will listen for PowerRequest by the unix domain socket "/tmp/estimator.sock"
go estimator.StartPyEstimator()
defer estimator.Destroy()
  1. call GetPower function in reader.go
// it will create PowerRequest and send to estimator.py via the unix domain socket
(e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {} 
  • modelName refers to the model folder in /data/model which contains metadata.json giving the rest details of model such as model file, feature engineering pkl files, features, error, so on. (auto-select the minimum error model if it is empty, "")
  • xCols refers to features
  • xValues refers to values of each features for each pods [no. pods x no. features]
  • corePower refers to core power for each package (leave it empty if not available)
  • dramPower, gpuPower, otherPower same to corePower
  1. put initial models to data/model of container folder (can be done by statically add in the docker image or deployment manifest volumes)

check example use in pkg/model/estimator_test.go

If you are agree with this direction, we can modify estimator.py to

  • support other modeling classes
  • select the applicable features from available features
  • connect to kepler-model-server to update the model

Signed-off-by: Sunyanan Choochotkaew sunyanan.choochotkaew1@ibm.com

@rootfs
Copy link
Contributor

rootfs commented Aug 5, 2022

thank you @sunya-ch for this impressive work!

wonder how much cpu and memory the estimator will consume, do you have any data?

}


func (e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does GetPower get metrics from all Pods as input, and get power consumption in a batch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Power consumption will be reported as a list of RAPL power package while xCols, xValues refers to read metric values such as cpu_cycles per pod (for all pods at the same time).

For example,
at ticker t,
RAPL pkg0 - core = 20, dram = 5, gpu = unknown, other = unknown
RAPL pkg1 - core =30, dram = 1, gpu = unknown, other = unknown

There are 3 pods
Pod A: cpu_cycles = 100, cache_miss=1
Pod B: cpu_cycles = 50, cache_miss=0
Pod C: cpu_cycles = 10, cache_miss=0

xCols = [cpu_cycles, cache_miss]
xValues = [[100, 1], [50, 0], [10, 0]]
corePower = [20, 30]
dramPower = [5, 1]
gpuPower = []
otherPower = []

type PowerRequest struct {
ModelName string `json:"model_name"`
XCols []string `json:"x_cols"`
XValues [][]float32 `json:"x_values"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is XCols and XValues?
Might be nice the have more meaningful names or some comments here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have it in the PR description, and might be good to have it in the code as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see example above.
About naming, how's about
XCols --> MetricNames
XValues --> MetricValuesOfAllPods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@marceloamaral
Copy link
Collaborator

That is great, I really want to have different Power Models, specially bring back the Power Model based on Ratio

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Aug 5, 2022

That is great, I really want to have different Power Models, specially bring back the Power Model based on Ratio

In the current implementation, I treat the ratio approach same to the trained approach considered it as a model.
With this way, you can dynamically update the importance of the ratio metric for example when you find higher correlated metric. What do you think?

@marceloamaral
Copy link
Collaborator

I'm wondering if instead of calling the python code, we could have a micro-service running as Power Model Server (which will be running the python code) and we access it through an API (http or grpc)...
That way, we can enforce a good design pattern using APIs to communicate between different programming languages.

@rootfs was creating an external server to do something like this right? A server that receives some data, does some calculations, and responds to some information.
This server can be a container that expose an API to receive the data and return the energy consumption.

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Aug 5, 2022

thank you @sunya-ch for this impressive work!

wonder how much cpu and memory the estimator will consume, do you have any data?

I have no experimental data yet. It passes the feature values through the unix domain socket and just apply the mathematical model to it for estimation. The training process is not included here.

@marceloamaral
Copy link
Collaborator

We will also need some documentation, describing how to configure the models and with details about the supported models

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Aug 5, 2022

I'm wondering if instead of calling the python code, we could have a micro-service running as Power Model Server (which will be running the python code) and we access it through an API (http or grpc)... That way, we can enforce a good design pattern using APIs to communicate between different programming languages.

@rootfs was creating an external server to do something like this right? A server that receives some data, does some calculations, and responds to some information. This server can be a container that expose an API to receive the data and return the energy consumption.

The external server is for training the model which not all data should be sent. This module is called for prediction for every read data. I think it might be better to use local socket instead of going through the networks for microservice.

@marceloamaral
Copy link
Collaborator

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang
https://upstack.co/knowledge/golang-machine-learning

}


func (e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not find where this function is called
Will it replace some code in pkg/collector/reader.go?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I haven't replaced it the reader.go yet.
Need to convert the current PodEnergy to xCols, xValues and Package power to corePower, dramPower, and so on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh ok, this PR is draft then!

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Aug 5, 2022

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning

I think at this step, unix domain socket should be fine because it is just to apply the trained method (it is not going to do training or fancy thing rather than read the trained weight and do multiplications to the read data).
Migrating to golang is a big task and I don't think it will have that much impact. We can put it in the future work if all sets.

I will evaluate the end-to-end power estimation time per one tick.

@rootfs
Copy link
Contributor

rootfs commented Aug 5, 2022

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning

+1 that

The estimator python may have its own repo and run as a sidecar, so we don't have to upgrade the kepler container image if the estimator changes.

@sunya-ch sunya-ch force-pushed the model branch 2 times, most recently from 9610edd to 883ba9c Compare August 8, 2022 11:34
@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Aug 8, 2022

I amended the commit by

  • separating estimator for building another Docker image to run as a sidecar container.
    To build the image, run
    make build-estimator
    The new image will be quay.io/sustainable_computing_io/kepler-estimator:latest
  • changing variable name of xCols and xValues.

TO-DO:

  • measure overhead
  • update reader.go to use GetPower
  • add sidecar container to kepler deployment

@rootfs
Copy link
Contributor

rootfs commented Aug 8, 2022

sounds good, I just created a repo there for your next push quay.io/sustainable_computing_io/kepler-estimator


type PowerRequest struct {
ModelName string `json:"model_name"`
MetricNames []string `json:"metrics"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please run gofmt

@sunya-ch
Copy link
Collaborator Author

These are results testing on the varied number of pod from 10 to 100 (as the maximum number of pod per worker node is about 100).

  • general usage for pidstat with 1 second interval of GetPower request
    • 0.04% MEM
    • VSZ 3.8 Gb
    • RSS 3.9 Gb
# Time        UID       PID    %usr %system  %guest   %wait    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
21:41:27        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:28        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:29        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:30        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:31        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
  • elapsed time of handle request observed from estimator_client.py (~0.003 s for scikit model, ~0.020 s for ratio)
    • ratio model invokes 10x functions than applying the model (I believe it could be further optimized)
      Screenshot 2022-08-10 at 12 00 52
  • profiled time (with python cProfile) from estimator_test.py
    • handle request takes more 0.001s for finding specific model name and 0.001s for common tasks.
    • all trained model takes almost the same elapsed time to get power which is 0.001s.
      Screenshot 2022-08-10 at 12 00 16
      Screenshot 2022-08-10 at 12 07 11

summary

  • with 1s interval of request, there is no significant overhead in CPU of estimator. for memory, VSZ=3.8Gb, RSS=3.9Gb
  • common communication latency is about 0.001s-0.002s and corresponding model finding latency is about 0.001s.
  • computation latency is significantly affected by the model function. the general regressor model from scikit learn takes about 0.001s.

@rootfs what do you think?

Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
@rootfs
Copy link
Contributor

rootfs commented Aug 10, 2022

summary

  • with 1s interval of request, there is no significant overhead in CPU of estimator. for memory, VSZ=3.8Gb, RSS=3.9Gb
  • common communication latency is about 0.001s-0.002s and corresponding model finding latency is about 0.001s.
  • computation latency is significantly affected by the model function. the general regressor model from scikit learn takes about 0.001s.

@rootfs what do you think?

@sunya-ch thank you for this comprehensive study! This study result is worth a doc of its own. Please add the result to the PR as well.

looking forward to your full integration

@rootfs
Copy link
Contributor

rootfs commented Aug 31, 2022

the work is moved to kepler-estimator, closing

@rootfs rootfs closed this Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants