implement model-based power estimator #104

sunya-ch · 2022-08-05T13:11:50Z

This PR introduces a dynamic way to estimate the power by Estimator class (pkg/model/estimator.go).

the model is supposed to be dynamically downloaded to the folder data/model
python program running as a child process to apply the trained model to the read value via unix domain socket
model class is implemented in python now supporting .h5 of keras model, .sav of scikit-learn model, and simple ratio model computed metric importance by correlation to power

There are additional three dependent points to integrate this class to the Kepler

initialize in exporter.go

errCh := make(chan error)
estimator := &model.Estimator{
   Err: errCh,
}
// start python program (pkg/model/py/estimator.py) 
// it will listen for PowerRequest by the unix domain socket "/tmp/estimator.sock"
go estimator.StartPyEstimator()
defer estimator.Destroy()

call GetPower function in reader.go

// it will create PowerRequest and send to estimator.py via the unix domain socket
(e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {}

modelName refers to the model folder in /data/model which contains metadata.json giving the rest details of model such as model file, feature engineering pkl files, features, error, so on. (auto-select the minimum error model if it is empty, "")
xCols refers to features
xValues refers to values of each features for each pods [no. pods x no. features]
corePower refers to core power for each package (leave it empty if not available)
dramPower, gpuPower, otherPower same to corePower

put initial models to data/model of container folder (can be done by statically add in the docker image or deployment manifest volumes)

check example use in pkg/model/estimator_test.go

If you are agree with this direction, we can modify estimator.py to

support other modeling classes
select the applicable features from available features
connect to kepler-model-server to update the model

Signed-off-by: Sunyanan Choochotkaew sunyanan.choochotkaew1@ibm.com

rootfs · 2022-08-05T13:17:08Z

thank you @sunya-ch for this impressive work!

wonder how much cpu and memory the estimator will consume, do you have any data?

rootfs · 2022-08-05T13:20:32Z

pkg/model/estimate.go

+}
+
+
+func (e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {


Does GetPower get metrics from all Pods as input, and get power consumption in a batch?

Power consumption will be reported as a list of RAPL power package while xCols, xValues refers to read metric values such as cpu_cycles per pod (for all pods at the same time).

For example,
at ticker t,
RAPL pkg0 - core = 20, dram = 5, gpu = unknown, other = unknown
RAPL pkg1 - core =30, dram = 1, gpu = unknown, other = unknown

There are 3 pods
Pod A: cpu_cycles = 100, cache_miss=1
Pod B: cpu_cycles = 50, cache_miss=0
Pod C: cpu_cycles = 10, cache_miss=0

xCols = [cpu_cycles, cache_miss]
xValues = [[100, 1], [50, 0], [10, 0]]
corePower = [20, 30]
dramPower = [5, 1]
gpuPower = []
otherPower = []

marceloamaral · 2022-08-05T13:35:13Z

pkg/model/estimate.go

+type PowerRequest struct {
+	ModelName string   `json:"model_name"`
+	XCols 	[]string 	`json:"x_cols"`
+	XValues [][]float32 `json:"x_values"`


What is XCols and XValues?
Might be nice the have more meaningful names or some comments here.

You have it in the PR description, and might be good to have it in the code as well

Please see example above.
About naming, how's about
XCols --> MetricNames
XValues --> MetricValuesOfAllPods

marceloamaral · 2022-08-05T13:44:17Z

That is great, I really want to have different Power Models, specially bring back the Power Model based on Ratio

sunya-ch · 2022-08-05T14:01:09Z

That is great, I really want to have different Power Models, specially bring back the Power Model based on Ratio

In the current implementation, I treat the ratio approach same to the trained approach considered it as a model.
With this way, you can dynamically update the importance of the ratio metric for example when you find higher correlated metric. What do you think?

marceloamaral · 2022-08-05T14:02:48Z

I'm wondering if instead of calling the python code, we could have a micro-service running as Power Model Server (which will be running the python code) and we access it through an API (http or grpc)...
That way, we can enforce a good design pattern using APIs to communicate between different programming languages.

@rootfs was creating an external server to do something like this right? A server that receives some data, does some calculations, and responds to some information.
This server can be a container that expose an API to receive the data and return the energy consumption.

sunya-ch · 2022-08-05T14:03:39Z

thank you @sunya-ch for this impressive work!

wonder how much cpu and memory the estimator will consume, do you have any data?

I have no experimental data yet. It passes the feature values through the unix domain socket and just apply the mathematical model to it for estimation. The training process is not included here.

marceloamaral · 2022-08-05T14:06:19Z

We will also need some documentation, describing how to configure the models and with details about the supported models

sunya-ch · 2022-08-05T14:06:38Z

I'm wondering if instead of calling the python code, we could have a micro-service running as Power Model Server (which will be running the python code) and we access it through an API (http or grpc)... That way, we can enforce a good design pattern using APIs to communicate between different programming languages.

@rootfs was creating an external server to do something like this right? A server that receives some data, does some calculations, and responds to some information. This server can be a container that expose an API to receive the data and return the energy consumption.

The external server is for training the model which not all data should be sent. This module is called for prediction for every read data. I think it might be better to use local socket instead of going through the networks for microservice.

marceloamaral · 2022-08-05T14:14:55Z

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang
https://upstack.co/knowledge/golang-machine-learning

marceloamaral · 2022-08-05T14:25:17Z

pkg/model/estimate.go

+}
+
+
+func (e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {


Did not find where this function is called
Will it replace some code in pkg/collector/reader.go?

No. I haven't replaced it the reader.go yet.
Need to convert the current PodEnergy to xCols, xValues and Package power to corePower, dramPower, and so on.

Ahhh ok, this PR is draft then!

sunya-ch · 2022-08-05T14:31:57Z

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning

I think at this step, unix domain socket should be fine because it is just to apply the trained method (it is not going to do training or fancy thing rather than read the trained weight and do multiplications to the read data).
Migrating to golang is a big task and I don't think it will have that much impact. We can put it in the future work if all sets.

I will evaluate the end-to-end power estimation time per one tick.

rootfs · 2022-08-05T15:29:30Z

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning

+1 that

The estimator python may have its own repo and run as a sidecar, so we don't have to upgrade the kepler container image if the estimator changes.

sunya-ch · 2022-08-08T11:40:54Z

I amended the commit by

separating estimator for building another Docker image to run as a sidecar container.
To build the image, run
make build-estimator
The new image will be quay.io/sustainable_computing_io/kepler-estimator:latest
changing variable name of xCols and xValues.

TO-DO:

measure overhead
update reader.go to use GetPower
add sidecar container to kepler deployment

rootfs · 2022-08-08T13:00:21Z

sounds good, I just created a repo there for your next push quay.io/sustainable_computing_io/kepler-estimator

rootfs · 2022-08-08T13:01:44Z

pkg/model/estimate.go

+
+type PowerRequest struct {
+	ModelName 				string   `json:"model_name"`
+	MetricNames 			[]string 	`json:"metrics"`


please run gofmt

sunya-ch · 2022-08-10T03:23:43Z

These are results testing on the varied number of pod from 10 to 100 (as the maximum number of pod per worker node is about 100).

general usage for pidstat with 1 second interval of GetPower request
- 0.04% MEM
- VSZ 3.8 Gb
- RSS 3.9 Gb

# Time        UID       PID    %usr %system  %guest   %wait    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
21:41:27        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:28        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:29        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:30        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:31        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python

elapsed time of handle request observed from estimator_client.py (~0.003 s for scikit model, ~0.020 s for ratio)
- ratio model invokes 10x functions than applying the model (I believe it could be further optimized)
profiled time (with python cProfile) from estimator_test.py
- handle request takes more 0.001s for finding specific model name and 0.001s for common tasks.
- all trained model takes almost the same elapsed time to get power which is 0.001s.

summary

with 1s interval of request, there is no significant overhead in CPU of estimator. for memory, VSZ=3.8Gb, RSS=3.9Gb
common communication latency is about 0.001s-0.002s and corresponding model finding latency is about 0.001s.
computation latency is significantly affected by the model function. the general regressor model from scikit learn takes about 0.001s.

@rootfs what do you think?

Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>

rootfs · 2022-08-10T11:39:23Z

summary

with 1s interval of request, there is no significant overhead in CPU of estimator. for memory, VSZ=3.8Gb, RSS=3.9Gb

common communication latency is about 0.001s-0.002s and corresponding model finding latency is about 0.001s.

computation latency is significantly affected by the model function. the general regressor model from scikit learn takes about 0.001s.

@rootfs what do you think?

@sunya-ch thank you for this comprehensive study! This study result is worth a doc of its own. Please add the result to the PR as well.

looking forward to your full integration

rootfs · 2022-08-31T00:33:06Z

the work is moved to kepler-estimator, closing

rootfs reviewed Aug 5, 2022

View reviewed changes

marceloamaral reviewed Aug 5, 2022

View reviewed changes

sunya-ch force-pushed the model branch 2 times, most recently from 9610edd to 883ba9c Compare August 8, 2022 11:34

rootfs reviewed Aug 8, 2022

View reviewed changes

sunya-ch force-pushed the model branch from 883ba9c to 764054a Compare August 10, 2022 04:32

implement model-based power estimator

008580e

Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>

sunya-ch force-pushed the model branch from 764054a to 008580e Compare August 10, 2022 04:37

rootfs mentioned this pull request Aug 25, 2022

dial error: dial unix /tmp/estimator.sock: connect: no such file or directory #113

Closed

rootfs closed this Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement model-based power estimator #104

implement model-based power estimator #104

sunya-ch commented Aug 5, 2022 •

edited

rootfs commented Aug 5, 2022

rootfs Aug 5, 2022

sunya-ch Aug 5, 2022

marceloamaral Aug 5, 2022

marceloamaral Aug 5, 2022

sunya-ch Aug 5, 2022

rootfs Aug 5, 2022

marceloamaral commented Aug 5, 2022

sunya-ch commented Aug 5, 2022

marceloamaral commented Aug 5, 2022

sunya-ch commented Aug 5, 2022

marceloamaral commented Aug 5, 2022

sunya-ch commented Aug 5, 2022

marceloamaral commented Aug 5, 2022

marceloamaral Aug 5, 2022

sunya-ch Aug 5, 2022

marceloamaral Aug 5, 2022

sunya-ch commented Aug 5, 2022 •

edited

rootfs commented Aug 5, 2022

sunya-ch commented Aug 8, 2022 •

edited

rootfs commented Aug 8, 2022

rootfs Aug 8, 2022

sunya-ch commented Aug 10, 2022

rootfs commented Aug 10, 2022 •

edited

summary

rootfs commented Aug 31, 2022

		}


		func (e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {

implement model-based power estimator #104

implement model-based power estimator #104

Conversation

sunya-ch commented Aug 5, 2022 • edited

rootfs commented Aug 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marceloamaral commented Aug 5, 2022

sunya-ch commented Aug 5, 2022

marceloamaral commented Aug 5, 2022

sunya-ch commented Aug 5, 2022

marceloamaral commented Aug 5, 2022

sunya-ch commented Aug 5, 2022

marceloamaral commented Aug 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunya-ch commented Aug 5, 2022 • edited

rootfs commented Aug 5, 2022

sunya-ch commented Aug 8, 2022 • edited

rootfs commented Aug 8, 2022

Choose a reason for hiding this comment

sunya-ch commented Aug 10, 2022

summary

rootfs commented Aug 10, 2022 • edited

summary

rootfs commented Aug 31, 2022

sunya-ch commented Aug 5, 2022 •

edited

sunya-ch commented Aug 5, 2022 •

edited

sunya-ch commented Aug 8, 2022 •

edited

rootfs commented Aug 10, 2022 •

edited