panic when using pre-trained model #1476

lianhao · 2024-05-29T02:42:40Z

What happened?

When running the kepler in K8S with the pretrained model to estimate the process power, kepler pod just go panics after launch.

The models are trained by following kepler model server tekton training process, using the complete run.

Kepler container goes into error just after it started:

<omit>
I0529 02:05:36.588422  690634 exporter.go:175] starting to listen on 0.0.0.0:9102
I0529 02:05:36.588445  690634 exporter.go:181] Started Kepler in 2.243991957s
I0529 02:05:39.594488  690634 exporter.go:457] successfully get data with batch get and delete with 700 pids in 3.298332ms
I0529 02:05:39.914526  690634 estimate.go:139] estimator unmarshal error: json: cannot unmarshal array into Go struct field ComponentPowerResponse.powers of type map[string][]float64 ({"powers": [], "msg": "\"None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]\"\n"})
I0529 02:05:39.914657  690634 process_energy.go:210] Could not estimate the Process Platform Power
panic: runtime error: index out of range [0] with length 0

goroutine 33 [running]:
github.com/sustainable-computing-io/kepler/pkg/model.addEstimatedEnergy({0xc000746400, 0x3d, 0xc00075c820?}, 0x0?, 0x1)
        /workspace/pkg/model/process_energy.go:219 +0xbf0
github.com/sustainable-computing-io/kepler/pkg/model.UpdateProcessEnergy(0xc0005d4000?, 0xc000b88660?)
        /workspace/pkg/model/process_energy.go:145 +0x145
github.com/sustainable-computing-io/kepler/pkg/collector/energy.UpdateProcessEnergy(...)
        /workspace/pkg/collector/energy/process_energy_collector.go:26
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).UpdateProcessEnergyUtilizationMetrics(...)
        /workspace/pkg/collector/metric_collector.go:152
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).UpdateEnergyUtilizationMetrics(0xc0005d4000)
        /workspace/pkg/collector/metric_collector.go:139 +0x2a
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Update(0xb2d05e00?)
        /workspace/pkg/collector/metric_collector.go:113 +0x65
github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start.func1()
        /workspace/pkg/manager/manager.go:75 +0x7b
created by github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start in goroutine 1
        /workspace/pkg/manager/manager.go:67 +0x65

There are some errors in kepler-estimator container too:

<omit>
failed to get model from request {"metrics":["bpf_page_cache_hit","task_clock_ms","bpf_cpu_time_ms","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_instructions","cache_miss"],"values":[[0,0,0,0,0,0,0,0,0]],"output_type":"DynPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"GradientBoostingRegressorTrainer","filter":""}
get archived model
failed to get model from request {"metrics":["bpf_page_cache_hit","task_clock_ms","bpf_cpu_time_ms","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_instructions","cache_miss"],"values":[[0,0,0,0,0,0,0,0,0]],"output_type":"DynPower","source":"intel_rapl","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"GradientBoostingRegressorTrainer","filter":""}
get archived model
failed to get model from request {"metrics":["bpf_page_cache_hit","task_clock_ms","bpf_cpu_time_ms","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_instructions","cache_miss"],"values":[[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0]],"output_type":"DynPower","source":"intel_rapl","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"GradientBoostingRegressorTrainer","filter":""}
GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"
<omit>

The complete kepler log can be found here :
kepler.log
The complete kepler-estimator log can be found here:
kepler-estimator.log

What did you expect to happen?

Kepler should be run without any panics

How can we reproduce it (as minimally and precisely as possible)?

run kepler with the kepler deployment configurations below.

Anything else we need to know?

No response

Kepler image tag

kepler: quay.io/sustainable_computing_io/kepler:latest

estimator: quay.io/sustainable_computing_io/kepler_model_server:latest

Kubernetes version

$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.2

Cloud provider or bare metal

bare metal

OS version

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux onap02 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Using manifest

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl describe configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
Name:         kepler-cfm
Namespace:    kepler
Labels:       sustainable-computing.io/app=kepler
Annotations:  <none>

Data
====
CGROUP_METRICS:
----
*
EXPOSE_IRQ_COUNTER_METRICS:
----
true
KEPLER_LOG_LEVEL:
----
5
KEPLER_NAMESPACE:
----
kepler
MODEL_CONFIG:
----
PROCESS_COMPONENTS_ESTIMATOR=true
PROCESS_COMPONENTS_INIT_URL=http://onap01.sh.intel.com/kepler_models/CompleteTrainPipelineExample/intel_rapl/DynPower/Basic/GradientBoostingRegressorTrainer_0.zip
PROCESS_COMPONENTS_TRAINER=GradientBoostingRegressorTrainer
PROCESS_TOTAL_ESTIMATOR=true
PROCESS_TOTAL_INIT_URL=http://onap01.sh.intel.com/kepler_models/CompleteTrainPipelineExample/acpi/DynPower/Basic/GradientBoostingRegressorTrainer_0.zip
PROCESS_TOTAL_TRAINER=GradientBoostingRegressorTrainer

PROMETHEUS_SCRAPE_INTERVAL:
----
30s
CPU_ARCH_OVERRIDE:
----

ENABLE_GPU:
----
true
ENABLE_QAT:
----
false
EXPOSE_CGROUP_METRICS:
----
false
EXPOSE_HW_COUNTER_METRICS:
----
true
BIND_ADDRESS:
----
0.0.0.0:9102
ENABLE_PROCESS_METRICS:
----
false
MAX_LOOKUP_RETRY:
----
1
REDFISH_PROBE_INTERVAL_IN_SECONDS:
----
60
ENABLE_EBPF_CGROUPID:
----
true
METRIC_PATH:
----
/metrics
REDFISH_SKIP_SSL_VERIFY:
----
true

BinaryData
====

Events:  <none>


# provide kepler deployment description
$ kubectl describe daemonset kepler-exporter -n ${KEPLER_NAMESPACE} 
Name:           kepler-exporter
Selector:       app.kubernetes.io/component=exporter,app.kubernetes.io/name=kepler-exporter,sustainable-computing.io/app=kepler
Node-Selector:  <none>
Labels:         sustainable-computing.io/app=kepler
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 1
Current Number of Nodes Scheduled: 1
Number of Nodes Scheduled with Up-to-date Pods: 1
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/component=exporter
                    app.kubernetes.io/name=kepler-exporter
                    sustainable-computing.io/app=kepler
  Service Account:  kepler-sa
  Containers:
   kepler-exporter:
    Image:      quay.io/sustainable_computing_io/kepler:latest
    Port:       9102/TCP
    Host Port:  0/TCP
    Command:
      /bin/sh
      -c
    Args:
      until [ -e /tmp/estimator.sock ]; do sleep 1; done && /usr/bin/kepler -v=5
    Requests:
      cpu:     100m
      memory:  400Mi
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_IP:     (v1:status.hostIP)
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/kepler/kepler.config from cfm (ro)
      /etc/redfish from redfish (ro)
      /lib/modules from lib-modules (ro)
      /proc from proc (rw)
      /sys from tracing (ro)
      /tmp from tmp (rw)
      /usr/src from usr-src (rw)
      /var/run from var-run (rw)
   estimator:
    Image:      quay.io/sustainable_computing_io/kepler_model_server:latest
    Port:       <none>
    Host Port:  <none>
    Command:
      python3.8
    Args:
      -u
      src/estimate/estimator.py
    Environment:  <none>
    Mounts:
      /etc/kepler/kepler.config from cfm (ro)
      /mnt from mnt (rw)
      /tmp from tmp (rw)
  Volumes:
   tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
   mnt:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
   proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc-host
    HostPathType:  Directory
   usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/src
    HostPathType:  Directory
   lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  Directory
   tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
   var-run:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run
    HostPathType:  Directory
   cfm:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kepler-cfm
    Optional:  false
   redfish:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  redfish-4kh9d7bc7m
    Optional:    false
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  29m   daemonset-controller  Created pod: kepler-exporter-9wxsp

Container runtime (CRI) and version (if applicable)

containerd://1.7.13

Related plugins (CNI, CSI, ...) and versions (if applicable)

CNI: kindnet

The text was updated successfully, but these errors were encountered:

sunya-ch · 2024-05-30T05:10:59Z

It seems the trained power model using the CPU time metric exported by Kepler before v0.7 (bpf_cpu_time_us); however, the estimation is called by the new Kepler (with bpf_cpu_time_ms). You may have to retrain the power model with new Kepler version.

I0529 02:05:34.744887  690634 utils.go:86] Available ebpf counters: [bpf_page_cache_hit task_clock_ms bpf_cpu_time_ms bpf_net_tx_irq bpf_net_rx_irq bpf_block_irq cpu_cycles cpu_instructions cache_miss]
...
I0529 02:05:39.914526  690634 estimate.go:139] estimator unmarshal error: json: cannot unmarshal array into Go struct field ComponentPowerResponse.powers of type map[string][]float64 ({"powers": [], "msg": "\"None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]\"\n"})

lianhao added the kind/bug report bug issue label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic when using pre-trained model #1476

panic when using pre-trained model #1476

lianhao commented May 29, 2024

sunya-ch commented May 30, 2024

panic when using pre-trained model #1476

panic when using pre-trained model #1476

Comments

lianhao commented May 29, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kepler image tag

Kubernetes version

Cloud provider or bare metal

OS version

Install tools

Kepler deployment config

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

sunya-ch commented May 30, 2024