Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change prometheus metrics type from summary to histogram #1202

Merged
merged 1 commit into from Sep 14, 2020

Conversation

dreamtalen
Copy link
Contributor

@dreamtalen dreamtalen commented Sep 2, 2020

Fix #905

The summary types are tagged for deprecation, Kubernetes recommended to use histograms instead of summaries. The main advantages of histogram types are aggregation and inexpensive. In this commit, we changed three Antrea controller metrics from summary to histogram type. They are DurationAppliedToGroupSyncing, DurationAddressGroupSyncing, and DurationInternalNetworkPolicySyncing.

Did the manually test to check the type of these metrics using following commands:

  1. make
  2. ./test/e2e/infra/vagrant/push_antrea.sh --prometheus
  3. kubectl exec -it <antrea_controller_pod> -n kube-system -- bash
  4. TOKEN=$(cat /var/run/antrea/apiserver/loopback-client-token)
  5. curl --insecure --header "Authorization: Bearer $TOKEN" https://127.0.0.1:10349/metrics

Test results:
# HELP antrea_controller_applied_to_group_sync_duration_milliseconds [ALPHA] The duration of syncing applied-to-group
# TYPE antrea_controller_applied_to_group_sync_duration_milliseconds histogram
# HELP antrea_controller_address_group_sync_duration_milliseconds [ALPHA] The duration of syncing address-group
# TYPE antrea_controller_address_group_sync_duration_milliseconds histogram
# HELP antrea_controller_network_policy_sync_duration_milliseconds [ALPHA] The duration of syncing internal-networkpolicy
# TYPE antrea_controller_network_policy_sync_duration_milliseconds histogram

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-windows-networkpolicy: to trigger windows networkpolicy tests.
  • /skip-windows-networkpolicy: to skip windows networkpolicy tests.
  • /test-hw-offload: to trigger ovs hardware offload test.
  • /skip-hw-offload: to skip ovs hardware offload test.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

@codecov-commenter
Copy link

codecov-commenter commented Sep 2, 2020

Codecov Report

Merging #1202 into master will increase coverage by 0.19%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1202      +/-   ##
==========================================
+ Coverage   56.06%   56.26%   +0.19%     
==========================================
  Files         108      108              
  Lines       11887    12027     +140     
==========================================
+ Hits         6665     6767     +102     
- Misses       4634     4671      +37     
- Partials      588      589       +1     
Flag Coverage Δ
#integration-tests 46.12% <ø> (-0.21%) ⬇️
#unit-tests 42.51% <ø> (+0.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/types/networkpolicy.go 16.66% <0.00%> (-8.34%) ⬇️
pkg/agent/openflow/network_policy.go 73.61% <0.00%> (-0.84%) ⬇️
...ntroller/networkpolicy/networkpolicy_controller.go 72.92% <0.00%> (-0.13%) ⬇️
pkg/agent/openflow/client.go 35.60% <0.00%> (ø)
pkg/agent/openflow/pipeline.go 57.11% <0.00%> (+0.31%) ⬆️
pkg/ovs/openflow/ofctrl_bridge.go 72.24% <0.00%> (+0.66%) ⬆️
pkg/ovs/openflow/ofctrl_action.go 83.79% <0.00%> (+1.58%) ⬆️
pkg/ovs/openflow/ofctrl_builder.go 74.91% <0.00%> (+2.60%) ⬆️

srikartati
srikartati previously approved these changes Sep 2, 2020
Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use "Fixes: #" in the PR description, so that the issue could be linked and closed automatically when PR is merged.
LGTM otherwise.

This reminds me that we could enhance testing for Prometheus metrics at the controller like agent (#799 ).. possibly using network policy controller unit tests (networkpolicy_controller_test.go). I will update the issue.

antoninbas
antoninbas previously approved these changes Sep 3, 2020
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I think the PR description should mention the effect (if any) of changing the metrics type on the consumers of these metrics. If I have metrics enabled and I upgrade my version of Antrea, will there be any issue in the Prometheus server, on in a dashboard like Grafana?

@ksamoray
Copy link
Contributor

ksamoray commented Sep 9, 2020

LGTM but I think the PR description should mention the effect (if any) of changing the metrics type on the consumers of these metrics. If I have metrics enabled and I upgrade my version of Antrea, will there be any issue in the Prometheus server, on in a dashboard like Grafana?

For that purpose, K8S metrics wrapper has deprecation capabilities. We could mark the summary metric as deprecated and add the histogram.
At a given version, dispose the deprecated metrics.
Not sure if we should go through this at this stage before we have a solid install base. Maybe just document this change in the RN.

@srikartati
Copy link
Member

For that purpose, K8S metrics wrapper has deprecation capabilities. We could mark the summary metric as deprecated and add the histogram.
At a given version, dispose the deprecated metrics.
Not sure if we should go through this at this stage before we have a solid install base. Maybe just document this change in the RN.

@ksamoray Yes, you are right. Ideally, we should go through the process of K8s deprecation capabilities. As mentioned by @antoninbas, some consumers may get affected, if they have predefined Grafana dashboards or query scripts. They may need to update their scripts.
Therefore, we posted this change in slack/google groups to notify if anyone is consuming the metric as a Summary metric and they are affected by it. Since we did not get any response, we thought we could go ahead with this change.

What is RN?

@ksamoray
Copy link
Contributor

What is RN?

Release notes

ksamoray
ksamoray previously approved these changes Sep 10, 2020
@srikartati
Copy link
Member

/test-all

@tnqn
Copy link
Member

tnqn commented Sep 10, 2020

/skip-hw-offload as it's not related.

@tnqn
Copy link
Member

tnqn commented Sep 10, 2020

@dreamtalen the commit message is not well wrapped, could you follow https://github.com/golang/go/wiki/CommitMessage to revise?

the text should be wrapped to ~76 characters (to appease git viewing tools, mainly), unless you really need longer lines (e.g. for ASCII art, tables, or long links)

@dreamtalen
Copy link
Contributor Author

@dreamtalen the commit message is not well wrapped, could you follow https://github.com/golang/go/wiki/CommitMessage to revise?

the text should be wrapped to ~76 characters (to appease git viewing tools, mainly), unless you really need longer lines (e.g. for ASCII art, tables, or long links)

Thanks for pointing out. I will revise it.

@tnqn
Copy link
Member

tnqn commented Sep 11, 2020

Sorry for being nit-picking, the commit message seems not wrapped neatly, and a newline is usually inserted between paragraphs.
This is the auto-wrapped message in my editor (I use 72 chars per line):

The summary types are tagged for deprecation, Kubernetes recommended to
use histograms instead of summaries. The main advantages of histogram
types are aggregation and inexpensive.

In this commit, we changed three Antrea controller metrics from summary
to histogram type. They are DurationAppliedToGroupSyncing,
DurationAddressGroupSyncing, and DurationInternalNetworkPolicySyncing.

Fixes #905

The summary types are tagged for deprecation, Kubernetes recommended to
use histograms instead of summaries. The main advantages of histogram
types are aggregation and inexpensive.

In this commit, we changed three Antrea controller metrics from summary
to histogram type. They are DurationAppliedToGroupSyncing,
DurationAddressGroupSyncing, and DurationInternalNetworkPolicySyncing.

Fixes antrea-io#905
@dreamtalen
Copy link
Contributor Author

Sorry for being nit-picking, the commit message seems not wrapped neatly, and a newline is usually inserted between paragraphs.
This is the auto-wrapped message in my editor (I use 72 chars per line):

The summary types are tagged for deprecation, Kubernetes recommended to
use histograms instead of summaries. The main advantages of histogram
types are aggregation and inexpensive.

In this commit, we changed three Antrea controller metrics from summary
to histogram type. They are DurationAppliedToGroupSyncing,
DurationAddressGroupSyncing, and DurationInternalNetworkPolicySyncing.

Fixes #905

Never mind. Thanks for your advice.

@srikartati
Copy link
Member

/test-all

@srikartati
Copy link
Member

/test-windows-conformance

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, LGTM

@srikartati
Copy link
Member

srikartati commented Sep 14, 2020

/skip-hw-offload as this is not related test. Merging the PR.

@srikartati srikartati merged commit 7d06431 into antrea-io:master Sep 14, 2020
GraysonWu pushed a commit to GraysonWu/antrea that referenced this pull request Sep 22, 2020
)

The summary types are tagged for deprecation, Kubernetes recommended to
use histograms instead of summaries. The main advantages of histogram
types are aggregation and inexpensive.

In this commit, we changed three Antrea controller metrics from summary
to histogram type. They are DurationAppliedToGroupSyncing,
DurationAddressGroupSyncing, and DurationInternalNetworkPolicySyncing.

Fixes antrea-io#905

Co-authored-by: Yongming Ding <dyongming@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus metrics: Change summary type metrics to histogram type metrics.
8 participants