Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate prometheus run time info metrics #1217

Merged
merged 1 commit into from
Sep 18, 2020

Conversation

srikartati
Copy link
Member

@srikartati srikartati commented Sep 8, 2020

Deprecate run time info metrics that provide node name and pod name
from Antrea 0.10.0. These metrics will be removed Antrea 0.11.0. Added
deprecation policy for the Prometheus metric in Antrea documentation.

In addition, antrea-prometheus.yaml is changed to add nodename and
pod name to the instance label instead of the default IP:port. This will help
with promql queries to search for metrics on a specific node or pod.

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-windows-networkpolicy: to trigger windows networkpolicy tests.
  • /skip-windows-networkpolicy: to skip windows networkpolicy tests.
  • /test-hw-offload: to trigger ovs hardware offload test.
  • /skip-hw-offload: to skip ovs hardware offload test.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

@codecov-commenter
Copy link

codecov-commenter commented Sep 8, 2020

Codecov Report

Merging #1217 into master will decrease coverage by 1.31%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1217      +/-   ##
==========================================
- Coverage   56.29%   54.97%   -1.32%     
==========================================
  Files         109      110       +1     
  Lines       12045    10573    -1472     
==========================================
- Hits         6781     5813     -968     
+ Misses       4673     4186     -487     
+ Partials      591      574      -17     
Flag Coverage Δ
#integration-tests 44.92% <0.00%> (-1.19%) ⬇️
#unit-tests 41.80% <ø> (-0.73%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/metrics/prometheus.go 35.89% <0.00%> (-3.64%) ⬇️
pkg/k8s/name.go 33.33% <0.00%> (-16.67%) ⬇️
pkg/apis/controlplane/v1beta1/conversion.go 30.00% <0.00%> (-12.86%) ⬇️
pkg/agent/util/ipset/ipset.go 63.63% <0.00%> (-6.74%) ⬇️
pkg/agent/openflow/cookie/allocator.go 64.51% <0.00%> (-6.22%) ⬇️
pkg/util/env/env.go 42.30% <0.00%> (-6.08%) ⬇️
pkg/controller/networkpolicy/store/util.go 63.63% <0.00%> (-5.60%) ⬇️
pkg/agent/types/networkpolicy.go 11.11% <0.00%> (-5.56%) ⬇️
pkg/ovs/openflow/ofctrl_bridge.go 67.98% <0.00%> (-4.93%) ⬇️
pkg/antctl/command_list.go 32.69% <0.00%> (-4.81%) ⬇️
... and 101 more

ksamoray
ksamoray previously approved these changes Sep 9, 2020
@ksamoray
Copy link
Contributor

ksamoray commented Sep 9, 2020

I don't believe that this is relevant here as I doubt that anyone uses these metrics, but maybe it's worthwhile to conclude a deprecation policy for metrics which we change or drop.
K8S metrics wrapper has this capability.

@srikartati
Copy link
Member Author

I don't believe that this is relevant here as I doubt that anyone uses these metrics, but maybe it's worthwhile to conclude a deprecation policy for metrics which we change or drop.
K8S metrics wrapper has this capability.

My understanding of the deprecated version is to have the future Antrea version (+1 to current version) as deprecated version. This serves as a notice to the consumers of the current version. And finally we could remove the metric in the future Antrea version. Does this process sound good?

@antoninbas
Copy link
Contributor

I don't believe that this is relevant here as I doubt that anyone uses these metrics, but maybe it's worthwhile to conclude a deprecation policy for metrics which we change or drop.
K8S metrics wrapper has this capability.

My understanding of the deprecated version is to have the future Antrea version (+1 to current version) as deprecated version. This serves as a notice to the consumers of the current version. And finally we could remove the metric in the future Antrea version. Does this process sound good?

@srikartati could you clarify what you mean here?
ideally, if a metric is present in 0.N.0 and needs to be removed, it would be tagged as deprecated for at least one minor version - e.g. 0.N+1.0 -, and removed after that - e.g. 0.N+2.0.
unfortunately we cannot follow this model for #1202, since we are changing the metric's type (and I guess we do not want to change the metric's name) but maybe for this PR we can follow a more rigorous process as suggested by @ksamoray. IIRC, for #1202 @srikartati posted a notice on Slack & mailing lists a while back.

@ksamoray
Copy link
Contributor

As @srikartati noted, we remove these metrics as their usability is questionable (I'm generous here...). I would remove them without going through deprecation process. As for #1202, maybe we should document the change somewhere beyond slack (e.g release notes document if we have any). I agree that retaining the metrics name is valuable though.

@antoninbas
Copy link
Contributor

It will be documented in the release notes no matter what. The advance notice on slack / email groups was to check if anyone had an objection.

antoninbas
antoninbas previously approved these changes Sep 10, 2020
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tnqn
tnqn previously approved these changes Sep 11, 2020
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@srikartati
Copy link
Member Author

@srikartati could you clarify what you mean here?
ideally, if a metric is present in 0.N.0 and needs to be removed, it would be tagged as deprecated for at least one minor version - e.g. 0.N+1.0 -, and removed after that - e.g. 0.N+2.0.

@antoninbas In high-level, I meant the same but was not sure which release we should remove the metric after it is being deprecated.
I dug into this a bit more on what K8s deprecation policy is. They seem to have two different policies one for all metrics in kubernetes/kubernetes repository and one specific to Kubernetes control plane metrics. The latter seems to be well defined and there is some instrumentation in code to implement this, but it is dependent on the Kubernetes version.

I believe we should define a simple custom deprecation policy for Antrea metrics and not follow the Kubernetes metrics specifically the hidden metrics convention. For this purpose, we could employ the policy you suggested and also follow the guidelines here w.r.t name and help text. In addition, we should describe this policy in the Antrea's Prometheus documentation. What do you think?

@antoninbas
Copy link
Contributor

@srikartati sounds good to me. Feel free to submit a documentation patch for this and to link to / borrow from the K8s documentation, while adjusting / simplifying a few things as needed.

@srikartati srikartati dismissed stale reviews from tnqn, antoninbas, and ksamoray via e2a16be September 11, 2020 22:12
@srikartati srikartati force-pushed the prom_metric_deletion branch 2 times, most recently from e2a16be to fbd01af Compare September 11, 2020 22:29
@srikartati srikartati changed the title Delete prometheus run time info metrics Deprecate prometheus run time info metrics Sep 11, 2020
@@ -69,7 +69,7 @@ rules:

### Antrea Components Scraping configuration
Add the following jobs to Prometheus scraping configuration to enable metrics
collection from Antrea components
collection from Antrea components. Antrea Agent metrics endpoint is exposed through Antrea apiserver on `apiport` config parameter given in `antrea-agent.conf` (default value is 10350). Antrea Controller metrics endpoint is exposed through Antrea apiserver on `apiport` config parameter given in `antrea-controller.conf` (default value is 10349).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change as there were some questions on what ports Prometheus metrics are exposed recently when integrating with Wavefront solution.

Name: "antrea_agent_runtime_info",
Help: "Antrea agent runtime info , defined as labels. The value of the gauge is always set to 1.",

deprecatedGaugeHost := metrics.NewGauge(&metrics.GaugeOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave gauge names as they are - as renaming just adds plenty of unnecessary diffs and merge efforts later on.
Just add maybe DeprecatedVersion: "x.x" to the gauge defs, e.g
metrics.GaugeOpts{
Name: "antrea_agent_runtime_info",
Help: "...",
ConstLabels: "..."
StabilityLevel: metrics.STABLE,
DeprecationVersion: "x.x"}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeprecatedVersion is to implement the deprecation lifecycle for Kubernetes control plane metrics as given here.
The instrumented version is Kubernetes version. We could employ a simpler policy as mentioned in #1217 (comment)

I would like to not add this field to avoid the confusion of the deprecation policy and version numbers.

Can you elaborate on merging difficulties with the change in the variable name? IMO, this serves as a quick indicator in source code that these metrics have to be removed in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @srikartati,
See deprecation example here.
BTW that metric has been actually removed since then :)

In the above example given by you, there is a change in the variable name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @srikartati, DeprecatedVersion should expose deprecation to users (AFAIK k8s metrics will note that in the metric help) while renaming variables is only visible to developers. I'm not sure that this is enough.
Not really sure which confusion you're trying to avoid by not using it.
Merging wise - that change will popup as a diff on every branch comparison...
My vote is still to remove these metrics immediately, but if we decide to deprecate in a later time, lets use the best tools for this purpose.

@@ -78,18 +78,19 @@ func InitializePrometheusMetrics() {
}

klog.Info("Initializing prometheus metrics")
gaugeHost := metrics.NewGauge(&metrics.GaugeOpts{
deprecatedGaugeHost := metrics.NewGauge(&metrics.GaugeOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@ksamoray
Copy link
Contributor

Hi @srikartati,
See deprecation example here.
BTW that metric has been actually removed since then :)

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is disagreement on what to do here?

  • It makes sense to me to rename the variable with a "deprecated" prefix
  • It also makes sense to use the DeprecatedVersion field. If we do that, I think we can just adhere to the K8s metrics deprecation lifecycle. Since this was tagged as a STABLE metric, we can do the following for these metrics: Deprecated (0.10.0) -> Hidden (0.11.0) -> Deleted (0.12.0). I believe it is better than removing them altogether in 0.10.0 as it sets a better precedent.

Comment on lines 117 to 119
# Deprecation Policy

In Antrea, Prometheus metric deprecation policy is to flag them deprecated in `Antrea 0.N.0` by adding `(Deprecated since Antrea 0.N.0)` in the front of metric's help text. The release target of removing the deprecated metrics would be `Antrea 0.N+1.0`. We follow the policy given in the [Kubernetes Metrics Overhaul](https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20181106-kubernetes-metrics-overhaul.md#deprecation-plan) document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not make this part of this PR, I don't want it to go unnoticed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this.

@antoninbas
Copy link
Contributor

@srikartati pointed out to me that "hidden" metrics is a k8s concept and not a Prometheus concept. To implement this, we would need to expose the --show-hidden-metrics-for-version flag (kubernetes/kubernetes#85270), just like the k8s binaries. We can look into whether we think this is something worth doing (for the sake of having the same deprecation policy as k8s). I don't think we need to answer this right now though. We should just set the DeprecatedVersion to 0.10.0 for now and decide what we want to do with these metrics before 0.11.

@ksamoray ksamoray dismissed their stale review September 17, 2020 05:47

OK, let's move on then

@ksamoray
Copy link
Contributor

/test-all

@srikartati
Copy link
Member Author

@srikartati pointed out to me that "hidden" metrics is a k8s concept and not a Prometheus concept. To implement this, we would need to expose the --show-hidden-metrics-for-version flag (kubernetes/kubernetes#85270), just like the k8s binaries. We can look into whether we think this is something worth doing (for the sake of having the same deprecation policy as k8s). I don't think we need to answer this right now though. We should just set the DeprecatedVersion to 0.10.0 for now and decide what we want to do with these metrics before 0.11.

Yes, we need to expose the show hidden flag and add some instrumentation code like K8s in Antrea to implement the deprecation life cycle given below:
Deprecated (0.10.0) -> Hidden (0.11.0) -> Deleted (0.12.0)

Instead, I opted for a simpler deprecation policy provided here in the Kubernetes overhaul metrics documentation:
Deprecated(0.10.0)->Delete(0.11.0)

Agree that we can make this decision on Antrea metrics deprecation policy in 0.11.0 time frame. Set the Deprecated version field with Antrea version for now.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple more comments

@@ -69,7 +69,7 @@ rules:

### Antrea Components Scraping configuration
Add the following jobs to Prometheus scraping configuration to enable metrics
collection from Antrea components
collection from Antrea components. Antrea Agent metrics endpoint is exposed through Antrea apiserver on `apiport` config parameter given in `antrea-agent.conf` (default value is 10350). Antrea Controller metrics endpoint is exposed through Antrea apiserver on `apiport` config parameter given in `antrea-controller.conf` (default value is 10349).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: line not wrapped

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

StabilityLevel: metrics.STABLE,
deprecatedGaugeHost := metrics.NewGauge(&metrics.GaugeOpts{
Name: "antrea_agent_runtime_info",
Help: "Antrea agent runtime info (Deprecated since Antrea 0.10.0), defined as labels. The value of the gauge is always set to 1.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding was that by setting DeprecatedVersion, K8s would automatically add the deprecation information to the help message. Is it not the case after all? @ksamoray @srikartati

Copy link
Member Author

@srikartati srikartati Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas It only happens if metric is initialized through following function:

// Create forces the initialization of metric which has been deferred until
// the point at which this method is invoked. This method will determine whether
// the metric is deprecated or hidden, no-opting if the metric should be considered
// hidden. Furthermore, this function no-opts and returns true if metric is already
// created.
func (r *lazyMetric) Create(version *semver.Version) bool {
	if version != nil {
		r.determineDeprecationStatus(*version)
	}
	// let's not create if this metric is slated to be hidden
	if r.IsHidden() {
		return false
	}
	r.createOnce.Do(func() {
		r.createLock.Lock()
		defer r.createLock.Unlock()
		r.isCreated = true
		if r.IsDeprecated() {
			r.self.initializeDeprecatedMetric()
		} else {
			r.self.initializeMetric()
		}
	})
	return r.IsCreated()
}

Above function is called through only below initialization function:

func NewGaugeFunc(opts GaugeOpts, function func() float64)
go/pkg/mod/k8s.io/component-base@v0.18.4/metrics/gauge.go:189

In K8s 0.18.4 version, this initialization is provided only for Guage type and not for all types of metrics. In addition, most guage type metrics in K8s seemed to be using NewGuage which doesn't support this instrumentation yet. So we need to manually add in "HELP" statement.
May be this will be improved in future K8s version, but this instrumentation code works only for K8s version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks good to know. Let's see how this evolves in the future.

Deprecate run time info metrics that provide node name and pod name
from Antrea 0.10.0. These metrics will be removed Antrea 0.11.0.

In addition, antrea-prometheus.yaml is changed to add nodename and
pod name to instance label instead of default IP:port. This will help
with promql queries to search for metrics on a specific node or pod.
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@srikartati
Copy link
Member Author

/test-all

@srikartati
Copy link
Member Author

/skip-hw-offload

@srikartati srikartati merged commit 03bcb94 into antrea-io:master Sep 18, 2020
@srikartati
Copy link
Member Author

As this PR is about deprecating metrics, code coverage is not applicable. Merged this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants