refactored updateConnectionMetricCache to report connection-metrics correctly #1377

yp969803 · 2025-05-08T12:21:00Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
refactored updateConnectionMetricCache to report connection-metrics correctly
Which issue(s) this PR fixes:
Fixes #1370

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

correct long_conn metrics

Copilot

Pull Request Overview

This PR refactors how connection metrics are reported to ensure the connection metric cache is updated correctly. Key changes include:

Removing the tcpConns parameter from updateConnectionMetricCache and its tests.
Updating updateConnectionMetricCache to use requestMetric data directly for computing metric differences.
Changing the access log condition to skip logging based on totalReports instead of a duration threshold.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
pkg/controller/telemetry/metric_test.go	Removed the tcpConns field from the test struct and updated the call to updateConnectionMetricCache.
pkg/controller/telemetry/metric.go	Refactored updateConnectionMetricCache to remove the tcpConns parameter and the duration threshold check, using data fields instead.
pkg/controller/telemetry/accesslog.go	Modified the condition for skipping access log output from a duration threshold to checking totalReports.

yp969803 · 2025-05-08T12:21:26Z

@hzxuzhonghu can u review it

codecov · 2025-05-08T12:39:00Z

Codecov Report

Attention: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 46.28%. Comparing base (579f10b) to head (663d8a7).
Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/controller/telemetry/metric.go	83.33%	1 Missing ⚠️

Files with missing lines	Coverage Δ
pkg/controller/telemetry/metric.go	`59.64% <83.33%> (ø)`

... and 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5d11ab...663d8a7. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hzxuzhonghu

This does not fix. Please take a look at #1378.

yp969803 · 2025-05-09T03:49:25Z

I have looked to #1378, this pr will work as successor of it, this pr refactor the updateConnMetricCache code to give correct connectionMetrics @hzxuzhonghu

hzxuzhonghu · 2025-05-09T06:26:49Z

Let me have a look, it's great if you could describe the bug a little bit

hzxuzhonghu · 2025-05-09T06:49:23Z

pkg/controller/telemetry/metric.go

@@ -508,7 +506,7 @@ func (m *MetricController) Run(ctx context.Context, mapOfTcpInfo *ebpf.Map) {
 			}

 			connectionLabels := connectionMetricLabels{}
-			if m.EnableConnectionMetric.Load() && data.duration > LONG_CONN_METRIC_THRESHOLD {
+			if m.EnableConnectionMetric.Load() {


Would you refactor a little more to make it together with L522. We have so many duplicate flag checking in this function

hzxuzhonghu · 2025-05-09T06:53:23Z

pkg/controller/telemetry/metric.go

@@ -908,7 +906,7 @@ func (m *MetricController) updateServiceMetricCache(data requestMetric, labels s
 	}
 }

-func (m *MetricController) updateConnectionMetricCache(data requestMetric, connData connMetric, labels connectionMetricLabels) {


IIUC, this is just a cleanup, because data contains only delta sent received bytes while connData contains datas till now

yp969803 · 2025-05-12T03:39:10Z

currently busy i will complete this after 1-2 days

hzxuzhonghu · 2025-05-12T09:07:55Z

ok, so this may not be back to release-1.1

yp969803 · 2025-05-12T12:39:11Z

This in needed in release, for correct connection-metrics, can u please wait 2 days for the release, currently having a last exam tomorrow. @hzxuzhonghu

hzxuzhonghu · 2025-05-13T01:23:25Z

After review, i thought this is kind of cleanup. Not sure what bug it is fixing.

hzxuzhonghu · 2025-05-13T01:28:16Z

pkg/controller/telemetry/metric.go

@@ -508,7 +506,7 @@ func (m *MetricController) Run(ctx context.Context, mapOfTcpInfo *ebpf.Map) {
 			}

 			connectionLabels := connectionMetricLabels{}
-			if m.EnableConnectionMetric.Load() && data.duration > LONG_CONN_METRIC_THRESHOLD {


you mean we should remove data.duration > LONG_CONN_METRIC_THRESHOLD, right?

I added this const to only create conn metrics for the connections which have passed the 5 sec duration. But this is not necessary as we already flushing the metrics after 5 sec,

YaoZengzeng · 2025-05-13T01:51:03Z

pkg/controller/telemetry/metric.go

@@ -47,8 +47,6 @@ const (
 	metricFlushInterval = 5 * time.Second

 	DEFAULT_UNKNOWN = "-"
-
-	LONG_CONN_METRIC_THRESHOLD = uint64(5 * time.Second)


why we add this const before?

I added this const to only create conn metrics for the connections which have passed the 5 sec duration. But this is not necessary as we already flushing the metrics after 5 sec,

yp969803 · 2025-05-14T02:50:08Z

@hzxuzhonghu can u review and merge

hzxuzhonghu · 2025-05-14T03:20:08Z

@yp969803 You can rebase on your branch by git pull --rebase upstream main, then no merge commit included

hzxuzhonghu · 2025-05-14T03:22:03Z

pkg/controller/telemetry/metric.go

@@ -508,7 +506,7 @@ func (m *MetricController) Run(ctx context.Context, mapOfTcpInfo *ebpf.Map) {
 			}

 			connectionLabels := connectionMetricLabels{}
-			if m.EnableConnectionMetric.Load() && reqMetric.duration > LONG_CONN_METRIC_THRESHOLD {
+			if m.EnableConnectionMetric.Load() {


Shouldn't we only generate connection metrics for connection duration > 5s?
Otherwise how do you filter out the connection establishment report

hzxuzhonghu · 2025-05-15T01:27:06Z

@yp969803 Please explain #1377 (comment)

yp969803 · 2025-05-15T01:54:25Z

kmesh/pkg/controller/telemetry/metric.go

Line 970 in e908419

if v.Duration > LONG_CONN_METRIC_THRESHOLD {

filtering long connections at the time of report to prometheus

yp969803 · 2025-05-15T01:55:47Z

I am keeping the information of connection metrics from the start at the cache so I can calculate the correct metrics,

yp969803 · 2025-05-15T02:01:06Z

func (m *MetricController) updateConnectionMetricCache(reqMetric requestMetric, labels connectionMetricLabels) {
	v, ok := m.connectionMetricCache[labels]
	if ok {
		v.ConnSentBytes = v.ConnSentBytes + float64(reqMetric.sentBytes)
		v.ConnReceivedBytes = v.ConnReceivedBytes + float64(reqMetric.receivedBytes)
		v.ConnPacketLost = v.ConnPacketLost + float64(reqMetric.packetLost)
		v.ConnTotalRetrans = v.ConnTotalRetrans + float64(reqMetric.totalRetrans)
		v.Duration = reqMetric.duration
	} else {
		newConnectionMetricInfo := connectionMetricInfo{}
		newConnectionMetricInfo.ConnSentBytes = float64(reqMetric.sentBytes)
		newConnectionMetricInfo.ConnReceivedBytes = float64(reqMetric.receivedBytes)
		newConnectionMetricInfo.ConnPacketLost = float64(reqMetric.packetLost)
		newConnectionMetricInfo.ConnTotalRetrans = float64(reqMetric.totalRetrans)
		newConnectionMetricInfo.Duration = reqMetric.duration
		m.connectionMetricCache[labels] = &newConnectionMetricInfo
	}
	if reqMetric.state == TCP_CLOSED {
		deleteLock.Lock()
		deleteConnection = append(deleteConnection, &labels)
		deleteLock.Unlock()
	}
}

we can only use delta of connection metrics here, after the 5 sec duration of a connection we have to report the metrics till 5 sec,

yp969803 · 2025-05-15T02:02:51Z

i have to come up with this approach

func (m *MetricController) updatePrometheusMetric() {
	m.mutex.Lock()
	workloadInfoCache := m.workloadMetricCache
	serviceInfoCache := m.serviceMetricCache
	connectionInfoCache := m.connectionMetricCache
	m.workloadMetricCache = map[workloadMetricLabels]*workloadMetricInfo{}
	m.serviceMetricCache = map[serviceMetricLabels]*serviceMetricInfo{}
	m.connectionMetricCache = map[connectionMetricLabels]*connectionMetricInfo{}
	m.mutex.Unlock()
      -------
		for k, v := range connectionInfoCache {
		// push connection metrics to prometheus if the connection is long-lived
		if v.Duration > LONG_CONN_METRIC_THRESHOLD {
			connectionLabels := struct2map(k)
			tcpConnectionTotalSendBytes.With(connectionLabels).Add(v.ConnSentBytes)
			tcpConnectionTotalReceivedBytes.With(connectionLabels).Add(v.ConnReceivedBytes)
			tcpConnectionTotalPacketLost.With(connectionLabels).Add(v.ConnPacketLost)
			tcpConnectionTotalRetrans.With(connectionLabels).Add(v.ConnTotalRetrans)
		} else {
			m.connectionMetricCache[k] = v
		}
	}

	// delete metrics
	deleteLock.Lock()
	// Creating a copy reduces the amount of time spent adding locks in the programme
	workloadReplica := deleteWorkload
	deleteWorkload = nil
	serviceReplica := deleteService
	deleteService = nil
	connReplica := deleteConnection
	deleteConnection = []*connectionMetricLabels{}
	deleteLock.Unlock()

	for i := 0; i < len(workloadReplica); i++ {
		deleteWorkloadMetricInPrometheus(workloadReplica[i])
	}
	for i := 0; i < len(serviceReplica); i++ {
		deleteServiceMetricInPrometheus(serviceReplica[i])
	}
	for i := 0; i < len(connReplica); i++ {
		deleteConnectionMetricInPrometheus(connReplica[i])
	}
}

yp969803 · 2025-05-15T02:03:31Z

@hzxuzhonghu did u understand !!

hzxuzhonghu · 2025-05-15T02:21:40Z

No

I do think if m.EnableConnectionMetric.Load() && reqMetric.duration > LONG_CONN_METRIC_THRESHOLD { is right

At the connection establishment, we donot store tcp connection, it doesnot matter much

yp969803 · 2025-05-15T02:30:27Z

i think you are correct it will not matter much, I will refactor it

yp969803 · 2025-05-15T02:44:24Z

@hzxuzhonghu can u look now, i have refactored it

hzxuzhonghu

/lgtm
/approve

kmesh-bot · 2025-05-15T07:38:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/OWNERS~~ [hzxuzhonghu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Yash Patel <yp969803@gmail.com> rfac: TestMetricController_updatePrometheusMetric Signed-off-by: Yash Patel <yp969803@gmail.com>

yp969803 · 2025-05-16T03:03:22Z

@hzxuzhonghu waiting for merge

hzxuzhonghu · 2025-05-16T03:47:05Z

/lgtm

Copilot AI review requested due to automatic review settings May 8, 2025 12:21

kmesh-bot added the kind/bug label May 8, 2025

kmesh-bot requested review from hzxuzhonghu and supercharge-xsy May 8, 2025 12:21

kmesh-bot added the size/M label May 8, 2025

Copilot AI reviewed May 8, 2025

View reviewed changes

hzxuzhonghu reviewed May 9, 2025

View reviewed changes

hzxuzhonghu added the do-not-merge/hold label May 9, 2025

hzxuzhonghu reviewed May 9, 2025

View reviewed changes

hzxuzhonghu reviewed May 13, 2025

View reviewed changes

YaoZengzeng reviewed May 13, 2025

View reviewed changes

kmesh-bot added the do-not-merge/contains-merge-commits label May 14, 2025

hzxuzhonghu reviewed May 14, 2025

View reviewed changes

yp969803 force-pushed the issue1370 branch from 4e63eea to 3bf654d Compare May 14, 2025 04:26

kmesh-bot removed the do-not-merge/contains-merge-commits label May 14, 2025

yp969803 force-pushed the issue1370 branch from 3bf654d to e908419 Compare May 14, 2025 05:11

yp969803 force-pushed the issue1370 branch 2 times, most recently from b3d0ff2 to 76fa21a Compare May 15, 2025 02:42

hzxuzhonghu reviewed May 15, 2025

View reviewed changes

kmesh-bot assigned hzxuzhonghu May 15, 2025

kmesh-bot added the lgtm label May 15, 2025

kmesh-bot added the approved label May 15, 2025

rfac: updateConnectionMetricCache to report metrics correctly

663d8a7

Signed-off-by: Yash Patel <yp969803@gmail.com> rfac: TestMetricController_updatePrometheusMetric Signed-off-by: Yash Patel <yp969803@gmail.com>

yp969803 force-pushed the issue1370 branch from 76fa21a to 663d8a7 Compare May 16, 2025 03:01

kmesh-bot removed the lgtm label May 16, 2025

kmesh-bot added the lgtm label May 16, 2025

hzxuzhonghu removed the do-not-merge/hold label May 19, 2025

kmesh-bot merged commit e56e0ac into kmesh-net:main May 19, 2025
11 checks passed

refactored updateConnectionMetricCache to report connection-metrics correctly #1377

refactored updateConnectionMetricCache to report connection-metrics correctly #1377

Uh oh!

Conversation

yp969803 commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

yp969803 commented May 8, 2025

Uh oh!

codecov bot commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

yp969803 commented May 9, 2025

Uh oh!

hzxuzhonghu commented May 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yp969803 commented May 12, 2025

Uh oh!

hzxuzhonghu commented May 12, 2025

Uh oh!

yp969803 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzxuzhonghu commented May 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yp969803 commented May 14, 2025

Uh oh!

hzxuzhonghu commented May 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu commented May 15, 2025

Uh oh!

yp969803 commented May 15, 2025

Uh oh!

yp969803 commented May 15, 2025

Uh oh!

yp969803 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yp969803 commented May 15, 2025

Uh oh!

yp969803 commented May 15, 2025

Uh oh!

hzxuzhonghu commented May 15, 2025

Uh oh!

yp969803 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yp969803 commented May 15, 2025

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

kmesh-bot commented May 15, 2025

Uh oh!

yp969803 commented May 16, 2025

yp969803 commented May 8, 2025 •

edited

Loading

codecov bot commented May 8, 2025 •

edited

Loading

yp969803 commented May 12, 2025 •

edited

Loading

yp969803 commented May 15, 2025 •

edited

Loading

yp969803 commented May 15, 2025 •

edited

Loading