Skip to content

refactored updateConnectionMetricCache to report connection-metrics correctly #1377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 19, 2025

Conversation

yp969803
Copy link
Contributor

@yp969803 yp969803 commented May 8, 2025

What type of PR is this?
/kind bug

What this PR does / why we need it:
refactored updateConnectionMetricCache to report connection-metrics correctly
Which issue(s) this PR fixes:
Fixes #1370

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

correct long_conn metrics 

@Copilot Copilot AI review requested due to automatic review settings May 8, 2025 12:21
@kmesh-bot kmesh-bot added the kind/bug Something isn't working label May 8, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors how connection metrics are reported to ensure the connection metric cache is updated correctly. Key changes include:

  • Removing the tcpConns parameter from updateConnectionMetricCache and its tests.
  • Updating updateConnectionMetricCache to use requestMetric data directly for computing metric differences.
  • Changing the access log condition to skip logging based on totalReports instead of a duration threshold.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
pkg/controller/telemetry/metric_test.go Removed the tcpConns field from the test struct and updated the call to updateConnectionMetricCache.
pkg/controller/telemetry/metric.go Refactored updateConnectionMetricCache to remove the tcpConns parameter and the duration threshold check, using data fields instead.
pkg/controller/telemetry/accesslog.go Modified the condition for skipping access log output from a duration threshold to checking totalReports.

@yp969803
Copy link
Contributor Author

yp969803 commented May 8, 2025

@hzxuzhonghu can u review it

Copy link

codecov bot commented May 8, 2025

Codecov Report

Attention: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 46.28%. Comparing base (579f10b) to head (663d8a7).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/telemetry/metric.go 83.33% 1 Missing ⚠️
Files with missing lines Coverage Δ
pkg/controller/telemetry/metric.go 59.64% <83.33%> (ø)

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5d11ab...663d8a7. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not fix. Please take a look at #1378.

@yp969803
Copy link
Contributor Author

yp969803 commented May 9, 2025

I have looked to #1378, this pr will work as successor of it, this pr refactor the updateConnMetricCache code to give correct connectionMetrics @hzxuzhonghu

@hzxuzhonghu
Copy link
Member

Let me have a look, it's great if you could describe the bug a little bit

@@ -508,7 +506,7 @@ func (m *MetricController) Run(ctx context.Context, mapOfTcpInfo *ebpf.Map) {
}

connectionLabels := connectionMetricLabels{}
if m.EnableConnectionMetric.Load() && data.duration > LONG_CONN_METRIC_THRESHOLD {
if m.EnableConnectionMetric.Load() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you refactor a little more to make it together with L522. We have so many duplicate flag checking in this function

@@ -908,7 +906,7 @@ func (m *MetricController) updateServiceMetricCache(data requestMetric, labels s
}
}

func (m *MetricController) updateConnectionMetricCache(data requestMetric, connData connMetric, labels connectionMetricLabels) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is just a cleanup, because data contains only delta sent received bytes while connData contains datas till now

@yp969803
Copy link
Contributor Author

currently busy i will complete this after 1-2 days

@hzxuzhonghu
Copy link
Member

ok, so this may not be back to release-1.1

@yp969803
Copy link
Contributor Author

yp969803 commented May 12, 2025

This in needed in release, for correct connection-metrics, can u please wait 2 days for the release, currently having a last exam tomorrow. @hzxuzhonghu

@hzxuzhonghu
Copy link
Member

After review, i thought this is kind of cleanup. Not sure what bug it is fixing.

@@ -508,7 +506,7 @@ func (m *MetricController) Run(ctx context.Context, mapOfTcpInfo *ebpf.Map) {
}

connectionLabels := connectionMetricLabels{}
if m.EnableConnectionMetric.Load() && data.duration > LONG_CONN_METRIC_THRESHOLD {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean we should remove data.duration > LONG_CONN_METRIC_THRESHOLD, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this const to only create conn metrics for the connections which have passed the 5 sec duration. But this is not necessary as we already flushing the metrics after 5 sec,

@@ -47,8 +47,6 @@ const (
metricFlushInterval = 5 * time.Second

DEFAULT_UNKNOWN = "-"

LONG_CONN_METRIC_THRESHOLD = uint64(5 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we add this const before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this const to only create conn metrics for the connections which have passed the 5 sec duration. But this is not necessary as we already flushing the metrics after 5 sec,

@yp969803
Copy link
Contributor Author

@hzxuzhonghu can u review and merge

@hzxuzhonghu
Copy link
Member

@yp969803 You can rebase on your branch by git pull --rebase upstream main, then no merge commit included

@@ -508,7 +506,7 @@ func (m *MetricController) Run(ctx context.Context, mapOfTcpInfo *ebpf.Map) {
}

connectionLabels := connectionMetricLabels{}
if m.EnableConnectionMetric.Load() && reqMetric.duration > LONG_CONN_METRIC_THRESHOLD {
if m.EnableConnectionMetric.Load() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we only generate connection metrics for connection duration > 5s?
Otherwise how do you filter out the connection establishment report

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzxuzhonghu
Copy link
Member

@yp969803 Please explain #1377 (comment)

@yp969803
Copy link
Contributor Author

if v.Duration > LONG_CONN_METRIC_THRESHOLD {

filtering long connections at the time of report to prometheus

@yp969803
Copy link
Contributor Author

I am keeping the information of connection metrics from the start at the cache so I can calculate the correct metrics,

@yp969803
Copy link
Contributor Author

yp969803 commented May 15, 2025

func (m *MetricController) updateConnectionMetricCache(reqMetric requestMetric, labels connectionMetricLabels) {
	v, ok := m.connectionMetricCache[labels]
	if ok {
		v.ConnSentBytes = v.ConnSentBytes + float64(reqMetric.sentBytes)
		v.ConnReceivedBytes = v.ConnReceivedBytes + float64(reqMetric.receivedBytes)
		v.ConnPacketLost = v.ConnPacketLost + float64(reqMetric.packetLost)
		v.ConnTotalRetrans = v.ConnTotalRetrans + float64(reqMetric.totalRetrans)
		v.Duration = reqMetric.duration
	} else {
		newConnectionMetricInfo := connectionMetricInfo{}
		newConnectionMetricInfo.ConnSentBytes = float64(reqMetric.sentBytes)
		newConnectionMetricInfo.ConnReceivedBytes = float64(reqMetric.receivedBytes)
		newConnectionMetricInfo.ConnPacketLost = float64(reqMetric.packetLost)
		newConnectionMetricInfo.ConnTotalRetrans = float64(reqMetric.totalRetrans)
		newConnectionMetricInfo.Duration = reqMetric.duration
		m.connectionMetricCache[labels] = &newConnectionMetricInfo
	}
	if reqMetric.state == TCP_CLOSED {
		deleteLock.Lock()
		deleteConnection = append(deleteConnection, &labels)
		deleteLock.Unlock()
	}
}

we can only use delta of connection metrics here, after the 5 sec duration of a connection we have to report the metrics till 5 sec,

@yp969803
Copy link
Contributor Author

i have to come up with this approach

func (m *MetricController) updatePrometheusMetric() {
	m.mutex.Lock()
	workloadInfoCache := m.workloadMetricCache
	serviceInfoCache := m.serviceMetricCache
	connectionInfoCache := m.connectionMetricCache
	m.workloadMetricCache = map[workloadMetricLabels]*workloadMetricInfo{}
	m.serviceMetricCache = map[serviceMetricLabels]*serviceMetricInfo{}
	m.connectionMetricCache = map[connectionMetricLabels]*connectionMetricInfo{}
	m.mutex.Unlock()
      -------
		for k, v := range connectionInfoCache {
		// push connection metrics to prometheus if the connection is long-lived
		if v.Duration > LONG_CONN_METRIC_THRESHOLD {
			connectionLabels := struct2map(k)
			tcpConnectionTotalSendBytes.With(connectionLabels).Add(v.ConnSentBytes)
			tcpConnectionTotalReceivedBytes.With(connectionLabels).Add(v.ConnReceivedBytes)
			tcpConnectionTotalPacketLost.With(connectionLabels).Add(v.ConnPacketLost)
			tcpConnectionTotalRetrans.With(connectionLabels).Add(v.ConnTotalRetrans)
		} else {
			m.connectionMetricCache[k] = v
		}
	}

	// delete metrics
	deleteLock.Lock()
	// Creating a copy reduces the amount of time spent adding locks in the programme
	workloadReplica := deleteWorkload
	deleteWorkload = nil
	serviceReplica := deleteService
	deleteService = nil
	connReplica := deleteConnection
	deleteConnection = []*connectionMetricLabels{}
	deleteLock.Unlock()

	for i := 0; i < len(workloadReplica); i++ {
		deleteWorkloadMetricInPrometheus(workloadReplica[i])
	}
	for i := 0; i < len(serviceReplica); i++ {
		deleteServiceMetricInPrometheus(serviceReplica[i])
	}
	for i := 0; i < len(connReplica); i++ {
		deleteConnectionMetricInPrometheus(connReplica[i])
	}
}

@yp969803
Copy link
Contributor Author

@hzxuzhonghu did u understand !!

@hzxuzhonghu
Copy link
Member

No

I do think if m.EnableConnectionMetric.Load() && reqMetric.duration > LONG_CONN_METRIC_THRESHOLD { is right

At the connection establishment, we donot store tcp connection, it doesnot matter much

@yp969803
Copy link
Contributor Author

yp969803 commented May 15, 2025

i think you are correct it will not matter much, I will refactor it

@yp969803 yp969803 force-pushed the issue1370 branch 2 times, most recently from b3d0ff2 to 76fa21a Compare May 15, 2025 02:42
@yp969803
Copy link
Contributor Author

@hzxuzhonghu can u look now, i have refactored it

Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@kmesh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Yash Patel <yp969803@gmail.com>

rfac: TestMetricController_updatePrometheusMetric

Signed-off-by: Yash Patel <yp969803@gmail.com>
@yp969803
Copy link
Contributor Author

@hzxuzhonghu waiting for merge

@hzxuzhonghu
Copy link
Member

/lgtm

@kmesh-bot kmesh-bot added the lgtm label May 16, 2025
@kmesh-bot kmesh-bot merged commit e56e0ac into kmesh-net:main May 19, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metric sent_bytes received_bytes not consistent with accesslog
4 participants