Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(servicegraph): make virtual node tests reliable on Windows #37550

Merged

Conversation

mapno
Copy link
Contributor

@mapno mapno commented Jan 28, 2025

Description

The virtual node tests (TestVirtualNodeServerLabels and TestVirtualNodeClientLabels) were failing intermittently on Windows due to timing issues. This change:

  • Reduces MetricsFlushInterval and StoreExpirationLoop to 10ms
  • Adds a timeout-based polling mechanism to wait for metrics generation
  • Replaces unreliable sleep with proper metric availability checks
  • Improves error messaging for test failures

Link to tracking issue

Fixes #33679

Testing

Updates tests

Documentation

Copy link
Contributor

@pjanotti pjanotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still draft but LGTM, minus the change log file. Thanks @mapno!

# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: bug_fix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't require a change log: it is not visible to users. You can remove this file from the change I will add the label to skip the check for a change log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@pjanotti pjanotti added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Jan 29, 2025
// Wait for metrics to be generated with timeout
deadline := time.Now().Add(5 * time.Second)
var metrics []pmetric.Metrics
for time.Now().Before(deadline) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: you can keep using the assert.Eventually or assert.EventuallyWithT, the main advantage being the error message in case of failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-added again

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 13, 2025
The virtual node tests (TestVirtualNodeServerLabels and TestVirtualNodeClientLabels)
were failing intermittently on Windows due to timing issues. This change:
- Reduces MetricsFlushInterval and StoreExpirationLoop to 10ms
- Adds a timeout-based polling mechanism to wait for metrics generation
- Replaces unreliable sleep with proper metric availability checks
- Improves error messaging for test failures

Fixes open-telemetry#33679
@mapno mapno force-pushed the fix/servicegraph-windows-tests branch from cfda079 to 8d273de Compare February 17, 2025 09:57
@mapno
Copy link
Contributor Author

mapno commented Feb 17, 2025

The CI job for windows tests passed. Unfortunately I don't have a windows machine to verify that it's not flaky :/

@mapno mapno marked this pull request as ready for review February 17, 2025 10:27
@mapno mapno requested a review from a team as a code owner February 17, 2025 10:27
@mapno mapno requested a review from dehaansa February 17, 2025 10:27
@github-actions github-actions bot removed the Stale label Feb 18, 2025
@pjanotti
Copy link
Contributor

@mapno 100s of runs on my box without failures, it seems good by that metric.

virtualNodeDimensions := []string{"peer.service", "db.system", "messaging.system"}
cfg := &Config{
Dimensions: virtualNodeDimensions,
LatencyHistogramBuckets: []time.Duration{time.Duration(0.1 * float64(time.Second)), time.Duration(1 * float64(time.Second)), time.Duration(10 * float64(time.Second))},
Store: StoreConfig{MaxItems: 10},
VirtualNodePeerAttributes: virtualNodeDimensions,
VirtualNodeExtraLabel: true,
MetricsFlushInterval: time.Millisecond,
// Reduce flush interval for faster test execution
MetricsFlushInterval: 1 * time.Millisecond,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice this shouldn't make any difference, but I'm curious if it is deliberate that here is 1 ms vs 10 ms on the other test. Anyway, as I said this shouldn't make a practical difference.

@pjanotti pjanotti added the ready to merge Code review completed; ready to merge by maintainers label Feb 26, 2025
@andrzej-stencel andrzej-stencel merged commit 7993c9d into open-telemetry:main Mar 3, 2025
176 checks passed
@github-actions github-actions bot added this to the next release milestone Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connector/servicegraph ready to merge Code review completed; ready to merge by maintainers Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[connector/servicegraph] Unit test failures on Windows
5 participants