Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: runtime metrics #375

Merged
merged 32 commits into from
Nov 11, 2021
Merged

feat: runtime metrics #375

merged 32 commits into from
Nov 11, 2021

Conversation

seemk
Copy link
Collaborator

@seemk seemk commented Oct 28, 2021

Description

Add runtime metrics collection in a similar fashion as is done by signalfx-nodejs-collect. However signalfx-nodejs-collect used 2 different native extensions to gather event loop and GC metrics. For event loop metric only the whole cycle duration (including the poll phase) was reported as a metric, which is quite useless. It should have measured event loop lag, which is the whole event loop cycle minus the time it took for IO, timer polling (i.e. no CPU used). Besides this signalfx-nodejs-collect had a lot of dead code regarding events, which were never sent (e.g. on each GC cycle). So the decision was to drop signalfx-nodejs-collect for a custom native extension.

Currently the exporting step will still use the SignalFx client.

List of metrics exported (see signalfx-nodejs-collect for the previous list):

  • nodejs.memory.heap.total (gauge, bytes) - Heap total via process.memoryUsage().heapTotal
  • nodejs.memory.heap.used (gauge, bytes) - Heap used via process.memoryUsage().heapUsed
  • nodejs.memory.rss (gauge, bytes) - Resident set size via process.memoryUsage().rss
  • nodejs.memory.gc.size (cumulative_counter, bytes) - Memory collected by the garbage collector
  • nodejs.memory.gc.pause (cumulative_counter, nanoseconds) - Time spent doing GC
  • nodejs.memory.gc.count (cumulative_counter, count) - Number of times GC ran. (signalfx-nodejs-collect: nodejs.memory.gc.total)
  • nodejs.event_loop.lag.max (gauge, nanoseconds) - Max event loop lag (signalfx-nodejs-collect: nodejs.event_loop.max)
  • nodejs.event_loop.lag.min (gauge, nanoseconds) - Min event loop lag (signalfx-nodejs-collect: nodejs.event_loop.min)

Environment variables introduced:

  • SPLUNK_METRICS_ENABLED (default: 'false') - Metrics are opt-in
  • SPLUNK_METRICS_ENDPOINT (default: http://localhost:9943)
  • SPLUNK_METRICS_EXPORT_INTERVAL (default: 5000)

Misc:

  • Why getSignalFxClient? Custom metrics will still need to be supported. This PR allows to provide your own SignalFx client to startMetrics, but since we configure it with the same access token anyway, they can just use this helper method to access the currently used client.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Documentation change or requires a documentation update

How Has This Been Tested?

  • Tested manually
  • Added automated tests

Checklist:

  • Unit tests have been added/updated
  • Documentation has been updated
  • Change file has been generated (npm run change:new)
  • Delete this branch (after the PR is merged)

@codecov-commenter
Copy link

codecov-commenter commented Oct 28, 2021

Codecov Report

Merging #375 (6f01c1c) into main (6afb5b3) will decrease coverage by 0.54%.
The diff coverage is 91.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #375      +/-   ##
==========================================
- Coverage   93.63%   93.08%   -0.55%     
==========================================
  Files           9       12       +3     
  Lines         330      434     +104     
  Branches       87      104      +17     
==========================================
+ Hits          309      404      +95     
- Misses         21       30       +9     
Impacted Files Coverage Δ
src/options.ts 98.09% <88.88%> (-0.88%) ⬇️
src/metrics/index.ts 90.90% <90.90%> (ø)
src/instrument.ts 100.00% <100.00%> (ø)
src/metrics/memory.ts 100.00% <100.00%> (ø)
src/metrics/native/index.ts 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6afb5b3...6f01c1c. Read the comment docs.

@seemk seemk marked this pull request as ready for review November 1, 2021 15:19
@seemk seemk requested review from a team as code owners November 1, 2021 15:19
Comment on lines 61 to 65
| Environment variable<br>``startMetrics()`` argument | Default value | Support | Notes
| --------------------------------------------------------------- | ----------------------- | ------- | ---
| `SPLUNK_METRICS_ENABLED`<br>`enabled` | `false` | Experimental | Enabled metrics export. See [metrics documentation](metrics.md) for more information.
| `SPLUNK_METRICS_ENDPOINT`<br>`endpoint` | `http://localhost:9943` | Experimental | The SignalFx metrics endpoint to send to.
| `SPLUNK_METRICS_EXPORT_INTERVAL`<br>`exportInterval` | `5000` | Experimental | The interval, in milliseconds, of metrics collection and exporting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps an extra column for the startMetrics argument would make it easier to scan the table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/SignalFx/Splunk

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a link to SignalFx client. Followed the existing doc style and startTracing did not have the extra column either, should I split both of them to separate column?


#### Additional `startMetrics` config options

- `signalfx`: A JS object with optional `client` and `dimensions` fields. If you have already setup a SignalFx client with custom configuration, it is possible use this for sending instead of creating, configuring a new one. `dimensions` object adds a pre-defined dimension for each datapoint. The format for `dimensions` is `{key: value, ...}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `signalfx`: A JS object with optional `client` and `dimensions` fields. If you have already setup a SignalFx client with custom configuration, it is possible use this for sending instead of creating, configuring a new one. `dimensions` object adds a pre-defined dimension for each datapoint. The format for `dimensions` is `{key: value, ...}`.
- `signalfx`: A JS object with optional `client` and `dimensions` fields. If you have already setup a Splunk client with custom configuration, you can use this for sending instead of creating, configuring a new one. `dimensions` object adds a pre-defined dimension for each datapoint. The format for `dimensions` is `{key: value, ...}`.

Can we rename the setting to "splunk"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, explicitly wanted to make it clear that we are dealing with SignalFx Node.js client. What do you think about removing the signalfx field completely? And replace it with 2 fields: client and dimensions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old library? Then it makes sense. Just wondering about backward compatibility if the switch to OTel metrics in the future (see Owais' comments).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a link to the SignalFx client there and mentioned that metrics will change once OpenTelemetry becomes available.

@owais
Copy link
Contributor

owais commented Nov 2, 2021

How will this work with Otel? What happens when Otel introduces similar runtime metrics? Are we marking this as unstable for now so we have a path to migrate to Otel implementation later?

Or why aren't we contributing this to otel contrib?

@seemk
Copy link
Collaborator Author

seemk commented Nov 3, 2021

How will this work with Otel? What happens when Otel introduces similar runtime metrics? Are we marking this as unstable for now so we have a path to migrate to Otel implementation later?

Or why aren't we contributing this to otel contrib?

Yes, it'll be unstable for now. Runtime metrics are disabled by default for that reason. When OTel semconvs and the metrics SDK is available, there will be a breaking change. And after semconvs are available we can start contributing the module upstream.

@owais
Copy link
Contributor

owais commented Nov 3, 2021

OK. If it is experimental and users understand things can break then it's fine I think. That said, SignalFx never had a single library for metrics and traces. Customers always had to use two different libs. Why wouldn't we let customers use signalfx metrics lib till we have counterparts in otel ready?

@theletterf
Copy link
Member

I tend to agree with @owais here. What's the take of Ivo on this?

scripts/prebuild-current.js Outdated Show resolved Hide resolved
Copy link
Contributor

@rauno56 rauno56 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still rather see the native extension have a separate life-cycle in another package, but this works for now.

src/metrics/index.ts Outdated Show resolved Hide resolved
src/metrics/native/module.cpp Outdated Show resolved Hide resolved
test/examples/Dockerfile_test Outdated Show resolved Hide resolved
@owais
Copy link
Contributor

owais commented Nov 3, 2021

@ivomagi
Copy link

ivomagi commented Nov 3, 2021

I tend to agree with @owais here. What's the take of Ivo on this?

I think I can restate the goals

  1. Satisfy the need for runtime metrics today, not when Otel metrics SDK is stable and semconvs available
  2. Package the solution in a way where it is easy to adopt
  3. Make sure the means to transition to Otel-world are taken into account and we know what is the upgrade path to Otel semconv and data model and SDK. There is an epic for next PI where this is taken apart in more details.
  4. Provide similar user experience to other instrumentation lis in this aspect (only Java so far)

Whether or not the macro requirements above can be fulfilled with one or two client-facing artifacts is engineering call. Maybe worth a call as well to go over the concerns.

@seemk seemk merged commit 668ffd8 into main Nov 11, 2021
@seemk seemk deleted the metrics branch November 11, 2021 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants