Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenTelemetry support for Koa template #706

Merged
merged 10 commits into from
Dec 28, 2021
Merged

Add OpenTelemetry support for Koa template #706

merged 10 commits into from
Dec 28, 2021

Conversation

runk
Copy link
Contributor

@runk runk commented Nov 30, 2021

This brings https://opentelemetry.io/ support, which can be enabled via gantry config (deploys another sidecar that collects the data). According to @bernos it's here to stay and should no longer be considered an experimental feature. Yet has to be widely announced.

All traces can be viewed in "Technology Platforms" DD org.

This PR is a first step - one template only - koa app.

@runk runk requested a review from a team as a code owner November 30, 2021 07:18
@changeset-bot
Copy link

changeset-bot bot commented Nov 30, 2021

🦋 Changeset detected

Latest commit: 0126a2a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
skuba Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

return sdk.start();
};

const log = (level: string, msg: string, extra = {}) => {
Copy link
Contributor Author

@runk runk Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to bring in the standard logger as it might bring more dependencies before the internals are instrumented. This lib really meant to be self-contained.

Thoughts?

@@ -1,5 +1,3 @@
import './register';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's loaded in app.ts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is here to make it obvious (to humans but also to e.g. linters) that the import has side effects and must go first. It's easy to shoot yourself in the foot by importing something else before ./app, which is part of the reason why I'd like to get rid of this hook soon™️.

template/koa-rest-api/gantry.apply.yml Show resolved Hide resolved
"@seek/logger": "^5.0.1",
"skuba-dive": "^1.2.0",
"aws-sdk": "^2.1039.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: do you need to declare this as a dep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's up for discussion. It's needed for aws-sdk instrumentation, which I guess is highly desirable for anyone at seek. Perhaps some don't use aws-sdk but I bet it's the minority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when I flipped through the GitHub page it said you could use aws-sdk V3 as well?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, otel won't work with aws-sdk < V3. At least it wasn't in early testing. This was because in V2 of the aws SDK, request signing was happening before otel header propagation. This meant that all instrumented requests to AWS services via the AWS SDK would fail on account of invalid signatures. This wasn't an issue in V3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernos re aws-sdk v2 - I can confirm instrumentation works for it with current otel libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to just leave it out and let consumers byo sdk?

Copy link
Contributor Author

@runk runk Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on your definition of "better", as always :). Do we want to bring in aws-sdk instrumentation for consumers out of the box?

Screen Shot 2021-12-01 at 4 21 38 pm

Looking at this trace - I see clear advantages of knowing what the HTTP POST actually was about, which in the case with aws-sdk instrumentation is very illustrative. With minimal built-in http instrumentation it'll be just another opaque HTTP request.

The problem is https://github.com/open-telemetry/opentelemetry-js-contrib/tree/main/plugins/node/opentelemetry-instrumentation-aws-sdk needs aws-sdk (or rather its types)

@etaoins
Copy link
Contributor

etaoins commented Nov 30, 2021

Has this been running in production yet? Any early performance insights versus no instrumentation or Datadog APM?

@runk
Copy link
Contributor Author

runk commented Nov 30, 2021

Has this been running in production yet? Any early performance insights versus no instrumentation or Datadog APM?

Otel is in production for about ~20 services today, however only some have code instrumentation enabled. Most of those 20 services have the flag set in gantry which enables tracing in auth-sidecar only. I can see there are only 4 services that have code instrumentation enabled across a couple of teams. There're some heavy-loaded systems like the API router that is instrumented though. Also it comes down that what we want to instrument, e.g. both aws-sdk and node's http or only one one of those.

Re DD APM - I think it'd be a similar performance hit as with automatic instrumentation as all outgoing requests (and other i/o commands) are wrapped.

Stats for one of the services:

(red line is when it went live)

Screen Shot 2021-12-01 at 9 22 04 am

Screen Shot 2021-12-01 at 9 19 09 am

Number of tasks didn't go up so CPU avg is somewhat reliable representation of possible performance hit.

I'll try to get similar insights for another service that's hit a bit harder.

@runk
Copy link
Contributor Author

runk commented Nov 30, 2021

Okay, there're two discreet parts - app code instrumentation (controlled via tracing.ts) and auth sidecar instrumentation (controlled via gantry config). The previous chart was for the case when we already had Otel enabled in gantry (think auth sidecar) and just added code instrumentation in the app itself. No visible performance hit could be seen.

However, when enabled in gantry AND app, there's about 2-3ms latency increase. I'm inclined to believe it's mainly attributed to sidecar instrumentation.

Screen Shot 2021-12-01 at 9 53 15 am

@etaoins
Copy link
Contributor

etaoins commented Nov 30, 2021

Is that expected @bernos? I don't recall that showing up on indie-graphql-server. Seems like a large regression for a feature teams might not end up using.

Copy link
Contributor

@etaoins etaoins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking with @runk offline, there's potentially a large performance regression for some workloads (33%-50%). I'm happy to include this as a default off to allow teams to opt-in to tracing if they find it useful. This is similar to what we do with Datadog APM.

@runk
Copy link
Contributor Author

runk commented Dec 1, 2021

Changed to be opt-in.

I think as an alternative (a middle ground?) we can to have the tracing on in auth sidecar by default, but keep app instrumentation off by default. I think this way possible performance hit as you experienced with DD APM will be an unlikely scenario.

@runk
Copy link
Contributor Author

runk commented Dec 1, 2021

Screen Shot 2021-12-01 at 11 10 07 am

The graph shows the service in three modes:

  • Bare app (before 9:15)
  • Sidecar + App instrumentation (9:15-10:38)
  • Sidecar instrumentation only (10:38 onwards)

While latency increase is a single digit, it could mean a ~30% increase for a fast-responding app. Assuming if the app is doing a lot of network calls and all be instrumented (e.g. Experience APIs/Graphql servers), the more serious performance hit of app instrumentation should be expected. I do not see time spent for sidecar instrumentation to differ depending on nature of the app itself.

"@seek/logger": "^5.0.1",
"skuba-dive": "^1.2.0",
"aws-sdk": "^2.1039.0",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, otel won't work with aws-sdk < V3. At least it wasn't in early testing. This was because in V2 of the aws SDK, request signing was happening before otel header propagation. This meant that all instrumented requests to AWS services via the AWS SDK would fail on account of invalid signatures. This wasn't an issue in V3

import { B3InjectEncoding, B3Propagator } from '@opentelemetry/propagator-b3';
import { NodeSDK } from '@opentelemetry/sdk-node';

const main = async () => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does async code loaded via --require behave any differently to async code called from the entry point? ie. is there a race here between instrumentation being initialised and the first requests being handed by the service?

Copy link
Contributor Author

@runk runk Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think it's definitely a possibility that promises aren't fulfilled before the app loads. Today there's no way to pause for async execution in the bootstrap file supplied via --require x.js.

Really depends on what happens inside of the sdk.start fn: https://www.npmjs.com/package/@opentelemetry/sdk-node. Looking in code it only really returns a pending promise when we want auto resource detection which is off: https://github.com/seek-oss/skuba/pull/706/files#diff-4836bf38ef9493029c23deb46fdf06733bbc8b8cec520b9f538ea1a68cebf8eaR33

I'm going to remove async from the main fn to keep it simple. However it's only hiding the possible problem with pending promise from sdk.start()

@bernos
Copy link

bernos commented Dec 1, 2021

This is great, happy to roll it out, but please don't default any open telemetry options to enabled by default just yet. The cost implications of mass adoption are still not well understood, and we'd like to avoid any shock there. There are contingencies and compensations that we can make in the event that costs exceed the forecast, but we need a bit more control over the rate of adoption in order to have time to adapt if needs be. Also, I mentioned an incompatibility with aws-sdk < v3 that may have been addressed in newer releases of v2, but there should probably be a warning on the tin.

@runk
Copy link
Contributor Author

runk commented Dec 3, 2021

@samchungy @etaoins do you want to check it one more time before I merge? The notable change is the addition of process exit handling: https://github.com/seek-oss/skuba/pull/706/files#diff-4836bf38ef9493029c23deb46fdf06733bbc8b8cec520b9f538ea1a68cebf8eaR46

@samchungy
Copy link
Contributor

@samchungy @etaoins do you want to check it one more time before I merge? The notable change is the addition of process exit handling: https://github.com/seek-oss/skuba/pull/706/files#diff-4836bf38ef9493029c23deb46fdf06733bbc8b8cec520b9f538ea1a68cebf8eaR46

@72636c your review probably holds the most weight here 😆

@@ -1,5 +1,3 @@
import './register';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is here to make it obvious (to humans but also to e.g. linters) that the import has side effects and must go first. It's easy to shoot yourself in the foot by importing something else before ./app, which is part of the reason why I'd like to get rid of this hook soon™️.

Comment on lines +5 to +8
"@opentelemetry/exporter-collector-grpc": "^0.25.0",
"@opentelemetry/instrumentation-aws-sdk": "^0.1.0",
"@opentelemetry/instrumentation-http": "^0.26.0",
"@opentelemetry/sdk-node": "^0.26.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be concerned about these v0s? Does OpenTelemetry promise some degree of stability?

template/koa-rest-api/Dockerfile Outdated Show resolved Hide resolved
.changeset/brown-bananas-repair.md Outdated Show resolved Hide resolved
template/koa-rest-api/src/tracing.ts Outdated Show resolved Hide resolved
template/koa-rest-api/src/tracing.ts Outdated Show resolved Hide resolved
template/koa-rest-api/gantry.apply.yml Outdated Show resolved Hide resolved
template/koa-rest-api/src/tracing.ts Outdated Show resolved Hide resolved
runk and others added 7 commits December 16, 2021 10:39
Co-authored-by: Ryan Ling <ryan@outlook.com.au>
Co-authored-by: Ryan Ling <ryan@outlook.com.au>
Co-authored-by: Ryan Ling <ryan@outlook.com.au>
Co-authored-by: Ryan Ling <ryan@outlook.com.au>
Co-authored-by: Ryan Ling <ryan@outlook.com.au>
Co-authored-by: Ryan Ling <ryan@outlook.com.au>
import { B3InjectEncoding, B3Propagator } from '@opentelemetry/propagator-b3';
import { NodeSDK } from '@opentelemetry/sdk-node';

const app = 'opentelemetry';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity, I was wondering where this app log field convention was from and it looks like our sidecar follows suit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old mate ca-go-library 🙈

seek-auth-sidecar was originally ca-auth-sidecar and used the default logger from ca-go-library. The newer name convention comes from Pino/Bunyan.

@72636c 72636c changed the title Add opentelemetry support for koa-app template Add OpenTelemetry support for Koa template Dec 28, 2021
@72636c 72636c merged commit baae6b3 into master Dec 28, 2021
@72636c 72636c deleted the add-otel-support branch December 28, 2021 22:46
@github-actions github-actions bot mentioned this pull request Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants