Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Conversation

@beyang
Copy link
Member

@beyang beyang commented Mar 25, 2020

Addresses #9300

From the updated CHANGELOG:

  • Distributed tracing is a powerful tool for investigating performance issues. The following changes
    have been made with the goal of making it easier to use distributed tracing with Sourcegraph:
    • The site configuration field "tracing.distributedTracing": { "sampling" } allows a site admin to control which requests generate tracing data.

      • "all" will trace all requests.
      • "selective" will trace all requests initiated from an end-user URL with
        ?trace=1. Non-end-user-initiated requests can set a HTTP header X-Sourcegraph-Should-Trace: true. This is the recommended setting, as "all" can generate large amounts of tracing data that may cause network and memory resource contention in the Sourcegraph instance.
      • "none" turns off tracing.
    • Jaeger is now the officially supported distributed tracer. The following is the recommended site configuration to connect Sourcegraph to a Jaeger agent (which must be deployed on the same host and listening on the default ports):

      "tracing.distributedTracing": {
        "type": "jaeger",
        "sampling": "selective"
      }
      
    • The site configuration field, useJaeger, is deprecated in favor of "tracing.distributedTracing": { "type": "jaeger" }.

    • The site configuration field "experimentalFeatures": { "debug.log": { "opentracing" } } toggles debug logging that logs every call initiated from the opentracing (Jaeger) client.

    • Support for configuring Lightstep as a distributed tracer is deprecated and will be removed in a subsequent release. Because most Sourcegraph instances are deployed on-prem and Lightstep is only available "in the Cloud", usage of Lightstep was very low or non-existent. If you are a paying customer and would like us to maintain support, please email support@sourcegraph.com.

Other notes:

  • Reviewers should try this out in dev, given that Jaeger is now run by default in our dev environment. The main thing to try is to use the "selective" setting and toggle on ?trace=1 in the URL to notice Jaeger trace collection turn on/off for the given request tree.
  • The diff touches many files, because I had to update all invocations of the opentracing API to go through the internal/trace/ot package (which implements the "selective" tracing behavior described in the CHANGELOG).
  • The field name tracing.distributedTracing was made, because I anticipate wanting to add tracing.nettrace shortly. If anyone prefers a different naming scheme or site config structure, please comment.

TODO

  • Investigating remaining issue with debug (toggle doesn't work)

Following merge, I will do the following:

  • Update the Sourcegraph deployment repositories to have users install Jaeger by default. I will also provide instructions to deploy Jaeger for existing instances that do not already have it.
  • Update docs.sourcegraph.com to document this on the site admin tracing docs page.
  • Open up a tech-debt issue to migrate to opentelemetry (which has a context-aware API for starting spans)
  • Look into adding httptrace (for client-side tracing) to debug networking issues.

@beyang beyang force-pushed the bl/jaeger branch 5 times, most recently from 6b1e400 to 4614d3f Compare March 31, 2020 04:24
@codecov
Copy link

codecov bot commented Mar 31, 2020

Codecov Report

Merging #9330 into master will decrease coverage by 0.04%.
The diff coverage is 70.14%.

@@            Coverage Diff             @@
##           master    #9330      +/-   ##
==========================================
- Coverage   41.54%   41.49%   -0.05%     
==========================================
  Files        1333     1333              
  Lines       72658    72664       +6     
  Branches     6582     6583       +1     
==========================================
- Hits        30184    30151      -33     
- Misses      39675    39713      +38     
- Partials     2799     2800       +1
Impacted Files Coverage Δ
internal/conf/parse.go 47.82% <ø> (ø) ⬆️
cmd/frontend/internal/app/jscontext/jscontext.go 2.98% <ø> (ø) ⬆️
cmd/frontend/graphqlbackend/codemod.go 13.4% <0%> (ø) ⬆️
cmd/frontend/graphqlbackend/textsearch.go 49.31% <0%> (ø) ⬆️
internal/vcs/git/shortlog.go 40.74% <0%> (ø) ⬆️
cmd/frontend/internal/cli/http.go 0% <0%> (ø) ⬆️
cmd/gitserver/main.go 15.51% <0%> (ø) ⬆️
internal/gitserver/proxy.go 22.22% <0%> (ø) ⬆️
internal/highlight/highlight.go 49.07% <0%> (-0.23%) ⬇️
cmd/frontend/graphqlbackend/search_results.go 46.92% <0%> (ø) ⬆️
... and 38 more

@beyang beyang force-pushed the bl/jaeger branch 3 times, most recently from c0d132f to 00d6c98 Compare April 1, 2020 00:26
@beyang beyang requested review from emidoots and keegancsmith April 1, 2020 00:35
@beyang beyang changed the title WIP: Jaeger changes Enable selective tracing with Jaeger and update Jaeger site config schema Apr 1, 2020
@beyang beyang marked this pull request as ready for review April 1, 2020 00:45
@beyang beyang requested review from a team and rvantonder as code owners April 1, 2020 00:45
@beyang beyang requested review from a team April 1, 2020 00:45
Copy link
Member Author

@beyang beyang Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sourcegraph/web this look kosher to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beyang beyang removed request for a team and rvantonder April 1, 2020 00:47
Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but there is an approach here which is much more minimal than updating every callsite. And that is to implement your own opentracing.Tracer which either does a noop or uses Jaeger. Then set that as the GlobalTracer.

@beyang
Copy link
Member Author

beyang commented Apr 1, 2020

LGTM, but there is an approach here which is much more minimal than updating every callsite. And that is to implement your own opentracing.Tracer which either does a noop or uses Jaeger. Then set that as the GlobalTracer.

@keegancsmith this is the approach I tried first. However, I got hung up, because I am using a context item to toggle tracing on/off and the opentracing.Tracer methods do not accept a context as an argument. Is there another approach here I might have overlooked?

@beyang beyang force-pushed the bl/jaeger branch 2 times, most recently from e097ac8 to ce8b5d7 Compare April 2, 2020 04:45
@keegancsmith
Copy link
Member

LGTM, but there is an approach here which is much more minimal than updating every callsite. And that is to implement your own opentracing.Tracer which either does a noop or uses Jaeger. Then set that as the GlobalTracer.

@keegancsmith this is the approach I tried first. However, I got hung up, because I am using a context item to toggle tracing on/off and the opentracing.Tracer methods do not accept a context as an argument. Is there another approach here I might have overlooked?

Just took a look. You are right the API is pretty poor here. You could make it such that your GlobalTracer does noop for all "root" spans. Then the part that does the "real" root span uses the correct decision logic. That way we will get the optional propagating down. That seems like a reasonable approach and easy to reason about (ie only create a span if there is a parent, special method for root spans). But this also works to be fair. Also I see opentelemetry is now a thing which we may want to migrate to, so having an intermediate pkg makes that easier.

CHANGELOG.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this relate to the retention policy in jaeger? How long is data kept there and what is the default storage capacity in our Kubernetes deployment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has no bearing on the data retention policy in Jaeger, which must be configured in Jaeger itself. This only affects behavior of the Jaeger client. This point will be addressed elsewhere, in the Jaeger install docs we provide in the Sourcegraph docs. I've added it to the TODO in this other PR: sourcegraph/deploy-sourcegraph#559.

@beyang
Copy link
Member Author

beyang commented Apr 3, 2020

Just took a look. You are right the API is pretty poor here. You could make it such that your GlobalTracer does noop for all "root" spans. Then the part that does the "real" root span uses the correct decision logic. That way we will get the optional propagating down. That seems like a reasonable approach and easy to reason about (ie only create a span if there is a parent, special method for root spans). But this also works to be fair. Also I see opentelemetry is now a thing which we may want to migrate to, so having an intermediate pkg makes that easier.

@keegancsmith Yeah, we should just migrate to opentelemetry. I will open an issue to do that after this is merged. It might be a little tricky given we use some intermediary packages that depend on opentracing.

@beyang beyang merged commit 69763d8 into master Apr 3, 2020
@beyang beyang deleted the bl/jaeger branch April 3, 2020 00:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants