Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension registry GraphQL API is slow #10554

Closed
felixfbecker opened this issue May 11, 2020 · 26 comments · Fixed by #24626
Closed

Extension registry GraphQL API is slow #10554

felixfbecker opened this issue May 11, 2020 · 26 comments · Fixed by #24626

Comments

@felixfbecker
Copy link
Contributor

Currently, the extensions registry page usually takes a long time to load, sometimes even up to minute, because of the GraphQL request to query extensions.

Screen Shot 2020-05-10 at 12 30 05

Screen Shot 2020-05-10 at 12 30 14

Worse, this request is not only done on the registry page, but also when loading extensions.

And since private instances query extensions from sourcegraph.com, this experience is true for all private instances too.

The primary task here is a backend issue. It's likely that our DB tables and queries are not optimized (e.g. could it be that we have extensions and extension releases in the same table and run a SELECT DISTINCT on every query?).

Related: #1983

cc @sourcegraph/core-services

@felixfbecker felixfbecker added this to To do in Web Team :: Current iteration via automation May 11, 2020
@mrnugget
Copy link
Contributor

What's the GraphQL request?

@keegancsmith
Copy link
Member

keegancsmith commented May 11, 2020 via email

@felixfbecker
Copy link
Contributor Author

@keegancsmith
Copy link
Member

keegancsmith commented May 11, 2020 via email

@tsenart tsenart added this to the 3.17 milestone May 11, 2020
@tsenart tsenart self-assigned this May 12, 2020
@tsenart
Copy link
Contributor

tsenart commented May 12, 2020

@felixfbecker: We increased the RAM of the sourcegraph.com postgres database and tuned some parameters as part of #10598. I think this may impact the performance of this query. Would you be able to check?

@felixfbecker
Copy link
Contributor Author

Definitely feels faster:

image

Will report back if I see a slow load.
Is there a way we could monitor these loading times, e.g. in a Grafana dashboard?

@unknwon
Copy link
Member

unknwon commented May 13, 2020

In case no one has noticed this before, registry_extension_releases is No. 3 largest table in our DB:

image

Might worth some clean up.

@tsenart
Copy link
Contributor

tsenart commented May 13, 2020

Might worth some clean up.

How would we do this clean up? Any data that can be deleted safely?

@tsenart
Copy link
Contributor

tsenart commented May 13, 2020

Is there a way we could monitor these loading times, e.g. in a Grafana dashboard?

@felixfbecker: I opened a PR to instrument the extension registry API with Prometheus metrics so that we can do that.

tsenart pushed a commit that referenced this issue May 13, 2020
This commit re-enables and extends the Prometheus instrumentation of the
extension registry API, so that we can chart latency in Grafana.

Related to #10554
tsenart pushed a commit that referenced this issue May 13, 2020
* registry: Instrument API with Prometheus metrics

This commit re-enables and extends the Prometheus instrumentation of the
extension registry API, so that we can chart latency in Grafana.

Related to #10554

* fixup! Address feedback
@tsenart
Copy link
Contributor

tsenart commented May 13, 2020

@slimsag: I can't seem to save a new dashboard on sourcegraph.com. It'd be useful to be able to play around with those dashboards in the UI to ensure we get the right design, and the export that as JSON to the right git repo for persistence and automation of the provisioning. Any leads on how to do that? I have this at the moment:

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "panels": [
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": null,
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 9,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "hiddenSeries": false,
      "id": 2,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum by (le)(rate(src_registry_requests_duration_seconds_bucket{code=~\"^2.+\", operation=\"list\"}[5m])))",
          "interval": "",
          "legendFormat": "2xx",
          "refId": "A"
        },
        {
          "expr": "histogram_quantile(0.99, sum by (le)(rate(src_registry_requests_duration_seconds_bucket{code!~\"^2.+\", operation=\"list\"}[5m])))",
          "interval": "",
          "legendFormat": "Non 2xx",
          "refId": "B"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Extension Registry API: List Latency",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "label": "seconds",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "schemaVersion": 22,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": [
      "5s",
      "10s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m",
      "1h",
      "2h",
      "1d"
    ]
  },
  "timezone": "",
  "title": "Extension Registry API",
  "uid": null,
  "variables": {
    "list": []
  },
  "version": 0
}

@unknwon
Copy link
Member

unknwon commented May 13, 2020

Might worth some clean up.

How would we do this clean up? Any data that can be deleted safely?

Does @sourcegraph/web team has more context of this?

@felixfbecker
Copy link
Contributor Author

Sounds like we are storing past versions in that table. Those could be deleted I think because we always serve the latest version (and don't provide any way to introspect or fetch old versions) AFAIK. @lguychard @sqs is this correct?

It is likely that big because we store in it a) the extension bundle and b) the source map for the extension bundle, which likely has inline sources, i.e. the entire contents of the repository plus dependencies in strings in a JSON file.

I wonder if instead of in Postgres, we should just save these in files on disk...

I wonder though why the releases table would make the registry listing page slow. Intuitively I would expect that to only query the extensions table, I can't think of why it would need to do a JOIN?

@sqs
Copy link
Member

sqs commented May 13, 2020 via email

@slimsag
Copy link
Member

slimsag commented May 14, 2020

@slimsag: I can't seem to save a new dashboard on sourcegraph.com. It'd be useful to be able to play around with those dashboards in the UI to ensure we get the right design, and the export that as JSON to the right git repo for persistence and automation of the provisioning. Any leads on how to do that? I have this at the moment:

This approach is (intentionally) forbidden. You should add this to the frontend dashboard in the generator here: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/tree/monitoring

In specific, you will want to add a hidden row like this: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/monitoring/frontend.go#L250-290

You should define Warning alerts for those queries you have (which is required).

I know the rationale for these constraints is not well-documented, currently, it's on me to write better docs here and I am going to very soon. If you have any questions in the meantime, please don't hesitate to ask.

@tsenart
Copy link
Contributor

tsenart commented May 14, 2020

This approach is (intentionally) forbidden. You should add this to the frontend dashboard in the generator here: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/tree/monitoring

How has your workflow been for iterating on a dashboard? Code first seems quite cumbersome. I want to be able to iterate on a dashboard with the Grafana UI and then export that as JSON and check-in to whatever provisioning mechanism we have. What concerns do you have with this approach?

You should define Warning alerts for those queries you have (which is required).

I don't want to define any alerts yet. I just want a dashboard for visualization. How shall we proceed?

@Joelkw
Copy link
Contributor

Joelkw commented Nov 25, 2020

@felixfbecker is this still an issue? Adding to our backlog but feel free to close if it's resolved.

@Joelkw Joelkw added this to To triage in Web Team :: Standalone Issues Backlog via automation Nov 25, 2020
@eseliger
Copy link
Member

Yes it is, but it isn't a web task to fix it I believe (unless web owns backend code now as well).

@tsenart
Copy link
Contributor

tsenart commented Nov 26, 2020

We can take another look at this next cycle.

@tsenart tsenart modified the milestones: Backlog, Cloud 2020-12-02 Nov 26, 2020
@felixfbecker
Copy link
Contributor Author

I read through the thread and it sounds like you added a dashboard to track the performance. Do you have a link to that dashboard? I'd be curious to take a look at it.

@github-actions
Copy link
Contributor

Heads up @Joelkw @felixfbecker - the "team/extensibility" label was applied to this issue.

@muratsu muratsu added this to To Triage 📥 in Extensibility :: Current Iteration Jun 30, 2021
@Joelkw Joelkw moved this from To Triage 📥 to Backlog in Extensibility :: Current Iteration Jul 9, 2021
@leifwalsh
Copy link

leifwalsh commented Aug 24, 2021

The graphql?Extensions API is returning a Cache-Control header of no-cache, max-age=0. Any chance that could be set to something more reasonably cacheable, maybe a scheme where the initial load lets the browser use any cached response in the common case but then asynchronously issues a non-cacheable (no-store, or more likely max-age=0) request to bust the cache if the user's extensions have changed?

@leifwalsh
Copy link

It also looks like the response payload contains every extension's icon as a base64-encoded string, and the readme of the extension, neither of which are actually shown on most page loads. Perhaps those could be URLs in the response instead of embedded content, and only be loaded (with caching) if the page needs to show them? This would probably decrease the size of the response payload a lot (it's currently over a megabyte for me, by far the largest response payload on a page load).

@ryanslade
Copy link
Contributor

@tjkandala Tagging you as you recently did some performance work related to this I think?

@felixfbecker
Copy link
Contributor Author

Reopening since #24626 said "Partially (but not fully) fixes"

@felixfbecker felixfbecker reopened this Sep 8, 2021
Extensibility :: Current Iteration automation moved this from Done 🚀 to In Progress 👀 Sep 8, 2021
@muratsu
Copy link
Contributor

muratsu commented Sep 8, 2021

While we've arrived at a partial fix, the response time has dropped from 900ms to ~70ms and I think it's fair to consider 70ms fast enough :) Closing this issue since it's no longer relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging a pull request may close this issue.