Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Conversation

@emidoots
Copy link
Member

@emidoots emidoots commented Jan 13, 2021

See #17218 for the general plan / work being done on code insights.

I am working from this codeinsights branch (which will be unstable/buggy/broken sometimes) and my aim is to have this branch by EOW run a very basic code insights backend in dev environments, including all the app infra/changes we need, running timescaledb, doing the DB migrations to create the schema, and the GraphQL schema/API/backend making requests and serving data from timescale.

This branch itself won't be merged, rather I will be taking changes from here once they are solid and sending individual PRs for review (so they meet our quality bar, are not "one big PR", etc.) I will be posting updates here as I send out those PRs.

TODO:

  • Grafana dashboards -- issue filed elsewhere
  • testing: An end-to-end test where we use the GraphQL API to modify user settings/add insights, and the GraphQL API to query data after a while? -- issue filed elsewhere
  • Investigate adding more tests for enterprise/internal/insights/background/* - see https://github.com/sourcegraph/sourcegraph/pull/18267 -- issue filed elsewhere
  • Basic webhook fetching support -- decided to adopt push model, see other issue
  • Deploy TimescaleDB in non-dev envs
  • Generate schema.insights.md documentation
  • dev/db/squash_migrations.sh starts Postgres 9.6 if the connected DB version != 9.6, so it doesn't work with these migrations.
  • internal/db/schemadoc/main.go starts Postgres 9.6 if the connected DB version != 9.6, so it doesn't work with this DB schema yet.
  • Add code TODO for retroactively updating repo names in DB

Feb 21-26:

Feb 15-19:

Feb 8-9:

  • Implement DB store tests
  • Merge DB schema
  • Implement GraphQL backend tests

Feb 1-5:

Jan 18-22:

  • Determine where to store insights (user/org/global settings? likely not DB for now, but file an issue for later?)
  • Run migrations on server startup
  • Plan DB schema & queries/inserts
  • Generate separate schema.md files for codeintel/frontend DBs so we can generate schema.insights.md later
  • Tag TimescaleDB Docker image as sourcegraph/codeinsights-db
  • Run codeinsights-db as part of dev server.
  • Add migrations/codeinsights DB migration foundation
  • Add GraphQL backend scaffolding / stubs

Stephen Gutekanst added 3 commits January 12, 2021 17:11
Signed-off-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>
…tend DBs

This is needed for us to be able to generate a schema.md file for the new Code Insights
DB, which will be a separate TimescaleDB deployment / cannot be part of the same Postgres
DB.

See #17217 for a more detailed explanation.

Signed-off-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>
Signed-off-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>
@sourcegraph sourcegraph deleted a comment from codecov bot Jan 13, 2021
@emidoots
Copy link
Member Author

PR to generate separate schema.md files for codeintel/frontend DBs so we can generate a schema.md for code insights DB schema soon: https://github.com/sourcegraph/sourcegraph/pull/17228

Stephen Gutekanst added 7 commits January 12, 2021 20:34
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
@codecov
Copy link

codecov bot commented Jan 16, 2021

Codecov Report

Merging #17227 (ec41897) into main (b96444a) will increase coverage by 0.10%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #17227      +/-   ##
==========================================
+ Coverage   51.88%   51.98%   +0.10%     
==========================================
  Files        1713     1707       -6     
  Lines       84963    84926      -37     
  Branches     7748     7510     -238     
==========================================
+ Hits        44082    44150      +68     
+ Misses      36978    36873     -105     
  Partials     3903     3903              
Flag Coverage Δ
go 51.03% <ø> (+0.19%) ⬆️
integration 30.53% <ø> (-0.16%) ⬇️
storybook 30.05% <ø> (-0.36%) ⬇️
typescript 54.29% <ø> (-0.12%) ⬇️
unit 34.80% <ø> (+0.04%) ⬆️
Impacted Files Coverage Δ
internal/db/dbutil/dbutil.go 16.82% <ø> (ø)
cmd/frontend/graphqlbackend/search_structural.go 44.56% <0.00%> (-15.44%) ⬇️
client/web/src/enterprise/site-admin/routes.ts 6.66% <0.00%> (-10.99%) ⬇️
cmd/frontend/graphqlbackend/search_symbols.go 10.00% <0.00%> (-8.83%) ⬇️
...nal/campaigns/resolvers/changeset_apply_preview.go 58.97% <0.00%> (-5.48%) ⬇️
internal/db/users.go 47.49% <0.00%> (-3.37%) ⬇️
cmd/frontend/auth/user.go 76.11% <0.00%> (-2.99%) ⬇️
cmd/frontend/graphqlbackend/zoekt.go 75.46% <0.00%> (-2.98%) ⬇️
internal/db/external_accounts.go 61.11% <0.00%> (-2.87%) ⬇️
enterprise/internal/campaigns/syncer/syncer.go 61.71% <0.00%> (-2.61%) ⬇️
... and 157 more


-- Metadata about this event, this can be any arbitrary JSON metadata which will be returned
-- when querying events, but cannot be filtered on.
metadata jsonb NOT NULL,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- the repository name at the time the event was created. Note that the repository name may
-- have changed since the event was created (e.g. if the repo was renamed), in which case this
-- describes the outdated repository na,e.
repo_name citext
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ongoing discussions about this:

@emidoots
Copy link
Member Author

PR to tag the TimescaleDB image as sourcegraph/codeinsights-db: https://github.com/sourcegraph/sourcegraph/pull/17427

@emidoots
Copy link
Member Author

PR to run codeinsights-db (TimescaleDB) as part of dev server: https://github.com/sourcegraph/sourcegraph/pull/17431

@emidoots
Copy link
Member Author

PR to add codeinsights-db migrations foundation: https://github.com/sourcegraph/sourcegraph/pull/17432

Stephen Gutekanst added 8 commits January 19, 2021 13:16
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
…sights-db

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
@emidoots
Copy link
Member Author

Stephen Gutekanst and others added 4 commits February 12, 2021 16:26
* Query metadata for points.
* Improve formatting of test data.
* Change incorrect `Series *int32` to `Series *string`
* Add TODOs for improved filtering abilities in the future.

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
This PR adds support for the store to record data points.

Stacked on top of #18254

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
@emidoots
Copy link
Member Author

@emidoots
Copy link
Member Author

Stephen Gutekanst added 4 commits February 12, 2021 17:30
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
This will be used both from the GraphQL resolver layer, as well as the
background workers - both of which need to scan the user/org/global settings
for insights defined within.

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
@emidoots
Copy link
Member Author

emidoots commented Feb 13, 2021

@emidoots
Copy link
Member Author

@emidoots
Copy link
Member Author

emidoots commented Feb 16, 2021

Data size & retention policy speculation

In the original RFC we speculated the backend:

  • Must support insights over 16,000 repos.
  • Ideally over our largest customers ~500k+ repos.
  • We speculated customers would at max have ~500 dashboards (spread across ~2k or so devs)
  • We speculated there may be 5 panels per dashboard, each with a one to a handful of data series recorded per-repo.
  • Examples of series cardinality included:
    • A panel showing 20 java versions, recorded per repo
    • A panel showing 40 languages, recorded per repo
    • A panel showing PR states(open/close/merged), recorded per repo - assuming a max of like 20,000 PRs.
  • In an ideal world, keeping data for some long period of time like 6-12mo (but maybe with limitations as we start to talk about older data.)

We will obviously need horizontal scaling in the future to achieve the much larger scales (which should be easy to add with TimescaleDB, but I haven't looked into it.)

In theory, the best interval at which to record data points would be whenever repositories receive new commits. We are doing global searches (currently) at some interval, though, as that is a fair bit easier to implement - and so a question is at what interval we should do that for all repositories.

I would speculate that given all repositories on a Sourcegraph instance - up to 500k - we are unlikely to see a commit push frequency higher than every 5 minutes in general.

We can then speculate on the # of table rows per recording interval based on:

  • 20-500k repos
  • 10-500 dashboards with ~5 panels each (50-2500 panels)
  • 5-40 semi-unique series per dashboard (let's say 50% are reused / same queries.)
  • # of table rows per recording interval == repos * dashboard_panels * unique_series_per_dashboard

Some examples:

  • 20k_repos50_panels5_series == 20,000,000 rows / recording interval (e.g. 5 minutes)
  • 500k_repos2500_panels40_series == 50,000,000 rows/recording interval (e.g. 5 minutes)

Obviously that's a lot of data and will most likely not "just work" magically. https://docs.timescale.com/latest/faq#scaling says they regularly test with 10+ billion rows and inserting 100-200k rows / second. I we assume we have a budget of 10 billion rows, and recorded 20mil/5min we would only be able to store the last 500 recording intervals (1.7 days).

But that's all very speculative, let's try some real

Quick benchmarks

Inserting 1 year of data points for a single repository, with no metadata at a 5 minute interval takes 6-10s using:

INSERT INTO series_points(
    time,
    series_id,
    value,
    metadata_id,
    repo_id,
    repo_name_id,
    original_repo_name_id)
SELECT time,
    0,
    random()*80 - 40,
    (SELECT id FROM metadata WHERE metadata = '{"hello": "world", "languages": ["Go", "Python", "Java"]}'),
    2,
    (SELECT id FROM repo_names WHERE name = 'github.com/gorilla/mux-renamed'),
    (SELECT id FROM repo_names WHERE name = 'github.com/gorilla/mux-original')
    FROM generate_series(TIMESTAMP '2021-01-01 00:00:00', TIMESTAMP '2022-01-01 00:00:00', INTERVAL '5 min') AS time;

Idle memory / CPU usage is quite minimal (41 MiB / 1% CPU), still, and disk usage grows by about 60 MiB. Most of the costs outside of disk storage will probably be incurred at query time - which can be helped via precomputed aggregations Timescale offers.

But the 60 MiB disk usage gives us a pretty good idea: 20,000 repos * 60 MiB == ~1171 GiB of disk storage for 1 year of data points recorded every 5min for 20k repos.

Based on this, I will go with a default recording interval of 10 minutes (we can adjust this if needed later) which would give us ~40k repos worth of data points for a single series type on a reasonable deployment, i.e. it gets us decently far for the initial version.

The next important things to look at for scaling are:

  • A data retention policy that deletes data points after 1yr.
  • Continuous aggregation, we may be able to do this and keep lesser precision historical data but still at a per-repo level (e.g. one day point per day if older than 3mo.)
  • Keep repository-level metrics for the past ~3 months, but then only keep global (non-repo-specific) insights after that.

Lots of stuff/options to explore here in the TimescaleDB docs.

Stephen Gutekanst added 2 commits February 15, 2021 19:53
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
@emidoots
Copy link
Member Author

insights: actually enqueue/record insights from global settings #18314

@emidoots
Copy link
Member Author

@emidoots
Copy link
Member Author

insights: correct over-reporting / aggregation of data points #18632

@emidoots emidoots closed this Mar 12, 2021
@emidoots emidoots deleted the codeinsights branch March 12, 2021 00:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

code-insights Issues related to the Code Insights product

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants