[WIP] Codeinsights MVP backend #17227

emidoots · 2021-01-13T03:15:01Z

See #17218 for the general plan / work being done on code insights.

I am working from this codeinsights branch (which will be unstable/buggy/broken sometimes) and my aim is to have this branch by EOW run a very basic code insights backend in dev environments, including all the app infra/changes we need, running timescaledb, doing the DB migrations to create the schema, and the GraphQL schema/API/backend making requests and serving data from timescale.

This branch itself won't be merged, rather I will be taking changes from here once they are solid and sending individual PRs for review (so they meet our quality bar, are not "one big PR", etc.) I will be posting updates here as I send out those PRs.

TODO:

Grafana dashboards -- issue filed elsewhere
testing: An end-to-end test where we use the GraphQL API to modify user settings/add insights, and the GraphQL API to query data after a while? -- issue filed elsewhere
Investigate adding more tests for enterprise/internal/insights/background/* - see https://github.com/sourcegraph/sourcegraph/pull/18267 -- issue filed elsewhere
Basic webhook fetching support -- decided to adopt push model, see other issue
Deploy TimescaleDB in non-dev envs
Generate schema.insights.md documentation
dev/db/squash_migrations.sh starts Postgres 9.6 if the connected DB version != 9.6, so it doesn't work with these migrations.
internal/db/schemadoc/main.go starts Postgres 9.6 if the connected DB version != 9.6, so it doesn't work with this DB schema yet.
Add code TODO for retroactively updating repo names in DB

Feb 21-26:

insights: correct over-reporting / aggregation of data points insights: correct over-reporting / aggregation of data points #18632

Feb 15-19:

Basic search query execution & storage
Investigate using go-mockgen for GraphQL resolver tests: https://github.com/sourcegraph/sourcegraph/pull/18075#discussion_r572359571

Feb 8-9:

Implement DB store tests
Merge DB schema
Implement GraphQL backend tests

Feb 1-5:

Implement DB store
Implement GraphQL backend
Investigate replacing repo_names and metadata with https://docs.timescale.com/latest/using-timescaledb/compression (Tomas suggested, looks really cool) -- CONCLUSION: https://github.com/sourcegraph/sourcegraph/pull/17227#issuecomment-773704129
Start TImescaleDB in testing env; run migrations enterprise/internal/insights: add basic store package + testing infrastructure #17733 insights: execute database migrations #17586

Jan 18-22:

Determine where to store insights (user/org/global settings? likely not DB for now, but file an issue for later?)
Run migrations on server startup
Plan DB schema & queries/inserts
Generate separate schema.md files for codeintel/frontend DBs so we can generate schema.insights.md later
Tag TimescaleDB Docker image as sourcegraph/codeinsights-db
Run codeinsights-db as part of dev server.
Add migrations/codeinsights DB migration foundation
Add GraphQL backend scaffolding / stubs

Signed-off-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>

…tend DBs This is needed for us to be able to generate a schema.md file for the new Code Insights DB, which will be a separate TimescaleDB deployment / cannot be part of the same Postgres DB. See #17217 for a more detailed explanation. Signed-off-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>

Signed-off-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>

emidoots · 2021-01-13T03:25:53Z

PR to generate separate schema.md files for codeintel/frontend DBs so we can generate a schema.md for code insights DB schema soon: https://github.com/sourcegraph/sourcegraph/pull/17228

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

codecov · 2021-01-16T02:27:16Z

Codecov Report

Merging #17227 (ec41897) into main (b96444a) will increase coverage by 0.10%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #17227      +/-   ##
==========================================
+ Coverage   51.88%   51.98%   +0.10%     
==========================================
  Files        1713     1707       -6     
  Lines       84963    84926      -37     
  Branches     7748     7510     -238     
==========================================
+ Hits        44082    44150      +68     
+ Misses      36978    36873     -105     
  Partials     3903     3903

Flag	Coverage Δ
go	`51.03% <ø> (+0.19%)`	⬆️
integration	`30.53% <ø> (-0.16%)`	⬇️
storybook	`30.05% <ø> (-0.36%)`	⬇️
typescript	`54.29% <ø> (-0.12%)`	⬇️
unit	`34.80% <ø> (+0.04%)`	⬆️

Impacted Files	Coverage Δ
internal/db/dbutil/dbutil.go	`16.82% <ø> (ø)`
cmd/frontend/graphqlbackend/search_structural.go	`44.56% <0.00%> (-15.44%)`	⬇️
client/web/src/enterprise/site-admin/routes.ts	`6.66% <0.00%> (-10.99%)`	⬇️
cmd/frontend/graphqlbackend/search_symbols.go	`10.00% <0.00%> (-8.83%)`	⬇️
...nal/campaigns/resolvers/changeset_apply_preview.go	`58.97% <0.00%> (-5.48%)`	⬇️
internal/db/users.go	`47.49% <0.00%> (-3.37%)`	⬇️
cmd/frontend/auth/user.go	`76.11% <0.00%> (-2.99%)`	⬇️
cmd/frontend/graphqlbackend/zoekt.go	`75.46% <0.00%> (-2.98%)`	⬇️
internal/db/external_accounts.go	`61.11% <0.00%> (-2.87%)`	⬇️
enterprise/internal/campaigns/syncer/syncer.go	`61.71% <0.00%> (-2.61%)`	⬇️
... and 157 more

migrations/codeinsights/1000000001_initial_schema.up.sql

emidoots · 2021-01-19T19:12:42Z

migrations/codeinsights/1000000001_initial_schema.up.sql

+
+    -- Metadata about this event, this can be any arbitrary JSON metadata which will be returned
+    -- when querying events, but cannot be filtered on.
+    metadata jsonb NOT NULL,


Ongoing discussion about this: https://sourcegraph.slack.com/archives/C014ZCKMCAV/p1610764544015200

emidoots · 2021-01-19T19:13:46Z

migrations/codeinsights/1000000001_initial_schema.up.sql

+    -- the repository name at the time the event was created. Note that the repository name may
+    -- have changed since the event was created (e.g. if the repo was renamed), in which case this
+    -- describes the outdated repository na,e.
+    repo_name citext


Ongoing discussions about this:

repo vs. global vs. other association of events: https://sourcegraph.slack.com/archives/C014ZCKMCAV/p1610763511014200

what happens if repositories get renamed/removed/etc: https://sourcegraph.slack.com/archives/C014ZCKMCAV/p1610762810012800

emidoots · 2021-01-19T19:20:16Z

PR to tag the TimescaleDB image as sourcegraph/codeinsights-db: https://github.com/sourcegraph/sourcegraph/pull/17427

emidoots · 2021-01-19T19:52:08Z

PR to run codeinsights-db (TimescaleDB) as part of dev server: https://github.com/sourcegraph/sourcegraph/pull/17431

emidoots · 2021-01-19T20:04:54Z

PR to add codeinsights-db migrations foundation: https://github.com/sourcegraph/sourcegraph/pull/17432

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

…sights-db Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

emidoots · 2021-02-12T22:09:37Z

insights: change series_id to string (hash) insights: change series_id to string (hash); add index #18230

* Query metadata for points. * Improve formatting of test data. * Change incorrect `Series *int32` to `Series *string` * Add TODOs for improved filtering abilities in the future. Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

This PR adds support for the store to record data points. Stacked on top of #18254 Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

emidoots · 2021-02-12T23:55:31Z

insights: store: query metadata & other minor improvements insights: store: query metadata & other minor improvements #18254

emidoots · 2021-02-12T23:56:42Z

insights: store: add support for recording data points insights: store: add support for recording data points #18255

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

This will be used both from the GraphQL resolver layer, as well as the background workers - both of which need to scan the user/org/global settings for insights defined within. Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

emidoots · 2021-02-13T01:08:54Z

insights: add new discovery package for locating insights (and use it) insights: add new discovery package for locating insights (and use it) #18264

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

emidoots · 2021-02-13T01:23:38Z

migrations: add insights_query_runner_jobs table migrations: add insights_query_runner_jobs table #18265

emidoots · 2021-02-16T00:49:40Z

Data size & retention policy speculation

In the original RFC we speculated the backend:

Must support insights over 16,000 repos.
Ideally over our largest customers ~500k+ repos.
We speculated customers would at max have ~500 dashboards (spread across ~2k or so devs)
We speculated there may be 5 panels per dashboard, each with a one to a handful of data series recorded per-repo.
Examples of series cardinality included:
- A panel showing 20 java versions, recorded per repo
- A panel showing 40 languages, recorded per repo
- A panel showing PR states(open/close/merged), recorded per repo - assuming a max of like 20,000 PRs.
In an ideal world, keeping data for some long period of time like 6-12mo (but maybe with limitations as we start to talk about older data.)

We will obviously need horizontal scaling in the future to achieve the much larger scales (which should be easy to add with TimescaleDB, but I haven't looked into it.)

In theory, the best interval at which to record data points would be whenever repositories receive new commits. We are doing global searches (currently) at some interval, though, as that is a fair bit easier to implement - and so a question is at what interval we should do that for all repositories.

I would speculate that given all repositories on a Sourcegraph instance - up to 500k - we are unlikely to see a commit push frequency higher than every 5 minutes in general.

We can then speculate on the # of table rows per recording interval based on:

20-500k repos
10-500 dashboards with ~5 panels each (50-2500 panels)
5-40 semi-unique series per dashboard (let's say 50% are reused / same queries.)
# of table rows per recording interval == repos * dashboard_panels * unique_series_per_dashboard

Some examples:

20k_repos50_panels5_series == 20,000,000 rows / recording interval (e.g. 5 minutes)
500k_repos2500_panels40_series == 50,000,000 rows/recording interval (e.g. 5 minutes)

Obviously that's a lot of data and will most likely not "just work" magically. https://docs.timescale.com/latest/faq#scaling says they regularly test with 10+ billion rows and inserting 100-200k rows / second. I we assume we have a budget of 10 billion rows, and recorded 20mil/5min we would only be able to store the last 500 recording intervals (1.7 days).

But that's all very speculative, let's try some real

Quick benchmarks

Inserting 1 year of data points for a single repository, with no metadata at a 5 minute interval takes 6-10s using:

INSERT INTO series_points(
    time,
    series_id,
    value,
    metadata_id,
    repo_id,
    repo_name_id,
    original_repo_name_id)
SELECT time,
    0,
    random()*80 - 40,
    (SELECT id FROM metadata WHERE metadata = '{"hello": "world", "languages": ["Go", "Python", "Java"]}'),
    2,
    (SELECT id FROM repo_names WHERE name = 'github.com/gorilla/mux-renamed'),
    (SELECT id FROM repo_names WHERE name = 'github.com/gorilla/mux-original')
    FROM generate_series(TIMESTAMP '2021-01-01 00:00:00', TIMESTAMP '2022-01-01 00:00:00', INTERVAL '5 min') AS time;

Idle memory / CPU usage is quite minimal (41 MiB / 1% CPU), still, and disk usage grows by about 60 MiB. Most of the costs outside of disk storage will probably be incurred at query time - which can be helped via precomputed aggregations Timescale offers.

But the 60 MiB disk usage gives us a pretty good idea: 20,000 repos * 60 MiB == ~1171 GiB of disk storage for 1 year of data points recorded every 5min for 20k repos.

Based on this, I will go with a default recording interval of 10 minutes (we can adjust this if needed later) which would give us ~40k repos worth of data points for a single series type on a reasonable deployment, i.e. it gets us decently far for the initial version.

The next important things to look at for scaling are:

A data retention policy that deletes data points after 1yr.
Continuous aggregation, we may be able to do this and keep lesser precision historical data but still at a per-repo level (e.g. one day point per day if older than 3mo.)
Keep repository-level metrics for the past ~3 months, but then only keep global (non-repo-specific) insights after that.

Lots of stuff/options to explore here in the TimescaleDB docs.

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

emidoots · 2021-02-16T02:54:55Z

insights: actually enqueue/record insights from global settings #18314

emidoots · 2021-02-17T02:20:38Z

ci: build docker-images/codeinsights-db ci: build docker-images/codeinsights-db #18350
dev: allow codeinsights-db more time to startup dev: allow codeinsights-db more time to startup #18359
insights: turn on for Sourcegraph.com insights: turn on for Sourcegraph.com #18360

emidoots · 2021-02-24T22:58:56Z

insights: correct over-reporting / aggregation of data points #18632

Stephen Gutekanst added 3 commits January 12, 2021 17:11

migrations: add codeinsights migrations (based on migrations/codeintel)

fa89122

Signed-off-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>

internal/db/dbutil: add codeinsights migrations

3d7323e

Signed-off-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>

sourcegraph deleted a comment from codecov bot Jan 13, 2021

Stephen Gutekanst added 7 commits January 12, 2021 20:34

Merge remote-tracking branch 'origin/main' into codeinsights

ec41897

dev: run codeinsights-db (TimescaleDB)

608d656

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

docker-images: add codeinsights-db (re-tag TimescaleDB)

0bf010e

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

add TODOs about places needing updates for timescaledb

85d359e

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

running personal notes (to be moved to proper dev docs later)

728e4c0

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

add intial DB schema

3477821

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

Merge remote-tracking branch 'origin/main' into codeinsights

1c695d4

felixfbecker reviewed Jan 19, 2021

View reviewed changes

migrations/codeinsights/1000000001_initial_schema.up.sql Outdated Show resolved Hide resolved

Merge remote-tracking branch 'origin/main' into codeinsights

111da5b

emidoots commented Jan 19, 2021

View reviewed changes

Merge remote-tracking branch 'origin/main' into codeinsights

fc03157

Merge remote-tracking branch 'origin/main' into codeinsights

7f68e88

Stephen Gutekanst added 8 commits January 19, 2021 13:16

rename histogram_events -> gauge_events; document table

66bf94b

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

Merge remote-tracking branch 'origin/main' into codeinsights

3f745c5

dev/drop-entire-local-database-and-redis.sh - make it work for codein…

25482c9

…sights-db Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

README: update

1f755f9

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

DB schema take 2

314dcfd

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

note metadata filtering

7f2412b

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

Merge remote-tracking branch 'origin/main' into codeinsights

8f73de7

scratch the surface of aggregation

e1528b6

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

Merge remote-tracking branch 'origin/main' into codeinsights

17d6e26

Stephen Gutekanst and others added 4 commits February 12, 2021 16:26

Update README.codeinsights.md

2e52637

update resolver data

139e77d

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

insights: store: add support for recording data points

70c85f8

This PR adds support for the store to record data points. Stacked on top of #18254 Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

Stephen Gutekanst added 4 commits February 12, 2021 17:30

go generate ./enterprise/internal/insights/store/ (regenerate mocks)

07b2634

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

go test -update

c645642

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

insights: resolvers: use the new discovery package

c0b23b3

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

Stephen Gutekanst added 4 commits February 12, 2021 18:12

gofmt

bd4f455

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

Merge remote-tracking branch 'origin/main' into codeinsights

b1a847b

Merge branch 'sg/insights-store-inserts' into codeinsights

428a2e2

Merge branch 'sg/insights-discovery' into codeinsights

ee4d0a0

emidoots mentioned this pull request Feb 16, 2021

insights: add background workers which execute search queries and store insights #18267

Merged

Stephen Gutekanst added 2 commits February 15, 2021 19:53

Merge remote-tracking branch 'origin/main' into codeinsights

3deea82

fix merge conflicts

066fbcf

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

Merge remote-tracking branch 'origin/main' into codeinsights

6fdcbdc

Update README.codeinsights.md

b007308

emidoots closed this Mar 12, 2021

emidoots deleted the codeinsights branch March 12, 2021 00:32

sourcegraph-bot mentioned this pull request Jul 7, 2021

insights: frontend / backend integration #22647

Closed

5 tasks

[WIP] Codeinsights MVP backend #17227

[WIP] Codeinsights MVP backend #17227

Uh oh!

Conversation

emidoots commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emidoots commented Jan 13, 2021

Uh oh!

codecov bot commented Jan 16, 2021

Codecov Report

Uh oh!

Uh oh!

emidoots Jan 19, 2021

Choose a reason for hiding this comment

Uh oh!

emidoots Jan 19, 2021

Choose a reason for hiding this comment

Uh oh!

emidoots commented Jan 19, 2021

Uh oh!

emidoots commented Jan 19, 2021

Uh oh!

emidoots commented Jan 19, 2021

Uh oh!

emidoots commented Feb 12, 2021

Uh oh!

emidoots commented Feb 12, 2021

Uh oh!

emidoots commented Feb 12, 2021

Uh oh!

emidoots commented Feb 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emidoots commented Feb 13, 2021

Uh oh!

emidoots commented Feb 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Data size & retention policy speculation

Quick benchmarks

Uh oh!

emidoots commented Feb 16, 2021

Uh oh!

emidoots commented Feb 17, 2021

Uh oh!

emidoots commented Feb 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

emidoots commented Jan 13, 2021 •

edited

Loading

emidoots commented Feb 13, 2021 •

edited

Loading

emidoots commented Feb 16, 2021 •

edited

Loading