-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[WIP] Codeinsights MVP backend #17227
Conversation
Signed-off-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>
…tend DBs This is needed for us to be able to generate a schema.md file for the new Code Insights DB, which will be a separate TimescaleDB deployment / cannot be part of the same Postgres DB. See #17217 for a more detailed explanation. Signed-off-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>
Signed-off-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>
|
PR to generate separate schema.md files for codeintel/frontend DBs so we can generate a schema.md for code insights DB schema soon: https://github.com/sourcegraph/sourcegraph/pull/17228 |
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Codecov Report
@@ Coverage Diff @@
## main #17227 +/- ##
==========================================
+ Coverage 51.88% 51.98% +0.10%
==========================================
Files 1713 1707 -6
Lines 84963 84926 -37
Branches 7748 7510 -238
==========================================
+ Hits 44082 44150 +68
+ Misses 36978 36873 -105
Partials 3903 3903
|
|
|
||
| -- Metadata about this event, this can be any arbitrary JSON metadata which will be returned | ||
| -- when querying events, but cannot be filtered on. | ||
| metadata jsonb NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ongoing discussion about this: https://sourcegraph.slack.com/archives/C014ZCKMCAV/p1610764544015200
| -- the repository name at the time the event was created. Note that the repository name may | ||
| -- have changed since the event was created (e.g. if the repo was renamed), in which case this | ||
| -- describes the outdated repository na,e. | ||
| repo_name citext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ongoing discussions about this:
- repo vs. global vs. other association of events: https://sourcegraph.slack.com/archives/C014ZCKMCAV/p1610763511014200
- what happens if repositories get renamed/removed/etc: https://sourcegraph.slack.com/archives/C014ZCKMCAV/p1610762810012800
|
PR to tag the TimescaleDB image as |
|
PR to run codeinsights-db (TimescaleDB) as part of dev server: https://github.com/sourcegraph/sourcegraph/pull/17431 |
|
PR to add codeinsights-db migrations foundation: https://github.com/sourcegraph/sourcegraph/pull/17432 |
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
…sights-db Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
|
* Query metadata for points. * Improve formatting of test data. * Change incorrect `Series *int32` to `Series *string` * Add TODOs for improved filtering abilities in the future. Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
This PR adds support for the store to record data points. Stacked on top of #18254 Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
|
|
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
This will be used both from the GraphQL resolver layer, as well as the background workers - both of which need to scan the user/org/global settings for insights defined within. Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
|
|
Data size & retention policy speculationIn the original RFC we speculated the backend:
We will obviously need horizontal scaling in the future to achieve the much larger scales (which should be easy to add with TimescaleDB, but I haven't looked into it.) In theory, the best interval at which to record data points would be whenever repositories receive new commits. We are doing global searches (currently) at some interval, though, as that is a fair bit easier to implement - and so a question is at what interval we should do that for all repositories. I would speculate that given all repositories on a Sourcegraph instance - up to 500k - we are unlikely to see a commit push frequency higher than every 5 minutes in general. We can then speculate on the
Some examples:
Obviously that's a lot of data and will most likely not "just work" magically. https://docs.timescale.com/latest/faq#scaling says they regularly test with 10+ billion rows and inserting 100-200k rows / second. I we assume we have a budget of 10 billion rows, and recorded 20mil/5min we would only be able to store the last 500 recording intervals (1.7 days). But that's all very speculative, let's try some real Quick benchmarksInserting 1 year of data points for a single repository, with no metadata at a 5 minute interval takes 6-10s using: Idle memory / CPU usage is quite minimal (41 MiB / 1% CPU), still, and disk usage grows by about 60 MiB. Most of the costs outside of disk storage will probably be incurred at query time - which can be helped via precomputed aggregations Timescale offers. But the 60 MiB disk usage gives us a pretty good idea: 20,000 repos * 60 MiB == ~1171 GiB of disk storage for 1 year of data points recorded every 5min for 20k repos. Based on this, I will go with a default recording interval of 10 minutes (we can adjust this if needed later) which would give us ~40k repos worth of data points for a single series type on a reasonable deployment, i.e. it gets us decently far for the initial version. The next important things to look at for scaling are:
Lots of stuff/options to explore here in the TimescaleDB docs. |
Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
|
insights: actually enqueue/record insights from global settings #18314 |
|
|
insights: correct over-reporting / aggregation of data points #18632 |
See #17218 for the general plan / work being done on code insights.
I am working from this
codeinsightsbranch (which will be unstable/buggy/broken sometimes) and my aim is to have this branch by EOW run a very basic code insights backend in dev environments, including all the app infra/changes we need, running timescaledb, doing the DB migrations to create the schema, and the GraphQL schema/API/backend making requests and serving data from timescale.This branch itself won't be merged, rather I will be taking changes from here once they are solid and sending individual PRs for review (so they meet our quality bar, are not "one big PR", etc.) I will be posting updates here as I send out those PRs.
TODO:
dev/db/squash_migrations.shstarts Postgres 9.6 if the connected DB version != 9.6, so it doesn't work with these migrations.internal/db/schemadoc/main.gostarts Postgres 9.6 if the connected DB version != 9.6, so it doesn't work with this DB schema yet.Feb 21-26:
Feb 15-19:
Feb 8-9:
Feb 1-5:
repo_namesandmetadatawith https://docs.timescale.com/latest/using-timescaledb/compression (Tomas suggested, looks really cool) -- CONCLUSION: https://github.com/sourcegraph/sourcegraph/pull/17227#issuecomment-773704129Jan 18-22:
sourcegraph/codeinsights-dbcodeinsights-dbas part of dev server.migrations/codeinsightsDB migration foundation