Switch to EAV tables in GraphQL mirror #1313

wchargin · 2019-08-21T07:32:39Z

Switch to EAV tables in GraphQL mirror

The GraphQL mirror currently creates one table per object type in the
GitHub schema, used to store primitive fields, with schemata like this
(formatted for readability):

/* 1 */ CREATE TABLE "primitives_Repository" (
/* 2 */     id TEXT NOT NULL PRIMARY KEY,
/* 3 */     "url",
/* 4 */     "name",
/* 5 */     "createdAt",
/* 6 */     FOREIGN KEY(id) REFERENCES objects(id)
/* 7 */ );

Lines 3–5 are dynamically generated, with one column per primitive field
on the GraphQL object type. The ID column and foreign key reference are
always present.

Because the columns are dynamically generated, most queries that alter,
read from, or write to these tables must be as well. But this violates
the First Rule of SQL: only create prepared statements whose bodies are
compile-time constant expressions. Dynamically generating SQL queries
opens us up to SQL injection, and while I’ve been quite careful in the
implementation and the data source is also trusted, it’s still a
dangerous property to have lurking around.

In retrospect, I think that the fact that these tables require
dynamically generated DDL and DML queries should have been a sign that
the table design itself leaves something to be desired. An obvious
alternative is to use a single table for storing all primitive values:
i.e., an entity–attribute–value table. I considered this when
first designing the schema, but decided against it because I’d heard
vague cautions to avoid the EAV model when possible. After completing
the initial implementation, I don’t think that such criticisms apply
here, primarily because we’re writing a generic family of schemata
(amusingly, mathematicians might call this a schema schema)
rather than a schema for any fixed business application.

This refactoring should make the Mirror module code easier to understand
for newcomers. There will be a constant number of tables. There will
be no dynamically generated queries; all queries will be
compile-time constant expressions. There will be less side-condition
checking and validation logic, because less will be required. And the
code will be safer and more reusable.

teamdandelion · 2019-08-22T15:39:52Z

Thanks for documenting this context, @wchargin. My prioritization of the github mirror has decreased, because in the coming quarter I plan to focus more on SourceCred dogfooding and less on supporting third party projects. But sometime in the next year I still expect to spend a bunch of time on the mirror module, so I might start here as a first step to really grokking and getting comfortable with the module.

(Of course, if anyone else feels like taking this on, more cred to them. But I have a feeling it will be me. :P)

wchargin · 2019-08-22T15:45:23Z

I was actually leaving this as a TODO for myself, but if you get around
to it first that’d be great!

Summary: The migration is complete; only EAV primitives remain, so they shall be called simply “primitives”. See #1313 and adjacent commits for context. Test Plan: Running `git grep -iw eav` no longer returns any results. wchargin-branch: mirror-eav-prune-names

wchargin · 2019-09-01T23:43:10Z

I opened a few pull requests for this:

pre-work: mirror: test registerObject with nested fields #1339 adds a convenient test case
anabolism: mirror: add primitives table for EAV migration #1340, mirror: initialize EAV primitives at registration #1341, mirror: update EAV primitives #1342, and mirror: add EAV reading to extract, behind flag #1343 implement EAV primitives
flip: mirror: read from EAV tables by default #1344 flips the flag from false to true in production
catabolism: mirror: update implementation notes for EAV tables #1345, mirror: remove legacy non-EAV extract #1346, mirror: remove legacy tables #1347, mirror: remove primitives test multiplexing logic #1348, and mirror: clean up references to “EAV” primitives #1349 remove legacy code

Summary: See #1313 for context. The plan is to set up dual-writes with `extract` calls still reading from the old tables until the new ones are complete and tested. The primary risk to production would be a fatal exception in the new write paths, which seems like an acceptable risk. Test Plan: Unit tests pass. wchargin-branch: mirror-eav-schema

Summary: This flips the switch for all production `Mirror` reads to use the single `primitives` EAV table as their source of truth, rather than the legacy type-specific primitives tables. For context and design discussion, see issue #1313 and commits adjacent to this one. Test Plan: All relevant code paths are already tested (see test plans of commits adjacent to this one). Running `yarn test --full` passes. wchargin-branch: mirror-eav-flip

Summary: This data is now stored in EAV `primitives` table; see issue #1313 and adjacent commits for details. We simultaneously lift the restriction that GraphQL type and field names be SQL-safe identifiers, as it’s no longer necessary. Test Plan: Some test cases queried the legacy primitives tables to check properties about the database state. These queries have of course been removed; note that each such removed query was already accompanied by an equivalent query against the EAV `primitives` table. Note that `yarn test --full` still passes, and that when manually loading `sourcecred/example-github` the cache no longer has any of the legacy tables. wchargin-branch: mirror-eav-prune-tables

Summary: The migration is complete; only EAV primitives remain, so they shall be called simply “primitives”. See #1313 and adjacent commits for context. Test Plan: Running `git grep -iw eav` no longer returns any results. wchargin-branch: mirror-eav-prune-names

Summary: The Mirror module extraction code calculates the set of transitive dependencies and stores these results in a temporary table to avoid unnecessary marshalling between JavaScript and C. We originally chose the temporary table name dynamically, guaranteeing that it was unused. However, this is unnecessary: - The temporary table namespace is unique to each database connection, so we need only consider possible conflicts in the same connection. - A `Mirror` instance exercises exclusive ownership of its database connection, per its constructor docs, so we need only consider conflicts within this module. - Temporary tables are only used in the `extract` method, so we need only consider conflicts in this method. - The `extract` method makes no open calls nor recursive calls, and does not yield control back to the event loop, so only one stack frame can be in `extract` at any time. - The `extract` method itself only creates the temporary table once. Thus, the temporary table creation is safe. Furthermore, the failure mode is simply that we raise an exception and fail cleanly; there is no risk of data loss or corruption. This patch replaces the dynamically generated table name with a fixed name. On top of the work in #1313, this removes the last instance of SQL queries that are not compile-time constant expressions. Test Plan: Running `yarn unit -f graphql/mirror` suffices. wchargin-branch: mirror-fixed-temp-table

wchargin · 2019-10-20T01:14:57Z

Done!

wchargin · 2020-02-03T08:24:35Z

I was interested in the performance implications of this change, so I
inspected a mirror of twbs/bootstrap. The full database is 486 MB on
disk with 124350 SQLite pages (4096 bytes each). The primitives table
takes up 27269 pages (21.9%), and its (object_id, fieldname) index
takes up an additional 11548 pages (9.3%). The only table/index that
takes up more space is the network log, at 44451 pages (35.7%).

Breaking it down a bit more: I was worried that the duplicated field
names would bloat the table, but actually the field names are the
smallest component, and it’s the object names that have bloated more:

for k in object_id fieldname value; do
    printf '%s\t' "${k}" &&
    sqlite3 "${db}" "SELECT SUM(LENGTH(${k})) FROM primitives" |
        numfmt --to=iec
done | expand -t12

object_id   28M
fieldname   4.3M
value       56M

(You may note that those don’t add up to 124350 pages’ worth of bytes;
in the primitives table, only 82.8% of bytes store payload, with 5.2%
used for metadata and 12.1% unused.)

(Note also that the real, irreducible data—value—still accounts for
the majority of space used: around 63%.)

In retrospect, this makes sense, as some of the object names are long.
Commit object IDs tend to be about 76 bytes, with 6 primitive fields
each, so that’s 456 bytes of overhead per commit, and there are 22168
commits, so that’s 10MB already. (This is why we always profile! :-) )

The average length of an average ID per primitives row is 38 bytes.

If this becomes a problem, we could:

Alter the primitives table to key objects by rowid rather than
GraphQL object name, which would take an average of 2–3 bytes rather
than 38 bytes per row (saving 27MB overall).
Pack all primitives for each object into a single row, given that
always update and read them all at once, reducing deduplication. The
row contents would have to be some kind of serialized JSON; if we
pick a fixed key order, we can drop the field names, too.
Drop the index on (object_id, fieldname).

The first two options, which alter the table structure, would make human
exploration and debugging of the database harder. The third option would
be fine for readability; we’d have to look at the performance impact.
I’d advise on holding off for now, as we haven’t heard any reports that
this is a problem and the network log could provide lower-hanging fruit.

Beanow · 2020-02-03T11:43:11Z

Great investigation 😄

I’d advise on holding off for now, as we haven’t heard any reports that
this is a problem and the network log could provide lower-hanging fruit.

I'd concur.
For a single instance, using SC for it's primary use-case, I don't believe this should be an issue at all.

At scale in this experiment I did find handling the cache size to be the main obstacle. Though I think the lesson should be, this use-case wasn't suitable for a CI container cronjob + github pages, and should have had a dedicated application with direct filesystem access, as any application dealing with databases spanning several GBs reasonably would. Besides, I found a simple gzip to be really effective at bringing the size down to work for that experiment.

wchargin · 2020-02-03T16:10:58Z

Yes, that’s another good point—I found that gzip -9 typically gives
around a 5× compression ratio now that the network log is enabled.

wchargin added the Feature Request New feature or request label Aug 21, 2019

This was referenced Sep 1, 2019

mirror: add primitives table for EAV migration #1340

Merged

mirror: read from EAV tables by default #1344

Merged

mirror: remove legacy tables #1347

Merged

mirror: clean up references to “EAV” primitives #1349

Merged

wchargin self-assigned this Sep 1, 2019

wchargin mentioned this issue Sep 2, 2019

mirror: use fixed temp table for transitive deps #1350

Merged

wchargin closed this as completed Oct 20, 2019

wchargin mentioned this issue Jan 21, 2020

GithubToken dependency to create graph from cache #1580

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to EAV tables in GraphQL mirror #1313

Switch to EAV tables in GraphQL mirror #1313

wchargin commented Aug 21, 2019

teamdandelion commented Aug 22, 2019

wchargin commented Aug 22, 2019

wchargin commented Sep 1, 2019 •

edited

Loading

wchargin commented Oct 20, 2019

wchargin commented Feb 3, 2020

Beanow commented Feb 3, 2020

wchargin commented Feb 3, 2020

Switch to EAV tables in GraphQL mirror #1313

Switch to EAV tables in GraphQL mirror #1313

Comments

wchargin commented Aug 21, 2019

teamdandelion commented Aug 22, 2019

wchargin commented Aug 22, 2019

wchargin commented Sep 1, 2019 • edited Loading

wchargin commented Oct 20, 2019

wchargin commented Feb 3, 2020

Beanow commented Feb 3, 2020

wchargin commented Feb 3, 2020

wchargin commented Sep 1, 2019 •

edited

Loading