Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to EAV tables in GraphQL mirror #1313

Closed
wchargin opened this issue Aug 21, 2019 · 7 comments
Closed

Switch to EAV tables in GraphQL mirror #1313

wchargin opened this issue Aug 21, 2019 · 7 comments
Assignees
Labels
Feature Request New feature or request

Comments

@wchargin
Copy link
Contributor

Switch to EAV tables in GraphQL mirror

The GraphQL mirror currently creates one table per object type in the
GitHub schema, used to store primitive fields, with schemata like this
(formatted for readability):

/* 1 */ CREATE TABLE "primitives_Repository" (
/* 2 */     id TEXT NOT NULL PRIMARY KEY,
/* 3 */     "url",
/* 4 */     "name",
/* 5 */     "createdAt",
/* 6 */     FOREIGN KEY(id) REFERENCES objects(id)
/* 7 */ );

Lines 3–5 are dynamically generated, with one column per primitive field
on the GraphQL object type. The ID column and foreign key reference are
always present.

Because the columns are dynamically generated, most queries that alter,
read from, or write to these tables must be as well. But this violates
the First Rule of SQL: only create prepared statements whose bodies are
compile-time constant expressions. Dynamically generating SQL queries
opens us up to SQL injection, and while I’ve been quite careful in the
implementation and the data source is also trusted, it’s still a
dangerous property to have lurking around.

In retrospect, I think that the fact that these tables require
dynamically generated DDL and DML queries should have been a sign that
the table design itself leaves something to be desired. An obvious
alternative is to use a single table for storing all primitive values:
i.e., an entity–attribute–value table. I considered this when
first designing the schema, but decided against it because I’d heard
vague cautions to avoid the EAV model when possible. After completing
the initial implementation, I don’t think that such criticisms apply
here, primarily because we’re writing a generic family of schemata
(amusingly, mathematicians might call this a schema schema)
rather than a schema for any fixed business application.

This refactoring should make the Mirror module code easier to understand
for newcomers. There will be a constant number of tables. There will
be no dynamically generated queries; all queries will be
compile-time constant expressions. There will be less side-condition
checking and validation logic, because less will be required. And the
code will be safer and more reusable.

@wchargin wchargin added the Feature Request New feature or request label Aug 21, 2019
@teamdandelion
Copy link
Contributor

Thanks for documenting this context, @wchargin. My prioritization of the github mirror has decreased, because in the coming quarter I plan to focus more on SourceCred dogfooding and less on supporting third party projects. But sometime in the next year I still expect to spend a bunch of time on the mirror module, so I might start here as a first step to really grokking and getting comfortable with the module.

(Of course, if anyone else feels like taking this on, more cred to them. But I have a feeling it will be me. :P)

@wchargin
Copy link
Contributor Author

I was actually leaving this as a TODO for myself, but if you get around
to it first that’d be great!

wchargin added a commit that referenced this issue Sep 1, 2019
Summary:
The migration is complete; only EAV primitives remain, so they shall be
called simply “primitives”. See #1313 and adjacent commits for context.

Test Plan:
Running `git grep -iw eav` no longer returns any results.

wchargin-branch: mirror-eav-prune-names
@wchargin wchargin self-assigned this Sep 1, 2019
wchargin added a commit that referenced this issue Sep 15, 2019
Summary:
See #1313 for context. The plan is to set up dual-writes with `extract`
calls still reading from the old tables until the new ones are complete
and tested. The primary risk to production would be a fatal exception in
the new write paths, which seems like an acceptable risk.

Test Plan:
Unit tests pass.

wchargin-branch: mirror-eav-schema
wchargin added a commit that referenced this issue Oct 12, 2019
Summary:
This flips the switch for all production `Mirror` reads to use the
single `primitives` EAV table as their source of truth, rather than the
legacy type-specific primitives tables. For context and design
discussion, see issue #1313 and commits adjacent to this one.

Test Plan:
All relevant code paths are already tested (see test plans of commits
adjacent to this one). Running `yarn test --full` passes.

wchargin-branch: mirror-eav-flip
wchargin added a commit that referenced this issue Oct 20, 2019
Summary:
This data is now stored in EAV `primitives` table; see issue #1313 and
adjacent commits for details.

We simultaneously lift the restriction that GraphQL type and field names
be SQL-safe identifiers, as it’s no longer necessary.

Test Plan:
Some test cases queried the legacy primitives tables to check properties
about the database state. These queries have of course been removed;
note that each such removed query was already accompanied by an
equivalent query against the EAV `primitives` table.

Note that `yarn test --full` still passes, and that when manually
loading `sourcecred/example-github` the cache no longer has any of the
legacy tables.

wchargin-branch: mirror-eav-prune-tables
wchargin added a commit that referenced this issue Oct 20, 2019
Summary:
The migration is complete; only EAV primitives remain, so they shall be
called simply “primitives”. See #1313 and adjacent commits for context.

Test Plan:
Running `git grep -iw eav` no longer returns any results.

wchargin-branch: mirror-eav-prune-names
wchargin added a commit that referenced this issue Oct 20, 2019
Summary:
The Mirror module extraction code calculates the set of transitive
dependencies and stores these results in a temporary table to avoid
unnecessary marshalling between JavaScript and C. We originally chose
the temporary table name dynamically, guaranteeing that it was unused.
However, this is unnecessary:

  - The temporary table namespace is unique to each database connection,
    so we need only consider possible conflicts in the same connection.
  - A `Mirror` instance exercises exclusive ownership of its database
    connection, per its constructor docs, so we need only consider
    conflicts within this module.
  - Temporary tables are only used in the `extract` method, so we need
    only consider conflicts in this method.
  - The `extract` method makes no open calls nor recursive calls, and
    does not yield control back to the event loop, so only one stack
    frame can be in `extract` at any time.
  - The `extract` method itself only creates the temporary table once.

Thus, the temporary table creation is safe. Furthermore, the failure
mode is simply that we raise an exception and fail cleanly; there is no
risk of data loss or corruption.

This patch replaces the dynamically generated table name with a fixed
name. On top of the work in #1313, this removes the last instance of SQL
queries that are not compile-time constant expressions.

Test Plan:
Running `yarn unit -f graphql/mirror` suffices.

wchargin-branch: mirror-fixed-temp-table
@wchargin
Copy link
Contributor Author

Done!

@wchargin
Copy link
Contributor Author

wchargin commented Feb 3, 2020

I was interested in the performance implications of this change, so I
inspected a mirror of twbs/bootstrap. The full database is 486 MB on
disk with 124350 SQLite pages (4096 bytes each). The primitives table
takes up 27269 pages (21.9%), and its (object_id, fieldname) index
takes up an additional 11548 pages (9.3%). The only table/index that
takes up more space is the network log, at 44451 pages (35.7%).

Breaking it down a bit more: I was worried that the duplicated field
names would bloat the table, but actually the field names are the
smallest component, and it’s the object names that have bloated more:

for k in object_id fieldname value; do
    printf '%s\t' "${k}" &&
    sqlite3 "${db}" "SELECT SUM(LENGTH(${k})) FROM primitives" |
        numfmt --to=iec
done | expand -t12
object_id   28M
fieldname   4.3M
value       56M

(You may note that those don’t add up to 124350 pages’ worth of bytes;
in the primitives table, only 82.8% of bytes store payload, with 5.2%
used for metadata and 12.1% unused.)

(Note also that the real, irreducible data—value—still accounts for
the majority of space used: around 63%.)

In retrospect, this makes sense, as some of the object names are long.
Commit object IDs tend to be about 76 bytes, with 6 primitive fields
each, so that’s 456 bytes of overhead per commit, and there are 22168
commits, so that’s 10MB already. (This is why we always profile! :-) )

The average length of an average ID per primitives row is 38 bytes.

If this becomes a problem, we could:

  • Alter the primitives table to key objects by rowid rather than
    GraphQL object name, which would take an average of 2–3 bytes rather
    than 38 bytes per row (saving 27MB overall).
  • Pack all primitives for each object into a single row, given that
    always update and read them all at once, reducing deduplication. The
    row contents would have to be some kind of serialized JSON; if we
    pick a fixed key order, we can drop the field names, too.
  • Drop the index on (object_id, fieldname).

The first two options, which alter the table structure, would make human
exploration and debugging of the database harder. The third option would
be fine for readability; we’d have to look at the performance impact.
I’d advise on holding off for now, as we haven’t heard any reports that
this is a problem and the network log could provide lower-hanging fruit.

@Beanow
Copy link
Contributor

Beanow commented Feb 3, 2020

Great investigation 😄

I’d advise on holding off for now, as we haven’t heard any reports that
this is a problem and the network log could provide lower-hanging fruit.

I'd concur.
For a single instance, using SC for it's primary use-case, I don't believe this should be an issue at all.

At scale in this experiment I did find handling the cache size to be the main obstacle. Though I think the lesson should be, this use-case wasn't suitable for a CI container cronjob + github pages, and should have had a dedicated application with direct filesystem access, as any application dealing with databases spanning several GBs reasonably would. Besides, I found a simple gzip to be really effective at bringing the size down to work for that experiment.

@wchargin
Copy link
Contributor Author

wchargin commented Feb 3, 2020

Yes, that’s another good point—I found that gzip -9 typically gives
around a 5× compression ratio now that the network log is enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants