-
Notifications
You must be signed in to change notification settings - Fork 116
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to EAV tables in GraphQL mirror #1313
Comments
Thanks for documenting this context, @wchargin. My prioritization of the github mirror has decreased, because in the coming quarter I plan to focus more on SourceCred dogfooding and less on supporting third party projects. But sometime in the next year I still expect to spend a bunch of time on the mirror module, so I might start here as a first step to really grokking and getting comfortable with the module. (Of course, if anyone else feels like taking this on, more cred to them. But I have a feeling it will be me. :P) |
I was actually leaving this as a TODO for myself, but if you get around |
Summary: The migration is complete; only EAV primitives remain, so they shall be called simply “primitives”. See #1313 and adjacent commits for context. Test Plan: Running `git grep -iw eav` no longer returns any results. wchargin-branch: mirror-eav-prune-names
I opened a few pull requests for this:
|
Summary: See #1313 for context. The plan is to set up dual-writes with `extract` calls still reading from the old tables until the new ones are complete and tested. The primary risk to production would be a fatal exception in the new write paths, which seems like an acceptable risk. Test Plan: Unit tests pass. wchargin-branch: mirror-eav-schema
Summary: This flips the switch for all production `Mirror` reads to use the single `primitives` EAV table as their source of truth, rather than the legacy type-specific primitives tables. For context and design discussion, see issue #1313 and commits adjacent to this one. Test Plan: All relevant code paths are already tested (see test plans of commits adjacent to this one). Running `yarn test --full` passes. wchargin-branch: mirror-eav-flip
Summary: This data is now stored in EAV `primitives` table; see issue #1313 and adjacent commits for details. We simultaneously lift the restriction that GraphQL type and field names be SQL-safe identifiers, as it’s no longer necessary. Test Plan: Some test cases queried the legacy primitives tables to check properties about the database state. These queries have of course been removed; note that each such removed query was already accompanied by an equivalent query against the EAV `primitives` table. Note that `yarn test --full` still passes, and that when manually loading `sourcecred/example-github` the cache no longer has any of the legacy tables. wchargin-branch: mirror-eav-prune-tables
Summary: The migration is complete; only EAV primitives remain, so they shall be called simply “primitives”. See #1313 and adjacent commits for context. Test Plan: Running `git grep -iw eav` no longer returns any results. wchargin-branch: mirror-eav-prune-names
Summary: The Mirror module extraction code calculates the set of transitive dependencies and stores these results in a temporary table to avoid unnecessary marshalling between JavaScript and C. We originally chose the temporary table name dynamically, guaranteeing that it was unused. However, this is unnecessary: - The temporary table namespace is unique to each database connection, so we need only consider possible conflicts in the same connection. - A `Mirror` instance exercises exclusive ownership of its database connection, per its constructor docs, so we need only consider conflicts within this module. - Temporary tables are only used in the `extract` method, so we need only consider conflicts in this method. - The `extract` method makes no open calls nor recursive calls, and does not yield control back to the event loop, so only one stack frame can be in `extract` at any time. - The `extract` method itself only creates the temporary table once. Thus, the temporary table creation is safe. Furthermore, the failure mode is simply that we raise an exception and fail cleanly; there is no risk of data loss or corruption. This patch replaces the dynamically generated table name with a fixed name. On top of the work in #1313, this removes the last instance of SQL queries that are not compile-time constant expressions. Test Plan: Running `yarn unit -f graphql/mirror` suffices. wchargin-branch: mirror-fixed-temp-table
Done! |
I was interested in the performance implications of this change, so I Breaking it down a bit more: I was worried that the duplicated field for k in object_id fieldname value; do
printf '%s\t' "${k}" &&
sqlite3 "${db}" "SELECT SUM(LENGTH(${k})) FROM primitives" |
numfmt --to=iec
done | expand -t12
(You may note that those don’t add up to 124350 pages’ worth of bytes; (Note also that the real, irreducible data— In retrospect, this makes sense, as some of the object names are long. The average length of an average ID per If this becomes a problem, we could:
The first two options, which alter the table structure, would make human |
Great investigation 😄
I'd concur. At scale in this experiment I did find handling the cache size to be the main obstacle. Though I think the lesson should be, this use-case wasn't suitable for a CI container cronjob + github pages, and should have had a dedicated application with direct filesystem access, as any application dealing with databases spanning several GBs reasonably would. Besides, I found a simple gzip to be really effective at bringing the size down to work for that experiment. |
Yes, that’s another good point—I found that |
Switch to EAV tables in GraphQL mirror
The GraphQL mirror currently creates one table per object type in the
GitHub schema, used to store primitive fields, with schemata like this
(formatted for readability):
Lines 3–5 are dynamically generated, with one column per primitive field
on the GraphQL object type. The ID column and foreign key reference are
always present.
Because the columns are dynamically generated, most queries that alter,
read from, or write to these tables must be as well. But this violates
the First Rule of SQL: only create prepared statements whose bodies are
compile-time constant expressions. Dynamically generating SQL queries
opens us up to SQL injection, and while I’ve been quite careful in the
implementation and the data source is also trusted, it’s still a
dangerous property to have lurking around.
In retrospect, I think that the fact that these tables require
dynamically generated DDL and DML queries should have been a sign that
the table design itself leaves something to be desired. An obvious
alternative is to use a single table for storing all primitive values:
i.e., an entity–attribute–value table. I considered this when
first designing the schema, but decided against it because I’d heard
vague cautions to avoid the EAV model when possible. After completing
the initial implementation, I don’t think that such criticisms apply
here, primarily because we’re writing a generic family of schemata
(amusingly, mathematicians might call this a schema schema)
rather than a schema for any fixed business application.
This refactoring should make the Mirror module code easier to understand
for newcomers. There will be a constant number of tables. There will
be no dynamically generated queries; all queries will be
compile-time constant expressions. There will be less side-condition
checking and validation logic, because less will be required. And the
code will be safer and more reusable.
The text was updated successfully, but these errors were encountered: