Initial support for "transforming" plugins, incl. dbt by mildbyte · Pull Request #574 · splitgraph/sgr

mildbyte · 2021-11-25T15:22:20Z

Add a TransformingDataSource mixin that lets a data source define what images it requires to run, with an ImageMounter class that can mount and give it links to the schemas these images are mounted in at runtime.
dbt_utils changes:
- Let the sources.yml patcher take in a dbt source <> schema map and overwrite sources with required schemas (instead of the current behaviour where we point the source and the target to the same schema)
- Add a function that runs the dbt project in compile mode and extracts its manifest (https://docs.getdbt.com/reference/artifacts/manifest-json) to list models that the project can build
Add a DBTDataSource that runs dbt transformations from Git:
- introspect() gets the repository's manifest and gets a list of all models
  from it that materialize as tables, allowing the user to select models
  that will get loaded into their Splitgraph repo.
- The plugin requires a map of dbt source names to Splitgraph images (uses
  a JSONSchema with an array)
- load() uses the new ImageMounter functionality to build the source -> temporary schema map and calls the dbt
  wrapper that injects the Splitgraph image references instead of dbt sources and builds the required tables

Tested against a real-life project at https://github.com/splitgraph/jaffle_shop_archive/tree/sg-integration-test

Not implemented yet:

Exporting dbt dependencies as provenance (and in general, unifying this with some Splitfile execution mechanisms)
Graceful error handling, e.g.:
- scanning the manifest/project for sources that aren't mapped to Splitgraph images
- scanning for dangling ref() invocations that don't map to any models/sources

Add ability to compile and export the model's manifest file. This sadly requires a connection to an engine, even though it just returns a manifest.json file with compiled dbt views for every model. This is useful for finding out which models a dbt repository provides as well as their dependency tree. Add ability to override each dbt data source's schema separately. Add ability to filter a list of models when building a dbt model.

* introspect() gets the repository's manifest and gets a list of all models from it that materialize as tables, allowing the user to select models that will get loaded into their Splitgraph repo. * The plugin requires a map of dbt source names to Splitgraph images (uses a JSONSchema with an array) * Yet-unimplemented preload step mounts all required images into temporary schemas and passes them to the plugin (similar to the Splitfile executor). * load() builds the source -> temporary schema map and calls our dbt wrapper that builds the required tables.

* Add a `TransformingDataSource` mixin that provides a context manager temporarily mounting images into throwaway schemas * Add an `ImageMounter` interface that converts image reference tuples to schemas (TODO: Splitfile execution does something similar, need to figure out how to merge them) * Add the `TransformingDataSource` mixin to the DBT plugin and use it to generate the schema map. * Switch the type required by data sources back to `PostgresEngine`, as we need to use that engine to generate Repository objects.

…dels.

* Add tables and mounter to the constructor * Get `DBTDataSource` to call the correct parent (`TransformingDataSource`).

(only data sources that support mount are meant to be shown)

Fleshing out the `splitgraph.yml` (aka `repositories.yml`) format that defines a Splitgraph Cloud "project" (datasets, their sources and metadata). Existing users of `repositories.yml` don't need to change anything, though note that `sgr cloud` commands using the YAML format will now default to `splitgraph.yml` unless explicitly set to `repositories.yml`. New sgr cloud commands: See #582 and #587 These let users manipulate Splitgraph Cloud and ingestion jobs from the CLI: * `sgr cloud status`: view the status of ingestion jobs in the current project * `sgr cloud logs`: view job logs * `sgr cloud upload`: upload a CSV file to Splitgraph Cloud (without using the engine) * `sgr cloud sync`: trigger a one-off load of a dataset * `sgr cloud stub`: generate a `splitgraph.yml` file * `sgr cloud seed`: generate a Splitgraph Cloud project with a `splitgraph.yml`, GitHub Actions, dbt etc * `sgr cloud validate`: merge multiple project files and output the result (like `docker-compose config`) * `sgr cloud download`: download a query result from Splitgraph Cloud as a CSV file, bypassing time/query size limits. repositories.yml/splitgraph.yml format: Change various commands that use `repositories.yml` to default to `splitgraph.yml` instead. Allow "mixing in" multiple `.yml` files Docker Compose-style, useful for splitting credentials (and not checking them in) and data settings. Temporary location for the new full documentation on `splitgraph.yml`: https://github.com/splitgraph/splitgraph.com/blob/f7ac524cb5023091832e8bf51b277991c435f241/content/docs/0900_splitgraph-cloud/0500_splitgraph-yml.mdx Miscellaneous: * Initial backend support for "transforming" Splitgraph plugins, including dbt (#574) * Dump scheduled ingestion/transformation jobs with `sgr cloud dump` (#577)

mildbyte added 18 commits November 24, 2021 16:33

Add tests for patching/manifest extraction.

83e38ed

Add tests for the dbt data source's params parsing and introspection.

3bb7181

Move the Airbyte normalization project into a subdirectory.

930a57c

Use a custom branch of the jaffle_shop project and filter out test mo…

88be746

…dels.

Add sample CSV files from the jaffle-shop archive project as resources.

39a435b

Add an end-to-end dbt data loading test and get it passing.

748644f

Change the test to use two Splitgraph repos at once.

632ec69

Fix wrong TransformingDataSource signature

daa2405

* Add tables and mounter to the constructor * Get `DBTDataSource` to call the correct parent (`TransformingDataSource`).

Fix dbt introspection test

c2704fd

Add dbt as a default data source

546af75

Fix sgr mount help test

16934fa

(only data sources that support mount are meant to be shown)

Factor out unmount_schema

6f7073a

Move ImageMounter/ImageMapper into a separate module.

9b2d0a7

Add docstrings to TransformingDataSource.

7d28e78

Quote PsycopgEngine since we only use it for typechecks.

50f8af7

gruuya approved these changes Nov 26, 2021

View reviewed changes

mildbyte merged commit 1d171bd into master Nov 26, 2021

mildbyte deleted the feature/dbt-data-source branch November 26, 2021 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial support for "transforming" plugins, incl. dbt#574

Initial support for "transforming" plugins, incl. dbt#574
mildbyte merged 18 commits intomasterfrom
feature/dbt-data-source

mildbyte commented Nov 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mildbyte commented Nov 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants