Skip to content

Initial support for "transforming" plugins, incl. dbt#574

Merged
mildbyte merged 18 commits intomasterfrom
feature/dbt-data-source
Nov 26, 2021
Merged

Initial support for "transforming" plugins, incl. dbt#574
mildbyte merged 18 commits intomasterfrom
feature/dbt-data-source

Conversation

@mildbyte
Copy link
Copy Markdown
Contributor

  • Add a TransformingDataSource mixin that lets a data source define what images it requires to run, with an ImageMounter class that can mount and give it links to the schemas these images are mounted in at runtime.
  • dbt_utils changes:
    • Let the sources.yml patcher take in a dbt source <> schema map and overwrite sources with required schemas (instead of the current behaviour where we point the source and the target to the same schema)
    • Add a function that runs the dbt project in compile mode and extracts its manifest (https://docs.getdbt.com/reference/artifacts/manifest-json) to list models that the project can build
  • Add a DBTDataSource that runs dbt transformations from Git:
    • introspect() gets the repository's manifest and gets a list of all models
      from it that materialize as tables, allowing the user to select models
      that will get loaded into their Splitgraph repo.
    • The plugin requires a map of dbt source names to Splitgraph images (uses
      a JSONSchema with an array)
    • load() uses the new ImageMounter functionality to build the source -> temporary schema map and calls the dbt
      wrapper that injects the Splitgraph image references instead of dbt sources and builds the required tables

Tested against a real-life project at https://github.com/splitgraph/jaffle_shop_archive/tree/sg-integration-test

Not implemented yet:

  • Exporting dbt dependencies as provenance (and in general, unifying this with some Splitfile execution mechanisms)
  • Graceful error handling, e.g.:
    • scanning the manifest/project for sources that aren't mapped to Splitgraph images
    • scanning for dangling ref() invocations that don't map to any models/sources

Add ability to compile and export the model's manifest file.

This sadly requires a connection to an engine, even though it just returns
a manifest.json file with compiled dbt views for every model. This is useful
for finding out which models a dbt repository provides as well as their
dependency tree.

Add ability to override each dbt data source's schema separately.

Add ability to filter a list of models when building a dbt model.
  * introspect() gets the repository's manifest and gets a list of all models
    from it that materialize as tables, allowing the user to select models
    that will get loaded into their Splitgraph repo.
  * The plugin requires a map of dbt source names to Splitgraph images (uses
    a JSONSchema with an array)
  * Yet-unimplemented preload step mounts all required images into temporary
    schemas and passes them to the plugin (similar to the Splitfile executor).
  * load() builds the source -> temporary schema map and calls our dbt
    wrapper that builds the required tables.
  * Add a `TransformingDataSource` mixin that provides a context manager temporarily
    mounting images into throwaway schemas
  * Add an `ImageMounter` interface that converts image reference tuples to
    schemas (TODO: Splitfile execution does something similar, need to figure out
    how to merge them)
  * Add the `TransformingDataSource` mixin to the DBT plugin and use it to
    generate the schema map.
  * Switch the type required by data sources back to `PostgresEngine`, as we
    need to use that engine to generate Repository objects.
  * Add tables and mounter to the constructor
  * Get `DBTDataSource` to call the correct parent (`TransformingDataSource`).
(only data sources that support mount are meant to be shown)
@mildbyte mildbyte merged commit 1d171bd into master Nov 26, 2021
@mildbyte mildbyte deleted the feature/dbt-data-source branch November 26, 2021 09:47
mildbyte added a commit that referenced this pull request Dec 17, 2021
Fleshing out the `splitgraph.yml` (aka `repositories.yml`) format that defines a Splitgraph Cloud "project" (datasets, their sources and metadata).

Existing users of `repositories.yml` don't need to change anything, though note that `sgr cloud` commands using the YAML format will now default to `splitgraph.yml` unless explicitly set to `repositories.yml`.


New sgr cloud commands:

See #582 and #587

These let users manipulate Splitgraph Cloud and ingestion jobs from the CLI:

  * `sgr cloud status`: view the status of ingestion jobs in the current project
  * `sgr cloud logs`: view job logs
  * `sgr cloud upload`: upload a CSV file to Splitgraph Cloud (without using the engine)
  * `sgr cloud sync`: trigger a one-off load of a dataset
  * `sgr cloud stub`: generate a `splitgraph.yml` file
  * `sgr cloud seed`: generate a Splitgraph Cloud project with a `splitgraph.yml`, GitHub Actions, dbt etc
  * `sgr cloud validate`: merge multiple project files and output the result (like `docker-compose config`)
  * `sgr cloud download`: download a query result from Splitgraph Cloud as a CSV file, bypassing time/query size limits.


repositories.yml/splitgraph.yml format:

Change various commands that use `repositories.yml` to default to `splitgraph.yml` instead. Allow "mixing in" multiple `.yml` files Docker Compose-style, useful for splitting credentials (and not checking them in) and data settings.

Temporary location for the new full documentation on `splitgraph.yml`: https://github.com/splitgraph/splitgraph.com/blob/f7ac524cb5023091832e8bf51b277991c435f241/content/docs/0900_splitgraph-cloud/0500_splitgraph-yml.mdx


Miscellaneous:

  * Initial backend support for "transforming" Splitgraph plugins, including dbt (#574)
  * Dump scheduled ingestion/transformation jobs with `sgr cloud dump` (#577)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants