This repo is an end to end implementation of
TPC-DI. The repo is desinged to
be run against Databricks, Snowflake, BigQuery, Redshift, and Synapse.
You can select your desired target warehouse with different targets that
are defined in the profiles.yml; for ex dbt run --target databricks
will build the TPCDI tables in Databricks while dbt run --target redshift
will build the tables in Redshift. The default is to build the tables in Databricks but you
can build the tables in other warehouses by adjusting the model-paths key to the desried warehouse
The steps below will be applicable regardless of the target warehouse. For warehouse specific considerations,
refer to the read.me within the relevant directories of the models directory.
This project uses the TPC-DI Kit TPC-DI Data Generator https://github.com/haydarai/tpcdi-kit
Using the DIGen Tool
java -jar DIGen.jar -o ../staging/10/ -sf 10
Once data files is generated, upload the files to your working cloud storage account. We recommend using the cli tool for the relevant cloud (aws CLI for Redshift, azure SLI for synapse, gcloud sli for big query, etc).
Please refer to the individual readmes to determine how to best handle the customermgmt xml file.
(Note that do not run dbt deps as some of the packages/macros the code relies on are custom)
- create your prod and staging schemas (if not already created)
- Update profiles.yml with your prod schema and other warehouse specific configurations.
- Update project.yml with desired scalefactor, bucketname, and staging schema
- dbt run-operation stage_external_sources
- dbt run
Additional contributions are welcome.
For small changes, please submit a pull request with your changes.
For larger changes that would change the majority of code in a single file or span multiple files, please open an issue first for discussion.
This project uses several formatters and linters for different file types. Each linter should be run before submitting
a pull request and the appropriate changes should be made to resolve any errors or warnings. This projects has
built-in checks that will run automatically when a pull request is submitted and need to pass before the pull request
can be merged. The file dev-requirements.txt
contains the packages needed to run the linters.
The dev-requirements.txt
file can be used to install all of the packages at once with the command
pip install -r dev-requirements.txt
. It's considered best practice to install the packages in a python virtual
environment.
Instructions to install each package will be linked below if you do not wish to use the dev-requirements.txt
file.
We use yamllint to lint the yaml files. To run the linter, run
yamllint .
from the root of the project. Errors and warnings will be printed to the console for review.
Instructions to install yamllint.
We use black to format and
flake8 to lint python files. In addition, black-jupyter
is an extension of
black
for Jupyter Notebooks, which end with an .ipynb
extension.
We recommend running black .
to format code prior to flake8 .
to lint the code since black
will auto-correct
many flake8
violations. Errors and warnings will be printed to the console for review by both tools.
Instructions to install black. Instructions to install flake8. Instructions to install black-jupyter.
We use sqlfluff to lint the sql files. To run the formatter, run
sqlfluff format .
, and to run the linter, run sqlfluff lint .
from the root of the project. Errors and warnings will
be printed to the console for review.
No additional configurations, including specification of the SQL dialect is required since the project uses a global
.sqlfluff
file at the root directory to manage configurations and a .sqlfluff
file to specify the dialect for each
Warehouse directory.