Building components of a modern data stack using sample e-commerce data from Google BigQuery. Specifically focused on:
- Transformation: using dbt to build pipelines with testing, CI/CD
- Orchestration: using Prefect to integrate with dbt and schedule automated processes in Python
We will run dbt from our command line. However, you can also use DBT Cloud to run queries from
the /models
directory. See more details about Getting Started with DBT Core here.
-
Initialize the dbt project with
dbt init {project-name}
. You will add thisproject-name
elsewhere. -
Update values in the
dbt_project.yml
file as needed. At the least, you'll need to change:name: jaffle_shop # Change from the default, `my_new_project` ... profile: jaffle_shop # Change from the default profile name, `default` ... models: jaffle_shop: # Change from `my_new_project` to match the previous value for `name:` ...
-
Set up a profile. After initializing, dbt will request the following profile information from you:
- data warehouse (e.g. bigquery, redshift, snowflake)
- authentication method (oauth or service_account)
- keyfile (of the Service Account key)
- project (i.e. GCP project ID)
- dataset
- location (US or EU)
This will then output the file
/{home-dir}/.dbt/profiles.yml
. You can make adjustments to this file as needed. If you're connected through Jupyter Lab's Docker image, the profile is stored in/home/jovyan/.dbt/profiles.yml
. -
Run dbt. From here, you can run dbt from the command line. Below are some samples:
dbt run # run all dbt scripts dbt run -s order_metrics_by_day # run a specific dbt script
...
We must first set up our Prefect Cloud account before interacting with it locally.
- Sign in or register a Prefect Cloud account.
- Create a profile name if requested.
- Create a workspace for your account, or enter an existing workspace.
- Create an API key to authorize a local execution environment. If you already have a key, access it here.
- Log into Prefect Cloud,
prefect cloud login
. Use the API key you created during the setup. - (Optional) To change our workspace, enter:
prefect cloud workspace set
. - Start the Orion UI:
prefect orion start
.- Note: Prefect points you to the server: http://127.0.0.1:4200/. If you have logged into Prefect Cloud, runs will be visible in your cloud account.
In order to run DBT from Prefect, we must install the necessary packages and create a connection between the two services. This connection is made through Prefect Blocks.
- Install the
prefect-dbt
package:pip install prefect-dbt
to installimport prefect_dbt
to reference in code
- Register prefect_dbt blocks with
prefect block register -m prefect_dbt
. - Create DBT specific blocks (CliProfile and targetConfigs) ... update here
You can schedule and run processes from the Orion Cloud UI. We do this through a Deployment, which consists of:
- Flow. ...
- Work Queue. This is a set (or queue) of Flows to be run in a Deployment. An Agent runs Flows in a given Work Queue. In order to initialize the Work Queue, run:
prefect agent start --work-queue "{work_queue_name}"
In order to run deployments remotely (i.e. from the Orion Cloud), we must add these Blocks to our deployment: - Storage - Infrastructure