SQLMesh has native support for running dbt projects with its dbt adapter.
!!! tip
If you've never used SQLMesh before, learn the basics of how it works in the [SQLMesh Quickstart](../quick_start.md)!
SQLMesh is a Python library you install with the pip
command. We recommend running your SQLMesh projects in a Python virtual environment, which must be created and activated before running any pip
commands.
Most people do not use all of SQLMesh's functionality. For example, most projects only run on one SQL execution engine.
Therefore, SQLMesh is packaged with multiple "extras," which you may optionally install based on the functionality your project needs. You may specify all your project's extras in a single pip
call.
At minimum, using the SQLMesh dbt adapter requires installing the dbt extra:
> pip install "sqlmesh[dbt]"
If your project uses any SQL execution engine other than DuckDB, you must install the extra for that engine. For example, if your project runs on the Postgres SQL engine:
> pip install "sqlmesh[dbt,postgres]"
If you would like to use the SQLMesh Browser UI to view column-level lineage, include the web
extra:
> pip install "sqlmesh[dbt,web]"
Learn more about SQLMesh installation and extras here.
Prepare an existing dbt project to be run by SQLMesh by executing the sqlmesh init
command within the dbt project root directory and with the dbt
template option:
$ sqlmesh init -t dbt
SQLMesh will use the data warehouse connection target in your dbt project profiles.yml
file. The target can be changed at any time.
Models require a start date for backfilling data through use of the start
configuration parameter. start
can be defined individually for each model in its config
block or globally in the dbt_project.yml
file as follows:
> models:
> +start: Jan 1 2000
SQLMesh derives a project's configuration from its dbt configuration files. This section outlines additional settings specific to SQLMesh that can be defined.
Certain engines, like Trino, cannot be used to store SQLMesh's state.
As a workaround, we recommend specifying a supported state engine using the state_connection
argument instead.
Learn more about how to configure state connections in Python here.
dbt supports passing variable values at runtime with its CLI vars
option.
In SQLMesh, these variables are passed via configurations. When you initialize a dbt project with sqlmesh init
, a file config.py
is created in your project directory.
The file creates a SQLMesh config
object pointing to the project directory:
config = sqlmesh_config(Path(__file__).parent)
Specify runtime variables by adding a Python dictionary to the sqlmesh_config()
variables
argument.
For example, we could specify the runtime variable is_marketing
and its value no
as:
config = sqlmesh_config(
Path(__file__).parent,
variables={"is_marketing": "no"}
)
Some projects use combinations of runtime variables to control project behavior. Different combinations can be specified in different sqlmesh_config
objects, with the relevant configuration passed to the SQLMesh CLI command.
For example, consider a project with a special configuration for the marketing
department. We could create separate configurations to pass at runtime like this:
config = sqlmesh_config(
Path(__file__).parent,
variables={"is_marketing": "no", "include_pii": "no"}
)
marketing_config = sqlmesh_config(
Path(__file__).parent,
variables={"is_marketing": "yes", "include_pii": "yes"}
)
By default, SQLMesh will use the configuration object named config
. Use a different configuration by passing the object name to SQLMesh CLI commands with the --config
option. For example, we could run a plan
with the marketing configuration like this:
sqlmesh --config marketing_config plan
Note that the --config
option is specified between the word sqlmesh
and the command being executed (e.g., plan
, run
).
SQLMesh automatically registers model descriptions and column comments with the target SQL engine, as described in the Models Overview documentation. Comment registration is on by default for all engines that support it.
dbt offers similar comment registration functionality via its persist_docs
model configuration parameter, specified by model. SQLMesh comment registration is configured at the project level, so it does not use dbt's model-specific persist_docs
configuration.
SQLMesh's project-level comment registration defaults are overridden with the sqlmesh_config()
register_comments
argument. For example, this configuration turns comment registration off:
config = sqlmesh_config(
Path(__file__).parent,
register_comments=False,
)
Run SQLMesh as with a SQLMesh project, generating and applying plans, running tests or audits, and executing models with a scheduler if desired.
You continue to use your dbt file and project format.
Consider the following when using a dbt project:
- SQLMesh will detect and deploy new or modified seeds as part of running the
plan
command and applying changes - there is no separate seed command. Refer to seed models for more information. - The
plan
command dynamically creates environments, so environments do not need to be hardcoded into yourprofiles.yml
file as targets. To get the most out of SQLMesh, point your dbt profile target at the production target and let SQLMesh handle the rest for you. - The term "test" has a different meaning in dbt than in SQLMesh:
- dbt "tests" are audits in SQLMesh.
- SQLMesh "tests" are unit tests, which test query logic before applying a SQLMesh plan.
- dbt's' recommended incremental logic is not compatible with SQLMesh, so small tweaks to the models are required (don't worry - dbt can still use the models!).
Incremental loading is a powerful technique when datasets are large and recomputing tables is expensive. SQLMesh offers first-class support for incremental models, and its approach differs from dbt's.
This section describes how to adapt dbt's incremental models to run on sqlmesh and maintain backwards compatibility with dbt.
SQLMesh supports two approaches to implement idempotent incremental loads:
- Using merge (with the sqlmesh
INCREMENTAL_BY_UNIQUE_KEY
model kind) - Using insert-overwrite/delete+insert (with the sqlmesh
INCREMENTAL_BY_TIME_RANGE
model kind)
To enable incremental_by_unique_key incrementality, the model configuration should contain:
- The
unique_key
key with the model's unique key field name or names as the value - The
materialized
key with value'incremental'
- Either:
- No
incremental_strategy
key or - The
incremental_strategy
key with value'merge'
- No
To enable incremental_by_time_range incrementality, the model configuration should contain:
- The
time_column
key with the model's time column field name as the value (seetime column
for details) - The
materialized
key with value'incremental'
- Either:
- The
incremental_strategy
key with value'insert_overwrite'
or - The
incremental_strategy
key with value'delete+insert'
- Note: in this context, these two strategies are synonyms. Regardless of which one is specified SQLMesh will use the
best incremental strategy
for the target engine.
- The
SQLMesh requires a new jinja block gated by {% if sqlmesh_incremental is defined %}
. The new block should supersede the existing {% if is_incremental() %}
block and contain the WHERE
clause selecting the time interval.
For example, the SQL WHERE
clause with the "ds" column goes in a new jinja block gated by {% if sqlmesh_incremental is defined %}
as follows:
> {% if sqlmesh_incremental is defined %}
> WHERE
> ds BETWEEN '{{ start_ds }}' AND '{{ end_ds }}'
> {% elif is_incremental() %}
> ; < your existing is_incremental block >
> {% endif %}
{{ start_ds }}
and {{ end_ds }}
are the jinja equivalents of SQLMesh's @start_ds
and @end_ds
predefined time macro variables. See all predefined time variables available in jinja.
SQLMesh provides configuration parameters that enable control over how incremental computations occur. These parameters are set in the model's config
block.
The batch_size
parameter determines the maximum number of time intervals to run in a single job.
The lookback
parameter is used to capture late arriving data. It sets the number of units of late arriving data the model should expect and must be a positive integer.
Note: By default, all incremental dbt models are configured to be forward-only. However, you can change this behavior by setting the forward_only: false
setting either in the configuration of an individual model or globally for all models in the dbt_project.yaml
file. The forward-only mode aligns more closely with the typical operation of dbt and therefore better meets user's expectations.
Similarly, the allow_partials parameter is set to true
by default for incremental dbt models unless the time column is specified, or the allow_partials
parameter is explicitly set to false
in the model configuration.
SQLMesh automatically detects destructive schema changes to forward-only incremental models and to all incremental models in forward-only plans.
A model's on_destructive_change
setting determines whether it errors (default), warns, or silently allows the changes. SQLMesh always allows non-destructive forward-only schema changes, such as adding or casting a column in place.
on_schema_change
configuration values are mapped to these SQLMesh on_destructive_change
values:
on_schema_change |
SQLMesh on_destructive_change |
---|---|
ignore | warn |
append_new_columns | warn |
sync_all_columns | allow |
fail | error |
SQLMesh supports both dbt snapshot strategies of either timestamp
or check
.
Only unsupported snapshot functionality is invalidate_hard_deletes
which must be set to True
.
If set to False
, then the snapshot will be skipped and a warning will be logged indicating this happened.
Support for this will be added soon.
SQLMesh uses dbt tests to perform SQLMesh audits (coming soon).
Add SQLMesh unit tests to a dbt project by placing them in the "tests" directory.
SQLMesh parses seed CSV files using Panda's read_csv
utility and its default column type inference.
dbt parses seed CSV files using agate's csv reader and customizes agate's default type inference.
If SQLMesh and dbt infer different column types for a seed CSV file, you may specify a column_types dictionary in your dbt_project.yml
file, where the keys define the column names and the values the data types.
seeds:
<seed name>
+column_types:
<column name>: <SQL data type>
Alternatively, you can define this dictionary in the seed seed properties configuration file.
seeds:
- name: <seed name>
config:
column_types:
<column name>: <SQL data type>
You may also specify a column's SQL data type in its data_type
key, as shown below. The file must list all columns present in the CSV file; SQLMesh's default type inference will be used for columns that do not specify the data_type
key.
seeds:
- name: <seed name>
columns:
- name: <column name>
data_type: <SQL data type>
SQLMesh does not have its own package manager; however, SQLMesh's dbt adapter is compatible with dbt's package manager. Continue to use dbt deps and dbt clean to update, add, or remove packages.
Model documentation is available in the SQLMesh UI.
SQLMesh supports running dbt projects using the majority of dbt jinja methods, including:
Method | Method | Method | Method |
---|---|---|---|
adapter (*) | env_var | project_name | target |
as_bool | exceptions | ref | this |
as_native | from_yaml | return | to_yaml |
as_number | is_incremental | run_query | var |
as_text | load_result | schema | zip |
api | log | set | |
builtins | modules | source | |
config | statement |
* adapter.rename_relation
and adapter.expand_target_column_types
are not currently supported.
The dbt jinja methods that are not currently supported are:
- debug
- selected_sources
- adapter.expand_target_column_types
- adapter.rename_relation
- graph.nodes.values
- graph.metrics.values
Submit an issue, and we'll look into it!