Audits are one of the tools SQLMesh provides to validate your models. Along with tests, they are a great way to ensure the quality of your data and to build trust in it across your organization.
Unlike tests, audits are used to validate the output of a model after every run. When you apply a plan, SQLMesh will automatically run each model's audits.
By default, SQLMesh will halt plan application when an audit fails so potentially invalid data does not propagate further downstream. This behavior can be changed for individual audits - refer to Non-blocking audits.
A comprehensive suite of audits can identify data issues upstream, whether they are from your vendors or other teams. Audits also empower your data engineers and analysts to work with confidence by catching problems early as they work on new features or make updates to your models.
NOTE: For incremental models, audits are only applied to intervals being processed - not for the entire underlying table.
In SQLMesh, user-defined audits are defined in .sql
files in an audits
directory in your SQLMesh project. Multiple audits can be defined in a single file, so you can organize them to your liking. Alternatively, audits can be defined inline within the model definition itself.
Audits are SQL queries that should not return any rows; in other words, they query for bad data, so returned rows indicates that something is wrong.
In its simplest form, an audit is defined with the AUDIT
statement along with a query, as in the following example:
AUDIT (
name assert_item_price_is_not_null,
dialect spark
);
SELECT * from sushi.items
WHERE
ds BETWEEN @start_ds AND @end_ds
AND price IS NULL;
In the example, we defined an audit named assert_item_price_is_not_null
, ensuring that every sushi item has a price.
Note: If the query is in a different dialect than the rest of your project, you can specify it in the AUDIT
statement. In the example above we set it to spark
, so SQLGlot will automatically understand how to execute the query behind the scenes.
To run the audit, include it in a model's MODEL
statement:
MODEL (
name sushi.items,
audits (assert_item_price_is_not_null)
);
Now assert_item_price_is_not_null
will run every time the sushi.items
model is run.
Audits can also be parameterized and implemented in a model-agnostic way so the same audit can be used for multiple models.
Consider the following audit definition that checks whether the target column exceeds a configured threshold:
AUDIT (
name does_not_exceed_threshold
);
SELECT * FROM @this_model
WHERE @column >= @threshold;
This example utilizes macros to parameterize the audit. @this_model
is a special macro which refers to the model that is being audited. For incremental models, this macro also ensures that only relevant data intervals are affected.
@column
and @threshold
are parameters whose values are specified in a model definition's MODEL
statement.
Apply the generic audit to a model by referencing it in the MODEL
statement:
MODEL (
name sushi.items,
audits (
does_not_exceed_threshold(column := id, threshold := 1000),
does_not_exceed_threshold(column := price, threshold := 100)
)
);
Notice how column
and threshold
parameters have been set. These values will be propagated into the audit query and substituted into the @column
and @threshold
macro variables.
Note that the same audit can be applied more than once to the a model using different sets of parameters.
Generic audits can define default values as follows:
AUDIT (
name does_not_exceed_threshold,
defaults (
threshold = 10,
column = id
)
);
SELECT * FROM @this_model
WHERE @column >= @threshold;
Alternatively, you can apply specific audits globally by including them in the model defaults configuration:
model_defaults:
audits:
- assert_positive_order_ids
- does_not_exceed_threshold(column := id, threshold := 1000)
We recommended avoiding SQL keywords when naming audit parameters. Quote any audit argument that is also a SQL keyword.
For example, if an audit my_audit
uses a values
parameter, invoking it will require quotes because values
is a SQL keyword:
MODEL (
name sushi.items,
audits (
my_audit(column := a, "values" := (1,2,3))
)
)
You can also define audits directly within a model definition using the same syntax. Multiple audits can be specified within the same SQL model file:
MODEL (
name sushi.items,
audits(does_not_exceed_threshold(column := id, threshold := 1000), price_is_not_null)
);
SELECT id, price
FROM sushi.seed;
AUDIT (name does_not_exceed_threshold);
SELECT * FROM @this_model
WHERE @column >= @threshold;
AUDIT (name price_is_not_null);
SELECT * FROM @this_model
WHERE price IS NULL;
SQLMesh comes with a suite of built-in generic audits that cover a broad set of common use cases. Built-in audits are blocking by default, but they all have non-blocking counterparts which you can use by appending _non_blocking
- see Non-blocking audits.
This section describes the audits, grouped by general purpose.
The forall
audit is the most generic built-in audit, allowing arbitrary boolean SQL expressions.
Ensures that a set of arbitrary boolean expressions evaluate to TRUE
for all rows in the model. The boolean expressions should be written in SQL.
This example asserts that all rows have a price
greater than 0 and a name
value containing at least one character:
MODEL (
name sushi.items,
audits (
forall(criteria := (
price > 0,
LENGTH(name) > 0
))
)
);
These audits concern row counts and presence of NULL
values.
Ensures that the number of rows in the model's table exceeds the threshold.
This example asserts that the model has more than 10 rows:
MODEL (
name sushi.orders,
audits (
number_of_rows(threshold := 10)
)
);
Ensures that specified columns do not contain NULL
values.
This example asserts that none of the id
, customer_id
, or waiter_id
columns contain NULL
values:
MODEL (
name sushi.orders,
audits (
not_null(columns := (id, customer_id, waiter_id))
)
);
Ensures that specified columns contain at least one non-NULL value.
This example asserts that the zip
column contains at least one non-NULL value:
MODEL (
name sushi.customers,
audits (
at_least_one(column := zip)
)
);
Ensures that the specified column's proportion of NULL
values is no greater than a threshold.
This example asserts that the zip
column has no more than 80% NULL
values:
MODEL (
name sushi.customers,
audits (
not_null_proportion(column := zip, threshold := 0.8)
)
);
These audits concern the specific set of data values present in a column.
Ensures that the specified columns are not constant (i.e., have at least two non-NULL values).
This example asserts that the column customer_id
has at least two non-NULL values:
MODEL (
name sushi.customer_revenue_by_day,
audits (
not_constant(column := customer_id)
)
);
Ensures that specified columns contain unique values (i.e., have no duplicated values).
This example asserts that the id
and item_id
columns have unique values:
MODEL (
name sushi.orders,
audits (
unique_values(columns := (id, item_id))
)
);
Ensures that each row has a unique combination of values over the specified columns.
This example asserts that the combination of id
and ds
columns has no duplicated values:
MODEL (
name sushi.orders,
audits (
unique_combination_of_columns(columns := (id, ds))
)
);
Ensures that all rows of the specified column contain one of the accepted values.
!!! note
Rows with NULL
values for the column will pass this audit in most databases/engines. Use the not_null
audit to ensure there are no NULL
values present in a column.
This example asserts that column name
has a value of 'Hamachi', 'Unagi', or 'Sake':
MODEL (
name sushi.items,
audits (
accepted_values(column := name, is_in := ('Hamachi', 'Unagi', 'Sake'))
)
);
Ensures that no rows of the specified column contain one of the not accepted values.
!!! note
This audit does not support rejecting NULL
values. Use the not_null
audit to ensure there are no NULL
values present in a column.
This example asserts that column name
is not one of 'Hamburger' or 'French fries':
MODEL (
name sushi.items,
audits (
not_accepted_values(column := name, is_in := ('Hamburger', 'French fries'))
)
);
These audits concern the distribution of values in numeric columns.
Ensures that each of an ordered numeric column's values contains the previous row's value plus interval
.
For example, with a column having minimum value 1 and maximum value 4 and interval := 1
, it ensures that the rows contain values [1, 2, 3, 4]
.
This example asserts that column item_id
contains sequential values that differ by 1
:
MODEL (
name sushi.items,
audits (
sequential_values(column := item_id, interval := 1)
)
);
Ensures that a column's values are in a numeric range. Range is inclusive by default, such that values equal to the range boundaries will pass the audit.
This example asserts that all rows have a price
greater than or equal 1 and less than or equal to 100:
MODEL (
name sushi.items,
audits (
accepted_range(column := price, min_v := 1, max_v := 100)
)
);
This example specifies the inclusive := false
argument to assert that all rows have a price
greater than 0 and less than 100:
MODEL (
name sushi.items,
audits (
accepted_range(column := price, min_v := 0, max_v := 100, inclusive := false)
)
);
Ensures that each row's numeric range does not overlap with any other row's range.
This example asserts that each row's range [min_price, max_price] does not overlap with any other row's range:
MODEL (
audits (
mutually_exclusive_ranges(lower_bound_column := min_price, upper_bound_column := max_price)
)
);
These audits concern the characteristics of values in character/string columns.
!!! warning Databases/engines may exhibit different behavior for different character sets or languages.
Ensures that no rows of a column contain an empty string value ''
.
This example asserts that no name
is an empty string:
MODEL (
name sushi.items,
audits (
not_empty_string(column := name)
)
);
Ensures that all rows of a column contain a string with the specified number of characters.
This example asserts that all zip
values are 5 characters long:
MODEL (
name sushi.customers,
audits (
string_length_equal(column := zip, v := 5)
)
);
Ensures that all rows of a column contain a string with number of characters in the specified range. Range is inclusive by default, such that values equal to the range boundaries will pass the audit.
This example asserts that all name
values have 5 or more and 50 or fewer characters:
MODEL (
name sushi.customers,
audits (
string_length_between(column := name, min_v := 5, max_v := 50)
)
);
This example specifies the inclusive := false
argument to assert that all rows have a name
with 5 or more and 59 or fewer characters:
MODEL (
name sushi.customers,
audits (
string_length_between(column := zip, min_v := 4, max_v := 60, inclusive := false)
)
);
Ensures that all non-NULL rows of a column contain a string with the UUID structure.
UUID structure determined by matching regular expression '^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$'
.
This example asserts that all uuid
values have the UUID structure:
MODEL (
audits (
valid_uuid(column := uuid)
)
);
Ensures that all non-NULL rows of a column contain a string with the email address structure.
Email address structure determined by matching regular expression '^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
.
This example asserts that all email
values have the email address structure:
MODEL (
audits (
valid_email(column := email)
)
);
Ensures that all non-NULL rows of a column contain a string with the URL structure.
URL structure determined by matching regular expression '^(https?|ftp)://[^\s/$.?#].[^\s]*$'
.
This example asserts that all url
values have the URL structure:
MODEL (
audits (
valid_url(column := url)
)
);
Ensures that all non-NULL rows of a column contain a valid HTTP method.
Valid HTTP methods determined by matching values GET
, POST
, PUT
, DELETE
, PATCH
, HEAD
, OPTIONS
, TRACE
, CONNECT
.
This example asserts that all http_method
values are valid HTTP methods:
MODEL (
audits (
valid_http_method(column := http_method)
)
);
Ensures that all non-NULL rows of a column match at least one of the specified regular expressions.
This example asserts that all todo
values match one of '^\d.*'
(string starts with a digit) or '.*!$'
(ends with an exclamation mark):
MODEL (
audits (
match_regex_pattern_list(column := todo, patterns := ('^\d.*', '.*!$'))
)
);
Ensures that no non-NULL rows of a column match any of the specified regular expressions.
This example asserts that no todo
values match one of '^!.*'
(string starts with an exclamation mark) or '.*\d$'
(ends with a digit):
MODEL (
audits (
match_regex_pattern_list(column := todo, patterns := ('^!.*', '.*\d$'))
)
);
Ensures that all non-NULL rows of a column are LIKE
at least one of the specified patterns.
This example asserts that all name
values are LIKE
one of 'jim%'
or 'pam%'
:
MODEL (
audits (
match_like_pattern_list(column := name, patterns := ('jim%', 'pam%'))
)
);
Ensures that no non-NULL rows of a column are LIKE
any of the specified patterns.
This example asserts that no name
values are LIKE
'%doe'
or '%smith'
:
MODEL (
audits (
not_match_like_pattern_list(column := name, patterns := ('%doe', '%smith'))
)
);
These audits concern the statistical distributions of numeric columns.
!!! note
Audit thresholds will likely require fine-tuning via trial and error for each column being audited.
Ensures that a numeric column's mean is in the specified range. Range is inclusive by default, such that values equal to the range boundaries will pass the audit.
This example asserts that the age
column has a mean of at least 21 and at most 50:
MODEL (
audits (
mean_in_range(column := age, min_v := 21, max_v := 50)
)
);
This example specifies the inclusive := false
argument to assert that age
has a mean greater than 18 and less than 65:
MODEL (
audits (
mean_in_range(column := age, min_v := 18, max_v := 65, inclusive := false)
)
);
Ensures that a numeric column's standard deviation is in the specified range. Range is inclusive by default, such that values equal to the range boundaries will pass the audit.
This example asserts that the age
column has a standard deviation of at least 2 and at most 5:
MODEL (
audits (
stddev_in_range(column := age, min_v := 2, max_v := 5)
)
);
This example specifies the inclusive := false
argument to assert that age
has a standard deviation greater than 3 and less than 6:
MODEL (
audits (
mean_in_range(column := age, min_v := 3, max_v := 6, inclusive := false)
)
);
Ensures that no rows of a numeric column contain a value whose absolute z-score exceeds the threshold.
z-score is calculated as ABS(([row value] - [column mean]) / NULLIF([column standard deviation], 0))
.
This example asserts that the age
column contains no rows with z-scores greater than 3:
MODEL (
audits (
z_score(column := age, threshold := 3)
)
);
Ensures that the symmetrised Kullback-Leibler divergence (aka "Jeffreys divergence" or "Population Stability Index") between two columns does not exceed a threshold.
This example asserts that the symmetrised KL Divergence between columns age
and reference_age
is less than or equal to 0.1:
MODEL (
audits (
kl_divergence(column := age, target_column := reference_age, threshold := 0.1)
)
);
Ensures that the chi-square statistic for two categorical columns does not exceed a critical value.
You can look up the critical value corresponding to a p-value with a table (such as this one) or by using the Python scipy library:
from scipy.stats import chi2
# critical value for p-value := 0.95 and degrees of freedom := 1
chi2.ppf(0.95, 1)
This example asserts that the chi-square statistic0 for columns user_state
and user_type
does not exceed 6.635:
MODEL (
audits (
chi_square(column := user_state, target_column := user_type, critical_value := 6.635)
)
);
You can execute audits with the sqlmesh audit
command as follows:
$ sqlmesh -p project audit -start 2022-01-01 -end 2022-01-02
Found 1 audit(s).
assert_item_price_is_not_null FAIL.
Finished with 1 audit error(s).
Failure in audit assert_item_price_is_not_null for model sushi.items (audits/items.sql).
Got 3 results, expected 0.
SELECT * FROM sqlmesh.sushi__items__1836721418_83893210 WHERE ds BETWEEN '2022-01-01' AND '2022-01-02' AND price IS NULL
Done.
When you apply a plan, SQLMesh will automatically run each model's audits.
By default, SQLMesh will halt the pipeline when an audit fails to prevent potentially invalid data from propagating further downstream. This behavior can be changed for individual audits - see Non-blocking audits.
Audits can be skipped by setting the skip
argument to true
as in the following example:
AUDIT (
name assert_item_price_is_not_null,
skip true
);
SELECT * from sushi.items
WHERE ds BETWEEN @start_ds AND @end_ds AND
price IS NULL;
By default, audits that fail will stop the execution of the pipeline to prevent bad data from propagating further.
An audit can be configured to notify you without blocking the execution of the pipeline when it fails, as in this example:
AUDIT (
name assert_item_price_is_not_null,
blocking false
);
SELECT * from sushi.items
WHERE ds BETWEEN @start_ds AND @end_ds AND
price IS NULL;
The blocking
property can also be set at the use-site of an audit, using the special blocking
argument:
MODEL (
name sushi.items,
audits (
does_not_exceed_threshold(column := id, threshold := 1000, blocking := false),
does_not_exceed_threshold(column := price, threshold := 100, blocking := false)
)
);