data-diff is a free, open-source tool that enables data professionals to detect differences in values between any two tables. It's fast, easy to use, and reliable. Even at massive scale.
Are you a developer with a deep understanding of databases and solid Python knowledge? We're hiring!
pip install data-diff
To try out bleeding-edge features, including materialization of results in your data warehouse:
pip install data-diff --pre
-
pip install 'data-diff[mysql]'
-
pip install 'data-diff[postgresql]'
-
pip install 'data-diff[snowflake]'
-
pip install 'data-diff[presto]'
-
pip install 'data-diff[oracle]'
-
pip install 'data-diff[trino]'
-
pip install 'data-diff[clickhouse]'
-
pip install 'data-diff[vertica]'
-
For BigQuery, see: https://pypi.org/project/google-cloud-bigquery/
Some drivers have dependencies that cannot be installed using pip
and still need to be installed manually.
Once you've installed data-diff
, you can run it from the command line.
data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]
Be sure to read the How to Use section below which gets into specific details about how to build one of these commands depending on your database setup.
Here's an example command for your copy/pasting, taken from the screenshot above when we diffed data between Snowflake and Postgres.
data-diff \
postgresql://<username>:'<password>'@localhost:5432/<database> \
<table> \
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
<TABLE> \
-k activity_id \
-c activity \
-w "event_timestamp < '2022-10-10'"
Here's a code example from the video, where we compare data between two Snowflake tables within one database.
data-diff \
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA_1>?warehouse=<WAREHOUSE>&role=<ROLE>" <TABLE_1> \
<SCHEMA_2>.<TABLE_2> \
-k org_id \
-c created_at -c is_internal \
-w "org_id != 1 and org_id < 2000" \
-m test_results_%t \
--materialize-all-rows \
--table-write-limit 10000
In both code examples, I've used <>
carrots to represent values that should be replaced with your values in the database connection strings. For the flags (-k
, -c
, etc.), I opted for "real" values (org_id
, is_internal
) to give you a more realistic view of what your command will look like.
We know, that data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]
command can become long and dense. And maybe you're new to the command line.
We're here to help on slack if you have ANY questions as you use data-diff
in your workflow.
This section gets into more details, including:
- database-specific syntax
- the many options (flags) you can use beyond the examples presented above
- how to run
data-diff
using a TOML configuration file - how to run
data-diff
from Python
To run data-diff
from the command line, run this command:
data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]
Let's break this down. Assume there are two tables stored in two databases, and you want to know the differences between those tables.
DB1_URI
will be a string thatdata-diff
uses to connect to the database where the first table is stored.TABLE1_NAME
is the name of the table in theDB1_URI
database.DB2_URI
will be a string thatdata-diff
uses to connect to the database where the second table is stored.TABLE2_NAME
is the name of the second table in theDB2_URI
database.[OPTIONS]
can be replaced with a variety of additional commands, detailed here.
Database | Connection string | Status |
---|---|---|
PostgreSQL >=10 | postgresql://<user>:'<password>'@<host>:5432/<database> |
💚 |
MySQL | mysql://<user>:<password>@<hostname>:5432/<database> |
💚 |
Snowflake | With password:"snowflake://<USER>:<password>@<ACCOUNT>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" With SSO: "snowflake://<USER>@<ACCOUNT>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>&authenticator=externalbrowser" Note: Unless something is explicitly case sensitive (like your password) use all caps. |
💚 |
BigQuery | bigquery://<project>/<dataset> |
💚 |
Redshift | redshift://<username>:<password>@<hostname>:5439/<database> |
💚 |
Oracle | oracle://<username>:<password>@<hostname>/database |
💛 |
Presto | presto://<username>:<password>@<hostname>:8080/<database> |
💛 |
Databricks | databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema> |
💛 |
Trino | trino://<username>:<password>@<hostname>:8080/<database> |
💛 |
Clickhouse | clickhouse://<username>:<password>@<hostname>:9000/<database> |
💛 |
Vertica | vertica://<username>:<password>@<hostname>:5433/<database> |
💛 |
ElasticSearch | 📝 | |
Planetscale | 📝 | |
Pinot | 📝 | |
Druid | 📝 | |
Kafka | 📝 | |
DuckDB | 📝 | |
SQLite | 📝 |
- 💚: Implemented and thoroughly tested.
- 💛: Implemented, but not thoroughly tested yet.
- ⏳: Implementation in progress.
- 📝: Implementation planned. Contributions welcome.
If a database is not on the list, we'd still love to support it. Open an issue to discuss it.
Note: Because URLs allow many special characters, and may collide with the syntax of your command-line,
it's recommended to surround them with quotes. Alternatively, you may provide them in a TOML file via the --config
option.
--help
- Show help message and exit.-k
or--key-columns
- Name of the primary key column. If none provided, default is 'id'.-t
or--update-column
- Name of updated_at/last_updated column-c
or--columns
- Names of extra columns to compare. Can be used more than once in the same command. Accepts a name or a pattern like in SQL. Example:-c col% -c another_col -c %foorb.r%
-l
or--limit
- Maximum number of differences to find (limits maximum bandwidth and runtime)-s
or--stats
- Print stats instead of a detailed diff-d
or--debug
- Print debug info-v
or--verbose
- Print extra info-i
or--interactive
- Confirm queries, implies--debug
--json
- Print JSONL output for machine readability--min-age
- Considers only rows older than specified. Useful for specifying replication lag. Example:--min-age=5min
ignores rows from the last 5 minutes. Valid units:d, days, h, hours, min, minutes, mon, months, s, seconds, w, weeks, y, years
--max-age
- Considers only rows younger than specified. See--min-age
.-j
or--threads
- Number of worker threads to use per database. Default=1.-w
,--where
- An additional 'where' expression to restrict the search space.--conf
,--run
- Specify the run and configuration from a TOML file. (see below)--no-tracking
- data-diff sends home anonymous usage data. Use this to disable it.
The following two options are not available when using the pre release In-DB feature:
--bisection-threshold
- Minimal size of segment to be split. Smaller segments will be downloaded and compared locally.--bisection-factor
- Segments per iteration. When set to 2, it performs binary search.
In-DB commands, available in pre release only:
-m
,--materialize
- Materialize the diff results into a new table in the database. If a table exists by that name, it will be replaced. Use%t
in the name to place a timestamp. Example:-m test_mat_%t
--assume-unique-key
- Skip validating the uniqueness of the key column during joindiff, which is costly in non-cloud dbs.--sample-exclusive-rows
- Sample several rows that only appear in one of the tables, but not the other. Use with-s
.--materialize-all-rows
- Materialize every row, even if they are the same, instead of just the differing rows.--table-write-limit
- Maximum number of rows to write when creating materialized or sample tables, per thread. Default=1000.-a
,--algorithm
[auto|joindiff|hashdiff]
- Force algorithm choice
Data-diff lets you load the configuration for a run from a TOML file.
Reasons to use a configuration file:
-
Convenience: Set-up the parameters for diffs that need to run often
-
Easier and more readable: You can define the database connection settings as config values, instead of in a URI.
-
Gives you fine-grained control over the settings switches, without requiring any Python code.
Use --conf
to specify that path to the configuration file. data-diff will load the settings from run.default
, if it's defined.
Then you can, optionally, use --run
to choose to load the settings of a specific run, and override the settings run.default
. (all runs extend run.default
, like inheritance).
Finally, CLI switches have the final say, and will override the settings defined by the configuration file, and the current run.
Example TOML file:
# Specify the connection params to the test database.
[database.test_postgresql]
driver = "postgresql"
user = "postgres"
password = "Password1"
# Specify the default run params
[run.default]
update_column = "timestamp"
verbose = true
# Specify params for a run 'test_diff'.
[run.test_diff]
verbose = false
# Source 1 ("left")
1.database = "test_postgresql" # Use options from database.test_postgresql
1.table = "rating"
# Source 2 ("right")
2.database = "postgresql://postgres:Password1@/" # Use URI like in the CLI
2.table = "rating_del1"
In this example, running data-diff --conf myconfig.toml --run test_diff
will compare between rating
and rating_del1
.
It will use the timestamp
column as the update column, as specified in run.default
. However, it won't be verbose, since that
flag is overwritten to false
.
Running it with data-diff --conf myconfig.toml --run test_diff -v
will set verbose back to true
.
API reference: https://data-diff.readthedocs.io/en/latest/
Example:
# Optional: Set logging to display the progress of the diff
import logging
logging.basicConfig(level=logging.INFO)
from data_diff import connect_to_table, diff_tables
table1 = connect_to_table("postgresql:///", "table_name", "id")
table2 = connect_to_table("mysql:///", "table_name", "id")
for different_row in diff_tables(table1, table2):
plus_or_minus, columns = different_row
print(plus_or_minus, columns)
Run help(diff_tables)
or read the docs to learn about the different options.
- Open an issue or chat with us on slack.
- Interested in contributing to this open source project? Please see our Contributing Guideline!
- Did we mention we're hiring?
data-diff collects anonymous usage data to help our team improve the tool and to apply development efforts to where our users need them most.
We capture two events: one when the data-diff run starts, and one when it is finished. No user data or potentially sensitive information is or ever will be collected. The captured data is limited to:
- Operating System and Python version
- Types of databases used (postgresql, mysql, etc.)
- Sizes of tables diffed, run time, and diff row count (numbers only)
- Error message, if any, truncated to the first 20 characters.
- A persistent UUID to indentify the session, stored in
~/.datadiff.toml
If you do not wish to participate, the tracking can be easily disabled with one of the following methods:
- In the CLI, use the
--no-tracking
flag. - In the config file, set
no_tracking = true
(for example, under[run.default]
) - If you're using the Python API:
import data_diff
data_diff.disable_tracking() # Call this first, before making any API calls
# Connect and diff your tables without any tracking
data-diff splits the table into smaller segments, then checksums each segment in both databases. When the checksums for a segment aren't equal, it will further divide that segment into yet smaller segments, checksumming those until it gets to the differing row(s).
This approach has performance within an order of magnitude of count(*) when there are few/no changes, but is able to output each differing row! By pushing the compute into the databases, it's much faster than querying for and comparing every row.
†: The implementation for downloading all rows that data-diff
and
count(*)
is compared to is not optimal. It is a single Python multi-threaded
process. The performance is fairly driver-specific, e.g. PostgreSQL's performs 10x
better than MySQL.
data-diff can diff tables within the same database, or across different databases.
Same-DB Diff:
- Uses an outer-join to diff the rows as efficiently and accurately as possible.
- Supports materializing the diff results to a database table.
- Can also collect various extra statistics about the tables.
Cross-DB Diff: Employs a divide and conquer algorithm based on hashing, optimized for few changes.
In this section we'll be doing a walk-through of exactly how data-diff
works, and how to tune --bisection-factor
and --bisection-threshold
.
Let's consider a scenario with an orders
table with 1M rows. Fivetran is
replicating it contionously from PostgreSQL to Snowflake:
┌─────────────┐ ┌─────────────┐
│ PostgreSQL │ │ Snowflake │
├─────────────┤ ├─────────────┤
│ │ │ │
│ │ │ │
│ │ ┌─────────────┐ │ table with │
│ table with ├──┤ replication ├──────▶│ ?maybe? all │
│lots of rows!│ └─────────────┘ │ the same │
│ │ │ rows. │
│ │ │ │
│ │ │ │
│ │ │ │
└─────────────┘ └─────────────┘
In order to check whether the two tables are the same, data-diff splits
the table into --bisection-factor=10
segments.
We also have to choose which columns we want to checksum. In our case, we care
about the primary key, --key-column=id
and the update column
--update-column=updated_at
. updated_at
is updated every time the row is, and
we have an index on it.
data-diff starts by querying both databases for the min(id)
and max(id)
of the table. Then it splits the table into --bisection-factor=10
segments of
1M/10 = 100K
keys each:
┌──────────────────────┐ ┌──────────────────────┐
│ PostgreSQL │ │ Snowflake │
├──────────────────────┤ ├──────────────────────┤
│ id=1..100k │ │ id=1..100k │
├──────────────────────┤ ├──────────────────────┤
│ id=100k..200k │ │ id=100k..200k │
├──────────────────────┤ ├──────────────────────┤
│ id=200k..300k ├─────────────▶│ id=200k..300k │
├──────────────────────┤ ├──────────────────────┤
│ id=300k..400k │ │ id=300k..400k │
├──────────────────────┤ ├──────────────────────┤
│ ... │ │ ... │
├──────────────────────┤ ├──────────────────────┤
│ 900k..100k │ │ 900k..100k │
└───────────────────▲──┘ └▲─────────────────────┘
┃ ┃
┃ ┃
┃ checksum queries ┃
┃ ┃
┌─┻──────────────────┻────┐
│ data-diff │
└─────────────────────────┘
Now data-diff will start running --threads=1
queries in parallel that
checksum each segment. The queries for checksumming each segment will look
something like this, depending on the database:
SELECT count(*),
sum(cast(conv(substring(md5(concat(cast(id as char), cast(timestamp as char))), 18), 16, 10) as unsigned))
FROM `rating_del1`
WHERE (id >= 1) AND (id < 100000)
This keeps the amount of data that has to be transferred between the databases
to a minimum, making it very performant! Additionally, if you have an index on
updated_at
(highly recommended), then the query will be fast, as the database
only has to do a partial index scan between id=1..100k
.
If you are not sure whether the queries are using an index, you can run it with
--interactive
. This puts data-diff in interactive mode, where it shows an
EXPLAIN
before executing each query, requiring confirmation to proceed.
After running the checksum queries on both sides, we see that all segments
are the same except id=100k..200k
:
┌──────────────────────┐ ┌──────────────────────┐
│ PostgreSQL │ │ Snowflake │
├──────────────────────┤ ├──────────────────────┤
│ checksum=0102 │ │ checksum=0102 │
├──────────────────────┤ mismatch! ├──────────────────────┤
│ checksum=ffff ◀──────────────▶ checksum=aaab │
├──────────────────────┤ ├──────────────────────┤
│ checksum=abab │ │ checksum=abab │
├──────────────────────┤ ├──────────────────────┤
│ checksum=f0f0 │ │ checksum=f0f0 │
├──────────────────────┤ ├──────────────────────┤
│ ... │ │ ... │
├──────────────────────┤ ├──────────────────────┤
│ checksum=9494 │ │ checksum=9494 │
└──────────────────────┘ └──────────────────────┘
Now data-diff will do exactly as it just did for the whole table for only
this segment: Split it into --bisection-factor
segments.
However, this time, because each segment has 100k/10=10k
entries, which is
less than the --bisection-threshold
, it will pull down every row in the segment
and compare them in memory in data-diff.
┌──────────────────────┐ ┌──────────────────────┐
│ PostgreSQL │ │ Snowflake │
├──────────────────────┤ ├──────────────────────┤
│ id=100k..110k │ │ id=100k..110k │
├──────────────────────┤ ├──────────────────────┤
│ id=110k..120k │ │ id=110k..120k │
├──────────────────────┤ ├──────────────────────┤
│ id=120k..130k │ │ id=120k..130k │
├──────────────────────┤ ├──────────────────────┤
│ id=130k..140k │ │ id=130k..140k │
├──────────────────────┤ ├──────────────────────┤
│ ... │ │ ... │
├──────────────────────┤ ├──────────────────────┤
│ 190k..200k │ │ 190k..200k │
└──────────────────────┘ └──────────────────────┘
Finally data-diff will output the (id, updated_at)
for each row that was different:
(122001, 1653672821)
If you pass --stats
you'll see stats such as the % of rows were different.
- Ensure that you have indexes on the columns you are comparing. Preferably a
compound index. You can run with
--interactive
to see anEXPLAIN
for the queries. - Consider increasing the number of simultaneous threads executing
queries per database with
--threads
. For databases that limit concurrency per query, such as PostgreSQL/MySQL, this can improve performance dramatically. - If you are only interested in whether something changed, pass
--limit 1
. This can be useful if changes are very rare. This is often faster than doing acount(*)
, for the reason mentioned above. - If the table is very large, consider a larger
--bisection-factor
. Otherwise, you may run into timeouts. - If there are a lot of changes, consider a larger
--bisection-threshold
. - If there are very large gaps in your key column (e.g., 10s of millions of continuous rows missing), then data-diff may perform poorly, doing lots of queries for ranges of rows that do not exist. We have ideas on how to tackle this issue, which we have yet to implement. If you're experiencing this effect, please open an issue, and we will prioritize it.
- The fewer columns you verify (passed with
--columns
), the faster data-diff will be. On one extreme, you can verify every column; on the other, you can verify onlyupdated_at
, if you trust it enough. You can also only verifyid
if you're interested in only presence, such as to detect missing hard deletes. You can do also do a hybrid where you verifyupdated_at
and the most critical value, such as a money value inamount
, but not verify a large serialized column likejson_settings
. - We have ideas for making data-diff even faster that
we haven't implemented yet: faster checksums by reducing type-casts
and using a faster hash than MD5, dynamic adaptation of
bisection_factor
/threads
/bisection_threshold
(especially with large key gaps), and improvements to bypass Python/driver performance limitations when comparing huge amounts of rows locally (i.e. for very highbisection_threshold
values).
This project is licensed under the terms of the MIT License.