A schema-aware data diff tool with both a Rust CLI and a Tauri desktop UI.
Compare schemas for breaking changes, find modified rows with configurable tolerance, and review differences through either scripted CLI workflows or an interactive desktop app.
The project currently ships in two forms:
datadiff— the Rust command-line interface for scripted and batch workflowsdatadiff-gui— the Tauri desktop app for interactive schema and data comparisons
| Surface | Sources | Best for |
|---|---|---|
datadiff CLI |
CSV files | automation, CI checks, manifest-driven batch runs |
datadiff-gui desktop app |
CSV, SQL Server, PostgreSQL, MySQL/MariaDB, SQLite | ad hoc inspection, side-by-side comparisons, saved connection profiles |
- Schema Comparison — detect added/removed columns, type changes, and compatibility issues with optional policy validation
- Data Diffing — identify source-only, target-only, and modified rows with configurable keys and numeric tolerance
- Batch Operations — compare multiple file pairs in one run with a JSON or CSV manifest
- Policy-Driven Validation — enforce schema contracts (required columns, forbidden removals, allowed type promotions)
- Flexible Output — export results as JSON or CSV for downstream automation
- Desktop App — side-by-side GUI built with Tauri for interactive schema and data diffing
- Database Connectors — SQL Server, PostgreSQL, MySQL/MariaDB, and SQLite sources in the desktop app
- Scalable — optimized for large datasets with early termination and column filtering
Tagged releases are the intended stable installation target. Source builds remain the most predictable cross-platform option.
git clone https://github.com/vixinxiviir/datadiff.git
cd datadiff
cargo install --path .This builds and installs the datadiff binary to your Cargo bin directory (usually ~/.cargo/bin).
cargo build --release --manifest-path tauri-app/src-tauri/Cargo.tomlThe desktop binary is produced at tauri-app/src-tauri/target/release/datadiff-gui on Linux and macOS, or datadiff-gui.exe on Windows.
When available, tagged releases may include prebuilt artifacts for the CLI, the desktop app, and packaging support files. If a release does not include a binary for your platform yet, use the source build instructions above.
The Tauri desktop build depends on the normal Linux WebKitGTK stack. On Arch Linux, the important runtime packages are:
webkit2gtk-4.1gtk3libsoup3openssllibrsvg
For source builds of the current connectors, you should also expect build-time dependencies such as rust, cargo, clang, and cmake.
datadiff --version
datadiff --helpFor the desktop app, launch:
datadiff-guiUse the desktop app when you want to diff database queries or inspect changes interactively:
- Launch
datadiff-gui. - Choose the Data Diff or Schema Diff tab.
- Select CSV, SQL Server, PostgreSQL, MySQL/MariaDB, or SQLite for each side.
- For database sources, optionally save connection profiles and reuse them later.
- Run the comparison and inspect row-level and schema-level results side by side.
Compare two CSV files to see what columns changed:
datadiff schema \
--source gold_customers.csv \
--target silver_customers.csvOutput includes:
- Columns added in target
- Columns removed from source
- Type changes and impact classification (SafePromotion, RiskyConversion, Breaking)
- Backward and forward compatibility assessment
Find which rows were added, removed, or modified:
datadiff data \
--source gold_customers.csv \
--target silver_customers.csv \
--key customer_idOptions:
--key— one or more column names for row matching (can repeat:--key id --key date)--exclude-columns— skip comparing certain columns (comma-separated:--exclude-columns created_at,updated_at)--only-columns— compare only specific columns--numeric-tolerance— allow numeric values to differ by this amount (e.g.,0.01for 1%)--diffs-only— show only modified rows, skip summary tables (much faster)--output— directory to write exports; must be used together with--format--format— export format:jsonorcsv; must be used together with--output--temp— write to a timestamped temp directory instead of--output; cannot be combined with--outputor--format--json— emit the diff payload as JSON to stdout and suppress normal terminal output
Example with filters:
datadiff data \
--source raw_events.csv \
--target processed_events.csv \
--key event_id \
--exclude-columns processing_timestamp \
--numeric-tolerance 0.001 \
--output ./reports \
--format json \
--diffs-onlyRun multiple file pair comparisons and get an aggregate summary:
datadiff batch \
--manifest pairs.json \
--key id \
--output ./batch_results \
--format jsonBatch-specific flags:
--manifest-format— force the manifest parser tojsonorcsvinstead of inferring from file extension--fail-fast— stop the batch on the first failed pair--diffs-only— show compact per-pair counts rather than fuller summaries
[
{
"name": "customers_v1_to_v2",
"source": "data/customers_v1.csv",
"target": "data/customers_v2.csv",
"key": "customer_id"
},
{
"name": "orders_daily_check",
"source": "data/orders_daily.csv",
"target": "data/orders_staging.csv",
"key": "order_id,order_date",
"exclude_columns": "processing_notes",
"numeric_tolerance": 0.01,
"diffs_only": true
}
]Entries can override global settings:
key(string) — override--keyfor this pairexclude_columns(string) — comma-separated columns to skiponly_columns(string) — comma-separated columns to include onlynumeric_tolerance(float) — tolerance for this pairdiffs_only(bool) — show only diffs for this pairoutput_base(string) — per-pair output directory
name,source,target,key,exclude_columns,numeric_tolerance,diffs_only
customers_v1_to_v2,data/customers_v1.csv,data/customers_v2.csv,customer_id,,
orders_daily_check,data/orders_daily.csv,data/orders_staging.csv,"order_id,order_date",processing_notes,0.01,trueEnforce structural contracts with a JSON policy file:
datadiff schema \
--source gold_schema.csv \
--target silver_schema.csv \
--policy schema-contract.json{
"required_columns_source": ["id", "created_at"],
"required_columns_target": ["id", "created_at", "modified_at"],
"forbidden_removals": ["id", "customer_id"],
"max_new_columns": 5,
"allowed_type_changes": [
{ "from": "Int32", "to": "Int64" },
{ "from": "Float32", "to": "Float64" },
{ "from": "Int32", "to": "Int32" }
],
"fail_on_breaking": true
}required_columns_source— columns that must exist in sourcerequired_columns_target— columns that must exist in targetforbidden_removals— columns that cannot be removedmax_new_columns— reject if more than N columns are addedallowed_type_changes— list of type conversions to permitfail_on_breaking— if true, exit with error on breaking/risky changes
Schema Comparison Results
---------------------------
Source file: gold_schema.csv
Target file: silver_schema.csv
Columns added in target (1): ["new_field"]
Columns removed from source (0): []
Type changes in shared columns (1):
- customer_id: Int32 -> Int64 (SafePromotion)
Potential renames: none
Compatibility:
- Backward compatible: true
- Forward compatible: false
- Breaking reasons:
- Added column: new_field
Policy check: passed (schema-contract.json)
Terminal shows:
- Summary of row counts (total, source-only, target-only, modified)
- Column-level statistics (nulls, unique values, numeric min/max/mean)
- Most-changed columns
Export JSON includes structured diff results for automation.
Batch Results: 3 pairs
- customers_v1_to_v2: ✓ (5 modified rows)
- orders_daily_check: ✓ (120 target-only rows)
- transactions_staging: ✗ (missing source file)
Total: 2 succeeded, 1 failed
Total rows modified across all pairs: 125
# Check if a new table version is backward compatible
datadiff schema \
--source warehouse/events_v2.csv \
--target warehouse/events_v3.csv \
--policy warehouse/schema-policies.json# Compare daily ETL inputs to see what changed
datadiff data \
--source raw/daily_2026-03-28.csv \
--target raw/daily_2026-03-29.csv \
--key transaction_id \
--diffs-only \
--output ./etl_check \
--format json# Run schema checks on all updated tables after a deployment
datadiff schema \
--source prod_snapshot.csv \
--target staging_snapshot.csv \
--policy prod-schema-contract.json
# If schema is OK, check data integrity
datadiff batch \
--manifest prod_validation_pairs.json \
--key id \
--output ./release_validation \
--format json- Use
--diffs-onlyto skip expensive statistics computation - Use
--exclude-columnsor--only-columnsto reduce comparison scope - For multi-column keys, use only the minimal key set needed for matching
- Test policy files on small samples before batch runs
Error: "No columns added in target"
Normal when schemas match. Check file paths and CSV encoding.
Error: "CSV parsing failed"
Verify the input is a standard CSV with the expected delimiter, quotes, and encoding. The CLI currently documents and targets CSV inputs only.
--output or --format is rejected
Use them together. The CLI requires --output and --format as a pair, while --temp is an alternative output mode.
Batch run fails on one pair but not others
Run the failing pair directly with datadiff data using the same filters, or rerun the batch with --fail-fast to stop at the first failing entry.
Database sources are not available in the CLI
That is expected. Database connectors currently live in the desktop app, not the datadiff CLI.
Type classification seems wrong
Polars infers schema from the first 100 rows. If a CSV column contains mixed types, normalize the input first so the sampled rows reflect the full dataset.
Contributions are welcome. Open an issue for bugs, feature requests, or release packaging problems, and use pull requests for scoped changes.
- Additional database connectors beyond the current SQL Server, PostgreSQL, MySQL/MariaDB, and SQLite support
- Streaming mode for files larger than RAM
- Plugin system for custom diff rules
- Scheduled reports and alerting