datadiff

A schema-aware data diff tool with both a Rust CLI and a Tauri desktop UI.

Compare schemas for breaking changes, find modified rows with configurable tolerance, and review differences through either scripted CLI workflows or an interactive desktop app.

The project currently ships in two forms:

datadiff — the Rust command-line interface for scripted and batch workflows
datadiff-gui — the Tauri desktop app for interactive schema and data comparisons

Support Matrix

Surface	Sources	Best for
`datadiff` CLI	CSV files	automation, CI checks, manifest-driven batch runs
`datadiff-gui` desktop app	CSV, SQL Server, PostgreSQL, MySQL/MariaDB, SQLite	ad hoc inspection, side-by-side comparisons, saved connection profiles

Features

Schema Comparison — detect added/removed columns, type changes, and compatibility issues with optional policy validation
Data Diffing — identify source-only, target-only, and modified rows with configurable keys and numeric tolerance
Batch Operations — compare multiple file pairs in one run with a JSON or CSV manifest
Policy-Driven Validation — enforce schema contracts (required columns, forbidden removals, allowed type promotions)
Flexible Output — export results as JSON or CSV for downstream automation
Desktop App — side-by-side GUI built with Tauri for interactive schema and data diffing
Database Connectors — SQL Server, PostgreSQL, MySQL/MariaDB, and SQLite sources in the desktop app
Scalable — optimized for large datasets with early termination and column filtering

Installation

Tagged releases are the intended stable installation target. Source builds remain the most predictable cross-platform option.

From Source

git clone https://github.com/vixinxiviir/datadiff.git
cd datadiff
cargo install --path .

This builds and installs the datadiff binary to your Cargo bin directory (usually ~/.cargo/bin).

Desktop App From Source

cargo build --release --manifest-path tauri-app/src-tauri/Cargo.toml

The desktop binary is produced at tauri-app/src-tauri/target/release/datadiff-gui on Linux and macOS, or datadiff-gui.exe on Windows.

Release Artifacts

When available, tagged releases may include prebuilt artifacts for the CLI, the desktop app, and packaging support files. If a release does not include a binary for your platform yet, use the source build instructions above.

Linux Runtime Dependencies

The Tauri desktop build depends on the normal Linux WebKitGTK stack. On Arch Linux, the important runtime packages are:

webkit2gtk-4.1
gtk3
libsoup3
openssl
librsvg

For source builds of the current connectors, you should also expect build-time dependencies such as rust, cargo, clang, and cmake.

Verify Installation

datadiff --version
datadiff --help

For the desktop app, launch:

datadiff-gui

Quick Start

Desktop App

Use the desktop app when you want to diff database queries or inspect changes interactively:

Launch datadiff-gui.
Choose the Data Diff or Schema Diff tab.
Select CSV, SQL Server, PostgreSQL, MySQL/MariaDB, or SQLite for each side.
For database sources, optionally save connection profiles and reuse them later.
Run the comparison and inspect row-level and schema-level results side by side.

1. Basic Schema Comparison

Compare two CSV files to see what columns changed:

datadiff schema \
  --source gold_customers.csv \
  --target silver_customers.csv

Output includes:

Columns added in target
Columns removed from source
Type changes and impact classification (SafePromotion, RiskyConversion, Breaking)
Backward and forward compatibility assessment

2. Data Diffing with Primary Keys

Find which rows were added, removed, or modified:

datadiff data \
  --source gold_customers.csv \
  --target silver_customers.csv \
  --key customer_id

Options:

--key — one or more column names for row matching (can repeat: --key id --key date)
--exclude-columns — skip comparing certain columns (comma-separated: --exclude-columns created_at,updated_at)
--only-columns — compare only specific columns
--numeric-tolerance — allow numeric values to differ by this amount (e.g., 0.01 for 1%)
--diffs-only — show only modified rows, skip summary tables (much faster)
--output — directory to write exports; must be used together with --format
--format — export format: json or csv; must be used together with --output
--temp — write to a timestamped temp directory instead of --output; cannot be combined with --output or --format
--json — emit the diff payload as JSON to stdout and suppress normal terminal output

Example with filters:

datadiff data \
  --source raw_events.csv \
  --target processed_events.csv \
  --key event_id \
  --exclude-columns processing_timestamp \
  --numeric-tolerance 0.001 \
  --output ./reports \
  --format json \
  --diffs-only

3. Batch Comparisons with Manifest

Run multiple file pair comparisons and get an aggregate summary:

datadiff batch \
  --manifest pairs.json \
  --key id \
  --output ./batch_results \
  --format json

Batch-specific flags:

--manifest-format — force the manifest parser to json or csv instead of inferring from file extension
--fail-fast — stop the batch on the first failed pair
--diffs-only — show compact per-pair counts rather than fuller summaries

Manifest Format (JSON)

[
  {
    "name": "customers_v1_to_v2",
    "source": "data/customers_v1.csv",
    "target": "data/customers_v2.csv",
    "key": "customer_id"
  },
  {
    "name": "orders_daily_check",
    "source": "data/orders_daily.csv",
    "target": "data/orders_staging.csv",
    "key": "order_id,order_date",
    "exclude_columns": "processing_notes",
    "numeric_tolerance": 0.01,
    "diffs_only": true
  }
]

Entries can override global settings:

key (string) — override --key for this pair
exclude_columns (string) — comma-separated columns to skip
only_columns (string) — comma-separated columns to include only
numeric_tolerance (float) — tolerance for this pair
diffs_only (bool) — show only diffs for this pair
output_base (string) — per-pair output directory

Manifest Format (CSV)

name,source,target,key,exclude_columns,numeric_tolerance,diffs_only
customers_v1_to_v2,data/customers_v1.csv,data/customers_v2.csv,customer_id,,
orders_daily_check,data/orders_daily.csv,data/orders_staging.csv,"order_id,order_date",processing_notes,0.01,true

Schema Policy & Validation

Enforce structural contracts with a JSON policy file:

datadiff schema \
  --source gold_schema.csv \
  --target silver_schema.csv \
  --policy schema-contract.json

Policy File Format

{
  "required_columns_source": ["id", "created_at"],
  "required_columns_target": ["id", "created_at", "modified_at"],
  "forbidden_removals": ["id", "customer_id"],
  "max_new_columns": 5,
  "allowed_type_changes": [
    { "from": "Int32", "to": "Int64" },
    { "from": "Float32", "to": "Float64" },
    { "from": "Int32", "to": "Int32" }
  ],
  "fail_on_breaking": true
}

required_columns_source — columns that must exist in source
required_columns_target — columns that must exist in target
forbidden_removals — columns that cannot be removed
max_new_columns — reject if more than N columns are added
allowed_type_changes — list of type conversions to permit
fail_on_breaking — if true, exit with error on breaking/risky changes

Output & Exports

Schema Comparison Output (terminal + optional export)

Schema Comparison Results
---------------------------
Source file: gold_schema.csv
Target file: silver_schema.csv

Columns added in target (1): ["new_field"]
Columns removed from source (0): []

Type changes in shared columns (1):
  - customer_id: Int32 -> Int64 (SafePromotion)

Potential renames: none

Compatibility:
  - Backward compatible: true
  - Forward compatible: false
  - Breaking reasons:
    - Added column: new_field

Policy check: passed (schema-contract.json)

Data Diff Output (terminal + JSON/CSV export)

Terminal shows:

Summary of row counts (total, source-only, target-only, modified)
Column-level statistics (nulls, unique values, numeric min/max/mean)
Most-changed columns

Export JSON includes structured diff results for automation.

Batch Summary Output

Batch Results: 3 pairs
- customers_v1_to_v2: ✓ (5 modified rows)
- orders_daily_check: ✓ (120 target-only rows)
- transactions_staging: ✗ (missing source file)

Total: 2 succeeded, 1 failed
Total rows modified across all pairs: 125

Examples

Example 1: Validate a Data Warehouse Schema Change

# Check if a new table version is backward compatible
datadiff schema \
  --source warehouse/events_v2.csv \
  --target warehouse/events_v3.csv \
  --policy warehouse/schema-policies.json

Example 2: Find Unexpected Changes in ETL Output

# Compare daily ETL inputs to see what changed
datadiff data \
  --source raw/daily_2026-03-28.csv \
  --target raw/daily_2026-03-29.csv \
  --key transaction_id \
  --diffs-only \
  --output ./etl_check \
  --format json

Example 3: Batch Validation After Release

# Run schema checks on all updated tables after a deployment
datadiff schema \
  --source prod_snapshot.csv \
  --target staging_snapshot.csv \
  --policy prod-schema-contract.json

# If schema is OK, check data integrity
datadiff batch \
  --manifest prod_validation_pairs.json \
  --key id \
  --output ./release_validation \
  --format json

Performance Tips

Use --diffs-only to skip expensive statistics computation
Use --exclude-columns or --only-columns to reduce comparison scope
For multi-column keys, use only the minimal key set needed for matching
Test policy files on small samples before batch runs

Troubleshooting

Error: "No columns added in target"
Normal when schemas match. Check file paths and CSV encoding.

Error: "CSV parsing failed"
Verify the input is a standard CSV with the expected delimiter, quotes, and encoding. The CLI currently documents and targets CSV inputs only.

--output or --format is rejected
Use them together. The CLI requires --output and --format as a pair, while --temp is an alternative output mode.

Batch run fails on one pair but not others
Run the failing pair directly with datadiff data using the same filters, or rerun the batch with --fail-fast to stop at the first failing entry.

Database sources are not available in the CLI
That is expected. Database connectors currently live in the desktop app, not the datadiff CLI.

Type classification seems wrong
Polars infers schema from the first 100 rows. If a CSV column contains mixed types, normalize the input first so the sampled rows reflect the full dataset.

Contributing

Contributions are welcome. Open an issue for bugs, feature requests, or release packaging problems, and use pull requests for scoped changes.

Roadmap

Additional database connectors beyond the current SQL Server, PostgreSQL, MySQL/MariaDB, and SQLite support
Streaming mode for files larger than RAM
Plugin system for custom diff rules
Scheduled reports and alerting

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
examples		examples
packaging/aur		packaging/aur
src		src
tauri-app		tauri-app
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
myresults		myresults
source.csv		source.csv
target.csv		target.csv

Folders and files

Latest commit

History

Repository files navigation

datadiff

Support Matrix

Features

Installation

From Source

Desktop App From Source

Release Artifacts

Linux Runtime Dependencies

Verify Installation

Quick Start

Desktop App

1. Basic Schema Comparison

2. Data Diffing with Primary Keys

3. Batch Comparisons with Manifest

Manifest Format (JSON)

Manifest Format (CSV)

Schema Policy & Validation

Policy File Format

Output & Exports

Schema Comparison Output (terminal + optional export)

Data Diff Output (terminal + JSON/CSV export)

Batch Summary Output

Examples

Example 1: Validate a Data Warehouse Schema Change

Example 2: Find Unexpected Changes in ETL Output

Example 3: Batch Validation After Release

Performance Tips

Troubleshooting

Contributing

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages