Skip to content

vixinxiviir/datadiff

Repository files navigation

datadiff

A schema-aware data diff tool with both a Rust CLI and a Tauri desktop UI.

Compare schemas for breaking changes, find modified rows with configurable tolerance, and review differences through either scripted CLI workflows or an interactive desktop app.

The project currently ships in two forms:

  • datadiff — the Rust command-line interface for scripted and batch workflows
  • datadiff-gui — the Tauri desktop app for interactive schema and data comparisons

Support Matrix

Surface Sources Best for
datadiff CLI CSV files automation, CI checks, manifest-driven batch runs
datadiff-gui desktop app CSV, SQL Server, PostgreSQL, MySQL/MariaDB, SQLite ad hoc inspection, side-by-side comparisons, saved connection profiles

Features

  • Schema Comparison — detect added/removed columns, type changes, and compatibility issues with optional policy validation
  • Data Diffing — identify source-only, target-only, and modified rows with configurable keys and numeric tolerance
  • Batch Operations — compare multiple file pairs in one run with a JSON or CSV manifest
  • Policy-Driven Validation — enforce schema contracts (required columns, forbidden removals, allowed type promotions)
  • Flexible Output — export results as JSON or CSV for downstream automation
  • Desktop App — side-by-side GUI built with Tauri for interactive schema and data diffing
  • Database Connectors — SQL Server, PostgreSQL, MySQL/MariaDB, and SQLite sources in the desktop app
  • Scalable — optimized for large datasets with early termination and column filtering

Installation

Tagged releases are the intended stable installation target. Source builds remain the most predictable cross-platform option.

From Source

git clone https://github.com/vixinxiviir/datadiff.git
cd datadiff
cargo install --path .

This builds and installs the datadiff binary to your Cargo bin directory (usually ~/.cargo/bin).

Desktop App From Source

cargo build --release --manifest-path tauri-app/src-tauri/Cargo.toml

The desktop binary is produced at tauri-app/src-tauri/target/release/datadiff-gui on Linux and macOS, or datadiff-gui.exe on Windows.

Release Artifacts

When available, tagged releases may include prebuilt artifacts for the CLI, the desktop app, and packaging support files. If a release does not include a binary for your platform yet, use the source build instructions above.

Linux Runtime Dependencies

The Tauri desktop build depends on the normal Linux WebKitGTK stack. On Arch Linux, the important runtime packages are:

  • webkit2gtk-4.1
  • gtk3
  • libsoup3
  • openssl
  • librsvg

For source builds of the current connectors, you should also expect build-time dependencies such as rust, cargo, clang, and cmake.

Verify Installation

datadiff --version
datadiff --help

For the desktop app, launch:

datadiff-gui

Quick Start

Desktop App

Use the desktop app when you want to diff database queries or inspect changes interactively:

  1. Launch datadiff-gui.
  2. Choose the Data Diff or Schema Diff tab.
  3. Select CSV, SQL Server, PostgreSQL, MySQL/MariaDB, or SQLite for each side.
  4. For database sources, optionally save connection profiles and reuse them later.
  5. Run the comparison and inspect row-level and schema-level results side by side.

1. Basic Schema Comparison

Compare two CSV files to see what columns changed:

datadiff schema \
  --source gold_customers.csv \
  --target silver_customers.csv

Output includes:

  • Columns added in target
  • Columns removed from source
  • Type changes and impact classification (SafePromotion, RiskyConversion, Breaking)
  • Backward and forward compatibility assessment

2. Data Diffing with Primary Keys

Find which rows were added, removed, or modified:

datadiff data \
  --source gold_customers.csv \
  --target silver_customers.csv \
  --key customer_id

Options:

  • --key — one or more column names for row matching (can repeat: --key id --key date)
  • --exclude-columns — skip comparing certain columns (comma-separated: --exclude-columns created_at,updated_at)
  • --only-columns — compare only specific columns
  • --numeric-tolerance — allow numeric values to differ by this amount (e.g., 0.01 for 1%)
  • --diffs-only — show only modified rows, skip summary tables (much faster)
  • --output — directory to write exports; must be used together with --format
  • --format — export format: json or csv; must be used together with --output
  • --temp — write to a timestamped temp directory instead of --output; cannot be combined with --output or --format
  • --json — emit the diff payload as JSON to stdout and suppress normal terminal output

Example with filters:

datadiff data \
  --source raw_events.csv \
  --target processed_events.csv \
  --key event_id \
  --exclude-columns processing_timestamp \
  --numeric-tolerance 0.001 \
  --output ./reports \
  --format json \
  --diffs-only

3. Batch Comparisons with Manifest

Run multiple file pair comparisons and get an aggregate summary:

datadiff batch \
  --manifest pairs.json \
  --key id \
  --output ./batch_results \
  --format json

Batch-specific flags:

  • --manifest-format — force the manifest parser to json or csv instead of inferring from file extension
  • --fail-fast — stop the batch on the first failed pair
  • --diffs-only — show compact per-pair counts rather than fuller summaries

Manifest Format (JSON)

[
  {
    "name": "customers_v1_to_v2",
    "source": "data/customers_v1.csv",
    "target": "data/customers_v2.csv",
    "key": "customer_id"
  },
  {
    "name": "orders_daily_check",
    "source": "data/orders_daily.csv",
    "target": "data/orders_staging.csv",
    "key": "order_id,order_date",
    "exclude_columns": "processing_notes",
    "numeric_tolerance": 0.01,
    "diffs_only": true
  }
]

Entries can override global settings:

  • key (string) — override --key for this pair
  • exclude_columns (string) — comma-separated columns to skip
  • only_columns (string) — comma-separated columns to include only
  • numeric_tolerance (float) — tolerance for this pair
  • diffs_only (bool) — show only diffs for this pair
  • output_base (string) — per-pair output directory

Manifest Format (CSV)

name,source,target,key,exclude_columns,numeric_tolerance,diffs_only
customers_v1_to_v2,data/customers_v1.csv,data/customers_v2.csv,customer_id,,
orders_daily_check,data/orders_daily.csv,data/orders_staging.csv,"order_id,order_date",processing_notes,0.01,true

Schema Policy & Validation

Enforce structural contracts with a JSON policy file:

datadiff schema \
  --source gold_schema.csv \
  --target silver_schema.csv \
  --policy schema-contract.json

Policy File Format

{
  "required_columns_source": ["id", "created_at"],
  "required_columns_target": ["id", "created_at", "modified_at"],
  "forbidden_removals": ["id", "customer_id"],
  "max_new_columns": 5,
  "allowed_type_changes": [
    { "from": "Int32", "to": "Int64" },
    { "from": "Float32", "to": "Float64" },
    { "from": "Int32", "to": "Int32" }
  ],
  "fail_on_breaking": true
}
  • required_columns_source — columns that must exist in source
  • required_columns_target — columns that must exist in target
  • forbidden_removals — columns that cannot be removed
  • max_new_columns — reject if more than N columns are added
  • allowed_type_changes — list of type conversions to permit
  • fail_on_breaking — if true, exit with error on breaking/risky changes

Output & Exports

Schema Comparison Output (terminal + optional export)

Schema Comparison Results
---------------------------
Source file: gold_schema.csv
Target file: silver_schema.csv

Columns added in target (1): ["new_field"]
Columns removed from source (0): []

Type changes in shared columns (1):
  - customer_id: Int32 -> Int64 (SafePromotion)

Potential renames: none

Compatibility:
  - Backward compatible: true
  - Forward compatible: false
  - Breaking reasons:
    - Added column: new_field

Policy check: passed (schema-contract.json)

Data Diff Output (terminal + JSON/CSV export)

Terminal shows:

  • Summary of row counts (total, source-only, target-only, modified)
  • Column-level statistics (nulls, unique values, numeric min/max/mean)
  • Most-changed columns

Export JSON includes structured diff results for automation.

Batch Summary Output

Batch Results: 3 pairs
- customers_v1_to_v2: ✓ (5 modified rows)
- orders_daily_check: ✓ (120 target-only rows)
- transactions_staging: ✗ (missing source file)

Total: 2 succeeded, 1 failed
Total rows modified across all pairs: 125

Examples

Example 1: Validate a Data Warehouse Schema Change

# Check if a new table version is backward compatible
datadiff schema \
  --source warehouse/events_v2.csv \
  --target warehouse/events_v3.csv \
  --policy warehouse/schema-policies.json

Example 2: Find Unexpected Changes in ETL Output

# Compare daily ETL inputs to see what changed
datadiff data \
  --source raw/daily_2026-03-28.csv \
  --target raw/daily_2026-03-29.csv \
  --key transaction_id \
  --diffs-only \
  --output ./etl_check \
  --format json

Example 3: Batch Validation After Release

# Run schema checks on all updated tables after a deployment
datadiff schema \
  --source prod_snapshot.csv \
  --target staging_snapshot.csv \
  --policy prod-schema-contract.json

# If schema is OK, check data integrity
datadiff batch \
  --manifest prod_validation_pairs.json \
  --key id \
  --output ./release_validation \
  --format json

Performance Tips

  • Use --diffs-only to skip expensive statistics computation
  • Use --exclude-columns or --only-columns to reduce comparison scope
  • For multi-column keys, use only the minimal key set needed for matching
  • Test policy files on small samples before batch runs

Troubleshooting

Error: "No columns added in target"
Normal when schemas match. Check file paths and CSV encoding.

Error: "CSV parsing failed"
Verify the input is a standard CSV with the expected delimiter, quotes, and encoding. The CLI currently documents and targets CSV inputs only.

--output or --format is rejected
Use them together. The CLI requires --output and --format as a pair, while --temp is an alternative output mode.

Batch run fails on one pair but not others
Run the failing pair directly with datadiff data using the same filters, or rerun the batch with --fail-fast to stop at the first failing entry.

Database sources are not available in the CLI
That is expected. Database connectors currently live in the desktop app, not the datadiff CLI.

Type classification seems wrong
Polars infers schema from the first 100 rows. If a CSV column contains mixed types, normalize the input first so the sampled rows reflect the full dataset.

Contributing

Contributions are welcome. Open an issue for bugs, feature requests, or release packaging problems, and use pull requests for scoped changes.

Roadmap

  • Additional database connectors beyond the current SQL Server, PostgreSQL, MySQL/MariaDB, and SQLite support
  • Streaming mode for files larger than RAM
  • Plugin system for custom diff rules
  • Scheduled reports and alerting

About

Tool for comparing tables and datasets for row differences, summary stats, and schema changes. AI assisted.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors