Diffy contains record diff-ing classes that can be utilized by BigDiffy to perform Diffs over large datasets.
Supported filesystems include Local, GCS (
gs://), HDFS (
hdfs://). There is also support for diff-ing directly from BigQuery tables.
Currently supported formats are Avro, Protobuf, or BigQuery TableRow, but the CLI only supports Avro and TableRow.
BigDiffy will run a Scio pipeline diffing a LHS and RHS by key and output statistics based on the differences. The output contains 3 folders:
global- Global counts of Diff types (
MISSING_RHS) seen in the entire dataset (See
fields- Field level statistics including, but not limited to, the number of records with different values per field, min, max, standard deviation (See
keys- All unique keys found in the two datasets and their Diff types by key in pairs of
KeyStats). If fields are different it will output stats for every field which is different, including left and right, as well as distance if the field is
For full details on Statistics and output see BigDiffy.scala
From the CLI
BigDiffy - pair-wise field-level statistical diff Usage: ratatool bigDiffy [dataflow_options] [options] --input-mode=(avro|bigquery) Diff-ing Avro or BQ records [--output-mode=(gcs|bigquery)] Saves to a text file in GCS or a BigQuery dataset. Defaults to GCS --key=<key> '.' separated key field --lhs=<path> LHS File path or BigQuery table --rhs=<path> RHS File path or BigQuery table --output=<output> File path prefix for output --ignore=<keys> ',' separated field list to ignore --unordered=<keys> ',' separated field list to treat as unordered [--with-header] Output all TSVs with header rows. Defaults to false Since this runs a Scio/Beam pipeline, Dataflow options will have to be provided. At a minimum, the following should be specified: --project=<gcp-project-id> GCP Project used to run your job --runner=DataflowRunner Executes the job on Google Cloud Dataflow --stagingLocation=<gcs-path> Location to stage jars for the job. GCS bucket must be created prior to running job. --gcpTempLocation=<gcs-path> Location for temporary files. GCS bucket must be created prior to running job. The following options are recommended, but may not be necessary. --serviceAccount=<your-service-account> Service account used on Dataflow workers. Useful to avoid permissions issues. --workerMachineType=<machine-type> Can be tweaked based on your specific needs, but is not necessary. --maxNumWorkers=<num-workers> Limits the number of workers (machines) used in the job to avoid using up quota. For more details regarding Dataflow options see here: https://cloud.google.com/dataflow/pipelines/specifying-exec-params
Or from SBT
libraryDependencies += "com.spotify" %% "ratatool-diffy" % ratatoolVersion
The latest version can be found in the main README.