Skip to content

yaml diff Array of Hashes Options

William W. Kimball, Jr., MBA, MSIS edited this page Oct 29, 2020 · 9 revisions
  1. Introduction
  2. By Position
  3. Deep Comparison by Position
  4. By Value
  5. By Key
    1. Bonus: Configuration Files with Key Comparisons
  6. Deep Comparison by Key
    1. Bonus: Configuration Files with Deep Key Comparisons
  7. Configuration File Options
    1. Configuration File Section: defaults
    2. Configuration File Section: rules
    3. Configuration File Section: keys

This document is part of the body of knowledge about yaml-diff, one of the reference command-line tools provided by the YAML Path project.

Introduction

The yaml-diff command-line tool enables users to control how Arrays-of-Hashes (AoH) are compared. This is different from merging regular Arrays, discussed elsewhere. The position mode is used, by default. It, and other options, are explored in the following sections. These include:

  1. position (the default) each record is treated as a whole unit and any differences between the LHS and RHS elements are reported as changes, no matter how nested the change may be.
  2. dpos compares each element by position, except they are deeply compared, one node at a time. Any differences are reported at each node's own YAML Path instead of at the AoH element path.
  3. value treats each element as a whole unit, synchronizing the two Arrays being compared by equal values before they are checked for differences. The entire Hash must be perfectly identical to be matched up.
  4. key treats each element as a record with an identity key, arranging the two Arrays being compared by matching only these key fields. However, the entire record is treated as a whole unit, so any differences -- no matter how deeply nested in any record -- cause a change of the whole element to be reported.
  5. deep treats each element as a record with an identity key, arranging the two Arrays being compared by matching only these key fields before recursively comparing the LHS and RHS records, one node at a time. Any differences are reported at each node's own YAML Path instead of at the AoH element path.

By Position

When comparing AoH elements by position, the Hashes are treated a whole units. Differences in their child nodes are detected but not individually reported. Rather, the whole of any different Hashes are reported in JSON format.

This is the default comparison mode. It is not ideal for every use-case, so several other modes are available.

For an example, consider these two documents and their position-based differences:

File: LHS1.yaml

---
products:
  - product: doodad
    availability:
      start:
        date: 2020-10-10
        time: 08:00
      stop:
        date: 2020-10-29
        time: 17:00
    dimensions:
      width: 5
      height: 5
      depth: 5
      weight: 10
  - product: dumdow
    availability:
      start:
        date: 2020-10-23
        time: 08:00
      stop:
        date: 2020-11-23
        time: 17:00
    dimensions:
      width: 3
      height: 3
      depth: 3
      weight: 27
  - product: doohickey
    availability:
      start:
        date: 2020-08-01
        time: 10:00
      stop:
        date: 2020-09-25
        time: 10:00
    dimensions:
      width: 1
      height: 2
      depth: 3
      weight: 4
  - product: widget
    availability:
      start:
        date: 2020-01-01
        time: 12:00
      stop:
        date: 2020-01-01
        time: 16:00
    dimensions:
      width: 9
      height: 10
      depth: 1
      weight: 4

File: RHS1.yaml

---
products:
  - product: doodad
    availability:
      start:
        date: 2020-10-10
        time: 08:00
      stop:
        date: 2020-10-29
        time: 17:00
    dimensions:
      width: 5
      height: 5
      depth: 5
      weight: 10
  - product: dumdow
    availability:
      start:
        date: 2020-10-23
        time: 08:00
      stop:
        date: 2020-11-23
        time: 17:00
    dimensions:
      width: 4
      height: 3
      depth: 3
      weight: 27
  - product: widget
    availability:
      start:
        date: 2020-01-01
        time: 12:00
      stop:
        date: 2020-01-01
        time: 16:00
    dimensions:
      width: 9
      height: 10
      depth: 1
      weight: 4
  - product: doohickey
    availability:
      start:
        date: 2020-08-01
        time: 10:00
      stop:
        date: 2020-09-25
        time: 10:00
    dimensions:
      width: 1
      height: 2
      depth: 3
      weight: 4

At a glance, we might spot a couple of differences, or think there are more differences than there really are, depending on how these documents are compared. When we instruct yaml-diff to compare these by position, it produces this report:

c products[1]
< {"product": "dumdow", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 3, "height": 3, "depth": 3, "weight": 27}}
---
> {"product": "dumdow", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 4, "height": 3, "depth": 3, "weight": 27}}

c products[2]
< {"product": "doohickey", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}
---
> {"product": "widget", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}, "dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}}

c products[3]
< {"product": "widget", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}, "dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}}
---
> {"product": "doohickey", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}

This report has low granularity. It displays entire Hashes when there are differences, no matter how small or large. There is no change to the first Hash -- at index 0 -- between the two documents, so it doesn't appear in the report, at all. The next record -- at index 1 -- has a very small change to its width property. Because this is a position based report, the entirety of both the LHS and RHS Hashes are returned as having been changed. The next two elements are arguably identical between the two documents other than their ordinal positions within their respective documents. They are reported as entire changes because a comparison by position is concerned only with whether the LHS and RHS elements have any differences whatsoever at the same Array index.

Deep Comparison by Position

If you want a high-granularity version of the same report from By Position, use dpos (an abbreviation for "deep position" comparison). Be warned: high-granularity reports can be quite long! Doing so with the same LHS1.yaml and RHS1.yaml documents produces this very detailed report:

c products[1].dimensions.width
< 3
---
> 4

c products[2].product
< doohickey
---
> widget

c products[2].availability.start.date
< 2020-08-01
---
> 2020-01-01

c products[2].availability.start.time
< 10:00
---
> 12:00

c products[2].availability.stop.date
< 2020-09-25
---
> 2020-01-01

c products[2].availability.stop.time
< 10:00
---
> 16:00

c products[2].dimensions.width
< 1
---
> 9

c products[2].dimensions.height
< 2
---
> 10

c products[2].dimensions.depth
< 3
---
> 1

c products[3].product
< widget
---
> doohickey

c products[3].availability.start.date
< 2020-01-01
---
> 2020-08-01

c products[3].availability.start.time
< 12:00
---
> 10:00

c products[3].availability.stop.date
< 2020-01-01
---
> 2020-09-25

c products[3].availability.stop.time
< 16:00
---
> 10:00

c products[3].dimensions.width
< 9
---
> 1

c products[3].dimensions.height
< 10
---
> 2

c products[3].dimensions.depth
< 1
---
> 3

Identified by their YAML Paths, every single leaf-node-level difference between both documents is reported. We can see the anticipated -- very small -- change to the "dumdow" element: its "width" changed from 3 to 4. The remainder of the report is really just overly accurate noise with this particular data. Other comparison modes -- Deep Comparison by Key, in particular -- would be far more useful for this use-case.

By Value

If we compare the same two documents from By Position using the value mode, we get a much smaller report:

c products[1]
< {"product": "dumdow", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 3, "height": 3, "depth": 3, "weight": 27}}
---
> {"product": "dumdow", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 4, "height": 3, "depth": 3, "weight": 27}}

In this mode, the records for "widget" and "doohickey" were identified as being identical, so they were omitted from the report. However, like the position mode, this value mode still has low granularity. We can see that there was a change to the "dumdow" record but it may be difficult to see precisely what that change is with this mode.

Note that value is not tricked by any reordering of the child nodes within any of the elements. As long as the nodes are identical, their order is irrelevant.

By Key

When the Array of Hash elements are records with identity keys and you want a low-granularity report of differences, use the key comparison mode. The value comparison mode can produce suboptimal results when there are material differences between the two record sets. This is particularly the case when there are differences in otherwise equivalent records, which also happen to be at different ordinal positions in the two documents. Further, when the identity key attribute may not be the first attribute of the first record in a particular Array of Hashes or you need special handling for certain records, use the yaml-diff Configuration File.

Consider these two variations of the above YAML data files:

File: LHS2.yaml

---
products:
  - product: doodad
    sku: 0-000-0001-0
    availability:
      start:
        date: 2020-10-10
        time: 08:00
      stop:
        date: 2020-10-29
        time: 17:00
    dimensions:
      width: 5
      height: 5
      depth: 5
      weight: 10
  - product: dumdow
    sku: 0-000-0002-0
    availability:
      start:
        date: 2020-10-23
        time: 08:00
      stop:
        date: 2020-11-23
        time: 17:00
    dimensions:
      width: 3
      height: 3
      depth: 3
      weight: 27
  - product: doohickey
    sku: 0-000-0003-0
    availability:
      start:
        date: 2020-08-01
        time: 10:00
      stop:
        date: 2020-09-25
        time: 10:00
    dimensions:
      width: 1
      height: 2
      depth: 3
      weight: 4
  - product: widget
    sku: 0-000-0004-0
    availability:
      start:
        date: 2020-01-01
        time: 12:00
      stop:
        date: 2020-01-01
        time: 16:00
    dimensions:
      width: 9
      height: 10
      depth: 1
      weight: 4

File: RHS2.yaml

---
products:
  - sku: 0-000-0001-0
    availability:
      start:
        date: 2020-10-10
        time: 08:00
      stop:
        date: 2020-10-29
        time: 17:00
    dimensions:
      width: 5
      height: 5
      depth: 5
      weight: 10
  - sku: 0-000-0002-0
    availability:
      start:
        date: 2020-10-23
        time: 08:00
      stop:
        date: 2020-11-23
        time: 17:00
    dimensions:
      width: 4
      height: 3
      depth: 3
      weight: 27
  - dimensions:
      width: 9
      height: 10
      depth: 1
      weight: 4
    sku: 0-000-0004-0
    availability:
      start:
        date: 2020-01-01
        time: 12:00
      stop:
        date: 2020-01-01
        time: 16:00
  - product: doohickey
    availability:
      stop:
        date: 2020-09-25
        time: 10:00
      start:
        date: 2020-08-01
        time: 10:00
    dimensions:
      weight: 4
      width: 1
      depth: 3
      height: 2

Note that all of the "product" fields were removed from the RHS document except for the "doohickey" record, which is missing its "sku" field.

Comparing these two documents with the value mode produces this report:

c products[0]
< {"product": "doodad", "sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}
---
> {"sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}

c products[1]
< {"product": "dumdow", "sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 3, "height": 3, "depth": 3, "weight": 27}}
---
> {"sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 4, "height": 3, "depth": 3, "weight": 27}}

c products[2]
< {"product": "doohickey", "sku": "0-000-0003-0", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}
---
> {"dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}, "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}}

c products[3]
< {"product": "widget", "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}, "dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}}
---
> {"product": "doohickey", "availability": {"stop": {"date": "2020-09-25", "time": "10:00"}, "start": {"date": "2020-08-01", "time": "10:00"}}, "dimensions": {"weight": 4, "width": 1, "depth": 3, "height": 2}}

This report doesn't make a lot of sense. The changes to the earlier records were detected -- by coincidence that the ordinal positions of the same records were identical between both documents -- and the "doohickey" and "widget" records could not be automatically matched up, at all. This is because the value mode falls-back to position when records are not identical.

In this case, we need to use the key mode. Contrast the value report above with the report generated using the key mode:

c products[0]
< {"product": "doodad", "sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}
---
> {"sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}

c products[1]
< {"product": "dumdow", "sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 3, "height": 3, "depth": 3, "weight": 27}}
---
> {"sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 4, "height": 3, "depth": 3, "weight": 27}}

d products[2]
< {"product": "doohickey", "sku": "0-000-0003-0", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}

a products[3]
> {"product": "doohickey", "availability": {"stop": {"date": "2020-09-25", "time": "10:00"}, "start": {"date": "2020-08-01", "time": "10:00"}}, "dimensions": {"weight": 4, "width": 1, "depth": 3, "height": 2}}

c products[3]
< {"product": "widget", "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}, "dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}}
---
> {"dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}, "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}}

For this case -- when we need to use an identity key to match up otherwise very different record sets -- the records were properly matched up and the real differences were reported. This includes a correct delete-add difference for the "doohickey" record because it is missing the mandatory identity key in the RHS document and could therefore not be matched for direct comparison.

Bonus: Configuration Files with Key Comparisons

What if you want the "doohickey" product from this record set to be matched despite the record lacking the mandatory "sku"? That's easy: use a yaml-diff Configuration File like so:

[rules]
/products = key

[keys]
/products[product=doohickey] = product

This changes the report to:

c products[0]
< {"product": "doodad", "sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}
---
> {"sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}

c products[1]
< {"product": "dumdow", "sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 3, "height": 3, "depth": 3, "weight": 27}}
---
> {"sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 4, "height": 3, "depth": 3, "weight": 27}}

c products[2]
< {"product": "doohickey", "sku": "0-000-0003-0", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}
---
> {"product": "doohickey", "availability": {"stop": {"date": "2020-09-25", "time": "10:00"}, "start": {"date": "2020-08-01", "time": "10:00"}}, "dimensions": {"weight": 4, "width": 1, "depth": 3, "height": 2}}

c products[3]
< {"product": "widget", "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}, "dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}}
---
> {"dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}, "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}}

Notice that the "doohickey" record was successfully compared against its LHS equivalent, despite lacking a "sku". This custom configuration changed the identity key of just this specific oddball record from "sku" to "product". All other records were still matched up by their "sku" values.

Deep Comparison by Key

This will be the preferred comparison mode when dealing with record sets -- the AoH elements have identity keys -- which may be in disjointed ordinal positions and which may have minor differences between the comparison documents or you need to see the minute differences between the records, no matter how few or many.

Using this mode against LHS2.yaml and RHS2.yaml (without the bonus configuration file) produces this report:

d products[0].product
< doodad

d products[1].product
< dumdow

c products[1].dimensions.width
< 3
---
> 4

d products[2]
< {"product": "doohickey", "sku": "0-000-0003-0", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}

a products[3]
> {"product": "doohickey", "availability": {"stop": {"date": "2020-09-25", "time": "10:00"}, "start": {"date": "2020-08-01", "time": "10:00"}}, "dimensions": {"weight": 4, "width": 1, "depth": 3, "height": 2}}

d products[2].product
< widget

We can clearly see that the most common difference between the records is the deletion of the "product" field. Because the RHS "doohickey" record lacks a "sku", it was still not matched for comparison but was instead marked correctly as a delete-add.

Bonus: Configuration Files with Deep Key Comparisons

Adding the same bonus configuration file from Bonus: Configuration Files with Key Comparisons further reduces the differences report to:

d products[0].product
< doodad

d products[1].product
< dumdow

c products[1].dimensions.width
< 3
---
> 4

d products[3].sku
< 0-000-0003-0

d products[2].product
< widget

Note that the delete-add difference pair was consolidated to show the "sku" field was removed from the oddball record.

Configuration File Options

The yaml-diff tool can read per YAML Path comparison options from an INI-Style configuration file via its --config (-c) argument. Whereas the --aoh (-O) argument supplies an overarching mode for comparing AoHs, using a configuration file permits far more precise control whenever you need a different mode for specific parts of the documents being compared.

Configuration File Section: defaults

The [defaults] section permits a key named, aoh, which behaves identically to the --aoh (-O) command-line argument to the yaml-diff tool. The [defaults]aoh setting is overridden by the same-named command-line argument, when supplied. In practice, this file may look like:

File merge-options.ini

[defaults]
aoh = position

Note the spaces around the = sign are optional but only an = sign may be used to separate each key from its value.

Configuration File Section: rules

The [rules] section takes any YAML Paths as keys and any of the AoH comparison modes that are available to the --aoh (-O) command-line argument. This enables extremely fine precision for applying the available modes.

This has already been explored at Bonus: Configuration Files with Key Comparisons.

Configuration File Section: keys

Like the [rules] section, the [keys] section takes any YAML Paths as keys. In contrast, each entry specifies the identity key for the AoH at the specified YAML Path, overriding implicit identity key detection for the targeted AoHs.

See Bonus: Configuration Files with Key Comparisons for an example.

Clone this wiki locally