DeepDiff to run in multiple passes to diff combinations of results when ignore_order=True #136

seperman · 2019-03-25T23:35:54Z

DeepDiff to run in 2 passes. And diff combinations of results when ignore_order=True.

Example:

Currently:

In [2]: from deepdiff import DeepDiff

In [3]: DeepDiff({'a': [1,2,3]}, {'a': [3,2,1, 0]}, ignore_order=True)
Out[3]: {'iterable_item_added': {"root['a'][3]": 0}}

In [4]: DeepDiff({'a': [{'b': [1,2,3]}]}, {'a': [{'b': [3,2,1, 0]}]}, ignore_order=True)
Out[4]:
{'iterable_item_added': {"root['a'][0]": {'b': [3, 2, 1, 0]}},
 'iterable_item_removed': {"root['a'][0]": {'b': [1, 2, 3]}}}

But if deepdiff compares the items between the iterable item added and removed, it should be spitting out the following results instead:

In [4]: DeepDiff({'a': [{'b': [1,2,3]}]}, {'a': [{'b': [3,2,1, 0]}]}, ignore_order=True)
Out[4]:  {'iterable_item_added': {"root['a'][0][3]": 0}}

The text was updated successfully, but these errors were encountered:

nkaliape · 2019-09-26T17:43:06Z

BTW. It may NOT be just two passes but this needs to be addressed at the multiple level hierarchy.
For example consider this dict containing iterable of dictionaries which is again having iterable of dictionaries - may be 2 or 3 level deep further.

testautomation · 2020-04-09T17:04:47Z

@seperman do you need some "real" test-data to play around with?

seperman · 2020-04-29T08:37:52Z

Hi @testautomation
This is addressed in v5 that is due to release. I just ran your input and it looks fine to me. You can pull the dev branch and do a beta test too before v5 is released. Here are the changes:
https://github.com/seperman/deepdiff/pull/188/files
Thanks

seperman · 2020-04-29T08:38:29Z

tagging @nkaliape too.

testautomation · 2020-04-30T08:41:25Z

@seperman first thing I noticed is

 failed: ModuleNotFoundError: No module named 'numpy'

solved by pip install numpy

May be numpy should be added as dependency so that it is installed automatically when doing pip install deepdiff ?

testautomation · 2020-04-30T09:16:30Z

looks definitely better than before (where I just had a big "iterable added" + another big "iterable removed")

Now the diff looks much better:

{'dictionary_item_added': {"root['rows'][0][0]['composer']['external_ref']['id']['value']": 'ref',
                           "root['rows'][0][0]['composer']['external_ref']['type']": 'PERSON',
                           "root['rows'][0][0]['content'][0]['activities'][0]['_type']": 'ACTIVITY',
                           "root['rows'][0][0]['context']['participations']": [{'_type': 'PARTICIPATION',
                                                                                'function': {'_type': 'DV_TEXT',
                                                                                             'value': 'legal guardian consent author'},
                                                                                'mode': {'_type': 'DV_CODED_TEXT',
                                                                                         'defining_code': {'code_string': '193',
                                                                                                           'terminology_id': {'value': 'openehr'}},
                                                                                         'value': 'not specified'},
                                                                                'performer': {'_type': 'PARTY_IDENTIFIED',
                                                                                              'name': 'Charles Connor'}}],
                           "root['rows'][1][0]['composer']['external_ref']['id']['value']": 'ref',
                           "root['rows'][1][0]['composer']['external_ref']['type']": 'PERSON',
                           "root['rows'][1][0]['context']['participations']": [{'_type': 'PARTICIPATION',
                                                                                'function': {'_type': 'DV_TEXT',
                                                                                             'value': 'companion'},
                                                                                'mode': {'_type': 'DV_CODED_TEXT',
                                                                                         'defining_code': {'code_string': '193',
                                                                                                           'terminology_id': {'value': 'openehr'}},
                                                                                         'value': 'not specified'},
                                                                                'performer': {'_type': 'PARTY_IDENTIFIED',
                                                                                              'name': 'Betty Bix'}}],
                           "root['rows'][2][0]['composer']['external_ref']['id']['value']": 'ref',
                           "root['rows'][2][0]['composer']['external_ref']['type']": 'PERSON',
                           "root['rows'][3][0]['composer']['external_ref']['id']['value']": 'ref',
                           "root['rows'][3][0]['composer']['external_ref']['type']": 'PERSON',
                           "root['rows'][4][0]['composer']['external_ref']['id']['value']": 'ref',
                           "root['rows'][4][0]['composer']['external_ref']['type']": 'PERSON',
                           "root['rows'][4][0]['context']['participations']": [{'_type': 'PARTICIPATION',
                                                                                'function': {'_type': 'DV_TEXT',
                                                                                             'value': 'legal guardian'},
                                                                                'mode': {'_type': 'DV_CODED_TEXT',
                                                                                         'defining_code': {'code_string': '193',
                                                                                                           'terminology_id': {'value': 'openehr'}},
                                                                                         'value': 'not specified'},
                                                                                'performer': {'_type': 'PARTY_IDENTIFIED',
                                                                                              'name': 'Martha Stewart'}}]},
 'dictionary_item_removed': {"root['rows'][0][0]['archetype_details']['rm_version']": '1.0.2',
                             "root['rows'][0][0]['content'][0]['narrative']['name']": {'value': 'Minimal'},
                             "root['rows'][1][0]['archetype_details']['rm_version']": '1.0.2',
                             "root['rows'][1][0]['content'][0]['data']['items'][0]['value']['other_reference_ranges']": [],
                             "root['rows'][2][0]['archetype_details']['rm_version']": '1.0.2',
                             "root['rows'][2][0]['content'][0]['data']['items'][0]['value']['other_reference_ranges']": [],
                             "root['rows'][2][0]['content'][0]['data']['items'][0]['value']['symbol']['defining_code']['terminology_id']['name']": 'local',
                             "root['rows'][3][0]['archetype_details']['rm_version']": '1.0.2',
                             "root['rows'][3][0]['content'][0]['description']['items'][0]['value']['other_reference_ranges']": [],
                             "root['rows'][4][0]['archetype_details']['rm_version']": '1.0.2',
                             "root['rows'][4][0]['content'][0]['data']['_type']": 'HISTORY',
                             "root['rows'][4][0]['content'][0]['data']['origin']['name']": {'value': 'Event Series'},
                             "root['rows'][4][0]['content'][0]['subject']['name']": {'value': 'Minimal'}},
 'type_changes': {"root['rows'][0][0]['content'][0]['activities'][0]['description']['items'][0]['value']['value']": {'new_type': <class 'float'>,
                                                                                                                     'new_value': 1800.0,
                                                                                                                     'old_type': <class 'str'>,
                                                                                                                     'old_value': 'PT30M'},
                  "root['rows'][2][0]['content'][0]['data']['items'][0]['value']['value']": {'new_type': <class 'int'>,
                                                                                             'new_value': 1,
                                                                                             'old_type': <class 'float'>,
                                                                                             'old_value': 1.0},
                  "root['rows'][3][0]['content'][0]['description']['items'][0]['value']['precision']": {'new_type': <class 'int'>,
                                                                                                        'new_value': 1,
                                                                                                        'old_type': <class 'float'>,
                                                                                                        'old_value': 1.0},
                  "root['rows'][3][0]['content'][0]['description']['items'][0]['value']['type']": {'new_type': <class 'int'>,
                                                                                                   'new_value': 3,
                                                                                                   'old_type': <class 'float'>,
                                                                                                   'old_value': 3.0}},
 'values_changed': {"root['rows'][0][0]['composer']['external_ref']['id']['_type']": {'new_value': 'HIER_OBJECT_ID',
                                                                                      'old_value': 'GENERIC_ID'},
                    "root['rows'][0][0]['context']['start_time']['value']": {'new_value': '2019-01-28T22:22:19,542+01:00',
                                                                             'old_value': '2019-01-28T22:22:19.542+01:00'},
                    "root['rows'][1][0]['composer']['external_ref']['id']['_type']": {'new_value': 'HIER_OBJECT_ID',
                                                                                      'old_value': 'GENERIC_ID'},
                    "root['rows'][1][0]['context']['start_time']['value']": {'new_value': '2019-01-28T22:22:19,979+01:00',
                                                                             'old_value': '2019-01-28T22:22:19.979+01:00'},
                    "root['rows'][2][0]['composer']['external_ref']['id']['_type']": {'new_value': 'HIER_OBJECT_ID',
                                                                                      'old_value': 'GENERIC_ID'},
                    "root['rows'][2][0]['context']['start_time']['value']": {'new_value': '2019-01-28T22:22:19,851+01:00',
                                                                             'old_value': '2019-01-28T22:22:19.851+01:00'},
                    "root['rows'][3][0]['composer']['external_ref']['id']['_type']": {'new_value': 'HIER_OBJECT_ID',
                                                                                      'old_value': 'GENERIC_ID'},
                    "root['rows'][3][0]['content'][0]['time']['value']": {'new_value': '2019-11-20T20:35:26,466Z',
                                                                          'old_value': '2019-11-20T20:35:26.466Z'},
                    "root['rows'][3][0]['context']['start_time']['value']": {'new_value': '2019-11-20T21:35:26,466+01:00',
                                                                             'old_value': '2019-11-20T21:35:26.466+01:00'},
                    "root['rows'][4][0]['composer']['external_ref']['id']['_type']": {'new_value': 'HIER_OBJECT_ID',
                                                                                      'old_value': 'GENERIC_ID'},
                    "root['rows'][4][0]['content'][0]['data']['events'][0]['time']['value']": {'new_value': '2019-01-28T21:22:19,562Z',
                                                                                               'old_value': '2019-01-28T21:22:19.562Z'},
                    "root['rows'][4][0]['content'][0]['data']['origin']['value']": {'new_value': '2019-01-28T21:22:19,552Z',
                                                                                    'old_value': '2019-01-28T21:22:19.552Z'},
                    "root['rows'][4][0]['context']['start_time']['value']": {'new_value': '2019-01-28T22:22:19,501+01:00',
                                                                             'old_value': '2019-01-28T22:22:19.501+01:00'}}}

data and settings used

actual.txt
expected.txt

IGNORE ORDER: True
IGNORE_STRING_CASE: False
IGNORE_TYPE_SUBCLASSES: False
VERBOSE_LEVEL: 2	
KWARGS: {
        'exclude_regex_paths': 
            [
                "root\\['meta'\\]", 
                "\\['columns'\\]\\[\\d+\\]\\['path'\\]"
            ], 
            
        'exclude_obj_callback': <function ignore_type_properties at 0x7fc6515578b0>
        }

# exclude_obj_callback function
def ignore_type_properties(obj, path):
    ignorable_types = [
        "ARCHETYPE_ID",
        "ARCHETYPED",
        "CODE_PHRASE",
        "DV_BOOLEAN",
        "DV_CODED_TEXT",
        "DV_COUNT",
        "DV_DATE",
        "DV_DATE_TIME",
        "DV_DATE_TIME",
        "DV_DURATION",
        "DV_EHR_URI",
        "DV_IDENTIFIER",
        "DV_MULTIMEDIA",
        "DV_ORDINAL",
        "DV_PARSABLE",
        "DV_PROPORTION",
        "DV_QUANTITY",
        "DV_SCALE",
        "DV_STATE",
        "DV_TEXT",
        "DV_TIME",
        "DV_URI",
        "REFERENCE_RANGE",
        "TEMPLATE_ID",
        "TERM_MAPPING",
        "TERMINOLOGY_ID",
    ]
    return True if "_type" in path and obj in ignorable_types else False

testautomation · 2020-04-30T11:25:56Z

@seperman I may have found an issue, not 100% sure yet but seems like my tests ~~got stuck in an infinite loop or at least~~ got exceptionally slow. I have them running right now locally for almost 1 h and they have still not finished nor failed. FYI I have a test suite that finishes within ~16 minutes on CircleCI pipeline, and a bit faster when executed locally (~10 minutes), approx. 100 data-sets (which some of have over 20 K lines of JSON) are compared.

edit: test just finished (successfully) after 1 h and 8 minutes 😱

I'll try to figure out which data-set causes the slow down.

testautomation · 2020-04-30T11:54:21Z

here are the test results: test_report_and_log.zip but don't care much about them yet. I'll have to repeat the whole procedure bc my VM could have been the root cause - it was running out of disk space 🙈

testautomation · 2020-04-30T14:30:13Z

Repeated test w/ more RAM and disk space. Same result. But I think the slow down is reasonable cause there is much more going on under the hood now. So it's not an issue of the changes in v5 but simple due to the fact that the diff is HUGE - btw. Robot's XML file (which is written during test execution) grew to over 1 GB. Also the way I wrapped Deepdiff in Robot and the logging that I do may be a reason.

I'll extract the generated test-data from my test log to make it easier to test with Deepdiff directly (w/o other parts like Robot or "writing to XML" involved).

@seperman Is there a way to abort the comparison when let's say "enough" diffs where recognized? Something like a diff limit?

seperman · 2020-04-30T15:12:30Z

Thanks for the very useful information. Deepdiff is running recursively now between any diffs it finds to see if it can pin point the actual difference. so it is way slower than before. Numpy shouldn’t be required. Thanks for reporting it. I will look into some optimizations. We can have add a parameter for max passes to run. Currently that is the max recursion depth allowed. Have you installed murmur3? It is in the docs. It should increase the CPU usage but dramatically decrease the memory usage. Also have you tried using pypy3 to run the diff? Im curious if you will gain any speed. I will keep you posted once I do some tests. Thanks Sep Dehpour

…

On Apr 30, 2020, at 7:30 AM, Wlad Wagner ***@***.***> wrote: Repeated test w/ more RAM and disk space. Same result. But I think the slow down is reasonable cause there is much more going on under the hood now. So it's not an issue of the changes in v5 but simple due to the fact that the diff is HUGE - btw. Robot's XML file (which is written during test execution) grew to over 1 GB. Also the way I wrapped Deepdiff in Robot and the logging that I do may be a reason. I'll extract the generated test-data from my test log to make it easier to test with Deepdiff directly (w/o other parts like Robot or "writing to XML" involved). @seperman Is there a way to abort the comparison when let's say "enough" diffs where recognized? Something like a diff limit? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

testautomation · 2020-04-30T15:35:48Z

Here is the part I identified to have the most impact in my tests

I've extracted relevant data from that. Here as .txt for a quick look

huge_actual.txt
huge_expected.txt

An here as .json
huge_JSONs.zip

The latter one you can quickly take into use after extracting into a folder and then

import json
from deepdiff import DeepDiff

actual = json.load(open('huge_actual.json'))
expected = json.load(open('huge_expected.json'))

# this is fast
diff = DeepDiff(actual, expected)

# gets dramatically slower w/ ignore_order
diff = DeepDiff(actual, expected, ignore_order=True)

testautomation · 2020-04-30T15:36:11Z

We can have add a parameter for max passes to run.

That would be really great!

Have you installed murmur3? / Tried pypy3?

Not yet. I can give it a shot but I have to make sure that it also works on CI. Same w/ pypy3. I'll let you know if I can report something interesting about that.

Cheers

seperman · 2020-06-23T18:56:57Z

DeepDiff 5 is finally here and it comes with multiple passes option!
https://zepworks.com/deepdiff/5.0.0/ignore_order.html#max-passes

seperman added the enhancement label Mar 25, 2019

seperman self-assigned this Mar 25, 2019

seperman added this to To do in Running in passes Mar 25, 2019

seperman changed the title ~~DeepDiff to run in 2 passes to diff combinations of results when ignore_order=True~~ DeepDiff to run in multiple passes to diff combinations of results when ignore_order=True Apr 12, 2020

seperman closed this as completed Jun 23, 2020

Running in passes automation moved this from To do to Done Jun 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepDiff to run in multiple passes to diff combinations of results when ignore_order=True #136

DeepDiff to run in multiple passes to diff combinations of results when ignore_order=True #136

seperman commented Mar 25, 2019

nkaliape commented Sep 26, 2019

testautomation commented Apr 9, 2020

seperman commented Apr 29, 2020

seperman commented Apr 29, 2020

testautomation commented Apr 30, 2020 •

edited

testautomation commented Apr 30, 2020 •

edited

testautomation commented Apr 30, 2020 •

edited

testautomation commented Apr 30, 2020

testautomation commented Apr 30, 2020

seperman commented Apr 30, 2020 via email

testautomation commented Apr 30, 2020

testautomation commented Apr 30, 2020

seperman commented Jun 23, 2020

DeepDiff to run in multiple passes to diff combinations of results when ignore_order=True #136

DeepDiff to run in multiple passes to diff combinations of results when ignore_order=True #136

Comments

seperman commented Mar 25, 2019

nkaliape commented Sep 26, 2019

testautomation commented Apr 9, 2020

seperman commented Apr 29, 2020

seperman commented Apr 29, 2020

testautomation commented Apr 30, 2020 • edited

testautomation commented Apr 30, 2020 • edited

data and settings used

testautomation commented Apr 30, 2020 • edited

testautomation commented Apr 30, 2020

testautomation commented Apr 30, 2020

seperman commented Apr 30, 2020 via email

testautomation commented Apr 30, 2020

testautomation commented Apr 30, 2020

seperman commented Jun 23, 2020

testautomation commented Apr 30, 2020 •

edited

testautomation commented Apr 30, 2020 •

edited

testautomation commented Apr 30, 2020 •

edited