Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepDiff to run in multiple passes to diff combinations of results when ignore_order=True #136

Closed
seperman opened this issue Mar 25, 2019 · 13 comments
Assignees

Comments

@seperman
Copy link
Owner

DeepDiff to run in 2 passes. And diff combinations of results when ignore_order=True.

Example:

Currently:

In [2]: from deepdiff import DeepDiff

In [3]: DeepDiff({'a': [1,2,3]}, {'a': [3,2,1, 0]}, ignore_order=True)
Out[3]: {'iterable_item_added': {"root['a'][3]": 0}}

In [4]: DeepDiff({'a': [{'b': [1,2,3]}]}, {'a': [{'b': [3,2,1, 0]}]}, ignore_order=True)
Out[4]:
{'iterable_item_added': {"root['a'][0]": {'b': [3, 2, 1, 0]}},
 'iterable_item_removed': {"root['a'][0]": {'b': [1, 2, 3]}}}

But if deepdiff compares the items between the iterable item added and removed, it should be spitting out the following results instead:

In [4]: DeepDiff({'a': [{'b': [1,2,3]}]}, {'a': [{'b': [3,2,1, 0]}]}, ignore_order=True)
Out[4]:  {'iterable_item_added': {"root['a'][0][3]": 0}}
@seperman seperman self-assigned this Mar 25, 2019
@seperman seperman added this to To do in Running in passes Mar 25, 2019
@nkaliape
Copy link

BTW. It may NOT be just two passes but this needs to be addressed at the multiple level hierarchy.
For example consider this dict containing iterable of dictionaries which is again having iterable of dictionaries - may be 2 or 3 level deep further.

@testautomation
Copy link

@seperman do you need some "real" test-data to play around with?

@seperman seperman changed the title DeepDiff to run in 2 passes to diff combinations of results when ignore_order=True DeepDiff to run in multiple passes to diff combinations of results when ignore_order=True Apr 12, 2020
@seperman
Copy link
Owner Author

Hi @testautomation
This is addressed in v5 that is due to release. I just ran your input and it looks fine to me. You can pull the dev branch and do a beta test too before v5 is released. Here are the changes:
https://github.com/seperman/deepdiff/pull/188/files
Thanks

@seperman
Copy link
Owner Author

tagging @nkaliape too.

@testautomation
Copy link

testautomation commented Apr 30, 2020

@seperman first thing I noticed is

 failed: ModuleNotFoundError: No module named 'numpy'

solved by pip install numpy

May be numpy should be added as dependency so that it is installed automatically when doing pip install deepdiff ?

@testautomation
Copy link

testautomation commented Apr 30, 2020

looks definitely better than before (where I just had a big "iterable added" + another big "iterable removed")

Now the diff looks much better:

{'dictionary_item_added': {"root['rows'][0][0]['composer']['external_ref']['id']['value']": 'ref',
                           "root['rows'][0][0]['composer']['external_ref']['type']": 'PERSON',
                           "root['rows'][0][0]['content'][0]['activities'][0]['_type']": 'ACTIVITY',
                           "root['rows'][0][0]['context']['participations']": [{'_type': 'PARTICIPATION',
                                                                                'function': {'_type': 'DV_TEXT',
                                                                                             'value': 'legal guardian consent author'},
                                                                                'mode': {'_type': 'DV_CODED_TEXT',
                                                                                         'defining_code': {'code_string': '193',
                                                                                                           'terminology_id': {'value': 'openehr'}},
                                                                                         'value': 'not specified'},
                                                                                'performer': {'_type': 'PARTY_IDENTIFIED',
                                                                                              'name': 'Charles Connor'}}],
                           "root['rows'][1][0]['composer']['external_ref']['id']['value']": 'ref',
                           "root['rows'][1][0]['composer']['external_ref']['type']": 'PERSON',
                           "root['rows'][1][0]['context']['participations']": [{'_type': 'PARTICIPATION',
                                                                                'function': {'_type': 'DV_TEXT',
                                                                                             'value': 'companion'},
                                                                                'mode': {'_type': 'DV_CODED_TEXT',
                                                                                         'defining_code': {'code_string': '193',
                                                                                                           'terminology_id': {'value': 'openehr'}},
                                                                                         'value': 'not specified'},
                                                                                'performer': {'_type': 'PARTY_IDENTIFIED',
                                                                                              'name': 'Betty Bix'}}],
                           "root['rows'][2][0]['composer']['external_ref']['id']['value']": 'ref',
                           "root['rows'][2][0]['composer']['external_ref']['type']": 'PERSON',
                           "root['rows'][3][0]['composer']['external_ref']['id']['value']": 'ref',
                           "root['rows'][3][0]['composer']['external_ref']['type']": 'PERSON',
                           "root['rows'][4][0]['composer']['external_ref']['id']['value']": 'ref',
                           "root['rows'][4][0]['composer']['external_ref']['type']": 'PERSON',
                           "root['rows'][4][0]['context']['participations']": [{'_type': 'PARTICIPATION',
                                                                                'function': {'_type': 'DV_TEXT',
                                                                                             'value': 'legal guardian'},
                                                                                'mode': {'_type': 'DV_CODED_TEXT',
                                                                                         'defining_code': {'code_string': '193',
                                                                                                           'terminology_id': {'value': 'openehr'}},
                                                                                         'value': 'not specified'},
                                                                                'performer': {'_type': 'PARTY_IDENTIFIED',
                                                                                              'name': 'Martha Stewart'}}]},
 'dictionary_item_removed': {"root['rows'][0][0]['archetype_details']['rm_version']": '1.0.2',
                             "root['rows'][0][0]['content'][0]['narrative']['name']": {'value': 'Minimal'},
                             "root['rows'][1][0]['archetype_details']['rm_version']": '1.0.2',
                             "root['rows'][1][0]['content'][0]['data']['items'][0]['value']['other_reference_ranges']": [],
                             "root['rows'][2][0]['archetype_details']['rm_version']": '1.0.2',
                             "root['rows'][2][0]['content'][0]['data']['items'][0]['value']['other_reference_ranges']": [],
                             "root['rows'][2][0]['content'][0]['data']['items'][0]['value']['symbol']['defining_code']['terminology_id']['name']": 'local',
                             "root['rows'][3][0]['archetype_details']['rm_version']": '1.0.2',
                             "root['rows'][3][0]['content'][0]['description']['items'][0]['value']['other_reference_ranges']": [],
                             "root['rows'][4][0]['archetype_details']['rm_version']": '1.0.2',
                             "root['rows'][4][0]['content'][0]['data']['_type']": 'HISTORY',
                             "root['rows'][4][0]['content'][0]['data']['origin']['name']": {'value': 'Event Series'},
                             "root['rows'][4][0]['content'][0]['subject']['name']": {'value': 'Minimal'}},
 'type_changes': {"root['rows'][0][0]['content'][0]['activities'][0]['description']['items'][0]['value']['value']": {'new_type': <class 'float'>,
                                                                                                                     'new_value': 1800.0,
                                                                                                                     'old_type': <class 'str'>,
                                                                                                                     'old_value': 'PT30M'},
                  "root['rows'][2][0]['content'][0]['data']['items'][0]['value']['value']": {'new_type': <class 'int'>,
                                                                                             'new_value': 1,
                                                                                             'old_type': <class 'float'>,
                                                                                             'old_value': 1.0},
                  "root['rows'][3][0]['content'][0]['description']['items'][0]['value']['precision']": {'new_type': <class 'int'>,
                                                                                                        'new_value': 1,
                                                                                                        'old_type': <class 'float'>,
                                                                                                        'old_value': 1.0},
                  "root['rows'][3][0]['content'][0]['description']['items'][0]['value']['type']": {'new_type': <class 'int'>,
                                                                                                   'new_value': 3,
                                                                                                   'old_type': <class 'float'>,
                                                                                                   'old_value': 3.0}},
 'values_changed': {"root['rows'][0][0]['composer']['external_ref']['id']['_type']": {'new_value': 'HIER_OBJECT_ID',
                                                                                      'old_value': 'GENERIC_ID'},
                    "root['rows'][0][0]['context']['start_time']['value']": {'new_value': '2019-01-28T22:22:19,542+01:00',
                                                                             'old_value': '2019-01-28T22:22:19.542+01:00'},
                    "root['rows'][1][0]['composer']['external_ref']['id']['_type']": {'new_value': 'HIER_OBJECT_ID',
                                                                                      'old_value': 'GENERIC_ID'},
                    "root['rows'][1][0]['context']['start_time']['value']": {'new_value': '2019-01-28T22:22:19,979+01:00',
                                                                             'old_value': '2019-01-28T22:22:19.979+01:00'},
                    "root['rows'][2][0]['composer']['external_ref']['id']['_type']": {'new_value': 'HIER_OBJECT_ID',
                                                                                      'old_value': 'GENERIC_ID'},
                    "root['rows'][2][0]['context']['start_time']['value']": {'new_value': '2019-01-28T22:22:19,851+01:00',
                                                                             'old_value': '2019-01-28T22:22:19.851+01:00'},
                    "root['rows'][3][0]['composer']['external_ref']['id']['_type']": {'new_value': 'HIER_OBJECT_ID',
                                                                                      'old_value': 'GENERIC_ID'},
                    "root['rows'][3][0]['content'][0]['time']['value']": {'new_value': '2019-11-20T20:35:26,466Z',
                                                                          'old_value': '2019-11-20T20:35:26.466Z'},
                    "root['rows'][3][0]['context']['start_time']['value']": {'new_value': '2019-11-20T21:35:26,466+01:00',
                                                                             'old_value': '2019-11-20T21:35:26.466+01:00'},
                    "root['rows'][4][0]['composer']['external_ref']['id']['_type']": {'new_value': 'HIER_OBJECT_ID',
                                                                                      'old_value': 'GENERIC_ID'},
                    "root['rows'][4][0]['content'][0]['data']['events'][0]['time']['value']": {'new_value': '2019-01-28T21:22:19,562Z',
                                                                                               'old_value': '2019-01-28T21:22:19.562Z'},
                    "root['rows'][4][0]['content'][0]['data']['origin']['value']": {'new_value': '2019-01-28T21:22:19,552Z',
                                                                                    'old_value': '2019-01-28T21:22:19.552Z'},
                    "root['rows'][4][0]['context']['start_time']['value']": {'new_value': '2019-01-28T22:22:19,501+01:00',
                                                                             'old_value': '2019-01-28T22:22:19.501+01:00'}}}

data and settings used

actual.txt
expected.txt

IGNORE ORDER: True
IGNORE_STRING_CASE: False
IGNORE_TYPE_SUBCLASSES: False
VERBOSE_LEVEL: 2	
KWARGS: {
        'exclude_regex_paths': 
            [
                "root\\['meta'\\]", 
                "\\['columns'\\]\\[\\d+\\]\\['path'\\]"
            ], 
            
        'exclude_obj_callback': <function ignore_type_properties at 0x7fc6515578b0>
        }
# exclude_obj_callback function
def ignore_type_properties(obj, path):
    ignorable_types = [
        "ARCHETYPE_ID",
        "ARCHETYPED",
        "CODE_PHRASE",
        "DV_BOOLEAN",
        "DV_CODED_TEXT",
        "DV_COUNT",
        "DV_DATE",
        "DV_DATE_TIME",
        "DV_DATE_TIME",
        "DV_DURATION",
        "DV_EHR_URI",
        "DV_IDENTIFIER",
        "DV_MULTIMEDIA",
        "DV_ORDINAL",
        "DV_PARSABLE",
        "DV_PROPORTION",
        "DV_QUANTITY",
        "DV_SCALE",
        "DV_STATE",
        "DV_TEXT",
        "DV_TIME",
        "DV_URI",
        "REFERENCE_RANGE",
        "TEMPLATE_ID",
        "TERM_MAPPING",
        "TERMINOLOGY_ID",
    ]
    return True if "_type" in path and obj in ignorable_types else False

@testautomation
Copy link

testautomation commented Apr 30, 2020

@seperman I may have found an issue, not 100% sure yet but seems like my tests got stuck in an infinite loop or at least got exceptionally slow. I have them running right now locally for almost 1 h and they have still not finished nor failed. FYI I have a test suite that finishes within ~16 minutes on CircleCI pipeline, and a bit faster when executed locally (~10 minutes), approx. 100 data-sets (which some of have over 20 K lines of JSON) are compared.

edit: test just finished (successfully) after 1 h and 8 minutes 😱

I'll try to figure out which data-set causes the slow down.

@testautomation
Copy link

here are the test results: test_report_and_log.zip but don't care much about them yet. I'll have to repeat the whole procedure bc my VM could have been the root cause - it was running out of disk space 🙈

@testautomation
Copy link

Repeated test w/ more RAM and disk space. Same result. But I think the slow down is reasonable cause there is much more going on under the hood now. So it's not an issue of the changes in v5 but simple due to the fact that the diff is HUGE - btw. Robot's XML file (which is written during test execution) grew to over 1 GB. Also the way I wrapped Deepdiff in Robot and the logging that I do may be a reason.

I'll extract the generated test-data from my test log to make it easier to test with Deepdiff directly (w/o other parts like Robot or "writing to XML" involved).

@seperman Is there a way to abort the comparison when let's say "enough" diffs where recognized? Something like a diff limit?

@seperman
Copy link
Owner Author

seperman commented Apr 30, 2020 via email

@testautomation
Copy link

Here is the part I identified to have the most impact in my tests
image

I've extracted relevant data from that. Here as .txt for a quick look

huge_actual.txt
huge_expected.txt

An here as .json
huge_JSONs.zip

The latter one you can quickly take into use after extracting into a folder and then

import json
from deepdiff import DeepDiff

actual = json.load(open('huge_actual.json'))
expected = json.load(open('huge_expected.json'))

# this is fast
diff = DeepDiff(actual, expected)

# gets dramatically slower w/ ignore_order
diff = DeepDiff(actual, expected, ignore_order=True)

@testautomation
Copy link

We can have add a parameter for max passes to run.

That would be really great!

Have you installed murmur3? / Tried pypy3?

Not yet. I can give it a shot but I have to make sure that it also works on CI. Same w/ pypy3. I'll let you know if I can report something interesting about that.

Cheers

@seperman
Copy link
Owner Author

DeepDiff 5 is finally here and it comes with multiple passes option!
https://zepworks.com/deepdiff/5.0.0/ignore_order.html#max-passes

Running in passes automation moved this from To do to Done Jun 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

2 participants