Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta + Numpy structure throws ambiguous true value #194

Closed
David-Herman opened this issue May 18, 2020 · 11 comments
Closed

Delta + Numpy structure throws ambiguous true value #194

David-Herman opened this issue May 18, 2020 · 11 comments
Assignees
Labels

Comments

@David-Herman
Copy link

Describe the bug
I am using the delta of the diff to generate the other input again. When using numpy arrays as input (or dictionaries of lists of arrays, etc.) it throws an exception, ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

To Reproduce
Steps to reproduce the behavior:

  1. Input Code
a1 = np.array([1,2,3,4])
a2 =np.array([5,6,7,8,9,10])
mydiff = DeepDiff(a1, a2)
delta = Delta(mydiff)
delta + a1
  1. Error Trace
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-62-bbde3ce70e97> in <module>
      3 mydiff = DeepDiff(a1, a2)
      4 delta = Delta(mydiff)
----> 5 delta + a1

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\delta.py in __add__(self, other)
    146         else:
    147             self.root = deepcopy(other)
--> 148         self._do_values_changed()
    149         self._do_set_item_added()
    150         self._do_set_item_removed()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\delta.py in _do_values_changed(self)
    300         values_changed = self.diff.get('values_changed')
    301         if values_changed:
--> 302             self._do_values_or_type_changed(values_changed)
    303 
    304     def _do_type_changes(self):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\delta.py in _do_values_or_type_changed(self, changes, is_type_change)
    358 
    359             self._set_new_value(parent, parent_to_obj_elem, parent_to_obj_action,
--> 360                                 obj, elements, path, elem, action, new_value)
    361 
    362             self._do_verify_changes(path, expected_old_value, current_old_value)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\delta.py in _set_new_value(self, parent, parent_to_obj_elem, parent_to_obj_action, obj, elements, path, elem, action, new_value)
    234         self._simple_set_elem_value(obj=obj, path_for_err_reporting=path, elem=elem,
    235                                     value=new_value, action=action)
--> 236         if obj_is_new and parent:
    237             # Making sure that the object is re-instated inside the parent especially if it was immutable
    238             # and we had to turn it into a mutable one. In such cases the object has a new id.

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Expected behavior
return of data structure matching numpy array a2

OS, DeepDiff version and Python version (please complete the following information):

  • OS: Win7
  • Python: 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
  • DeepDiff Version: '5.0.0' (From Dev today)
@seperman
Copy link
Owner

@David-Herman Thanks for reporting the issue. This is happening because the shape of the 2 arrays are different. I'm fixing it now.

@seperman seperman self-assigned this May 19, 2020
@seperman seperman added the bug label May 19, 2020
@David-Herman
Copy link
Author

thanks

@seperman
Copy link
Owner

@David-Herman This is fixed in the dev branch now. I added your example as a test case.

@David-Herman
Copy link
Author

Hello,

Thanks for the quick commit. I believe the code is causing a regression somewhere that is leading to poor performance (infinite loop?). I have two HDF5 (.Mat) files around 15-20 MB in size that unpack into python objects via scipy.io.loadmat(). I cannot provide the raw data as it is confidential. Please let me know if I can assist further in the debugging.

When I install the prior dev commit (that fails due to the above issue) I get the following.

# pip install git+git://github.com/seperman/deepdiff.git@73fd3f8f8349bfc0d28c3bff6604fb278f666227
%timeit ddiff = DeepDiff(mat1, mat2) # 650 MB peak memory usage

That returns

1.83 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

When I install the current head of dev I get the following results,

# pip install git+git://github.com/seperman/deepdiff.git@dev
%timeit ddiff = DeepDiff(mat1, mat2) 

That increasingly uses up free memory. I killed it at 20 GB usage for the python process. I did confirm for the simple example

when I interrupt the process I get,

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-5-a9c66558ea9b> in <module>
----> 1 ddiff = DeepDiff(mat1, mat2)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __init__(self, t1, t2, ignore_order, report_repetition, significant_digits, number_format_notation, exclude_paths, exclude_regex_paths, exclude_types, ignore_type_in_groups, ignore_string_type_changes, ignore_numeric_type_changes, ignore_type_subclasses, ignore_string_case, exclude_obj_callback, number_to_string_func, ignore_nan_inequality, ignore_private_variables, verbose_level, view, hasher, hashes, parameters, shared_parameters, max_passes, max_distances_to_keep_track_per_item, max_diffs, cutoff_distance_for_pairs, log_frequency_in_sec, progress_logger, _stats, _cache, _numpy_paths, **kwargs)
    215         try:
    216             root = DiffLevel(t1, t2, verbose_level=self.verbose_level)
--> 217             self.__diff(root, parents_ids=frozenset({id(t1)}))
    218 
    219             self.tree.remove_empty_keys()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __diff(self, level, parents_ids)
   1070 
   1071         elif isinstance(level.t1, Mapping):
-> 1072             self.__diff_dict(level, parents_ids)
   1073 
   1074         elif isinstance(level.t1, tuple):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __diff_dict(self, level, parents_ids, print_as_attribute, override, override_t1, override_t2)
    456                 child_relationship_class=rel_class,
    457                 child_relationship_param=key)
--> 458             self.__diff(next_level, parents_ids_added)
    459 
    460     def __diff_set(self, level):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __diff(self, level, parents_ids)
   1085 
   1086         else:
-> 1087             self.__diff_obj(level, parents_ids)
   1088 
   1089     def _get_view_results(self, view):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __diff_obj(self, level, parents_ids, is_namedtuple)
    322             override=True,
    323             override_t1=t1,
--> 324             override_t2=t2)
    325 
    326     def __skip_this(self, level):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __diff_dict(self, level, parents_ids, print_as_attribute, override, override_t1, override_t2)
    456                 child_relationship_class=rel_class,
    457                 child_relationship_param=key)
--> 458             self.__diff(next_level, parents_ids_added)
    459 
    460     def __diff_set(self, level):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __diff(self, level, parents_ids)
   1079 
   1080         elif isinstance(level.t1, np_ndarray):
-> 1081             self.__diff_numpy_array(level, parents_ids)
   1082 
   1083         elif isinstance(level.t1, Iterable):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __diff_numpy_array(self, level, parents_ids)
    981             level.t1 = level.t1.tolist()
    982             level.t2 = level.t2.tolist()
--> 983             self.__diff_iterable(level, parents_ids)
    984         else:
    985             # metadata same -- the difference is in the content

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __diff_iterable(self, level, parents_ids)
    503             self.__diff_iterable_with_deephash(level, parents_ids)
    504         else:
--> 505             self.__diff_iterable_in_order(level, parents_ids)
    506 
    507     def __diff_iterable_in_order(self, level, parents_ids=frozenset({})):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __diff_iterable_in_order(self, level, parents_ids)
    548                     child_relationship_class=child_relationship_class,
    549                     child_relationship_param=i)
--> 550                 self.__diff(next_level, parents_ids_added)
    551 
    552     def __diff_str(self, level):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __diff(self, level, parents_ids)
   1067 
   1068         elif isinstance(level.t1, numbers):
-> 1069             self.__diff_numbers(level)
   1070 
   1071         elif isinstance(level.t1, Mapping):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\deepdiff\diff.py in __diff_numbers(self, level)
    929     def __diff_numbers(self, level):
    930         """Diff Numbers"""
--> 931         t1_type = "number" if self.ignore_numeric_type_changes else level.t1.__class__.__name__
    932         t2_type = "number" if self.ignore_numeric_type_changes else level.t2.__class__.__name__
    933 

KeyboardInterrupt: 

@seperman
Copy link
Owner

Hi @David-Herman
It is hard to debug just based on these logs. The reason the previous branch never got to 20GB memory usage might have been that the crash didn't let it continue.
You are not even using ignore_order=True which exponentially will increase the resource usage.
Can you please also pass: log_frequency_in_sec=20 and limit the number of diffs so it won't get an accurate diff but gives you something just for debugging. You can limit the diffs by passing something like max_diffs=10**5.
If you can then run all the above with some memory profiler and post the results here, that would be great: https://github.com/pythonprofilers/memory_profiler

@seperman
Copy link
Owner

Also this is probably happening because we cache almost everything now in order to increase the performance. Sounds like your data basically makes it grow the cache size without enough cache hits to justify the cache. I'm gonna put some limits on the cache size.

@seperman
Copy link
Owner

Hi @David-Herman
Please pull the latest from dev branch. And rerun your data. The cache size is now limited. You can play with it though and see if you get any performance benefits. You can set it via cache_size=number. I set the default as 5000 items. It really depends on your dataset how useful is caching. You can reduce the cache size to reduce the memory usage.

@David-Herman
Copy link
Author

David-Herman commented May 21, 2020

Here are some of my results. Thanks for the help.

Here is the install

! pip install git+git://github.com/seperman/deepdiff.git@dev
  Created wheel for deepdiff: filename=deepdiff-5.0.0-cp37-none-any.whl size=67298 sha256=6a8d51f99738d669879dbd8abf97f18f70dc970cb0f69ff37137ffdc5f334bc0

without ignore order the memory usage balloons.

%timeit ddiff = DeepDiff(mat1, mat2) -> 20 GB

the memory size is reasonable but the speed is slow.

ddiff = DeepDiff(mat1, mat2,ignore_order=True,log_frequency_in_sec=20,max_diffs=10**6)

Here is the output

DeepDiff 20 seconds in progress. Pass #47, Diff #192227
DeepDiff 40 seconds in progress. Pass #70, Diff #390876
DeepDiff 60 seconds in progress. Pass #70, Diff #589631
DeepDiff 80 seconds in progress. Pass #70, Diff #789661
DeepDiff 100 seconds in progress. Pass #70, Diff #988053
DeepDiff has reached the max number of diffs of 1000000. You can possibly get more accurate results by increasing the max_diffs parameter.
DeepDiff 120 seconds in progress. Pass #70, Diff #1000001
DeepDiff 140 seconds in progress. Pass #70, Diff #1000001
DeepDiff 160 seconds in progress. Pass #70, Diff #1000001
DeepDiff 180 seconds in progress. Pass #70, Diff #1000001
DeepDiff 200 seconds in progress. Pass #70, Diff #1000001
DeepDiff 220 seconds in progress. Pass #70, Diff #1000001
DeepDiff 240 seconds in progress. Pass #70, Diff #1000001
DeepDiff 260 seconds in progress. Pass #70, Diff #1000001

How do I set the chache_size?

ValueError: The following parameter(s) are not valid: cache_size

edit: I uninstalled and then reinstalled from dev to get the last commit. I set the chache_size = 1 and still saw increasing use in memory.

@David-Herman
Copy link
Author

Ok, I have been troubleshooting further with sub-structures of my data structure. I am not encountering issues with memory size presently. What I have noticed is that when ignore_order=True is used in function call it runs slightly slower (e.g. 0.0119 vs 0.0139 seconds) for most of the data. Then on some data the ignore_order= True increase in time to 51 seconds compared to the DeepDiff(,) call of 0.02 seconds.

This sub-structure tends to be numpy arrays (500-2000 elements) with dtypes like unit16 ('<u2'). I have tried to generate sample data via np.arange(). This ends up with 18.5 ms vs 53.1 ms difference with ignore_order=True.

I used np.savetxt() to save two arrays that seem to be causing an issue. the difference is pretty clear.

a1 = np.loadtxt('mat1.txt')
a2 = np.loadtxt('mat2.txt')
%timeit DeepDiff(a1, a2,ignore_order=True)
43.5 s ± 304 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

compared to:

a1 = np.loadtxt('mat1.txt')
a2 = np.loadtxt('mat2.txt')
%timeit DeepDiff(a1, a2)
20.9 ms ± 622 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

here is the input,
mat1.txt
mat2.txt

@seperman
Copy link
Owner

Hi @David-Herman
Thanks for posting sample date. The cache_size is a new parameter. Please pull the latest dev branch to be able to set it.

@seperman
Copy link
Owner

Very interesting how long it takes to run this. Let's open a new ticket and continue there since this ticket was about the delta + numpy error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants