No copy init for StructuredGrid and RectilinearGrid #2698

banesullivan · 2022-05-26T05:34:21Z

Some changes to include in #2697 - note the base branch

This improves the inits of the StructuredGrid and RectilinearGrid classes such that they can be initialized without copying the source array data.

There is a slight potential for the changes to RectlinearGrid to not be backward compatible as it would previously take the np.unique values of the input array (producing a copy). IMO, and from the documentation on that class, the arrays should be provided as unique values and so the internal call to np.unique is unneeded and causes undesired data copying. While in some corner cases, this is not backward compatible, I would argue that those corner cases should not be supported and the original implementation was the library version of synatic sugar.

codecov · 2022-05-26T05:43:44Z

Codecov Report

Merging #2698 (23b801b) into patch/coerce_pointslike_arg_copy (3d0761b) will decrease coverage by 0.00%.
The diff coverage is 93.75%.

@@                         Coverage Diff                          @@
##           patch/coerce_pointslike_arg_copy    #2698      +/-   ##
====================================================================
- Coverage                             93.70%   93.70%   -0.01%     
====================================================================
  Files                                    75       75              
  Lines                                 16178    16192      +14     
====================================================================
+ Hits                                  15160    15172      +12     
- Misses                                 1018     1020       +2

MatthewFlamm · 2022-05-27T16:22:48Z

There is a slight potential for the changes to RectlinearGrid to not be backward compatible as it would previously take the np.unique values of the input array (producing a copy). IMO, and from the documentation on that class, the arrays should be provided as unique values and so the internal call to np.unique is unneeded and causes undesired data copying. While in some corner cases, this is not backward compatible, I would argue that those corner cases should not be supported and the original implementation was the library version of synatic sugar.

I agree with this. It makes no sense to allow non-unique values. What happens if a user IS using data with non-unique values in this PR? Does it silently continue or noisily complain in vtk? If it is silently continuing, it would be nice to put in a check to warn or raise an error. Then users will quickly be able to pinpoint the change here and is also useful in general.

akaszynski · 2022-05-27T16:59:17Z

If it is silently continuing, it would be nice to put in a check to warn or raise an error.

A quick check revealed that non-unique values are silently handled:

import numpy as np
import pyvista

xrng = np.arange(-10, 10, 2, dtype=float)
yrng = np.arange(-10, 10, 5, dtype=float)
zrng = np.arange(-10, 10, 1, dtype=float)


# non-unique xrng
xrng_non_unique = np.array([-10., -10.,  -8.,  -6.,  -4.,  -2.,   0.,   2.,   4.,   6.,   8.])

mesh_non_unique = pyvista.RectilinearGrid(xrng_non_unique, yrng, zrng)
mesh = pyvista.RectilinearGrid(xrng, yrng, zrng)

assert mesh == mesh_non_unique

Only way around this is to check if an array is unique, and it appears fastest way to do so is:

s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]

From https://stackoverflow.com/questions/11528078/

Implementing...

akaszynski · 2022-05-27T18:14:03Z

I went for raising as grid.<x|y|z> would contain the duplicated values despite them not showing up in the actual grid.

Calling this good to go.

MatthewFlamm · 2022-05-27T18:45:16Z

pyvista/core/grid.py

@@ -161,16 +162,23 @@ def _from_arrays(self, x: np.ndarray, y: np.ndarray, z: np.ndarray):
            Coordinates of the points in z direction.

        """
+
+        def raise_not_unique(arr, name):


Some minor comments:

This is OK here, but it could be a generally useful method.

Do we need the name argument here? Won't the stack trace be enough? Also the variable name is x in this private method, but the user supplied variable could be anything.

Both resolved in 3b85909.

adeak · 2022-05-27T19:07:14Z

Only way around this is to check if an array is unique, and it appears fastest way to do so is:
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]

@akaszynski that's a bit surprising. On a quick timing with a 1000-length random array (no unique values, most likely use case) it is indeed slightly faster than np.unique(arr).size == arr.size. However, a few remarks:

we should consider how large arrays we typically expect and make sure we optimise for that size,
if we keep this approach, we only need the boolean mask: (s[1:] == s[:-1]).any(), and axis=None seems superfluous.

Edit: OK, I see you actually need the duplicate values for some reason. I thought we only wanted to check for lack of duplicates.

akaszynski · 2022-05-27T19:20:13Z

I see you actually need the duplicate values for some reason. I thought we only wanted to check for lack of duplicates.

It potentially makes the traceback a bit more helpful by giving the duplicate values. Since we're getting those for free, I figure it's worth it.

akaszynski · 2022-05-27T19:32:43Z

we should consider how large arrays we typically expect and make sure we optimise for that size,

using a typical array of 100-1000 values, I'm seeing a noticeable improvement using sort rather than unique. With 1000 values on my machine:

>>> timeit has_duplicates(arr)
5.6 µs ± 5.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> timeit np.unique(arr).size != arr.size
6.22 µs ± 6.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

if we keep this approach, we only need the boolean mask: (s[1:] == s[:-1]).any(), and axis=None seems superfluous.

On second though, I think it makes more sense, especially when considering comment.

banesullivan · 2022-05-27T19:48:50Z

raise_has_duplicates appears to have a significant(?) memory consumption impact.

Take this test:

import numpy as np
import pyvista as pv
from memory_profiler import profile


@profile
def test_rectilinear():
    xrng = np.arange(-100000, 100000, .2, dtype=float)
    yrng = np.arange(-100000, 100000, .5, dtype=float)
    zrng = np.arange(-100000, 100000, .1, dtype=float)
    mesh = pv.RectilinearGrid(xrng, yrng, zrng)
    x = mesh.x
    x /= 2


if __name__ == '__main__':
    test_rectilinear()

Reuslts without using `raise_has_duplicates`:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6    138.4 MiB    138.4 MiB           1   @profile
     7                                         def test_rectilinear():
     8    214.7 MiB     76.3 MiB           1       xrng = np.arange(-1000000, 1000000, .2, dtype=float)
     9    245.2 MiB     30.5 MiB           1       yrng = np.arange(-1000000, 1000000, .5, dtype=float)
    10    397.8 MiB    152.6 MiB           1       zrng = np.arange(-1000000, 1000000, .1, dtype=float)
    11    398.0 MiB      0.1 MiB           1       mesh = pv.RectilinearGrid(xrng, yrng, zrng)
    12    398.0 MiB      0.0 MiB           1       x = mesh.x
    13    398.0 MiB      0.0 MiB           1       x /= 2

Reuslts with using `raise_has_duplicates`:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6    138.3 MiB    138.3 MiB           1   @profile
     7                                         def test_rectilinear():
     8    214.6 MiB     76.3 MiB           1       xrng = np.arange(-1000000, 1000000, .2, dtype=float)
     9    245.1 MiB     30.5 MiB           1       yrng = np.arange(-1000000, 1000000, .5, dtype=float)
    10    397.7 MiB    152.6 MiB           1       zrng = np.arange(-1000000, 1000000, .1, dtype=float)
    11    670.6 MiB    272.9 MiB           1       mesh = pv.RectilinearGrid(xrng, yrng, zrng)
    12    670.6 MiB      0.0 MiB           1       x = mesh.x
    13    670.6 MiB      0.0 MiB           1       x /= 2

Comparison

An addition 272.8 MiB are consumed by raise_has_duplicates for this example

banesullivan · 2022-05-27T19:51:12Z

Should we have a keyword argument check_duplicates that will perform this check if desired? I'm of the opinion that this should be an opt-in feature

banesullivan · 2022-05-27T19:53:30Z

For context, this work and the work in #2697 are for a xarray-pyvista DataArray accessor I am creating where we can easily deal with much larger data sizes and this sort of memory consumption is important to minimize

banesullivan · 2022-05-27T20:01:02Z

I mean, this memory usage isn't that significant when realizing that test is actually pretty ridiculous... That mesh has 6,789,684,830,523,280,511 cells (yes, 6 quintillion cells) and is totally not plotting on my laptop anytime soon

adeak · 2022-05-27T20:02:28Z

An addition 272.8 MiB are consumed by raise_has_duplicates for this example

I'm pretty sure we have to choose between memory and performance here. np.sort itself will double memory because it gives us a new array, and even after that we have to loop over the sorted array ("very" slow) or do the vectorized comparison that is currently implemented (another temporary array of pretty much the same size, although 1-byte bool dtype).

At the worst case we need an N - 1-sized container to keep track of the first N - 1 unique values seen, unless we're willing to go for an O(N^2)-time O(1)-space approach of looping over the array twice and comparing every element. I doubt we want to do this.

akaszynski · 2022-05-27T22:08:59Z

Should we have a keyword argument check_duplicates that will perform this check if desired? I'm of the opinion that this should be an opt-in feature

Let's do that. Didn't realize the scope of the arrays being used here. This is also VTK's default behavior.

akaszynski · 2022-05-27T22:22:41Z

As an added bonus, 23b801b includes updates to our documentation as our class documentation for RectilinearGrid wasn't great, and documenting check_duplicates required the addition of a Parameters section.

akaszynski

I think this is good to go.

Recommending merge by Wednesday 1 June 2022 given:
#2245 (comment)

banesullivan · 2022-05-30T15:59:11Z

I'll merge this into #2697 and do a few final checks, then I think that one will be good to go as well

* _coerce_pointslike_arg copy=False default * Simplify test * Add create no copy tests * Fix PointSet pytestmark * No copy init for StructuredGrid and RectilinearGrid (#2698) * Support no copy with StructuredGrid * Support no copy with RectilinearGrid * Improve convert_array to handle lists * raise on duplication * refactor * add check_duplicates option Co-authored-by: Alex Kaszynski <akascap@gmail.com> * Set dimensions of StructuredGrid in test * Fix up RectilinearGrid docstrings a bit * Remove comments * Use kwargs on RectilinearGrid init * Use may_share_memory * Use array_equal * Fix coerce test with array_equal and may_share_memory * Update pyvista/utilities/misc.py Co-authored-by: Andras Deak <adeak@users.noreply.github.com> * Add duplicate y and z array cases to RectilinearGrid test * move non-unique test to test_grid Co-authored-by: Alex Kaszynski <akascap@gmail.com> Co-authored-by: Andras Deak <deak.andris@gmail.com> Co-authored-by: Andras Deak <adeak@users.noreply.github.com>

banesullivan added 2 commits May 25, 2022 23:13

Support no copy with StructuredGrid

8fe8576

Support no copy with RectilinearGrid

71009e3

github-actions bot added the enhancement Changes that enhance the library label May 26, 2022

Merge branch 'patch/coerce_pointslike_arg_copy' into feat/no-copy-init

da4064c

banesullivan mentioned this pull request May 26, 2022

Fix shared array access on init #2697

Merged

Improve convert_array to handle lists

4345579

raise on duplication

857adc4

MatthewFlamm reviewed May 27, 2022

View reviewed changes

refactor

3b85909

add check_duplicates option

23b801b

akaszynski approved these changes May 30, 2022

View reviewed changes

banesullivan merged commit ef5c097 into patch/coerce_pointslike_arg_copy May 31, 2022

banesullivan deleted the feat/no-copy-init branch May 31, 2022 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No copy init for StructuredGrid and RectilinearGrid #2698

No copy init for StructuredGrid and RectilinearGrid #2698

banesullivan commented May 26, 2022

codecov bot commented May 26, 2022 •

edited

Loading

MatthewFlamm commented May 27, 2022

akaszynski commented May 27, 2022 •

edited

Loading

akaszynski commented May 27, 2022

MatthewFlamm May 27, 2022

akaszynski May 27, 2022

adeak commented May 27, 2022 •

edited

Loading

akaszynski commented May 27, 2022

akaszynski commented May 27, 2022

banesullivan commented May 27, 2022 •

edited

Loading

banesullivan commented May 27, 2022

banesullivan commented May 27, 2022

banesullivan commented May 27, 2022

adeak commented May 27, 2022 •

edited

Loading

akaszynski commented May 27, 2022

akaszynski commented May 27, 2022

akaszynski left a comment

banesullivan commented May 30, 2022

No copy init for StructuredGrid and RectilinearGrid #2698

No copy init for StructuredGrid and RectilinearGrid #2698

Conversation

banesullivan commented May 26, 2022

codecov bot commented May 26, 2022 • edited Loading

Codecov Report

MatthewFlamm commented May 27, 2022

akaszynski commented May 27, 2022 • edited Loading

akaszynski commented May 27, 2022

MatthewFlamm May 27, 2022

Choose a reason for hiding this comment

akaszynski May 27, 2022

Choose a reason for hiding this comment

adeak commented May 27, 2022 • edited Loading

akaszynski commented May 27, 2022

akaszynski commented May 27, 2022

banesullivan commented May 27, 2022 • edited Loading

Reuslts without using raise_has_duplicates:

Reuslts with using raise_has_duplicates:

Comparison

banesullivan commented May 27, 2022

banesullivan commented May 27, 2022

banesullivan commented May 27, 2022

adeak commented May 27, 2022 • edited Loading

akaszynski commented May 27, 2022

akaszynski commented May 27, 2022

akaszynski left a comment

Choose a reason for hiding this comment

banesullivan commented May 30, 2022

codecov bot commented May 26, 2022 •

edited

Loading

akaszynski commented May 27, 2022 •

edited

Loading

adeak commented May 27, 2022 •

edited

Loading

banesullivan commented May 27, 2022 •

edited

Loading

Reuslts without using `raise_has_duplicates`:

Reuslts with using `raise_has_duplicates`:

adeak commented May 27, 2022 •

edited

Loading