Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No copy init for StructuredGrid and RectilinearGrid #2698

Merged
merged 7 commits into from
May 31, 2022

Conversation

banesullivan
Copy link
Member

Some changes to include in #2697 - note the base branch

This improves the inits of the StructuredGrid and RectilinearGrid classes such that they can be initialized without copying the source array data.

There is a slight potential for the changes to RectlinearGrid to not be backward compatible as it would previously take the np.unique values of the input array (producing a copy). IMO, and from the documentation on that class, the arrays should be provided as unique values and so the internal call to np.unique is unneeded and causes undesired data copying. While in some corner cases, this is not backward compatible, I would argue that those corner cases should not be supported and the original implementation was the library version of synatic sugar.

@github-actions github-actions bot added the enhancement Changes that enhance the library label May 26, 2022
@codecov
Copy link

codecov bot commented May 26, 2022

Codecov Report

Merging #2698 (23b801b) into patch/coerce_pointslike_arg_copy (3d0761b) will decrease coverage by 0.00%.
The diff coverage is 93.75%.

@@                         Coverage Diff                          @@
##           patch/coerce_pointslike_arg_copy    #2698      +/-   ##
====================================================================
- Coverage                             93.70%   93.70%   -0.01%     
====================================================================
  Files                                    75       75              
  Lines                                 16178    16192      +14     
====================================================================
+ Hits                                  15160    15172      +12     
- Misses                                 1018     1020       +2     

@MatthewFlamm
Copy link
Contributor

There is a slight potential for the changes to RectlinearGrid to not be backward compatible as it would previously take the np.unique values of the input array (producing a copy). IMO, and from the documentation on that class, the arrays should be provided as unique values and so the internal call to np.unique is unneeded and causes undesired data copying. While in some corner cases, this is not backward compatible, I would argue that those corner cases should not be supported and the original implementation was the library version of synatic sugar.

I agree with this. It makes no sense to allow non-unique values. What happens if a user IS using data with non-unique values in this PR? Does it silently continue or noisily complain in vtk? If it is silently continuing, it would be nice to put in a check to warn or raise an error. Then users will quickly be able to pinpoint the change here and is also useful in general.

@akaszynski
Copy link
Member

akaszynski commented May 27, 2022

If it is silently continuing, it would be nice to put in a check to warn or raise an error.

A quick check revealed that non-unique values are silently handled:

import numpy as np
import pyvista

xrng = np.arange(-10, 10, 2, dtype=float)
yrng = np.arange(-10, 10, 5, dtype=float)
zrng = np.arange(-10, 10, 1, dtype=float)


# non-unique xrng
xrng_non_unique = np.array([-10., -10.,  -8.,  -6.,  -4.,  -2.,   0.,   2.,   4.,   6.,   8.])

mesh_non_unique = pyvista.RectilinearGrid(xrng_non_unique, yrng, zrng)
mesh = pyvista.RectilinearGrid(xrng, yrng, zrng)

assert mesh == mesh_non_unique

Only way around this is to check if an array is unique, and it appears fastest way to do so is:

s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]

From https://stackoverflow.com/questions/11528078/

Implementing...

@akaszynski
Copy link
Member

I went for raising as grid.<x|y|z> would contain the duplicated values despite them not showing up in the actual grid.

Calling this good to go.

@@ -161,16 +162,23 @@ def _from_arrays(self, x: np.ndarray, y: np.ndarray, z: np.ndarray):
Coordinates of the points in z direction.

"""

def raise_not_unique(arr, name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments:

  1. This is OK here, but it could be a generally useful method.
  2. Do we need the name argument here? Won't the stack trace be enough? Also the variable name is x in this private method, but the user supplied variable could be anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both resolved in 3b85909.

@adeak
Copy link
Member

adeak commented May 27, 2022

Only way around this is to check if an array is unique, and it appears fastest way to do so is:

s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]

@akaszynski that's a bit surprising. On a quick timing with a 1000-length random array (no unique values, most likely use case) it is indeed slightly faster than np.unique(arr).size == arr.size. However, a few remarks:

  1. we should consider how large arrays we typically expect and make sure we optimise for that size,
  2. if we keep this approach, we only need the boolean mask: (s[1:] == s[:-1]).any(), and axis=None seems superfluous.

Edit: OK, I see you actually need the duplicate values for some reason. I thought we only wanted to check for lack of duplicates.

@akaszynski
Copy link
Member

I see you actually need the duplicate values for some reason. I thought we only wanted to check for lack of duplicates.

It potentially makes the traceback a bit more helpful by giving the duplicate values. Since we're getting those for free, I figure it's worth it.

@akaszynski
Copy link
Member

we should consider how large arrays we typically expect and make sure we optimise for that size,

using a typical array of 100-1000 values, I'm seeing a noticeable improvement using sort rather than unique. With 1000 values on my machine:

>>> timeit has_duplicates(arr)
5.6 µs ± 5.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> timeit np.unique(arr).size != arr.size
6.22 µs ± 6.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

if we keep this approach, we only need the boolean mask: (s[1:] == s[:-1]).any(), and axis=None seems superfluous.

On second though, I think it makes more sense, especially when considering comment.

@banesullivan
Copy link
Member Author

banesullivan commented May 27, 2022

raise_has_duplicates appears to have a significant(?) memory consumption impact.

Take this test:

import numpy as np
import pyvista as pv
from memory_profiler import profile


@profile
def test_rectilinear():
    xrng = np.arange(-100000, 100000, .2, dtype=float)
    yrng = np.arange(-100000, 100000, .5, dtype=float)
    zrng = np.arange(-100000, 100000, .1, dtype=float)
    mesh = pv.RectilinearGrid(xrng, yrng, zrng)
    x = mesh.x
    x /= 2


if __name__ == '__main__':
    test_rectilinear()

Reuslts without using raise_has_duplicates:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6    138.4 MiB    138.4 MiB           1   @profile
     7                                         def test_rectilinear():
     8    214.7 MiB     76.3 MiB           1       xrng = np.arange(-1000000, 1000000, .2, dtype=float)
     9    245.2 MiB     30.5 MiB           1       yrng = np.arange(-1000000, 1000000, .5, dtype=float)
    10    397.8 MiB    152.6 MiB           1       zrng = np.arange(-1000000, 1000000, .1, dtype=float)
    11    398.0 MiB      0.1 MiB           1       mesh = pv.RectilinearGrid(xrng, yrng, zrng)
    12    398.0 MiB      0.0 MiB           1       x = mesh.x
    13    398.0 MiB      0.0 MiB           1       x /= 2

Reuslts with using raise_has_duplicates:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6    138.3 MiB    138.3 MiB           1   @profile
     7                                         def test_rectilinear():
     8    214.6 MiB     76.3 MiB           1       xrng = np.arange(-1000000, 1000000, .2, dtype=float)
     9    245.1 MiB     30.5 MiB           1       yrng = np.arange(-1000000, 1000000, .5, dtype=float)
    10    397.7 MiB    152.6 MiB           1       zrng = np.arange(-1000000, 1000000, .1, dtype=float)
    11    670.6 MiB    272.9 MiB           1       mesh = pv.RectilinearGrid(xrng, yrng, zrng)
    12    670.6 MiB      0.0 MiB           1       x = mesh.x
    13    670.6 MiB      0.0 MiB           1       x /= 2

Comparison

An addition 272.8 MiB are consumed by raise_has_duplicates for this example

@banesullivan
Copy link
Member Author

Should we have a keyword argument check_duplicates that will perform this check if desired? I'm of the opinion that this should be an opt-in feature

@banesullivan
Copy link
Member Author

For context, this work and the work in #2697 are for a xarray-pyvista DataArray accessor I am creating where we can easily deal with much larger data sizes and this sort of memory consumption is important to minimize

@banesullivan
Copy link
Member Author

I mean, this memory usage isn't that significant when realizing that test is actually pretty ridiculous... That mesh has 6,789,684,830,523,280,511 cells (yes, 6 quintillion cells) and is totally not plotting on my laptop anytime soon

@adeak
Copy link
Member

adeak commented May 27, 2022

An addition 272.8 MiB are consumed by raise_has_duplicates for this example

I'm pretty sure we have to choose between memory and performance here. np.sort itself will double memory because it gives us a new array, and even after that we have to loop over the sorted array ("very" slow) or do the vectorized comparison that is currently implemented (another temporary array of pretty much the same size, although 1-byte bool dtype).

At the worst case we need an N - 1-sized container to keep track of the first N - 1 unique values seen, unless we're willing to go for an O(N^2)-time O(1)-space approach of looping over the array twice and comparing every element. I doubt we want to do this.

@akaszynski
Copy link
Member

Should we have a keyword argument check_duplicates that will perform this check if desired? I'm of the opinion that this should be an opt-in feature

Let's do that. Didn't realize the scope of the arrays being used here. This is also VTK's default behavior.

@akaszynski
Copy link
Member

As an added bonus, 23b801b includes updates to our documentation as our class documentation for RectilinearGrid wasn't great, and documenting check_duplicates required the addition of a Parameters section.

Copy link
Member

@akaszynski akaszynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go.

Recommending merge by Wednesday 1 June 2022 given:
#2245 (comment)

@banesullivan
Copy link
Member Author

I'll merge this into #2697 and do a few final checks, then I think that one will be good to go as well

@banesullivan banesullivan merged commit ef5c097 into patch/coerce_pointslike_arg_copy May 31, 2022
@banesullivan banesullivan deleted the feat/no-copy-init branch May 31, 2022 15:59
banesullivan added a commit that referenced this pull request Jun 3, 2022
* _coerce_pointslike_arg copy=False default

* Simplify test

* Add create no copy tests

* Fix PointSet pytestmark

* No copy init for StructuredGrid and RectilinearGrid (#2698)

* Support no copy with StructuredGrid

* Support no copy with RectilinearGrid

* Improve convert_array to handle lists

* raise on duplication

* refactor

* add check_duplicates option

Co-authored-by: Alex Kaszynski <akascap@gmail.com>

* Set dimensions of StructuredGrid in test

* Fix up RectilinearGrid docstrings a bit

* Remove comments

* Use kwargs on RectilinearGrid init

* Use may_share_memory

* Use array_equal

* Fix coerce test with array_equal and may_share_memory

* Update pyvista/utilities/misc.py

Co-authored-by: Andras Deak <adeak@users.noreply.github.com>

* Add duplicate y and z array cases to RectilinearGrid test

* move non-unique test to test_grid

Co-authored-by: Alex Kaszynski <akascap@gmail.com>
Co-authored-by: Andras Deak <deak.andris@gmail.com>
Co-authored-by: Andras Deak <adeak@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Changes that enhance the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants