Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using deprecated/dangerous method to set 'data' in numpy array #17

Open
davidsmf opened this issue Apr 27, 2017 · 14 comments
Open

Using deprecated/dangerous method to set 'data' in numpy array #17

davidsmf opened this issue Apr 27, 2017 · 14 comments

Comments

@davidsmf
Copy link

In numpy 1.12.0, release notes:

Assignment of ndarray object's data attribute

Assigning the 'data' attribute is an inherently unsafe operation as pointed out in gh-7083. Such a capability will be removed in the future.

This appears to happen where we assign directly to the data of param_array and gradient_full. I've noticed warnings from

paramz\parameterized.py:266
paramz\parameterized.py:267
paramz\core\parameter_core.py:284
paramz\core\parameter_core.py:285
@mzwiessele
Copy link
Member

Yes that is right. We are making use of the data attribute for in memory views of parameters. It seems that this will not be possible in the future. This appears to be the source of memory issues reported on GPy: SheffieldML/GPy#341, SheffieldML/GPy#51, though this is not confirmed.

Not using this in memory view significantly reduces turnaround during optimization, as getting and setting parameters has to go through python loops.

Can you think of a fix for this issue, or know about another way of in-memory views of arrays?

See stackoverflow discussions about this:
https://stackoverflow.com/questions/23650796/numpy-set-array-memory
https://stackoverflow.com/questions/39446199/in-numpy-how-do-i-set-array-bs-data-to-reference-the-data-of-array-a

@darthdeus
Copy link

Is there any reasonable workaround at the moment? Or should we just silence the warning in the meantime?

@mzwiessele
Copy link
Member

mzwiessele commented Dec 31, 2019 via email

@darthdeus
Copy link

@mzwiessele Is there an upstream issue in numpy that we're waiting on to be fixed, so it can be tracked here?

@mzwiessele
Copy link
Member

mzwiessele commented Jan 13, 2020 via email

@g-bassetto
Copy link

Hi there,

I think I have worked out possible solution for this problem, but I am not sure it is easy to integrate it in the existing framework.
The idea is very simple: it consists of enriching an np.ndarray with a field to which, if set, all calls will be delegated; if the field is not set, then it falls back to invoking the method on itself.
I paste below a short example to explain what I mean:

import numpy as np
import logging
logging.basicConfig(level=logging.DEBUG)

BUFFER = '__my_buffer__'

class MyArray(np.ndarray):
    
    @property
    def data(self):
        try:
            return getattr(self, BUFFER)
        except AttributeError:
            return super().data
        
    @data.setter
    def data(self, value):
        value = np.asarray(value)
        logging.debug(f'self: {self}; other: {value}')
        assert value.shape == self.shape
        value[...] = self[...]
        logging.debug(f'self: {self}; other: {value}')
        setattr(self, BUFFER, value)
        logging.debug(f'self: {self}; other: {value}')
        
    def __getitem__(self, item):
        if getattr(self, BUFFER, self) is self:
            return super().__getitem__(item)
        else:
            return getattr(self, BUFFER).__getitem__(item)
            
    def __setitem__(self, item, value):
        if getattr(self, BUFFER, self) is self:
            return super().__setitem__(item, value)
        else:
            getattr(self, BUFFER).__setitem__(item, value)
                  
    def __iadd__(self, other):
        if getattr(self, BUFFER, self) is self:
            return super().__iadd__(other)
        else:
            getattr(self, BUFFER).__iadd__(other)
            return self
        
    # ... and so on

buf = np.ones(5)
arr = np.zeros(3).view(MyArray)
assert np.all(arr == 0)
# arr.data = buf[1:2]  # raises assertion error because the sizes are different
arr.data = buf[2:]  # this is ok because the sizes match
assert np.all(arr[:] == 0) 
assert np.all(arr == 0)
assert np.all(buf[:2] == 1) and np.all(buf[2:] == 0)

arr += 2;
assert np.all(arr[:] == 2)  # this is ok
# assert np.all(arr == 2)   # this is not ok
assert np.all(buf == [1, 1, 2, 2, 2])

Although this seems reasonable to me, I am not sure this will not break some internal mechanics of the framework, of which I honestly haven't an in-depth understanding yet (I got the overall mechanics of the system, but I am still trying to wrap my head around the logic behind the implementation of IndexOperations).

@g-bassetto
Copy link

An alternative idea would be to decouple the description of a parameter from its actual data storage.
The external interface of a Param object would still need to match that of ndarray, but interally it delegates all the operations to a proper ndarray. When the parameter is connected to some parent object, all we would need to do is to replace it internal storage with the appropriate view from the parent storage (taking care of copying the old values to the new storage before discarding the old one).
This solution would still keep the advantages of the old system, which avoid looping through parameters because all operations will still be done in-memory, but it will get rid of the deprecation warning. I am willing to give it a try.

@ekalosak
Copy link

ekalosak commented Sep 2, 2020

This is my impression, but it could use some correction:

The issue appears to boil down to having a pointer "*A" (the ndarray.data attribute) to the first element of the array that is changed (to "*B") while other objects may hold views of or pointers to the original *A. [1] The memory at the location indicated by the first pointer, now derefferenced by the original ndarray, may be garbage collected. Those objects holding copies of the *A pointer side-step the built-in python memory management - which can cause a segfault when *A is referenced. This is why we're given that "Inherently unsafe" warning - we may get a segfault if there's another object holding the original pointer *A that tries to access memory at that location after we've reassigned the array's contents to memory starting at *B.

It took some reading to understand @g-bassetto 's suggestion, and it makes some sense to me: essentially, we should provide a wrapper around the ndarray.data buffer that keeps it stable while allowing us to get and set the array's contents. Does anyone see any issues with this approach? Not only would it get rid of the warning, it appears to address the root cause of the warning by more carefully handling the .data attribute.

The only change I might propose is the addition of some initialization of the buffer that would do away with the if getattr(self, BUFFER, self) is self conditions replicated across the magic methods.

Sources:

  1. https://numpy.org/doc/stable/reference/arrays.interface.html specifically "A reference to the object exposing the array interface must be stored by the new object if the memory area is to be secured."

Edit:
@mzwiessele I'm not terribly sure I understand why the ndarray.data attribute is being set at all - why not simply update the arrays themselves? Is the desire to keep a contiguous memory map of the parameters' arrays?

@ekalosak
Copy link

ekalosak commented Sep 3, 2020

Also, the link to the numpy mailing list thread above is broken for me, here's a currently working link for anyone else having the same issue: http://numpy-discussion.10968.n7.nabble.com/deprecating-assignment-to-ndarray-data-td42176.html

@ekalosak
Copy link

ekalosak commented Sep 3, 2020

Why not do an in-place modification of the memory underlying the array, e.g.

a = np.array([1,2,3])
b = np.array([4,5,6])
# forgive the pseudocode
for i, bbyte in enumerate(b.data.memory_map):
  a.data.memory_map[i] = bbyte

Or, even simpler, why not just do a raw copy? Would something like the following work? (from https://github.com/sods/paramz/blob/master/paramz/core/parameter_core.py#L290)

            # pi.param_array.data = parray[pislice].data 
            pi.param_array[:] = parray[pislice][:]

I'm guessing this approach incurs the performance issue discussed towards the top of the thread, though, by using python loops.

It seems like we might have two additional approaches:

  1. Implement the assign_slice operation for numpy (currently a NULL pointer, not implemented [1])
  2. Avoid the .data attribute all together

The second approach is brutally just

            # pi.param_array.data = parray[pislice].data 
            pi.param_array = parray[pislice]

By assigning the .data attribute, we implicitly assume no other object holds a (post-.data-reassignment) stale pointer to the memory underlying the original array. I wonder if this assumption can't be leveraged in simply directly reassigning the value at which pi.param_array points to the value at which parray[pislice] points. In this case, there's no python loop [:] used to copy values one by one - the same behavior of reassigning the pi.param_array.data buffer to the value at parray[pslice] is achieved.

EDIT: If I'm not too mislead, the second solution I've sketched out matches up with @g-bassetto 's second solution.

Sources:

  1. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/sequence.c#L75

@g-bassetto
Copy link

g-bassetto commented Sep 4, 2020

@ekalosak your second approach matches mine second one, as you guessed.
As far as of my understanding, the reason why Param objects be of type of ndarray, is to be able to leverage the numpy implementation that already exists. At a first glance, the Param class is a kind of structured array. For this reason, a specific Param instance is meant to be a view on a subset of parameters of a larger parameters hierarchy associated to some model. The reason why they are using the ndarray.data attribute is because there is no other way to change the address to which a view is pointing to (please correct me if I am wrong).
From an implementation point of view, having all the parameters of the root model of the hierarchy to exist within a contiguous piece of memory is very convenient: most of the existing optimization tools which operate on scalar cost functions of vector variables will work out of the box, as these tools often assume that the parameters lie contiguously in memory.

The reason why I think that having the Param class decoupled by ndarray is a good idea, is that this will make it possible to change the back-end at runtime. The first idea that comes to mind is to be able to swap between numpy and pytorch, for example, in case one's model requires some heavy work to be carried out on a GPU. I think I am imagining the Param class acting as a proxy to the actual parameter array, rather than being itself the parameters array, and both mine and @ekalosak 's option number 2 are going in this direction.

Do you think that what I said is sensible?

@ekalosak
Copy link

ekalosak commented Sep 5, 2020

I think what you've said is sensible, but it also might be out of scope for the narrow issue - though if you see the current narrow issue as a stepping stone to backend agnosticism, then that just adds motivation to solve this current, narrow issue.

If I've understood correctly, there's an issue with simply directly reassigning the pi.param_array = parray[pislice]: when we change the pi.param_array as a direct python re-assignment, the memory underlying the new data will be in a different, non-contiguous location:

untitled (3)

If the Param were decoupled from the np.ndarray, I can sense we might be able to maintain memory continuity, but it's not clear how we'd do it aside from directly modifying the array's values in place as it is done now. We could gain more confidence that there aren't any stale views of the data that could cause segfaults, but at the same time, decoupling seems like a potentially unnecessary additional level of abstraction.

@ekalosak
Copy link

ekalosak commented Sep 9, 2020

At the same time, @g-bassetto , I wonder if it wouldn't be easier to just implement sq_ass_slice in numpy. Then we could just use param_array[slice] = pi.param_array without the subroutine falling back on python loops. It's wild to me that slice overwriting is not implemented in the C backend...
(https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/sequence.c#L75)

EDIT
Having run some rough benchmarking...

import numpy as np
import time

N = 10**6
a = np.empty(N)
b = np.empty(N)
c = np.ones(N)

t0a = time.time()
for i in range(N):
    a[i] = c[i]
t1a = time.time()
tda = t1a-t0a

t0b = time.time()
b[:] = c
t1b = time.time()
tdb = t1b-t0b

print(f"a: {tda}\nb: {tdb}")

Results:

a: 0.23286128044128418
b: 0.0038118362426757812

I don't quite understand why we wouldn't use the built-in slice overwriting. Seems like the a.data = b.data is unnecessary altogether? There's no slowdown.

@hawkinsp
Copy link

hawkinsp commented Aug 2, 2024

In NumPy 2.0, .data assignment is a hard error, which breaks this package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants