Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lingering memory connections when extracting underlying np.arrays from datasets #8728

Closed
ks905383 opened this issue Feb 9, 2024 · 7 comments

Comments

@ks905383
Copy link
Contributor

ks905383 commented Feb 9, 2024

What is your issue?

I know that generally, ds2 = ds connects the two objects in memory, and changes in one will also cause changes in the other.

However, I generally assume that certain operations should break this connection, for example:

  • extracting the underlying np.array from a dataset (changing its type and destroying a lot of the xarray-specific information: index, dimensions, etc.)
  • using the underlying np.array into a new dataset

In other words, I would expect that using ds['var'].values would be similar to copy.deepcopy(ds['var'].values).

Here's an example that illustrates how in these cases, the objects are still linked in memory:

(apologies for the somewhat hokey example)

import xarray as xr
import numpy as np

# Create a dataset
ds = xr.Dataset(coords = {'lon':(['lon'],np.array([178.2,179.2,-179.8, -178.8,-177.8,-176.8]))})
print('\nds: ')
print(ds)

# Create a new dataset that uses the values of the first dataset
ds2 = xr.Dataset({'lon1':(['lon'],ds.lon.values)},
                  coords = {'lon':(['lon'],ds.lon.values)})
print('\nds2: ')
print(ds2)

# Change ds2's 'lon1' variable 
ds2['lon1'][ds2['lon1']<0] = 360 + ds2['lon1'][ds2['lon1']<0]

# `ds2` is changed as expected
print('\nds2 (should be modified): ')
print(ds2)

# `ds` is changed, which is *not* expected
print('\nds (should not be modified): ')
print(ds)

The question is - am I right (from a UX perspective) to expect these kinds of operations to disconnect the objects in memory? If so, I might try to update the docs to be a bit clearer on this. (or, alternatively, if these kinds of operations should disconnect the objects in memory, maybe it's better to have .values also call .copy(deep=True).values)

Appreciate y'all's thoughts on this!

@ks905383 ks905383 added the needs triage Issue that has not been reviewed by xarray team member label Feb 9, 2024
@dcherian
Copy link
Contributor

dcherian commented Feb 9, 2024

In general, you're expected to deep-copy explicitly to break these "links". This is the numpy paradigm

@max-sixty
Copy link
Collaborator

If you want to read up on this, look for "view vs copy"!

@ks905383
Copy link
Contributor Author

ks905383 commented Feb 9, 2024

Yeah, I guess in this case from a legibility standpoint, the fact that .values 'changes' (from the user point of view) the form (and type) of the data from a DataArray to the underlying numpy array just feels different?

Like I wouldn't expect the following two operations:

a = np.ones(3)
b = a.astype(str)
a[0] = 5
print(b)

and

a = np.ones(3)
b = a
a[0] = 5
print(b)

to behave the same. But I do understand that from the backend perspective, .values seems to be more of the latter than the former, since it is just accessing something that's already there...

(relatedly, would it be worth it to link to the relevant numpy docs in this part of the xarray docs?)

@ks905383
Copy link
Contributor Author

ks905383 commented Feb 9, 2024

A related issue is that this allows you to (possibly inadvertently) circumvent certain xarray safeguards, like the TypeError around not being able to modify IndexVariables:

# Create sample dataset
ds = xr.Dataset({'test':(['lon'],[5,6,7])},coords = {'lon':(('lon'),[0,1,2])})

# Raises TypeError, to avoid changing indices like this
ds['lon'][0] = 2

# Now, extract underly numpy array
a = ds.lon.values

# Change value
a[0] = 2

# This changes `ds` without raising error
print(ds)

@max-sixty
Copy link
Collaborator

max-sixty commented Feb 9, 2024

(relatedly, would it be worth it to link to the relevant numpy docs in this part of the xarray docs?)

Yes! That would be a welcome contribution.


A related issue is that this allows you to (possibly inadvertently) circumvent certain xarray safeguards, like the TypeError around not being able to modify IndexVariables:

Yes. But I'm not sure there's much we can do about this. Our focus should be "if you use xarray operations, you won't get surprises"...

@ks905383
Copy link
Contributor Author

ks905383 commented Feb 9, 2024

Yes! That would be a welcome contribution.

Sounds good, I'll prep a PR

ks905383 added a commit to ks905383/xarray that referenced this issue Feb 13, 2024
- Add reference to numpy docs on view / copies in the corresponding section of the xarray docs, to help clarify pydata#8728 .
- Add note that `da.values()` returns a view in the header for `da.values()`.
@max-sixty max-sixty added topic-documentation and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 26, 2024
dcherian pushed a commit that referenced this issue Mar 25, 2024
* Clarify #8728 in docs

- Add reference to numpy docs on view / copies in the corresponding section of the xarray docs, to help clarify #8728 .
- Add note that `da.values()` returns a view in the header for `da.values()`.

* tweaks to the  header

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* flip order of new .to_values() doc header paragraphs

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@kmuehlbauer
Copy link
Contributor

Resolved by #8744.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants