Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DataID overwriting attributes during Scene assignment and DataID not reflecting DataArray attributes #2332

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

BENR0
Copy link
Collaborator

@BENR0 BENR0 commented Dec 20, 2022

This PR fixes the overwriting of attributes of the DataArray to be assigned based on the DataID when writing to an existing id as well as the DataID not reflecting array attributes (see corresponding issues below).

The behaviour makes some assumptions which I think make sense but should be debated:

  1. If name is not in the attributes it gets added based on the choosen name in the key to which the DataArray is going to be assigned
  2. This deprecates the possibility to assign plain numpy arrays to the Scene and instead converts it to a xarray DataArray (I think this makes sense because many features of the Scene are based on attributes and as far as I can see not even the documentation mentions that this is possible).
  3. A new DataID is generated from the DataArray attributes which also updates the _satpy_id attribute so they are in sync.
  4. Currently not used but also the name property of the DataArray is set to the name in the DataID. This is something that needs to be discussed since it would be good if the name property also gets set by the readers and also, I think, they need to be unique which is not handled currently (e.g. what happens with two datasets in the Scene with DataID(name='1') and DataID(name='1', resolution=2000) ).

Known problems

  1. Some tests are failing due to Scene.__getitem__ not "greedy" enough #2331

@djhoese
Copy link
Member

djhoese commented Feb 1, 2023

The behaviour makes some assumptions which I think make sense but should be debated:

Does this mean this is the behavior added by this PR or this is the behavior existing in main currently? And sorry for not reviewing this earlier. I'll see if I can find some time to review these PRs and issues.

@BENR0
Copy link
Collaborator Author

BENR0 commented Feb 8, 2023

This behavior is added by the PR.

@djhoese
Copy link
Member

djhoese commented Feb 8, 2023

We should maybe adopt the same logic as what xarray uses (as much as we can) when you assign a DataArray from one Dataset to a new Dataset and with a different name. I'm not sure off the top of my head what happens but I know that Xarray has some more flexibility because it primarily deals with Variable objects and use the DataArrays as a wrapper around them.

  1. If name is not in the attributes it gets added based on the choosen name in the key to which the DataArray is going to be assigned

The way you have this implemented essentially throws out the key if it isn't a DataID, right? Isn't that extremely confusing?

What do we think about making a copy of the passed DataArray and then modifying the attributes? Regardless of what is done now, I don't think we want to overwrite what the user passed. That causes a lot of confusion if someone does:

scn["new_name"] = scn["old_name"]
  1. This deprecates the possibility to assign plain numpy arrays

If this worked before...I don't like it. Glad it gets removed.

  1. A new DataID is generated from the DataArray attributes which also updates the _satpy_id attribute so they are in sync.

I think this is good, but as mentioned above I think it should involve some level of copying (either just the attrs or the entire DataArray).

  1. Currently not used but also the name property of the DataArray is set to the name in the DataID.

I think, they need to be unique which is not handled currently

I think this is a good idea as a best practice kind of thing. Especially if this is what xarray does when you assign a DataArray to a Dataset. I think they need to be unique if they are part of the same Dataset object which is really only relevant in the to_xarray_dataset method and the cf writer...I think. Along that same line, maybe this isn't something that the Scene needs to worry about? We don't currently use the .name of the DataArray/Variable. Oh, also, this could cause conflicts between dask task naming. That might be a reason that they need to be unique too.

@djhoese djhoese added bug component:scene backwards-incompatibility Causes backwards incompatibility or introduces a deprecation labels Feb 8, 2023
new_info = key.to_dict()
except AttributeError:
new_info = key
new_info = key.to_dict()
if isinstance(value_info, dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When if value_info not a dict? You also use it in the above DataID.from_dataarray as if it was a dict without checking if it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think value_info should always be a dict since it's the xr.DataArrays attributes... unless the DataArray has no attributes. Mmh need to check that. Should probably then be covered by tests too...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So either the user didn't provide a DataArray (numpy, dask, etc), in which case there would be no .attrs attribute (an exception above). Or we require a DataArray as input and we know it has .attrs and this isinstance check can be removed. Right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that seems right. Maybe this check is a leftover from the implementation I had before making sure it always was a DataArray. I will check and clean it up.

Comment on lines 193 to 194
key = DataID.from_dataarray(value, default_keys={k: v for k, v in default_id_keys_config.items() if k
in value_info.keys()})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k in value_info? Drop the .keys() I mean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget the exact name of it, but DataArrays from Satpy will likely have the _satpy_id_keys in the .attrs which could be used instead of default_id_keys_config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't use it here because I drop it in the __setitem__ of the Scene before (see below).

satpy/scene.py Outdated
Comment on lines 811 to 812
if isinstance(value, np.ndarray):
value = xr.DataArray(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also check for a dask array. There is also a good chance that many parts of Satpy won't work without the sub-array being a dask array so maybe wrap it with da.from_array? Should there be a warning that this conversion is happening? Or...should non-DataArrays just not work at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True maybe also checking for dask array is a good idea. If currently it is allowed to assign plain numpy arrays I guess assigning dask arrays is allowed to so they should also be converted to a xr.DataArray. I also thought about not only converting to a xr.DataArray if a user tries to assing a numpy array but also converting it to a dask array but can't remember why I didn't. I guess it makes sense. I will change it then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot: I think non-DataArrays should not work at all. But I think there is no need for a warning because the conversion is kind of for the "benefit" of the user since many things might not work if the dataset is only a numpy array so the conversion, albeit intransparently, makes sure that only DataArrays are assinged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so you're saying you think users should be able to provide a numpy array, or a dask array, or a DataArray (with numpy or dask underneath), and this will convert whatever is given to a DataArray with dask underneath?

If so, then I'm OK with that. If not, let's discuss.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yepp. I mean this would leave it backwards compatible if somebody is really providing numpy arrays in his scripts. I would also be ok with throwing a warning or exception and therefore really forbidding the assignment of numpy or dask arrays. That would be more transparent I guess. The question is if that is wanted. Basically the question is if we treat being able to provide numpy or dask arrays as a feature or not.

Copy link
Member

@djhoese djhoese Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm leaning towards not allowing non-DataArrays and if the DataArray has a numpy array underneath it should maybe be turned into a dask array. Here's what happens if you assign a numpy array to an xarray Dataset:

In [1]: import xarray as xr
In [2]: import numpy as np
In [3]: ds = xr.Dataset()
In [4]: ds["a"] = np.zeros((10, 10))
---------------------------------------------------------------------------
MissingDimensionsError                    Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 ds["a"] = np.zeros((10, 10))

...

MissingDimensionsError: cannot set variable 'a' with 2-dimensional data without explicit dimension names. Pass a tuple of (dims, data) instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then I will change that.

if value.attrs.get('name', None) is None or value.attrs['name'] != name:
value.attrs['name'] = name

value.attrs.pop('_satpy_id', None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be popped so that DatasetDict.__setitem__ can use the id keys config from it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I remember this correctly; At first I tried that but then found out that the DataID.from_dataarray class method always returned the _satpy_id directly if present so it wouldn't get updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 But after the self._datasets[key] = value shouldn't the new _satpy_id be set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I call DataID.from_dataarray in the datasets.__setitem__ method and therefore I need to remove it because otherwise DataID.from_dataarray will return the old id. This could probably be done different though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but then I think this pop should go inside DatasetDict.__setitem__, right?

And actually, instead of from_dataarray you could use data_id.from_dict if you have the original DataID of the DataArray...but yeah if the provided DataArray didn't already have a DataID you'd need to create one. Actually that might not be that bad if you used the default id config (satpy.dataset.dataid.default_id_keys_config).

🤔 Ok that might actually be cleaner. If in the DatasetDict.__setitem__ you create a dictionary of all the metadata, then if there was a DataID in that metadata, pop it. If DataID get the .id_keys else use default_id_keys_config. Then create a new DataID(id_keys, **attrs) and use that. Maybe?

@BENR0
Copy link
Collaborator Author

BENR0 commented Feb 9, 2023

We should maybe adopt the same logic as what xarray uses (as much as we can) when you assign a DataArray from one Dataset to a new Dataset and with a different name. I'm not sure off the top of my head what happens but I know that Xarray has some more flexibility because it primarily deals with Variable objects and use the DataArrays as a wrapper around them.

I agree. I only remember having a look at xarray how it is done there but I can't remember if and when how much this mimics xarray :-/.

  1. If name is not in the attributes it gets added based on the choosen name in the key to which the DataArray is going to be assigned

The way you have this implemented essentially throws out the key if it isn't a DataID, right? Isn't that extremely confusing?

You mean throwing it out of the wishlist and the _datasets dict?

What do we think about making a copy of the passed DataArray and then modifying the attributes? Regardless of what is done now, I don't think we want to overwrite what the user passed. That causes a lot of confusion if someone does:

scn["new_name"] = scn["old_name"]

Yes that would not be helpful indeed. It's defenitely not covered by unit tests I guess :-D. I am not sure if I checked this manually.

  1. Currently not used but also the name property of the DataArray is set to the name in the DataID.

I think, they need to be unique which is not handled currently

I think this is a good idea as a best practice kind of thing. Especially if this is what xarray does when you assign a DataArray to a Dataset. I think they need to be unique if they are part of the same Dataset object which is really only relevant in the to_xarray_dataset method and the cf writer...I think. Along that same line, maybe this isn't something that the Scene needs to worry about? We don't currently use the .name of the DataArray/Variable. Oh, also, this could cause conflicts between dask task naming. That might be a reason that they need to be unique too.

While I agree that currently the .name is not used in Satpy I just added this here to start the discussion about it, because I have run into this at some occasion (even though I admit I may have unexpected ways of using Satpy sometimes ;-)) and this is something that probably would be helpful in the future with regard to the to_xarray_dataset method and the CF/Zarr writer.

I think unique names should be based on the DataID information which would probably handle most cases except the ones where the information is not enough to separate two datasets. So if this was implemented as a method in DataID it would need an argument for the _datasets dict to be able to check if it is indeed unique. Or maybe this is better located in the _datasets dict. In any case it should be consistent in a way that not only datasets that get assigned to the Scene by the user get the .name set but also datasets which are loaded.

In general I am not really "happy" with the implementation here. It feels kind of hacky. But it turned out that this was not that easy because of how the and when the DataIDs are created/ updated and assigned to the wishlist/dependency tree and the _datasets dict (for example that the DataID gets created twice, once in the __setitem__ of the _datasets dict and then again to be able to add it to the wishlist and dependency tree). So while this works it might not be the "best" solution which would probably need a lot more refactoring and more intricate knowledge of the whole DataID handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompatibility Causes backwards incompatibility or introduces a deprecation bug component:scene
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reset of modifier attribute of xr.DataArray when assigning to existing scene DataID
2 participants