Fix DataID overwriting attributes during Scene assignment and DataID not reflecting DataArray attributes #2332

BENR0 · 2022-12-20T14:01:19Z

This PR fixes the overwriting of attributes of the DataArray to be assigned based on the DataID when writing to an existing id as well as the DataID not reflecting array attributes (see corresponding issues below).

The behaviour makes some assumptions which I think make sense but should be debated:

If name is not in the attributes it gets added based on the choosen name in the key to which the DataArray is going to be assigned
This deprecates the possibility to assign plain numpy arrays to the Scene and instead converts it to a xarray DataArray (I think this makes sense because many features of the Scene are based on attributes and as far as I can see not even the documentation mentions that this is possible).
A new DataID is generated from the DataArray attributes which also updates the _satpy_id attribute so they are in sync.
Currently not used but also the name property of the DataArray is set to the name in the DataID. This is something that needs to be discussed since it would be good if the name property also gets set by the readers and also, I think, they need to be unique which is not handled currently (e.g. what happens with two datasets in the Scene with DataID(name='1') and DataID(name='1', resolution=2000) ).

Known problems

Some tests are failing due to Scene.__getitem__ not "greedy" enough #2331

Closes Reset of modifier attribute of xr.DataArray when assigning to existing scene DataID #1490, Scene assignment with existing name updates data but not attributes #1283
Tests added
Fully documented

djhoese · 2023-02-01T15:40:45Z

The behaviour makes some assumptions which I think make sense but should be debated:

Does this mean this is the behavior added by this PR or this is the behavior existing in main currently? And sorry for not reviewing this earlier. I'll see if I can find some time to review these PRs and issues.

BENR0 · 2023-02-08T08:25:46Z

This behavior is added by the PR.

djhoese · 2023-02-08T16:54:21Z

We should maybe adopt the same logic as what xarray uses (as much as we can) when you assign a DataArray from one Dataset to a new Dataset and with a different name. I'm not sure off the top of my head what happens but I know that Xarray has some more flexibility because it primarily deals with Variable objects and use the DataArrays as a wrapper around them.

If name is not in the attributes it gets added based on the choosen name in the key to which the DataArray is going to be assigned

The way you have this implemented essentially throws out the key if it isn't a DataID, right? Isn't that extremely confusing?

What do we think about making a copy of the passed DataArray and then modifying the attributes? Regardless of what is done now, I don't think we want to overwrite what the user passed. That causes a lot of confusion if someone does:

scn["new_name"] = scn["old_name"]

This deprecates the possibility to assign plain numpy arrays

If this worked before...I don't like it. Glad it gets removed.

A new DataID is generated from the DataArray attributes which also updates the _satpy_id attribute so they are in sync.

I think this is good, but as mentioned above I think it should involve some level of copying (either just the attrs or the entire DataArray).

Currently not used but also the name property of the DataArray is set to the name in the DataID.

I think, they need to be unique which is not handled currently

I think this is a good idea as a best practice kind of thing. Especially if this is what xarray does when you assign a DataArray to a Dataset. I think they need to be unique if they are part of the same Dataset object which is really only relevant in the to_xarray_dataset method and the cf writer...I think. Along that same line, maybe this isn't something that the Scene needs to worry about? We don't currently use the .name of the DataArray/Variable. Oh, also, this could cause conflicts between dask task naming. That might be a reason that they need to be unique too.

djhoese · 2023-02-08T16:55:15Z

satpy/dataset/data_dict.py

-            new_info = key.to_dict()
-        except AttributeError:
-            new_info = key
+        new_info = key.to_dict()
        if isinstance(value_info, dict):


When if value_info not a dict? You also use it in the above DataID.from_dataarray as if it was a dict without checking if it is.

I think value_info should always be a dict since it's the xr.DataArrays attributes... unless the DataArray has no attributes. Mmh need to check that. Should probably then be covered by tests too...

So either the user didn't provide a DataArray (numpy, dask, etc), in which case there would be no .attrs attribute (an exception above). Or we require a DataArray as input and we know it has .attrs and this isinstance check can be removed. Right?

Yes that seems right. Maybe this check is a leftover from the implementation I had before making sure it always was a DataArray. I will check and clean it up.

djhoese · 2023-02-08T16:55:34Z

satpy/dataset/data_dict.py

+            key = DataID.from_dataarray(value, default_keys={k: v for k, v in default_id_keys_config.items() if k
+                                                             in value_info.keys()})


k in value_info? Drop the .keys() I mean.

I forget the exact name of it, but DataArrays from Satpy will likely have the _satpy_id_keys in the .attrs which could be used instead of default_id_keys_config.

I couldn't use it here because I drop it in the __setitem__ of the Scene before (see below).

djhoese · 2023-02-08T16:58:51Z

satpy/scene.py

+        if isinstance(value, np.ndarray):
+            value = xr.DataArray(value)


You could also check for a dask array. There is also a good chance that many parts of Satpy won't work without the sub-array being a dask array so maybe wrap it with da.from_array? Should there be a warning that this conversion is happening? Or...should non-DataArrays just not work at all?

True maybe also checking for dask array is a good idea. If currently it is allowed to assign plain numpy arrays I guess assigning dask arrays is allowed to so they should also be converted to a xr.DataArray. I also thought about not only converting to a xr.DataArray if a user tries to assing a numpy array but also converting it to a dask array but can't remember why I didn't. I guess it makes sense. I will change it then.

Forgot: I think non-DataArrays should not work at all. But I think there is no need for a warning because the conversion is kind of for the "benefit" of the user since many things might not work if the dataset is only a numpy array so the conversion, albeit intransparently, makes sure that only DataArrays are assinged.

Ok so you're saying you think users should be able to provide a numpy array, or a dask array, or a DataArray (with numpy or dask underneath), and this will convert whatever is given to a DataArray with dask underneath?

If so, then I'm OK with that. If not, let's discuss.

Yepp. I mean this would leave it backwards compatible if somebody is really providing numpy arrays in his scripts. I would also be ok with throwing a warning or exception and therefore really forbidding the assignment of numpy or dask arrays. That would be more transparent I guess. The question is if that is wanted. Basically the question is if we treat being able to provide numpy or dask arrays as a feature or not.

I think I'm leaning towards not allowing non-DataArrays and if the DataArray has a numpy array underneath it should maybe be turned into a dask array. Here's what happens if you assign a numpy array to an xarray Dataset:

In [1]: import xarray as xr In [2]: import numpy as np In [3]: ds = xr.Dataset() In [4]: ds["a"] = np.zeros((10, 10)) --------------------------------------------------------------------------- MissingDimensionsError Traceback (most recent call last) Input In [4], in <cell line: 1>() ----> 1 ds["a"] = np.zeros((10, 10)) ... MissingDimensionsError: cannot set variable 'a' with 2-dimensional data without explicit dimension names. Pass a tuple of (dims, data) instead.

Ok then I will change that.

djhoese · 2023-02-08T17:00:47Z

satpy/scene.py

+        if value.attrs.get('name', None) is None or value.attrs['name'] != name:
+            value.attrs['name'] = name
+
+        value.attrs.pop('_satpy_id', None)


Should this not be popped so that DatasetDict.__setitem__ can use the id keys config from it?

I hope I remember this correctly; At first I tried that but then found out that the DataID.from_dataarray class method always returned the _satpy_id directly if present so it wouldn't get updated.

🤔 But after the self._datasets[key] = value shouldn't the new _satpy_id be set?

Yes but I call DataID.from_dataarray in the datasets.__setitem__ method and therefore I need to remove it because otherwise DataID.from_dataarray will return the old id. This could probably be done different though?

Good point, but then I think this pop should go inside DatasetDict.__setitem__, right?

And actually, instead of from_dataarray you could use data_id.from_dict if you have the original DataID of the DataArray...but yeah if the provided DataArray didn't already have a DataID you'd need to create one. Actually that might not be that bad if you used the default id config (satpy.dataset.dataid.default_id_keys_config).

🤔 Ok that might actually be cleaner. If in the DatasetDict.__setitem__ you create a dictionary of all the metadata, then if there was a DataID in that metadata, pop it. If DataID get the .id_keys else use default_id_keys_config. Then create a new DataID(id_keys, **attrs) and use that. Maybe?

BENR0 · 2023-02-09T09:31:00Z

We should maybe adopt the same logic as what xarray uses (as much as we can) when you assign a DataArray from one Dataset to a new Dataset and with a different name. I'm not sure off the top of my head what happens but I know that Xarray has some more flexibility because it primarily deals with Variable objects and use the DataArrays as a wrapper around them.

I agree. I only remember having a look at xarray how it is done there but I can't remember if and when how much this mimics xarray :-/.

If name is not in the attributes it gets added based on the choosen name in the key to which the DataArray is going to be assigned

The way you have this implemented essentially throws out the key if it isn't a DataID, right? Isn't that extremely confusing?

You mean throwing it out of the wishlist and the _datasets dict?

What do we think about making a copy of the passed DataArray and then modifying the attributes? Regardless of what is done now, I don't think we want to overwrite what the user passed. That causes a lot of confusion if someone does:
scn["new_name"] = scn["old_name"]

Yes that would not be helpful indeed. It's defenitely not covered by unit tests I guess :-D. I am not sure if I checked this manually.

Currently not used but also the name property of the DataArray is set to the name in the DataID.

I think, they need to be unique which is not handled currently

I think this is a good idea as a best practice kind of thing. Especially if this is what xarray does when you assign a DataArray to a Dataset. I think they need to be unique if they are part of the same Dataset object which is really only relevant in the to_xarray_dataset method and the cf writer...I think. Along that same line, maybe this isn't something that the Scene needs to worry about? We don't currently use the .name of the DataArray/Variable. Oh, also, this could cause conflicts between dask task naming. That might be a reason that they need to be unique too.

While I agree that currently the .name is not used in Satpy I just added this here to start the discussion about it, because I have run into this at some occasion (even though I admit I may have unexpected ways of using Satpy sometimes ;-)) and this is something that probably would be helpful in the future with regard to the to_xarray_dataset method and the CF/Zarr writer.

I think unique names should be based on the DataID information which would probably handle most cases except the ones where the information is not enough to separate two datasets. So if this was implemented as a method in DataID it would need an argument for the _datasets dict to be able to check if it is indeed unique. Or maybe this is better located in the _datasets dict. In any case it should be consistent in a way that not only datasets that get assigned to the Scene by the user get the .name set but also datasets which are loaded.

In general I am not really "happy" with the implementation here. It feels kind of hacky. But it turned out that this was not that easy because of how the and when the DataIDs are created/ updated and assigned to the wishlist/dependency tree and the _datasets dict (for example that the DataID gets created twice, once in the __setitem__ of the _datasets dict and then again to be able to add it to the wishlist and dependency tree). So while this works it might not be the "best" solution which would probably need a lot more refactoring and more intricate knowledge of the whole DataID handling.

Benjamin Rösner added 7 commits December 14, 2022 15:09

dataid overwriting attributes during scn assignment

5e81ca9

refactor: remove comments

f21997d

refactor: setitem method on scene

ecad410

refactor: default keys for dataid from dataarray

9a97aaf

refactor: refactor and add tests for scene setitem

b095f6f

fix: default keys in dataid from array

1efc2b7

Merge branch 'main' into fix_assigning_to_scn_dataid

4a681f7

BENR0 requested review from djhoese and mraspaud as code owners December 20, 2022 14:01

BENR0 mentioned this pull request Dec 21, 2022

Fixes modifier attribute getting replaced instead of merged with existing one when applying modifications #2333

Draft

2 tasks

djhoese added bug component:scene backwards-incompatibility Causes backwards incompatibility or introduces a deprecation labels Feb 8, 2023

djhoese reviewed Feb 8, 2023

View reviewed changes

refactor: github suggestions

f32c788

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DataID overwriting attributes during Scene assignment and DataID not reflecting DataArray attributes #2332

Fix DataID overwriting attributes during Scene assignment and DataID not reflecting DataArray attributes #2332

BENR0 commented Dec 20, 2022

djhoese commented Feb 1, 2023

BENR0 commented Feb 8, 2023

djhoese commented Feb 8, 2023

djhoese Feb 8, 2023

BENR0 Feb 9, 2023

djhoese Feb 9, 2023

BENR0 Feb 9, 2023

djhoese Feb 8, 2023

djhoese Feb 8, 2023

BENR0 Feb 9, 2023

djhoese Feb 8, 2023

BENR0 Feb 9, 2023

BENR0 Feb 9, 2023

djhoese Feb 9, 2023

BENR0 Feb 9, 2023

djhoese Feb 9, 2023 •

edited

BENR0 Feb 10, 2023

djhoese Feb 8, 2023

BENR0 Feb 9, 2023

djhoese Feb 9, 2023

BENR0 Feb 9, 2023

djhoese Feb 9, 2023

BENR0 commented Feb 9, 2023 •

edited

		key = DataID.from_dataarray(value, default_keys={k: v for k, v in default_id_keys_config.items() if k
		in value_info.keys()})

		if isinstance(value, np.ndarray):
		value = xr.DataArray(value)

Fix DataID overwriting attributes during Scene assignment and DataID not reflecting DataArray attributes #2332

Are you sure you want to change the base?

Fix DataID overwriting attributes during Scene assignment and DataID not reflecting DataArray attributes #2332

Conversation

BENR0 commented Dec 20, 2022

Known problems

djhoese commented Feb 1, 2023

BENR0 commented Feb 8, 2023

djhoese commented Feb 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djhoese Feb 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BENR0 commented Feb 9, 2023 • edited

djhoese Feb 9, 2023 •

edited

BENR0 commented Feb 9, 2023 •

edited