Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor CFWriter.save_datasets and enable retrieval of equivalent xr.Dataset with scn.to_xarray() #2259

Merged
merged 25 commits into from
Jun 27, 2023

Conversation

ghiggi
Copy link
Contributor

@ghiggi ghiggi commented Nov 3, 2022

This PR:

  1. refactors the internal code of CFWriter.save_datasets
  2. add the Scene.to_xarray() method to retrieve the equivalent xr.Dataset which is written by the CFWriter.

Side note:

  • This PR could serve as an intermediate step to write in future a CF-compliant ZarrWriter, which essentially requires a different definition of the encodings. Nevertheless, after the merging of this PR, a CF-compliant Zarr dataset can be created with just scn.to_xarray().to_zarr(). Relates to Writer for Zarr format #2229

  • Tests added

  • Fully documented

  • Add your name to AUTHORS.md if not there already

Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this. I see a lot of TODOs for refactoring. Do those still need to be done? Also, could/should this new Scene.to_xarray be a keyword argument on to_xarray_dataset (default to False)? I personally don't really like taking the generic name to_xarray to be this very specific use case.

Lastly, can you add tests?

@djhoese
Copy link
Member

djhoese commented Nov 3, 2022

Also...isn't the full test suite run for draft PRs?

@djhoese djhoese added enhancement code enhancements, features, improvements component:writers component:scene labels Nov 3, 2022
@djhoese
Copy link
Member

djhoese commented Nov 3, 2022

...and why aren't the tests running? This is very strange.

@ghiggi
Copy link
Contributor Author

ghiggi commented Nov 3, 2022

@djhoese no idea why the test is not running ...

Few questions:

  1. I add a cf_compliant and the other arguments to the current Scene.to_xarray_dataset ? Like this?
scn.to_xarray_dataset(datasets=None,
  cf_compliant=True, 
  header_attrs=None,
  exclude_attrs=None,
  flatten_attrs=False,
  pretty=True,
  include_lonlats=True,
  epoch=None,
  include_orig_name=True,
  numeric_name_prefix='CHANNEL_')
  1. I can change the input/outputs of the _get_groups and _set_history functions or should I create new functions and leave them (unused in cf_writer.py) to guarantee backward compatibility if some people importing them?
  2. To complete the refactor I would need to define outside the CFWriter class the da2cf and _collect_datasets methods.
    Also in this case should I guarantee backward compatibility for these static methods?

@djhoese
Copy link
Member

djhoese commented Nov 3, 2022

  1. Oh...that's a lot of keyword arguments. 😨 I don't like it, but I'm also not sure what's best. Maybe we wait and see what other @pytroll/satpy-core maintainers think.
  2. If they have a _ prefix then don't worry about backward compatibility. If they are needed by your new code then extract them from the writer class (assuming they can be extracted) and then call the top level functions. The _ prefix should be removed from top level functions. Also, if it makes more sense to put these functions in a non-writer class to share state or share arguments then that's OK with me too. This would probably mean the CFWriter would create an instance of this new class as a "helper" to do the conversion.
  3. I think my answer is the same as for point 2 above.

@ghiggi
Copy link
Contributor Author

ghiggi commented Dec 5, 2022

Hi @pytroll/satpy-core and @BENR0 . I will likely have time to work on this toward the end of the week.
What's your opinion on the forward way to go?

@BENR0
Copy link
Collaborator

BENR0 commented Dec 6, 2022

I think this should definitely be merged with the to_xarray_dataset method. I just had a quick glance at the code and I don't know if that is already possible (I will try if I have time), but I would also like to have is this working with Multiscene and the timeseries blend to be able to do distributed writing to Zarr.

@BENR0 BENR0 mentioned this pull request Dec 7, 2022
@ghiggi ghiggi marked this pull request as draft February 2, 2023 08:13
@ghiggi ghiggi marked this pull request as ready for review February 2, 2023 08:14
@mraspaud
Copy link
Member

@ghiggi could you fix the merge conflicts? I'll have a shot at review this when it's done

@mraspaud mraspaud added this to the v0.42.0 milestone Apr 17, 2023
@ghiggi
Copy link
Contributor Author

ghiggi commented Apr 17, 2023

@mraspaud I might be able to give it a shot on Wednesday. Meanwhile ... tell me if scn.to_xarray_dataset should have this arguments (same as CF_Writer) ;)

scn.to_xarray_dataset(datasets=None,
  cf_compliant=True, 
  header_attrs=None,
  exclude_attrs=None,
  flatten_attrs=False,
  pretty=True,
  include_lonlats=True,
  epoch=None,
  include_orig_name=True,
  numeric_name_prefix='CHANNEL_')

@mraspaud
Copy link
Member

@mraspaud I might be able to give it a shot on Wednesday. Meanwhile ... tell me if scn.to_xarray_dataset should have this arguments (same as CF_Writer) ;)

scn.to_xarray_dataset(datasets=None,
  cf_compliant=True, 
  header_attrs=None,
  exclude_attrs=None,
  flatten_attrs=False,
  pretty=True,
  include_lonlats=True,
  epoch=None,
  include_orig_name=True,
  numeric_name_prefix='CHANNEL_')

That's a lot of arguments, but if think this is necessary...

@ghiggi
Copy link
Contributor Author

ghiggi commented May 25, 2023

Can you do another round of review @djhoese @mraspaud? I addressed your previous comments.

Tomorrow morning I will finalize the refactoring of the method called by _collect_cf_dataset which are still a bit messy in regards to the extraction of info from the area attrs.

@ghiggi ghiggi requested review from djhoese and mraspaud May 25, 2023 17:31
satpy/writers/cf_writer.py Outdated Show resolved Hide resolved
@djhoese
Copy link
Member

djhoese commented May 26, 2023

I've been staring at this RTD issue for too long along with every other PR tonight. I have no idea what sphinx is complaining about. It seems to think "CHANNEL_" (IN QUOTES!) is referring to some reference that doesn't exist. It is just a string. Why is it having a problem with this? I guess we could try double backticks for both. Let me try that...

@ghiggi
Copy link
Contributor Author

ghiggi commented May 26, 2023

@djhoese @mraspaud there is only left to decide what to do with the get_extra_ds function and the ancillary_variables DataArrays ...

@ghiggi ghiggi requested a review from djhoese May 26, 2023 13:40
@djhoese
Copy link
Member

djhoese commented May 26, 2023

I personally don't use ancillary variables, but I also don't use this writer. That said, ancillary variables are important and should be included in the NetCDF file. If there is a bug in the code so these never get added then we need to know why the tests don't show that as a problem. If there are two redundant pieces of code in the writer that look like they are collecting ancillary variables then yes one of them could/should be removed.

@ghiggi
Copy link
Contributor Author

ghiggi commented May 26, 2023

No there are no redundant pieces of code. Currently, the CFWriter simply doesn't write ancillary variables to disk (since at https://github.com/pytroll/satpy/blob/main/satpy/writers/cf_writer.py#L808 we don't specify keys).
If we would like to do so, we would need to add another argument to the CF Writer ...
And it's not clear what get_extra_ds(https://github.com/pytroll/satpy/blob/main/satpy/writers/cf_writer.py#L228) does except converting a list of DataArrays into a dictionary {name: DataArray}

@djhoese
Copy link
Member

djhoese commented May 26, 2023

Maybe @zxdawn can comment it they're available? To summarize get_extra_ds in the CF writer only adds ancillary variables if a keys keyword argument is provided (you @zxdawn added this), but it seems like keys is never passed.

@ghiggi Can you tell if there are any tests that use ancillary variables (or some of the fake data has them)?

@ghiggi
Copy link
Contributor Author

ghiggi commented May 31, 2023

Regarding the ancillary variables:

  • There is a test that checks that the list of DataArray in the ancillary_variables are resampled !
  • The only reader that exploit ancillary_variables are fci_l1c_nc (for pixel_quality) and nucaps (for Quality_Flag)
    In fci_l1c_nc, the reader adds a xr.DataArray, in nucaps the reader adds a string !
  • Regarding writers, there is a single test for the cf writer (at https://github.com/pytroll/satpy/blob/main/satpy/tests/writer_tests/test_cf.py#L171) which tests that the writer replaces the list of DataArray in the ancillary_variables key with a list of the DataArray names. This behavior is not modified in this PR
  • The possible list of DataArrays is not written to disk (and there is no test about it).

If you agree @djhoese, I would tend to ask you to merge the PR as it is, so that I can proceed with the CF refactoring.
And we can decide and implement the behavior related to ancillary_variables in a separate PR.

@zxdawn
Copy link
Member

zxdawn commented May 31, 2023

ha, it seems #1140 didn't pass the keys. Anyway, maybe it's better to keep ancillary_variables as string attrs? It would be annoying if users only want to load one variable, but they get multiple "redundant" variables.

@djhoese
Copy link
Member

djhoese commented May 31, 2023

The only reader that exploit ancillary_variables

It may be hard to tell which ones use it since some may be loaded automatically by xarray as part of the normal xr.open_dataset usage.

In fci_l1c_nc, the reader adds a xr.DataArray, in nucaps the reader adds a string !

The nucaps way of doing it was a change by me (or rather, how I always did it). @mraspaud preferred including DataArrays in the .attrs. However, this is against the expected usage of xarray as stated by some of the xarray maintainers. There was an issue a while ago where tests were failing because xarray didn't expect DataArrays to be inside a .attrs (or even any complex object at all). They worked around it for us, but we should not assume that this is something that should be allowed moving forward. I would advise FCI maintainers (@ameraner @gerritholl ?) to switch this behavior in the future if at all possible. But then again, I don't think Satpy provides an easy way to refer to the ancillary variables which is why it is still done this way.

Based on your private message to me on slack @ghiggi I am OK with this being merged and think the next PR should be a refactor-only PR (which can include converting tests to pytest). I could see an argument for include an ancillary variable fix, but that might be too much for a single PR and especially confusing with the amount of testing that would be needed.

Given my lack of usage of the CF writer I don't think I should be the one to push the merge button. @mraspaud please give this one last review and merge it if acceptable while keeping in mind that @ghiggi plans to make more PRs after this one.

@ghiggi
Copy link
Contributor Author

ghiggi commented Jun 21, 2023

@mraspaud milestone v0.43.0

@mraspaud mraspaud modified the milestones: v0.42.0, v0.43.0 Jun 21, 2023
@ghiggi
Copy link
Contributor Author

ghiggi commented Jun 27, 2023

@mraspaud @djhoese it's there something that I can still do to allow merging this PR in the next release?
I can work on it tomorrow and Monday in case ;)

@mraspaud
Copy link
Member

@ghiggi I answered yesterday the question about ancillary variables, is there still something to do there?

@ghiggi
Copy link
Contributor Author

ghiggi commented Jun 27, 2023

@mraspaud I think no. Regarding the ancillary variables I currently kept the historical behaviour of the CF-writer:

If we merge this today, tomorrow I might be able to do a quick PR to reorg the functions in the cf directory and convert the tests to pytest.

@mraspaud
Copy link
Member

Ok, let's merge this then, so that you can start on making it nicer.

@mraspaud mraspaud merged commit 8fb627b into pytroll:main Jun 27, 2023
15 of 16 checks passed
@ghiggi ghiggi deleted the refactor-cfwriter branch June 27, 2023 13:11
@ameraner
Copy link
Member

They worked around it for us, but we should not assume that this is something that should be allowed moving forward. I would advise FCI maintainers (@ameraner @gerritholl ?) to switch this behavior in the future if at all possible. But then again, I don't think Satpy provides an easy way to refer to the ancillary variables which is why it is still done this way.

I agree, this is something I wanted to look into, and also profile in terms of extra computational efforts, e.g. when resampling.
In the case of FCI, we can access the ancillary variable as a separate dataset anyway (with channelname_pixel_quality), so I think it's not an issue to disable the automatic DataArray inclusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:scene component:writers enhancement code enhancements, features, improvements
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

7 participants