Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dataset writing so computations are shared between tasks #216

Merged
merged 11 commits into from Mar 12, 2018

Conversation

djhoese
Copy link
Member

@djhoese djhoese commented Mar 8, 2018

This PR adds a compute keyword argument to all save_dataset, save_datasets, save_image methods in the Scene and Writers. It defaults to True in all of these methods so individual use cases have not changed. It is also handled by the higher level methods so that the user doesn't see a difference except in performance.

Handling writes in the way this PR implements will allow all dask objects to be computed at the same and is how it is implemented in the higher level methods. This means that if tasks share computations (input datasets) they will be shared and shouldn't have to be recalculated.

Additionally this will allow for more flexibility in the future. If a user calls scn.save_datasets(..., compute=False) they will get a dask Delayed object back which they can then use with other delayed objects.

If compute is True or a writer does not support returning a delayed object (and they don't complain about the compute keyword being passed) then functions will return whatever the result of the delayed function call was. Typically this is just None. This should also allow future flexibility if a writer should be returning something (maybe filenames of the files created?).

WIP: Docstrings and tests to make sure this functionality completely works.

  • Tests added
  • Tests passed
  • Passes git diff origin/develop **/*py | flake8 --diff
  • Fully documented

@djhoese djhoese added enhancement code enhancements, features, improvements component:writers labels Mar 8, 2018
@djhoese djhoese added this to the v0.9 milestone Mar 8, 2018
@djhoese djhoese self-assigned this Mar 8, 2018
@djhoese djhoese requested a review from mraspaud March 8, 2018 03:22
@djhoese
Copy link
Member Author

djhoese commented Mar 8, 2018

Oops I forgot this needs my trollimage updates too. Will make a PR tomorrow.

@@ -23,6 +23,7 @@

import logging

import dask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 'dask' imported but unused

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mraspaud Bug in stickler? This import is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it being used in this file ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha I was looking at the wrong file. This was from a previous commit that stickler didn't comment on I think.

@coveralls
Copy link

coveralls commented Mar 12, 2018

Coverage Status

Coverage increased (+0.4%) to 59.923% when pulling c85e435 on feature-delayed-save-datasets into cebde2b on develop.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nagging...

@@ -325,17 +327,44 @@ def get_filename(self, **kwargs):
"No filename pattern or specific filename provided")
return self.filename_parser.compose(kwargs)

def save_datasets(self, datasets, **kwargs):
def save_datasets(self, datasets, compute=True, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should be updated to exlain compute, and how that affects the return value.

for ds in datasets:
self.save_dataset(ds, **kwargs)
res = self.save_dataset(ds, compute=False, **kwargs)
if isinstance(res, tuple):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see self.save_dataset returning a tuple, even longer down the call chain, am I missing something ?


def save_dataset(self, dataset, filename=None, fill_value=None, **kwargs):
def save_dataset(self, dataset, filename=None, fill_value=None,
compute=True, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring should mention compute

@@ -360,15 +389,18 @@ def __init__(self,
self.enhancer = Enhancer(ppp_config_dir=self.ppp_config_dir,
enhancement_config_file=enhancement_config)

def save_dataset(self, dataset, filename=None, fill_value=None, overlay=None, decorate=None, **kwargs):
def save_dataset(self, dataset, filename=None, fill_value=None,
overlay=None, decorate=None, compute=True, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring should mention compute and the return value

dst_ds = None
def save_image(self, img, filename=None, floating_point=False,
compute=True, **kwargs):
"""Save the image to the given *filename* in geotiff_ format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring should mention compute and the return value

satpy/scene.py Outdated
@@ -766,7 +766,7 @@ def load_writer_config(self, config_files, **kwargs):
return writer

def save_dataset(self, dataset_id, filename=None, writer=None,
overlay=None, **kwargs):
overlay=None, compute=True, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring should mention compute and what the function returns.

XArray defaults to using netcdf4-python engine if possible. It falls
back to scipy. PRs are pending for ways to specify the engine from
global xarray options.
@djhoese djhoese merged commit 6657ac3 into develop Mar 12, 2018
@djhoese djhoese deleted the feature-delayed-save-datasets branch March 12, 2018 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:writers enhancement code enhancements, features, improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants