Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated Dask graph is huge - performance issue? #1161

Closed
mangecoeur opened this issue Dec 12, 2016 · 8 comments
Closed

Generated Dask graph is huge - performance issue? #1161

mangecoeur opened this issue Dec 12, 2016 · 8 comments

Comments

@mangecoeur
Copy link
Contributor

mangecoeur commented Dec 12, 2016

I've been trying to get around some performance issues when subsetting a set of netCDF files opend with open_mfdataset. I managed to print out the generated dask graph for one variable and it doesn't seem right - it's huge, 5000 elements, and seems to have a getitem entry for every requested element for that variable.

The code that generates this select looks roughly like:

paths = WEATHER_MET['latlon'].glob('*_resampled.nc')
dataset = xr.open_mfdataset([str(p) for p in paths])
selection = dataset.sel(time=time_sel).sel_points(method='nearest', tolerance=0.1, lon=lon, lat=lat)
selection *= weights

and the graph for one variable in the select (the irradiance value) looks like this:

mydask.pdf

@shoyer
Copy link
Member

shoyer commented Dec 12, 2016

Indeed, sel_points currently uses a very inefficient implementation -- it indexes each point and then concatenates them together.

Could be worth experimenting with dask.array.Array.vindex to make a more scalable version: dask/dask#439

@mangecoeur
Copy link
Contributor Author

Ok I will have a look, where is this implemented (I always seem to have trouble pinpointing the dask-specific bits in the codebase :S )

@shoyer
Copy link
Member

shoyer commented Dec 12, 2016

# TODO: This would be sped up with vectorized indexing. This will

nothing about the logic is dask specific currently

@mangecoeur
Copy link
Contributor Author

mangecoeur commented Dec 12, 2016

Thanks, I've been looking around and I think i'm getting close, however i'm not sure the best way to turn the array slice i get from vindex into a DataArray variable. I'm thinking I might but together a draft PR for comments. This is what i have so far:

def isel_points(self, dim='points', **indexers):
    """Returns a new dataset with each array indexed pointwise along the
    specified dimension(s).

    This method selects pointwise values from each array and is akin to
    the NumPy indexing behavior of `arr[[0, 1], [0, 1]]`, except this
    method does not require knowing the order of each array's dimensions.

    Parameters
    ----------
    dim : str or DataArray or pandas.Index or other list-like object, optional
        Name of the dimension to concatenate along. If dim is provided as a
        string, it must be a new dimension name, in which case it is added
        along axis=0. If dim is provided as a DataArray or Index or
        list-like object, its name, which must not be present in the
        dataset, is used as the dimension to concatenate along and the
        values are added as a coordinate.
    **indexers : {dim: indexer, ...}
        Keyword arguments with names matching dimensions and values given
        by array-like objects. All indexers must be the same length and
        1 dimensional.

    Returns
    -------
    obj : Dataset
        A new Dataset with the same contents as this dataset, except each
        array and dimension is indexed by the appropriate indexers. With
        pointwise indexing, the new Dataset will always be a copy of the
        original.

    See Also
    --------
    Dataset.sel
    Dataset.isel
    Dataset.sel_points
    DataArray.isel_points
    """
    from .dataarray import DataArray

    indexer_dims = set(indexers)

    def relevant_keys(mapping):
        return [k for k, v in mapping.items()
                if any(d in indexer_dims for d in v.dims)]

    data_vars = relevant_keys(self.data_vars)
    coords = relevant_keys(self.coords)

    # all the indexers should be iterables
    keys = indexers.keys()
    indexers = [(k, np.asarray(v)) for k, v in iteritems(indexers)]
    # Check that indexers are valid dims, integers, and 1D
    for k, v in indexers:
        if k not in self.dims:
            raise ValueError("dimension %s does not exist" % k)
        if v.dtype.kind != 'i':
            raise TypeError('Indexers must be integers')
        if v.ndim != 1:
            raise ValueError('Indexers must be 1 dimensional')

    # all the indexers should have the same length
    lengths = set(len(v) for k, v in indexers)
    if len(lengths) > 1:
        raise ValueError('All indexers must be the same length')

    # Existing dimensions are not valid choices for the dim argument
    if isinstance(dim, basestring):
        if dim in self.dims:
            # dim is an invalid string
            raise ValueError('Existing dimension names are not valid '
                             'choices for the dim argument in sel_points')
    elif hasattr(dim, 'dims'):
        # dim is a DataArray or Coordinate
        if dim.name in self.dims:
            # dim already exists
            raise ValueError('Existing dimensions are not valid choices '
                             'for the dim argument in sel_points')

    if not utils.is_scalar(dim) and not isinstance(dim, DataArray):
        dim = as_variable(dim, name='points')

    variables = OrderedDict()
    indexers_dict = dict(indexers)
    non_indexed = list(set(self.dims) - indexer_dims)

    # TODO need to figure out how to make sure we get the indexed vs non indexed dimensions in the right order
    for name, var in self.variables.items():
        slc = []
        
        for k in var.dims:
            if k in indexers_dict:
                slc.append(indexers_dict[k])
            else:
                slc.append(slice(None, None))
        if hasattr(var.data, 'vindex'):
            variables[name] = DataArray(var.data.vindex[tuple(slc)], name=name)
        else:
            variables[name] = var[tuple(slc)]
    
    points_len = lengths.pop()
    
    new_variables = OrderedDict()
    for name, var in variables.items():
        if name not in self.dims:
            coords = [variables[k] for k in non_indexed]
            new_variables[name] = DataArray(var, coords=[np.arange(points_len)] + coords, dims=[dim] + non_indexed)

    return xr.merge([v for k,v in new_variables.items() if k not in selection.dims])
    # TODO: This would be sped up with vectorized indexing. This will
    # require dask to support pointwise indexing as well.
#     return concat([self.isel(**d) for d in
#                    [dict(zip(keys, inds)) for inds in
#                     zip(*[v for k, v in indexers])]],
#                   dim=dim, coords=coords, data_vars=data_vars)

@shoyer
Copy link
Member

shoyer commented Dec 12, 2016

I'm thinking I might but together a draft PR for comments

Yes, please do -- that's much easier to read and comment on than what you have here.

@mangecoeur
Copy link
Contributor Author

Done with PR #1162

@mangecoeur
Copy link
Contributor Author

Seems to run a lot faster for me too...

shoyer pushed a commit that referenced this issue Jan 23, 2017
* First draft of dask vindex enabled isel_points

WIP to use dask vindex to point based selection

* remove old sel points

* completely re-worked logic

Work to get tests passing - completely re-worked logic to allow for
wide range of Dataset formats and normal behaviour

* improve support for using array dim

* clean up comments

* Still trying to get semantics of dim and coords indexing right

* Tests for sel_points, isel_points working

* First draft of dask vindex enabled isel_points

WIP to use dask vindex to point based selection

* remove old sel points

* completely re-worked logic

Work to get tests passing - completely re-worked logic to allow for
wide range of Dataset formats and normal behaviour

* improve support for using array dim

* clean up comments

* Still trying to get semantics of dim and coords indexing right

* Tests for sel_points, isel_points working

* more tweaks, fixed tests after merge

* revert to origin test now working

* clean up coordinate construction

* further simplify coords construction

* Update tests to require generation of selection axis coordinate array

* revert change to test to reflect latest dataset behaviour

* fix coord generation to match latest dataset behaviour

Datasets no longer require/generate index coordinate for every dimension

* cleanup

* Simplified code by just looping once over variables

* tidy up

* #1162 further improvements

use _replace_vars_and_dims, simplify new dataset creation, preserve
attributes, clarify dim vs dim_name (don’t re-use variable name to
reduce confusion)

* Formatting

* remove impl detail from docstr

* Add performance improvements section and short descr
@shoyer
Copy link
Member

shoyer commented Jan 23, 2017

Fixed by #1162

@shoyer shoyer closed this as completed Jan 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants