New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generated Dask graph is huge - performance issue? #1161
Comments
Indeed, Could be worth experimenting with |
Ok I will have a look, where is this implemented (I always seem to have trouble pinpointing the dask-specific bits in the codebase :S ) |
Line 1080 in 1615a0f
nothing about the logic is dask specific currently |
Thanks, I've been looking around and I think i'm getting close, however i'm not sure the best way to turn the array slice i get from vindex into a DataArray variable. I'm thinking I might but together a draft PR for comments. This is what i have so far: def isel_points(self, dim='points', **indexers):
"""Returns a new dataset with each array indexed pointwise along the
specified dimension(s).
This method selects pointwise values from each array and is akin to
the NumPy indexing behavior of `arr[[0, 1], [0, 1]]`, except this
method does not require knowing the order of each array's dimensions.
Parameters
----------
dim : str or DataArray or pandas.Index or other list-like object, optional
Name of the dimension to concatenate along. If dim is provided as a
string, it must be a new dimension name, in which case it is added
along axis=0. If dim is provided as a DataArray or Index or
list-like object, its name, which must not be present in the
dataset, is used as the dimension to concatenate along and the
values are added as a coordinate.
**indexers : {dim: indexer, ...}
Keyword arguments with names matching dimensions and values given
by array-like objects. All indexers must be the same length and
1 dimensional.
Returns
-------
obj : Dataset
A new Dataset with the same contents as this dataset, except each
array and dimension is indexed by the appropriate indexers. With
pointwise indexing, the new Dataset will always be a copy of the
original.
See Also
--------
Dataset.sel
Dataset.isel
Dataset.sel_points
DataArray.isel_points
"""
from .dataarray import DataArray
indexer_dims = set(indexers)
def relevant_keys(mapping):
return [k for k, v in mapping.items()
if any(d in indexer_dims for d in v.dims)]
data_vars = relevant_keys(self.data_vars)
coords = relevant_keys(self.coords)
# all the indexers should be iterables
keys = indexers.keys()
indexers = [(k, np.asarray(v)) for k, v in iteritems(indexers)]
# Check that indexers are valid dims, integers, and 1D
for k, v in indexers:
if k not in self.dims:
raise ValueError("dimension %s does not exist" % k)
if v.dtype.kind != 'i':
raise TypeError('Indexers must be integers')
if v.ndim != 1:
raise ValueError('Indexers must be 1 dimensional')
# all the indexers should have the same length
lengths = set(len(v) for k, v in indexers)
if len(lengths) > 1:
raise ValueError('All indexers must be the same length')
# Existing dimensions are not valid choices for the dim argument
if isinstance(dim, basestring):
if dim in self.dims:
# dim is an invalid string
raise ValueError('Existing dimension names are not valid '
'choices for the dim argument in sel_points')
elif hasattr(dim, 'dims'):
# dim is a DataArray or Coordinate
if dim.name in self.dims:
# dim already exists
raise ValueError('Existing dimensions are not valid choices '
'for the dim argument in sel_points')
if not utils.is_scalar(dim) and not isinstance(dim, DataArray):
dim = as_variable(dim, name='points')
variables = OrderedDict()
indexers_dict = dict(indexers)
non_indexed = list(set(self.dims) - indexer_dims)
# TODO need to figure out how to make sure we get the indexed vs non indexed dimensions in the right order
for name, var in self.variables.items():
slc = []
for k in var.dims:
if k in indexers_dict:
slc.append(indexers_dict[k])
else:
slc.append(slice(None, None))
if hasattr(var.data, 'vindex'):
variables[name] = DataArray(var.data.vindex[tuple(slc)], name=name)
else:
variables[name] = var[tuple(slc)]
points_len = lengths.pop()
new_variables = OrderedDict()
for name, var in variables.items():
if name not in self.dims:
coords = [variables[k] for k in non_indexed]
new_variables[name] = DataArray(var, coords=[np.arange(points_len)] + coords, dims=[dim] + non_indexed)
return xr.merge([v for k,v in new_variables.items() if k not in selection.dims])
# TODO: This would be sped up with vectorized indexing. This will
# require dask to support pointwise indexing as well.
# return concat([self.isel(**d) for d in
# [dict(zip(keys, inds)) for inds in
# zip(*[v for k, v in indexers])]],
# dim=dim, coords=coords, data_vars=data_vars) |
Yes, please do -- that's much easier to read and comment on than what you have here. |
Done with PR #1162 |
Seems to run a lot faster for me too... |
* First draft of dask vindex enabled isel_points WIP to use dask vindex to point based selection * remove old sel points * completely re-worked logic Work to get tests passing - completely re-worked logic to allow for wide range of Dataset formats and normal behaviour * improve support for using array dim * clean up comments * Still trying to get semantics of dim and coords indexing right * Tests for sel_points, isel_points working * First draft of dask vindex enabled isel_points WIP to use dask vindex to point based selection * remove old sel points * completely re-worked logic Work to get tests passing - completely re-worked logic to allow for wide range of Dataset formats and normal behaviour * improve support for using array dim * clean up comments * Still trying to get semantics of dim and coords indexing right * Tests for sel_points, isel_points working * more tweaks, fixed tests after merge * revert to origin test now working * clean up coordinate construction * further simplify coords construction * Update tests to require generation of selection axis coordinate array * revert change to test to reflect latest dataset behaviour * fix coord generation to match latest dataset behaviour Datasets no longer require/generate index coordinate for every dimension * cleanup * Simplified code by just looping once over variables * tidy up * #1162 further improvements use _replace_vars_and_dims, simplify new dataset creation, preserve attributes, clarify dim vs dim_name (don’t re-use variable name to reduce confusion) * Formatting * remove impl detail from docstr * Add performance improvements section and short descr
Fixed by #1162 |
I've been trying to get around some performance issues when subsetting a set of netCDF files opend with
open_mfdataset
. I managed to print out the generated dask graph for one variable and it doesn't seem right - it's huge, 5000 elements, and seems to have a getitem entry for every requested element for that variable.The code that generates this select looks roughly like:
and the graph for one variable in the select (the irradiance value) looks like this:
mydask.pdf
The text was updated successfully, but these errors were encountered: