Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement grid concatenation and standardize datatype casting #2762

Merged
merged 14 commits into from Jun 22, 2018

Conversation

Projects
None yet
3 participants
@philippjfr
Copy link
Contributor

philippjfr commented Jun 2, 2018

This PR has two main aims:

  1. Standardize and simplify how casting between different datatypes works
  2. Implement concatenation for all datatypes

Before this PR was applied past both casting and concatenation were limited to columnar data formats, which meant that certain operations could not be applied to gridded data, e.g. a HoloMap collapse. Having a dedicated concat implementation for both columnar and gridded data also allows much more efficient concatenation than what is currently in use by methods like .table and .dframe and will generalize them so that we can eventually replace the column specific .table implementation with a general one that returns a dataset of arbitrary type.

Implementing concatenation along HoloMap dimensions also means that Dataset.groupby operations are now reversible and fixes HoloMap.collapse.

  • Fixes #1417
  • Adds unit tests
@@ -1532,7 +1531,7 @@ def groupby_python(self_or_cls, ndmapping, dimensions, container_type,
selects = get_unique_keys(ndmapping, dimensions)
selects = group_select(list(selects))
groups = [(k, group_type((v.reindex(idims) if hasattr(v, 'kdims')
else [((), (v,))]), **kwargs))
else [((), v)]), **kwargs))

This comment has been minimized.

@philippjfr

philippjfr Jun 2, 2018

Author Contributor

These are NdMapping implementation details left over from when we had an NdElement, the precursor to datasets, which now lead to strange behavior.

datatypes = ['dictionary', 'grid']

try:
import pandas as pd # noqa (Availability import)
from .pandas import PandasInterface
default_datatype = 'dataframe'

This comment has been minimized.

@philippjfr

philippjfr Jun 2, 2018

Author Contributor

When converting from gridded to columnar data throughout the code it usually has to cast the data to a specific datatype. Various places in the code hardcoded ['pandas', 'dictionary'] in these places, defining a default_datatype avoids having to hardcode this all over the place.

This comment has been minimized.

@jbednar

jbednar Jun 2, 2018

Contributor

Shouldn't this be "default_columnar_datatype', then? Or are there no cases where columnar data needs to be cast into some gridded data type?

This comment has been minimized.

@philippjfr

philippjfr Jun 3, 2018

Author Contributor

Columnar data cannot be cast to gridded data without some kind of aggregation occurring. So that's correct. Would still be okay with changing it to default_columnar_datatype.

arrays = [grid[vdim.name] for grid in grids]
stack = np.stack if any(is_dask(arr) for arr in arrays) else da.stack
new_data[vdim.name] = stack(arrays, -1)
return new_data

This comment has been minimized.

@philippjfr

philippjfr Jun 2, 2018

Author Contributor

Since arrays cannot be concatenated along multiple axes at once the implementation of concat on gridded interfaces has two components. A general concat method coordinates hierarchical concatenation along each dimension and uses the interface specific concat_dim method implementations to concatenate along one particular axis or dimension.

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Jun 2, 2018

Made some additional comments to clarify certain implementation details.

cast = []
for ds in datasets:
if cast_type is not None or ds.interface.datatype != datatype:
ds = ds.clone(ds, datatype=[datatype], new_type=cast_type)

This comment has been minimized.

@philippjfr

philippjfr Jun 2, 2018

Author Contributor

Casting works quite simply, if the Interface.initialize is passed another dataset and it finds a mismatch between the supplied datatype and the requested datatype it will deconstruct the original dataset into the columnar or gridded tuple format, which is supported by all interfaces. In this way a dataset can easily be cast to any other datatype, except for columnar -> gridded conversions.

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Jun 2, 2018

As a followup to this PR we should provide special handling for dask arrays/dataframes during casting. This requires multiple things:

  • Interfaces need to declare if they support lazy data
  • Interfaces need to declare an API to check if the data for a dimension is lazy
  • The .values method on Interfaces need to provide an option to return a lazy (i.e. dask) array

@philippjfr philippjfr added this to the v1.10.5 milestone Jun 2, 2018

@philippjfr philippjfr force-pushed the cast_and_concat branch from 3a8b95a to b16c799 Jun 2, 2018

@philippjfr philippjfr modified the milestones: v1.10.5, v1.10.6 Jun 10, 2018

@philippjfr philippjfr force-pushed the cast_and_concat branch from b16c799 to 67bd62a Jun 20, 2018

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Jun 20, 2018

Ready for review.

@philippjfr philippjfr removed the WIP label Jun 20, 2018

Philipp Rudiger Philipp Rudiger
"""
Concatenates multiple datasets wrapped in an NdMapping type
along all of its dimensions. Before concatenation all datasets
are cast to the same datatype. For columnar data concatenation

This comment has been minimized.

@jlstevens

jlstevens Jun 20, 2018

Contributor

'same datatype' determined how?

This comment has been minimized.

@philippjfr

philippjfr Jun 20, 2018

Author Contributor

Either explicitly defined or the type of the first dataset that was passed in.

This comment has been minimized.

@jlstevens

jlstevens Jun 20, 2018

Contributor

Would be good to state that bit about it being chosen from the first one if not explicitly set.

datasets = datasets.items()
keys, datasets = zip(*datasets)
elif isinstance(datasets, list) and not any(isinstance(v, tuple) for v in datasets):
keys = [()]*len(datasets)

This comment has been minimized.

@jlstevens

jlstevens Jun 20, 2018

Contributor

What are all these empty tuple keys for? Just to get things in the right format?

This comment has been minimized.

@philippjfr

philippjfr Jun 20, 2018

Author Contributor

Right, concatenate is usually meant for concatenating along some dimension but you can also concatenate a simple list of datasets without concatenating along some dimensions. For that case we generate empty tuple keys. Happy to add a comment. Separately I also need to assert that this only happens for tabular data, since gridded data must be concatenated along some dimension.

@@ -4,6 +4,9 @@
from itertools import product

import iris
from iris.coords import DimCoord
from iris.cube import CubeList
from iris.experimental.equalise_cubes import equalise_attributes

This comment has been minimized.

@jlstevens

jlstevens Jun 20, 2018

Contributor

Will be good to have the iris interface moved to geoviews. Could this be done for 1.10.6?

This comment has been minimized.

@philippjfr

philippjfr Jun 20, 2018

Author Contributor

Tests need to be moved into the holoviews package first.

This comment has been minimized.

@jlstevens

jlstevens Jun 20, 2018

Contributor

You mean 'geoviews' package?

This comment has been minimized.

@philippjfr

philippjfr Jun 20, 2018

Author Contributor

No I mean the /tests need to move to /holoviews/tests.

This comment has been minimized.

@philippjfr

philippjfr Jun 20, 2018

Author Contributor

The interface tests are defined as mix-in classes, so if I want to run them in geoviews I have to be able to import them from holoviews. We also promised this to the bokeh folks so they can run our bokeh unit tests easily.

col_data = group.last.clone(data)
collapsed[key] = col_data
return collapsed if self.ndims > 1 else collapsed.last
group_data = group.last.clone(data)

This comment has been minimized.

@jlstevens

jlstevens Jun 20, 2018

Contributor

group_data can be a whole load of different things at different times. Not critical but I would prefer to have something that isn't clobbered so much.

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Jun 20, 2018

Other than a few minor comments this looks good and I'm happy to merge.

Philipp Rudiger Philipp Rudiger
@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Jun 22, 2018

Tests are green. Merging.

@jlstevens jlstevens merged commit e5d4adc into master Jun 22, 2018

4 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.3%) to 83.071%
Details
s3-reference-data-cache Test data is cached.
Details

@philippjfr philippjfr deleted the cast_and_concat branch Jul 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.