-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a general interface for N-dimensional gridded data #542
Conversation
I think this will be great! Just to summarize the discussion I've just had with Philipp for future reference:
Anyway, these are my general comments for now. I'm happy to review the code in more detail once you think it is ready. |
I'm very happy with this proposal. I think it really will help HoloViews work well in a broad range of other applications, and is worth taking the effort to work on now. |
a5d134b
to
7d5dc26
Compare
Okay as far as I can tell I'm now done with this PR. @jlstevens said he'd go through and document the class and methods so he can get a better idea about the implementation. So I'll wait on that to make any more changes, as I'm sure he'll find some further issues. |
Note that a lot of the work to allow Raster, Image, Histogram and QuadMesh types to use dense interfaces has been postponed and is not part of this PR. Hopefully for version 1.5 we can unify all these types together leaving only Path and Annotation types with custom data formats. |
I'm going to go through this PR carefully now, making sure I understand it, making comments and updating docstrings as necessary. Then once those issues are addressed I think it can be merged. |
@@ -469,6 +464,11 @@ def validate(cls, columns): | |||
|
|||
|
|||
@classmethod | |||
def check_dense(cls, arrays): | |||
return any(array.shape not in [arrays[0].shape, (1,)] for array in arrays[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this code correctly check_compressed
might be a better name...
Edit: How about inverting it and calling it expanded_format
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
17afc6d
to
c509a07
Compare
Okay, I've gone through an made all the fixes you suggested and tests should pass in a minute. If you could go through it and add docstrings then I think this is ready to merge. Only other thing we should decide is whether to add 'grid' to the |
I've had a go updating the class docstring for Once the pr tests pass, I'm happy to merge. |
Ok, the pr build is passing. Time to merge! |
Adding a general interface for N-dimensional gridded data
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
In HoloViews we now have an interface to hold data in a columnar format. This is provides a very powerful interface for some kinds of data, however when exploring dense high-dimensional arrays it is wasteful because it expands the coordinates of the key dimensions. An alternative format used by xarray, iris and in some limited ways pandas, stores the index values or coordinates (as they are sometimes called) separately from the value dimension data, which is stored as an n-dimensional array.
Instead of storing the cartesian product of all key dimension values we store only the outer indices. Often working in a Columnar format is a lot easier because merging or adding new or derived dimensions is considerably easier, however it is not only inefficient in terms of space but is also considerably slower for various operations, particularly for groupby, aggregation and reduce operations.
The proposal
The HoloViews Columns interface actually provides a very general interface to work with structured data and in theory it does not actually restrict the format of the data. In this notebook I will outline a suggestion to add additional interfaces for the Columns type, which works with N-D gridded from hereon referred to as dense data. I will set out to show that not only is this format more efficient for various operations but the implementation is actually fairly simple and fits into our current system.
The datastructure
The current Columns interfaces already have different datastructures, which all fundamentally represent an array of the shape
Row x Column
, where the rows represent the total number of samples and the columns the combined key and value dimensions, this is fundamentally no different to the COO (Coordinate) sparse matrix format (except the r, c indices are actually values). The new format would provide a dense equivalent and would differ from these existing implementations in the following ways:Here the x and y arrays provide the indexes along the first and second axis of the z-array. Using the current formats this would have to be specified as:
Instead of storing the cartesian product as computed by meshgrid, the internal representation stores just the outer indices. The interface then expands these indices if required (which would generally be pretty rare).
or like this:
This comes down to whether the interface should support heterogeneous value dimension types. The current proposal works on the first suggestion but it would be trivial to automatically expand the first format into the second format and store the value dimensions separately internally.
Pros vs Cons
Pros:
Cons:
Obstacles/Problems
dimension_values
accepts aproduct
argument (or similar) that defaults toFalse
returning the full cartesian product by default. Additionally it would also support aflat
argument defaulting toTrue
to retain a consistent backward compatible interface.as_dense
andas_sparse
methods that convert between dense and sparse representations. The as_sparse representation is obviously very straightforward as it's just the cartesian product of the dense representation. Theas_dense
implementation requires that the data has been aggregated already. After that it's most easily implemented by combining the sparse columns with a cartesian product of the key dimensions inserting NaNs for all values, aggregating, sorting and reshaping.To-do list:
dimension_values
(interface proposed in Columns row/column based indexing API #541 postponed).Notebook with examples and profiling