-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pivot groups for cross tabulation #48
Conversation
…efined groups. Implementation is simplest thing that can work: it doesn't respect filter changes, doesn't handle data added after first executed, and is not performance optimized
Significant performance improvement over brute force search (no surprise). 71ms to group 100k records into a 3 dimensional pivot group with 2800 distinct keys (was taking ~1900ms before with brute force search).
Pretty cool ;) Wrote this dc-js/dc.js#91 to serve the same purpose ! |
Thanks. I had seen your fork as well, but for my project I was looking for an API where I wouldn't have to think about the pivot grouping when writing the reduce functions. I haven't looked at how to tie this into one of the various charting components that use crossfilter since currently I'm using crossfilter inside node to create Excel and PDF tabular reports. Hopefully there will be a future version of crossfilter with some version of pivot groups built in. |
I'm not sure I understand the question Brandon asked about "imagine trying The pivot group just supports grouping the records by the dimension and the On Thu, Feb 7, 2013 at 10:21 AM, Brandon notifications@github.com wrote:
|
Yea.. never mind on that. I figured out my problem. var cf = crossfilter(myData)
, d1 = cf.dimension(function(d){return d.key1})
, d2 = cf.dimension(function(d){return d.key2})
, g1 = d1.group()
, g2 = d2.group()
, x = g.all()
, s = []
while(x.length) s.push(x.shift())
pg = cf.pivotGroup([g1,g2]) // Error should occur here That is just an example to expose what is happening. It doesn't do anything interesting. |
Has anything been done with this in the last few months? |
I've been using it within my node.js application for report generation for the past several months, but haven't heard from anyone else using it. |
Any plans to merge this into crossfilter mainline? Looks like an important addition for crossfilter. |
I would also be interested in this being made more official |
Thanks for the contribution. Note that the equivalent can be achieved by creating a special dimension with a unique value for each possible combination of pivot groups: var pivotGroup = c
.dimension(function(d) { return d.gender + "/" + d.handed; })
.group(); // default groups use the dimension value This requires somewhat careful choice of separator ("/" seems convenient here, but you could use any character, even something like "\0"). A special dimension is required because group keys can only be based on the dimension value. Alternatively, perhaps arbitrary groups on a “dummy dimension”, similar to crossfilter.groupAll could be supported, that allow a group keys to be generated based on a record, e.g. I’m inclined to keep the API simple and fast, and the use of a special dimension seems reasonable here. The advantage over your pivotGroup implementation is that other standard group methods are also available, such as top and order, and the special dimension can also be removed when no longer needed. I’m closing for now, but feel free to add comments if you think there’s a problem with my approach, or an advantage to having a specialised API. |
The alternative of creating a compound dimension using a separator is reasonable, and I've used it in other situations. One drawback is that I've encountered performance issues using the compound dimension technique in other situations due to the string manipulation. That said, I don't have head-to-head benchmarks of the compound dimension technique against this pull request. With crossfilter, I wanted something that took advantage of the existing group indexes and had an API expressed in terms of tuples as opposed to the compound strings. |
Hi @jasondavies, One problem with the string concatenation solution is that it's very messy to supply tuples that include numbers. They have to be zero-padded. I wonder if you would consider a PR for native support of tuple keys? |
Why do they need to be zero-padded? As long as you pick a suitable separator, no padding is necessary ( |
|
Similar problem if the character code is greater than digits:
I could be wrong, but I don't think lexicographical ordering ever works for tuples of variable length fields. |
Right, the array of all groups as returned by group.all will be in ascending natural order, so I agree the order could be unexpected if you have variable-length keys. However, the grouping behaviour and top-K groups will all work as expected, which is the main point of using a dummy dimension in this manner. I’d be open to considering proper tuple support, but only if performance remains reasonably fast. Though not ideal, you can always clone the returned group.all array and sort it afterwards if you really need it to be sorted differently. |
Yes, that almost works, but for example, I would want to be able to select the range [1,1] -> [1,19], but without native support [1,2] would be left out. I think it boils down to parameterizing the ordering function. So there wouldn't be any extra logic during sorting. I would hope this isn't worse than the valueOf() call that is already happening, and you'd pay for what you need. I can benchmark it and find out. Thanks for considering it. |
The dummy dimension is really only intended as a way to group by multiple keys at once, rather than for filtering. If you want to filter, you can use an individual dimension that has the correct ordering. So for your example you could say: dimension0.filterExact(1);
dimension1.filterRange([1, 19]); |
Ah, right. Thanks, got it. And it would be very unlikely that one would want to filter by e.g. range [1,10] -> [2,9] I will see if I can make some wrappers to hide the mess. |
This is a first cut at adding a "pivotGroup" to maintain cross tabulations across multiple dimensions.
This has tests and decent performance, but needs a bit more work to be generally useful (in particular it does not update correctly if data is added to the crossfilter after the pivotGroup is created). There is some duplication of code with the regular group object that could also be eliminated. Happy to spend more time on this if there is interest in merging it upstream.
The API is very simple. The crossfilter object has a new method pivotGroup that takes an array of group objects as the dimensions for cross tabulation. The pivotGroup has reduce(), reduceCount(), and reduceSum() methods to define the reduce operation in the same manner as regular group objects. The pivotGroup only has two methods for accessing results: size() and all() which also behave similarly to regular group objects, with the keys for the results returned in all() being an array of key values for each dimension. Results from all() are returned sorted.
Construction performance is enhanced by using FNV-1a hash function from bloomfilter.js by jasondavis.