Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pivot groups for cross tabulation #48

Closed
wants to merge 8 commits into from

Conversation

chondl
Copy link

@chondl chondl commented Dec 5, 2012

This is a first cut at adding a "pivotGroup" to maintain cross tabulations across multiple dimensions.

This has tests and decent performance, but needs a bit more work to be generally useful (in particular it does not update correctly if data is added to the crossfilter after the pivotGroup is created). There is some duplication of code with the regular group object that could also be eliminated. Happy to spend more time on this if there is interest in merging it upstream.

The API is very simple. The crossfilter object has a new method pivotGroup that takes an array of group objects as the dimensions for cross tabulation. The pivotGroup has reduce(), reduceCount(), and reduceSum() methods to define the reduce operation in the same manner as regular group objects. The pivotGroup only has two methods for accessing results: size() and all() which also behave similarly to regular group objects, with the keys for the results returned in all() being an array of key values for each dimension. Results from all() are returned sorted.

Construction performance is enhanced by using FNV-1a hash function from bloomfilter.js by jasondavis.

var data = [ { gender:'Female', handed:'Right-handed', score: 9},
             { gender:'Male', handed:'Left-handed', score: 2},
             { gender:'Female', handed:'Right-handed', score: 32},
             { gender:'Male', handed:'Right-handed', score: 22},
             { gender:'Male', handed:'Left-handed', score: 3},
             { gender:'Male', handed:'Right-handed', score: 21},
             { gender:'Female', handed:'Right-handed', score: 99},
             { gender:'Female', handed:'Left-handed', score: 12},
             { gender:'Male', handed:'Right-handed', score: 0},
             { gender:'Female', handed:'Right-handed', score: 1},
           ],
    c = crossfilter(data),
    dim = { gender: c.dimension(function(v) { return v.gender }), 
            handed: c.dimension(function(v) { return v.handed }) }
    group = { gender: dim.gender.group(), 
              handed: dim.handed.group() },
    pivotGroup = c.pivotGroup([group.gender, group.handed])


assert.deepEqual(pivotGroup.all(), [{key:['Female', 'Left-handed'], value:1}, 
                                    {key:['Female', 'Right-handed'], value:4}, 
                                    {key:['Male', 'Left-handed'], value:2}, 
                                    {key:['Male', 'Right-handed'], value:3}])

dim.gender.filter('Female')
assert.deepEqual(pivotGroup.all(), [{key:['Female', 'Left-handed'], value:1}, 
                                    {key:['Female', 'Right-handed'], value:4}, 
                                    {key:['Male', 'Left-handed'], value:0}, 
                                    {key:['Male', 'Right-handed'], value:0}])



…efined groups.

Implementation is simplest thing that can work: it doesn't respect filter changes, doesn't handle data added after first executed, and is not performance optimized
Significant performance improvement over brute force search (no surprise).   71ms to group 100k records into a 3 dimensional pivot group with 2800 distinct keys (was taking ~1900ms before with brute force search).
@christophe-g
Copy link

Pretty cool ;)

Wrote this dc-js/dc.js#91 to serve the same purpose !
Cheers,
C.

@chondl
Copy link
Author

chondl commented Dec 7, 2012

Thanks. I had seen your fork as well, but for my project I was looking for an API where I wouldn't have to think about the pivot grouping when writing the reduce functions. I haven't looked at how to tie this into one of the various charting components that use crossfilter since currently I'm using crossfilter inside node to create Excel and PDF tabular reports. Hopefully there will be a future version of crossfilter with some version of pivot groups built in.

@chondl
Copy link
Author

chondl commented Feb 8, 2013

I'm not sure I understand the question Brandon asked about "imagine trying
to sum the values of a column in group A while counting the ones the group
B". Perhaps you can send a simple example with a few records and the
expected results of the calculation?

The pivot group just supports grouping the records by the dimension and the
reduce function has access to all records in the group as they are added
and removed.

On Thu, Feb 7, 2013 at 10:21 AM, Brandon notifications@github.com wrote:

I see the reduce methods in the pivot group, but imagine that I want to
adjust the group to be used in the pivot before the pivot combines them.

So, imagine trying to sum the values of a column in group A while counting
the ones the group B
Then, in the pivot, multiply the 2 values in the reduce.


Reply to this email directly or view it on GitHubhttps://github.com//pull/48#issuecomment-13251495.

@Trakkasure
Copy link

Yea.. never mind on that. I figured out my problem.
It was some of my code altering the result of .all() on a group because all() doesn't return a copy.
Fixed with a .slice(0) on the return.
So:

var cf = crossfilter(myData)
  , d1 = cf.dimension(function(d){return d.key1})
  , d2 = cf.dimension(function(d){return d.key2})
  , g1 = d1.group()
  , g2 = d2.group()
  , x = g.all()
  , s = []

while(x.length) s.push(x.shift())

pg = cf.pivotGroup([g1,g2]) // Error should occur here

That is just an example to expose what is happening. It doesn't do anything interesting.

@JDvorak
Copy link

JDvorak commented Jun 20, 2013

Has anything been done with this in the last few months?

@chondl
Copy link
Author

chondl commented Jun 24, 2013

I've been using it within my node.js application for report generation for the past several months, but haven't heard from anyone else using it.

@ghost
Copy link

ghost commented Aug 16, 2013

Any plans to merge this into crossfilter mainline? Looks like an important addition for crossfilter.

@JDvorak
Copy link

JDvorak commented Aug 16, 2013

I would also be interested in this being made more official

@jasondavies
Copy link
Collaborator

Thanks for the contribution. Note that the equivalent can be achieved by creating a special dimension with a unique value for each possible combination of pivot groups:

var pivotGroup = c
    .dimension(function(d) { return d.gender + "/" + d.handed; })
    .group(); // default groups use the dimension value

This requires somewhat careful choice of separator ("/" seems convenient here, but you could use any character, even something like "\0").

A special dimension is required because group keys can only be based on the dimension value. Alternatively, perhaps arbitrary groups on a “dummy dimension”, similar to crossfilter.groupAll could be supported, that allow a group keys to be generated based on a record, e.g. c.group(function(d) { return d.gender + "/" + d.handed; }).

I’m inclined to keep the API simple and fast, and the use of a special dimension seems reasonable here. The advantage over your pivotGroup implementation is that other standard group methods are also available, such as top and order, and the special dimension can also be removed when no longer needed.

I’m closing for now, but feel free to add comments if you think there’s a problem with my approach, or an advantage to having a specialised API.

@chondl
Copy link
Author

chondl commented Aug 20, 2013

The alternative of creating a compound dimension using a separator is reasonable, and I've used it in other situations.

One drawback is that I've encountered performance issues using the compound dimension technique in other situations due to the string manipulation. That said, I don't have head-to-head benchmarks of the compound dimension technique against this pull request.

With crossfilter, I wanted something that took advantage of the existing group indexes and had an API expressed in terms of tuples as opposed to the compound strings.

@gordonwoodhull
Copy link

Hi @jasondavies,

One problem with the string concatenation solution is that it's very messy to supply tuples that include numbers. They have to be zero-padded. I wonder if you would consider a PR for native support of tuple keys?

@jasondavies
Copy link
Collaborator

Why do they need to be zero-padded? As long as you pick a suitable separator, no padding is necessary (, or / would work fine for numbers, or even \0 as discussed above).

@gordonwoodhull
Copy link

gordon$ cat commas
1,1
10,2
2,3
1,9
11,15
gordon$ sort commas
1,1
1,9
10,2
11,15
2,3

@gordonwoodhull
Copy link

Similar problem if the character code is greater than digits:

gordon$ cat bars
1|1
10|2
2|3
1|9
11|15
1|11
10|1
11|1
11|2
gordon$ sort bars
10|1
10|2
11|1
11|15
11|2
1|1
1|11
1|9
2|3

I could be wrong, but I don't think lexicographical ordering ever works for tuples of variable length fields.

@jasondavies
Copy link
Collaborator

Right, the array of all groups as returned by group.all will be in ascending natural order, so I agree the order could be unexpected if you have variable-length keys. However, the grouping behaviour and top-K groups will all work as expected, which is the main point of using a dummy dimension in this manner.

I’d be open to considering proper tuple support, but only if performance remains reasonably fast. Though not ideal, you can always clone the returned group.all array and sort it afterwards if you really need it to be sorted differently.

@gordonwoodhull
Copy link

Yes, that almost works, but for example, I would want to be able to select the range [1,1] -> [1,19], but without native support [1,2] would be left out.

I think it boils down to parameterizing the ordering function. So there wouldn't be any extra logic during sorting. I would hope this isn't worse than the valueOf() call that is already happening, and you'd pay for what you need. I can benchmark it and find out. Thanks for considering it.

@jasondavies
Copy link
Collaborator

The dummy dimension is really only intended as a way to group by multiple keys at once, rather than for filtering. If you want to filter, you can use an individual dimension that has the correct ordering. So for your example you could say:

dimension0.filterExact(1);
dimension1.filterRange([1, 19]);

@gordonwoodhull
Copy link

Ah, right. Thanks, got it. And it would be very unlikely that one would want to filter by e.g. range [1,10] -> [2,9]

I will see if I can make some wrappers to hide the mess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants