Support pivot groups for cross tabulation #48

chondl · 2012-12-05T07:34:11Z

This is a first cut at adding a "pivotGroup" to maintain cross tabulations across multiple dimensions.

This has tests and decent performance, but needs a bit more work to be generally useful (in particular it does not update correctly if data is added to the crossfilter after the pivotGroup is created). There is some duplication of code with the regular group object that could also be eliminated. Happy to spend more time on this if there is interest in merging it upstream.

The API is very simple. The crossfilter object has a new method pivotGroup that takes an array of group objects as the dimensions for cross tabulation. The pivotGroup has reduce(), reduceCount(), and reduceSum() methods to define the reduce operation in the same manner as regular group objects. The pivotGroup only has two methods for accessing results: size() and all() which also behave similarly to regular group objects, with the keys for the results returned in all() being an array of key values for each dimension. Results from all() are returned sorted.

Construction performance is enhanced by using FNV-1a hash function from bloomfilter.js by jasondavis.

var data = [ { gender:'Female', handed:'Right-handed', score: 9},
             { gender:'Male', handed:'Left-handed', score: 2},
             { gender:'Female', handed:'Right-handed', score: 32},
             { gender:'Male', handed:'Right-handed', score: 22},
             { gender:'Male', handed:'Left-handed', score: 3},
             { gender:'Male', handed:'Right-handed', score: 21},
             { gender:'Female', handed:'Right-handed', score: 99},
             { gender:'Female', handed:'Left-handed', score: 12},
             { gender:'Male', handed:'Right-handed', score: 0},
             { gender:'Female', handed:'Right-handed', score: 1},
           ],
    c = crossfilter(data),
    dim = { gender: c.dimension(function(v) { return v.gender }), 
            handed: c.dimension(function(v) { return v.handed }) }
    group = { gender: dim.gender.group(), 
              handed: dim.handed.group() },
    pivotGroup = c.pivotGroup([group.gender, group.handed])


assert.deepEqual(pivotGroup.all(), [{key:['Female', 'Left-handed'], value:1}, 
                                    {key:['Female', 'Right-handed'], value:4}, 
                                    {key:['Male', 'Left-handed'], value:2}, 
                                    {key:['Male', 'Right-handed'], value:3}])

dim.gender.filter('Female')
assert.deepEqual(pivotGroup.all(), [{key:['Female', 'Left-handed'], value:1}, 
                                    {key:['Female', 'Right-handed'], value:4}, 
                                    {key:['Male', 'Left-handed'], value:0}, 
                                    {key:['Male', 'Right-handed'], value:0}])

…efined groups. Implementation is simplest thing that can work: it doesn't respect filter changes, doesn't handle data added after first executed, and is not performance optimized

Significant performance improvement over brute force search (no surprise). 71ms to group 100k records into a 3 dimensional pivot group with 2800 distinct keys (was taking ~1900ms before with brute force search).

christophe-g · 2012-12-06T20:39:12Z

Pretty cool ;)

Wrote this dc-js/dc.js#91 to serve the same purpose !
Cheers,
C.

chondl · 2012-12-07T17:16:45Z

Thanks. I had seen your fork as well, but for my project I was looking for an API where I wouldn't have to think about the pivot grouping when writing the reduce functions. I haven't looked at how to tie this into one of the various charting components that use crossfilter since currently I'm using crossfilter inside node to create Excel and PDF tabular reports. Hopefully there will be a future version of crossfilter with some version of pivot groups built in.

chondl · 2013-02-08T17:16:20Z

I'm not sure I understand the question Brandon asked about "imagine trying
to sum the values of a column in group A while counting the ones the group
B". Perhaps you can send a simple example with a few records and the
expected results of the calculation?

The pivot group just supports grouping the records by the dimension and the
reduce function has access to all records in the group as they are added
and removed.

On Thu, Feb 7, 2013 at 10:21 AM, Brandon notifications@github.com wrote:

I see the reduce methods in the pivot group, but imagine that I want to
adjust the group to be used in the pivot before the pivot combines them.

So, imagine trying to sum the values of a column in group A while counting
the ones the group B
Then, in the pivot, multiply the 2 values in the reduce.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/48#issuecomment-13251495.

Trakkasure · 2013-02-08T17:24:26Z

Yea.. never mind on that. I figured out my problem.
It was some of my code altering the result of .all() on a group because all() doesn't return a copy.
Fixed with a .slice(0) on the return.
So:

var cf = crossfilter(myData)
  , d1 = cf.dimension(function(d){return d.key1})
  , d2 = cf.dimension(function(d){return d.key2})
  , g1 = d1.group()
  , g2 = d2.group()
  , x = g.all()
  , s = []

while(x.length) s.push(x.shift())

pg = cf.pivotGroup([g1,g2]) // Error should occur here

That is just an example to expose what is happening. It doesn't do anything interesting.

JDvorak · 2013-06-20T18:54:01Z

Has anything been done with this in the last few months?

chondl · 2013-06-24T22:11:22Z

I've been using it within my node.js application for report generation for the past several months, but haven't heard from anyone else using it.

ghost · 2013-08-16T15:50:54Z

Any plans to merge this into crossfilter mainline? Looks like an important addition for crossfilter.

JDvorak · 2013-08-16T19:54:32Z

I would also be interested in this being made more official

jasondavies · 2013-08-16T21:32:02Z

Thanks for the contribution. Note that the equivalent can be achieved by creating a special dimension with a unique value for each possible combination of pivot groups:

var pivotGroup = c
    .dimension(function(d) { return d.gender + "/" + d.handed; })
    .group(); // default groups use the dimension value

This requires somewhat careful choice of separator ("/" seems convenient here, but you could use any character, even something like "\0").

A special dimension is required because group keys can only be based on the dimension value. Alternatively, perhaps arbitrary groups on a “dummy dimension”, similar to crossfilter.groupAll could be supported, that allow a group keys to be generated based on a record, e.g. c.group(function(d) { return d.gender + "/" + d.handed; }).

I’m inclined to keep the API simple and fast, and the use of a special dimension seems reasonable here. The advantage over your pivotGroup implementation is that other standard group methods are also available, such as top and order, and the special dimension can also be removed when no longer needed.

I’m closing for now, but feel free to add comments if you think there’s a problem with my approach, or an advantage to having a specialised API.

chondl · 2013-08-20T21:31:22Z

The alternative of creating a compound dimension using a separator is reasonable, and I've used it in other situations.

One drawback is that I've encountered performance issues using the compound dimension technique in other situations due to the string manipulation. That said, I don't have head-to-head benchmarks of the compound dimension technique against this pull request.

With crossfilter, I wanted something that took advantage of the existing group indexes and had an API expressed in terms of tuples as opposed to the compound strings.

gordonwoodhull · 2013-10-15T20:31:29Z

Hi @jasondavies,

One problem with the string concatenation solution is that it's very messy to supply tuples that include numbers. They have to be zero-padded. I wonder if you would consider a PR for native support of tuple keys?

jasondavies · 2013-10-15T21:15:17Z

Why do they need to be zero-padded? As long as you pick a suitable separator, no padding is necessary (, or / would work fine for numbers, or even \0 as discussed above).

gordonwoodhull · 2013-10-15T21:41:23Z

gordon$ cat commas
1,1
10,2
2,3
1,9
11,15
gordon$ sort commas
1,1
1,9
10,2
11,15
2,3

gordonwoodhull · 2013-10-15T21:46:46Z

Similar problem if the character code is greater than digits:

gordon$ cat bars
1|1
10|2
2|3
1|9
11|15
1|11
10|1
11|1
11|2
gordon$ sort bars
10|1
10|2
11|1
11|15
11|2
1|1
1|11
1|9
2|3

I could be wrong, but I don't think lexicographical ordering ever works for tuples of variable length fields.

jasondavies · 2013-10-15T22:06:42Z

Right, the array of all groups as returned by group.all will be in ascending natural order, so I agree the order could be unexpected if you have variable-length keys. However, the grouping behaviour and top-K groups will all work as expected, which is the main point of using a dummy dimension in this manner.

I’d be open to considering proper tuple support, but only if performance remains reasonably fast. Though not ideal, you can always clone the returned group.all array and sort it afterwards if you really need it to be sorted differently.

gordonwoodhull · 2013-10-15T22:16:16Z

Yes, that almost works, but for example, I would want to be able to select the range [1,1] -> [1,19], but without native support [1,2] would be left out.

I think it boils down to parameterizing the ordering function. So there wouldn't be any extra logic during sorting. I would hope this isn't worse than the valueOf() call that is already happening, and you'd pay for what you need. I can benchmark it and find out. Thanks for considering it.

jasondavies · 2013-10-15T22:22:33Z

The dummy dimension is really only intended as a way to group by multiple keys at once, rather than for filtering. If you want to filter, you can use an individual dimension that has the correct ordering. So for your example you could say:

dimension0.filterExact(1);
dimension1.filterRange([1, 19]);

gordonwoodhull · 2013-10-15T22:37:04Z

Ah, right. Thanks, got it. And it would be very unlikely that one would want to filter by e.g. range [1,10] -> [2,9]

I will see if I can make some wrappers to hide the mess.

chondl added 8 commits December 4, 2012 13:39

Add pivotGroup to performs a cross tabulation against other already d…

582035e

…efined groups. Implementation is simplest thing that can work: it doesn't respect filter changes, doesn't handle data added after first executed, and is not performance optimized

Add test that pivot group respects filters

f3368b0

Add test for custom reducer

6062616

Update pivot groups when filters change

9b44637

Add a larger data set for integration testing

87e5005

Refactor pivot group construction

eec5113

Use FNV-1a hash to construct pivot group

ba1d79d

Significant performance improvement over brute force search (no surprise). 71ms to group 100k records into a 3 dimensional pivot group with 2800 distinct keys (was taking ~1900ms before with brute force search).

Fix bugs with hash and cardinality 1 dimensions

1d767ab

jasondavies closed this Aug 16, 2013

jasondavies mentioned this pull request Aug 16, 2013

Support multi-value grouping #67

Closed

gordonwoodhull mentioned this pull request Oct 15, 2013

New Series chart requires two renders to render dc-js/dc.js#336

Closed

gordonwoodhull mentioned this pull request Oct 15, 2013

series chart should override doRender not plotData dc-js/dc.js#349

Closed

gordonwoodhull mentioned this pull request Mar 4, 2014

Array support as key for group dc-js/dc.js#535

Closed

gordonwoodhull mentioned this pull request Nov 5, 2015

sort by multiple (nested) dimensions #165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pivot groups for cross tabulation #48

Support pivot groups for cross tabulation #48

chondl commented Dec 5, 2012

christophe-g commented Dec 6, 2012

chondl commented Dec 7, 2012

chondl commented Feb 8, 2013

Trakkasure commented Feb 8, 2013

JDvorak commented Jun 20, 2013

chondl commented Jun 24, 2013

ghost commented Aug 16, 2013

JDvorak commented Aug 16, 2013

jasondavies commented Aug 16, 2013

chondl commented Aug 20, 2013

gordonwoodhull commented Oct 15, 2013

jasondavies commented Oct 15, 2013

gordonwoodhull commented Oct 15, 2013

gordonwoodhull commented Oct 15, 2013

jasondavies commented Oct 15, 2013

gordonwoodhull commented Oct 15, 2013

jasondavies commented Oct 15, 2013

gordonwoodhull commented Oct 15, 2013

Support pivot groups for cross tabulation #48

Support pivot groups for cross tabulation #48

Conversation

chondl commented Dec 5, 2012

christophe-g commented Dec 6, 2012

chondl commented Dec 7, 2012

chondl commented Feb 8, 2013

Trakkasure commented Feb 8, 2013

JDvorak commented Jun 20, 2013

chondl commented Jun 24, 2013

ghost commented Aug 16, 2013

JDvorak commented Aug 16, 2013

jasondavies commented Aug 16, 2013

chondl commented Aug 20, 2013

gordonwoodhull commented Oct 15, 2013

jasondavies commented Oct 15, 2013

gordonwoodhull commented Oct 15, 2013

gordonwoodhull commented Oct 15, 2013

jasondavies commented Oct 15, 2013

gordonwoodhull commented Oct 15, 2013

jasondavies commented Oct 15, 2013

gordonwoodhull commented Oct 15, 2013