Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't create (or add option to not create) empty columns when unstacking #1181

Closed
wesm opened this issue May 2, 2012 · 10 comments
Closed

Don't create (or add option to not create) empty columns when unstacking #1181

wesm opened this issue May 2, 2012 · 10 comments
Assignees
Milestone

Comments

@wesm
Copy link
Member

wesm commented May 2, 2012

from @rkern on the pydata mailing list:

Hello! I am using the pivot table functionality to good effect. One
issue I am running to, however, is that the columns end up being
roughly a full Cartesian product of the column pivots. That is, there
are blocks of all-NA columns. The row pivots appear to be pruned
already; all-NA rows don't show up. For pivot values that are roughly
tree-structured (i.e. the children of one node mostly don't show up as
children of a neighboring node), this can create pivot tables with
very many columns. Currently, I post-process the pivot tables using
.dropna(axis=1, how='all'), but I have just run into a case there the
intermediate table is too large for my 32-bit machine. Would there be
a good way to change the pivot table computation to get appropriately
sparse trees for both the rows and the columns? I am happy to look
into it myself, but I did want to check to see if it was on anyone's
radar or if anyone had suggestions.

to my response

That's actually a good point and it's basically an oversight in the pivot_table implementation. If you look at the code, it's basically a convenience function that uses groupby and calls unstack on the aggregated result. I suspect the problem with unstack is that it's creating lots of empty columns; 
@ghost ghost assigned wesm May 3, 2012
@wesm wesm closed this as completed in 773d861 May 8, 2012
@wesm
Copy link
Member Author

wesm commented May 8, 2012

@rkern looks like I fixed this one. Was actually much less trivial than I expected. Here's an example table that would have had lots of empty columns before:

In [1]: paste
fac1 = np.array(['A', 'B', 'C'], dtype='O')
fac2 = np.array(['one', 'two'], dtype='O')

ind1 = np.random.randint(0, 3, size=100000)
ind2 = np.random.randint(0, 2, size=100000)

df = DataFrame({'key1': fac1.take(ind1),
                'key2': fac2.take(ind2),
                'key3': fac2.take(ind2),
                'value1' : np.random.randn(100000),
                'value2' : np.random.randn(100000),
                'value3' : np.random.randn(100000)})
## -- End pasted text --

In [2]: df.pivot_table(rows='key1', cols=['key2', 'key3'])
Out[2]: 
        value1              value2              value3          
key2       one       two       one       two       one       two
key3       one       two       one       two       one       two
key1                                                            
A    -0.001907 -0.003096 -0.011314  0.001509  0.006259  0.008286
B    -0.002674  0.009532  0.001333 -0.005985 -0.005540 -0.010263
C    -0.018193 -0.014543 -0.007773 -0.002644 -0.008985 -0.002281

Take it for a spin with your data set and let me know if you have any problems

@rkern
Copy link
Contributor

rkern commented May 9, 2012

It doesn't look like it gives the correct (lexicographic) ordering of the columns when there are NA cells. When there is enough data for all cells to be "full", the right ordering is restored.

https://gist.github.com/2643464

@wesm
Copy link
Member Author

wesm commented May 9, 2012

I see. That is indeed annoying. So the issue is that prior to reshaping the data are lexicographically sorted:

                                    Price
Month Day Index  Symbol Year             
1     1   FTSE   DGE.L  2001    54.163185
                        2003   379.459302
                 GSK.L  2002   497.299116
                        2007     1.664354
                        2010   272.104540
                 TSCO.L 2004     1.451182
                        2007    69.706041
                        2010   154.649158
          NDQ100 GOOG   2000   550.347603
                        2004     3.031219
                 MSFT   2008    31.284452
                        2009   286.642562
          SP500  ADBE   2000     7.973080
                        2001     4.627768
                        2006    14.359992
                        2009  2393.824308
                 NVDA   2000    69.009683
                 ORCL   2000  2713.215936
      2   FTSE   DGE.L  2002     2.058551
                        2009    52.296926
                 GSK.L  2010    30.699186
                 TSCO.L 2003  1062.002363
                        2005    15.939004
                        2009   902.966650
          NDQ100 AAPL   2003   178.357396
                        2006   114.985258
                        2008   763.402429
                        2009    38.491641
                 GOOG   2008     7.095779
                 MSFT   2001   142.510070
                        2003    61.885051
                        2010    46.534496
          SP500  ADBE   2001   779.574063
                        2004    76.471981
                 NVDA   2000  3177.537287
                        2004    31.213180
                        2009    36.103810
                 ORCL   2008   230.984945
      3   FTSE   DGE.L  2000   218.143268
                        2001   275.780256
                        2005     8.186233
                        2009    13.550338
                 GSK.L  2002    56.868936
                        2007    18.372622
                        2008     8.835835
                        2010    72.746109
                 TSCO.L 2005    20.465795
                        2007     8.700926
          NDQ100 AAPL   2002     5.158302
                        2003     0.906602
ipdb> agged.index.lexsort_depth
5

But after reshaping, not so much. So I think the prudent thing is just to sort the columns before returning with sortlevel

@wesm
Copy link
Member Author

wesm commented May 9, 2012

OK it's a done deal

@rkern
Copy link
Contributor

rkern commented May 9, 2012

I'm getting an error:

[examples]
|1> %run gen_data.py

[examples]
|2> pt = getpt()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/rkern/git/et/pivotgrid/examples/<ipython-input-2-fd166cbd8bda> in <module>()
----> 1 pt = getpt()

/Users/rkern/git/et/pivotgrid/examples/gen_data.py in getpt(n, prng)
     52 def getpt(n=10000, prng=np.random):
     53     df = getdf(n, prng)
---> 54     return df.pivot_table('Price', rows=['Month', 'Day'], cols=['Index', 'Symbol', 'Year'], aggfunc='mean')

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/tools/pivot.py in pivot_table(data, values, rows, cols, aggfunc, fill_value, margins)
     98 
     99     grouped = data.groupby(keys)
--> 100     agged = grouped.agg(aggfunc)
    101 
    102     to_unstack = [agged.index.names[i]

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/core/groupby.py in agg(self, func, *args, **kwargs)
    258         See docstring for aggregate
    259         """
--> 260         return self.aggregate(func, *args, **kwargs)
    261 
    262     def _iterate_slices(self):

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/core/groupby.py in aggregate(self, arg, *args, **kwargs)
   1359         """
   1360         if isinstance(arg, basestring):
-> 1361             return getattr(self, arg)(*args, **kwargs)
   1362 
   1363         result = {}

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/core/groupby.py in mean(self)
    278         except Exception:  # pragma: no cover
    279             f = lambda x: x.mean(axis=self.axis)
--> 280             return self._python_agg_general(f)
    281 
    282     def std(self, ddof=1):

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/core/groupby.py in _python_agg_general(self, func, *args, **kwargs)
    368 
    369         if len(output) == 0:
--> 370             return self._python_apply_general(func, *args, **kwargs)
    371 
    372         mask = counts.ravel() > 0

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/core/groupby.py in _python_apply_general(self, func, *args, **kwargs)
    383 
    384         not_indexed_same = False
--> 385         for key, group in self:
    386             object.__setattr__(group, 'name', key)
    387 

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/core/groupby.py in get_iterator(self, data, axis)
    506         else:
    507             # provide "flattened" iterator for multi-group setting
--> 508             comp_ids, _, ngroups = self.group_info
    509             label_list = self.labels
    510             level_list = self.levels

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/_tseries.so in pandas._tseries.cache_readonly.__get__ (pandas/src/tseries.c:124182)()

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/core/groupby.py in group_info(self)
    561     @cache_readonly
    562     def group_info(self):
--> 563         comp_ids, obs_group_ids = self._get_compressed_labels()
    564 
    565         ngroups = len(obs_group_ids)

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/core/groupby.py in _get_compressed_labels(self)
    568 
    569     def _get_compressed_labels(self):
--> 570         all_labels = [ping.labels for ping in self.groupings]
    571         if self._overflow_possible:
    572             tups = lib.fast_zip(all_labels)

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/core/groupby.py in labels(self)
    980     def labels(self):
    981         if self._labels is None:
--> 982             self._make_labels()
    983         return self._labels
    984 

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/core/groupby.py in _make_labels(self)
   1003         else:
   1004             labs, uniques, counts = algos.factorize(self.grouper,
-> 1005                                                     sort=self.sort)
   1006             uniques = Index(uniques, name=self.name)
   1007             self._labels = labs

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel)
    105 
    106         mask = labels < 0
--> 107         labels = reverse_indexer.take(labels)
    108         np.putmask(labels, mask, -1)
    109 

TypeError: array cannot be safely cast to required type

@wesm
Copy link
Member Author

wesm commented May 9, 2012

Doh, you're 32-bit. I'm in the midst of sorting out a bunch of int32/int64 issues (started last night), hopefully have the test suite passing on 32-bit within a few hours

@rkern
Copy link
Contributor

rkern commented May 9, 2012

No rush.

@wesm
Copy link
Member Author

wesm commented May 9, 2012

Jenkins is very unhappy

@wesm
Copy link
Member Author

wesm commented May 9, 2012

Test suite passing on 32-bit platforms now. Let me know if you run into any problems

@rkern
Copy link
Contributor

rkern commented May 9, 2012

Works for me. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants