BUG tukeyhsd nan in group labels #1890

Closed
josef-pkt opened this Issue Aug 16, 2014 · 3 comments

Projects

None yet

1 participant

@josef-pkt
Member

while checking and fixing #1561
using example notebook from comment in that issue

The labels with integer group labels show nans when using Pandas DataFrame/Series for data.
The first case copied below uses recarrays and is ok, the second (In [7]) uses pandas DataFrame and shows nans.

The example with string/bytes labels looks completely messed up with pandas DataFrame with more lines than pairs and nans as group/level labels (not copied here)

pandas.__version__ is '0.12.0'

In [6]:

mc = multi.MultiComparison(dta['Rust'], dta['Brand'])
res = mc.tukeyhsd()
print(res.summary())
Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
  1      2      46.3   43.3155  49.2845   True 
  1      3     24.81   21.8255  27.7945   True 
  1      4     -2.67   -5.6545   0.3145  False 
  2      3     -21.49  -24.4745 -18.5055  True 
  2      4     -48.97  -51.9545 -45.9855  True 
  3      4     -27.48  -30.4645 -24.4955  True 
-----------------------------------------------
In [7]:

mc = multi.MultiComparison(dta1['Rust'], dta1['Brand'])#.values)
res = mc.tukeyhsd()
print(res.summary())
Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
 1.0    nan     46.3   43.3155  49.2845   True 
 1.0    nan    24.81   21.8255  27.7945   True 
 1.0    nan    -2.67   -5.6545   0.3145  False 
 nan    nan    -21.49  -24.4745 -18.5055  True 
 nan    nan    -48.97  -51.9545 -45.9855  True 
 nan    nan    -27.48  -30.4645 -24.4955  True 
-----------------------------------------------
@josef-pkt
Member

bug: I use np.asarray on groups, but then use the original in the call to np.unique

        self.groups = np.asarray(groups)

        # Allow for user-provided sorting of groups
        if group_order is None:
            self.groupsunique, self.groupintlab = np.unique(groups,
                                                            return_inverse=True)

groupsunique is a pandas.Series not an ndarray

no idea which pandas version I was using that worked in the gist notebook
http://nbviewer.ipython.org/gist/josef-pkt/10000517

@josef-pkt
Member

unique on pandas.Series for groups produces nonsense

n [27]:

res2.groupsunique
Out[27]:
Treatment
29            medical
20           physical
8              mental
0                 NaN
3              mental
19           physical
10             mental
15           physical
Name: Treatment, dtype: object
@josef-pkt
Member

using
self.groups = groups = np.asarray(groups)
in MultiComparison.__init__ to avoid pandas internally fixes the bugs that show up in the notebook.

@josef-pkt josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Aug 16, 2014
@josef-pkt josef-pkt BUG: MultiComparison use array for groups closes #1890 177b7c3
@josef-pkt josef-pkt closed this in #1895 Aug 16, 2014
@josef-pkt josef-pkt added this to the 0.6 milestone Aug 16, 2014
@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@josef-pkt josef-pkt BUG: MultiComparison use array for groups closes #1890 8e77ddd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment