Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG tukeyhsd nan in group labels #1890

Closed
josef-pkt opened this issue Aug 16, 2014 · 3 comments

Comments

Projects
None yet
1 participant
@josef-pkt
Copy link
Member

commented Aug 16, 2014

while checking and fixing #1561
using example notebook from comment in that issue

The labels with integer group labels show nans when using Pandas DataFrame/Series for data.
The first case copied below uses recarrays and is ok, the second (In [7]) uses pandas DataFrame and shows nans.

The example with string/bytes labels looks completely messed up with pandas DataFrame with more lines than pairs and nans as group/level labels (not copied here)

pandas.__version__ is '0.12.0'

In [6]:

mc = multi.MultiComparison(dta['Rust'], dta['Brand'])
res = mc.tukeyhsd()
print(res.summary())
Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
  1      2      46.3   43.3155  49.2845   True 
  1      3     24.81   21.8255  27.7945   True 
  1      4     -2.67   -5.6545   0.3145  False 
  2      3     -21.49  -24.4745 -18.5055  True 
  2      4     -48.97  -51.9545 -45.9855  True 
  3      4     -27.48  -30.4645 -24.4955  True 
-----------------------------------------------
In [7]:

mc = multi.MultiComparison(dta1['Rust'], dta1['Brand'])#.values)
res = mc.tukeyhsd()
print(res.summary())
Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
 1.0    nan     46.3   43.3155  49.2845   True 
 1.0    nan    24.81   21.8255  27.7945   True 
 1.0    nan    -2.67   -5.6545   0.3145  False 
 nan    nan    -21.49  -24.4745 -18.5055  True 
 nan    nan    -48.97  -51.9545 -45.9855  True 
 nan    nan    -27.48  -30.4645 -24.4955  True 
-----------------------------------------------
@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2014

bug: I use np.asarray on groups, but then use the original in the call to np.unique

        self.groups = np.asarray(groups)

        # Allow for user-provided sorting of groups
        if group_order is None:
            self.groupsunique, self.groupintlab = np.unique(groups,
                                                            return_inverse=True)

groupsunique is a pandas.Series not an ndarray

no idea which pandas version I was using that worked in the gist notebook
http://nbviewer.ipython.org/gist/josef-pkt/10000517

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2014

unique on pandas.Series for groups produces nonsense

n [27]:

res2.groupsunique
Out[27]:
Treatment
29            medical
20           physical
8              mental
0                 NaN
3              mental
19           physical
10             mental
15           physical
Name: Treatment, dtype: object
@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2014

using
self.groups = groups = np.asarray(groups)
in MultiComparison.__init__ to avoid pandas internally fixes the bugs that show up in the notebook.

josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Aug 16, 2014

@josef-pkt josef-pkt added this to the 0.6 milestone Aug 16, 2014

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.