Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompareMeans summary: Add pandas columns names #4774

Open
adrienpacifico opened this issue Jul 3, 2018 · 1 comment
Open

CompareMeans summary: Add pandas columns names #4774

adrienpacifico opened this issue Jul 3, 2018 · 1 comment

Comments

@adrienpacifico
Copy link
Contributor

adrienpacifico commented Jul 3, 2018

Problem/question

The output of a CompareMeans summary, when pandas DataFrames are inputed, is constituted of rows labels named "subset #1", ..., "subset #N" .

Wouldn't it be possible to get pandas columns names instead ?

Small exemple:

Code

import pandas as pd
import statsmodels.stats

df1 = pd.DataFrame({"Age": [21,23,54], 'Income': [200, 150,600]})
df2 = pd.DataFrame({"Age": [30,26,34], 'Income': [400, 250,100]})

print(statsmodels.stats.weightstats.CompareMeans.from_data(df1,df2).summary())

Output

                          Test for equality of means                          
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
subset #1      2.6667     10.929      0.244      0.819     -27.677      33.011
subset #2     66.6667    166.667      0.400      0.710    -396.074     529.408
==============================================================================

Expected

                          Test for equality of means                          
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Age         2.6667     10.929      0.244      0.819     -27.677      33.011
Income     66.6667    166.667      0.400      0.710    -396.074     529.408
==============================================================================

Solution ?

Apparently names are generated from that line:

xname = ['subset #%d'%(ii + 1) for ii in range(tstat.shape[0])]

What would be necessary to include columns names ?

ps: statsmodels version: 0.9.0

@adrienpacifico adrienpacifico changed the title CompareMeans summary: Add pandas columns name CompareMeans summary: Add pandas columns names Jul 3, 2018
@josef-pkt josef-pkt added this to the 0.10 milestone Jul 12, 2018
@josef-pkt
Copy link
Member

sorry, I'm always getting distracted before replying

The first extension would be to add something like xname as keyword argument in summary, similarly to the results.summary method of the models. (xname was initially used but it's a misnomer)

Second, DescrStatsW will need to store the names before converting to numpy arrays.

Third, creating names for the contrasts needs to take several cases into account because CompareMeans allows broadcasting and the column names do not have to be identical.
e.g. if both column names agree as in the example here, then one name is enough, otherwise we need a combination name like "column1 - column2". (maybe similar to tukey-hsd or pairwise constrast names)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants