CompareMeans summary: Add pandas columns names #4774

adrienpacifico · 2018-07-03T16:41:15Z

Problem/question

The output of a CompareMeans summary, when pandas DataFrames are inputed, is constituted of rows labels named "subset #1", ..., "subset #N" .

Wouldn't it be possible to get pandas columns names instead ?

Small exemple:

Code

import pandas as pd
import statsmodels.stats

df1 = pd.DataFrame({"Age": [21,23,54], 'Income': [200, 150,600]})
df2 = pd.DataFrame({"Age": [30,26,34], 'Income': [400, 250,100]})

print(statsmodels.stats.weightstats.CompareMeans.from_data(df1,df2).summary())

Output

                          Test for equality of means                          
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
subset #1      2.6667     10.929      0.244      0.819     -27.677      33.011
subset #2     66.6667    166.667      0.400      0.710    -396.074     529.408
==============================================================================

Expected

                          Test for equality of means                          
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Age         2.6667     10.929      0.244      0.819     -27.677      33.011
Income     66.6667    166.667      0.400      0.710    -396.074     529.408
==============================================================================

Solution ?

Apparently names are generated from that line:

statsmodels/statsmodels/stats/weightstats.py

Line 810 in 0b5cea5

xname = ['subset #%d'%(ii + 1) for ii in range(tstat.shape[0])]

What would be necessary to include columns names ?

ps: statsmodels version: 0.9.0

The text was updated successfully, but these errors were encountered:

josef-pkt · 2018-07-12T11:17:17Z

sorry, I'm always getting distracted before replying

The first extension would be to add something like xname as keyword argument in summary, similarly to the results.summary method of the models. (xname was initially used but it's a misnomer)

Second, DescrStatsW will need to store the names before converting to numpy arrays.

Third, creating names for the contrasts needs to take several cases into account because CompareMeans allows broadcasting and the column names do not have to be identical.
e.g. if both column names agree as in the example here, then one name is enough, otherwise we need a combination name like "column1 - column2". (maybe similar to tukey-hsd or pairwise constrast names)

adrienpacifico changed the title ~~CompareMeans summary: Add pandas columns name~~ CompareMeans summary: Add pandas columns names Jul 3, 2018

josef-pkt added type-enh pandas-integration comp-stats labels Jul 12, 2018

josef-pkt added this to the 0.10 milestone Jul 12, 2018

stevenlis mentioned this issue Jan 28, 2019

suggestion for CompareMeans when passing DataFrame #5479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CompareMeans summary: Add pandas columns names #4774

CompareMeans summary: Add pandas columns names #4774

adrienpacifico commented Jul 3, 2018 •

edited

josef-pkt commented Jul 12, 2018

CompareMeans summary: Add pandas columns names #4774

CompareMeans summary: Add pandas columns names #4774

Comments

adrienpacifico commented Jul 3, 2018 • edited

Problem/question

Small exemple:

Code

Output

Expected

Solution ?

josef-pkt commented Jul 12, 2018

adrienpacifico commented Jul 3, 2018 •

edited