ENH: stats: add a few functions for working with contingency tables. #34

Closed
wants to merge 10 commits into
from

Projects

None yet

3 participants

@WarrenWeckesser
Collaborator

This is an updated version of the code from ticket #1203. The main function is conting_test_ind(obs) ("contingency table test for independence"), which computes the statistics for the hypothesis test of independence of a contingency table. See ticket #1203, and the link there to the discussion on the mailing list, for more details.

An earlier version of these functions is also included in the contingency table class attached to ticket #1258. That code will be easy to modify to use the code from this ticket. This separation keeps the code more modular: the code in this pull request includes just a few core computations, while the code in #1258 provides the ContingencyTable class with many more computational methods, along with methods for manipulating data sets.

@rgommers

Not sure I like this name, because this is not the only test of independence. Why not put chi2 in the name?

You could make correction=True the default, since this will give better results for small sample sizes. Even with sample sizes on the order of 100s the correction agrees better with fisher_exact.

Can you explain the relation with stats.chisquare?

@rgommers

I think it's clearer to spell out the name, so _contingency. You don't need to import from it as a users, so no problem it's longer.

@rgommers
Member

Looks good overall.

@WarrenWeckesser
Collaborator

All good suggestions, Ralf, thanks. I renamed the module to _contingency.py, and renamed the main test in the module to chi2_contingency. I also extended the docstring a bit (and put the sections in the right order), and updated the usual related doc files.

Warren Weckesser DOC: Update fisher_exact docstring to refer to chi2_contingency in it…
…s 'See Also' section, and fix a numerical typo in the example.
9108be8
@rgommers

That's a good addition to the docstring. I assume you thought about retrofitting this functionality to chisquare, and decided it wasn't possible? It does bother me a little to have two functions doing essentially the same thing, but I'm not sure there's an easy way around it.

@WarrenWeckesser
Collaborator

That issue was brought up in the megathread last year (thread starts here: http://mail.scipy.org/pipermail/scipy-dev/2010-June/014538.html)

I compared an API with one function to an API with two functions in this post: http://mail.scipy.org/pipermail/scipy-dev/2010-June/014927.html
I preferred two functions, as did Josef. I think it is clearer, since the two have different use cases (goodness of fit vs. test of independence), and handle their arguments differently.

@rgommers
Member

Ah yes, vaguely remember that thread. The reasoning of why two functions sounds reasonable.

Re-reading the chisquare() docstring after reading your second link, I think that that docstring would be much clearer if it mentioned "goodness-of-fit". Your new chi2 docstring is much clearer.

@rgommers

It's not necessary to add "scipy.stats" here. The doc build is smart enough to resolve the see also's if objects are in the same module, or higher up. Only if you want to reference an object from another module, say "ndimage.map_coordinates", then you'd add the prefix.

Collaborator

Fixed, thanks.

@rgommers
Member

Maybe Josef or someone else still wants to review, but looks polished to me.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Jun 18, 2011
scipy/stats/_contingency.py
+ observed = np.asarray(observed, dtype=np.float64)
+
+ # Create a list of the marginal sums.
+ margsums = margins(observed)
+
+ # Create the array of expected frequencies. The shapes of the
+ # marginal sums returned by apply_over_axes() are just what we
+ # need for broadcasting in the following product.
+ d = observed.ndim
+ expected = reduce(np.multiply, margsums) / observed.sum() ** (d - 1)
+ return expected
+
+
+def chi2_contingency(observed, correction=True):
+ """Chi-square test of independence of observations in a contingency table.
+
@josef-pkt
josef-pkt Jun 18, 2011 SciPy member

maybe independence of variables
instead of observations

@WarrenWeckesser
WarrenWeckesser Aug 6, 2011 collaborator

Good idea--done.

@josef-pkt
Member

I'm all in favor as we discussed this last year.

The only thing I don't like so much are that the names. "margins" and "expected_freq" are too generic in the scipy.stats.* namespace, but I'm not sure what nice short alternatives would be, expected_freq_contingency is a bit too long.

maybe "_ct" postfix for contingency table, or "_table". (contingency_expected ?)
I don't know whether there will be also a pivot_table that might make "table" ambiguous as a term.

an aside: expected_freq doesn't remove axis like the usual numpy behavior for reduce operations, but I guess that this is more useful (practicality beats purity)

@rgommers
Member

Agreed that it's a bit generic. I like "_table" better than "_ct", the latter is not understandable without checking the docstring.

This shows that the scipy.stats namespace is way too large though, it has over 200 functions/objects. If one cannot think of clear, concise names then some reorganization may be in order.....

@josef-pkt
Member

another alternative, which also helps to move away from the big namespace, is to keep contingency as a public module and leave margins and expected_freq in their and not add them to the stats.* namespace.

chisquare_contingency is a main function, but margins and expected_freq are only supporting functions and if they are in stats.contingency.margins, stats.contingency.expected_freq the context of the names would be obvious.

This might also be the better way forward, when the contingency table class gets added, with maybe additional support functions.

@rgommers
Member

That would be the best solution imho. It probably requires at least a mention on the ML.

Warren Wecke... added some commits Aug 6, 2011
Warren Weckesser REF: Make contingency.py a public module, and don't pull its contents…
… into the scipy.stats namespace.

Author:    Warren Weckesser <warren.weckesser@enthought.com>
9a13400
Warren Weckesser DOC: A couple small changes to chi2_contingency's docstring. 7452a76
Warren Weckesser Add chi2_contingency to the scipy.stats namespace. a620150
@WarrenWeckesser
Collaborator

The last few commits make contingency a public module within stats; only chi2_contingency is imported into the scipy.stats namespace.

Josef, in your "aside" above, I think you meant to refer to 'margins', not 'expected_freq'. 'margins' intentionally doesn't eliminate the trivial dimensions that are left after the reduction, because the shapes are needed in the 'expected_freq' function.

@josef-pkt
Member

I just quickly skimmed it. I think this looks good. Thanks Warren

2 suggestions:

frequencies: I think of this usually as fractions/proportions not as counts, I would add "frequency counts" instead of frequencies in a few places (not throughout), especially the short description and in the Parameters.

combining one of the examples with an example of np.histogramdd that does the binning and counting could be useful

freq = np.histogramdd(np.random.randn(50,3), bins=3)
freq[0]
array([[[ 1., 0., 2.],
[ 2., 2., 1.],
[ 0., 0., 0.]],

   [[ 0.,  2.,  0.],
    [ 3.,  6.,  3.],
    [ 0.,  6.,  2.]],

   [[ 0.,  6.,  3.],
    [ 1.,  5.,  1.],
    [ 0.,  3.,  1.]]])

alternatively a tutorial section that also shows how to do counting for integers (no ready function available) might eventually be useful.

@WarrenWeckesser
Collaborator

Re: "frequencies": How about this in the chi2_contingency docstring:

observed : array_like
    The contingency table. The table contains the observed frequencies
    (i.e. number of occurrences) of each category.

Regarding your second suggestion: Good idea, but I'd rather add a section to the stats tutorial, where we can use data that is more meaningful than randn(50,3).

@josef-pkt
Member

change to docstring sounds good, (I would have added it to margins and expected_freq, but see below)

I agree about the tutorial, initially I checked the return of histogramdd just to see if it has the required format, but binning continuous variables is not the most common use case, most users will have categorical variables, I guess.

another idea because of "frequency":

'margins', 'expected_freq' are written for chi2 test and need counts for this (so they are fine for this), but for other applications I need the frequency distributions (fraction of total count). From reading your source, both functions should also work if the table is already normalized, (i.e. total sum = 1). Can you verify that?
(Initially I thought about a normalize=False keyword, but it doesn't seem necessary.)

@WarrenWeckesser
Collaborator

'margins', 'expected_freq' are written for chi2 test and need counts for this (so they are fine for this), but for other
applications I need the frequency distributions (fraction of total count). From reading your source, both functions
should also work if the table is already normalized, (i.e. total sum = 1). Can you verify that?

Yes, 'margins' and 'expected_freq' work as expected in this case.

@WarrenWeckesser
Collaborator

Committed in 6fb3507.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment