Permalink
Browse files

ENH: stats: Added the stats.contingency module containing some contin…

…gency table

functions.  The main function, chi2_contingency, is also added to the scipy.stats
namespace.  See pull request 34 and ticket 1203 for details.
  • Loading branch information...
WarrenWeckesser committed Aug 7, 2011
1 parent fa21e84 commit 6fb35072529bcc030b95efa33fb2f5dcaefbe74d
@@ -67,10 +67,15 @@ Additional special matrices (``scipy.linalg``)
The functions ``hilbert`` and ``invhilbert`` were added to ``scipy.linalg``.
-``stats.fisher_exact``
-----------------------
-
-The one-sided form of Fisher's exact test is now also implemented.
+Enhancements to ``scipy.stats``
+-------------------------------
+
+* The *one-sided form* of Fisher's exact test is now also implemented in
+ ``stats.fisher_exact``.
+* The function ``stats.chi2_contingency`` for computing the chi-square test of
+ independence of factors in a contingency table has been added, along with
+ the related utility functions ``stats.contingency.margins`` and
+ ``stats.contingency.expected_freq``.
Deprecated features
View
@@ -10,6 +10,7 @@
from morestats import *
from kde import gaussian_kde
import mstats
+from contingency import chi2_contingency
#remove vonmises_cython from __all__, I don't know why it is included
__all__ = filter(lambda s:not (s.startswith('_') or s.endswith('cython')),dir())
View
@@ -0,0 +1,239 @@
+"""Some functions for working with contingency tables (i.e. cross tabulations).
+"""
+
+# Author: Warren Weckesser, Enthought, Inc.
+
+import numpy as np
+from scipy import special
+
+
+__all__ = ['margins', 'expected_freq', 'chi2_contingency']
+
+
+def margins(a):
+ """Return a list of the marginal sums of the array `a`.
+
+ Parameters
+ ----------
+ a : ndarray
+ The array for which to compute the marginal sums.
+
+ Return Value
+ ------------
+ margsums : list of ndarrays
+ A list of length `a.ndim`. `margsums[k]` is the result
+ of summing `a` over all axes except `k`; it has the same
+ number of dimensions as `a`, but the length of each axis
+ except axis `k` will be 1.
+
+ Examples
+ --------
+ >>> a = np.arange(12).reshape(2, 6)
+ >>> a
+ array([[ 0, 1, 2, 3, 4, 5],
+ [ 6, 7, 8, 9, 10, 11]])
+ >>> m0, m1 = margins(a)
+ >>> m0
+ array([[15],
+ [51]])
+ >>> m1
+ array([[ 6, 8, 10, 12, 14, 16]])
+
+ >>> b = np.arange(24).reshape(2,3,4)
+ >>> m0, m1, m2 = margins(b)
+ >>> m0
+ array([[[ 66]],
+ [[210]]])
+ >>> m1
+ array([[[ 60],
+ [ 92],
+ [124]]])
+ >>> m2
+ array([[[60, 66, 72, 78]]])
+ """
+ margsums = []
+ ranged = range(a.ndim)
+ for k in ranged:
+ marg = np.apply_over_axes(np.sum, a, [j for j in ranged if j != k])
+ margsums.append(marg)
+ return margsums
+
+
+def expected_freq(observed):
+ """Compute the expected frequencies from a contingency table.
+
+ Given an n-dimensional contingency table of observed frequencies,
+ compute the expected frequencies for the table based on the marginal
+ sums under the assumption that the groups associated with each
+ dimension are independent.
+
+ Parameters
+ ----------
+ observed : array_like
+ The table of observed frequencies. (While this function can handle
+ a 1-D array, that case is trivial. Generally `observed` is at
+ least 2-D.)
+
+ Returns
+ -------
+ expected : ndarray of type numpy.float64, same shape as `observed`.
+ The expected frequencies, based on the marginal sums of the table.
+
+ Examples
+ --------
+ >>> observed = np.array([[10, 10, 20],[20, 20, 20]])
+ >>> expected_freq(observed)
+ array([[ 12., 12., 16.],
+ [ 18., 18., 24.]])
+ """
+ # Typically `observed` is an integer array. If `observed` has a large
+ # number of dimensions or holds large values, some of the following
+ # computations may overflow, so we first switch to floating point.
+ observed = np.asarray(observed, dtype=np.float64)
+
+ # Create a list of the marginal sums.
+ margsums = margins(observed)
+
+ # Create the array of expected frequencies. The shapes of the
+ # marginal sums returned by apply_over_axes() are just what we
+ # need for broadcasting in the following product.
+ d = observed.ndim
+ expected = reduce(np.multiply, margsums) / observed.sum() ** (d - 1)
+ return expected
+
+
+def chi2_contingency(observed, correction=True):
+ """Chi-square test of independence of variables in a contingency table.
+
+ This function computes the chi-square statistic and p-value for the
+ hypothesis test of independence of the observed frequencies in the
+ contingency table [1]_ `observed`. The expected frequencies are computed
+ based on the marginal sums under the assumption of independence;
+ see scipy.stats.expected_freq. The number of degrees of freedom is
+ (expressed using numpy functions and attributes)::
+
+ dof = observed.size - sum(observed.shape) + observed.ndim - 1
+
+
+ Parameters
+ ----------
+ observed : array_like
+ The contingency table. The table contains the observed frequencies
+ (i.e. number of occurrences) in each category. In the two-dimensional
+ case, the table is often described as an "R x C table".
+ correction : bool, optional
+ If True, *and* the degrees of freedom is 1, apply Yates' correction
+ for continuity.
+
+ Returns
+ -------
+ chi2 : float
+ The chi-square test statistic. Without the Yates' correction, this
+ is the sum of the squares of the observed values minus the expected
+ values, divided by the expected values. With Yates' correction,
+ 0.5 is subtracted from the squared differences before dividing by
+ the expected values.
+ p : float
+ The p-value of the test
+ dof : int
+ Degrees of freedom
+ expected : ndarray, same shape as `observed`
+ The expected frequencies, based on the marginal sums of the table.
+
+ See Also
+ --------
+ contingency.expected_freq
+ fisher_exact
+ chisquare
+
+ Notes
+ -----
+ An often quoted guideline for the validity of this calculation is that
+ the test should be used only if the observed and expected frequency in
+ each cell is at least 5.
+
+ This is a test for the independence of different categories of a
+ population. The test is only meaningful when the dimension of
+ `observed` is two or more. Applying the test to a one-dimensional
+ table will always result in `expected` equal to `observed` and a
+ chi-square statistic equal to 0.
+
+ This function does not handle masked arrays, because the calculation
+ does not make sense with missing values.
+
+ Like stats.chisquare, this function computes a chi-square statistic;
+ the convenience this function provides is to figure out the expected
+ frequencies and degrees of freedom from the given contingency table.
+ If these were already known, and if the Yates' correction was not
+ required, one could use stats.chisquare. That is, if one calls::
+
+ chi2, p, dof, ex = chi2_contingency(obs, correction=False)
+
+ then the following is true::
+
+ (chi2, p) == stats.chisquare(obs.ravel(), f_exp=ex.ravel(),
+ ddof=obs.size - 1 - dof)
+
+ References
+ ----------
+ .. [1] http://en.wikipedia.org/wiki/Contingency_table
+
+ Examples
+ --------
+ A two-way example (2 x 3):
+
+ >>> obs = np.array([[10, 10, 20], [20, 20, 20]])
+ >>> chi2_contingency(obs)
+ (2.7777777777777777,
+ 0.24935220877729619,
+ 2,
+ array([[ 12., 12., 16.],
+ [ 18., 18., 24.]]))
+
+ A four-way example (2 x 2 x 2 x 2):
+
+ >>> obs = np.array(
+ ... [[[[12, 17],
+ ... [11, 16]],
+ ... [[11, 12],
+ ... [15, 16]]],
+ ... [[[23, 15],
+ ... [30, 22]],
+ ... [[14, 17],
+ ... [15, 16]]]])
+ >>> chi2_contingency(obs)
+ (8.7584514426741897,
+ 0.64417725029295503,
+ 11,
+ array([[[[ 14.15462386, 14.15462386],
+ [ 16.49423111, 16.49423111]],
+ [[ 11.2461395 , 11.2461395 ],
+ [ 13.10500554, 13.10500554]]],
+ [[[ 19.5591166 , 19.5591166 ],
+ [ 22.79202844, 22.79202844]],
+ [[ 15.54012004, 15.54012004],
+ [ 18.10873492, 18.10873492]]]]))
+ """
+ observed = np.asarray(observed)
+ if np.any(observed < 0):
+ raise ValueError("All values in `table` must be nonnegative.")
+
+ expected = expected_freq(observed)
+ if np.any(expected == 0):
+ # Include one of the positions where expected is zero in
+ # the exception message.
+ zeropos = list(np.where(expected == 0)[0])
+ raise ValueError("The internally computed table of expected "
+ "frequencies has a zero element at %s." % zeropos)
+
+ # The degrees of freedom
+ dof = expected.size - sum(expected.shape) + expected.ndim - 1
+
+ if dof == 1 and correction:
+ # Use Yates' correction for continuity.
+ chi2 = ((np.abs(observed - expected) - 0.5) ** 2 / expected).sum()
+ else:
+ # Regular chi-square--no correction.
+ chi2 = ((observed - expected) ** 2 / expected).sum()
+ p = special.chdtrc(dof, chi2)
+ return chi2, p, dof, expected
View
@@ -235,7 +235,6 @@
f_oneway
pearsonr
spearmanr
- fisher_exact
pointbiserialr
kendalltau
linregress
@@ -269,6 +268,19 @@
mood
oneway
+Contingency table functions
+===========================
+
+.. autosummary::
+ :toctree: generated/
+
+ fisher_exact
+ chi2_contingency
+ contingency.expected_freq
+ contingency.margins
+
+General linear model
+====================
.. autosummary::
:toctree: generated/
View
@@ -2238,8 +2238,8 @@ def fisher_exact(table, alternative='two-sided'):
See Also
--------
- chisquare : inexact alternative that can be used when sample sizes are
- large enough.
+ chi2_contingency : Chi-square test of independence of variables in a
+ contingency table.
Notes
-----
@@ -2248,13 +2248,13 @@ def fisher_exact(table, alternative='two-sided'):
Likelihood Estimate", while R uses the "conditional Maximum Likelihood
Estimate".
- For tables with large numbers the (inexact) `chisquare` test can also be
- used.
+ For tables with large numbers the (inexact) chi-square test implemented
+ in the function `chi2_contingency` can also be used.
Examples
--------
Say we spend a few days counting whales and sharks in the Atlantic and
- Indian oceans. In the Atlantic ocean we find 6 whales and 1 shark, in the
+ Indian oceans. In the Atlantic ocean we find 8 whales and 1 shark, in the
Indian ocean 2 whales and 5 sharks. Then our contingency table is::
Atlantic Indian
Oops, something went wrong.

0 comments on commit 6fb3507

Please sign in to comment.