Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Review and clean up all Box-Cox functions #3080

Merged
merged 11 commits into from

3 participants

@rgommers
Owner

No description provided.

@coveralls

Coverage Status

Coverage remained the same when pulling fd448b9 on rgommers:boxcox into 572aaf0 on scipy:master.

scipy/stats/morestats.py
@@ -497,52 +497,136 @@ def ppcc_plot(x,a,b,dist='tukeylambda', plot=None, N=80):
def boxcox_llf(lmb, data):
- """The boxcox log-likelihood function.
+ r"""The boxcox log-likelihood function.
+
+ Parameters
+ ----------
+ lmb : scalar
+ Parameter for Box-Cox transformation. See `boxcox` for details.
+ data : array_like
+ Data to calculate Box-Cox log-likelihood for.
+
+ Returns
+ -------
+ llf : float
+ Box-Cox log-likelihood of `data` given `lmbda`.
@WarrenWeckesser Collaborator

Should say "given lmb."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
scipy/stats/morestats.py
((10 lines not shown))
+ Parameter for Box-Cox transformation. See `boxcox` for details.
+ data : array_like
+ Data to calculate Box-Cox log-likelihood for.
+
+ Returns
+ -------
+ llf : float
+ Box-Cox log-likelihood of `data` given `lmbda`.
+
+ See Also
+ --------
+ boxcox, probplot, boxcox_normplot, boxcox_normmax
+
+ Notes
+ -----
+ The Box-Cox log likelihood function is defined here as
@WarrenWeckesser Collaborator

"log-likelihood" (missing dash)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
scipy/stats/morestats.py
@@ -497,52 +497,136 @@ def ppcc_plot(x,a,b,dist='tukeylambda', plot=None, N=80):
def boxcox_llf(lmb, data):
- """The boxcox log-likelihood function.
+ r"""The boxcox log-likelihood function.
+
+ Parameters
+ ----------
+ lmb : scalar
+ Parameter for Box-Cox transformation. See `boxcox` for details.
+ data : array_like
+ Data to calculate Box-Cox log-likelihood for.
@WarrenWeckesser Collaborator

Needs a sentence explaining that if data is n-dimensional, n > 1, the function is applied along the first axis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
scipy/stats/morestats.py
@@ -566,64 +649,293 @@ def boxcox(x,lmbda=None,alpha=None):
tuple of floats represents the minimum and maximum confidence limits
given `alpha`.
+ See Also
+ --------
+ probplot, boxcox_normplot, boxcox_normmax, boxcox_llf
+
+ Notes
+ -----
+ The Box-Cox transform is given by::
+
+ y = (x**lmbda - 1) / lmbda, for lmbda > 0
+ log(x), for lmbda = 0
+
+ `boxcox` requires the input data to be positives. Sometimes a Box-Cox
@WarrenWeckesser Collaborator

"... positive." (Not plural.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
scipy/stats/morestats.py
((59 lines not shown))
"""
+ x = np.asarray(x)
+ if x.size == 0:
+ return x
+
if any(x < 0):
@WarrenWeckesser Collaborator

Should be any(x <= 0)

Edit: It appears that 0 works, but it generates spurious RuntimeWarnings. If 0 is acceptable, these warnings should not occur.

@rgommers Owner

Should indeed be <= 0. Zero doesn't work for lmbda = 0, would return log(0). Positive data is standard requirement for this function also in R, Matlab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
scipy/stats/morestats.py
((303 lines not shown))
- plot.title('Box-Cox Normality Plot')
- plot.xlabel('Prob Plot Corr. Coef.')
- plot.ylabel('Transformation parameter')
+ plot.plot(lmbdas, ppcc, 'x')
+ try:
+ if hasattr(plot, 'set_title'):
+ # Matplotlib Axes instance or something that looks like it
+ plot.set_title('Box-Cox Normality Plot')
+ plot.set_ylabel('Prob Plot Corr. Coef.')
+ plot.set_xlabel('$\lambda$')
+ else:
+ # matplotlib.pyplot module
+ plot.title('Box-Cox Normality Plot')
+ plot.ylabel('Prob Plot Corr. Coef.')
+ plot.xlabel('$\lambda$')
+ except:
@WarrenWeckesser Collaborator

Can we be more selective here? Perhaps except AttributeError:? At least it should be except Exception:, and not an empty except:.

@rgommers Owner

I'll change it to except Exception, but note that the difference is very minor (zero for py3k). There's nothing wrong with bare excepts for things like this and for conditional imports for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@WarrenWeckesser
Collaborator

This is great, Ralf! I made a few comments inline.

I think the method="all" option in boxcox_normmax is unnecessary. It is not a lot of code, but I don't think the small increase in convenience is worth it. But that's a judgment call, and I wouldn't hold up merging because of it.

I noticed a bug in boxcox: using a negative integer lambda with an integer array gives incorrect results:

In [27]: x
Out[27]: array([1, 2, 3])

In [28]: boxcox(x, -1)
Out[28]: array([-0.,  1.,  1.])

In [29]: boxcox(x, -1.0)
Out[29]: array([-0.        ,  0.5       ,  0.66666667])

In [30]: boxcox(x, -2)
Out[30]: array([-0. ,  0.5,  0.5])

In [31]: boxcox(x, -2.0)
Out[31]: array([-0.        ,  0.375     ,  0.44444444])
@WarrenWeckesser
Collaborator

I think the core Box-Cox function,

    y = np.where(lmbda == 0, log(x), (x**lmbda - 1) / lmbda)

should be implemented as a ufunc, and perhaps added to scipy.special. This would eliminate some of the spurious Runtime Warnings generated when a value is zero, and it could be reused in stats.power_divergence.

@WarrenWeckesser
Collaborator

I created a pull request that implements the core Box-Cox transformation as a ufunc in scipy.special: #3150

scipy/stats/morestats.py
((67 lines not shown))
if lmbda is not None: # single transformation
- lmbda = lmbda*(x == x)
- y = where(lmbda == 0, log(x), (x**lmbda - 1)/lmbda)
+ lmbda = lmbda * (x == x) # equal to ``lmbda * np.ones(x.shape)``
+ y = np.where(lmbda == 0, log(x), (x**lmbda - 1) / lmbda)
@WarrenWeckesser Collaborator

#3150 was merged, so this line and the previous can be replaced by y = special.boxcox(x, lmbda).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
rgommers added some commits
@rgommers rgommers DOC: complete stats.boxcox docstring. Add add boxcox functions to ref…
…guide.

Also do a PEP8 cleanup of `boxcox`.
4c5eeef
@rgommers rgommers TST: add tests fox stats.boxcox. 1f925d3
@rgommers rgommers DOC: complete stats.boxcox_llf docstring. Also PEP8ify that function. 6fc2476
@rgommers rgommers TST: add tests for stats.boxcox_llf. Also add some input parameter ha…
…ndling.

Add also a reference to the original paper of Box and Cox.
51b71e0
@rgommers rgommers DOC: clean up and document stats.boxcox_normplot.
Also some PEP8 cleanups in related functions.
1bebbce
@rgommers rgommers TST: add tests for stats.boxcox_normplot. 913485f
@rgommers rgommers BUG: fix bug in confidence interval estimation of stats.boxcox. a222aa6
@rgommers rgommers TST: finish stats.boxcox_llf test for 2-D input. Fix bug in it.
boxcox() doesn't accept 2-D input, leave it like that.
f44fef2
@rgommers rgommers ENH: add tests and docstring for stats.boxcox_normmax().
Also refactor boxcox and boxcox_normmax so MLE can be used from the latter.

Changes the `brack` keyword in boxcox_normmax; it differed from that used in
boxcox(), keeping boxcox() results unchanged is preferred here.

This finishes the Statistics Review for all Box-Cox functions:
Closes gh-680
Closes gh-681
Closes gh-682
Closes gh-683
Closes gh-684
23dcf28
@rgommers
Owner

Re method='all': there are more methods that can easily be implemented, I just didn't get around to it. There's an R function (can't find it so quickly) that has seven options. So I'd like to keep 'all'.

@rgommers
Owner

Thanks for the review Warren. Addressed all your comments in the last commit.

scipy/stats/morestats.py
((211 lines not shown))
+
+ methods = {'pearsonr': _pearsonr,
+ 'mle': _mle,
+ 'all': _all}
+ if not method in methods.keys():
+ raise ValueError("Method %s not recognized." % method)
+
+ optimfunc = methods[method]
+ return optimfunc(x, brack)
+
+
+def boxcox_normplot(x, la, lb, plot=None, N=80):
+ """Compute parameters for a Box-Cox normality plot, optionally show it.
+
+ A Box-Cox normality plot shows graphically what the best transformation
+ parameter is to use in `boxcox` for obtained a distribution that is close
@WarrenWeckesser Collaborator

"... to obtain a distribution ..." or perhaps "... for obtaining a distribution ..."

@rgommers Owner

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage remained the same when pulling 1e1f103 on rgommers:boxcox into 881fadf on scipy:master.

@WarrenWeckesser
Collaborator

Did the Travis tests run? I don't see the usual Travis status bar.

@rgommers
Owner

Yes they did for the second to last commit. The last commit is a one line change to a docstring, so I added [ci skip] to the commit message to not trigger Travis for it (no need to waste resources for zero code changes).

@WarrenWeckesser
Collaborator

OK, I see that now (and I just noticed the little green check mark next to the penultimate commit).

Merging.

@WarrenWeckesser WarrenWeckesser merged commit 757211a into scipy:master
@rgommers rgommers deleted the rgommers:boxcox branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Dec 15, 2013
  1. @rgommers

    DOC: complete stats.boxcox docstring. Add add boxcox functions to ref…

    rgommers authored
    …guide.
    
    Also do a PEP8 cleanup of `boxcox`.
  2. @rgommers
  3. @rgommers
  4. @rgommers

    TST: add tests for stats.boxcox_llf. Also add some input parameter ha…

    rgommers authored
    …ndling.
    
    Add also a reference to the original paper of Box and Cox.
  5. @rgommers

    DOC: clean up and document stats.boxcox_normplot.

    rgommers authored
    Also some PEP8 cleanups in related functions.
  6. @rgommers
  7. @rgommers
  8. @rgommers

    TST: finish stats.boxcox_llf test for 2-D input. Fix bug in it.

    rgommers authored
    boxcox() doesn't accept 2-D input, leave it like that.
  9. @rgommers

    ENH: add tests and docstring for stats.boxcox_normmax().

    rgommers authored
    Also refactor boxcox and boxcox_normmax so MLE can be used from the latter.
    
    Changes the `brack` keyword in boxcox_normmax; it differed from that used in
    boxcox(), keeping boxcox() results unchanged is preferred here.
    
    This finishes the Statistics Review for all Box-Cox functions:
    Closes gh-680
    Closes gh-681
    Closes gh-682
    Closes gh-683
    Closes gh-684
  10. @rgommers
  11. @rgommers

    MAINT: fix grammar in stats.boxcox docstring.

    rgommers authored
    Addresses review comment on gh-3080.
    
    [ci skip]
This page is out of date. Refresh to see the latest.
View
8 scipy/stats/__init__.py
@@ -275,6 +275,13 @@
fligner
mood
+.. autosummary::
+ :toctree: generated/
+
+ boxcox
+ boxcox_normmax
+ boxcox_llf
+
Contingency table functions
===========================
@@ -295,6 +302,7 @@
ppcc_max
ppcc_plot
probplot
+ boxcox_normplot
Masked statistics functions
View
448 scipy/stats/morestats.py
@@ -9,12 +9,13 @@
import numpy as np
from numpy import (isscalar, r_, log, sum, around, unique, asarray,
- zeros, arange, sort, amin, amax, any, where, atleast_1d, sqrt, ceil,
+ zeros, arange, sort, amin, amax, any, atleast_1d, sqrt, ceil,
floor, array, poly1d, compress, not_equal, pi, exp, ravel, angle)
from numpy.testing.decorators import setastest
from scipy.lib.six import string_types
from scipy import optimize
+from scipy import special
from . import statlib
from . import stats
from .stats import find_repeats
@@ -336,7 +337,7 @@ def probplot(x, sparams=(), dist='norm', fit=True, plot=None):
Notes
-----
Even if `plot` is given, the figure is not shown or saved by `probplot`;
- ``plot.show()`` or ``plot.savefig('figname.png')`` should be used after
+ ``plt.show()`` or ``plt.savefig('figname.png')`` should be used after
calling `probplot`.
`probplot` generates a probability plot, which should not be confused with
@@ -497,52 +498,139 @@ def ppcc_plot(x,a,b,dist='tukeylambda', plot=None, N=80):
def boxcox_llf(lmb, data):
- """The boxcox log-likelihood function.
+ r"""The boxcox log-likelihood function.
+
+ Parameters
+ ----------
+ lmb : scalar
+ Parameter for Box-Cox transformation. See `boxcox` for details.
+ data : array_like
+ Data to calculate Box-Cox log-likelihood for. If `data` is
+ multi-dimensional, the log-likelihood is calculated along the first
+ axis.
+
+ Returns
+ -------
+ llf : float or ndarray
+ Box-Cox log-likelihood of `data` given `lmb`. A float for 1-D `data`,
+ an array otherwise.
+
+ See Also
+ --------
+ boxcox, probplot, boxcox_normplot, boxcox_normmax
+
+ Notes
+ -----
+ The Box-Cox log-likelihood function is defined here as
+
+ .. math::
+
+ llf = (\lambda - 1) \sum_i(\log(x_i)) -
+ N/2 \log(\sum_i (y_i - \bar{y})^2 / N),
+
+ where ``y`` is the Box-Cox transformed input data ``x``.
+
+ Examples
+ --------
+ >>> from scipy import stats
+ >>> import matplotlib.pyplot as plt
+ >>> from mpl_toolkits.axes_grid1.inset_locator import inset_axes
+ >>> np.random.seed(1245)
+
+ Generate some random variates and calculate Box-Cox log-likelihood values
+ for them for a range of ``lmbda`` values:
+
+ >>> x = stats.loggamma.rvs(5, loc=10, size=1000)
+ >>> lmbdas = np.linspace(-2, 10)
+ >>> llf = np.zeros(lmbdas.shape, dtype=np.float)
+ >>> for ii, lmbda in enumerate(lmbdas):
+ ... llf[ii] = stats.boxcox_llf(lmbda, x)
+
+ Also find the optimal lmbda value with `boxcox`:
+
+ >>> x_most_normal, lmbda_optimal = stats.boxcox(x)
+
+ Plot the log-likelihood as function of lmbda. Add the optimal lmbda as a
+ horizontal line to check that that's really the optimum:
+
+ >>> fig = plt.figure()
+ >>> ax = fig.add_subplot(111)
+ >>> ax.plot(lmbdas, llf, 'b.-')
+ >>> ax.axhline(stats.boxcox_llf(lmbda_optimal, x), color='r')
+ >>> ax.set_xlabel('lmbda parameter')
+ >>> ax.set_ylabel('Box-Cox log-likelihood')
+
+ Now add some probability plots to show that where the log-likelihood is
+ maximized the data transformed with `boxcox` looks closest to normal:
+
+ >>> locs = [3, 10, 4] # 'lower left', 'center', 'lower right'
+ >>> for lmbda, loc in zip([-1, lmbda_optimal, 9], locs):
+ ... xt = stats.boxcox(x, lmbda=lmbda)
+ ... (osm, osr), (slope, intercept, r_sq) = stats.probplot(xt)
+ ... ax_inset = inset_axes(ax, width="20%", height="20%", loc=loc)
+ ... ax_inset.plot(osm, osr, 'c.', osm, slope*osm + intercept, 'k-')
+ ... ax_inset.set_xticklabels([])
+ ... ax_inset.set_yticklabels([])
+ ... ax_inset.set_title('$\lambda=%1.2f$' % lmbda)
+
+ >>> plt.show()
+
"""
- N = len(data)
- y = boxcox(data,lmb)
- my = np.mean(y, axis=0)
- f = (lmb-1)*sum(log(data),axis=0)
- f -= N/2.0*log(sum((y-my)**2.0/N,axis=0))
- return f
+ data = np.asarray(data)
+ N = data.shape[0]
+ if N == 0:
+ return np.nan
+
+ y = boxcox(data, lmb)
+ y_mean = np.mean(y, axis=0)
+ llf = (lmb - 1) * np.sum(np.log(data), axis=0)
+ llf -= N / 2.0 * np.log(np.sum((y - y_mean)**2. / N, axis=0))
+ return llf
def _boxcox_conf_interval(x, lmax, alpha):
# Need to find the lambda for which
# f(x,lmbda) >= f(x,lmax) - 0.5*chi^2_alpha;1
- fac = 0.5*distributions.chi2.ppf(1-alpha,1)
- target = boxcox_llf(lmax,x)-fac
+ fac = 0.5 * distributions.chi2.ppf(1 - alpha, 1)
+ target = boxcox_llf(lmax, x) - fac
+
+ def rootfunc(lmbda, data, target):
+ return boxcox_llf(lmbda, data) - target
- def rootfunc(lmbda,data,target):
- return boxcox_llf(lmbda,data) - target
- # Find positive endpont
- newlm = lmax+0.5
+ # Find positive endpoint of interval in which answer is to be found
+ newlm = lmax + 0.5
N = 0
- while (rootfunc(newlm,x,target) > 0.0) and (N < 500):
+ while (rootfunc(newlm, x, target) > 0.0) and (N < 500):
newlm += 0.1
N += 1
+
if N == 500:
raise RuntimeError("Could not find endpoint.")
- lmplus = optimize.brentq(rootfunc,lmax,newlm,args=(x,target))
- newlm = lmax-0.5
+
+ lmplus = optimize.brentq(rootfunc, lmax, newlm, args=(x, target))
+
+ # Now find negative interval in the same way
+ newlm = lmax - 0.5
N = 0
- while (rootfunc(newlm,x,target) > 0.0) and (N < 500):
- newlm += 0.1
+ while (rootfunc(newlm, x, target) > 0.0) and (N < 500):
+ newlm -= 0.1
N += 1
+
if N == 500:
raise RuntimeError("Could not find endpoint.")
- lmminus = optimize.brentq(rootfunc, newlm, lmax, args=(x,target))
+
+ lmminus = optimize.brentq(rootfunc, newlm, lmax, args=(x, target))
return lmminus, lmplus
-def boxcox(x,lmbda=None,alpha=None):
- """
+def boxcox(x, lmbda=None, alpha=None):
+ r"""
Return a positive dataset transformed by a Box-Cox power transformation.
Parameters
----------
x : ndarray
- Input array.
+ Input array. Should be 1-dimensional.
lmbda : {None, scalar}, optional
If `lmbda` is not None, do the transformation for that value.
@@ -551,8 +639,7 @@ def boxcox(x,lmbda=None,alpha=None):
alpha : {None, float}, optional
If `alpha` is not None, return the ``100 * (1-alpha)%`` confidence
interval for `lmbda` as the third output argument.
-
- If `alpha` is not None it must be between 0.0 and 1.0.
+ Must be between 0.0 and 1.0.
Returns
-------
@@ -566,64 +653,291 @@ def boxcox(x,lmbda=None,alpha=None):
tuple of floats represents the minimum and maximum confidence limits
given `alpha`.
+ See Also
+ --------
+ probplot, boxcox_normplot, boxcox_normmax, boxcox_llf
+
+ Notes
+ -----
+ The Box-Cox transform is given by::
+
+ y = (x**lmbda - 1) / lmbda, for lmbda > 0
+ log(x), for lmbda = 0
+
+ `boxcox` requires the input data to be positive. Sometimes a Box-Cox
+ transformation provides a shift parameter to achieve this; `boxcox` does
+ not. Such a shift parameter is equivalent to adding a positive constant to
+ `x` before calling `boxcox`.
+
+ The confidence limits returned when `alpha` is provided give the interval
+ where:
+
+ .. math::
+
+ llf(\hat{\lambda}) - llf(\lambda) < \frac{1}{2}\chi^2(1 - \alpha, 1),
+
+ with ``llf`` the log-likelihood function and :math:`\chi^2` the chi-squared
+ function.
+
+ References
+ ----------
+ G.E.P. Box and D.R. Cox, "An Analysis of Transformations", Journal of the
+ Royal Statistical Society B, 26, 211-252 (1964).
+
+ Examples
+ --------
+ >>> from scipy import stats
+ >>> import matplotlib.pyplot as plt
+
+ We generate some random variates from a non-normal distribution and make a
+ probability plot for it, to show it is non-normal in the tails:
+
+ >>> fig = plt.figure()
+ >>> ax1 = fig.add_subplot(211)
+ >>> x = stats.loggamma.rvs(5, size=500) + 5
+ >>> stats.probplot(x, dist=stats.norm, plot=ax1)
+ >>> ax1.set_xlabel('')
+ >>> ax1.set_title('Probplot against normal distribution')
+
+ We now use `boxcox` to transform the data so it's closest to normal:
+
+ >>> ax2 = fig.add_subplot(212)
+ >>> xt, _ = stats.boxcox(x)
+ >>> stats.probplot(xt, dist=stats.norm, plot=ax2)
+ >>> ax2.set_title('Probplot after Box-Cox transformation')
+
+ >>> plt.show()
+
"""
- if any(x < 0):
+ x = np.asarray(x)
+ if x.size == 0:
+ return x
+
+ if any(x <= 0):
raise ValueError("Data must be positive.")
+
if lmbda is not None: # single transformation
- lmbda = lmbda*(x == x)
- y = where(lmbda == 0, log(x), (x**lmbda - 1)/lmbda)
- return y
-
- # Otherwise find the lmbda that maximizes the log-likelihood function.
- def tempfunc(lmb, data): # function to minimize
- return -boxcox_llf(lmb,data)
- lmax = optimize.brent(tempfunc, brack=(-2.0,2.0),args=(x,))
+ return special.boxcox(x, lmbda)
+
+ # If lmbda=None, find the lmbda that maximizes the log-likelihood function.
+ lmax = boxcox_normmax(x, method='mle')
y = boxcox(x, lmax)
+
if alpha is None:
return y, lmax
- # Otherwise find confidence interval
- interval = _boxcox_conf_interval(x, lmax, alpha)
- return y, lmax, interval
+ else:
+ # Find confidence interval
+ interval = _boxcox_conf_interval(x, lmax, alpha)
+ return y, lmax, interval
-def boxcox_normmax(x,brack=(-1.0,1.0)):
- osm_uniform = _calc_uniform_order_statistic_medians(x)
- # this function computes the x-axis values of the probability plot
- # and computes a linear regression (including the correlation)
- # and returns 1-r so that a minimization function maximizes the
- # correlation
- xvals = distributions.norm.ppf(osm_uniform)
+def boxcox_normmax(x, brack=(-2.0, 2.0), method='pearsonr'):
+ """Compute optimal Box-Cox transform parameter for input data.
- def tempfunc(lmbda, xvals, samps):
- y = boxcox(samps,lmbda)
- yvals = sort(y)
- r, prob = stats.pearsonr(xvals, yvals)
- return 1-r
+ Parameters
+ ----------
+ x : array_like
+ Input array.
+ brack : 2-tuple, optional
+ The starting interval for a downhill bracket search with
+ `optimize.brent`. Note that this is in most cases not critical; the
+ final result is allowed to be outside this bracket.
+ method : str, optional
+ The method to determine the optimal transform parameter (`boxcox`
+ ``lmbda`` parameter). Options are:
+
+ 'pearsonr' (default)
+ Maximizes the Pearson correlation coefficient between
+ ``y = boxcox(x)`` and the expected values for ``y`` if `x` would be
+ normally-distributed.
+
+ 'mle'
+ Minimizes the log-likelihood `boxcox_llf`. This is the method used
+ in `boxcox`.
+
+ 'all'
+ Use all optimization methods available, and return all results.
+ Useful to compare different methods.
- return optimize.brent(tempfunc, brack=brack, args=(xvals, x))
+ Returns
+ -------
+ maxlog : float or ndarray
+ The optimal transform parameter found. An array instead of a scalar
+ for ``method='all'``.
+ See Also
+ --------
+ boxcox, boxcox_llf, boxcox_normplot
-def boxcox_normplot(x,la,lb,plot=None,N=80):
- svals = r_[la:lb:complex(N)]
- ppcc = svals*0.0
- k = 0
- for sval in svals:
- # This doesn't use sval, creates constant ppcc, and horizontal line
- z = boxcox(x,sval)
- r1,r2 = probplot(z,dist='norm',fit=1)
- ppcc[k] = r2[-1]
- k += 1
+ Examples
+ --------
+ >>> from scipy import stats
+ >>> import matplotlib.pyplot as plt
+ >>> np.random.seed(1234) # make this example reproducible
+
+ Generate some data and determine optimal ``lmbda`` in various ways:
+
+ >>> x = stats.loggamma.rvs(5, size=30) + 5
+ >>> y, lmax_mle = stats.boxcox(x)
+ >>> lmax_pearsonr = stats.boxcox_normmax(x)
+
+ >>> lmax_mle
+ 7.177...
+ >>> lmax_pearsonr
+ 7.916...
+ >>> stats.boxcox_normmax(x, method='all')
+ array([ 7.91667384, 7.17718692])
+
+ >>> fig = plt.figure()
+ >>> ax = fig.add_subplot(111)
+ >>> stats.boxcox_normplot(x, -10, 10, plot=ax)
+ >>> ax.axvline(lmax_mle, color='r')
+ >>> ax.axvline(lmax_pearsonr, color='g', ls='--')
+
+ >>> plt.show()
+
+ """
+ def _pearsonr(x, brack):
+ osm_uniform = _calc_uniform_order_statistic_medians(x)
+ xvals = distributions.norm.ppf(osm_uniform)
+
+ def _eval_pearsonr(lmbda, xvals, samps):
+ # This function computes the x-axis values of the probability plot
+ # and computes a linear regression (including the correlation) and
+ # returns ``1 - r`` so that a minimization function maximizes the
+ # correlation.
+ y = boxcox(samps, lmbda)
+ yvals = np.sort(y)
+ r, prob = stats.pearsonr(xvals, yvals)
+ return 1 - r
+
+ return optimize.brent(_eval_pearsonr, brack=brack, args=(xvals, x))
+
+ def _mle(x, brack):
+ def _eval_mle(lmb, data):
+ # function to minimize
+ return -boxcox_llf(lmb, data)
+
+ return optimize.brent(_eval_mle, brack=brack, args=(x,))
+
+ def _all(x, brack):
+ maxlog = np.zeros(2, dtype=np.float)
+ maxlog[0] = _pearsonr(x, brack)
+ maxlog[1] = _mle(x, brack)
+ return maxlog
+
+ methods = {'pearsonr': _pearsonr,
+ 'mle': _mle,
+ 'all': _all}
+ if not method in methods.keys():
+ raise ValueError("Method %s not recognized." % method)
+
+ optimfunc = methods[method]
+ return optimfunc(x, brack)
+
+
+def boxcox_normplot(x, la, lb, plot=None, N=80):
+ """Compute parameters for a Box-Cox normality plot, optionally show it.
+
+ A Box-Cox normality plot shows graphically what the best transformation
+ parameter is to use in `boxcox` to obtain a distribution that is close
+ to normal.
+
+ Parameters
+ ----------
+ x : array_like
+ Input array.
+ la, lb : scalar
+ The lower and upper bounds for the ``lmbda`` values to pass to `boxcox`
+ for Box-Cox transformations. These are also the limits of the
+ horizontal axis of the plot if that is generated.
+ plot : object, optional
+ If given, plots the quantiles and least squares fit.
+ `plot` is an object that has to have methods "plot" and "text".
+ The `matplotlib.pyplot` module or a Matplotlib Axes object can be used,
+ or a custom object with the same methods.
+ Default is None, which means that no plot is created.
+ N : int, optional
+ Number of points on the horizontal axis (equally distributed from
+ `la` to `lb`).
+
+ Returns
+ -------
+ lmbdas : ndarray
+ The ``lmbda`` values for which a Box-Cox transform was done.
+ ppcc : ndarray
+ Probability Plot Correlelation Coefficient, as obtained from `probplot`
+ when fitting the Box-Cox transformed input `x` against a normal
+ distribution.
+
+ See Also
+ --------
+ probplot, boxcox, boxcox_normmax, boxcox_llf, ppcc_max
+
+ Notes
+ -----
+ Even if `plot` is given, the figure is not shown or saved by
+ `boxcox_normplot`; ``plt.show()`` or ``plt.savefig('figname.png')``
+ should be used after calling `probplot`.
+
+ Examples
+ --------
+ >>> from scipy import stats
+ >>> import matplotlib.pyplot as plt
+
+ Generate some non-normally distributed data, and create a Box-Cox plot:
+
+ >>> x = stats.loggamma.rvs(5, size=500) + 5
+ >>> fig = plt.figure()
+ >>> ax = fig.add_subplot(111)
+ >>> stats.boxcox_normplot(x, -20, 20, plot=ax)
+
+ Determine and plot the optimal ``lmbda`` to transform ``x`` and plot it in
+ the same plot:
+
+ >>> _, maxlog = stats.boxcox(x)
+ >>> ax.axvline(maxlog, color='r')
+
+ >>> plt.show()
+
+ """
+ x = np.asarray(x)
+ if x.size == 0:
+ return x
+
+ if lb <= la:
+ raise ValueError("`lb` has to be larger than `la`.")
+
+ lmbdas = np.linspace(la, lb, num=N)
+ ppcc = lmbdas * 0.0
+ for i, val in enumerate(lmbdas):
+ # Determine for each lmbda the correlation coefficient of transformed x
+ z = boxcox(x, lmbda=val)
+ _, r2 = probplot(z, dist='norm', fit=True)
+ ppcc[i] = r2[-1]
if plot is not None:
- plot.plot(svals, ppcc, 'x')
- plot.title('Box-Cox Normality Plot')
- plot.xlabel('Prob Plot Corr. Coef.')
- plot.ylabel('Transformation parameter')
+ plot.plot(lmbdas, ppcc, 'x')
+ try:
+ if hasattr(plot, 'set_title'):
+ # Matplotlib Axes instance or something that looks like it
+ plot.set_title('Box-Cox Normality Plot')
+ plot.set_ylabel('Prob Plot Corr. Coef.')
+ plot.set_xlabel('$\lambda$')
+ else:
+ # matplotlib.pyplot module
+ plot.title('Box-Cox Normality Plot')
+ plot.ylabel('Prob Plot Corr. Coef.')
+ plot.xlabel('$\lambda$')
+ except Exception:
+ # Not an MPL object or something that looks (enough) like it.
+ # Don't crash on adding labels or title
+ pass
- return svals, ppcc
+ return lmbdas, ppcc
-def shapiro(x,a=None,reta=False):
+def shapiro(x, a=None, reta=False):
"""
Perform the Shapiro-Wilk test for normality.
View
149 scipy/stats/tests/test_morestats.py
@@ -497,10 +497,151 @@ def test_ppcc_max_bad_arg():
assert_raises(ValueError, stats.ppcc_max, data, dist="plate_of_shrimp")
-def test_boxcox_bad_arg():
- # Raise ValueError if any data value is negative.
- x = np.array([-1])
- assert_raises(ValueError, stats.boxcox, x)
+class TestBoxcox_llf(TestCase):
+
+ def test_basic(self):
+ np.random.seed(54321)
+ x = stats.norm.rvs(size=10000, loc=10)
+ lmbda = 1
+ llf = stats.boxcox_llf(lmbda, x)
+ llf_expected = -x.size / 2. * np.log(np.sum(x.std()**2))
+ assert_allclose(llf, llf_expected)
+
+ def test_array_like(self):
+ np.random.seed(54321)
+ x = stats.norm.rvs(size=100, loc=10)
+ lmbda = 1
+ llf = stats.boxcox_llf(lmbda, x)
+ llf2 = stats.boxcox_llf(lmbda, list(x))
+ assert_allclose(llf, llf2, rtol=1e-12)
+
+ def test_2d_input(self):
+ # Note: boxcox_llf() was already working with 2-D input (sort of), so
+ # keep it like that. boxcox() doesn't work with 2-D input though, due
+ # to brent() returning a scalar.
+ np.random.seed(54321)
+ x = stats.norm.rvs(size=100, loc=10)
+ lmbda = 1
+ llf = stats.boxcox_llf(lmbda, x)
+ llf2 = stats.boxcox_llf(lmbda, np.vstack([x, x]).T)
+ assert_allclose([llf, llf], llf2, rtol=1e-12)
+
+ def test_empty(self):
+ assert_(np.isnan(stats.boxcox_llf(1, [])))
+
+
+class TestBoxcox(TestCase):
+
+ def test_fixed_lmbda(self):
+ np.random.seed(12345)
+ x = stats.loggamma.rvs(5, size=50) + 5
+ xt = stats.boxcox(x, lmbda=1)
+ assert_allclose(xt, x - 1)
+ xt = stats.boxcox(x, lmbda=-1)
+ assert_allclose(xt, 1 - 1/x)
+
+ xt = stats.boxcox(x, lmbda=0)
+ assert_allclose(xt, np.log(x))
+
+ # Also test that array_like input works
+ xt = stats.boxcox(list(x), lmbda=0)
+ assert_allclose(xt, np.log(x))
+
+ def test_lmbda_None(self):
+ np.random.seed(1234567)
+ # Start from normal rv's, do inverse transform to check that
+ # optimization function gets close to the right answer.
+ np.random.seed(1245)
+ lmbda = 2.5
+ x = stats.norm.rvs(loc=10, size=50000)
+ x_inv = (x * lmbda + 1)**(-lmbda)
+ xt, maxlog = stats.boxcox(x_inv)
+
+ assert_almost_equal(maxlog, -1 / lmbda, decimal=2)
+
+ def test_alpha(self):
+ np.random.seed(1234)
+ x = stats.loggamma.rvs(5, size=50) + 5
+
+ # Some regular values for alpha, on a small sample size
+ _, _, interval = stats.boxcox(x, alpha=0.75)
+ assert_allclose(interval, [4.004485780226041, 5.138756355035744])
+ _, _, interval = stats.boxcox(x, alpha=0.05)
+ assert_allclose(interval, [1.2138178554857557, 8.209033272375663])
+
+ # Try some extreme values, see we don't hit the N=500 limit
+ x = stats.loggamma.rvs(7, size=500) + 15
+ _, _, interval = stats.boxcox(x, alpha=0.001)
+ assert_allclose(interval, [0.3988867, 11.40553131])
+ _, _, interval = stats.boxcox(x, alpha=0.999)
+ assert_allclose(interval, [5.83316246, 5.83735292])
+
+ def test_boxcox_bad_arg(self):
+ # Raise ValueError if any data value is negative.
+ x = np.array([-1])
+ assert_raises(ValueError, stats.boxcox, x)
+
+ def test_empty(self):
+ assert_(stats.boxcox([]).shape == (0,))
+
+
+class TestBoxcoxNormmax(TestCase):
+ def setUp(self):
+ np.random.seed(12345)
+ self.x = stats.loggamma.rvs(5, size=50) + 5
+
+ def test_pearsonr(self):
+ maxlog = stats.boxcox_normmax(self.x)
+ assert_allclose(maxlog, 1.804465325046)
+
+ def test_mle(self):
+ maxlog = stats.boxcox_normmax(self.x, method='mle')
+ assert_allclose(maxlog, 1.758101454114)
+
+ # Check that boxcox() uses 'mle'
+ _, maxlog_boxcox = stats.boxcox(self.x)
+ assert_allclose(maxlog_boxcox, maxlog)
+
+ def test_all(self):
+ maxlog_all = stats.boxcox_normmax(self.x, method='all')
+ assert_allclose(maxlog_all, [1.804465325046, 1.758101454114])
+
+
+class TestBoxcoxNormplot(TestCase):
+ def setUp(self):
+ np.random.seed(7654321)
+ self.x = stats.loggamma.rvs(5, size=500) + 5
+
+ def test_basic(self):
+ N = 5
+ lmbdas, ppcc = stats.boxcox_normplot(self.x, -10, 10, N=N)
+ ppcc_expected = [0.57783375, 0.83610988, 0.97524311, 0.99756057,
+ 0.95843297]
+ assert_allclose(lmbdas, np.linspace(-10, 10, num=N))
+ assert_allclose(ppcc, ppcc_expected)
+
+ @dec.skipif(not have_matplotlib)
+ def test_plot_kwarg(self):
+ # Check with the matplotlib.pyplot module
+ fig = plt.figure()
+ fig.add_subplot(111)
+ stats.boxcox_normplot(self.x, -20, 20, plot=plt)
+ plt.close()
+
+ # Check that a Matplotlib Axes object is accepted
+ fig.add_subplot(111)
+ ax = fig.add_subplot(111)
+ stats.boxcox_normplot(self.x, -20, 20, plot=ax)
+ plt.close()
+
+ def test_invalid_inputs(self):
+ # `lb` has to be larger than `la`
+ assert_raises(ValueError, stats.boxcox_normplot, self.x, 1, 0)
+ # `x` can not contain negative values
+ assert_raises(ValueError, stats.boxcox_normplot, [-1, 1] , 0, 1)
+
+ def test_empty(self):
+ assert_(stats.boxcox_normplot([], 0, 1).size == 0)
class TestCircFuncs(TestCase):
Something went wrong with that request. Please try again.