New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ignore NaNs in PowerTransformer #11306

Merged
merged 34 commits into from Jun 21, 2018

Conversation

Projects
None yet
4 participants
@glemaitre
Contributor

glemaitre commented Jun 17, 2018

Reference Issues/PRs

Towards #10404.

Should be merged after:

  • #11206 (depends on the StandardScaler)
  • #11011 (better test)

What does this implement/fix? Explain your changes.

Any other comments?

glemaitre added some commits Jun 5, 2018

tmp
@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 17, 2018

The diff will be better once #11206 will be merged.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 17, 2018

My concern with this PR is this line:

https://github.com/scikit-learn/scikit-learn/pull/11306/files#diff-5ebddebc20987b6125fffc893f5abc4cR2785

It seems that the computation of the lambdas is not robust to nan values. Therefore, we need to slice the column to find the lambdas and then compute the using boxcox (which ignore nan)

@jnothman jnothman referenced this pull request Jun 17, 2018

Closed

Disregard NaNs in preprocessing #10404

6 of 7 tasks complete
@jnothman

This comment has been minimized.

Member

jnothman commented Jun 17, 2018

It seems that the computation of the lambdas is not robust to nan values. Therefore, we need to slice the column to find the lambdas and then compute the using boxcox (which ignore nan)

Why is that concerning? It's got roughly the same complexity as np.nan*... the only improvement I can see is to not make a data copy if there are no NaNs.

glemaitre added some commits Jun 18, 2018

@glemaitre glemaitre changed the title from [WIP] ignore NaNs in PowerTransformer to [MRG] ignore NaNs in PowerTransformer Jun 18, 2018

@glemaitre glemaitre added this to the 0.20 milestone Jun 18, 2018

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 18, 2018

@ogrisel

Besides the following small comment, the PowerTransformer specific part of this PR LGTM.

# get rid of them to compute them.
_, lmbda = stats.boxcox(col[~np.isnan(col)], lmbda=None)
# FIXME: stats.boxcox should be changed by special.boxcox which
# handles NaN and does not raise warnings. Check SciPy 0.14.X

This comment has been minimized.

@ogrisel

ogrisel Jun 21, 2018

Member

Why not use special.boxcox if the installed scipy is recent enough? This will make it easier to drop the backward compat code once we do not support scipy < 0.14.

Ideally, the boxcox backward compat function that ignores the nan warnings should be extracted in sklearn.utils.fixes to make it easy to clean up the backward compat code in the future.

This comment has been minimized.

@glemaitre

glemaitre Jun 21, 2018

Contributor

good point

This comment has been minimized.

@jnothman

jnothman Jun 21, 2018

Member

It may not be so easy to backport special.boxcox... It's something like:

from libc.math cimport log, log1p, expm1, exp, fabs
cimport numpy as np



cdef inline double _func_boxcox(double x, double lmbda) nogil:
    # if lmbda << 1 and log(x) < 1.0, the lmbda*log(x) product can lose
    # precision, furthermore, expm1(x) == x for x < eps.
    # For doubles, the range of log is -744.44 to +709.78, with eps being
    # the smallest value produced.  This range means that we will have
    # abs(lmbda)*log(x) < eps whenever abs(lmbda) <= eps/-log(min double)
    # which is ~2.98e-19.  
    if fabs(lmbda) < 1e-19:
        return log(x)
    else:
        return expm1(lmbda * log(x)) / lmbda


cdef void loop_d_dd__As_dd_d(char **args, np.npy_intp *dims, np.npy_intp *steps, void *data) nogil:
    cdef np.npy_intp i, n = dims[0]
    cdef void *func = (<void**>data)[0]
    cdef char *func_name = <char*>(<void**>data)[1]
    cdef char *ip0 = args[0]
    cdef char *ip1 = args[1]
    cdef char *op0 = args[2]
    cdef double ov0
    for i in range(n):
        ov0 = (<double(*)(double, double) nogil>func)(<double>(<double*>ip0)[0], <double>(<double*>ip1)[0])
        (<double *>op0)[0] = <double>ov0
        ip0 += steps[0]
        ip1 += steps[1]
        op0 += steps[2]
    sf_error.check_fpe(func_name)

cdef void loop_d_dd__As_ff_f(char **args, np.npy_intp *dims, np.npy_intp *steps, void *data) nogil:
    cdef np.npy_intp i, n = dims[0]
    cdef void *func = (<void**>data)[0]
    cdef char *func_name = <char*>(<void**>data)[1]
    cdef char *ip0 = args[0]
    cdef char *ip1 = args[1]
    cdef char *op0 = args[2]
    cdef double ov0
    for i in range(n):
        ov0 = (<double(*)(double, double) nogil>func)(<double>(<float*>ip0)[0], <double>(<float*>ip1)[0])
        (<float *>op0)[0] = <float>ov0
        ip0 += steps[0]
        ip1 += steps[1]
        op0 += steps[2]
    sf_error.check_fpe(func_name)


cdef np.PyUFuncGenericFunction ufunc_boxcox_loops[2]
cdef void *ufunc_boxcox_ptr[4]
cdef void *ufunc_boxcox_data[2]
cdef char ufunc_boxcox_types[6]
cdef char *ufunc_boxcox_doc = (
    "boxcox(x, lmbda)\n"
    "\n"
    "Compute the Box-Cox transformation.\n"
    "\n"
    "The Box-Cox transformation is::\n"
    "\n"
    "    y = (x**lmbda - 1) / lmbda  if lmbda != 0\n"
    "        log(x)                  if lmbda == 0\n"
    "\n"
    "Returns `nan` if ``x < 0``.\n"
    "Returns `-inf` if ``x == 0`` and ``lmbda < 0``.\n"
    "\n"
    "Parameters\n"
    "----------\n"
    "x : array_like\n"
    "    Data to be transformed.\n"
    "lmbda : array_like\n"
    "    Power parameter of the Box-Cox transform.\n"
    "\n"
    "Returns\n"
    "-------\n"
    "y : array\n"
    "    Transformed data.\n"
    "\n"
    "Notes\n"
    "-----\n"
    "\n"
    ".. versionadded:: 0.14.0\n"
    "\n"
    "Examples\n"
    "--------\n"
    ">>> from scipy.special import boxcox\n"
    ">>> boxcox([1, 4, 10], 2.5)\n"
    "array([   0.        ,   12.4       ,  126.09110641])\n"
    ">>> boxcox(2, [0, 1, 2])\n"
    "array([ 0.69314718,  1.        ,  1.5       ])")
ufunc_boxcox_loops[0] = <np.PyUFuncGenericFunction>loop_d_dd__As_ff_f
ufunc_boxcox_loops[1] = <np.PyUFuncGenericFunction>loop_d_dd__As_dd_d
ufunc_boxcox_types[0] = <char>NPY_FLOAT
ufunc_boxcox_types[1] = <char>NPY_FLOAT
ufunc_boxcox_types[2] = <char>NPY_FLOAT
ufunc_boxcox_types[3] = <char>NPY_DOUBLE
ufunc_boxcox_types[4] = <char>NPY_DOUBLE
ufunc_boxcox_types[5] = <char>NPY_DOUBLE
ufunc_boxcox_ptr[2*0] = <void*>_func_boxcox
ufunc_boxcox_ptr[2*0+1] = <void*>(<char*>"boxcox")
ufunc_boxcox_ptr[2*1] = <void*>_func_boxcox
ufunc_boxcox_ptr[2*1+1] = <void*>(<char*>"boxcox")
ufunc_boxcox_data[0] = &ufunc_boxcox_ptr[2*0]
ufunc_boxcox_data[1] = &ufunc_boxcox_ptr[2*1]
boxcox = np.PyUFunc_FromFuncAndData(ufunc_boxcox_loops, ufunc_boxcox_data, ufunc_boxcox_types, 2, 2, 1, 0, "boxcox", ufunc_boxcox_doc, 0)

This comment has been minimized.

@jnothman

jnothman Jun 21, 2018

Member

But I've now realised you probably aren't suggesting to backport special.boxcox, merely to move these few lines to fixes. Yes.

This comment has been minimized.

@ogrisel

ogrisel Jun 21, 2018

Member

Yes :)

@sklearn-lgtm

This comment has been minimized.

sklearn-lgtm commented Jun 21, 2018

This pull request introduces 2 alerts when merging 08960ec into e555893 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Module is imported with 'import' and 'import from'

Comment posted by LGTM.com

@sklearn-lgtm

This comment has been minimized.

sklearn-lgtm commented Jun 21, 2018

This pull request introduces 2 alerts when merging e626a39 into e555893 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Module is imported with 'import' and 'import from'

Comment posted by LGTM.com

@jnothman jnothman merged commit c1bc665 into scikit-learn:master Jun 21, 2018

5 of 6 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
LGTM analysis: Python 2 new alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment