Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] fused types in isotonic_regression #9106

Merged
merged 11 commits into from Feb 26, 2019

Conversation

6 participants
@vene
Copy link
Member

commented Jun 11, 2017

Reference Issue #8769 -ish?

What does this implement/fix? Explain your changes.

isotonic regression should preserve 32bit dtypes when possible.

Any other comments?

@vene

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2017

I implemented and tested dtype preservation for IsotonicRegression as well.

I followed the convention that X should drive the dtype, and checked that interp1d preserves dtype. For this particular estimator, though, this doesn't make much sense: X and y are always just 1d arrays. In any case, it's better to stick to conventions.

Ping @Henley13 for comments :)

@vene vene changed the title [WIP] fused types in isotonic_regression [MRG] fused types in isotonic_regression Jun 11, 2017

@vene

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2017

further ping @lesteve, @glemaitre, @raghavrv, @TomDLT just in case someone bites :P I am pretty sure I forgot things here, because I failed to participate to all of the discussions on 32bit support over the week.

@@ -437,3 +438,15 @@ def test_isotonic_copy_before_fit():
# https://github.com/scikit-learn/scikit-learn/issues/6628
ir = IsotonicRegression()
copy.copy(ir)


def test_isotonic_dtype():

This comment has been minimized.

Copy link
@vene

vene Jun 11, 2017

Author Member

should this kind of test be here or in common tests / estimator tags?

This comment has been minimized.

Copy link
@amueller

amueller Jun 19, 2017

Member

Probably there, but not possible without estimator tags, right? preserves_precision? Have you run it on all_estimators to see the status? We could post an issue with a check list.

@vene

This comment has been minimized.

Copy link
Member Author

commented Jun 13, 2017

test failures were genuine: one of the tests is calling predict without calling fit. Instead of changing the test I made the model support this behavior (so it works the same way as on master)

@jnothman jnothman added this to the 0.19 milestone Jun 14, 2017

@@ -375,6 +378,9 @@ def transform(self, T):
The transformed data
"""
T = as_float_array(T)
if hasattr(self, '_dtype'):

This comment has been minimized.

Copy link
@amueller

amueller Jun 19, 2017

Member

Is this required? Is this faster?

This comment has been minimized.

Copy link
@vene

vene Jun 26, 2017

Author Member

Basically a way to check if there is a saved X. I could replace it with

if hasattr(self, '_necessary_X_'):
    T = T.astype(self._necessary_X_.dtype, copy=False)

I could check if it makes a difference in terms of speed but it's probably minimal (just attribute accesses, right?)

else:
sample_weight = np.array(sample_weight[order], dtype=np.float64)
sample_weight = np.array(sample_weight[order])

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jun 26, 2017

Member

should we make sure that sample_weight has the same dtype than y?

@@ -271,6 +272,7 @@ def _build_y(self, X, y, sample_weight, trim_duplicates=True):
check_consistent_length(X, y, sample_weight)
X, y = [check_array(x, ensure_2d=False) for x in [X, y]]

X = as_float_array(X)

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jun 26, 2017

Member

This will trigger a copy, and is rather redundant with check_array.
I would prefer:

X = check_array(X, dtype=[np.float64, np.float32])
y = check_array(y, dtype=X.dtype, ensure_2d=False)

This comment has been minimized.

Copy link
@vene

vene Jul 3, 2017

Author Member

@TomDLT this has different behavior than as_float_array is the input is int32 (and maybe others too?)

import numpy as np
from sklearn.utils import as_float_array, check_array

for dtype in (np.int32, np.int64, np.uint32, np.uint64, np.float32, np.float64):
    print('input dtype\t', dtype.__name__)
    x = np.arange(5).astype(dtype)
    x = as_float_array(x)
    print('as_float_array\t', x.dtype)

    x = np.arange(5).astype(dtype)
    x = check_array(x, dtype=[np.float64, np.float32], ensure_2d=False)
    print('check_array\t', x.dtype)
    print()

outputs

input dtype	 int32
as_float_array	 float32
check_array	 float64

input dtype	 int64
as_float_array	 float64
check_array	 float64

input dtype	 uint32
as_float_array	 float64
check_array	 float64

input dtype	 uint64
as_float_array	 float64
check_array	 float64

input dtype	 float32
as_float_array	 float32
check_array	 float32

input dtype	 float64
as_float_array	 float64
check_array	 float64

Has there been a decision in the other 32-bit prs what to do in this situation?

This comment has been minimized.

Copy link
@vene

This comment has been minimized.

Copy link
@TomDLT

TomDLT Jul 3, 2017

Member

this has different behavior than as_float_array is the input is int32

Fair enough, just make sure that it does not copy more than necessary

This comment has been minimized.

Copy link
@vene

vene Jul 3, 2017

Author Member

I removed the redundant call to check_array on X and am now just calling it on y. Still, I think the behavior on int inputs should be consistent if possible over the codebase.

@Henley13 have you already discussed this?

This comment has been minimized.

Copy link
@vene

vene Feb 25, 2019

Author Member

Failing tests on travis seem to suggest that my assumptions here fail on some configs; in particular this:

input dtype	 int32
as_float_array	 float32
check_array	 float64

@agramfort agramfort force-pushed the vene:isofused branch from 63a1f7d to 241c3c6 Feb 25, 2019

@vene

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

thanks @agramfort ! my memory might fail but let me know if I can help

@agramfort

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

@vene can you replicate the travis failure? It seems I can't

@agramfort agramfort added this to To do in Sprint Paris 2019 Feb 25, 2019

@agramfort agramfort moved this from To do to In progress in Sprint Paris 2019 Feb 25, 2019

@vene

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

@vene can you replicate the travis failure? It seems I can't

I was able to reproduce in a clean python 3.5.6 installed with pyenv, with numpy==1.11.0 scipy==0.17.0

The problem is caused by the call to interp1d, which apparently upcasts the float32s to float64 on this version of scipy (?).

Basically, at this line, T.dtype == float32 but the result is float64...

@vene

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

# file check.py
import numpy as np
import scipy
from scipy import interpolate
print(np.__version__)
print(scipy.__version__)

X = np.random.randn(5).astype(np.float32)
y = np.random.randn(5).astype(np.float32)
T = np.random.randn(5).astype(np.float32)

f_ = scipy.interpolate.interp1d(X, y, kind='linear', bounds_error=False)
out = f_(T)
print(out.dtype)

output

scikit-learn(isofused){main}$ python check.py
1.16.1
1.2.1
float32
scikit-learn(isofused){main}$ pyenv activate skl
scikit-learn(isofused){skl}$ python check.py
1.11.0
0.17.0
float64
@vene

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

I think there are two solutions

  1. force cast the return of f_ to the dtype of T, which on newer scipy should be a noop
  2. change our expectations in the tests (at the cost of weakening the tests, I think)

EDIT: pushed a quick fix for solution (1)

vene added some commits Feb 25, 2019

flake8 remove blank line
sorry, i'm rusty
@agramfort

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

this is good to go from my end. Should I be worried that azure and appveyor failed? it seems unrelated...

@vene

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

one of the both of the azure failures actually seems related, but pretty weird:

================================== FAILURES ===================================
_______ test_isotonic_regression_with_ties_in_differently_sized_groups ________

    def test_isotonic_regression_with_ties_in_differently_sized_groups():
        """
        Non-regression test to handle issue 9432:
        ***/issues/9432
    
        Compare against output in R:
        > library("isotone")
        > x <- c(0, 1, 1, 2, 3, 4)
        > y <- c(0, 0, 1, 0, 0, 1)
        > res1 <- gpava(x, y, ties="secondary")
        > res1$x
    
        `isotone` version: 1.1-0, 2015-07-24
        R version: R version 3.3.2 (2016-10-31)
        """
        x = np.array([0, 1, 1, 2, 3, 4])
        y = np.array([0, 0, 1, 0, 0, 1])
        y_true = np.array([0., 0.25, 0.25, 0.25, 0.25, 1.])
        ir = IsotonicRegression()
>       ir.fit(x, y)

ir         = IsotonicRegression(increasing=True, out_of_bounds='nan', y_max=None, y_min=None)
x          = array([0, 1, 1, 2, 3, 4])
y          = array([0, 0, 1, 0, 0, 1])
y_true     = array([0.  , 0.25, 0.25, 0.25, 0.25, 1.  ])

C:\Miniconda\envs\testvenv\lib\site-packages\sklearn\tests\test_isotonic.py:190: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
C:\Miniconda\envs\testvenv\lib\site-packages\sklearn\isotonic.py:323: in fit
    X, y = self._build_y(X, y, sample_weight)
C:\Miniconda\envs\testvenv\lib\site-packages\sklearn\isotonic.py:267: in _build_y
    X, y, sample_weight)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   ValueError: Buffer dtype mismatch, expected 'float' but got 'double'

__builtins__ = <builtins>
__doc__    = None
__file__   = 'C:\\Miniconda\\envs\\testvenv\\lib\\site-packages\\sklearn\\_isotonic.cp35-win_amd64.pyd'
__loader__ = <_frozen_importlib_external.ExtensionFileLoader object at 0x000002DE52881B70>
__name__   = 'sklearn._isotonic'
__package__ = 'sklearn'
__pyx_unpickle_Enum = <built-in function __pyx_unpickle_Enum>
__spec__   = ModuleSpec(name='sklearn._isotonic', loader=<_frozen_importlib_external.ExtensionFileLoader object at 0x000002DE52881B70>, origin='C:\\Miniconda\\envs\\testvenv\\lib\\site-packages\\sklearn\\_isotonic.cp35-win_amd64.pyd')
__test__   = {}
_inplace_contiguous_isotonic_regression = <cyfunction _inplace_contiguous_isotonic_regression at 0x000002DE52819D48>
_make_unique = <cyfunction _make_unique at 0x000002DE5288B048>
np         = <module 'numpy' from 'C:\\Miniconda\\envs\\testvenv\\lib\\site-packages\\numpy\\__init__.py'>

sklearn\_isotonic.pyx:84: ValueError
@vene

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

The failure is when calling _make_unique, which has signature

def _make_unique(np.ndarray[dtype=floating] X,
                 np.ndarray[dtype=floating] y,
                 np.ndarray[dtype=floating] sample_weights):

Could this be because the fused type specializations are tied, i.e., either X, y, sample_weights are all float32 or all float64 but not mixed?

@agramfort

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

yes it's very likely. Can you fix?

@vene

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

Hopefully! I'm trying to reproduce on windows, but it might take a bit longer for technical reasons.

@vene

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

I was able to reproduce. Will find a fix tomorrow.

@albertcthomas

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

FWIW I can also reproduce on my laptop.

vene added some commits Feb 25, 2019

@vene

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

Thanks @albertcthomas. It was a legitimate bug; per new failing test in previous commit 6f2b6d1. Should be fixed now, let's see what ci says.

@agramfort agramfort changed the title [MRG] fused types in isotonic_regression [MRG+1] fused types in isotonic_regression Feb 26, 2019

@agramfort agramfort moved this from In progress to Needs review in Sprint Paris 2019 Feb 26, 2019

@agramfort agramfort merged commit 4e56b29 into scikit-learn:master Feb 26, 2019

18 checks passed

LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 92.49%)
Details
codecov/project Absolute coverage decreased by -0.01% but relative coverage increased by +7.5% compared to b29a961
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
scikit-learn.scikit-learn Build #20190225.105 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
scikit-learn.scikit-learn (Windows py35) Windows py35 succeeded
Details
scikit-learn.scikit-learn (Windows py37) Windows py37 succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details

Sprint Paris 2019 automation moved this from Needs review to Done Feb 26, 2019

@agramfort

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

thanks @vene

@vene

This comment has been minimized.

Copy link
Member Author

commented Feb 26, 2019

thanks @agramfort !!

xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019

[MRG+1] fused types in isotonic_regression (scikit-learn#9106)
* ENH fused types in isotonic_regression

* make X drive computation dtype in IsotonicRegression

* preserve current behaviour if transform w/o fit

* thoroughly check sample weights; avoid storing dtype explicitly

* consistent testing and behavior

* misc

* update what's new

* fix for interp1d upcast on old scipy

* flake8 remove blank line

sorry, i'm rusty

* add failing test

* FIX dtype bug in _make_unique

xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019

xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.